Blog
How We Solved a PII Redaction Problem at Scale
Like most real problems, it was a combination of AI, deterministic programming, and human in the loop review.
We recently had to solve a very specific problem: redact personally identifiable information from a large backlog of unstructured narratives so the data could be safely used.
The scale made this a non-trivial problem, AI tools or not.
The existing crash narrative backlog is about 1.84 million records with an additional ~100k per month. The average narrative is about 650 characters, but a good number exceed 1,000 characters. To redact one record would be simple, but to redact them all is a systems problem rather than a pure technical one.
A straightforward paid redaction service was not realistic. One commercial option we looked at was roughly $1 per record. At this scale, that turns into something close to $1.8 million just to clear the historical backlog. Running every narrative through frontier-model APIs had the same basic problem: even if the per-record cost sounds small, it becomes expensive quickly when multiplied by nearly two million records, and then again for future refreshes.
So we needed something cheaper, repeatable, auditable, and good enough to trust.
The final design became a hybrid pipeline.
First, we do deterministic redaction with some smart hints from related data. We also tried Microsoft Presidio, but it turned into a layered regex mountain trying to account for the unstructured nature and endless edge cases. And relying on any model, in Presidio or not, trying to guess what a person's name is vs. the name of a street or restaurant is incredibly hard. If we already know the person's name from associated data, it's much better to use that deterministically. The deterministic pass handles the high-confidence, known PII first.
Second, we classify which rows still look risky. Most records did not need an LLM. The risk classifier looks for signs that something may have escaped the deterministic pass, such as remaining name-like context, strong PII cues, driver/license/plate patterns, or person/address context. We trained the classifier on thousands of samples using the walk-forward/holdout technique to know we were converging on a good deterministic solution.
Third, only those risky records go to a local LLM. The current model is gpt-oss-120b, served through an OpenAI-compatible local vLLM endpoint. It is not a paid frontier API call per record and the PII never leaves the trusted network. The LLM acts as a final context-aware reviewer, not the primary redactor. This is the step like asking a human to just read the narrative and ask, "Is there any PII here?" It's not deterministic, but that's the point: you cannot realistically have a truly deterministic method to catch every single edge case. Could we have just passed every record through the local LLM since it's free and trusted? Sure, but at about 20 tokens/second and close to 100% GPU usage, it would have taken about 20,000 hours to get through the backlog.
Finally, every LLM result is validated. Because even the LLM isn't perfect, a very small subset is marked as needing human review.
Operationally, we processed the backlog in 100,000-record batches. This is what the numbers looked like for the first batch:
100,000 deterministic rows processed.
22,352 records reviewed by the LLM.
826 records modified by the LLM.
55 records marked human-review-required.
That means about 22.4% of records went to the LLM, and human review was required for about 0.055% of the full batch.
For the batch, it took about 10.6 hours end to end.
The lesson was that it took a very custom system with disciplined iteration to achieve a good result. It was a combination of deterministic programming, AI integration, and human review.
This is exactly the pattern I'm seeing play out across any real problem. AI isn't a magic wand; it's one of many tools to use to solve hard problems.