
What if you could anonymize unstructured text as easily as a database? For years, anonymization meant structured data: tables, columns, clear field names — a familiar landscape where finding a name or a social security number came down to a simple lookup. Today, sensitive data lives somewhere else entirely. It's buried in your emails, your logs, your PDFs, your reports, your customer exchanges. And free text gives you nothing to work with: no columns, no types, no metadata. You're starting from scratch. This is where AI changes the game for DOT Anonymizer: anonymization rules written in plain language, semantic detection, and full PDF anonymization. Here's how it works.
1. AI as a rule generator
We started by integrating AI into rule generation. Anonymization rules aren't a commodity. They're a corporate asset — built around your applications, your controls, your business logic. Not a universal standard. Your DNA.
So we let AI write them. A business user types: "For this IBAN field, apply the checksum formula based on the country code." That's it. What used to take Groovy code, deep technical knowledge, and hours of work now takes a sentence. The LLM generates the complete rule, ready to run in DOT Anonymizer.
The impact is immediate. Business users stop waiting on developers. Consultants stop rewriting the same scripts. Functional teams focus on what they actually know — the business — and AI handles the rest.
2. The challenge of unstructured text
Anonymizing free text across varied formats and contexts required a powerful tool. Not magical, but powerful. And that tool is artificial intelligence.
Take a single email, a medical report, a doctor-to-patient letter, an internal note. Each one is packed with sensitive data. But there's no first_name column telling you what's a name and what's a diagnosis. The structure simply isn't there.
Our answer: Large Language Models (LLMs), with two jobs.
Detection runs on models trained per business domain: personal data, financial data, medical data, with more in the pipeline. Each domain gets its own algorithm, tuned to the data and the context it actually appears in.
When the first pass finishes, you're not stuck with what the model gave you. Refine it in plain language: anonymize a field it missed, restore a field it shouldn't have touched. The interaction is conversational, the result precise — and no technical skill needed to get there.
Once you're happy with a workflow, save it as a template. Run it in batch on similar documents through the API or CLI. Scale it across the organization. The whole pipeline becomes industrial.
And here's the critical part: the LLM detects, but your rules transform. ChatGPT doesn't decide what happens to your data. You do. Your rules, your business logic, your consistency.
That consistency matters more than people realize. If "John Smith" becomes "Alex Brown" in your customer database, he needs to be "Alex Brown" in the related medical letter, in the application logs, everywhere. Otherwise the anonymization leaks. Our approach closes that gap across every data source you have.
One last thing: none of this leaves your infrastructure. Our AI integrations run fully on-premise. Your sensitive data stays where it belongs.
3. PDFs: a category of their own
Let's talk about PDFs. Every company runs on them — for sharing, archiving, processing. And they're uniquely difficult to anonymize.
PDFs aren't just files to read. They feed business applications, populate test environments, drive automated pipelines. Break the layout and you break the systems that depend on it.
So we set a high bar: anonymize the content, leave the document visually intact, down to the character. A one-page document processes in about ten seconds, which makes high-volume runs realistic.
The same care goes into metadata — the part everyone forgets. Author name, originating company, software version: any of these can betray a document that looked perfectly clean. Picture a confidential tender response where "Authored by: Company X" sits quietly in the file properties. DOT Anonymizer catches that.
And for scanned PDFs, we handle OCR too. The terrain gets rougher — handwriting, character-recognition gaps, source-to-output mismatches — but the goal stays the same: a usable, secure document at the end of the pipeline.
4. What about large volumes?
Anonymizing unstructured text isn't at odds with scale. It works with it.
For organizations dealing with massive document volumes, DOT Anonymizer fits into your existing architecture. Validate the engine once, then run it in batch through the API or CLI. Wire it to your scheduler. Run overnight, on weekends, whenever your production environments are quiet.
Throughput depends on document type and volume, but the pattern is the same: large flows handled without disrupting anything in production.
5. What's next for AI-driven anonymization?
The next milestone is the Magic Button: upload a document, click once, get it back anonymized. No configuration, no setup. The system identifies the regulated fields itself — GDPR, CNIL, whichever framework applies — anonymizes them, and drops the result back into your folder. Zero friction, fully automated.
Beyond that, we're working on several fronts at once:
At ARCAD, "long term" means a few months. On data privacy, every day counts.
TRIAL VERSION / DEMO
Request a trial version or a session in our sandbox!
or



