Anonymization of unstructured text with AI - DOT Anonymizer

By Florian Pusello · June 4, 2026

What if you could anonymize unstructured text as easily as a database? For years, anonymization meant structured data: tables, columns, clear field names — a familiar landscape where finding a name or a social security number came down to a simple lookup. Today, sensitive data lives somewhere else entirely. It's buried in your emails, your logs, your PDFs, your reports, your customer exchanges. And free text gives you nothing to work with: no columns, no types, no metadata. You're starting from scratch. This is where AI changes the game for DOT Anonymizer: anonymization rules written in plain language, semantic detection, and full PDF anonymization. Here's how it works.

Key takeaways

  • 1

    Generate your anonymization rules with AI in plain language — no technical expertise required.

  • 2

    Detect sensitive data in free text using LLMs trained for each business domain.

  • 3

    Keep your data in-house with fully on-premise processing. No public cloud.

  • 4

    PDF, metadata and OCR supported with preservation of the original layout.

1. AI as a rule generator

We started by integrating AI into rule generation. Anonymization rules aren't a commodity. They're a corporate asset — built around your applications, your controls, your business logic. Not a universal standard. Your DNA.

So we let AI write them. A business user types: "For this IBAN field, apply the checksum formula based on the country code." That's it. What used to take Groovy code, deep technical knowledge, and hours of work now takes a sentence. The LLM generates the complete rule, ready to run in DOT Anonymizer.

The impact is immediate. Business users stop waiting on developers. Consultants stop rewriting the same scripts. Functional teams focus on what they actually know — the business — and AI handles the rest.

2. The challenge of unstructured text

Anonymizing free text across varied formats and contexts required a powerful tool. Not magical, but powerful. And that tool is artificial intelligence.

Take a single email, a medical report, a doctor-to-patient letter, an internal note. Each one is packed with sensitive data. But there's no first_name column telling you what's a name and what's a diagnosis. The structure simply isn't there.

Our answer: Large Language Models (LLMs), with two jobs.

  • Find the sensitive data — names, social security numbers, medical treatments, anything that matters.

  • Classify what each piece is — first name? IBAN? internal code?

Detection runs on models trained per business domain: personal data, financial data, medical data, with more in the pipeline. Each domain gets its own algorithm, tuned to the data and the context it actually appears in.

When the first pass finishes, you're not stuck with what the model gave you. Refine it in plain language: anonymize a field it missed, restore a field it shouldn't have touched. The interaction is conversational, the result precise — and no technical skill needed to get there.

Once you're happy with a workflow, save it as a template. Run it in batch on similar documents through the API or CLI. Scale it across the organization. The whole pipeline becomes industrial.

And here's the critical part: the LLM detects, but your rules transform. ChatGPT doesn't decide what happens to your data. You do. Your rules, your business logic, your consistency.

That consistency matters more than people realize. If "John Smith" becomes "Alex Brown" in your customer database, he needs to be "Alex Brown" in the related medical letter, in the application logs, everywhere. Otherwise the anonymization leaks. Our approach closes that gap across every data source you have.

One last thing: none of this leaves your infrastructure. Our AI integrations run fully on-premise. Your sensitive data stays where it belongs.

Anonymize your unstructured text with AI

3. PDFs: a category of their own

Let's talk about PDFs. Every company runs on them — for sharing, archiving, processing. And they're uniquely difficult to anonymize.

  • Extracting and anonymizing the text? Manageable.

  • Preserving the exact layout of the original document? That's the hard part.

PDFs aren't just files to read. They feed business applications, populate test environments, drive automated pipelines. Break the layout and you break the systems that depend on it.

So we set a high bar: anonymize the content, leave the document visually intact, down to the character. A one-page document processes in about ten seconds, which makes high-volume runs realistic.

The same care goes into metadata — the part everyone forgets. Author name, originating company, software version: any of these can betray a document that looked perfectly clean. Picture a confidential tender response where "Authored by: Company X" sits quietly in the file properties. DOT Anonymizer catches that.

And for scanned PDFs, we handle OCR too. The terrain gets rougher — handwriting, character-recognition gaps, source-to-output mismatches — but the goal stays the same: a usable, secure document at the end of the pipeline.

4. What about large volumes?

Anonymizing unstructured text isn't at odds with scale. It works with it.

For organizations dealing with massive document volumes, DOT Anonymizer fits into your existing architecture. Validate the engine once, then run it in batch through the API or CLI. Wire it to your scheduler. Run overnight, on weekends, whenever your production environments are quiet.

Throughput depends on document type and volume, but the pattern is the same: large flows handled without disrupting anything in production.

5. What's next for AI-driven anonymization?

The next milestone is the Magic Button: upload a document, click once, get it back anonymized. No configuration, no setup. The system identifies the regulated fields itself — GDPR, CNIL, whichever framework applies — anonymizes them, and drops the result back into your folder. Zero friction, fully automated.

Beyond that, we're working on several fronts at once:

  • PDF and Office documents, end-to-end — metadata, OCR, layout fidelity.

  • New formats and use cases, because sensitive data keeps showing up in new places.

  • Custom prompts, so every organization can define what "sensitive" means in their own world — a confidential price, a unique spec, a proprietary field.

  • MCP and API integration, so AI-driven anonymization slots into your existing pipelines without disruption.

At ARCAD, "long term" means a few months. On data privacy, every day counts.

Switch to AI-driven anonymization

Florian Pusello - Spécialiste en anonymisation

About the author

Florian Pusello

Anonymization Solution Specialist

Florian spent years working with data on the business side of major enterprises before joining ARCAD Software as a Solution Architect. He now helps clients deploy and scale anonymization across Business Intelligence, Data Science, and AI projects.

For any question about anonymization, contact our specialists.

TRIAL VERSION / DEMO

Request a trial version or a session in our sandbox!

Trial Version

Test Data Management Expert

Try it now!

Request a trial version

or

Demo

Test Data Management Expert

Personalized demo

Ask our data masking experts