Anonymization in Unstructured Data: A Guide for IT & Technical Managers - Part 2

Anonymizing unstructured data like medical records or legal documents is much harder than with structured data. The primary challenge is identifying sensitive information (PII) within free-form text, which can be obscured by jargon, abbreviations, and OCR errors. This guide explores the viable approaches, from simple rule-based systems to advanced Machine Learning and hybrid models.

ANDRÁS ZIMMER
|
|

In Part 1 we discussed the anonymization methods in structured data.  Now we turn to unstructured data.

There are many very different kinds of unstructured data from JSONs, through texts in various formats (e.g. plain text, PDFs, Word documents, etc), to media (images, video, audio).  

In this discussion we stick to text-based documents (whether they are textual in their original form or as a result of OCR).

Depending on the topic of the documents, even text-based documents pose very different challenges for anonymization.  In this post I’ll often draw examples from medical records and legal documents–two kinds of documents Hifly has substantial experience with.

Remember that the goal is to remove or mask identifiers (names, IDs, dates, etc.) while preserving the document’s utility for downstream analysis.

Sometimes anonymization is done on the basis of privacy best practices. However, specifically with regard to medical and legal documents, there are also regulatory requirements in most jurisdictions. Make sure you understand the requirements and their implications before committing to any solution.

Challenges

With respect to pinpointing anonymization targets, structured data is much simpler to handle than unstructured ones–mostly for the fact that one knows where to look for sensitive information. With unstructured data we do not have that luxury…

 

And therein lies the biggest challenge: how to identify sensitive information?

 

Most of the documents we are concerned with are either one large blob of text, or at least have free text fields. In these text blocks nothing really tells us if a sequence of numbers is a sensitive personal identifier or some other code whose exact value is important to make sense of the document and has to be kept intact.  Much the same with names: is a name sensitive (such as maybe that of a witness) or a required reference (e.g. part of the title of a precedent case)?

 

To make matters worse, many of these documents are converted from printouts (i.e. OCR’d scans), and thus may often have errors in text and layout, causing PII detection to fail or produce false hits. Even digitally produced PDFs may have complex layouts that scatter identifiers in non-linear ways.  And even if pieces of the PII are logically grouped together, they can be structured / separated in various ways (think of international addresses or phone numbers).

 

The next hurdle is that both the medical and the legal fields have their own language style–even if they are in one language only. We have routinely come across clinical notes that are a mixture of several languages (say, Hungarian, Latin, and English), abundant with abbreviations, and the doctors’ technical jargon. Oh, yes, and let’s not forget that they are non-standard in many ways (i.e. different doctors use different terms and abbreviation schemes).

Viable Approaches of Detecting Sensitive Information

So figuring out what to anonymize in free-form text is far from trivial.

Sometimes it makes sense to apply human labor to the task. Not only has it a trivially inherent limitation in being able to scale (in most cases literally several orders of magnitude), but it is also non-reliable and often requires to expose sensitive information to hard-to-control individuals.  So it is obviously not what we are looking for.

Rule-based Approaches

Perhaps the simplest approach worth trying is the rule-based ones.

These rely on manually crafted patterns, dictionaries, and heuristics to find sensitive information.  These methods are transparent and easily customized. They also don’t require training data.

However, they can be brittle: if data deviates slightly from expected patterns, the rules may miss the PII. They also struggle with the ambiguity and variability of natural language.

They work best when PII has very predictable formats (like account numbers) or in “somewhat” structured text (like database dumps), but less so in free narrative.

Machine Learning Based Approaches

Machine learning (ML) approaches treat de-identification as a learning problem: the system is trained on examples of text with sensitive entities labeled, so it can predict those entities in new text. For these “Named Entity Recognition” (NER) tasks, most modern systems use transformer-based model. These models are structurally similar to the core of LLMs (large language models).

They consider the context of each information element in the text, which helps resolve some ambiguities. Thus they often vastly outperform simple rules engines, especially for free-form text. (Note, however, that they are not perfect, either.)

One of their major downsides is that ML models require training data (annotated examples of documents with all the sensitive information marked), which can be costly to obtain, especially in specialized domains. It is well known that general-purpose de-identification models can miss a large portion of domain-specific sensitive information. There are pre-trained models tuned for anonymization in specific domains but they also have their limitations (e.g. across languages or legal systems).

Despite these caveats, ML-based approaches (especially with modern NLP models) form the core of many high-performance de-identification pipelines. They can achieve high recall and decent precision when trained well.

Hybrid Systems

The most robust anonymization solutions often combine rule-based and ML techniques, leveraging the strengths of each. Hybrid systems might use an ML model as the primary detector and then apply post-processing rules to refine its outputs (for example, if the model tags something as a name that matches a known hospital name, a rule might downgrade that to an organization entity). Conversely, a set of pattern rules might first catch easy instances (like obvious ID numbers) and an ML model then handles the trickier bits of text.

Hybrid systems may also incorporate a human-in-the-loop for edge cases–e.g., automatically anonymize 95% of a document and then present the uncertain parts to a human reviewer. In high-stakes applications semi-automatic anonymization with manual checking is common. For example, a tool might highlight all detected names in a court decision and let a legal clerk confirm or correct them before finalizing the anonymized version.

Pre-built Solutions

Frameworks

The industry has long realized that the challenges are uniform and that similar solutions have been built repeatedly. So automated frameworks emerged.

There are good ones for medical texts. However, they are mostly US-centric and tend to be less reliable in other contexts. For example, BioBERT, a leading healthcare data anonymization framework, has been fine-tuned on HIPAA entity categories: great for US-based applications but less useful elsewhere. And even when deployed to US documents, they may require tuning to maximize their performance.

Other fields have fewer pre-built options as of now but several are in the works (e.g. the European Legal Large Language Model).

In domains that do not have available pre-trained models and services, organizations usually need to train their own models on their own data, or use some generic solution with more tuning.

One very notable open-source generic framework for detecting and anonymizing PII in text (and even images) is Microsoft Presidio. Presidio is highly extensible: it comes with predefined recognizers for many entity types (using regex, checksums, and NLP) and allows adding custom ones. It can also anonymize text with multiple different strategies (masking, replacement, encryption). It is code-first, and thus it is easy to build into data processing pipelines–and, indeed, it is often used in enterprises to comply with privacy laws by scrubbing logs, documents, or incoming text data streams.

Services

Another approach in healthcare worth noting is service-based pre-trained models.

All large cloud platforms have their services in this field for PII detection/extraction:

However, they obviously require that the document is either in their own cloud already or it should get uploaded to the service–and without anonymization it is a non-starter for many use-cases.

LLMs can also be used but they fall under the “generic” category in their effectiveness: they need to be fine-tuned and/or guided heavily to be sufficiently reliable.

Anonymization Methods

The actual methods of anonymization available are a subset of those discussed in the previous post for structured data:

  • Data Masking: redacting the information with some “fixed” text (often “<MASKED>” or asterisks).
  • Tokenization: replacing the information with some “hash” or other unique identifier.  These mappings are then stored in a table so that the original values can be reconstructed. Note that the mapping table should be properly guarded.  (In some applications, where reversing the values is not a requirement, the mapping table may be omitted altogether or dropped after a short while.)
  • Encryption: similar to tokenization but the replacement comes through an algorithm reversible through a key rather than keeping all token-value pairs in a table. It requires fewer/smaller secrets to be managed. Also in this case the same original value always gets mapped to the same replacement–something may or may not be preferable for the given use-case.
  • Replacement with synthetic data: replacing the information with something “similar” (in function). This is most useful for certain training and test scenarios, and not a very good idea for production documents (as they may look “actual” enough so that they may accidentally get processed as if the information in them would be valid).

Conclusion

Anonymization of unstructured data is much harder than that of structured data because it is difficult to identify the pieces of information that need to be redacted. Lingustic specialities and technical inconsistencies (e.g. OCR artifacts) make the process all the more demanding.  

Building reliable and robust data pipelines with proper anonymization is a rather complex endeavor. In many cases the actual solution is field and application specific, and it requires a surprising amount of effort and finesse.

Both coded (“rule-based”) and machine-learning approaches can be successfully applied to the problem. The best results often come from mixing methods: using rules for high-precision cues, ML models for contextual understanding, and domain-specific knowledge to guide both.

There are frameworks that support building these kinds of solutions; general ones as well as field-specific ones. Healthcare related documentation in the US is perhaps the best served vertical in this regard with several local and cloud options; the farther we get (either geographically or domain-wise), the fewer high-performance, pre-built options are available.

Ultimately, anonymizing unstructured text is less about choosing a single tool and more about designing a system that balances accuracy, compliance, and practical constraints. As organizations increasingly rely on sensitive document processing, investing in a thoughtful, layered anonymization strategy is no longer optional—it's essential.

Article by ANDRÁS ZIMMER
Data Warehouse
Data Observability
Governance

Explore more stories

  • Streamlining Real-Time Data Pipelines with Lakeflow Declarative Pipelines

    More efficient development, simplified maintenance, and a higher level of accessibility. A power tool in the Databricks ecosystem, Lakeflow Declarative Pipelines simplifies the creation of data pipelines while providing a declarative framework, allowing data engineers to focus on the desired state of the pipeline instead of getting lost in details.

  • Creating a Killer Strategy with Product Discovery

    Instead of throwing quick-and-dirty fixes at a declining product, evaluate and plan for the future. This includes introspection as well as thorough market research, and our Product Discovery framework can show you the way.

  • Operationalizing AI Ontologies

    An operational intelligence layer, the ontology, models relationships of entities and actions, enabling “digital twins” for organizations.

Flying high with Hifly

We want to work with you

Hiflylabs is your partner in building your future. Share your ideas and let’s work together.