Anonymization in Structured Data: A Guide for IT & Technical Managers - Part 1

We’ve put together a simple guide on what we believe is a must have security functionality: anonymizing structured data effectively. Let’s ensure your data is both perfectly secure and fully usable for business operations.

ANDRÁS ZIMMER

June 05, 2025

Table of Contents

Introduction
What is Anonymization?
Structured Data in the Context of Anonymization
Methods of Anonymization for Structured Data
1. Data Masking
2. Tokenization
3. Generalization & Aggregation
4. Differential Privacy
5. Synthetic Data Generation
Encryption
1. Decrypt while Processing
2. Homomorphic Encryption
Choosing the Right Approach: Trade-offs & Challenges
Best Practices for Implementing Anonymization in Structured Data
Conclusion: Building a Future-Proof Anonymization Strategy

Introduction

Data is a double-edged sword. On one hand, it fuels business insights, powers AI models, and drives automation. On the other, it’s a prime target for breaches, leaks, and compliance violations.

IT managers often face a tough balance—securing data while ensuring it remains useful for business processes. Many assume anonymization is just about masking personally identifiable information (PII) but it is a much more complex topic. It requires a deep understanding of the data itself and that of the applicable anonymization techniques, and demands an awareness of potential trade-offs.

This post will break down how to effectively anonymize structured data, discussing various approaches, their strengths and weaknesses, and how to ensure anonymized data remains both secure and useful.

What is Anonymization?

There are many definitions of anonymization but all of them aim at removing or sufficiently obfuscating pieces of information so that they make the original content inaccessible (usually within reasonable constraints).

For example the General Data Protection Regulation of the EU (commonly referred to as GDPR) defines ‘pseudonymization’ as "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information [...]."

Note: Although there are important distinctions between anonymization and pseudonymization, they are interchangeable for the purposes of this post.

It is important to understand that anonymization is context sensitive. For example the list of telephone numbers issued by a certain provider may not require anonymization; however, once coupled with the subscribers’ personal data, they become personal identifiers, and thus do need protection.

Structured Data in the Context of Anonymization

Structured data is highly organized and stored in fixed fields (“columns”) within databases, data files, spreadsheets, and CRM systems. Structured data is the most common form of what is widely considered data. We have long kept data in this form, for it is easy to comprehend, search, sort, and analyze—making it both valuable and vulnerable.

Most notably: with structured data we have data containers (“tables” and “columns”) with homogenous data contents. (I.e. everything in a column should be the “same kind of data”.) There are slight deviations which may be important for anonymization (e.g. “JSON” columns in databases) but we assume that we can decide what level of protection each column requires in our context.

Examples include:

Common data types:
- Names, email addresses, phone numbers
- Online identifiers (e.g. social platform names, ids, links)
- Financial identifiers (e.g. credit card numbers)
- Any identifier that make a person identifyable, including state-issued identifiers and intra-company identifiers as well
Less common but still sensitive:
- Location data (GPS coordinates, IP addresses)
- Network logs (firewall data, access records)
- Device identifiers (MAC addresses, IMEIs)

Since structured data is well-defined, anonymization techniques must ensure usability while preventing re-identification. A naive approach (e.g., simple redaction), although effective 😁, often isn’t enough.

Methods of Anonymization for Structured Data

1. Data Masking

Data masking replaces sensitive values with obfuscated but realistic-looking substitutes.

It works well for user interfaces where full data isn’t needed, and it is easy to implement to that end.

On the other hand it is very hard to apply in cases when the original values and their properties are, indeed, required (e.g. stability and uniqueness for joins, or checksums for validations).

Example: Masking Credit Card Numbers

A call center agent might see: 4111-XXXX-XXXX-1234 instead of 4111-5678-9012-3456 in the credit card number field.

2. Tokenization

Tokenization replaces real data with unique tokens, while preserving referential integrity. The original data is stored securely and only accessible through a token vault.

Tokenization is usually applied to PII values (or more broadly “dimensions”); it is uncommon to tokenize numeric values (“facts”).

Tokenization gets around many of the problems of Data Masking by effectively creating a new identifier for each entity. However, it requires a secure token vault to be able to do the mapping between the token and the original value. Another problem is that it can hinder certain kinds of analytics without detokenization. (For example if genders are tokenized, the results of a "count by gender" query cannot be presented without either revealing the token meanings to the users or detokenization.)

While data masking is most often applied at the UI level only, tokenization is usually applied much deeper in the data stack (ideally as early in the pipeline as feasible).

Example: Storing Customer IDs Without Revealing SSNs

Instead of storing John Doe - 123-45-6789 - $27, a database would store TKN-34886320 - TKN-82739412 - $27.

3. Generalization & Aggregation

These methods reduce the level of detail in a dataset, making individual identification harder while maintaining statistical utility.

This is sometimes sufficient in terms of data protection while at the same time makes the necessary level of analytics possible. However care must be taken to preserve the important properties of the data:

Statistical properties: Mean, standard deviation, percentiles
NULL values and categorical distributions: To maintain data quality
Referential integrity: Ensuring related tables (e.g., customer transactions) remain linked

Note that exactly which of these need attention is always context-dependent, but ignoring them can break analytics models and introduce bias.

Also, of all methods mentioned so far this is the trickiest one to protect against "reverse inference": when the attributes together, and especially when taken with other records, reveal enough information to be able to single out specific individuals. The most trivial of such cases is when there is a sample of one with certain combinations of attributes. There are more elaborate ones, and they are much harder to defend against.

Example: Anonymizing Date of Birth in Medical Records

Instead of DOB: 1989-07-21 store and use Age: 35-40.

4. Differential Privacy

Differential privacy adds controlled noise to datasets to prevent re-identification while preserving overall trends.

This approach allows most statistical analyses to run smoothly over the data set because they tend to naturally "clear" the noise.

However, for this to be effective, it needs to be applied to a large data set (e.g. census data, social trends). Also, it requires careful planning with regard both to the level of noise and to the independence of pieces of information it is applied to.

Example: Protecting User Behavior Analytics

A search engine logs searches but adds random noise to prevent tracking individual users.

5. Synthetic Data Generation

Sometimes it is possible to replace the original set with an artificially generated one that retains its statistical patterns.

Its virtue is that it can be completely open as it contains none of the exact sensitive pieces of information of the original dataset. The downside is that it is hard to generate even when a relatively small number of known statistical properties must hold, and gets exponentially more difficult as the number of invariants grows. Notably this approach tends to remove edge cases from the original data set and unintentionally introduce other edge cases.

This approach is often used in generating test inputs/systems and training data.

Example: AI Model Training

Instead of training an AI fraud detection model on real transactions, a bank generates synthetic transactions based on real spending patterns.

Encryption

Encrypting the data is obviously one form of anonymization–in that it makes the original data inaccessible without external information (the key). However, in most setups, it also makes the data unusable for practical analytical purposes. There are applications, though, where this approach can still work well.

Note that care still must be taken to either properly encrypt the computation results, or to make sure in other ways that the output does not reveal protected information. Therefore, depending on the application, one may still need some form of anonymization :)...

1. Decrypt while Processing

One possible approach is to only decrypt the data on the computation nodes, while being processed, keeping it encrypted elsewhere (both "at rest" and "in flight").

In these cases data is open only very briefly in a well guarded, constrained environment. At the same time all analyses can be fully and easily performed, as at calculation time the data is fully available. The downside is that it requires special data processing tools; the off-the-shelf ones are not set up for this use case.

Example: LLM Compute Vault

A secure cloud computing setup where only a locally deployed AI model node decrypts patient records for real-time analysis, but no human can access them.

2. Homomorphic Encryption

A different approach to working with encrypted data is what's called homomorphic encryption: it allows computations to be performed directly on encrypted data(!)–without decrypting it. It is somewhat counter-intuitive that it can exist at all--but it does.

This sounds like the ideal contender: data is obfuscated all the time, everywhere, and we can still infer from it.

The catch, however, is that while it is theoretically possible, and it even has implementations, it is prohibitively slow (compute intensive) for most practical purposes, especially real-time ones.

Example: Making Statistics Available over Highly Sensitive Data

An insurance company wants to make patient level information available for statistical modeling internally on its cloud platform–without revealing patient records to the cloud provider or its staff. Homomorphic encryption lets them analyze encrypted data directly.

Choosing the Right Approach: Trade-offs & Challenges

Every anonymization technique comes with trade-offs:

Method	Pros	Cons	Best For
Data Masking	Easy to implement	Partial security	UI protection
Tokenization	Reversible, secure	Requires token vault	Storing sensitive PII
Generalization	Reduces re-identification risk	Can lose precision	Analytics, reports
Differential Privacy	Strong anonymity	Less useful on small datasets	Large-scale data studies
Synthetic Data	No real PII exposure	May introduce bias	AI model training
Decrypt-while- Processing	Highly secure	Needs special tools	Large-scale data whose entirety must be obfuscated
Homomorphic Encryption	Ultimate security	High computational cost	Highly secure cloud computing

Best Practices for Implementing Anonymization in Structured Data

Classify & tag sensitive data – Know what needs protection.
Choose your approach wisely – No single approach works for all cases.
Regularly audit for re-identification risks – Test anonymized datasets for vulnerabilities.
Balance security and usability – Don’t over-anonymize if data still needs to be useful.
Keep encryption and anonymization separate – Both are necessary for full protection.

Conclusion: Building a Future-Proof Anonymization Strategy

As compliance laws evolve and cyber threats grow, IT leaders must think beyond basic PII masking. The best anonymization strategies use a mix of techniques to balance security, performance, and business needs.

By understanding how structured data is anonymized, companies can:

Prevent data leaks without crippling operations
Stay compliant with evolving regulations
Enable secure AI and analytics

The key? Anonymization isn’t just about hiding data—it’s about securing it while keeping it useful.

Next up in Part 2: How to Anonymize Unstructured Data (Emails, Documents, Logs, and More). Stay tuned!

Article by ANDRÁS ZIMMER

Data Observability

Governance

Explore more stories

AI Predictions: Tracking the Prophecies
Humanity has always tried to predict the future. We’re swapping tea leaves for datasets. We explore our collective obsession with AGI timelines, the moving goalposts of superintelligence, and why we ultimately decided to build a scoreboard to keep everyone honest.
Think Big, Start Small: Digital Twinning a Coffee Machine
Digital twinning allows you to minimize downtime when maintaining large-scale machinery. But it’s not exclusive to gigantic pieces of equipment—even something as small as a coffee machine can showcase its benefits.
Building an AI Application with Databricks Apps in 30 Days
Discover how to build a production-ready AI application on Databricks Apps in just under a month. Learn from our journey, challenges, and architectural choices.

Flying high with Hifly

We want to work with you

Hiflylabs is your partner in building your future. Share your ideas and let’s work together.