Banner Article Data Masking and Data Anonymization understanding the different algorithms

June 28th 2021

Data Masking and anonymization are fundamental aspects of data protection. These techniques make it possible to “play” with the information in a dataset in order to make it anonymous. This notion of anonymization can take different forms depending on the algorithms that exist. Thus, it is possible to set up forms of encoding that substitute certain data for others, others that completely hide certain data, or others that manipulate certain values in order to make the initial data set completely impossible to find. In order to better understand how each algorithm works, we will detail the different data masking techniques to better understand their value.

In our examples, we will start with the following dataset, containing a name and a salary:

Name : Brown – Salary : 95000

Name: Smith – Salary : 125000

Substitution algorithms: maintaining a realistic appearance

When a substitution algorithm is used, some of the information in the main dataset is substituted by alternative information. The information looks realistic, but it anonymizes and protects the identity of the people in the original dataset. In our example, the new data would then be as follows:

Name : Green – Salary : 95000

Name: Jones – Salary : 125000

Guide 5 conseils pour projet anonymisation des données

Personal Data and Anonymisation

5 tips for a successful anonymization project

Download the Guide

Randomized algorithms: shuffling the data

With this algorithm, the characters in each column are randomly shuffled. This makes it very difficult to retrieve the original information. Based on the example dataset, we could obtain the following result:

Name : Worbn – Salary : 95000

Name: Miths – Salary : 125000

Numerical variations algorithms: reproducing a result representative of the original dataset

Using a number and date variation algorithm, it is possible to create a fictitious dataset based on numerical information from the original dataset. With the help of a significant numerical range (e.g. +/- 10%), it is possible to display results that are realistic, which would at the same time make the original dataset completely untraceable. This could give the following result in our example:

Name : Brown – Salary : 102600

Name: Smith – Salary : 112500

Redaction algorithms: artificially replacing data

To make a dataset completely anonymous, it is possible to use a redaction algorithm. This replaces all real data with a constant or random unrelated string. In other words, it is a substitution algorithm where the information does not attempt to look realistic. This could give the following result in our example:

Name : xxxxx – Salary : 95000

Name: xxxxx – Salary : 125000

Masking algorithms: keeping a usable database

Not so different from the previous algorithm, the masking algorithm allows for a partial redaction, where some information is retained during anonymization. In our example, the result could be:

Name : Bxxxx – Salary : 95000

Name: Sxxxx – Salary : 125000

Customized algorithms: to meet more specific needs

Sometimes, the algorithms listed earlier are not sufficient or do not meet specific requirements. In such cases, algorithms can be customized. These are generally customized on request. For example, a company may need to invert certain information across different lines to make the data anonymous. In our example; this could result in:

Name : Brown – Salary : 125000

Name: Smith – Salary : 95000

We have seen that there are many different data masking and anonymization algorithms and all of them enable the creation of new and very different datasets. Not all of them mask information in the same way, but they allow organizations to find the right solution that meets their specific needs and constraints.