5 most common questions about data anonymization

by Maurice Marrel

GDPR, data privacy, data protection regulations have raised more questions around the handling of data than ever before. We asked our DPO and anonymization expert, Maurice Marrel, to answer some of the most common questions facing our customers today.

Summary

What is the role of anonymization in GDPR compliance?
Anonymization and pseudonymization – how do they differ?
Personal vs. sensitive data – what does this change for data handling?
How can I safeguard IT performance when introducing anonymization?
How can I identify which data should be anonymized?

1. What is the role of anonymization in GDPR compliance?

In recent years, "digital everywhere" has dramatically transformed the flow of data.
Production data is copied into test, QA or pre-production environments, and exposed to the eyes of testers, receivers or unauthorized developers on machines much less protected than production environments.
Many files are also shared with external partners, who often only require a small part of the data actually transferred.

This personal data must be protected from leaks and other indiscretions.
In response, specific new legislation has emerged, such as the GDPR in Europe.

These new regulations oblige the desensitization of confidential data.
Desensitization means transforming the data, using non-reversible algorithms.
However, the data must remain usable. A test user must still see on the screen, in the last name field, a modified last name that "looks like" a last name.
Similarly, the domain must remain the same: an IBAN / RIB or a social security number must stay valid and compatible with the requirements and validation checks made by applications to allow the tests to actually run.
These same constraints must still apply even in the case of data redundancy in legacy databases, or across multiple database management systems.
These concerns must all be taken into account by any anonymization solution.

2. Anonymization and pseudonymization - how do they differ?

Anonymisation ensures that the data can never be retrieved by any means, contrary to pseudonymization.

In a test environment, even if the machines are secure, it is the developers, testers, QA staff, and training personnel who have direct access to the data. It is therefore imperative to anonymize or pseudonymize the data upstream.
In the case of a pseudonymization, the data can optionally be kept encrypted in software metadata, so it can be retrieved individually on request, and only to authorized persons. The old data in this case are preserved. This can be useful for example to check specific, one-off problems in a test environment.

Pseudonymization is often the only solution that allows normal operation of applications and the completeness of test scenarios.
On the other hand, it is a potentially reversible technique due to the identification keys that may not be replaceable for technical reasons. Pseudonymization can leave identifiable data in place, such as customer numbers, which are sometimes the only link between data storage technologies (DBMS, files). Combining the data with each other can help malicious organizations statistically guess some of the original data.

GDPR : Data Masking and Anonymisation: Why manage your personal data with Data Masking and Anonymization?

Read the White Paper

3. Personal vs. sensitive data - what does this change for data handling?

According to the CNIL, personal data is "any information relating to a natural person who can be identified, directly or indirectly". Whereas sensitive data refers to "any information that reveals racial or ethnic origins, political, philosophical or religious opinions, trade union membership, health or sexual orientation of a natural person".

But this differentiation of data can be confusing.
The most important point is to identify the data to be anonymized. The goal is to prevent anyone being able to find links between these data. For example, you are unable to modify health status type data if the corresponding first and last names are anonymized.

Anonymization therefore utilizes algorithms that apply to all types of data.

4. How can I safeguard IT performance when introducing anonymization?

It is important to not only consider performance alone, but also take security into account.
Anonymization means an additional process, and will therefore necessarily have an impact on performance. However, if it is well planned for, and its scope and requirements are well defined, any impact will be minimized. And on average, only about twenty percent of data needs to be anonymized.

In general, data when anonymized, will be retrieved directly from a production environment for insertion into a test environment. But even if users (developers, testers etc.) do not have access during processing, test environments are usually less protected.
The ideal solution, in this case, will be to make a copy of the production database. This will allow the first instance to remain available while the other is being anonymized.
The anonymized data will then be dispatched to the relevant test, QA and training environments.
Another solution is to isolate a copy of the production environments in test machines while limiting access during the anonymization, then distribute onto the test environment.

5. How can I identify which data should be anonymized?

Typically, anonymization is required for test environments.
A good knowledge of the overall scope of the database is important, because it will help in assessing which types of data will need to be anonymized.
It is also important to consider how specific data relate to each other, as some data are inseparable.
To assist the administrator, the discovery of the data eligible for anonymization must be as automated as possible, using algorithms catering for the various types of data.

But in some cases, anonymization is needed for production environments. This is especially the case with the "right to be forgotten", which has been considerably reinforced by the GDPR.

Indeed, anyone residing in the European Union and whose organization holds personal data may take control over his/her data.
But in many cases, simply deleting this data would have a significant impact on other data. In such cases anonymization is therefore a better solution as it renders personal data inaccessible, while preserving the usability of data to allow normal application operation and consistency of results.

Take the example of an online commerce site. When a product is sold, out-of-stock, money-in, or parcel-delivery data are necessary for the the business to operate and cannot be removed. However, the name of the buyer, his address or banking data can be.
The right to be forgotten, whether it results from a specific request or a regulation on the conservation of historical data, is the most common reason for anonymizing a production environment.

Conclusion

Anonymization meets the requirements of the GDPR because it transforms data irreversibly, while retaining its usability
Anonymization concerns all data, personal or sensitive
If the anonymization scope and requirements are well defined and planned ahead, any impact on performance will be minimized
Anonymization may be necessary in a production environment in response to "right to be forgotten" requirements

Protection of personal data: Comply with the new regulation (GDPR). Learn about the concept of anonymization.

Read the White Paper

Maurice Marrel

Senior Solutions Consultant, DOT Software

Maurice Marrel has over 20 years experience on IBM i (and its predecessors) remaining actively involved in modernization projects at the forefront of technology on the platform.

Now specializing in technical pre-sales and training for ARCAD’s solutions for Enterprise Modernization on IBM i, Maurice has a wide-ranging technical background including IT management in aerospace and energy industries, and project leadership in several technology sectors including software development tooling.