RGPD et Anonymisation, jusqu’où doit-on aller ?

I regularly have discussions with General Data Protection Regulation (GDPR) experts or prospects about the content of the regulation and in particular about the definition of anonymity. The recurring topic is: "according to the definition of the GDPR, it is almost impossible to anonymize data".

1. What the law says

An anonymization solution must be built on a case-by-case basis and adapted to the precise use for which it was intended. To help assess a good anonymization solution, the G29 proposes three criteria:

  • Individualization: is it always possible to demarcate an individual?
  • Correlation: is it possible to link separate datasets about the same individual?
  • Inference: can we infer information about an individual?

Thus:

  • a dataset for which it is not possible to individualize, correlate or infer is a priori anonymous ;
  • a dataset for which at least one of the three criteria is not met can only be considered as anonymous following a detailed analysis of the risks of re-identification.

G29's Opinion 05/2014 on Anonymization Techniques

2. What this text implies

If there is a sufficiently precise data item that allows even a single individual to be traced in the anonymized set, then the anonymization can be considered as inconclusive and the use of this dataset would be therefore in breach of the GDPR. It is thus quite clear that for the data controller (or anyone else with access to the source data that generated this anonymized set), it is possible to trace individuals if even a single numerical value is retained during anonymization.

The subject in question "according to the definition of the GDPR, it is almost impossible to anonymize the data" would seem to be true when one seeks to carry out the action on an entire information system. The corollary of this "since it is impossible to anonymize, there is not point in launching an anonymization project" would be the conclusion, cutting short any discussion on the subject.

3. What needs to be put in place

We can conclude that anonymization, to be complete, must alter the totality of the source data through randomization and generalization techniques, and that it is relevant to ensure the outcome by measuring the risk of re-identification. For more information on these topics, you can look up in the literature the notions of "k-anonymity", "l-diversity", "t-closeness", "δ-disclosure privacy”, “β-likeness”, “δ-presence”, “k-map”, “thresholds on average risk, methods based on super-population models”, “(ε, δ)-differential privacy” or “game-theoretic de-identification approach”.

So many terms that are barbaric for people who are not conversant with mathematical theories, used to verify at various levels whether the completeness of the anonymization can be confirmed.

4. It is therefore inefficient to set up an anonymization project

In order to set up a company-wide anonymization system that complies with the GDPR, not only would an investment in a complete anonymization and mathematical verification solution be needed, but also the procurement of some very powerful servers to run the verification algorithms, and the hiring of an expert statistician or data analyst to define the verification criteria and understand the result.

Very quickly, the cost appears disproportionate to the benefits, especially as the latter are very often underestimated. Many companies often discover this benefit far too late, usually after having measured the cost of an intrusion into their system.

GDPR: Data Masking and Anonymization: Why manage your personal data with Data Masking and Anonymization?

5. And in real life?

We are now faced with the inability to meet requirements of the GDPR by anonymizing the information system at a reasonable price. Should we conclude that it is appropriate to do nothing? Well, anonymized data is very complex to implement in order to comply 100% with the GDPR... But let’s come back to more pragmatic debates. Is it really a problem that people who have access to production data can trace back to the individual in an anonymized set? From my point of view, the risk of the company failing to protect personal data is low. There are certain cases, such as the reconciliation of data between competitors or the case of a dishonest subcontractor, which could cause concern, but these concerns are more of the order of industrial espionage or unfair competition rather than of the GDPR. In the case of an unscrupulous subcontractor, a piece of advice: change to a more expensive and trustworthy candidate, you will gain in the end.

Let's go back to the basics of the GDPR. At no point does it say that anonymization is necessary to protect data. Anonymization is a way to take data out of the scope of the GDPR, but by no means the only solution.

6. Data exposure area

The GDPR requires security measures to be adapted to the risks of data subjects in the event of non-consensual use of their personal data.

Thus, the exposure of this data should be reduced to only those who need it for information processing purposes. Most production environments in companies offer sufficient measures to protect against data theft, but data usage in the IT world today creates much larger and less secure exposure areas than production environments.

Some of these exposure surfaces are handled by cryptography, such as the encryption of communications with web browsers or inter-site communications (the so-called HTTPS). Other sources of exposure are mobile phone applications, connected to corporate networks and managed by a strict internal IT security policy. Still other sources of exposure are test environments, which are often copies of the production environment without adequate security. In this case, anonymization, even if incomplete (i.e. pseudonymization), drastically reduces the data attack surface, which improves the overall state of security and compliance with the GDPR. In practice, the use of pseudonymization, although not allowing data to be taken out of the scope of the GDPR, is encouraged and relaxes the companies that use it on several requirements of the regulation.

7. It is therefore still relevant to start an anonymization process

Anonymization, even if incomplete, therefore provides solutions that improve the level of compliance with the GDPR and lays the foundation for an ideal solution.

The cost and technicality of setting up a perfect global anonymization solution make it almost impossible to implement at the current time. However, the reduction of risks and implementation of projects that will evolve over time make these projects relevant and in line with current needs.

Most anonymization solutions are constantly evolving and improve the situation by successive optimizations of the state of anonymity. The evolution in technology will bring its share of solutions but also its share of problems, which is why it is not wise to procrastinate in putting in place protective measures. On the contrary, it is advisable to anticipate the risk.

The less data is exposed, the less emerging technologies such as “machine learning”, “quantum computers” or “AI” (a term that I hate due to its overuse these days) will be able to have an impact on the lives of people who have trusted us with their data. After all, the GDPR is not there to bother us with strict and unfounded rules to follow but to protect individuals.

We don’t need to strictly anonymize to be compliant with the GDPR. We need to put in place a set of data protection elements that ultimately protect people.

8. My advice

Embarking on a project to completely anonymize the production database is a costly and time-consuming exercise. In addition, it carries a significant risk of failure. Anonymization is likely to be incomplete, so it becomes pseudonymization, and does not fall outside the scope of the GDPR.

However, it can be done right. To do this, we need to distinguish between uses, and compartmentalize our needs so that we have full control over what we use:

  • Open-data or statistics: Identify the relevant scope of data and export to a database that is easier to anonymize, where each useful data item can be processed correctly. Assess the needs precisely for each element, check for noise, and truncate or generalize your data. On controlled sets, anonymization analysis is relevant and possible at a reasonable cost. It is important to bear in mind that this technique is very risky in terms of leakage, as it is usually intended for wide communication outside your company. One must therefore be very careful in the generation of such datasets.
  • Generation of test samples for development purposes: As the data remains within the company, it generates a lower risk. Again, the sample size to be anonymized should be reduced to minimize the risk of re-identification. By implementing an anonymization process, you will greatly reduce the data attack surface and therefore effectively protect your data from potential leakage. Unless all data generated during your anonymization is altered, it is strongly recommended that you put strong security measures in place on your anonymized databases. In the case of theft, even if the data is not anonymous in the eyes of the law because it can be re-identified by your authorized employees, it will be unusable by the thief.

And I will end on an optimistic note: Don't forget that anonymization is an ongoing project, not only because it will not be perfect the first time, but also because the algorithms that work today may not work tomorrow. You therefore have the time and opportunity to set up a continuous improvement process to finally achieve a perfect anonymization of all the elements within your information system. In other words, you can start small and take the time to do the job right.