Data detection

1. What is the role of data discovery in a GDPR compliance project?

When you initiate a GDPR compliance project, you are probably not aware of where all your personal data is stored. You know that your software packages (ERP, HR software, accounting software, CRM...) hold some personal data in their own database, but not precisely where, or in what format.

To locate this data, you could try turning to your solution provider who should be able to tell you where, and in which database the information is stored. Unfortunately, providers are often unable to locate personal data in the database for one of a number of reasons:

The company may no longer exist, or lack sufficient skills in the database framework used, or your application may have been customized and the related information lost. In these cases, the automated discovery of personal data can help. Starting from a list of identified data sources, data detection software will scan databases and files for personal data according to predefined algorithms. Once you have the list of data, you can start building your data registry and begin your compliance.

2. What is a data registry and how is it set up?

In the context of the GDPR, the data registry acts as an inventory of data use and in particular it details why a company is processing that data. This registry plays a key role in documenting compliance.

Article 30 of the GDPR states, "Each controller and, where applicable, the controller's representative, shall maintain a record of processing activities under its responsibility."

In this data registry, you will need to indicate specific information about how the personal data is processed:

  • Who ?

    • Identify the data controller.
    • Identify the persons responsible for the operations.
    • Identify the processors.
  • What ?

    • Identify the categories of data.
    • Identify the sensitivity of the data.
  • Why is this necessary ?

    • Identify the purpose for which the data is collected.
  • Where is it collected ?

    • Identify where the data is stored.
    • Identify the country where the data is likely to be transferred.
  • Until when ?

    • Identify how long the data can/should be stored.
  • How can/should it be stored ?

    • Identify the means of access to the data and the security measures implemented to protect it.

Personal data detection software can help in creating and maintaining this registry by automating the technical side of the task. The personal data sources and the type of data stored are identified automatically making it easier then to categorize the data.

GDPR: Data Masking and Anonymisation: Why manage your personal data with Data Masking and Anonymization?

3. What are the challenges when searching for personal data?

The principal challenge in data detection lies in the completeness of data sources processed and the detection rules themselves.

The term "data source" refers to any location where data is stored, including:

  • Databases (SQL, NoSQL),
  • Storage of software packages outside the database (XML, File),
  • Emails (on the server and on users' machines),
  • Hidden data (Excel files, CSV export, documents stored on the network or user workstations).

Each of these potential data sources should be considered when performing the inventory work. The more detailed the list, the more exhaustive the detection will be.

It is important to note that the personal data protected by the GDPR does not only apply to sensitive data (e.g. political opinions, race, sexual orientation, religion, etc.) which are anyway (with some exceptions) prohibited from collection. Indeed any data is considered personal as soon as it can be "linked to" a person – either directly (because it contains a name, a photo, a fingerprint, a postal address, an e-mail address, a telephone number, a social security number, an internal number, an IP address, a computer connection identifier, a voice recording, etc.), or indirectly if the linkage can be made by cross-referencing with other information.

In this context, detection rules are able to locate data that could lead to a person being identified, either directly or indirectly (re-identification by cross-referencing).

Usually such data have a particular format that can be detected by more or less complex IT techniques, such as :

  • Address,
  • Zip code,
  • Name,
  • Date of birth,
  • Face in a picture,
  • GPS position.

This personal data should then be protected to ensure they are used for their primary purpose only, and for no other purpose, unless they are decoupled from the identifying data (that is, anonymized).

4. What are the main criteria to take into account when selecting a data detection solution?

When choosing a tool for detecting personal data, the first thing to take into account is the ultimate intention behind the detection itself.

For example, your goal may be to remove a person from your entire information system (the right to be forgotten), or simply to extract all or part of your databases for testing purposes. In these two cases you will not manage the discovery of personal data in the same way. In the first case, you will have to search (and above all find) ALL data, whereas in the second case you will only have to manage the subset of data extracted.

The cost of the data detection solutions is also a factor to take into account. There is no point in implementing complex processes of link search, anonymization of low quality data (e.g. typing error, data entry error, scan with character recognition...) if you are starting from a well-known data source, or if the regeneration of documents from anonymized data is sufficient for your needs (for the creation of test data sets for example). Limiting the complexity of the detection will result in faster processing time and be easier and less costly to manage.

It would be risky to embark on a complex process of modeling your complete information system if you simply want to generate test data sets for specific applications or extract anonymized statistical data from certain components of the system.

The scale of the task could quickly turn into a financial drain and demand skills or decision-making beyond the current capacities of the project manager himself - thus introducing a high risk of failure into the project. For this reason, it is vital to limit the scope of the project according to the actual need at hand.

Moreover, the GDPR does not require all the company's information to be anonymized, but only data being used outside of its original purpose. It is therefore very rare that an entire information system will need to be anonymized.