Link Search Menu Expand Document
Responsible data management toolbox

3.6 The de-identification process


Keep in mind

When talking about sharing or retention/archiving of personal data, de-identification (generally anonymization) is necessary to avoid compromising personal data, unless there are specific reasons.

Pseudonymization is used if you ever need to link the shared data back to your initial database.

Anonymization is used if you want to encourage data recipients to keep, share and reuse them without risk of compromising the data subject. Nevertheless, anonymization is not a very simple technique to implement if you really want to make sure that no re-identification is possible.

De-identification processes aim to make it more difficult or impossible to identify personal data and make it easier to share data of interest by reducing or eliminating the risk of compromising the data. This is generally what is used to share data with institutions, partners, donors or the sector at large, except for specific reasons (see subsection 3.5 on Data Sharing Agreements).

There are different degrees of data identification, as shown in this Future of privacy forum visual.

image info

However, let’s focus on the two main methods of de-identification, anonymization and pseudonymization, which have different purposes.

You can refer to OCHA’s guidance note N°1 on Statistical Disclosure Control for additional insights into the disclosure of information and risks of re-identification.

3.6.1 What is pseudonymization?

Pseudonymization “describes the processing of personal data in a way that personal data can no longer be attributed to a specific data subject without the use of additional information, such as a key code.” (definition taken from the Mercy Corps Data Protection & Privacy Guide).

This method makes it possible to limit the risks of identification because it replaces a direct identification element with another indirect element. The indirect element of identification is intact and can, with the addition of one or more information, be traced back to the data subject.

For example, when compiling a list of patients of a medical emergency organisation, the first and last names are recorded in a file that is separate from the rest of the personal information and are replaced by a code.

There are several pseudonymization techniques, including the secret key encryption system: decryption is possible provided the key is known.

Pseudonymization is a good practice to secure data, because it is no longer connected to an individual, and to limit the risks of correlating personal data. However, the personal nature of the data is retained and the data subject can be found if other data is cross-checked (for example the original DB with the associated “keys”).

3.6.2 What is anonymization?

Anonymization “is the process by which Personal Data is rendered anonymous so that an individual (or “data subject”) is no longer identifiable:” (definition from the Mercy Corps Data Protection & Privacy Guide). This process is irreversible, unlike pseudonymization.

The data no longer contains any information allowing for direct identification (for example, surname, first name) or indirect identification (for example: date of birth, number of people in a household in a community …) of a person. This eliminates the risk of compromising the rights of the populations concerned and allows data sets to be retained longer for reuse purposes, for example for future projects or for training purposes. Data protection legislation is no longer applicable: data transfer, retention period, dissemination are possible without restriction.

The nature of the data changes after anonymization, the purpose is not to secure the data, but precisely to be able to reuse it.

The French CNIL recommends the following conditions to establish an appropriate anonymization technique:

  • to identify the information that should be retained according to its relevance (reuse of data should have a disclosure relevance for the organisation, its staff or its partners, with a view to improving the effectiveness of interventions for example)
  • to remove direct identification elements as well as rare values that could allow easy re-identification of individuals (for example, the mention of an individual’s age can make it very easy to re-identify centenarians; or the number of household members in a small community);
  • to distinguish important information from secondary or unnecessary (i.e., removable) information;
  • to define the ideal and acceptable fineness for each piece of information retained (i.e. the level of detail that the organisation considers useful for reuse).

There are two main types of anonymization techniques (the different aspects of which are detailed in the opinion on anonymization techniques of “Group 29”, the former European Data Protection Board):

  • randomisation: this technique relies on the alteration of data and its reliability, which then becomes uncertain enough to no longer be allocated to a specific person (for example, by making data less accurate, by adding or removing 10cm to the measured height of a person)
  • generalisation: this process aims to change the scale of data to generalise or dilute it (for example, to the month rather than the week).

The latter, which is often easier to set up, is the one most used by NGOs.

Since 2018, the Centre for Humanitarian Data has been conducting an assessment of datasets from the Humanitarian Data Exchange (HDX) platform and found that there is too high a risk of disclosing the identity of individuals. It uses the SDC (statistical disclosure control) (via free software - open source - called SDCMicro), which makes it possible to ensure that anonymized data does not carry a risk of re-identification (OCHA’s guidance note N°1 on Statistical Disclosure Control) .

In any event, it is important for NGOs to stay up-to-date of new anonymization techniques and technologies given how quickly they become obsolete (especially when it comes to artificial intelligence, the consequences of which - in terms of personal databases and potential cross-referencing - are as yet impossible to ascertain)