6.4 Process and curate
TABLE OF CONTENTS
- Case study: a selection of questionable quality of beneficiaries
- Case study: improvised and messy data cleaning
- Case study: poorly controlled pseudonymisation
- Key resources
Following a data collection, you are asked as a data management specialist to prepare the data to select beneficiaries for a cash transfer program. You become aware of a number of inconsistencies and inaccuracies in the database in question, which could harm some potential beneficiaries. But, since time is running out and the selection must be made and communicated the next day, you see no other choice but to transfer the database as is to the project manager in charge of the analysis.
- Incorrect analysis
- Unethical and questionable response
- Poor relevance and effectiveness of the response provided
- Reputational risks, loss of confidence of beneficiaries vis-à-vis the NGO
In light of the potential consequences, i.e. :
- the choice of “wrong” beneficiaries for access to financial assistance,
- the questionable quality of the database
It is reasonable to ask whether the people most in need of assistance will have access to the service in question?
Despite the fact that risk mitigation measures will be decided on the basis of the potential consequences in this particular situation, it should not be forgotten that having reliable data for decision-making, in constant compliance with the “do no harm” principle, is part of the responsible data management process.
If the quality is too degraded, organise a complementary collection which – although costly – could allow the NGO to ensure an objective selection of beneficiaries.
The better the collection is prepared upstream (with qualitative collections to feed the upstream survey, a tried-and-tested questionnaire, field tests…), the better the collection will play out.
You can also quickly correct the situation as needed during the collection by identifying errors or inconsistencies through quality controls, daily interaction sessions with enumerators, discussions with thematic managers who know the area, etc.
Quality data is an essential medium for quality action.
A data manager prepares the collected data before sending it to the person who will perform the analysis. This person discovers what they perceive as inconsistencies and errors in the data collected. He or she decides on their own to adjust and correct the data without referring to the analysis plan or consulting the team members who performed the collection, or the person who will perform the analysis. No documentation has been produced to explain these changes, nor file versioning performed.
- Erroneous, biased or otherwise off-topic analysis results if errors are introduced into the data
- For the teams, for the NGO, a risk of wasting time and unnecessary extra costs
- A risk of making bad decisions with a low relevance and effectiveness of the response provided because the teams will no longer have access to quality data. This may result in a reputational risk for the NGO
- Poor targeting of a vulnerable population
Start by retrieving the original data, and if possible make a comparison to identify what has been modified or deleted. Assess whether it is necessary to start all over again or whether only the data that has been changed needs to be corrected.
If there is one, resume data cleaning based on the analysis plan and document the procedure and actions performed during data cleaning comprehensively to be able to inform the person who will conduct the analysis.
To start with, develop an analysis plan when building a survey protocol to establish a framework that will guide the data cleaning process, given that it will be carried out to match the needs of the subsequent analysis as closely as possible. This plan could be accompanied by a protocol to clarify the procedure to be followed in the event of questions about the validity or quality of the data.
Support and train the data managers responsible for cleaning. Complementarily, ensuring that these same data managers are involved in the data collection process, for example by ensuring their attendance at planning meetings or by providing meeting minutes, would allow them to better understand the context of the collection.
In addition, having a resource person available to answer questions throughout the data collection, cleaning and analysis cycle would limit this kind of situation.
HQ asks the field team to pseudonymise the beneficiaries’ data before sharing it with the provider responsible for carrying out the analysis. But the field team does not understand this notion well and creates an identification code that includes the last name of the beneficiaries. The database is sent to the provider as is.
- Disclosure of personal and sensitive data to third parties, which do not comply with humanitarian and data protection principles
- May cause harm to individuals and their communities, especially when data is sensitive
- Loss of control of what can be done with such data
- Potential targeting of a vulnerable population by other actors if the information has not been sufficiently protected by the third party in charge of the analysis
- Reputational risks for the NGO
- Contact the provider responsible for the analysis and ask them to delete the data received
- Train the team and/or provide a guide on the principles of pseudonymisation & anonymization
- Support teams in creating a new identification code format
Develop training on responsible data management, which of course includes the principles of pseudonymisation & anonymization and their importance. Provide easily accessible and understandable resources on pseudonymisation & anonymization procedures that include a list of good practices and good and bad examples.
To go further, and act more globally, consider setting up a ‘Data transfer agreement’ (DTA) with each provider that will process data. It clearly clarifies the roles and responsibilities of the various parties involved and stipulates additional restrictions or safeguards on how data is processed and shared. It can be complemented by a data sharing protocol with a validation system for every occasion when sensitive data is to be shared with third parties
- You will find interesting ACAPS articles, such as the technical note, or the technical note on data cleaning and associated poster on detecting data of questionable quality
- Section 7.2 The Responsible Data Management Toolkit by CaLP Network