3.3.1 Duplicates


For a survey to be reliable, it is essential that each record is only found once in the database, but sometimes there are several records for the same subject.
The following are some examples of common duplications in datasets in the humanitarian context:
- A survey form is entered into the database more than once, by mistake.
- A mobile survey is started, then stopped, then started again. Suppose a survey is started with a particular household and a household member shows up while the interview is already in progress. The household then asks to start from the beginning. If the first record is not deleted in the field, it will appear twice: once incompletely and then again in full.
- When beneficiaries are registered, some individuals or households may register twice (or more).
These examples are not exhaustive and it is recommended that you always check carefully for duplicates in the datasets before starting the analysis.
To easily adjust duplicates, the most straightforward method is to ensure that each record has a unique identifier. If the dataset already has a unique identifier, it is necessary to ensure that all recorded identifiers are truly unique, and that there are no duplicates. These unique identifiers are better if they are numbers, however can also include alphanumeric values. However, names are not always the best identifier, as it is more likely for there to be spelling mistakes or homonyms.
- If you have not thought through a good unique identifier system you would have to create with information already present in the database (ex. Age+Location+time of record).
Finding these duplicates is an important phase of data cleaning, as their presence can lead to the production of biased indicators within the analysis. Each tool has different ways of finding them duplicates (if you are using Excel you can review the method here - available in French).
Collecting data with MDC tools can really help you to spot duplicates and errors thanks to the metadata, such the time of data collection (start-end), enumerator, or an automatic unique ID that is automatically captured.
What to do with duplicates?
When you know for a fact that 2 entries are duplicates (collecting data over the same subject), you should only keep one record. If different data are entered for the same data subject (i.e. individual, household, etc.), try to investigate through the metadata or communication with the data collector which record should be kept.
Warning: Here we exclude longitudinal data collection (data collection that intends to follow up the same information over time ex. Weight of the same child over time), where it is normal and intended to get various records for the same subject multiple instances at different points in time.