3.4 Identifying duplicates


Before you can analyse your data, it is important to ensure the reliability of the data. In any data set, each record must have a unique identifier. If the dataset already has a unique identifier, it is necessary to ensure that all the identifiers of the records are truly unique, that there are no duplicates. The search for these duplicates is an important phase of data cleaning, as their presence can lead to the production of biased indicators.
Here are some examples of frequently encountered duplicates in data sets used in the humanitarian context:
- A survey form is entered repeatedly into the database by mistake.
- A mobile survey is initiated, then interrupted, then resumed from the beginning. Imagine that the survey is initiated with a given household and that a member of the household appears while the interview is already under way. The household then asks to start everything over from the beginning. If the first record is not deleted in the field, it will appear twice: first in an incomplete manner and then a second time in its entirety.
- When beneficiaries are registered, some individuals or households may attempt to register twice (or more).
These examples are not exhaustive, and it is recommended that you always check carefully for possible duplicates in the datasets before starting the analysis.
It is possible to identify duplicates in Excel via several methods:
- Under Data, click Data Tools and then click Remove Duplicates.
- It is also possible to highlight single or duplicate values, use the Conditional Formatting command under the Style group of the Home tab.
- Create a PivotTable with the data to identify duplicates in a column, as PivotTables can handle large datasets faster: drag the unique field (home number, registration number, phone number, etc.) into “LINES” and the record identifier into “VALUES”; the calculation is “COUNT”. In the PivotTable, identify fields that have multiple records. These are duplicates. Filters can be applied to the initial dataset for further investigation.
Find more information on PivotTables in this section.