Link Search Menu Expand Document
Excel toolbox

3.7 Summary of Data Quality Control


Before being analysed, data must first be checked for possible errors. Some errors may not be obvious before the analysis begins (for example: an outlier that is not really one but is actually a typo). However, it is preferable to detect as many errors as possible in order to avoid having to backtrack during the analysis.

You can find the data quality definition in the glossary of the First steps with program data toolbox

The table below lists several common errors and the techniques to correct them. It should be noted that some of these errors can be mitigated by means of data entry controls in the input platform.

Error Description Technical correction
Outliers An outlier is an observation that is uncommonly far from all other measured values. Possible outliers should be checked at the data processing stage to ensure that they are actual values and not errors. Conditional formatting
Missing data/gaps There may be gaps in the data due to a lack of response (for example: informants refused to answer, were not available or did not have enough information to be able to answer) or irrelevant questions (the question did not apply to the respondent). It is better for people who use the data to know what “missing data” means in order to be able to analyse it correctly. Conditional formatting Find and Replace

Enter the expression “NULL value” in the empty cells, or if differentiation is required, enter the expressions “no response” or “not applicable”. It is common to indicate -9999 or N/A for null values
Zero values Zero values will have a direct impact on calculations performed on a dataset. For example, the average between 0 and 1 is 0.5, while the average between an empty cell and 1 is 1. Particular attention should be paid in this respect in order to avoid erroneous analysis results. Find and replace

Zero is never used instead of “NULL value” (empty cell) but it is always used when this value is really significant (true value of zero)
Duplicates Duplicates are two or more identical records (for example: the same household appears twice or data from the same respondent is recorded twice). Filters

Conditional Formatting (in Excel, Conditional Duplicates Formatting)

PivotTables + counting of unique records
Typing errors/spelling errors Input errors can be a problem during the analysis phase, as one category can be treated as other separate categories. For example, if one person writes “male” under “head of household” and another writes “man”, households classified as “man” may not be considered if the keyword used for the calculation is “male”. Manual Check

Find and replace input errors/common mistakes

The following exercise is only available on the French page at this stage- translation coming soon!

The Excel and analysis toolkit uses a set of fake data that has been generated based on a ‘real-world’ scenario in the humanitarian sector.

You will find a more complete presentation of the case study (study presentation, research questions, list of indicators, analysis plan, comprehensive databases), in the “Data Analysis Toolkit” – Part 7 “The case study”.

To practice data cleaning and quality control, here is a hands-on exercise which goal is to clean the raw data from the case study. After twenty or so steps - from formatting to identifying missing values, as well as removing duplicates - the database will be ready for analysis.

Use the [“Excel Toolbox - Cleaning - Practice exercise - V1 - Statement”] to practice data cleaning, using the steps stated.

In case of difficulties, or to compare your results with the clean database, refer to the file [“Excel Toolbox - Cleaning - Practical exercise - V1 – Solutions”].