Most of the time, when cleaning a database, you should be looking for inconsistencies and errors of multiple different types, such as:
Data may be entered into the wrong format:
- Special characters that were transformed when exported
- “0” instead of an empty cell or “N/A” (i.e. for a numeric question that has not been answered)
- Unit issues (age in months/years, meters/feet, etc.), date or cell formats
- Numbers stored as text
- Difference of treatment of special characters in different languages
Wrong or inconsistent data may be entered by the enumerators, or wrongly calculated by the software:
- Spelling inconsistencies ⇒ For example, names of places having several different spelling
- Percentages adding up to more than a 100%, suggesting the calculations were not correctly formatted
- Sums that do not match (logical inconsistencies), i.e. the total number of household members is different from sum of male/female members in the household
- Typing errors, for example “-5” people in the household
- Other logical inconsistencies in the responses ⇒ the household has reported no children, and 3 children going to school
- Other inconsistent answers due to several reasons, for example through different interpretation of questions by enumerators or the interviewees
- This is a common issue that can be addressed through good practices related to translation and testing surveys
- Incorrectly derived values occurs when a function was incorrectly calculated for a derived field based on survey inputs.
MDC tools can help to prevent inconsistent data collection. Two of the most important tools include skip-logic and constraints. Well-designed MDC forms should limit many inconsistencies, but not all, and hence data still needs to be checked for data cleaning.
Skip-logic (often called ‘relevancies’): The survey will ask questions only if they make sense based on previous answers. It is also possible to restrict possible answer to a specific question based on the previous answers.
An easy example of skip-logic can be seen in the case study where respondents are asked: “Does your household receive food assistance (general in-kind food distribution and/or cash grants and/or food vouchers)?”
Based on ‘Yes’ or ‘No’ answers, respondents are asked one of the following questions, but not the other:
- (If Yes) “How many days did the food from the general food distribution from the last cycle of food assistance last?”
- (If No) “Why do you not have access to the local food assistance programmes?”
Constraints: Constraints can limit the input values (if integer) or responses (if a nominal variable). With constraints on integer variables, you can set a maximum and minimum value in the MDC form and reduce data entry errors. With constraints on nominal variables, you can limit the answer options based on prior responses (in the same or other questions), thereby reducing logical inconsistencies within the responses.
An example of a constraint on an integer variable can be seen in the case study where respondents are asked: “What is the total number of household members?”. Responses can only be input between 1 and 30 (no negative numbers and no responses higher than 30). This leaves open the possibility for very large families, while removing the possibility of a data input error, such as a negative (-) symbol before the value, or an extra zero.
An example of a constraint on a nominal variable can be seen in the case study where respondents are asked: “Which of your household’s basic needs can you not afford?” The potential answer options are a series of basic household needs, however two other options include ‘Don’t know’ or ‘All needs are met’. The form does not allow respondents to select a basic need that they cannot afford, along with ‘Don’t know’ or ‘All basic needs are met’.
However - you should be careful because a poorly designed MDC form that has inherent bias can actually decrease data quality by preventing enumerators from entering actual answers from the subjects. In this instance, the form would contain ‘confirmation bias’.
Access the XLSForm file here.
Access the ZIP file here.
Download the PPT file here.
What to do with errors, missing values, and/or true identified extreme values?
If time and resources allow, it is always preferable to re-measure the suspect or erroneous value, for example through follow-up directly by enumerators with the research subject who first answered the survey questions. If this is not the case, you can follow the below courses of action (summarized from the ACAPS 2016 resource Data cleaning):
- Leave the data unchanged: Despite suspicion of a datapoint, leave it unchanged and include it in the analysis. The larger the sample size, the less one potential error will affect the analysis.
Remember, it is possible that the datapoint is an outlier, and choosing to correct the datapoint is often down to the subjective perspective of the person conducting the data analysis. As such, leaving the data unchanged is a conservative approach. However, will choosing this course of action is more difficult if the dataset has a small sample size, and therefore one incorrect value has a greater potential impact on the overall results.
- Correct the data: This can be done only if the respondent’s original intentions in answering questions can be fully determined, for example after follow-ups with the enumerator, or the respondent directly.
However, the more conservative approach is above; if there are no notes or evidence to understand where the error came from, it is often better to leave the data as is. Changing values can lead to significant bias if unjustified, as changes will be based on the pre-conceived notions of the person conducting the analysis. Therefore, without evidence of an error, it is best to leave the data as is, and treat it as an outlier.
- Delete the data: You can potentially delete the data if it seems highly illogical as the value is so far from the norm that it will affect descriptive or inferential statistics.
However, should you delete just this response or delete the entire record? It is important to recall that any time data is deleted, there is a risk of ‘cherry-picking’ the expect results, consciously or subconsciously, to have more anticipated results. Impossible values should never be left unchanged, but should be corrected if a correct value can be found, otherwise they should be deleted.
Overall, the best course of action is context specific, and up to the discretion of the researcher. Collecting more background information can help you determine the best course of action.
Note: MDC can provide very valuable tools in helping you have more information. Metadata from the submitted data can help you gain more information, for example through the time spent on a survey input, the location, or reviewing potential ‘gaps’ in the tools (such as a missing constraint, which could lead to data entry errors). Gathering information back from the data collection team is also very useful, if time and resources allow for it.