Quantitative data analysis Toolbox

# 3.3.3 Outliers

An outlier is an extreme value, abnormally different from the distribution of a variable. In other words, the value of this observation differs greatly from other values of the same variable. Like the search for duplicates, the detection of these extreme values (or outliers) is an essential step in data cleaning, as extreme values can influence statistics production, particularly with regards to increasing or decreasing the average.

For instance, if you are calculating the average food baskets received per month. While the expected number is between 2 to 5, if one or several families mentioned having received up to 10 food baskets in the last month, it will impact the average number of food baskets received over the last month.

• If you are using Excel to analyze your data, here is how you can spot them (resource available in French)

What to do with outliers?

First, extreme values must be highlighted so that they can be clearly identified, and you can have an automatic, visual overview (for example, through conditional formatting in Excel).

Determining the cause of an outlier can then be a step to knowing what to do.

• Outliers caused by the inclusion of a subject not fitting your target population. For instance - interviews of adults, while conducting a survey on children.
• In that case, data should be excluded from the analysis and can be removed.
• Outliers caused by a typing error or misunderstanding the question. Ex. 120 food baskets received in one day by 1 Family.
• If they are easily identifiable, they can be treated as an error (see above).

If MDC forms are properly designed, it is less and less possible for unrealistic or extreme values to be entered by the enumerator in the field. MDC can limit possible answers (i.e. provide a maximum or minimum answer, or a selection of possible qualitative values). For instance, a form can be set to accept only an age between a specific range, thereby removing the possible to include data from people outside this range, and also the possibility of data input errors.

However - you should be careful because a poorly designed MDC form that has inherent bias can actually decrease data quality by preventing enumerators from entering actual answers from the subjects. In this instance, the form would contain ‘confirmation bias’.

But it is not always easy to see whether an outlier is caused by an error or if it is a true extreme value.

• For true or undetermined extreme values, it is not always desirable to exclude or delete them. Sometimes it is best to keep outliers in your data, as they can capture valuable information that is part of your survey results.
• You should examine the influence of extreme values on your analysis results before making any decision. Eventually you can:
• Keep them, especially if you know they are true extreme values and you want to consider them in your analysis. You must analyze the median instead of the mean, to minimize their effect on the final analysis. Further, the difference between the median and the mean highlight variance within the results.
• Exclude outliers from your analysis, which can also introduce bias your analysis and the subsequent statistical analysis.
• Replace extreme values by random normal or expected value, but should be aware that this can bias your analysis. but be aware that this can distort your analysis. Two methods are possible:
• Imputation to the mean (replacing the missing or inconsistent value with the mean value), which is easy to do but can hamper the statistical value of the dataset.
• Winsorizing, which is statistically more robust and means putting all outliers at a specified percentile of the data. To do this in Excel - you can refer here.

Keep in mind that even if variables do not make sense, but there is no evidence of errors, the most conservative approach is to leave data as is. Changing values introduces the possibility of significant bias based on the subjective perspective of the person conducting the analysis!

In any case your treatment of outliers should be included in the presentation of the methodology section of your report, alongside a description of the effects on the analysis.