3.5 Identifying extreme values
An outlier is an extreme value, which is abnormally different from the distribution of a variable. In other words, the value of this observation greatly differs from the other values of the same variable. Just like the search for duplicates, the detection of these extreme values (or outliers) is an essential step in data cleaning, since extreme values can influence upward, or downward certain statistics produced - particularly those relating to averages - and thus lead to erroneous analyses.
Extreme values can be identified via the following steps: to begin with, determine the 1st quartile Q1=QUARTILE(cell range;1) and the 3rd quartile Q3=QUARTILE(cell range;3) of the variable of interest and then evaluate the interquartile range (IQR), i.e., the middle (50%) spread of your data. The latter is calculated as the difference between the value of the 1st and 3rd quartiles, Q3-Q1. The third step is to calculate the lower and upper limits, which are the smallest and largest values in the data range you are using. For the lower limit, the IQR value is then multiplied by 1.5 and subtracted from the Q1 value. For the upper limit, the IQR is multiplied by 1.5, and added to the Q3 value.
To identify outliers, the =OR() function can be used to locate data below the lower limit value or above the upper limit value.
Depending on the context, a TRUE value may then indicate an outlier (above, line 8).
Conditional formatting can also be used to highlight extreme values, by specifying under which conditions a value is considered as such.
You can find more information on conditional formatting in this section.
Once you have identified the extreme values, you can either:
- keep them as is,
- replace them with a normal or random expected value,
- remove them from the analysis.
You can find more information on extreme values in this section of the Data analysis toolbox.
The following exercise is only available on the French page at this stage- translation coming soon!
In order to practice calculating quartiles (and other descriptive statistics), here is a practical exercise (Bonus 3) using clean data from the case study - developed by CartONG as part of the Data Analysis Toolbox publication. A more complete presentation of this case study can be found in the “Data Analysis Toolbox” – Part “7 The case study ”.
In this example, we propose to generate, using certain functions, descriptive statistics (including the 1st and 3rd quartiles, as well as the median), concerning:
- The age of household members
- The Food Consumption Score (FCS) of households
Use the [“Excel Toolbox - Analysis 4.3 - Practice exercise - V1 - Statement”] file to practice calculating descriptive analyses using functions.
In case of difficulties, or to compare your results with those of the case study, refer to the file [“Excel Toolbox - Analysis 4.3 - Practical exercise - V1 – Solutions”].