4.3.3 Measures of variability
TABLE OF CONTENTS
Measures of variability tell us about how spread out, or varied, our data is across the possible range of values.
The simplest way to do this is by calculating the overall range of our observations, subtracting the smallest observation from the largest.
However, the standard deviation is the most commonly-used measure of variability. The standard deviation tells us, on average, how close each of our observations are to the mean. A larger standard deviation indicates that our data is more spread out. This page outlines the formula behind calculating the standard deviation. How to calculate that in Excel: Simple!
To go more into details: what is the signification of the standard deviation? The situation is summed up in the following graph:
This means that, generally, you want to use the standard deviation when the graphical distribution of your data will look like a bell (it’s a Gaussian distribution).
It approximately means that you have 68.2% (34.1% + 34.1%) of the values located between your average – 1 * standard deviation.
Let’s say you want to understand the impact of constructing wells on your beneficiary population in a camp setting, which you measure through the amount of water people receive from the wells per day in the surrounding households. Through a survey, you find the mean value of 21 litres per person per day and a standard deviation of 2.1. This would mean that 68.2% of the values are between 18.9 and 23.1 litres per person per day. Further, 95% of the values will fall between 21 litres - 22.1= 16.8 and 21 + 22.1 =25.2.
You can therefore make the hypothesis that (if your survey was conducted with an appropriate sampling method) most of your beneficiaries in the area (95% of them) have received at least 16.8 litres/person/day (and up to 25.2).
If your target is 15 litres per day, you have met your target, but if you are aiming for 20 litres per person per day, it means that you should still analyse the data in more detail to understand the discrepancies (despite that you could say on average your distribution has met the target). You can then review to see if you have a geographical imbalance (for example, a zone that is not well covered) and that you should focus on during your next round of construction activities. Therefore, you could almost say that the standard deviation is a measure of the “volatility” (hence the trust you can have) of the measurement of your indicators.
As you can see, the standard deviation and the mean themselves don’t give you anything, but interrogating the numbers will lead you to ask the good questions to your program team.
It is crucial not to reduce the analysis of data to the simple duo of the mean and standard deviation. Indeed, you must always display your data in order to better visually understand the spread of your data, and therefore assess the relevance of the mean and the standard deviation.
Have a look for example to the two graphs below:
In the first case (on the left), assessing the mean and the standard deviation will be relevant as the distribution is looking like a Gaussian curve.
In the graph on the right though, the approximation will be factually wrong, as it will not accurately represent how your data are distributed. In these cases, you should:
- Employ other ways to statistically model the distribution (via an exponential, Poisson law etc., which parameters are different from the mean and the standard deviation), or;
- Assume that the numbers you have for the mean and standard deviation are not really representative, and therefore shouldn’t be used for decision making/but rather as a basis for further assessments).
What you need to know:
- You should always plot your data to have a look of the distribution. Use a histogram, and add the mean. Remember that a curve that is not looking like a bell hints that you shouldn’t have a lot of confidence in the mean and standard deviation.
- And it also means that between two series of measures that have the same mean and standard deviation, you cannot draw any conclusion without looking at the plots.
The variance is closely related to the standard deviation as it is just the square of the standard deviation. While these two measures both quantify similar characteristics of your data, the standard deviation is much more commonly used. Check out how to do it on Excel (available in French).
A quartile is a division of your data set in 4 equal parts (therefore, using a similar approach than the median, which divides it in 2). Similarly, to the median, you need to order all observations from smallest to largest first.
- Then the first quartile will be the number between the smallest number of your data set and the median of the data set (it corresponds to 25% of the total values).
- The second quartile if the median of the data set, hence 50 % of the total values.
- The third quartile will be the number between the highest number of your data set and the median of the data set (it corresponds to 75% of the total values).
Once you’ve defined these values, you can identify the interquartile range (IQR) which is the difference between Q1 and Q3. This spread analysis is complementary to the standard deviation, as it is using a parameter that is less influenced by extreme values, and therefore providing you a more robust estimation of the spread.
Therefore, when you want to analyse your data and, specifically exclude outliers, you could use the following representation, known as a box plot:
In statistics, quartile (so dividing in 4) and decile (dividing in 10) are often the most used quantile to characterize dataset. You can use other way to divide your dataset into intervals with equal probabilities such as terciles (dividing in 3), sextiles (dividing in 6), or percentiles (dividing in 100) according to the level of granularity you want to have, the size of your dataset.
What you need to know: You should always use a form of quantile (at least the median, but also quartiles) to analyse your data because they are less susceptible to outliers, and long tailed distribution. Indeed, if these measurements are less efficient than the parameters relating to a standard distribution (such as mean and standard deviation), they are much more appropriate when your data have a different outlook, for example the following:
In our case study, we have created a box plot of the FCS scores for each household by region, which shows some interesting differences for interpretation.
Firstly, the bottom ‘whisker’ marks the minimum value with the top ‘whisker’ showing the highest value. The bottom of the box marks the location of the first quartile with the top of the box showing the location of the third quartile; the box length, therefore, shows the interquartile range. The ‘X’ in the box represents the mean, whereas the line across the box represents the median.
Firstly, each region has very similar ranges of values, as seen through the interquartile range; this makes sense, given that the FCS is a discreet, quantitative variable with a minimum value of 0 and a maximum value of 112 (which implies that each of the food groups was consumed every day for the last seven days).
However, despite a very similar mean value (represented by the x), we can see that the median values differ significantly among regions (represented by the lines in the boxes). Notably, Region 1 shows a mean above 50 (higher than the mean), whereas Region 2 shows a mean of 31 (below the mean). With the median closer to the bottom of the top of the box, Region 1 has a negative skew and, with the opposite, region 2 has a positive skew.
The positive skew seen in Region 2 suggests that although some households have high FCS scores, more households have FCS scores below the mean (on the lower side of the box). In the world of our case study, this suggests that Region 2 could have higher rates of food insecurity, as measured by the FCS indicator.