Quantitative data analysis Toolbox

# 4.4 Looking for relations between variables

In many cases we’re not just concerned with analyzing each variable on its own, but also want to understand how different variables relate to each other. We may want to understand how changes in one variable are linked to changes in another variable. When X increases, does Y also increase? Does it decrease? By how much? This is called correlation analysis.

A good way to start looking for relationships is by creating basic scatter plots where you plot one variable on each the X and the Y axis.

In our case study, we are looking to compare the reduced Coping Strategy Index (rCSI) scores and the Food Consumption Scores (FCS) and whether or not they correlate with the household size (number of household members).

The hypothesis we have is that more household members among our beneficiary population will lead to lower levels of food security, as food security assistance from humanitarian actors is often be a standard size (or value if cash-based), and we know that almost a third of our sample primarily rely on food security assistance to acquire their food.

We have conducted a correlation analysis on both the rCSI and the FCS separately with the household size. The hypothesis would mean that household size would have a positive correlation with rCSI (as higher rCSI scores mean higher levels of food insecurity), and a negative correlation with FCS (as lower FCS scores mean lower levels of food insecurity).

In our scatter plot, we also input a trendline (or, line of best fit), which displays a linear relationship between the two variables.  Based on the above analysis and the direction of the trendlines, we can say that household size is positively correlated with rCSI and negatively correlated with FCS. The findings suggest that food insecurity (measured by LOWER FCS scores or HIGHER rCSI scores) is in fact correlated with household size.

Note: These conclusions are only valid if your survey was conducted with an appropriate sampling method.

You can also use basic statistical measures to quantify the correlation between variables. Pearson’s correlation coefficient is most commonly used in this case. The formula to calculate this by hand is somewhat complex, but data analysis software such as Excel will allow you to calculate this easily. The correlation coefficient will range between -1 and 1, with -1 indicating a perfect negative correlation and 1 indicating a perfect positive correlation. A value of zero (or close to) would indicate no correlation between the variables.

Also keep in mind that this statistic will only identify the presence of a linear correlation, which would be identified by a straight line on a scatter plot. Curved, or nonlinear, correlations are more complicated and would need to be evaluated using alternative methods.

You should always be careful when interpreting the results of a correlation analysis. We can never assume that correlation is the same as causation! A classic example of this is data showing that rates of sunburn are highly correlated with ice cream sales. Clearly, sunburns aren’t causing people to buy more ice cream. We’re missing a valuable third variable, temperature, which is causing co-occurring increases in both sunburns and ice cream sales. This website shows interesting examples of cases where two variables are highly correlated but logically totally unrelated to each other (such as per capita chicken consumption and US crude oil imports). XKCD