- Effect SizeClick here to access the Word version of this Effect Size guide

- Effect size is used to show how much an intervention works.
- The greater the value of the effect size, the greater the degree of effect between the variables.
- There are many different effect sizes to consider, each slightly different from each other and used in different circumstances. The three main categories are the group difference indices, correlation coefficients and the risk estimates.

You would have heard that in statistics there is such thing as significance. Statistical significance shows that an effect is present in a statistical study and is given by a p-value. Practical significance on the other hand shows that, if an effect is present in a study, what size the practical effect has in real life: this is what the effect size concerns, as effect sizes show the strength or magnitude of the relationship between two or more variables.

It is important to note that just because a result in a hypothesis test is shown to be statistically significant, it tells of nothing on whether the effect is great or small, or indeed important at all. That is why it is important to consider effect sizes as well when hypothesis testing. In other words, the effect size can be used to answer how well something works, rather than if it does or not. However, true effect sizes are never known with upmost certainty, and should be reported alongside levels of uncertainty measurements such as the confidence interval or the standard error. Of course, with any hypothesis testing, a narrative explanation should be provided as well to give purpose to the test and link results back to real life.

Effect sizes can either be small, medium or large, and how to determine this depends on which effect size you use - there are a few! There are three main types of effect size: the group difference, correlation coefficient and risk estimate.

As the name implies, group difference indices denote the difference between two or more groups.

Cohen's d is probably the most popular effect size to use in hypothesis testing for continuous data, where the sample size is large (more than 50), and is calculated as:

Generally:

*d*≤ 0.5 is a small effect size- 0.5 <
*d*< 0.8 is a medium effect size - 0.8 ≤
*d*< 1.3 is a large effect size - 1.3 ≤
*d*is a very large effect size.

Notice that the equation requires the pooled standard deviation - this is of course assuming equal variance between groups! If you are not assuming this, you will need to use an alternative, such as Glass's Δ. The pooled standard deviation is a weighted average of the two group standard deviations, calculated by:

where

*n*1 is the sample size for group 1*n*2 is the sample size for group 2*s*1 is the standard deviation for group 1*s*2 is the standard deviation for group 2.

Glass's Δ (pronounced 'delta') is very similar to Cohen's *d* as it is used as an effect size between groups of continuous data, but instead is calculated as:

The standard deviation of the control group is used here as opposed to the pooled standard deviation because Glass's Δ is less sensitive to differences in variances, and in this way provides a more accurate effect size which will not be differing under equal means and different variances.

Like Cohen's *d*, Glass's Δ should be used when groups have large sample sizes.

Hedges' *g* is more ideal than Cohen's *d* or Glass's Δ when working with uneven sample sizes and sample sizes less than 20 (for sample sizes above 20 Hedges' *g *and Cohen's *d* are approximately equal). This is because it is capable of accommodating for variance which may be introduced with smaller samples. Otherwise, there is very little difference at all between Cohen's *d* and Hedges' *g*.

You may be surprised to hear that Pearson's *r* can be used as an effect size measurement, but as a correlation measures the strength and direction of association between variables, this can indeed be interpreted as a measurement of effect and are therefore considered as indices of association.

Pearson's *r* is the most popular test for correlation as *r* is only indicative of the strength of relationship between variables rather than any cause and effect going on between them. It is used as a measurement of effect to show association between linearly related continuous variables, producing a value for *r* between -1 and 1:

where:

*n*is the sample size*x*represents one variable*y*represents the other (it doesn't matter which way round*x*and*y*go!)

such that:

- 0.7 <
*r*≤ 1 indicates a strong positive relationship - 0.4 <
*r*≤ 0.7 indicates a moderate positive relationship - 0 <
*r*≤ 0.4 indicates a weak positive relationship *r*= 0 indicates no relationship- -0.4 ≤
*r*< 0 indicates a weak negative relationship - -0.7 ≤
*r*< -0.4 indicates a moderate negative relationship - -1 ≤
*r*< -0.7 indicates a strong negative relationship.

Be aware that Pearson's *r* is a parametric test and should only be used on continuous data which is normally distributed, is homogeneous and independent. An alternative to be used with nonparametric data is Spearman's ρ (rho) or Kendall's τ (tau).

Cramer's *V*, also known as Cramer's φ (phi) coefficient, is another association index, however this time it is used for the Chi-squared test of independence, i.e., for categorical variables which are not ordinal with each group having expected frequencies of at least five. The way to calculate Cramer's *V* is to use the Chi-squared statistic obtained from a contingency table formed using the categorical variables.

The value of *V* varies between zero and one and is calculated by:

where:

*X*² is the Chi-squared statistic*n*is the sample size*r*is the number of rows in the contingency table*c*is the number of columns in the contingency table.

How V is interpreted depends on the degrees of freedom. Recall that the degrees of freedom is given by:

min(*c* - 1, *r* - 1).

Degrees of Freedom |
Small Effect Size |
Medium Effect Size |
Large Effect Size |

1 |
0.1 |
0.3 | 0.5 |

2 | 0.07 | 0.21 | 0.35 |

3 | 0.06 | 0.17 | 0.29 |

4 | 0.05 | 0.15 | 0.25 |

5 | 0.04 | 0.13 | 0.22 |

Typically, having a value of *V* equalling zero indicates no association and having *V* equalling one indicates perfect association. It is less robust than *r* due to the fact that *r* can be used to show the direction of association (that is, positive or negative relationship) whereas *V* does not.

When reporting Cramer's *V*, you should also report alongside it the Chi-squared statistic, the degrees of freedom and the p-value.

A particular type of effect measurement is the risk estimate, which is a way of comparing risks, and is most commonly used in medicine and medical research, although it can of course be made use of outside of the disciplines. There are three main types of risk estimate: relative risk (RR), odds ratio (OR) and risk difference (RD).

In statistics, RR compares the odds of two groups against each other: specifically, it is the probability of an event occurring in a treatment group compared to in a control group.

RR is calculated by:

Generally, we have:

- RR < 1: The event is less likely to occur in the treatment group than the control
- RR = 1: The event is neither more nor less likely to occur in the treatment group than the control
- RR > 1: The event is more likely to occur in the treatment group than the control.

As you might imagine, the OR is the ratio of probability that an event happens to the probability that it will not occur. More specifically, it is used when considering the odds of favourable outcomes in a treatment group compared to in a control group. The odds of an event happening can be any number between zero and infinity, and is calculated by:

and therefore, the odds ratio is calculated by:

The closer the OR is to 1, the estimated effects are closer to being the same for both groups: the further away the OR is from 1, the higher the likelihood is that the treatment has a real-life effect.

Many medical researchers tend to prefer using risk rather than odds as it is more intuitive as odds is not divided by the total number of cases like risk, but the opposite number of cases.

RD is the difference in observed risks and is useful when needing to consider the differences in likely outcomes. It is very similar to RR, but instead of dividing the probabilities we instead take them away:

RD can be calculated for any study. When using risk estimates, it is good practice to report both RR and RD in a medical study to reduce confusion.