Effect Size
Statistical significance flags an effect unlikely to be pure noise. Effect size tells you whether it's big enough to matter.
The question significance can't answer
A drug reduces recovery time by 1 hour. The study has 50,000 participants. p = 0.001.
Is that a good drug?
Statistical significance says the result would be surprising if nothing were going on — that it's unlikely to be pure noise. It says nothing about the size of the difference — whether that 1-hour reduction is trivial or life-changing. If you have a large enough sample, nearly every effect becomes statistically significant, including ones too small to matter.
Effect size is the number that tells you how big the difference actually is. Without it, a significant result is an incomplete answer.
Overlap is the intuition
Imagine two groups — treated and untreated — each distributed normally around different means. An effect size is essentially a measure of how separated those two distributions are.
When there's almost no effect, the distributions sit almost exactly on top of each other. A treated person has about a 50% chance of doing better than an untreated person — a coin flip. You can't tell which group an individual came from.
When there's a large effect, the distributions pull apart. The treated person now has roughly an 85% chance of beating the untreated one. You can start to tell them apart. At some point — large enough effect — there's barely any overlap at all.
This visual separation is what effect size measures. And it makes the practical question concrete: if I give this treatment to one person, how likely are they to do better than someone who didn't get it?
Cohen's d
The most common standardized measure for comparing two group means is Cohen's d. It expresses the difference between means in units of standard deviation — so it's comparable across studies measuring different things on different scales.
Cohen proposed rough benchmarks: d = 0.2 is small, d = 0.5 is medium, d = 0.8 is large. These are not laws — context determines what matters. A d of 0.1 in a cheap, scalable intervention affecting millions of people can be enormously valuable. A d of 0.5 in a painful and expensive treatment might not be worth it.
For the drug reducing recovery time by 1 hour: if typical recovery times vary by 24 hours (one standard deviation), the effect size is d ≈ 0.04. Tiny, regardless of the p-value.
Cohen's 'medium' (d ≈ 0.5). ~64% chance treated beats untreated — the curves visibly pull apart.
Why large studies manufacture significance
Here is the key relationship: statistical significance depends on both effect size and sample size.
With a massive sample, even a d of 0.01 — a difference barely distinguishable from zero — will produce p < 0.001. This is not a flaw in statistics; it's working as designed. But it means that large-sample studies can make the trivially small look profound.
This has caused real damage in psychology and medicine. Studies with tens of thousands of participants report "significant" effects that explain 0.3% of the variance in an outcome. The p-value is real. The effect is not practically meaningful.
The fix is to report effect sizes alongside p-values. Many journals now require this. But it's still common to see research presented as a headline — "X significantly improves Y" — with no mention of how much.
How to read effect sizes in practice
When you encounter research:
- Look for Cohen's d, r², odds ratios, or "explained variance" — these are all forms of effect size
- Compare the effect to what would matter in the real world, not to a table of "small/medium/large"
- Be skeptical of large studies reporting tiny effects as breakthroughs
- Be equally skeptical of small studies that find large effects — those estimates are noisy and rarely replicate
Enjoying this? Get notified when new concepts and articles launch.