Correlation vs. Causation
Why two things moving together doesn't mean one causes the other — and how to tell the difference.
They move together. So what?
Ice cream sales and drowning rates are correlated. Countries with more TV sets have higher life expectancy. Nicolas Cage films correlate with swimming pool drownings.
These correlations are real in the data. None of them are causal.
Correlation means two things tend to move together. Causation means one thing makes the other happen. These are fundamentally different, and confusing them is one of the most costly mistakes in data analysis.
Four reasons things correlate without causation
1. A common cause (confounding)
Ice cream and drowning both rise in summer. Hot weather causes people to buy more ice cream and to swim more. Hot weather is the confounder — the variable you're not looking at that explains both.
Whenever you see a surprising correlation, ask: is there a third variable that causes both?
2. Reverse causation
Studies show that people who exercise more are happier. Does exercise cause happiness, or do happier people exercise more? Both directions are plausible. Cross-sectional data can't tell you which arrow points which way.
3. Chance (especially in small samples)
With enough variables, some will correlate by pure coincidence. If you test 100 independent variables for correlation with your outcome, expect 5 to pass a significance threshold of p < 0.05 just by chance. This is the multiple comparisons problem.
4. Selection bias
The data you can see has been filtered before you got it. Mutual funds with strong 10-year returns look like a good asset class — until you remember that the funds that failed and got delisted aren't in the dataset. The surviving funds make the whole category look better than it was: the correlation between "exists today" and "strong returns" is real in the visible sample, but it doesn't generalize because the sample isn't the population.
Ice cream sales track drownings tightly — but does ice cream cause drowning?
What correlation actually measures
The Pearson correlation coefficient r ranges from -1 to +1:
- r = 1: perfect positive relationship — as x goes up, y goes up proportionally
- r = 0: no linear relationship
- r = -1: perfect negative relationship — as x goes up, y goes down proportionally
Note that r only measures linear relationships. Two variables can have a strong curved relationship and a correlation near zero.
How to actually establish causation
Correlation is easy to find. Causation requires more work:
Randomized controlled trials (RCTs): randomly assign people to treatment and control groups. Randomization breaks the link between confounders and treatment, so any difference in outcomes must be from the treatment.
Natural experiments: sometimes circumstances create quasi-random assignment. Economists study the causal effect of education by comparing people on either side of a school enrollment cutoff date.
Directed acyclic graphs (DAGs): a formal tool for mapping causal assumptions and identifying which variables to control for (and which to leave alone).
Mechanism: understanding why X causes Y — the biological, physical, or economic pathway — is strong evidence for causation. Correlation without mechanism should always prompt skepticism.
The practical habit
When you see a correlation reported — in a news article, a business dashboard, a research paper — ask four questions:
- Could a third variable cause both?
- Could the arrow point the other way?
- Could this be chance, given how many things were tested?
- Is the sample filtered in a way related to what you're measuring?
If you can't rule all four out, you have a correlation. Which is interesting — but not an action item.
Confounding Variables
Why the variable you're not measuring is often the one driving the result — and how to defend against it.
Survivorship Bias
Why the data that reaches you is never the full story — and how the missing failures quietly corrupt every conclusion you draw from winners.
Enjoying this? Get notified when new concepts and articles launch.