applied·Interactive
StatisticsData Thinking

Multiple Comparisons

Why running enough statistical tests guarantees false positives — and what to do about it.

Before this:P-Values

The jelly bean study

A research team investigates whether jelly beans cause acne. They test 20 colors, one at a time, using p < 0.05 as their threshold. Nineteen colors show no effect. Yellow jelly beans: p = 0.04.

They publish: "Yellow jelly beans linked to acne."

This result is almost certainly noise. They ran 20 tests, each with a 5% chance of producing a false positive by chance alone. The expected number of false positives from 20 tests on truly null effects is — one. Which is what they found.

The problem is not that they tested too many things. The problem is they treated the single hit as if they'd only tested once.

Why 5% compounds

A p < 0.05 threshold means you accept a 5% probability of a false positive on any single test, when the null hypothesis is true.

Run 1 test: 5% chance of a false positive. Run 20 independent tests: the probability of getting at least one false positive is:

1(10.05)2064%1 - (1 - 0.05)^{20} \approx 64\%

Run 100 tests on pure noise: roughly a 99.4% chance of at least one "significant" result. Not probably — almost certainly.

This is the multiple comparisons problem (also called the problem of multiplicity). Every additional test is another lottery ticket for a false discovery.

The dead salmon

In 2011, researchers Craig Bennett and colleagues demonstrated this with deliberate absurdity. They put a dead Atlantic salmon in an fMRI scanner and showed it photographs of humans in social situations. The task: infer what emotion the person in the photo was experiencing.

Then they ran a standard fMRI analysis — without correcting for multiple comparisons across the thousands of voxels (3D pixels) in the brain scan. Result: significant neural activity in the salmon's brain. Posthumously. A dead fish, apparently reading emotions.

The salmon couldn't think. But fMRI analysis touches tens of thousands of locations in the brain simultaneously. At p < 0.05, you'd expect 5% of those to show "significance" by chance. Without correction, you will always find something. The study won an Ig Nobel Prize and became a landmark warning about analytic practices in neuroimaging.

Jelly Bean Lab
20
1100
T10.627
T20.003
T30.527
T40.981
T50.968
T60.281
T70.613
T80.721
T90.426
T100.995
T110.455
T120.489
T130.139
T140.404
T150.248
T160.154
T170.489
T180.067
T190.395
T200.767
20
Tests run
1
Significant
1.0
Expected at p<0.05

Test K=20 colors at α=0.05 — about 1.0 'hit' expected from pure noise.

The Bonferroni correction

The straightforward fix: raise the threshold. If you run k tests and want the overall false positive rate to stay at 5%, divide your significance threshold by k.

Testing 20 jelly bean colors? Use p < 0.05/20 = 0.0025 instead of 0.05. A result now needs to be much stronger to qualify as significant.

This is the Bonferroni correction. It's conservative — it can miss real effects when k is large — but it forces you to be honest about how many questions you asked.

In genomics, researchers test millions of genetic variants simultaneously. The corrected threshold used there is often p < 5 × 10⁻⁸ — not 1 in 20, but roughly 1 in 20 million. That's how severe the multiple comparisons problem gets at scale.

Dashboard p-hacking

This problem is rampant in data science, often not by deliberate fraud but by structure. A business dashboard with 50 metrics, filtered by 10 segments, across 12 time windows, represents thousands of implicit comparisons. When something pops as "significant," it may simply be noise that won the lottery.

This is sometimes called p-hacking: the practice — intentional or not — of running analyses until something significant appears, then reporting that as the finding. The result looks valid because each individual test was done correctly. The problem is the selection process that chose which test to report.

Pre-registration — committing to your hypothesis and analysis plan before looking at data — is the gold standard defense. It closes the gap between "tests planned" and "tests run."

Key takeaways

  • A significant result across many tested hypotheses is selection, not discovery — the right question is whether it would survive as the only test you ran.
  • The probability of at least one false positive compounds fast: ~64% at K=20 tests on noise, ~99% at K=100.
  • Bonferroni (α/K) is the blunt fix; Benjamini-Hochberg controls the false-discovery rate less conservatively when K is large.
  • Pre-registration is the gold-standard defense — committing to the hypothesis before seeing the data closes the gap between "tests planned" and "tests run".
  • Business dashboards are silent multiple-comparisons machines: 50 metrics × 10 segments × 12 time windows is thousands of implicit tests.

Enjoying this? Get notified when new concepts and articles launch.