applied·Interactive

StatisticsData Thinking

Multiple Comparisons

Why running enough statistical tests guarantees false positives — and what to do about it.

Before this:P-Values

The jelly bean study

A research team investigates whether jelly beans cause acne. They test 20 colors, one at a time, using p < 0.05 as their threshold. Nineteen colors show no effect. Yellow jelly beans: p = 0.04.

They publish: "Yellow jelly beans linked to acne."

This result is almost certainly noise. They ran 20 tests, each with a 5% chance of producing a false positive by chance alone. The expected number of false positives from 20 tests on truly null effects is — one. Which is what they found.

The problem is not that they tested too many things. The problem is they treated the single hit as if they'd only tested once.

Why 5% compounds

A p < 0.05 threshold means you accept a 5% probability of a false positive on any single test, when the null hypothesis is true.

Run 1 test: 5% chance of a false positive. Run 20 independent tests: the probability of getting at least one false positive is:

$1 - (1 - 0.05)^{20} \approx 64\%$

Run 100 tests on pure noise: roughly a 99.4% chance of at least one "significant" result. Not probably — almost certainly.

This is the multiple comparisons problem (also called the problem of multiplicity). Every additional test is another lottery ticket for a false discovery.

The dead salmon

In 2011, researchers Craig Bennett and colleagues demonstrated this with deliberate absurdity. They put a dead Atlantic salmon in an fMRI scanner and showed it photographs of humans in social situations. The task: infer what emotion the person in the photo was experiencing.

Then they ran a standard fMRI analysis — without correcting for multiple comparisons across the thousands of voxels (3D pixels) in the brain scan. Result: significant neural activity in the salmon's brain. Posthumously. A dead fish, apparently reading emotions.

The salmon couldn't think. But fMRI analysis touches tens of thousands of locations in the brain simultaneously. At p < 0.05, you'd expect 5% of those to show "significance" by chance. Without correction, you will always find something. The study won an Ig Nobel Prize and became a landmark warning about analytic practices in neuroimaging.

Jelly Bean Lab

Tests (K)20

1100

T10.627

T20.003

T30.527

T40.981

T50.968

T60.281

T70.613

T80.721

T90.426

T100.995

T110.455

T120.489

T130.139

T140.404

T150.248

T160.154

T170.489

T180.067

T190.395

T200.767

Tests run

Significant

1.0

Expected at p<0.05

Test K=20 colors at α=0.05 — about 1.0 'hit' expected from pure noise.

The Bonferroni correction

The straightforward fix: raise the threshold. If you run k tests and want the overall false positive rate to stay at 5%, divide your significance threshold by k.

Testing 20 jelly bean colors? Use p < 0.05/20 = 0.0025 instead of 0.05. A result now needs to be much stronger to qualify as significant.

This is the Bonferroni correction. It's conservative — it can miss real effects when k is large — but it forces you to be honest about how many questions you asked.

In genomics, researchers test millions of genetic variants simultaneously. The corrected threshold used there is often p < 5 × 10⁻⁸ — not 1 in 20, but roughly 1 in 20 million. That's how severe the multiple comparisons problem gets at scale.

Dashboard p-hacking

This problem is rampant in data science, often not by deliberate fraud but by structure. A business dashboard with 50 metrics, filtered by 10 segments, across 12 time windows, represents thousands of implicit comparisons. When something pops as "significant," it may simply be noise that won the lottery.

This is sometimes called p-hacking: the practice — intentional or not — of running analyses until something significant appears, then reporting that as the finding. The result looks valid because each individual test was done correctly. The problem is the selection process that chose which test to report.

Pre-registration — committing to your hypothesis and analysis plan before looking at data — is the gold standard defense. It closes the gap between "tests planned" and "tests run."

Key takeaways

A significant result across many tested hypotheses is selection, not discovery — the right question is whether it would survive as the only test you ran.
The probability of at least one false positive compounds fast: ~64% at K=20 tests on noise, ~99% at K=100.
Bonferroni (α/K) is the blunt fix; Benjamini-Hochberg controls the false-discovery rate less conservatively when K is large.
Pre-registration is the gold-standard defense — committing to the hypothesis before seeing the data closes the gap between "tests planned" and "tests run".
Business dashboards are silent multiple-comparisons machines: 50 metrics × 10 segments × 12 time windows is thousands of implicit tests.

Explore in Playground →

Continue exploring

applied·Interactive

P-Values

The most cited number in science is also the most misunderstood. Here is what p < 0.05 actually means — and what it doesn't.

applied·Interactive

Effect Size

Statistical significance flags an effect unlikely to be pure noise. Effect size tells you whether it's big enough to matter.

foundational·Interactive

Trend vs. Noise

How to tell the difference between a real pattern and what random variation naturally looks like over short windows.

Enjoying this? Get notified when new concepts and articles launch.