applied·Interactive

Statistics

P-Values

The most cited number in science is also the most misunderstood. Here is what p < 0.05 actually means — and what it doesn't.

Before this:Sampling Standard Deviation and Variance

The number science runs on

In 2011, social psychologist Diederik Stapel published a study showing that eating meat makes people more selfish. It had p < 0.05, which satisfied reviewers. The paper was published. It was later discovered Stapel had fabricated his data entirely — not occasionally, but for years across dozens of studies.

His p-values were always beautifully significant.

This isn't an argument that p-values are useless. It's an illustration that they're far more limited than their reputation suggests. A p-value below 0.05 does not mean a hypothesis is probably true. It does not mean a result will replicate. It does not mean much at all on its own.

What a p-value actually says

The p-value answers one specific, narrow question: if the null hypothesis were true, how likely is it that you'd see data at least this extreme by chance?

More precisely: the p-value is the probability of observing a test statistic as large as (or larger than) what you measured, assuming the null hypothesis is true and everything else about your study design holds.

A p-value of 0.03 says: if there were really no effect, random sampling variation would produce data this extreme only 3% of the time.

That's it. Nothing more.

The three wrong interpretations

Wrong interpretation 1: "p < 0.05 means the hypothesis is probably true."

A p-value does not tell you the probability that your hypothesis is correct. To make that claim, you'd need a prior probability — how plausible was this hypothesis before you ran the study? A surprising hypothesis confirmed by a single p < 0.05 result should not be taken as established. Many unlikely hypotheses will appear to pass the threshold by chance alone.

Wrong interpretation 2: "p < 0.05 means the result will replicate."

A p-value of 0.04 means that, under the null, you'd see this outcome 4% of the time. But if the null is actually false and there is a real effect, the p-value doesn't tell you how large or how stable that effect is. Studies with p = 0.049 often fail to replicate not because they were fraudulent, but because they had low statistical power and were sitting on the edge of the threshold.

Wrong interpretation 3: "p = 0.03 means there's a 3% chance of a false positive."

This is sometimes called the prosecutor's fallacy. The p-value is $P(\text{data} \mid H_0)$ — the probability of this data given the null. What we actually want to know is $P(H_0 \mid \text{data})$ — the probability the null is true given this data. These are related by Bayes' theorem and they are not the same number.

What it's actually measuring

When you run a hypothesis test, you:

Assume the null hypothesis is true (e.g., "the drug has no effect")
Compute a test statistic from your data (e.g., the difference in means)
Ask: if the null were true, how extreme would this test statistic need to be to occur less than 5% of the time by chance?
If your observed statistic crosses that threshold, you reject the null.

The p-value is the probability in step 3 — measured not at the threshold but at your actual statistic. A p of 0.03 says your result was more extreme than the 5% threshold; a p of 0.40 says it wasn't.

P-value definition

$p = P(T \geq t_{\text{obs}} \mid H_0 \text{ is true})$

Where $T$ is the test statistic and $t_{\text{obs}}$ is the value you observed. This is the one-sided form; a two-sided test — the more common default — counts extremes in either direction, using $|T| \geq |t_{\text{obs}}|$ . Either way, the p-value sums the probability of results at least as extreme as yours.

Null distribution explorer

Observed z-statistic2.00

-4.004.00

2.00

|z|

0.0455

p (two-sided)

1 in 22

frequency

Uncommon under the null, but happens about 1 in 22 experiments by chance.

Why 0.05?

The threshold of 0.05 was largely set by Ronald Fisher in the 1920s as a convenient rule of thumb — "roughly 1 in 20 seems like a reasonable bar." It was never meant to be a universal standard. Fisher himself argued against mechanical application of any fixed threshold.

The field adopted it anyway, and now careers, publications, and drug approvals hinge on whether a number crosses this line. This is a sociological phenomenon, not a scientific one.

What you should actually ask

A p-value below 0.05 is weak evidence of anything on its own. Stronger evidence comes from:

Effect size: how large is the effect, not just whether it's there?
Confidence intervals: a p = 0.04 with a huge interval spanning near-zero to large effects is much less compelling than a narrow interval.
Replication: has anyone else found the same result in an independent sample?
Pre-registration: was this hypothesis specified before data collection, or was it found by searching through the data?
Prior plausibility: how plausible was this effect before the study?

The replication crisis in social science, medicine, and nutrition science is substantially a p-value problem: studies powered just enough to cross 0.05, testing hypotheses post-hoc, with results that dissolve on replication. The p-value is a useful tool. It is not a truth detector.

Key takeaways

A p-value answers one specific question: how surprising would this data be if the null hypothesis were true?
It does not give the probability that the null is true, that a hypothesis is correct, or that a result will replicate.
The 0.05 threshold is a Fisher-era rule of thumb, not a law of nature.
Effect size, confidence intervals, replication, and pre-registration matter more than crossing 0.05.
The replication crisis is, in large part, a p-value problem — significance at threshold without the rest of the evidence is weak evidence.

Explore in Playground →

Continue exploring

applied·Interactive

Confidence Intervals

What '95% confident' actually means — and why the most common interpretation is precisely backwards.

applied·Interactive

Effect Size

Statistical significance flags an effect unlikely to be pure noise. Effect size tells you whether it's big enough to matter.

applied·Interactive

Multiple Comparisons

Why running enough statistical tests guarantees false positives — and what to do about it.

Enjoying this? Get notified when new concepts and articles launch.