applied·Interactive
Statistics

P-Values

The most cited number in science is also the most misunderstood. Here is what p < 0.05 actually means — and what it doesn't.

The number science runs on

In 2011, social psychologist Diederik Stapel published a study showing that eating meat makes people more selfish. It had p < 0.05, which satisfied reviewers. The paper was published. It was later discovered Stapel had fabricated his data entirely — not occasionally, but for years across dozens of studies.

His p-values were always beautifully significant.

This isn't an argument that p-values are useless. It's an illustration that they're far more limited than their reputation suggests. A p-value below 0.05 does not mean a hypothesis is probably true. It does not mean a result will replicate. It does not mean much at all on its own.

What a p-value actually says

The p-value answers one specific, narrow question: if the null hypothesis were true, how likely is it that you'd see data at least this extreme by chance?

More precisely: the p-value is the probability of observing a test statistic as large as (or larger than) what you measured, assuming the null hypothesis is true and everything else about your study design holds.

A p-value of 0.03 says: if there were really no effect, random sampling variation would produce data this extreme only 3% of the time.

That's it. Nothing more.

The three wrong interpretations

Wrong interpretation 1: "p < 0.05 means the hypothesis is probably true."

A p-value does not tell you the probability that your hypothesis is correct. To make that claim, you'd need a prior probability — how plausible was this hypothesis before you ran the study? A surprising hypothesis confirmed by a single p < 0.05 result should not be taken as established. Many unlikely hypotheses will appear to pass the threshold by chance alone.

Wrong interpretation 2: "p < 0.05 means the result will replicate."

A p-value of 0.04 means that, under the null, you'd see this outcome 4% of the time. But if the null is actually false and there is a real effect, the p-value doesn't tell you how large or how stable that effect is. Studies with p = 0.049 often fail to replicate not because they were fraudulent, but because they had low statistical power and were sitting on the edge of the threshold.

Wrong interpretation 3: "p = 0.03 means there's a 3% chance of a false positive."

This is sometimes called the prosecutor's fallacy. The p-value is P(dataH0)P(\text{data} \mid H_0) — the probability of this data given the null. What we actually want to know is P(H0data)P(H_0 \mid \text{data}) — the probability the null is true given this data. These are related by Bayes' theorem and they are not the same number.

What it's actually measuring

When you run a hypothesis test, you:

  1. Assume the null hypothesis is true (e.g., "the drug has no effect")
  2. Compute a test statistic from your data (e.g., the difference in means)
  3. Ask: if the null were true, how extreme would this test statistic need to be to occur less than 5% of the time by chance?
  4. If your observed statistic crosses that threshold, you reject the null.

The p-value is the probability in step 3 — measured not at the threshold but at your actual statistic. A p of 0.03 says your result was more extreme than the 5% threshold; a p of 0.40 says it wasn't.

P-value definition

p=P(TtobsH0 is true)p = P(T \geq t_{\text{obs}} \mid H_0 \text{ is true})

Where TT is the test statistic and tobst_{\text{obs}} is the value you observed. This is the one-sided form; a two-sided test — the more common default — counts extremes in either direction, using Ttobs|T| \geq |t_{\text{obs}}|. Either way, the p-value sums the probability of results at least as extreme as yours.

Null distribution explorer
2.00
-4.004.00
2.00
|z|
0.0455
p (two-sided)
1 in 22
frequency

Uncommon under the null, but happens about 1 in 22 experiments by chance.

Why 0.05?

The threshold of 0.05 was largely set by Ronald Fisher in the 1920s as a convenient rule of thumb — "roughly 1 in 20 seems like a reasonable bar." It was never meant to be a universal standard. Fisher himself argued against mechanical application of any fixed threshold.

The field adopted it anyway, and now careers, publications, and drug approvals hinge on whether a number crosses this line. This is a sociological phenomenon, not a scientific one.

What you should actually ask

A p-value below 0.05 is weak evidence of anything on its own. Stronger evidence comes from:

  • Effect size: how large is the effect, not just whether it's there?
  • Confidence intervals: a p = 0.04 with a huge interval spanning near-zero to large effects is much less compelling than a narrow interval.
  • Replication: has anyone else found the same result in an independent sample?
  • Pre-registration: was this hypothesis specified before data collection, or was it found by searching through the data?
  • Prior plausibility: how plausible was this effect before the study?

The replication crisis in social science, medicine, and nutrition science is substantially a p-value problem: studies powered just enough to cross 0.05, testing hypotheses post-hoc, with results that dissolve on replication. The p-value is a useful tool. It is not a truth detector.

Key takeaways

  • A p-value answers one specific question: how surprising would this data be if the null hypothesis were true?
  • It does not give the probability that the null is true, that a hypothesis is correct, or that a result will replicate.
  • The 0.05 threshold is a Fisher-era rule of thumb, not a law of nature.
  • Effect size, confidence intervals, replication, and pre-registration matter more than crossing 0.05.
  • The replication crisis is, in large part, a p-value problem — significance at threshold without the rest of the evidence is weak evidence.

Enjoying this? Get notified when new concepts and articles launch.