May 24, 2026Essay8 min readStatisticsData Thinking

Reading Research After the Replication Crisis

A single p < 0.05 study is weaker evidence than it looks — here is why the replication crisis should change the discount rate every data practitioner applies to published findings.

The product manager pastes a link into the deck. "Studies show that social proof nudges increase conversion by 15–20%." The behavioral-science finding has been cited in at least three industry blog posts, mentioned in a Reforge course, and attributed back to a 2012 psychology paper. It's going to justify a sprint. The question nobody in the room is asking: how much should we trust a single published result?

Not much. The hard lesson of the last decade in empirical research is that a published, peer-reviewed, statistically significant finding is not a settled fact. It is a signal — and a weaker one than the journal publication implies. Knowing why changes how you should discount external evidence when it lands in your strategy docs.

What the 2015 replication study actually showed

In 2015, 270 researchers systematically tried to reproduce 100 psychology studies — peer-reviewed, published in top journals, all with statistically significant original results (about 97% had p < 0.05). The criterion for replication was straightforward: did the replication study itself reach statistical significance?

Only about 36% did. Beyond the pass/fail rate, replicated effect sizes were on average roughly half the magnitude of the originals. Both numbers matter. The 64% failure rate tells you how often the finding couldn't be confirmed at all under similar conditions. The halving of effect sizes tells you that even when something replicated, the original estimate was probably inflated — the true effect was smaller than the first study suggested.

This was the moment a quiet methodological concern became a genuine scientific problem. But the causes were not primarily fraud.

Three systemic causes that should worry any analyst

Publication bias. Journals prefer positive results. A study showing that an environmental cue changes behavior is publishable. A study showing that it doesn't is not. The unpublished null results pile up in researchers' file drawers. The published literature is therefore a systematically biased sample of all research actually conducted — tilted toward findings that cleared p < 0.05, regardless of whether the underlying effect is real or stable. A stack of supporting papers can reflect the publication filter as much as the underlying phenomenon.

Low statistical power. Many psychology studies ran on 20 or 30 undergraduates in a single lab. Small samples produce noisy effect size estimates. Even when a real effect exists, an underpowered study will either miss it (producing a false negative) or, when it happens to detect it, will overestimate the size. The first published result is often the lucky extreme of a noisy measurement. Replications regress toward a smaller, more accurate estimate — which reads as "failure to replicate" even when the original phenomenon is partially real.

Researcher degrees of freedom. Choosing which outcome to report, when to stop collecting data, which covariates to include, whether to exclude outliers — each decision is a fork in the analysis path. If researchers explore enough paths and report the one that crossed p < 0.05, the finding is statistically significant by construction, not by evidence. This is what p-values were never designed to handle: a significance threshold applied at the end of an undisclosed search through analysis space. The result looks like a confirmatory test but behaves like multiple comparisons.

Fraud does enter the picture — Diederik Stapel, a Dutch social psychologist, confessed to fabricating data in dozens of studies over a decade. But misconduct explains only a small fraction of the failed replications. The larger causes are these three structural features of how academic science was practiced, not bad actors.

The failures that matter for practitioners

Ego depletion was one of social psychology's most cited effects: the idea that willpower is a finite resource that gets depleted by use. The original 1998 study had people resist cookies and then persist less on a puzzle. It was replicated dozens of times and made it into management consulting and habit-change programs.

A 2016 pre-registered multi-lab replication — 23 labs, approximately 2,100 participants — found essentially no effect. The best current interpretation is that the original result was probably inflated, and any underlying phenomenon is far smaller and more conditional than a decade of follow-up implied.

Power posing claimed that adopting expansive physical postures changes hormone levels — raising testosterone, lowering cortisol — and improves outcomes under stress. It became one of the most-watched TED talks of all time. The hormonal claims didn't replicate. One of the original study's co-authors publicly distanced herself from those claims, acknowledging the evidence didn't support them.

The honest answer is that you often can't tell without the replication data. Which is exactly why your prior on any single published result should be lower than the p-value implies.

What reform looks like — and how to use it as a signal

The response to the replication crisis has been real. Pre-registration means committing to the hypothesis, analysis plan, and sample size before collecting data — so the published result is genuinely confirmatory, not exploratory dressed up as confirmatory. Registered Reports go further: a journal agrees to publish the study based on design quality before results are known, eliminating publication bias at the source. Many Labs replication projects pool statistical power across dozens of independent sites, producing effect size estimates that are far more stable than any single lab result.

When you're evaluating a finding, these are your quality signals:

Was the study pre-registered? You can check on OSF or AsPredicted.
Has it been replicated in a large multi-site study, or only in the original lab (or labs that cite the original)?
Are effect sizes reported with confidence intervals, not just a significance verdict? A significant p with a wide confidence interval around a small effect is much weaker evidence than it looks.

Practitioner rules

Discount a single p < 0.05 study hard. One significant result in one lab on one sample is a weak prior update. It becomes informative when it replicates independently, at scale, in populations that resemble the one you care about.

The more surprising the finding, the more skeptical you should be. Dramatic results — ego depletion, power posing, social priming — attracted citations precisely because they were counterintuitive and large. Counterintuitive and large is also the signature of publication-filtered noise. If a finding feels too clean or too convenient, that's information.

Be especially skeptical when the finding conveniently supports a decision someone already wants to make. "Studies show X" arriving in a deck to justify a sprint that was already planned is not evidence being evaluated — it is evidence being marshaled. The confirmation bias operating on the researcher who p-hacked the finding is now operating on the PM citing it.

Weight effect sizes and confidence intervals over the significance verdict. A pre-registered multi-site study with a small but precisely estimated effect is far more useful than a viral single-study result with a large claimed effect and no replication. The former tells you roughly what to expect. The latter tells you what happened once under favorable conditions.

Look for systematic reviews and meta-analyses, not individual papers. A meta-analysis that covers the full published literature — including null results where possible — gives you a better estimate of the true effect than any single study, including the most-cited one.

For evaluating any specific piece of external evidence — a vendor whitepaper, an industry report, a paper that surfaced in Slack — see the companion note How to Read a Study Without Being Fooled, which works through the structural questions one at a time. This note is the bigger-picture backdrop: why the published literature as a whole carries less certainty than its p-values imply, and what that means for your prior before you open any individual study.

The lesson is not that published research is useless. It is that a p-value is not a reliability certificate. It is a measure of surprise under a specific null hypothesis, computed after an analysis process you usually can't fully audit. Calibrating how much to update on a published finding — rather than treating publication as validation — is one of the more practically useful things a working data person can internalize.