Why Your A/B Test Peeks Lie
Stopping a test the moment p drops below 0.05 doesn't make you fast — it inflates your false positive rate from 5% to over 20%. Here is what peeking actually does to your numbers, and what to do instead.
The PM was watching the dashboard on Day 4 when the line crossed. Conversion up 6.2% on the treatment, p = 0.043. The test had been scheduled for two weeks, but the result was already there. Why wait? She shipped the change to 100% of traffic that afternoon.
Six weeks later, the metric was flat. Not slightly worse, not slightly better — flat. The team blamed seasonality, an ecosystem change, a competing release. The post-mortem concluded the effect had "decayed."
The effect hadn't decayed. It had never existed. The test result was a false positive, manufactured almost entirely by the act of stopping early.
This is the most common, most expensive, and most quietly tolerated mistake in modern experimentation. It is also the one that breaks the math in the clearest, most quantifiable way. If you only fix one statistical habit in your team this year, fix this one.
The thing a p-value is actually promising you
A p-value of 0.05 makes a single, specific promise: if the treatment has no real effect, the probability of seeing data this extreme by chance is 5%. That promise is conditional on something most teams never think about — the stopping rule. The p-value assumes you decided in advance how long to run the test, and then you ran it that long.
The moment you start peeking and deciding to stop based on what you see, you've broken the contract. The 5% number doesn't apply to your experiment anymore. It applies to a different, hypothetical experiment that you didn't actually run.
The intuition is easier than the math. You are not asking "is the treatment effect real?" You are asking "did the running difference between treatment and control happen to cross a significance threshold at any point during my observation window?" Those two questions have wildly different answers. The first is what you care about. The second is what peek-and-stop measures.
The numbers, plainly
Suppose the treatment has zero effect. The test statistic — the standardized difference between treatment and control — is a random walk around zero. Real-world bumps and dips will move it around. With a fixed-N test, you look once, at the end, and the probability that random noise has pushed the statistic past the ±1.96 threshold is exactly 5%. That is the false positive rate the math promises.
Now imagine you look every day for a month. Each day, you ask the same question: has the statistic crossed the threshold yet? The random walk has thirty separate opportunities to wander across the line and trigger your stopping rule. It only has to cross once.
Under the standard random-walk model with equal information increments, here is what the false positive rate actually looks like as a function of how many times you peek:
| Looks | False positive rate | |---|---| | 1 (no peeking) | 5.0% | | 2 | 8.3% | | 5 | 14.2% | | 10 | 19.3% | | 20 | 24.7% | | Continuous monitoring | climbs toward 100% with no bound |
The team that runs a two-week test and peeks at the dashboard each morning is not running a 5% false-positive experiment. They are running a roughly 20% one. Out of every five "significant" results they ship, one is pure noise.
5,000 simulated A/B tests, no real effect — adjust how often you peek
The dashed line on the chart is the 5% rate the textbook says your test guarantees. Drag the slider to the right and watch the gap. Every additional look is another lottery ticket bought against the null hypothesis.
Where the lie comes from
The technical version is short. Under the null hypothesis, the cumulative test statistic zt behaves like a Brownian motion. The probability that a Brownian motion crosses ±1.96 at some point before time T is much higher than the probability it ends up beyond ±1.96 at time T. The first is a maximum, the second is a tail. The maximum of a random walk grows over time; the endpoint does not.
This was worked out formally in Armitage, McPherson and Rowe (1969), the paper that introduced sequential analysis to medical statistics. The result has been textbook material for half a century. It is not new and it is not controversial.
What's new is that A/B testing tools made peeking trivial. Before dashboards, running a sequential test required computing custom thresholds, often by hand. The mechanical friction kept teams honest. Once peeking became as easy as refreshing a tab, the discipline went with it.
The math doesn't care how easy the dashboard is. The contract is the same.
Why this matters in practice
The 20% number above is the minimum damage in the cleanest possible setting — one team, one experiment, one peek per day, no other degrees of freedom. Real experimentation programs add multipliers. Different metrics get inspected at different times. Decisions to stop are made under social pressure, around release deadlines, when someone senior walks past the dashboard. Sub-segments are inspected when the main effect looks weak ("it works for mobile users!"). Each of these is another peek. Each peek inflates the false positive rate further.
This is the engine behind the famous Simmons, Nelson and Simonsohn (2011) result that combinations of researcher degrees of freedom — peeking, dropping outliers, choosing covariates — can push the false positive rate of a single test above 60%. Industrial A/B testing has the same degrees of freedom that academic psychology had, in the same combinations, with the same consequences.
The teams that ignore this and the teams that take it seriously look identical from the outside. Both ship features. Both have dashboards full of "wins." The difference shows up later, in the metrics that never quite move at the company level, in the wins that don't compound, in the gap between the sum of reported A/B test improvements and the actual line on the quarterly chart. Where did all the wins go? They were peeks.
What good practice actually looks like
You have three serious options. None of them require giving up the ability to monitor.
Pre-register the test horizon and stick to it. Decide in advance how long the test will run, based on a power calculation, and resist the urge to stop early. This is the cleanest fix and the one teams hate the most, because it requires saying no to the small voice that whispers but we already know.
Use a sequential test with α-spending. Pocock boundaries, O'Brien-Fleming boundaries, and the broader α-spending framework are exactly the corrected versions of peek-as-you-go testing. They use stricter thresholds at each interim look so the overall false positive rate stays at 5%. The thresholds are well-tabulated and supported by standard libraries. The cost is a slightly stricter bar for stopping early, which is also exactly the cost you should be paying.
Use always-valid confidence intervals (confidence sequences). The modern version of the same idea, developed by Howard, Ramdas, McAuliffe and Sekhon (2018) and others. A confidence sequence is a CI you can peek at as many times as you want without losing coverage. The math is non-trivial but the implementation is not — it's the approach behind the always-valid / sequential offerings on platforms like Optimizely's Stats Engine and Statsig, and similar internal experimentation systems at large tech companies. If your tool advertises "valid at any time" or "peek-safe" results, this is the family of methods it's drawing on.
You can also adopt Bayesian inference, which is not magically immune to garbage stopping rules but is more honest about what it tells you under continuous monitoring. The point is not which framework you choose. The point is to choose something that knows you're peeking.
For why each underlying piece of the math behaves this way, see the concept pages on p-values and multiple comparisons — the latter is the same phenomenon viewed from a slightly different angle.
The real failure mode is organizational
Peeking is not, in the end, a statistics problem. It is a problem about who has authority to stop a test, when, and based on what evidence. Most A/B test failures of this type happen because nobody on the team owns the stopping rule. The PM watches the dashboard. The data scientist runs the test. The director asks "are we ready to ship?" in the standup. The result is decided by whoever loses patience first.
A team that says we'll run it until it's significant is a team that produces 20% false positives. A team that says we'll run it for 14 days, no exceptions, and then we'll look produces calibrated 5% false positives. The math doesn't care which culture you build. It just records the consequences.
The reframe is the takeaway. A p-value isn't a property of your data. It's a property of your data plus the rule that governed your experiment. Specify the rule in advance, in writing, before you see anything. Pre-register the stopping rule the same way you pre-register the hypothesis. If you can't bring yourself to do that, at least be honest about what your numbers are actually saying — which is, in most cases, considerably less than you think.
The test is only as strong as the contract you signed with the universe before it started.