May 7, 2026Essay8 min readProbabilityData Thinking

The Alert at 11 PM

A fraud alert fires on a Friday night. You have ninety seconds to decide whether to escalate, page the on-call engineer, and potentially lock out a real customer — or ignore it and move on. This is what good probabilistic reasoning looks like under pressure.

It's 11:14 PM on a Friday. Your phone buzzes with an automated fraud alert. A transaction just fired the model — score 0.87 out of 1.0, flagged as high-confidence suspicious. The system is set to auto-block at 0.90. This one didn't cross the threshold, but it generated a page. You're the on-call.

You have a few options. Page the fraud team, wake someone up, and block the account. Escalate to the on-call engineer for a manual review. Or ignore it and let the transaction clear. The customer is trying to buy something for a few hundred dollars from an unfamiliar merchant at an unusual hour.

Most people in this situation anchor on the score. 0.87 feels high. The model flagged it. Ninety seconds of reasoning under pressure, and the instinct is to act — to err on the side of caution, because what if it really is fraud?

That instinct has a cost. And the cost comes from a number almost nobody thinks to ask about first: how often does a score like this turn out to be real fraud?

The base rate you forgot to check

Suppose your fraud model fires on roughly 1% of transactions (illustrative). Of the alerts it generates with a score between 0.80 and 0.90, historical data shows that about 15% are confirmed fraud after manual review (illustrative). The other 85% are legitimate customers — unusual location, new device, late-night purchase — who get blocked, locked out, and forced through a friction-heavy dispute process that your support team knows ends in churn roughly half the time.

The model's score is not the probability of fraud. It's a feature of the model's output, calibrated against whatever data it was trained on. The probability you actually care about — given this specific score, at this threshold, on your platform, in this customer population — is a different number, and it lives in your historical alert data, not the score itself.

This is what base rate neglect costs you in practice. It's not an abstract cognitive bias. It's the gap between "the score is 0.87" and "the probability this is real fraud, given that the score is 0.87, is 15%." Those are not the same sentence. The second one is the one that should drive your decision.

Working the expected value

Once you have a rough probability — call it 15% chance of real fraud — the question shifts from "is this fraud?" to "what is the right action given 15% odds?"

That calculation depends on two costs you can actually estimate.

Cost of blocking a legitimate customer: a few hundred dollars of lost revenue, a friction event, and some measurable churn risk. Call the total expected cost something in the range of tens of dollars in lifetime value lost (illustrative), concentrated in a cohort of customers who were already doing something unusual and may not come back.

Cost of letting real fraud through: depends entirely on your liability model. If the platform eats the chargeback, it's the transaction amount plus the chargeback fee plus operational cost. For a few-hundred-dollar transaction, that might be $350–450 all-in (illustrative).

Expected cost of blocking: 0.85 × (cost of wrongly blocking a real customer). Expected cost of passing: 0.15 × (cost of fraud going through).

Whether blocking or passing has the lower expected cost depends on numbers specific to your platform. But the structure of the decision is clear: if the false-positive rate in the 0.80–0.90 score band is high, reflexive blocking is expensive even before you count the customer experience damage. The threshold that feels "safe" — act on everything the model flags — may not be the threshold that minimizes harm.

This is expected value reasoning, and it forces a discipline that score-anchoring doesn't: you have to make the costs explicit. The moment you write down the two numbers, the right decision often becomes obvious — and it's frequently not the "safe" one.

The reference class problem at midnight

Here's where the 11 PM context matters. Your instinct is to pull up this transaction and reason about it specifically: unusual merchant, unfamiliar geography, late hour. Those feel like signals.

But which past alerts is this alert actually like?

Late-night transactions from unfamiliar merchants skew toward travelers, shift workers, and people buying gifts — not toward fraud. If your historical data shows that late-night alerts from this merchant category resolve as fraud 8% of the time versus 22% for daytime alerts in other categories (illustrative), the time and merchant type should shift your prior down, not up. Your specific circumstances are evidence, but they have to be assessed against actual outcome data — not against a narrative that matches the pattern of a fraud you once saw.

This is the reference class question: which bucket of past cases does this alert belong to? Not which story does it remind you of, but which slice of your historical alert distribution matches this alert's features? The answer to that question is the only honest prior. Intuition built from memorable cases skews toward the vivid and the recent. Data from the last ten thousand alerts doesn't.

Bayes' theorem is the formal machinery for this update — prior probability of fraud, likelihood of seeing this score and these features given fraud vs. not fraud, posterior after the update. In practice you don't need to do the algebra at midnight. You need the lookup table your team should have built from past alert outcomes, segmented by score band and transaction type. That table is a pre-computed Bayesian update. If it doesn't exist, the first thing to do after this alert resolves is build it.

Calibration is a team infrastructure problem

The deeper issue is that most fraud teams are not calibrated. They know their model's overall precision and recall, but they don't have a clear empirical answer to "what fraction of 0.85-scored alerts in transaction category X are real fraud, and what fraction are false alarms?"

Without that table, every alert is an exercise in gut-feel with a veneer of quantitative confidence. The score makes the decision feel data-driven. But if nobody has checked whether a score of 0.87 in this context means 15%, 40%, or 5% real fraud, the score is providing false precision rather than genuine probability.

Calibration here means the same thing it means in weather forecasting: when the model says 0.87, does fraud actually materialize 87% of the time? In most fraud systems the answer is no — the raw scores are not probabilities, they're relative rankings, and the actual fraud rate at a given score band depends on threshold tuning, population shift, and recency of the training data. A team that treats 0.87 as "87% chance of fraud" is miscalibrated by construction.

The fix is empirical, not theoretical. Run the alerts from the past six months. For each score decile, calculate the confirmed fraud rate. Build the lookup table. Now your on-call analyst has something to work with at 11 PM that is actually a probability.

What this all adds up to

The alert at 11 PM is a decision problem, not a pattern-matching problem. The decision structure is:

What is the actual base rate of fraud at this score band and transaction type?
Given that base rate, what is the expected cost of acting versus passing?
Does the specific context of this transaction push the base rate up or down, based on data rather than narrative?

None of this requires math at midnight. It requires having built the infrastructure — the calibrated lookup tables, the expected-cost calculations by transaction type, the documented decision thresholds — before the alert fires. Probabilistic reasoning is mostly preparation work done in advance so that the decision at 11 PM is a lookup, not a guess.

The teams that do this well are not better at math under pressure. They are better at recognizing that probability is not a property of the alert. It's a property of the alert plus the population it came from, the model that generated it, and the history of how alerts like this one resolved. Without that context, the score is just a number that feels like certainty.

Hold it lightly. Check the base rate. Price the two errors. Then decide.