Overfitting

Why a model that gets everything right on training data is probably wrong — and how to build models that generalize.

Perfect on practice, useless on the exam

A student preparing for an exam memorizes every practice question — not the underlying concepts, but the specific answers. When exam day arrives and the questions are slightly different, the student fails.

A machine learning model can make exactly this mistake. Given enough complexity, a model can learn the quirks of its training data so precisely that it stops learning anything real. It has memorized the noise, not the signal.

This is overfitting. It's the central problem in applied machine learning.

The polynomial metaphor

Suppose you have 10 data points showing the relationship between years of experience and salary. You want to fit a curve.

A degree-1 polynomial (a straight line) is simple. It might not pass through every point, but it captures the general trend. This is underfitting — the model is too simple to represent the pattern in the data.

A degree-9 polynomial can thread through all 10 training points exactly. Training error drops to zero. But the curve bends wildly between the points, and if you measure a new employee's salary using it, the prediction is absurd. This is overfitting — the model is so flexible that it has encoded every accidental wiggle in the training data as if it were a real pattern.

Overfitting explorer

Polynomial degree3 — Good fit

1 — Good fit10 — Good fit

Fitted curveTrue function (sin)Training points

0.0974

Training error

0.1071

Test error

Training error vs. test error

The key diagnostic is the gap between two numbers:

Training error: how well the model fits the data it was trained on
Test error: how well it performs on new data it has never seen

As you increase model complexity, training error always decreases — a more flexible model can always find a better fit to the data it has. But test error follows a U-shaped curve: it decreases at first as the model captures real patterns, then rises as the model starts fitting noise.

The point of minimum test error is the sweet spot: complex enough to capture the signal, simple enough not to memorize the noise.

In practice, you approximate this by holding out a portion of your data before training, then evaluating the model on that held-out set. The test set is untouched during training — it's the only honest measure of how the model will perform on data it hasn't seen.

Why it happens

Every dataset contains two things mixed together: the true underlying pattern (signal) and random variation specific to this sample (noise). A flexible enough model will fit both.

The noise in training data is just that — noise. It doesn't exist in new data from the same process. So when the model tries to predict on new data, its noise-memorization provides no benefit and its complexity actively hurts.

This is also why more features aren't always better. Each additional feature gives the model another dimension to overfit in. Adding a column of random numbers to your training data will almost always reduce training error and increase test error.

What to do about it

Collect more data. More training data dilutes the noise. The noise is random and averages out; the signal persists. Larger datasets make it harder for a model to memorize quirks.

Reduce model complexity. Use a simpler model — fewer features, lower polynomial degree, shallower tree depth. Simpler models have fewer degrees of freedom to overfit.

Regularization. Techniques like L1 (lasso) and L2 (ridge) regularization add a penalty to the training objective for model complexity. They force the model to justify each added parameter by improvement in fit, discouraging it from chasing noise.

Cross-validation. Instead of a single train/test split, divide the data into $k$ folds. Train on $k-1$ folds, evaluate on the remaining fold, rotate. Average the test errors. This gives a more stable estimate of generalization performance and is the standard evaluation protocol.

In finance

Overfitting is especially dangerous in quantitative finance, where it goes by a different name: backtest overfitting or p-hacking a strategy.

A trader with 10 years of historical data and a flexible enough model can find a trading rule that would have generated extraordinary returns — on that specific historical period, with its specific sequence of events. When the rule is deployed on new market data, the edge evaporates. The model learned the particular path that history took, not anything durable about market dynamics.

The same problem appears in factor investing. Of the hundreds of factors that "predict" returns in published academic research, most fail out of sample. They were real patterns in the training data; they were noise in the broader data-generating process.

The underlying principle

Overfitting is a special case of a deeper idea: optimizing for a proxy metric (training error) diverges from the real goal (generalization). The model is doing exactly what you asked — minimize error on the training set — and that objective is subtly wrong.

This shows up everywhere. A recommendation algorithm optimized for clicks learns to recommend outrage. A hiring model trained on past employees learns to replicate historical biases. A model maximizing training accuracy learns to memorize.

The fix is always the same in structure: measure what you actually care about — performance on unseen data — not a convenient proxy for it.

Explore in Playground →

Continue exploring

applied·Interactive

Regression Intuition

What a regression line actually is, why it's the best line, and what R² is really telling you.

Enjoying this? Get notified when new concepts and articles launch.