Skip to content

The Bias-Variance Tradeoff: Why Your Models Fail (And How to Fix Them)

DS
LDS Team
Let's Data Science
10 minAudio · 1 listens
Listen Along
0:00/ 0:00
AI voice

Picture yourself at an archery range with two friends. Alex shoots every arrow into the same tight cluster, but the cluster sits a foot left of the bullseye. Sam scatters arrows across the entire target, though the average landing point is dead center. Alex has high bias (consistently wrong in the same direction). Sam has high variance (right on average, but wildly inconsistent). The best archer groups arrows tightly around the bullseye, minimizing both.

The bias-variance tradeoff is the machine learning version of this archery problem. Every model lives somewhere on the spectrum between Alex and Sam. Understanding where your model sits is the single most useful diagnostic skill in data science, and it determines whether you should add complexity, add data, or reach for regularization.

We'll work through one running example from start to finish: predicting house prices from square footage, where the true relationship is quadratic (price per square foot rises for larger homes). Every formula, table, and code block maps to this same scenario.

The Error Decomposition Formula

Every prediction error in supervised learning breaks down into exactly three components. This isn't a heuristic or a rule of thumb. It's a mathematical identity.

Suppose we want to predict a target YY given input XX, where the true relationship is Y=f(X)+ϵY = f(X) + \epsilon. The noise term ϵ\epsilon has mean zero and variance σ2\sigma^2. Our model f^(X)\hat{f}(X) tries to approximate f(X)f(X). The expected prediction error at any point xx decomposes as:

E[(Yf^(x))2]=(E[f^(x)]f(x))2+E[(f^(x)E[f^(x)])2]+σ2\text{E}\left[(Y - \hat{f}(x))^2\right] = \left(\text{E}[\hat{f}(x)] - f(x)\right)^2 + \text{E}\left[(\hat{f}(x) - \text{E}[\hat{f}(x)])^2\right] + \sigma^2

Which simplifies to:

Total Error=Bias2+Variance+Irreducible Error\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Where:

  • Bias2=(E[f^(x)]f(x))2\text{Bias}^2 = (\text{E}[\hat{f}(x)] - f(x))^2 measures how far the model's average prediction deviates from the true value
  • Variance=E[(f^(x)E[f^(x)])2]\text{Variance} = \text{E}[(\hat{f}(x) - \text{E}[\hat{f}(x)])^2] measures how much the prediction changes when trained on different datasets
  • σ2\sigma^2 is the irreducible error, the noise inherent in the data that no model can eliminate
  • f^(x)\hat{f}(x) is our model's prediction at input xx
  • f(x)f(x) is the true underlying function

In Plain English: For our house price example, bias is the systematic error a straight line makes when the real pattern is curved. No matter how many houses we train on, a linear model will always undershoot prices at the extremes and overshoot in the middle. Variance is how much a degree-7 polynomial's predictions jump around depending on which specific 150 houses we happened to sample. Irreducible error is the randomness left over: two identical 2,000 sqft houses in the same neighborhood sell for different prices because of factors we can't measure (the seller's urgency, market timing, curb appeal).

Error decomposition showing total prediction error split into bias squared, variance, and irreducible error with fixes for eachClick to expandError decomposition showing total prediction error split into bias squared, variance, and irreducible error with fixes for each

The "tradeoff" exists because bias and variance pull in opposite directions as you adjust model complexity:

ActionEffect on BiasEffect on Variance
Increase model complexityDecreasesIncreases
Add more featuresDecreasesIncreases
Increase regularizationIncreasesDecreases
Gather more training dataNo changeDecreases

Underfitting: When the Model Is Too Simple

Underfitting (high bias) happens when your model lacks the capacity to capture the real patterns in the data. A straight linear regression line through a curved relationship is the textbook example. The model has already decided the world is linear and stubbornly refuses to learn otherwise.

Back to the archery analogy: giving Alex more arrows (more data) won't help. The arrows will keep landing left because the aiming technique is wrong. Collecting 10,000 more house sales won't fix a linear model that can't represent curvature.

Symptoms of underfitting

  • High training error
  • High validation error
  • Training and validation errors are close to each other (both bad)
  • The model is simpler than the data warrants

Common Pitfall: If your model performs poorly and your first instinct is "I need more data," stop. Check whether training error is also high. If it is, more data won't help. The model isn't capable of learning the pattern you already have. Fix the model first.

Overfitting: When the Model Memorizes Noise

Overfitting (high variance) is the opposite failure mode. The model is so flexible that it fits not just the real pattern but also the random noise unique to your specific training set. A degree-15 polynomial will thread through every training point, capturing bumps and wiggles that are accidents of sampling rather than real relationships.

In the archery analogy: Sam adjusts aim for every tiny gust of wind instead of learning a consistent technique. Some arrows hit by chance, but there's no reliable pattern.

Symptoms of overfitting

  • Very low training error (sometimes near zero)
  • High validation or test error
  • Large gap between training error and validation error
  • Performance degrades on new data from the same distribution

Key Insight: Overfitting is harder to spot than underfitting because the training metrics look great. A model with 99% training accuracy and 65% test accuracy is overfitting badly, but if you only check training performance, you'll think you have a winner. Always validate with held-out data or cross-validation.

Seeing the Tradeoff in Code

Let's make this concrete with our house price example. We generate 150 houses where price depends on square footage quadratically (with noise), then fit polynomial regression models of increasing complexity.

code
Polynomial Degree vs Prediction Error (House Prices)
=================================================================
Degree |     Train RMSE | CV RMSE (5-fold) | Gap
-----------------------------------------------------------------
     1 | $      38,564 | $        45,684 | $     7,120
     2 | $      35,783 | $        35,956 | $       172
     3 | $      35,738 | $        38,630 | $     2,892
     5 | $      35,624 | $       109,717 | $    74,093
     7 | $      35,293 | $       576,694 | $   541,402

Train RMSE keeps dropping as complexity grows.
CV RMSE hits a minimum at degree 2, then rises sharply.
The widening gap between train and CV is the signature of overfitting.

Notice the classic pattern. Training RMSE decreases monotonically from $38,564 to $35,293 as the polynomial grows from degree 1 to degree 7. But cross-validated RMSE tells the real story: it bottoms out at degree 2 ($35,956), then explodes to $576,694 at degree 7. That gap between train and CV error is exactly what the bias-variance tradeoff predicts.

Model complexity curve showing underfitting zone, sweet spot, and overfitting zone along the polynomial degree spectrumClick to expandModel complexity curve showing underfitting zone, sweet spot, and overfitting zone along the polynomial degree spectrum

Measuring Bias and Variance Empirically

You can't compute bias directly on real data because you don't know the true function f(x)f(x). But with synthetic data, we can simulate the decomposition by training models on many different random samples from the same distribution, then measuring how the predictions vary.

code
True price for 2000 sqft house: $310,000
Noise standard deviation: $35,000

Degree |           Bias^2 |         Variance |            Noise |            Total
------------------------------------------------------------------------------
     1 | $        363.3M | $         21.6M | $       1225.0M | $       1609.9M
     2 | $          0.2M | $         33.3M | $       1225.0M | $       1258.5M
     5 | $          0.4M | $         64.6M | $       1225.0M | $       1289.9M

As degree increases: bias drops, variance rises, noise stays fixed.
Total error = Bias^2 + Variance + Irreducible Noise

The numbers tell the story clearly. Degree 1 has massive bias ($363.3M) because a straight line systematically misses the quadratic curve. Degree 2 collapses that bias to nearly zero ($0.2M) and adds only modest variance ($33.3M). Degree 5 keeps bias low but nearly doubles variance to $64.6M. The irreducible noise ($1,225M) stays identical across all three because no model can reduce it.

Pro Tip: This simulation technique is powerful for research and teaching, but it requires knowing the true function. In practice, you diagnose bias vs variance through learning curves and cross-validation gaps instead.

Finding the Sweet Spot with Cross-Validation

In the real world, you don't know the true function. Cross-validation serves as your substitute: it estimates generalization error by repeatedly holding out different slices of the training data.

code
Model Complexity Sweep: Finding the Optimal Polynomial Degree
==================================================
  Degree 1: Train $    38,564  |  CV $    45,684 <-- best
  Degree 2: Train $    35,783  |  CV $    35,956 <-- best
  Degree 3: Train $    35,738  |  CV $    38,630
  Degree 4: Train $    35,738  |  CV $    45,741
  Degree 5: Train $    35,624  |  CV $   109,717
  Degree 6: Train $    35,606  |  CV $   261,688
  Degree 7: Train $    35,293  |  CV $   576,694

Optimal complexity: degree 2 (CV RMSE: &#36;35,956)
The true function is quadratic (degree 2), and cross-validation found it.

Cross-validation correctly identifies degree 2 as optimal without knowing the true function. This is the practical workflow: sweep over complexity settings, track CV error, and pick the model where it's minimized. For hyperparameter tuning at scale, tools like GridSearchCV or Optuna automate this sweep across multiple parameters simultaneously.

Diagnosing Bias vs Variance in Practice

On real datasets, you can't run the simulation above. Instead, you compare training error against validation error and look at the gap between them. Learning curves formalize this diagnostic by plotting error as a function of training set size.

SymptomDiagnosisWhat the Learning Curve Shows
High training error, high validation error, small gapHigh bias (underfitting)Both curves flatten at a high error rate. More data won't help.
Low training error, high validation error, large gapHigh variance (overfitting)Training curve stays low while validation curve stays high. More data may help.
Low training error, low validation error, small gapGood fitBoth curves converge at a low error rate. Model generalizes well.

Decision flowchart for diagnosing and fixing high bias vs high variance, with specific remedies for eachClick to expandDecision flowchart for diagnosing and fixing high bias vs high variance, with specific remedies for each

Fixing High Bias vs High Variance

Once you've diagnosed the problem, the fix depends on which side of the tradeoff you're stuck on. These are not interchangeable. Applying a variance fix to a bias problem (or vice versa) will make things worse.

Fixing high bias (underfitting)

FixHow It HelpsExample
Increase model complexityGives the model capacity to learn curved relationshipsSwitch from degree 1 to degree 2 polynomial
Add featuresProvides new information the model was missingAdd lot size, bedrooms, neighborhood to the house price model
Use feature engineeringCreate interaction or polynomial termsAdd sqft^2, sqft x bedrooms
Decrease regularizationAllows the model to fit more freelyReduce Ridge alpha from 100 to 0.01
Switch to a more expressive modelNon-linear models can capture complex patternsMove from linear regression to random forests

Fixing high variance (overfitting)

FixHow It HelpsExample
Get more training dataForces the model to generalize instead of memorizeExpand from 150 to 5,000 house sales
Increase regularizationPenalizes complexity, shrinks coefficientsIncrease Ridge alpha from 0.01 to 10
Reduce model complexityFewer parameters means less room to memorize noiseDrop from degree 7 to degree 2 polynomial
Feature selectionRemove noisy or irrelevant featuresDrop "seller's favorite color" from the dataset
Use ensemble methodsBagging averages many high-variance models to reduce total varianceRandom forests bag many deep trees
Early stoppingHalt training before the model starts fitting noiseStop gradient boosting at 100 trees instead of 10,000

Pro Tip: If you have high bias, gathering more data is a waste of time and money. A straight line fit to 150 houses and a straight line fit to 15,000 houses will both miss the curve. Fix the model's capacity first, then worry about data volume.

The Modern Plot Twist: Double Descent

Everything above describes the classical regime with its clean U-shaped test error curve. But starting around 2019, researchers documented something unexpected: when you push model complexity far beyond the interpolation threshold (where the model perfectly fits every training point), test error can actually decrease again.

This is the double descent phenomenon, described formally by Belkin et al. (2019) in "Reconciling Modern Machine-Learning Practice and the Bias-Variance Trade-off" and further characterized by Nakkiran et al. (2021) in "Deep Double Descent". The test error curve has three phases instead of two:

  1. Classical regime: error decreases as complexity grows (reducing bias)
  2. Interpolation peak: error spikes right at the point where the model has just enough parameters to fit the training data exactly
  3. Overparameterized regime: error decreases again as the model gains far more parameters than data points

Related phenomena include benign overfitting (Bartlett et al., 2020), where massively overparameterized models interpolate noisy training data yet still generalize well, and grokking (Power et al., 2022), where models suddenly learn to generalize long after memorizing the training set.

Key Insight: Double descent does not invalidate the classical bias-variance tradeoff. It extends it. For 95% of practical work with tabular data, XGBoost, and traditional ML models, the classical U-shaped curve is exactly what you'll observe. Double descent primarily shows up in deep neural networks and very high-dimensional kernel methods. Understanding the classical tradeoff remains essential. It's the foundation that double descent builds on.

When to Think About Bias-Variance (and When Not To)

The bias-variance framework is most useful in specific situations.

Think about it when:

  • Your model performs well on training data but poorly on validation data (variance)
  • Your model performs poorly on both training and validation data (bias)
  • You're deciding between a simple model and a complex one
  • You're choosing regularization strength
  • You're deciding whether to collect more data or engineer better features

Don't overthink it when:

  • You're using a well-tuned ensemble method like XGBoost with built-in regularization
  • You're fine-tuning a large pretrained model (different optimization dynamics apply)
  • Your data has severe quality issues like missing values or label errors (fix data quality first)

Conclusion

The bias-variance tradeoff is the diagnostic compass for every model that underperforms. High training error points to underfitting; a widening gap between training and validation error signals overfitting. Once you've identified the problem, the fix is directional: bias problems need more model capacity, variance problems need more constraint.

For practical diagnosis on real projects, learning curves are your best tool. They plot training and validation error as a function of sample size and show you exactly which regime your model lives in. And when you've identified high variance, regularization is often the fastest fix, adding a penalty term that trades a small increase in bias for a large reduction in variance.

Every time your model disappoints, resist the urge to swap algorithms randomly. Ask one question first: am I underfitting or overfitting? That answer determines everything that comes next.

Frequently Asked Interview Questions

Q: Your model has 95% training accuracy but only 60% test accuracy. What's happening, and how do you fix it?

The 35-point gap is a textbook sign of overfitting (high variance). The model has memorized training-specific patterns rather than learning generalizable rules. Fixes include adding regularization, collecting more data, reducing model complexity, or applying dropout (for neural networks).

Q: You increase your training set from 1,000 to 100,000 samples, but validation error barely improves. What does this tell you?

More data didn't help, which means the model is likely underfitting (high bias). Its capacity is too limited to capture the true pattern regardless of data volume. Increasing model complexity, adding features, or switching to a more expressive algorithm would be more productive.

Q: Explain the difference between bias and variance in one sentence each.

Bias measures how far the model's average prediction is from the true value, reflecting systematic errors from simplifying assumptions. Variance measures how much the predictions change when trained on different subsets of the data, reflecting the model's sensitivity to the specific training sample.

Q: How does bagging reduce variance without increasing bias?

Bagging (bootstrap aggregating) trains multiple copies of a high-variance model on different random subsets of the training data, then averages their predictions. Averaging independent noisy estimates reduces variance by a factor proportional to the number of models. Each individual model retains its low bias because it's still complex, but the noise in individual predictions cancels out across the ensemble. Random forests apply this principle to decision trees.

Q: Can you have both high bias and high variance at the same time?

Yes, though it's less common. A model could have high bias in one region of the input space and high variance in another. This typically happens with a model too simple for the overall trend that also fits noise in sparse regions. A poorly tuned k-nearest neighbors with k too large for local patterns (bias) but sensitive to outliers (variance) can exhibit this.

Q: Why doesn't collecting more data reduce bias?

Bias comes from the model's structural assumptions, not from the amount of data. A linear model assumes a straight-line relationship. Whether you fit it to 100 points or 1 million points, it will still predict a straight line. The only way to reduce bias is to change the model itself, either by increasing its complexity, adding better features, or reducing regularization that constrains it.

Q: What is double descent, and does it mean the bias-variance tradeoff is wrong?

Double descent describes a phenomenon where test error decreases, then increases, then decreases again as model complexity grows past the interpolation threshold. It doesn't invalidate the classical tradeoff; it extends the picture to the overparameterized regime where models have far more parameters than training samples. In practice, double descent mainly appears in deep learning and kernel methods. For tabular ML and classical models, the standard U-shaped tradeoff curve remains the right mental model.

Q: You're choosing between a simple logistic regression and a complex gradient-boosted tree for a tabular dataset with 500 samples and 50 features. How does bias-variance inform your choice?

With only 500 samples and 50 features, overfitting risk is high. Gradient-boosted trees have many tunable parameters and can easily memorize a small dataset. Logistic regression has higher bias but far lower variance, which makes it a safer default. If you do choose gradient boosting, aggressive regularization (low learning rate, limited max depth, high min samples per leaf) is essential. Cross-validation should guide the final decision.

Hands-On Practice

The bias-variance tradeoff is one of the most fundamental concepts in machine learning. You'll visualize this tradeoff using polynomial regression on real data. By fitting models of increasing complexity, you will see exactly how underfitting (high bias) and overfitting (high variance) manifest in practice, and learn to identify the sweet spot that balances both.

Building Intuition from First Principles

Rather than just reading about bias and variance, we implement polynomial models of varying complexity to see the tradeoff in action. This hands-on approach reveals why simple models underfit, complex models overfit, and how learning curves help diagnose these issues.

Dataset: Polynomial Regression 120 temperature vs efficiency records showing a non-linear relationship. Perfect for demonstrating polynomial regression and the bias-variance tradeoff.

Visualized the bias-variance tradeoff using polynomial regression. You saw how a degree-1 model underfits (high bias), a degree-15 model overfits (high variance), and a degree-3 model achieves the right balance. The learning curves provided diagnostic insight into each model's behavior. Try experimenting with different Ridge regularization values (alpha) to see how regularization can help control overfitting in complex models.

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths