Probability Calibration: Why High Accuracy Doesn't Mean You Can Trust Your Model

DS
LDS Team
Let's Data Science
10 min readAudio
Probability Calibration: Why High Accuracy Doesn't Mean You Can Trust Your Model
0:00 / 0:00

Imagine a doctor using an AI diagnostic tool. The model analyzes a patient's scan and predicts: "Positive for Disease X (Confidence: 90%)."

Based on that 90% confidence, the doctor orders an invasive surgery. But what if that model—despite having 98% overall accuracy—is "overconfident"? What if, historically, when the model says "90%," the patient actually has the disease only 60% of the time?

In that scenario, the surgery might be a mistake. The model is accurate (it usually gets the class right), but it is not calibrated (its probability scores don't reflect reality).

In machine learning, we often obsess over accuracy, AUC, and F1 scores, assuming that a better score means a better model. But in risk-sensitive fields—finance, healthcare, fraud detection—knowing how sure the model is matters just as much as the prediction itself.

This guide explores probability calibration: the missing link between a model's raw output and trustworthy, actionable probabilities.

What is probability calibration?

Probability calibration is the process of adjusting a model's output scores so that they reflect the true likelihood of the event occurring. If a calibrated model predicts a 70% probability for 100 different instances, the positive class should actually occur for roughly 70 of those instances.

In many machine learning workflows, we treat the output of .predict_proba() as ground truth. We assume that a 0.8 output means an 80% chance. However, raw outputs from complex models (like Gradient Boosting or Neural Networks) are often just "ranking scores" rather than true probabilities.

Calibration fixes this alignment.

💡 Pro Tip: A perfectly calibrated model is not necessarily an accurate one. A model that predicts the global average (e.g., 0.5 for a balanced dataset) for every single instance is "perfectly calibrated" but useless. You need both discrimination (accuracy) and calibration.

Why do high-accuracy models often have poor calibration?

Modern machine learning algorithms often sacrifice probability calibration to maximize classification accuracy or minimize loss functions that don't prioritize probability alignment. Different algorithms distort probabilities in unique ways based on their mathematical foundations.

Let's look at why specific high-performance models fail the "trust" test:

  1. Naive Bayes: This algorithm assumes all features are independent (which they rarely are). This assumption pushes predicted probabilities toward the extremes (0 and 1). A Naive Bayes model might output 0.99 for a slightly positive case, making it incredibly overconfident.
  2. Random Forests: Because Random Forest averages the predictions of many trees, it rarely outputs 0 or 1. The variance reduction pulls probabilities toward the center (0.5), making the model under-confident at the extremes.
  3. Support Vector Machines (SVMs): Support Vector Machines don't naturally output probabilities at all—they calculate distances to a hyperplane. We have to force-convert these distances to probabilities, which often results in poor calibration without tuning.

The "S-Curve" of Distortion

If you plot the predicted probability against the actual fraction of positives:

  • Overconfident models look like an inverted 'S'.
  • Underconfident models look like a standard 'S'.
  • Calibrated models look like a diagonal line (y=xy=x).

How do we measure calibration errors?

To fix calibration, we first need to visualize and quantify how "lying" our model is. We use two primary tools: Reliability Diagrams and the Brier Score.

1. Reliability Diagrams (Calibration Curves)

A reliability diagram bins the predictions (e.g., 0-10%, 10-20%, ...) and calculates the actual fraction of positives in each bin.

  • X-axis: Mean predicted probability in the bin.
  • Y-axis: Fraction of actual positives in that bin.
  • Perfect Calibration: The points fall exactly on the diagonal line.

If the curve dips below the diagonal, the model is overconfident (predicting higher probability than reality). If it floats above, the model is underconfident.

2. The Brier Score

The Brier Score is essentially the Mean Squared Error (MSE) applied to probability predictions. It measures both calibration and refinement (sharpness).

BS=1Ni=1N(fioi)2BS = \frac{1}{N} \sum_{i=1}^{N} (f_i - o_i)^2

Where:

  • fif_i is the forecasted probability.
  • oio_i is the actual outcome (0 or 1).
  • NN is the number of instances.

In Plain English: The Brier Score calculates the squared distance between your "confidence" and the "truth." If you predict 0.9 and the result is 1, the error is minimal (0.91)2=0.01(0.9 - 1)^2 = 0.01. If you predict 0.9 and the result is 0, the error is massive (0.90)2=0.81(0.9 - 0)^2 = 0.81. Lower Brier Scores are better.

3. Expected Calibration Error (ECE)

While Brier Score mixes accuracy and calibration, ECE isolates the calibration error. It effectively measures the weighted average gap between the reliability diagram bins and the diagonal line.

ECE=m=1MBmnacc(Bm)conf(Bm)\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} | \text{acc}(B_m) - \text{conf}(B_m) |

In Plain English: ECE asks, "On average, how far off is the model's confidence?" If the model says 80% confident, but is only 70% accurate in that bucket, the gap is 10%. ECE averages these gaps across all data points. If ECE is 0.05, your model's probability estimates are, on average, 5% off from reality.

How does Platt Scaling fix calibration?

Platt Scaling is a parametric method that assumes the distortion in your probabilities follows a sigmoid shape. It essentially trains a Logistic Regression model on the output of your original classifier.

Mathematically, Platt Scaling learns two parameters, AA and BB, to transform the raw output f(x)f(x):

P(y=1f(x))=11+exp(Af(x)+B)P(y=1 | f(x)) = \frac{1}{1 + \exp(Af(x) + B)}

In Plain English: Think of Platt Scaling as a translator. Your SVM speaks a language called "Distance to Hyperplane," which humans don't understand. Platt Scaling listens to the SVM and translates it into "Probability," fitting a smooth S-curve to map high distances to high probabilities and low distances to low probabilities.

When to use Platt Scaling:

  • You have a small calibration dataset.
  • The distortion looks "S-shaped" (sigmoid).
  • You are calibrating SVMs or Naive Bayes models.

How does Isotonic Regression differ from Platt Scaling?

Isotonic Regression is a non-parametric approach. Instead of assuming a sigmoid shape, it fits a free-form line that must be non-decreasing (monotonically increasing). It creates a "stepwise" function to map raw scores to probabilities.

Because Isotonic Regression doesn't assume a specific shape (like the sigmoid in Platt Scaling), it can correct complex distortions that aren't S-shaped.

When to use Isotonic Regression:

  • You have a large calibration dataset (1,000+ samples) to prevent overfitting.
  • The distortion is irregular (not a simple S-shape).
  • You are calibrating Random Forest or Gradient Boosting models.

⚠️ Common Pitfall: Isotonic Regression is powerful but prone to overfitting on small datasets. If you only have a few hundred validation points, the isotonic model might memorize noise. In those cases, stick to Platt Scaling.

Practical Application: Calibrating a Support Vector Machine

Let's verify these concepts with Python. We will generate a classification dataset, train an uncalibrated SVM, and then calibrate it using both Platt Scaling (Sigmoid) and Isotonic Regression.

1. Setup and Data Generation

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss

# Create a synthetic dataset
X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=2, 
    n_redundant=2, weights=[0.9, 0.1], random_state=42
)

# Split into train and calibration/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

2. Training the Uncalibrated Model

SVMs (specifically LinearSVC) do not output probabilities by default. We use the decision function (distance to margin) and normalize it, or use CalibratedClassifierCV to handle it for us.

python
# Train a Linear SVM (uncalibrated)
svm = LinearSVC(max_iter=10000, random_state=42, dual='auto')
svm.fit(X_train, y_train)

# SVM outputs "decision function" (distance), not probability
# We normalize roughly to 0-1 for visualization comparison (naive approach)
decision_scores = svm.decision_function(X_test)
prob_uncalibrated = (decision_scores - decision_scores.min()) / (decision_scores.max() - decision_scores.min())

3. Applying Calibration

We use sklearn's CalibratedClassifierCV. Note that we set cv='prefit' because we have already trained the base model. If you haven't trained the model yet, you can use cross-validation (cv=5) to train and calibrate simultaneously.

python
# 1. Platt Scaling (Sigmoid)
calibrated_sigmoid = CalibratedClassifierCV(svm, method='sigmoid', cv='prefit')
calibrated_sigmoid.fit(X_test, y_test) # Usually we'd split test into valid/test
prob_sigmoid = calibrated_sigmoid.predict_proba(X_test)[:, 1]

# 2. Isotonic Regression
calibrated_isotonic = CalibratedClassifierCV(svm, method='isotonic', cv='prefit')
calibrated_isotonic.fit(X_test, y_test)
prob_isotonic = calibrated_isotonic.predict_proba(X_test)[:, 1]

4. Evaluating Results

python
# Calculate Brier Scores (Lower is better)
brier_raw = brier_score_loss(y_test, prob_uncalibrated)
brier_sig = brier_score_loss(y_test, prob_sigmoid)
brier_iso = brier_score_loss(y_test, prob_isotonic)

print(f"Brier Score (Uncalibrated - Naive): {brier_raw:.4f}")
print(f"Brier Score (Platt Scaling):        {brier_sig:.4f}")
print(f"Brier Score (Isotonic):             {brier_iso:.4f}")

Expected Output:

text
Brier Score (Uncalibrated - Naive): 0.0832 (example)
Brier Score (Platt Scaling):        0.0510 (lower is better)
Brier Score (Isotonic):             0.0498 (lowest)

The calibrated scores will significantly outperform the raw normalization. The Isotonic method often edges out Platt Scaling on larger datasets because it fits the specific irregularities of the SVM's distortion.

Visualization: Plotting the Reliability Curve

Numbers are great, but the curve tells the story.

python
plt.figure(figsize=(10, 6))

# Plot perfectly calibrated line
plt.plot([0, 1], [0, 1], "k:", label="Perfectly Calibrated")

# Calculate curves
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_uncalibrated, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Uncalibrated")

fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_sigmoid, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Platt Scaling")

fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_isotonic, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Isotonic")

plt.ylabel("Fraction of positives")
plt.xlabel("Mean predicted probability")
plt.title("Reliability Diagram")
plt.legend()
plt.show()

In the resulting plot, you will see the Uncalibrated line deviating wildly (likely an S-shape or inverted S-shape), while the Platt and Isotonic lines hug the diagonal dotted line much more closely. This visual confirms that when the calibrated model says "20% risk," the actual risk is indeed very close to 20%.

Comparison: Platt Scaling vs. Isotonic Regression

FeaturePlatt Scaling (Sigmoid)Isotonic Regression
Model TypeParametric (Logistic)Non-Parametric (Stepwise)
AssumptionSigmoid (S-shape) distortionMonotonic (Non-decreasing)
Data NeededLow (Small datasets ok)High (>1,000 samples recommended)
Overfitting RiskLowHigh (on small data)
Best ForSVMs, Naive BayesRandom Forest, Gradient Boosting

Conclusion

Accuracy tells you if the model is right. Calibration tells you when to trust it.

In real-world applications—like automated loan approvals or medical triage—an uncalibrated probability is dangerous. If you set a threshold of 0.8 for approving a loan, but your model's 0.8 actually corresponds to a 0.5 probability of repayment, you are taking on massive hidden risk.

By using tools like Reliability Diagrams and applying Platt Scaling or Isotonic Regression, you transform raw, messy scores into interpretable, actionable probabilities.

To deepen your understanding of how different models generate these raw scores in the first place, check out our guides on Logistic Regression (which is naturally calibrated) and Gradient Boosting (which often requires calibration).


Hands-On Practice

Probability calibration is essential when you need to trust your model's confidence scores, not just its predictions. In this tutorial, you will train a classifier on passenger survival data and then calibrate its probability outputs using both Platt Scaling (sigmoid) and Isotonic Regression. By visualizing reliability diagrams, you will see firsthand how calibration transforms unreliable confidence scores into trustworthy probability estimates.

Dataset: Passenger Survival (Binary) Titanic-style survival prediction with clear class patterns. Women and first-class passengers have higher survival rates. Expected accuracy ≈ 78-85% depending on model.

Try It Yourself

Binary Classification
Loading editor...
0/50 runs

Binary Classification: 800 passenger records (Titanic-style)

Experiment by changing the base model from RandomForestClassifier to a GaussianNB (which tends to be overconfident) or LogisticRegression (which is naturally well-calibrated). Observe how different classifiers have different calibration curves before and after applying Platt Scaling. You can also try adjusting the number of bins in the calibration_curve function to see how granularity affects the reliability diagram.