Skip to content

Why 99% Accuracy Can Be a Disaster: The Ultimate Guide to ML Metrics

DS
LDS Team
Let's Data Science
13 minAudio
Listen Along
0:00/ 0:00
AI voice

A fraud detection model scores 99.5% accuracy. The team celebrates, deploys to production, and watches **$2.3 million** vanish over the next quarter. The model learned to predict "legitimate" for every transaction. It was right 99.5% of the time because only 0.5% of transactions were fraudulent, but it caught exactly zero fraud.

This is the accuracy trap, and it bites teams more often than most are willing to admit. ML metrics exist to tell you whether your model actually solves the problem it was built for. Accuracy answers the wrong question when classes are imbalanced. Precision and recall answer different, more useful questions. F1 balances the two. AUC tells you whether the model's probability rankings are any good at all. The right metric depends on what mistakes cost in your specific domain.

Every formula, code block, and diagram in this article references the same scenario: detecting fraudulent transactions in a dataset where fraud is rare. By the end, you will know exactly which metric to report and why.

The Confusion Matrix

The confusion matrix is a 2x2 table that breaks every prediction your classifier makes into one of four categories. First described in Stehman's 1997 survey of accuracy assessment methods, it remains the foundation from which precision, recall, F1, and virtually every other classification metric is calculated.

For our fraud detection example, "Positive" means the model flags a transaction as fraud, and "Negative" means it labels it as legitimate.

Predicted LegitimatePredicted Fraud
Actually LegitimateTrue Negative (TN) Correctly clearedFalse Positive (FP) Flagged a clean transaction (Type I Error)
Actually FraudFalse Negative (FN) Missed real fraud (Type II Error)True Positive (TP) Caught the fraud

Pro Tip: Read the labels this way. "True/False" tells you whether the prediction was correct. "Positive/Negative" tells you what the model predicted. A False Positive is a prediction of Positive that turned out to be False.

Confusion matrix anatomy showing the four quadrants with fraud detection labelsClick to expandConfusion matrix anatomy showing the four quadrants with fraud detection labels

Type I and Type II Errors

Not all mistakes carry the same weight. A Type I Error (False Positive) means your fraud system freezes a customer's card during a legitimate purchase. Annoying, but recoverable. A Type II Error (False Negative) means a thief drains the account while your model looks the other way. The business eats the loss.

Which error costs more depends entirely on the domain. Cancer screening cannot afford to miss sick patients (Type II). An email spam filter cannot afford to delete important messages (Type I). This asymmetry drives every metric choice that follows.

The Accuracy Trap

Accuracy is the ratio of correct predictions to total predictions. It treats every correct answer equally, regardless of class, which makes it misleading whenever one class dominates the dataset.

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Where:

  • TPTP is the count of true positives (correctly flagged fraud)
  • TNTN is the count of true negatives (correctly cleared legitimate transactions)
  • FPFP is the count of false positives (legitimate transactions wrongly flagged)
  • FNFN is the count of false negatives (fraud that slipped through)

In Plain English: Accuracy asks "out of all transactions we reviewed, how many did we label correctly?" It counts a missed fraud case and a correctly cleared legitimate transaction as equally valuable. In our fraud dataset, clearing legitimate transactions is trivially easy because there are so many of them.

Consider 1,000 transactions where 950 are legitimate and 50 are fraudulent. A model that always predicts "legitimate" gets 950 right and 50 wrong: 95% accuracy. It catches zero fraud. The metric looks good. The model is useless.

Expected Output:

text
=== Lazy Model (always predicts legitimate) ===
Accuracy:  0.9500
Precision: 0.0000
Recall:    0.0000
F1 Score:  0.0000

Confusion Matrix:
  TN=950  FP=0
  FN=50   TP=0

95% accuracy, zero fraud caught.

The moment you see class imbalance worse than 80/20, stop using accuracy as your primary metric. It will lie to you. The alternative metrics below exist specifically because accuracy fails in these situations, which happen to be the situations where correct classification matters most.

Precision: Quality of Positive Predictions

Precision measures how trustworthy your model's positive predictions are. When the model flags a transaction as fraud, precision tells you the probability that it really is fraud.

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Where:

  • TPTP is the number of transactions correctly flagged as fraud
  • FPFP is the number of legitimate transactions wrongly flagged

In Plain English: Precision answers "when the fraud alert fires, how often is it a real fraud?" High precision means fewer false alarms. If your fraud team manually reviews every flagged transaction, low precision wastes their time on legitimate purchases.

When Precision Matters Most

Optimize for precision when false positives are expensive:

  • Spam filtering. Deleting a legitimate email from a client is far worse than letting a spam message through.
  • Content moderation. Removing a user's post incorrectly damages trust and can create PR incidents.
  • Recommendation systems. Recommending irrelevant content erodes user engagement over time.

Recall: Completeness of Positive Detection

Recall (also called sensitivity or true positive rate) measures what fraction of actual positives your model found. It tells you how many fraudulent transactions slipped past undetected.

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Where:

  • TPTP is the number of fraud cases the model caught
  • FNFN is the number of fraud cases the model missed

In Plain English: Recall answers "out of all the actual fraud in the data, what percentage did we catch?" A recall of 0.80 means 20% of fraud went undetected. For a bank processing 10 million transactions a month, that 20% gap translates to real financial losses.

When Recall Matters Most

Optimize for recall when false negatives are expensive:

  • Medical diagnosis. Telling a cancer patient they are healthy can be fatal. A false positive just means additional testing.
  • Fraud detection. Missing fraud costs real money. Flagging a clean transaction costs a phone call.
  • Security threat detection. Missing an intrusion attempt can compromise an entire network.

The Precision-Recall Tradeoff

You cannot maximize both precision and recall simultaneously. They pull in opposite directions. Tightening a model's criteria (raising the classification threshold) improves precision but hurts recall. Loosening criteria does the reverse.

Precision-recall tradeoff visualization showing how threshold changes shift the balance between false positives and false negativesClick to expandPrecision-recall tradeoff visualization showing how threshold changes shift the balance between false positives and false negatives

Think of it as a security checkpoint at an airport. An extremely strict checkpoint (high threshold) catches nearly zero innocent travelers but might miss some actual threats because it only flags the most obvious cases. A loose checkpoint (low threshold) catches every potential threat but also pulls aside hundreds of innocent passengers.

The right balance depends on what your business can tolerate. This is not a technical decision alone; it is a product decision that involves stakeholders who understand the cost of each error type.

F1 Score: Balancing Precision and Recall

The F1 score is the harmonic mean of precision and recall. Unlike the arithmetic mean, it punishes models that excel at one metric while failing at the other.

F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Where:

  • Precision\text{Precision} is the fraction of positive predictions that are correct
  • Recall\text{Recall} is the fraction of actual positives that were detected

In Plain English: A model with 100% recall but 1% precision (it flags everything as fraud) gets an arithmetic mean of 50.5%, which sounds acceptable. The F1 score punishes that imbalance and returns roughly 0.02. F1 only climbs when both metrics are reasonably high.

Why the harmonic mean instead of the regular average? The arithmetic mean of 1.00 and 0.01 is 0.505. The harmonic mean is 0.0198. The harmonic mean is dominated by the smaller value, which is exactly what you want: it forces the model to be decent at both finding positives and being accurate about them.

Beyond Standard F1: The F-beta Score

Sometimes you do want to weight one metric more heavily. The F-beta score generalizes F1 by introducing a parameter β\beta that controls the relative importance of recall versus precision.

Fβ=(1+β2)×Precision×Recall(β2×Precision)+RecallF_\beta = (1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}}

Where:

  • β\beta controls the weight given to recall
  • β=1\beta = 1 gives equal weight (standard F1)
  • β=2\beta = 2 weighs recall twice as heavily as precision
  • β=0.5\beta = 0.5 weighs precision twice as heavily as recall

In Plain English: For fraud detection where missing fraud is costlier than a false alarm, use F2F_2 (β=2\beta = 2). This tells the model: "I care about catching fraud twice as much as I care about avoiding false alarms."

F-beta Variantβ\betaEmphasisUse Case
F0.5F_{0.5}0.5Precision-heavySpam filtering, legal document review
F1F_11.0Equal weightGeneral-purpose balanced evaluation
F2F_22.0Recall-heavyMedical screening, fraud detection

Expected Output:

text
Confusion Matrix: TP=4, FP=2, TN=12, FN=2
Precision: 0.6667  (of 6 fraud predictions, 4 correct)
Recall:    0.6667  (of 6 actual fraud, 4 caught)
F1:        0.6667
F0.5:      0.6667  (precision-weighted)
F2:        0.6667  (recall-weighted)

In this case, precision and recall happen to be equal, so all F-beta variants converge to the same value. The differences become dramatic when precision and recall diverge, which is the common case on imbalanced data. With unequal precision and recall, F2F_2 would pull toward recall and F0.5F_{0.5} would pull toward precision.

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve and its summary statistic AUC (Area Under the Curve) measure a model's ability to rank positive examples higher than negative ones across all possible classification thresholds. Unlike precision, recall, and F1, which depend on a specific threshold (usually 0.5), AUC evaluates the quality of the model's probability estimates themselves.

Most classifiers do not just output "fraud" or "legitimate." They output a probability: "this transaction has a 73% chance of being fraud." You then pick a threshold. Everything above the threshold gets flagged. The ROC curve shows what happens to the True Positive Rate and False Positive Rate as you sweep that threshold from 0 to 1.

False Positive Rate

FPR=FPFP+TNFPR = \frac{FP}{FP + TN}

Where:

  • FPFP is the number of legitimate transactions incorrectly flagged as fraud
  • TNTN is the number of legitimate transactions correctly cleared

In Plain English: FPR asks "of all the clean transactions, what fraction did we wrongly flag?" We want FPR to be low while recall (TPR) stays high. The ROC curve plots TPR on the y-axis against FPR on the x-axis.

Interpreting AUC Values

AUC condenses the entire ROC curve into a single number between 0 and 1. According to Hosmer and Lemeshow's applied logistic regression guidelines, AUC values can be roughly categorized as follows.

AUC RangeInterpretationPractical Meaning
0.90 to 1.00ExcellentModel ranks almost all fraud above legitimate
0.80 to 0.90GoodSuitable for most production systems
0.70 to 0.80FairMay need feature engineering or a stronger model
0.50 to 0.70PoorBarely better than random guessing
0.50RandomNo discriminative power at all

Key Insight: AUC = 0.85 has a concrete probabilistic interpretation. Pick one random fraud case and one random legitimate case. There is an 85% chance the model assigns a higher fraud probability to the actual fraud. This makes AUC excellent for comparing models before you have committed to a specific threshold.

ROC curve interpretation showing the diagonal baseline, good classifier curve, and AUC shadingClick to expandROC curve interpretation showing the diagonal baseline, good classifier curve, and AUC shading

When AUC Misleads

AUC is not perfect. On extremely imbalanced datasets (say, 0.01% positive rate), the ROC curve can look excellent even when precision is terrible. The reason: FPR divides by the number of negatives, which is enormous. A model that produces thousands of false positives barely nudges the FPR because the denominator is so large. For heavily skewed datasets, consider the Precision-Recall AUC (PR-AUC) instead, which is more sensitive to performance on the minority class. The scikit-learn metrics documentation covers average_precision_score for this purpose.

Expected Output:

text
AUC Score: 0.960

Interpretation: pick one random fraud and one random legitimate
transaction. There is a 96% chance the model assigns a
higher fraud probability to the actual fraud case.

An AUC of 0.96 means the model's probability rankings are excellent. Almost every fraudulent transaction receives a higher score than the legitimate ones. Notice that one fraud case received a probability of only 0.40 while one legitimate transaction scored 0.65. That single crossing is what keeps the AUC below 1.0. The next step is choosing a threshold that converts those rankings into binary decisions, which is where the precision-recall tradeoff reenters the picture.

Threshold Tuning in Practice

The default classification threshold of 0.5 is arbitrary. Nothing magical about it. Adjusting the threshold lets you slide along the precision-recall curve to the operating point your business actually needs.

For fraud detection, you might lower the threshold to 0.3. More transactions get flagged (recall goes up), but some are false alarms (precision goes down). A bank's fraud investigation team can handle the extra volume. For a system that automatically blocks transactions without human review, you might raise the threshold to 0.7 so only high-confidence cases get blocked.

Pro Tip: Use scikit-learn's precision_recall_curve to find the threshold that maximizes F1 (or F2, or whatever F-beta variant matches your cost structure). Do not eyeball it. The scikit-learn metrics module covers every threshold-tuning utility available.

Multi-class Metrics: Macro, Micro, and Weighted Averages

When you move beyond binary classification (fraud vs. legitimate) to multi-class problems (classifying transactions as legitimate, card theft, account takeover, or identity fraud), you need a way to aggregate per-class metrics into a single number. Scikit-learn offers three averaging strategies, and the choice matters.

AveragingComputationBest For
MacroCompute metric per class, take unweighted averageTreating all classes equally, even rare ones
MicroPool all TP/FP/FN globally, compute metric onceOverall performance across all instances
WeightedCompute per class, average weighted by class sizeCompromise between macro and micro

Macro F1=1Cc=1CF1c\text{Macro F1} = \frac{1}{C} \sum_{c=1}^{C} F1_c

Where:

  • CC is the total number of classes
  • F1cF1_c is the F1 score calculated independently for class cc

In Plain English: Macro averaging gives equal voice to every class. If your model achieves 0.95 F1 on legitimate transactions but 0.30 F1 on identity fraud, the macro average exposes that failure. Micro averaging, by contrast, is dominated by the majority class and would hide the poor performance on rare fraud types.

Common Pitfall: Micro-averaged precision, recall, and F1 all collapse to the same value, which equals accuracy. If you report micro-F1 on an imbalanced multi-class problem, you have not escaped the accuracy trap at all. Always check macro averages when class sizes differ significantly.

Expected Output:

text
Per-class F1 scores:
  Legitimate          : 0.8182
  Card Theft          : 0.6667
  Account Takeover    : 0.6667
  Identity Fraud      : 0.0000

Macro F1:    0.5379  (equal weight per class)
Micro F1:    0.7500  (= accuracy for multi-class)
Weighted F1: 0.7197  (weighted by class size)

Notice how micro F1 (0.75) looks respectable, but macro F1 (0.54) reveals that the model completely fails on identity fraud. The macro average forces you to confront that failure instead of hiding it behind majority-class performance.

Metric Selection Decision Framework

Choosing the right metric is not purely a technical decision. It encodes your assumptions about what mistakes cost. The table below provides a practical decision framework.

ScenarioPrimary MetricWhy
Balanced classes, general purposeAccuracy or F1Classes are equally important
Imbalanced classes, missing positives is costlyRecall or F2F_2You cannot afford false negatives
Imbalanced classes, false alarms are costlyPrecision or F0.5F_{0.5}False positives waste resources
Model comparison before threshold selectionAUCThreshold-independent ranking quality
Heavily skewed data (< 1% positive)PR-AUCMore sensitive than ROC-AUC to minority class
Multi-class with rare categoriesMacro F1Prevents majority class from masking failures

Metric selection decision tree for classification problems based on class balance and error cost structureClick to expandMetric selection decision tree for classification problems based on class balance and error cost structure

Key Insight: Start with the confusion matrix. Look at it. Understand where the model fails. Then pick the metric that punishes the kind of failure you care about. Every number you report to stakeholders should connect back to a real cost.

Production Considerations

Computational cost varies across metrics. Accuracy, precision, recall, and F1 all run in O(n)O(n) time where nn is the number of predictions. AUC requires sorting predictions by probability, making it O(nlogn)O(n \log n). For datasets with millions of predictions, this difference is negligible in practice. The confusion matrix itself is O(n)O(n) to build but O(C2)O(C^2) in memory for CC classes. With 10,000 classes (rare, but possible in extreme multi-label problems), that matrix alone consumes significant memory.

In production monitoring, track metrics on sliding windows rather than full history. A random forest that scored 0.92 F1 at deployment can degrade to 0.74 after three months of data drift. Metrics computed on stale evaluation sets will not catch this. Set up alerts on weekly metric recalculations against fresh labeled samples, and validate on proper cross-validation splits during development.

Conclusion

The confusion matrix is your starting point. Every useful classification metric is just a different lens on those four numbers: TP, TN, FP, FN. Accuracy counts all correct predictions equally and fails the moment classes are imbalanced. Precision tells you how much to trust a positive prediction. Recall tells you how many real positives you missed. F1 balances both, and F-beta lets you tilt the balance toward whichever error type costs more. AUC evaluates ranking quality independent of any specific threshold.

The metric you choose defines what "good" means for your model. A fraud detection system optimized for accuracy might score 99.5% while hemorrhaging money. The same model evaluated on recall or F2 would immediately reveal the problem. Metric selection is a business conversation, not just a technical one.

To make sure your metrics are not inflated by a lucky data split, validate your model properly using cross-validation. If your scores are suspiciously high, you may be overfitting; our guide on the bias-variance tradeoff explains how to diagnose and fix that. And if your model's probabilities say "80% fraud" but only 50% of those cases are actually fraud, you need probability calibration to align predicted confidence with observed outcomes.

Frequently Asked Interview Questions

Q: Your classification model has 98% accuracy on a fraud detection task. Your manager is thrilled. What is the first thing you check?

Check the class distribution. If only 2% of transactions are fraudulent, a model that predicts "legitimate" for everything achieves 98% accuracy while catching zero fraud. I would immediately look at precision, recall, and the confusion matrix to understand whether the model actually identifies any positive cases.

Q: Explain the difference between precision and recall using a real-world example.

Precision measures how many of your positive predictions are correct: "of all the transactions we flagged as fraud, how many actually were?" Recall measures how many actual positives you found: "of all the real fraud in the dataset, what fraction did we catch?" In medical testing, high recall means you rarely miss a sick patient; high precision means you rarely scare a healthy one.

Q: When would you choose F2 over standard F1?

F2 weights recall twice as heavily as precision. I would choose it any time the cost of missing a positive is significantly higher than the cost of a false alarm. Cancer screening is the classic example: missing a malignant tumor is far worse than ordering an extra biopsy on a benign one.

Q: A model has AUC of 0.95 but F1 of 0.40 at the default 0.5 threshold. What is happening?

The model ranks positives above negatives very well (high AUC), but the default threshold is not appropriate for the class distribution. By adjusting the threshold (likely lowering it for an imbalanced dataset), you can find an operating point with much better F1. AUC says the model has good discriminative ability; the threshold just needs tuning.

Q: Why is micro-averaged F1 the same as accuracy in multi-class problems?

When you pool all true positives, false positives, and false negatives across classes and compute a single precision and recall, both equal the overall fraction of correct predictions. The harmonic mean of two identical values is that same value. This is why micro-F1 does not add information beyond accuracy for multi-class evaluation.

Q: Your fraud model has perfect recall but terrible precision. How do you fix it without retraining?

Raise the classification threshold. The model is currently flagging too many transactions as fraud (low threshold). By increasing the threshold, you require higher confidence before flagging, which reduces false positives and improves precision. The tradeoff is that recall will decrease, so find the threshold that gives the best F1 or F-beta for your cost structure.

Q: When should you use PR-AUC instead of ROC-AUC?

When the positive class is extremely rare (less than 1% of the data). ROC-AUC can look excellent because the false positive rate denominator (total negatives) is enormous, making even thousands of false positives barely register. PR-AUC directly measures precision and recall without this dilution effect, giving a more honest picture of model performance on the minority class.

Q: You are building a content moderation system. Should you optimize for precision or recall?

It depends on the moderation action. For automated removal of content (no human review), optimize for precision since incorrectly removing legitimate content damages user trust and can create legal issues. For flagging content for human review, optimize for recall since a moderator can dismiss false positives but cannot review content that was never flagged.

Hands-On Practice

Now these metrics in action with real data. You'll build a classifier on an imbalanced dataset and compare what different metrics tell you about model performance. You'll also visualize the confusion matrix and ROC curve to understand why accuracy alone can be dangerously misleading.

Dataset: ML Fundamentals (Loan Approval) A loan approval dataset with ~76/24 class imbalance - perfect for demonstrating why accuracy can mislead you and why precision, recall, and F1 matter.

Experiment with different thresholds to see how precision and recall trade off. Also try adding class_weight='balanced' to the RandomForestClassifier to see how it affects the metrics on the minority class.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths