Skip to content

Linear Discriminant Analysis: The Supervised Upgrade to PCA

DS
LDS Team
Let's Data Science
12 minAudio
Listen Along
0:00/ 0:00
AI voice

Your dataset has 13 chemical measurements for 178 wine samples and three known cultivars. You need to compress those 13 features into something a simple classifier can handle. PCA would find the directions of maximum variance, but variance and class separability are not the same thing. The axis where your data spreads the most might be completely useless for telling Cultivar 0 apart from Cultivar 2.

Linear Discriminant Analysis (LDA) solves this by looking at the labels. Instead of asking "where does the data vary most?", LDA asks "which projection pushes the classes as far apart as possible while keeping each class tight?" Ronald Fisher formalized this idea in 1936, and it remains one of the fastest, most interpretable methods for supervised dimensionality reduction in the scikit-learn 1.8 toolkit.

Every formula and code block in this article uses the same running example: the classic Wine dataset with 13 chemical features across three cultivars. We'll build LDA from scatter matrices, compare it head-to-head with PCA, and see exactly where it wins and where it breaks.

PCA vs LDA comparison showing unsupervised variance maximization versus supervised class separation for dimensionality reductionClick to expandPCA vs LDA comparison showing unsupervised variance maximization versus supervised class separation for dimensionality reduction

The Core Idea: Maximizing Class Separation

Linear Discriminant Analysis is a supervised dimensionality reduction technique that projects high-dimensional data into a lower-dimensional space optimized for classification. Where PCA preserves global variance without knowing which samples belong to which class, LDA explicitly uses class labels to find projections that separate categories as cleanly as possible.

LDA balances two competing goals simultaneously:

  1. Push class centers apart. The mean of Cultivar 0 in the projected space should be far from the mean of Cultivar 2.
  2. Keep each class compact. The wine samples within Cultivar 0 should cluster tightly around their own mean, not scatter everywhere.

Think of it like organizing tables at a conference banquet. You want the ML team and the Stats team seated on opposite sides of the room (maximize between-group distance), but you also want each team's members sitting close together rather than wandering around (minimize within-group scatter).

Key Insight: PCA summarizes data. LDA separates data. If your next step is classification, LDA gives you a projection tailored for that goal. If your next step is exploratory visualization with no labels, PCA is the right starting point.

CriterionPCALDA
SupervisionUnsupervised (ignores labels)Supervised (requires labels)
ObjectiveMaximize total varianceMaximize class separation
Output axesPrincipal componentsLinear discriminants
Max dimensionsmin(n,p)\min(n, p)min(C1,p)\min(C - 1, p)
Best useCompression, denoisingClassification preprocessing

Common Pitfall: With a binary classification problem (2 classes), LDA produces exactly 1 component. You cannot make a 2D scatter plot of LDA output for binary data. You get a 1D number line. For 3 classes like our wine dataset, you get at most 2 discriminant axes.

Fisher's Criterion: The Math Behind "Good Separation"

Fisher's Linear Discriminant (1936) formalized what "good separation" means mathematically. Looking at the distance between class means alone is not enough. Two class centers can be far apart yet still overlap heavily if each class is spread out. Two close centers can be perfectly separable if each class is a tight dot.

Fisher defined a score that rewards distance between class centers and penalizes scatter within classes:

J(w)=(μ1μ2)2s12+s22J(w) = \frac{(\mu_1 - \mu_2)^2}{s_1^2 + s_2^2}

Where:

  • J(w)J(w) is Fisher's criterion for projection direction ww
  • μ1,μ2\mu_1, \mu_2 are the projected class means
  • s12,s22s_1^2, s_2^2 are the projected within-class variances (scatter)
  • The numerator measures between-class distance
  • The denominator measures within-class spread

In Plain English: Fisher's criterion says "the best projection line is the one where the wine cultivar centers land far apart, but the wines within each cultivar stay clustered tightly." A high score means clean separation. A low score means the classes bleed into each other after projection.

This two-class formulation extends naturally to multiple classes through matrix algebra.

Scatter Matrices: Within-Class and Between-Class

For multi-class data with pp features, Fisher's criterion generalizes to two scatter matrices. These are the engine of LDA.

Within-Class Scatter Matrix (SWS_W)

The within-class scatter matrix measures how much samples spread inside each class. It's the sum of covariance-like matrices computed per class:

SW=c=1Ciclass c(xiμc)(xiμc)TS_W = \sum_{c=1}^{C} \sum_{i \in \text{class } c} (x_i - \mu_c)(x_i - \mu_c)^T

Where:

  • CC is the number of classes (3 for wine cultivars)
  • xix_i is an individual sample vector (13 wine measurements)
  • μc\mu_c is the mean vector of class cc
  • SWS_W is a p×pp \times p matrix (13 x 13 for wine data)

In Plain English: SWS_W captures the "noise" in your data. It tells you how scattered the Cultivar 0 wines are around the average Cultivar 0, how scattered the Cultivar 1 wines are around their average, and so on. LDA wants to minimize the influence of this noise.

Between-Class Scatter Matrix (SBS_B)

The between-class scatter matrix measures how far each class center sits from the global data center:

SB=c=1CNc(μcμ)(μcμ)TS_B = \sum_{c=1}^{C} N_c (\mu_c - \mu)(\mu_c - \mu)^T

Where:

  • NcN_c is the number of samples in class cc
  • μc\mu_c is the mean vector of class cc
  • μ\mu is the overall mean across all samples
  • SBS_B is a p×pp \times p matrix weighted by class size

In Plain English: SBS_B captures the "signal." It measures how far the average Cultivar 0 wine and the average Cultivar 2 wine sit from the center of all wines combined. LDA wants to maximize this separation.

The Generalized Eigenvalue Problem

LDA seeks a projection vector ww that maximizes the ratio of between-class scatter to within-class scatter:

J(w)=wTSBwwTSWwJ(w) = \frac{w^T S_B w}{w^T S_W w}

The solution comes from the generalized eigenvalue problem:

SW1SBw=λwS_W^{-1} S_B w = \lambda w

Where:

  • ww is an eigenvector (a linear discriminant direction)
  • λ\lambda is the corresponding eigenvalue (how much separation that direction provides)
  • The eigenvector with the largest λ\lambda is the best single projection axis
  • For CC classes, at most C1C - 1 eigenvalues are non-zero

In Plain English: The optimal discriminant directions are the eigenvectors of SW1SBS_W^{-1} S_B. Each eigenvector defines an axis in feature space, and its eigenvalue tells you how much class separation that axis captures. For the wine dataset with 3 cultivars, we get exactly 2 non-zero eigenvalues, meaning LDA compresses 13 features down to 2 discriminant axes.

LDA algorithm pipeline showing steps from labeled data through scatter matrices to eigenvalue problem and projectionClick to expandLDA algorithm pipeline showing steps from labeled data through scatter matrices to eigenvalue problem and projection

LDA Step by Step on the Wine Dataset

Let's build LDA from scratch to see every computation, then verify against scikit-learn.

code
=== LDA from Scratch (Wine Dataset) ===
Sw trace (within-class scatter): 1299.98
Sb trace (between-class scatter): 1014.02
Ratio Sb/Sw trace: 0.7800

Eigenvalue 1 (LD1): 12.0147  (72.0% of separation)
Eigenvalue 2 (LD2): 4.6831  (28.0% of separation)

Top 5 features driving LD1:
  flavanoids                          |w| = 0.6007
  proline                             |w| = 0.3940
  od280/od315_of_diluted_wines        |w| = 0.3836
  total_phenols                       |w| = 0.2890
  color_intensity                     |w| = 0.2695

sklearn explained_variance_ratio_: [0.6875, 0.3125]

The within-class scatter (trace 1300) represents noise. The between-class scatter (trace 1014) represents signal. LD1 captures 72% of the discriminant information, driven primarily by flavanoids and proline, the same polyphenol chemistry that varies most between cultivars. Notice that sklearn's explained_variance_ratio_ uses a slightly different normalization than raw eigenvalue proportions, but both agree on the relative ranking.

Pro Tip: Flavanoids dominate LD1 here because polyphenol content is the strongest chemical difference between these three Italian cultivars. This kind of domain-relevant feature importance is a major advantage of LDA over black-box methods.

LDA as a Classifier

LDA is not just a dimensionality reduction tool. It's also a full classifier. Scikit-learn's LinearDiscriminantAnalysis can predict class labels directly by fitting class-conditional Gaussian distributions and applying Bayes' theorem (conceptually similar to how Naive Bayes classifies, but without the feature independence assumption).

Let's compare four approaches on the wine dataset: all 13 features with logistic regression, PCA-reduced features, LDA-reduced features, and LDA used directly as a classifier.

code
Method               |  Dims | Accuracy
----------------------------------------
All 13 features      |    13 | 0.9815
PCA (2D)             |     2 | 0.9630
LDA (2D)             |     2 | 0.9815
LDA classifier       |    13 | 1.0000

LDA 5-fold CV accuracy: 0.9717 (+/- 0.0252)
Individual folds: ['1.0000', '1.0000', '0.9444', '0.9429', '0.9714']

The results tell a clear story. PCA compresses 13 features to 2 and drops to 96.3% accuracy because it discards information that separates classes. LDA compresses to the same 2 dimensions but matches the full 13-feature model at 98.2%. And LDA used directly as a classifier hits 100% on this test split.

The cross-validation score (97.2%) is more realistic than the single test split. Some folds see 100%, others 94.3%, which reflects the small dataset size (178 samples). Still, LDA punches well above its weight for a method with zero hyperparameters to tune.

Key Insight: LDA compresses 13 features into 2 and loses nothing for classification. PCA compresses the same 13 into 2 and loses nearly 2 percentage points. The difference is supervision: LDA knows which features matter for separating cultivars, PCA does not.

LDA vs QDA: Relaxing the Equal Covariance Assumption

LDA assumes all classes share the same covariance matrix. In geometric terms, every class cloud has the same shape and orientation; they differ only in location. This is called the homoscedasticity assumption, and it leads to linear decision boundaries between classes.

Quadratic Discriminant Analysis (QDA) drops this assumption. Each class gets its own covariance matrix, producing curved (quadratic) decision boundaries. When one wine cultivar forms a tight sphere and another forms an elongated ellipse, QDA can capture that difference.

In scikit-learn 1.8, QDA gained new solver, shrinkage, and covariance_estimator parameters (added in version 1.6), bringing it closer to LDA's API flexibility. You can now apply automatic Ledoit-Wolf shrinkage to QDA's per-class covariance estimates.

code
=== LDA vs QDA on Wine Dataset (5-Fold CV) ===
Method                    | Mean Acc |    Std
------------------------------------------------
LDA                       | 0.9717   | 0.0252
QDA                       | 0.9551   | 0.0137
LDA (shrinkage=auto)      | 0.9663   | 0.0110

Class 0 covariance trace: 5.12

Class 1 covariance trace: 10.07

Class 2 covariance trace: 6.34

Covariance matrix differences (Frobenius norm):
  Class 0 vs Class 1: 2.82
  Class 0 vs Class 2: 2.14
  Class 1 vs Class 2: 3.13

LDA beats QDA here (97.2% vs 95.5%) despite the class covariances being somewhat different (Class 1 has nearly double the trace of Class 0). With only 178 samples split across 3 classes, QDA must estimate three separate 13x13 covariance matrices. That's 273 free parameters per class. The bias-variance tradeoff bites hard here: QDA's extra flexibility costs it in variance.

FactorLDAQDA
Decision boundaryLinear (hyperplane)Quadratic (curved)
Covariance assumptionShared across classesPer-class covariance
Parameters to estimateOne shared covarianceOne covariance per class
When it winsSmall nn, similar class shapesLarge nn, different class shapes
RiskUnderfitting curved boundariesOverfitting with few samples

Pro Tip: When in doubt between LDA and QDA, start with LDA. It has fewer parameters and is more stable on small datasets. Switch to QDA only when you have plenty of samples per class (at least 10x the number of features) and you suspect class shapes genuinely differ.

When to Use LDA (and When NOT to Use It)

LDA is not a universal tool. Knowing its boundaries prevents wasted effort and misleading results.

Decision guide for choosing between LDA, QDA, PCA, shrinkage LDA, t-SNE, and UMAP based on data characteristicsClick to expandDecision guide for choosing between LDA, QDA, PCA, shrinkage LDA, t-SNE, and UMAP based on data characteristics

Use LDA when:

ScenarioWhy LDA fits
Classification preprocessingLDA directly optimizes for class separation
Few classes relative to featuresThe C1C - 1 component limit is fine when CC is small
Roughly Gaussian classesLDA's math assumes normal distributions per class
Baseline comparisonFast, no hyperparameters, strong default
Interpretable reductionDiscriminant weights map to original features

Do NOT use LDA when:

ScenarioWhy LDA fails
No labels availableLDA is supervised; use PCA instead
Non-linear class boundariesClasses wrapped around each other need kernel methods or nonlinear embeddings
pnp \gg n (more features than samples)SWS_W becomes singular; use shrinkage LDA or PCA first
Heavily skewed / multimodal classesThe Gaussian assumption breaks; consider random forests
Many classes, few featuresC1C - 1 might exceed pp, capping the useful discriminants

Common Pitfall: LDA is extremely sensitive to outliers. A single extreme measurement can shift a class mean and inflate the scatter matrix, warping the entire projection. Check for outliers before fitting LDA. Shrinkage LDA (solver='lsqr', shrinkage='auto') partially mitigates this by regularizing SWS_W, but does not eliminate the problem.

Production Considerations

Computational Complexity

LDA is fast. The dominant cost is computing and inverting SWS_W, which is O(p2n)O(p^2 \cdot n) for the scatter matrix and O(p3)O(p^3) for the inversion.

Dataset sizeTimeMemoryNotes
Wine (178 x 13)< 1 msTrivialAny solver works
10K x 100~10 ms~80 KB for scatterStandard LDA is fine
100K x 1,000~1 s~8 MB for scatterConsider solver='lsqr'
1M x 10,000Minutes~800 MBMust use shrinkage; PCA first may help
p>np > n (genomics, NLP)FailsSWS_W singularShrinkage mandatory, or run PCA to reduce pp first

The Small-Sample, High-Dimensional Problem

When the number of features exceeds the number of samples (p>np > n), the within-class scatter matrix SWS_W becomes singular (non-invertible). You cannot compute SW1S_W^{-1}. This is common in genomics (20,000 genes, 200 patients) and NLP (vocabulary-size features, hundreds of documents).

Two standard solutions:

  1. Shrinkage LDA. Regularize SWS_W by blending it with a diagonal matrix: SWreg=(1α)SW+αIS_W^{\text{reg}} = (1 - \alpha) S_W + \alpha I. Setting shrinkage='auto' uses the Ledoit-Wolf estimator to pick α\alpha automatically.
  2. PCA then LDA. Use PCA to reduce dimensionality below nn first, then apply LDA on the reduced features. This PCA + LDA pipeline is the standard approach in face recognition (Fisherfaces, introduced by Belhumeur et al., 1997).

Feature Scaling

LDA computes means and scatter matrices from the raw feature values. If one feature ranges from 278 to 1680 (proline) and another from 0.48 to 1.71 (hue), the large-scale feature dominates the scatter. Always standardize before LDA, just as you would before PCA.

Key Assumptions Checklist

  1. Normality. Each class is drawn from a multivariate Gaussian. Moderate violations are tolerable; heavy skew or multimodality will hurt.
  2. Homoscedasticity. All classes share the same covariance. If violated badly, switch to QDA.
  3. Linear separability. The optimal decision boundary between classes is a hyperplane. Curved boundaries need nonlinear methods.
  4. No perfect multicollinearity. Perfectly correlated features make SWS_W singular even when n>pn > p. Drop duplicates or use shrinkage.

Conclusion

Linear Discriminant Analysis occupies a specific, valuable niche: supervised dimensionality reduction for classification. It compresses features by maximizing class separation through Fisher's criterion, producing discriminant axes that directly target the downstream task. On the wine dataset, LDA compressed 13 features to 2 and matched the accuracy of a model using all 13 features, something PCA in 2D could not achieve.

The most important practical takeaway is understanding LDA's boundaries. It assumes Gaussian, equally-shaped class distributions with linear decision boundaries. When those assumptions roughly hold (and they do surprisingly often for structured tabular data), LDA is hard to beat for its speed and simplicity. When they don't, feature engineering or nonlinear classifiers like random forests are better choices.

If you arrived here from the PCA guide, the contrast should now be sharp: PCA preserves variance, LDA preserves class separation, and picking the right one depends entirely on whether you have labels and what you plan to do next.

Frequently Asked Interview Questions

Q: What is the fundamental difference between PCA and LDA?

PCA finds projection directions that maximize total variance in the data, without looking at class labels. LDA finds directions that maximize the ratio of between-class variance to within-class variance. PCA is unsupervised and best for compression; LDA is supervised and best for classification preprocessing. They can give very different results when the highest-variance direction is not the most discriminative one.

Q: Why is LDA limited to at most C1C - 1 components?

The between-class scatter matrix SBS_B has rank at most C1C - 1 because it's constructed from CC class means in pp-dimensional space. These CC means define a subspace of dimension at most C1C - 1 (just as 3 points define a plane, not a volume). Any eigenvector beyond the (C1)(C-1)th has a zero eigenvalue and captures no class separation.

Q: When would LDA fail as a classifier?

LDA fails when its core assumptions are violated: non-Gaussian class distributions (e.g., bimodal or heavily skewed), strongly unequal covariance matrices across classes (where QDA would be better), or nonlinear decision boundaries (e.g., one class surrounding another in a ring). It also struggles when the number of features exceeds the number of samples, making SWS_W singular, though shrinkage can partially address this.

Q: Your LDA model achieves 98% accuracy on the test set but only 94% in cross-validation. What's happening?

The single test split is overly optimistic due to favorable random partitioning. Cross-validation averages over multiple splits and gives a more reliable estimate. With small datasets, a single split can easily land most easy-to-classify samples in the test set. Always report cross-validated scores for model comparison, especially with fewer than 500 samples.

Q: How does LDA handle the case where SWS_W is singular?

When p>np > n or features are perfectly correlated, SWS_W has no inverse and standard LDA breaks. Two approaches work: shrinkage LDA regularizes SWS_W by adding a scaled identity matrix (SWreg=(1α)SW+αIS_W^{\text{reg}} = (1-\alpha)S_W + \alpha I), making it invertible. Alternatively, you can run PCA first to reduce dimensionality below nn, then apply LDA in the reduced space. Scikit-learn supports both via solver='lsqr', shrinkage='auto'.

Q: Can LDA be used for multi-class problems, or only binary classification?

LDA handles multi-class problems naturally. Fisher's original formulation was for two classes, but the scatter matrix generalization extends to any number of classes. For CC classes, you get up to C1C - 1 discriminant components. The method simultaneously separates all classes, not just pairs.

Q: A colleague suggests running LDA on one-hot encoded categorical features. Is this a good idea?

No. LDA assumes continuous, approximately Gaussian features. One-hot encoded variables are binary (0/1), violating the normality assumption. The covariance and mean calculations will be technically valid but the Gaussian class-conditional model will be a poor fit. For datasets mixing continuous and categorical features, consider encoding categoricals with target encoding or ordinal encoding, then applying LDA only to the continuous portion. Or skip LDA entirely and use a tree-based classifier that handles mixed types natively.

Q: How would you choose between LDA, QDA, and logistic regression for a classification task?

LDA and logistic regression produce linear decision boundaries but arrive there differently: LDA models the joint distribution P(X,y)P(X, y), while logistic regression models the conditional P(yX)P(y | X) directly. LDA is more efficient when the Gaussian assumption holds but more fragile when it doesn't. QDA allows quadratic boundaries at the cost of estimating more parameters. In practice, start with LDA and logistic regression as baselines. If both underperform, try QDA with sufficient data. If boundaries are truly nonlinear, move to kernel methods or ensemble classifiers.

Hands-On Practice

Linear Discriminant Analysis (LDA) is often called the "supervised sibling" of PCA. While PCA blindly searches for variance, LDA uses class labels to find the projection that best separates your categories. In this hands-on tutorial, we will work with a high-dimensional Wine Analysis dataset to visualize the critical difference between maximizing variance (PCA) and maximizing separability (LDA), and see why LDA is the superior choice for preprocessing before classification.

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

Notice how the PCA plot likely showed some overlap between classes, while the LDA plot separated them into distinct, tight clusters. This perfectly illustrates Fisher's criterion: maximizing distance between class means while minimizing variance within classes. Try experimenting by changing n_components in PCA to see how many dimensions are required to match LDA's performance, or inspect the 'noise' columns in the feature importance plot to confirm that LDA correctly identified them as irrelevant.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths