Linear Discriminant Analysis: The Supervised Upgrade to PCA

DS
LDS Team
Let's Data Science
12 min readAudio
Linear Discriminant Analysis: The Supervised Upgrade to PCA
0:00 / 0:00

Imagine you are trying to separate a pile of apples from a pile of oranges based on data like "weight" and "redness."

If you use Principal Component Analysis (PCA), the algorithm blindly looks for the direction where the data varies the most. It ignores the labels. It might say, "Hey, weight varies a lot!" and project everything onto the weight axis. But if small apples and small oranges weigh the same, you’ve just mashed your distinct classes into a confused mess.

Linear Discriminant Analysis (LDA) is smarter. It cheats. It peeks at the labels.

LDA doesn't just ask "Where is the variance?" It asks, "What combination of features pushes the apples as far away from the oranges as possible?" It is the supervised sibling of PCA—a technique that reduces dimensionality not to preserve data shape, but to optimize class separation.

In this guide, we’ll move beyond the black box. We’ll explore the intuition behind "Fisher’s Criterion," break down the math of scatter matrices, and implement LDA in Python to see exactly why it often beats PCA for classification tasks.

What is Linear Discriminant Analysis?

Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction technique used primarily for classification. It projects data from a high-dimensional space into a lower-dimensional space with one specific goal: maximize the separability between known categories.

While PCA is an unsupervised algorithm that focuses on preserving the global structure (variance) of the dataset, LDA is a supervised algorithm that focuses on the boundaries between classes.

🔑 Key Insight: PCA summarizes data; LDA separates data. Use PCA for compression. Use LDA when your next step is classification.

The Two Goals of LDA

To separate classes perfectly, LDA tries to satisfy two competing requirements simultaneously:

  1. Maximize the distance between the means (centers) of different classes.
  2. Minimize the spread (variance) within each class.

Think of it like organizing a crowded wedding reception. To keep the "Bride's Family" and "Groom's Family" distinct, you want to put their tables on opposite sides of the room (maximize distance between means), but you also want family members to sit tight in their chairs rather than wandering around (minimize within-class variance).

LDA vs. PCA: The Fundamental Difference

It is impossible to understand LDA without comparing it to its famous cousin, PCA.

If you have a dataset where the variation within a class is in the same direction as the variation between classes, PCA and LDA will give similar results. But when the distinguishing feature has low variance, PCA fails.

FeaturePCA (Principal Component Analysis)LDA (Linear Discriminant Analysis)
TypeUnsupervised (Ignores labels)Supervised (Uses labels)
GoalMaximize total varianceMaximize class separation
OutputPrincipal ComponentsLinear Discriminants
Max Dimensionsmin(n_samples,n_features)\min(n\_samples, n\_features)min(n_classes1,n_features)\min(n\_classes - 1, n\_features)
Best ForVisualization, Noise ReductionPre-processing for Classification

⚠️ Common Pitfall: Beginners often try to use LDA to project a binary classification problem (2 classes) into 2D for visualization. You can't. LDA is limited to C1C-1 components. For 2 classes, LDA produces a 1D line, not a 2D plane.

If you need a refresher on how PCA calculates variance, check out our guide on PCA: Reducing Dimensions While Keeping What Matters.

The Intuition: Fisher's Criterion

How do we mathematically define "good separation"?

In 1936, Ronald Fisher proposed a solution now known as Fisher's Linear Discriminant. He realized that looking at the distance between means (μ1μ2\mu_1 - \mu_2) isn't enough.

Imagine two clouds of data points.

  • Scenario A: The centers are far apart, but the clouds are huge and puffy. They overlap significantly.
  • Scenario B: The centers are closer together, but the clouds are tiny, tight dots. They don't overlap at all.

Scenario B is better for classification. Fisher formalized this by defining a score (a "criterion") that rewards distance between centers and penalizes the spread (variance) of the data.

J(w)=(μ1μ2)2s12+s22J(w) = \frac{(\mu_1 - \mu_2)^2}{s_1^2 + s_2^2}

In Plain English: This formula says "The score equals the Distance Between Classes divided by the Scatter Within Classes." We want the top number (distance) to be huge and the bottom number (scatter) to be tiny. The vector ww that produces the highest score J(w)J(w) is our ideal projection line.

The Mathematics of Separation

To apply this to high-dimensional data with multiple classes, we use matrices. We need to construct two specific matrices: the Within-Class Scatter Matrix (SWS_W) and the Between-Class Scatter Matrix (SBS_B).

1. Within-Class Scatter Matrix (SWS_W)

This matrix measures how much the data spreads out inside each class. We calculate the covariance matrix for each class individually and sum them up.

SW=c=1Ci=1Nc(xiμc)(xiμc)TS_W = \sum_{c=1}^{C} \sum_{i=1}^{N_c} (x_i - \mu_c)(x_i - \mu_c)^T

In Plain English: SWS_W represents the "noise" or confusion in the data. It tells us how scattered the "apples" are from the average apple, and how scattered the "oranges" are from the average orange. We want to minimize the influence of this matrix.

2. Between-Class Scatter Matrix (SBS_B)

This matrix measures how far the class centers are from the global center of the data.

SB=c=1CNc(μcμ)(μcμ)TS_B = \sum_{c=1}^{C} N_c (\mu_c - \mu)(\mu_c - \mu)^T

In Plain English: SBS_B represents the "signal" or the separation. It measures how far the average apple and average orange are from the center of the entire fruit basket. We want to maximize this matrix.

3. The Optimization Problem

We want to find a projection vector ww that maximizes the ratio of "Between Scatter" to "Within Scatter."

J(w)=wTSBwwTSWwJ(w) = \frac{w^T S_B w}{w^T S_W w}

To solve this, we don't use gradient descent. It turns out this ratio is maximized by solving the Generalized Eigenvalue Problem:

SW1SBw=λwS_W^{-1} S_B w = \lambda w

In Plain English: The optimal projection directions (linear discriminants) are simply the eigenvectors of the matrix SW1SBS_W^{-1}S_B. The eigenvector with the largest eigenvalue provides the best separation, the second largest provides the second best, and so on.

This implies that LDA assumes the data has a structure similar to a Gaussian distribution. If you are interested in how probabilistic distributions model data, you might enjoy our article on Gaussian Mixture Models.

python Implementation: LDA in Action

Let's verify LDA's power using Python and the classic Wine dataset. This dataset has 13 features and 3 classes of wine. We will compare how PCA and LDA project this 13-dimensional data down to 2 dimensions.

Step 1: Setup and Loading Data

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Load data
wine = load_wine()
X = wine.data
y = wine.target
target_names = wine.target_names

# Standardize the features (Crucial for PCA, good practice for LDA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Original Data Shape: {X_scaled.shape}")
# Output: Original Data Shape: (178, 13)

Step 2: Applying PCA vs LDA

python
# Run PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Run LDA
# Note: We must pass 'y' (labels) to LDA!
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

# Visualizing the difference
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for color, i, target_name in zip(['navy', 'turquoise', 'darkorange'], [0, 1, 2], target_names):
    # Plot PCA
    axes[0].scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, alpha=.8, lw=2,
                    label=target_name)
    # Plot LDA
    axes[1].scatter(X_lda[y == i, 0], X_lda[y == i, 1], color=color, alpha=.8, lw=2,
                    label=target_name)

axes[0].legend(loc='best', shadow=False, scatterpoints=1)
axes[0].set_title('PCA of Wine dataset (Unsupervised)')

axes[1].legend(loc='best', shadow=False, scatterpoints=1)
axes[1].set_title('LDA of Wine dataset (Supervised)')

plt.show()

What to expect in the plot:

  • PCA Plot: You will likely see the classes somewhat separated, but there might be significant overlap, especially between similar wine types. PCA preserved the shape, but not necessarily the distinctness.
  • LDA Plot: You will see three tight, distinct clusters with clear gaps between them. LDA deliberately rotated the data to make these gaps as wide as possible.

Step 3: Explained Variance Ratio

Just like PCA, LDA components have "explained variance" (discriminability) ratios.

python
print("LDA Explained Variance Ratio:", lda.explained_variance_ratio_)
# Output Example: [0.687, 0.313]

This tells us that the first discriminant captures roughly 69% of the useful separation information, and the second captures the remaining 31%.

How many dimensions can you keep?

This is the most critical constraint of LDA that trips up practitioners.

In PCA, if you have 100 features, you can find up to 100 principal components. In LDA, the maximum number of components is determined by the number of classes, not features.

Max Components=min(Number of Features,Number of Classes1)\text{Max Components} = \min(\text{Number of Features}, \text{Number of Classes} - 1)

Why? Because CC classes lie in a subspace of dimension C1C-1. Think about it geometrically:

  • 2 points (Classes) can always be connected by 1 line.
  • 3 points (Classes) can always be placed on 1 plane (2D).
  • 4 points (Classes) define a 3D volume.

If you have a binary classification problem (Positive vs. Negative), you can only extract 1 LDA component. You cannot make a 2D scatter plot of LDA results for a binary problem—you can only plot them on a 1D number line (histogram).

Critical Assumptions and Limitations

LDA is powerful, but it is mathematically rigid. It makes strong assumptions about your data.

  1. Normality: It assumes independent variables are normally distributed for each class. If your data is highly skewed, LDA might perform poorly.
  2. Homoscedasticity: It assumes that every class shares the same covariance matrix (i.e., the "shape" and "spread" of each class cloud is roughly the same). If one class is a tight ball and another is a long cigar shape, Quadratic Discriminant Analysis (QDA) is often a better choice.
  3. Outliers: Since LDA relies on mean and variance (squared deviations), it is highly sensitive to outliers. A single extreme value can shift the mean and inflate the scatter matrix, throwing off the projection.
  4. Small Sample Size Problem: If the number of features is greater than the number of samples (D>ND > N), the scatter matrix SWS_W becomes singular (non-invertible). You cannot calculate SW1S_W^{-1}. In these cases, you must use regularization (Shrinkage LDA) or perform PCA first to reduce dimensions.

Conclusion

Linear Discriminant Analysis bridges the gap between raw data and classification models. While PCA blindly preserves variance, LDA acts as a smart filter, highlighting exactly those features that distinguish your classes.

In the modern data science stack, LDA is often used twice:

  1. As a Preprocessor: To compress data before feeding it into a simpler classifier like Logistic Regression or Naive Bayes.
  2. As a Classifier: Since the LDA projection defines linear boundaries, it can classify new data points directly based on which mean they are closest to.

If you are dealing with complex data where the separation isn't linear (e.g., the "Swiss Roll" dataset), standard LDA will fail. In those cases, you need to explore non-linear manifold learning techniques like t-SNE or UMAP. However, for structured, tabular data with clear class distinctions, LDA remains one of the most efficient tools in the arsenal.

Next Steps


Hands-On Practice

Linear Discriminant Analysis (LDA) is often called the "supervised sibling" of PCA. While PCA blindly searches for variance, LDA uses class labels to find the projection that best separates your categories. In this hands-on tutorial, we will work with a high-dimensional Wine Analysis dataset to visualize the critical difference between maximizing variance (PCA) and maximizing separability (LDA), and see why LDA is the superior choice for preprocessing before classification.

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

Try It Yourself

High Dimensional
Loading editor...
0/50 runs

High Dimensional: 180 wine samples with 13 features

Notice how the PCA plot likely showed some overlap between classes, while the LDA plot separated them into distinct, tight clusters. This perfectly illustrates Fisher's criterion: maximizing distance between class means while minimizing variance within classes. Try experimenting by changing n_components in PCA to see how many dimensions are required to match LDA's performance, or inspect the 'noise' columns in the feature importance plot to confirm that LDA correctly identified them as irrelevant.