Skip to content

PCA: Reducing Dimensions While Keeping What Matters

DS
LDS Team
Let's Data Science
11 minAudio
Listen Along
0:00/ 0:00
AI voice

Your dataset has 13 chemical measurements for 178 wine samples. Several features track the same underlying chemistry: total phenols, flavanoids, and proanthocyanins all measure polyphenol content from slightly different angles. Feeding all 13 columns into a classifier wastes compute, inflates variance in your estimates, and makes visualization impossible. Principal Component Analysis (PCA) compresses those correlated columns into a smaller set of uncorrelated axes that preserve the information and discard the redundancy.

Karl Pearson proposed the core idea in 1901, and PCA remains the default first move for dimensionality reduction over a century later. The scikit-learn PCA documentation lists it among the library's most-used transformers: fast, deterministic, and backed by clean linear algebra.

Every formula and code block in this article maps to one running example: the classic Wine dataset with 13 chemical features across three cultivars. We'll compress 13 dimensions into 2, watch three grape varieties separate clearly on screen, and learn exactly which original features drive each principal component.

PCA pipeline showing standardization, covariance computation, eigendecomposition, and projection stepsClick to expandPCA pipeline showing standardization, covariance computation, eigendecomposition, and projection steps

Variance as a Proxy for Information

PCA treats variance as a stand-in for information content. A feature with zero variance tells you nothing: if every wine sample has the same ash content, that column cannot distinguish Cultivar 1 from Cultivar 3. High variance spreads data points apart, giving a classifier something to work with.

FeatureRangeVarianceInformation Value
Alcohol11.0 to 14.8%HighSeparates cultivars well
Proline278 to 1680 mg/LVery high (raw)Strong signal, but scale dominates
Hue0.48 to 1.71Low (raw)Still informative after scaling
Constant columnAlways 1.00Useless

PCA finds the directions in your data cloud that maximize this spread. The first principal component (PC1) points along the widest axis. The second (PC2) points along the widest remaining axis that is perpendicular to PC1. Each subsequent component captures progressively less variance.

Key Insight: PCA is a feature extraction technique, not feature selection. Feature selection picks existing columns and drops others. PCA creates entirely new variables by combining the originals. The distinction matters because PCA components lose the semantic meaning of the original features.

The Geometry Behind PCA

You might recall linear regression, which finds a line minimizing vertical distance between data points and predictions. PCA finds a line minimizing the perpendicular distance between each point and the line. Minimizing perpendicular distance turns out to be mathematically equivalent to maximizing the variance of the projected points.

Picture a cloud of wine samples in 13-dimensional space. PCA rotates the coordinate system so that:

  1. PC1 aligns with the direction of maximum spread in the data cloud.
  2. PC2 aligns with the maximum spread in the remaining directions, constrained to be orthogonal (90 degrees) to PC1.
  3. PC3 through PC13 continue this pattern, each orthogonal to all previous components.

Because every component is perpendicular to the others, all principal components are uncorrelated by construction. This is one of PCA's most useful properties: it removes multicollinearity entirely.

Comparison of PCA projection versus linear regression showing perpendicular vs vertical error minimizationClick to expandComparison of PCA projection versus linear regression showing perpendicular vs vertical error minimization

The Mathematics Step by Step

The geometric intuition is clean, but the engine underneath is linear algebra. PCA relies on eigendecomposition of the covariance matrix. Let's walk through each step using our wine example.

Step 1: Standardization

PCA is extremely sensitive to scale. Proline ranges from 278 to 1680 (mg/L). Hue ranges from 0.48 to 1.71. Without standardization, PCA will think proline carries all the information simply because its numbers are bigger.

z=xμσz = \frac{x - \mu}{\sigma}

Where:

  • xx is the original feature value
  • μ\mu is the mean of that feature across all samples
  • σ\sigma is the standard deviation of that feature
  • zz is the standardized value (mean 0, variance 1)

In Plain English: Standardization puts every wine measurement on an equal footing. We subtract the average so each feature centers at zero, then divide by the spread so that proline (measured in hundreds) and hue (measured in decimals) contribute fairly. After scaling, both columns have a standard deviation of 1.0.

The cost of skipping this step is severe. Here's what happens:

code
PCA WITHOUT scaling:
PC1 explains 99.8% of variance
PC2 explains 0.2% of variance

Top 3 PC1 loadings (unscaled):
  proline                       : +0.9998  (range: 278 to 1680)
  magnesium                     : +0.0179  (range: 70 to 162)
  alcalinity_of_ash             : -0.0047  (range: 11 to 30)

PCA WITH scaling:
PC1 explains 36.2% of variance
PC2 explains 19.2% of variance

Top 3 PC1 loadings (scaled):
  flavanoids                    : +0.4229
  total_phenols                 : +0.3947
  od280/od315_of_diluted_wines  : +0.3762

Common Pitfall: Without scaling, proline dominates PC1 with a loading of 0.9998. PCA essentially becomes "the proline axis." After scaling, the loadings spread across flavanoids, phenols, and diluted wines, which is a far more informative decomposition. Always standardize before PCA unless all features share the same unit and scale.

Step 2: The Covariance Matrix

After standardization, we compute the covariance matrix to quantify how features move together.

Σ=1n1(XXˉ)T(XXˉ)\Sigma = \frac{1}{n-1} (X - \bar{X})^T (X - \bar{X})

Where:

  • Σ\Sigma is the d×dd \times d covariance matrix (13 x 13 for wine data)
  • XX is the n×dn \times d data matrix (178 samples, 13 features)
  • Xˉ\bar{X} is the column-wise mean of XX
  • nn is the number of samples
  • The diagonal entries are variances; off-diagonal entries are covariances

In Plain English: The covariance matrix is a relationship report card for your dataset. The diagonal tells you how much each wine feature varies on its own. The off-diagonal tells you whether features move together: when flavanoids go up, do total phenols go up too? (Yes, they do. Their covariance is strongly positive.)

Step 3: Eigendecomposition

This is the core of PCA. We find the eigenvectors and eigenvalues of the covariance matrix.

Σv=λv\Sigma v = \lambda v

Where:

  • Σ\Sigma is the covariance matrix
  • vv is an eigenvector (a direction in feature space)
  • λ\lambda is the corresponding eigenvalue (a scalar measuring variance along that direction)

In Plain English: Most vectors change direction when you multiply them by a matrix. Eigenvectors are the special directions that stay pointed the same way; the matrix only stretches them. Each eigenvector becomes a principal component axis, and its eigenvalue tells you how much variance that axis captures. Bigger eigenvalue means more information.

Step 4: Sort and Project

We sort eigenvectors by descending eigenvalue. The top kk eigenvectors form our projection matrix WkW_k, and we transform the data:

Z=XWkZ = X W_k

Where:

  • XX is the standardized n×dn \times d data matrix
  • WkW_k is the d×kd \times k matrix of the top kk eigenvectors
  • ZZ is the projected n×kn \times k data in the reduced space

In Plain English: We keep only the most informative axes and throw away the rest. For the wine data, keeping 5 of 13 eigenvectors captures over 80% of the total variance. The projected data ZZ has fewer columns but retains most of the original structure.

Here is the full eigendecomposition computed from scratch, then verified against scikit-learn:

code
Covariance matrix shape: (13, 13)
Diagonal (variances) first 5: [1.006 1.006 1.006 1.006 1.006]

Top 5 eigenvalues: [4.7324 2.5111 1.4542 0.9242 0.858 ]
Proportion explained: [0.362  0.1921 0.1112 0.0707 0.0656]

Projected shape: (178, 2)
Variance along PC1: 4.7059
Variance along PC2: 2.4970
Correlation PC1-PC2: -0.000000

The diagonal values are all approximately 1.006 because we standardized first (each feature has unit variance, with the slight deviation from 1.0 due to Bessel's correction using n1n-1). The correlation between PC1 and PC2 is exactly zero, confirming orthogonality.

Choosing the Right Number of Components

Choosing kk is the central practical decision. Keep too few components and you lose signal. Keep too many and you defeat the purpose of reduction. Two standard approaches help.

The Scree Plot

A scree plot displays the variance explained by each component in descending order. You look for an "elbow" where the curve flattens, similar to the elbow method in K-means clustering.

The Cumulative Variance Threshold

A more principled approach: pick kk such that the cumulative explained variance crosses a target (commonly 80%, 90%, or 95%).

code
Component | Individual % | Cumulative %
------------------------------------------
PC 1      |        36.20 |       36.20
PC 2      |        19.21 |       55.41
PC 3      |        11.12 |       66.53
PC 4      |         7.07 |       73.60
PC 5      |         6.56 |       80.16
PC 6      |         4.94 |       85.10
PC 7      |         4.24 |       89.34
PC 8      |         2.68 |       92.02
PC 9      |         2.22 |       94.24
PC10      |         1.93 |       96.17
PC11      |         1.74 |       97.91
PC12      |         1.30 |       99.20
PC13      |         0.80 |      100.00

Components for 80% variance: 5
Components for 90% variance: 8
Components for 95% variance: 10

Five components capture 80% of the variance in 13 features. That is a compression ratio of roughly 2.6:1 with only 20% information loss. For visualization, 2 components (55.4%) are usually enough to reveal cluster structure. For downstream modeling, 8 components (92%) is a solid default.

Pro Tip: There is no universally correct threshold. For exploratory visualization, 2 or 3 components are fine even at 60% variance. For a production classifier where every percentage of accuracy matters, 95% is safer. Always validate with a downstream metric rather than picking a number in isolation.

Scree plot showing explained variance by component with elbow at PC5 for the wine datasetClick to expandScree plot showing explained variance by component with elbow at PC5 for the wine dataset

Interpreting Loadings

One of PCA's biggest criticisms is the loss of interpretability. When you blend alcohol, malic acid, and ash into "PC1," what does PC1 actually represent?

Loadings answer this. Each loading is the weight that an original feature contributes to a given component. High absolute loading means that feature strongly influences the component.

code
Feature Loadings:
                                 PC1     PC2
alcohol                       0.1443  0.4837
malic_acid                   -0.2452  0.2249
ash                          -0.0021  0.3161
alcalinity_of_ash            -0.2393 -0.0106
magnesium                     0.1420  0.2996
total_phenols                 0.3947  0.0650
flavanoids                    0.4229 -0.0034
nonflavanoid_phenols         -0.2985  0.0288
proanthocyanins               0.3134  0.0393
color_intensity              -0.0886  0.5300
hue                           0.2967 -0.2792
od280/od315_of_diluted_wines  0.3762 -0.1645
proline                       0.2868  0.3649

Top 5 features for PC1 (by absolute loading):
  flavanoids                     +0.4229
  total_phenols                  +0.3947
  od280/od315_of_diluted_wines   +0.3762
  proanthocyanins                +0.3134
  nonflavanoid_phenols           -0.2985

Top 5 features for PC2 (by absolute loading):
  color_intensity                +0.5300
  alcohol                        +0.4837
  proline                        +0.3649
  ash                            +0.3161
  magnesium                      +0.2996

PC1 loads heavily on flavanoids, total phenols, and diluted wines (all positive) versus nonflavanoid phenols (negative). You could label this axis "polyphenol richness." Wines scoring high on PC1 have more complex phenolic profiles. PC2 picks up color intensity, alcohol, and proline, which you could call the "body and color" axis.

This kind of interpretation is always subjective, but it gives stakeholders something meaningful to discuss instead of abstract component numbers.

Complete PCA Pipeline with sklearn

Here is the full workflow: load, scale, fit PCA, and compare model accuracy with and without dimensionality reduction.

code
Original shape: (178, 13)

  Features | Dims | Accuracy
--------------------------------
    All 13 |   13 | 0.9815
     PCA 2 |    2 | 0.9630
     PCA 5 |    5 | 0.9630
     PCA 8 |    8 | 0.9815

With just 2 components (15% of the original dimensionality), logistic regression still achieves 96.3% accuracy. At 8 components, accuracy matches the full 13-feature model exactly. This is the practical payoff of PCA: fewer features, faster training, same performance.

Pro Tip: Always fit PCA on the training set only, then call .transform() on the test set. Fitting on the full dataset before splitting causes data leakage because the test set's variance influences the component directions. In a production pipeline, use sklearn.pipeline.Pipeline to keep this automatic.

The set_output API for Cleaner Code

Since scikit-learn 1.2, every transformer supports set_output(transform="pandas"), which returns DataFrames instead of raw NumPy arrays. This is especially convenient with PCA because you get named columns:

python
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.set_output(transform="pandas")
result = pca.fit_transform(X_scaled_df)
# result is a DataFrame with columns: ['pca0', 'pca1', 'pca2']

In scikit-learn 1.8 (December 2025), this API is stable and works across all transformers including IncrementalPCA and Pipeline objects.

When to Use PCA

PCA is the right tool in specific situations. Reaching for it by default on every dataset is a mistake.

Use PCA when:

ScenarioWhy PCA Helps
Many correlated featuresPCA collapses redundancy into fewer orthogonal axes
Visualization of high-D data2 or 3 components give a quick visual summary
Preprocessing before distance-based modelsKNN, SVM, and K-means all suffer from the curse of dimensionality
Noise reductionLow-variance components often capture measurement noise
Multicollinearity in regressionPCA components are uncorrelated by definition

Do NOT use PCA when:

ScenarioWhy PCA Fails
Interpretability is critical"PC3 went up" means nothing to a business stakeholder
Relationships are nonlinearSwiss-roll data gets smashed flat; use t-SNE or UMAP instead
Features are already independentIf features are uncorrelated, PCA just reorders them
Sparse data (NLP, recommender systems)Centering destroys sparsity; use TruncatedSVD
You need supervised reductionPCA ignores labels; LDA maximizes class separation

Key Insight: PCA maximizes variance, not class separation. Two classes might overlap entirely along the direction of highest variance while being perfectly separable along a low-variance direction. If you have labels, at least compare PCA results with LDA before committing.

PCA Compared to Alternatives

Choosing between dimensionality reduction techniques depends on your data and your goal. This table provides a quick decision framework. For a deeper comparison, see Feature Selection vs Feature Extraction.

MethodLinear?Supervised?PreservesBest For
PCAYesNoGlobal varianceGeneral-purpose reduction, denoising
LDAYesYesClass separabilityClassification preprocessing
t-SNENoNoLocal neighborhoods2D/3D visualization only
UMAPNoOptionalLocal + some globalVisualization, faster than t-SNE
AutoencodersNoNoLearned representationComplex nonlinear compression
TruncatedSVDYesNoVariance (no centering)Sparse matrices (NLP, TF-IDF)

Decision matrix comparing PCA, LDA, t-SNE, UMAP, and TruncatedSVD across key criteriaClick to expandDecision matrix comparing PCA, LDA, t-SNE, UMAP, and TruncatedSVD across key criteria

Production Considerations

Computational Complexity

Standard PCA computes the full SVD, which costs O(nd2)O(n \cdot d^2) for nn samples and dd features. For the wine dataset (178 x 13), this is trivial. For a genomics dataset with 20,000 genes across 500 patients, the covariance matrix alone is 20,000 x 20,000, which takes real time and memory.

VariantTime ComplexityMemoryUse Case
Full PCA (PCA)O(nd2)O(n \cdot d^2)O(d2)O(d^2)d<5,000d < 5,000
Randomized PCA (PCA(svd_solver='randomized'))O(ndk)O(n \cdot d \cdot k)O(dk)O(d \cdot k)Large dd, small kk
IncrementalPCAO(bd2)O(b \cdot d^2) per batchO(bd)O(b \cdot d)Streaming data, memory-constrained
TruncatedSVDO(ndk)O(n \cdot d \cdot k)O(dk)O(d \cdot k)Sparse matrices

Where kk is the number of components and bb is the batch size.

Incremental PCA for Large Datasets

When your data doesn't fit in memory, IncrementalPCA processes it in batches. Each batch contributes to a running estimate of the principal components:

python
from sklearn.decomposition import IncrementalPCA

ipca = IncrementalPCA(n_components=50, batch_size=1000)
for batch in data_generator():
    ipca.partial_fit(batch)

X_reduced = ipca.transform(X_new)

This is essential for datasets with millions of rows. The memory footprint stays proportional to batch_size * n_features regardless of the total dataset size.

Sparse Data

PCA requires centering (subtracting the mean), which converts a sparse matrix to dense and destroys the memory advantage of sparsity. For text data represented as TF-IDF vectors, use TruncatedSVD instead. It performs the same decomposition without centering, keeping the matrix sparse throughout.

Limitations and Failure Modes

PCA is powerful within its assumptions, but those assumptions break down in predictable ways.

Linearity. PCA finds linear projections. If your data lies on a curved manifold (a spiral, a Swiss roll, or any surface that bends through high-dimensional space), PCA will flatten the curvature and mix together points that should stay separated.

Outlier sensitivity. PCA maximizes variance, and outliers inflate variance. A single extreme point can tilt a principal component toward itself, distorting the entire decomposition. Consider removing outliers first or using sklearn.decomposition.PCA with svd_solver='full' combined with manual outlier filtering.

Mean-centering assumption. PCA assumes the data is centered. Scikit-learn handles this automatically, but if you preprocess data outside sklearn (say, in a Spark pipeline), forgetting to center will produce wrong results silently.

Orthogonality constraint. Real-world factors of variation are rarely perfectly orthogonal. The "quality" axis of wine might genuinely correlate with the "body" axis in nature. PCA forces them apart, which can sometimes split a single meaningful factor across two components.

Conclusion

PCA reduces dimensionality by rotating your coordinate system to align with the directions of maximum variance, then discarding the axes that carry the least information. The math is clean: standardize, compute the covariance matrix, extract eigenvectors, and project. The practical value is equally clear: fewer features, faster models, better generalization, and the ability to visualize structure that high-dimensional spaces hide.

The most important takeaway is knowing PCA's boundaries. It works brilliantly for correlated, linearly structured data. It fails for nonlinear manifolds, sparse matrices, and cases where you need interpretable features. For nonlinear visualization, t-SNE and UMAP are better tools. For supervised reduction with class labels, LDA is worth comparing against PCA every time.

Start with PCA. If the scree plot shows a clean elbow and downstream accuracy holds, you're done. If not, move to more flexible methods.

Frequently Asked Interview Questions

Q: PCA maximizes variance, but variance is not always the same as useful information. When does this assumption break down?

Variance equals information only when the signal you care about is spread along high-variance directions. In classification, two classes might be perfectly separable along a low-variance axis while overlapping completely along the highest-variance one. PCA would keep the useless axis and discard the discriminative one. LDA addresses this by incorporating class labels to maximize between-class separation instead of total variance.

Q: You run PCA on a dataset and PC1 explains 99% of variance. Is this a good sign?

Not necessarily. It likely means one feature dominates the others in scale. If you forgot to standardize, proline (range 278 to 1680) will overwhelm hue (range 0.48 to 1.71). Check whether you applied StandardScaler. If you did and one component still dominates, it means the data genuinely has one dominant direction of variation, which is fine.

Q: A colleague suggests using PCA before running a random forest. Is this a good idea?

Generally no. Random forests are invariant to feature scale and handle correlated features through random subspace selection. PCA destroys interpretable feature names and the orthogonality it provides is not needed by tree-based models. PCA before linear models (logistic regression, SVM) is more useful because those models struggle with multicollinearity and high dimensionality.

Q: Explain the difference between PCA and feature selection in one sentence each.

PCA creates new features by linearly combining all original features to maximize variance. Feature selection picks a subset of original features and discards the rest, preserving interpretability but ignoring correlations between the kept features.

Q: How do you decide between 80%, 90%, and 95% variance thresholds?

For visualization, 2 to 3 components are fine even at 50 to 60% variance. For a downstream classifier, start at 90% and check whether accuracy degrades compared to using all features. If it does not, stay at 90%. If the task is sensitive to small signals (medical diagnosis, fraud detection), go to 95% or higher. Always validate with a held-out test set rather than relying on the variance number alone.

Q: Can PCA be applied to categorical features?

Standard PCA operates on continuous numerical data. For categorical features, you would need to encode them first (one-hot, target encoding), but PCA on one-hot encoded data is problematic because the binary nature violates the continuous variance assumption. Multiple Correspondence Analysis (MCA) is the categorical analog of PCA and handles this properly.

Q: Your PCA components explain 95% of variance, but your model's accuracy dropped after reduction. What went wrong?

The 5% discarded variance likely contained the discriminative signal for your specific target. Variance explained is measured without reference to any label. A small-variance component might be the only one that separates classes. Try increasing the number of components, or switch to LDA which directly optimizes for class separation. Another possibility: the downstream model was powerful enough to handle the full dimensionality, and PCA introduced unnecessary information loss.

Q: What happens if you apply PCA to data that is already uncorrelated?

PCA still works, but it does not help. If features are already uncorrelated, the covariance matrix is diagonal, and the eigenvectors align with the original feature axes. PCA simply reorders the features by decreasing variance. You get no compression benefit because there is no redundancy to remove. In this case, simple feature selection by variance would achieve the same result with less computation.

Hands-On Practice

We will explain Principal Component Analysis (PCA) by applying it to a high-dimensional wine dataset. You will see firsthand how PCA transforms complex, correlated data into a compact set of orthogonal features (principal components) that preserve the most critical information. By visualizing the transition from 27 noisy features down to just a few powerful components, you'll gain an intuitive understanding of variance as information and learn how to effectively battle the "Curse of Dimensionality."

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

Try changing n_components in the final classification step to 5 or 10 and observe if accuracy reaches 100%. You can also experiment with the StandardScaler step, comment it out to see how drastically unscaled data affects PCA performance (spoiler: the feature with the largest numbers will dominate the variance). Finally, look closely at the 'loadings' to identify which chemical properties are the primary drivers of wine differences.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths