Skip to content

UMAP Explained: The Faster, Smarter Alternative to t-SNE

DS
LDS Team
Let's Data Science
12 minAudio
Listen Along
0:00/ 0:00
AI voice

You have 1,797 handwritten digits, each described by 64 pixel values. Plotting them in 2D should reveal ten clean groups, one per digit. PCA mashes them into an overlapping blob. t-SNE produces beautiful islands but scrambles the distances between them and takes minutes on anything larger than 100K rows.

UMAP (Uniform Manifold Approximation and Projection) fixes both problems. Published in 2018 by Leland McInnes, John Healy, and James Melville, UMAP grounds dimensionality reduction in algebraic topology instead of probability distributions. The result: an algorithm that runs in O(nlogn)O(n \log n) time, preserves global relationships t-SNE throws away, and ships with a transform() method that can embed new data without re-fitting. As of umap-learn 0.5.11 (January 2026), the library remains the most widely used non-linear projection tool in production data science and single-cell genomics alike.

Throughout this article, we will use the MNIST digits dataset as a running example to build every concept from scratch.

Dimensionality Reduction and the Manifold Hypothesis

Dimensionality reduction compresses high-dimensional data into fewer dimensions while retaining as much structure as possible. The manifold hypothesis says real-world data rarely fills its full feature space; instead, it lives on a lower-dimensional surface (a "manifold") embedded in that space.

Consider our digits. Each 8x8 image occupies a point in 64-dimensional space, but the actual variety of handwritten digits traces a much lower-dimensional surface. A "3" can slant, grow thick or thin, yet all its variants lie on a smooth manifold within that 64D cube. The goal is to flatten that manifold onto a 2D plane without tearing it.

PCA does this with a linear projection: it finds the two directions of maximum variance and discards the rest. On digits, that keeps only 21.6% of the variance and drops 5-NN classification accuracy from 94.4% to 51.4%. The curved manifold that separates a "3" from an "8" gets crushed.

UMAP algorithm pipeline from high-dimensional input through graph construction to low-dimensional embeddingClick to expandUMAP algorithm pipeline from high-dimensional input through graph construction to low-dimensional embedding

Expected output:

code
Dataset: 1797 handwritten digits, 64 pixel features
Task: reduce 64 dims to 2 dims, then classify with 5-NN

5-NN on full 64D:    0.944 accuracy
5-NN on PCA 2D:      0.514 accuracy (21.6% variance retained)
5-NN on PCA 10D:     0.896 accuracy (58.9% variance retained)

Crushing 64 features into 2 with PCA destroys 46% of
classification power. Non-linear methods like UMAP preserve far
more structure in those 2 dimensions.

Key Insight: PCA's linear projection loses nearly half the classification signal when going from 64D to 2D. Non-linear methods like UMAP and t-SNE can preserve far more of the manifold's curved geometry in just two dimensions.

Local Metrics: How UMAP Adapts to Varying Density

UMAP handles varying data density by assigning every point its own personal distance scale. Dense neighborhoods get a short ruler; sparse regions get a stretched one. This adaptive scaling is the core idea that separates UMAP from naive distance-based methods.

Here is the precise mechanism. For each point xix_i, UMAP computes:

  1. ρi\rho_i: the distance to the nearest neighbor.
  2. σi\sigma_i: a scaling factor found via binary search so that the sum of neighbor weights equals log2(k)\log_2(k).

The connection weight from xix_i to xjx_j is then:

wi,j=exp(d(xi,xj)ρiσi)w_{i,j} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)

Where:

  • wi,jw_{i,j} is the directed connection strength from point ii to point jj
  • d(xi,xj)d(x_i, x_j) is the distance between the two points in the original high-dimensional space
  • ρi\rho_i is the distance from xix_i to its nearest neighbor (guarantees at least one connection with weight 1.0)
  • σi\sigma_i is the local bandwidth that normalizes for density around xix_i

In Plain English: Subtracting ρi\rho_i means every point's nearest neighbor gets a weight of exactly 1.0, whether that neighbor is 0.9 units away (inside a tight cluster of digit-0 images) or 44.4 units away (an oddball digit-2 sitting alone). UMAP does not care about absolute isolation; it guarantees every point connects to something.

This is what makes UMAP density-invariant. In our digits dataset, the ratio between the most isolated point's ρ\rho and the least isolated point's ρ\rho is 47.5x. Without local metric normalization, those sparse points would float as disconnected islands and the embedding would be meaningless.

Expected output:

code
--- Local Metric Statistics (k=15) ---
Mean rho (nearest-neighbor dist): 3.689
Std  rho: 2.008
Min  rho: 0.936
Max  rho: 44.424
Ratio max/min: 47.5x

Point 0 (digit=0):
  rho (nearest neighbor dist): 2.3312
  sigma (approx, k-th neighbor): 3.7118

Neighbor | Distance | UMAP Weight
--------------------------------------
  877     | 2.3312   | 1.0000
  1541    | 3.0666   | 0.8203
  1167    | 3.0872   | 0.8157
  1365    | 3.1844   | 0.7947
  464     | 3.2440   | 0.7820
  ...     | ...      | ...
  812     | 3.7118   | 0.6894

Notice how the nearest neighbor (point 877, also a digit 0) gets weight 1.0 by construction, and the weight decays smoothly to 0.69 at the 15th neighbor. This exponential decay is controlled entirely by the local density around point 0.

Fuzzy Simplicial Sets: Building the High-Dimensional Graph

A fuzzy simplicial set is UMAP's version of a nearest-neighbor graph where each edge carries a probability of existing rather than a binary yes/no. This probabilistic framing comes from algebraic topology and is what gives UMAP its mathematical rigor (McInnes et al., 2018 arXiv paper).

The weight wi,jw_{i,j} is directional: point ii's view of point jj differs from point jj's view of point ii, because each point has its own ρ\rho and σ\sigma. UMAP symmetrizes these directed weights using a fuzzy set union:

Pij=wi,j+wj,iwi,jwj,iP_{ij} = w_{i,j} + w_{j,i} - w_{i,j} \cdot w_{j,i}

Where:

  • PijP_{ij} is the symmetric edge weight (probability of connection) between points ii and jj
  • wi,jw_{i,j} is the directed weight from ii's perspective
  • wj,iw_{j,i} is the directed weight from jj's perspective

In Plain English: This is just the probability formula for "A or B": P(AB)=P(A)+P(B)P(A)P(B)P(A \cup B) = P(A) + P(B) - P(A) \cdot P(B). If our oddball digit-2 (with ρ=44.4\rho = 44.4) considers a digit-4 its closest neighbor, it assigns weight 1.0. The digit-4, sitting in a dense cluster, barely notices the digit-2 and assigns weight 0.04. The fuzzy union gives $1.0 + 0.04 - 0.04 = 1.0$, so the connection survives. UMAP's graph is generous: if either side considers the other a neighbor, the bond stays.

This one-sided acceptance is critical. Without it, outliers would disconnect from the graph entirely, and the embedding would lose them.

Cross-Entropy Optimization: Attraction and Repulsion

UMAP optimizes the low-dimensional layout by minimizing the cross-entropy between the high-dimensional graph PP and the low-dimensional graph QQ. This choice of loss function is the mathematical reason UMAP preserves global structure better than t-SNE.

C=ij[Pijlog(PijQij)+(1Pij)log(1Pij1Qij)]C = \sum_{i \neq j} \left[ P_{ij} \log\left(\frac{P_{ij}}{Q_{ij}}\right) + (1 - P_{ij}) \log\left(\frac{1 - P_{ij}}{1 - Q_{ij}}\right) \right]

Where:

  • CC is the total cross-entropy cost
  • PijP_{ij} is the high-dimensional edge probability (from the fuzzy simplicial set)
  • QijQ_{ij} is the low-dimensional edge probability (computed from embedded positions)
  • The first term is the attraction force (active when PijP_{ij} is large)
  • The second term is the repulsion force (active when PijP_{ij} is small)

In Plain English: t-SNE only has the first term, the attraction. It pulls neighbors together but never explicitly pushes non-neighbors apart. UMAP has both. Think of arranging our digit clusters on a table: t-SNE makes sure each group of 0s, 1s, 2s stays clustered, but it might stack the 3s right on top of the 8s. UMAP's repulsion term actively pushes the 3-cluster away from the 8-cluster, preserving the fact that 3 and 8 look more similar to each other than either does to 1.

The optimization itself uses stochastic gradient descent with negative sampling (not all n2n^2 pairs, just a random subset of non-neighbors per iteration), which is why UMAP runs so fast.

Comparison of PCA, t-SNE, and UMAP across key properties including complexity, structure preservation, and transform supportClick to expandComparison of PCA, t-SNE, and UMAP across key properties including complexity, structure preservation, and transform support

UMAP vs t-SNE vs PCA: A Direct Comparison

Each method makes a different tradeoff between speed, structure, and usability. This table captures the practical differences that matter when choosing a method for real work.

PropertyPCAt-SNEUMAP
TypeLinearNon-linearNon-linear
Time complexityO(min(n,d)2max(n,d))O(\min(n,d)^2 \cdot \max(n,d))O(nlogn)O(n \log n) Barnes-Hut, O(n2)O(n^2) exactO(nlogn)O(n \log n)
Global structurePreserved (linear only)Lost (cluster distances meaningless)Preserved (distances meaningful)
Local structurePoor for non-linear dataExcellentExcellent
transform()YesNoYes
Scales to 1M+ rowsYesImpracticalYes (minutes)
InitializationDeterministic (eigendecomp)Random or PCASpectral (more stable)
Density handlingGlobal variance onlyPerplexity-based (uniform)Point-wise rho/sigma (adaptive)
Primary useFeature reduction, noise removalSmall-dataset visualizationVisualization + preprocessing

Common Pitfall: UMAP's global structure preservation does not mean distances are metric-space accurate like in PCA. The relative positioning of clusters is meaningful (a 3-cluster being near an 8-cluster means they share features), but you cannot measure absolute distances with a ruler. The improvement over t-SNE is significant; the distances are just not perfect.

UMAP in Python: The Complete Implementation

UMAP's Python library, umap-learn (version 0.5.11, January 2026), follows the scikit-learn API with fit(), transform(), and fit_transform(). Since umap-learn is not available in the browser runtime, the code below is display-only.

bash
pip install umap-learn
python
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_scaled = StandardScaler().fit_transform(digits.data)
labels = digits.target

# Default UMAP: n_neighbors=15, min_dist=0.1, metric='euclidean'
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(X_scaled)

print(f"Input shape:  {X_scaled.shape}")   # (1797, 64)
print(f"Output shape: {embedding.shape}")  # (1797, 2)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    embedding[:, 0], embedding[:, 1],
    c=labels, cmap='Spectral', s=5
)
plt.colorbar(scatter, label='Digit')
plt.title('UMAP Projection of MNIST Digits')
plt.xlabel('UMAP-1')
plt.ylabel('UMAP-2')
plt.show()

You will see ten distinct clusters with visually similar digits (1 and 7, 3 and 8) positioned closer together. This global arrangement is UMAP's key advantage over t-SNE.

Embedding New Data and Supervised Mode

UMAP's transform() method projects unseen data without re-fitting, something t-SNE cannot do. You fit once on training data and embed new samples as many times as needed:

python
reducer = umap.UMAP(random_state=42)
reducer.fit(X_train_scaled)
new_embedding = reducer.transform(X_test_scaled)

UMAP also accepts class labels via y=labels during training (supervised UMAP), which creates dramatically cleaner separation. A recent study (Ding et al., 2026, arXiv:2603.02275) confirmed supervised UMAP excels for classification but shows limitations for regression tasks.

Hyperparameter Tuning: n_neighbors, min_dist, and metric

UMAP has three critical hyperparameters. Getting them right is the difference between a clear, informative visualization and a confusing mess.

UMAP hyperparameter effects showing n_neighbors controlling local vs global balance, min_dist controlling packing tightness, and metric controlling the distance functionClick to expandUMAP hyperparameter effects showing n_neighbors controlling local vs global balance, min_dist controlling packing tightness, and metric controlling the distance function

n_neighbors (default: 15)

This controls the size of the local neighborhood UMAP considers when building the graph. It is the single most impactful parameter.

  • Low (5 to 15): Emphasizes fine-grained local structure. Produces sharper, more fragmented clusters. Good for spotting sub-populations (e.g., different writing styles within digit-4 images).
  • High (50 to 200): Emphasizes global topology. Clusters may merge, but the overall manifold shape becomes clearer. Better for seeing the big picture of how all ten digit classes relate.

min_dist (default: 0.1)

Controls the minimum distance between points in the output. This is a visual parameter, not a structural one.

  • Low (0.0 to 0.1): Points clump tightly into dense balls. Best when using UMAP as a preprocessing step for DBSCAN or HDBSCAN clustering.
  • High (0.5 to 0.99): Points spread out evenly. Better for visualization when you need to see individual data points.

metric (default: 'euclidean')

UMAP supports any distance function from scipy.spatial.distance plus custom callables. Common choices:

Data TypeRecommended Metric
Continuous features (scaled)euclidean
TF-IDF text vectorscosine
Binary featuresjaccard or hamming
Correlation patternscorrelation
Mixed typesmahalanobis

Pro Tip: When using UMAP before DBSCAN or HDBSCAN clustering, set n_neighbors to 30 or higher and min_dist to 0.0. This produces tight, well-separated islands that density-based clustering algorithms can detect cleanly. It's a workflow used routinely in single-cell RNA sequencing analysis to identify cell types.

ParameterDefaultGood Starting RangeEffect
n_neighbors155 to 200Local vs global structure balance
min_dist0.10.0 to 0.99Packing tightness
n_components22 to 100Output dimensionality
metriceuclideaneuclidean, cosine, etc.Distance function
random_stateNone42 (fixed for reproducibility)Initialization seed

When to Use UMAP (and When Not To)

UMAP is the right tool for specific situations, and the wrong tool for others. Picking the wrong method wastes compute and produces misleading results.

Decision guide for choosing between PCA, t-SNE, and UMAP based on dataset size, interpretability needs, and embedding requirementsClick to expandDecision guide for choosing between PCA, t-SNE, and UMAP based on dataset size, interpretability needs, and embedding requirements

Use UMAP when:

  1. Visualizing datasets above 10K rows. t-SNE becomes painfully slow; UMAP stays fast. On 1M samples, UMAP finishes in minutes where t-SNE would take days.
  2. You need to embed new data later. UMAP's transform() method is essential for production pipelines where the model fits once on training data and projects incoming samples.
  3. Global structure matters. If knowing that cluster A is near cluster B (and far from cluster C) carries meaningful information, UMAP is the clear choice.
  4. Preprocessing for clustering. UMAP + HDBSCAN is the standard pipeline in single-cell genomics (Scanpy, Seurat v5) and is increasingly common in general-purpose ML.
  5. Feature extraction for classifiers. Reducing 500 features to 50 with UMAP, then feeding those into gradient boosting, often outperforms PCA-based reduction because UMAP preserves non-linear boundaries.

Do NOT use UMAP when:

  1. You need interpretable components. PCA gives you loadings, explained variance ratios, and linear coefficients. UMAP dimensions have no inherent meaning.
  2. Your data is genuinely linear. If PCA with 2 components explains 90%+ variance, there is no manifold to learn. Adding UMAP complexity buys nothing.
  3. Exact reproducibility is required. Despite random_state, UMAP results can vary slightly across library versions and platforms. PCA is deterministic.
  4. You are doing supervised regression preprocessing. Recent research (Ding et al., 2026) shows supervised UMAP's limitations with continuous targets. Stick with PCA or autoencoders.
  5. Tiny datasets (under 500 samples). The nearest-neighbor graph becomes unreliable with too few points, and there is not enough data to define a stable manifold.

Production Considerations

UMAP's computational profile makes it viable for production in ways t-SNE never was. Time complexity is O(nlogn)O(n \log n) through approximate nearest neighbors (NNDescent). On 100K samples, expect 30 to 60 seconds on a single CPU core. Memory is O(n×k)O(n \times k) for the neighbor graph; at 1M samples with k=15k = 15, that is roughly 120 MB, far cheaper than t-SNE's O(n2)O(n^2) distance matrix.

For larger workloads, RAPIDS cuML provides a GPU-native UMAP implementation (NVIDIA Technical Blog) that handles 10M+ samples in seconds. A fitted reducer can be pickled and reloaded for transform() calls in deployment, just like any scikit-learn estimator.

Pro Tip: For datasets above 500K rows, switch to cuML's GPU implementation. With cuML, 5M-row datasets embed in under 10 seconds.

Conclusion

UMAP's foundation in algebraic topology, specifically fuzzy simplicial sets and cross-entropy optimization, gives it a mathematical edge that translates into practical advantages: speed, global structure preservation, and the ability to embed new data. Understanding the ρ\rho/σ\sigma local metric, the fuzzy union symmetrization, and the attraction/repulsion balance in the loss function turns UMAP from a black-box visualization tool into something you can tune with confidence.

For the clustering pipeline that naturally follows UMAP embedding, explore our guides on DBSCAN and HDBSCAN. If you want to understand the linear baseline that UMAP improves upon, start with PCA: Reducing Dimensions While Keeping What Matters. And for the algorithm UMAP was built to replace, read our deep dive on t-SNE.

The best embedding is the one you actually understand. Now you do.

Frequently Asked Interview Questions

Q: Explain the difference between how t-SNE and UMAP preserve structure.

T-SNE minimizes KL divergence, which only penalizes placing neighbors far apart (attraction). It has no explicit term for keeping non-neighbors separated. UMAP minimizes cross-entropy, which includes both an attraction term for neighbors and a repulsion term that pushes non-neighbors apart. This repulsion term is why UMAP preserves global cluster distances better than t-SNE.

Q: What does the rho parameter represent in UMAP, and why is it important?

Rho (ρi\rho_i) is the distance from point xix_i to its nearest neighbor. UMAP subtracts ρ\rho from all distances before computing edge weights, which guarantees every point connects to at least one neighbor with weight 1.0. This makes the algorithm density-invariant: a point in a sparse region is treated the same as one in a dense cluster, preventing disconnected components in the graph.

Q: Can UMAP be used for tasks other than visualization?

Yes. UMAP can reduce to any number of dimensions (n_components=50), not just 2 or 3. Common non-visualization uses include preprocessing features for gradient boosting or logistic regression, compressing embeddings for approximate nearest-neighbor search, and creating inputs for density-based clustering algorithms like HDBSCAN. Its transform() method makes it practical in production ML pipelines.

Q: Your UMAP embedding shows two clusters merged that you expected to be separate. How would you fix this?

Lower the n_neighbors parameter to focus more on local structure, which tends to fragment merged clusters. You could also lower min_dist to pack points more tightly, making boundaries sharper. If the merge persists, check if the data genuinely lacks separation in those classes, or try supervised UMAP by passing class labels to pull the groups apart.

Q: Why does UMAP use spectral initialization instead of random initialization?

Spectral initialization uses the eigenvectors of the graph Laplacian to place points in the low-dimensional space. This gives the optimizer a starting position that already reflects the global topology, so SGD converges faster and more consistently. Random initialization works but introduces more run-to-run variability and can trap the optimization in poor local minima.

Q: A colleague claims UMAP distances between clusters are "exact." Is this correct?

No. UMAP preserves the relative positioning of clusters much better than t-SNE, meaning if cluster A is genuinely closer to cluster B than to cluster C in the original space, that relationship usually holds in the UMAP plot. However, the absolute distances are still distorted by the non-linear projection. You cannot measure pixels on a UMAP plot and draw metric conclusions. For exact distances, use PCA.

Q: When would you choose t-SNE over UMAP despite UMAP's advantages?

T-SNE can produce slightly sharper local cluster boundaries on small datasets (under 10K samples) where speed is not a concern. If your only goal is visualizing tight clusters for a publication figure and you do not need to embed new data or preserve global structure, t-SNE with careful perplexity tuning still produces compelling visualizations. In practice, this is an increasingly rare use case.

Q: How does UMAP scale to million-row datasets compared to PCA and t-SNE?

PCA handles millions of rows in seconds via matrix decomposition. t-SNE is O(n2)O(n^2) exact or O(nlogn)O(n \log n) with Barnes-Hut, but becomes impractical above 100K rows. UMAP is O(nlogn)O(n \log n) with a smaller constant factor. On 1M rows, expect 5 to 15 minutes on CPU, or seconds with RAPIDS cuML on GPU.

Hands-On Practice

Visualizing high-dimensional data often requires balancing global structure with local detail. In this tutorial. the fundamental concepts behind UMAP (Uniform Manifold Approximation and Projection) by manually dismantling the mathematical machinery it uses, specifically, how it handles varying data density using nearest neighbor graphs. Using the Wine Analysis dataset, we will compare PCA and t-SNE to understand the trade-offs UMAP solves, and then write code to inspect the 'local connectivity' that makes UMAP unique.

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

By dissecting the nearest neighbor graph, we've seen how UMAP adapts to data density, something standard distance metrics miss. Try experimenting with the k parameter (n_neighbors) in the code; increasing it forces the algorithm to consider a broader 'local' view, often preserving more global structure at the cost of local detail. In a production environment with the full umap-learn library, this n_neighbors parameter is your primary knob for balancing the focus between the forest (global) and the trees (local).

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths