<!— slug: hierarchical-clustering-building-the-family-tree-of-your-data —> <!— excerpt: Master hierarchical clustering with Python. Learn linkage criteria, dendrogram analysis, silhouette scoring, and when to pick hierarchical over K-Means. —>
Picture a library with 10,000 books and no genre labels. You could sort them into 10 bins and call it done, but that ignores the fact that "science fiction" contains hard sci-fi, space opera, cyberpunk, and climate fiction as sub-genres. Hierarchical clustering solves exactly this problem. Instead of forcing data into a flat set of buckets the way K-Means does, it builds a tree structure that captures relationships at every level of granularity, from individual data points all the way up to one giant cluster. You pick where to "cut" the tree after the fact.
Throughout this article, we'll work with one scenario end to end: segmenting 150 customers based on two behavioral features. Every formula, every code block, and every diagram ties back to this dataset so the concepts stay concrete.
The hierarchical clustering framework
Hierarchical clustering is an unsupervised learning algorithm that organizes data points into a nested tree of clusters called a dendrogram. Unlike K-Means, which demands you specify before running, hierarchical clustering produces the full tree first. You decide on the number of clusters later by slicing the tree at whatever height makes sense for your problem.
Two strategies exist for building the tree:
- Agglomerative (bottom-up): Each data point starts as its own cluster. At every step, the two closest clusters merge. Repeat until one cluster remains. This is the standard approach and the one every major library implements.
- Divisive (top-down): All points start in a single cluster that recursively splits. Computationally expensive ( in the worst case) and rarely used in practice.
The rest of this article focuses exclusively on agglomerative hierarchical clustering because it's what you'll encounter in real codebases, interviews, and scikit-learn's AgglomerativeClustering.
Agglomerative algorithm mechanics
Click to expandStep-by-step agglomerative merge process for hierarchical clustering
The agglomerative algorithm follows five repeating steps. Here's how it works on our 150-customer dataset:
- Initialize: Treat each of the data points as its own cluster. We start with 150 clusters.
- Compute distances: Build the pairwise distance matrix between every pair of clusters.
- Find the closest pair: Identify the two clusters with the smallest inter-cluster distance.
- Merge: Combine those two clusters into one. We now have 149 clusters.
- Repeat: Go back to step 2 with updated distances. Continue until a single cluster remains.
Each merge gets recorded with the distance at which it happened. That record is the dendrogram.
The critical question isn't the merging mechanics themselves; it's the definition of "closest." Two clusters can each contain dozens of points, so "distance between clusters" isn't immediately obvious. That definition is called the linkage criterion, and it changes everything about the shape of your results.
Linkage criteria explained
Click to expandLinkage criteria comparison for hierarchical clustering
Measuring the distance between two individual points is straightforward (usually Euclidean distance, sometimes Manhattan or cosine). Measuring the distance between two groups of points requires a design choice. That choice is the linkage criterion, and it dramatically affects cluster shape.
Single linkage (minimum distance)
Single linkage defines the distance between two clusters as the shortest distance between any point in Cluster A and any point in Cluster B.
Where:
- is the inter-cluster distance between clusters and
- is the point-to-point distance (e.g., Euclidean) between points and
- means point belongs to cluster
- The operator selects the smallest pairwise distance
In Plain English: If even one customer in Segment A is close to one customer in Segment B, those segments merge. Single linkage only looks at the nearest edges, which lets it discover elongated and non-convex cluster shapes (crescents, spirals). The downside is "chaining," where a trail of closely spaced points bridges two genuinely separate groups into one.
Complete linkage (maximum distance)
Complete linkage measures the distance between two clusters using their farthest pair of points.
Where:
- is the inter-cluster distance
- is the point-to-point distance between and
- The operator selects the largest pairwise distance
In Plain English: Two customer segments only merge when even their most distant members are reasonably close. This produces compact, roughly spherical clusters because the algorithm refuses to merge groups when any pair of members is too far apart. It's the conservative cousin of single linkage.
Ward's method (variance minimization)
Ward's method (Ward, 1963) is the default in scikit-learn and the most commonly used linkage in practice. Instead of measuring geometric distance, it computes the increase in total within-cluster variance that would result from a merge.
Where:
- is the increase in within-cluster variance from merging and
- and are the number of points in clusters and
- and are the centroids (mean vectors) of clusters and
- is the squared Euclidean distance between the two centroids
In Plain English: Ward's method asks: "If I merge these two customer segments, how much messier does the combined group become?" If two segments have similar spending patterns, merging them barely increases variance, so the cost is low. If they're very different, the cost is high. This naturally produces balanced, similarly-sized clusters, which is why it's the go-to default for most real-world segmentation tasks.
Linkage comparison at a glance
| Linkage | Merges by | Cluster shape | Weakness | Best for |
|---|---|---|---|---|
| Single | Nearest neighbor | Elongated, chain-like | Chaining effect with noise | Non-convex clusters (moons, spirals) |
| Complete | Farthest neighbor | Compact, spherical | Sensitive to outliers pulling max distance | Well-separated round clusters |
| Average | Mean pairwise distance | Moderate between single and complete | No strong guarantee | General use when Ward fails |
| Ward | Variance increase | Balanced, similarly-sized | Assumes Euclidean distance | Default choice for most datasets |
Pro Tip: Always start with Ward's method. Switch to single linkage only if you suspect your clusters have non-convex shapes, and switch to complete linkage if you need tight, compact groups. Average linkage is the safe middle ground when nothing else works.
Reading and cutting dendrograms
A dendrogram is the tree diagram that records the entire history of merges. The y-axis shows the distance (or dissimilarity) at which each merge occurred. The x-axis lists individual data points (or truncated cluster labels). Reading a dendrogram well is a skill that separates someone who just calls .fit() from someone who actually understands their data.
How to read it:
- Each leaf at the bottom represents one data point (or, in truncated view, a mini-cluster).
- A horizontal bar connecting two branches indicates a merge. The height of that bar is the distance at which the merge happened.
- Long vertical lines without any merges represent a big "jump" in distance. Those jumps indicate natural cluster boundaries.
How to cut it:
Draw a horizontal line across the dendrogram. The number of vertical lines that line crosses equals the number of clusters you'd get. The trick is finding the right height.
- Cut low: Many small, highly specific clusters. Good for fine-grained analysis but risks splitting natural groups.
- Cut high: Fewer, broader clusters. Good for top-level segmentation but may lump distinct groups together.
Key Insight: Look for the largest vertical gap in the dendrogram that no horizontal merge line crosses. That gap represents the distance jump where genuinely distinct clusters were forced to merge. Cutting just below that gap usually gives the most natural grouping.
Automated cut with inconsistency or silhouette
You don't have to eyeball the dendrogram. Scipy's fcluster with criterion='distance' lets you specify a cut height directly. For a more principled approach, compute the silhouette score for different numbers of clusters and pick the that maximizes it. We'll do exactly this in the code section below.
Customer segmentation with scipy and scikit-learn
Let's build hierarchical clustering end-to-end on our 150-customer dataset. We'll generate synthetic data with four natural segments, compute linkage matrices, visualize dendrograms, and assign cluster labels.
Generating and visualizing the data
Expected output:
Dataset shape: (150, 2)
Feature 1 range: [-11.76, 6.53]
Feature 2 range: [-9.31, 11.24]
The scatter plot shows four visually distinct clusters, which is exactly what we'd expect from make_blobs with four centers. In a real scenario, you wouldn't know how many segments exist. That's where the dendrogram comes in.
Building and comparing dendrograms
Before clustering, always standardize your features. Hierarchical clustering relies on distance calculations, and features on different scales will dominate the result.
Expected output:
Last 5 merges (Ward linkage):
Step | Distance | New Cluster Size
145 | 1.484 | 38
146 | 1.559 | 37
147 | 7.200 | 75
148 | 15.663 | 113
149 | 16.763 | 150
Notice the massive jump from 1.559 to 7.200 at step 147. That's the dendrogram telling you: "merging beyond four clusters costs a lot more." Ward's dendrogram shows four distinct branches with long vertical stems. The complete linkage dendrogram gives a similar picture. Single linkage produces a messier tree because of chaining effects; two of the four groups may look merged at lower heights.
Common Pitfall: Never skip standardization before hierarchical clustering. If one feature ranges from 0 to 100,000 and another from 0 to 1, the algorithm will base its entire hierarchy on the large-scale feature, producing meaningless clusters.
Fitting the model and evaluating clusters
Expected output:
Cluster sizes:
Cluster 0: 37 customers
Cluster 1: 38 customers
Cluster 2: 38 customers
Cluster 3: 37 customers
Silhouette score: 0.7724
A silhouette score of 0.77 is excellent. Values above 0.5 indicate well-defined clusters; above 0.7 means the clusters are clearly separated with little overlap.
Finding the optimal number of clusters
The dendrogram gave us a visual hint (four clusters), but you can quantify this with the silhouette method. Compute the score for several values of and pick the maximum.
Expected output:
Silhouette scores:
k=2: 0.5452
k=3: 0.7329
k=4: 0.7724 <-- best
k=5: 0.6440
k=6: 0.5532
k=7: 0.4749
The silhouette score peaks at , confirming what the dendrogram suggested. This two-pronged validation (visual from dendrogram + quantitative from silhouette) is the standard workflow.
Linkage choice on non-globular data
Ward's method excels with roughly spherical clusters, but real data isn't always so cooperative. When clusters have crescent, spiral, or elongated shapes, single linkage can outperform every other method. Here's a demonstration using the classic "two moons" dataset.
Expected output:
Ward ARI: 0.4464
Single ARI: 1.0000
Single linkage achieves a perfect Adjusted Rand Index of 1.0 because it follows the contour of each crescent. Ward's method, which minimizes variance, tries to split along a straight line and butchers both crescents. This is the one scenario where single linkage shines; for everything else, Ward's method is the safer bet.
Key Insight: There's no universally "best" linkage. Ward works for round, balanced clusters. Single linkage works for chain-like or non-convex shapes. Complete linkage works for compact, well-separated groups. Always visualize your dendrogram and try at least two linkage methods before committing.
When to use hierarchical clustering (and when not to)
Click to expandClustering method selection guide for hierarchical, K-Means, and DBSCAN
Choosing the right clustering algorithm depends on your data size, cluster shape expectations, and whether you need the hierarchy itself.
Use hierarchical clustering when
- You need the full hierarchy. Taxonomy construction (gene families, document categories, customer tiers) benefits from seeing how groups nest inside each other.
- You don't know upfront. The dendrogram lets you explore multiple granularities without re-running the algorithm.
- Your dataset is small to medium. Under 10,000 rows, hierarchical clustering runs in seconds and gives richer output than K-Means.
- Determinism matters. Hierarchical clustering produces the same result every time. No random initialization, no variance between runs.
Do NOT use hierarchical clustering when
- You have more than ~10,000 points. The memory for the distance matrix and time for optimized Ward's method make it impractical for large datasets. At 50,000 rows, you're looking at ~20 GB of memory for the distance matrix alone. Use K-Means or Mini-Batch K-Means instead.
- Clusters have irregular density. DBSCAN handles varying density far better because it defines clusters as dense regions separated by sparse areas.
- You need online/incremental clustering. Once built, you can't add a new point to a dendrogram without recomputing from scratch. K-Means can incrementally update centroids.
- You only care about flat cluster labels. If you don't need the hierarchy, K-Means is faster and simpler.
Head-to-head comparison
| Criterion | Hierarchical | K-Means | DBSCAN |
|---|---|---|---|
| Specify upfront | No | Yes | No |
| Time complexity | |||
| Memory | |||
| Max practical size | ~10K rows | Millions | Hundreds of thousands |
| Cluster shape | Depends on linkage | Spherical | Arbitrary |
| Deterministic | Yes | No (random init) | Yes |
| Outputs hierarchy | Yes | No | No |
| Handles noise | Poorly | Poorly | Natively |
Production considerations and scaling
Computational complexity
The naive agglomerative algorithm runs in time because each of the merge steps scans an distance matrix. Optimized implementations (such as the one in scipy's linkage) reduce this:
| Linkage | Time | Memory | Notes |
|---|---|---|---|
| Ward, centroid, median | Use nearest-neighbor chain algorithm | ||
| Single | Minimum spanning tree approach | ||
| Complete, average | Priority queue for merge selection |
On a modern laptop, expect roughly: 1,000 points in 0.006s, 5,000 in 0.24s, 10,000 in 1.1s. Beyond 20,000, you'll start feeling the wait.
Memory bottleneck
The pairwise distance matrix for points with features requires entries stored as 64-bit floats. For , that's about 9.3 GB. This is often the binding constraint before time complexity becomes an issue.
Pro Tip: If you absolutely need hierarchical clustering on a larger dataset, consider subsampling first. Cluster a random 5,000-point subset, build the dendrogram, determine , then assign the full dataset using AgglomerativeClustering(n_clusters=k) with connectivity constraints or simply use K-Means with the you discovered. This hybrid approach gives you the best of both worlds.
Scikit-learn versus scipy
Use scipy (linkage + dendrogram) when you want to visualize the dendrogram and explore different cut heights. Use scikit-learn (AgglomerativeClustering) when you just need cluster labels for downstream tasks. Scikit-learn also supports a connectivity parameter for spatially constrained clustering (e.g., only merge adjacent pixels in an image).
Common pitfalls
The scaling trap
Like K-Nearest Neighbors and K-Means, hierarchical clustering is distance-based. Features measured on different scales will distort the hierarchy.
Common Pitfall: Always apply StandardScaler before clustering. If annual income ranges from $20,000 to $200,000 and age ranges from 18 to 70, the algorithm will build its entire hierarchy based on income alone. Standardizing brings both features to zero mean and unit variance, giving them equal influence on distance calculations.
Sensitivity to noise and outliers
Single linkage is especially vulnerable. A single outlier positioned between two distinct clusters acts as a bridge, merging them prematurely. Ward's method handles noise better because outliers increase variance, making those merges expensive. If your data has significant noise, consider DBSCAN instead; it treats sparse-region points as noise by design.
Irreversible merges
Once two clusters merge in the agglomerative process, they can never split apart. If an early merge was a mistake (say, due to noise), every subsequent merge inherits that error. This is why the linkage criterion matters so much. Ward's method is the most forgiving here because its variance-based cost function naturally resists merging dissimilar groups.
Conclusion
Hierarchical clustering gives you something no other clustering algorithm provides: a complete map of how your data organizes itself at every scale. The dendrogram isn't just a pretty visualization. It's a diagnostic tool that reveals natural groupings, shows you exactly where cluster boundaries blur, and lets you pick the granularity that matches your business question.
The algorithm's main limitation is computational. The memory requirement and time complexity cap practical usage at roughly 10,000 to 20,000 points. Beyond that, you're better off with K-Means for large spherical clusters or DBSCAN for arbitrary shapes with noise. For datasets that fit within the size constraint, though, hierarchical clustering remains the most informative unsupervised method available.
The practical workflow is straightforward: standardize features, compute Ward's linkage, inspect the dendrogram for large distance jumps, validate your with silhouette scores, and assign labels. Try single linkage if your clusters are non-convex. And when someone asks, "Why four clusters instead of five?" you can point to the dendrogram and show them exactly where the tree says the boundary is. That kind of transparency is hard to beat.
Frequently Asked Interview Questions
Q: Why would you choose hierarchical clustering over K-Means?
Hierarchical clustering doesn't require specifying the number of clusters in advance, which is valuable during exploratory analysis. It also produces a dendrogram that reveals the nested structure of your data at every level of granularity. K-Means is faster and scales to large datasets, but it gives you only flat cluster labels with no insight into how groups relate to each other.
Q: What is the chaining effect in single linkage, and how do you mitigate it?
The chaining effect occurs when a trail of closely spaced points connects two genuinely separate clusters, causing single linkage to merge them. A single noise point between two groups can act as a bridge. You can mitigate it by switching to Ward's method or complete linkage for globular data, or by removing outliers before clustering. If the data genuinely has non-convex shapes, accept single linkage but preprocess with a noise filter.
Q: How do you determine the optimal number of clusters from a dendrogram?
Look for the longest vertical gap in the dendrogram where no horizontal merge occurs. That gap represents a large jump in merge distance, indicating that the clusters below are well-separated. Draw a horizontal line through that gap; the number of branches it crosses is your optimal . For quantitative confirmation, compute silhouette scores for multiple values and pick the maximum.
Q: What is the time and space complexity of agglomerative hierarchical clustering?
The naive algorithm runs in time and space. Optimized implementations using nearest-neighbor chains bring Ward's method down to time with space for the distance matrix. The memory constraint is usually the binding factor. For 50,000 points, the distance matrix alone requires about 9.3 GB.
Q: When would Ward's method produce poor results?
Ward's method minimizes within-cluster variance, which biases it toward spherical, equally-sized clusters. It struggles with non-convex shapes (crescents, spirals), clusters of very different sizes, and data where Euclidean distance isn't meaningful (e.g., text data with cosine similarity). In those cases, single linkage or DBSCAN may perform better.
Q: Can you add new data points to an existing dendrogram without recomputing?
No. Agglomerative clustering processes all points at once to build the merge tree. Adding a new point requires recomputing the entire distance matrix and re-running the algorithm. If you need incremental clustering, K-Means allows online updates to centroids, and DBSCAN can be adapted with incremental variants like HDBSCAN.
Q: A colleague scaled features using Min-Max normalization before hierarchical clustering and got poor results. What went wrong?
Min-Max normalization compresses all values to [0, 1] but is highly sensitive to outliers. A single extreme value stretches the scale, pushing the bulk of data into a narrow range. Standardization (z-score scaling) is usually better for distance-based clustering because it centers each feature at zero with unit variance, making the distance metric equally sensitive to all features regardless of their original range. The colleague should switch to StandardScaler and verify results.
Hands-On Practice
Hierarchical clustering offers a powerful way to visualize data relationships without pre-committing to a specific number of groups. In this hands-on tutorial. Agglomerative Hierarchical Clustering, focusing on how different linkage criteria (Single vs. Complete) dramatically change the structure of our clusters. Using a customer segmentation dataset, we will build dendrograms to visualize the "family tree" of our data and discover natural groupings based on spending behavior and income.
Dataset: Customer Segments (Clustering) Customer segmentation data with 5 natural clusters based on income, spending, and age. Silhouette score ≈ 0.75 with k=5.
Try experimenting by changing the linkage parameter in AgglomerativeClustering to 'complete' or 'single' and observe how the cluster shapes change in the scatter plot. Single linkage often fails to separate the distinct groups in this dataset, merging them into chains. You can also adjust n_clusters to 3 or 7 to see how the hierarchy splits or merges these customer segments.