Skip to content

DBSCAN: Finding Clusters of Any Shape (Without Knowing K)

DS
LDS Team
Let's Data Science
11 minAudio
Listen Along
0:00/ 0:00
AI voice

<!— slug: dbscan-finding-clusters-of-any-shape-without-knowing-k —> <!— excerpt: Learn how DBSCAN finds clusters of any shape without specifying K. Covers epsilon tuning, k-distance plots, Python code, and when to choose DBSCAN over K-Means. —>

Look at a satellite image of a city at night. You don't count neighborhoods before you see them. Your eyes just follow the bright, densely packed lights and group them naturally, while the scattered dots on the outskirts register as isolated stragglers. That spatial intuition is exactly what DBSCAN (Density-Based Spatial Clustering of Applications with Noise) formalizes into an algorithm. Instead of forcing data into a predetermined number of spherical blobs the way K-Means does, DBSCAN defines clusters as contiguous regions of high density separated by regions of low density. It discovers the number of clusters on its own, handles arbitrary shapes, and has a built-in mechanism for labeling outliers.

Throughout this article, we'll work with one consistent example: the moons dataset, two interlocking half-circles generated by scikit-learn's make_moons. Every formula, code block, and diagram will reference this dataset so the concepts stay grounded in a single concrete scenario.

K-Means limitations on non-spherical data

K-Means clustering partitions data by assigning each point to the nearest centroid, then iterating until centroids stabilize. This objective function implicitly assumes clusters are convex and roughly spherical because it minimizes within-cluster variance (the sum of squared Euclidean distances to each centroid). When clusters actually are ball-shaped and well separated, K-Means is fast, simple, and effective.

But real data is often messier. Consider two interlocking crescents. K-Means can't trace the curve of each crescent because it draws a straight decision boundary between centroids. It ends up slicing both moons horizontally, grouping the top of the left crescent with the top of the right one. The algorithm prioritizes compactness over connectivity.

K-Means also has no concept of "this point doesn't belong anywhere." Every single observation gets assigned to a cluster, including outliers that sit far from any natural group. A single anomalous point can drag a centroid away from where it should be, distorting the entire partition.

These two weaknesses (shape assumption and forced assignment) motivate density-based clustering. Instead of asking "which center is this point closest to?", DBSCAN asks "is this point surrounded by enough neighbors to be part of a crowd?"

The two parameters that control everything

DBSCAN relies on exactly two hyperparameters. Getting them right is the main challenge, but the upside is that you never need to specify the number of clusters.

Epsilon (ε\varepsilon) is the radius of the neighborhood around each point. Draw a circle of radius ε\varepsilon centered on a data point; every other point inside that circle is a neighbor.

min_samples is the minimum number of points (including the center point itself) required inside that ε\varepsilon-circle for the center to qualify as a core point. Think of it as the density threshold: "how crowded does a neighborhood need to be before we call it part of a cluster?"

Formally, the ε\varepsilon-neighborhood of a point pp in dataset DD is:

Nε(p)={qDdist(p,q)ε}N_\varepsilon(p) = \{q \in D \mid \text{dist}(p, q) \leq \varepsilon\}

Where:

  • Nε(p)N_\varepsilon(p) is the set of all neighbors of point pp within distance ε\varepsilon
  • qq is any other point in the dataset DD
  • dist(p,q)\text{dist}(p, q) is the distance between pp and qq (typically Euclidean)
  • ε\varepsilon is the maximum neighborhood radius

In Plain English: For each point in the moons dataset, draw a circle with radius ε\varepsilon around it. Count how many other points fall inside. If the count meets or exceeds min_samples, that point sits in a dense region. If the circle is nearly empty, the point is isolated.

ParameterControlsToo smallToo large
ε\varepsilonNeighborhood radiusData fragments into many tiny clusters and noiseDistinct clusters merge into one blob
min_samplesDensity thresholdSparse regions form clusters (noisy results)Only the densest cores survive; border points become noise

Core, border, and noise point classification

Every point in the dataset falls into exactly one of three categories. This three-way classification is the engine that drives DBSCAN's ability to find arbitrary shapes and reject outliers simultaneously.

DBSCAN classifies every data point as core, border, or noise based on neighborhood densityClick to expandDBSCAN classifies every data point as core, border, or noise based on neighborhood density

Core points

A point is a core point if its ε\varepsilon-neighborhood contains at least min_samples points (counting itself). Core points are the backbone of each cluster. They sit deep inside dense regions, and every cluster must contain at least one.

In the moons dataset with eps=0.3 and min_samples=5, the vast majority of points along each crescent qualify as core points because the curves are densely sampled.

Border points

A point is a border point if it doesn't have enough neighbors to be core, but it falls within the ε\varepsilon-radius of at least one core point. Border points form the outer fringe of a cluster. They belong to the cluster but don't have the density to anchor it.

On the moons dataset, border points tend to appear at the tips of each crescent where the local density thins out.

Noise points

A point is a noise point (outlier) if it is neither core nor border. It sits in a low-density region, unreachable from any core point. DBSCAN labels these with -1 and effectively says "I don't know what group this belongs to."

This is a massive advantage over K-Means. Instead of warping a cluster to accommodate a distant anomaly, DBSCAN simply excludes it. For applications like outlier detection, this built-in noise handling is one of DBSCAN's most valuable features.

Key Insight: The distinction between core and border matters for cluster stability. Core points always belong to a specific cluster. A border point could technically be reachable from core points in two different clusters; in practice, scikit-learn assigns it to whichever cluster's core point was processed first.

The DBSCAN algorithm step by step

The algorithm explores data by hopping from one dense point to the next, expanding clusters through a chain reaction of reachable core points. It's conceptually similar to a flood-fill in image processing.

DBSCAN algorithm iterates through unvisited points, expanding clusters through chains of density-reachable core pointsClick to expandDBSCAN algorithm iterates through unvisited points, expanding clusters through chains of density-reachable core points

  1. Pick an unvisited point pp. The order doesn't affect the final clusters (only border point assignments can vary).
  2. Count neighbors. Find every point within distance ε\varepsilon of pp.
  3. Classify pp. If the neighbor count is below min_samples, label pp as noise (tentatively; it might become a border point later).
  4. Start a new cluster. If pp is a core point, create a new cluster and add pp plus all its neighbors.
  5. Expand. For each neighbor that's also a core point, recursively add its neighbors to the same cluster. This chain reaction is what lets DBSCAN trace curves, spirals, and any other shape where dense points form a continuous path.
  6. Repeat. Move to the next unvisited point and go back to step 1.

The expansion step is the critical piece. On the moons dataset, a core point near the middle of the left crescent connects to its neighbors, which connect to their neighbors, and so on until the entire left crescent belongs to one cluster. The two crescents never merge because there's a low-density gap between them that no ε\varepsilon-radius can bridge (given an appropriate ε\varepsilon).

Computational complexity

With a naive distance calculation (no spatial index), DBSCAN runs in O(n2)O(n^2) time because every point's neighborhood query scans the entire dataset. Using a spatial index like a k-d tree or ball tree drops this to O(nlogn)O(n \log n) on average for low-dimensional data. Scikit-learn's implementation uses these indices automatically when you set algorithm='auto'.

ScenarioTime complexityMemory
No spatial indexO(n2)O(n^2)O(n)O(n)
k-d tree (low dims)O(nlogn)O(n \log n) averageO(n)O(n)
Ball treeO(nlogn)O(n \log n) averageO(n)O(n)
High dimensions (>20)Effectively O(n2)O(n^2)O(n)O(n)

Pro Tip: For datasets beyond 50,000 points, precompute the distance matrix only if it fits in memory. On 100K points with 64-bit floats, the full distance matrix eats ~75 GB. Use algorithm='ball_tree' or algorithm='kd_tree' with the leaf_size parameter tuned for your data size instead.

K-Means versus DBSCAN on the moons dataset

Theory is fine, but the difference becomes visceral when you see both algorithms side by side on data that K-Means was never designed to handle.

DBSCAN versus K-Means comparison showing strengths and weaknesses of centroid-based and density-based clusteringClick to expandDBSCAN versus K-Means comparison showing strengths and weaknesses of centroid-based and density-based clustering

Expected output:

code
K-Means: assigned all 300 points to 2 clusters
DBSCAN:  found 2 clusters, 0 noise points

The left plot shows K-Means drawing a vertical-ish boundary through the center, grouping the top halves of both moons together. The right plot shows DBSCAN correctly tracing each crescent as a separate cluster. With low noise (noise=0.06), DBSCAN classifies every point into a cluster; none end up as noise.

Inspecting core, border, and noise points in practice

Adding random scatter points to the moons dataset lets us see all three point types in action.

Expected output:

code
Total points: 320
Clusters found: 2
Core points:   310
Border points: 3
Noise points:  7

The 20 randomly scattered outliers mostly land in low-density regions and get flagged as noise (red X markers). A few land close enough to the crescents to become border points (orange squares). The core points (colored dots) form the dense interior of each moon.

Choosing epsilon and min_samples

Picking the right ε\varepsilon and min_samples is the hardest part of using DBSCAN. The algorithm doesn't have a built-in loss function to optimize, so there's no equivalent of the elbow method for K-Means. But there are reliable heuristics.

Setting min_samples

A practical rule of thumb ties min_samples to your data's dimensionality dd:

min_samplesd+1\text{min\_samples} \geq d + 1

Where:

  • min_samples\text{min\_samples} is the DBSCAN density threshold parameter
  • dd is the number of features (dimensions) in the dataset

In Plain English: For the 2D moons dataset, d=2d = 2, so min_samples should be at least 3. In practice, min_samples = 5 is the standard starting point for 2D data. For noisy datasets, bump it up to 7 or 10 to require denser neighborhoods before forming clusters.

The original DBSCAN paper by Ester et al. (1996) recommended min_samples = 4 as a default for 2D datasets, noting that larger values produce "smoother" cluster boundaries.

Finding epsilon with the k-distance plot

The k-distance plot is the most reliable method for selecting ε\varepsilon. The idea: for each point, compute the distance to its kk-th nearest neighbor (where k=min_samples1k = \text{min\_samples} - 1). Sort these distances, plot them, and look for the knee where the curve bends sharply upward.

Expected output:

code
5th-nearest neighbor distances (sorted):
  Min:    0.0431
  25th %: 0.0692
  Median: 0.0842
  75th %: 0.1051
  Max:    0.2052

The plot shows a mostly flat curve that bends upward near the right end. Points below the knee have close neighbors (they're inside dense regions). Points above the knee are farther from their neighbors (they're in sparse areas or outliers). The yy-value at the knee is your ε\varepsilon.

For this moons dataset, the knee falls around 0.15 to 0.20. Setting eps=0.3 works well because the moons have tight, consistent density. In messier data, you'd want to be more precise about locating the knee.

Common Pitfall: Don't blindly pick the knee point. Always validate by running DBSCAN with your chosen ε\varepsilon and checking whether the resulting clusters make visual (or domain) sense. The k-distance plot is a starting point, not a final answer.

Epsilon sensitivity in action

The table below shows how dramatically different ε\varepsilon values change the output on the same 300-point moons dataset with min_samples=5:

ε\varepsilonClustersNoise pointsWhat happens
0.11526Too small: data shatters into fragments
0.220Good: both moons cleanly separated
0.320Good: slightly more tolerance for spread
0.510Too large: both moons merge into one

The sweet spot for this data is ε[0.15,0.35]\varepsilon \in [0.15, 0.35]. Outside that range, you either fragment the moons or merge them.

Feature scaling before DBSCAN

DBSCAN measures distance between points. If one feature ranges from 0 to 1,000 and another from 0 to 1, the first feature dominates the distance calculation and the second is effectively ignored. Scaling is not optional for distance-based algorithms.

Use standardization (z-score scaling) when features are roughly Gaussian, or MinMax scaling when they're bounded. The choice of scaler matters less than the decision to scale at all.

Warning: Always fit the scaler on training data only. If you're using DBSCAN as part of a pipeline (for example, segmenting customers before building a classifier), fitting the scaler on the full dataset introduces data leakage.

When to use DBSCAN (and when not to)

DBSCAN excels in specific scenarios and fails in others. Knowing both sides prevents wasted time.

Use DBSCAN when:

  • Clusters have non-convex shapes (crescents, rings, elongated blobs)
  • The data contains genuine outliers that shouldn't influence cluster formation
  • You don't know the number of clusters in advance and want the algorithm to discover it
  • The data is low-to-moderate dimensional (2 to about 15 features)

Don't use DBSCAN when:

  • Clusters have varying densities. A single global ε\varepsilon can't simultaneously capture a tight cluster and a diffuse one. A small ε\varepsilon fragments the sparse cluster; a large one merges the dense cluster with its surroundings. For varying-density data, use HDBSCAN, which extends DBSCAN with a hierarchy of density levels.
  • The data is high-dimensional (50+ features). Euclidean distance loses meaning when dimensionality grows. Everything becomes roughly equidistant, and no ε\varepsilon value will distinguish dense from sparse regions. Reduce dimensions first with PCA or UMAP.
  • You need soft (probabilistic) cluster assignments. DBSCAN produces hard labels. Each point belongs to exactly one cluster or is noise. For soft clustering, look at Gaussian Mixture Models.
  • Speed is critical on very large datasets without spatial indexing. Without a k-d tree or ball tree, DBSCAN's O(n2)O(n^2) time makes it impractical on millions of points.
CriterionK-MeansDBSCANHDBSCAN
Cluster shapeSphericalAnyAny
Number of clustersMust specify KKAuto-detectedAuto-detected
Noise handlingNoneBuilt-inBuilt-in
Varying densitiesNoNoYes
High dimensionsOK (Euclidean degrades but still runs)PoorBetter than DBSCAN
Speed (100K points)Fast (seconds)Moderate (with index)Moderate (with index)

DBSCAN beyond scikit-learn

Scikit-learn's DBSCAN implementation is the standard starting point, but it's worth knowing your options.

HDBSCAN is the most important extension. It removes the need to choose ε\varepsilon by building a hierarchy of clusterings across all possible epsilon values and extracting the most stable clusters. In practice, HDBSCAN often produces better results with less tuning. As of scikit-learn 1.3+, HDBSCAN is available directly via sklearn.cluster.HDBSCAN.

OPTICS (Ordering Points To Identify the Clustering Structure) is another density-based alternative included in scikit-learn. It creates a reachability plot that lets you visually identify clusters at different density levels. Think of it as a more flexible version of DBSCAN that doesn't commit to a single ε\varepsilon.

For very large datasets (millions of points), consider using the algorithm parameter in scikit-learn's DBSCAN to select ball_tree or kd_tree, and tune leaf_size for your hardware. Libraries like cuML (RAPIDS) offer GPU-accelerated DBSCAN for datasets that won't fit in reasonable CPU time.

Conclusion

DBSCAN reframes the clustering question from "where are the centers?" to "where are the crowds?" By tracing regions of high point density, it discovers clusters of any shape, from crescents to spirals to irregular blobs, without you ever specifying how many clusters to expect. The three-way classification into core, border, and noise points gives it a built-in mechanism for outlier detection that centroid-based methods like K-Means simply lack.

The main challenge is parameter selection. The k-distance plot gives you a principled starting point for ε\varepsilon, and the rule of thumb min_samplesd+1\text{min\_samples} \geq d + 1 sets the density threshold. But always validate visually, and know that DBSCAN's single global density threshold breaks down when clusters have different densities.

If you're working with data where density varies across clusters, HDBSCAN is the natural next step. And if you're feeding DBSCAN output into downstream models, remember that proper feature scaling and an understanding of distance-based learning are prerequisites, not afterthoughts. Start with the k-distance plot, find the knee, and let the density guide you to the structure hidden in your data.

Frequently Asked Interview Questions

Q: What is the difference between DBSCAN and K-Means, and when would you pick DBSCAN?

K-Means assumes clusters are spherical and requires you to specify the number of clusters upfront. DBSCAN discovers clusters of arbitrary shape and determines the count automatically based on density. Pick DBSCAN when clusters are non-convex, when your data has outliers you don't want influencing cluster formation, or when you genuinely don't know how many groups exist.

Q: Explain the roles of epsilon and min_samples in DBSCAN.

Epsilon defines the radius of the neighborhood around each point, and min_samples sets the minimum number of points required within that radius for a point to be classified as a core point. Together, they control the density threshold: smaller epsilon or larger min_samples requires tighter, denser regions to form clusters, while looser settings create broader, more inclusive clusters.

Q: How does DBSCAN handle outliers, and why is this an advantage over K-Means?

DBSCAN labels any point that isn't density-reachable from a core point as noise (label -1). These outliers are excluded from all clusters. K-Means has no noise concept and forces every point into a cluster, which means a single distant outlier can shift a centroid and distort the entire partition. DBSCAN's noise handling makes it naturally suitable for anomaly detection tasks.

Q: What happens to DBSCAN's performance in high-dimensional spaces?

In high dimensions, Euclidean distances between points converge (the curse of dimensionality), making it nearly impossible for any fixed epsilon to distinguish dense from sparse regions. Neighborhood queries also slow down because spatial indexing structures like k-d trees degrade above ~20 dimensions. The standard approach is to reduce dimensionality with PCA or UMAP before applying DBSCAN.

Q: DBSCAN found only one cluster in your data. What are two likely causes and how would you fix each?

First, epsilon might be too large, causing all points to be density-reachable from each other and collapsing into a single cluster. Lower epsilon to separate the groups. Second, the data might not have been scaled, so one high-range feature dominates the distance metric and makes everything look equally close. Apply standardization before running DBSCAN and re-tune epsilon on the scaled data.

Q: Can DBSCAN handle clusters with different densities? If not, what would you use instead?

No. DBSCAN uses a single global epsilon, so it can't simultaneously capture a tight cluster and a diffuse one. An epsilon that works for the dense cluster will fragment the sparse one, and vice versa. HDBSCAN solves this by building a hierarchy of clusterings over all possible epsilon values and extracting the most stable clusters at each density level.

Q: Describe the time complexity of DBSCAN and how you would scale it to a large dataset.

Without spatial indexing, DBSCAN is O(n2)O(n^2) because each neighborhood query scans all points. With a k-d tree or ball tree, average complexity drops to O(nlogn)O(n \log n) in low dimensions. For datasets with millions of points, use approximate nearest neighbor methods or GPU-accelerated implementations like cuML's DBSCAN. Also ensure your data is low-dimensional; high dimensions make spatial indexes ineffective.

Q: A colleague ran DBSCAN and got 50 clusters with 40% of points marked as noise. What went wrong?

The epsilon value is almost certainly too small. When epsilon is too restrictive, each small pocket of points forms its own miniature cluster, and anything outside those pockets gets labeled noise. The fix is to generate a k-distance plot, identify the knee, and set epsilon at that value. Alternatively, min_samples might be set too high for the data's actual density.

Hands-On Practice

In this hands-on tutorial. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), an algorithm designed to find clusters of arbitrary shapes and identify outliers. Unlike K-Means, which forces data into spherical blobs, DBSCAN groups points based on density, mimicking how humans naturally group dense regions of data. We will apply this powerful technique to a customer segmentation dataset, visualizing how DBSCAN separates core customers from outliers based on their income and spending habits.

Dataset: Customer Segments (Clustering) Customer segmentation data with 5 natural clusters based on income, spending, and age. Silhouette score ≈ 0.75 with k=5.

Try changing eps to 0.2 or 0.5 and observe how the number of clusters and noise points changes drastically. Lower eps values will fracture clusters and increase noise (points become isolated), while higher eps values will merge distinct clusters into giant blobs. You can also experiment with min_samples; increasing it makes the algorithm stricter, requiring denser regions to form a cluster.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths