K-Means Clustering
K-Means is the most popular clustering algorithm in machine learning. It partitions data into K distinct, non-overlapping groups based on similarity, making it perfect for customer segmentation, image compression, and pattern discovery.
What is Clustering?
Unlike classification (where you know labels: "spam" vs "not spam"), clustering discovers patterns on its own. It's like organizing a messy closet without knowing what categories of clothes you have - the algorithm figures out "shirts", "pants", "shoes" just by looking at similarities.
Clustering is an unsupervised learning technique that groups similar data points together without predefined labels. Unlike classification where you train on labeled data ("this is a cat", "this is a dog"), clustering discovers natural groupings in your data automatically.
Think of sorting a deck of cards without knowing the rules. You might group them by color, number, or suit. Clustering algorithms do exactly this - they find patterns and similarities that humans might miss when dealing with thousands or millions of data points across dozens of dimensions.
How K-Means Works
Step 1 (Initialize): Teacher randomly picks 4 students as "group leaders" (centroids)
Step 2 (Assign): Each remaining student joins the group leader they're most similar to (nearest centroid)
Step 3 (Update): After everyone joins a group, find the "average" student in each group - this becomes the new group leader
Step 4 (Repeat): Students might switch groups if they're now closer to a different leader. Keep repeating until no one switches anymore!
This is exactly how K-Means works - it iteratively assigns points to nearest centroids, then recalculates centroids, until everything stabilizes (converges).
K-Means partitions data into K clusters by iteratively assigning points to the nearest cluster center (centroid) and then updating the centroids based on the assigned points. The algorithm follows these steps:
K-Means Algorithm Steps
- Initialize: Randomly place K centroids in the feature space
- Assign: Assign each data point to the nearest centroid
- Update: Recalculate centroids as the mean of all assigned points
- Repeat: Continue steps 2-3 until centroids stop moving (convergence)
Key insight: K-Means minimizes the within-cluster sum of squares (inertia), making clusters as compact as possible.
Implementing K-Means in Python
Scikit-learn provides a simple and efficient K-Means implementation. Let us start with a basic example using synthetic data to understand how clustering works.
# ============================================
# K-Means Clustering - Complete Implementation
# ============================================
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data with 4 natural clusters
# This creates synthetic data where we KNOW there are 4 groups
X, y_true = make_blobs(
n_samples=300, # 300 data points total
centers=4, # 4 actual cluster centers
cluster_std=0.60, # How spread out points are (standard deviation)
random_state=42 # For reproducibility
)
print(f"Data shape: {X.shape}") # (300, 2) - 300 samples, 2 features
print(f"Feature 1 range: [{X[:, 0].min():.2f}, {X[:, 0].max():.2f}]")
print(f"Feature 2 range: [{X[:, 1].min():.2f}, {X[:, 1].max():.2f}]")
print()
# ============================================
# STEP 1: Create K-Means Model
# ============================================
kmeans = KMeans(
n_clusters=4, # We want to find 4 clusters
random_state=42, # For reproducible results
n_init=10, # Run algorithm 10 times with different initializations
max_iter=300 # Maximum iterations per run
)
print("Fitting K-Means...")
kmeans.fit(X) # This is where the magic happens!
print("✓ Convergence achieved!")
print()
# ============================================
# STEP 2: Get Results
# ============================================
# Get cluster assignments for each data point
labels = kmeans.labels_ # Array of cluster IDs (0, 1, 2, 3)
print(f"Cluster labels: {np.unique(labels)}") # [0 1 2 3]
print(f"Sample assignments: {labels[:10]}") # First 10 points' clusters
print()
# Get final centroid positions
centroids = kmeans.cluster_centers_ # Shape: (4, 2) - 4 centroids with 2 coordinates
print(f"Centroids shape: {centroids.shape}") # (4, 2)
print(f"Centroid 0: {centroids[0]}") # Position of cluster 0's center
print()
# Get inertia (within-cluster sum of squared distances)
print(f"Inertia (lower is better): {kmeans.inertia_:.2f}")
print(f"Number of iterations to converge: {kmeans.n_iter_}")
print()
# Count points in each cluster
for cluster_id in range(4):
count = np.sum(labels == cluster_id)
print(f"Cluster {cluster_id}: {count} points")
Now let us apply K-Means to find these clusters:
# Create and fit K-Means model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X)
# Get cluster assignments and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print(f"Cluster labels: {np.unique(labels)}") # [0 1 2 3]
print(f"Centroids shape: {centroids.shape}") # (4, 2)
Visualizing the clusters helps understand how K-Means partitioned the data:
# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red',
marker='X', s=200, edgecolors='black', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()
Important K-Means Parameters
Understanding the key parameters helps you tune K-Means for better results:
| Parameter | Default | Description |
|---|---|---|
n_clusters |
8 | Number of clusters (K) to create |
init |
'k-means++' | Initialization method for centroids |
n_init |
10 | Number of times to run with different seeds |
max_iter |
300 | Maximum iterations per run |
random_state |
None | Seed for reproducibility |
# Always scale features before K-Means
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Now apply K-Means on scaled data
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X_scaled)
Predicting New Data Points
Once trained, you can assign new data points to existing clusters:
# Predict cluster for new data points
new_points = np.array([[0, 0], [4, 4], [-3, 2]])
predictions = kmeans.predict(new_points)
print(f"New points assigned to clusters: {predictions}")
# Output: [2 0 1] (example cluster assignments)
K-Means Limitations
While powerful, K-Means has some important limitations to keep in mind:
-
Must specify K in advance:
You need to decide how many clusters before running. If you guess wrong (K=3 but true clusters=5), results are meaningless. -
Assumes spherical clusters:
Struggles with crescent/elongated shapes (like moons). Works best when clusters are round and compact. -
Sensitive to initialization:
Bad starting centroids = bad final clusters. That's why we use k-means++ and n_init=10 to try multiple starts. -
Sensitive to outliers:
Extreme points pull centroids away from true centers, distorting entire clusters. -
Struggles with varying sizes:
Prefers equal-sized clusters. One huge cluster (1000 points) + tiny cluster (10 points) = poor separation.
-
Fast and scalable:
Handles millions of points efficiently. O(nki) complexity where k is clusters, i is iterations - very fast! -
Easy to understand & interpret:
Intuitive: "find K centers, assign points to nearest". Simple to explain to non-technical stakeholders. -
Works with large datasets:
Can cluster 1M+ points in seconds. Hierarchical clustering would take hours/days on same data! -
Guaranteed convergence:
Always reaches a stable solution (local minimum). Won't run forever like some algorithms. -
Great for compact clusters:
When data naturally forms round, well-separated groups (e.g., customer segments), K-Means excels!
Practice Questions: K-Means Clustering
Test your understanding with these hands-on exercises.
Task: Create sample data using make_blobs with 200 samples and 3 centers, then fit a K-Means model.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate data
X, _ = make_blobs(n_samples=200, centers=3, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
print(f"Inertia: {kmeans.inertia_:.2f}")
Given:
import numpy as np
customers = np.array([
[25, 50000], [30, 60000], [35, 75000], # Young professionals
[45, 120000], [50, 150000], [55, 140000], # Senior executives
[22, 25000], [24, 30000], [26, 28000] # Entry level
])
Task: Scale the data and cluster customers into 3 segments.
Show Solution
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
customers = np.array([
[25, 50000], [30, 60000], [35, 75000],
[45, 120000], [50, 150000], [55, 140000],
[22, 25000], [24, 30000], [26, 28000]
])
# Scale the data
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)
# Cluster into 3 segments
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(customers_scaled)
print(f"Customer segments: {labels}")
Task: Generate 2D data with 4 clusters, apply K-Means, and create a scatter plot showing clusters with different colors and centroids marked with red X markers.
Show Solution
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate data
X, _ = make_blobs(n_samples=300, centers=4,
cluster_std=0.7, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red',
marker='X', s=200, edgecolors='black')
plt.title('K-Means Clustering with Centroids')
plt.show()
Task: After fitting K-Means with 5 clusters on sample data, count how many data points are assigned to each cluster.
Show Solution
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate and cluster data
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)
# Count samples per cluster
unique, counts = np.unique(labels, return_counts=True)
for cluster, count in zip(unique, counts):
print(f"Cluster {cluster}: {count} samples")
Elbow Method & Silhouette Score
How do you choose the right number of clusters? The Elbow method and Silhouette score are two essential techniques that help you find the optimal K value for your clustering analysis.
The Challenge of Choosing K
Here's the tricky part about K-Means: you have to tell it how many groups to create BEFORE it runs. But wait — if you're using clustering because you don't know the structure of your data, how are you supposed to know how many groups exist? It's like being asked "How many colors should I use?" before you've even seen the painting!
This is one of the biggest challenges in clustering. Get the number wrong, and you'll get misleading results:
Too Few Clusters (K too small)
You're forcing naturally different groups to merge together. Imagine putting "cats" and "dogs" into one "pets" cluster when you actually need them separate for your analysis. You lose important distinctions and your insights become too generic to be useful.
Too Many Clusters (K too large)
You're splitting natural groups into artificial sub-groups. Like separating "orange cats" and "brown cats" into different clusters when they really behave the same way. You end up with meaningless, fragmented groups that don't represent real patterns.
The good news? We have data-driven methods to help us find the "sweet spot" — the optimal K where clusters are meaningful and well-separated. These techniques evaluate clustering quality at different K values, so you're not just guessing. Let's explore the two most popular methods: the Elbow Method and Silhouette Score.
The Elbow Method
• 1 warehouse: High shipping costs (long distances) - like having just 1 cluster
• 2 warehouses: Much better! Costs drop significantly
• 3 warehouses: Even better, but improvement is smaller
• 4 warehouses: Marginal improvement - costs barely decrease
• 10 warehouses: Tiny improvement but 10x the overhead!
The "elbow" is where adding more warehouses gives diminishing returns. In K-Means, inertia (cost) drops sharply at first, then plateaus. The elbow point = optimal K!
The Elbow method works by plotting the inertia (a measure of cluster "tightness") against different values of K. Here's the key insight: as you add more clusters, inertia ALWAYS decreases. Why? Because with more clusters, each point gets closer to its assigned center. In the extreme case, if K equals the number of data points, every point IS its own center, and inertia becomes zero!
But here's the catch: we don't want K to equal our data size — that defeats the purpose of clustering! Instead, we want to find the "elbow" point where adding more clusters stops giving us meaningful improvement. It's the sweet spot between too few clusters (high inertia) and too many (overfitting).
Inertia (Within-Cluster Sum of Squares / WCSS)
Inertia measures how "spread out" points are within their assigned clusters. Think of it as a measure of cluster compactness or cohesion. It calculates the sum of squared distances between each data point and the center (centroid) of its assigned cluster.
In simple terms: Low inertia = points are huddled close to their cluster centers (good!). High inertia = points are scattered far from their centers (bad clustering).
Formula: Inertia = Σ (distance from each point to its centroid)²
The squaring makes bigger distances count more — a point that's twice as far contributes 4x to inertia!
# ============================================
# Step 1: Import Libraries and Create Sample Data
# ============================================
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data with known clusters
X, _ = make_blobs(
n_samples=300, # 300 data points
centers=4, # We know there are 4 true clusters
cluster_std=0.60, # Spread of points
random_state=42 # For reproducibility
)
Importing Libraries
matplotlib.pyplot: For creating visualizations (our elbow chart).
KMeans: The clustering algorithm we're testing.
make_blobs: Creates fake data with natural groupings — perfect for testing!
Creating Test Data
n_samples=300: We're creating 300 data points.
centers=4: These points are grouped around 4 centers (but we'll pretend we don't know this!).
cluster_std=0.60: How spread out each group is. Lower = tighter clusters.
# ============================================
# Step 2: Test Different K Values (1 to 10)
# ============================================
print("Testing different K values...")
print("=" * 50)
# We'll store the inertia (error) for each K
inertias = []
k_range = range(1, 11) # Test K from 1 to 10
for k in k_range:
# Create K-Means with current K value
kmeans = KMeans(
n_clusters=k, # Number of clusters to create
random_state=42, # Same starting point each time
n_init=10 # Try 10 different starting positions
)
# Train the model on our data
kmeans.fit(X)
# Save the inertia (how "spread out" points are from centers)
inertias.append(kmeans.inertia_)
The Loop: Testing Each K Value
Think of this like trying different numbers of pizza slices to cut your pizza into. We're asking: "If I divide this data into 1 group, 2 groups, 3 groups... how well does each fit?"
Current number of clusters we're testing
K-Means starts randomly, so we try 10 times and keep the best
The "error" — sum of distances from points to their centers
# ============================================
# Step 3: Print Results to See the Pattern
# ============================================
for i, k in enumerate(k_range):
print(f"K={k:2d}: Inertia={inertias[i]:8.2f}")
print("=" * 50)
print()
# Calculate how much inertia DECREASES when we add one more cluster
print("Inertia reduction when adding one more cluster:")
print("-" * 50)
for i in range(len(inertias) - 1):
reduction = inertias[i] - inertias[i+1]
percent = (reduction / inertias[i]) * 100
# Mark significant drops with an arrow
marker = " ← BIG DROP!" if percent > 30 else ""
print(f"K={i+1} → {i+2}: Reduction = {reduction:7.2f} ({percent:5.1f}%){marker}")
Understanding the Output
This is the key insight! We're looking at how much inertia drops when we add each new cluster:
| Transition | What Happens | What It Means |
|---|---|---|
| K=1 → K=2 | HUGE drop (often 50%+) | Splitting into 2 groups helps A LOT |
| K=2 → K=3 | Big drop (30-40%) | 3 groups is much better than 2 |
| K=3 → K=4 | Medium drop (15-25%) | Still helpful, but less so |
| K=4 → K=5+ | Small drops (5-10%) | ⚠️ ELBOW HERE! More clusters don't help much |
💡 The "elbow" is where the % reduction suddenly gets much smaller! That's your optimal K.
# ============================================
# Step 4: Find the Elbow Point Automatically
# ============================================
print()
print("→ Look for where reduction % drops significantly!")
print("→ That's your elbow point (optimal K)")
print()
# Simple automatic detection: find biggest "drop in drops"
drops = []
for i in range(len(inertias) - 1):
drops.append(inertias[i] - inertias[i+1])
# The elbow is where the drop suddenly becomes much smaller
# (i.e., the "second derivative" is largest)
elbow_k = 1
max_drop_change = 0
for i in range(len(drops) - 1):
drop_change = drops[i] - drops[i+1] # How much did the improvement slow down?
if drop_change > max_drop_change:
max_drop_change = drop_change
elbow_k = i + 2 # K value at this elbow
print(f"🎯 Detected Elbow Point: K = {elbow_k}")
print(f" (This is likely the optimal number of clusters)")
The Math Behind Elbow Detection
We calculate the "drop in drops" — basically asking: "When did the improvement suddenly slow down?"
• If K=2 reduced error by 1000 and K=3 reduced by 500, that's a drop change of 500.
• The biggest "drop change" indicates the elbow — where adding clusters stopped helping as much.
Why This Works
Imagine driving and slowing down. The elbow is like when you hit the brakes hardest —
the biggest change in your rate of deceleration.
At this point, you've captured the main clusters. Adding more just splits existing groups unnecessarily.
Now visualize the elbow curve to find the optimal K:
# ============================================
# Step 5: Visualize the Elbow Curve
# ============================================
plt.figure(figsize=(10, 6))
# Plot the inertia values as a line with dots
plt.plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
# Add labels and title
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia (Within-cluster sum of squares)', fontsize=12)
plt.title('Elbow Method for Optimal K', fontsize=14, fontweight='bold')
# Make the x-axis show each K value
plt.xticks(k_range)
# Add grid for easier reading
plt.grid(True, alpha=0.3)
# Mark the elbow point
plt.axvline(x=elbow_k, color='r', linestyle='--', label=f'Elbow at K={elbow_k}')
plt.legend()
plt.tight_layout()
plt.show()
Reading the Chart
X-axis: Number of clusters (K)
Y-axis: Inertia (error/spread)
Goal: Find where the line "bends" like an elbow!
The "Elbow" Shape
The curve drops steeply at first (adding clusters helps a lot), then flattens out (adding more clusters doesn't help much). The bend = your answer!
Red Dashed Line
The vertical red line marks the detected elbow point. In our example with 4 true clusters, it should appear at K=4!
Silhouette Score
While the Elbow method looks at how tight clusters are internally, the Silhouette Score takes a different approach: it measures both how close points are to their own cluster AND how far they are from other clusters. This gives us a more complete picture of clustering quality.
Question 1 (a): How similar are you to people in YOUR group? Do you share interests, talk easily, feel comfortable? (This is called intra-cluster cohesion)
Question 2 (b): How different are you from people in the NEAREST other group? Would you fit in there, or do you feel like an outsider? (This is called inter-cluster separation)
Your Silhouette score = (b - a) / max(a, b)
• Score near +1: You LOVE your group and have nothing in common with others. Perfect placement! 🎯
• Score near 0: You're on the boundary — you could go either way. Maybe you're between two friend circles? 🤷
• Score near -1: You're in the WRONG group! You have more in common with another group. Time to switch! ❌
The average silhouette score across ALL people (data points) tells you how well the entire party is organized!
The Silhouette score measures how similar each point is to its own cluster compared to other clusters. Unlike inertia (where lower is better), higher silhouette scores are better, and the score has a meaningful, interpretable range from -1 to +1. This makes it easier to understand than inertia, which can be any positive number.
Perfect Clustering
Points are very close to their own cluster center and very far from other clusters. Ideal separation!
Overlapping Clusters
Points are on the boundary between clusters. They could belong to either one. Clusters may be too close.
Wrong Assignment
Points are closer to a DIFFERENT cluster than their own! They've been misclassified. Bad clustering!
| Score Range | Interpretation | What It Means in Practice |
|---|---|---|
0.71 to 1.0 |
Strong structure | Excellent! Clusters are clearly separated and well-defined. Your K choice is great. |
0.51 to 0.70 |
Reasonable structure | Good clustering. Some overlap may exist, but groups are still meaningful. |
0.26 to 0.50 |
Weak structure | Clusters are overlapping. Consider trying different K values or a different algorithm. |
< 0.25 |
No substantial structure | Poor clustering. The data might not have natural clusters, or your K is way off. |
# ============================================
# Calculate Silhouette Score for Different K Values
# ============================================
from sklearn.metrics import silhouette_score
silhouette_scores = []
k_range = range(2, 11) # NOTE: Silhouette requires at least 2 clusters (K=1 has no meaning)
print("Testing Silhouette Score for K = 2 to 10...")
print("=" * 50)
for k in k_range:
# Create and fit K-Means
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X) # Get cluster labels for each point
# Calculate silhouette score
score = silhouette_score(X, labels)
silhouette_scores.append(score)
# Interpret the score
if score >= 0.71:
quality = "Excellent ⭐"
elif score >= 0.51:
quality = "Good ✓"
elif score >= 0.26:
quality = "Fair ⚠"
else:
quality = "Poor ✗"
print(f"K={k}: Silhouette Score = {score:.3f} → {quality}")
print("=" * 50)
print(f"\\n→ Best K = {k_range[silhouette_scores.index(max(silhouette_scores))]} "
f"(highest score: {max(silhouette_scores):.3f})")
Visualize to find the K with highest silhouette score:
# ============================================
# Visualize Silhouette Scores
# ============================================
plt.figure(figsize=(10, 6))
# Plot the scores
plt.plot(k_range, silhouette_scores, 'go-', linewidth=2, markersize=10)
# Highlight the best K
best_k = k_range[silhouette_scores.index(max(silhouette_scores))]
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best K = {best_k}')
# Labels
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Silhouette Score', fontsize=12)
plt.title('Silhouette Score for Optimal K\\n(Higher is Better!)', fontsize=14, fontweight='bold')
plt.xticks(k_range)
plt.ylim(0, 1) # Score range is -1 to 1, but we expect positive values
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()
Reading the Silhouette Chart
Unlike the Elbow method (where you look for a bend), here you simply find the peak — the K value with the highest score. That's your optimal number of clusters! The red dashed line marks it for you.
Elbow vs Silhouette: Use Both!
Elbow method: Looks at cluster tightness (inertia) — requires visual interpretation.
Silhouette: Looks at separation AND tightness — gives a clear "best" answer.
When they agree, you can be confident in your K choice!
Combining Both Methods
For robust cluster selection, use both methods together. They often agree, but when they differ, consider domain knowledge and the specific use case.
# Combined visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Elbow plot
axes[0].plot(range(1, 11), inertias, 'bo-', linewidth=2)
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)
# Silhouette plot
axes[1].plot(range(2, 11), silhouette_scores, 'go-', linewidth=2)
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Silhouette Visualization
A silhouette plot shows the score for each individual point, helping you understand cluster quality in more detail:
# Silhouette plot for individual samples
from sklearn.metrics import silhouette_samples
import numpy as np
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Get silhouette values for each sample
sample_silhouette_values = silhouette_samples(X, labels)
# Average silhouette score
avg_score = silhouette_score(X, labels)
print(f"Average Silhouette Score: {avg_score:.3f}")
Practice Questions: Elbow & Silhouette
Test your understanding with these hands-on exercises.
Task: Fit K-Means with K=5 and print the inertia value.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=200, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)
print(f"Inertia: {kmeans.inertia_:.2f}")
Task: Create an elbow plot for K values from 1 to 10 and identify the elbow point.
Show Solution
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.plot(range(1, 11), inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Elbow appears at K=4
Task: Fit K-Means with K=4 and calculate the silhouette score.
Show Solution
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score:.3f}")
Task: Test K values from 2 to 8 and print the K with the highest silhouette score.
Show Solution
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
best_k = 2
best_score = -1
for k in range(2, 9):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
print(f"K={k}: Score={score:.3f}")
if score > best_score:
best_score = score
best_k = k
print(f"\nBest K: {best_k} with score {best_score:.3f}")
Hierarchical Clustering (Agglomerative)
Hierarchical clustering builds a tree of clusters, letting you explore data at multiple granularity levels. Unlike K-Means, you do not need to specify the number of clusters upfront, giving you more flexibility in exploratory analysis.
Hierarchical clustering works exactly like this! Start with individual data points (people), progressively merge similar ones (create families), and build a tree showing relationships at every level. You can "cut" the tree at any height to get your desired number of groups - just like deciding whether to organize by immediate families (many groups) or extended families (fewer groups).
How Hierarchical Clustering Works
Agglomerative (bottom-up) hierarchical clustering starts with each data point as its own cluster, then progressively merges the closest clusters until only one remains. This creates a tree structure called a dendrogram that shows how clusters are related.
Imagine you have 100 customers. Initially, each customer is their own "cluster" (100 clusters). The algorithm finds the two most similar customers and merges them (99 clusters). It repeats this process, always merging the two closest clusters, until all customers form one giant cluster. The dendrogram records this entire merge history, letting you "rewind" to any stage and extract the cluster structure you need.
Agglomerative Clustering Steps
- Start: Each data point is its own cluster (N clusters)
- Find: Identify the two closest clusters
- Merge: Combine them into a single cluster (N-1 clusters)
- Repeat: Continue until all points are in one cluster
- Cut: Choose where to cut the dendrogram for desired cluster count
Linkage Methods
The linkage method determines how to measure distance between clusters. Different linkage methods produce different clustering results:
| Linkage Method | How It Works (Simple Explanation) | Best For | Real-World Analogy |
|---|---|---|---|
wardRecommended |
Minimize total variance within clusters When merging two clusters, pick the pair that increases the total "spread" (variance) the least. This creates balanced, compact groups where points are tightly packed. Math: Minimizes sum of squared distances from cluster centers |
Compact, equally-sized clusters Balanced cluster sizes Most common choice |
School classes: Form balanced classrooms where students have similar skill levels (low variance within each class) |
complete |
Use maximum distance between any two points When measuring distance between two clusters, use the farthest pair of points (one from each cluster). This creates tight, compact clusters because it's conservative - only merges clusters when even the farthest points are close. Also called "furthest neighbor" or "maximum linkage" |
Tight, spherical clusters Prevents elongated chains Similar-sized groups |
Gated communities: Only merge neighborhoods if even the farthest houses are still close together (strict boundaries) |
average |
Use average distance between all point pairs Calculate distance between every point in cluster A to every point in cluster B, then take the average. This is a balanced middle-ground approach. Also called UPGMA (Unweighted Pair Group Method with Arithmetic mean) |
General purpose Moderate cluster sizes Robust compromise |
City districts: Merge districts based on the average commute time between all residents (balanced consideration) |
single |
Use minimum distance between any two points When measuring distance between clusters, use the closest pair of points (one from each). This creates elongated, chain-like clusters because it easily merges clusters as long as ANY two points are close (can create "bridges"). Also called "nearest neighbor" - sensitive to outliers creating chains |
Elongated, chain-like patterns Can create unbalanced clusters Rare use cases |
Highway system: Connect cities if ANY two border points are close, even if city centers are far (creates long chains) |
linkage='ward' - it's the most commonly used
and produces intuitive, balanced clusters in 90% of cases. Only experiment with other linkages if ward
doesn't capture your data structure well!
ward linkage for most cases. It tends to
produce well-balanced clusters and works well with Euclidean distance.
Implementing Hierarchical Clustering
Scikit-learn provides AgglomerativeClustering, and scipy offers tools for dendrograms:
# ============================================
# Hierarchical (Agglomerative) Clustering
# ============================================
# Import required libraries
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import numpy as np
print("Setting up Hierarchical Clustering...")
print("=" * 50)
# Generate sample data with 4 natural groups
X, _ = make_blobs(
n_samples=150, # 150 data points
centers=4, # 4 true clusters
cluster_std=0.60, # Moderate spread
random_state=42 # Reproducibility
)
print(f"Data shape: {X.shape}") # (150, 2)
print(f"Starting with {len(X)} individual clusters (each point is its own cluster)")
print()
# ============================================
# STEP 1: Apply Agglomerative Clustering
# ============================================
# Create the clustering model
agg_clustering = AgglomerativeClustering(
n_clusters=4, # Cut the tree to get 4 final clusters
linkage='ward' # Ward minimizes within-cluster variance (good default)
)
print("Applying Agglomerative Clustering...")
print("Process: Start with 150 clusters → Merge closest pairs → Repeat until 4 remain")
print()
# Fit and predict cluster labels
# Under the hood: Builds full dendrogram, then cuts at the right height
labels = agg_clustering.fit_predict(X)
print("✓ Clustering complete!")
print()
# ============================================
# STEP 2: Analyze Results
# ============================================
print(f"Cluster labels: {np.unique(labels)}") # [0 1 2 3]
print()
# Count points per cluster
for cluster_id in range(4):
count = np.sum(labels == cluster_id)
percentage = (count / len(X)) * 100
print(f"Cluster {cluster_id}: {count} points ({percentage:.1f}%)")
print()
print(f"Total merges performed: {len(X) - 4}") # Started with 150, ended with 4
print(f"Number of clusters: {agg_clustering.n_clusters_}")
Creating Dendrograms
Dendrograms visualize the hierarchical structure and help you decide where to cut for a specific number of clusters:
# ============================================
# Creating a Dendrogram Visualization
# ============================================
# Import visualization tools
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
print("Building dendrogram...")
print()
# ============================================
# STEP 1: Calculate Linkage Matrix
# ============================================
# The linkage matrix records the merge history:
# - Which clusters were merged at each step
# - The distance between them at merge time
# - How many original points each new cluster contains
linkage_matrix = linkage(
X, # Data to cluster
method='ward' # Ward linkage (minimizes variance)
)
print(f"Linkage matrix shape: {linkage_matrix.shape}") # (n_samples-1, 4)
print(f"Number of merges recorded: {len(linkage_matrix)}")
print()
print("Each row represents one merge:")
print("[cluster1_id, cluster2_id, merge_distance, num_points_in_new_cluster]")
print()
print("First 3 merges:")
print(linkage_matrix[:3])
print()
# ============================================
# STEP 2: Plot Dendrogram
# ============================================
plt.figure(figsize=(12, 6))
# Create the dendrogram plot
# truncate_mode='level' limits display to top 5 levels (for readability)
dendrogram(
linkage_matrix, # The merge history
truncate_mode='level', # Show only top levels
p=5 # Show 5 levels (prevents clutter with large datasets)
)
plt.xlabel('Sample Index (or Cluster Size)', fontsize=12)
plt.ylabel('Distance (Height of Merge)', fontsize=12)
plt.title('Hierarchical Clustering Dendrogram\n(Higher merges = more different clusters)',
fontsize=14, fontweight='bold')
plt.axhline(y=10, color='r', linestyle='--', label='Potential cut point')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()
print("✓ Dendrogram created!")
print()
print("How to read it:")
print("- X-axis: Data points or clusters")
print("- Y-axis: Distance at which clusters merge")
print("- Vertical lines: Clusters being merged")
print("- Height of U-shape: How different the clusters are")
print("- Cut horizontally to get desired number of clusters")
You can cut the dendrogram at different heights to get different numbers of clusters:
# Cut dendrogram at specific distance threshold
from scipy.cluster.hierarchy import fcluster
# Get clusters by cutting at distance threshold
distance_threshold = 10
clusters = fcluster(linkage_matrix, t=distance_threshold, criterion='distance')
print(f"Number of clusters at threshold {distance_threshold}: {len(np.unique(clusters))}")
# Or cut to get specific number of clusters
n_clusters = 4
clusters = fcluster(linkage_matrix, t=n_clusters, criterion='maxclust')
print(f"Cluster labels: {np.unique(clusters)}")
Visualizing Cluster Results
# Visualize clustering results
plt.figure(figsize=(12, 5))
# Without predefined n_clusters - explore structure
plt.subplot(1, 2, 1)
agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels = agg.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('Agglomerative Clustering (4 clusters)')
# Compare with different n_clusters
plt.subplot(1, 2, 2)
agg = AgglomerativeClustering(n_clusters=6, linkage='ward')
labels = agg.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('Agglomerative Clustering (6 clusters)')
plt.tight_layout()
plt.show()
When to Use Hierarchical Clustering
-
Explore cluster structure:
Don't know how many clusters exist? Dendrogram shows all possibilities at once! -
Number of clusters unknown:
Unlike K-Means, you don't need to specify K upfront. Build full tree, cut wherever makes sense. -
Small-medium datasets:
Up to 10,000 points works well. More than that? Use K-Means instead. -
Visualize hierarchy:
Stakeholders love dendrograms! Shows how groups relate at multiple levels. -
Nested clusters meaningful:
E.g., biological taxonomy (Species → Genus → Family), organizational charts.
-
Dataset is very large:
O(n²) or O(n³) complexity! 100k+ points = hours/days. Use K-Means or Mini-Batch K-Means. -
Adding new data points:
Can't add points incrementally - must rebuild entire tree! Use K-Means for dynamic data. -
Real-time clustering needed:
Too slow for production systems. Need instant results? Use K-Means. -
Memory is limited:
Stores full linkage matrix in RAM. 50k points needs ~10GB memory! K-Means uses way less. -
Simple flat clusters suffice:
Don't need hierarchy? Why pay the computational cost? Use K-Means instead.
Practice Questions: Hierarchical Clustering
Test your understanding with these hands-on exercises.
Task: Create sample data and apply AgglomerativeClustering with 3 clusters using ward linkage.
Show Solution
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=150, centers=3, random_state=42)
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X)
print(f"Labels: {labels[:10]}")
Task: Generate data and create a dendrogram using scipy with ward linkage.
Show Solution
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=50, centers=3, random_state=42)
linkage_matrix = linkage(X, method='ward')
plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix)
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.title('Dendrogram')
plt.show()
Task: Apply AgglomerativeClustering with ward, complete, and average linkage, then visualize the results side by side.
Show Solution
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=150, centers=4, random_state=42)
linkages = ['ward', 'complete', 'average']
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, link in zip(axes, linkages):
agg = AgglomerativeClustering(n_clusters=4, linkage=link)
labels = agg.fit_predict(X)
ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
ax.set_title(f'Linkage: {link}')
plt.tight_layout()
plt.show()
Task: Use fcluster to cut the dendrogram at distance=15 and count resulting clusters.
Show Solution
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import make_blobs
import numpy as np
X, _ = make_blobs(n_samples=100, centers=4, random_state=42)
linkage_matrix = linkage(X, method='ward')
clusters = fcluster(linkage_matrix, t=15, criterion='distance')
n_clusters = len(np.unique(clusters))
print(f"Clusters at distance 15: {n_clusters}")
DBSCAN for Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shapes and automatically detects outliers. It is ideal when your data has non-spherical clusters or contains noise that should not belong to any cluster.
DBSCAN does exactly this:
• Core Point (Stage area): If you're standing somewhere with at least 5 people within 2 meters, you're in the "core" of a crowd
• Border Point (Edge of crowd): You're within 2 meters of a core area but don't have 5 people around you
• Noise Point (Lone wanderer): You're not near any dense crowd - just wandering alone
The algorithm connects all core areas that are close enough (within 2m), forming clusters of any shape! No need to specify how many stages (clusters) exist - DBSCAN discovers them automatically.
How DBSCAN Works
Unlike K-Means which uses distance to centroids, DBSCAN defines clusters as dense regions of points separated by sparse regions. It groups together points that are closely packed and marks points in low-density regions as outliers.
The algorithm has a simple logic: If you have enough neighbors nearby, you're in a cluster. It doesn't care about cluster shape (spherical, crescent, serpentine) - only about local density. This makes DBSCAN perfect for real-world data where clusters aren't always nice, round blobs.
DBSCAN Point Types (Detailed Explanation)
1. Core Point (Dense Center)
Definition: A point is "core" if it has at least min_samples neighbors within eps distance (including itself).
Simple Explanation: Imagine you're at a party. You're a "core person" if you can stand in one spot and have at least 5 people (including yourself) within 2 meters. This means you're in a dense crowd - the heart of the party!
Example: If eps=2m and min_samples=5, and you count 6 people within 2 meters of you (including yourself), you're a core point.
2. Border Point (Edge of Crowd)
Definition: A point that is NOT core (doesn't have min_samples neighbors) but is within eps distance of a core point.
Simple Explanation: You're on the edge of the party crowd. You don't have 5 people around YOU (not core), but you're standing within 2 meters of someone who IS in the dense center (core). You're part of the cluster but on the periphery.
Example: You only have 2 people within 2 meters of you, BUT one of those people is a core point. You're a border point - attached to the cluster but not central.
3. Noise Point (Outlier)
Definition: A point that is neither core nor border. It doesn't have enough neighbors to be core, and it's not close enough to any core point to be border.
Simple Explanation: You're wandering alone between party areas. You don't have 5 people around you (not core), and you're not near anyone who does (not border). You're an isolated wanderer - labeled as noise with cluster ID = -1.
Example: You only have 1 person within 2 meters, and that person is also alone. Both of you are noise points - outliers that don't belong to any cluster.
- Find all core points (points with enough neighbors)
- Connect core points that are within eps of each other → these form cluster "backbones"
- Attach border points to their nearby clusters
- Label remaining points as noise (outliers)
Think of it like finding islands: Core points are land, border points are the shoreline, and noise points are isolated rocks in the ocean!
DBSCAN Parameters
DBSCAN has two main parameters that control cluster formation:
| Parameter | What It Controls (Simple Explanation) | How to Choose | Effect on Results |
|---|---|---|---|
eps(epsilon) |
Neighborhood radius - "How close is close?" This defines the maximum distance for two points to be considered neighbors. Think of it as the radius of a circle around each point - anything inside the circle is a neighbor, anything outside is not. Real example: If eps=2 meters, anyone within 2 meters of you is your neighbor. People 2.1 meters away are NOT neighbors. Units match your data (meters, pixels, standardized units, etc.) |
Method 1: k-distance plot Plot distance to k-th nearest neighbor, pick eps at the "elbow" (see detailed section below) Method 2: Domain knowledge If you know meaningful distances in your domain (e.g., "customers within 5km are in same region") Method 3: Trial and error Start with eps = 0.5 (scaled data) or average distance to k-th neighbor, then adjust |
Too large: Everything becomes one giant cluster Too small: Everything becomes noise/outliers Just right: Clear clusters with some noise Critical parameter - Most important to tune correctly! |
min_samples |
Crowd size threshold - "How many to be a crowd?" The minimum number of points (including the point itself) required within eps distance to form a "dense region" (core point). Lower values = easier to form clusters. Higher values = stricter definition of density. Real example: If min_samples=5, you need at least 5 people within eps distance to be considered a "crowd" (core point). If you only have 3 people nearby, you're not dense enough. Always an integer, minimum 1 (but 1-2 are impractical) |
Rule of Thumb: • 2D data: Start with min_samples=4 • Higher dimensions: Use 2 × n_features • Minimum: At least 3 (practical minimum) • Large datasets: Can use 10-20 for stricter clusters Tip: Less critical than eps. Start with 4-5 and adjust if needed |
Larger values: → Fewer clusters (stricter) → More noise points → Only very dense regions become clusters Smaller values: → More clusters (lenient) → Less noise → Easier to form clusters Less sensitive than eps |
DBSCAN(eps=0.5, min_samples=5) # Good default for scaled data
Then adjust eps based on k-distance plot. Keep min_samples=5 unless you have specific reasons to change it.
StandardScaler
so all features have similar ranges. Otherwise, eps becomes meaningless (which unit do you use?).
Implementing DBSCAN
# ============================================
# DBSCAN - Density-Based Clustering
# ============================================
# Import required libraries
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import numpy as np
import matplotlib.pyplot as plt
print("Setting up DBSCAN for non-spherical clusters...")
print("=" * 50)
# ============================================
# STEP 1: Generate Non-Spherical Data
# ============================================
# K-Means struggles with moon/crescent shapes
# DBSCAN excels at these!
X, _ = make_moons(
n_samples=300, # 300 data points
noise=0.05, # Small amount of noise (outliers)
random_state=42 # Reproducibility
)
print(f"Data shape: {X.shape}") # (300, 2)
print(f"Data has 2 crescent-shaped clusters (like two moons)")
print()
# ============================================
# STEP 2: Apply DBSCAN Algorithm
# ============================================
# Create DBSCAN model with two key parameters
dbscan = DBSCAN(
eps=0.2, # Maximum distance between neighbors (neighborhood radius)
# "2 neighbors within 0.2 units are in same neighborhood"
min_samples=5 # Minimum points needed to form a dense region (core point)
# "Need at least 5 points nearby to be a 'crowd'"
)
print("DBSCAN Parameters:")
print(f" eps (neighborhood radius): {dbscan.eps}")
print(f" min_samples (crowd size): {dbscan.min_samples}")
print()
print("Running DBSCAN...")
print("Process: For each point, count neighbors within eps")
print(" If >= min_samples, mark as core point")
print(" Connect core points to form clusters")
print(" Label remaining points as border or noise")
print()
# Fit and predict cluster labels
# Label -1 means NOISE (outlier)
# Labels 0, 1, 2, ... are cluster IDs
labels = dbscan.fit_predict(X)
print("✓ DBSCAN complete!")
print()
# ============================================
# STEP 3: Analyze Results
# ============================================
# Calculate number of clusters (excluding noise)
# Noise is labeled as -1, so we subtract 1 if it exists
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
# Count noise points (labeled as -1)
n_noise = list(labels).count(-1)
# Get unique cluster labels
unique_labels = set(labels)
print(f"Unique labels: {sorted(unique_labels)}")
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise} ({n_noise/len(X)*100:.1f}% of data)")
print()
# Show cluster sizes
for label in sorted(unique_labels):
if label == -1:
continue # Skip noise, already counted
count = list(labels).count(label)
percentage = (count / len(X)) * 100
print(f"Cluster {label}: {count} points ({percentage:.1f}%)")
print()
print(f"Core samples (dense points): {len(dbscan.core_sample_indices_)}")
print(f"Components (connected regions): {dbscan.components_.shape}")
Visualize how DBSCAN handles non-spherical clusters:
# Visualize DBSCAN results
plt.figure(figsize=(12, 5))
# DBSCAN on moon data
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title(f'DBSCAN: {n_clusters} clusters, {n_noise} noise points')
# Compare with K-Means (struggles with moons)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
km_labels = kmeans.fit_predict(X)
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis', alpha=0.6)
plt.title('K-Means (struggles with non-spherical)')
plt.tight_layout()
plt.show()
Finding Optimal eps
The k-distance plot helps find the sweet spot. Plot how far each point's 4th nearest neighbor is. Most points should have similar k-distances (flat region), but noise points have larger distances (sharp rise). The "elbow" where this curve bends sharply is your optimal eps!
The k-distance plot helps find a good eps value. Calculate the distance to the k-th nearest neighbor for each point and look for the "elbow" where distances sharply increase:
# ============================================
# k-Distance Plot for Optimal eps Selection
# ============================================
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
print("Finding optimal eps using k-distance plot...")
print("=" * 50)
# ============================================
# STEP 1: Calculate k-Nearest Neighbor Distances
# ============================================
# Choose k = min_samples (or min_samples - 1)
# This is the same k we'll use in DBSCAN's min_samples
k = 4
print(f"Calculating distance to {k}th nearest neighbor for each point...")
# Create a nearest neighbors model
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X) # Fit on our data
# Calculate distances to k nearest neighbors for each point
# distances shape: (n_samples, k) - distances to each of k neighbors
# indices shape: (n_samples, k) - indices of k nearest neighbors
distances, indices = neighbors.kneighbors(X)
print(f"Distances shape: {distances.shape}") # (300, 4)
print()
# ============================================
# STEP 2: Extract and Sort k-th Neighbor Distance
# ============================================
# Get distance to the k-th (farthest) neighbor for each point
# [:, k-1] gets the last column (4th neighbor distance)
k_distances = distances[:, k-1]
print(f"Sample k-distances (first 10 points):")
print(k_distances[:10])
print()
# Sort distances in ascending order for visualization
k_distances_sorted = np.sort(k_distances)
print(f"Min k-distance: {k_distances_sorted[0]:.4f}")
print(f"Max k-distance: {k_distances_sorted[-1]:.4f}")
print(f"Median k-distance: {np.median(k_distances_sorted):.4f}")
print()
# ============================================
# STEP 3: Plot k-Distance Graph
# ============================================
plt.figure(figsize=(10, 6))
# Plot sorted k-distances
plt.plot(k_distances_sorted, linewidth=2, color='steelblue')
# Add suggested eps line (visual helper)
# Rule of thumb: eps at the "knee" or elbow of the curve
suggested_eps = 0.2 # This would be determined visually or algorithmically
plt.axhline(y=suggested_eps, color='red', linestyle='--',
linewidth=2, label=f'Suggested eps = {suggested_eps}')
plt.xlabel('Points (sorted by k-distance)', fontsize=12)
plt.ylabel(f'{k}th Nearest Neighbor Distance', fontsize=12)
plt.title(f'k-Distance Plot (k={k})\nLook for the elbow point!',
fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()
print("✓ k-distance plot created!")
print()
print("How to interpret:")
print("- Flat region (left): Points with close neighbors (core/border points)")
print("- Sharp rise (right): Points with distant neighbors (noise points)")
print("- Elbow point: Optimal eps value (where curve bends sharply)")
print()
print(f"Suggested eps for this data: {suggested_eps}")
# The elbow point suggests optimal eps
Handling Noise Points
DBSCAN labels noise points as -1. You can filter them out or analyze them separately:
# Identify noise points
noise_mask = labels == -1
noise_points = X[noise_mask]
clustered_points = X[~noise_mask]
print(f"Total points: {len(X)}")
print(f"Clustered points: {len(clustered_points)}")
print(f"Noise points: {len(noise_points)}")
# Visualize with noise highlighted
plt.figure(figsize=(10, 6))
plt.scatter(clustered_points[:, 0], clustered_points[:, 1],
c=labels[~noise_mask], cmap='viridis', alpha=0.6, label='Clustered')
plt.scatter(noise_points[:, 0], noise_points[:, 1],
c='red', marker='x', s=50, label='Noise')
plt.legend()
plt.title('DBSCAN with Noise Points Highlighted')
plt.show()
Algorithm Comparison Guide
| Algorithm | Best For | Advantages | Disadvantages | When to Use |
|---|---|---|---|---|
| K-Means | Spherical, equal-sized clusters |
\u2022 Fast & scalable \u2022 Simple to understand \u2022 Works on large datasets \u2022 Easy to implement |
\u2022 Must specify K upfront \u2022 Assumes spherical shapes \u2022 Sensitive to outliers \u2022 Poor for varying densities |
Customer segmentation, image compression, document clustering with clear groups |
| Hierarchical | Exploring nested structures |
\u2022 No need to specify K \u2022 Produces dendrogram \u2022 Multiple granularities \u2022 Deterministic results |
\u2022 Slow (O(n\u00b2) or O(n\u00b3)) \u2022 High memory usage \u2022 Can't undo merges \u2022 Not scalable to big data |
Biological taxonomy, social network analysis, organizational hierarchies (small-medium datasets) |
| DBSCAN | Arbitrary shapes with noise |
\u2022 Finds any cluster shape \u2022 Robust to outliers \u2022 No need to specify K \u2022 Identifies noise points |
\u2022 Struggles with varying densities \u2022 Sensitive to parameters \u2022 Slower than K-Means \u2022 Hard to tune eps/min_samples |
Geospatial data, anomaly detection, non-globular clusters, data with outliers |
Decision Flowchart
Follow this visual decision tree to choose the best clustering algorithm for your data. Start at the top and answer each question to find your optimal choice!
START HERE
Choose Your Clustering Algorithm
Question 1: Do you know how many clusters (K) you want?
Can you specify the exact number of groups upfront?
Example: "I want 3 customer segments"
You want the algorithm to discover K
Question 2: How large is your dataset?
Number of data points matters for performance
Large
> 100,000 points
Fast & scalable
Medium
10k - 100k points
Best performance
Small
< 10,000 points
Any works fine
Question 4: Dataset size (K unknown path)
Since you don't know K, size determines algorithm choice
Can afford computational cost
Explore structure with dendrogram
Need efficient algorithm
Question 5: Cluster shapes & outliers?
What does your data look like?
Spherical
Round, compact clusters
Use Elbow Method
Irregular
Crescent, serpentine shapes
Handles any shape
Has Outliers
Noisy data with anomalies
Labels noise as -1
Quick Reference Guide:
K-Means
- Know K upfront
- Large datasets
- Spherical clusters
- Fast & simple
Hierarchical
- Don't know K
- Small datasets (<10k)
- Explore structure
- See dendrogram
DBSCAN
- Don't know K
- Irregular shapes
- Has outliers/noise
- Density-based
🎯 Pro Tip: Still Unsure?
Try all three algorithms and compare their Silhouette Scores. The algorithm with the highest score (closer to +1) is your best choice for that specific dataset!
Practice Questions: DBSCAN
Test your understanding with these hands-on exercises.
Task: Generate moon-shaped data and apply DBSCAN with eps=0.3 and min_samples=5.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)
print(f"Unique labels: {set(labels)}")
Task: Apply DBSCAN and print the number of clusters found and noise points detected.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters: {n_clusters}")
print(f"Noise points: {n_noise}")
Task: Generate data, calculate 5th nearest neighbor distances, and plot the k-distance graph.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
neighbors = NearestNeighbors(n_neighbors=5)
neighbors.fit(X)
distances, _ = neighbors.kneighbors(X)
distances = np.sort(distances[:, 4])
plt.figure(figsize=(10, 6))
plt.plot(distances)
plt.xlabel('Points')
plt.ylabel('5th NN Distance')
plt.title('k-Distance Plot')
plt.show()
Task: Apply DBSCAN and create a scatter plot where noise points are shown in red with X markers.
Show Solution
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)
noise_mask = labels == -1
plt.figure(figsize=(10, 6))
plt.scatter(X[~noise_mask, 0], X[~noise_mask, 1],
c=labels[~noise_mask], cmap='viridis')
plt.scatter(X[noise_mask, 0], X[noise_mask, 1],
c='red', marker='x', s=50, label='Noise')
plt.legend()
plt.show()
Cluster Evaluation & Interpretation
Evaluating clustering results is challenging because there are no ground truth labels. Learn how to assess cluster quality using internal metrics, visualizations, and domain knowledge to ensure your clusters are meaningful and actionable.
Internal vs External Evaluation
Clustering evaluation falls into two categories depending on whether you have ground truth labels:
Used when you do not have ground truth labels:
- Silhouette Score
- Calinski-Harabasz Index
- Davies-Bouldin Index
- Inertia / SSE
Used when ground truth labels are available:
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
- Homogeneity / Completeness
- V-Measure
Internal Evaluation Metrics
These metrics evaluate cluster quality based only on the data and cluster assignments:
# Import evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Calculate internal metrics
silhouette = silhouette_score(X, labels)
calinski = calinski_harabasz_score(X, labels)
davies_bouldin = davies_bouldin_score(X, labels)
print(f"Silhouette Score: {silhouette:.3f}") # Higher is better (-1 to 1)
print(f"Calinski-Harabasz: {calinski:.2f}") # Higher is better
print(f"Davies-Bouldin: {davies_bouldin:.3f}") # Lower is better
| Metric | Range | Goal | What It Measures (Beginner Explanation) | How to Interpret Score |
|---|---|---|---|---|
Silhouette Score |
-1 to +1 | Maximize (Higher = Better) |
Measures: How well-separated and cohesive are the clusters? For each point, compares:
Formula intuition: (distance to other clusters - distance within my cluster) / max(both) Think: "Do I fit well in my group vs. how different am I from other groups?" |
Score Ranges: • +0.71 to +1.0: Strong, well-separated clusters ⭐⭐⭐⭐⭐ • +0.51 to +0.70: Good cluster structure ⭐⭐⭐⭐ • +0.26 to +0.50: Weak/overlapping clusters ⭐⭐⭐ • ≤ +0.25: Poor clustering ⭐⭐ • Negative: Points in wrong clusters! ⭐ Aim for > 0.5 for good clustering |
Calinski-Harabasz Index(Variance Ratio) |
0 to ∞ | Maximize (Higher = Better) |
Measures: Ratio of between-cluster spread to within-cluster tightness Good clustering: Clusters far apart from each other (high between-variance), but points within each cluster are tightly packed (low within-variance) Formula intuition: (Between-cluster spread) / (Within-cluster spread) Think: "Are my groups well-separated AND internally compact?" |
Interpretation: • Higher values = Better (no fixed threshold) • Use to compare different K values • Pick K with highest CH Index Important: No universal "good" score - it's relative! Compare across your experiments: • K=3: CH=120 • K=4: CH=145 ← Better! • K=5: CH=138 Tends to favor K-Means over DBSCAN |
Davies-Bouldin Index |
0 to ∞ | Minimize (Lower = Better) |
Measures: Average similarity ratio between each cluster and its most similar cluster For each cluster, finds its "closest rival" and measures how similar they are. Then averages across all clusters. Bad clustering: Clusters are similar to their neighbors (high overlap) Good clustering: Even the most similar pairs are very different (well-separated) Think: "How similar is each cluster to its nearest competitor?" |
Score Guidelines: • Close to 0: Excellent separation (0 = perfect, theoretical) • 0.0 - 0.5: Very good clustering • 0.5 - 1.0: Moderate clustering • > 1.0: Poor clustering (overlapping clusters) Note: Lower is always better! A score of 0.3 beats 0.8 Use to compare K values - pick lowest DB Index |
- Silhouette Score: Best all-around metric, easy to interpret (-1 to 1 range)
- Calinski-Harabasz: Good for comparing K values, fast to compute
- Davies-Bouldin: Sensitive to cluster overlap, good complement to Silhouette
Pro tip: Use all three and see if they agree! If Silhouette says K=4 is best, and CH Index also peaks at K=4, that's strong evidence.
External Evaluation Metrics
When you have ground truth labels (for validation or benchmarking), use these metrics:
# External metrics (when true labels are available)
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score
# Generate data with known labels
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
# Cluster the data
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)
# Calculate external metrics
ari = adjusted_rand_score(y_true, y_pred)
nmi = normalized_mutual_info_score(y_true, y_pred)
homogeneity = homogeneity_score(y_true, y_pred)
completeness = completeness_score(y_true, y_pred)
v_measure = v_measure_score(y_true, y_pred)
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized MI: {nmi:.3f}")
print(f"Homogeneity: {homogeneity:.3f}")
print(f"Completeness: {completeness:.3f}")
print(f"V-Measure: {v_measure:.3f}")
Comparing Multiple Algorithms
Evaluate different clustering algorithms on the same data to find the best approach:
# Compare clustering algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
# Prepare data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)
# Test different algorithms
results = {}
# K-Means
km = KMeans(n_clusters=4, random_state=42)
km_labels = km.fit_predict(X_scaled)
results['K-Means'] = silhouette_score(X_scaled, km_labels)
# Agglomerative
agg = AgglomerativeClustering(n_clusters=4)
agg_labels = agg.fit_predict(X_scaled)
results['Agglomerative'] = silhouette_score(X_scaled, agg_labels)
# DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
db_labels = db.fit_predict(X_scaled)
if len(set(db_labels)) > 1: # Need at least 2 clusters for silhouette
results['DBSCAN'] = silhouette_score(X_scaled, db_labels)
for algo, score in results.items():
print(f"{algo}: {score:.3f}")
Cluster Interpretation
After clustering, analyze what makes each cluster unique to derive actionable insights:
# Analyze cluster characteristics
import pandas as pd
import numpy as np
# Create sample customer data
np.random.seed(42)
customers = pd.DataFrame({
'age': np.random.randint(18, 70, 100),
'income': np.random.randint(20000, 150000, 100),
'spending_score': np.random.randint(1, 100, 100)
})
# Scale and cluster
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)
kmeans = KMeans(n_clusters=3, random_state=42)
customers['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze cluster profiles
cluster_summary = customers.groupby('cluster').agg({
'age': ['mean', 'std'],
'income': ['mean', 'std'],
'spending_score': ['mean', 'std']
}).round(2)
print("Cluster Profiles:")
print(cluster_summary)
# Visualize cluster profiles with radar/bar charts
import matplotlib.pyplot as plt
cluster_means = customers.groupby('cluster')[['age', 'income', 'spending_score']].mean()
# Normalize for comparison
cluster_normalized = (cluster_means - cluster_means.min()) / (cluster_means.max() - cluster_means.min())
# Bar chart comparison
cluster_normalized.T.plot(kind='bar', figsize=(10, 6))
plt.title('Cluster Profile Comparison (Normalized)')
plt.xlabel('Features')
plt.ylabel('Normalized Value')
plt.legend(title='Cluster', labels=['Young Budget', 'High Spenders', 'Conservative'])
plt.tight_layout()
plt.show()
Best Practices for Cluster Analysis
Clustering Best Practices
- Always scale your data: Use StandardScaler before clustering
- Try multiple K values: Use Elbow and Silhouette methods together
- Compare algorithms: Different data shapes suit different algorithms
- Validate with domain knowledge: Do clusters make business sense?
- Document your findings: Name clusters meaningfully for stakeholders
- Test stability: Run with different random seeds
Practice Questions: Cluster Evaluation
Test your understanding with these hands-on exercises.
Task: Fit K-Means with 4 clusters and print silhouette, Calinski-Harabasz, and Davies-Bouldin scores.
Show Solution
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
print(f"Silhouette: {silhouette_score(X, labels):.3f}")
print(f"CH Index: {calinski_harabasz_score(X, labels):.2f}")
print(f"DB Index: {davies_bouldin_score(X, labels):.3f}")
Task: Generate data with known labels, cluster with K-Means, and calculate Adjusted Rand Index.
Show Solution
from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)
ari = adjusted_rand_score(y_true, y_pred)
print(f"Adjusted Rand Index: {ari:.3f}")
Task: Apply both K-Means and AgglomerativeClustering with 4 clusters, compare their silhouette scores, and print which is better.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
km = KMeans(n_clusters=4, random_state=42)
km_labels = km.fit_predict(X)
km_score = silhouette_score(X, km_labels)
agg = AgglomerativeClustering(n_clusters=4)
agg_labels = agg.fit_predict(X)
agg_score = silhouette_score(X, agg_labels)
print(f"K-Means: {km_score:.3f}")
print(f"Agglomerative: {agg_score:.3f}")
print(f"Winner: {'K-Means' if km_score > agg_score else 'Agglomerative'}")
Task: Create a DataFrame with features, cluster it, and print the mean of each feature per cluster.
Show Solution
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
df = pd.DataFrame({
'age': np.random.randint(20, 60, 100),
'income': np.random.randint(30000, 120000, 100)
})
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
print(df.groupby('cluster').mean().round(2))
Key Takeaways
K-Means is Fast and Simple
Works best with spherical clusters and requires specifying K upfront. Scale your data first for best results
Elbow Method Finds Optimal K
Plot inertia vs K and look for the "elbow" point where adding clusters gives diminishing returns
Silhouette Measures Quality
Score ranges from -1 to 1. Higher values indicate well-separated, cohesive clusters
Hierarchical Shows Structure
Dendrograms reveal nested cluster relationships. Cut at different heights for different granularities
DBSCAN Handles Noise
Finds arbitrary-shaped clusters and labels outliers as noise. No need to specify cluster count
Choose Algorithm Wisely
K-Means for speed, Hierarchical for exploration, DBSCAN for noisy or irregular-shaped data
Knowledge Check
Quick Quiz
Test what you've learned about clustering algorithms