Clustering Algorithms

K-Means Clustering

K-Means is the most popular clustering algorithm in machine learning. It partitions data into K distinct, non-overlapping groups based on similarity, making it perfect for customer segmentation, image compression, and pattern discovery.

What is Clustering?

Party Guest Analogy: Imagine hosting a party with 100 guests. Without any information, you notice people naturally form groups - some gather in the kitchen (foodies), others in the living room (game players), and some on the patio (smokers). Clustering does exactly this - it finds natural groupings in data without being told what groups exist!

Unlike classification (where you know labels: "spam" vs "not spam"), clustering discovers patterns on its own. It's like organizing a messy closet without knowing what categories of clothes you have - the algorithm figures out "shirts", "pants", "shoes" just by looking at similarities.

Clustering is an unsupervised learning technique that groups similar data points together without predefined labels. Unlike classification where you train on labeled data ("this is a cat", "this is a dog"), clustering discovers natural groupings in your data automatically.

Think of sorting a deck of cards without knowing the rules. You might group them by color, number, or suit. Clustering algorithms do exactly this - they find patterns and similarities that humans might miss when dealing with thousands or millions of data points across dozens of dimensions.

Real-world applications: Customer segmentation for marketing, grouping similar documents, image compression, anomaly detection, recommendation systems, and gene expression analysis.

How K-Means Works

School Assignment Analogy: Imagine a teacher wants to divide 30 students into 4 study groups. Here's how K-Means would work:

Step 1 (Initialize): Teacher randomly picks 4 students as "group leaders" (centroids)
Step 2 (Assign): Each remaining student joins the group leader they're most similar to (nearest centroid)
Step 3 (Update): After everyone joins a group, find the "average" student in each group - this becomes the new group leader
Step 4 (Repeat): Students might switch groups if they're now closer to a different leader. Keep repeating until no one switches anymore!

This is exactly how K-Means works - it iteratively assigns points to nearest centroids, then recalculates centroids, until everything stabilizes (converges).

K-Means partitions data into K clusters by iteratively assigning points to the nearest cluster center (centroid) and then updating the centroids based on the assigned points. The algorithm follows these steps:

Algorithm

K-Means Algorithm Steps

Initialize: Randomly place K centroids in the feature space
Assign: Assign each data point to the nearest centroid
Update: Recalculate centroids as the mean of all assigned points
Repeat: Continue steps 2-3 until centroids stop moving (convergence)

Key insight: K-Means minimizes the within-cluster sum of squares (inertia), making clusters as compact as possible.

Implementing K-Means in Python

Scikit-learn provides a simple and efficient K-Means implementation. Let us start with a basic example using synthetic data to understand how clustering works.

# ============================================
# K-Means Clustering - Complete Implementation
# ============================================

# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data with 4 natural clusters
# This creates synthetic data where we KNOW there are 4 groups
X, y_true = make_blobs(
    n_samples=300,        # 300 data points total
    centers=4,            # 4 actual cluster centers
    cluster_std=0.60,     # How spread out points are (standard deviation)
    random_state=42       # For reproducibility
)

print(f"Data shape: {X.shape}")  # (300, 2) - 300 samples, 2 features
print(f"Feature 1 range: [{X[:, 0].min():.2f}, {X[:, 0].max():.2f}]")
print(f"Feature 2 range: [{X[:, 1].min():.2f}, {X[:, 1].max():.2f}]")
print()

# ============================================
# STEP 1: Create K-Means Model
# ============================================
kmeans = KMeans(
    n_clusters=4,         # We want to find 4 clusters
    random_state=42,      # For reproducible results
    n_init=10,            # Run algorithm 10 times with different initializations
    max_iter=300          # Maximum iterations per run
)

print("Fitting K-Means...")
kmeans.fit(X)  # This is where the magic happens!
print("✓ Convergence achieved!")
print()

# ============================================
# STEP 2: Get Results
# ============================================

# Get cluster assignments for each data point
labels = kmeans.labels_  # Array of cluster IDs (0, 1, 2, 3)
print(f"Cluster labels: {np.unique(labels)}")  # [0 1 2 3]
print(f"Sample assignments: {labels[:10]}")  # First 10 points' clusters
print()

# Get final centroid positions
centroids = kmeans.cluster_centers_  # Shape: (4, 2) - 4 centroids with 2 coordinates
print(f"Centroids shape: {centroids.shape}")   # (4, 2)
print(f"Centroid 0: {centroids[0]}")  # Position of cluster 0's center
print()

# Get inertia (within-cluster sum of squared distances)
print(f"Inertia (lower is better): {kmeans.inertia_:.2f}")
print(f"Number of iterations to converge: {kmeans.n_iter_}")
print()

# Count points in each cluster
for cluster_id in range(4):
    count = np.sum(labels == cluster_id)
    print(f"Cluster {cluster_id}: {count} points")

Now let us apply K-Means to find these clusters:

# Create and fit K-Means model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X)

# Get cluster assignments and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print(f"Cluster labels: {np.unique(labels)}")  # [0 1 2 3]
print(f"Centroids shape: {centroids.shape}")   # (4, 2)

Visualizing the clusters helps understand how K-Means partitioned the data:

# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', 
            marker='X', s=200, edgecolors='black', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()

Important K-Means Parameters

Understanding the key parameters helps you tune K-Means for better results:

Parameter	Default	Description
`n_clusters`	8	Number of clusters (K) to create
`init`	'k-means++'	Initialization method for centroids
`n_init`	10	Number of times to run with different seeds
`max_iter`	300	Maximum iterations per run
`random_state`	None	Seed for reproducibility

Always scale your data! K-Means uses Euclidean distance, so features with larger scales will dominate the clustering. Use StandardScaler before applying K-Means.

# Always scale features before K-Means
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Now apply K-Means on scaled data
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X_scaled)

Predicting New Data Points

Once trained, you can assign new data points to existing clusters:

# Predict cluster for new data points
new_points = np.array([[0, 0], [4, 4], [-3, 2]])
predictions = kmeans.predict(new_points)

print(f"New points assigned to clusters: {predictions}")
# Output: [2 0 1]  (example cluster assignments)

K-Means Limitations

While powerful, K-Means has some important limitations to keep in mind:

Limitations (When K-Means Struggles)

Must specify K in advance:
You need to decide how many clusters before running. If you guess wrong (K=3 but true clusters=5), results are meaningless.
Assumes spherical clusters:
Struggles with crescent/elongated shapes (like moons). Works best when clusters are round and compact.
Sensitive to initialization:
Bad starting centroids = bad final clusters. That's why we use k-means++ and n_init=10 to try multiple starts.
Sensitive to outliers:
Extreme points pull centroids away from true centers, distorting entire clusters.
Struggles with varying sizes:
Prefers equal-sized clusters. One huge cluster (1000 points) + tiny cluster (10 points) = poor separation.

Strengths (Why K-Means Is Popular)

Fast and scalable:
Handles millions of points efficiently. O(nki) complexity where k is clusters, i is iterations - very fast!
Easy to understand & interpret:
Intuitive: "find K centers, assign points to nearest". Simple to explain to non-technical stakeholders.
Works with large datasets:
Can cluster 1M+ points in seconds. Hierarchical clustering would take hours/days on same data!
Guaranteed convergence:
Always reaches a stable solution (local minimum). Won't run forever like some algorithms.
Great for compact clusters:
When data naturally forms round, well-separated groups (e.g., customer segments), K-Means excels!

Practice Questions: K-Means Clustering

Test your understanding with these hands-on exercises.

Task: Create sample data using make_blobs with 200 samples and 3 centers, then fit a K-Means model.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate data
X, _ = make_blobs(n_samples=200, centers=3, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

print(f"Inertia: {kmeans.inertia_:.2f}")

Given:

import numpy as np
customers = np.array([
    [25, 50000], [30, 60000], [35, 75000],  # Young professionals
    [45, 120000], [50, 150000], [55, 140000],  # Senior executives
    [22, 25000], [24, 30000], [26, 28000]  # Entry level
])

Task: Scale the data and cluster customers into 3 segments.

Show Solution

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

customers = np.array([
    [25, 50000], [30, 60000], [35, 75000],
    [45, 120000], [50, 150000], [55, 140000],
    [22, 25000], [24, 30000], [26, 28000]
])

# Scale the data
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# Cluster into 3 segments
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(customers_scaled)

print(f"Customer segments: {labels}")

Task: Generate 2D data with 4 clusters, apply K-Means, and create a scatter plot showing clusters with different colors and centroids marked with red X markers.

Show Solution

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, 
                  cluster_std=0.7, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', 
            marker='X', s=200, edgecolors='black')
plt.title('K-Means Clustering with Centroids')
plt.show()

Task: After fitting K-Means with 5 clusters on sample data, count how many data points are assigned to each cluster.

Show Solution

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate and cluster data
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

# Count samples per cluster
unique, counts = np.unique(labels, return_counts=True)
for cluster, count in zip(unique, counts):
    print(f"Cluster {cluster}: {count} samples")

Elbow Method & Silhouette Score

How do you choose the right number of clusters? The Elbow method and Silhouette score are two essential techniques that help you find the optimal K value for your clustering analysis.

The Challenge of Choosing K

Here's the tricky part about K-Means: you have to tell it how many groups to create BEFORE it runs. But wait — if you're using clustering because you don't know the structure of your data, how are you supposed to know how many groups exist? It's like being asked "How many colors should I use?" before you've even seen the painting!

This is one of the biggest challenges in clustering. Get the number wrong, and you'll get misleading results:

Too Few Clusters (K too small)

You're forcing naturally different groups to merge together. Imagine putting "cats" and "dogs" into one "pets" cluster when you actually need them separate for your analysis. You lose important distinctions and your insights become too generic to be useful.

Too Many Clusters (K too large)

You're splitting natural groups into artificial sub-groups. Like separating "orange cats" and "brown cats" into different clusters when they really behave the same way. You end up with meaningless, fragmented groups that don't represent real patterns.

The good news? We have data-driven methods to help us find the "sweet spot" — the optimal K where clusters are meaningful and well-separated. These techniques evaluate clustering quality at different K values, so you're not just guessing. Let's explore the two most popular methods: the Elbow Method and Silhouette Score.

Domain knowledge matters: While these methods provide mathematical guidance, always consider whether the resulting clusters make sense for your specific use case. For example, if you're segmenting customers for a marketing campaign and can only afford to create 3 different ad campaigns, then K=3 might be your practical limit — even if the math suggests K=5. The "optimal" K from algorithms should inform your decision, not dictate it.

The Elbow Method

Budget Shopping Analogy: Imagine you're deciding how many warehouse locations to open.

• 1 warehouse: High shipping costs (long distances) - like having just 1 cluster
• 2 warehouses: Much better! Costs drop significantly
• 3 warehouses: Even better, but improvement is smaller
• 4 warehouses: Marginal improvement - costs barely decrease
• 10 warehouses: Tiny improvement but 10x the overhead!

The "elbow" is where adding more warehouses gives diminishing returns. In K-Means, inertia (cost) drops sharply at first, then plateaus. The elbow point = optimal K!

The Elbow method works by plotting the inertia (a measure of cluster "tightness") against different values of K. Here's the key insight: as you add more clusters, inertia ALWAYS decreases. Why? Because with more clusters, each point gets closer to its assigned center. In the extreme case, if K equals the number of data points, every point IS its own center, and inertia becomes zero!

But here's the catch: we don't want K to equal our data size — that defeats the purpose of clustering! Instead, we want to find the "elbow" point where adding more clusters stops giving us meaningful improvement. It's the sweet spot between too few clusters (high inertia) and too many (overfitting).

Key Metric

Inertia (Within-Cluster Sum of Squares / WCSS)

Inertia measures how "spread out" points are within their assigned clusters. Think of it as a measure of cluster compactness or cohesion. It calculates the sum of squared distances between each data point and the center (centroid) of its assigned cluster.

In simple terms: Low inertia = points are huddled close to their cluster centers (good!). High inertia = points are scattered far from their centers (bad clustering).

Formula: Inertia = Σ (distance from each point to its centroid)²

The squaring makes bigger distances count more — a point that's twice as far contributes 4x to inertia!

# ============================================
# Step 1: Import Libraries and Create Sample Data
# ============================================
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data with known clusters
X, _ = make_blobs(
    n_samples=300,      # 300 data points
    centers=4,          # We know there are 4 true clusters
    cluster_std=0.60,   # Spread of points
    random_state=42     # For reproducibility
)

Importing Libraries

matplotlib.pyplot: For creating visualizations (our elbow chart).
KMeans: The clustering algorithm we're testing.
make_blobs: Creates fake data with natural groupings — perfect for testing!

Creating Test Data

n_samples=300: We're creating 300 data points.
centers=4: These points are grouped around 4 centers (but we'll pretend we don't know this!).
cluster_std=0.60: How spread out each group is. Lower = tighter clusters.

# ============================================
# Step 2: Test Different K Values (1 to 10)
# ============================================
print("Testing different K values...")
print("=" * 50)

# We'll store the inertia (error) for each K
inertias = []
k_range = range(1, 11)  # Test K from 1 to 10

for k in k_range:
    # Create K-Means with current K value
    kmeans = KMeans(
        n_clusters=k,       # Number of clusters to create
        random_state=42,    # Same starting point each time
        n_init=10           # Try 10 different starting positions
    )
    
    # Train the model on our data
    kmeans.fit(X)
    
    # Save the inertia (how "spread out" points are from centers)
    inertias.append(kmeans.inertia_)

The Loop: Testing Each K Value

Think of this like trying different numbers of pizza slices to cut your pizza into. We're asking: "If I divide this data into 1 group, 2 groups, 3 groups... how well does each fit?"

n_clusters=k
Current number of clusters we're testing

n_init=10
K-Means starts randomly, so we try 10 times and keep the best

kmeans.inertia_
The "error" — sum of distances from points to their centers

# ============================================
# Step 3: Print Results to See the Pattern
# ============================================
for i, k in enumerate(k_range):
    print(f"K={k:2d}: Inertia={inertias[i]:8.2f}")

print("=" * 50)
print()

# Calculate how much inertia DECREASES when we add one more cluster
print("Inertia reduction when adding one more cluster:")
print("-" * 50)

for i in range(len(inertias) - 1):
    reduction = inertias[i] - inertias[i+1]
    percent = (reduction / inertias[i]) * 100
    
    # Mark significant drops with an arrow
    marker = " ← BIG DROP!" if percent > 30 else ""
    print(f"K={i+1} → {i+2}: Reduction = {reduction:7.2f} ({percent:5.1f}%){marker}")

Understanding the Output

This is the key insight! We're looking at how much inertia drops when we add each new cluster:

Transition	What Happens	What It Means
K=1 → K=2	HUGE drop (often 50%+)	Splitting into 2 groups helps A LOT
K=2 → K=3	Big drop (30-40%)	3 groups is much better than 2
K=3 → K=4	Medium drop (15-25%)	Still helpful, but less so
K=4 → K=5+	Small drops (5-10%)	⚠️ ELBOW HERE! More clusters don't help much

💡 The "elbow" is where the % reduction suddenly gets much smaller! That's your optimal K.

# ============================================
# Step 4: Find the Elbow Point Automatically
# ============================================
print()
print("→ Look for where reduction % drops significantly!")
print("→ That's your elbow point (optimal K)")
print()

# Simple automatic detection: find biggest "drop in drops"
drops = []
for i in range(len(inertias) - 1):
    drops.append(inertias[i] - inertias[i+1])

# The elbow is where the drop suddenly becomes much smaller
# (i.e., the "second derivative" is largest)
elbow_k = 1
max_drop_change = 0

for i in range(len(drops) - 1):
    drop_change = drops[i] - drops[i+1]  # How much did the improvement slow down?
    if drop_change > max_drop_change:
        max_drop_change = drop_change
        elbow_k = i + 2  # K value at this elbow

print(f"🎯 Detected Elbow Point: K = {elbow_k}")
print(f"   (This is likely the optimal number of clusters)")

The Math Behind Elbow Detection

We calculate the "drop in drops" — basically asking: "When did the improvement suddenly slow down?"

• If K=2 reduced error by 1000 and K=3 reduced by 500, that's a drop change of 500.
• The biggest "drop change" indicates the elbow — where adding clusters stopped helping as much.

Why This Works

Imagine driving and slowing down. The elbow is like when you hit the brakes hardest — the biggest change in your rate of deceleration.

At this point, you've captured the main clusters. Adding more just splits existing groups unnecessarily.

Now visualize the elbow curve to find the optimal K:

# ============================================
# Step 5: Visualize the Elbow Curve
# ============================================
plt.figure(figsize=(10, 6))

# Plot the inertia values as a line with dots
plt.plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)

# Add labels and title
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia (Within-cluster sum of squares)', fontsize=12)
plt.title('Elbow Method for Optimal K', fontsize=14, fontweight='bold')

# Make the x-axis show each K value
plt.xticks(k_range)

# Add grid for easier reading
plt.grid(True, alpha=0.3)

# Mark the elbow point
plt.axvline(x=elbow_k, color='r', linestyle='--', label=f'Elbow at K={elbow_k}')
plt.legend()

plt.tight_layout()
plt.show()

Reading the Chart

X-axis: Number of clusters (K)
Y-axis: Inertia (error/spread)
Goal: Find where the line "bends" like an elbow!

The "Elbow" Shape

The curve drops steeply at first (adding clusters helps a lot), then flattens out (adding more clusters doesn't help much). The bend = your answer!

Red Dashed Line

The vertical red line marks the detected elbow point. In our example with 4 true clusters, it should appear at K=4!

Finding the elbow: Look for where the curve bends sharply - like an elbow. After this point, adding more clusters provides only marginal improvement in inertia.

Silhouette Score

While the Elbow method looks at how tight clusters are internally, the Silhouette Score takes a different approach: it measures both how close points are to their own cluster AND how far they are from other clusters. This gives us a more complete picture of clustering quality.

Friend Groups at a Party Analogy: Imagine you're at a big party, and everyone has been assigned to different "friend groups" (clusters). To evaluate if YOU are in the right group, ask yourself:

Question 1 (a): How similar are you to people in YOUR group? Do you share interests, talk easily, feel comfortable? (This is called intra-cluster cohesion)

Question 2 (b): How different are you from people in the NEAREST other group? Would you fit in there, or do you feel like an outsider? (This is called inter-cluster separation)

Your Silhouette score = (b - a) / max(a, b)

• Score near +1: You LOVE your group and have nothing in common with others. Perfect placement! 🎯
• Score near 0: You're on the boundary — you could go either way. Maybe you're between two friend circles? 🤷
• Score near -1: You're in the WRONG group! You have more in common with another group. Time to switch! ❌

The average silhouette score across ALL people (data points) tells you how well the entire party is organized!

The Silhouette score measures how similar each point is to its own cluster compared to other clusters. Unlike inertia (where lower is better), higher silhouette scores are better, and the score has a meaningful, interpretable range from -1 to +1. This makes it easier to understand than inertia, which can be any positive number.

Perfect Clustering

Points are very close to their own cluster center and very far from other clusters. Ideal separation!

Overlapping Clusters

Points are on the boundary between clusters. They could belong to either one. Clusters may be too close.

-1

Wrong Assignment

Points are closer to a DIFFERENT cluster than their own! They've been misclassified. Bad clustering!

Score Range	Interpretation	What It Means in Practice
`0.71 to 1.0`	Strong structure	Excellent! Clusters are clearly separated and well-defined. Your K choice is great.
`0.51 to 0.70`	Reasonable structure	Good clustering. Some overlap may exist, but groups are still meaningful.
`0.26 to 0.50`	Weak structure	Clusters are overlapping. Consider trying different K values or a different algorithm.
`< 0.25`	No substantial structure	Poor clustering. The data might not have natural clusters, or your K is way off.

# ============================================
# Calculate Silhouette Score for Different K Values
# ============================================
from sklearn.metrics import silhouette_score

silhouette_scores = []
k_range = range(2, 11)  # NOTE: Silhouette requires at least 2 clusters (K=1 has no meaning)

print("Testing Silhouette Score for K = 2 to 10...")
print("=" * 50)

for k in k_range:
    # Create and fit K-Means
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)  # Get cluster labels for each point
    
    # Calculate silhouette score
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)
    
    # Interpret the score
    if score >= 0.71:
        quality = "Excellent ⭐"
    elif score >= 0.51:
        quality = "Good ✓"
    elif score >= 0.26:
        quality = "Fair ⚠"
    else:
        quality = "Poor ✗"
    
    print(f"K={k}: Silhouette Score = {score:.3f} → {quality}")

print("=" * 50)
print(f"\\n→ Best K = {k_range[silhouette_scores.index(max(silhouette_scores))]} "
      f"(highest score: {max(silhouette_scores):.3f})")

Why start at K=2? The silhouette score compares "your cluster" vs "other clusters." With K=1, there ARE no other clusters to compare to, so the score is undefined! That's why we start at K=2.

Visualize to find the K with highest silhouette score:

# ============================================
# Visualize Silhouette Scores
# ============================================
plt.figure(figsize=(10, 6))

# Plot the scores
plt.plot(k_range, silhouette_scores, 'go-', linewidth=2, markersize=10)

# Highlight the best K
best_k = k_range[silhouette_scores.index(max(silhouette_scores))]
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best K = {best_k}')

# Labels
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Silhouette Score', fontsize=12)
plt.title('Silhouette Score for Optimal K\\n(Higher is Better!)', fontsize=14, fontweight='bold')
plt.xticks(k_range)
plt.ylim(0, 1)  # Score range is -1 to 1, but we expect positive values
plt.grid(True, alpha=0.3)
plt.legend()

plt.tight_layout()
plt.show()

Reading the Silhouette Chart

Unlike the Elbow method (where you look for a bend), here you simply find the peak — the K value with the highest score. That's your optimal number of clusters! The red dashed line marks it for you.

Elbow vs Silhouette: Use Both!

Elbow method: Looks at cluster tightness (inertia) — requires visual interpretation.
Silhouette: Looks at separation AND tightness — gives a clear "best" answer.
When they agree, you can be confident in your K choice!

# Highest score indicates optimal K

Combining Both Methods

For robust cluster selection, use both methods together. They often agree, but when they differ, consider domain knowledge and the specific use case.

# Combined visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow plot
axes[0].plot(range(1, 11), inertias, 'bo-', linewidth=2)
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)

# Silhouette plot
axes[1].plot(range(2, 11), silhouette_scores, 'go-', linewidth=2)
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Silhouette Visualization

A silhouette plot shows the score for each individual point, helping you understand cluster quality in more detail:

# Silhouette plot for individual samples
from sklearn.metrics import silhouette_samples
import numpy as np

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Get silhouette values for each sample
sample_silhouette_values = silhouette_samples(X, labels)

# Average silhouette score
avg_score = silhouette_score(X, labels)
print(f"Average Silhouette Score: {avg_score:.3f}")

Practice Questions: Elbow & Silhouette

Test your understanding with these hands-on exercises.

Task: Fit K-Means with K=5 and print the inertia value.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=200, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)

print(f"Inertia: {kmeans.inertia_:.2f}")

Task: Create an elbow plot for K values from 1 to 10 and identify the elbow point.

Show Solution

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# Elbow appears at K=4

Task: Fit K-Means with K=4 and calculate the silhouette score.

Show Solution

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

score = silhouette_score(X, labels)
print(f"Silhouette Score: {score:.3f}")

Task: Test K values from 2 to 8 and print the K with the highest silhouette score.

Show Solution

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

best_k = 2
best_score = -1

for k in range(2, 9):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    print(f"K={k}: Score={score:.3f}")
    if score > best_score:
        best_score = score
        best_k = k

print(f"\nBest K: {best_k} with score {best_score:.3f}")

Hierarchical Clustering (Agglomerative)

Hierarchical clustering builds a tree of clusters, letting you explore data at multiple granularity levels. Unlike K-Means, you do not need to specify the number of clusters upfront, giving you more flexibility in exploratory analysis.

Family Tree Analogy: Think about organizing a family reunion. At the lowest level, each person is separate. Then you group siblings together (small families). Next, you combine parents with their children (extended families). Finally, all families merge into one big reunion group.

Hierarchical clustering works exactly like this! Start with individual data points (people), progressively merge similar ones (create families), and build a tree showing relationships at every level. You can "cut" the tree at any height to get your desired number of groups - just like deciding whether to organize by immediate families (many groups) or extended families (fewer groups).

How Hierarchical Clustering Works

Agglomerative (bottom-up) hierarchical clustering starts with each data point as its own cluster, then progressively merges the closest clusters until only one remains. This creates a tree structure called a dendrogram that shows how clusters are related.

Imagine you have 100 customers. Initially, each customer is their own "cluster" (100 clusters). The algorithm finds the two most similar customers and merges them (99 clusters). It repeats this process, always merging the two closest clusters, until all customers form one giant cluster. The dendrogram records this entire merge history, letting you "rewind" to any stage and extract the cluster structure you need.

Algorithm

Agglomerative Clustering Steps

Start: Each data point is its own cluster (N clusters)
Find: Identify the two closest clusters
Merge: Combine them into a single cluster (N-1 clusters)
Repeat: Continue until all points are in one cluster
Cut: Choose where to cut the dendrogram for desired cluster count

Linkage Methods

The linkage method determines how to measure distance between clusters. Different linkage methods produce different clustering results:

Linkage Method	How It Works (Simple Explanation)	Best For	Real-World Analogy
`ward` Recommended	Minimize total variance within clusters When merging two clusters, pick the pair that increases the total "spread" (variance) the least. This creates balanced, compact groups where points are tightly packed. Math: Minimizes sum of squared distances from cluster centers	Compact, equally-sized clusters Balanced cluster sizes Most common choice	School classes: Form balanced classrooms where students have similar skill levels (low variance within each class)
`complete`	Use maximum distance between any two points When measuring distance between two clusters, use the farthest pair of points (one from each cluster). This creates tight, compact clusters because it's conservative - only merges clusters when even the farthest points are close. Also called "furthest neighbor" or "maximum linkage"	Tight, spherical clusters Prevents elongated chains Similar-sized groups	Gated communities: Only merge neighborhoods if even the farthest houses are still close together (strict boundaries)
`average`	Use average distance between all point pairs Calculate distance between every point in cluster A to every point in cluster B, then take the average. This is a balanced middle-ground approach. Also called UPGMA (Unweighted Pair Group Method with Arithmetic mean)	General purpose Moderate cluster sizes Robust compromise	City districts: Merge districts based on the average commute time between all residents (balanced consideration)
`single`	Use minimum distance between any two points When measuring distance between clusters, use the closest pair of points (one from each). This creates elongated, chain-like clusters because it easily merges clusters as long as ANY two points are close (can create "bridges"). Also called "nearest neighbor" - sensitive to outliers creating chains	Elongated, chain-like patterns Can create unbalanced clusters Rare use cases	Highway system: Connect cities if ANY two border points are close, even if city centers are far (creates long chains)

Beginner's Choice: Start with linkage='ward' - it's the most commonly used and produces intuitive, balanced clusters in 90% of cases. Only experiment with other linkages if ward doesn't capture your data structure well!

Recommendation: Start with ward linkage for most cases. It tends to produce well-balanced clusters and works well with Euclidean distance.

Implementing Hierarchical Clustering

Scikit-learn provides AgglomerativeClustering, and scipy offers tools for dendrograms:

# ============================================
# Hierarchical (Agglomerative) Clustering
# ============================================

# Import required libraries
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import numpy as np

print("Setting up Hierarchical Clustering...")
print("=" * 50)

# Generate sample data with 4 natural groups
X, _ = make_blobs(
    n_samples=150,        # 150 data points
    centers=4,            # 4 true clusters
    cluster_std=0.60,     # Moderate spread
    random_state=42       # Reproducibility
)

print(f"Data shape: {X.shape}")  # (150, 2)
print(f"Starting with {len(X)} individual clusters (each point is its own cluster)")
print()

# ============================================
# STEP 1: Apply Agglomerative Clustering
# ============================================

# Create the clustering model
agg_clustering = AgglomerativeClustering(
    n_clusters=4,         # Cut the tree to get 4 final clusters
    linkage='ward'        # Ward minimizes within-cluster variance (good default)
)

print("Applying Agglomerative Clustering...")
print("Process: Start with 150 clusters → Merge closest pairs → Repeat until 4 remain")
print()

# Fit and predict cluster labels
# Under the hood: Builds full dendrogram, then cuts at the right height
labels = agg_clustering.fit_predict(X)

print("✓ Clustering complete!")
print()

# ============================================
# STEP 2: Analyze Results
# ============================================

print(f"Cluster labels: {np.unique(labels)}")  # [0 1 2 3]
print()

# Count points per cluster
for cluster_id in range(4):
    count = np.sum(labels == cluster_id)
    percentage = (count / len(X)) * 100
    print(f"Cluster {cluster_id}: {count} points ({percentage:.1f}%)")

print()
print(f"Total merges performed: {len(X) - 4}")  # Started with 150, ended with 4
print(f"Number of clusters: {agg_clustering.n_clusters_}")

Creating Dendrograms

Dendrograms visualize the hierarchical structure and help you decide where to cut for a specific number of clusters:

# ============================================
# Creating a Dendrogram Visualization
# ============================================

# Import visualization tools
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

print("Building dendrogram...")
print()

# ============================================
# STEP 1: Calculate Linkage Matrix
# ============================================
# The linkage matrix records the merge history:
# - Which clusters were merged at each step
# - The distance between them at merge time
# - How many original points each new cluster contains

linkage_matrix = linkage(
    X,                # Data to cluster
    method='ward'     # Ward linkage (minimizes variance)
)

print(f"Linkage matrix shape: {linkage_matrix.shape}")  # (n_samples-1, 4)
print(f"Number of merges recorded: {len(linkage_matrix)}")
print()
print("Each row represents one merge:")
print("[cluster1_id, cluster2_id, merge_distance, num_points_in_new_cluster]")
print()
print("First 3 merges:")
print(linkage_matrix[:3])
print()

# ============================================
# STEP 2: Plot Dendrogram
# ============================================

plt.figure(figsize=(12, 6))

# Create the dendrogram plot
# truncate_mode='level' limits display to top 5 levels (for readability)
dendrogram(
    linkage_matrix,         # The merge history
    truncate_mode='level',  # Show only top levels
    p=5                     # Show 5 levels (prevents clutter with large datasets)
)

plt.xlabel('Sample Index (or Cluster Size)', fontsize=12)
plt.ylabel('Distance (Height of Merge)', fontsize=12)
plt.title('Hierarchical Clustering Dendrogram\n(Higher merges = more different clusters)', 
          fontsize=14, fontweight='bold')
plt.axhline(y=10, color='r', linestyle='--', label='Potential cut point')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

print("✓ Dendrogram created!")
print()
print("How to read it:")
print("- X-axis: Data points or clusters")
print("- Y-axis: Distance at which clusters merge")
print("- Vertical lines: Clusters being merged")
print("- Height of U-shape: How different the clusters are")
print("- Cut horizontally to get desired number of clusters")

You can cut the dendrogram at different heights to get different numbers of clusters:

# Cut dendrogram at specific distance threshold
from scipy.cluster.hierarchy import fcluster

# Get clusters by cutting at distance threshold
distance_threshold = 10
clusters = fcluster(linkage_matrix, t=distance_threshold, criterion='distance')

print(f"Number of clusters at threshold {distance_threshold}: {len(np.unique(clusters))}")

# Or cut to get specific number of clusters
n_clusters = 4
clusters = fcluster(linkage_matrix, t=n_clusters, criterion='maxclust')
print(f"Cluster labels: {np.unique(clusters)}")

Visualizing Cluster Results

# Visualize clustering results
plt.figure(figsize=(12, 5))

# Without predefined n_clusters - explore structure
plt.subplot(1, 2, 1)
agg = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels = agg.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('Agglomerative Clustering (4 clusters)')

# Compare with different n_clusters
plt.subplot(1, 2, 2)
agg = AgglomerativeClustering(n_clusters=6, linkage='ward')
labels = agg.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title('Agglomerative Clustering (6 clusters)')

plt.tight_layout()
plt.show()

When to Use Hierarchical Clustering

Use When... (Perfect Situations)

Explore cluster structure:
Don't know how many clusters exist? Dendrogram shows all possibilities at once!
Number of clusters unknown:
Unlike K-Means, you don't need to specify K upfront. Build full tree, cut wherever makes sense.
Small-medium datasets:
Up to 10,000 points works well. More than that? Use K-Means instead.
Visualize hierarchy:
Stakeholders love dendrograms! Shows how groups relate at multiple levels.
Nested clusters meaningful:
E.g., biological taxonomy (Species → Genus → Family), organizational charts.

Avoid When... (Problematic Situations)

Dataset is very large:
O(n²) or O(n³) complexity! 100k+ points = hours/days. Use K-Means or Mini-Batch K-Means.
Adding new data points:
Can't add points incrementally - must rebuild entire tree! Use K-Means for dynamic data.
Real-time clustering needed:
Too slow for production systems. Need instant results? Use K-Means.
Memory is limited:
Stores full linkage matrix in RAM. 50k points needs ~10GB memory! K-Means uses way less.
Simple flat clusters suffice:
Don't need hierarchy? Why pay the computational cost? Use K-Means instead.

Practice Questions: Hierarchical Clustering

Test your understanding with these hands-on exercises.

Task: Create sample data and apply AgglomerativeClustering with 3 clusters using ward linkage.

Show Solution

from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=150, centers=3, random_state=42)

agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X)

print(f"Labels: {labels[:10]}")

Task: Generate data and create a dendrogram using scipy with ward linkage.

Show Solution

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=50, centers=3, random_state=42)

linkage_matrix = linkage(X, method='ward')

plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix)
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.title('Dendrogram')
plt.show()

Task: Apply AgglomerativeClustering with ward, complete, and average linkage, then visualize the results side by side.

Show Solution

import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=150, centers=4, random_state=42)
linkages = ['ward', 'complete', 'average']

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, link in zip(axes, linkages):
    agg = AgglomerativeClustering(n_clusters=4, linkage=link)
    labels = agg.fit_predict(X)
    ax.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
    ax.set_title(f'Linkage: {link}')

plt.tight_layout()
plt.show()

Task: Use fcluster to cut the dendrogram at distance=15 and count resulting clusters.

Show Solution

from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=100, centers=4, random_state=42)

linkage_matrix = linkage(X, method='ward')
clusters = fcluster(linkage_matrix, t=15, criterion='distance')

n_clusters = len(np.unique(clusters))
print(f"Clusters at distance 15: {n_clusters}")

DBSCAN for Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shapes and automatically detects outliers. It is ideal when your data has non-spherical clusters or contains noise that should not belong to any cluster.

Concert Crowd Analogy: Imagine looking down at a music festival from a helicopter. You see dense crowds gathered around different stages (clusters), with some lone wanderers walking between stages (noise/outliers).

DBSCAN does exactly this:
• Core Point (Stage area): If you're standing somewhere with at least 5 people within 2 meters, you're in the "core" of a crowd
• Border Point (Edge of crowd): You're within 2 meters of a core area but don't have 5 people around you
• Noise Point (Lone wanderer): You're not near any dense crowd - just wandering alone

The algorithm connects all core areas that are close enough (within 2m), forming clusters of any shape! No need to specify how many stages (clusters) exist - DBSCAN discovers them automatically.

How DBSCAN Works

Unlike K-Means which uses distance to centroids, DBSCAN defines clusters as dense regions of points separated by sparse regions. It groups together points that are closely packed and marks points in low-density regions as outliers.

The algorithm has a simple logic: If you have enough neighbors nearby, you're in a cluster. It doesn't care about cluster shape (spherical, crescent, serpentine) - only about local density. This makes DBSCAN perfect for real-world data where clusters aren't always nice, round blobs.

Key Concepts

DBSCAN Point Types (Detailed Explanation)

1. Core Point (Dense Center)

Definition: A point is "core" if it has at least min_samples neighbors within eps distance (including itself).

Simple Explanation: Imagine you're at a party. You're a "core person" if you can stand in one spot and have at least 5 people (including yourself) within 2 meters. This means you're in a dense crowd - the heart of the party!

Example: If eps=2m and min_samples=5, and you count 6 people within 2 meters of you (including yourself), you're a core point.

2. Border Point (Edge of Crowd)

Definition: A point that is NOT core (doesn't have min_samples neighbors) but is within eps distance of a core point.

Simple Explanation: You're on the edge of the party crowd. You don't have 5 people around YOU (not core), but you're standing within 2 meters of someone who IS in the dense center (core). You're part of the cluster but on the periphery.

Example: You only have 2 people within 2 meters of you, BUT one of those people is a core point. You're a border point - attached to the cluster but not central.

3. Noise Point (Outlier)

Definition: A point that is neither core nor border. It doesn't have enough neighbors to be core, and it's not close enough to any core point to be border.

Simple Explanation: You're wandering alone between party areas. You don't have 5 people around you (not core), and you're not near anyone who does (not border). You're an isolated wanderer - labeled as noise with cluster ID = -1.

Example: You only have 1 person within 2 meters, and that person is also alone. Both of you are noise points - outliers that don't belong to any cluster.

How Clusters Form:

Find all core points (points with enough neighbors)
Connect core points that are within eps of each other → these form cluster "backbones"
Attach border points to their nearby clusters
Label remaining points as noise (outliers)

Think of it like finding islands: Core points are land, border points are the shoreline, and noise points are isolated rocks in the ocean!

DBSCAN Parameters

DBSCAN has two main parameters that control cluster formation:

Parameter	What It Controls (Simple Explanation)	How to Choose	Effect on Results
`eps` (epsilon)	Neighborhood radius - "How close is close?" This defines the maximum distance for two points to be considered neighbors. Think of it as the radius of a circle around each point - anything inside the circle is a neighbor, anything outside is not. Real example: If eps=2 meters, anyone within 2 meters of you is your neighbor. People 2.1 meters away are NOT neighbors. Units match your data (meters, pixels, standardized units, etc.)	Method 1: k-distance plot Plot distance to k-th nearest neighbor, pick eps at the "elbow" (see detailed section below) Method 2: Domain knowledge If you know meaningful distances in your domain (e.g., "customers within 5km are in same region") Method 3: Trial and error Start with eps = 0.5 (scaled data) or average distance to k-th neighbor, then adjust	Too large: Everything becomes one giant cluster Too small: Everything becomes noise/outliers Just right: Clear clusters with some noise Critical parameter - Most important to tune correctly!
`min_samples`	Crowd size threshold - "How many to be a crowd?" The minimum number of points (including the point itself) required within eps distance to form a "dense region" (core point). Lower values = easier to form clusters. Higher values = stricter definition of density. Real example: If min_samples=5, you need at least 5 people within eps distance to be considered a "crowd" (core point). If you only have 3 people nearby, you're not dense enough. Always an integer, minimum 1 (but 1-2 are impractical)	Rule of Thumb: • 2D data: Start with min_samples=4 • Higher dimensions: Use 2 × n_features • Minimum: At least 3 (practical minimum) • Large datasets: Can use 10-20 for stricter clusters Tip: Less critical than eps. Start with 4-5 and adjust if needed	Larger values: → Fewer clusters (stricter) → More noise points → Only very dense regions become clusters Smaller values: → More clusters (lenient) → Less noise → Easier to form clusters Less sensitive than eps

Beginner Starting Point:

DBSCAN(eps=0.5, min_samples=5)  # Good default for scaled data

Then adjust eps based on k-distance plot. Keep min_samples=5 unless you have specific reasons to change it.

Important: ALWAYS scale your data before using DBSCAN! Use StandardScaler so all features have similar ranges. Otherwise, eps becomes meaningless (which unit do you use?).

Implementing DBSCAN

# ============================================
# DBSCAN - Density-Based Clustering
# ============================================

# Import required libraries
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import numpy as np
import matplotlib.pyplot as plt

print("Setting up DBSCAN for non-spherical clusters...")
print("=" * 50)

# ============================================
# STEP 1: Generate Non-Spherical Data
# ============================================
# K-Means struggles with moon/crescent shapes
# DBSCAN excels at these!

X, _ = make_moons(
    n_samples=300,        # 300 data points
    noise=0.05,           # Small amount of noise (outliers)
    random_state=42       # Reproducibility
)

print(f"Data shape: {X.shape}")  # (300, 2)
print(f"Data has 2 crescent-shaped clusters (like two moons)")
print()

# ============================================
# STEP 2: Apply DBSCAN Algorithm
# ============================================

# Create DBSCAN model with two key parameters
dbscan = DBSCAN(
    eps=0.2,              # Maximum distance between neighbors (neighborhood radius)
                          # "2 neighbors within 0.2 units are in same neighborhood"
    
    min_samples=5         # Minimum points needed to form a dense region (core point)
                          # "Need at least 5 points nearby to be a 'crowd'"
)

print("DBSCAN Parameters:")
print(f"  eps (neighborhood radius): {dbscan.eps}")
print(f"  min_samples (crowd size): {dbscan.min_samples}")
print()

print("Running DBSCAN...")
print("Process: For each point, count neighbors within eps")
print("         If >= min_samples, mark as core point")
print("         Connect core points to form clusters")
print("         Label remaining points as border or noise")
print()

# Fit and predict cluster labels
# Label -1 means NOISE (outlier)
# Labels 0, 1, 2, ... are cluster IDs
labels = dbscan.fit_predict(X)

print("✓ DBSCAN complete!")
print()

# ============================================
# STEP 3: Analyze Results
# ============================================

# Calculate number of clusters (excluding noise)
# Noise is labeled as -1, so we subtract 1 if it exists
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

# Count noise points (labeled as -1)
n_noise = list(labels).count(-1)

# Get unique cluster labels
unique_labels = set(labels)
print(f"Unique labels: {sorted(unique_labels)}")
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise} ({n_noise/len(X)*100:.1f}% of data)")
print()

# Show cluster sizes
for label in sorted(unique_labels):
    if label == -1:
        continue  # Skip noise, already counted
    count = list(labels).count(label)
    percentage = (count / len(X)) * 100
    print(f"Cluster {label}: {count} points ({percentage:.1f}%)")

print()
print(f"Core samples (dense points): {len(dbscan.core_sample_indices_)}")
print(f"Components (connected regions): {dbscan.components_.shape}")

Visualize how DBSCAN handles non-spherical clusters:

# Visualize DBSCAN results
plt.figure(figsize=(12, 5))

# DBSCAN on moon data
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.title(f'DBSCAN: {n_clusters} clusters, {n_noise} noise points')

# Compare with K-Means (struggles with moons)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
km_labels = kmeans.fit_predict(X)

plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=km_labels, cmap='viridis', alpha=0.6)
plt.title('K-Means (struggles with non-spherical)')

plt.tight_layout()
plt.show()

Finding Optimal eps

Finding the Right Neighborhood Size: Imagine trying to define "close friends". If your radius is too small (eps=0.01), nobody qualifies as a friend - everyone is isolated. If too large (eps=10), everyone is your friend - one giant cluster!

The k-distance plot helps find the sweet spot. Plot how far each point's 4th nearest neighbor is. Most points should have similar k-distances (flat region), but noise points have larger distances (sharp rise). The "elbow" where this curve bends sharply is your optimal eps!

The k-distance plot helps find a good eps value. Calculate the distance to the k-th nearest neighbor for each point and look for the "elbow" where distances sharply increase:

# ============================================
# k-Distance Plot for Optimal eps Selection
# ============================================

from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt

print("Finding optimal eps using k-distance plot...")
print("=" * 50)

# ============================================
# STEP 1: Calculate k-Nearest Neighbor Distances
# ============================================

# Choose k = min_samples (or min_samples - 1)
# This is the same k we'll use in DBSCAN's min_samples
k = 4

print(f"Calculating distance to {k}th nearest neighbor for each point...")

# Create a nearest neighbors model
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(X)  # Fit on our data

# Calculate distances to k nearest neighbors for each point
# distances shape: (n_samples, k) - distances to each of k neighbors
# indices shape: (n_samples, k) - indices of k nearest neighbors
distances, indices = neighbors.kneighbors(X)

print(f"Distances shape: {distances.shape}")  # (300, 4)
print()

# ============================================
# STEP 2: Extract and Sort k-th Neighbor Distance
# ============================================

# Get distance to the k-th (farthest) neighbor for each point
# [:, k-1] gets the last column (4th neighbor distance)
k_distances = distances[:, k-1]

print(f"Sample k-distances (first 10 points):")
print(k_distances[:10])
print()

# Sort distances in ascending order for visualization
k_distances_sorted = np.sort(k_distances)

print(f"Min k-distance: {k_distances_sorted[0]:.4f}")
print(f"Max k-distance: {k_distances_sorted[-1]:.4f}")
print(f"Median k-distance: {np.median(k_distances_sorted):.4f}")
print()

# ============================================
# STEP 3: Plot k-Distance Graph
# ============================================

plt.figure(figsize=(10, 6))

# Plot sorted k-distances
plt.plot(k_distances_sorted, linewidth=2, color='steelblue')

# Add suggested eps line (visual helper)
# Rule of thumb: eps at the "knee" or elbow of the curve
suggested_eps = 0.2  # This would be determined visually or algorithmically
plt.axhline(y=suggested_eps, color='red', linestyle='--', 
            linewidth=2, label=f'Suggested eps = {suggested_eps}')

plt.xlabel('Points (sorted by k-distance)', fontsize=12)
plt.ylabel(f'{k}th Nearest Neighbor Distance', fontsize=12)
plt.title(f'k-Distance Plot (k={k})\nLook for the elbow point!', 
          fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

print("✓ k-distance plot created!")
print()
print("How to interpret:")
print("- Flat region (left): Points with close neighbors (core/border points)")
print("- Sharp rise (right): Points with distant neighbors (noise points)")
print("- Elbow point: Optimal eps value (where curve bends sharply)")
print()
print(f"Suggested eps for this data: {suggested_eps}")

# The elbow point suggests optimal eps

Handling Noise Points

DBSCAN labels noise points as -1. You can filter them out or analyze them separately:

# Identify noise points
noise_mask = labels == -1
noise_points = X[noise_mask]
clustered_points = X[~noise_mask]

print(f"Total points: {len(X)}")
print(f"Clustered points: {len(clustered_points)}")
print(f"Noise points: {len(noise_points)}")

# Visualize with noise highlighted
plt.figure(figsize=(10, 6))
plt.scatter(clustered_points[:, 0], clustered_points[:, 1], 
            c=labels[~noise_mask], cmap='viridis', alpha=0.6, label='Clustered')
plt.scatter(noise_points[:, 0], noise_points[:, 1], 
            c='red', marker='x', s=50, label='Noise')
plt.legend()
plt.title('DBSCAN with Noise Points Highlighted')
plt.show()

Algorithm Comparison Guide

Choosing the Right Algorithm: Different clustering algorithms suit different scenarios. Think about your data characteristics, business requirements, and computational constraints when choosing.

Algorithm	Best For	Advantages	Disadvantages	When to Use
K-Means	Spherical, equal-sized clusters	\u2022 Fast & scalable \u2022 Simple to understand \u2022 Works on large datasets \u2022 Easy to implement	\u2022 Must specify K upfront \u2022 Assumes spherical shapes \u2022 Sensitive to outliers \u2022 Poor for varying densities	Customer segmentation, image compression, document clustering with clear groups
Hierarchical	Exploring nested structures	\u2022 No need to specify K \u2022 Produces dendrogram \u2022 Multiple granularities \u2022 Deterministic results	\u2022 Slow (O(n\u00b2) or O(n\u00b3)) \u2022 High memory usage \u2022 Can't undo merges \u2022 Not scalable to big data	Biological taxonomy, social network analysis, organizational hierarchies (small-medium datasets)
DBSCAN	Arbitrary shapes with noise	\u2022 Finds any cluster shape \u2022 Robust to outliers \u2022 No need to specify K \u2022 Identifies noise points	\u2022 Struggles with varying densities \u2022 Sensitive to parameters \u2022 Slower than K-Means \u2022 Hard to tune eps/min_samples	Geospatial data, anomaly detection, non-globular clusters, data with outliers

Decision Flowchart

Follow this visual decision tree to choose the best clustering algorithm for your data. Start at the top and answer each question to find your optimal choice!

START HERE

Choose Your Clustering Algorithm

Question 1: Do you know how many clusters (K) you want?

Can you specify the exact number of groups upfront?

YES - I know K

Example: "I want 3 customer segments"

Go to Question 2

NO - K is unknown

You want the algorithm to discover K

Go to Question 4

Question 2: How large is your dataset?

Number of data points matters for performance

Large

> 100,000 points

K-Means

Fast & scalable

Medium

10k - 100k points

K-Means

Best performance

Small

< 10,000 points

K-Means

Any works fine

Question 4: Dataset size (K unknown path)

Since you don't know K, size determines algorithm choice

Small (< 10k points)

Can afford computational cost

Hierarchical Clustering

Explore structure with dendrogram

Large (> 10k points)

Need efficient algorithm

Go to Question 5

Question 5: Cluster shapes & outliers?

What does your data look like?

Spherical

Round, compact clusters

K-Means

Use Elbow Method

Irregular

Crescent, serpentine shapes

DBSCAN

Handles any shape

Has Outliers

Noisy data with anomalies

DBSCAN

Labels noise as -1

Quick Reference Guide:

K-Means

Know K upfront
Large datasets
Spherical clusters
Fast & simple

Hierarchical

Don't know K
Small datasets (<10k)
Explore structure
See dendrogram

DBSCAN

Don't know K
Irregular shapes
Has outliers/noise
Density-based

🎯 Pro Tip: Still Unsure?

Try all three algorithms and compare their Silhouette Scores. The algorithm with the highest score (closer to +1) is your best choice for that specific dataset!

Practice Questions: DBSCAN

Test your understanding with these hands-on exercises.

Task: Generate moon-shaped data and apply DBSCAN with eps=0.3 and min_samples=5.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)

dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

print(f"Unique labels: {set(labels)}")

Task: Apply DBSCAN and print the number of clusters found and noise points detected.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Clusters: {n_clusters}")
print(f"Noise points: {n_noise}")

Task: Generate data, calculate 5th nearest neighbor distances, and plot the k-distance graph.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

neighbors = NearestNeighbors(n_neighbors=5)
neighbors.fit(X)
distances, _ = neighbors.kneighbors(X)

distances = np.sort(distances[:, 4])

plt.figure(figsize=(10, 6))
plt.plot(distances)
plt.xlabel('Points')
plt.ylabel('5th NN Distance')
plt.title('k-Distance Plot')
plt.show()

Task: Apply DBSCAN and create a scatter plot where noise points are shown in red with X markers.

Show Solution

import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

noise_mask = labels == -1

plt.figure(figsize=(10, 6))
plt.scatter(X[~noise_mask, 0], X[~noise_mask, 1], 
            c=labels[~noise_mask], cmap='viridis')
plt.scatter(X[noise_mask, 0], X[noise_mask, 1], 
            c='red', marker='x', s=50, label='Noise')
plt.legend()
plt.show()

Cluster Evaluation & Interpretation

Evaluating clustering results is challenging because there are no ground truth labels. Learn how to assess cluster quality using internal metrics, visualizations, and domain knowledge to ensure your clusters are meaningful and actionable.

Internal vs External Evaluation

Clustering evaluation falls into two categories depending on whether you have ground truth labels:

Internal Metrics

Used when you do not have ground truth labels:

Silhouette Score
Calinski-Harabasz Index
Davies-Bouldin Index
Inertia / SSE

External Metrics

Used when ground truth labels are available:

Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)
Homogeneity / Completeness
V-Measure

Internal Evaluation Metrics

These metrics evaluate cluster quality based only on the data and cluster assignments:

# Import evaluation metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate internal metrics
silhouette = silhouette_score(X, labels)
calinski = calinski_harabasz_score(X, labels)
davies_bouldin = davies_bouldin_score(X, labels)

print(f"Silhouette Score: {silhouette:.3f}")  # Higher is better (-1 to 1)
print(f"Calinski-Harabasz: {calinski:.2f}")   # Higher is better
print(f"Davies-Bouldin: {davies_bouldin:.3f}")  # Lower is better

Metric	Range	Goal	What It Measures (Beginner Explanation)	How to Interpret Score
`Silhouette Score`	-1 to +1	Maximize (Higher = Better)	Measures: How well-separated and cohesive are the clusters? For each point, compares: How close am I to my own cluster? (intra-cluster distance) How far am I from other clusters? (inter-cluster distance) Formula intuition: (distance to other clusters - distance within my cluster) / max(both) Think: "Do I fit well in my group vs. how different am I from other groups?"	Score Ranges: • +0.71 to +1.0: Strong, well-separated clusters ⭐⭐⭐⭐⭐ • +0.51 to +0.70: Good cluster structure ⭐⭐⭐⭐ • +0.26 to +0.50: Weak/overlapping clusters ⭐⭐⭐ • ≤ +0.25: Poor clustering ⭐⭐ • Negative: Points in wrong clusters! ⭐ Aim for > 0.5 for good clustering
`Calinski-Harabasz Index` (Variance Ratio)	0 to ∞	Maximize (Higher = Better)	Measures: Ratio of between-cluster spread to within-cluster tightness Good clustering: Clusters far apart from each other (high between-variance), but points within each cluster are tightly packed (low within-variance) Formula intuition: (Between-cluster spread) / (Within-cluster spread) Think: "Are my groups well-separated AND internally compact?"	Interpretation: • Higher values = Better (no fixed threshold) • Use to compare different K values • Pick K with highest CH Index Important: No universal "good" score - it's relative! Compare across your experiments: • K=3: CH=120 • K=4: CH=145 ← Better! • K=5: CH=138 Tends to favor K-Means over DBSCAN
`Davies-Bouldin Index`	0 to ∞	Minimize (Lower = Better)	Measures: Average similarity ratio between each cluster and its most similar cluster For each cluster, finds its "closest rival" and measures how similar they are. Then averages across all clusters. Bad clustering: Clusters are similar to their neighbors (high overlap) Good clustering: Even the most similar pairs are very different (well-separated) Think: "How similar is each cluster to its nearest competitor?"	Score Guidelines: • Close to 0: Excellent separation (0 = perfect, theoretical) • 0.0 - 0.5: Very good clustering • 0.5 - 1.0: Moderate clustering • > 1.0: Poor clustering (overlapping clusters) Note: Lower is always better! A score of 0.3 beats 0.8 Use to compare K values - pick lowest DB Index

Which metric should I use?

Silhouette Score: Best all-around metric, easy to interpret (-1 to 1 range)
Calinski-Harabasz: Good for comparing K values, fast to compute
Davies-Bouldin: Sensitive to cluster overlap, good complement to Silhouette

Pro tip: Use all three and see if they agree! If Silhouette says K=4 is best, and CH Index also peaks at K=4, that's strong evidence.

External Evaluation Metrics

When you have ground truth labels (for validation or benchmarking), use these metrics:

# External metrics (when true labels are available)
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score

# Generate data with known labels
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

# Cluster the data
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

# Calculate external metrics
ari = adjusted_rand_score(y_true, y_pred)
nmi = normalized_mutual_info_score(y_true, y_pred)
homogeneity = homogeneity_score(y_true, y_pred)
completeness = completeness_score(y_true, y_pred)
v_measure = v_measure_score(y_true, y_pred)

print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized MI: {nmi:.3f}")
print(f"Homogeneity: {homogeneity:.3f}")
print(f"Completeness: {completeness:.3f}")
print(f"V-Measure: {v_measure:.3f}")

ARI interpretation: ARI = 1 means perfect agreement with ground truth. ARI = 0 means random clustering. ARI can be negative if worse than random.

Comparing Multiple Algorithms

Evaluate different clustering algorithms on the same data to find the best approach:

# Compare clustering algorithms
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Prepare data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# Test different algorithms
results = {}

# K-Means
km = KMeans(n_clusters=4, random_state=42)
km_labels = km.fit_predict(X_scaled)
results['K-Means'] = silhouette_score(X_scaled, km_labels)

# Agglomerative
agg = AgglomerativeClustering(n_clusters=4)
agg_labels = agg.fit_predict(X_scaled)
results['Agglomerative'] = silhouette_score(X_scaled, agg_labels)

# DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
db_labels = db.fit_predict(X_scaled)
if len(set(db_labels)) > 1:  # Need at least 2 clusters for silhouette
    results['DBSCAN'] = silhouette_score(X_scaled, db_labels)

for algo, score in results.items():
    print(f"{algo}: {score:.3f}")

Cluster Interpretation

After clustering, analyze what makes each cluster unique to derive actionable insights:

# Analyze cluster characteristics
import pandas as pd
import numpy as np

# Create sample customer data
np.random.seed(42)
customers = pd.DataFrame({
    'age': np.random.randint(18, 70, 100),
    'income': np.random.randint(20000, 150000, 100),
    'spending_score': np.random.randint(1, 100, 100)
})

# Scale and cluster
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)

kmeans = KMeans(n_clusters=3, random_state=42)
customers['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze cluster profiles
cluster_summary = customers.groupby('cluster').agg({
    'age': ['mean', 'std'],
    'income': ['mean', 'std'],
    'spending_score': ['mean', 'std']
}).round(2)

print("Cluster Profiles:")
print(cluster_summary)

# Visualize cluster profiles with radar/bar charts
import matplotlib.pyplot as plt

cluster_means = customers.groupby('cluster')[['age', 'income', 'spending_score']].mean()

# Normalize for comparison
cluster_normalized = (cluster_means - cluster_means.min()) / (cluster_means.max() - cluster_means.min())

# Bar chart comparison
cluster_normalized.T.plot(kind='bar', figsize=(10, 6))
plt.title('Cluster Profile Comparison (Normalized)')
plt.xlabel('Features')
plt.ylabel('Normalized Value')
plt.legend(title='Cluster', labels=['Young Budget', 'High Spenders', 'Conservative'])
plt.tight_layout()
plt.show()

Best Practices for Cluster Analysis

Checklist

Clustering Best Practices

Always scale your data: Use StandardScaler before clustering
Try multiple K values: Use Elbow and Silhouette methods together
Compare algorithms: Different data shapes suit different algorithms
Validate with domain knowledge: Do clusters make business sense?
Document your findings: Name clusters meaningfully for stakeholders
Test stability: Run with different random seeds

Practice Questions: Cluster Evaluation

Test your understanding with these hands-on exercises.

Task: Fit K-Means with 4 clusters and print silhouette, Calinski-Harabasz, and Davies-Bouldin scores.

Show Solution

from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

print(f"Silhouette: {silhouette_score(X, labels):.3f}")
print(f"CH Index: {calinski_harabasz_score(X, labels):.2f}")
print(f"DB Index: {davies_bouldin_score(X, labels):.3f}")

Task: Generate data with known labels, cluster with K-Means, and calculate Adjusted Rand Index.

Show Solution

from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

ari = adjusted_rand_score(y_true, y_pred)
print(f"Adjusted Rand Index: {ari:.3f}")

Task: Apply both K-Means and AgglomerativeClustering with 4 clusters, compare their silhouette scores, and print which is better.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

km = KMeans(n_clusters=4, random_state=42)
km_labels = km.fit_predict(X)
km_score = silhouette_score(X, km_labels)

agg = AgglomerativeClustering(n_clusters=4)
agg_labels = agg.fit_predict(X)
agg_score = silhouette_score(X, agg_labels)

print(f"K-Means: {km_score:.3f}")
print(f"Agglomerative: {agg_score:.3f}")
print(f"Winner: {'K-Means' if km_score > agg_score else 'Agglomerative'}")

Task: Create a DataFrame with features, cluster it, and print the mean of each feature per cluster.

Show Solution

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.randint(20, 60, 100),
    'income': np.random.randint(30000, 120000, 100)
})

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

print(df.groupby('cluster').mean().round(2))

Key Takeaways

K-Means is Fast and Simple

Works best with spherical clusters and requires specifying K upfront. Scale your data first for best results

Elbow Method Finds Optimal K

Plot inertia vs K and look for the "elbow" point where adding clusters gives diminishing returns

Silhouette Measures Quality

Score ranges from -1 to 1. Higher values indicate well-separated, cohesive clusters

Hierarchical Shows Structure

Dendrograms reveal nested cluster relationships. Cut at different heights for different granularities

DBSCAN Handles Noise

Finds arbitrary-shaped clusters and labels outliers as noise. No need to specify cluster count

Choose Algorithm Wisely

K-Means for speed, Hierarchical for exploration, DBSCAN for noisy or irregular-shaped data

What You'll Learn

Contents

K-Means Clustering

What is Clustering?

How K-Means Works

K-Means Algorithm Steps

Implementing K-Means in Python

Important K-Means Parameters

Predicting New Data Points

K-Means Limitations

Practice Questions: K-Means Clustering

Easy Create a K-Means model with 3 clusters

Medium Scale data and apply K-Means for customer segmentation

Hard Visualize K-Means clustering with centroids

Medium Count samples in each cluster

Elbow Method & Silhouette Score

The Challenge of Choosing K

Too Few Clusters (K too small)

Too Many Clusters (K too large)

The Elbow Method

Inertia (Within-Cluster Sum of Squares / WCSS)

Importing Libraries

Creating Test Data

The Loop: Testing Each K Value

Understanding the Output

The Math Behind Elbow Detection

Why This Works

Reading the Chart

The "Elbow" Shape

Red Dashed Line

Silhouette Score

Perfect Clustering

Overlapping Clusters

Wrong Assignment

Reading the Silhouette Chart

Elbow vs Silhouette: Use Both!

Combining Both Methods

Silhouette Visualization

Practice Questions: Elbow & Silhouette

Easy Calculate inertia for a K-Means model

Medium Find optimal K using Elbow method

Medium Calculate silhouette score for clustering

Hard Find K with highest silhouette score

Hierarchical Clustering (Agglomerative)

How Hierarchical Clustering Works

Agglomerative Clustering Steps

Linkage Methods

Implementing Hierarchical Clustering

Creating Dendrograms

Visualizing Cluster Results

When to Use Hierarchical Clustering

Practice Questions: Hierarchical Clustering

Easy Apply Agglomerative Clustering with 3 clusters

Medium Create a dendrogram visualization

Hard Compare different linkage methods

Medium Cut dendrogram at a distance threshold

DBSCAN for Density-Based Clustering

How DBSCAN Works

DBSCAN Point Types (Detailed Explanation)

1. Core Point (Dense Center)

2. Border Point (Edge of Crowd)

3. Noise Point (Outlier)

DBSCAN Parameters

Implementing DBSCAN

Finding Optimal eps

Handling Noise Points

Algorithm Comparison Guide

Decision Flowchart

START HERE

Question 1: Do you know how many clusters (K) you want?

Question 2: How large is your dataset?

Large

Medium

Small

Question 4: Dataset size (K unknown path)

Question 5: Cluster shapes & outliers?

Spherical

Irregular

Has Outliers