Principal Component Analysis (PCA)
PCA is the most widely used dimensionality reduction technique. It transforms your data into a new coordinate system where the axes (principal components) capture the maximum variance in decreasing order.
Why Reduce Dimensions?
Imagine trying to describe a person using 1,000 different measurements - height, weight, shoe size, favorite color, number of books read, coffee preference, etc. Many of these measurements are related (tall people often have larger shoe sizes) or irrelevant for your task. Dimensionality reduction is like finding the most important 5-10 characteristics that capture the essence of a person.
High-dimensional data presents several challenges: It's hard to visualize (we can only see 3 dimensions at most), computationally expensive to process (more features = more calculations), and often contains redundant or correlated features (like temperature in Celsius and Fahrenheit - they're the same information!). Dimensionality reduction addresses these issues while preserving the essential patterns and relationships in your data.
- Faster model training: Fewer features = less computation. A model with 10 features trains 10x faster than one with 100!
- Reduced storage: Save disk space and memory. Store 50 features instead of 5,000.
- Better visualization: We can see 2D/3D plots. Can't visualize 784 dimensions (like in images)!
- Removes noise: Gets rid of random fluctuations and keeps the signal. Like removing static from a radio.
- Avoids overfitting: Fewer features mean simpler models that generalize better to new data.
- Information loss: Like compressing a song to MP3 - smaller file, but some quality is lost.
- Less interpretable: "Principal Component 1" is harder to explain than "age" or "income".
- Linear only (PCA): PCA can't capture curved patterns. Like trying to draw a circle with straight lines.
- Requires scaling: Must standardize features first or results will be biased toward large-scale features.
- Outlier sensitive: Extreme values can distort the components. One billionaire in a dataset of average earners!
How PCA Works
Beginner-Friendly Explanation
Imagine you're looking at a football field from the stands. You see players moving in all directions. But if you rotate your view to look from the sideline, you'll notice most movement happens along the length of the field (1st dimension) and less width-wise (2nd dimension).
PCA does exactly this! It "rotates" your data to find the directions where your data varies the most. The 1st principal component is the direction of maximum variation (like the field's length), the 2nd is perpendicular and captures the next most variation (like the width), and so on.
Technically speaking: PCA finds new axes (called principal components) that are linear combinations of your original features. Think of a linear combination as a recipe - "PC1 = 0.5 × height + 0.3 × weight + 0.2 × age". Each component is a weighted mix of your original features.
The key rule: The first component points in the direction of maximum variance (where data is most spread out). The second is perpendicular (orthogonal) to the first and captures the next most variance. The third is perpendicular to both, and so on. This guarantees no overlap or redundancy between components!
PCA Steps (What Happens Under the Hood)
-
Standardize: Scale features to have zero mean (center at 0) and unit variance (same spread).
Why? Without this, features with larger scales (like salary in $100K vs age in years) would dominate the components unfairly. -
Covariance Matrix: Calculate how each pair of features varies together.
Think of it as a "relationship table" - high covariance means two features move together (height & weight), low means they're independent. -
Eigenvectors & Eigenvalues: Find the directions (eigenvectors) of maximum variance and their importance (eigenvalues).
Eigenvectors = the new axes (principal components). Eigenvalues = how much variance each axis captures. Bigger eigenvalue = more important component! -
Select Components: Keep the top k components based on explained variance (usually 90-95%).
You're choosing: "I want to keep 95% of the information, how many components do I need?" Typically much fewer than original features! -
Transform: Project your original data onto the selected principal components.
This is the final step - rotating and projecting your data into the new coordinate system. Your 100 features become 10 components!
Key Insight (The Magic of PCA):
Each principal component is orthogonal (perpendicular, at 90°) to all others, which means they are completely uncorrelated. No redundancy! If feature A and B were 80% correlated (redundant), PCA combines them into one component. You get independent, information-rich features!
Implementing PCA in Python
Scikit-learn makes PCA straightforward. Always remember to scale your data first!
# Import required libraries
import numpy as np # For numerical operations
import matplotlib.pyplot as plt # For creating visualizations
from sklearn.decomposition import PCA # The PCA algorithm implementation
from sklearn.preprocessing import StandardScaler # For scaling features (CRITICAL!)
from sklearn.datasets import load_iris # Sample dataset with 4 features
# Load the famous Iris dataset
# - 150 flowers (samples)
# - 4 measurements per flower: sepal length, sepal width, petal length, petal width
# - 3 species: setosa, versicolor, virginica
iris = load_iris()
X = iris.data # Shape: (150, 4) - 150 flowers, 4 features each
y = iris.target # Species labels: 0, 1, or 2
print(f"Original shape: {X.shape}") # (150, 4) - We'll reduce from 4D to 2D
print(f"Feature names: {iris.feature_names}") # See what we're compressing
# Step 1: Scale the data (ABSOLUTELY CRITICAL!)
# WHY: PCA is based on variance. If one feature ranges 0-1000 and another 0-5,
# the 0-1000 feature will dominate the principal components unfairly!
# StandardScaler makes all features have mean=0 and standard deviation=1
scaler = StandardScaler() # Create the scaler
X_scaled = scaler.fit_transform(X) # Fit to data AND transform it
# After scaling, all features have:
# - Mean = 0 (centered)
# - Std Dev = 1 (same spread)
# This ensures fair comparison!
print(f"Original mean: {X.mean(axis=0).round(2)}") # Different means
print(f"Scaled mean: {X_scaled.mean(axis=0).round(2)}") # All near 0!
# Step 2: Apply PCA (reduce from 4 dimensions to 2)
# n_components=2 means "give me the top 2 principal components"
# These 2 will capture the MOST variance from the original 4 features
pca = PCA(n_components=2) # Create PCA object
X_pca = pca.fit_transform(X_scaled) # Fit PCA and transform data
# What just happened?
# - fit(): PCA analyzed X_scaled and found the 2 best directions
# - transform(): Projected the data onto those 2 directions
# - Result: We went from 4D to 2D!
print(f"Reduced shape: {X_pca.shape}") # (150, 2) - Same 150 flowers, but only 2 features now!
print(f"We reduced dimensions by {(1 - 2/4)*100:.0f}%!") # 50% reduction
Visualize the reduced data to see how well the classes separate:
# Visualize the 2D projection
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA: Iris Dataset (4D to 2D)')
plt.colorbar(scatter, label='Species')
plt.show()
Understanding PCA Attributes
After fitting, PCA provides useful attributes to understand the transformation:
# After fitting, PCA provides useful attributes to understand the transformation
# 1. Explained Variance Ratio - THE MOST IMPORTANT METRIC!
# This tells you: "What percentage of total variance does each component capture?"
print("Explained variance ratio:", pca.explained_variance_ratio_)
# Example output: [0.7296, 0.2285]
# Interpretation:
# - PC1 captures 73% of the variance (MOST important direction)
# - PC2 captures 23% of the variance (2nd most important)
# - Together: 73% + 23% = 96% of total variance retained!
# 2. Total Variance Retained
# How much of the original information did we keep?
total_variance = sum(pca.explained_variance_ratio_)
print(f"Total variance retained: {total_variance:.2%}") # e.g., 95.81%
# Rule of thumb:
# - 90-95%: Excellent! Lost very little information
# - 80-90%: Good for most tasks
# - < 80%: Might have lost too much information
# 3. Component Loadings (Feature Contributions)
# These show HOW each original feature contributes to each component
print("Component loadings shape:", pca.components_.shape) # (2, 4)
# Shape: (n_components, n_original_features)
# Each row is one principal component
# Each column shows contribution from one original feature
# Example: If pca.components_[0] = [0.5, 0.3, -0.6, -0.5]
# It means: PC1 = 0.5*sepal_length + 0.3*sepal_width - 0.6*petal_length - 0.5*petal_width
# Positive = feature increases with PC1, Negative = feature decreases with PC1
Feature Importance from PCA
The component loadings show how much each original feature contributes to each principal component:
# Visualize feature contributions to components
import pandas as pd
loadings = pd.DataFrame(
pca.components_.T,
columns=['PC1', 'PC2'],
index=iris.feature_names
)
print(loadings)
# Plot loadings as heatmap
plt.figure(figsize=(8, 4))
plt.imshow(loadings.values, cmap='coolwarm', aspect='auto')
plt.colorbar(label='Loading')
plt.xticks([0, 1], ['PC1', 'PC2'])
plt.yticks(range(4), iris.feature_names)
plt.title('Feature Loadings on Principal Components')
plt.show()
Practice Questions: PCA
Test your understanding with these hands-on exercises.
Task: Load the iris dataset, scale it, and apply PCA to reduce to 3 components.
Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_pca.shape}")
Task: Apply PCA with 2 components and print the total variance retained as a percentage.
Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
pca = PCA(n_components=2)
pca.fit(X_scaled)
total = sum(pca.explained_variance_ratio_)
print(f"Total variance retained: {total:.2%}")
Task: Apply PCA to reduce iris data to 2D and create a scatter plot colored by species.
Show Solution
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Iris Dataset')
plt.colorbar(label='Species')
plt.show()
Task: Use PCA with n_components=0.95 to automatically select components that retain 95% variance.
Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
# Use ratio to auto-select components
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)
print(f"Components needed: {pca.n_components_}")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")
Explained Variance & Scree Plot
How many components should you keep? Explained variance ratios and scree plots help you decide the optimal number of dimensions to retain while preserving most of the information in your data.
Understanding Explained Variance
Beginner-Friendly Explanation
Imagine you have a $100 budget to explain why students succeed. You discover:
- Study hours explains $60 of success (60%)
- Sleep quality explains $25 (25%)
- Diet explains $10 (10%)
- Shoe size explains $5 (5%)
The first two factors (study + sleep) give you $85 of your $100 budget - that's 85% explained! This is exactly what explained variance means in PCA. Each component "explains" a portion of the total variation in your data, and you want to keep enough components to hit your budget (usually 90-95%).
Each principal component captures a portion of the total variance (spread) in your data. The first component always captures the most (it's designed to!), the second captures the next most from what remains, and so on. The explained variance ratio tells you what fraction of total variance each component represents - like slices of a pie.
Why does this matter? If PC1 explains 90% of variance and PC2 explains 8%, keeping just these 2 components means you've retained 98% of your data's information! The remaining components (PC3, PC4, ...) might just be noise and can be safely discarded.
One Component at a Time
explained_variance_ratio_ shows what proportion each component captures individually.
Think: "How much does this ONE component contribute?"
PC1: 73% (biggest slice)
PC2: 23% (2nd slice)
PC3: 4% (tiny slice)
Running Total
The cumulative sum shows total variance retained as you add more components. Think: "How much do I have so far?"
PC1 alone: 73%
PC1+PC2: 96% (almost there!)
All 3: 100% (everything)
When to Stop?
Common practice: Keep enough components to retain 90-95% of total variance. Think: "I'm okay losing 5-10% to save space."
More components = more info = more features
Fewer components = less info = more compression
90-95% is the sweet spot!
Computing Explained Variance
Fit PCA on all components first to see how variance is distributed:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
# Load and scale data
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
# Fit PCA on ALL components
pca = PCA() # No n_components = keep all
pca.fit(X_scaled)
# Get explained variance
variance_ratio = pca.explained_variance_ratio_
cumulative = np.cumsum(variance_ratio)
print("Individual variance:", variance_ratio.round(3))
print("Cumulative variance:", cumulative.round(3))
Creating a Scree Plot
A scree plot visualizes the explained variance for each component. Look for an "elbow" where adding more components yields diminishing returns:
# Create a scree plot
fig, ax1 = plt.subplots(figsize=(10, 6))
# Bar chart for individual variance
components = range(1, len(variance_ratio) + 1)
ax1.bar(components, variance_ratio, alpha=0.7, color='steelblue', label='Individual')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance Ratio', color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')
# Line for cumulative variance
ax2 = ax1.twinx()
ax2.plot(components, cumulative, 'ro-', label='Cumulative')
ax2.axhline(y=0.95, color='g', linestyle='--', label='95% threshold')
ax2.set_ylabel('Cumulative Explained Variance', color='red')
ax2.tick_params(axis='y', labelcolor='red')
plt.title('Scree Plot with Cumulative Variance')
fig.legend(loc='center right', bbox_to_anchor=(0.85, 0.5))
plt.tight_layout()
plt.show()
Reading a Scree Plot (The Elbow Hunter's Guide)
The scree plot is named after the geological term for debris at the base of a cliff. Imagine a mountain:
- The Cliff (Steep Drop): First few components with HIGH explained variance - these are important!
- The Scree (Flat Tail): Later components with LOW explained variance - mostly noise, can discard
- The Elbow (Bend Point): Where the steep drop transitions to flat - THIS is your cutoff!
The Elbow Rule (How to Decide):
Choose the number of components at the point where the curve bends sharply (the "elbow").
Before the elbow: Each component adds significant value.
After the elbow: Adding more components gives diminishing returns - not worth it!
Pro Tip: If you see 4 components before the elbow, keep 4. Don't keep the flat tail!
Automatic Component Selection
Instead of manually choosing, you can specify a variance threshold:
# Method 1: Specify variance to retain
pca_95 = PCA(n_components=0.95) # Keep 95% of variance
X_reduced = pca_95.fit_transform(X_scaled)
print(f"Components for 95%: {pca_95.n_components_}")
# Method 2: Specify exact number
pca_2 = PCA(n_components=2)
X_2d = pca_2.fit_transform(X_scaled)
print(f"Variance with 2 components: {sum(pca_2.explained_variance_ratio_):.2%}")
Variance Thresholds Table
Common rules of thumb for choosing the number of components:
| Target Variance | Use Case | Trade-off |
|---|---|---|
| 80-85% | Quick exploration, visualization | Loses more detail, but very compact |
| 90-95% | General purpose, most ML tasks | Good balance of compression and accuracy |
| 99% | When accuracy is critical | Minimal compression, removes only noise |
| 2-3 components | Visualization only (2D/3D plots) | May lose significant variance |
Reconstructing Data from Components
You can transform data back to the original space. The difference between original and reconstructed data shows what information was lost:
# Reduce to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
# Reconstruct back to original dimensions
X_reconstructed = pca.inverse_transform(X_reduced)
# Calculate reconstruction error
mse = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {mse:.4f}")
# Compare original vs reconstructed for first sample
print("Original:", X_scaled[0].round(2))
print("Reconstructed:", X_reconstructed[0].round(2))
Practice Questions: Explained Variance
Test your understanding with these hands-on exercises.
Task: Fit PCA on the iris dataset without limiting components and print the explained variance ratio.
Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
pca = PCA() # Keep all components
pca.fit(X_scaled)
print("Variance ratio:", pca.explained_variance_ratio_.round(4))
Task: Compute and print the cumulative explained variance using numpy cumsum.
Show Solution
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
pca = PCA()
pca.fit(X_scaled)
cumulative = np.cumsum(pca.explained_variance_ratio_)
print("Cumulative variance:", cumulative.round(4))
Task: Create a bar chart of individual variance and overlay a line plot of cumulative variance.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
pca = PCA().fit(X_scaled)
variance = pca.explained_variance_ratio_
cumulative = np.cumsum(variance)
components = range(1, len(variance) + 1)
fig, ax1 = plt.subplots(figsize=(8, 5))
ax1.bar(components, variance, alpha=0.7, color='blue')
ax1.set_xlabel('Component')
ax1.set_ylabel('Individual Variance')
ax2 = ax1.twinx()
ax2.plot(components, cumulative, 'r-o')
ax2.set_ylabel('Cumulative Variance')
plt.title('Scree Plot')
plt.show()
Task: Use n_components=0.90 to find how many components are needed to retain 90% variance.
Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
pca = PCA(n_components=0.90)
pca.fit(X_scaled)
print(f"Components for 90%: {pca.n_components_}")
print(f"Actual variance: {sum(pca.explained_variance_ratio_):.2%}")
t-SNE for Visualization
t-SNE (t-distributed Stochastic Neighbor Embedding) excels at revealing clusters and patterns in high-dimensional data by creating stunning 2D or 3D visualizations that preserve local structure.
What is t-SNE?
Beginner-Friendly Explanation
Imagine you're creating a map of a city. PCA would preserve the overall layout - north stays north, the downtown area stays in the center. But what if you care more about neighborhoods? You want houses on the same street to stay close together, even if that means bending the overall map.
That's t-SNE! It focuses on keeping neighbors together. If data points A and B were close in the original high-dimensional space, t-SNE makes sure they're also close in the 2D visualization. It's willing to distort global distances to preserve local neighborhoods - perfect for discovering clusters!
Technical explanation: Unlike PCA which preserves global variance (overall spread of data), t-SNE focuses on preserving local neighborhoods. Points that are close together in high-dimensional space will remain close in the low-dimensional projection. This makes t-SNE excellent for discovering clusters that might not be visible with PCA.
Important caveat: Because t-SNE distorts global structure, the distances between clusters are NOT meaningful. Two clusters might appear far apart in a t-SNE plot but could actually be close in the original space. t-SNE is for visualization and cluster discovery ONLY, not for measuring distances!
- Linear transformation: Like rotating a photo - no bending or warping, just rotation
- Preserves global structure: Overall relationships and distances stay accurate
- Fast & deterministic: Same input = same output, every time. Runs in seconds!
- Feature reduction: Great for ML pipelines - use PCA features as input to models
- Interpretable axes: Can understand what each PC represents (hard but possible)
- Non-linear transformation: Can bend and warp to reveal hidden patterns
- Preserves local structure: Keeps neighbors together, but distorts overall layout
- Slower & stochastic: Different runs give slightly different results. Takes minutes!
- Visualization only: Beautiful plots but don't use for machine learning features
- Not interpretable: Axes have no meaning - just look at cluster separation
How t-SNE Works (The Neighborhood Preservation Dance)
-
Compute High-D Probabilities: Calculate the probability that each point would "pick" every other point as a neighbor in high-dimensional space.
Think: "If point A were at a party, how likely is it to hang out with point B vs point C?" Close points = high probability. -
Random Low-D Initialization: Randomly scatter all points in 2D or 3D space (like throwing confetti).
Starting point is random - that's why different runs give slightly different results (stochastic). -
Iterative Optimization: Gradually move points around in 2D so that the low-D probabilities match the high-D probabilities.
Like organizing a party - keep rearranging people until everyone is near their friends. This takes many iterations (1000+). -
Use t-Distribution: In low dimensions, use a t-distribution (instead of Gaussian) which has heavy tails.
Why? Prevents "crowding problem" - gives clusters room to spread out. Without this, all points would pile up in the center!
Key Insight (The t-SNE Philosophy):
t-SNE optimizes for neighborhood preservation, NOT absolute distances! Two points that are far apart in the t-SNE plot might still be relatively close in the original space - you just can't tell. Only trust local clusters, not global layout! The magic: reveals hidden clusters beautifully.
Applying t-SNE in Python
Scikit-learn provides an easy-to-use t-SNE implementation:
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load and scale data
iris = load_iris()
X = iris.data
y = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)
print(f"Original shape: {X.shape}") # (150, 4)
print(f"t-SNE shape: {X_tsne.shape}") # (150, 2)
transform() method. You cannot apply a fitted
t-SNE to new data. Always use fit_transform() on all your data at once.
# Visualize t-SNE results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7, s=50)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('t-SNE Visualization of Iris Dataset')
plt.colorbar(scatter, label='Species')
plt.show()
The Perplexity Parameter
Perplexity is the most important hyperparameter. It controls how many neighbors each point considers:
| Perplexity Value | Effect on Visualization | Best For | Icon |
|---|---|---|---|
| 5-10 Very Low |
Focuses on very tight, local structure. Each point only considers its 5-10 nearest neighbors. Can create many small, fragmented clusters. | Small datasets (< 100 points), when you want to see micro-clusters, detecting very tight groups | |
| 30 Default |
Balanced view of both local and global structure. Each point considers ~30 neighbors. Most reliable and recommended starting point. | Most datasets, general purpose visualization, when you're not sure what value to use - START HERE! | |
| 50-100 High |
Considers broader neighborhoods, preserves more global structure. Creates smoother, more spread-out clusters. Less fragmentation. | Larger datasets (1000+ samples), when you want to see macro-structure, reducing over-clustering artifacts |
# Compare different perplexity values
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
perplexities = [5, 30, 50]
for ax, perp in zip(axes, perplexities):
tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7)
ax.set_title(f'Perplexity = {perp}')
ax.set_xlabel('Dim 1')
ax.set_ylabel('Dim 2')
plt.tight_layout()
plt.show()
t-SNE Best Practices
- Use for visualization only, not as features
- Always set random_state for reproducibility
- Try multiple perplexity values
- Scale your data first
- Use PCA to reduce to 50 dims first for speed
- Do not interpret distances between clusters
- Do not use for feature engineering
- Do not trust cluster sizes (they can be artifacts)
- Do not run on very large datasets directly
- Do not use on new unseen data
Speed Optimization with PCA
For datasets with many features, first reduce with PCA before applying t-SNE:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import fetch_openml
# Load MNIST (784 features)
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data[:5000], mnist.target[:5000].astype(int)
# Step 1: Reduce to 50 dimensions with PCA
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
# Step 2: Apply t-SNE on reduced data (much faster!)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_pca)
print(f"784 dims -> 50 dims -> 2 dims")
Practice Questions: t-SNE
Test your understanding with these hands-on exercises.
Task: Apply t-SNE to reduce the iris dataset to 2 dimensions.
Show Solution
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
print(f"Shape: {X_tsne.shape}")
Task: Create a scatter plot of t-SNE output colored by species.
Show Solution
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.colorbar(label='Species')
plt.title('t-SNE of Iris')
plt.show()
Task: Create a 1x3 subplot comparing t-SNE with perplexity values 5, 30, and 50.
Show Solution
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, perp in zip(axes, [5, 30, 50]):
tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='viridis')
ax.set_title(f'Perplexity = {perp}')
plt.tight_layout()
plt.show()
Task: For large datasets, first reduce dimensions with PCA before applying t-SNE.
Show Solution
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
# Step 1: PCA to 30 dimensions
pca = PCA(n_components=30)
X_pca = pca.fit_transform(X)
# Step 2: t-SNE on reduced data
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_pca)
print(f"Original: {X.shape} -> PCA: {X_pca.shape} -> t-SNE: {X_tsne.shape}")
UMAP for High-Dimensional Data
UMAP (Uniform Manifold Approximation and Projection) is a modern technique that preserves both local and global structure, runs faster than t-SNE, and works well for both visualization and as a preprocessing step for machine learning.
Why UMAP? (The Modern Champion)
Beginner-Friendly Explanation
Imagine you need to create a map of your city for tourists:
- PCA: Fast but boring - just a rotated satellite view, loses neighborhoods
- t-SNE: Beautiful and shows neighborhoods, but SLOW, and you can't add new locations later
- UMAP: Beautiful like t-SNE, FAST like PCA, AND you can add new locations to the same map!
UMAP is the best of both worlds! It was developed in 2018 to fix t-SNE's limitations. It creates gorgeous visualizations (like t-SNE) while being faster, preserving more global structure, and supporting transformation of new data (like PCA). It's quickly becoming the go-to choice!
UMAP (Uniform Manifold Approximation and Projection) was developed in 2018 and has quickly become the preferred choice for many data scientists and machine learning practitioners. Why? Because it combines the best aspects of PCA (speed, usability as features for ML) with the visualization quality of t-SNE.
The UMAP advantage: Unlike t-SNE which only preserves local structure, UMAP preserves both local AND global structure. This means cluster separations and relative positions are more meaningful. Plus, it scales to millions of data points efficiently!
Blazing Fast!
UMAP is significantly faster than t-SNE, especially on larger datasets. Where t-SNE might take 30 minutes, UMAP finishes in 2 minutes!
1,000 points: seconds
10,000 points: 1-2 min
1,000,000+ points: possible!
Best of Both!
Unlike t-SNE (local only), UMAP preserves more global relationships. Distances between clusters are more meaningful and trustworthy.
See clusters (local)
AND their relationships (global)
More accurate overall view!
ML Ready!
UMAP can transform new data using a fitted model. This makes it usable for machine learning pipelines, unlike t-SNE which can't!
Fit on training data
Transform test data
Use as ML features!
Installing UMAP
UMAP is not included in scikit-learn. Install it separately:
# Install UMAP (run once)
# pip install umap-learn
# Note: the package is 'umap-learn' but you import 'umap'
import umap
print("UMAP installed successfully!")
umap-learn (not umap),
but you import it as import umap or from umap import UMAP.
Applying UMAP in Python
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load and scale data
iris = load_iris()
X = iris.data
y = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
print(f"Original shape: {X.shape}") # (150, 4)
print(f"UMAP shape: {X_umap.shape}") # (150, 2)
# Visualize UMAP results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis', alpha=0.7, s=50)
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.title('UMAP Visualization of Iris Dataset')
plt.colorbar(scatter, label='Species')
plt.show()
Key UMAP Parameters
UMAP has two main hyperparameters that control the embedding:
| Parameter | Default | Effect & What It Controls | Guidance (When to Change) |
|---|---|---|---|
n_neighbors |
15 |
Controls local vs global focus. Number of neighbors each point considers when building the manifold.
Higher = broader view, lower = tighter focus |
Increase (30-100): For more global structure, smoother embedding Decrease (5-10): For very local, fine details, small datasets |
min_dist |
0.1 |
Controls point spacing. Minimum distance between points in the low-D output. Think "cluster tightness".
Lower = tighter clusters, higher = more spread out |
Lower (0.01-0.05): Tight, dense clusters - good for clear separation Higher (0.3-0.5): Spread out - better for seeing individual points |
n_components |
2 |
Output dimensions. Usually 2 for visualization (2D plots), 3 for 3D plots, or higher for ML features.
2/3 for viz, 10-50 for ML preprocessing |
2 or 3: For visualization/plotting only 10-50: When using as features for machine learning models |
metric |
'euclidean' |
Distance metric. How to measure "closeness" between points. Euclidean is standard straight-line distance.
Other options: cosine, manhattan, hamming, etc. |
'cosine': For text/document data (word vectors) 'manhattan': For grid-like data, coordinates 'euclidean': Most data - good default! |
n_neighbors=15, min_dist=0.1) first.
Then experiment: try n_neighbors=5 (local focus) vs n_neighbors=50 (global focus),
and min_dist=0.01 (tight) vs min_dist=0.5 (spread). See which reveals your data best!
# Experiment with parameters
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
params = [
{'n_neighbors': 5, 'min_dist': 0.1},
{'n_neighbors': 50, 'min_dist': 0.1},
{'n_neighbors': 15, 'min_dist': 0.01},
{'n_neighbors': 15, 'min_dist': 0.5},
]
for ax, p in zip(axes.flat, params):
reducer = umap.UMAP(**p, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
ax.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis', alpha=0.7)
ax.set_title(f"n_neighbors={p['n_neighbors']}, min_dist={p['min_dist']}")
plt.tight_layout()
plt.show()
UMAP vs t-SNE Comparison
| Aspect | t-SNE (2012) | UMAP (2018) |
|---|---|---|
| Speed |
Slow (O(n²)) 10K points: 10-30 minutes |
Fast (approximate neighbors) 10K points: 1-2 minutes! |
| Global Structure |
Poor preservation Only local neighborhoods reliable |
Better preservation Both local and global trustworthy |
| Transform New Data |
Not supported Must rerun on all data every time |
Supported! Can transform() new test data |
| Scalability |
Up to ~10K points Becomes impractical beyond this |
Millions of points! Scales beautifully to large datasets |
| Use as ML Features |
Not recommended Visualization ONLY - don't use for training |
Can work well! Use as preprocessing for ML models |
| Winner? |
Good for viz When you only need 2D plots of small data |
Modern Default! Faster, more versatile, scales better |
Bottom Line
For visualization: Try both! UMAP is usually faster and better, but t-SNE sometimes reveals different patterns.
For ML preprocessing: Use UMAP (or PCA). Never use t-SNE as features.
For large datasets (>10K points): UMAP is your only practical choice.
Using UMAP for ML Preprocessing
Unlike t-SNE, UMAP can be used as a preprocessing step for machine learning:
import umap
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
# Load data
digits = load_digits()
X, y = digits.data, digits.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit UMAP on training data
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_umap = reducer.fit_transform(X_train)
# Transform test data using fitted UMAP
X_test_umap = reducer.transform(X_test)
# Train classifier on UMAP features
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_umap, y_train)
print(f"Accuracy: {clf.score(X_test_umap, y_test):.2%}")
Choosing a Technique
- PCA: Start here. Fast, interpretable, good for feature reduction and denoising
- t-SNE: Best visualization for finding clusters. Use only for visualization
- UMAP: Modern default. Fast, good visualization, can be used for ML features
Tip: For visualization, try both t-SNE and UMAP. For ML pipelines, prefer PCA or UMAP since they support transforming new data.
Practice Questions: UMAP
Test your understanding with these hands-on exercises.
Task: Use UMAP to reduce the iris dataset to 2 dimensions.
Show Solution
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
print(f"Shape: {X_umap.shape}")
Task: Apply UMAP with n_neighbors=30 and min_dist=0.05.
Show Solution
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
reducer = umap.UMAP(n_neighbors=30, min_dist=0.05, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
print(f"Shape: {X_umap.shape}")
Task: Fit UMAP on training data and transform test data separately.
Show Solution
import umap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
reducer = umap.UMAP(n_components=2, random_state=42)
X_train_umap = reducer.fit_transform(X_train_scaled)
X_test_umap = reducer.transform(X_test_scaled)
print(f"Train: {X_train_umap.shape}, Test: {X_test_umap.shape}")
Task: Create a scatter plot of UMAP output colored by target class.
Show Solution
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.colorbar(label='Species')
plt.title('UMAP of Iris Dataset')
plt.show()
Decision Flowchart: Which Technique to Use?
Confused about when to use PCA, t-SNE, or UMAP? Follow this visual guide to choose the right dimensionality reduction technique for your specific situation!
START
I need dimensionality reduction
Question 1: What is your PRIMARY goal?
Think about what matters most for your task
Feature Reduction
I want to reduce features for use in machine learning models
Visualization
I want beautiful 2D/3D plots to explore and present my data
Both
I need features for ML AND want to visualize my data
Use PCA
Why PCA?
• Preserves maximum variance
• Linear, interpretable
• Can transform new data
• Perfect for ML pipelines
Go to Question 2
Multiple Options
For visualization, we need to consider:
• Dataset size
• Speed requirements
• Quality needs
Use UMAP
Why UMAP?
• Great visualizations
• Can transform new data
• Faster than t-SNE
• Works for ML features
Question 2: How large is your dataset?
(Only for visualization goal - if you chose feature reduction, you're already done with PCA!)
Small
< 1,000 samples
(e.g., Iris: 150 samples)
Medium
1K - 10K samples
(e.g., MNIST subset)
Large
> 10,000 samples
(e.g., Full MNIST, ImageNet)
t-SNE or UMAP
Your Choice!
Small datasets run fast with either method. Try both and see which reveals clusters better!
UMAP: n_neighbors=5-15
UMAP (Preferred)
Why UMAP?
t-SNE starts getting slow here. UMAP provides similar quality much faster.
UMAP Only
Must Use UMAP
t-SNE becomes impractical beyond 10K points. UMAP scales to millions!
Quick Reference Guide
Use when:
- Need features for ML
- Want speed & efficiency
- Data is roughly linear
- Interpretability matters
Linear Fast Transform
Use when:
- Only need visualization
- Small datasets (< 10K)
- Want stunning cluster plots
- Publication-quality viz
Non-linear Slow Viz Only
Use when:
- Need viz AND features
- Any dataset size
- Want speed + quality
- Modern default choice
Non-linear Fast Versatile
Pro Tip: The Hybrid Approach
For large, high-dimensional datasets (like images with 784+ features):
1. Use PCA first to reduce to ~50 dimensions (fast, removes noise)
2. Then apply UMAP or t-SNE to reduce to 2D for visualization
This combination gives you: Speed + Quality + Efficiency!
X_pca = PCA(50).fit_transform(X) → X_viz = UMAP().fit_transform(X_pca)
Key Takeaways
PCA for Linear Reduction
Projects data onto orthogonal axes of maximum variance. Fast, interpretable, and great for preprocessing
Scree Plot for Component Selection
Look for the "elbow" where explained variance drops. Aim to retain 80-95% of total variance
t-SNE for Cluster Visualization
Excellent for revealing clusters in 2D/3D. Use perplexity 5-50 and only for visualization
UMAP is Faster and Versatile
Preserves global structure better than t-SNE. Good for both visualization and ML preprocessing
Always Scale Before Reduction
StandardScaler is essential for PCA. Features with larger scales will dominate otherwise
Choose Based on Goal
PCA for preprocessing/speed, t-SNE for publication visuals, UMAP for general-purpose reduction
Knowledge Check
Quick Quiz
Test what you've learned about dimensionality reduction