What are Anomalies?
An anomaly (also called an outlier) is a data point that differs significantly from the majority of the data. These unusual observations might indicate fraud, equipment failure, disease, or simply measurement errors. Detecting them automatically is crucial in many real-world applications where manual inspection is impossible.
Understanding Anomalies (Beginner's Guide)
Anomaly detection is the art and science of finding "the odd one out" in your data. Think of it like spotting a wolf disguised among a flock of sheep, finding a counterfeit bill hidden among real currency, or identifying a submarine surfacing in a sea of fishing boats. These rare, unusual observations often signal something important: fraud, system failures, security breaches, or groundbreaking discoveries.
In the real world, anomalies are everywhere. Your bank uses anomaly detection to freeze your card when it sees a purchase in a foreign country you've never visited. Hospitals use it to detect unusual heart rhythms that might indicate a heart attack. Tech companies use it to spot hackers trying to break into servers. Manufacturing plants use it to catch defective products before they ship. The applications are endless because wherever there's data, there can be anomalies hiding within it.
The Three Types of Anomalies
Point Anomalies
The simplest type: a single data point that stands out dramatically from the rest of the dataset.
Examples: A $50,000 purchase on a student's credit card, a server CPU spike to 100% when it normally runs at 20%, a patient's blood pressure reading of 200/150 when normal is 120/80.
Contextual Anomalies
A value that's normal in one context but anomalous in another. Context matters!
Examples: 30C temperature in summer is fine, but in winter it signals a broken sensor. High sales on Black Friday are expected, but on a random Tuesday they're suspicious. Loud music at 3 PM is normal, at 3 AM it's a noise complaint.
Collective Anomalies
Individual points look normal, but together they form an anomalous pattern.
Examples: Multiple small transactions that individually seem fine but together indicate money laundering. A sequence of login attempts from different IPs suggesting a coordinated attack. Gradual sensor drift that only becomes apparent over time.
Why is Anomaly Detection Challenging?
Extreme Class Imbalance
Anomalies might be 1 in 10,000 or even 1 in a million. Traditional classifiers fail because they can achieve 99.99% accuracy by just predicting "normal" for everything!
Unknown Unknowns
You often don't know what anomalies look like beforehand. Fraudsters constantly invent new schemes, hackers find new vulnerabilities, and equipment fails in unexpected ways.
Concept Drift
What's "normal" changes over time. User behavior evolves, systems get upgraded, and seasonal patterns shift. Your anomaly detector must adapt.
False Alarm Fatigue
Too many false positives and analysts start ignoring alerts. Too few and you miss real threats. Finding the right balance is critical for production systems.
Key Insight: Anomalies are rare by definition - if something happens frequently, it's not an anomaly, it's a pattern! Most anomaly detection algorithms assume anomalies make up less than 1-5% of your data. This rarity is both a curse (hard to find examples to learn from) and a blessing (most of your data can be used to learn "normal" behavior).
Real-World Applications
Fraud Detection
Credit card fraud, insurance claims, money laundering detection
Manufacturing
Defect detection, quality control, predictive maintenance
Cybersecurity
Intrusion detection, malware identification, unusual access patterns
Healthcare
Disease outbreak detection, abnormal test results, patient monitoring
Supervised vs Unsupervised Anomaly Detection
Supervised (with labels)
You have labeled examples of both normal and anomalous data.
- Works like regular classification
- Requires labeled anomaly examples
- Best when anomaly types are known
- Example: Spam detection with labeled spam
Unsupervised (no labels)
You only have data without labels, mostly normal observations.
- Learns "normal" behavior from data
- Flags anything that deviates
- Can detect unknown anomaly types
- This is what we focus on!
# Generate sample data with anomalies
import numpy as np
from sklearn.datasets import make_blobs
# Create normal data (2 clusters)
np.random.seed(42)
X_normal, _ = make_blobs(n_samples=300, centers=2, cluster_std=0.5, random_state=42)
# Add some anomalies (random points far from clusters)
n_anomalies = 15
X_anomalies = np.random.uniform(low=-6, high=6, size=(n_anomalies, 2))
# Combine into one dataset
X = np.vstack([X_normal, X_anomalies])
y_true = np.array([0] * len(X_normal) + [1] * n_anomalies) # 0=normal, 1=anomaly
print(f"Total samples: {len(X)}")
print(f"Normal samples: {len(X_normal)} ({len(X_normal)/len(X)*100:.1f}%)")
print(f"Anomalies: {n_anomalies} ({n_anomalies/len(X)*100:.1f}%)")
This code creates a synthetic dataset for practicing anomaly detection. We use make_blobs() to generate 300 "normal" points clustered in two groups, simulating typical behavior like legitimate transactions from two customer segments. Then we add 15 "anomalies" by generating random points uniformly across a larger range using np.random.uniform(). These anomalies are scattered randomly and don't belong to either cluster, mimicking fraudulent transactions or equipment malfunctions that don't follow normal patterns. We use np.vstack() to stack both arrays vertically into one dataset. The y_true array stores the ground truth labels (0 for normal, 1 for anomaly), which we'll use later to evaluate our detectors. Notice that anomalies make up only about 5% of the data, which is realistic for most real-world scenarios where anomalies are rare by definition.
Practice Questions: Introduction
Test your understanding with these coding challenges.
Task: Generate 100 normal values from a Gaussian distribution (mean=50, std=5) and add 5 extreme outliers.
Show Solution
import numpy as np
np.random.seed(42)
# Normal data: mean=50, std=5
normal_data = np.random.normal(50, 5, 100)
# Outliers: extreme values
outliers = np.array([10, 15, 90, 95, 100])
# Combine
data = np.concatenate([normal_data, outliers])
print(f"Mean: {data.mean():.2f}")
print(f"Std: {data.std():.2f}")
print(f"Min: {data.min():.2f}, Max: {data.max():.2f}")
Task: Given a dataset, calculate what percentage of points are beyond 2 standard deviations from the mean.
Show Solution
import numpy as np
np.random.seed(42)
data = np.random.normal(0, 1, 1000)
mean = data.mean()
std = data.std()
threshold = 2 # 2 standard deviations
# Points beyond 2 std
is_outlier = np.abs(data - mean) > threshold * std
n_outliers = is_outlier.sum()
print(f"Total points: {len(data)}")
print(f"Outliers (>2 std): {n_outliers}")
print(f"Percentage: {n_outliers/len(data)*100:.2f}%")
# Should be ~5% for normal distribution
Task: Create a function that labels data points as point anomalies based on distance from cluster centers.
Show Solution
import numpy as np
from sklearn.cluster import KMeans
def detect_point_anomalies(X, n_clusters=2, threshold=2.0):
# Cluster the data
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(X)
# Calculate distance to nearest cluster center
distances = kmeans.transform(X).min(axis=1)
# Flag points beyond threshold * std as anomalies
mean_dist = distances.mean()
std_dist = distances.std()
is_anomaly = distances > mean_dist + threshold * std_dist
return is_anomaly
# Test it
np.random.seed(42)
X = np.vstack([np.random.randn(100, 2), [[5, 5], [-5, -5]]])
anomalies = detect_point_anomalies(X)
print(f"Anomalies found: {anomalies.sum()}")
Task: Create a time-series dataset where values are normal in summer but anomalous in winter (contextual anomalies).
Show Solution
import numpy as np
np.random.seed(42)
# 12 months of temperature data
months = np.arange(1, 13)
# Normal seasonal pattern: warm in summer (months 6-8)
base_temp = 15 + 10 * np.sin((months - 3) * np.pi / 6)
# Add noise
temperatures = base_temp + np.random.normal(0, 2, 12)
# Add contextual anomaly: 25C in January (month 1)
temperatures[0] = 25 # Too warm for winter!
# Detect: compare each month to its expected range
for i, (month, temp) in enumerate(zip(months, temperatures)):
expected = base_temp[i]
if abs(temp - expected) > 8: # More than 8 degrees off
print(f"Month {month}: {temp:.1f}C - ANOMALY (expected ~{expected:.1f}C)")
else:
print(f"Month {month}: {temp:.1f}C - Normal")
Task: Given a dataset with known ground truth labels, calculate the actual contamination ratio and compare to detected anomalies.
Show Solution
import numpy as np
from sklearn.ensemble import IsolationForest
np.random.seed(42)
# Create data with known anomalies
n_normal, n_anomaly = 950, 50
X_normal = np.random.randn(n_normal, 2)
X_anomaly = np.random.uniform(-5, 5, (n_anomaly, 2))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([0]*n_normal + [1]*n_anomaly)
# True contamination
true_contamination = n_anomaly / len(X)
print(f"True contamination: {true_contamination:.2%}")
# Detect with different contamination settings
for cont in [0.01, 0.05, 0.10]:
iso = IsolationForest(contamination=cont, random_state=42)
y_pred = iso.fit_predict(X)
detected = (y_pred == -1).sum()
true_positives = ((y_pred == -1) & (y_true == 1)).sum()
print(f"cont={cont:.2f}: detected {detected}, true positives {true_positives}/{n_anomaly}")
Statistical Methods
The simplest anomaly detection methods are based on statistics. If you understand the normal distribution of your data, anything too far from the center can be flagged as unusual. These methods are fast, interpretable, and work well for univariate (single-feature) data.
Z-Score Method (Standard Score)
The Z-score (also called the standard score) measures how many standard deviations a data point is away from the mean. It's one of the oldest and most intuitive methods for detecting outliers, rooted in the properties of the normal (Gaussian) distribution. When data follows a bell curve, we know exactly what percentage of values should fall within certain ranges.
The beauty of the Z-score is its simplicity: it transforms any dataset into a standardized scale where 0 represents the mean, and each unit represents one standard deviation. A Z-score of 2 means "this value is 2 standard deviations above average." A Z-score of -1.5 means "this value is 1.5 standard deviations below average."
Formula:
Z = (x - mean) / std
Where x is your data point, mean is the average of all data, and std is the standard deviation (spread) of the data.
Common Thresholds:
|Z| > 2- Mild outlier (5% of normal data)|Z| > 2.5- Moderate outlier (1.2% of normal data)|Z| > 3- Strong outlier (0.3% of normal data)
When to Use Z-Score
- Data is approximately normally distributed
- You're working with a single feature (univariate)
- You need a quick, interpretable baseline
- You can explain results to non-technical stakeholders
Limitations
- Assumes normal distribution (fails on skewed data)
- Mean and std are affected by outliers themselves!
- Doesn't work well for multivariate anomalies
- Fixed threshold may not suit all datasets
Pro Tip: The Z-score has a "masking effect" problem - extreme outliers can inflate the mean and standard deviation so much that moderate outliers get masked and appear normal. Consider using the Modified Z-score with median and MAD (Median Absolute Deviation) for more robust detection.
# Z-Score Anomaly Detection
import numpy as np
from scipy import stats
# Sample data with outliers
np.random.seed(42)
data = np.concatenate([
np.random.normal(100, 10, 95), # Normal values around 100
np.array([150, 160, 40, 30, 200]) # Obvious outliers
])
# Calculate Z-scores
z_scores = np.abs(stats.zscore(data))
# Find outliers (|Z| > 3)
threshold = 3
outliers_mask = z_scores > threshold
outliers = data[outliers_mask]
print(f"Mean: {data.mean():.2f}, Std: {data.std():.2f}")
print(f"Outliers found: {len(outliers)}")
print(f"Outlier values: {outliers}")
The Z-score method works by converting each data point to a standardized score that tells us how many standard deviations it is from the mean. We use scipy.stats.zscore() which automatically calculates (x - mean) / std for every value. We take the absolute value because we care about extreme values in both directions (too high OR too low). The threshold of 3 is a common choice because in a normal distribution, only about 0.27% of values fall beyond 3 standard deviations. So if we have 1000 points and find 50 beyond this threshold, something is clearly unusual. In our example, we created normal data around 100 with a standard deviation of 10, then added 5 extreme values (150, 160, 40, 30, 200). The Z-score method will flag these because they're far from the expected range of roughly 70-130. This method is simple and fast, but it has a weakness: it assumes your data is roughly normally distributed, and it can be skewed by the very outliers you're trying to detect!
IQR Method (Interquartile Range)
The IQR method is more robust to outliers because it uses the median instead of the mean. It's the same method used to create box plots and is less affected by extreme values.
IQR Formula
Q1 = 25th percentile, Q3 = 75th percentile, IQR = Q3 - Q1
Lower bound: Q1 - 1.5 * IQR
Upper bound: Q3 + 1.5 * IQR
Anything outside these bounds is an outlier.
# IQR Method for Outlier Detection
import numpy as np
# Same data as before
np.random.seed(42)
data = np.concatenate([
np.random.normal(100, 10, 95),
np.array([150, 160, 40, 30, 200])
])
# Calculate IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Find outliers
outliers_mask = (data < lower_bound) | (data > upper_bound)
outliers = data[outliers_mask]
print(f"Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
print(f"Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"Outliers: {outliers}")
The IQR method divides your data into quartiles (four equal parts) and uses the middle 50% to define what's "normal." We use np.percentile(data, 25) to find Q1 (the value below which 25% of data falls) and np.percentile(data, 75) for Q3. The Interquartile Range (IQR) is simply Q3 - Q1, representing the spread of the central half of your data. The "1.5 * IQR" rule is a standard convention from statistics: points beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR are considered outliers. This method is more robust than Z-scores because even if you have extreme outliers, they won't affect the quartile calculations much since quartiles only care about the middle of the data. The tradeoff is that IQR is slightly less sensitive to subtle anomalies compared to Z-score, but it's much more reliable when you don't know if your data is normally distributed.
Multivariate Statistical Methods
# Mahalanobis Distance for Multivariate Data
import numpy as np
from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2
# Generate 2D data with outliers
np.random.seed(42)
normal_data = np.random.multivariate_normal(
mean=[0, 0],
cov=[[1, 0.5], [0.5, 1]],
size=100
)
outliers = np.array([[4, 4], [-4, 3], [3, -4]])
X = np.vstack([normal_data, outliers])
# Calculate Mahalanobis distance for each point
mean = np.mean(X, axis=0)
cov = np.cov(X.T)
cov_inv = np.linalg.inv(cov)
distances = np.array([mahalanobis(x, mean, cov_inv) for x in X])
# Chi-squared threshold (p < 0.01 for 2 dimensions)
threshold = np.sqrt(chi2.ppf(0.99, df=2))
outlier_mask = distances > threshold
print(f"Threshold (99%): {threshold:.2f}")
print(f"Outliers found: {outlier_mask.sum()}")
print(f"Max distance: {distances.max():.2f}")
When you have multiple features, Euclidean distance doesn't work well because it ignores correlations between features. The Mahalanobis distance solves this by accounting for the shape and orientation of your data's distribution. Think of it as measuring distance in "standard deviation units" along the natural axes of your data cloud. We generate correlated 2D data using multivariate_normal() with a covariance matrix that creates an elliptical shape. For each point, we calculate its Mahalanobis distance from the center using the inverse covariance matrix. The chi2.ppf(0.99, df=2) gives us the critical value from the chi-squared distribution with 2 degrees of freedom (one per feature) at the 99% confidence level. Points with Mahalanobis distance exceeding this threshold are in the outer 1% and likely anomalies. This method is powerful for detecting multivariate outliers that might look normal when examining each feature individually but are unusual when considering all features together.
Practice Questions: Statistical Methods
Test your understanding with these coding challenges.
Task: Calculate Z-scores manually without using scipy.
Show Solution
import numpy as np
data = np.array([10, 12, 14, 100, 15, 13, 11, 200])
# Manual Z-score calculation
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std
print(f"Mean: {mean:.2f}, Std: {std:.2f}")
for val, z in zip(data, z_scores):
status = "OUTLIER" if abs(z) > 2 else "normal"
print(f"Value: {val:3d}, Z: {z:6.2f} - {status}")
Task: Apply both methods to the same data and compare which outliers each finds.
Show Solution
import numpy as np
from scipy import stats
np.random.seed(42)
data = np.concatenate([np.random.normal(50, 5, 100), [10, 90, 95]])
# Z-score method
z_outliers = np.abs(stats.zscore(data)) > 3
# IQR method
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
iqr_outliers = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)
print(f"Z-score outliers: {data[z_outliers]}")
print(f"IQR outliers: {data[iqr_outliers]}")
print(f"Agreement: {(z_outliers == iqr_outliers).sum()}/{len(data)}")
Task: Implement the Modified Z-score using median and MAD (Median Absolute Deviation) for more robust outlier detection.
Show Solution
import numpy as np
def modified_zscore(data, threshold=3.5):
median = np.median(data)
mad = np.median(np.abs(data - median)) # Median Absolute Deviation
# Modified Z-score formula
modified_z = 0.6745 * (data - median) / mad
return np.abs(modified_z) > threshold
# Test with data containing extreme outliers
np.random.seed(42)
data = np.concatenate([np.random.normal(100, 10, 100), [500, 600]])
# Compare regular vs modified Z-score
from scipy import stats
regular_outliers = np.abs(stats.zscore(data)) > 3
modified_outliers = modified_zscore(data)
print(f"Regular Z-score outliers: {data[regular_outliers]}")
print(f"Modified Z-score outliers: {data[modified_outliers]}")
Task: Detect outliers in 3D data using Mahalanobis distance with chi-squared threshold.
Show Solution
import numpy as np
from scipy.spatial.distance import mahalanobis
from scipy.stats import chi2
np.random.seed(42)
# 3D correlated data
cov = [[1, 0.5, 0.3], [0.5, 1, 0.4], [0.3, 0.4, 1]]
X_normal = np.random.multivariate_normal([0, 0, 0], cov, 200)
X_outliers = np.array([[4, 4, 4], [-4, 3, -3]])
X = np.vstack([X_normal, X_outliers])
# Calculate Mahalanobis distances
mean = X.mean(axis=0)
cov_matrix = np.cov(X.T)
cov_inv = np.linalg.inv(cov_matrix)
distances = [mahalanobis(x, mean, cov_inv) for x in X]
# Chi-squared threshold for 3 dimensions, 99% confidence
threshold = np.sqrt(chi2.ppf(0.99, df=3))
outliers = np.array(distances) > threshold
print(f"Threshold: {threshold:.2f}")
print(f"Outliers found: {outliers.sum()}")
print(f"Outlier indices: {np.where(outliers)[0]}")
Task: Implement Tukey's fences with both 1.5 (mild outliers) and 3.0 (extreme outliers) multipliers.
Show Solution
import numpy as np
def tukey_fences(data, k=1.5):
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
lower = Q1 - k * IQR
upper = Q3 + k * IQR
outliers = (data < lower) | (data > upper)
return outliers, lower, upper
np.random.seed(42)
data = np.concatenate([
np.random.normal(50, 5, 100),
[20, 25, 80, 85, 100] # Various severity outliers
])
# Mild outliers (k=1.5)
mild, low1, up1 = tukey_fences(data, k=1.5)
print(f"Mild outliers (k=1.5): bounds [{low1:.1f}, {up1:.1f}]")
print(f" Found: {data[mild]}")
# Extreme outliers (k=3.0)
extreme, low3, up3 = tukey_fences(data, k=3.0)
print(f"\nExtreme outliers (k=3.0): bounds [{low3:.1f}, {up3:.1f}]")
print(f" Found: {data[extreme]}")
Isolation Forest
Isolation Forest is one of the most popular and effective anomaly detection algorithms. Its brilliant insight is that anomalies are "easy to isolate" - they require fewer random splits to separate from the rest of the data because they're different from everything else.
How Isolation Forest Works
Isolation Forest flips the traditional anomaly detection approach on its head. Instead of trying to define what "normal" looks like and finding points that don't fit, it exploits a key property of anomalies: they are few and different. Because anomalies are rare and have unusual feature values, they are easier to "isolate" or separate from the rest of the data.
Imagine repeatedly slicing a pizza with random cuts. Normal toppings clustered in the center take many cuts to isolate - you keep slicing and they're still grouped with others. But that one weird olive way out in the corner? Just one or two random cuts and it's completely alone! That's the core intuition behind Isolation Forest.
The Algorithm Step by Step
Random Split
Randomly select a feature, then randomly select a split value between the min and max of that feature
Partition
Split the data: points less than split value go left, greater go right. Repeat recursively until each point is isolated
Measure Path
Count how many splits (tree depth) it took to isolate each point. This is the "path length"
Score
Shorter average path length = easier to isolate = more anomalous. Build many trees and average the results
Why Isolation Forest is So Popular
Blazing Fast
No distance calculations needed! Time complexity is O(n log n) for training, making it suitable for millions of samples. Most other methods are O(n2) or worse.
Handles High Dimensions
Unlike distance-based methods that suffer from the "curse of dimensionality," Isolation Forest works well even with hundreds of features because it uses random subspace sampling.
Low Memory Footprint
The algorithm uses subsampling (default 256 samples per tree), so even huge datasets don't blow up memory. You can tune this with max_samples.
Few Hyperparameters
Just n_estimators (number of trees) and contamination (expected anomaly fraction). Default values work well for most cases!
Key Parameters:
n_estimators: Number of isolation trees (default 100). More trees = more stable results but slower.contamination: Expected proportion of anomalies (e.g., 0.05 for 5%). Sets the decision threshold.max_samples: Number of samples per tree (default 256 or 'auto'). Smaller = faster but potentially less accurate.max_features: Features to consider per split (default 1.0 = all). Lower values add more randomness.
Watch Out: Isolation Forest can struggle with "local" anomalies - points that are anomalous relative to their local neighborhood but not globally. For varying-density clusters, consider LOF instead. Also, it may not work well when anomalies form their own cluster (they might not be easy to isolate!).
# Basic Isolation Forest
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_blobs
import numpy as np
# Create data with anomalies
np.random.seed(42)
X_normal, _ = make_blobs(n_samples=300, centers=2, cluster_std=0.5)
X_anomalies = np.random.uniform(low=-6, high=6, size=(15, 2))
X = np.vstack([X_normal, X_anomalies])
# Fit Isolation Forest
iso_forest = IsolationForest(
n_estimators=100, # Number of trees
contamination=0.05, # Expected % of anomalies
random_state=42
)
predictions = iso_forest.fit_predict(X)
# -1 = anomaly, 1 = normal
n_anomalies = (predictions == -1).sum()
print(f"Anomalies detected: {n_anomalies}")
print(f"Normal points: {(predictions == 1).sum()}")
Scikit-learn's IsolationForest makes anomaly detection straightforward. The n_estimators=100 parameter sets the number of isolation trees in the forest (more trees = more stable results, but slower). The contamination=0.05 is crucial: it tells the algorithm to expect about 5% of your data to be anomalies. This parameter determines the decision threshold for the anomaly scores. The fit_predict() method trains the model and returns predictions in one step. Unlike most scikit-learn classifiers that return 0 or 1, Isolation Forest returns -1 for anomalies and 1 for normal points. This convention can be confusing at first! Under the hood, the algorithm builds 100 random binary trees, and for each point, it measures how quickly the point gets isolated (shorter path = more anomalous). Points that consistently get isolated early across all trees receive low scores and are flagged as anomalies.
Understanding Anomaly Scores
# Analyzing Anomaly Scores
from sklearn.ensemble import IsolationForest
import numpy as np
# Same setup
np.random.seed(42)
X_normal = np.random.randn(100, 2) # Normal: centered at origin
X_anomalies = np.array([[5, 5], [-5, 5], [5, -5], [-5, -5]]) # Corner anomalies
X = np.vstack([X_normal, X_anomalies])
# Get anomaly scores (lower = more anomalous)
iso_forest = IsolationForest(contamination=0.05, random_state=42)
iso_forest.fit(X)
scores = iso_forest.decision_function(X)
print("Anomaly Scores (lower = more anomalous):")
print(f" Normal data mean score: {scores[:100].mean():.3f}")
print(f" Anomaly scores: {scores[-4:]}")
print(f" Score range: [{scores.min():.3f}, {scores.max():.3f}]")
The decision_function() method returns raw anomaly scores instead of just predictions. Lower (more negative) scores indicate more anomalous points, while higher (more positive) scores indicate normal points. The threshold for classification is set based on the contamination parameter. Understanding these scores is useful because sometimes you want more nuance than a binary normal/anomaly label. For instance, you might want to rank all transactions by suspiciousness rather than just flagging the top 5%. You could also use custom thresholds different from what contamination implies. In production systems, you might alert differently based on score severity: mild anomalies get logged, severe anomalies trigger immediate investigation.
Tuning Contamination Parameter
# Effect of Contamination Parameter
from sklearn.ensemble import IsolationForest
import numpy as np
np.random.seed(42)
X_normal = np.random.randn(200, 2)
X_anomalies = np.random.uniform(-5, 5, (10, 2))
X = np.vstack([X_normal, X_anomalies])
print("Effect of contamination parameter:")
print("-" * 40)
for cont in [0.01, 0.05, 0.10, 0.20]:
iso = IsolationForest(contamination=cont, random_state=42)
preds = iso.fit_predict(X)
n_flagged = (preds == -1).sum()
print(f"contamination={cont:.2f}: {n_flagged:3d} flagged ({n_flagged/len(X)*100:.1f}%)")
The contamination parameter has a direct effect on how many points get flagged as anomalies. Setting contamination=0.01 means "I expect only 1% of my data to be anomalous," so only the most extreme outliers get flagged. Setting contamination=0.20 means "I expect 20% to be anomalous," which will flag many more points, potentially including some normal ones. Choosing the right value requires domain knowledge: for credit card fraud, you might use 0.001 (0.1%) because fraud is rare. For manufacturing defects, you might use 0.05 (5%) if quality issues are more common. If you don't know the true contamination rate, you can set contamination='auto' which uses a heuristic threshold, though this may not be optimal for your specific use case.
Practice Questions: Isolation Forest
Test your understanding with these coding challenges.
Task: Apply Isolation Forest to the Iris dataset and find how many samples are flagged as anomalies.
Show Solution
from sklearn.ensemble import IsolationForest
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
iso = IsolationForest(contamination=0.1, random_state=42)
predictions = iso.fit_predict(X)
print(f"Total samples: {len(X)}")
print(f"Normal: {(predictions == 1).sum()}")
print(f"Anomalies: {(predictions == -1).sum()}")
Task: Use anomaly scores to find the top 5 most anomalous points in a dataset.
Show Solution
from sklearn.ensemble import IsolationForest
import numpy as np
np.random.seed(42)
X = np.vstack([np.random.randn(100, 2), np.random.uniform(-5, 5, (5, 2))])
iso = IsolationForest(random_state=42)
iso.fit(X)
scores = iso.decision_function(X)
# Get indices of 5 lowest scores
top_5_idx = np.argsort(scores)[:5]
print("Top 5 most anomalous points:")
for i, idx in enumerate(top_5_idx):
print(f" {i+1}. Index {idx}: score={scores[idx]:.3f}, point={X[idx]}")
Task: Compare results with different numbers of trees (50, 100, 200, 500) and measure consistency.
Show Solution
from sklearn.ensemble import IsolationForest
import numpy as np
np.random.seed(42)
X = np.vstack([np.random.randn(500, 5), np.random.uniform(-4, 4, (25, 5))])
results = {}
for n_trees in [50, 100, 200, 500]:
iso = IsolationForest(n_estimators=n_trees, contamination=0.05, random_state=42)
preds = iso.fit_predict(X)
results[n_trees] = set(np.where(preds == -1)[0])
print(f"n_estimators={n_trees}: {(preds == -1).sum()} anomalies")
# Check consistency between different settings
print(f"\nCommon anomalies (all agree): {len(results[50] & results[100] & results[200] & results[500])}")
Task: Use decision_function scores to create a custom threshold instead of using contamination.
Show Solution
from sklearn.ensemble import IsolationForest
import numpy as np
np.random.seed(42)
X = np.vstack([np.random.randn(200, 2), np.random.uniform(-4, 4, (10, 2))])
# Train without contamination (use auto)
iso = IsolationForest(contamination='auto', random_state=42)
iso.fit(X)
scores = iso.decision_function(X)
# Create custom thresholds based on score percentiles
print("Custom threshold results:")
for percentile in [1, 5, 10]:
threshold = np.percentile(scores, percentile)
anomalies = scores < threshold
print(f" Bottom {percentile}% (threshold={threshold:.3f}): {anomalies.sum()} anomalies")
# Or use a fixed score threshold
fixed_threshold = -0.1
print(f"\nFixed threshold {fixed_threshold}: {(scores < fixed_threshold).sum()} anomalies")
Task: Evaluate Isolation Forest using precision, recall, and F1-score on a dataset with known labels.
Show Solution
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import numpy as np
np.random.seed(42)
# Create labeled data
X_normal = np.random.randn(200, 3)
X_anomaly = np.random.uniform(-4, 4, (20, 3))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([1]*200 + [-1]*20) # 1=normal, -1=anomaly
# Detect anomalies
iso = IsolationForest(contamination=0.1, random_state=42)
y_pred = iso.fit_predict(X)
# Calculate metrics (anomaly is positive class)
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print(f"\nPrecision: {precision_score(y_true, y_pred, pos_label=-1):.3f}")
print(f"Recall: {recall_score(y_true, y_pred, pos_label=-1):.3f}")
print(f"F1-Score: {f1_score(y_true, y_pred, pos_label=-1):.3f}")
One-Class SVM
One-Class SVM learns a decision boundary that encompasses "normal" data. Any new point falling outside this boundary is considered an anomaly. It's particularly good at novelty detection - finding new, previously unseen types of anomalies.
How One-Class SVM Works
One-Class SVM (Support Vector Machine) takes a fundamentally different approach to anomaly detection. Instead of learning from both normal and anomalous examples, it learns ONLY from normal data. The algorithm finds the smallest possible "envelope" or boundary that contains all (or most) of the normal training data. Any new point falling outside this boundary is flagged as a novelty (anomaly).
Imagine drawing the tightest possible fence around a flock of sheep in a field. The fence should be as small as possible while still containing all the sheep inside. Once the fence is built, any animal that appears outside it is probably not a sheep - it could be a wolf, a deer, or something entirely unexpected. That's One-Class SVM: it learns what "inside the fence" looks like, and anything outside is suspicious.
The Technical Details
One-Class SVM works by mapping data to a high-dimensional feature space (using a kernel function) and then finding a hyperplane that separates the data from the origin with maximum margin. The RBF (Radial Basis Function) kernel is most commonly used because it can create non-linear, flexible boundaries that wrap around complex data distributions. The algorithm finds "support vectors" - the critical points that define the boundary.
When One-Class SVM Shines
- Clean training data available: You have a dataset of only normal examples
- Novelty detection: You want to detect NEW types of anomalies never seen before
- Complex boundaries: The boundary between normal and abnormal isn't a simple shape
- High-dimensional data: Works well when features outnumber samples
- Small to medium datasets: Best for under 10,000 training samples
Limitations to Consider
- Slow training: O(n2) to O(n3) complexity makes it impractical for large datasets
- Memory intensive: Stores all support vectors, which can be many
- Parameter sensitivity: Results depend heavily on
nuandgammatuning - Scaling required: Features must be normalized for good performance
- No probability estimates: Returns -1 or 1, not confidence scores
Key Parameters Explained
nu (0 to 1)
Upper bound on training errors and lower bound on support vectors. Think of it as "how tight is the fence?" Lower nu = tighter boundary = more false positives. Higher nu = looser = more false negatives.
kernel
The function that transforms data. 'rbf' (default) is best for most cases. 'linear' is faster but can only create linear boundaries. 'poly' for polynomial boundaries.
gamma
RBF kernel width. 'scale' (default) auto-adjusts. Higher gamma = more complex boundary (risk of overfitting). Lower gamma = smoother boundary.
Novelty vs Outlier Detection: One-Class SVM is designed for novelty detection - training on clean data, then detecting new anomalies. This is different from outlier detection (like Isolation Forest) which can handle contaminated training data. If your training data contains anomalies, use Isolation Forest or LOF instead!
# One-Class SVM for Anomaly Detection
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate normal training data
np.random.seed(42)
X_train = np.random.randn(200, 2) # Only normal data for training
# Test data with some anomalies
X_test_normal = np.random.randn(50, 2)
X_test_anomaly = np.random.uniform(-5, 5, (10, 2))
X_test = np.vstack([X_test_normal, X_test_anomaly])
# Scale the data (important for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train One-Class SVM
ocsvm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.1)
ocsvm.fit(X_train_scaled)
# Predict on test data
predictions = ocsvm.predict(X_test_scaled)
print(f"Normal detected: {(predictions == 1).sum()}")
print(f"Anomalies detected: {(predictions == -1).sum()}")
One-Class SVM is designed for novelty detection, where you train only on normal data and then detect new, unseen anomalies. This is different from Isolation Forest which can handle mixed data during training. We use StandardScaler because SVMs are sensitive to feature scales. The kernel='rbf' (Radial Basis Function) allows for non-linear decision boundaries, which is usually what you want. The gamma='scale' automatically adjusts based on feature count and variance. The nu=0.1 parameter is an upper bound on the fraction of training errors and a lower bound on the fraction of support vectors. Think of it like contamination but for the training phase. After training on only normal data, we apply predict() to new test data, and the model flags points that don't fit the learned "normal" pattern as -1 (anomaly).
Tuning the Nu Parameter
# Effect of Nu Parameter
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(42)
X_train = np.random.randn(300, 2)
X_test = np.vstack([np.random.randn(100, 2), np.random.uniform(-4, 4, (20, 2))])
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
print("Effect of nu parameter:")
print("-" * 40)
for nu in [0.01, 0.05, 0.10, 0.20]:
ocsvm = OneClassSVM(kernel='rbf', gamma='scale', nu=nu)
ocsvm.fit(X_train_s)
preds = ocsvm.predict(X_test_s)
n_flagged = (preds == -1).sum()
print(f"nu={nu:.2f}: {n_flagged:3d} anomalies ({n_flagged/len(X_test)*100:.1f}%)")
The nu parameter controls how tight the decision boundary is around the normal data. Lower values (like 0.01) create a tighter boundary, meaning the model is very strict about what it considers "normal" and will flag more points as anomalies. Higher values (like 0.20) create a looser boundary that tolerates more variation, flagging fewer anomalies. The nu parameter is an upper bound on the fraction of margin errors (training points misclassified as outliers) and a lower bound on the fraction of support vectors. In practice, you should tune nu based on your domain knowledge of what fraction of your test data you expect to be anomalous. Unlike Isolation Forest's contamination which affects the decision threshold, nu affects the actual model training, so changing it requires retraining.
Practice Questions: One-Class SVM
Test your understanding with these coding challenges.
Task: Train on only digit 0 and see if other digits are detected as anomalies.
Show Solution
from sklearn.svm import OneClassSVM
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
digits = load_digits()
X, y = digits.data, digits.target
# Train only on digit 0
X_train = X[y == 0]
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
ocsvm = OneClassSVM(nu=0.1, kernel='rbf')
ocsvm.fit(X_train_s)
# Test on all digits
X_test_s = scaler.transform(X)
preds = ocsvm.predict(X_test_s)
print(f"Digit 0 flagged as anomaly: {(preds[y==0] == -1).sum()}/{(y==0).sum()}")
print(f"Other digits as anomaly: {(preds[y!=0] == -1).sum()}/{(y!=0).sum()}")
Task: Compare One-Class SVM with linear and RBF kernels on the same data.
Show Solution
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(42)
X_train = np.random.randn(200, 2)
X_test = np.vstack([np.random.randn(80, 2), np.random.uniform(-4, 4, (20, 2))])
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
for kernel in ['linear', 'rbf']:
ocsvm = OneClassSVM(kernel=kernel, nu=0.1)
ocsvm.fit(X_train_s)
preds = ocsvm.predict(X_test_s)
print(f"{kernel:6s} kernel: {(preds == -1).sum()} anomalies detected")
Task: Experiment with different gamma values and observe their effect on anomaly detection.
Show Solution
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(42)
X_train = np.random.randn(200, 2)
X_test = np.vstack([np.random.randn(50, 2), np.random.uniform(-3, 3, (10, 2))])
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
print("Effect of gamma on RBF kernel:")
for gamma in [0.01, 0.1, 1.0, 10.0, 'scale']:
ocsvm = OneClassSVM(kernel='rbf', gamma=gamma, nu=0.1)
ocsvm.fit(X_train_s)
preds = ocsvm.predict(X_test_s)
n_sv = len(ocsvm.support_)
print(f" gamma={str(gamma):6s}: {(preds == -1).sum()} anomalies, {n_sv} support vectors")
Task: Use decision_function to get continuous scores and rank test samples by anomaly likelihood.
Show Solution
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(42)
X_train = np.random.randn(200, 2)
X_test = np.vstack([
np.random.randn(10, 2), # Normal
np.array([[3, 3], [4, 4], [-3, -3]]) # Anomalies
])
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
ocsvm = OneClassSVM(kernel='rbf', nu=0.1)
ocsvm.fit(X_train_s)
# Get decision scores (negative = more anomalous)
scores = ocsvm.decision_function(X_test_s)
# Rank by anomaly likelihood
ranking = np.argsort(scores)
print("Test samples ranked by anomaly likelihood (most anomalous first):")
for rank, idx in enumerate(ranking[:5]):
print(f" {rank+1}. Index {idx}: score={scores[idx]:.3f}, point={X_test[idx]}")
Task: Implement a simple validation strategy for One-Class SVM using holdout normal data.
Show Solution
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
# Only normal data for training
X_normal = np.random.randn(300, 3)
X_train, X_val = train_test_split(X_normal, test_size=0.2, random_state=42)
# Anomalies for testing
X_anomaly = np.random.uniform(-4, 4, (30, 3))
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)
X_anomaly_s = scaler.transform(X_anomaly)
# Try different nu values
print("Nu tuning (lower false positive rate on validation is better):")
for nu in [0.01, 0.05, 0.1, 0.2]:
ocsvm = OneClassSVM(kernel='rbf', nu=nu)
ocsvm.fit(X_train_s)
val_fp = (ocsvm.predict(X_val_s) == -1).sum() / len(X_val_s) # False positive rate
test_tp = (ocsvm.predict(X_anomaly_s) == -1).sum() / len(X_anomaly_s) # True positive rate
print(f" nu={nu:.2f}: Val FP rate={val_fp:.1%}, Test TP rate={test_tp:.1%}")
Local Outlier Factor (LOF)
LOF takes a different approach: instead of asking "is this point far from everything?", it asks "is this point in a less dense neighborhood than its neighbors?" This makes LOF excellent at detecting anomalies in datasets with clusters of varying densities.
How LOF Works
Local Outlier Factor (LOF) takes a fundamentally local approach to anomaly detection. Instead of asking "is this point far from the global center?", it asks "is this point in a less dense neighborhood compared to its neighbors?" This makes LOF uniquely powerful for datasets where different regions have different natural densities.
Think of it as the "lonely in a crowd" detector. Imagine a city with a dense downtown and sparse suburbs. A person standing alone in the suburbs isn't suspicious - that's normal for suburbs. But a person standing completely alone in the middle of Times Square? That's weird! LOF captures this intuition by comparing local densities rather than using a global threshold.
The Algorithm Explained
Step 1: Find k-Nearest Neighbors
For each point, identify its k closest neighbors using Euclidean distance (or another metric). The parameter n_neighbors controls k.
Step 2: Calculate Reachability Distance
Measure how "reachable" each point is from its neighbors. Points in dense areas have small reachability distances.
Step 3: Compute Local Reachability Density (LRD)
Calculate each point's local density as the inverse of the average reachability distance to its neighbors.
Step 4: Compute LOF Score
Compare each point's LRD to its neighbors' LRDs. If a point's density is much lower than its neighbors', LOF > 1 (anomaly!).
Interpreting LOF Scores
Normal Point
The point has similar local density to its neighbors. It "fits in" with its local neighborhood perfectly.
Mildly Suspicious
The point is somewhat sparser than neighbors. Could be on the edge of a cluster or a mild outlier.
Strong Anomaly
The point is in a much sparser region than its neighbors. High confidence anomaly - investigate!
LOF vs Global Methods
LOF Advantages
- Handles clusters with different densities
- Detects local anomalies missed by global methods
- Provides interpretable scores (how anomalous?)
- No assumption about data distribution
LOF Limitations
- O(n2) complexity - slow for large datasets
- Sensitive to
n_neighborschoice - Struggles in high dimensions (distance becomes meaningless)
- Memory intensive (stores all pairwise distances)
Choosing n_neighbors: Too small (< 10) makes LOF noisy and sensitive to local fluctuations. Too large (> 50) makes it behave like a global method. A good rule of thumb: set n_neighbors larger than the minimum cluster size you expect, but smaller than the size of the smallest cluster where you want to detect anomalies. Values between 15-30 work well for most datasets.
# Local Outlier Factor
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
# Create clusters with different densities + anomalies
np.random.seed(42)
cluster1 = np.random.randn(100, 2) * 0.5 # Dense cluster
cluster2 = np.random.randn(100, 2) * 2 + [5, 5] # Sparse cluster
anomalies = np.array([[2, 2], [3, 0], [0, 4]])
X = np.vstack([cluster1, cluster2, anomalies])
# Apply LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
predictions = lof.fit_predict(X)
print(f"Normal points: {(predictions == 1).sum()}")
print(f"Anomalies detected: {(predictions == -1).sum()}")
# LOF scores (negative = more anomalous)
scores = lof.negative_outlier_factor_
print(f"Score range: [{scores.min():.2f}, {scores.max():.2f}]")
LOF is a density-based method that shines when your data has clusters of different densities. In our example, we create a dense cluster (small standard deviation) and a sparse cluster (larger standard deviation), plus some isolated anomalies between them. The n_neighbors=20 parameter sets how many neighbors to consider when computing local density. Too few neighbors makes LOF noisy, too many makes it insensitive to local structure. The contamination=0.05 works similarly to Isolation Forest. The negative_outlier_factor_ attribute gives raw LOF scores. These are negative (by convention) and closer to -1 means normal, while much more negative (like -3 or -5) indicates anomalies. Unlike global methods, LOF can correctly identify points that are far from a dense cluster even if they're close to a sparse cluster, because it compares local densities rather than absolute distances.
The n_neighbors Parameter
# Effect of n_neighbors on LOF
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
np.random.seed(42)
X = np.vstack([
np.random.randn(200, 2),
np.random.uniform(-5, 5, (10, 2))
])
print("Effect of n_neighbors:")
print("-" * 40)
for k in [5, 10, 20, 50]:
lof = LocalOutlierFactor(n_neighbors=k, contamination=0.05)
preds = lof.fit_predict(X)
n_anomalies = (preds == -1).sum()
print(f"n_neighbors={k:2d}: {n_anomalies} anomalies detected")
The n_neighbors parameter is crucial for LOF's performance. With too few neighbors (like 5), the algorithm becomes sensitive to noise because it's only looking at a tiny local neighborhood. With too many neighbors (like 50+), it starts to behave more like a global method and loses its ability to detect anomalies in varying-density clusters. A good rule of thumb is to set n_neighbors to be larger than the minimum cluster size you expect (so points in small clusters don't get flagged) but smaller than the size of the smallest cluster you want to detect anomalies within. For most datasets, values between 10 and 30 work well. If you're unsure, try a few values and see which gives results that match your domain knowledge.
Novelty Detection with LOF
# LOF for Novelty Detection
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
np.random.seed(42)
# Train on clean data only
X_train = np.random.randn(200, 2)
# Test on new data with potential anomalies
X_test = np.vstack([
np.random.randn(50, 2), # Normal
np.array([[4, 4], [-4, 4], [4, -4]]) # Anomalies
])
# Use novelty=True for new data prediction
lof = LocalOutlierFactor(n_neighbors=20, novelty=True)
lof.fit(X_train)
# Predict on test data
predictions = lof.predict(X_test)
print(f"Test normal: {(predictions == 1).sum()}")
print(f"Test anomalies: {(predictions == -1).sum()}")
By default, LOF only works with fit_predict() which requires all data at once. But setting novelty=True enables novelty detection mode, where you can train on clean data and then call predict() on new, unseen samples. This is similar to how One-Class SVM works. In novelty mode, the LOF model learns the density patterns of the training data, then any new point that falls into a region sparser than what was seen during training gets flagged as a novelty (anomaly). This is useful in production scenarios where you have a baseline of normal behavior and want to detect when new, unusual patterns appear. Remember: in novelty mode, your training data should be "clean" (contain only normal samples), otherwise the model will learn to accept anomalies as normal.
Practice Questions: LOF
Test your understanding with these coding challenges.
Task: Find anomalies in the breast cancer dataset using LOF.
Show Solution
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
cancer = load_breast_cancer()
X = StandardScaler().fit_transform(cancer.data)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
predictions = lof.fit_predict(X)
print(f"Total samples: {len(X)}")
print(f"Normal: {(predictions == 1).sum()}")
print(f"Anomalies: {(predictions == -1).sum()}")
Task: Rank all points by their LOF score and print the 5 most anomalous.
Show Solution
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
np.random.seed(42)
X = np.vstack([np.random.randn(100, 2), np.random.uniform(-4, 4, (10, 2))])
lof = LocalOutlierFactor(n_neighbors=20)
lof.fit_predict(X)
scores = lof.negative_outlier_factor_
# Lower (more negative) = more anomalous
top_5 = np.argsort(scores)[:5]
print("Top 5 anomalies:")
for i, idx in enumerate(top_5):
print(f" {i+1}. Index {idx}: LOF score = {-scores[idx]:.2f}")
Task: Compare LOF results with different n_neighbors values (5, 10, 20, 50).
Show Solution
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
np.random.seed(42)
# Two clusters with different densities
cluster1 = np.random.randn(100, 2) * 0.5 # Dense
cluster2 = np.random.randn(100, 2) * 2 + [4, 4] # Sparse
outliers = np.array([[2, 2], [0, 4], [4, 0]]) # Between clusters
X = np.vstack([cluster1, cluster2, outliers])
print("Effect of n_neighbors on varying-density data:")
for k in [5, 10, 20, 50]:
lof = LocalOutlierFactor(n_neighbors=k, contamination=0.02)
preds = lof.fit_predict(X)
print(f" k={k:2d}: {(preds == -1).sum()} anomalies detected")
Task: Use LOF in novelty detection mode to train on clean data and detect anomalies in new data.
Show Solution
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
np.random.seed(42)
# Clean training data
X_train = np.random.randn(200, 2)
# Test data with mix of normal and anomalies
X_test_normal = np.random.randn(50, 2)
X_test_anomaly = np.array([[4, 4], [-4, 4], [4, -4], [-4, -4], [0, 5]])
X_test = np.vstack([X_test_normal, X_test_anomaly])
# LOF in novelty mode
lof = LocalOutlierFactor(n_neighbors=20, novelty=True)
lof.fit(X_train)
# Predict on new data
preds = lof.predict(X_test)
scores = lof.decision_function(X_test)
print(f"Normal test points flagged as anomaly: {(preds[:50] == -1).sum()}/50")
print(f"True anomalies detected: {(preds[50:] == -1).sum()}/{len(X_test_anomaly)}")
print(f"Anomaly scores for true anomalies: {scores[50:]}")
Task: Compare LOF using Euclidean, Manhattan, and Cosine distance metrics.
Show Solution
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(42)
# High-dimensional data
X_normal = np.random.randn(200, 10)
X_anomaly = np.random.uniform(-3, 3, (10, 10))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([1]*200 + [-1]*10)
# Scale for fair comparison
X_scaled = StandardScaler().fit_transform(X)
print("LOF with different distance metrics:")
for metric in ['euclidean', 'manhattan', 'cosine']:
lof = LocalOutlierFactor(n_neighbors=20, metric=metric, contamination=0.05)
preds = lof.fit_predict(X_scaled)
# Calculate true positives
tp = ((preds == -1) & (y_true == -1)).sum()
fp = ((preds == -1) & (y_true == 1)).sum()
print(f" {metric:10s}: TP={tp}/10, FP={fp}")
Comparison and Applications
Each anomaly detection method has its strengths. The right choice depends on your data size, dimensionality, whether you have clean training data, and what type of anomalies you expect. There's no single "best" method - the optimal choice depends entirely on your specific use case.
In practice, you'll often want to try multiple methods and compare their results. Different algorithms make different assumptions about what constitutes an anomaly, so they may flag different points. Points flagged by multiple methods are often the most confident anomalies, while points flagged by only one method may warrant further investigation. Below, we compare the methods we've learned and provide guidance on when to use each one.
| Method | Best For | Speed | Scalability |
|---|---|---|---|
| Z-Score / IQR | Simple univariate data, quick analysis | Very Fast | Excellent |
| Isolation Forest | General purpose, high dimensions, large datasets | Fast | Excellent |
| One-Class SVM | Novelty detection, clean training data available | Slow | Poor (< 10k) |
| LOF | Varying density clusters, local anomalies | Medium | Medium |
Quick Decision Guide
Ask yourself these questions to choose the right method:
1. How large is your dataset?
- < 1K samples: Any method works
- 1K - 100K: Isolation Forest or LOF
- > 100K: Isolation Forest only
2. Do you have clean training data?
- Yes: One-Class SVM or LOF (novelty=True)
- No: Isolation Forest or LOF
3. What's your data structure?
- Univariate: Z-score or IQR
- Varying density clusters: LOF
- High-dimensional (100+): Isolation Forest or SVM
4. What matters most?
- Speed: Isolation Forest
- Local accuracy: LOF
- Novel patterns: One-Class SVM
- Explainability: Statistical methods
Unsure? Start with IsolationForest — it's the best general-purpose choice!
# Step 1: Setup comparison experiment
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
import numpy as np
import time
# Create test data with known anomalies
np.random.seed(42)
X_normal = np.random.randn(500, 10)
X_anomaly = np.random.uniform(-4, 4, (25, 10))
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([1]*500 + [-1]*25) # 1=normal, -1=anomaly
# Scale data
X_scaled = StandardScaler().fit_transform(X)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Anomaly ratio: {25/525*100:.1f}%")
We set up a fair comparison by creating a 10-dimensional dataset with 500 normal samples (from a standard normal distribution) and 25 anomalies (uniformly distributed, not following the same pattern). The labels use the scikit-learn convention where 1 means normal and -1 means anomaly. We'll compare all three methods on this exact same data to see which performs best. Data scaling with StandardScaler is important for One-Class SVM, and it doesn't hurt the other methods. The 5% anomaly ratio is realistic for many real-world scenarios.
# Step 2: Compare all methods
methods = {
'Isolation Forest': IsolationForest(contamination=0.05, random_state=42),
'One-Class SVM': OneClassSVM(nu=0.05, kernel='rbf'),
'LOF': LocalOutlierFactor(n_neighbors=20, contamination=0.05)
}
print("\nMethod Comparison:")
print("-" * 55)
print(f"{'Method':<20} {'F1 Score':>10} {'Time (ms)':>12} {'Detected':>10}")
print("-" * 55)
for name, model in methods.items():
start = time.time()
if name == 'LOF':
preds = model.fit_predict(X_scaled)
else:
preds = model.fit(X_scaled).predict(X_scaled)
elapsed = (time.time() - start) * 1000
f1 = f1_score(y_true, preds, pos_label=-1)
n_detected = (preds == -1).sum()
print(f"{name:<20} {f1:>10.3f} {elapsed:>12.1f} {n_detected:>10}")
We compare all three methods using F1 score as our metric (the harmonic mean of precision and recall, perfect for imbalanced data like anomaly detection). We set comparable parameters: contamination=0.05 for Isolation Forest and LOF, and nu=0.05 for One-Class SVM. Note the different calling conventions: LOF uses fit_predict() in one step, while the others use fit().predict(). We measure execution time in milliseconds to highlight speed differences. In practice, you'll typically see Isolation Forest being fastest while maintaining good accuracy, One-Class SVM being slowest but potentially most accurate for novelty detection, and LOF falling in between. The "Detected" column shows how many anomalies each method flagged, ideally close to the true 25.
# Step 3: Ensemble approach - combine multiple detectors
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
np.random.seed(42)
X = np.vstack([np.random.randn(200, 5), np.random.uniform(-4, 4, (10, 5))])
# Get predictions from multiple methods
iso = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
iso_pred = iso.fit_predict(X)
lof_pred = lof.fit_predict(X)
# Voting: anomaly if BOTH methods agree
ensemble_pred = np.where((iso_pred == -1) & (lof_pred == -1), -1, 1)
print(f"Isolation Forest anomalies: {(iso_pred == -1).sum()}")
print(f"LOF anomalies: {(lof_pred == -1).sum()}")
print(f"Ensemble (both agree): {(ensemble_pred == -1).sum()}")
A powerful technique is to combine multiple anomaly detectors into an ensemble. In this example, we use a strict voting rule: a point is only flagged as an anomaly if BOTH Isolation Forest and LOF agree. This reduces false positives (normal points incorrectly flagged as anomalies) at the cost of potentially missing some true anomalies. You could also use softer voting like "at least one method" for higher sensitivity, or "majority vote" if you have three or more methods. Another approach is to average the anomaly scores from different methods. Ensembles are particularly useful in production systems where the cost of a false positive is high, like blocking a legitimate transaction. By requiring agreement between methods that work differently (tree-based vs density-based), you're more confident that flagged points are truly anomalous.
Key Takeaways
Anomalies are Rare
By definition, anomalies make up a small fraction of data (typically less than 5%)
Statistical Methods First
Z-score and IQR are fast, interpretable, and great for simple univariate analysis
Isolation Forest Scales
Your go-to choice for large datasets and high dimensions with great speed
One-Class SVM for Novelty
Best when you have clean training data and want to detect new anomaly types
LOF for Density Variance
Excels when clusters have different densities where global methods fail
Ensemble for Confidence
Combine multiple methods to reduce false positives in critical applications
Knowledge Check
Test your understanding of anomaly detection:
What does Isolation Forest use to identify anomalies?
When is One-Class SVM preferred over Isolation Forest?
What does an LOF score of 1.0 indicate?
What is the contamination parameter in scikit-learn anomaly detectors?
Why might Z-score method fail to detect outliers in multivariate data?
Which method would work best for detecting anomalies in a dataset with clusters of varying densities?