The Accuracy Trap
Accuracy seems like the obvious choice for evaluating classifiers - it tells you what percentage of predictions are correct. But in many real-world scenarios, accuracy is dangerously misleading. When classes are imbalanced, a model can achieve high accuracy while being completely useless.
Think of it Like This (Airport Security Analogy)
Imagine an airport security system that checks for dangerous items. If 99.9% of passengers are safe, a "lazy" system that says "everyone is safe" would be 99.9% accurate! But it would miss every single dangerous person - making it completely useless. This is exactly what happens with accuracy in machine learning when your classes are imbalanced.
What is Accuracy?
Accuracy is the ratio of correct predictions to total predictions. It measures overall correctness but treats all errors equally, regardless of their real-world consequences.
Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
The Imbalanced Class Problem
Consider a fraud detection system where only 1% of transactions are fraudulent. A model that predicts "not fraud" for everything achieves 99% accuracy - but catches zero fraud! This is the accuracy paradox.
Visualizing the Problem
Your Dataset (1000 samples)
Lazy Model: "Everything is Normal"
β 990 correct (normalβnormal)
β 10 missed (fraudβnormal)
Accuracy: 99% but 0 frauds caught!
Step 1: Import Libraries
# The Accuracy Paradox - Imbalanced Classes
from sklearn.metrics import accuracy_score
import numpy as np
What we're importing:
accuracy_score- Function to calculate accuracy (correct predictions / total predictions)numpy- Library for working with arrays and numerical operations
Step 2: Create an Imbalanced Dataset
# Step 1: Create our dataset
# We have 1000 transactions total
# - 990 are normal transactions (labeled as 0)
# - Only 10 are fraudulent (labeled as 1)
# This is a "class imbalance" - one class dominates!
y_true = np.array([0] * 990 + [1] * 10)
Understanding the data:
[0] * 990- Creates a list with 990 zeros (normal transactions)[1] * 10- Creates a list with 10 ones (fraudulent transactions)[0]*990 + [1]*10- Combines both lists into one array of 1000 elementsnp.array()- Converts the list to a NumPy array for efficient operations
Class Distribution:
Class 0 (Normal): 990 samples (99%) Class 1 (Fraud): 10 samples ( 1%) ---------------------------------------- This is a 99:1 class imbalance!
Step 3: Create a "Lazy" Model
# Step 2: Create a "lazy" model
# This useless model ALWAYS predicts "not fraud" (0)
# It never even looks at the data!
y_pred_useless = np.zeros(1000) # All zeros = all "not fraud"
What this "model" does:
np.zeros(1000)- Creates an array of 1000 zeros- This means: predict "Not Fraud" (0) for EVERY transaction
- The model doesn't analyze any features - it just says "everything is normal"
- This is what we call a "majority class classifier" or "lazy classifier"
Reality Check: This is NOT a real model - it's a demonstration of how a useless prediction strategy can still get high accuracy!
Step 4: Calculate Accuracy
# Step 3: Calculate accuracy
# accuracy = correct predictions / total predictions
accuracy = accuracy_score(y_true, y_pred_useless)
print(f"Accuracy: {accuracy:.1%}") # 99.0% - WOW, looks amazing!
How accuracy is calculated:
accuracy_score(y_true, y_pred)- Compares actual vs predicted labels- Formula: Accuracy = Correct Predictions / Total Predictions
- Our calculation: 990 correct (normal predicted as normal) + 0 correct (fraud predicted as fraud)
- Result: 990 / 1000 = 0.99 = 99%
Breakdown:
990 normal transactions predicted as normal = CORRECT 10 fraud transactions predicted as normal = WRONG ------------------------------------------------- 990 correct out of 1000 = 99% accuracy
Step 5: The Reality Check - How Many Frauds Caught?
# Step 4: But wait... how many frauds did we catch?
frauds_caught = sum(y_pred_useless[y_true == 1])
print(f"Frauds detected: {frauds_caught} out of 10") # 0 - TERRIBLE!
# The model is USELESS despite 99% accuracy!
Understanding the code:
y_true == 1- Creates a boolean mask: True where actual label is 1 (fraud)y_pred_useless[y_true == 1]- Gets predictions only for actual fraud casessum(...)- Counts how many of those predictions were 1 (detected as fraud)- Result: 0 frauds detected out of 10 actual frauds!
THE ACCURACY PARADOX:
- Accuracy: 99% (sounds amazing!)
- Frauds caught: 0/10 (completely useless!)
- This model would cost a company MILLIONS in undetected fraud!
When Accuracy Fails
- Imbalanced datasets: When one class dominates (fraud detection, disease diagnosis)
- Different error costs: When false positives and false negatives have different consequences
- Rare event prediction: When you care most about finding the minority class
Common Beginner Mistake
The Mistake: "My model has 98% accuracy, so it must be great!"
The Reality: Always check your class distribution first. If 98% of your data is one class, even a random model could get 98% accuracy.
The Fix: Always look at precision, recall, and confusion matrix alongside accuracy. These tell you how well your model performs on EACH class, not just overall.
Step 1: Import Libraries
# A slightly better model - but accuracy doesn't show it
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
What we're importing:
LogisticRegression- A popular classification algorithm for binary problemsmake_classification- Function to generate synthetic classification datasetstrain_test_split- Splits data into training and testing setsaccuracy_score- Calculates accuracy metric
Step 2: Create Imbalanced Dataset
# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05],
n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Understanding the parameters:
n_samples=1000- Create 1000 total samplesweights=[0.95, 0.05]- Class distribution: 95% class 0, 5% class 1 (imbalanced!)n_features=10- Each sample has 10 input featuresrandom_state=42- Ensures reproducibility (same data each time)
After train_test_split:
X_train, y_train: 750 samples (75%) - for training X_test, y_test: 250 samples (25%) - for testing Expected class distribution in test set: - Class 0: ~238 samples (95%) - Class 1: ~12 samples (5%)
Step 3: Train the Model
# Train a real model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Training process:
LogisticRegression()- Creates a logistic regression classifier objectmodel.fit(X_train, y_train)- Trains the model by learning patterns from training datamodel.predict(X_test)- Uses the trained model to predict labels for test data
Output: y_pred contains 250 predictions (0 or 1) for each test sample
Step 4: Evaluate and Analyze Results
print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}")
print(f"Minority class in test: {sum(y_test)} samples")
print(f"Minority class predicted: {sum(y_pred)} samples")
Understanding the output:
accuracy_score(y_test, y_pred)- Overall accuracy (typically high due to imbalance)sum(y_test)- Counts actual minority class (1s) in test setsum(y_pred)- Counts how many minority class predictions the model made
What to look for:
If sum(y_pred) << sum(y_test): Model is UNDER-predicting minority class (missing positives) If sum(y_pred) >> sum(y_test): Model is OVER-predicting minority class (too many false alarms) Ideally: sum(y_pred) β sum(y_test)
Key Insight: High accuracy alone doesn't tell you if the model is actually detecting the minority class. Always compare predicted vs actual minority class counts!
Practice Questions
Test your understanding with these coding challenges.
Task: Given true labels and predictions, calculate accuracy manually and with sklearn.
Show Solution
from sklearn.metrics import accuracy_score
import numpy as np
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])
y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0, 1, 0])
# Manual calculation
correct = sum(y_true == y_pred)
total = len(y_true)
manual_acc = correct / total
print(f"Manual accuracy: {manual_acc:.1%}")
# sklearn
sklearn_acc = accuracy_score(y_true, y_pred)
print(f"Sklearn accuracy: {sklearn_acc:.1%}")
Line-by-Line Explanation:
y_true == y_pred- Creates boolean array: True where prediction matches actual labelsum(...)- Counts True values (correct predictions) = 7 out of 10correct / total- Manual formula: 7/10 = 0.70 = 70%accuracy_score()- sklearn's function does the same calculation automatically
Result: Both methods give 70% accuracy - 7 correct predictions out of 10 total.
Task: Create datasets with different imbalance ratios and show how a dummy classifier accuracy changes.
Show Solution
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
for imbalance in [0.5, 0.7, 0.9, 0.99]:
X, y = make_classification(n_samples=1000, weights=[imbalance],
n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
acc = dummy.score(X_test, y_test)
print(f"Imbalance {imbalance:.0%}: Dummy accuracy = {acc:.1%}")
Line-by-Line Explanation:
for imbalance in [0.5, 0.7, 0.9, 0.99]- Tests four different class imbalance levelsweights=[imbalance]- Sets the proportion of the majority class (50%, 70%, 90%, 99%)DummyClassifier(strategy='most_frequent')- A "dumb" model that always predicts the most common classdummy.score()- Returns accuracy on test data
Result Pattern: 50% imbalance β ~50% accuracy | 90% imbalance β ~90% accuracy | 99% imbalance β ~99% accuracy. The dummy model's accuracy equals the majority class proportion!
Task: Given a dataset with 950 negative and 50 positive samples, calculate what accuracy a model would get if it predicted all negatives. Then create a model that catches at least 40 positives and compare.
Show Solution
import numpy as np
from sklearn.metrics import accuracy_score
# Create imbalanced dataset
y_true = np.array([0] * 950 + [1] * 50) # 950 neg, 50 pos
# Lazy model: predict all negative
y_pred_lazy = np.zeros(1000)
lazy_acc = accuracy_score(y_true, y_pred_lazy)
print(f"Lazy model accuracy: {lazy_acc:.1%}") # 95%!
print(f"Positives caught: {sum(y_pred_lazy[y_true == 1])}/50")
# Better model: catches 40 positives, but has 30 false positives
y_pred_better = np.zeros(1000)
y_pred_better[950:990] = 1 # Catches 40 of 50 positives
y_pred_better[900:930] = 1 # 30 false positives
better_acc = accuracy_score(y_true, y_pred_better)
print(f"\nBetter model accuracy: {better_acc:.1%}") # Lower!
print(f"Positives caught: {sum(y_pred_better[y_true == 1])}/50")
print("\nThe 'better' model has LOWER accuracy but catches more fraud!")
Key insight: The lazy model has 95% accuracy but catches ZERO fraud. The "better" model has lower accuracy but catches 40/50 fraud cases. This proves accuracy alone is misleading for imbalanced data!
Task: Create a function that takes y_true and returns the "baseline accuracy" - the accuracy you'd get by always predicting the most common class. Use this to evaluate if a model is actually learning.
Show Solution
import numpy as np
from collections import Counter
from sklearn.metrics import accuracy_score
def get_baseline_accuracy(y_true):
"""Calculate baseline accuracy (always predict majority class)"""
# Find most common class
counter = Counter(y_true)
most_common_class, count = counter.most_common(1)[0]
# Baseline accuracy = majority class proportion
baseline_acc = count / len(y_true)
return baseline_acc, most_common_class
def is_model_useful(y_true, y_pred):
"""Check if model beats baseline"""
baseline_acc, majority = get_baseline_accuracy(y_true)
model_acc = accuracy_score(y_true, y_pred)
print(f"Baseline accuracy: {baseline_acc:.1%} (always predict {majority})")
print(f"Model accuracy: {model_acc:.1%}")
print(f"Improvement: {(model_acc - baseline_acc)*100:.1f} percentage points")
return model_acc > baseline_acc
# Test it
y_true = np.array([0]*900 + [1]*100)
y_pred = np.array([0]*850 + [1]*150) # Some predictions
is_model_useful(y_true, y_pred)
What this function does: Creates a reusable tool to check if your model is actually learning. It calculates the baseline (majority-class) accuracy and compares your model against it. If your model doesn't beat the baseline, it's not useful!
Confusion Matrix
The confusion matrix is the foundation of classification evaluation. It breaks down predictions into four categories: True Positives, True Negatives, False Positives, and False Negatives. Understanding these four quadrants is essential for computing all other classification metrics.
Think of it Like This (Medical Test Analogy)
Imagine a COVID test. True Positive = You have COVID and test positive (correct!). True Negative = You don't have COVID and test negative (correct!). False Positive = You're healthy but test positive (false alarm!). False Negative = You have COVID but test negative (missed! dangerous!).
The Four Quadrants
True Positive (TP): Predicted positive, actually positive
True Negative (TN): Predicted negative, actually negative
False Positive (FP): Predicted positive, actually negative (Type I error - false alarm)
False Negative (FN): Predicted negative, actually positive (Type II error - missed detection)
Visual Confusion Matrix
| Predicted | ||
|---|---|---|
| Actual | Negative (0) | Positive (1) |
| Negative (0) | TN Correct rejection |
FP False alarm |
| Positive (1) | FN Missed! |
TP Correct detection |
Reading tip: Green = correct predictions (diagonal), Red = errors (off-diagonal)
Memory Trick
The first word tells you if the prediction was right (True/False). The second word tells you what was predicted (Positive/Negative). So "False Positive" = wrong prediction + predicted positive = predicted positive when it was actually negative.
Step 1: Import Required Libraries
# Creating and visualizing a confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
What we're importing:
confusion_matrix- Function to create the confusion matrix from predictionsConfusionMatrixDisplay- For visualizing the matrix as a heatmapLogisticRegression- Our classification modelload_breast_cancer- Built-in dataset with 569 samples (cancer diagnosis)train_test_split- Splits data into training and testing sets
Step 2: Load and Split the Data
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42
)
Data preparation:
load_breast_cancer()- Loads the dataset with 30 features (tumor measurements)data.data- The features (X) - measurements like radius, texture, etc.data.target- The labels (y) - 0 = malignant, 1 = benigntrain_test_split()- Default split: 75% training, 25% testingrandom_state=42- Makes the split reproducible (same results every time)
Result: X_train/y_train = 426 samples, X_test/y_test = 143 samples
Step 3: Train the Model
# Train model
model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Model training:
LogisticRegression(max_iter=5000)- Creates classifier (5000 iterations to ensure convergence)model.fit(X_train, y_train)- Trains the model to learn patterns from training datamodel.predict(X_test)- Uses trained model to predict on unseen test data
Output: y_pred contains 143 predictions (0 or 1) for each test sample
Step 4: Create the Confusion Matrix
# Create confusion matrix
# This function compares y_test (actual) vs y_pred (predicted)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Output looks like:
# [[TN, FP],
# [FN, TP]]
Creating the matrix:
confusion_matrix(y_test, y_pred)- Compares actual labels vs predicted labels- First argument = actual values, Second argument = predicted values
- Returns a 2x2 numpy array for binary classification
Matrix Layout:
Predicted 0 | Predicted 1 Actual 0: TN | FP Actual 1: FN | TP
Step 5: Extract Individual Values
# Extract each value for clarity
print(f"\nTrue Negatives (TN): {cm[0,0]} - Correctly said 'no cancer'")
print(f"False Positives (FP): {cm[0,1]} - Wrongly said 'cancer' (false alarm)")
print(f"False Negatives (FN): {cm[1,0]} - Wrongly said 'no cancer' (MISSED!)")
print(f"True Positives (TP): {cm[1,1]} - Correctly detected cancer")
Accessing matrix values:
cm[0,0]β TN (row 0, col 0) - Actual=0, Predicted=0cm[0,1]β FP (row 0, col 1) - Actual=0, Predicted=1cm[1,0]β FN (row 1, col 0) - Actual=1, Predicted=0cm[1,1]β TP (row 1, col 1) - Actual=1, Predicted=1
Remember: In medical diagnosis, FN is the most dangerous error - missing actual cancer cases!
Reading the Confusion Matrix
Rows represent actual classes, columns represent predicted classes. The diagonal shows correct predictions. For binary classification, position [0,0] is TN, [0,1] is FP, [1,0] is FN, and [1,1] is TP.
Step-by-Step: Reading a Confusion Matrix
- Look at the diagonal (top-left to bottom-right): These are your CORRECT predictions. Higher is better!
- Look at off-diagonal elements: These are your ERRORS. Lower is better!
- Top-right (FP): How many times you cried wolf? (said positive but was negative)
- Bottom-left (FN): How many times you missed danger? (said negative but was positive)
Step 1: Define Your Data
# Extract metrics from confusion matrix
from sklearn.metrics import confusion_matrix
y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 1, 0, 1, 1]
Setting up test data:
y_true- Actual labels: 3 negatives (0) + 5 positives (1) = 8 samples totaly_pred- What our model predicted for each sample
Visual Comparison:
Sample: 1 2 3 4 5 6 7 8 Actual: 0 0 0 1 1 1 1 1 Predict: 0 1 0 1 1 0 1 1 Result: Y N Y Y Y N Y Y
Step 2: Create Matrix and Extract Values
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}") # Correctly predicted negatives
print(f"False Positives: {fp}") # Type I errors
print(f"False Negatives: {fn}") # Type II errors
print(f"True Positives: {tp}") # Correctly predicted positives
Extracting values:
confusion_matrix(y_true, y_pred)- Creates the 2x2 matrixcm.ravel()- Flattens matrix into 1D array: [TN, FP, FN, TP]- Unpacking:
tn, fp, fn, tp = cm.ravel()assigns each value to a variable
Expected Output:
True Negatives: 2 (correctly said "not cancer") False Positives: 1 (wrongly said "cancer") False Negatives: 1 (missed a cancer case!) True Positives: 4 (correctly detected cancer)
Step 3: Calculate Accuracy Manually
# Calculate metrics manually
accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"\nAccuracy: {accuracy:.1%}")
Accuracy formula:
- Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
- In words: (Correct predictions) / (All predictions)
- Calculation: (4 + 2) / (4 + 2 + 1 + 1) = 6/8 = 75%
Key Insight: From just these 4 values (TN, FP, FN, TP), you can calculate ALL classification metrics!
Practice Questions
Task: Given predictions and true labels, create a confusion matrix using sklearn.
Show Solution
from sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
tn, fp, fn, tp = cm.ravel()
print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}")
What this code does: Creates a 2x2 confusion matrix and extracts all four values using ravel(). The matrix shows how predictions compare to actual labels - rows are true labels, columns are predictions.
Task: Create a confusion matrix for a 3-class classification problem.
Show Solution
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, random_state=42
)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print("Multi-class Confusion Matrix:")
print(cm)
print(f"\nClasses: {iris.target_names}")
What this code shows: For 3+ classes, the confusion matrix becomes NxN. Each cell [i,j] shows how many samples of class i were predicted as class j. Diagonal values are correct predictions.
Task: Given a confusion matrix with TN=85, FP=5, FN=3, TP=7, calculate accuracy, precision, recall, and F1 manually.
Show Solution
# Given values
TN, FP, FN, TP = 85, 5, 3, 7
# Calculate all metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * (precision * recall) / (precision + recall)
specificity = TN / (TN + FP)
print(f"Total samples: {TP + TN + FP + FN}")
print(f"\nAccuracy: {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall (Sensitivity): {recall:.2%}")
print(f"Specificity: {specificity:.2%}")
print(f"F1 Score: {f1:.2%}")
# Verify with sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
y_true = [0]*TN + [0]*FP + [1]*FN + [1]*TP
y_pred = [0]*TN + [1]*FP + [0]*FN + [1]*TP
print(f"\nVerification with sklearn:")
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2%}")
What this code teaches: Shows how to calculate ALL metrics from just 4 numbers (TP, TN, FP, FN). This is essential knowledge - once you understand the confusion matrix, you can compute any classification metric!
Task: Create a function that displays a confusion matrix as a colored heatmap with labels.
Show Solution
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import numpy as np
def plot_confusion_matrix(y_true, y_pred, labels=None):
"""Plot a beautiful confusion matrix heatmap"""
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=labels or ['Negative', 'Positive'],
yticklabels=labels or ['Negative', 'Positive'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
# Add totals
total = cm.sum()
correct = np.trace(cm)
print(f"Total samples: {total}")
print(f"Correct predictions: {correct}")
print(f"Accuracy: {correct/total:.1%}")
plt.tight_layout()
plt.show()
# Example usage
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plot_confusion_matrix(y_test, y_pred, labels=['Malignant', 'Benign'])
What this function creates: A reusable visualization function using seaborn heatmap. Color intensity shows cell values, making it easy to spot where the model makes mistakes. Great for presentations!
Task: Given a confusion matrix, identify how many Type I errors (FP) and Type II errors (FN) occurred, and explain which is worse for a medical diagnosis scenario.
Show Solution
import numpy as np
from sklearn.metrics import confusion_matrix
# Example: Cancer detection results
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] # 5 healthy, 5 sick
y_pred = [0, 0, 0, 1, 1, 1, 1, 1, 0, 0] # Model predictions
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print("Confusion Matrix:")
print(cm)
print(f"\n--- Error Analysis ---")
print(f"Type I Errors (False Positives): {fp}")
print(f" -> Healthy people told they have cancer")
print(f" -> Causes: Unnecessary stress, extra tests")
print(f"\nType II Errors (False Negatives): {fn}")
print(f" -> Sick people told they're healthy")
print(f" -> Causes: Delayed treatment, could be FATAL!")
print(f"\n--- For Medical Diagnosis ---")
print(f"Type II errors (FN) are MUCH WORSE!")
print(f"Missing a cancer case could cost a life.")
print(f"A false alarm just means more tests.")
print(f"\nConclusion: Optimize for HIGH RECALL (catch all cancers)")
Real-world insight: This code demonstrates why understanding error types matters. In medical diagnosis, Type II errors (false negatives) are often deadly - missing a cancer case is far worse than a false alarm that leads to extra tests.
Precision and Recall
Precision and recall are complementary metrics that focus on the positive class. Precision asks "Of all positive predictions, how many were correct?" while recall asks "Of all actual positives, how many did we find?" The trade-off between them is fundamental to classification.
Think of it Like This (Search Engine Analogy)
You search "chocolate cake recipes" and get 100 results:
- Precision: Of the 100 results shown, how many are ACTUALLY chocolate cake recipes? (Are the results relevant?)
- Recall: Of ALL chocolate cake recipes that exist on the internet, what percentage did the search find? (Did we find everything?)
The Two Key Questions
Precision asks:
"When I say YES, am I right?"
Out of everyone I predicted as positive, how many actually were positive?
Recall asks:
"Did I find them all?"
Out of everyone who was actually positive, how many did I correctly identify?
Precision
Of all positive predictions, how many were actually positive?
Precision = TP / (TP + FP)
High precision means few false positives. Important when false alarms are costly (spam detection).
Recall (Sensitivity)
Of all actual positives, how many did we correctly identify?
Recall = TP / (TP + FN)
High recall means few false negatives. Important when missing positives is costly (disease detection).
# Calculating precision and recall
from sklearn.metrics import precision_score, recall_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1],
n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(f"Precision: {precision:.3f}") # Of predicted positives, how many correct?
print(f"Recall: {recall:.3f}") # Of actual positives, how many found?
make_classification(weights=[0.9, 0.1])- Creates imbalanced data: 90% class 0, 10% class 1precision_score(y_test, y_pred)- Calculates TP / (TP + FP) - "When I say positive, am I right?"recall_score(y_test, y_pred)- Calculates TP / (TP + FN) - "Did I catch all the positives?"
Interpretation: If precision=0.8 and recall=0.6, then 80% of your positive predictions are correct, but you only caught 60% of actual positives. Use both metrics together to understand your model's behavior!
The Precision-Recall Trade-off
You can increase recall by predicting more positives (lower threshold), but this usually decreases precision. Conversely, being more conservative increases precision but lowers recall. The right balance depends on your problem.
Real-World Trade-off Examples
Spam Filter (Needs High Precision)
If your spam filter marks an important email as spam, you might miss a job offer or important message!
Better to: Let some spam through (low recall) than block real emails (high precision).
Cancer Screening (Needs High Recall)
If the test misses someone with cancer, they might not get treatment in time!
Better to: Have some false alarms (low precision) than miss any actual cancer (high recall).
# Precision-Recall trade-off with threshold adjustment
from sklearn.metrics import precision_score, recall_score
import numpy as np
# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]
print("Threshold | Precision | Recall")
print("-" * 35)
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
y_pred_thresh = (y_proba >= threshold).astype(int)
prec = precision_score(y_test, y_pred_thresh, zero_division=0)
rec = recall_score(y_test, y_pred_thresh)
print(f" {threshold:.1f} | {prec:.3f} | {rec:.3f}")
predict_proba(X_test)[:, 1]- Gets probability scores (0-1) for the positive class(y_proba >= threshold).astype(int)- Converts probabilities to 0/1 predictions using thresholdzero_division=0- Prevents error if no positive predictions at high threshold
Trade-off Pattern:
- Threshold 0.3 β Predict more positives β High recall (~0.9), Low precision (~0.5)
- Threshold 0.7 β Predict fewer positives β Low recall (~0.5), High precision (~0.9)
- Default 0.5 is rarely optimal - tune based on your use case!
Practice Questions
Task: Given TP=80, FP=20, FN=10, calculate precision and recall.
Show Solution
TP, FP, FN = 80, 20, 10
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print(f"Precision: {precision:.2%}") # 80%
print(f"Recall: {recall:.2%}") # 88.9%
The formulas: Precision = TP/(TP+FP) measures "when I predict positive, am I correct?" Recall = TP/(TP+FN) measures "did I find all the positives?" Both are essential for understanding model behavior.
Task: Find the threshold that achieves at least 90% recall with maximum precision.
Show Solution
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Find threshold for 90% recall
target_recall = 0.90
valid_idx = recalls >= target_recall
if valid_idx.any():
best_idx = np.argmax(precisions[:-1][valid_idx[:-1]])
best_threshold = thresholds[valid_idx[:-1]][best_idx]
print(f"Threshold: {best_threshold:.3f}")
print(f"Precision: {precisions[:-1][valid_idx[:-1]][best_idx]:.3f}")
print(f"Recall: {recalls[:-1][valid_idx[:-1]][best_idx]:.3f}")
What this code does: Uses sklearn's precision_recall_curve to find all threshold-precision-recall combinations, then filters to keep only thresholds achieving 90%+ recall, and picks the one with highest precision. This is how you optimize for a specific recall target!
Task: Create a table showing how precision and recall change as you adjust the classification threshold from 0.1 to 0.9.
Show Solution
from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
# Create dataset
X, y = make_classification(n_samples=1000, weights=[0.8, 0.2],
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Show trade-off at different thresholds
print("Threshold | Precision | Recall | Predicted Pos")
print("-" * 50)
for thresh in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
y_pred = (y_proba >= thresh).astype(int)
# Handle edge case of no positive predictions
if sum(y_pred) == 0:
prec = 0
else:
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
n_pos = sum(y_pred)
print(f" {thresh:.1f} | {prec:.2f} | {rec:.2f} | {n_pos}")
print("\nNotice: Lower threshold = More positives = Higher recall, Lower precision")
Key pattern: This table reveals the fundamental precision-recall trade-off. Low threshold (0.1) = predict more positives = high recall but low precision. High threshold (0.9) = predict fewer positives = high precision but low recall. Choose based on your use case!
Task: A spam filter has precision=0.95 and recall=0.70. Explain in plain English what this means and whether this is good for a spam filter.
Show Solution
# Spam Filter Analysis
precision = 0.95
recall = 0.70
print("=== Spam Filter Performance Analysis ===")
print(f"\nPrecision: {precision:.0%}")
print(" What it means: When the filter marks an email as spam,")
print(" it's correct 95% of the time.")
print(" Only 5% of 'spam' emails are actually legitimate (false positives).")
print(f"\nRecall: {recall:.0%}")
print(" What it means: Of all actual spam emails,")
print(" the filter catches 70% of them.")
print(" 30% of spam emails slip through to your inbox (false negatives).")
print("\n=== Is this good for a spam filter? ===")
print("YES, this is EXCELLENT for a spam filter!")
print("\nWhy?")
print("1. HIGH PRECISION (95%) is critical for spam filters")
print(" - Losing important emails is VERY bad")
print(" - Missing a job offer or contract is worse than seeing some spam")
print("\n2. Lower recall (70%) is acceptable")
print(" - Seeing some spam in inbox is annoying but not harmful")
print(" - Users can manually delete spam that gets through")
print("\nFor spam filters: PRECISION > RECALL")
Real-world application: This analysis shows how to interpret metrics for a specific use case. For spam filters, blocking legitimate emails (false positives) is worse than missing some spam, so we prioritize precision over recall.
Task: For a 3-class classification problem, calculate macro, micro, and weighted average precision.
Show Solution
from sklearn.metrics import precision_score, classification_report
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load 3-class dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, random_state=42)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Different averaging methods
print("=== Precision Averaging Methods ===")
print(f"\nMacro (simple average): {precision_score(y_test, y_pred, average='macro'):.3f}")
print(" -> Each class counts equally")
print(f"\nWeighted (by class size): {precision_score(y_test, y_pred, average='weighted'):.3f}")
print(" -> Larger classes have more influence")
print(f"\nMicro (total TP / total P): {precision_score(y_test, y_pred, average='micro'):.3f}")
print(" -> Treats all samples equally")
print("\n=== Full Classification Report ===")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Averaging methods explained: Macro = simple average (good when all classes equally important). Weighted = accounts for class sizes. Micro = treats each sample equally. Use classification_report() to see per-class metrics plus averages!
F1 Score and Other Metrics
The F1 score is the harmonic mean of precision and recall, providing a single number that balances both. When you need one metric to optimize, F1 is often a good choice for imbalanced problems. Other useful metrics include specificity, balanced accuracy, and Matthews Correlation Coefficient.
Why Do We Need F1?
Imagine you have to report ONE number to your boss about how good your model is. You can't say "precision is 80% and recall is 60%" - they want ONE number! F1 score combines both into a single metric. It's like a "balanced score" that's only high when BOTH precision AND recall are good.
F1 Score
F1 = 2 * (precision * recall) / (precision + recall). The harmonic mean penalizes extreme values, so F1 is only high when both precision and recall are high.
Why harmonic mean? If precision is 100% but recall is 1%, the arithmetic mean would be 50.5%. The harmonic mean correctly gives a low score of 1.98%.
Worked Example: Calculating F1 Score
Let's say your model has:
- Precision = 0.80 (80%)
- Recall = 0.60 (60%)
F1 = 2 Γ (0.80 Γ 0.60) / (0.80 + 0.60)
F1 = 2 Γ 0.48 / 1.40
F1 = 0.96 / 1.40 = 0.686 (68.6%)
Result
F1 = 0.69
Note: F1 (68.6%) is lower than the arithmetic mean (70%) because the harmonic mean penalizes the imbalance between precision and recall.
# F1 Score and classification report
from sklearn.metrics import f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42
)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
f1_score(y_test, y_pred)- Calculates 2Γ(precisionΓrecall)/(precision+recall)classification_report()- Generates a complete performance tabletarget_names=data.target_names- Adds readable class labels to the report
Classification Report Output:
- Per-class metrics: Precision, Recall, F1 for each class separately
- Support: Number of samples in each class (helps spot imbalance)
- Macro avg: Simple average across classes (treats all classes equally)
- Weighted avg: Weighted by class size (larger classes count more)
Other Useful Metrics
Specificity
TN / (TN + FP)
True negative rate - how well we identify negatives.
Balanced Accuracy
(Recall + Specificity) / 2
Average of recall per class - handles imbalance.
Matthews Correlation
MCC formula uses all 4 quadrants
Balanced measure even for imbalanced classes.
# Comparing multiple metrics
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, balanced_accuracy_score, matthews_corrcoef)
metrics = {
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1 Score': f1_score(y_test, y_pred),
'Balanced Accuracy': balanced_accuracy_score(y_test, y_pred),
'MCC': matthews_corrcoef(y_test, y_pred)
}
for name, score in metrics.items():
print(f"{name:18}: {score:.3f}")
accuracy_score()- (TP+TN) / Total - can be misleading on imbalanced dataprecision_score()- TP / (TP+FP) - "When I predict positive, am I right?"recall_score()- TP / (TP+FN) - "Did I catch all positives?"f1_score()- Harmonic mean of precision and recallbalanced_accuracy_score()- Average of recall per class (handles imbalance)matthews_corrcoef()- Uses all 4 values (TP, TN, FP, FN), ranges from -1 to +1
Always compute multiple metrics! Different metrics reveal different aspects of model performance. MCC is often considered the most reliable single metric for binary classification.
Practice Questions
Task: Given precision=0.8 and recall=0.6, calculate F1 score.
Show Solution
precision = 0.8
recall = 0.6
f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score: {f1:.3f}") # 0.686
The F1 formula: F1 = 2Γ(precisionΓrecall)/(precision+recall). The result (0.686) is lower than simple average (0.70) because harmonic mean penalizes the imbalance between precision and recall.
Task: Train a classifier on imbalanced data and compare F1 with balanced accuracy.
Show Solution
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, balanced_accuracy_score
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
Comparison insight: F1 focuses on the positive class performance while balanced accuracy averages recall across both classes. On highly imbalanced data (95% negative), these metrics tell different stories - use both to understand your model fully!
Task: Calculate both arithmetic mean and F1 (harmonic mean) for precision=1.0 and recall=0.01. Explain why F1 is more appropriate.
Show Solution
# Extreme case: Perfect precision, terrible recall
precision = 1.0 # 100% - every positive prediction is correct
recall = 0.01 # 1% - we only found 1% of actual positives!
# Arithmetic mean (simple average)
arithmetic_mean = (precision + recall) / 2
print(f"Arithmetic Mean: {arithmetic_mean:.2%}") # 50.5%
# Harmonic mean (F1 score)
f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score (Harmonic Mean): {f1:.2%}") # 1.98%
print("\n=== Why F1 is Better ===")
print(f"\nThe model catches only {recall:.0%} of positives!")
print("This is a TERRIBLE model - it misses 99% of cases.")
print("\nArithmetic mean of 50.5% is MISLEADING.")
print("It makes the model look acceptable when it's useless.")
print("\nF1 of 1.98% correctly shows the model is BAD.")
print("The harmonic mean punishes extreme values.")
print("\nRule: F1 is only high when BOTH precision AND recall are high.")
Why this matters: The arithmetic mean (50.5%) would suggest a decent model, but this model only catches 1% of positives! F1's harmonic mean (1.98%) correctly reveals how bad this model is. This is why we use F1 instead of averaging.
Task: Calculate F0.5, F1, and F2 scores for a model. Explain when you would use each.
Show Solution
from sklearn.metrics import fbeta_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Create dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate different F-beta scores
f05 = fbeta_score(y_test, y_pred, beta=0.5) # Precision-focused
f1 = fbeta_score(y_test, y_pred, beta=1.0) # Balanced
f2 = fbeta_score(y_test, y_pred, beta=2.0) # Recall-focused
print("=== F-beta Scores ===")
print(f"\nF0.5 (Precision-focused): {f05:.3f}")
print(" Use when: False positives are costly")
print(" Examples: Spam detection, recommendation systems")
print(f"\nF1 (Balanced): {f1:.3f}")
print(" Use when: Both errors are equally bad")
print(" Examples: General classification, balanced datasets")
print(f"\nF2 (Recall-focused): {f2:.3f}")
print(" Use when: False negatives are costly")
print(" Examples: Cancer detection, fraud detection")
print("\n=== The Formula ===")
print("F-beta = (1 + Ξ²Β²) Γ (precision Γ recall) / (Ξ²Β² Γ precision + recall)")
print("\nHigher beta = More weight on recall")
print("Lower beta = More weight on precision")
Choosing beta: F0.5 emphasizes precision (spam filters, recommendations). F1 is balanced. F2 emphasizes recall (medical diagnosis, fraud). The beta value lets you tune the precision-recall trade-off mathematically!
Task: Calculate MCC manually and explain why it's considered more reliable than F1 for imbalanced datasets.
Show Solution
import numpy as np
from sklearn.metrics import matthews_corrcoef, f1_score
# Confusion matrix values
TP, TN, FP, FN = 8, 900, 10, 2 # Highly imbalanced!
# Manual MCC calculation
numerator = (TP * TN) - (FP * FN)
denominator = np.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
mcc_manual = numerator / denominator
print("=== Matthews Correlation Coefficient ===")
print(f"\nTP={TP}, TN={TN}, FP={FP}, FN={FN}")
print(f"Total samples: {TP+TN+FP+FN}")
print(f"Class imbalance: {TN+FP} negative vs {TP+FN} positive")
print(f"\nMCC (manual): {mcc_manual:.3f}")
# Verify with sklearn
y_true = [0]*TN + [0]*FP + [1]*FN + [1]*TP
y_pred = [0]*TN + [1]*FP + [0]*FN + [1]*TP
print(f"MCC (sklearn): {matthews_corrcoef(y_true, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_true, y_pred):.3f}")
print("\n=== Why MCC is Better ===")
print("1. Uses ALL four values (TP, TN, FP, FN)")
print("2. F1 ignores True Negatives!")
print("3. MCC ranges from -1 to +1:")
print(" +1 = perfect prediction")
print(" 0 = random prediction")
print(" -1 = total disagreement")
print("\n4. MCC only gives high score when model does well")
print(" on BOTH classes, not just the positive class.")
MCC advantage: MCC uses all four confusion matrix values (TP, TN, FP, FN) making it a truly balanced metric. It returns values from -1 (total disagreement) to +1 (perfect), with 0 meaning random. Many researchers consider MCC the most reliable single metric!
ROC Curves and AUC
The ROC (Receiver Operating Characteristic) curve visualizes the trade-off between true positive rate and false positive rate across all possible thresholds. The AUC (Area Under the Curve) summarizes this into a single number from 0 to 1, where 1 is perfect and 0.5 is random guessing.
What is ROC-AUC in Simple Terms?
Imagine you have two people - one who has a disease and one who doesn't. You show your model both people.
AUC is the probability that your model correctly gives a higher "disease score" to the sick person than the healthy person. If AUC = 0.9, that means 90% of the time, your model ranks the sick person higher - that's great!
Why Use ROC-AUC?
Threshold Independent
It evaluates your model across ALL possible thresholds, not just 0.5
Easy to Compare
Single number makes comparing models simple
Measures Ranking
How well does your model rank positives above negatives?
ROC-AUC Interpretation
AUC represents the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. It measures ranking quality, not absolute predictions.
Perfect classifier
Excellent
Good
Random guessing
# Computing ROC curve and AUC
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42
)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# Calculate AUC
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {auc:.3f}")
predict_proba(X_test)[:, 1]- Gets probability of positive class (column 1) for each sampleroc_curve(y_test, y_proba)- Returns FPR, TPR, and thresholds at each point on the curveroc_auc_score(y_test, y_proba)- Calculates area under the ROC curve (0.5 to 1.0)
Understanding the Output:
- FPR (False Positive Rate): FP / (FP + TN) - how many negatives we incorrectly labeled positive
- TPR (True Positive Rate): Same as Recall = TP / (TP + FN)
- AUC = 0.5: Random guessing (diagonal line) | AUC = 1.0: Perfect classifier
Precision-Recall Curves
For imbalanced datasets, precision-recall curves are often more informative than ROC curves. They focus on the positive class and don't get inflated by a large number of true negatives.
# Precision-Recall curve and Average Precision
from sklearn.metrics import precision_recall_curve, average_precision_score
# Calculate PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Average Precision (area under PR curve)
ap = average_precision_score(y_test, y_proba)
print(f"Average Precision: {ap:.3f}")
# Find threshold for specific precision target
target_prec = 0.95
valid = precision[:-1] >= target_prec
if valid.any():
idx = np.argmax(recall[:-1][valid])
print(f"Threshold for {target_prec:.0%} precision: {thresholds[valid][idx]:.3f}")
print(f"Recall at that threshold: {recall[:-1][valid][idx]:.3f}")
precision_recall_curve()- Returns precision, recall arrays at each thresholdaverage_precision_score()- Calculates area under PR curve (AP score)precision[:-1] >= target_prec- Finds all thresholds with precision β₯ 95%np.argmax(recall[:-1][valid])- Among valid thresholds, pick the one with highest recall
PR vs ROC for Imbalanced Data:
- ROC-AUC can be misleading: Includes true negatives, which are easy to get right when negatives dominate
- PR-AUC (Average Precision) is better: Focuses only on positive class performance
- Rule of thumb: Use PR curves when positive class is rare (<10% of data)
Practice Questions
Task: Train a classifier and compute its ROC-AUC score.
Show Solution
from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
# Binary classification (setosa vs others)
iris = load_iris()
X, y = iris.data, (iris.target == 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LogisticRegression().fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc:.3f}")
Key steps: Get probability predictions with predict_proba()[:, 1], then pass true labels and probabilities to roc_auc_score(). AUC requires probability scores, not hard predictions!
Task: Compare AUC scores of Logistic Regression, Decision Tree, and Random Forest.
Show Solution
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42)
}
for name, model in models.items():
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"{name}: AUC = {auc:.3f}")
Model comparison pattern: Loop through multiple models, train each, get probabilities, and compute AUC. This gives you a single number to rank models. Higher AUC = better ranking ability!
Task: Plot the ROC curve and find the optimal threshold using Youden's J statistic (maximizes TPR - FPR).
Show Solution
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np
# Assume model and data are ready
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42)
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
# Find optimal threshold (Youden's J statistic)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"AUC: {auc:.3f}")
print(f"Optimal Threshold: {optimal_threshold:.3f}")
print(f"At this threshold:")
print(f" TPR (Recall): {tpr[optimal_idx]:.3f}")
print(f" FPR: {fpr[optimal_idx]:.3f}")
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.scatter(fpr[optimal_idx], tpr[optimal_idx],
color='red', s=100, zorder=5,
label=f'Optimal (threshold={optimal_threshold:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve with Optimal Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
What this creates: A complete ROC curve visualization with the optimal threshold marked. Youden's J statistic (TPR-FPR) finds the point farthest from the diagonal line - the best balance between sensitivity and specificity.
Task: For an imbalanced dataset, compare ROC-AUC with Average Precision (PR-AUC) and explain why they might give different impressions.
Show Solution
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Create HIGHLY imbalanced dataset (5% positive)
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05],
n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Calculate both metrics
roc_auc = roc_auc_score(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)
print("=== Imbalanced Dataset Analysis ===")
print(f"Positive class: {sum(y_test)} out of {len(y_test)} ({sum(y_test)/len(y_test):.1%})")
print(f"\nROC-AUC: {roc_auc:.3f}")
print(f"Average Precision (PR-AUC): {avg_precision:.3f}")
print("\n=== Why the Difference? ===")
print("\nROC-AUC looks HIGH because:")
print(" - It includes True Negatives")
print(" - With many negatives, getting TNs right is easy")
print(" - This 'inflates' the ROC-AUC score")
print("\nPR-AUC looks LOWER because:")
print(" - It focuses ONLY on the positive class")
print(" - Ignores True Negatives completely")
print(" - Shows how well we find the rare positives")
print("\n=== Recommendation ===")
print("For imbalanced data, TRUST PR-AUC more!")
print("It gives a more realistic view of performance.")
Critical insight: ROC-AUC can be misleadingly high on imbalanced data because it includes true negatives (which are easy to get right when negatives dominate). PR-AUC focuses only on positive class performance - use it for imbalanced problems!
Task: Your model has AUC = 0.85. Explain what this means in practical terms that a non-technical manager could understand.
Show Solution
auc = 0.85
print("=== Explaining AUC to Your Manager ===")
print(f"\nOur fraud detection model has AUC = {auc}")
print("\nWhat does this mean in simple terms?")
print("="*50)
print(f"\n1. THE RANKING TEST:")
print(f" If I pick one fraudulent transaction and one normal one,")
print(f" our model will correctly identify which is fraud")
print(f" {auc:.0%} of the time.")
print(f"\n2. THE QUALITY SCALE:")
print(f" - AUC = 0.5 = Random guessing (coin flip)")
print(f" - AUC = 0.7-0.8 = Acceptable")
print(f" - AUC = 0.8-0.9 = Good (β We are HERE)")
print(f" - AUC = 0.9-1.0 = Excellent")
print(f" - AUC = 1.0 = Perfect")
print(f"\n3. PRACTICAL MEANING:")
print(f" Our model is performing WELL.")
print(f" It successfully distinguishes fraud from normal transactions")
print(f" much better than random chance.")
print(f"\n4. ROOM FOR IMPROVEMENT:")
print(f" We're at 85%, so there's still 15% where the model")
print(f" ranks a normal transaction higher than a fraud.")
print(f" We could try more features or better algorithms.")
Communication tip: This code shows how to explain AUC to non-technical stakeholders. The "ranking test" interpretation (85% chance of correctly ranking a positive above a negative) is intuitive and practical!
Choosing the Right Metric
The best metric depends on your specific problem, the class distribution, and the costs of different types of errors. There is no one-size-fits-all answer. Understanding your domain and the consequences of mistakes is crucial for choosing appropriate evaluation criteria.
Quick Decision Framework
Ask yourself these questions to choose the right metric:
Question 1: Is my data imbalanced?
If yes β Avoid accuracy! Use F1, precision, recall, or balanced accuracy instead.
Question 2: Which error is worse - false positive or false negative?
False positive worse β Focus on Precision. False negative worse β Focus on Recall.
Question 3: Do I need to compare multiple models?
If yes β Use ROC-AUC or F1 score for easy comparison.
Question 4: Can I tune the threshold later?
If yes β ROC-AUC evaluates across all thresholds. If no β Use metrics at your chosen threshold.
Common Use Cases by Industry
| Use Case | Primary Metric | Why? |
|---|---|---|
| Email Spam Detection | Precision | Don't want important emails in spam folder |
| Cancer Detection | Recall | Must catch all cancer cases, even with some false alarms |
| Fraud Detection | F1 / Recall | Balance between catching fraud and not blocking customers |
| Search Engine | Precision@K | Top results must be relevant |
| General Model Comparison | ROC-AUC | Threshold-independent comparison |
Metric Selection Guide
Use Precision when...
- False positives are costly (spam detection)
- You want high confidence in positive predictions
- Action is expensive (manual review)
Use Recall when...
- False negatives are costly (disease detection)
- Missing positives is dangerous
- You need to catch all positives
Use F1 when...
- You need a single metric to optimize
- Classes are imbalanced
- Both precision and recall matter
Use ROC-AUC when...
- You want to compare models overall
- Threshold can be tuned later
- Classes are roughly balanced
Step 1: Import All Required Metrics
# Complete evaluation pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
What we're importing:
accuracy_score- Overall correctness (but misleading for imbalanced data!)precision_score- Of all positive predictions, how many were correct?recall_score- Of all actual positives, how many did we catch?f1_score- Harmonic mean of precision and recall (balanced measure)roc_auc_score- Area under ROC curve (threshold-independent)classification_report- Prints a complete summary table
Pro Tip: Always import multiple metrics - never rely on just one!
Step 2: Create an Imbalanced Dataset
# Create dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Understanding the dataset:
n_samples=1000- Create 1000 total samplesweights=[0.9, 0.1]- 90% class 0 (negative), 10% class 1 (positive) - IMBALANCED!random_state=42- Makes results reproducibletrain_test_split()- Splits into 75% training, 25% testing
Expected distribution:
Training set: ~750 samples (675 negative, 75 positive) Testing set: ~250 samples (225 negative, 25 positive) This 90:10 imbalance is common in real-world problems like fraud detection, disease diagnosis, or spam filtering!
Step 3: Train the Model
# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
Training and predicting:
LogisticRegression()- Creates a logistic regression classifiermodel.fit(X_train, y_train)- Trains on training datamodel.predict(X_test)- Gets hard predictions (0 or 1)model.predict_proba(X_test)- Gets probability scores for each class[:, 1]- Takes only the probability of class 1 (positive class)
Two types of predictions:
y_pred (hard): [0, 1, 0, 0, 1, ...] β Used for accuracy, precision, recall, F1 y_proba (soft): [0.2, 0.8, 0.3, 0.1, 0.9, ...] β Used for ROC-AUC Soft predictions give more information - they tell you HOW CONFIDENT the model is about each prediction!
Step 4: Calculate and Display All Metrics
# Complete metrics report
print("="*50)
print("CLASSIFICATION METRICS REPORT")
print("="*50)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
print("="*50)
Understanding each metric:
accuracy_score(y_test, y_pred)- (TP + TN) / Total - Can be misleading!precision_score(y_test, y_pred)- TP / (TP + FP) - "When I say positive, am I right?"recall_score(y_test, y_pred)- TP / (TP + FN) - "Did I catch all positives?"f1_score(y_test, y_pred)- 2 Γ (P Γ R) / (P + R) - Balance of bothroc_auc_score(y_test, y_proba)- Uses probabilities, not hard predictions!
Example output:
================================================== CLASSIFICATION METRICS REPORT ================================================== Accuracy: 0.920 β Looks great, but... Precision: 0.667 β Only 2/3 of "positive" predictions correct Recall: 0.480 β Missing half the actual positives! F1 Score: 0.558 β Shows the real picture ROC-AUC: 0.892 β Good ranking ability ==================================================
Key Insight: Notice how accuracy (92%) looks great, but recall (48%) reveals we're missing half the positive cases! This is why you must look at ALL metrics together.
Practice Questions
Task: Create a function that returns all classification metrics in a dictionary.
Show Solution
from sklearn.metrics import *
def evaluate_classifier(y_true, y_pred, y_proba=None):
metrics = {
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred),
'recall': recall_score(y_true, y_pred),
'f1': f1_score(y_true, y_pred),
'balanced_acc': balanced_accuracy_score(y_true, y_pred),
'mcc': matthews_corrcoef(y_true, y_pred)
}
if y_proba is not None:
metrics['roc_auc'] = roc_auc_score(y_true, y_proba)
metrics['avg_precision'] = average_precision_score(y_true, y_proba)
return metrics
# Usage
results = evaluate_classifier(y_test, y_pred, y_proba)
for name, score in results.items():
print(f"{name}: {score:.3f}")
Reusable function: This evaluation function returns all metrics in a dictionary, making it easy to log, compare, or display results. Include probability-based metrics (ROC-AUC, Average Precision) when probabilities are available!
Task: For each scenario below, identify which metric you should prioritize and explain why.
Show Solution
scenarios = [
{
"name": "Airport Security (detecting weapons)",
"metric": "RECALL",
"reason": "Missing a weapon could be fatal. False alarms are inconvenient but acceptable."
},
{
"name": "YouTube Recommendation System",
"metric": "PRECISION",
"reason": "Recommending irrelevant videos annoys users. Missing some good videos is fine."
},
{
"name": "Credit Card Fraud Detection",
"metric": "F1 or F2 (recall-weighted)",
"reason": "Missing fraud is costly, but too many false alarms block legitimate purchases."
},
{
"name": "Email Spam Filter",
"metric": "PRECISION (or F0.5)",
"reason": "Sending important emails to spam is very bad. Some spam in inbox is tolerable."
},
{
"name": "Autonomous Car Pedestrian Detection",
"metric": "RECALL (near 100%)",
"reason": "Missing a pedestrian could kill someone. False braking is safe."
},
{
"name": "Product Defect Detection in Factory",
"metric": "RECALL",
"reason": "Shipping defective products damages brand reputation and safety."
}
]
print("=== Metric Selection for Real-World Scenarios ===")
for s in scenarios:
print(f"\nπ {s['name']}")
print(f" Best Metric: {s['metric']}")
print(f" Why: {s['reason']}")
Decision framework: This reference shows how to match metrics to real-world scenarios. Key insight: safety-critical applications (security, medical, autonomous vehicles) typically need high recall, while user experience applications (spam, recommendations) need high precision.
Task: Create a custom scoring function where: FN costs $500 (missed fraud), FP costs $5 (blocked legitimate transaction). Calculate total cost.
Show Solution
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
def calculate_cost(y_true, y_pred, fn_cost=500, fp_cost=5):
"""
Calculate total cost of classification errors.
In fraud detection:
- FN (missed fraud): Customer loses money, bank pays = HIGH COST
- FP (blocked transaction): Customer inconvenienced = LOW COST
"""
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
total_cost = (fn * fn_cost) + (fp * fp_cost)
print(f"Confusion Matrix:")
print(f" TN={tn}, FP={fp}")
print(f" FN={fn}, TP={tp}")
print(f"\nCost Breakdown:")
print(f" FN cost: {fn} Γ ${fn_cost} = ${fn * fn_cost}")
print(f" FP cost: {fp} Γ ${fp_cost} = ${fp * fp_cost}")
print(f" Total Cost: ${total_cost}")
return total_cost
# Example with fraud detection
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Compare two models with different thresholds
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
print("=== Model 1: Default threshold (0.5) ===")
y_pred_default = (y_proba >= 0.5).astype(int)
cost1 = calculate_cost(y_test, y_pred_default)
print("\n=== Model 2: Lower threshold (0.3) - catches more fraud ===")
y_pred_lower = (y_proba >= 0.3).astype(int)
cost2 = calculate_cost(y_test, y_pred_lower)
print(f"\n=== Comparison ===")
print(f"Savings by using lower threshold: ${cost1 - cost2}")
Business value: This cost-sensitive approach translates model errors into dollars. By adjusting the threshold to catch more fraud (lowering from 0.5 to 0.3), we reduce missed fraud (expensive FNs) at the cost of more blocked transactions (cheap FPs). The net result is usually significant savings!
Task: Train 3 different models and create a comparison table showing all metrics. Which model would you choose for cancer detection?
Show Solution
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
# Load cancer detection data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42)
# Define models
models = {
'Logistic Regression': LogisticRegression(max_iter=5000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
}
# Evaluate each model
results = []
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
results.append({
'Model': name,
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1': f1_score(y_test, y_pred),
'ROC-AUC': roc_auc_score(y_test, y_proba)
})
# Display comparison table
df = pd.DataFrame(results)
print("=== Model Comparison Table ===")
print(df.to_string(index=False))
# Recommendation for cancer detection
print("\n=== Recommendation for Cancer Detection ===")
print("For cancer detection, we prioritize RECALL!")
print("We must catch ALL cancer cases, even at the cost of some false alarms.")
best_recall_model = df.loc[df['Recall'].idxmax(), 'Model']
print(f"\nBest model by Recall: {best_recall_model}")
Model selection strategy: This creates a comparison table with all metrics, then selects the best model based on the most important metric for the use case. For cancer detection, we pick the model with highest recall to minimize missed diagnoses!
Key Takeaways
Accuracy Trap
Don't rely on accuracy alone, especially with imbalanced data. A model predicting only the majority class can have high accuracy but be useless.
Confusion Matrix
The confusion matrix is the foundation. TP, TN, FP, FN let you calculate all other metrics and understand where your model fails.
Precision vs Recall
There is always a trade-off. Optimizing for one usually hurts the other. Choose based on which error type is more costly.
F1 Score
Use F1 when you need a single number that balances precision and recall. It is especially useful for imbalanced datasets.
ROC-AUC
AUC measures ranking ability across all thresholds. Great for comparing models, but consider PR curves for imbalanced data.
Domain Matters
Choose metrics based on your specific problem. Cancer detection needs high recall. Spam detection needs high precision.
Knowledge Check
Test your understanding of classification metrics:
What is the main problem with using accuracy on imbalanced datasets?
In a confusion matrix, what does a False Positive represent?
For a disease detection system where missing a sick patient is very dangerous, which metric should you prioritize?
What is the F1 score?
What does an ROC-AUC score of 0.5 indicate?
When would you prefer Precision-Recall curves over ROC curves?