Classification Metrics

The Accuracy Trap

Accuracy seems like the obvious choice for evaluating classifiers - it tells you what percentage of predictions are correct. But in many real-world scenarios, accuracy is dangerously misleading. When classes are imbalanced, a model can achieve high accuracy while being completely useless.

Think of it Like This (Airport Security Analogy)

Imagine an airport security system that checks for dangerous items. If 99.9% of passengers are safe, a "lazy" system that says "everyone is safe" would be 99.9% accurate! But it would miss every single dangerous person - making it completely useless. This is exactly what happens with accuracy in machine learning when your classes are imbalanced.

Key Concept

What is Accuracy?

Accuracy is the ratio of correct predictions to total predictions. It measures overall correctness but treats all errors equally, regardless of their real-world consequences.

Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)

The Imbalanced Class Problem

Consider a fraud detection system where only 1% of transactions are fraudulent. A model that predicts "not fraud" for everything achieves 99% accuracy - but catches zero fraud! This is the accuracy paradox.

Visualizing the Problem

Your Dataset (1000 samples)

990 Normal 🟢 10 Fraud 🔴

Lazy Model: "Everything is Normal"

✓ 990 correct (normal→normal)

✗ 10 missed (fraud→normal)

Accuracy: 99% but 0 frauds caught!

Step 1: Import Libraries

# The Accuracy Paradox - Imbalanced Classes
from sklearn.metrics import accuracy_score
import numpy as np

What we're importing:

accuracy_score - Function to calculate accuracy (correct predictions / total predictions)
numpy - Library for working with arrays and numerical operations

Step 2: Create an Imbalanced Dataset

# Step 1: Create our dataset
# We have 1000 transactions total
# - 990 are normal transactions (labeled as 0)
# - Only 10 are fraudulent (labeled as 1)
# This is a "class imbalance" - one class dominates!
y_true = np.array([0] * 990 + [1] * 10)

Understanding the data:

[0] * 990 - Creates a list with 990 zeros (normal transactions)
[1] * 10 - Creates a list with 10 ones (fraudulent transactions)
[0]*990 + [1]*10 - Combines both lists into one array of 1000 elements
np.array() - Converts the list to a NumPy array for efficient operations

Class Distribution:

Class 0 (Normal):     990 samples (99%)
Class 1 (Fraud):       10 samples ( 1%)
----------------------------------------
This is a 99:1 class imbalance!

Step 3: Create a "Lazy" Model

# Step 2: Create a "lazy" model
# This useless model ALWAYS predicts "not fraud" (0)
# It never even looks at the data!
y_pred_useless = np.zeros(1000)  # All zeros = all "not fraud"

What this "model" does:

np.zeros(1000) - Creates an array of 1000 zeros
This means: predict "Not Fraud" (0) for EVERY transaction
The model doesn't analyze any features - it just says "everything is normal"
This is what we call a "majority class classifier" or "lazy classifier"

Reality Check: This is NOT a real model - it's a demonstration of how a useless prediction strategy can still get high accuracy!

Step 4: Calculate Accuracy

# Step 3: Calculate accuracy
# accuracy = correct predictions / total predictions
accuracy = accuracy_score(y_true, y_pred_useless)
print(f"Accuracy: {accuracy:.1%}")  # 99.0% - WOW, looks amazing!

How accuracy is calculated:

accuracy_score(y_true, y_pred) - Compares actual vs predicted labels
Formula: Accuracy = Correct Predictions / Total Predictions
Our calculation: 990 correct (normal predicted as normal) + 0 correct (fraud predicted as fraud)
Result: 990 / 1000 = 0.99 = 99%

Breakdown:

990 normal transactions predicted as normal  = CORRECT
 10 fraud transactions predicted as normal   = WRONG
-------------------------------------------------
990 correct out of 1000 = 99% accuracy

Step 5: The Reality Check - How Many Frauds Caught?

# Step 4: But wait... how many frauds did we catch?
frauds_caught = sum(y_pred_useless[y_true == 1])
print(f"Frauds detected: {frauds_caught} out of 10")  # 0 - TERRIBLE!

# The model is USELESS despite 99% accuracy!

Understanding the code:

y_true == 1 - Creates a boolean mask: True where actual label is 1 (fraud)
y_pred_useless[y_true == 1] - Gets predictions only for actual fraud cases
sum(...) - Counts how many of those predictions were 1 (detected as fraud)
Result: 0 frauds detected out of 10 actual frauds!

THE ACCURACY PARADOX:
- Accuracy: 99% (sounds amazing!)
- Frauds caught: 0/10 (completely useless!)
- This model would cost a company MILLIONS in undetected fraud!

When Accuracy Fails

Imbalanced datasets: When one class dominates (fraud detection, disease diagnosis)
Different error costs: When false positives and false negatives have different consequences
Rare event prediction: When you care most about finding the minority class

Common Beginner Mistake

The Mistake: "My model has 98% accuracy, so it must be great!"

The Reality: Always check your class distribution first. If 98% of your data is one class, even a random model could get 98% accuracy.

The Fix: Always look at precision, recall, and confusion matrix alongside accuracy. These tell you how well your model performs on EACH class, not just overall.

Step 1: Import Libraries

# A slightly better model - but accuracy doesn't show it
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

What we're importing:

LogisticRegression - A popular classification algorithm for binary problems
make_classification - Function to generate synthetic classification datasets
train_test_split - Splits data into training and testing sets
accuracy_score - Calculates accuracy metric

Step 2: Create Imbalanced Dataset

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], 
                           n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Understanding the parameters:

n_samples=1000 - Create 1000 total samples
weights=[0.95, 0.05] - Class distribution: 95% class 0, 5% class 1 (imbalanced!)
n_features=10 - Each sample has 10 input features
random_state=42 - Ensures reproducibility (same data each time)

After train_test_split:

X_train, y_train: 750 samples (75%) - for training
X_test, y_test:   250 samples (25%) - for testing

Expected class distribution in test set:
- Class 0: ~238 samples (95%)
- Class 1: ~12 samples  (5%)

Step 3: Train the Model

# Train a real model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Training process:

LogisticRegression() - Creates a logistic regression classifier object
model.fit(X_train, y_train) - Trains the model by learning patterns from training data
model.predict(X_test) - Uses the trained model to predict labels for test data

Output: y_pred contains 250 predictions (0 or 1) for each test sample

Step 4: Evaluate and Analyze Results

print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}")
print(f"Minority class in test: {sum(y_test)} samples")
print(f"Minority class predicted: {sum(y_pred)} samples")

Understanding the output:

accuracy_score(y_test, y_pred) - Overall accuracy (typically high due to imbalance)
sum(y_test) - Counts actual minority class (1s) in test set
sum(y_pred) - Counts how many minority class predictions the model made

What to look for:

If sum(y_pred) << sum(y_test):
   Model is UNDER-predicting minority class (missing positives)
   
If sum(y_pred) >> sum(y_test):
   Model is OVER-predicting minority class (too many false alarms)
   
Ideally: sum(y_pred) ≈ sum(y_test)

Key Insight: High accuracy alone doesn't tell you if the model is actually detecting the minority class. Always compare predicted vs actual minority class counts!

Practice Questions

Test your understanding with these coding challenges.

Task: Given true labels and predictions, calculate accuracy manually and with sklearn.

Show Solution

from sklearn.metrics import accuracy_score
import numpy as np

y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 1])
y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0, 1, 0])

# Manual calculation
correct = sum(y_true == y_pred)
total = len(y_true)
manual_acc = correct / total
print(f"Manual accuracy: {manual_acc:.1%}")

# sklearn
sklearn_acc = accuracy_score(y_true, y_pred)
print(f"Sklearn accuracy: {sklearn_acc:.1%}")

Line-by-Line Explanation:

y_true == y_pred - Creates boolean array: True where prediction matches actual label
sum(...) - Counts True values (correct predictions) = 7 out of 10
correct / total - Manual formula: 7/10 = 0.70 = 70%
accuracy_score() - sklearn's function does the same calculation automatically

Result: Both methods give 70% accuracy - 7 correct predictions out of 10 total.

Task: Create datasets with different imbalance ratios and show how a dummy classifier accuracy changes.

Show Solution

from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

for imbalance in [0.5, 0.7, 0.9, 0.99]:
    X, y = make_classification(n_samples=1000, weights=[imbalance], 
                               n_features=5, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    
    dummy = DummyClassifier(strategy='most_frequent')
    dummy.fit(X_train, y_train)
    acc = dummy.score(X_test, y_test)
    print(f"Imbalance {imbalance:.0%}: Dummy accuracy = {acc:.1%}")

Line-by-Line Explanation:

for imbalance in [0.5, 0.7, 0.9, 0.99] - Tests four different class imbalance levels
weights=[imbalance] - Sets the proportion of the majority class (50%, 70%, 90%, 99%)
DummyClassifier(strategy='most_frequent') - A "dumb" model that always predicts the most common class
dummy.score() - Returns accuracy on test data

Result Pattern: 50% imbalance → ~50% accuracy | 90% imbalance → ~90% accuracy | 99% imbalance → ~99% accuracy. The dummy model's accuracy equals the majority class proportion!

Task: Given a dataset with 950 negative and 50 positive samples, calculate what accuracy a model would get if it predicted all negatives. Then create a model that catches at least 40 positives and compare.

Show Solution

import numpy as np
from sklearn.metrics import accuracy_score

# Create imbalanced dataset
y_true = np.array([0] * 950 + [1] * 50)  # 950 neg, 50 pos

# Lazy model: predict all negative
y_pred_lazy = np.zeros(1000)
lazy_acc = accuracy_score(y_true, y_pred_lazy)
print(f"Lazy model accuracy: {lazy_acc:.1%}")  # 95%!
print(f"Positives caught: {sum(y_pred_lazy[y_true == 1])}/50")

# Better model: catches 40 positives, but has 30 false positives
y_pred_better = np.zeros(1000)
y_pred_better[950:990] = 1  # Catches 40 of 50 positives
y_pred_better[900:930] = 1  # 30 false positives
better_acc = accuracy_score(y_true, y_pred_better)
print(f"\nBetter model accuracy: {better_acc:.1%}")  # Lower!
print(f"Positives caught: {sum(y_pred_better[y_true == 1])}/50")
print("\nThe 'better' model has LOWER accuracy but catches more fraud!")

Key insight: The lazy model has 95% accuracy but catches ZERO fraud. The "better" model has lower accuracy but catches 40/50 fraud cases. This proves accuracy alone is misleading for imbalanced data!

Task: Create a function that takes y_true and returns the "baseline accuracy" - the accuracy you'd get by always predicting the most common class. Use this to evaluate if a model is actually learning.

Show Solution

import numpy as np
from collections import Counter
from sklearn.metrics import accuracy_score

def get_baseline_accuracy(y_true):
    """Calculate baseline accuracy (always predict majority class)"""
    # Find most common class
    counter = Counter(y_true)
    most_common_class, count = counter.most_common(1)[0]
    
    # Baseline accuracy = majority class proportion
    baseline_acc = count / len(y_true)
    
    return baseline_acc, most_common_class

def is_model_useful(y_true, y_pred):
    """Check if model beats baseline"""
    baseline_acc, majority = get_baseline_accuracy(y_true)
    model_acc = accuracy_score(y_true, y_pred)
    
    print(f"Baseline accuracy: {baseline_acc:.1%} (always predict {majority})")
    print(f"Model accuracy: {model_acc:.1%}")
    print(f"Improvement: {(model_acc - baseline_acc)*100:.1f} percentage points")
    
    return model_acc > baseline_acc

# Test it
y_true = np.array([0]*900 + [1]*100)
y_pred = np.array([0]*850 + [1]*150)  # Some predictions
is_model_useful(y_true, y_pred)

What this function does: Creates a reusable tool to check if your model is actually learning. It calculates the baseline (majority-class) accuracy and compares your model against it. If your model doesn't beat the baseline, it's not useful!

Confusion Matrix

The confusion matrix is the foundation of classification evaluation. It breaks down predictions into four categories: True Positives, True Negatives, False Positives, and False Negatives. Understanding these four quadrants is essential for computing all other classification metrics.

Think of it Like This (Medical Test Analogy)

Imagine a COVID test. True Positive = You have COVID and test positive (correct!). True Negative = You don't have COVID and test negative (correct!). False Positive = You're healthy but test positive (false alarm!). False Negative = You have COVID but test negative (missed! dangerous!).

Key Concept

The Four Quadrants

True Positive (TP): Predicted positive, actually positive

True Negative (TN): Predicted negative, actually negative

False Positive (FP): Predicted positive, actually negative (Type I error - false alarm)

False Negative (FN): Predicted negative, actually positive (Type II error - missed detection)

Visual Confusion Matrix

	Predicted
Actual	Negative (0)	Positive (1)
Negative (0)	TN Correct rejection	FP False alarm
Positive (1)	FN Missed!	TP Correct detection

Reading tip: Green = correct predictions (diagonal), Red = errors (off-diagonal)

Memory Trick

The first word tells you if the prediction was right (True/False). The second word tells you what was predicted (Positive/Negative). So "False Positive" = wrong prediction + predicted positive = predicted positive when it was actually negative.

Step 1: Import Required Libraries

# Creating and visualizing a confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

What we're importing:

confusion_matrix - Function to create the confusion matrix from predictions
ConfusionMatrixDisplay - For visualizing the matrix as a heatmap
LogisticRegression - Our classification model
load_breast_cancer - Built-in dataset with 569 samples (cancer diagnosis)
train_test_split - Splits data into training and testing sets

Step 2: Load and Split the Data

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42
)

Data preparation:

load_breast_cancer() - Loads the dataset with 30 features (tumor measurements)
data.data - The features (X) - measurements like radius, texture, etc.
data.target - The labels (y) - 0 = malignant, 1 = benign
train_test_split() - Default split: 75% training, 25% testing
random_state=42 - Makes the split reproducible (same results every time)

Result: X_train/y_train = 426 samples, X_test/y_test = 143 samples

Step 3: Train the Model

# Train model
model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Model training:

LogisticRegression(max_iter=5000) - Creates classifier (5000 iterations to ensure convergence)
model.fit(X_train, y_train) - Trains the model to learn patterns from training data
model.predict(X_test) - Uses trained model to predict on unseen test data

Output: y_pred contains 143 predictions (0 or 1) for each test sample

Step 4: Create the Confusion Matrix

# Create confusion matrix
# This function compares y_test (actual) vs y_pred (predicted)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Output looks like:
# [[TN, FP],
#  [FN, TP]]

Creating the matrix:

confusion_matrix(y_test, y_pred) - Compares actual labels vs predicted labels
First argument = actual values, Second argument = predicted values
Returns a 2x2 numpy array for binary classification

Matrix Layout:

           Predicted 0  |  Predicted 1
Actual 0:     TN        |      FP
Actual 1:     FN        |      TP

Step 5: Extract Individual Values

# Extract each value for clarity
print(f"\nTrue Negatives (TN):  {cm[0,0]} - Correctly said 'no cancer'")
print(f"False Positives (FP): {cm[0,1]} - Wrongly said 'cancer' (false alarm)")
print(f"False Negatives (FN): {cm[1,0]} - Wrongly said 'no cancer' (MISSED!)")
print(f"True Positives (TP):  {cm[1,1]} - Correctly detected cancer")

Accessing matrix values:

cm[0,0] → TN (row 0, col 0) - Actual=0, Predicted=0
cm[0,1] → FP (row 0, col 1) - Actual=0, Predicted=1
cm[1,0] → FN (row 1, col 0) - Actual=1, Predicted=0
cm[1,1] → TP (row 1, col 1) - Actual=1, Predicted=1

Remember: In medical diagnosis, FN is the most dangerous error - missing actual cancer cases!

Reading the Confusion Matrix

Rows represent actual classes, columns represent predicted classes. The diagonal shows correct predictions. For binary classification, position [0,0] is TN, [0,1] is FP, [1,0] is FN, and [1,1] is TP.

Step-by-Step: Reading a Confusion Matrix

Look at the diagonal (top-left to bottom-right): These are your CORRECT predictions. Higher is better!
Look at off-diagonal elements: These are your ERRORS. Lower is better!
Top-right (FP): How many times you cried wolf? (said positive but was negative)
Bottom-left (FN): How many times you missed danger? (said negative but was positive)

Step 1: Define Your Data

# Extract metrics from confusion matrix
from sklearn.metrics import confusion_matrix

y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 1, 0, 1, 1]

Setting up test data:

y_true - Actual labels: 3 negatives (0) + 5 positives (1) = 8 samples total
y_pred - What our model predicted for each sample

Visual Comparison:

Sample:   1  2  3  4  5  6  7  8
Actual:   0  0  0  1  1  1  1  1
Predict:  0  1  0  1  1  0  1  1
Result:   Y  N  Y  Y  Y  N  Y  Y

Step 2: Create Matrix and Extract Values

cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"True Negatives: {tn}")   # Correctly predicted negatives
print(f"False Positives: {fp}")  # Type I errors
print(f"False Negatives: {fn}")  # Type II errors
print(f"True Positives: {tp}")   # Correctly predicted positives

Extracting values:

confusion_matrix(y_true, y_pred) - Creates the 2x2 matrix
cm.ravel() - Flattens matrix into 1D array: [TN, FP, FN, TP]
Unpacking: tn, fp, fn, tp = cm.ravel() assigns each value to a variable

Expected Output:

True Negatives: 2   (correctly said "not cancer")
False Positives: 1  (wrongly said "cancer")
False Negatives: 1  (missed a cancer case!)
True Positives: 4   (correctly detected cancer)

Step 3: Calculate Accuracy Manually

# Calculate metrics manually
accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"\nAccuracy: {accuracy:.1%}")

Accuracy formula:

Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
In words: (Correct predictions) / (All predictions)
Calculation: (4 + 2) / (4 + 2 + 1 + 1) = 6/8 = 75%

Key Insight: From just these 4 values (TN, FP, FN, TP), you can calculate ALL classification metrics!

Practice Questions

Task: Given predictions and true labels, create a confusion matrix using sklearn.

Show Solution

from sklearn.metrics import confusion_matrix

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

tn, fp, fn, tp = cm.ravel()
print(f"\nTP: {tp}, TN: {tn}, FP: {fp}, FN: {fn}")

What this code does: Creates a 2x2 confusion matrix and extracts all four values using ravel(). The matrix shows how predictions compare to actual labels - rows are true labels, columns are predictions.

Task: Create a confusion matrix for a 3-class classification problem.

Show Solution

from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42
)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print("Multi-class Confusion Matrix:")
print(cm)
print(f"\nClasses: {iris.target_names}")

What this code shows: For 3+ classes, the confusion matrix becomes NxN. Each cell [i,j] shows how many samples of class i were predicted as class j. Diagonal values are correct predictions.

Task: Given a confusion matrix with TN=85, FP=5, FN=3, TP=7, calculate accuracy, precision, recall, and F1 manually.

Show Solution

# Given values
TN, FP, FN, TP = 85, 5, 3, 7

# Calculate all metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * (precision * recall) / (precision + recall)
specificity = TN / (TN + FP)

print(f"Total samples: {TP + TN + FP + FN}")
print(f"\nAccuracy: {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall (Sensitivity): {recall:.2%}")
print(f"Specificity: {specificity:.2%}")
print(f"F1 Score: {f1:.2%}")

# Verify with sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

y_true = [0]*TN + [0]*FP + [1]*FN + [1]*TP
y_pred = [0]*TN + [1]*FP + [0]*FN + [1]*TP

print(f"\nVerification with sklearn:")
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2%}")

What this code teaches: Shows how to calculate ALL metrics from just 4 numbers (TP, TN, FP, FN). This is essential knowledge - once you understand the confusion matrix, you can compute any classification metric!

Task: Create a function that displays a confusion matrix as a colored heatmap with labels.

Show Solution

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
import numpy as np

def plot_confusion_matrix(y_true, y_pred, labels=None):
    """Plot a beautiful confusion matrix heatmap"""
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=labels or ['Negative', 'Positive'],
                yticklabels=labels or ['Negative', 'Positive'])
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title('Confusion Matrix')
    
    # Add totals
    total = cm.sum()
    correct = np.trace(cm)
    print(f"Total samples: {total}")
    print(f"Correct predictions: {correct}")
    print(f"Accuracy: {correct/total:.1%}")
    
    plt.tight_layout()
    plt.show()

# Example usage
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

plot_confusion_matrix(y_test, y_pred, labels=['Malignant', 'Benign'])

What this function creates: A reusable visualization function using seaborn heatmap. Color intensity shows cell values, making it easy to spot where the model makes mistakes. Great for presentations!

Task: Given a confusion matrix, identify how many Type I errors (FP) and Type II errors (FN) occurred, and explain which is worse for a medical diagnosis scenario.

Show Solution

import numpy as np
from sklearn.metrics import confusion_matrix

# Example: Cancer detection results
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # 5 healthy, 5 sick
y_pred = [0, 0, 0, 1, 1, 1, 1, 1, 0, 0]  # Model predictions

cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix:")
print(cm)
print(f"\n--- Error Analysis ---")
print(f"Type I Errors (False Positives): {fp}")
print(f"  -> Healthy people told they have cancer")
print(f"  -> Causes: Unnecessary stress, extra tests")

print(f"\nType II Errors (False Negatives): {fn}")
print(f"  -> Sick people told they're healthy")
print(f"  -> Causes: Delayed treatment, could be FATAL!")

print(f"\n--- For Medical Diagnosis ---")
print(f"Type II errors (FN) are MUCH WORSE!")
print(f"Missing a cancer case could cost a life.")
print(f"A false alarm just means more tests.")
print(f"\nConclusion: Optimize for HIGH RECALL (catch all cancers)")

Real-world insight: This code demonstrates why understanding error types matters. In medical diagnosis, Type II errors (false negatives) are often deadly - missing a cancer case is far worse than a false alarm that leads to extra tests.

Precision and Recall

Precision and recall are complementary metrics that focus on the positive class. Precision asks "Of all positive predictions, how many were correct?" while recall asks "Of all actual positives, how many did we find?" The trade-off between them is fundamental to classification.

Think of it Like This (Search Engine Analogy)

You search "chocolate cake recipes" and get 100 results:

Precision: Of the 100 results shown, how many are ACTUALLY chocolate cake recipes? (Are the results relevant?)
Recall: Of ALL chocolate cake recipes that exist on the internet, what percentage did the search find? (Did we find everything?)

The Two Key Questions

Precision asks:

"When I say YES, am I right?"

Out of everyone I predicted as positive, how many actually were positive?

Recall asks:

"Did I find them all?"

Out of everyone who was actually positive, how many did I correctly identify?

Precision

Of all positive predictions, how many were actually positive?

Precision = TP / (TP + FP)

High precision means few false positives. Important when false alarms are costly (spam detection).

Recall (Sensitivity)

Of all actual positives, how many did we correctly identify?

Recall = TP / (TP + FN)

High recall means few false negatives. Important when missing positives is costly (disease detection).

# Calculating precision and recall
from sklearn.metrics import precision_score, recall_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], 
                           n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")  # Of predicted positives, how many correct?
print(f"Recall: {recall:.3f}")        # Of actual positives, how many found?

make_classification(weights=[0.9, 0.1]) - Creates imbalanced data: 90% class 0, 10% class 1
precision_score(y_test, y_pred) - Calculates TP / (TP + FP) - "When I say positive, am I right?"
recall_score(y_test, y_pred) - Calculates TP / (TP + FN) - "Did I catch all the positives?"

Interpretation: If precision=0.8 and recall=0.6, then 80% of your positive predictions are correct, but you only caught 60% of actual positives. Use both metrics together to understand your model's behavior!

The Precision-Recall Trade-off

You can increase recall by predicting more positives (lower threshold), but this usually decreases precision. Conversely, being more conservative increases precision but lowers recall. The right balance depends on your problem.

Real-World Trade-off Examples

Spam Filter (Needs High Precision)

If your spam filter marks an important email as spam, you might miss a job offer or important message!

Better to: Let some spam through (low recall) than block real emails (high precision).

Cancer Screening (Needs High Recall)

If the test misses someone with cancer, they might not get treatment in time!

Better to: Have some false alarms (low precision) than miss any actual cancer (high recall).

# Precision-Recall trade-off with threshold adjustment
from sklearn.metrics import precision_score, recall_score
import numpy as np

# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]

print("Threshold | Precision | Recall")
print("-" * 35)
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    prec = precision_score(y_test, y_pred_thresh, zero_division=0)
    rec = recall_score(y_test, y_pred_thresh)
    print(f"   {threshold:.1f}    |   {prec:.3f}   |  {rec:.3f}")

predict_proba(X_test)[:, 1] - Gets probability scores (0-1) for the positive class
(y_proba >= threshold).astype(int) - Converts probabilities to 0/1 predictions using threshold
zero_division=0 - Prevents error if no positive predictions at high threshold

Trade-off Pattern:

Threshold 0.3 → Predict more positives → High recall (~0.9), Low precision (~0.5)
Threshold 0.7 → Predict fewer positives → Low recall (~0.5), High precision (~0.9)
Default 0.5 is rarely optimal - tune based on your use case!

Practice Questions

Task: Given TP=80, FP=20, FN=10, calculate precision and recall.

Show Solution

TP, FP, FN = 80, 20, 10

precision = TP / (TP + FP)
recall = TP / (TP + FN)

print(f"Precision: {precision:.2%}")  # 80%
print(f"Recall: {recall:.2%}")        # 88.9%

The formulas: Precision = TP/(TP+FP) measures "when I predict positive, am I correct?" Recall = TP/(TP+FN) measures "did I find all the positives?" Both are essential for understanding model behavior.

Task: Find the threshold that achieves at least 90% recall with maximum precision.

Show Solution

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Find threshold for 90% recall
target_recall = 0.90
valid_idx = recalls >= target_recall
if valid_idx.any():
    best_idx = np.argmax(precisions[:-1][valid_idx[:-1]])
    best_threshold = thresholds[valid_idx[:-1]][best_idx]
    print(f"Threshold: {best_threshold:.3f}")
    print(f"Precision: {precisions[:-1][valid_idx[:-1]][best_idx]:.3f}")
    print(f"Recall: {recalls[:-1][valid_idx[:-1]][best_idx]:.3f}")

What this code does: Uses sklearn's precision_recall_curve to find all threshold-precision-recall combinations, then filters to keep only thresholds achieving 90%+ recall, and picks the one with highest precision. This is how you optimize for a specific recall target!

Task: Create a table showing how precision and recall change as you adjust the classification threshold from 0.1 to 0.9.

Show Solution

from sklearn.metrics import precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# Create dataset
X, y = make_classification(n_samples=1000, weights=[0.8, 0.2], 
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Show trade-off at different thresholds
print("Threshold | Precision | Recall | Predicted Pos")
print("-" * 50)

for thresh in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    y_pred = (y_proba >= thresh).astype(int)
    
    # Handle edge case of no positive predictions
    if sum(y_pred) == 0:
        prec = 0
    else:
        prec = precision_score(y_test, y_pred)
    
    rec = recall_score(y_test, y_pred)
    n_pos = sum(y_pred)
    
    print(f"   {thresh:.1f}     |   {prec:.2f}    |  {rec:.2f}   |     {n_pos}")

print("\nNotice: Lower threshold = More positives = Higher recall, Lower precision")

Key pattern: This table reveals the fundamental precision-recall trade-off. Low threshold (0.1) = predict more positives = high recall but low precision. High threshold (0.9) = predict fewer positives = high precision but low recall. Choose based on your use case!

Task: A spam filter has precision=0.95 and recall=0.70. Explain in plain English what this means and whether this is good for a spam filter.

Show Solution

# Spam Filter Analysis
precision = 0.95
recall = 0.70

print("=== Spam Filter Performance Analysis ===")
print(f"\nPrecision: {precision:.0%}")
print("  What it means: When the filter marks an email as spam,")
print("  it's correct 95% of the time.")
print("  Only 5% of 'spam' emails are actually legitimate (false positives).")

print(f"\nRecall: {recall:.0%}")
print("  What it means: Of all actual spam emails,")
print("  the filter catches 70% of them.")
print("  30% of spam emails slip through to your inbox (false negatives).")

print("\n=== Is this good for a spam filter? ===")
print("YES, this is EXCELLENT for a spam filter!")
print("\nWhy?")
print("1. HIGH PRECISION (95%) is critical for spam filters")
print("   - Losing important emails is VERY bad")
print("   - Missing a job offer or contract is worse than seeing some spam")
print("\n2. Lower recall (70%) is acceptable")
print("   - Seeing some spam in inbox is annoying but not harmful")
print("   - Users can manually delete spam that gets through")
print("\nFor spam filters: PRECISION > RECALL")

Real-world application: This analysis shows how to interpret metrics for a specific use case. For spam filters, blocking legitimate emails (false positives) is worse than missing some spam, so we prioritize precision over recall.

Task: For a 3-class classification problem, calculate macro, micro, and weighted average precision.

Show Solution

from sklearn.metrics import precision_score, classification_report
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load 3-class dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Different averaging methods
print("=== Precision Averaging Methods ===")
print(f"\nMacro (simple average):    {precision_score(y_test, y_pred, average='macro'):.3f}")
print("  -> Each class counts equally")

print(f"\nWeighted (by class size):  {precision_score(y_test, y_pred, average='weighted'):.3f}")
print("  -> Larger classes have more influence")

print(f"\nMicro (total TP / total P): {precision_score(y_test, y_pred, average='micro'):.3f}")
print("  -> Treats all samples equally")

print("\n=== Full Classification Report ===")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Averaging methods explained: Macro = simple average (good when all classes equally important). Weighted = accounts for class sizes. Micro = treats each sample equally. Use classification_report() to see per-class metrics plus averages!

F1 Score and Other Metrics

The F1 score is the harmonic mean of precision and recall, providing a single number that balances both. When you need one metric to optimize, F1 is often a good choice for imbalanced problems. Other useful metrics include specificity, balanced accuracy, and Matthews Correlation Coefficient.

Why Do We Need F1?

Imagine you have to report ONE number to your boss about how good your model is. You can't say "precision is 80% and recall is 60%" - they want ONE number! F1 score combines both into a single metric. It's like a "balanced score" that's only high when BOTH precision AND recall are good.

Key Concept

F1 Score

F1 = 2 * (precision * recall) / (precision + recall). The harmonic mean penalizes extreme values, so F1 is only high when both precision and recall are high.

Why harmonic mean? If precision is 100% but recall is 1%, the arithmetic mean would be 50.5%. The harmonic mean correctly gives a low score of 1.98%.

Worked Example: Calculating F1 Score

Let's say your model has:

Precision = 0.80 (80%)
Recall = 0.60 (60%)

F1 = 2 × (0.80 × 0.60) / (0.80 + 0.60)

F1 = 2 × 0.48 / 1.40

F1 = 0.96 / 1.40 = 0.686 (68.6%)

Result

F1 = 0.69

Note: F1 (68.6%) is lower than the arithmetic mean (70%) because the harmonic mean penalizes the imbalance between precision and recall.

# F1 Score and classification report
from sklearn.metrics import f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42
)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.3f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

f1_score(y_test, y_pred) - Calculates 2×(precision×recall)/(precision+recall)
classification_report() - Generates a complete performance table
target_names=data.target_names - Adds readable class labels to the report

Classification Report Output:

Per-class metrics: Precision, Recall, F1 for each class separately
Support: Number of samples in each class (helps spot imbalance)
Macro avg: Simple average across classes (treats all classes equally)
Weighted avg: Weighted by class size (larger classes count more)

Other Useful Metrics

Specificity

TN / (TN + FP)

True negative rate - how well we identify negatives.

Balanced Accuracy

(Recall + Specificity) / 2

Average of recall per class - handles imbalance.

Matthews Correlation

MCC formula uses all 4 quadrants

Balanced measure even for imbalanced classes.

# Comparing multiple metrics
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                            f1_score, balanced_accuracy_score, matthews_corrcoef)

metrics = {
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1 Score': f1_score(y_test, y_pred),
    'Balanced Accuracy': balanced_accuracy_score(y_test, y_pred),
    'MCC': matthews_corrcoef(y_test, y_pred)
}

for name, score in metrics.items():
    print(f"{name:18}: {score:.3f}")

accuracy_score() - (TP+TN) / Total - can be misleading on imbalanced data
precision_score() - TP / (TP+FP) - "When I predict positive, am I right?"
recall_score() - TP / (TP+FN) - "Did I catch all positives?"
f1_score() - Harmonic mean of precision and recall
balanced_accuracy_score() - Average of recall per class (handles imbalance)
matthews_corrcoef() - Uses all 4 values (TP, TN, FP, FN), ranges from -1 to +1

Always compute multiple metrics! Different metrics reveal different aspects of model performance. MCC is often considered the most reliable single metric for binary classification.

Practice Questions

Task: Given precision=0.8 and recall=0.6, calculate F1 score.

Show Solution

precision = 0.8
recall = 0.6

f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score: {f1:.3f}")  # 0.686

The F1 formula: F1 = 2×(precision×recall)/(precision+recall). The result (0.686) is lower than simple average (0.70) because harmonic mean penalizes the imbalance between precision and recall.

Task: Train a classifier on imbalanced data and compare F1 with balanced accuracy.

Show Solution

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, balanced_accuracy_score

X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")

Comparison insight: F1 focuses on the positive class performance while balanced accuracy averages recall across both classes. On highly imbalanced data (95% negative), these metrics tell different stories - use both to understand your model fully!

Task: Calculate both arithmetic mean and F1 (harmonic mean) for precision=1.0 and recall=0.01. Explain why F1 is more appropriate.

Show Solution

# Extreme case: Perfect precision, terrible recall
precision = 1.0   # 100% - every positive prediction is correct
recall = 0.01     # 1% - we only found 1% of actual positives!

# Arithmetic mean (simple average)
arithmetic_mean = (precision + recall) / 2
print(f"Arithmetic Mean: {arithmetic_mean:.2%}")  # 50.5%

# Harmonic mean (F1 score)
f1 = 2 * (precision * recall) / (precision + recall)
print(f"F1 Score (Harmonic Mean): {f1:.2%}")  # 1.98%

print("\n=== Why F1 is Better ===")
print(f"\nThe model catches only {recall:.0%} of positives!")
print("This is a TERRIBLE model - it misses 99% of cases.")
print("\nArithmetic mean of 50.5% is MISLEADING.")
print("It makes the model look acceptable when it's useless.")
print("\nF1 of 1.98% correctly shows the model is BAD.")
print("The harmonic mean punishes extreme values.")
print("\nRule: F1 is only high when BOTH precision AND recall are high.")

Why this matters: The arithmetic mean (50.5%) would suggest a decent model, but this model only catches 1% of positives! F1's harmonic mean (1.98%) correctly reveals how bad this model is. This is why we use F1 instead of averaging.

Task: Calculate F0.5, F1, and F2 scores for a model. Explain when you would use each.

Show Solution

from sklearn.metrics import fbeta_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate different F-beta scores
f05 = fbeta_score(y_test, y_pred, beta=0.5)  # Precision-focused
f1 = fbeta_score(y_test, y_pred, beta=1.0)   # Balanced
f2 = fbeta_score(y_test, y_pred, beta=2.0)   # Recall-focused

print("=== F-beta Scores ===")
print(f"\nF0.5 (Precision-focused): {f05:.3f}")
print("  Use when: False positives are costly")
print("  Examples: Spam detection, recommendation systems")

print(f"\nF1 (Balanced): {f1:.3f}")
print("  Use when: Both errors are equally bad")
print("  Examples: General classification, balanced datasets")

print(f"\nF2 (Recall-focused): {f2:.3f}")
print("  Use when: False negatives are costly")
print("  Examples: Cancer detection, fraud detection")

print("\n=== The Formula ===")
print("F-beta = (1 + β²) × (precision × recall) / (β² × precision + recall)")
print("\nHigher beta = More weight on recall")
print("Lower beta = More weight on precision")

Choosing beta: F0.5 emphasizes precision (spam filters, recommendations). F1 is balanced. F2 emphasizes recall (medical diagnosis, fraud). The beta value lets you tune the precision-recall trade-off mathematically!

Task: Calculate MCC manually and explain why it's considered more reliable than F1 for imbalanced datasets.

Show Solution

import numpy as np
from sklearn.metrics import matthews_corrcoef, f1_score

# Confusion matrix values
TP, TN, FP, FN = 8, 900, 10, 2  # Highly imbalanced!

# Manual MCC calculation
numerator = (TP * TN) - (FP * FN)
denominator = np.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
mcc_manual = numerator / denominator

print("=== Matthews Correlation Coefficient ===")
print(f"\nTP={TP}, TN={TN}, FP={FP}, FN={FN}")
print(f"Total samples: {TP+TN+FP+FN}")
print(f"Class imbalance: {TN+FP} negative vs {TP+FN} positive")

print(f"\nMCC (manual): {mcc_manual:.3f}")

# Verify with sklearn
y_true = [0]*TN + [0]*FP + [1]*FN + [1]*TP
y_pred = [0]*TN + [1]*FP + [0]*FN + [1]*TP
print(f"MCC (sklearn): {matthews_corrcoef(y_true, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_true, y_pred):.3f}")

print("\n=== Why MCC is Better ===")
print("1. Uses ALL four values (TP, TN, FP, FN)")
print("2. F1 ignores True Negatives!")
print("3. MCC ranges from -1 to +1:")
print("   +1 = perfect prediction")
print("    0 = random prediction")
print("   -1 = total disagreement")
print("\n4. MCC only gives high score when model does well")
print("   on BOTH classes, not just the positive class.")

MCC advantage: MCC uses all four confusion matrix values (TP, TN, FP, FN) making it a truly balanced metric. It returns values from -1 (total disagreement) to +1 (perfect), with 0 meaning random. Many researchers consider MCC the most reliable single metric!

ROC Curves and AUC

The ROC (Receiver Operating Characteristic) curve visualizes the trade-off between true positive rate and false positive rate across all possible thresholds. The AUC (Area Under the Curve) summarizes this into a single number from 0 to 1, where 1 is perfect and 0.5 is random guessing.

What is ROC-AUC in Simple Terms?

Imagine you have two people - one who has a disease and one who doesn't. You show your model both people.

AUC is the probability that your model correctly gives a higher "disease score" to the sick person than the healthy person. If AUC = 0.9, that means 90% of the time, your model ranks the sick person higher - that's great!

Why Use ROC-AUC?

Threshold Independent

It evaluates your model across ALL possible thresholds, not just 0.5

Easy to Compare

Single number makes comparing models simple

Measures Ranking

How well does your model rank positives above negatives?

Key Concept

ROC-AUC Interpretation

AUC represents the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. It measures ranking quality, not absolute predictions.

AUC = 1.0
Perfect classifier

AUC = 0.9+
Excellent

AUC = 0.7-0.9
Good

AUC = 0.5
Random guessing

# Computing ROC curve and AUC
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42
)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# Get probability predictions
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Calculate AUC
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {auc:.3f}")

predict_proba(X_test)[:, 1] - Gets probability of positive class (column 1) for each sample
roc_curve(y_test, y_proba) - Returns FPR, TPR, and thresholds at each point on the curve
roc_auc_score(y_test, y_proba) - Calculates area under the ROC curve (0.5 to 1.0)

Understanding the Output:

FPR (False Positive Rate): FP / (FP + TN) - how many negatives we incorrectly labeled positive
TPR (True Positive Rate): Same as Recall = TP / (TP + FN)
AUC = 0.5: Random guessing (diagonal line) | AUC = 1.0: Perfect classifier

Precision-Recall Curves

For imbalanced datasets, precision-recall curves are often more informative than ROC curves. They focus on the positive class and don't get inflated by a large number of true negatives.

# Precision-Recall curve and Average Precision
from sklearn.metrics import precision_recall_curve, average_precision_score

# Calculate PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

# Average Precision (area under PR curve)
ap = average_precision_score(y_test, y_proba)
print(f"Average Precision: {ap:.3f}")

# Find threshold for specific precision target
target_prec = 0.95
valid = precision[:-1] >= target_prec
if valid.any():
    idx = np.argmax(recall[:-1][valid])
    print(f"Threshold for {target_prec:.0%} precision: {thresholds[valid][idx]:.3f}")
    print(f"Recall at that threshold: {recall[:-1][valid][idx]:.3f}")

precision_recall_curve() - Returns precision, recall arrays at each threshold
average_precision_score() - Calculates area under PR curve (AP score)
precision[:-1] >= target_prec - Finds all thresholds with precision ≥ 95%
np.argmax(recall[:-1][valid]) - Among valid thresholds, pick the one with highest recall

PR vs ROC for Imbalanced Data:

ROC-AUC can be misleading: Includes true negatives, which are easy to get right when negatives dominate
PR-AUC (Average Precision) is better: Focuses only on positive class performance
Rule of thumb: Use PR curves when positive class is rare (<10% of data)

Practice Questions

Task: Train a classifier and compute its ROC-AUC score.

Show Solution

from sklearn.metrics import roc_auc_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Binary classification (setosa vs others)
iris = load_iris()
X, y = iris.data, (iris.target == 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression().fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc:.3f}")

Key steps: Get probability predictions with predict_proba()[:, 1], then pass true labels and probabilities to roc_auc_score(). AUC requires probability scores, not hard predictions!

Task: Compare AUC scores of Logistic Regression, Decision Tree, and Random Forest.

Show Solution

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    print(f"{name}: AUC = {auc:.3f}")

Model comparison pattern: Loop through multiple models, train each, get probabilities, and compute AUC. This gives you a single number to rank models. Higher AUC = better ranking ability!

Task: Plot the ROC curve and find the optimal threshold using Youden's J statistic (maximizes TPR - FPR).

Show Solution

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np

# Assume model and data are ready
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

# Find optimal threshold (Youden's J statistic)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"AUC: {auc:.3f}")
print(f"Optimal Threshold: {optimal_threshold:.3f}")
print(f"At this threshold:")
print(f"  TPR (Recall): {tpr[optimal_idx]:.3f}")
print(f"  FPR: {fpr[optimal_idx]:.3f}")

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.scatter(fpr[optimal_idx], tpr[optimal_idx], 
            color='red', s=100, zorder=5,
            label=f'Optimal (threshold={optimal_threshold:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve with Optimal Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

What this creates: A complete ROC curve visualization with the optimal threshold marked. Youden's J statistic (TPR-FPR) finds the point farthest from the diagonal line - the best balance between sensitivity and specificity.

Task: For an imbalanced dataset, compare ROC-AUC with Average Precision (PR-AUC) and explain why they might give different impressions.

Show Solution

from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create HIGHLY imbalanced dataset (5% positive)
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05],
                           n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate both metrics
roc_auc = roc_auc_score(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

print("=== Imbalanced Dataset Analysis ===")
print(f"Positive class: {sum(y_test)} out of {len(y_test)} ({sum(y_test)/len(y_test):.1%})")
print(f"\nROC-AUC: {roc_auc:.3f}")
print(f"Average Precision (PR-AUC): {avg_precision:.3f}")

print("\n=== Why the Difference? ===")
print("\nROC-AUC looks HIGH because:")
print("  - It includes True Negatives")
print("  - With many negatives, getting TNs right is easy")
print("  - This 'inflates' the ROC-AUC score")

print("\nPR-AUC looks LOWER because:")
print("  - It focuses ONLY on the positive class")
print("  - Ignores True Negatives completely")
print("  - Shows how well we find the rare positives")

print("\n=== Recommendation ===")
print("For imbalanced data, TRUST PR-AUC more!")
print("It gives a more realistic view of performance.")

Critical insight: ROC-AUC can be misleadingly high on imbalanced data because it includes true negatives (which are easy to get right when negatives dominate). PR-AUC focuses only on positive class performance - use it for imbalanced problems!

Task: Your model has AUC = 0.85. Explain what this means in practical terms that a non-technical manager could understand.

Show Solution

auc = 0.85

print("=== Explaining AUC to Your Manager ===")
print(f"\nOur fraud detection model has AUC = {auc}")
print("\nWhat does this mean in simple terms?")
print("="*50)

print(f"\n1. THE RANKING TEST:")
print(f"   If I pick one fraudulent transaction and one normal one,")
print(f"   our model will correctly identify which is fraud")
print(f"   {auc:.0%} of the time.")

print(f"\n2. THE QUALITY SCALE:")
print(f"   - AUC = 0.5 = Random guessing (coin flip)")
print(f"   - AUC = 0.7-0.8 = Acceptable")
print(f"   - AUC = 0.8-0.9 = Good (← We are HERE)")
print(f"   - AUC = 0.9-1.0 = Excellent")
print(f"   - AUC = 1.0 = Perfect")

print(f"\n3. PRACTICAL MEANING:")
print(f"   Our model is performing WELL.")
print(f"   It successfully distinguishes fraud from normal transactions")
print(f"   much better than random chance.")

print(f"\n4. ROOM FOR IMPROVEMENT:")
print(f"   We're at 85%, so there's still 15% where the model")
print(f"   ranks a normal transaction higher than a fraud.")
print(f"   We could try more features or better algorithms.")

Communication tip: This code shows how to explain AUC to non-technical stakeholders. The "ranking test" interpretation (85% chance of correctly ranking a positive above a negative) is intuitive and practical!

Choosing the Right Metric

The best metric depends on your specific problem, the class distribution, and the costs of different types of errors. There is no one-size-fits-all answer. Understanding your domain and the consequences of mistakes is crucial for choosing appropriate evaluation criteria.

Quick Decision Framework

Ask yourself these questions to choose the right metric:

Question 1: Is my data imbalanced?

If yes → Avoid accuracy! Use F1, precision, recall, or balanced accuracy instead.

Question 2: Which error is worse - false positive or false negative?

False positive worse → Focus on Precision. False negative worse → Focus on Recall.

Question 3: Do I need to compare multiple models?

If yes → Use ROC-AUC or F1 score for easy comparison.

Question 4: Can I tune the threshold later?

If yes → ROC-AUC evaluates across all thresholds. If no → Use metrics at your chosen threshold.

Common Use Cases by Industry

Use Case	Primary Metric	Why?
Email Spam Detection	Precision	Don't want important emails in spam folder
Cancer Detection	Recall	Must catch all cancer cases, even with some false alarms
Fraud Detection	F1 / Recall	Balance between catching fraud and not blocking customers
Search Engine	Precision@K	Top results must be relevant
General Model Comparison	ROC-AUC	Threshold-independent comparison

Metric Selection Guide

Use Precision when...

False positives are costly (spam detection)
You want high confidence in positive predictions
Action is expensive (manual review)

Use Recall when...

False negatives are costly (disease detection)
Missing positives is dangerous
You need to catch all positives

Use F1 when...

You need a single metric to optimize
Classes are imbalanced
Both precision and recall matter

Use ROC-AUC when...

You want to compare models overall
Threshold can be tuned later
Classes are roughly balanced

Step 1: Import All Required Metrics

# Complete evaluation pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                            f1_score, roc_auc_score, classification_report)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

What we're importing:

accuracy_score - Overall correctness (but misleading for imbalanced data!)
precision_score - Of all positive predictions, how many were correct?
recall_score - Of all actual positives, how many did we catch?
f1_score - Harmonic mean of precision and recall (balanced measure)
roc_auc_score - Area under ROC curve (threshold-independent)
classification_report - Prints a complete summary table

Pro Tip: Always import multiple metrics - never rely on just one!

Step 2: Create an Imbalanced Dataset

# Create dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Understanding the dataset:

n_samples=1000 - Create 1000 total samples
weights=[0.9, 0.1] - 90% class 0 (negative), 10% class 1 (positive) - IMBALANCED!
random_state=42 - Makes results reproducible
train_test_split() - Splits into 75% training, 25% testing

Expected distribution:

Training set: ~750 samples (675 negative, 75 positive)
Testing set:  ~250 samples (225 negative, 25 positive)

This 90:10 imbalance is common in real-world problems like 
fraud detection, disease diagnosis, or spam filtering!

Step 3: Train the Model

# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

Training and predicting:

LogisticRegression() - Creates a logistic regression classifier
model.fit(X_train, y_train) - Trains on training data
model.predict(X_test) - Gets hard predictions (0 or 1)
model.predict_proba(X_test) - Gets probability scores for each class
[:, 1] - Takes only the probability of class 1 (positive class)

Two types of predictions:

y_pred (hard):    [0, 1, 0, 0, 1, ...]  → Used for accuracy, precision, recall, F1
y_proba (soft):   [0.2, 0.8, 0.3, 0.1, 0.9, ...]  → Used for ROC-AUC

Soft predictions give more information - they tell you 
HOW CONFIDENT the model is about each prediction!

Step 4: Calculate and Display All Metrics

# Complete metrics report
print("="*50)
print("CLASSIFICATION METRICS REPORT")
print("="*50)
print(f"Accuracy:          {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision:         {precision_score(y_test, y_pred):.3f}")
print(f"Recall:            {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:          {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC:           {roc_auc_score(y_test, y_proba):.3f}")
print("="*50)

Understanding each metric:

accuracy_score(y_test, y_pred) - (TP + TN) / Total - Can be misleading!
precision_score(y_test, y_pred) - TP / (TP + FP) - "When I say positive, am I right?"
recall_score(y_test, y_pred) - TP / (TP + FN) - "Did I catch all positives?"
f1_score(y_test, y_pred) - 2 × (P × R) / (P + R) - Balance of both
roc_auc_score(y_test, y_proba) - Uses probabilities, not hard predictions!

Example output:

==================================================
CLASSIFICATION METRICS REPORT
==================================================
Accuracy:          0.920   ← Looks great, but...
Precision:         0.667   ← Only 2/3 of "positive" predictions correct
Recall:            0.480   ← Missing half the actual positives!
F1 Score:          0.558   ← Shows the real picture
ROC-AUC:           0.892   ← Good ranking ability
==================================================

Key Insight: Notice how accuracy (92%) looks great, but recall (48%) reveals we're missing half the positive cases! This is why you must look at ALL metrics together.

Practice Questions

Task: Create a function that returns all classification metrics in a dictionary.

Show Solution

from sklearn.metrics import *

def evaluate_classifier(y_true, y_pred, y_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
        'balanced_acc': balanced_accuracy_score(y_true, y_pred),
        'mcc': matthews_corrcoef(y_true, y_pred)
    }
    if y_proba is not None:
        metrics['roc_auc'] = roc_auc_score(y_true, y_proba)
        metrics['avg_precision'] = average_precision_score(y_true, y_proba)
    return metrics

# Usage
results = evaluate_classifier(y_test, y_pred, y_proba)
for name, score in results.items():
    print(f"{name}: {score:.3f}")

Reusable function: This evaluation function returns all metrics in a dictionary, making it easy to log, compare, or display results. Include probability-based metrics (ROC-AUC, Average Precision) when probabilities are available!

Task: For each scenario below, identify which metric you should prioritize and explain why.

Show Solution

scenarios = [
    {
        "name": "Airport Security (detecting weapons)",
        "metric": "RECALL",
        "reason": "Missing a weapon could be fatal. False alarms are inconvenient but acceptable."
    },
    {
        "name": "YouTube Recommendation System",
        "metric": "PRECISION",
        "reason": "Recommending irrelevant videos annoys users. Missing some good videos is fine."
    },
    {
        "name": "Credit Card Fraud Detection",
        "metric": "F1 or F2 (recall-weighted)",
        "reason": "Missing fraud is costly, but too many false alarms block legitimate purchases."
    },
    {
        "name": "Email Spam Filter",
        "metric": "PRECISION (or F0.5)",
        "reason": "Sending important emails to spam is very bad. Some spam in inbox is tolerable."
    },
    {
        "name": "Autonomous Car Pedestrian Detection",
        "metric": "RECALL (near 100%)",
        "reason": "Missing a pedestrian could kill someone. False braking is safe."
    },
    {
        "name": "Product Defect Detection in Factory",
        "metric": "RECALL",
        "reason": "Shipping defective products damages brand reputation and safety."
    }
]

print("=== Metric Selection for Real-World Scenarios ===")
for s in scenarios:
    print(f"\n📌 {s['name']}")
    print(f"   Best Metric: {s['metric']}")
    print(f"   Why: {s['reason']}")

Decision framework: This reference shows how to match metrics to real-world scenarios. Key insight: safety-critical applications (security, medical, autonomous vehicles) typically need high recall, while user experience applications (spam, recommendations) need high precision.

Task: Create a custom scoring function where: FN costs $500 (missed fraud), FP costs $5 (blocked legitimate transaction). Calculate total cost.

Show Solution

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

def calculate_cost(y_true, y_pred, fn_cost=500, fp_cost=5):
    """
    Calculate total cost of classification errors.
    
    In fraud detection:
    - FN (missed fraud): Customer loses money, bank pays = HIGH COST
    - FP (blocked transaction): Customer inconvenienced = LOW COST
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    total_cost = (fn * fn_cost) + (fp * fp_cost)
    
    print(f"Confusion Matrix:")
    print(f"  TN={tn}, FP={fp}")
    print(f"  FN={fn}, TP={tp}")
    print(f"\nCost Breakdown:")
    print(f"  FN cost: {fn} × ${fn_cost} = ${fn * fn_cost}")
    print(f"  FP cost: {fp} × ${fp_cost} = ${fp * fp_cost}")
    print(f"  Total Cost: ${total_cost}")
    
    return total_cost

# Example with fraud detection
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Compare two models with different thresholds
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

print("=== Model 1: Default threshold (0.5) ===")
y_pred_default = (y_proba >= 0.5).astype(int)
cost1 = calculate_cost(y_test, y_pred_default)

print("\n=== Model 2: Lower threshold (0.3) - catches more fraud ===")
y_pred_lower = (y_proba >= 0.3).astype(int)
cost2 = calculate_cost(y_test, y_pred_lower)

print(f"\n=== Comparison ===")
print(f"Savings by using lower threshold: ${cost1 - cost2}")

Business value: This cost-sensitive approach translates model errors into dollars. By adjusting the threshold to catch more fraud (lowering from 0.5 to 0.3), we reduce missed fraud (expensive FNs) at the cost of more blocked transactions (cheap FPs). The net result is usually significant savings!

Task: Train 3 different models and create a comparison table showing all metrics. Which model would you choose for cancer detection?

Show Solution

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd

# Load cancer detection data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42)

# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=5000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
}

# Evaluate each model
results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_proba)
    })

# Display comparison table
df = pd.DataFrame(results)
print("=== Model Comparison Table ===")
print(df.to_string(index=False))

# Recommendation for cancer detection
print("\n=== Recommendation for Cancer Detection ===")
print("For cancer detection, we prioritize RECALL!")
print("We must catch ALL cancer cases, even at the cost of some false alarms.")
best_recall_model = df.loc[df['Recall'].idxmax(), 'Model']
print(f"\nBest model by Recall: {best_recall_model}")

Model selection strategy: This creates a comparison table with all metrics, then selects the best model based on the most important metric for the use case. For cancer detection, we pick the model with highest recall to minimize missed diagnoses!

Key Takeaways

Accuracy Trap

Don't rely on accuracy alone, especially with imbalanced data. A model predicting only the majority class can have high accuracy but be useless.

Confusion Matrix

The confusion matrix is the foundation. TP, TN, FP, FN let you calculate all other metrics and understand where your model fails.

Precision vs Recall

There is always a trade-off. Optimizing for one usually hurts the other. Choose based on which error type is more costly.

F1 Score

Use F1 when you need a single number that balances precision and recall. It is especially useful for imbalanced datasets.

ROC-AUC

AUC measures ranking ability across all thresholds. Great for comparing models, but consider PR curves for imbalanced data.

Domain Matters

Choose metrics based on your specific problem. Cancer detection needs high recall. Spam detection needs high precision.

What You'll Learn

Contents

The Accuracy Trap

Think of it Like This (Airport Security Analogy)

What is Accuracy?

The Imbalanced Class Problem

Visualizing the Problem

Step 1: Import Libraries

Step 2: Create an Imbalanced Dataset

Step 3: Create a "Lazy" Model

Step 4: Calculate Accuracy

Step 5: The Reality Check - How Many Frauds Caught?

When Accuracy Fails

Common Beginner Mistake

Step 1: Import Libraries

Step 2: Create Imbalanced Dataset

Step 3: Train the Model

Step 4: Evaluate and Analyze Results

Practice Questions

Easy Calculate accuracy from predictions

Medium Compare accuracy across class imbalances

Medium Identify when accuracy is misleading

Hard Build a baseline accuracy checker

Confusion Matrix

Think of it Like This (Medical Test Analogy)

The Four Quadrants

Visual Confusion Matrix

Memory Trick

Step 1: Import Required Libraries

Step 2: Load and Split the Data

Step 3: Train the Model

Step 4: Create the Confusion Matrix

Step 5: Extract Individual Values

Reading the Confusion Matrix

Step-by-Step: Reading a Confusion Matrix

Step 1: Define Your Data

Step 2: Create Matrix and Extract Values

Step 3: Calculate Accuracy Manually

Practice Questions

Easy Create a confusion matrix from scratch

Medium Multi-class confusion matrix

Medium Calculate all metrics from confusion matrix

Hard Visualize confusion matrix with heatmap

Easy Identify error types from confusion matrix

Precision and Recall

Think of it Like This (Search Engine Analogy)

The Two Key Questions

Precision asks:

Recall asks:

Precision

Recall (Sensitivity)

The Precision-Recall Trade-off

Real-World Trade-off Examples

Spam Filter (Needs High Precision)

Cancer Screening (Needs High Recall)

Practice Questions

Easy Calculate precision and recall manually

Hard Find optimal threshold for recall

Medium Precision vs Recall trade-off visualization

Easy Interpret precision and recall for a spam filter

Medium Calculate weighted average precision for multi-class

F1 Score and Other Metrics

Why Do We Need F1?

F1 Score

Worked Example: Calculating F1 Score

Other Useful Metrics

Specificity

Balanced Accuracy

Matthews Correlation

Practice Questions

Easy Calculate F1 score manually

Medium Compare F1 with balanced accuracy

Easy Why harmonic mean not arithmetic mean?

Medium Calculate F-beta score for different beta values

Hard Calculate Matthews Correlation Coefficient

ROC Curves and AUC

What is ROC-AUC in Simple Terms?

Why Use ROC-AUC?

Threshold Independent