Regression Metrics
Regression models predict continuous values, so we need metrics that measure how far our predictions are from the actual values. Understanding these metrics helps you choose the right one for your problem and interpret model performance correctly.
Why Multiple Metrics?
No single metric tells the whole story. Each metric has strengths and weaknesses, and the right choice depends on your specific use case. Let's understand when to use each one.
Mean Absolute Error (MAE)
Mean Absolute Error (MAE)
The average of the absolute differences between predictions and actual values. MAE = (1/n) × Σ|yᵢ - ŷᵢ|
Interpretation: "On average, predictions are off by X units." MAE is in the same units as your target variable (dollars, meters, temperature, etc.), making it very intuitive to interpret and explain to non-technical stakeholders.
Why "Absolute"? We use absolute value |error| because we don't care if we're predicting too high or too low - we just care about the magnitude of the mistake. Without absolute value, a +10 error and -10 error would cancel out to 0, hiding the mistakes!
# ============================================
# Mean Absolute Error (MAE) - Step by Step
# ============================================
from sklearn.metrics import mean_absolute_error
import numpy as np
# Example: Predicting restaurant bills (in dollars)
y_true = np.array([100, 150, 200, 250, 300]) # Actual bills
y_pred = np.array([110, 140, 210, 240, 320]) # Model predictions
print("Comparing Actual vs Predicted:")
print("Actual:", y_true)
print("Predicted:", y_pred)
print()
# Let's calculate MAE manually to understand what's happening:
print("Manual Calculation:")
errors = np.abs(y_true - y_pred) # Take absolute value of each error
print("Absolute Errors:", errors) # [10, 10, 10, 10, 20]
print(f"Sum of errors: {np.sum(errors)}") # 60
print(f"Number of samples: {len(errors)}") # 5
manual_mae = np.sum(errors) / len(errors) # Average
print(f"Manual MAE: {manual_mae:.2f}")
print()
# Now with sklearn (same result, less code!):
mae = mean_absolute_error(y_true, y_pred)
print(f"Sklearn MAE: {mae:.2f}")
print()
# INTERPRETATION:
print("=" * 50)
print("WHAT THIS MEANS:")
print(f"On average, predictions are off by ${mae:.2f}")
print("This is in the SAME UNITS as your target (dollars).")
print("Lower MAE = Better model!")
print("=" * 50)
- Easy to interpret (same units as target)
- Robust to outliers
- Linear penalty for all errors
- Doesn't heavily penalize large errors
- Not differentiable at zero (optimization issue)
- May not be ideal when big errors are costly
Mean Squared Error (MSE) & RMSE
Mean Squared Error (MSE)
The average of the squared differences between predictions and actual values. MSE = (1/n) × Σ(yᵢ - ŷᵢ)²
Key property: Squaring penalizes large errors exponentially more than small ones:
- Error of 2 → Contributes 4 to MSE (2²)
- Error of 5 → Contributes 25 to MSE (5²) - 6.25x worse!
- Error of 10 → Contributes 100 to MSE (10²) - 25x worse than error of 2!
RMSE (Root MSE): We take the square root to bring MSE back to original units. RMSE is more interpretable than MSE because it's in the same scale as your target variable.
# ============================================
# MSE and RMSE - Understanding the Difference
# ============================================
from sklearn.metrics import mean_squared_error
import numpy as np
# Using same example data
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 140, 210, 240, 320])
print("=== Manual Calculation to Understand MSE ===")
errors = y_true - y_pred # [−10, 10, −10, 10, −20]
print("Raw errors:", errors)
squared_errors = errors ** 2 # [100, 100, 100, 100, 400]
print("Squared errors:", squared_errors)
print("→ Notice how the −20 error (squared = 400) dominates!")
print()
manual_mse = np.mean(squared_errors)
print(f"Manual MSE: {manual_mse:.2f}")
print(f"Manual RMSE: {np.sqrt(manual_mse):.2f}")
print()
# Using sklearn:
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse) # Take square root to get back to original units
# Or directly: rmse = mean_squared_error(y_true, y_pred, squared=False)
print("=== Using Sklearn ===")
print(f"MSE: {mse:.2f} (units²)")
print(f"RMSE: {rmse:.2f} (original units - dollars)")
print()
# Compare with MAE:
mae = mean_absolute_error(y_true, y_pred)
print("=== Comparison ===")
print(f"MAE: ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"Ratio RMSE/MAE: {rmse/mae:.2f}")
print()
print("INTERPRETATION:")
print("→ RMSE > MAE means we have some large errors")
print("→ If RMSE ≈ MAE, all errors are similar size")
print("→ If RMSE >> MAE (like 2x or more), we have outliers!")
R-Squared (R²) - Coefficient of Determination
• R² = 1.0 (100%) → Perfect! Your model explains ALL variation in scores
• R² = 0.80 (80%) → Good! Your model explains 80% of why scores vary; 20% is still random/unexplained
• R² = 0.50 (50%) → Okay - Better than guessing the mean, but leaves a lot unexplained
• R² = 0.0 (0%) → Useless - No better than always predicting the average
• R² < 0 (negative) → Terrible! Worse than just guessing the mean every time!
R² Score (Coefficient of Determination)
Measures what proportion of the variance (variation) in the target variable is explained by your model.
R² = 1 - (SS_residual / SS_total) = 1 - (Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²)
Range and Interpretation:
- R² = 1.0: Perfect prediction - All variance explained
- R² = 0.7-0.9: Good model - Explains most of the variance
- R² = 0.5-0.7: Moderate - Some predictive power
- R² = 0.0: Baseline - No better than predicting mean
- R² < 0: Worse than baseline! - Model is making things worse
Important: R² can be negative on test data if predictions are consistently worse than just predicting the mean. This is a red flag that your model has serious problems!
# R-Squared Score
from sklearn.metrics import r2_score
r2 = r2_score(y_true, y_pred)
print(f"R² Score: {r2:.4f}")
# R² = 0.97 means model explains 97% of variance in target
Complete Regression Evaluation
# Complete regression evaluation example
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Generate sample data
X, y = make_regression(n_samples=200, n_features=3, noise=15, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Calculate all metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("=" * 40)
print("REGRESSION METRICS SUMMARY")
print("=" * 40)
print(f"MAE: {mae:.4f} (avg error in original units)")
print(f"MSE: {mse:.4f} (penalizes large errors)")
print(f"RMSE: {rmse:.4f} (interpretable units)")
print(f"R²: {r2:.4f} (variance explained)")
print("=" * 40)
| Metric | Best When | Sensitive to Outliers? |
|---|---|---|
| MAE | All errors matter equally, outliers present | No (robust) |
| MSE/RMSE | Large errors are especially bad | Yes (penalizes heavily) |
| R² | Comparing models, explaining variance | Somewhat |
Practice Questions: Regression Metrics
Task: Given y_true = [10, 20, 30, 40, 100] and y_pred = [12, 18, 32, 38, 50], calculate MAE and RMSE. What does their ratio tell you?
Show Solution
y_true = np.array([10, 20, 30, 40, 100])
y_pred = np.array([12, 18, 32, 38, 50])
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"Ratio RMSE/MAE: {rmse/mae:.2f}")
# High ratio (~2.2) indicates presence of large errors (outliers)
# The 100->50 prediction (error of 50) dominates RMSE
Question: What does an R-squared of 0.85 mean? What about a negative R-squared?
Show Solution
R-squared = 0.85: The model explains 85% of the variance in the target variable. 15% remains unexplained.
Negative R-squared: The model is worse than just predicting the mean! This happens when predictions are completely off.
# Demonstration of negative R-squared
from sklearn.metrics import r2_score
y_true = [1, 2, 3, 4, 5]
y_pred = [10, 20, 30, 40, 50] # Completely wrong scale!
print(f"R-squared: {r2_score(y_true, y_pred):.2f}") # Negative!
Task: Write a function that takes y_true and y_pred and prints a complete evaluation report with all regression metrics.
Show Solution
def regression_report(y_true, y_pred):
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
mape = np.mean(np.abs((np.array(y_true) - np.array(y_pred)) / y_true)) * 100
print("REGRESSION EVALUATION REPORT")
print("=" * 40)
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R2: {r2:.4f}")
print(f"MAPE: {mape:.2f}%")
print("=" * 40)
if rmse / mae > 1.5:
print("Warning: High RMSE/MAE ratio suggests outliers")
if r2 < 0:
print("Warning: Negative R2 - model is worse than baseline!")
# Usage
regression_report(y_true, y_pred)
Classification Metrics
Classification models predict categories, so we need metrics that measure how often predictions are correct. But accuracy alone can be misleading - especially with imbalanced datasets. Let's understand the full toolkit.
The Accuracy Trap
Consider a fraud detection model where 99% of transactions are legitimate. A model that predicts "not fraud" for everything achieves 99% accuracy - but catches zero fraudsters! This is why we need precision, recall, and F1 score.
Accuracy
Accuracy
The ratio of correct predictions to total predictions. Accuracy = (TP + TN) / (TP + TN + FP + FN)
Best for: Balanced datasets where all classes are equally important.
# Accuracy Score
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2%}") # 80%
Precision
Precision (Positive Predictive Value)
Of all positive predictions your model made, what proportion were actually positive?
Precision = TP / (TP + FP) = "True Positives / All Predicted Positives"
Focus on Precision when: False positives are costly
- Spam filter: Marking good emails as spam annoys users (false positive = bad!)
- Product recommendations: Suggesting bad products hurts trust
- Ad targeting: Showing ads to wrong people wastes money
- Criminal justice: False accusation is very serious
# Precision Score
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2%}")
# "Of all emails I marked as spam, X% were actually spam"
Recall (Sensitivity, True Positive Rate)
In cancer detection, missing a patient (false negative) can be deadly! That's why recall is critical in medical diagnosis - you'd rather have some false alarms than miss sick patients.
Recall (Sensitivity, True Positive Rate)
Of all actual positives in the data, what proportion did we successfully catch?
Recall = TP / (TP + FN) = "True Positives / All Actually Positive"
Focus on Recall when: False negatives are costly/dangerous
- Disease detection: Missing a sick patient can be fatal (false negative = very bad!)
- Fraud detection: Missing fraud costs money and hurts customers
- Search engines: Missing relevant results frustrates users
- Security systems: Missing a threat can be catastrophic
# Recall Score
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2%}")
# "Of all actual spam emails, I caught X%"
F1 Score - The Balance
F1 Score
The harmonic mean of precision and recall - balances both. F1 = 2 × (Precision × Recall) / (Precision + Recall)
Why harmonic mean? It penalizes extreme values. If precision=1.0 but recall=0.01, F1 ≈ 0.02 (not 0.5 like arithmetic mean).
# F1 Score
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.2%}")
Complete Classification Report
# Complete classification evaluation
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Generate imbalanced data
X, y = make_classification(
n_samples=1000, n_features=10, n_classes=2,
weights=[0.9, 0.1], # 90% class 0, 10% class 1
random_state=42
)
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Classification report - shows all metrics per class
print("CLASSIFICATION REPORT")
print("=" * 50)
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
High Precision
When you predict positive, you're usually right. Few false alarms.
High Recall
You catch most actual positives. Few missed cases.
High F1
Good balance of both. Neither too many false alarms nor missed cases.
Practice Questions: Classification Metrics
Question: For a cancer detection model, would you prioritize precision or recall? Why?
Show Answer
Recall! Missing a cancer diagnosis (false negative) is far worse than a false alarm (false positive) that leads to more tests. You'd rather have some healthy patients get extra tests than miss cancer patients.
# For cancer detection, optimize for recall
from sklearn.metrics import recall_score, make_scorer
# You can use this as scoring in GridSearchCV
recall_scorer = make_scorer(recall_score)
Task: Given TP=80, FP=20, FN=10, TN=90, calculate precision, recall, and F1 score manually, then verify with sklearn.
Show Solution
# Manual calculation
TP, FP, FN, TN = 80, 20, 10, 90
precision = TP / (TP + FP) # 80 / 100 = 0.80
recall = TP / (TP + FN) # 80 / 90 = 0.889
f1 = 2 * (precision * recall) / (precision + recall) # 0.842
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")
# Verify with sklearn
from sklearn.metrics import precision_score, recall_score, f1_score
# Reconstruct y_true and y_pred from confusion matrix values
y_true = [1]*TP + [1]*FN + [0]*FP + [0]*TN
y_pred = [1]*TP + [0]*FN + [1]*FP + [0]*TN
print(f"\nSklearn verification:")
print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall: {recall_score(y_true, y_pred):.3f}")
print(f"F1: {f1_score(y_true, y_pred):.3f}")
Task: Train a multi-class classifier and generate a complete classification report. Explain macro vs weighted averages.
Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Load multi-class data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
# Train and predict
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Classification report
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Macro avg: Simple average across classes (treats all classes equally)
# Weighted avg: Average weighted by support (class frequency)
# Use macro when all classes are equally important
# Use weighted when you want to account for class imbalance
Confusion Matrix & ROC-AUC
The confusion matrix gives you the complete picture of classification performance, while ROC curves help you understand performance across all thresholds.
The Confusion Matrix
1. True Positive (TP): Test says "Positive" AND person actually has COVID ✅
2. True Negative (TN): Test says "Negative" AND person actually doesn't have COVID ✅
3. False Positive (FP): Test says "Positive" BUT person doesn't have COVID ❌ (False alarm!)
4. False Negative (FN): Test says "Negative" BUT person actually has COVID ❌ (Dangerous miss!)
The confusion matrix counts how many of each scenario happened. The diagonal (TP and TN) = correct predictions. Off-diagonal (FP and FN) = mistakes.
Confusion Matrix
A table showing predicted vs actual classes. For binary classification, it has 4 cells: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Reading the Matrix (sklearn format):
- Rows: What your model PREDICTED
- Columns: What the ACTUAL truth was
- Diagonal cells: Correct predictions ✅ (high numbers = good!)
- Off-diagonal cells: Mistakes ❌ (low numbers = good!)
Remember: "True" means correct, "False" means wrong. "Positive/Negative" refers to what the model predicted.
| Actual Values | ||
|---|---|---|
| Negative (0) | Positive (1) | |
| Predicted Negative (0) | TN Correct rejection |
FN Missed! (Type II) |
| Predicted Positive (1) | FP False alarm (Type I) |
TP Correct hit! |
# Creating and visualizing a confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Sample predictions
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1]
# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
# [[4, 2], # TN=4, FP=2
# [2, 7]] # FN=2, TP=7
# Visualize with heatmap
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues', ax=ax)
plt.title('Confusion Matrix')
plt.show()
Normalized Confusion Matrix
Raw counts can be misleading with imbalanced classes. Normalize to see proportions!
# Normalized confusion matrix (row-wise)
cm_normalized = confusion_matrix(y_true, y_pred, normalize='true')
print("Normalized Confusion Matrix (by true label):")
print(cm_normalized)
# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm_normalized,
display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues', values_format='.2%')
plt.title('Normalized Confusion Matrix')
plt.show()
ROC Curve
ROC Curve (Receiver Operating Characteristic)
A plot showing the trade-off between True Positive Rate (Recall) and False Positive Rate at various classification thresholds. It helps you choose the optimal threshold.
Axes: X-axis = False Positive Rate (FPR = FP / (FP + TN)), Y-axis = True Positive Rate (TPR = Recall = TP / (TP + FN))
# ROC Curve
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model and get probability predictions
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1] # Probability of positive class
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label='Model')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
AUC Score
AUC (Area Under ROC Curve)
A single number summarizing model performance across all thresholds. Ranges from 0 to 1, where 1 is perfect and 0.5 is random guessing.
Interpretation: AUC = probability that a randomly chosen positive example ranks higher than a randomly chosen negative example.
# AUC Score
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, y_proba)
print(f"AUC Score: {auc:.3f}")
# Interpretation:
# AUC = 0.5 → Random guessing
# AUC = 0.7 → Acceptable
# AUC = 0.8 → Good
# AUC = 0.9+ → Excellent
- Comparing models regardless of threshold
- When you need a single performance number
- When false positives and false negatives are equally costly
- For balanced datasets
- Highly imbalanced datasets (use Precision-Recall curve instead)
- When you care more about the positive class
- When false positives and false negatives have different costs
Precision-Recall Curve (Alternative)
# Precision-Recall Curve (better for imbalanced data)
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt
# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {ap:.3f})')
plt.grid(True, alpha=0.3)
plt.show()
print(f"Average Precision: {ap:.3f}")
Practice Questions: ROC & AUC
Question: Given two models with predictions, plot their ROC curves on the same graph and determine which is better.
Show Answer
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# Train two models
rf = RandomForestClassifier(random_state=42)
lr = LogisticRegression(random_state=42)
rf.fit(X_train, y_train)
lr.fit(X_train, y_train)
# Get probabilities
rf_proba = rf.predict_proba(X_test)[:, 1]
lr_proba = lr.predict_proba(X_test)[:, 1]
# Calculate ROC curves
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_proba)
# Calculate AUC
rf_auc = roc_auc_score(y_test, rf_proba)
lr_auc = roc_auc_score(y_test, lr_proba)
# Plot both
plt.figure(figsize=(8, 6))
plt.plot(rf_fpr, rf_tpr, 'b-', label=f'Random Forest (AUC={rf_auc:.3f})')
plt.plot(lr_fpr, lr_tpr, 'g-', label=f'Logistic Reg (AUC={lr_auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
The model with the higher AUC (curve closer to top-left) is better!
Question: Given this confusion matrix [[85, 15], [25, 75]], identify TP, TN, FP, FN and calculate accuracy.
Show Solution
# Confusion matrix format in sklearn:
# [[TN, FP],
# [FN, TP]]
TN, FP = 85, 15
FN, TP = 25, 75
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")
accuracy = (TP + TN) / (TP + TN + FP + FN)
print(f"\nAccuracy: {accuracy:.2%}") # 80%
Task: Use the ROC curve to find the optimal classification threshold using Youden's J statistic.
Show Solution
from sklearn.metrics import roc_curve
import numpy as np
# Get ROC curve values
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# Youden's J statistic: J = TPR - FPR (maximize this)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal Threshold: {optimal_threshold:.3f}")
print(f"At this threshold:")
print(f" TPR (Recall): {tpr[optimal_idx]:.3f}")
print(f" FPR: {fpr[optimal_idx]:.3f}")
# Apply optimal threshold
y_pred_optimal = (y_proba >= optimal_threshold).astype(int)
# Compare with default 0.5 threshold
from sklearn.metrics import f1_score
print(f"\nF1 with 0.5 threshold: {f1_score(y_test, y_proba >= 0.5):.3f}")
print(f"F1 with optimal threshold: {f1_score(y_test, y_pred_optimal):.3f}")
Cross-Validation Techniques
Training on one split and testing on another can give lucky or unlucky results. Cross-validation provides a more robust estimate of model performance by using multiple splits.
Why Cross-Validation?
K-Fold Cross-Validation
How K-Fold Works (K=5 example):
1. Split your data into 5 equal pieces (folds)
2. Round 1: Train on folds 1,2,3,4 → Test on fold 5 → Get score #1
3. Round 2: Train on folds 1,2,3,5 → Test on fold 4 → Get score #2
4. Round 3: Train on folds 1,2,4,5 → Test on fold 3 → Get score #3
5. Round 4: Train on folds 1,3,4,5 → Test on fold 2 → Get score #4
6. Round 5: Train on folds 2,3,4,5 → Test on fold 1 → Get score #5
7. Final score = Average of all 5 scores ± standard deviation
Every data point gets to be in the test set exactly once!
K-Fold Cross-Validation
Split data into K equal folds. Train on K-1 folds, test on 1 fold. Repeat K times, each time using a different fold as test set. Report average performance ± std deviation.
Common choices:
- K=5: Good balance, trains on 80% data (faster, slightly biased)
- K=10: More stable estimate, trains on 90% data (slower, less biased)
- More folds: → More computation but more accurate performance estimate
- Fewer folds: → Faster but less reliable (might get unlucky with split)
# ============================================
# K-Fold Cross-Validation - Complete Example
# ============================================
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
# Generate sample data
X, y = make_classification(
n_samples=1000, # 1000 samples
n_features=20, # 20 features
n_classes=2, # Binary classification
random_state=42
)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print()
# Create model (not trained yet!)
model = RandomForestClassifier(random_state=42)
# METHOD 1: Simple 5-fold CV (one line!)
print("=== Method 1: Simple 5-Fold CV ===")
scores = cross_val_score(
model, # The model to evaluate
X, y, # Data
cv=5, # Number of folds
scoring='accuracy' # Metric to use
)
print(f"Scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Std deviation: {scores.std():.3f}")
print(f"95% Confidence: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
print()
# METHOD 2: More control with KFold object
print("=== Method 2: Explicit KFold with Shuffle ===")
kfold = KFold(
n_splits=5, # 5 folds
shuffle=True, # IMPORTANT: Shuffle before splitting!
random_state=42 # For reproducibility
)
scores = cross_val_score(model, X, y, cv=kfold)
print(f"Shuffled KFold mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
print()
# WHY SHUFFLE?
print("=== Why Shuffle Matters ===")
kfold_no_shuffle = KFold(n_splits=5, shuffle=False) # Don't shuffle
scores_no_shuffle = cross_val_score(model, X, y, cv=kfold_no_shuffle)
print(f"Without shuffle: {scores_no_shuffle.mean():.3f}")
print(f"With shuffle: {scores.mean():.3f}")
print("→ If data is ordered (e.g., all class 0 first, then class 1),")
print("→ not shuffling gives biased results!")
Stratified K-Fold
Stratified K-Fold
Like K-Fold, but ensures each fold has the same proportion of classes as the original dataset. Essential for imbalanced classification!
When to use: Classification problems, especially with imbalanced classes. This is the default when you pass a classifier to cross_val_score.
# Stratified K-Fold (preserves class proportions)
from sklearn.model_selection import StratifiedKFold
# Imbalanced data (90% class 0, 10% class 1)
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
print(f"Class distribution: {np.bincount(y)}") # [900, 100]
# Stratified ensures each fold has ~90/10 split
strat_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print(f"Stratified KFold F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Leave-One-Out (LOO)
Leave-One-Out Cross-Validation
Train on N-1 samples, test on 1 sample. Repeat N times (where N = dataset size). Gives unbiased estimate but is computationally expensive.
Use when: You have a very small dataset (less than 100 samples) and need every sample for training.
# Leave-One-Out CV (expensive!)
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
# Small dataset
X_small, y_small = make_classification(n_samples=50, n_features=5, random_state=42)
# LOO CV
loo = LeaveOneOut()
model = KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(model, X_small, y_small, cv=loo)
print(f"LOO CV Accuracy: {scores.mean():.3f}")
print(f"Number of folds: {len(scores)}") # 50 folds!
Time Series Split
For time series data, you can't randomly shuffle! The training set must always come before the test set in time.
# Time Series Split (for temporal data)
from sklearn.model_selection import TimeSeriesSplit
import matplotlib.pyplot as plt
# Time series CV - always train on past, test on future
tscv = TimeSeriesSplit(n_splits=5)
# Visualize the splits
X_time = np.arange(100) # Simulated time series indices
fig, axes = plt.subplots(5, 1, figsize=(10, 8))
for i, (train_idx, test_idx) in enumerate(tscv.split(X_time)):
axes[i].scatter(train_idx, [1]*len(train_idx), c='blue', label='Train', s=10)
axes[i].scatter(test_idx, [1]*len(test_idx), c='red', label='Test', s=10)
axes[i].set_title(f'Split {i+1}')
axes[i].set_yticks([])
plt.tight_layout()
plt.show()
Multiple Metrics in CV
# Cross-validation with multiple metrics
from sklearn.model_selection import cross_validate
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
# Evaluate multiple metrics at once
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
cv_results = cross_validate(model, X, y, cv=5, scoring=scoring, return_train_score=True)
# Print results
print("CROSS-VALIDATION RESULTS")
print("=" * 50)
for metric in scoring:
train_key = f'train_{metric}'
test_key = f'test_{metric}'
print(f"{metric.upper()}")
print(f" Train: {cv_results[train_key].mean():.3f} (+/- {cv_results[train_key].std():.3f})")
print(f" Test: {cv_results[test_key].mean():.3f} (+/- {cv_results[test_key].std():.3f})")
| CV Method | Best For | Pros | Cons |
|---|---|---|---|
| K-Fold | General use | Good balance of bias/variance | May not preserve class balance |
| Stratified K-Fold | Classification (imbalanced) | Preserves class proportions | Slightly more complex |
| Leave-One-Out | Tiny datasets | Uses maximum data for training | Very slow, high variance |
| Time Series Split | Sequential/temporal data | Respects time ordering | Early folds have less training data |
Practice Questions: Cross-Validation
Question: If you tune hyperparameters using the test set, you're cheating! Implement nested CV to get an unbiased performance estimate.
Show Answer
# Nested Cross-Validation
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC
# Outer CV for performance estimation
# Inner CV for hyperparameter tuning
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
# Hyperparameter search (inner loop)
param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}
grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')
# Outer loop for unbiased estimation
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')
print(f"Nested CV Accuracy: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")
print("This estimate is unbiased - it doesn't 'see' the test data during tuning!")
Task: Perform 5-fold cross-validation on a Random Forest classifier and print mean score with standard deviation.
Show Solution
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# Create model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Individual fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Standard deviation: {scores.std():.3f}")
print(f"95% CI: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Task: Run cross-validation that returns training and test scores for accuracy, precision, recall, and F1 simultaneously.
Show Solution
from sklearn.model_selection import cross_validate
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
# Multiple metrics at once
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv_results = cross_validate(
model, X, y, cv=5,
scoring=scoring,
return_train_score=True
)
# Print formatted results
print("CROSS-VALIDATION RESULTS")
print("=" * 50)
for metric in scoring:
train_mean = cv_results[f'train_{metric}'].mean()
train_std = cv_results[f'train_{metric}'].std()
test_mean = cv_results[f'test_{metric}'].mean()
test_std = cv_results[f'test_{metric}'].std()
print(f"\n{metric.upper()}:")
print(f" Train: {train_mean:.3f} (+/- {train_std:.3f})")
print(f" Test: {test_mean:.3f} (+/- {test_std:.3f})")
# Check for overfitting
if train_mean - test_mean > 0.1:
print(f" Warning: Gap suggests overfitting!")
Hyperparameter Tuning
Hyperparameters are settings you choose before training - like the number of trees in a forest or the learning rate. Finding the best combination is crucial for model performance.
What are Hyperparameters?
Hyperparameters = Settings you choose BEFORE baking:
• Oven temperature (350°F vs 400°F)
• Baking time (30 min vs 45 min)
• Pan size (8" vs 9")
Parameters = Things that happen DURING baking:
• How much the cake rises
• How ingredients combine
• The final texture and taste
You choose hyperparameters; the model learns parameters. Hyperparameter tuning = experimenting with different oven temps and times to find the perfect recipe!
- Weights in neural networks
- Coefficients in linear regression
- Split points in decision trees
Learned during training
- Number of trees, max depth
- Learning rate, regularization strength
- Number of neighbors (K)
Set before training
GridSearchCV - Exhaustive Search
• 3 shirts (red, blue, green)
• 2 pants (jeans, khakis)
• 2 shoes (sneakers, boots)
GridSearch tries ALL combinations: red+jeans+sneakers, red+jeans+boots, red+khakis+sneakers, red+khakis+boots, blue+jeans+sneakers, etc. That's 3 × 2 × 2 = 12 complete outfits to try!
With cross-validation (CV=5), you're trying each outfit in 5 different lighting conditions. So 12 × 5 = 60 total evaluations! This is why GridSearch can be slow.
GridSearchCV (Exhaustive Grid Search)
Try every possible combination of hyperparameters in your grid. Guaranteed to find the absolute best combination in your search space, but computationally expensive.
Computational Cost: Number of fits = (combinations) × (cv folds)
Example: 3 params with 4 values each = 4³ = 64 combinations
With 5-fold CV: 64 × 5 = 320 model trainings!
Pro tip: Start with a coarse grid (few values), find promising regions, then do a finer grid search in that region.
# ============================================
# GridSearchCV - Complete Example
# ============================================
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import time
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
print("=== Setting Up GridSearch ===")
# Define the base model (hyperparameters will be set by GridSearch)
rf = RandomForestClassifier(random_state=42)
# Define hyperparameter grid
# Start with a COARSE grid (few values) to explore broadly
param_grid = {
'n_estimators': [50, 100, 200], # Number of trees - 3 values
'max_depth': [5, 10, 20, None], # Max tree depth - 4 values
'min_samples_split': [2, 5, 10], # Min samples to split - 3 values
'min_samples_leaf': [1, 2, 4] # Min samples in leaf - 3 values
}
print(f"Parameter grid:")
for param, values in param_grid.items():
print(f" {param}: {values}")
# Calculate total combinations
total_combinations = 1
for values in param_grid.values():
total_combinations *= len(values)
print(f"\nTotal combinations: {total_combinations}")
print(f"With 5-fold CV: {total_combinations * 5} model fits!")
print("\nStarting GridSearch... (this may take a while)\n")
# Create GridSearchCV object
grid_search = GridSearchCV(
estimator=rf, # Model to tune
param_grid=param_grid, # Grid of parameters
cv=5, # 5-fold cross-validation
scoring='accuracy', # Metric to optimize
n_jobs=-1, # Use ALL CPU cores (parallel)
verbose=1, # Show progress (2=more detail, 3=even more)
return_train_score=True # Also return training scores
)
# Fit - this tries all combinations!
start_time = time.time()
grid_search.fit(X, y)
elapsed = time.time() - start_time
# Display results
print("\n" + "="*50)
print("GRIDSEARCH RESULTS")
print("="*50)
print(f"Time taken: {elapsed:.1f} seconds")
print(f"\nBest parameters found:")
for param, value in grid_search.best_params_.items():
print(f" {param}: {value}")
print(f"\nBest cross-validation score: {grid_search.best_score_:.4f}")
# Access the best model (already trained!)
best_model = grid_search.best_estimator_
print(f"\nBest model: {best_model}")
# Get all results as a DataFrame (useful for analysis)
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
# Show top 5 parameter combinations
print("\nTop 5 parameter combinations:")
top_5 = results_df.nsmallest(5, 'rank_test_score')[[
'params', 'mean_test_score', 'std_test_score', 'rank_test_score'
]]
print(top_5.to_string(index=False))
RandomizedSearchCV - Faster Alternative
Why RandomSearch Often Works Better:
• Most hyperparameters don't matter much - only 1-2 are really important
• GridSearch wastes time exploring worthless combinations exhaustively
• RandomSearch explores MORE unique values for important parameters in the same time
• Example: If 'learning_rate' is critical but 'max_features' doesn't matter much, RandomSearch tries 50 different learning_rates vs GridSearch trying only 5
Rule of thumb: Use GridSearch for ≤3 parameters. Use RandomSearch for 4+ parameters or when you're exploring a large space.
RandomizedSearchCV
Randomly sample combinations from the parameter space. You control how many combinations to try (n_iter). Often finds good solutions much faster than GridSearch.
Why it's effective:
- Explores more diverse values for important parameters
- Less time wasted on exhaustively trying every combo
- Can use continuous distributions (not just discrete values)
- Finds "good enough" solutions quickly; can increase n_iter if needed
Research shows: RandomSearch with 60 trials often outperforms GridSearch with 60 combinations because it explores the space more effectively.
# RandomizedSearchCV - Random sampling
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define distributions for random sampling
param_distributions = {
'n_estimators': randint(50, 500), # Sample from 50-500
'max_depth': randint(5, 50), # Sample from 5-50
'min_samples_split': randint(2, 20), # Sample from 2-20
'min_samples_leaf': randint(1, 10), # Sample from 1-10
'max_features': uniform(0.1, 0.9) # Sample from 0.1-1.0
}
# Random search
random_search = RandomizedSearchCV(
estimator=rf,
param_distributions=param_distributions,
n_iter=50, # Try 50 random combinations
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X, y)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")
Analyzing Search Results
# Analyzing search results
import pandas as pd
import matplotlib.pyplot as plt
# Get results as DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)
# View key columns
print(results_df[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))
# Plot hyperparameter importance (example with max_depth)
depths = [p['max_depth'] for p in results_df['params']]
scores = results_df['mean_test_score']
plt.figure(figsize=(10, 5))
plt.scatter(depths, scores, alpha=0.5)
plt.xlabel('max_depth')
plt.ylabel('Mean CV Score')
plt.title('Effect of max_depth on Performance')
plt.show()
Best Practices
- Start with RandomizedSearch to explore
- Use GridSearch to refine promising regions
- Always use cross-validation (cv=5+)
- Set n_jobs=-1 for parallelization
- Use early stopping when available
- Don't tune on test data (data leakage!)
- Don't grid search with too many parameters
- Don't use default hyperparameters blindly
- Don't forget to check for overfitting
- Don't tune if model is already good enough
Common Hyperparameters by Model
| Model | Key Hyperparameters | Typical Range |
|---|---|---|
| Random Forest | n_estimators, max_depth, min_samples_split | 100-500, 10-50, 2-20 |
| XGBoost | learning_rate, max_depth, n_estimators | 0.01-0.3, 3-10, 100-1000 |
| SVM | C, gamma, kernel | 0.1-100, 0.001-1, rbf/linear |
| KNN | n_neighbors, weights, metric | 3-15, uniform/distance, euclidean |
| Logistic Regression | C, penalty, solver | 0.01-100, l1/l2, lbfgs/saga |
Practice Questions: Hyperparameter Tuning
Question: Create a complete pipeline that: (1) scales data, (2) tunes a model, and (3) evaluates on a held-out test set.
Show Answer
# Complete tuning pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
# Split data - keep test set completely separate!
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('svc', SVC(random_state=42))
])
# Define parameter grid (note: prefix with step name__)
param_grid = {
'svc__C': [0.1, 1, 10, 100],
'svc__kernel': ['rbf', 'linear'],
'svc__gamma': ['scale', 'auto', 0.1, 1]
}
# Grid search on training data only
grid_search = GridSearchCV(
pipeline, param_grid, cv=5,
scoring='f1', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")
# Evaluate on held-out test set
y_pred = grid_search.predict(X_test)
print("\n" + "="*50)
print("TEST SET PERFORMANCE (Never seen during tuning!)")
print("="*50)
print(classification_report(y_test, y_pred))
Task: Use GridSearchCV to find the best n_neighbors for KNN classifier from values [3, 5, 7, 9, 11].
Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
# Define model and parameter grid
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}
# Grid search
grid_search = GridSearchCV(
knn, param_grid, cv=5,
scoring='accuracy'
)
grid_search.fit(X, y)
print(f"Best K: {grid_search.best_params_['n_neighbors']}")
print(f"Best CV Accuracy: {grid_search.best_score_:.3f}")
# View all results
results = pd.DataFrame(grid_search.cv_results_)
print(results[['param_n_neighbors', 'mean_test_score', 'rank_test_score']])
Task: Use RandomizedSearchCV to tune a Random Forest with random sampling from distributions.
Show Solution
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform
# Model
rf = RandomForestClassifier(random_state=42)
# Parameter distributions (not fixed values!)
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 30),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
# Random search - try 20 random combinations
random_search = RandomizedSearchCV(
rf, param_dist,
n_iter=20, # Only 20 combinations instead of all
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X, y)
print(f"Best params: {random_search.best_params_}")
print(f"Best CV Accuracy: {random_search.best_score_:.3f}")
Quick Reference Guide
Use these tables to quickly decide which metrics, CV strategy, and tuning method to use for your specific problem.
When to Use Each Metric
| Problem Type | Primary Metric | Secondary Metrics | When to Use |
|---|---|---|---|
| Regression (general) | MAE + RMSE | R² | MAE for interpretability, RMSE when large errors are bad, R² to explain variance |
| Balanced Classification | Accuracy | Confusion Matrix | Equal class sizes, all errors equally costly |
| Imbalanced Classification | F1-Score | Precision, Recall, PR-AUC | Minority class important, need balance of precision/recall |
| Medical Diagnosis | Recall | F1, Confusion Matrix | Missing sick patients is dangerous (minimize false negatives) |
| Spam Detection | Precision | F1 | False alarms annoy users (minimize false positives) |
| Ranking/Threshold Selection | ROC-AUC | PR-AUC (if imbalanced) | Comparing models across all thresholds, probability calibration |
CV Strategy Quick Reference
K-Fold (K=5)
Default choice for most problems
✓ Fast ✓ Reliable Use: Balanced dataStratified K-Fold
For classification with imbalanced classes
✓ Preserves ratios Use: Imbalanced classesTimeSeriesSplit
For temporal data
✓ No data leakage ✗ Never shuffle! Use: Time seriesTuning Method Decision Matrix
| Situation | Use This | Why? |
|---|---|---|
| 1-3 hyperparameters, clear ranges | GridSearchCV | Exhaustive search is feasible and finds exact best |
| 4+ hyperparameters or large search space | RandomizedSearchCV | Much faster, explores space effectively |
| Expensive training (large data/slow model) | Lower CV folds (cv=3) | Saves computation while maintaining reliability |
| Initial exploration (don't know good ranges) | RandomizedSearchCV → GridSearchCV | Random finds region, Grid refines within it |
Key Takeaways
Choose the Right Metrics
Use MSE/RMSE for regression when large errors are costly. Use F1-score for imbalanced classification. Accuracy is only meaningful with balanced classes.
Precision vs Recall Trade-off
Prioritize recall when missing positives is costly (disease detection). Prioritize precision when false alarms are costly (spam filtering).
Confusion Matrix Reveals All
Always examine the confusion matrix. It shows exactly where your model fails - which classes get confused and in what direction.
ROC-AUC for Threshold-Free Evaluation
ROC-AUC compares models regardless of threshold. Use Precision-Recall curves instead for highly imbalanced datasets.
Cross-Validation is Essential
Always use cross-validation for reliable performance estimates. Use Stratified K-Fold for classification, TimeSeriesSplit for temporal data.
Tune Hyperparameters Properly
Never tune on test data! Use GridSearchCV for small grids, RandomizedSearchCV for exploration. Keep a final test set completely unseen.
Knowledge Check
Quick Quiz
Test what you've learned about model evaluation and tuning