Module 9.4

Model Evaluation & Tuning

Learn how to measure model performance correctly, avoid common pitfalls, and optimize your models using cross-validation and hyperparameter tuning. The difference between a good and great model lies here!

50 min read
Intermediate
Hands-on Examples
What You'll Learn
  • Regression metrics (MSE, RMSE, MAE, R²)
  • Classification metrics (precision, recall, F1)
  • Confusion matrix & ROC-AUC curves
  • Cross-validation techniques
  • Hyperparameter tuning strategies
Contents
01

Regression Metrics

Regression models predict continuous values, so we need metrics that measure how far our predictions are from the actual values. Understanding these metrics helps you choose the right one for your problem and interpret model performance correctly.

Why Multiple Metrics?

No single metric tells the whole story. Each metric has strengths and weaknesses, and the right choice depends on your specific use case. Let's understand when to use each one.

Key principle: All error metrics (MSE, RMSE, MAE) should be as low as possible. R² should be as high as possible (max 1.0, but can be negative for terrible models).

Mean Absolute Error (MAE)

Real-World Example: Imagine predicting house prices. If your MAE is $15,000, it means your predictions are off by an average of $15,000 in either direction. A $200,000 house might be predicted as $185,000 or $215,000. MAE tells you the typical error size.
Error Metric

Mean Absolute Error (MAE)

The average of the absolute differences between predictions and actual values. MAE = (1/n) × Σ|yᵢ - ŷᵢ|

Interpretation: "On average, predictions are off by X units." MAE is in the same units as your target variable (dollars, meters, temperature, etc.), making it very intuitive to interpret and explain to non-technical stakeholders.

Why "Absolute"? We use absolute value |error| because we don't care if we're predicting too high or too low - we just care about the magnitude of the mistake. Without absolute value, a +10 error and -10 error would cancel out to 0, hiding the mistakes!

# ============================================
# Mean Absolute Error (MAE) - Step by Step
# ============================================
from sklearn.metrics import mean_absolute_error
import numpy as np

# Example: Predicting restaurant bills (in dollars)
y_true = np.array([100, 150, 200, 250, 300])  # Actual bills
y_pred = np.array([110, 140, 210, 240, 320])  # Model predictions

print("Comparing Actual vs Predicted:")
print("Actual:", y_true)
print("Predicted:", y_pred)
print()

# Let's calculate MAE manually to understand what's happening:
print("Manual Calculation:")
errors = np.abs(y_true - y_pred)  # Take absolute value of each error
print("Absolute Errors:", errors)  # [10, 10, 10, 10, 20]
print(f"Sum of errors: {np.sum(errors)}")  # 60
print(f"Number of samples: {len(errors)}")  # 5
manual_mae = np.sum(errors) / len(errors)  # Average
print(f"Manual MAE: {manual_mae:.2f}")
print()

# Now with sklearn (same result, less code!):
mae = mean_absolute_error(y_true, y_pred)
print(f"Sklearn MAE: {mae:.2f}")
print()

# INTERPRETATION:
print("=" * 50)
print("WHAT THIS MEANS:")
print(f"On average, predictions are off by ${mae:.2f}")
print("This is in the SAME UNITS as your target (dollars).")
print("Lower MAE = Better model!")
print("=" * 50)
MAE Advantages
  • Easy to interpret (same units as target)
  • Robust to outliers
  • Linear penalty for all errors
MAE Disadvantages
  • Doesn't heavily penalize large errors
  • Not differentiable at zero (optimization issue)
  • May not be ideal when big errors are costly

Mean Squared Error (MSE) & RMSE

Why Squaring Matters: Think of it like grading an exam. A student who gets 2 questions wrong is not that bad. But a student who gets 10 questions wrong is WAY worse - not just 5 times worse, but 25 times worse (10²/2² = 100/4 = 25x). MSE punishes big mistakes much more than small ones. This is useful when large errors are especially problematic (like predicting medical dosages - being way off could be dangerous!).
Error Metric

Mean Squared Error (MSE)

The average of the squared differences between predictions and actual values. MSE = (1/n) × Σ(yᵢ - ŷᵢ)²

Key property: Squaring penalizes large errors exponentially more than small ones:

  • Error of 2 → Contributes 4 to MSE (2²)
  • Error of 5 → Contributes 25 to MSE (5²) - 6.25x worse!
  • Error of 10 → Contributes 100 to MSE (10²) - 25x worse than error of 2!

RMSE (Root MSE): We take the square root to bring MSE back to original units. RMSE is more interpretable than MSE because it's in the same scale as your target variable.

# ============================================
# MSE and RMSE - Understanding the Difference
# ============================================
from sklearn.metrics import mean_squared_error
import numpy as np

# Using same example data
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 140, 210, 240, 320])

print("=== Manual Calculation to Understand MSE ===")
errors = y_true - y_pred  # [−10, 10, −10, 10, −20]
print("Raw errors:", errors)

squared_errors = errors ** 2  # [100, 100, 100, 100, 400]
print("Squared errors:", squared_errors)
print("→ Notice how the −20 error (squared = 400) dominates!")
print()

manual_mse = np.mean(squared_errors)
print(f"Manual MSE: {manual_mse:.2f}")
print(f"Manual RMSE: {np.sqrt(manual_mse):.2f}")
print()

# Using sklearn:
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)  # Take square root to get back to original units
# Or directly: rmse = mean_squared_error(y_true, y_pred, squared=False)

print("=== Using Sklearn ===")
print(f"MSE:  {mse:.2f}  (units²)")
print(f"RMSE: {rmse:.2f}  (original units - dollars)")
print()

# Compare with MAE:
mae = mean_absolute_error(y_true, y_pred)
print("=== Comparison ===")
print(f"MAE:  ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"Ratio RMSE/MAE: {rmse/mae:.2f}")
print()
print("INTERPRETATION:")
print("→ RMSE > MAE means we have some large errors")
print("→ If RMSE ≈ MAE, all errors are similar size")
print("→ If RMSE >> MAE (like 2x or more), we have outliers!")
RMSE vs MAE: If RMSE is significantly larger than MAE, it means you have some large errors (outliers). The ratio RMSE/MAE indicates error distribution - a ratio close to 1 means errors are similar in size; a high ratio means some predictions are way off.

R-Squared (R²) - Coefficient of Determination

Simple Analogy: Imagine you're trying to predict exam scores. The baseline (dumbest) approach is to always predict the average score. R² tells you how much better your model is than this baseline.

• R² = 1.0 (100%) → Perfect! Your model explains ALL variation in scores
• R² = 0.80 (80%) → Good! Your model explains 80% of why scores vary; 20% is still random/unexplained
• R² = 0.50 (50%) → Okay - Better than guessing the mean, but leaves a lot unexplained
• R² = 0.0 (0%) → Useless - No better than always predicting the average
• R² < 0 (negative) → Terrible! Worse than just guessing the mean every time!
Goodness of Fit

R² Score (Coefficient of Determination)

Measures what proportion of the variance (variation) in the target variable is explained by your model.
R² = 1 - (SS_residual / SS_total) = 1 - (Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²)

Range and Interpretation:

  • R² = 1.0: Perfect prediction - All variance explained
  • R² = 0.7-0.9: Good model - Explains most of the variance
  • R² = 0.5-0.7: Moderate - Some predictive power
  • R² = 0.0: Baseline - No better than predicting mean
  • R² < 0: Worse than baseline! - Model is making things worse

Important: R² can be negative on test data if predictions are consistently worse than just predicting the mean. This is a red flag that your model has serious problems!

# R-Squared Score
from sklearn.metrics import r2_score

r2 = r2_score(y_true, y_pred)
print(f"R² Score: {r2:.4f}")
# R² = 0.97 means model explains 97% of variance in target

Complete Regression Evaluation

# Complete regression evaluation example
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Generate sample data
X, y = make_regression(n_samples=200, n_features=3, noise=15, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate all metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("=" * 40)
print("REGRESSION METRICS SUMMARY")
print("=" * 40)
print(f"MAE:  {mae:.4f}  (avg error in original units)")
print(f"MSE:  {mse:.4f}  (penalizes large errors)")
print(f"RMSE: {rmse:.4f}  (interpretable units)")
print(f"R²:   {r2:.4f}  (variance explained)")
print("=" * 40)
Metric Best When Sensitive to Outliers?
MAE All errors matter equally, outliers present No (robust)
MSE/RMSE Large errors are especially bad Yes (penalizes heavily)
Comparing models, explaining variance Somewhat

Practice Questions: Regression Metrics

Task: Given y_true = [10, 20, 30, 40, 100] and y_pred = [12, 18, 32, 38, 50], calculate MAE and RMSE. What does their ratio tell you?

Show Solution
y_true = np.array([10, 20, 30, 40, 100])
y_pred = np.array([12, 18, 32, 38, 50])

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"Ratio RMSE/MAE: {rmse/mae:.2f}")
# High ratio (~2.2) indicates presence of large errors (outliers)
# The 100->50 prediction (error of 50) dominates RMSE

Question: What does an R-squared of 0.85 mean? What about a negative R-squared?

Show Solution

R-squared = 0.85: The model explains 85% of the variance in the target variable. 15% remains unexplained.

Negative R-squared: The model is worse than just predicting the mean! This happens when predictions are completely off.

# Demonstration of negative R-squared
from sklearn.metrics import r2_score

y_true = [1, 2, 3, 4, 5]
y_pred = [10, 20, 30, 40, 50]  # Completely wrong scale!

print(f"R-squared: {r2_score(y_true, y_pred):.2f}")  # Negative!

Task: Write a function that takes y_true and y_pred and prints a complete evaluation report with all regression metrics.

Show Solution
def regression_report(y_true, y_pred):
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    import numpy as np
    
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((np.array(y_true) - np.array(y_pred)) / y_true)) * 100
    
    print("REGRESSION EVALUATION REPORT")
    print("=" * 40)
    print(f"MAE:  {mae:.4f}")
    print(f"MSE:  {mse:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R2:   {r2:.4f}")
    print(f"MAPE: {mape:.2f}%")
    print("=" * 40)
    
    if rmse / mae > 1.5:
        print("Warning: High RMSE/MAE ratio suggests outliers")
    if r2 < 0:
        print("Warning: Negative R2 - model is worse than baseline!")

# Usage
regression_report(y_true, y_pred)
02

Classification Metrics

Classification models predict categories, so we need metrics that measure how often predictions are correct. But accuracy alone can be misleading - especially with imbalanced datasets. Let's understand the full toolkit.

The Accuracy Trap

Consider a fraud detection model where 99% of transactions are legitimate. A model that predicts "not fraud" for everything achieves 99% accuracy - but catches zero fraudsters! This is why we need precision, recall, and F1 score.

Warning: High accuracy doesn't always mean a good model. On imbalanced datasets, always check precision, recall, and F1 score for the minority class.

Accuracy

Basic Metric

Accuracy

The ratio of correct predictions to total predictions. Accuracy = (TP + TN) / (TP + TN + FP + FN)

Best for: Balanced datasets where all classes are equally important.

# Accuracy Score
from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1]

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2%}")  # 80%

Precision

Simple Analogy: Imagine you're a fisherman trying to catch salmon. Precision answers: "Of all the fish I thought were salmon and put in my bucket, how many were actually salmon?" If you catch 100 fish and call them all salmon, but only 70 are actually salmon (30 are other fish), your precision is 70%. High precision = fewer false alarms = you're picky and careful.
Quality Metric

Precision (Positive Predictive Value)

Of all positive predictions your model made, what proportion were actually positive?
Precision = TP / (TP + FP) = "True Positives / All Predicted Positives"

Focus on Precision when: False positives are costly

  • Spam filter: Marking good emails as spam annoys users (false positive = bad!)
  • Product recommendations: Suggesting bad products hurts trust
  • Ad targeting: Showing ads to wrong people wastes money
  • Criminal justice: False accusation is very serious
# Precision Score
from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2%}")
# "Of all emails I marked as spam, X% were actually spam"

Recall (Sensitivity, True Positive Rate)

Medical Analogy: Imagine a cancer screening test. There are 100 people with cancer in a population. Recall answers: "Of those 100 people who actually have cancer, how many did the test successfully identify?" If the test catches 95 out of 100 cancer patients, recall is 95%. High recall = catching most cases = you're thorough and don't miss much.

In cancer detection, missing a patient (false negative) can be deadly! That's why recall is critical in medical diagnosis - you'd rather have some false alarms than miss sick patients.
Coverage Metric

Recall (Sensitivity, True Positive Rate)

Of all actual positives in the data, what proportion did we successfully catch?
Recall = TP / (TP + FN) = "True Positives / All Actually Positive"

Focus on Recall when: False negatives are costly/dangerous

  • Disease detection: Missing a sick patient can be fatal (false negative = very bad!)
  • Fraud detection: Missing fraud costs money and hurts customers
  • Search engines: Missing relevant results frustrates users
  • Security systems: Missing a threat can be catastrophic
# Recall Score
from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2%}")
# "Of all actual spam emails, I caught X%"

F1 Score - The Balance

Harmonic Mean

F1 Score

The harmonic mean of precision and recall - balances both. F1 = 2 × (Precision × Recall) / (Precision + Recall)

Why harmonic mean? It penalizes extreme values. If precision=1.0 but recall=0.01, F1 ≈ 0.02 (not 0.5 like arithmetic mean).

# F1 Score
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.2%}")

Complete Classification Report

# Complete classification evaluation
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Generate imbalanced data
X, y = make_classification(
    n_samples=1000, n_features=10, n_classes=2,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    random_state=42
)

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Classification report - shows all metrics per class
print("CLASSIFICATION REPORT")
print("=" * 50)
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
High Precision

When you predict positive, you're usually right. Few false alarms.

High Recall

You catch most actual positives. Few missed cases.

High F1

Good balance of both. Neither too many false alarms nor missed cases.

Precision vs Recall Trade-off: You can usually increase one at the cost of the other by adjusting the classification threshold. Lower threshold → higher recall, lower precision. Higher threshold → higher precision, lower recall.

Practice Questions: Classification Metrics

Question: For a cancer detection model, would you prioritize precision or recall? Why?

Show Answer

Recall! Missing a cancer diagnosis (false negative) is far worse than a false alarm (false positive) that leads to more tests. You'd rather have some healthy patients get extra tests than miss cancer patients.

# For cancer detection, optimize for recall
from sklearn.metrics import recall_score, make_scorer

# You can use this as scoring in GridSearchCV
recall_scorer = make_scorer(recall_score)

Task: Given TP=80, FP=20, FN=10, TN=90, calculate precision, recall, and F1 score manually, then verify with sklearn.

Show Solution
# Manual calculation
TP, FP, FN, TN = 80, 20, 10, 90

precision = TP / (TP + FP)  # 80 / 100 = 0.80
recall = TP / (TP + FN)     # 80 / 90 = 0.889
f1 = 2 * (precision * recall) / (precision + recall)  # 0.842

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")

# Verify with sklearn
from sklearn.metrics import precision_score, recall_score, f1_score

# Reconstruct y_true and y_pred from confusion matrix values
y_true = [1]*TP + [1]*FN + [0]*FP + [0]*TN
y_pred = [1]*TP + [0]*FN + [1]*FP + [0]*TN

print(f"\nSklearn verification:")
print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall: {recall_score(y_true, y_pred):.3f}")
print(f"F1: {f1_score(y_true, y_pred):.3f}")

Task: Train a multi-class classifier and generate a complete classification report. Explain macro vs weighted averages.

Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load multi-class data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Train and predict
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Macro avg: Simple average across classes (treats all classes equally)
# Weighted avg: Average weighted by support (class frequency)
# Use macro when all classes are equally important
# Use weighted when you want to account for class imbalance
03

Confusion Matrix & ROC-AUC

The confusion matrix gives you the complete picture of classification performance, while ROC curves help you understand performance across all thresholds.

The Confusion Matrix

Medical Test Analogy: Imagine a COVID test. The confusion matrix shows 4 scenarios:

1. True Positive (TP): Test says "Positive" AND person actually has COVID ✅
2. True Negative (TN): Test says "Negative" AND person actually doesn't have COVID ✅
3. False Positive (FP): Test says "Positive" BUT person doesn't have COVID ❌ (False alarm!)
4. False Negative (FN): Test says "Negative" BUT person actually has COVID ❌ (Dangerous miss!)

The confusion matrix counts how many of each scenario happened. The diagonal (TP and TN) = correct predictions. Off-diagonal (FP and FN) = mistakes.
Visualization Tool

Confusion Matrix

A table showing predicted vs actual classes. For binary classification, it has 4 cells: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

Reading the Matrix (sklearn format):

  • Rows: What your model PREDICTED
  • Columns: What the ACTUAL truth was
  • Diagonal cells: Correct predictions ✅ (high numbers = good!)
  • Off-diagonal cells: Mistakes ❌ (low numbers = good!)

Remember: "True" means correct, "False" means wrong. "Positive/Negative" refers to what the model predicted.

Actual Values
Negative (0) Positive (1)
Predicted Negative (0) TN
Correct rejection
FN
Missed! (Type II)
Predicted Positive (1) FP
False alarm (Type I)
TP
Correct hit!
# Creating and visualizing a confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Sample predictions
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1]

# Create confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
# [[4, 2],    # TN=4, FP=2
#  [2, 7]]    # FN=2, TP=7

# Visualize with heatmap
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues', ax=ax)
plt.title('Confusion Matrix')
plt.show()

Normalized Confusion Matrix

Raw counts can be misleading with imbalanced classes. Normalize to see proportions!

# Normalized confusion matrix (row-wise)
cm_normalized = confusion_matrix(y_true, y_pred, normalize='true')
print("Normalized Confusion Matrix (by true label):")
print(cm_normalized)

# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm_normalized, 
                               display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues', values_format='.2%')
plt.title('Normalized Confusion Matrix')
plt.show()

ROC Curve

Threshold-Independent

ROC Curve (Receiver Operating Characteristic)

A plot showing the trade-off between True Positive Rate (Recall) and False Positive Rate at various classification thresholds. It helps you choose the optimal threshold.

Axes: X-axis = False Positive Rate (FPR = FP / (FP + TN)), Y-axis = True Positive Rate (TPR = Recall = TP / (TP + FN))

# ROC Curve
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model and get probability predictions
model = LogisticRegression()
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of positive class

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label='Model')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

AUC Score

Area Under the Curve

AUC (Area Under ROC Curve)

A single number summarizing model performance across all thresholds. Ranges from 0 to 1, where 1 is perfect and 0.5 is random guessing.

Interpretation: AUC = probability that a randomly chosen positive example ranks higher than a randomly chosen negative example.

# AUC Score
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test, y_proba)
print(f"AUC Score: {auc:.3f}")

# Interpretation:
# AUC = 0.5  → Random guessing
# AUC = 0.7  → Acceptable
# AUC = 0.8  → Good
# AUC = 0.9+ → Excellent
When to Use ROC-AUC
  • Comparing models regardless of threshold
  • When you need a single performance number
  • When false positives and false negatives are equally costly
  • For balanced datasets
When ROC-AUC Misleads
  • Highly imbalanced datasets (use Precision-Recall curve instead)
  • When you care more about the positive class
  • When false positives and false negatives have different costs

Precision-Recall Curve (Alternative)

# Precision-Recall Curve (better for imbalanced data)
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {ap:.3f})')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Average Precision: {ap:.3f}")
Choosing the Right Curve: Use ROC-AUC for balanced datasets. Use Precision-Recall curve and Average Precision (AP) for imbalanced datasets where you care more about the minority (positive) class.

Practice Questions: ROC & AUC

Question: Given two models with predictions, plot their ROC curves on the same graph and determine which is better.

Show Answer
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Train two models
rf = RandomForestClassifier(random_state=42)
lr = LogisticRegression(random_state=42)

rf.fit(X_train, y_train)
lr.fit(X_train, y_train)

# Get probabilities
rf_proba = rf.predict_proba(X_test)[:, 1]
lr_proba = lr.predict_proba(X_test)[:, 1]

# Calculate ROC curves
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_proba)

# Calculate AUC
rf_auc = roc_auc_score(y_test, rf_proba)
lr_auc = roc_auc_score(y_test, lr_proba)

# Plot both
plt.figure(figsize=(8, 6))
plt.plot(rf_fpr, rf_tpr, 'b-', label=f'Random Forest (AUC={rf_auc:.3f})')
plt.plot(lr_fpr, lr_tpr, 'g-', label=f'Logistic Reg (AUC={lr_auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

The model with the higher AUC (curve closer to top-left) is better!

Question: Given this confusion matrix [[85, 15], [25, 75]], identify TP, TN, FP, FN and calculate accuracy.

Show Solution
# Confusion matrix format in sklearn:
# [[TN, FP],
#  [FN, TP]]

TN, FP = 85, 15
FN, TP = 25, 75

print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")

accuracy = (TP + TN) / (TP + TN + FP + FN)
print(f"\nAccuracy: {accuracy:.2%}")  # 80%

Task: Use the ROC curve to find the optimal classification threshold using Youden's J statistic.

Show Solution
from sklearn.metrics import roc_curve
import numpy as np

# Get ROC curve values
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Youden's J statistic: J = TPR - FPR (maximize this)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Optimal Threshold: {optimal_threshold:.3f}")
print(f"At this threshold:")
print(f"  TPR (Recall): {tpr[optimal_idx]:.3f}")
print(f"  FPR: {fpr[optimal_idx]:.3f}")

# Apply optimal threshold
y_pred_optimal = (y_proba >= optimal_threshold).astype(int)

# Compare with default 0.5 threshold
from sklearn.metrics import f1_score
print(f"\nF1 with 0.5 threshold: {f1_score(y_test, y_proba >= 0.5):.3f}")
print(f"F1 with optimal threshold: {f1_score(y_test, y_pred_optimal):.3f}")
04

Cross-Validation Techniques

Training on one split and testing on another can give lucky or unlucky results. Cross-validation provides a more robust estimate of model performance by using multiple splits.

Why Cross-Validation?

The Problem: A single train/test split can give misleading results. Your test set might be "easy" or "hard" by chance. Cross-validation averages across multiple splits to give a more reliable estimate.

K-Fold Cross-Validation

Restaurant Analogy: Imagine testing a chef's cooking ability. Instead of having them cook just ONE meal for judges (which might be lucky or unlucky), you have them cook 5 different meals on 5 different days, with different judges each time. Then you average the scores. This is more reliable than a single test!

How K-Fold Works (K=5 example):
1. Split your data into 5 equal pieces (folds)
2. Round 1: Train on folds 1,2,3,4 → Test on fold 5 → Get score #1
3. Round 2: Train on folds 1,2,3,5 → Test on fold 4 → Get score #2
4. Round 3: Train on folds 1,2,4,5 → Test on fold 3 → Get score #3
5. Round 4: Train on folds 1,3,4,5 → Test on fold 2 → Get score #4
6. Round 5: Train on folds 2,3,4,5 → Test on fold 1 → Get score #5
7. Final score = Average of all 5 scores ± standard deviation

Every data point gets to be in the test set exactly once!
Standard Method

K-Fold Cross-Validation

Split data into K equal folds. Train on K-1 folds, test on 1 fold. Repeat K times, each time using a different fold as test set. Report average performance ± std deviation.

Common choices:

  • K=5: Good balance, trains on 80% data (faster, slightly biased)
  • K=10: More stable estimate, trains on 90% data (slower, less biased)
  • More folds: → More computation but more accurate performance estimate
  • Fewer folds: → Faster but less reliable (might get unlucky with split)
# ============================================
# K-Fold Cross-Validation - Complete Example
# ============================================
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

# Generate sample data
X, y = make_classification(
    n_samples=1000,      # 1000 samples
    n_features=20,       # 20 features
    n_classes=2,         # Binary classification
    random_state=42
)

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print()

# Create model (not trained yet!)
model = RandomForestClassifier(random_state=42)

# METHOD 1: Simple 5-fold CV (one line!)
print("=== Method 1: Simple 5-Fold CV ===")
scores = cross_val_score(
    model,              # The model to evaluate
    X, y,               # Data
    cv=5,               # Number of folds
    scoring='accuracy'  # Metric to use
)

print(f"Scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Std deviation: {scores.std():.3f}")
print(f"95% Confidence: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
print()

# METHOD 2: More control with KFold object
print("=== Method 2: Explicit KFold with Shuffle ===")
kfold = KFold(
    n_splits=5,          # 5 folds
    shuffle=True,        # IMPORTANT: Shuffle before splitting!
    random_state=42      # For reproducibility
)

scores = cross_val_score(model, X, y, cv=kfold)
print(f"Shuffled KFold mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
print()

# WHY SHUFFLE?
print("=== Why Shuffle Matters ===")
kfold_no_shuffle = KFold(n_splits=5, shuffle=False)  # Don't shuffle
scores_no_shuffle = cross_val_score(model, X, y, cv=kfold_no_shuffle)

print(f"Without shuffle: {scores_no_shuffle.mean():.3f}")
print(f"With shuffle:    {scores.mean():.3f}")
print("→ If data is ordered (e.g., all class 0 first, then class 1),")
print("→ not shuffling gives biased results!")

Stratified K-Fold

For Imbalanced Data

Stratified K-Fold

Like K-Fold, but ensures each fold has the same proportion of classes as the original dataset. Essential for imbalanced classification!

When to use: Classification problems, especially with imbalanced classes. This is the default when you pass a classifier to cross_val_score.

# Stratified K-Fold (preserves class proportions)
from sklearn.model_selection import StratifiedKFold

# Imbalanced data (90% class 0, 10% class 1)
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)

print(f"Class distribution: {np.bincount(y)}")  # [900, 100]

# Stratified ensures each fold has ~90/10 split
strat_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=strat_kfold, scoring='f1')
print(f"Stratified KFold F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Leave-One-Out (LOO)

Maximum Folds

Leave-One-Out Cross-Validation

Train on N-1 samples, test on 1 sample. Repeat N times (where N = dataset size). Gives unbiased estimate but is computationally expensive.

Use when: You have a very small dataset (less than 100 samples) and need every sample for training.

# Leave-One-Out CV (expensive!)
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# Small dataset
X_small, y_small = make_classification(n_samples=50, n_features=5, random_state=42)

# LOO CV
loo = LeaveOneOut()
model = KNeighborsClassifier(n_neighbors=3)

scores = cross_val_score(model, X_small, y_small, cv=loo)
print(f"LOO CV Accuracy: {scores.mean():.3f}")
print(f"Number of folds: {len(scores)}")  # 50 folds!

Time Series Split

For time series data, you can't randomly shuffle! The training set must always come before the test set in time.

# Time Series Split (for temporal data)
from sklearn.model_selection import TimeSeriesSplit
import matplotlib.pyplot as plt

# Time series CV - always train on past, test on future
tscv = TimeSeriesSplit(n_splits=5)

# Visualize the splits
X_time = np.arange(100)  # Simulated time series indices

fig, axes = plt.subplots(5, 1, figsize=(10, 8))
for i, (train_idx, test_idx) in enumerate(tscv.split(X_time)):
    axes[i].scatter(train_idx, [1]*len(train_idx), c='blue', label='Train', s=10)
    axes[i].scatter(test_idx, [1]*len(test_idx), c='red', label='Test', s=10)
    axes[i].set_title(f'Split {i+1}')
    axes[i].set_yticks([])
plt.tight_layout()
plt.show()

Multiple Metrics in CV

# Cross-validation with multiple metrics
from sklearn.model_selection import cross_validate
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=42)

# Evaluate multiple metrics at once
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
cv_results = cross_validate(model, X, y, cv=5, scoring=scoring, return_train_score=True)

# Print results
print("CROSS-VALIDATION RESULTS")
print("=" * 50)
for metric in scoring:
    train_key = f'train_{metric}'
    test_key = f'test_{metric}'
    print(f"{metric.upper()}")
    print(f"  Train: {cv_results[train_key].mean():.3f} (+/- {cv_results[train_key].std():.3f})")
    print(f"  Test:  {cv_results[test_key].mean():.3f} (+/- {cv_results[test_key].std():.3f})")
CV Method Best For Pros Cons
K-Fold General use Good balance of bias/variance May not preserve class balance
Stratified K-Fold Classification (imbalanced) Preserves class proportions Slightly more complex
Leave-One-Out Tiny datasets Uses maximum data for training Very slow, high variance
Time Series Split Sequential/temporal data Respects time ordering Early folds have less training data
Rule of Thumb: Always use Stratified K-Fold for classification. Use 5 folds for large datasets, 10 folds for smaller ones. Never use regular K-Fold on time series data!

Practice Questions: Cross-Validation

Question: If you tune hyperparameters using the test set, you're cheating! Implement nested CV to get an unbiased performance estimate.

Show Answer
# Nested Cross-Validation
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC

# Outer CV for performance estimation
# Inner CV for hyperparameter tuning
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Hyperparameter search (inner loop)
param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}
grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')

# Outer loop for unbiased estimation
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')

print(f"Nested CV Accuracy: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")
print("This estimate is unbiased - it doesn't 'see' the test data during tuning!")

Task: Perform 5-fold cross-validation on a Random Forest classifier and print mean score with standard deviation.

Show Solution
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Create model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Individual fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Standard deviation: {scores.std():.3f}")
print(f"95% CI: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Task: Run cross-validation that returns training and test scores for accuracy, precision, recall, and F1 simultaneously.

Show Solution
from sklearn.model_selection import cross_validate
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=42)

# Multiple metrics at once
scoring = ['accuracy', 'precision', 'recall', 'f1']

cv_results = cross_validate(
    model, X, y, cv=5, 
    scoring=scoring, 
    return_train_score=True
)

# Print formatted results
print("CROSS-VALIDATION RESULTS")
print("=" * 50)
for metric in scoring:
    train_mean = cv_results[f'train_{metric}'].mean()
    train_std = cv_results[f'train_{metric}'].std()
    test_mean = cv_results[f'test_{metric}'].mean()
    test_std = cv_results[f'test_{metric}'].std()
    
    print(f"\n{metric.upper()}:")
    print(f"  Train: {train_mean:.3f} (+/- {train_std:.3f})")
    print(f"  Test:  {test_mean:.3f} (+/- {test_std:.3f})")
    
    # Check for overfitting
    if train_mean - test_mean > 0.1:
        print(f"  Warning: Gap suggests overfitting!")
05

Hyperparameter Tuning

Hyperparameters are settings you choose before training - like the number of trees in a forest or the learning rate. Finding the best combination is crucial for model performance.

What are Hyperparameters?

Cooking Analogy: Think of training a model like baking a cake.

Hyperparameters = Settings you choose BEFORE baking:
• Oven temperature (350°F vs 400°F)
• Baking time (30 min vs 45 min)
• Pan size (8" vs 9")

Parameters = Things that happen DURING baking:
• How much the cake rises
• How ingredients combine
• The final texture and taste

You choose hyperparameters; the model learns parameters. Hyperparameter tuning = experimenting with different oven temps and times to find the perfect recipe!
Parameters (Learned)
  • Weights in neural networks
  • Coefficients in linear regression
  • Split points in decision trees

Learned during training

Hyperparameters (Chosen)
  • Number of trees, max depth
  • Learning rate, regularization strength
  • Number of neighbors (K)

Set before training

GridSearchCV - Exhaustive Search

Shopping Analogy: Imagine you're trying every possible outfit combination:

• 3 shirts (red, blue, green)
• 2 pants (jeans, khakis)
• 2 shoes (sneakers, boots)

GridSearch tries ALL combinations: red+jeans+sneakers, red+jeans+boots, red+khakis+sneakers, red+khakis+boots, blue+jeans+sneakers, etc. That's 3 × 2 × 2 = 12 complete outfits to try!

With cross-validation (CV=5), you're trying each outfit in 5 different lighting conditions. So 12 × 5 = 60 total evaluations! This is why GridSearch can be slow.
Brute Force Approach

GridSearchCV (Exhaustive Grid Search)

Try every possible combination of hyperparameters in your grid. Guaranteed to find the absolute best combination in your search space, but computationally expensive.

Computational Cost: Number of fits = (combinations) × (cv folds)

Example: 3 params with 4 values each = 4³ = 64 combinations
With 5-fold CV: 64 × 5 = 320 model trainings!

Pro tip: Start with a coarse grid (few values), find promising regions, then do a finer grid search in that region.

# ============================================
# GridSearchCV - Complete Example
# ============================================
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import time

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

print("=== Setting Up GridSearch ===")

# Define the base model (hyperparameters will be set by GridSearch)
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
# Start with a COARSE grid (few values) to explore broadly
param_grid = {
    'n_estimators': [50, 100, 200],           # Number of trees - 3 values
    'max_depth': [5, 10, 20, None],           # Max tree depth - 4 values  
    'min_samples_split': [2, 5, 10],          # Min samples to split - 3 values
    'min_samples_leaf': [1, 2, 4]             # Min samples in leaf - 3 values
}

print(f"Parameter grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

# Calculate total combinations
total_combinations = 1
for values in param_grid.values():
    total_combinations *= len(values)

print(f"\nTotal combinations: {total_combinations}")
print(f"With 5-fold CV: {total_combinations * 5} model fits!")
print("\nStarting GridSearch... (this may take a while)\n")

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=rf,                    # Model to tune
    param_grid=param_grid,           # Grid of parameters
    cv=5,                            # 5-fold cross-validation
    scoring='accuracy',              # Metric to optimize
    n_jobs=-1,                       # Use ALL CPU cores (parallel)
    verbose=1,                       # Show progress (2=more detail, 3=even more)
    return_train_score=True          # Also return training scores
)

# Fit - this tries all combinations!
start_time = time.time()
grid_search.fit(X, y)
elapsed = time.time() - start_time

# Display results
print("\n" + "="*50)
print("GRIDSEARCH RESULTS")
print("="*50)
print(f"Time taken: {elapsed:.1f} seconds")
print(f"\nBest parameters found:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest cross-validation score: {grid_search.best_score_:.4f}")

# Access the best model (already trained!)
best_model = grid_search.best_estimator_
print(f"\nBest model: {best_model}")

# Get all results as a DataFrame (useful for analysis)
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)

# Show top 5 parameter combinations
print("\nTop 5 parameter combinations:")
top_5 = results_df.nsmallest(5, 'rank_test_score')[[
    'params', 'mean_test_score', 'std_test_score', 'rank_test_score'
]]
print(top_5.to_string(index=False))

RandomizedSearchCV - Faster Alternative

Lottery Strategy Analogy: Instead of trying every possible lottery number combination (GridSearch = guaranteed to find winning combo but takes forever), you randomly pick 100 tickets (RandomizedSearch = might find winner much faster!).

Why RandomSearch Often Works Better:
Most hyperparameters don't matter much - only 1-2 are really important
• GridSearch wastes time exploring worthless combinations exhaustively
• RandomSearch explores MORE unique values for important parameters in the same time
• Example: If 'learning_rate' is critical but 'max_features' doesn't matter much, RandomSearch tries 50 different learning_rates vs GridSearch trying only 5

Rule of thumb: Use GridSearch for ≤3 parameters. Use RandomSearch for 4+ parameters or when you're exploring a large space.
Random Sampling Strategy

RandomizedSearchCV

Randomly sample combinations from the parameter space. You control how many combinations to try (n_iter). Often finds good solutions much faster than GridSearch.

Why it's effective:

  • Explores more diverse values for important parameters
  • Less time wasted on exhaustively trying every combo
  • Can use continuous distributions (not just discrete values)
  • Finds "good enough" solutions quickly; can increase n_iter if needed

Research shows: RandomSearch with 60 trials often outperforms GridSearch with 60 combinations because it explores the space more effectively.

# RandomizedSearchCV - Random sampling
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define distributions for random sampling
param_distributions = {
    'n_estimators': randint(50, 500),           # Sample from 50-500
    'max_depth': randint(5, 50),                # Sample from 5-50
    'min_samples_split': randint(2, 20),        # Sample from 2-20
    'min_samples_leaf': randint(1, 10),         # Sample from 1-10
    'max_features': uniform(0.1, 0.9)           # Sample from 0.1-1.0
}

# Random search
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_distributions,
    n_iter=50,        # Try 50 random combinations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X, y)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Analyzing Search Results

# Analyzing search results
import pandas as pd
import matplotlib.pyplot as plt

# Get results as DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)

# View key columns
print(results_df[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))

# Plot hyperparameter importance (example with max_depth)
depths = [p['max_depth'] for p in results_df['params']]
scores = results_df['mean_test_score']

plt.figure(figsize=(10, 5))
plt.scatter(depths, scores, alpha=0.5)
plt.xlabel('max_depth')
plt.ylabel('Mean CV Score')
plt.title('Effect of max_depth on Performance')
plt.show()

Best Practices

Do This
  • Start with RandomizedSearch to explore
  • Use GridSearch to refine promising regions
  • Always use cross-validation (cv=5+)
  • Set n_jobs=-1 for parallelization
  • Use early stopping when available
Don't Do This
  • Don't tune on test data (data leakage!)
  • Don't grid search with too many parameters
  • Don't use default hyperparameters blindly
  • Don't forget to check for overfitting
  • Don't tune if model is already good enough

Common Hyperparameters by Model

Model Key Hyperparameters Typical Range
Random Forest n_estimators, max_depth, min_samples_split 100-500, 10-50, 2-20
XGBoost learning_rate, max_depth, n_estimators 0.01-0.3, 3-10, 100-1000
SVM C, gamma, kernel 0.1-100, 0.001-1, rbf/linear
KNN n_neighbors, weights, metric 3-15, uniform/distance, euclidean
Logistic Regression C, penalty, solver 0.01-100, l1/l2, lbfgs/saga
Pro Tip: For large datasets, consider using Halving Grid/Random Search (sklearn 0.24+) which progressively eliminates poor candidates, dramatically reducing computation time while still finding good solutions.

Practice Questions: Hyperparameter Tuning

Question: Create a complete pipeline that: (1) scales data, (2) tunes a model, and (3) evaluates on a held-out test set.

Show Answer
# Complete tuning pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Split data - keep test set completely separate!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(random_state=42))
])

# Define parameter grid (note: prefix with step name__)
param_grid = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__kernel': ['rbf', 'linear'],
    'svc__gamma': ['scale', 'auto', 0.1, 1]
}

# Grid search on training data only
grid_search = GridSearchCV(
    pipeline, param_grid, cv=5, 
    scoring='f1', n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.3f}")

# Evaluate on held-out test set
y_pred = grid_search.predict(X_test)
print("\n" + "="*50)
print("TEST SET PERFORMANCE (Never seen during tuning!)")
print("="*50)
print(classification_report(y_test, y_pred))

Task: Use GridSearchCV to find the best n_neighbors for KNN classifier from values [3, 5, 7, 9, 11].

Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

# Define model and parameter grid
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}

# Grid search
grid_search = GridSearchCV(
    knn, param_grid, cv=5, 
    scoring='accuracy'
)
grid_search.fit(X, y)

print(f"Best K: {grid_search.best_params_['n_neighbors']}")
print(f"Best CV Accuracy: {grid_search.best_score_:.3f}")

# View all results
results = pd.DataFrame(grid_search.cv_results_)
print(results[['param_n_neighbors', 'mean_test_score', 'rank_test_score']])

Task: Use RandomizedSearchCV to tune a Random Forest with random sampling from distributions.

Show Solution
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform

# Model
rf = RandomForestClassifier(random_state=42)

# Parameter distributions (not fixed values!)
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(3, 30),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9)
}

# Random search - try 20 random combinations
random_search = RandomizedSearchCV(
    rf, param_dist, 
    n_iter=20,  # Only 20 combinations instead of all
    cv=5, 
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)
random_search.fit(X, y)

print(f"Best params: {random_search.best_params_}")
print(f"Best CV Accuracy: {random_search.best_score_:.3f}")

Quick Reference Guide

Use these tables to quickly decide which metrics, CV strategy, and tuning method to use for your specific problem.

When to Use Each Metric

Problem Type Primary Metric Secondary Metrics When to Use
Regression (general) MAE + RMSE MAE for interpretability, RMSE when large errors are bad, R² to explain variance
Balanced Classification Accuracy Confusion Matrix Equal class sizes, all errors equally costly
Imbalanced Classification F1-Score Precision, Recall, PR-AUC Minority class important, need balance of precision/recall
Medical Diagnosis Recall F1, Confusion Matrix Missing sick patients is dangerous (minimize false negatives)
Spam Detection Precision F1 False alarms annoy users (minimize false positives)
Ranking/Threshold Selection ROC-AUC PR-AUC (if imbalanced) Comparing models across all thresholds, probability calibration

CV Strategy Quick Reference

K-Fold (K=5)

Default choice for most problems

✓ Fast ✓ Reliable Use: Balanced data
Stratified K-Fold

For classification with imbalanced classes

✓ Preserves ratios Use: Imbalanced classes
TimeSeriesSplit

For temporal data

✓ No data leakage ✗ Never shuffle! Use: Time series

Tuning Method Decision Matrix

Situation Use This Why?
1-3 hyperparameters, clear ranges GridSearchCV Exhaustive search is feasible and finds exact best
4+ hyperparameters or large search space RandomizedSearchCV Much faster, explores space effectively
Expensive training (large data/slow model) Lower CV folds (cv=3) Saves computation while maintaining reliability
Initial exploration (don't know good ranges) RandomizedSearchCVGridSearchCV Random finds region, Grid refines within it
Golden Workflow: (1) Split data → (2) Baseline model → (3) Try alternatives with CV → (4) Tune best model with GridSearchCV/RandomizedSearchCV → (5) Final test set evaluation ONCE

Key Takeaways

Choose the Right Metrics

Use MSE/RMSE for regression when large errors are costly. Use F1-score for imbalanced classification. Accuracy is only meaningful with balanced classes.

Precision vs Recall Trade-off

Prioritize recall when missing positives is costly (disease detection). Prioritize precision when false alarms are costly (spam filtering).

Confusion Matrix Reveals All

Always examine the confusion matrix. It shows exactly where your model fails - which classes get confused and in what direction.

ROC-AUC for Threshold-Free Evaluation

ROC-AUC compares models regardless of threshold. Use Precision-Recall curves instead for highly imbalanced datasets.

Cross-Validation is Essential

Always use cross-validation for reliable performance estimates. Use Stratified K-Fold for classification, TimeSeriesSplit for temporal data.

Tune Hyperparameters Properly

Never tune on test data! Use GridSearchCV for small grids, RandomizedSearchCV for exploration. Keep a final test set completely unseen.

Knowledge Check

Quick Quiz

Test what you've learned about model evaluation and tuning

1 Which regression metric is MOST sensitive to outliers?
2 For a fraud detection model where only 1% of transactions are fraudulent, which metric is MOST appropriate?
3 In a confusion matrix, what does FN (False Negative) represent?
4 What AUC score indicates a model performing no better than random guessing?
5 Which cross-validation method should you use for time series data?
6 What is the main advantage of RandomizedSearchCV over GridSearchCV?
Answer all questions to check your score