Parameters vs Hyperparameters
Before diving into tuning, it is essential to understand the difference between parameters and hyperparameters. Parameters are learned from data during training (like weights in neural networks), while hyperparameters are set before training and control the learning process itself (like learning rate or tree depth).
What are Hyperparameters?
Hyperparameters are configuration settings that control the learning algorithm itself. They are not learned from data but must be specified before training begins. The right hyperparameter values can dramatically improve model performance.
Examples: Learning rate, number of trees in Random Forest, max depth of decision trees, regularization strength (C in SVM), number of neighbors (k in KNN)
Parameters
- Learned from training data
- Model weights and biases
- Split points in decision trees
- Coefficients in linear regression
Hyperparameters
- Set before training begins
- Control the learning process
- Number of trees, max depth
- Learning rate, regularization
Why Hyperparameter Tuning Matters
Default hyperparameters rarely give optimal results. A well-tuned model can significantly outperform a default one. However, tuning must be done carefully to avoid overfitting to the validation set.
# Example: Impact of hyperparameters on model performance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Default hyperparameters
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train, y_train)
default_acc = accuracy_score(y_test, rf_default.predict(X_test))
print(f"Default accuracy: {default_acc:.4f}") # ~0.9561
# Tuned hyperparameters
rf_tuned = RandomForestClassifier(
n_estimators=200, max_depth=10, min_samples_split=5, random_state=42
)
rf_tuned.fit(X_train, y_train)
tuned_acc = accuracy_score(y_test, rf_tuned.predict(X_test))
print(f"Tuned accuracy: {tuned_acc:.4f}") # ~0.9737
This code demonstrates the impact of hyperparameter tuning on model performance. We start by importing the necessary libraries and loading the breast cancer dataset, which we split into 80% training and 20% testing sets. First, we train a Random Forest classifier with default hyperparameters and achieve approximately 95.6% accuracy. Then, we create a tuned version with specific hyperparameters: 200 trees instead of the default 100, a maximum depth of 10 to prevent overfitting, and requiring at least 5 samples to split a node. The tuned model achieves approximately 97.4% accuracy - a significant improvement that shows why hyperparameter tuning matters.
Common Hyperparameters by Algorithm
| Algorithm | Key Hyperparameters | Typical Range |
|---|---|---|
| Random Forest | n_estimators, max_depth, min_samples_split | 10-500, 3-30, 2-20 |
| SVM | C, kernel, gamma | 0.001-1000, rbf/linear/poly, scale/auto |
| KNN | n_neighbors, weights, metric | 1-50, uniform/distance, euclidean/manhattan |
| Gradient Boosting | learning_rate, n_estimators, max_depth | 0.01-0.3, 50-500, 3-10 |
| Logistic Regression | C, penalty, solver | 0.001-100, l1/l2, lbfgs/liblinear |
Practice Questions
Task: Create a Random Forest classifier and print both the hyperparameters (before fitting) and the learned parameters (after fitting).
Show Solution
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load data
X, y = load_iris(return_X_y=True)
# Create model - hyperparameters are set here
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
# Print hyperparameters (before fitting)
print("Hyperparameters:")
print(f" n_estimators: {rf.n_estimators}")
print(f" max_depth: {rf.max_depth}")
# Train the model - parameters are learned here
rf.fit(X, y)
# Print learned parameters (after fitting)
print("\nLearned Parameters:")
print(f" Number of trees: {len(rf.estimators_)}")
print(f" Feature importances: {rf.feature_importances_}")
Task: Train two SVM classifiers on the breast cancer dataset - one with default parameters and one with C=10, gamma=0.001. Compare their accuracies.
Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load and split data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Scale features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Default SVM
svm_default = SVC(random_state=42)
svm_default.fit(X_train_scaled, y_train)
default_acc = accuracy_score(y_test, svm_default.predict(X_test_scaled))
# Tuned SVM
svm_tuned = SVC(C=10, gamma=0.001, random_state=42)
svm_tuned.fit(X_train_scaled, y_train)
tuned_acc = accuracy_score(y_test, svm_tuned.predict(X_test_scaled))
print(f"Default SVM accuracy: {default_acc:.4f}")
print(f"Tuned SVM accuracy: {tuned_acc:.4f}")
print(f"Improvement: {(tuned_acc - default_acc)*100:.2f}%")
Task: Train Decision Tree classifiers with max_depth values from 1 to 20. Plot training and validation accuracy to visualize underfitting vs overfitting.
Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
depths = range(1, 21)
train_scores = []
val_scores = []
for depth in depths:
dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
dt.fit(X_train, y_train)
train_scores.append(dt.score(X_train, y_train))
val_scores.append(dt.score(X_val, y_val))
plt.figure(figsize=(10, 6))
plt.plot(depths, train_scores, 'b-o', label='Training Accuracy')
plt.plot(depths, val_scores, 'r-o', label='Validation Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Model Performance')
plt.legend()
plt.grid(True)
plt.show()
best_depth = depths[val_scores.index(max(val_scores))]
print(f"Best max_depth: {best_depth} with val accuracy: {max(val_scores):.4f}")
Task: Create a function that returns a dictionary of common hyperparameters and their typical ranges for different sklearn classifiers (RandomForest, SVM, KNN).
Show Solution
def get_hyperparameter_ranges(algorithm):
"""Return common hyperparameter ranges for different algorithms."""
ranges = {
'RandomForest': {
'n_estimators': (50, 500),
'max_depth': (3, 30),
'min_samples_split': (2, 20),
'min_samples_leaf': (1, 10),
'max_features': ['sqrt', 'log2', None]
},
'SVM': {
'C': (0.001, 1000), # Log scale recommended
'kernel': ['rbf', 'linear', 'poly'],
'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
'degree': (2, 5) # Only for poly kernel
},
'KNN': {
'n_neighbors': (1, 50),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski'],
'p': (1, 2) # 1=manhattan, 2=euclidean
}
}
return ranges.get(algorithm, "Algorithm not found")
# Test the function
for algo in ['RandomForest', 'SVM', 'KNN']:
print(f"\n{algo} Hyperparameters:")
for param, range_val in get_hyperparameter_ranges(algo).items():
print(f" {param}: {range_val}")
Grid Search
Grid Search is the most straightforward approach to hyperparameter tuning. It exhaustively tries every combination of hyperparameter values you specify and evaluates each using cross-validation. While thorough, it can be computationally expensive for large parameter spaces.
How Grid Search Works
Grid Search creates a grid of all possible hyperparameter combinations and evaluates each one using cross-validation. It then selects the combination that produces the best average score across all folds.
Example: If you have 3 values for param A and 4 values for param B, Grid Search will try all 3 x 4 = 12 combinations.
Using GridSearchCV
Scikit-learn provides GridSearchCV which combines grid search with cross-validation. It automatically handles the train/validation split and returns the best parameters found.
# Step 1: Import required libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
We import GridSearchCV for hyperparameter tuning, RandomForestClassifier as our model, and utilities for loading the breast cancer dataset and splitting it into train/test sets.
# Step 2: Load and split the data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
We load the breast cancer dataset and split it into training and testing sets. The random_state ensures reproducibility so you get the same split every time you run the code.
# Step 3: Define the parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
The parameter grid is a dictionary where keys are hyperparameter names and values are lists of values to try. This grid creates 3 × 4 × 3 = 36 different combinations that GridSearchCV will evaluate.
# Step 4: Create GridSearchCV object
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1, # Use all CPU cores
verbose=1
)
We create a GridSearchCV object with our Random Forest estimator and parameter grid. The cv=5 means 5-fold cross-validation, scoring='accuracy' defines how to evaluate, n_jobs=-1 uses all CPU cores for parallel processing, and verbose=1 shows progress during fitting.
# Step 5: Fit and find best parameters
grid_search.fit(X_train, y_train)
Calling fit() runs all 36 combinations × 5 folds = 180 model trainings. GridSearchCV automatically finds the best hyperparameter combination based on cross-validation scores.
# Step 6: Access the results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {grid_search.score(X_test, y_test):.4f}")
After fitting, we access best_params_ for the optimal hyperparameter values, best_score_ for the average cross-validation score of the best model, and use score() to evaluate performance on the held-out test set.
Analyzing Grid Search Results
The cv_results_ attribute contains detailed information about every combination tried. You can use this to understand how different hyperparameters affect performance.
# Step 1: Convert results to DataFrame
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
The cv_results_ attribute is a dictionary containing detailed information about every hyperparameter combination tried. We convert it to a pandas DataFrame for easier analysis and visualization.
# Step 2: View top performing combinations
top_5 = results_df.nsmallest(5, 'rank_test_score')[
['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
]
print(top_5.to_string())
Using nsmallest on rank_test_score, we extract the top 5 performing combinations. We select the most useful columns: params (hyperparameter values), mean_test_score (average CV accuracy), std_test_score (score variance), and rank_test_score (1 = best).
# Step 3: Access the best trained model
best_model = grid_search.best_estimator_
print(f"\nBest model: {best_model}")
The best_estimator_ attribute gives direct access to the actual trained model with the best hyperparameters. This model is ready to use for predictions without needing to retrain.
Practice Questions
Task: Use GridSearchCV to find the best n_neighbors value (1-20) for a KNN classifier on the iris dataset.
Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
param_grid = {'n_neighbors': list(range(1, 21))}
grid_search = GridSearchCV(
KNeighborsClassifier(),
param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(X, y)
print(f"Best n_neighbors: {grid_search.best_params_['n_neighbors']}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
Task: Perform Grid Search on SVM with C and gamma parameters. Create a heatmap showing accuracy for each combination.
Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1]
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_scaled, y)
# Create heatmap data
scores = grid_search.cv_results_['mean_test_score'].reshape(4, 4)
plt.figure(figsize=(8, 6))
plt.imshow(scores, cmap='viridis')
plt.colorbar(label='Accuracy')
plt.xticks(range(4), param_grid['gamma'])
plt.yticks(range(4), param_grid['C'])
plt.xlabel('gamma')
plt.ylabel('C')
plt.title('SVM Grid Search Results')
plt.show()
print(f"Best params: {grid_search.best_params_}")
Task: After running GridSearchCV on a Random Forest, extract the best model and plot the top 10 feature importances.
Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
# Grid search
param_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, y)
# Get best model and feature importances
best_model = grid_search.best_estimator_
importances = best_model.feature_importances_
indices = np.argsort(importances)[::-1][:10] # Top 10
# Plot
plt.figure(figsize=(10, 6))
plt.title('Top 10 Feature Importances (Tuned Random Forest)')
plt.bar(range(10), importances[indices])
plt.xticks(range(10), [data.feature_names[i] for i in indices], rotation=45, ha='right')
plt.tight_layout()
plt.show()
Task: Run a Grid Search, save the best model to a file using joblib, then load it back and make predictions.
Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import joblib
# Load data and run grid search
X, y = load_iris(return_X_y=True)
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, 7]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=5)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
# Save best model
joblib.dump(grid_search.best_estimator_, 'best_rf_model.pkl')
print("Model saved to 'best_rf_model.pkl'")
# Load model and predict
loaded_model = joblib.load('best_rf_model.pkl')
predictions = loaded_model.predict(X[:5])
print(f"Predictions on first 5 samples: {predictions}")
Random Search
Random Search samples hyperparameter values randomly from specified distributions instead of trying every combination. Research has shown that Random Search is often more efficient than Grid Search, finding good hyperparameters with fewer iterations, especially when some hyperparameters matter more than others.
Why Random Search Works
Not all hyperparameters are equally important. Random Search explores the full range of each parameter independently, often finding good values faster than Grid Search. With the same computational budget, Random Search typically finds better results.
Key Insight: If only 1 of 5 hyperparameters really matters, Grid Search wastes time on combinations that differ only in unimportant parameters.
Using RandomizedSearchCV
Scikit-learn provides RandomizedSearchCV which works similarly to GridSearchCV but samples from distributions. You specify the number of iterations instead of exhaustively searching.
# Step 1: Import required libraries
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from scipy.stats import randint, uniform
We import RandomizedSearchCV for random hyperparameter search, RandomForestClassifier as our model, and scipy.stats distributions (randint for integers, uniform for continuous values) to define parameter ranges.
# Step 2: Load the data
X, y = load_breast_cancer(return_X_y=True)
We load the breast cancer dataset for classification. The return_X_y=True parameter returns features and target as separate arrays directly.
# Step 3: Define parameter distributions
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 20),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
Instead of fixed lists like Grid Search, we define distributions. randint(50, 500) samples random integers from 50-499, and uniform(0.1, 0.9) samples continuous floats from 0.1-1.0. This allows exploring a much larger parameter space.
# Step 4: Create RandomizedSearchCV object
random_search = RandomizedSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=100, # Number of random combinations to try
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
We create a RandomizedSearchCV object with n_iter=100, meaning it will try 100 random combinations instead of all possible ones. With 5-fold CV, this trains 100 × 5 = 500 models. The random_state ensures reproducible results.
# Step 5: Fit and find best parameters
random_search.fit(X, y)
Calling fit() runs all 100 random combinations with 5-fold cross-validation each. RandomizedSearchCV automatically tracks which combination performed best.
# Step 6: Access the results
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
After fitting, best_params_ contains the optimal hyperparameter values found, and best_score_ gives the average cross-validation accuracy of the best configuration.
Choosing Distributions
Use appropriate distributions for different hyperparameter types. Integer parameters use randint, continuous parameters use uniform or log-uniform distributions.
from scipy.stats import randint, uniform, loguniform
# Different distribution types
param_distributions = {
# Integer parameters
'n_estimators': randint(10, 500), # Uniform integers 10-499
'max_depth': randint(1, 30), # Uniform integers 1-29
# Continuous parameters
'max_features': uniform(0.1, 0.9), # Uniform float 0.1-1.0
# Log-scale parameters (for values spanning orders of magnitude)
'learning_rate': loguniform(1e-4, 1), # Log-uniform 0.0001-1.0
'C': loguniform(1e-3, 1e3), # Log-uniform 0.001-1000
# Categorical parameters
'criterion': ['gini', 'entropy'] # Random choice from list
}
This code demonstrates how to choose appropriate distributions for different hyperparameter types. We import three distribution types from scipy.stats: randint for integer parameters like n_estimators and max_depth, which samples uniformly from low to high-1. For continuous parameters like max_features, we use uniform(loc, scale) which samples floats from loc to loc+scale. For parameters that span several orders of magnitude like learning_rate and regularization strength C, we use loguniform which samples on a log scale - making values like 0.001, 0.01, and 0.1 equally likely to be chosen. For categorical parameters like criterion, we simply use a list and RandomizedSearchCV will randomly choose from the available options.
Practice Questions
Task: Use RandomizedSearchCV to tune a GradientBoostingClassifier with learning_rate, n_estimators, and max_depth parameters.
Show Solution
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from scipy.stats import uniform, randint
X, y = load_breast_cancer(return_X_y=True)
param_distributions = {
'learning_rate': uniform(0.01, 0.29), # 0.01 to 0.30
'n_estimators': randint(50, 300),
'max_depth': randint(2, 10),
'min_samples_split': randint(2, 20)
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions,
n_iter=50,
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X, y)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")
Task: Compare the time and performance of Grid Search (60 combinations) vs Random Search (60 iterations) on a Random Forest.
Show Solution
import time
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from scipy.stats import randint
X, y = load_breast_cancer(return_X_y=True)
# Grid Search - 60 combinations (3 x 4 x 5)
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [5, 10, 15, 20],
'min_samples_split': [2, 4, 6, 8, 10]
}
start = time.time()
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, y)
grid_time = time.time() - start
# Random Search - 60 iterations
param_dist = {
'n_estimators': randint(50, 200),
'max_depth': randint(5, 25),
'min_samples_split': randint(2, 15)
}
start = time.time()
random_search = RandomizedSearchCV(RandomForestClassifier(random_state=42),
param_dist, n_iter=60, cv=5, n_jobs=-1,
random_state=42)
random_search.fit(X, y)
random_time = time.time() - start
print(f"Grid Search: {grid_search.best_score_:.4f} in {grid_time:.2f}s")
print(f"Random Search: {random_search.best_score_:.4f} in {random_time:.2f}s")
Task: Use RandomizedSearchCV with loguniform distribution to find the best learning_rate for a GradientBoostingClassifier (range: 0.001 to 1.0).
Show Solution
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from scipy.stats import loguniform
X, y = load_iris(return_X_y=True)
# Log-uniform samples evenly on log scale
# So 0.001, 0.01, 0.1 are equally likely
param_distributions = {
'learning_rate': loguniform(0.001, 1.0),
'n_estimators': [50, 100, 150]
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions,
n_iter=20,
cv=5,
random_state=42,
n_jobs=-1
)
random_search.fit(X, y)
print(f"Best learning_rate: {random_search.best_params_['learning_rate']:.6f}")
print(f"Best n_estimators: {random_search.best_params_['n_estimators']}")
print(f"Best CV score: {random_search.best_score_:.4f}")
Task: Run RandomizedSearchCV and plot how the best score improves over iterations to visualize convergence.
Show Solution
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from scipy.stats import randint
import matplotlib.pyplot as plt
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
param_dist = {
'n_estimators': randint(10, 300),
'max_depth': randint(2, 20),
'min_samples_split': randint(2, 15)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_dist,
n_iter=50,
cv=5,
random_state=42,
n_jobs=-1
)
random_search.fit(X, y)
# Extract scores and compute running best
scores = random_search.cv_results_['mean_test_score']
running_best = np.maximum.accumulate(scores)
# Plot convergence
plt.figure(figsize=(10, 6))
plt.plot(range(1, 51), scores, 'b.', alpha=0.5, label='Individual scores')
plt.plot(range(1, 51), running_best, 'r-', linewidth=2, label='Best so far')
plt.xlabel('Iteration')
plt.ylabel('CV Accuracy')
plt.title('Random Search Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"Final best score: {running_best[-1]:.4f}")
Cross-Validation Strategies
Cross-validation is crucial for hyperparameter tuning because it provides a reliable estimate of model performance on unseen data. Different CV strategies are appropriate for different types of data. Choosing the right strategy prevents data leakage and gives you trustworthy results.
K-Fold CV
Standard approach. Splits data into k equal folds. Each fold serves as validation once. Use when data is i.i.d. (independent and identically distributed).
Stratified K-Fold
Preserves class proportions in each fold. Essential for imbalanced classification. Default for classification in sklearn.
Time Series Split
For sequential data. Training set grows, test set moves forward. Prevents future data leaking into training.
Implementing Different CV Strategies
from sklearn.model_selection import (KFold, StratifiedKFold,
TimeSeriesSplit, LeaveOneOut,
cross_val_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(random_state=42)
# Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
print(f"K-Fold: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# Stratified K-Fold (preserves class proportions)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
print(f"Stratified K-Fold: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# Time Series Split (for temporal data)
tscv = TimeSeriesSplit(n_splits=5)
# Note: Don't shuffle time series data!
scores = cross_val_score(model, X, y, cv=tscv)
print(f"Time Series Split: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
This code demonstrates three different cross-validation strategies and when to use each. We import KFold (basic splitting), StratifiedKFold (preserves class ratios), TimeSeriesSplit (for temporal data), and cross_val_score for evaluation. Standard K-Fold with 5 splits and shuffling enabled works well for balanced datasets with i.i.d. data. Stratified K-Fold ensures each fold maintains the same proportion of each class as the full dataset, which is critical for imbalanced classification problems. TimeSeriesSplit is designed for sequential data where the training window grows and the test window moves forward in time - we never shuffle this data to preserve temporal order and prevent future information from leaking into training. Each method reports the mean score and 95% confidence interval.
Using Custom CV in GridSearchCV
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Create stratified CV splitter
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Use in GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=cv_strategy, # Custom CV strategy
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X, y)
print(f"Best score with Stratified CV: {grid_search.best_score_:.4f}")
This code shows how to use a custom cross-validation strategy within GridSearchCV. We create a StratifiedKFold splitter with 5 folds, shuffling enabled, and a fixed random_state for reproducibility. Instead of passing just an integer to the cv parameter of GridSearchCV, we pass our custom cv_strategy object. This ensures that class proportions are maintained in every fold during the hyperparameter search, which is especially important for imbalanced datasets where regular K-Fold might create folds with very few samples of the minority class. The stratified approach gives more reliable and consistent cross-validation scores.
Practice Questions
Task: Create an imbalanced dataset and compare K-Fold vs Stratified K-Fold cross-validation scores.
Show Solution
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np
# Create imbalanced dataset (90% class 0, 10% class 1)
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1],
random_state=42)
model = LogisticRegression(random_state=42)
# Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf, scoring='f1')
print(f"K-Fold F1: {kf_scores.mean():.4f} (+/- {kf_scores.std()*2:.4f})")
print(f" Individual folds: {kf_scores}")
# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
print(f"\nStratified K-Fold F1: {skf_scores.mean():.4f} (+/- {skf_scores.std()*2:.4f})")
print(f" Individual folds: {skf_scores}")
print(f"\nStratified has lower variance: {skf_scores.std() < kf_scores.std()}")
Task: Create synthetic time series data and visualize how TimeSeriesSplit creates the train/test folds. Show the growing training window.
Show Solution
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
import matplotlib.pyplot as plt
# Create synthetic time series
np.random.seed(42)
n_samples = 100
X = np.arange(n_samples).reshape(-1, 1)
y = np.sin(X.ravel() * 0.1) + np.random.randn(n_samples) * 0.1
# TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
fig, axes = plt.subplots(5, 1, figsize=(12, 8), sharex=True)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
ax = axes[fold]
ax.scatter(train_idx, y[train_idx], c='blue', label='Train', alpha=0.7)
ax.scatter(test_idx, y[test_idx], c='red', label='Test', alpha=0.7)
ax.set_ylabel(f'Fold {fold + 1}')
ax.legend(loc='upper right')
ax.set_xlim(0, n_samples)
print(f"Fold {fold + 1}: Train size={len(train_idx)}, Test size={len(test_idx)}")
axes[-1].set_xlabel('Time Index')
plt.suptitle('TimeSeriesSplit Visualization - Growing Training Window')
plt.tight_layout()
plt.show()
Best Practices
Hyperparameter tuning is powerful but can lead to overfitting if done incorrectly. Following best practices ensures your tuned model generalizes well to truly unseen data. These guidelines help you avoid common pitfalls and build more robust models.
Do's
- Hold out a final test set: Never tune on the test set. Use train/validation/test split.
- Start coarse, then refine: Begin with a wide search, then narrow down around promising values.
- Use appropriate CV: Match your CV strategy to your data type (stratified for imbalanced, time series split for sequential).
- Consider computational budget: Use Random Search when the parameter space is large.
- Log your experiments: Track all hyperparameter combinations and their scores for reproducibility.
Don'ts
- Don't tune on test data: This leads to overly optimistic performance estimates.
- Don't ignore data leakage: Ensure preprocessing is done inside CV folds, not before.
- Don't over-tune: Too many iterations can lead to overfitting the validation set.
- Don't forget scaling: Many algorithms require feature scaling - include it in your pipeline.
- Don't ignore variance: A model with slightly lower mean but much lower variance may be better.
Using Pipelines to Prevent Data Leakage
Always include preprocessing steps inside your pipeline. This ensures that transformations are fit only on training data during each CV fold.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# Create pipeline with scaling INSIDE
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(random_state=42))
])
# Parameter names use step__param format
param_grid = {
'svm__C': [0.1, 1, 10, 100],
'svm__gamma': ['scale', 'auto', 0.01, 0.1]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
This code demonstrates how to use Pipelines to prevent data leakage during hyperparameter tuning. We create a Pipeline with named steps: 'scaler' for StandardScaler and 'svm' for SVC. The key benefit is that the scaler will be fit only on training data during each CV fold, preventing information from the validation set from leaking into the preprocessing step. When defining the parameter grid, we use the step__param naming convention - for example, 'svm__C' refers to the C parameter of the svm step. Running GridSearchCV on the entire pipeline ensures that scaling happens correctly inside each fold, giving us realistic performance estimates and preventing the common mistake of scaling all data before splitting.
Nested Cross-Validation
For unbiased performance estimation, use nested CV: outer loop for evaluation, inner loop for hyperparameter tuning.
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# Inner CV: hyperparameter tuning
param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}
inner_cv = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid, cv=3, n_jobs=-1
)
# Outer CV: unbiased performance estimate
outer_scores = cross_val_score(inner_cv, X, y, cv=5, scoring='accuracy')
print(f"Nested CV scores: {outer_scores}")
print(f"Mean: {outer_scores.mean():.4f} (+/- {outer_scores.std()*2:.4f})")
This code implements nested cross-validation for unbiased performance estimation. We create two CV loops: an inner loop (GridSearchCV with 3-fold CV) that tunes hyperparameters and finds the best configuration, and an outer loop (cross_val_score with 5-fold CV) that evaluates how well the entire tuning process generalizes to unseen data. The outer scores give us an unbiased estimate of how well our tuned model will actually perform in production. Without nested CV, we might overfit to the validation set during tuning and get overly optimistic performance estimates. This approach is especially important when you need to report realistic expected performance to stakeholders or compare different algorithms fairly.
Practice Questions
Task: Create a complete pipeline with scaling, PCA, and SVM. Use GridSearchCV to tune and properly evaluate on a held-out test set.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report
X, y = load_breast_cancer(return_X_y=True)
# Hold out test set FIRST
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('svm', SVC(random_state=42))
])
# Parameter grid
param_grid = {
'pca__n_components': [5, 10, 15],
'svm__C': [0.1, 1, 10],
'svm__gamma': ['scale', 0.01, 0.1]
}
# Tune on training data only
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"CV score: {grid_search.best_score_:.4f}")
# Final evaluation on test set
y_pred = grid_search.predict(X_test)
print(f"\nTest set performance:")
print(classification_report(y_test, y_pred))
Task: Combine hyperparameter tuning with early stopping for a Gradient Boosting model. Use validation score to stop training early.
Show Solution
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
# Split: train, validation (for early stopping), test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42)
# Grid search with early stopping via n_iter_no_change
param_grid = {
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5],
'n_estimators': [500], # High value, early stopping will limit
'n_iter_no_change': [10], # Stop if no improvement for 10 iterations
'validation_fraction': [0.1] # Use 10% for internal validation
}
grid_search = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid,
cv=3,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
print(f"Best params: {grid_search.best_params_}")
print(f"Actual n_estimators used: {best_model.n_estimators_}")
print(f"Test accuracy: {best_model.score(X_test, y_test):.4f}")
Task: Create a function that performs hyperparameter tuning with full logging of all experiments to a CSV file for reproducibility.
Show Solution
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from datetime import datetime
def tune_with_logging(estimator, param_grid, X, y, log_file='tuning_log.csv'):
"""Perform tuning with full experiment logging."""
# Record start time
start_time = datetime.now()
# Run grid search
grid_search = GridSearchCV(
estimator, param_grid, cv=5,
scoring='accuracy', n_jobs=-1,
return_train_score=True
)
grid_search.fit(X, y)
# Create results DataFrame
results = pd.DataFrame(grid_search.cv_results_)
# Add metadata
results['experiment_time'] = start_time.strftime('%Y-%m-%d %H:%M:%S')
results['estimator'] = type(estimator).__name__
results['total_fits'] = len(results) * 5 # n_combinations * cv_folds
# Select important columns
cols = ['experiment_time', 'estimator', 'params',
'mean_train_score', 'mean_test_score', 'std_test_score',
'rank_test_score', 'mean_fit_time']
results = results[cols]
# Save to CSV (append if exists)
results.to_csv(log_file, mode='a', index=False,
header=not pd.io.common.file_exists(log_file))
print(f"Logged {len(results)} experiments to {log_file}")
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
return grid_search
# Usage
X, y = load_breast_cancer(return_X_y=True)
param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}
grid_search = tune_with_logging(
RandomForestClassifier(random_state=42),
param_grid, X, y
)
Key Takeaways
Hyperparameters Matter
Hyperparameters control the learning process and can dramatically affect model performance. Default values rarely give optimal results.
Grid Search
Exhaustively tries all combinations. Thorough but slow. Use when parameter space is small and you need to be comprehensive.
Random Search
Samples randomly from distributions. Often more efficient than Grid Search. Use for large parameter spaces.
Cross-Validation
Essential for reliable estimates. Use Stratified for imbalanced data, TimeSeriesSplit for temporal data.
Prevent Data Leakage
Always use pipelines to ensure preprocessing happens inside CV. Never tune on test data.
Balance Performance
Consider both mean score and variance. A slightly lower mean with much lower variance may generalize better.
Knowledge Check
Test your understanding of hyperparameter tuning with this quick quiz.
What is the main difference between parameters and hyperparameters?
When should you prefer Random Search over Grid Search?
Why should you use Stratified K-Fold for imbalanced classification?
What is data leakage in the context of hyperparameter tuning?
How do you access the best model after GridSearchCV finishes?
What distribution should you use for hyperparameters like learning rate that span several orders of magnitude?