Ridge Regression (L2 Regularization)
Ridge regression is your first line of defense against overfitting. By adding a penalty term proportional to the square of coefficient magnitudes, Ridge shrinks large coefficients toward zero without eliminating them entirely. This simple yet powerful technique transforms unstable, overfit models into robust predictors that generalize well to new data.
What is Ridge Regression?
Ridge regression (also called Tikhonov regularization or L2 regularization) modifies the standard linear regression cost function by adding a penalty term equal to the sum of squared coefficients multiplied by a hyperparameter lambda (also written as alpha in scikit-learn).
The intuition: Imagine you're packing for a trip with a weight limit on your luggage. Standard linear regression packs whatever it wants (large coefficients for features it thinks are important). Ridge regression imposes a "tax" on heavy items - you can still bring them, but they cost you extra. This encourages the model to use smaller, more distributed weights across all features rather than relying heavily on just a few.
Why "Ridge"? The name comes from the ridge that appears when you plot the penalty term. The technique was originally developed to handle situations where the feature matrix is singular or near-singular (multicollinearity), creating a "ridge" in the solution space.
The Ridge Regression Cost Function
The Ridge regression objective adds an L2 penalty term to the ordinary least squares (OLS) cost function. This penalty grows with the square of coefficient values, effectively discouraging the model from assigning very large weights to any single feature.
Lambda (Regularization Strength)
- Lambda = 0: No regularization (standard linear regression)
- Small lambda: Slight penalty, coefficients slightly reduced
- Large lambda: Strong penalty, coefficients pushed toward zero
- Lambda = infinity: All coefficients become exactly zero
Key Properties
- Shrinks coefficients but never sets them exactly to zero
- Handles multicollinearity effectively
- Always has a unique solution (regularization ensures invertibility)
- Works well when all features contribute to prediction
Ridge Regression in Python
Scikit-learn provides an intuitive interface for Ridge regression through the Ridge class. The key hyperparameter is alpha (equivalent to lambda in the mathematical notation), which controls the regularization strength.
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate sample data with multicollinearity
np.random.seed(42)
X = np.random.randn(100, 5)
X[:, 1] = X[:, 0] + np.random.randn(100) * 0.1 # Correlated features
y = 3*X[:, 0] + 2*X[:, 2] + np.random.randn(100) * 0.5
# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Ridge regression with different alpha values
for alpha in [0.01, 0.1, 1.0, 10.0]:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train_scaled, y_train)
score = ridge.score(X_test_scaled, y_test)
print(f"Alpha={alpha:5.2f}: R²={score:.4f}, Coefficients={ridge.coef_.round(2)}")
The code starts with
np.random.seed(42) to ensure you get the same results every time you run it. We then create multicollinearity by making feature 1 almost identical to feature 0 using X[:, 1] = X[:, 0] + np.random.randn(100) * 0.1. Before training, we use StandardScaler() to normalize features to mean=0 and std=1, which is essential for fair regularization since the L2 penalty treats all features equally. Finally, Ridge(alpha=alpha) creates the model where higher alpha means stronger regularization and smaller coefficients.
As alpha increases, you'll notice the coefficients get smaller (shrink toward zero), which reduces overfitting but may increase bias. The goal is finding the alpha that balances these tradeoffs for the best test performance.
Closed-Form Solution
Unlike standard linear regression which can fail when features are highly correlated, Ridge regression always has a unique solution. The regularization term adds a positive value to the diagonal of the matrix, ensuring it's always invertible.
import numpy as np
def ridge_regression_closed_form(X, y, alpha):
"""Implement Ridge regression using the closed-form solution"""
n_features = X.shape[1]
identity = np.eye(n_features)
# Add column of ones for intercept
X_with_intercept = np.column_stack([np.ones(len(X)), X])
identity_with_intercept = np.eye(n_features + 1)
identity_with_intercept[0, 0] = 0 # Don't regularize intercept
# Closed-form solution: (X^T X + alpha*I)^-1 X^T y
XtX = X_with_intercept.T @ X_with_intercept
regularized = XtX + alpha * identity_with_intercept
theta = np.linalg.inv(regularized) @ X_with_intercept.T @ y
return theta[0], theta[1:] # intercept, coefficients
# Example usage
intercept, coefs = ridge_regression_closed_form(X_train_scaled, y_train, alpha=1.0)
print(f"Intercept: {intercept:.4f}")
print(f"Coefficients: {coefs.round(4)}")
The function first creates an identity matrix using
np.eye(n_features), which has 1s on the diagonal and 0s elsewhere. We then add a column of 1s for the intercept term with np.column_stack([np.ones(len(X)), X]). Importantly, we set identity_with_intercept[0, 0] = 0 because we don't want to regularize the intercept since it's not a feature coefficient. The core computation uses X.T @ X to compute X-transpose times X (a covariance-like matrix), then adds alpha * identity to the diagonal. This is the magic of Ridge regression because adding alpha to the diagonal ensures the matrix is always invertible. Finally, np.linalg.inv(...) performs matrix inversion to solve the equation.
This matters because ordinary least squares fails when features are perfectly correlated (singular matrix). Ridge's lambda*I addition guarantees a unique solution always exists, which is why Ridge is often called "regularized least squares."
Practice Questions
Problem: Given coefficients [2.5, -1.8, 0.9] and alpha=0.5, calculate the L2 penalty term that would be added to the cost function.
Show Solution
import numpy as np
# Given coefficients
coefficients = np.array([2.5, -1.8, 0.9])
alpha = 0.5
# L2 penalty = alpha * sum(coefficients^2)
l2_penalty = alpha * np.sum(coefficients ** 2)
# = 0.5 * (6.25 + 3.24 + 0.81)
# = 0.5 * 10.3
# = 5.15
print(f"L2 Penalty: {l2_penalty}")
# Output: L2 Penalty: 5.15
Explanation: The L2 penalty squares each coefficient (2.5²=6.25, 1.8²=3.24, 0.9²=0.81), sums them (10.3), and multiplies by alpha (5.15). This value is added to the MSE cost.
Problem: Train Ridge regression on a dataset with alpha values [0.001, 0.1, 10, 1000]. Plot how the coefficients change as alpha increases.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
# Generate data
X, y = make_regression(n_samples=100, n_features=10, noise=10)
# Test different alpha values
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
coef_matrix = []
for alpha in alphas:
ridge = Ridge(alpha=alpha)
ridge.fit(X, y)
coef_matrix.append(ridge.coef_)
# Plot coefficient paths
coef_matrix = np.array(coef_matrix)
plt.figure(figsize=(10, 6))
for i in range(coef_matrix.shape[1]):
plt.plot(np.log10(alphas), coef_matrix[:, i], label=f'Coef {i+1}')
plt.xlabel('log10(Alpha)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Coefficient Paths')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()
Explanation: As alpha increases, all coefficients shrink toward zero but never reach exactly zero. The plot shows how stronger regularization constrains the model.
Problem: Implement Ridge regression using gradient descent. Train on synthetic data and compare your implementation with scikit-learn's Ridge.
Show Solution
import numpy as np
from sklearn.linear_model import Ridge
class RidgeRegressionGD:
def __init__(self, alpha=1.0, learning_rate=0.01, n_iterations=1000):
self.alpha = alpha
self.lr = learning_rate
self.n_iter = n_iterations
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.n_iter):
y_pred = X @ self.weights + self.bias
error = y_pred - y
# Gradients with L2 penalty
dw = (1/n_samples) * (X.T @ error + self.alpha * self.weights)
db = (1/n_samples) * np.sum(error)
self.weights -= self.lr * dw
self.bias -= self.lr * db
return self
def predict(self, X):
return X @ self.weights + self.bias
# Compare with sklearn
np.random.seed(42)
X = np.random.randn(100, 5)
y = 3*X[:, 0] + 2*X[:, 1] + np.random.randn(100) * 0.5
# Our implementation
my_ridge = RidgeRegressionGD(alpha=1.0, learning_rate=0.1, n_iterations=1000)
my_ridge.fit(X, y)
# Sklearn
sk_ridge = Ridge(alpha=1.0)
sk_ridge.fit(X, y)
print("Our weights:", my_ridge.weights.round(4))
print("Sklearn weights:", sk_ridge.coef_.round(4))
Explanation: The gradient of the L2 penalty is 2*alpha*weights, which we simplify to alpha*weights in the update. This pushes weights toward zero proportionally to their current value.
Lasso Regression (L1 Regularization)
Lasso regression takes regularization a step further by using the absolute value of coefficients instead of their squares. This seemingly small change has a profound effect: Lasso can drive coefficients to exactly zero, effectively performing automatic feature selection. When you need a sparse model that uses only the most important features, Lasso is your tool of choice.
What is Lasso Regression?
LASSO stands for Least Absolute Shrinkage and Selection Operator. It adds a penalty proportional to the absolute value of coefficients (L1 norm) to the cost function.
The key difference from Ridge: While Ridge shrinks coefficients toward zero but keeps them non-zero, Lasso can eliminate features entirely by setting their coefficients to exactly zero. This makes Lasso a powerful tool for feature selection in high-dimensional datasets.
When to use Lasso: Choose Lasso when you suspect many features are irrelevant or redundant. It will automatically identify and remove them, leaving you with a simpler, more interpretable model. This is especially valuable when you have more features than samples (p > n scenarios).
The Lasso Cost Function
The Lasso cost function replaces the squared penalty (L2) with an absolute value penalty (L1). This seemingly minor change fundamentally alters the optimization landscape and enables the sparse solutions that make Lasso unique.
Feature Selection Power
Lasso's L1 penalty creates sparse solutions where many coefficients are exactly zero. This is automatic feature selection.
- Identifies truly important features
- Reduces model complexity
- Improves interpretability
- Prevents overfitting on irrelevant features
Limitations
Lasso has some known limitations to be aware of:
- Arbitrarily selects one feature from correlated groups
- Cannot select more features than samples (n)
- May be unstable with highly correlated features
- No closed-form solution (requires iterative optimization)
Why L1 Produces Zeros (Geometric Intuition)
The reason Lasso produces sparse solutions while Ridge does not comes down to geometry. Imagine the constraint regions as shapes on a 2D plane where we're optimizing two coefficients. Ridge creates a circular constraint (all points at equal distance from origin), while Lasso creates a diamond shape with corners on the axes.
The optimization process finds where the elliptical contours of the MSE loss function first touch the constraint region. For Ridge's circle, this contact point is almost never on an axis (zero coefficient). For Lasso's diamond, the contact point often hits a corner, which lies on an axis - meaning one coefficient is exactly zero!
Ridge (L2) Constraint
Circular boundary: $\theta_1^2 + \theta_2^2 \leq t$
Loss contours rarely touch axes - coefficients stay non-zero
Lasso (L1) Constraint
Diamond boundary: $|\theta_1| + |\theta_2| \leq t$
Loss contours often hit corners - coefficients become exactly zero
Lasso Regression in Python
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Generate data with many irrelevant features
np.random.seed(42)
n_samples, n_features = 100, 20
X = np.random.randn(n_samples, n_features)
# Only first 3 features actually matter
y = 5*X[:, 0] - 3*X[:, 1] + 2*X[:, 2] + np.random.randn(n_samples) * 0.5
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
# Count non-zero coefficients
n_nonzero = np.sum(lasso.coef_ != 0)
print(f"Non-zero coefficients: {n_nonzero} out of {n_features}")
print(f"Selected features: {np.where(lasso.coef_ != 0)[0]}")
print(f"Coefficients: {lasso.coef_[lasso.coef_ != 0].round(3)}")
We set
n_features = 20 to create 20 potential predictors, but most of them are just noise. The target is defined as y = 5*X[:, 0] - 3*X[:, 1] + 2*X[:, 2], meaning only features 0, 1, and 2 truly affect the outcome. When we create the Lasso model with Lasso(alpha=0.1), the alpha parameter controls sparsity where higher values mean more coefficients become exactly zero. After training, np.sum(lasso.coef_ != 0) counts how many features Lasso kept (non-zero coefficients), and np.where(lasso.coef_ != 0)[0] tells us which specific features were selected.
In real datasets with hundreds of features, many are irrelevant or redundant. Lasso automatically identifies and removes them, giving you a simpler, more interpretable model. This is especially valuable when you need to explain your model to stakeholders.
Cross-Validation for Alpha Selection
Choosing the right alpha value is critical for Lasso performance. Scikit-learn provides LassoCV which automatically finds the optimal alpha using cross-validation.
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
# LassoCV automatically finds optimal alpha
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 100), cv=5)
lasso_cv.fit(X_scaled, y)
print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
print(f"R² score: {lasso_cv.score(X_scaled, y):.4f}")
print(f"Non-zero features: {np.sum(lasso_cv.coef_ != 0)}")
# Plot alpha vs MSE
plt.figure(figsize=(10, 5))
plt.semilogx(lasso_cv.alphas_, lasso_cv.mse_path_.mean(axis=1))
plt.axvline(lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal alpha={lasso_cv.alpha_:.4f}')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('Lasso Cross-Validation')
plt.legend()
plt.show()
The
np.logspace(-4, 1, 100) function creates 100 alpha values ranging from 0.0001 to 10, evenly spaced on a logarithmic scale. This is important because alpha effects are multiplicative, so we need to test across orders of magnitude. Setting cv=5 tells the model to use 5-fold cross-validation for reliable error estimates. After fitting, lasso_cv.alpha_ gives us the optimal alpha value found by cross-validation, and lasso_cv.mse_path_ contains the MSE for each alpha across all CV folds, which we use to create the visualization.
When reading the plot, the curve shows how prediction error changes with alpha. Values too low (left side) lead to overfitting, while values too high (right side) cause underfitting. The red dashed line marks the sweet spot that balances both extremes.
Practice Questions
Problem: Given coefficients [2.5, -1.8, 0.0, 0.9, 0.0] and alpha=0.3, calculate the L1 penalty term.
Show Solution
import numpy as np
coefficients = np.array([2.5, -1.8, 0.0, 0.9, 0.0])
alpha = 0.3
# L1 penalty = alpha * sum(|coefficients|)
l1_penalty = alpha * np.sum(np.abs(coefficients))
# = 0.3 * (2.5 + 1.8 + 0 + 0.9 + 0)
# = 0.3 * 5.2
# = 1.56
print(f"L1 Penalty: {l1_penalty}")
# Output: L1 Penalty: 1.56
Explanation: L1 uses absolute values, so we get |2.5|+|-1.8|+|0|+|0.9|+|0| = 5.2, then multiply by alpha to get 1.56. Note how zeros don't contribute to the penalty.
Problem: Create a dataset with 50 features but only 5 truly important ones. Use Lasso to identify the important features and verify the selection.
Show Solution
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n_samples, n_features = 200, 50
X = np.random.randn(n_samples, n_features)
# True important features: indices 0, 10, 20, 30, 40
important_features = [0, 10, 20, 30, 40]
true_coefs = [3.0, -2.5, 4.0, -1.5, 2.0]
y = sum(c * X[:, i] for i, c in zip(important_features, true_coefs))
y += np.random.randn(n_samples) * 0.5 # Add noise
# Scale and fit Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = LassoCV(cv=5)
lasso.fit(X_scaled, y)
# Check which features were selected
selected = np.where(lasso.coef_ != 0)[0]
print(f"True important features: {important_features}")
print(f"Lasso selected features: {list(selected)}")
print(f"Match: {set(selected) == set(important_features)}")
Explanation: With proper regularization, Lasso correctly identifies the 5 important features out of 50, demonstrating its power for automatic feature selection.
Problem: Implement the soft-thresholding operator and use it in a coordinate descent algorithm for Lasso regression.
Show Solution
import numpy as np
def soft_threshold(rho, alpha):
"""Soft thresholding operator for L1 regularization"""
if rho < -alpha:
return rho + alpha
elif rho > alpha:
return rho - alpha
else:
return 0.0
def lasso_coordinate_descent(X, y, alpha, n_iterations=1000):
"""Lasso using coordinate descent"""
n_samples, n_features = X.shape
weights = np.zeros(n_features)
for iteration in range(n_iterations):
for j in range(n_features):
# Calculate residual excluding feature j
residual = y - X @ weights + weights[j] * X[:, j]
# Calculate rho (correlation)
rho = X[:, j] @ residual / n_samples
# Apply soft thresholding
z = X[:, j] @ X[:, j] / n_samples
weights[j] = soft_threshold(rho, alpha) / z
return weights
# Test on data
np.random.seed(42)
X = np.random.randn(100, 10)
y = 3*X[:, 0] - 2*X[:, 1] + np.random.randn(100) * 0.1
weights = lasso_coordinate_descent(X, y, alpha=0.1)
print("Non-zero weights:", np.where(weights != 0)[0])
print("Weight values:", weights[weights != 0].round(4))
Explanation: Coordinate descent optimizes one coefficient at a time. The soft-thresholding operator is key to Lasso - it shrinks coefficients and sets them to zero when they're smaller than alpha.
Elastic Net: The Best of Both Worlds
What if you want both the feature selection of Lasso and the stability of Ridge? Elastic Net combines L1 and L2 regularization into a single model, giving you fine-grained control over the balance between sparsity and coefficient shrinkage. It's particularly powerful when dealing with correlated features where Lasso alone might be unstable.
What is Elastic Net?
Elastic Net linearly combines the L1 and L2 penalty terms, controlled by a mixing parameter. This creates a regularization approach that inherits advantages from both Ridge and Lasso while mitigating their individual weaknesses.
The key advantage: When features are correlated, Lasso tends to arbitrarily select one and ignore the others. Elastic Net's L2 component encourages grouped selection - if features are correlated, they tend to be selected together. This leads to more stable and interpretable models.
Two hyperparameters: Elastic Net has two parameters to tune: alpha (overall regularization strength) and l1_ratio (balance between L1 and L2). When l1_ratio=1, it becomes Lasso; when l1_ratio=0, it becomes Ridge.
The Elastic Net Cost Function
In scikit-learn, this is parameterized slightly differently using alpha and l1_ratio:
l1_ratio = 0
Pure Ridge (L2 only) - No feature selection, all coefficients shrunk
l1_ratio = 0.5
Balanced - Equal L1 and L2 influence, moderate sparsity
l1_ratio = 1
Pure Lasso (L1 only) - Maximum sparsity, feature selection
Elastic Net in Python
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate data with correlated features
np.random.seed(42)
n_samples = 200
base_feature = np.random.randn(n_samples)
# Create correlated features
X = np.column_stack([
base_feature, # Feature 0
base_feature + np.random.randn(n_samples) * 0.1, # Correlated with 0
base_feature + np.random.randn(n_samples) * 0.1, # Correlated with 0
np.random.randn(n_samples), # Independent
np.random.randn(n_samples), # Independent
])
y = 3 * base_feature + np.random.randn(n_samples) * 0.5
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare Lasso vs Elastic Net on correlated data
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
print(f"Lasso coefficients: {lasso.coef_.round(3)}")
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_scaled, y)
print(f"Elastic Net coefficients: {elastic.coef_.round(3)}")
Cross-Validation for Hyperparameter Tuning
from sklearn.linear_model import ElasticNetCV
import matplotlib.pyplot as plt
# ElasticNetCV searches over both alpha and l1_ratio
elastic_cv = ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99],
alphas=np.logspace(-4, 1, 50),
cv=5
)
elastic_cv.fit(X_scaled, y)
print(f"Optimal alpha: {elastic_cv.alpha_:.6f}")
print(f"Optimal l1_ratio: {elastic_cv.l1_ratio_:.2f}")
print(f"R² score: {elastic_cv.score(X_scaled, y):.4f}")
print(f"Non-zero coefficients: {np.sum(elastic_cv.coef_ != 0)}")
print(f"Coefficients: {elastic_cv.coef_.round(4)}")
The parameter grid includes
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99] to test different L1/L2 balances ranging from mostly Ridge (0.1) to nearly pure Lasso (0.99). We also test alphas=np.logspace(-4, 1, 50) which gives us 50 regularization strengths from 0.0001 to 10. With cv=5 for 5-fold cross-validation, the model tests a total of 7 x 50 = 350 different combinations to find the best settings.
After training,
elastic_cv.alpha_ gives you the winning alpha value, and elastic_cv.l1_ratio_ tells you the winning L1/L2 balance where values closer to 1 are more Lasso-like and values closer to 0 are more Ridge-like. The np.sum(elastic_cv.coef_ != 0) shows how many features were selected. If the optimal l1_ratio is near 1, your data prefers sparse Lasso-style solutions. If it's near 0, Ridge-style shrinkage works better. Values in between indicate you benefit from the combination.
When to Use Each Method
| Scenario | Recommended Method | Reasoning |
|---|---|---|
| Features are independent | Lasso | Clean feature selection, maximum sparsity |
| Features are correlated | Elastic Net | Grouped selection, stability |
| All features matter | Ridge | Shrinks but keeps all features |
| More features than samples (p > n) | Elastic Net | Can select more than n features (unlike Lasso) |
| Need interpretability | Lasso | Sparse models are easier to explain |
| Unsure about feature relationships | Elastic Net | Flexible, covers both extremes |
Practice Questions
Problem: Given coefficients [2.0, -1.0, 0.5], alpha=0.4, and l1_ratio=0.7, calculate the total regularization penalty.
Show Solution
import numpy as np
coefficients = np.array([2.0, -1.0, 0.5])
alpha = 0.4
l1_ratio = 0.7
# L1 component: alpha * l1_ratio * sum(|coef|)
l1_penalty = alpha * l1_ratio * np.sum(np.abs(coefficients))
# = 0.4 * 0.7 * (2.0 + 1.0 + 0.5) = 0.28 * 3.5 = 0.98
# L2 component: alpha * (1-l1_ratio) * sum(coef^2)
l2_penalty = alpha * (1 - l1_ratio) * np.sum(coefficients ** 2)
# = 0.4 * 0.3 * (4.0 + 1.0 + 0.25) = 0.12 * 5.25 = 0.63
total_penalty = l1_penalty + l2_penalty
print(f"L1 penalty: {l1_penalty:.2f}")
print(f"L2 penalty: {l2_penalty:.2f}")
print(f"Total penalty: {total_penalty:.2f}")
# Output: Total penalty: 1.61
Explanation: Elastic Net splits the penalty between L1 and L2 according to l1_ratio. Here, 70% goes to L1 (promoting sparsity) and 30% to L2 (encouraging small coefficients).
Problem: On the same dataset, compare Ridge, Lasso, and Elastic Net. Report the number of non-zero coefficients and R² scores for each.
Show Solution
import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
# Generate dataset
np.random.seed(42)
X = np.random.randn(200, 20)
y = 3*X[:, 0] - 2*X[:, 1] + 1.5*X[:, 2] + np.random.randn(200) * 0.5
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare models
models = {
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'Elastic Net': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
print(f"{'Model':<15} {'Non-zero':<10} {'R² (CV)':<10}")
print("-" * 35)
for name, model in models.items():
model.fit(X_scaled, y)
n_nonzero = np.sum(model.coef_ != 0)
cv_score = cross_val_score(model, X_scaled, y, cv=5).mean()
print(f"{name:<15} {n_nonzero:<10} {cv_score:.4f}")
Explanation: Ridge keeps all 20 coefficients non-zero, Lasso selects the fewest (sparsest), and Elastic Net falls in between. Performance (R²) should be similar if regularization is tuned properly.
Problem: Use GridSearchCV to find optimal alpha and l1_ratio for Elastic Net. Visualize the results as a heatmap.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
# Generate data
np.random.seed(42)
X = np.random.randn(200, 15)
y = 2*X[:, 0] - 3*X[:, 1] + X[:, 2] + np.random.randn(200) * 0.3
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Grid search
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid_search = GridSearchCV(
ElasticNet(), param_grid, cv=5, scoring='r2', return_train_score=True
)
grid_search.fit(X_scaled, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best R²: {grid_search.best_score_:.4f}")
# Create heatmap
scores = grid_search.cv_results_['mean_test_score'].reshape(5, 5)
plt.figure(figsize=(8, 6))
plt.imshow(scores, cmap='viridis', aspect='auto')
plt.colorbar(label='R² Score')
plt.xticks(range(5), param_grid['l1_ratio'])
plt.yticks(range(5), param_grid['alpha'])
plt.xlabel('l1_ratio')
plt.ylabel('alpha')
plt.title('Elastic Net Grid Search Results')
plt.show()
Explanation: The heatmap visualizes how different combinations of alpha and l1_ratio affect model performance. The best region appears in yellow/bright colors.
Preventing Overfitting with Regularization
Overfitting is the nemesis of machine learning models - it occurs when a model learns the training data too well, including its noise and peculiarities, failing to generalize to new data. Regularization is our primary weapon against overfitting, constraining model complexity to find the sweet spot between underfitting and overfitting.
Understanding Overfitting
Overfitting occurs when a model captures noise in the training data rather than the underlying pattern. The model performs excellently on training data but poorly on unseen test data.
Signs of overfitting:
- Large gap between training and validation performance
- Very high training accuracy (near 100%)
- Model complexity far exceeds necessary level
- Coefficients with very large magnitudes
- High variance in predictions across different training sets
Why regularization helps: By penalizing large coefficients, regularization prevents the model from fitting to noise. Noise typically requires extreme coefficient values to capture, so the penalty term discourages this behavior.
The Bias-Variance Tradeoff
Understanding the bias-variance tradeoff is essential to grasping why regularization works. Every model's prediction error can be decomposed into three components: bias (systematic error), variance (sensitivity to training data), and irreducible noise.
High Bias (Underfitting)
Model is too simple to capture patterns
- Low training score
- Low test score
- Regularization too strong
Just Right
Model captures true patterns without noise
- Good training score
- Good test score (similar)
- Optimal regularization
High Variance (Overfitting)
Model memorizes training data including noise
- High training score
- Low test score
- Regularization too weak
Visualizing Overfitting
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline
# Generate noisy data
np.random.seed(42)
X = np.linspace(0, 1, 30).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.randn(30) * 0.3
X_test = np.linspace(0, 1, 100).reshape(-1, 1)
np.linspace(0, 1, 30) and add Gaussian noise with np.random.randn(30) * 0.3. The X_test array contains 100 evenly spaced points that we'll use to visualize how smoothly each model predicts across the entire range.
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
figsize=(15, 4) makes the figure wide enough to display all three plots clearly.
# Underfitting: Degree 1
model1 = make_pipeline(PolynomialFeatures(1), LinearRegression())
model1.fit(X, y)
axes[0].scatter(X, y, alpha=0.7)
axes[0].plot(X_test, model1.predict(X_test), 'r-', linewidth=2)
axes[0].set_title('Underfitting (Degree 1)')
make_pipeline function chains together the polynomial feature transformation and linear regression into one convenient model. Even though this model has low variance (it won't change much with different training data), it has high bias because it systematically misses the true pattern. This is underfitting.
# Just right: Degree 4 with regularization
model2 = make_pipeline(PolynomialFeatures(4), Ridge(alpha=0.1))
model2.fit(X, y)
axes[1].scatter(X, y, alpha=0.7)
axes[1].plot(X_test, model2.predict(X_test), 'g-', linewidth=2)
axes[1].set_title('Good Fit (Degree 4 + Ridge)')
alpha=0.1 to prevent the coefficients from becoming too extreme. This combination gives us the best of both worlds: enough complexity to fit the true pattern, but enough constraint to avoid fitting the noise. The green curve should follow the general shape of the data without wiggling through every individual point.
# Overfitting: Degree 15 no regularization
model3 = make_pipeline(PolynomialFeatures(15), LinearRegression())
model3.fit(X, y)
axes[2].scatter(X, y, alpha=0.7)
axes[2].plot(X_test, model3.predict(X_test), 'orange', linewidth=2)
axes[2].set_title('Overfitting (Degree 15)')
plt.tight_layout()
plt.show()
tight_layout() function automatically adjusts spacing between plots so they don't overlap. When you run this code, you'll see three plots side by side: the red line (underfitting) is too straight, the green line (good fit) captures the wave nicely, and the orange line (overfitting) wiggles excessively. This visual comparison is one of the best ways to understand why regularization matters.
Learning Curves: Diagnosing Overfitting
Learning curves plot model performance against training set size, revealing whether your model suffers from bias or variance. They're invaluable for diagnosing overfitting and guiding model selection.
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
learning_curve function from sklearn which handles the heavy lifting of training models on different-sized subsets of data. The other imports give us plotting capabilities, numerical operations, Ridge regression for regularization, polynomial feature generation, and a convenient way to chain these steps together.
def plot_learning_curve(model, X, y, title):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='neg_mean_squared_error'
)
learning_curve() function trains the model on progressively larger portions of the training data (10%, 20%, ... up to 100%) specified by train_sizes=np.linspace(0.1, 1.0, 10). It uses 5-fold cross-validation (cv=5) at each size to get reliable error estimates. The scoring metric is negative MSE because sklearn's convention is to maximize scores, so errors are returned as negative values.
train_mean = -train_scores.mean(axis=1)
val_mean = -val_scores.mean(axis=1)
.mean(axis=1) averages the scores across all cross-validation folds for each training set size, giving us one training error and one validation error per size.
plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_mean, 'o-', label='Training Error')
plt.plot(train_sizes, val_mean, 'o-', label='Validation Error')
plt.xlabel('Training Set Size')
plt.ylabel('Mean Squared Error')
plt.title(title)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Generate data
np.random.seed(42)
X = np.random.randn(200, 1)
y = 0.5 * X.ravel() ** 2 + np.random.randn(200) * 0.3
# High variance (overfitting) - weak regularization
plot_learning_curve(
make_pipeline(PolynomialFeatures(10), Ridge(alpha=0.001)),
X, y, 'Weak Regularization (Overfitting)'
)
# Good fit - proper regularization
plot_learning_curve(
make_pipeline(PolynomialFeatures(10), Ridge(alpha=1.0)),
X, y, 'Proper Regularization (Good Fit)'
)
Regularization Path Visualization
A regularization path shows how coefficients change as regularization strength (alpha) varies. This visualization helps you understand which features are most important and how aggressive the regularization needs to be.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
# Generate data
np.random.seed(42)
X = np.random.randn(100, 5)
y = 3*X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5
# Test many alpha values
alphas = np.logspace(-3, 3, 100)
# Store coefficients
ridge_coefs = []
lasso_coefs = []
for alpha in alphas:
ridge = Ridge(alpha=alpha)
ridge.fit(X, y)
ridge_coefs.append(ridge.coef_)
lasso = Lasso(alpha=alpha, max_iter=10000)
lasso.fit(X, y)
lasso_coefs.append(lasso.coef_)
max_iter=10000 for Lasso ensures convergence at higher alpha values where optimization can be slower.
# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for i in range(5):
axes[0].plot(np.log10(alphas), [c[i] for c in ridge_coefs], label=f'Feature {i}')
axes[1].plot(np.log10(alphas), [c[i] for c in lasso_coefs], label=f'Feature {i}')
axes[0].set_xlabel('log10(Alpha)')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Ridge Regularization Path')
axes[0].legend()
axes[1].set_xlabel('log10(Alpha)')
axes[1].set_ylabel('Coefficient')
axes[1].set_title('Lasso Regularization Path')
axes[1].legend()
plt.tight_layout()
plt.show()
Practice Questions
Problem: A model has training R² = 0.99 and test R² = 0.45. What does this indicate and what should you do?
Show Solution
# Analysis
train_r2 = 0.99
test_r2 = 0.45
gap = train_r2 - test_r2 # 0.54
print(f"Training R²: {train_r2}")
print(f"Test R²: {test_r2}")
print(f"Gap: {gap}")
# Diagnosis
diagnosis = """
This is a clear case of OVERFITTING:
- Training score (0.99) is almost perfect
- Test score (0.45) is much lower
- Large gap (0.54) between train and test
Solutions:
1. Increase regularization (higher alpha)
2. Reduce model complexity (fewer features)
3. Get more training data
4. Apply cross-validation for tuning
5. Try Lasso for feature selection
"""
print(diagnosis)
Explanation: The large gap between training and test performance is the hallmark of overfitting. The model memorized training data but can't generalize.
Problem: Create a validation curve that shows how training and validation scores change with different alpha values. Find the optimal alpha.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import validation_curve
from sklearn.linear_model import Ridge
# Generate overfitting-prone data
np.random.seed(42)
X = np.random.randn(100, 20)
y = 2*X[:, 0] - X[:, 1] + np.random.randn(100) * 0.5
# Validation curve
alphas = np.logspace(-4, 4, 50)
train_scores, val_scores = validation_curve(
Ridge(), X, y, param_name='alpha',
param_range=alphas, cv=5, scoring='r2'
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
# Find optimal alpha
optimal_idx = np.argmax(val_mean)
optimal_alpha = alphas[optimal_idx]
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_mean, 'b-', label='Training')
plt.semilogx(alphas, val_mean, 'r-', label='Validation')
plt.axvline(optimal_alpha, color='g', linestyle='--',
label=f'Optimal alpha={optimal_alpha:.4f}')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('R² Score')
plt.title('Validation Curve for Ridge Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"Optimal alpha: {optimal_alpha:.6f}")
print(f"Best validation R²: {val_mean[optimal_idx]:.4f}")
Explanation: The validation curve shows underfitting at high alpha (both scores low) and overfitting at low alpha (train high, val low). The optimal alpha maximizes validation score.
Problem: Build a complete pipeline that: (1) scales features, (2) applies polynomial features, (3) uses regularized regression, and (4) uses cross-validation to find optimal hyperparameters.
Show Solution
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate non-linear data
np.random.seed(42)
X = np.random.randn(200, 3)
y = 2*X[:, 0]**2 - 3*X[:, 1] + X[:, 0]*X[:, 2] + np.random.randn(200) * 0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures()),
('regressor', ElasticNet(max_iter=10000))
])
# Hyperparameter grid
param_grid = {
'poly__degree': [1, 2, 3],
'regressor__alpha': [0.01, 0.1, 1.0],
'regressor__l1_ratio': [0.2, 0.5, 0.8]
}
# Grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Parameters:", grid_search.best_params_)
print(f"Best CV R²: {grid_search.best_score_:.4f}")
print(f"Test R²: {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
Explanation: This pipeline handles the entire workflow: scaling, polynomial feature generation, regularized regression, and hyperparameter tuning. It's production-ready code.
Choosing the Right Regularization Method
With three regularization techniques at your disposal, choosing the right one can be daunting. This section provides a decision framework based on your data characteristics, model requirements, and performance goals. The right choice depends on understanding your features and what you need from your model.
Decision Framework
Quick Decision Guide
Choose Ridge When:
- All features are potentially relevant
- Features are highly correlated
- You need a closed-form solution
- Interpretability isn't a priority
Choose Lasso When:
- Many features are irrelevant
- You need automatic feature selection
- Interpretability is important
- You want a sparse model
Choose Elastic Net When:
- Features are correlated AND some are irrelevant
- You need grouped feature selection
- More features than samples (p > n)
- You're unsure which to use
Complete Comparison Table
| Aspect | Ridge (L2) | Lasso (L1) | Elastic Net |
|---|---|---|---|
| Penalty Type | Sum of squared coefficients | Sum of absolute coefficients | Combination of L1 and L2 |
| Sparsity | No (all coefficients non-zero) | Yes (many coefficients zero) | Yes (controlled by l1_ratio) |
| Feature Selection | No | Yes (automatic) | Yes (grouped) |
| Correlated Features | Handles well (shrinks together) | Unstable (picks one arbitrarily) | Handles well (grouped selection) |
| Closed-Form Solution | Yes | No (needs iterative solver) | No (needs iterative solver) |
| Max Features Selected | All features | At most n features | No limit |
| Hyperparameters | alpha only | alpha only | alpha and l1_ratio |
| Computational Cost | Low (closed-form) | Medium (coordinate descent) | Medium (coordinate descent) |
Practical Comparison Example
import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Create dataset with different characteristics
np.random.seed(42)
n_samples = 200
# Scenario: Mix of important, irrelevant, and correlated features
X = np.random.randn(n_samples, 15)
# Add correlated features
X[:, 5] = X[:, 0] + np.random.randn(n_samples) * 0.1
X[:, 6] = X[:, 0] + np.random.randn(n_samples) * 0.1
# Target depends on only a few features
y = 3*X[:, 0] - 2*X[:, 1] + 1.5*X[:, 2] + np.random.randn(n_samples) * 0.5
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare models
models = {
'Ridge (alpha=1.0)': Ridge(alpha=1.0),
'Lasso (alpha=0.1)': Lasso(alpha=0.1),
'Elastic Net (0.1, 0.5)': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
print(f"{'Model':<25} {'CV R²':>10} {'Non-zero':>10}")
print("=" * 47)
for name, model in models.items():
cv_score = cross_val_score(model, X_scaled, y, cv=5).mean()
model.fit(X_scaled, y)
n_nonzero = np.sum(model.coef_ != 0)
print(f"{name:<25} {cv_score:>10.4f} {n_nonzero:>10}")
print("\nCoefficient Analysis:")
for name, model in models.items():
model.fit(X_scaled, y)
print(f"\n{name}:")
print(f" Top 3 coefficients: {np.argsort(np.abs(model.coef_))[-3:][::-1]}")
print(f" Coefficient values: {model.coef_[np.argsort(np.abs(model.coef_))[-3:][::-1]].round(3)}")
Key Takeaways
Ridge (L2) Shrinks All
Adds squared coefficient penalty, shrinks all weights toward zero but never eliminates them entirely
Lasso (L1) Selects Features
Uses absolute value penalty, drives unimportant coefficients to exactly zero for automatic feature selection
Elastic Net Combines Both
Mixes L1 and L2 penalties, providing grouped selection and stability with correlated features
Alpha Controls Strength
Higher alpha means stronger regularization, use cross-validation to find the optimal value
Prevents Overfitting
Regularization constrains model complexity, reducing variance and improving generalization to new data
Scale Features First
Always standardize features before regularization since penalties treat all coefficients equally
Knowledge Check
Test your understanding of regularization techniques:
What type of penalty does Ridge regression add to the cost function?
Which regularization technique can set coefficients to exactly zero?
What happens when you increase the alpha (regularization strength) parameter?
What is the main advantage of Elastic Net over Lasso when features are correlated?
Why is feature scaling important before applying regularization?
A model has high training accuracy but low test accuracy. What should you do?