Regularization Techniques | Machine Learning Course

Ridge Regression (L2 Regularization)

Ridge regression is your first line of defense against overfitting. By adding a penalty term proportional to the square of coefficient magnitudes, Ridge shrinks large coefficients toward zero without eliminating them entirely. This simple yet powerful technique transforms unstable, overfit models into robust predictors that generalize well to new data.

Key Concept

What is Ridge Regression?

Ridge regression (also called Tikhonov regularization or L2 regularization) modifies the standard linear regression cost function by adding a penalty term equal to the sum of squared coefficients multiplied by a hyperparameter lambda (also written as alpha in scikit-learn).

The intuition: Imagine you're packing for a trip with a weight limit on your luggage. Standard linear regression packs whatever it wants (large coefficients for features it thinks are important). Ridge regression imposes a "tax" on heavy items - you can still bring them, but they cost you extra. This encourages the model to use smaller, more distributed weights across all features rather than relying heavily on just a few.

Why "Ridge"? The name comes from the ridge that appears when you plot the penalty term. The technique was originally developed to handle situations where the feature matrix is singular or near-singular (multicollinearity), creating a "ridge" in the solution space.

The Ridge Regression Cost Function

The Ridge regression objective adds an L2 penalty term to the ordinary least squares (OLS) cost function. This penalty grows with the square of coefficient values, effectively discouraging the model from assigning very large weights to any single feature.

Ridge Regression Cost Function

$$ J(\theta) = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \theta_j^2 $$

MSE + L2 penalty term (sum of squared coefficients)

Lambda (Regularization Strength)

Lambda = 0: No regularization (standard linear regression)
Small lambda: Slight penalty, coefficients slightly reduced
Large lambda: Strong penalty, coefficients pushed toward zero
Lambda = infinity: All coefficients become exactly zero

Key Properties

Shrinks coefficients but never sets them exactly to zero
Handles multicollinearity effectively
Always has a unique solution (regularization ensures invertibility)
Works well when all features contribute to prediction

Ridge Regression in Python

Scikit-learn provides an intuitive interface for Ridge regression through the Ridge class. The key hyperparameter is alpha (equivalent to lambda in the mathematical notation), which controls the regularization strength.

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate sample data with multicollinearity
np.random.seed(42)
X = np.random.randn(100, 5)
X[:, 1] = X[:, 0] + np.random.randn(100) * 0.1  # Correlated features
y = 3*X[:, 0] + 2*X[:, 2] + np.random.randn(100) * 0.5

# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Ridge regression with different alpha values
for alpha in [0.01, 0.1, 1.0, 10.0]:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    score = ridge.score(X_test_scaled, y_test)
    print(f"Alpha={alpha:5.2f}: R²={score:.4f}, Coefficients={ridge.coef_.round(2)}")

This example demonstrates Ridge regression with different regularization strengths (alpha values). We first generate synthetic data with correlated features (X[:, 1] is nearly identical to X[:, 0]), which would cause instability in ordinary linear regression. After splitting and scaling the data, we train Ridge models with alpha values from 0.01 (weak regularization) to 10.0 (strong regularization).

The code starts with np.random.seed(42) to ensure you get the same results every time you run it. We then create multicollinearity by making feature 1 almost identical to feature 0 using X[:, 1] = X[:, 0] + np.random.randn(100) * 0.1. Before training, we use StandardScaler() to normalize features to mean=0 and std=1, which is essential for fair regularization since the L2 penalty treats all features equally. Finally, Ridge(alpha=alpha) creates the model where higher alpha means stronger regularization and smaller coefficients.

As alpha increases, you'll notice the coefficients get smaller (shrink toward zero), which reduces overfitting but may increase bias. The goal is finding the alpha that balances these tradeoffs for the best test performance.

Feature Scaling is Essential: Before applying Ridge regression, always standardize your features (mean=0, std=1). The L2 penalty treats all features equally, so if one feature has a larger scale, it will be unfairly penalized more. StandardScaler ensures fair comparison.

Closed-Form Solution

Unlike standard linear regression which can fail when features are highly correlated, Ridge regression always has a unique solution. The regularization term adds a positive value to the diagonal of the matrix, ensuring it's always invertible.

Ridge Regression Closed-Form Solution

$$ \theta = (X^TX + \lambda I)^{-1} X^T y $$

The added lambda*I term guarantees invertibility

import numpy as np

def ridge_regression_closed_form(X, y, alpha):
    """Implement Ridge regression using the closed-form solution"""
    n_features = X.shape[1]
    identity = np.eye(n_features)
    
    # Add column of ones for intercept
    X_with_intercept = np.column_stack([np.ones(len(X)), X])
    identity_with_intercept = np.eye(n_features + 1)
    identity_with_intercept[0, 0] = 0  # Don't regularize intercept
    
    # Closed-form solution: (X^T X + alpha*I)^-1 X^T y
    XtX = X_with_intercept.T @ X_with_intercept
    regularized = XtX + alpha * identity_with_intercept
    theta = np.linalg.inv(regularized) @ X_with_intercept.T @ y
    
    return theta[0], theta[1:]  # intercept, coefficients

# Example usage
intercept, coefs = ridge_regression_closed_form(X_train_scaled, y_train, alpha=1.0)
print(f"Intercept: {intercept:.4f}")
print(f"Coefficients: {coefs.round(4)}")

This implements Ridge regression from scratch using the mathematical closed-form solution. Unlike gradient descent which iteratively finds the answer, this directly computes the optimal coefficients in one step using matrix algebra.

The function first creates an identity matrix using np.eye(n_features), which has 1s on the diagonal and 0s elsewhere. We then add a column of 1s for the intercept term with np.column_stack([np.ones(len(X)), X]). Importantly, we set identity_with_intercept[0, 0] = 0 because we don't want to regularize the intercept since it's not a feature coefficient. The core computation uses X.T @ X to compute X-transpose times X (a covariance-like matrix), then adds alpha * identity to the diagonal. This is the magic of Ridge regression because adding alpha to the diagonal ensures the matrix is always invertible. Finally, np.linalg.inv(...) performs matrix inversion to solve the equation.

This matters because ordinary least squares fails when features are perfectly correlated (singular matrix). Ridge's lambda*I addition guarantees a unique solution always exists, which is why Ridge is often called "regularized least squares."

Practice Questions

Problem: Given coefficients [2.5, -1.8, 0.9] and alpha=0.5, calculate the L2 penalty term that would be added to the cost function.

Show Solution

import numpy as np

# Given coefficients
coefficients = np.array([2.5, -1.8, 0.9])
alpha = 0.5

# L2 penalty = alpha * sum(coefficients^2)
l2_penalty = alpha * np.sum(coefficients ** 2)
# = 0.5 * (6.25 + 3.24 + 0.81)
# = 0.5 * 10.3
# = 5.15

print(f"L2 Penalty: {l2_penalty}")
# Output: L2 Penalty: 5.15

Explanation: The L2 penalty squares each coefficient (2.5²=6.25, 1.8²=3.24, 0.9²=0.81), sums them (10.3), and multiplies by alpha (5.15). This value is added to the MSE cost.

Problem: Train Ridge regression on a dataset with alpha values [0.001, 0.1, 10, 1000]. Plot how the coefficients change as alpha increases.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression

# Generate data
X, y = make_regression(n_samples=100, n_features=10, noise=10)

# Test different alpha values
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
coef_matrix = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X, y)
    coef_matrix.append(ridge.coef_)

# Plot coefficient paths
coef_matrix = np.array(coef_matrix)
plt.figure(figsize=(10, 6))
for i in range(coef_matrix.shape[1]):
    plt.plot(np.log10(alphas), coef_matrix[:, i], label=f'Coef {i+1}')
plt.xlabel('log10(Alpha)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Coefficient Paths')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.show()

Explanation: As alpha increases, all coefficients shrink toward zero but never reach exactly zero. The plot shows how stronger regularization constrains the model.

Problem: Implement Ridge regression using gradient descent. Train on synthetic data and compare your implementation with scikit-learn's Ridge.

Show Solution

import numpy as np
from sklearn.linear_model import Ridge

class RidgeRegressionGD:
    def __init__(self, alpha=1.0, learning_rate=0.01, n_iterations=1000):
        self.alpha = alpha
        self.lr = learning_rate
        self.n_iter = n_iterations
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iter):
            y_pred = X @ self.weights + self.bias
            error = y_pred - y
            
            # Gradients with L2 penalty
            dw = (1/n_samples) * (X.T @ error + self.alpha * self.weights)
            db = (1/n_samples) * np.sum(error)
            
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
        return self
    
    def predict(self, X):
        return X @ self.weights + self.bias

# Compare with sklearn
np.random.seed(42)
X = np.random.randn(100, 5)
y = 3*X[:, 0] + 2*X[:, 1] + np.random.randn(100) * 0.5

# Our implementation
my_ridge = RidgeRegressionGD(alpha=1.0, learning_rate=0.1, n_iterations=1000)
my_ridge.fit(X, y)

# Sklearn
sk_ridge = Ridge(alpha=1.0)
sk_ridge.fit(X, y)

print("Our weights:", my_ridge.weights.round(4))
print("Sklearn weights:", sk_ridge.coef_.round(4))

Explanation: The gradient of the L2 penalty is 2*alpha*weights, which we simplify to alpha*weights in the update. This pushes weights toward zero proportionally to their current value.

Lasso Regression (L1 Regularization)

Lasso regression takes regularization a step further by using the absolute value of coefficients instead of their squares. This seemingly small change has a profound effect: Lasso can drive coefficients to exactly zero, effectively performing automatic feature selection. When you need a sparse model that uses only the most important features, Lasso is your tool of choice.

Key Concept

What is Lasso Regression?

LASSO stands for Least Absolute Shrinkage and Selection Operator. It adds a penalty proportional to the absolute value of coefficients (L1 norm) to the cost function.

The key difference from Ridge: While Ridge shrinks coefficients toward zero but keeps them non-zero, Lasso can eliminate features entirely by setting their coefficients to exactly zero. This makes Lasso a powerful tool for feature selection in high-dimensional datasets.

When to use Lasso: Choose Lasso when you suspect many features are irrelevant or redundant. It will automatically identify and remove them, leaving you with a simpler, more interpretable model. This is especially valuable when you have more features than samples (p > n scenarios).

The Lasso Cost Function

The Lasso cost function replaces the squared penalty (L2) with an absolute value penalty (L1). This seemingly minor change fundamentally alters the optimization landscape and enables the sparse solutions that make Lasso unique.

Lasso Regression Cost Function

$$ J(\theta) = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\theta_j| $$

MSE + L1 penalty term (sum of absolute coefficients)

Feature Selection Power

Lasso's L1 penalty creates sparse solutions where many coefficients are exactly zero. This is automatic feature selection.

Identifies truly important features
Reduces model complexity
Improves interpretability
Prevents overfitting on irrelevant features

Limitations

Lasso has some known limitations to be aware of:

Arbitrarily selects one feature from correlated groups
Cannot select more features than samples (n)
May be unstable with highly correlated features
No closed-form solution (requires iterative optimization)

Why L1 Produces Zeros (Geometric Intuition)

The reason Lasso produces sparse solutions while Ridge does not comes down to geometry. Imagine the constraint regions as shapes on a 2D plane where we're optimizing two coefficients. Ridge creates a circular constraint (all points at equal distance from origin), while Lasso creates a diamond shape with corners on the axes.

The optimization process finds where the elliptical contours of the MSE loss function first touch the constraint region. For Ridge's circle, this contact point is almost never on an axis (zero coefficient). For Lasso's diamond, the contact point often hits a corner, which lies on an axis - meaning one coefficient is exactly zero!

Ridge (L2) Constraint

Circular boundary: $\theta_1^2 + \theta_2^2 \leq t$

Loss contours rarely touch axes - coefficients stay non-zero

Lasso (L1) Constraint

Diamond boundary: $|\theta_1| + |\theta_2| \leq t$

Loss contours often hit corners - coefficients become exactly zero

Lasso Regression in Python

import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Generate data with many irrelevant features
np.random.seed(42)
n_samples, n_features = 100, 20
X = np.random.randn(n_samples, n_features)

# Only first 3 features actually matter
y = 5*X[:, 0] - 3*X[:, 1] + 2*X[:, 2] + np.random.randn(n_samples) * 0.5

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)

# Count non-zero coefficients
n_nonzero = np.sum(lasso.coef_ != 0)
print(f"Non-zero coefficients: {n_nonzero} out of {n_features}")
print(f"Selected features: {np.where(lasso.coef_ != 0)[0]}")
print(f"Coefficients: {lasso.coef_[lasso.coef_ != 0].round(3)}")

This example demonstrates Lasso's automatic feature selection ability. We create 20 features, but only 3 are actually relevant to the target. Lasso discovers this on its own!

We set n_features = 20 to create 20 potential predictors, but most of them are just noise. The target is defined as y = 5*X[:, 0] - 3*X[:, 1] + 2*X[:, 2], meaning only features 0, 1, and 2 truly affect the outcome. When we create the Lasso model with Lasso(alpha=0.1), the alpha parameter controls sparsity where higher values mean more coefficients become exactly zero. After training, np.sum(lasso.coef_ != 0) counts how many features Lasso kept (non-zero coefficients), and np.where(lasso.coef_ != 0)[0] tells us which specific features were selected.

In real datasets with hundreds of features, many are irrelevant or redundant. Lasso automatically identifies and removes them, giving you a simpler, more interpretable model. This is especially valuable when you need to explain your model to stakeholders.

Lasso Found the True Features! Even with 20 features, Lasso correctly identified features 0, 1, and 2 as important and set the others to zero. This automatic feature selection is invaluable in high-dimensional datasets.

Cross-Validation for Alpha Selection

Choosing the right alpha value is critical for Lasso performance. Scikit-learn provides LassoCV which automatically finds the optimal alpha using cross-validation.

from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt

# LassoCV automatically finds optimal alpha
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 100), cv=5)
lasso_cv.fit(X_scaled, y)

print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
print(f"R² score: {lasso_cv.score(X_scaled, y):.4f}")
print(f"Non-zero features: {np.sum(lasso_cv.coef_ != 0)}")

# Plot alpha vs MSE
plt.figure(figsize=(10, 5))
plt.semilogx(lasso_cv.alphas_, lasso_cv.mse_path_.mean(axis=1))
plt.axvline(lasso_cv.alpha_, color='r', linestyle='--', label=f'Optimal alpha={lasso_cv.alpha_:.4f}')
plt.xlabel('Alpha')
plt.ylabel('Mean Squared Error')
plt.title('Lasso Cross-Validation')
plt.legend()
plt.show()

LassoCV automates the tedious process of finding the best alpha value through cross-validation. Instead of manually testing different alpha values, it systematically evaluates 100 values and picks the one with the lowest cross-validated error.

The np.logspace(-4, 1, 100) function creates 100 alpha values ranging from 0.0001 to 10, evenly spaced on a logarithmic scale. This is important because alpha effects are multiplicative, so we need to test across orders of magnitude. Setting cv=5 tells the model to use 5-fold cross-validation for reliable error estimates. After fitting, lasso_cv.alpha_ gives us the optimal alpha value found by cross-validation, and lasso_cv.mse_path_ contains the MSE for each alpha across all CV folds, which we use to create the visualization.

When reading the plot, the curve shows how prediction error changes with alpha. Values too low (left side) lead to overfitting, while values too high (right side) cause underfitting. The red dashed line marks the sweet spot that balances both extremes.

Practice Questions

Problem: Given coefficients [2.5, -1.8, 0.0, 0.9, 0.0] and alpha=0.3, calculate the L1 penalty term.

Show Solution

import numpy as np

coefficients = np.array([2.5, -1.8, 0.0, 0.9, 0.0])
alpha = 0.3

# L1 penalty = alpha * sum(|coefficients|)
l1_penalty = alpha * np.sum(np.abs(coefficients))
# = 0.3 * (2.5 + 1.8 + 0 + 0.9 + 0)
# = 0.3 * 5.2
# = 1.56

print(f"L1 Penalty: {l1_penalty}")
# Output: L1 Penalty: 1.56

Explanation: L1 uses absolute values, so we get |2.5|+|-1.8|+|0|+|0.9|+|0| = 5.2, then multiply by alpha to get 1.56. Note how zeros don't contribute to the penalty.

Problem: Create a dataset with 50 features but only 5 truly important ones. Use Lasso to identify the important features and verify the selection.

Show Solution

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n_samples, n_features = 200, 50
X = np.random.randn(n_samples, n_features)

# True important features: indices 0, 10, 20, 30, 40
important_features = [0, 10, 20, 30, 40]
true_coefs = [3.0, -2.5, 4.0, -1.5, 2.0]

y = sum(c * X[:, i] for i, c in zip(important_features, true_coefs))
y += np.random.randn(n_samples) * 0.5  # Add noise

# Scale and fit Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lasso = LassoCV(cv=5)
lasso.fit(X_scaled, y)

# Check which features were selected
selected = np.where(lasso.coef_ != 0)[0]
print(f"True important features: {important_features}")
print(f"Lasso selected features: {list(selected)}")
print(f"Match: {set(selected) == set(important_features)}")

Explanation: With proper regularization, Lasso correctly identifies the 5 important features out of 50, demonstrating its power for automatic feature selection.

Problem: Implement the soft-thresholding operator and use it in a coordinate descent algorithm for Lasso regression.

Show Solution

import numpy as np

def soft_threshold(rho, alpha):
    """Soft thresholding operator for L1 regularization"""
    if rho < -alpha:
        return rho + alpha
    elif rho > alpha:
        return rho - alpha
    else:
        return 0.0

def lasso_coordinate_descent(X, y, alpha, n_iterations=1000):
    """Lasso using coordinate descent"""
    n_samples, n_features = X.shape
    weights = np.zeros(n_features)
    
    for iteration in range(n_iterations):
        for j in range(n_features):
            # Calculate residual excluding feature j
            residual = y - X @ weights + weights[j] * X[:, j]
            
            # Calculate rho (correlation)
            rho = X[:, j] @ residual / n_samples
            
            # Apply soft thresholding
            z = X[:, j] @ X[:, j] / n_samples
            weights[j] = soft_threshold(rho, alpha) / z
    
    return weights

# Test on data
np.random.seed(42)
X = np.random.randn(100, 10)
y = 3*X[:, 0] - 2*X[:, 1] + np.random.randn(100) * 0.1

weights = lasso_coordinate_descent(X, y, alpha=0.1)
print("Non-zero weights:", np.where(weights != 0)[0])
print("Weight values:", weights[weights != 0].round(4))

Explanation: Coordinate descent optimizes one coefficient at a time. The soft-thresholding operator is key to Lasso - it shrinks coefficients and sets them to zero when they're smaller than alpha.

Elastic Net: The Best of Both Worlds

What if you want both the feature selection of Lasso and the stability of Ridge? Elastic Net combines L1 and L2 regularization into a single model, giving you fine-grained control over the balance between sparsity and coefficient shrinkage. It's particularly powerful when dealing with correlated features where Lasso alone might be unstable.

Key Concept

What is Elastic Net?

Elastic Net linearly combines the L1 and L2 penalty terms, controlled by a mixing parameter. This creates a regularization approach that inherits advantages from both Ridge and Lasso while mitigating their individual weaknesses.

The key advantage: When features are correlated, Lasso tends to arbitrarily select one and ignore the others. Elastic Net's L2 component encourages grouped selection - if features are correlated, they tend to be selected together. This leads to more stable and interpretable models.

Two hyperparameters: Elastic Net has two parameters to tune: alpha (overall regularization strength) and l1_ratio (balance between L1 and L2). When l1_ratio=1, it becomes Lasso; when l1_ratio=0, it becomes Ridge.

The Elastic Net Cost Function

Elastic Net Cost Function

$$ J(\theta) = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda_1 \sum_{j=1}^{p} |\theta_j| + \lambda_2 \sum_{j=1}^{p} \theta_j^2 $$

MSE + L1 penalty + L2 penalty (combines both regularization types)

In scikit-learn, this is parameterized slightly differently using alpha and l1_ratio:

Scikit-learn Parameterization

$$ J(\theta) = MSE + \alpha \cdot l1\_ratio \cdot L1 + \alpha \cdot (1 - l1\_ratio) \cdot L2 $$

alpha controls total regularization, l1_ratio controls the L1/L2 balance

l1_ratio = 0

Pure Ridge (L2 only) - No feature selection, all coefficients shrunk

l1_ratio = 0.5

Balanced - Equal L1 and L2 influence, moderate sparsity

l1_ratio = 1

Pure Lasso (L1 only) - Maximum sparsity, feature selection

Elastic Net in Python

from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.preprocessing import StandardScaler
import numpy as np

We import ElasticNet for the main model and ElasticNetCV for automatic hyperparameter tuning with cross-validation. StandardScaler ensures our features are normalized before applying regularization, and NumPy handles numerical operations.

# Generate data with correlated features
np.random.seed(42)
n_samples = 200
base_feature = np.random.randn(n_samples)

We set up a reproducible experiment with 200 samples and create a base_feature that will be the foundation for our correlated features. This base feature represents the true underlying signal that influences the target variable.

# Create correlated features
X = np.column_stack([
    base_feature,                              # Feature 0
    base_feature + np.random.randn(n_samples) * 0.1,  # Correlated with 0
    base_feature + np.random.randn(n_samples) * 0.1,  # Correlated with 0
    np.random.randn(n_samples),                # Independent
    np.random.randn(n_samples),                # Independent
])

We build a feature matrix with 5 columns. Features 0, 1, and 2 are highly correlated because they're all based on the same base_feature with only tiny noise added. Features 3 and 4 are completely independent random values. This setup tests how regularization methods handle correlated versus independent features.

y = 3 * base_feature + np.random.randn(n_samples) * 0.5

The target variable depends on the base_feature pattern with some added noise. Since features 0, 1, and 2 all contain this pattern, any of them could theoretically be used to predict y. This is where Lasso and Elastic Net behave differently.

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Standardizing features to mean=0 and std=1 is essential before regularization. Without scaling, features with larger numerical ranges would be penalized more heavily, leading to biased results.

# Compare Lasso vs Elastic Net on correlated data
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
print(f"Lasso coefficients: {lasso.coef_.round(3)}")

We train Lasso first to see how it handles the correlated features. Lasso tends to arbitrarily pick just ONE of the correlated features (0, 1, or 2) and set the others to zero. This is mathematically valid but can be unstable because small changes in data might cause it to pick a different feature.

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_scaled, y)
print(f"Elastic Net coefficients: {elastic.coef_.round(3)}")

Elastic Net with l1_ratio=0.5 uses an equal mix of L1 and L2 penalties. Unlike Lasso, Elastic Net tends to keep all correlated features with smaller, distributed coefficients. This "grouped selection" behavior is more stable and often makes more sense when features measure the same underlying concept.

Grouped Selection: Notice how Lasso might select only one of the correlated features (0, 1, 2), while Elastic Net tends to select all of them with smaller coefficients. This "grouped selection" property makes Elastic Net more stable and interpretable when features are correlated.

Cross-Validation for Hyperparameter Tuning

from sklearn.linear_model import ElasticNetCV
import matplotlib.pyplot as plt

# ElasticNetCV searches over both alpha and l1_ratio
elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99],
    alphas=np.logspace(-4, 1, 50),
    cv=5
)
elastic_cv.fit(X_scaled, y)

print(f"Optimal alpha: {elastic_cv.alpha_:.6f}")
print(f"Optimal l1_ratio: {elastic_cv.l1_ratio_:.2f}")
print(f"R² score: {elastic_cv.score(X_scaled, y):.4f}")
print(f"Non-zero coefficients: {np.sum(elastic_cv.coef_ != 0)}")
print(f"Coefficients: {elastic_cv.coef_.round(4)}")

ElasticNetCV handles the challenging task of tuning TWO hyperparameters (alpha AND l1_ratio) using cross-validation. It systematically tests all combinations to find the optimal balance.

The parameter grid includes l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99] to test different L1/L2 balances ranging from mostly Ridge (0.1) to nearly pure Lasso (0.99). We also test alphas=np.logspace(-4, 1, 50) which gives us 50 regularization strengths from 0.0001 to 10. With cv=5 for 5-fold cross-validation, the model tests a total of 7 x 50 = 350 different combinations to find the best settings.

After training, elastic_cv.alpha_ gives you the winning alpha value, and elastic_cv.l1_ratio_ tells you the winning L1/L2 balance where values closer to 1 are more Lasso-like and values closer to 0 are more Ridge-like. The np.sum(elastic_cv.coef_ != 0) shows how many features were selected. If the optimal l1_ratio is near 1, your data prefers sparse Lasso-style solutions. If it's near 0, Ridge-style shrinkage works better. Values in between indicate you benefit from the combination.

When to Use Each Method

Scenario	Recommended Method	Reasoning
Features are independent	Lasso	Clean feature selection, maximum sparsity
Features are correlated	Elastic Net	Grouped selection, stability
All features matter	Ridge	Shrinks but keeps all features
More features than samples (p > n)	Elastic Net	Can select more than n features (unlike Lasso)
Need interpretability	Lasso	Sparse models are easier to explain
Unsure about feature relationships	Elastic Net	Flexible, covers both extremes

Practice Questions

Problem: Given coefficients [2.0, -1.0, 0.5], alpha=0.4, and l1_ratio=0.7, calculate the total regularization penalty.

Show Solution

import numpy as np

coefficients = np.array([2.0, -1.0, 0.5])
alpha = 0.4
l1_ratio = 0.7

# L1 component: alpha * l1_ratio * sum(|coef|)
l1_penalty = alpha * l1_ratio * np.sum(np.abs(coefficients))
# = 0.4 * 0.7 * (2.0 + 1.0 + 0.5) = 0.28 * 3.5 = 0.98

# L2 component: alpha * (1-l1_ratio) * sum(coef^2)
l2_penalty = alpha * (1 - l1_ratio) * np.sum(coefficients ** 2)
# = 0.4 * 0.3 * (4.0 + 1.0 + 0.25) = 0.12 * 5.25 = 0.63

total_penalty = l1_penalty + l2_penalty
print(f"L1 penalty: {l1_penalty:.2f}")
print(f"L2 penalty: {l2_penalty:.2f}")
print(f"Total penalty: {total_penalty:.2f}")
# Output: Total penalty: 1.61

Explanation: Elastic Net splits the penalty between L1 and L2 according to l1_ratio. Here, 70% goes to L1 (promoting sparsity) and 30% to L2 (encouraging small coefficients).

Problem: On the same dataset, compare Ridge, Lasso, and Elastic Net. Report the number of non-zero coefficients and R² scores for each.

Show Solution

import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Generate dataset
np.random.seed(42)
X = np.random.randn(200, 20)
y = 3*X[:, 0] - 2*X[:, 1] + 1.5*X[:, 2] + np.random.randn(200) * 0.5

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare models
models = {
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1),
    'Elastic Net': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

print(f"{'Model':<15} {'Non-zero':<10} {'R² (CV)':<10}")
print("-" * 35)

for name, model in models.items():
    model.fit(X_scaled, y)
    n_nonzero = np.sum(model.coef_ != 0)
    cv_score = cross_val_score(model, X_scaled, y, cv=5).mean()
    print(f"{name:<15} {n_nonzero:<10} {cv_score:.4f}")

Explanation: Ridge keeps all 20 coefficients non-zero, Lasso selects the fewest (sparsest), and Elastic Net falls in between. Performance (R²) should be similar if regularization is tuned properly.

Problem: Use GridSearchCV to find optimal alpha and l1_ratio for Elastic Net. Visualize the results as a heatmap.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Generate data
np.random.seed(42)
X = np.random.randn(200, 15)
y = 2*X[:, 0] - 3*X[:, 1] + X[:, 2] + np.random.randn(200) * 0.3

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Grid search
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}

grid_search = GridSearchCV(
    ElasticNet(), param_grid, cv=5, scoring='r2', return_train_score=True
)
grid_search.fit(X_scaled, y)

print(f"Best params: {grid_search.best_params_}")
print(f"Best R²: {grid_search.best_score_:.4f}")

# Create heatmap
scores = grid_search.cv_results_['mean_test_score'].reshape(5, 5)
plt.figure(figsize=(8, 6))
plt.imshow(scores, cmap='viridis', aspect='auto')
plt.colorbar(label='R² Score')
plt.xticks(range(5), param_grid['l1_ratio'])
plt.yticks(range(5), param_grid['alpha'])
plt.xlabel('l1_ratio')
plt.ylabel('alpha')
plt.title('Elastic Net Grid Search Results')
plt.show()

Explanation: The heatmap visualizes how different combinations of alpha and l1_ratio affect model performance. The best region appears in yellow/bright colors.

Preventing Overfitting with Regularization

Overfitting is the nemesis of machine learning models - it occurs when a model learns the training data too well, including its noise and peculiarities, failing to generalize to new data. Regularization is our primary weapon against overfitting, constraining model complexity to find the sweet spot between underfitting and overfitting.

Key Concept

Understanding Overfitting

Overfitting occurs when a model captures noise in the training data rather than the underlying pattern. The model performs excellently on training data but poorly on unseen test data.

Signs of overfitting:

Large gap between training and validation performance
Very high training accuracy (near 100%)
Model complexity far exceeds necessary level
Coefficients with very large magnitudes
High variance in predictions across different training sets

Why regularization helps: By penalizing large coefficients, regularization prevents the model from fitting to noise. Noise typically requires extreme coefficient values to capture, so the penalty term discourages this behavior.

The Bias-Variance Tradeoff

Understanding the bias-variance tradeoff is essential to grasping why regularization works. Every model's prediction error can be decomposed into three components: bias (systematic error), variance (sensitivity to training data), and irreducible noise.

High Bias (Underfitting)

Model is too simple to capture patterns

Low training score
Low test score
Regularization too strong

Just Right

Model captures true patterns without noise

Good training score
Good test score (similar)
Optimal regularization

High Variance (Overfitting)

Model memorizes training data including noise

High training score
Low test score
Regularization too weak

Visualizing Overfitting

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline

# Generate noisy data
np.random.seed(42)
X = np.linspace(0, 1, 30).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.randn(30) * 0.3

X_test = np.linspace(0, 1, 100).reshape(-1, 1)

We start by importing the necessary libraries for numerical operations, plotting, and machine learning models. The data we generate follows a sine wave pattern with added random noise to simulate real-world imperfect data. We create 30 training points using np.linspace(0, 1, 30) and add Gaussian noise with np.random.randn(30) * 0.3. The X_test array contains 100 evenly spaced points that we'll use to visualize how smoothly each model predicts across the entire range.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

This creates a figure with three side-by-side plots so we can compare underfitting, good fit, and overfitting visually. The figsize=(15, 4) makes the figure wide enough to display all three plots clearly.

# Underfitting: Degree 1
model1 = make_pipeline(PolynomialFeatures(1), LinearRegression())
model1.fit(X, y)
axes[0].scatter(X, y, alpha=0.7)
axes[0].plot(X_test, model1.predict(X_test), 'r-', linewidth=2)
axes[0].set_title('Underfitting (Degree 1)')

A degree 1 polynomial is just a straight line, which is far too simple to capture the wavy sine pattern in our data. The make_pipeline function chains together the polynomial feature transformation and linear regression into one convenient model. Even though this model has low variance (it won't change much with different training data), it has high bias because it systematically misses the true pattern. This is underfitting.

# Just right: Degree 4 with regularization
model2 = make_pipeline(PolynomialFeatures(4), Ridge(alpha=0.1))
model2.fit(X, y)
axes[1].scatter(X, y, alpha=0.7)
axes[1].plot(X_test, model2.predict(X_test), 'g-', linewidth=2)
axes[1].set_title('Good Fit (Degree 4 + Ridge)')

A degree 4 polynomial has enough flexibility to capture the sine wave's curves, but we add Ridge regularization with alpha=0.1 to prevent the coefficients from becoming too extreme. This combination gives us the best of both worlds: enough complexity to fit the true pattern, but enough constraint to avoid fitting the noise. The green curve should follow the general shape of the data without wiggling through every individual point.

# Overfitting: Degree 15 no regularization
model3 = make_pipeline(PolynomialFeatures(15), LinearRegression())
model3.fit(X, y)
axes[2].scatter(X, y, alpha=0.7)
axes[2].plot(X_test, model3.predict(X_test), 'orange', linewidth=2)
axes[2].set_title('Overfitting (Degree 15)')

A degree 15 polynomial without any regularization has too much flexibility. It can twist and turn to pass through nearly every training point, including the random noise. While this looks impressive on training data, the model has essentially memorized the noise rather than learning the underlying pattern. When you use this model on new data, it will perform terribly because the wild oscillations don't represent the true sine wave.

plt.tight_layout()
plt.show()

The tight_layout() function automatically adjusts spacing between plots so they don't overlap. When you run this code, you'll see three plots side by side: the red line (underfitting) is too straight, the green line (good fit) captures the wave nicely, and the orange line (overfitting) wiggles excessively. This visual comparison is one of the best ways to understand why regularization matters.

Learning Curves: Diagnosing Overfitting

Learning curves plot model performance against training set size, revealing whether your model suffers from bias or variance. They're invaluable for diagnosing overfitting and guiding model selection.

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

We import the learning_curve function from sklearn which handles the heavy lifting of training models on different-sized subsets of data. The other imports give us plotting capabilities, numerical operations, Ridge regression for regularization, polynomial feature generation, and a convenient way to chain these steps together.

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error'
    )

This function definition takes a model, data (X and y), and a title for the plot. The learning_curve() function trains the model on progressively larger portions of the training data (10%, 20%, ... up to 100%) specified by train_sizes=np.linspace(0.1, 1.0, 10). It uses 5-fold cross-validation (cv=5) at each size to get reliable error estimates. The scoring metric is negative MSE because sklearn's convention is to maximize scores, so errors are returned as negative values.

    train_mean = -train_scores.mean(axis=1)
    val_mean = -val_scores.mean(axis=1)

Since sklearn returns negative MSE values, we multiply by -1 to convert them back to positive errors that are easier to interpret. The .mean(axis=1) averages the scores across all cross-validation folds for each training set size, giving us one training error and one validation error per size.

    plt.figure(figsize=(8, 5))
    plt.plot(train_sizes, train_mean, 'o-', label='Training Error')
    plt.plot(train_sizes, val_mean, 'o-', label='Validation Error')
    plt.xlabel('Training Set Size')
    plt.ylabel('Mean Squared Error')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

This creates the learning curve plot with training error and validation error as two separate lines. The x-axis shows how many samples were used for training, and the y-axis shows the prediction error. By comparing these two lines, you can diagnose whether your model is overfitting (large gap between lines) or underfitting (both lines high).

# Generate data
np.random.seed(42)
X = np.random.randn(200, 1)
y = 0.5 * X.ravel() ** 2 + np.random.randn(200) * 0.3

We generate 200 data points following a quadratic pattern (y = 0.5x²) with some random noise added. This gives us a non-linear relationship that requires polynomial features to model well, making it perfect for demonstrating how regularization affects learning curves.

# High variance (overfitting) - weak regularization
plot_learning_curve(
    make_pipeline(PolynomialFeatures(10), Ridge(alpha=0.001)),
    X, y, 'Weak Regularization (Overfitting)'
)

With a degree 10 polynomial and very weak regularization (alpha=0.001), the model has too much flexibility. The learning curve will show a large gap between training and validation errors because the model memorizes the training data rather than learning the true pattern. This is the signature of overfitting.

# Good fit - proper regularization
plot_learning_curve(
    make_pipeline(PolynomialFeatures(10), Ridge(alpha=1.0)),
    X, y, 'Proper Regularization (Good Fit)'
)

Using the same degree 10 polynomial but with stronger regularization (alpha=1.0), the model is constrained to avoid extreme coefficients. The learning curve will show both training and validation errors converging to similar low values, indicating a good balance between bias and variance. This is what healthy learning curves should look like.

Reading Learning Curves: If training and validation errors converge at a low value, your model is good. If they converge at a high value, you have underfitting (add features or reduce regularization). If there's a large gap between them, you have overfitting (increase regularization or get more data).

Regularization Path Visualization

A regularization path shows how coefficients change as regularization strength (alpha) varies. This visualization helps you understand which features are most important and how aggressive the regularization needs to be.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso

We import NumPy for numerical operations, Matplotlib for plotting, and both Ridge and Lasso from sklearn so we can compare how each regularization method affects coefficients differently as the penalty strength changes.

# Generate data
np.random.seed(42)
X = np.random.randn(100, 5)
y = 3*X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5

We create a dataset with 5 features, but only features 0, 1, and 2 actually influence the target. Feature 0 has the strongest effect (coefficient 3), feature 1 is moderate (coefficient 2), and feature 2 has a negative effect (coefficient -1). Features 3 and 4 are pure noise with no relationship to y. This setup lets us see how regularization identifies the important features.

# Test many alpha values
alphas = np.logspace(-3, 3, 100)

We create 100 alpha values ranging from 0.001 to 1000 on a logarithmic scale. Using logspace instead of linspace is important because regularization effects are multiplicative, so we need to test across several orders of magnitude to see the full range of behavior.

# Store coefficients
ridge_coefs = []
lasso_coefs = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X, y)
    ridge_coefs.append(ridge.coef_)
    
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X, y)
    lasso_coefs.append(lasso.coef_)

This loop trains both Ridge and Lasso models at each alpha value and stores the resulting coefficients. By the end, we have 100 sets of coefficients for each method, showing how they change as regularization increases. The max_iter=10000 for Lasso ensures convergence at higher alpha values where optimization can be slower.

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for i in range(5):
    axes[0].plot(np.log10(alphas), [c[i] for c in ridge_coefs], label=f'Feature {i}')
    axes[1].plot(np.log10(alphas), [c[i] for c in lasso_coefs], label=f'Feature {i}')

We create two side-by-side plots to compare Ridge and Lasso paths. The loop draws one line per feature, with the x-axis showing log10(alpha) so we can see the full range clearly. Each line represents how that feature's coefficient changes as regularization strength increases from left (weak) to right (strong).

axes[0].set_xlabel('log10(Alpha)')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Ridge Regularization Path')
axes[0].legend()

axes[1].set_xlabel('log10(Alpha)')
axes[1].set_ylabel('Coefficient')
axes[1].set_title('Lasso Regularization Path')
axes[1].legend()

plt.tight_layout()
plt.show()

The final formatting adds labels, titles, and legends to both plots. In the Ridge plot (left), you'll see all coefficients shrink smoothly toward zero but never actually reach it. In the Lasso plot (right), coefficients hit exactly zero at different alpha thresholds. The noise features (3 and 4) become zero first, while the important features (0, 1, 2) stay non-zero the longest. This visual difference perfectly illustrates why Lasso performs feature selection while Ridge does not.

Practice Questions

Problem: A model has training R² = 0.99 and test R² = 0.45. What does this indicate and what should you do?

Show Solution

# Analysis
train_r2 = 0.99
test_r2 = 0.45
gap = train_r2 - test_r2  # 0.54

print(f"Training R²: {train_r2}")
print(f"Test R²: {test_r2}")
print(f"Gap: {gap}")

# Diagnosis
diagnosis = """
This is a clear case of OVERFITTING:
- Training score (0.99) is almost perfect
- Test score (0.45) is much lower
- Large gap (0.54) between train and test

Solutions:
1. Increase regularization (higher alpha)
2. Reduce model complexity (fewer features)
3. Get more training data
4. Apply cross-validation for tuning
5. Try Lasso for feature selection
"""
print(diagnosis)

Explanation: The large gap between training and test performance is the hallmark of overfitting. The model memorized training data but can't generalize.

Problem: Create a validation curve that shows how training and validation scores change with different alpha values. Find the optimal alpha.

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import validation_curve
from sklearn.linear_model import Ridge

# Generate overfitting-prone data
np.random.seed(42)
X = np.random.randn(100, 20)
y = 2*X[:, 0] - X[:, 1] + np.random.randn(100) * 0.5

# Validation curve
alphas = np.logspace(-4, 4, 50)
train_scores, val_scores = validation_curve(
    Ridge(), X, y, param_name='alpha',
    param_range=alphas, cv=5, scoring='r2'
)

train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

# Find optimal alpha
optimal_idx = np.argmax(val_mean)
optimal_alpha = alphas[optimal_idx]

plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_mean, 'b-', label='Training')
plt.semilogx(alphas, val_mean, 'r-', label='Validation')
plt.axvline(optimal_alpha, color='g', linestyle='--', 
            label=f'Optimal alpha={optimal_alpha:.4f}')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('R² Score')
plt.title('Validation Curve for Ridge Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Optimal alpha: {optimal_alpha:.6f}")
print(f"Best validation R²: {val_mean[optimal_idx]:.4f}")

Explanation: The validation curve shows underfitting at high alpha (both scores low) and overfitting at low alpha (train high, val low). The optimal alpha maximizes validation score.

Problem: Build a complete pipeline that: (1) scales features, (2) applies polynomial features, (3) uses regularized regression, and (4) uses cross-validation to find optimal hyperparameters.

Show Solution

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate non-linear data
np.random.seed(42)
X = np.random.randn(200, 3)
y = 2*X[:, 0]**2 - 3*X[:, 1] + X[:, 0]*X[:, 2] + np.random.randn(200) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures()),
    ('regressor', ElasticNet(max_iter=10000))
])

# Hyperparameter grid
param_grid = {
    'poly__degree': [1, 2, 3],
    'regressor__alpha': [0.01, 0.1, 1.0],
    'regressor__l1_ratio': [0.2, 0.5, 0.8]
}

# Grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid_search.best_params_)
print(f"Best CV R²: {grid_search.best_score_:.4f}")
print(f"Test R²: {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

Explanation: This pipeline handles the entire workflow: scaling, polynomial feature generation, regularized regression, and hyperparameter tuning. It's production-ready code.

Choosing the Right Regularization Method

With three regularization techniques at your disposal, choosing the right one can be daunting. This section provides a decision framework based on your data characteristics, model requirements, and performance goals. The right choice depends on understanding your features and what you need from your model.

Decision Framework

Quick Decision Guide

Choose Ridge When:

All features are potentially relevant
Features are highly correlated
You need a closed-form solution
Interpretability isn't a priority

Choose Lasso When:

Many features are irrelevant
You need automatic feature selection
Interpretability is important
You want a sparse model

Choose Elastic Net When:

Features are correlated AND some are irrelevant
You need grouped feature selection
More features than samples (p > n)
You're unsure which to use

Complete Comparison Table

Aspect	Ridge (L2)	Lasso (L1)	Elastic Net
Penalty Type	Sum of squared coefficients	Sum of absolute coefficients	Combination of L1 and L2
Sparsity	No (all coefficients non-zero)	Yes (many coefficients zero)	Yes (controlled by l1_ratio)
Feature Selection	No	Yes (automatic)	Yes (grouped)
Correlated Features	Handles well (shrinks together)	Unstable (picks one arbitrarily)	Handles well (grouped selection)
Closed-Form Solution	Yes	No (needs iterative solver)	No (needs iterative solver)
Max Features Selected	All features	At most n features	No limit
Hyperparameters	alpha only	alpha only	alpha and l1_ratio
Computational Cost	Low (closed-form)	Medium (coordinate descent)	Medium (coordinate descent)

Practical Comparison Example

import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

We import all three regularization methods (Ridge, Lasso, and ElasticNet) along with cross-validation scoring and feature scaling utilities. The warnings filter suppresses convergence warnings that can clutter output when testing many configurations.

# Create dataset with different characteristics
np.random.seed(42)
n_samples = 200

# Scenario: Mix of important, irrelevant, and correlated features
X = np.random.randn(n_samples, 15)

We create a dataset with 200 samples and 15 features. This mimics real-world data where you often have many potential predictors, but only some are actually useful. The random seed ensures reproducible results.

# Add correlated features
X[:, 5] = X[:, 0] + np.random.randn(n_samples) * 0.1
X[:, 6] = X[:, 0] + np.random.randn(n_samples) * 0.1

We deliberately make features 5 and 6 nearly identical copies of feature 0 by adding only tiny random noise. This creates multicollinearity, which is common in real datasets where multiple measurements capture the same underlying phenomenon. This tests how each method handles correlated features.

# Target depends on only a few features
y = 3*X[:, 0] - 2*X[:, 1] + 1.5*X[:, 2] + np.random.randn(n_samples) * 0.5

The target variable only depends on features 0, 1, and 2 with coefficients 3, -2, and 1.5 respectively. Features 3-14 are completely irrelevant noise. A good regularization method should identify these three important features and ignore or minimize the others.

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

StandardScaler transforms each feature to have mean=0 and standard deviation=1. This is essential before regularization because the penalty terms treat all coefficients equally. Without scaling, features with larger scales would be unfairly penalized more.

# Compare models
models = {
    'Ridge (alpha=1.0)': Ridge(alpha=1.0),
    'Lasso (alpha=0.1)': Lasso(alpha=0.1),
    'Elastic Net (0.1, 0.5)': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

We create a dictionary of three models to compare. Ridge uses alpha=1.0 for moderate regularization, Lasso uses alpha=0.1 for feature selection, and Elastic Net uses alpha=0.1 with l1_ratio=0.5 (equal mix of L1 and L2). Using a dictionary makes it easy to loop through and compare them.

print(f"{'Model':<25} {'CV R²':>10} {'Non-zero':>10}")
print("=" * 47)

for name, model in models.items():
    cv_score = cross_val_score(model, X_scaled, y, cv=5).mean()
    model.fit(X_scaled, y)
    n_nonzero = np.sum(model.coef_ != 0)
    print(f"{name:<25} {cv_score:>10.4f} {n_nonzero:>10}")

This loop evaluates each model using 5-fold cross-validation to get a reliable R-squared score, then counts how many coefficients are non-zero. The output table shows both prediction accuracy and model sparsity. You'll see Ridge keeps all 15 features, Lasso selects the fewest, and Elastic Net falls in between.

print("\nCoefficient Analysis:")
for name, model in models.items():
    model.fit(X_scaled, y)
    print(f"\n{name}:")
    print(f"  Top 3 coefficients: {np.argsort(np.abs(model.coef_))[-3:][::-1]}")
    print(f"  Coefficient values: {model.coef_[np.argsort(np.abs(model.coef_))[-3:][::-1]].round(3)}")

The coefficient analysis identifies which features each model considers most important by sorting coefficients by absolute value. All three methods should identify features 0, 1, and 2 as the top predictors since those are the true important features. However, you may notice differences in how they handle the correlated features (5 and 6) and noise features (3, 4, 7-14).

Best Practice: When in doubt, start with Elastic Net using l1_ratio=0.5 and tune both alpha and l1_ratio using cross-validation. This gives you the flexibility of both approaches and often produces the best results.

Key Takeaways

Ridge (L2) Shrinks All

Adds squared coefficient penalty, shrinks all weights toward zero but never eliminates them entirely

Lasso (L1) Selects Features

Uses absolute value penalty, drives unimportant coefficients to exactly zero for automatic feature selection

Elastic Net Combines Both

Mixes L1 and L2 penalties, providing grouped selection and stability with correlated features

Alpha Controls Strength

Higher alpha means stronger regularization, use cross-validation to find the optimal value

Prevents Overfitting

Regularization constrains model complexity, reducing variance and improving generalization to new data

Scale Features First

Always standardize features before regularization since penalties treat all coefficients equally

Master Regularization Techniques

What You'll Learn

Contents

Ridge Regression (L2 Regularization)

What is Ridge Regression?

The Ridge Regression Cost Function

Lambda (Regularization Strength)

Key Properties

Ridge Regression in Python

Closed-Form Solution

Practice Questions

Easy Calculate Ridge Penalty

Medium Compare Alpha Values

Hard Implement Ridge from Scratch

Lasso Regression (L1 Regularization)

What is Lasso Regression?

The Lasso Cost Function

Feature Selection Power

Limitations

Why L1 Produces Zeros (Geometric Intuition)

Ridge (L2) Constraint

Lasso (L1) Constraint

Lasso Regression in Python

Cross-Validation for Alpha Selection

Practice Questions

Easy Calculate L1 Penalty

Medium Feature Selection with Lasso

Hard Coordinate Descent for Lasso

Elastic Net: The Best of Both Worlds

What is Elastic Net?

The Elastic Net Cost Function

l1_ratio = 0

l1_ratio = 0.5

l1_ratio = 1

Elastic Net in Python

Cross-Validation for Hyperparameter Tuning

When to Use Each Method

Practice Questions

Easy Calculate Elastic Net Penalty

Medium Compare All Three Methods

Hard Grid Search for Elastic Net

Preventing Overfitting with Regularization

Understanding Overfitting

The Bias-Variance Tradeoff

High Bias (Underfitting)

Just Right

High Variance (Overfitting)

Visualizing Overfitting

Learning Curves: Diagnosing Overfitting

Regularization Path Visualization

Practice Questions

Easy Identify Overfitting

Medium Tune Regularization Strength

Hard Complete Regularization Pipeline

Choosing the Right Regularization Method

Decision Framework

Quick Decision Guide

Choose Ridge When:

Choose Lasso When:

Choose Elastic Net When:

Complete Comparison Table

Practical Comparison Example

Key Takeaways

Ridge (L2) Shrinks All

Lasso (L1) Selects Features

Elastic Net Combines Both

Alpha Controls Strength

Prevents Overfitting

Scale Features First

Knowledge Check