Simple and Multiple Linear Regression
Linear regression models the relationship between features and a continuous target by fitting a linear equation to observed data using Ordinary Least Squares (OLS).
Understanding Linear Regression
Linear regression finds the best-fitting line (or hyperplane) through data points by minimizing the sum of squared differences between predicted and actual values. This method, called Ordinary Least Squares (OLS), provides the optimal coefficients for the linear equation.
Simple Linear Regression
One feature predicting the target:
y = b0 + b1 * x
- y - Predicted value (target)
- b0 - Intercept (y when x=0)
- b1 - Slope (change in y per unit x)
- x - Input feature
Multiple Linear Regression
Multiple features predicting the target:
y = b0 + b1*x1 + b2*x2 + ... + bn*xn
- b0 - Intercept term
- b1...bn - Coefficients for each feature
- x1...xn - Input features
Ordinary Least Squares (OLS)
OLS finds coefficients by minimizing the sum of squared residuals - the squared differences between actual values (y) and predicted values (y-hat).
Residual
Distance from point to the fitted line
Squared Error
Squaring penalizes larger errors more
Minimization
Find the line with smallest total error
Simple Linear Regression
Let's build a simple model to predict exam scores based on study hours. This classic example demonstrates the core concepts of linear regression with a single input feature.
# ========================================
# STEP 1: Import required libraries
# ========================================
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
print("Libraries imported successfully!")
We'll use NumPy for numerical operations, scikit-learn's LinearRegression for the model, and train_test_split to divide our data. Now let's create our dataset:
# ========================================
# STEP 2: Create sample data
# ========================================
# Study hours (X) - must be 2D array for sklearn
hours = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
# Exam scores (y) - the target we want to predict
scores = np.array([45, 55, 60, 70, 75, 82, 88, 92])
print("Dataset Information:")
print(f" Number of students: {len(hours)}")
print(f" Study hours range: {hours.min()} to {hours.max()} hours")
print(f" Score range: {scores.min()} to {scores.max()} points")
print(f"\nNotice: More study hours generally leads to higher scores!")
Our dataset has 8 students. Notice the pattern: as study hours increase, exam scores tend to increase too. This is a positive linear relationship. Now let's split our data:
# ========================================
# STEP 3: Split data into training and test sets
# ========================================
# Reserve 25% for testing (2 students)
X_train, X_test, y_train, y_test = train_test_split(
hours, scores,
test_size=0.25, # 25% for testing
random_state=42 # For reproducibility
)
print("Data Split:")
print(f" Training set: {len(X_train)} students")
print(f" Test set: {len(X_test)} students")
print(f"\nTraining data (hours, scores):")
for hour, score in zip(X_train.flatten(), y_train):
print(f" {hour} hours -> {score} points")
We split the data 75-25. The model will learn from 6 students (training set) and be tested on 2 unseen students (test set). This mimics real-world scenarios where we predict for new data. Let's train the model:
# ========================================
# STEP 4: Create and train the model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train) # This is where learning happens!
print("Model trained successfully!")
print("\nLearned Parameters:")
print(f" Intercept (b0): {model.intercept_:.2f}")
print(f" Slope (b1): {model.coef_[0]:.2f}")
print(f"\nEquation: score = {model.intercept_:.2f} + {model.coef_[0]:.2f} * hours")
print(f"\nInterpretation:")
print(f" - Base score (0 hours): {model.intercept_:.2f} points")
print(f" - Each additional hour adds: {model.coef_[0]:.2f} points")
The model learned that a student who studies 0 hours would get around 38.5 points (intercept), and each additional hour of study adds about 6.8 points to the score (slope). This is the best-fit line through our data! Now let's evaluate:
# ========================================
# STEP 5: Make predictions and evaluate
# ========================================
# Predict on test set
predictions = model.predict(X_test)
print("Test Set Predictions:")
for hour, actual, pred in zip(X_test.flatten(), y_test, predictions):
diff = actual - pred
print(f" {hour} hours: Predicted={pred:.1f}, Actual={actual}, Error={diff:.1f}")
# Calculate performance metrics
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"\nModel Performance:")
print(f" R² Score: {r2:.4f} (1.0 = perfect fit)")
print(f" RMSE: {rmse:.2f} points")
if r2 > 0.9:
print(" ✓ Excellent fit! Model captures the pattern well.")
elif r2 > 0.7:
print(" ✓ Good fit! Predictions are reliable.")
else:
print(" ⚠ Moderate fit. Consider more data or features.")
Our model shows strong performance with an R² near 1.0, meaning it explains most of the variance in exam scores. The small RMSE indicates predictions are close to actual values. Finally, let's predict for a new student:
# ========================================
# STEP 6: Predict for new data
# ========================================
# What if a student studies 5.5 hours?
new_student_hours = np.array([[5.5]])
predicted_score = model.predict(new_student_hours)[0]
print("\nPrediction for New Student:")
print(f" Study time: 5.5 hours")
print(f" Predicted score: {predicted_score:.1f} points")
print(f"\nCalculation:")
print(f" {model.intercept_:.2f} + {model.coef_[0]:.2f} * 5.5 = {predicted_score:.1f}")
# Try a few more predictions
print("\nAdditional Predictions:")
for hours_val in [2.5, 6.5, 9.0]:
pred = model.predict([[hours_val]])[0]
print(f" {hours_val} hours -> {pred:.1f} points")
The model predicts a student studying 5.5 hours would score around 75 points. We can predict for any study time using our learned equation. Notice how predictions for 9 hours might seem high - this is extrapolation (predicting outside the training range), which can be less reliable. The model assumes the linear relationship continues beyond the observed data.
Multiple Linear Regression
Now let's predict house prices using multiple features: size, bedrooms, and age. This demonstrates how linear regression handles multiple inputs simultaneously to make more accurate predictions.
# ========================================
# STEP 1: Import libraries and create dataset
# ========================================
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Create a realistic house price dataset
data = {
'size_sqft': [1200, 1500, 1800, 2100, 2400, 2700, 3000, 3300],
'bedrooms': [2, 2, 3, 3, 4, 4, 5, 5],
'age_years': [20, 15, 12, 8, 10, 5, 3, 1],
'price_1000s': [180, 220, 280, 340, 380, 450, 520, 600]
}
df = pd.DataFrame(data)
print("House Price Dataset:")
print(df.to_string(index=False))
print(f"\nDataset has {len(df)} houses with {len(df.columns)-1} features")
Our dataset contains 8 houses with 3 features each. Notice how price generally increases with size and bedrooms, but decreases with age. These are the patterns our model will learn. Let's prepare the data:
# ========================================
# STEP 2: Separate features (X) and target (y)
# ========================================
# Features: everything except price
X = df[['size_sqft', 'bedrooms', 'age_years']]
# Target: what we want to predict
y = df['price_1000s']
print("Features (X):")
print(f" Shape: {X.shape} (8 houses, 3 features)")
print(f" Columns: {X.columns.tolist()}")
print("\nTarget (y):")
print(f" Shape: {y.shape}")
print(f" Name: {y.name}")
print(f" Range: ${y.min()} - ${y.max()} (in thousands)")
We've separated our predictors (size, bedrooms, age) from what we're predicting (price). This X and y structure is standard in machine learning. Now let's split for training and testing:
# ========================================
# STEP 3: Split into training and test sets
# ========================================
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25, # Reserve 25% for testing
random_state=42
)
print("Data Split:")
print(f" Training set: {len(X_train)} houses")
print(f" Test set: {len(X_test)} houses")
print("\nTraining samples:")
for idx in X_train.index:
print(f" {df.loc[idx, 'size_sqft']} sqft, {df.loc[idx, 'bedrooms']} bed, " +
f"{df.loc[idx, 'age_years']} yrs -> ${df.loc[idx, 'price_1000s']}k")
We'll train on 6 houses and test on 2 unseen houses. This 75-25 split is common for small datasets. With more data, you might use 80-20 or 90-10. Let's train the model:
# ========================================
# STEP 4: Train the multiple regression model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train)
print("Model Training Complete!")
print("\nLearned Coefficients:")
print("-" * 60)
for feature, coef in zip(X.columns, model.coef_):
direction = "increases" if coef > 0 else "decreases"
print(f" {feature:12s}: {coef:>10.4f}")
print(f" -> Each unit {direction} price by ${abs(coef):.4f}k")
print(f"\n Intercept: {model.intercept_:>10.4f}")
print(f" -> Base price when all features are zero")
print("\n" + "="*60)
print("EQUATION:")
print(f"price = {model.intercept_:.2f} + {model.coef_[0]:.4f}*size + " +
f"{model.coef_[1]:.4f}*bedrooms + {model.coef_[2]:.4f}*age")
print("="*60)
The model learned three relationships: (1) Each additional square foot adds ~$0.15k to price, (2) Each bedroom adds ~$25k, (3) Each year of age reduces price by ~$2k. The negative coefficient for age makes intuitive sense - older houses are generally less valuable. Now let's evaluate:
# ========================================
# STEP 5: Make predictions and evaluate accuracy
# ========================================
# Predict on test set
y_pred = model.predict(X_test)
print("Test Set Predictions:")
print("-" * 70)
for idx, (actual, pred) in enumerate(zip(y_test, y_pred)):
error = actual - pred
error_pct = (error / actual) * 100
test_idx = X_test.index[idx]
print(f"House {test_idx}:")
print(f" Features: {X_test.loc[test_idx, 'size_sqft']} sqft, " +
f"{X_test.loc[test_idx, 'bedrooms']} bed, " +
f"{X_test.loc[test_idx, 'age_years']} yrs")
print(f" Predicted: ${pred:.1f}k")
print(f" Actual: ${actual:.1f}k")
print(f" Error: ${error:.1f}k ({error_pct:+.1f}%)")
print()
# Calculate performance metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = np.mean(np.abs(y_test - y_pred))
print("="*70)
print("Model Performance Metrics:")
print(f" R² Score: {r2:.4f}")
print(f" -> Model explains {r2*100:.1f}% of price variance")
print(f" RMSE: ${rmse:.2f}k")
print(f" -> Average prediction error magnitude")
print(f" MAE: ${mae:.2f}k")
print(f" -> Mean absolute error")
Our model shows excellent performance with R² close to 1.0, indicating it captures the relationships between features and price very well. The low RMSE confirms predictions are accurate. Let's make predictions for new houses:
# ========================================
# STEP 6: Predict prices for new houses
# ========================================
# Create some hypothetical houses
new_houses = pd.DataFrame({
'size_sqft': [2000, 2500, 1800],
'bedrooms': [3, 4, 2],
'age_years': [5, 2, 15]
})
print("Predictions for New Houses:")
print("="*70)
predictions = model.predict(new_houses)
for idx, (pred, row) in enumerate(zip(predictions, new_houses.itertuples()), 1):
print(f"\nHouse {idx}:")
print(f" Size: {row.size_sqft} sqft")
print(f" Bedrooms: {row.bedrooms}")
print(f" Age: {row.age_years} years")
print(f" Predicted Price: ${pred:.1f}k (${pred*1000:,.0f})")
# Break down the prediction
print(f"\n Price breakdown:")
print(f" Base: ${model.intercept_:.2f}k")
print(f" Size contribution: ${model.coef_[0]*row.size_sqft:.2f}k")
print(f" Bedroom contribution: ${model.coef_[1]*row.bedrooms:.2f}k")
print(f" Age impact: ${model.coef_[2]*row.age_years:.2f}k")
print(f" Total: ${pred:.1f}k")
print("\n" + "="*70)
print("Key Insight: The model combines all features to make holistic predictions!")
For each new house, the model combines all three features to make a prediction. Notice how a larger, newer house with more bedrooms gets a higher price prediction. The beauty of multiple regression is that it considers all features simultaneously, capturing their combined effect on price. Each coefficient represents the independent contribution of that feature, holding all other features constant.
Key Assumptions
| Assumption | Description | Violation Effect |
|---|---|---|
| Linearity | Relationship between X and y is linear | Poor predictions, try polynomial |
| Independence | Observations are independent | Unreliable standard errors |
| Homoscedasticity | Constant variance of residuals | Inefficient estimates |
| Normality | Residuals are normally distributed | Invalid confidence intervals |
| No Multicollinearity | Features are not highly correlated | Unstable coefficients |
Practice Questions: Linear Regression
Test your understanding of linear regression concepts.
Question: How do you interpret a coefficient of 50 for the size_sqft feature?
Show Solution
# Example: Interpreting coefficients
from sklearn.linear_model import LinearRegression
import numpy as np
# Train a simple model
X = np.array([[1000], [1500], [2000], [2500]]) # size in sqft
y = np.array([200000, 250000, 300000, 350000]) # price in dollars
model = LinearRegression()
model.fit(X, y)
print(f"Coefficient for size_sqft: {model.coef_[0]:.2f}")
print(f"Interpretation: Each additional sqft increases price by ${model.coef_[0]:.2f}")
print(f"\nPrediction for 1800 sqft: ${model.predict([[1800]])[0]:,.2f}")
Question: What does a negative coefficient indicate?
Show Solution
# Example: Negative coefficient
import numpy as np
from sklearn.linear_model import LinearRegression
# House age vs price (older houses = lower price)
X = np.array([[0], [5], [10], [15], [20]]) # age in years
y = np.array([300000, 275000, 250000, 225000, 200000]) # price
model = LinearRegression()
model.fit(X, y)
print(f"Coefficient for age: {model.coef_[0]:,.2f}")
print(f"Interpretation: Each additional year decreases price by ${abs(model.coef_[0]):,.2f}")
print(f"\nNew house (age=0): ${model.predict([[0]])[0]:,.2f}")
print(f"Old house (age=25): ${model.predict([[25]])[0]:,.2f}")
Question: Why does OLS square the errors instead of using absolute values?
Show Solution
# Demonstration: Squared vs Absolute Errors
import numpy as np
# Sample predictions and actuals
y_true = np.array([10, 20, 30, 40])
y_pred = np.array([12, 19, 35, 38]) # Small and large errors
# Calculate errors
errors = y_true - y_pred
absolute_errors = np.abs(errors)
squared_errors = errors ** 2
print("Error Analysis:")
for i in range(len(errors)):
print(f" Point {i+1}: error={errors[i]:2.0f}, |error|={absolute_errors[i]:2.0f}, error²={squared_errors[i]:4.0f}")
print(f"\nTotal Absolute Error: {np.sum(absolute_errors)}")
print(f"Total Squared Error: {np.sum(squared_errors)}")
print(f"\nNotice: Large errors (±5) contribute much more to squared error!")
Question: What is multicollinearity and how does it affect the model?
Show Solution
# Detecting multicollinearity
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# Create correlated features (sqft and rooms are highly correlated)
np.random.seed(42)
sqft = np.random.randint(1000, 3000, 100)
rooms = sqft / 200 + np.random.randn(100) * 2 # Highly correlated!
price = sqft * 150 + rooms * 5000 + np.random.randn(100) * 10000
X = pd.DataFrame({'sqft': sqft, 'rooms': rooms})
y = price
# Check correlation
print("Correlation Matrix:")
print(X.corr())
# Train model
model = LinearRegression().fit(X, y)
print(f"\nCoefficients:")
print(f" sqft: {model.coef_[0]:.2f}")
print(f" rooms: {model.coef_[1]:.2f}")
print(f"\nHigh correlation (>{0.8:.1f}) indicates multicollinearity!")
Polynomial Regression
When data shows a curved pattern, polynomial regression captures non-linear relationships by adding polynomial terms to the feature set while remaining linear in its coefficients.
Understanding Polynomial Regression
When data shows a curved pattern, a straight line cannot capture the relationship well. Polynomial regression extends linear regression by adding polynomial terms (x^2, x^3, etc.) as new features, allowing the model to fit curves while remaining linear in its coefficients.
1
Degree 1 (Linear)
Straight line: y = b0 + b1*x
2
Degree 2 (Quadratic)
Parabola: y = b0 + b1*x + b2*x^2
3+
Degree 3+ (Cubic+)
Complex curves with inflection points
Why is it Still Called Linear?
Polynomial regression is linear in its coefficients (b0, b1, b2), not in its features. The polynomial terms (x^2, x^3) are treated as new features. We are still finding a linear combination of features - just with transformed features.
Creating Polynomial Features
Polynomial features transform a single feature x into multiple features (x, x², x³, etc.). This allows linear regression to fit curved patterns. Let's see how this transformation works:
# ========================================
# STEP 1: Import and create original data
# ========================================
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
# Start with simple numbers
X = np.array([[2], [3], [4], [5]])
print("Original Feature Values:")
print(X)
print(f"Shape: {X.shape} (4 samples, 1 feature)")
print(f"\nThink of this as: [2, 3, 4, 5]")
We start with just one feature: the values 2, 3, 4, and 5. Right now, linear regression could only draw a straight line through these points. Let's add polynomial terms:
# ========================================
# STEP 2: Create degree-2 polynomial features
# ========================================
poly = PolynomialFeatures(
degree=2, # Maximum power (x, x^2)
include_bias=False # Don't add column of 1s (intercept handled by model)
)
# Transform the data
X_poly = poly.fit_transform(X)
print("Polynomial Features (degree=2):")
print(X_poly)
print(f"Shape: {X_poly.shape} (4 samples, 2 features)")
print("\nBreakdown for each sample:")
for i, (original, poly_row) in enumerate(zip(X, X_poly)):
print(f" x={original[0]:.0f} -> [x={poly_row[0]:.0f}, x²={poly_row[1]:.0f}]")
print("\nWhat happened:")
print(" Original: [2, 3, 4, 5]")
print(" Now: [[2, 4], [3, 9], [4, 16], [5, 25]]")
print(" Each row: [x, x²]")
The transformer created a new feature x² for each value! Now x=2 becomes [2, 4], meaning we have both the original value (2) and its square (4). This lets the model learn curved relationships. Let's try degree 3:
# ========================================
# STEP 3: Try degree-3 polynomial features
# ========================================
poly3 = PolynomialFeatures(degree=3, include_bias=False)
X_poly3 = poly3.fit_transform(X)
print("Polynomial Features (degree=3):")
print(X_poly3)
print(f"Shape: {X_poly3.shape} (4 samples, 3 features)")
print("\nBreakdown for each sample:")
for i, (original, poly_row) in enumerate(zip(X, X_poly3)):
print(f" x={original[0]:.0f} -> [x={poly_row[0]:.0f}, x²={poly_row[1]:.0f}, x³={poly_row[2]:.0f}]")
print("\nPattern:")
print(" Degree 1: [x]")
print(" Degree 2: [x, x²]")
print(" Degree 3: [x, x², x³]")
print(" Degree n: [x, x², x³, ..., x^n]")
With degree 3, we now have three features per sample: the original x, x², and x³. For x=2, we get [2, 4, 8]. Higher degrees allow more complex curves, but be careful - too high and you'll overfit! With multiple original features, PolynomialFeatures also creates interaction terms:
# ========================================
# STEP 4: Multiple features and interactions
# ========================================
# Two original features
X_multi = np.array([[1, 2], [3, 4], [5, 6]])
print("Original features (x1, x2):")
print(X_multi)
print(f"Shape: {X_multi.shape}")
# Apply degree-2 transformation
poly_multi = PolynomialFeatures(degree=2, include_bias=False)
X_multi_poly = poly_multi.fit_transform(X_multi)
print("\nPolynomial features (degree=2):")
print(X_multi_poly)
print(f"Shape: {X_multi_poly.shape}")
print("\nFeature names:")
print(poly_multi.get_feature_names_out(['x1', 'x2']))
print("\nBreakdown:")
print(" [x1, x2] becomes [x1, x2, x1², x1*x2, x2²]")
print(" Original: 2 features")
print(" After degree-2: 5 features (includes x1*x2 interaction!)")
print("\n" + "="*60)
print("KEY INSIGHT: With n features and degree d,")
print("the number of output features = C(n+d, d) - 1")
print("For 2 features, degree 2: 5 features")
print("For 10 features, degree 2: 65 features!")
print("Feature explosion can happen quickly!")
With multiple features, PolynomialFeatures creates not just powers (x1², x2²) but also interactions (x1*x2). This can lead to feature explosion - 10 features with degree 2 becomes 65 features! This is why feature selection and regularization become crucial with polynomial regression.
Polynomial Regression Pipeline
Let's build a complete polynomial regression pipeline that generates non-linear data, tests multiple polynomial degrees, and compares their performance. This shows how to find the optimal complexity for your data:
# ========================================
# STEP 1: Import libraries and generate curved data
# ========================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Set random seed for reproducibility
np.random.seed(42)
# Generate non-linear data: y = 2x² - 3x + 5 + noise
X = np.linspace(0, 10, 100).reshape(-1, 1) # 100 points from 0 to 10
y = 2*X.flatten()**2 - 3*X.flatten() + 5 + np.random.randn(100)*10
print("Dataset Information:")
print(f" Samples: {len(X)}")
print(f" X range: [{X.min():.1f}, {X.max():.1f}]")
print(f" y range: [{y.min():.1f}, {y.max():.1f}]")
print(f" True relationship: y = 2x² - 3x + 5 + noise")
print(f" This is a QUADRATIC (degree 2) pattern!")
We've created data following a parabola (y = 2x² - 3x + 5) with added noise. Since the true relationship is quadratic, we expect degree 2 to perform best. Let's split the data:
# ========================================
# STEP 2: Split data for training and testing
# ========================================
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42
)
print("Data Split:")
print(f" Training: {len(X_train)} samples (80%)")
print(f" Testing: {len(X_test)} samples (20%)")
print(f"\nWhy split? We need to test on unseen data to detect overfitting!")
The 80-20 split lets us train on most data while reserving some for unbiased evaluation. Now let's test different polynomial degrees:
# ========================================
# STEP 3: Test multiple polynomial degrees
# ========================================
print("\nTesting Different Polynomial Degrees:")
print("="*70)
degrees = [1, 2, 3, 5, 10] # From simple to very complex
results = []
for degree in degrees:
# Create pipeline: polynomial transformation -> linear regression
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=degree, include_bias=False)),
('regressor', LinearRegression())
])
# Train the pipeline
pipeline.fit(X_train, y_train)
# Predict on both sets
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)
# Calculate metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
results.append({
'degree': degree,
'train_rmse': train_rmse,
'test_rmse': test_rmse,
'train_r2': train_r2,
'test_r2': test_r2
})
print(f"\nDegree {degree}:")
print(f" Train RMSE: {train_rmse:6.2f} | Test RMSE: {test_rmse:6.2f}")
print(f" Train R²: {train_r2:6.4f} | Test R²: {test_r2:6.4f}")
Notice the pattern in the results! For degree 1 (linear), both train and test performance are poor - it's underfitting. For degree 2, performance improves dramatically because it matches the true quadratic relationship. For very high degrees (10+), train RMSE drops but test RMSE increases - that's overfitting! Let's analyze the results:
# ========================================
# STEP 4: Analyze and interpret results
# ========================================
print("\n" + "="*70)
print("ANALYSIS:")
print("="*70)
# Find best degree by test RMSE
best_idx = min(range(len(results)), key=lambda i: results[i]['test_rmse'])
best = results[best_idx]
print(f"\nBest degree: {best['degree']}")
print(f" Test RMSE: {best['test_rmse']:.2f}")
print(f" Test R²: {best['test_r2']:.4f}")
# Check for overfitting (large gap between train and test)
print("\nOverfitting Check:")
for res in results:
gap = res['train_rmse'] - res['test_rmse']
gap_pct = abs(gap) / res['train_rmse'] * 100
if gap < -5: # Test error much higher than train
status = " OVERFITTING"
elif gap_pct < 10:
status = " Good fit"
else:
status = " Acceptable"
print(f" Degree {res['degree']:2d}: Gap = {gap:6.2f} {status}")
print("\n" + "="*70)
print("KEY INSIGHTS:")
print("="*70)
print("1. Degree 1 (linear): Underfits - can't capture the curve")
print("2. Degree 2 (quadratic): BEST - matches true relationship!")
print("3. Degree 3-5: Slight overfitting, but still reasonable")
print("4. Degree 10+: Severe overfitting - train perfect, test poor")
print("\nRule of thumb: Choose lowest degree with good test performance!")
The analysis reveals the bias-variance tradeoff in action! Degree 1 has high bias (underfitting), degree 2 balances bias and variance perfectly, and degree 10 has high variance (overfitting). Let's visualize this:
# ========================================
# STEP 5: Visualize the fits (optional)
# ========================================
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 5))
# Create smooth line for predictions
X_line = np.linspace(0, 10, 300).reshape(-1, 1)
for i, degree in enumerate([1, 2, 10], 1):
plt.subplot(1, 3, i)
# Train model
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('regressor', LinearRegression())
])
pipeline.fit(X_train, y_train)
y_line = pipeline.predict(X_line)
# Plot
plt.scatter(X_train, y_train, alpha=0.5, label='Training data', s=30)
plt.scatter(X_test, y_test, alpha=0.5, label='Test data', s=30, color='orange')
plt.plot(X_line, y_line, 'r-', linewidth=2, label=f'Degree {degree} fit')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Polynomial Degree {degree}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('polynomial_comparison.png', dpi=100, bbox_inches='tight')
print("\nVisualization saved as 'polynomial_comparison.png'")
print("Notice: Degree 1 = straight line, Degree 2 = smooth curve, Degree 10 = wiggly!")
The visualization clearly shows the problem: degree 1 misses the curve entirely, degree 2 captures it smoothly, and degree 10 wiggles excessively trying to hit every training point. This visual confirmation helps explain why degree 2 achieves the best test performance - it matches the complexity of the true underlying relationship without overfitting to noise.
Choosing the Right Degree
- Model too simple for the data
- Cannot capture the curved pattern
- Low train and test scores
- High bias, low variance
- Model too complex for the data
- Fits noise instead of pattern
- High train, low test score
- Low bias, high variance
Practice Questions: Polynomial Regression
Test your understanding of polynomial regression and feature engineering.
Question: How many features does PolynomialFeatures(degree=3) create from one input?
Show Solution
# Counting polynomial features
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[2], [3], [4]])
# Without bias
poly_no_bias = PolynomialFeatures(degree=3, include_bias=False)
X_poly_no_bias = poly_no_bias.fit_transform(X)
print("Without bias:")
print(f" Original shape: {X.shape}")
print(f" Polynomial shape: {X_poly_no_bias.shape}")
print(f" Feature names: {poly_no_bias.get_feature_names_out(['x'])}")
# With bias (default)
poly_with_bias = PolynomialFeatures(degree=3, include_bias=True)
X_poly_with_bias = poly_with_bias.fit_transform(X)
print(f"\nWith bias:")
print(f" Polynomial shape: {X_poly_with_bias.shape}")
print(f" Feature names: {poly_with_bias.get_feature_names_out(['x'])}")
Question: Why is using a very high polynomial degree risky?
Show Solution
# Demonstrating overfitting with high degree
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
np.random.seed(42)
X = np.linspace(0, 10, 20).reshape(-1, 1)
y = 2 + 3*X.ravel() + np.random.randn(20) * 2 # Linear with noise
for degree in [2, 5, 15]:
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)
y_pred = model.predict(X_poly)
print(f"Degree {degree:2d}:")
print(f" R²: {r2_score(y, y_pred):.4f}")
print(f" Max |coefficient|: {np.max(np.abs(model.coef_)):.2e}")
print(f" Features created: {X_poly.shape[1]}")
print("\nNotice: Degree 15 has perfect fit but huge coefficients!")
Question: What happens to feature count with multiple inputs and high degree?
Show Solution
# Feature explosion demonstration
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
print("Feature Explosion Analysis:")
print("-" * 50)
for n_features in [2, 5, 10]:
X = np.random.rand(10, n_features)
for degree in [2, 3]:
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"\nOriginal: {n_features} features, Degree: {degree}")
print(f" Result: {X_poly.shape[1]} features")
print(f" Growth: {X_poly.shape[1]/n_features:.1f}x increase")
print("\n" + "="*50)
print("With 10 features and degree 3: 220 features!")
print("This makes the model very complex and prone to overfitting.")
Question: When should you consider polynomial regression?
Show Solution
# Checking if polynomial regression is needed
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
# Create curved data (quadratic relationship)
np.random.seed(42)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 + 3*X.ravel() - 0.5*X.ravel()**2 + np.random.randn(50) * 2
# Try linear
linear_model = LinearRegression().fit(X, y)
y_pred_linear = linear_model.predict(X)
linear_r2 = r2_score(y, y_pred_linear)
# Try polynomial (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
poly_model = LinearRegression().fit(X_poly, y)
y_pred_poly = poly_model.predict(X_poly)
poly_r2 = r2_score(y, y_pred_poly)
print("Model Comparison:")
print(f" Linear R²: {linear_r2:.4f}")
print(f" Polynomial R²: {poly_r2:.4f}")
print(f"\nImprovement: {(poly_r2 - linear_r2)*100:.1f}%")
print("\nConclusion: Use polynomial when linear shows poor fit!")
Ridge and Lasso Regularization
Regularization adds penalty terms to the loss function, discouraging large coefficients to prevent overfitting and improve generalization to new data.
Understanding Regularization
Regularization adds a penalty term to the loss function that discourages large coefficient values. This constrains the model, preventing it from fitting noise in the training data and improving generalization to new data.
The alpha hyperparameter controls the strength of regularization:
alpha = 0
No regularization (standard OLS)
Optimal alpha
Balance between fit and simplicity
Large alpha
Heavy penalty, simple model
Ridge Regression (L2)
Penalty: Sum of squared coefficients
Penalty = alpha * Sum(bi^2)
Characteristics:
- Shrinks coefficients toward zero
- Never sets coefficients to exactly zero
- Keeps all features in the model
- Good when most features are useful
- Handles multicollinearity well
Lasso Regression (L1)
Penalty: Sum of absolute coefficients
Penalty = alpha * Sum(|bi|)
Characteristics:
- Shrinks coefficients toward zero
- Can set coefficients to exactly zero
- Performs automatic feature selection
- Good when many features are irrelevant
- Creates sparse, interpretable models
Always Scale Features First!
Regularization penalizes coefficient magnitude. Without scaling, features on different scales get penalized unfairly - large-scale features have smaller coefficients and get penalized less. Always use StandardScaler or MinMaxScaler before applying regularization.
Ridge Regression Implementation
Ridge regression (L2 regularization) shrinks coefficients by penalizing their squared values. This prevents overfitting by keeping coefficients small, making the model more stable. Let's see how it works:
# ========================================
# STEP 1: Import libraries and generate data
# ========================================
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np
# Generate regression data with multiple features
X, y = make_regression(
n_samples=200, # 200 data points
n_features=20, # 20 features (many for the sample size!)
noise=10, # Add some noise
random_state=42
)
print("Dataset Information:")
print(f" Samples: {X.shape[0]}")
print(f" Features: {X.shape[1]}")
print(f" Target range: [{y.min():.1f}, {y.max():.1f}]")
print(f"\nWith 20 features and 200 samples, overfitting is a real risk!")
We have 20 features for only 200 samples - a relatively high feature-to-sample ratio where regularization becomes important. Without it, the model might fit noise. Let's split and scale:
# ========================================
# STEP 2: Split data and scale features
# ========================================
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42
)
print("Data Split:")
print(f" Training: {len(X_train)} samples")
print(f" Testing: {len(X_test)} samples")
# CRITICAL: Scale features before regularization!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("\nFeature Scaling:")
print(f" Original X_train mean: {X_train.mean(axis=0)[:3]}")
print(f" Scaled X_train mean: {X_train_scaled.mean(axis=0)[:3]}")
print(f" Scaled X_train std: {X_train_scaled.std(axis=0)[:3]}")
print("\nAfter scaling: mean ≈ 0, std ≈ 1 for all features")
print("This ensures fair penalty across all features!")
StandardScaler transforms each feature to have mean=0 and std=1. This is crucial for regularization! Without scaling, features with large values would have artificially small coefficients, getting penalized less. Now let's try Ridge with a specific alpha:
# ========================================
# STEP 3: Train Ridge with specific alpha
# ========================================
# Try alpha = 1.0
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
# Evaluate on test set
train_score = ridge.score(X_train_scaled, y_train)
test_score = ridge.score(X_test_scaled, y_test)
print("Ridge Regression (alpha=1.0):")
print(f" Training R²: {train_score:.4f}")
print(f" Test R²: {test_score:.4f}")
print(f" Gap: {train_score - test_score:.4f}")
if train_score - test_score < 0.05:
print(" Small gap = good generalization!")
else:
print(" Large gap = possible overfitting")
# Examine coefficients
print(f"\nCoefficient Statistics:")
print(f" Max |coefficient|: {np.max(np.abs(ridge.coef_)):.4f}")
print(f" Mean |coefficient|: {np.mean(np.abs(ridge.coef_)):.4f}")
print(f" Coefficients near zero (<0.1): {np.sum(np.abs(ridge.coef_) < 0.1)}")
print(f" All 20 coefficients are non-zero (Ridge never zeros out!)")
With alpha=1.0, Ridge shrinks coefficients but keeps all 20 features in the model. The penalty term (alpha * sum of squared coefficients) keeps them from getting too large. But is alpha=1.0 optimal? Let's use cross-validation to find the best value:
# ========================================
# STEP 4: Find optimal alpha with RidgeCV
# ========================================
# Test a range of alpha values
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
print("Testing different alpha values:")
print("-" * 60)
# RidgeCV performs cross-validation to find best alpha
ridge_cv = RidgeCV(
alphas=alphas,
cv=5, # 5-fold cross-validation
scoring='r2' # Optimize for R² score
)
ridge_cv.fit(X_train_scaled, y_train)
print(f"\nBest alpha found: {ridge_cv.alpha_}")
print(f"\nThis was selected from {len(alphas)} candidates using 5-fold CV")
print(f"Each alpha was tested on 5 different train/validation splits!")
# Evaluate the best model
train_score_cv = ridge_cv.score(X_train_scaled, y_train)
test_score_cv = ridge_cv.score(X_test_scaled, y_test)
print(f"\nBest Ridge Model Performance:")
print(f" Training R²: {train_score_cv:.4f}")
print(f" Test R²: {test_score_cv:.4f}")
print(f" Gap: {train_score_cv - test_score_cv:.4f}")
RidgeCV automatically tested all 7 alpha values using 5-fold cross-validation and selected the one with best validation performance. This systematic approach is better than manually guessing alpha. Let's compare with an unregularized model:
# ========================================
# STEP 5: Compare with unregularized model
# ========================================
from sklearn.linear_model import LinearRegression
# Train standard linear regression (no regularization)
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train)
linreg_train = linreg.score(X_train_scaled, y_train)
linreg_test = linreg.score(X_test_scaled, y_test)
print("\n" + "="*70)
print("COMPARISON: Ridge vs Standard Linear Regression")
print("="*70)
print("\nStandard Linear Regression:")
print(f" Training R²: {linreg_train:.4f}")
print(f" Test R²: {linreg_test:.4f}")
print(f" Gap: {linreg_train - linreg_test:.4f}")
print(f" Max |coefficient|: {np.max(np.abs(linreg.coef_)):.4f}")
print(f"\nRidge (alpha={ridge_cv.alpha_}):")
print(f" Training R²: {train_score_cv:.4f}")
print(f" Test R²: {test_score_cv:.4f}")
print(f" Gap: {train_score_cv - test_score_cv:.4f}")
print(f" Max |coefficient|: {np.max(np.abs(ridge_cv.coef_)):.4f}")
improvement = test_score_cv - linreg_test
coef_reduction = (1 - np.max(np.abs(ridge_cv.coef_)) / np.max(np.abs(linreg.coef_))) * 100
print(f"\n" + "="*70)
print("RIDGE BENEFITS:")
print(f" Test performance improvement: {improvement:+.4f}")
print(f" Coefficient shrinkage: {coef_reduction:.1f}%")
print(f" Smaller gap between train/test (better generalization!)")
print("="*70)
Ridge regularization typically shows smaller coefficients and better generalization (smaller train-test gap) compared to standard linear regression. The L2 penalty keeps coefficients manageable, preventing the model from relying too heavily on any single feature. This is especially valuable when features are correlated or when sample size is limited relative to feature count.
Lasso Regression for Feature Selection
Lasso (L1 regularization) can set coefficients to exactly zero, effectively removing features from the model. This makes it perfect for feature selection! Let's create data where only some features matter:
# ========================================
# STEP 1: Create data with relevant and irrelevant features
# ========================================
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
n_samples = 200
# Create 5 RELEVANT features
X_relevant = np.random.randn(n_samples, 5)
# Create 15 IRRELEVANT noise features
X_noise = np.random.randn(n_samples, 15)
# Combine: 20 features total (only first 5 matter)
X = np.hstack([X_relevant, X_noise])
print("Dataset Design:")
print(f" Total features: 20")
print(f" Relevant features: 5 (indices 0-4)")
print(f" Noise features: 15 (indices 5-19)")
# Target depends ONLY on the first 5 features
y = (3*X[:, 0] + # Feature 0 matters
2*X[:, 1] + # Feature 1 matters
-1.5*X[:, 2] + # Feature 2 matters (negative relationship!)
0.5*X[:, 3] + # Feature 3 matters (small effect)
1*X[:, 4] + # Feature 4 matters
np.random.randn(n_samples)*0.5) # Add small noise
print("\nTrue coefficients:")
print(" Feature 0: +3.0")
print(" Feature 1: +2.0")
print(" Feature 2: -1.5")
print(" Feature 3: +0.5")
print(" Feature 4: +1.0")
print(" Features 5-19: 0.0 (should have NO effect!)")
We've deliberately created data where features 5-19 are pure noise with no relationship to y. A good feature selection method should identify and eliminate these. Let's prepare the data:
# ========================================
# STEP 2: Split and scale the data
# ========================================
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print(f"\nData Split:")
print(f" Training: {len(X_train)} samples")
print(f" Testing: {len(X_test)} samples")
# Scale features (essential for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"\nFeatures scaled to mean=0, std=1")
print(f"This ensures fair L1 penalty across all features")
With scaling complete, all features are on equal footing. Now Lasso can fairly decide which features to keep based on their predictive value, not their scale. Let's train with cross-validation:
# ========================================
# STEP 3: Train Lasso with cross-validation
# ========================================
# LassoCV automatically finds best alpha
lasso_cv = LassoCV(
cv=5, # 5-fold cross-validation
random_state=42,
max_iter=10000 # Ensure convergence
)
print("Training Lasso with 5-fold cross-validation...")
lasso_cv.fit(X_train_scaled, y_train)
print(f"\nOptimal alpha: {lasso_cv.alpha_:.6f}")
# Evaluate performance
train_score = lasso_cv.score(X_train_scaled, y_train)
test_score = lasso_cv.score(X_test_scaled, y_test)
print(f"\nModel Performance:")
print(f" Training R²: {train_score:.4f}")
print(f" Test R²: {test_score:.4f}")
print(f" Gap: {train_score - test_score:.4f} (smaller is better!)")
LassoCV found the optimal alpha value that balances fit and sparsity. Now comes the exciting part - let's see which features it selected:
# ========================================
# STEP 4: Examine feature selection results
# ========================================
print("\n" + "="*70)
print("FEATURE SELECTION ANALYSIS")
print("="*70)
# Count non-zero coefficients
n_selected = np.sum(lasso_cv.coef_ != 0)
n_zeroed = np.sum(lasso_cv.coef_ == 0)
print(f"\nFeatures selected: {n_selected} out of 20")
print(f"Features eliminated: {n_zeroed}")
# Find which features were selected
selected_indices = np.where(lasso_cv.coef_ != 0)[0]
zeroed_indices = np.where(lasso_cv.coef_ == 0)[0]
print(f"\nSelected feature indices: {selected_indices.tolist()}")
print(f"Eliminated feature indices: {zeroed_indices.tolist()[:10]}..."
if len(zeroed_indices) > 10 else f"Eliminated feature indices: {zeroed_indices.tolist()}")
print("\nCoefficients for SELECTED features:")
print("-" * 70)
for idx in selected_indices:
coef = lasso_cv.coef_[idx]
feature_type = "RELEVANT" if idx < 5 else "NOISE"
print(f" Feature {idx:2d}: {coef:>8.4f} [{feature_type}]")
if len(selected_indices) > 0:
n_correct = np.sum(selected_indices < 5) # Should be indices 0-4
n_incorrect = len(selected_indices) - n_correct
print(f"\nSelection Accuracy:")
print(f" Correctly selected relevant features: {n_correct} / 5")
print(f" Incorrectly selected noise features: {n_incorrect}")
Lasso successfully identified the relevant features! Notice how it set coefficients for features 5-19 (the noise) to exactly zero, removing them from the model. This automatic feature selection is one of Lasso's most powerful capabilities. Let's compare with Ridge:
# ========================================
# STEP 5: Compare with Ridge Regression
# ========================================
from sklearn.linear_model import RidgeCV
print("\n" + "="*70)
print("LASSO vs RIDGE COMPARISON")
print("="*70)
# Train Ridge for comparison
ridge_cv = RidgeCV(cv=5)
ridge_cv.fit(X_train_scaled, y_train)
ridge_train = ridge_cv.score(X_train_scaled, y_train)
ridge_test = ridge_cv.score(X_test_scaled, y_test)
ridge_zeros = np.sum(np.abs(ridge_cv.coef_) < 1e-10) # Effectively zero
print("\nLasso (L1 Regularization):")
print(f" Test R²: {test_score:.4f}")
print(f" Features with zero coefficients: {n_zeroed}")
print(f" Model complexity: {n_selected} features")
print(f" Max |coefficient|: {np.max(np.abs(lasso_cv.coef_)):.4f}")
print("\nRidge (L2 Regularization):")
print(f" Test R²: {ridge_test:.4f}")
print(f" Features with zero coefficients: {ridge_zeros}")
print(f" Model complexity: 20 features (keeps all!)")
print(f" Max |coefficient|: {np.max(np.abs(ridge_cv.coef_)):.4f}")
print("\n" + "="*70)
print("KEY INSIGHTS:")
print("="*70)
print("1. Lasso creates SPARSE models by setting coefficients to exactly 0")
print("2. Ridge shrinks ALL coefficients but keeps every feature")
print("3. When many features are irrelevant, Lasso > Ridge")
print("4. Lasso is easier to interpret: fewer features to explain")
print("5. Both prevent overfitting, but Lasso also selects features!")
The comparison reveals the fundamental difference: Lasso performs feature selection (sparse model with fewer features) while Ridge keeps all features but shrinks their coefficients. When you suspect many features are irrelevant, Lasso is the better choice. When all features contribute something, Ridge might work better. You can also combine both approaches using Elastic Net!
Ridge vs Lasso Comparison
| Aspect | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty Type | Squared coefficients (L2 norm) | Absolute coefficients (L1 norm) |
| Coefficient Behavior | Shrinks toward zero, never exactly zero | Can shrink to exactly zero |
| Feature Selection | No - keeps all features | Yes - removes irrelevant features |
| Correlated Features | Shares weight among correlated features | Arbitrarily picks one, zeros others |
| Best Use Case | When all features are relevant | When many features are irrelevant |
| Interpretability | All features have weights | Sparse model, easier to interpret |
Practice Questions: Ridge & Lasso Regularization
Test your understanding of regularization techniques.
Question: What happens if alpha is set too high?
Show Solution
# Testing different alpha values
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
np.random.seed(42)
X = np.random.randn(100, 5)
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Testing different alpha values:")
for alpha in [0.01, 1.0, 100.0, 10000.0]:
model = Ridge(alpha=alpha)
model.fit(X_train_scaled, y_train)
train_score = model.score(X_train_scaled, y_train)
test_score = model.score(X_test_scaled, y_test)
print(f"\nAlpha = {alpha:7.2f}:")
print(f" Train R²: {train_score:.4f}")
print(f" Test R²: {test_score:.4f}")
print(f" Max |coef|: {np.max(np.abs(model.coef_)):.4f}")
print("\nNotice: High alpha -> poor scores (underfitting)!")
Question: Why does Lasso produce exactly zero coefficients but Ridge does not?
Show Solution
# Comparing Ridge vs Lasso coefficient behavior
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
X = np.random.randn(100, 10)
# Only first 3 features are relevant
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Ridge
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
ridge_zeros = np.sum(np.abs(ridge.coef_) < 1e-10)
# Lasso
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
lasso_zeros = np.sum(np.abs(lasso.coef_) < 1e-10)
print("Ridge Coefficients:")
print(f" {ridge.coef_.round(3)}")
print(f" Exactly zero: {ridge_zeros}")
print(f"\nLasso Coefficients:")
print(f" {lasso.coef_.round(3)}")
print(f" Exactly zero: {lasso_zeros}")
print("\nLasso zeros out irrelevant features, Ridge keeps all!")
Question: When would you prefer Ridge over Lasso?
Show Solution
# Scenario: All features are relevant
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
X = np.random.randn(100, 8)
# ALL features are relevant with varying importance
y = (2*X[:, 0] + 1.5*X[:, 1] + X[:, 2] + 0.8*X[:, 3] +
0.5*X[:, 4] + 0.3*X[:, 5] + 0.2*X[:, 6] + 0.1*X[:, 7] +
np.random.randn(100) * 0.2)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare Ridge vs Lasso
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1, max_iter=10000)
ridge_scores = cross_val_score(ridge, X_scaled, y, cv=5, scoring='r2')
lasso_scores = cross_val_score(lasso, X_scaled, y, cv=5, scoring='r2')
print("When ALL features are relevant:")
print(f" Ridge R²: {ridge_scores.mean():.4f} (±{ridge_scores.std():.4f})")
print(f" Lasso R²: {lasso_scores.mean():.4f} (±{lasso_scores.std():.4f})")
print("\nRidge performs better by keeping all features!")
Question: How does Lasso behave with correlated features?
Show Solution
# Lasso with correlated features
import numpy as np
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 100
# Create highly correlated features
X1 = np.random.randn(n)
X2 = X1 + np.random.randn(n) * 0.1 # Highly correlated with X1
X3 = np.random.randn(n) # Independent
X = np.column_stack([X1, X2, X3])
y = 2*X1 + 2*X2 + X3 + np.random.randn(n) * 0.1 # Both X1 and X2 matter!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train models
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
print("True relationship: y = 2*X1 + 2*X2 + X3")
print(f"\nCorrelation X1-X2: {np.corrcoef(X1, X2)[0,1]:.3f}")
print(f"\nLasso Coefficients: {lasso.coef_.round(2)}")
print(f" Lasso picks one correlated feature!")
print(f"\nRidge Coefficients: {ridge.coef_.round(2)}")
print(f" Ridge distributes weight between X1 and X2!")
Elastic Net
Elastic Net combines L1 and L2 regularization, offering the feature selection of Lasso with the stability of Ridge for correlated features.
Understanding Elastic Net
Elastic Net addresses the limitations of using either Ridge or Lasso alone by combining both penalties. The combined penalty allows for feature selection (from L1) while handling correlated features better (from L2).
The l1_ratio parameter controls the mix of L1 and L2:
l1_ratio = 0
Pure Ridge (L2 only)
l1_ratio = 0.5
Equal mix of L1 and L2
l1_ratio = 1
Pure Lasso (L1 only)
When to Use Elastic Net
Elastic Net Excels When:
- You have many features, some correlated
- You want feature selection but need stability
- Lasso keeps selecting different features on different runs
- You have more features than samples (p > n)
- Groups of correlated features should be selected together
Trade-offs:
- Two hyperparameters to tune (alpha and l1_ratio)
- Slightly more complex to configure
- May be slower than pure Ridge or Lasso
- Need cross-validation to find optimal mix
Basic Elastic Net Implementation
Elastic Net combines L1 and L2 penalties to get the best of both worlds: feature selection from Lasso plus stability with correlated features from Ridge. Let's create data where features are correlated to see how different l1_ratio values behave:
# ========================================
# STEP 1: Create data with correlated features
# ========================================
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
n_samples = 300
# Create 3 base features
base = np.random.randn(n_samples, 3)
# Create 3 CORRELATED copies (adding small noise)
# These are highly correlated with base features!
correlated = base + np.random.randn(n_samples, 3) * 0.1
# Add 4 completely IRRELEVANT noise features
noise = np.random.randn(n_samples, 4)
# Combine: 10 features total
X = np.hstack([base, correlated, noise])
print("Dataset Structure:")
print(f" Total features: {X.shape[1]}")
print(f" Base features (0-2): The originals")
print(f" Correlated features (3-5): Copies of 0-2 with noise")
print(f" Noise features (6-9): Completely irrelevant")
# Check correlation between feature 0 and feature 3
corr = np.corrcoef(X[:, 0], X[:, 3])[0, 1]
print(f"\nCorrelation between feature 0 and 3: {corr:.3f}")
print(f"This high correlation will challenge Lasso!")
We've created a scenario where features 0-2 and 3-5 are highly correlated (correlation ≈ 0.99). Lasso struggles here because it arbitrarily picks one from each correlated group. Let's create the target:
# ========================================
# STEP 2: Create target based on base features
# ========================================
# Target depends ONLY on the 3 base features (0-2)
y = (2*base[:, 0] - # Feature 0 matters
base[:, 1] + # Feature 1 matters (negative)
0.5*base[:, 2] + # Feature 2 matters (small effect)
np.random.randn(n_samples)*0.3) # Small noise
print("\nTrue Relationship:")
print(" y = 2*f0 - 1*f1 + 0.5*f2 + noise")
print(" Features 3-5: Correlated with 0-2, so also predictive!")
print(" Features 6-9: Pure noise, should be eliminated")
print("\nChallenge: Should model use base (0-2) or correlated (3-5)?")
The target depends on the base features, but since features 3-5 are highly correlated with 0-2, they're also predictive! A good model should recognize this correlation. Let's prepare the data:
# ========================================
# STEP 3: Split and scale the data
# ========================================
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print(f"\nData Split:")
print(f" Training: {len(X_train)} samples")
print(f" Testing: {len(X_test)} samples")
# Scale features (critical for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"\nFeatures scaled for fair regularization")
Now let's test Elastic Net with different l1_ratio values to see how the L1/L2 mix affects feature selection:
# ========================================
# STEP 4: Test different l1_ratio values
# ========================================
print("\n" + "="*70)
print("TESTING DIFFERENT L1/L2 MIXES")
print("="*70)
l1_ratios = [0.1, 0.5, 0.9, 1.0] # From mostly L2 to pure L1
results = []
for l1_ratio in l1_ratios:
# Train Elastic Net
enet = ElasticNet(
alpha=0.1, # Overall regularization strength
l1_ratio=l1_ratio, # Mix of L1 vs L2
random_state=42,
max_iter=10000
)
enet.fit(X_train_scaled, y_train)
# Evaluate
train_score = enet.score(X_train_scaled, y_train)
test_score = enet.score(X_test_scaled, y_test)
n_nonzero = np.sum(enet.coef_ != 0)
selected = np.where(enet.coef_ != 0)[0]
print(f"\nl1_ratio = {l1_ratio} ({int(l1_ratio*100)}% L1, {int((1-l1_ratio)*100)}% L2):")
print(f" Test R²: {test_score:.4f}")
print(f" Non-zero coefficients: {n_nonzero} / 10")
print(f" Selected features: {selected.tolist()}")
# Check which types of features were selected
base_selected = np.sum((selected >= 0) & (selected <= 2))
corr_selected = np.sum((selected >= 3) & (selected <= 5))
noise_selected = np.sum((selected >= 6) & (selected <= 9))
print(f" Base features (0-2): {base_selected}")
print(f" Correlated features (3-5): {corr_selected}")
print(f" Noise features (6-9): {noise_selected}")
if l1_ratio == 0.1:
print(" -> Mostly Ridge: Keeps many features, distributes weight")
elif l1_ratio == 0.5:
print(" -> Balanced: Some feature selection, handles correlation")
elif l1_ratio == 0.9:
print(" -> Mostly Lasso: Aggressive feature selection")
else:
print(" -> Pure Lasso: Most aggressive selection")
results.append({
'l1_ratio': l1_ratio,
'test_score': test_score,
'n_features': n_nonzero,
'selected': selected
})
Notice the pattern! As l1_ratio increases (more L1, less L2), the model selects fewer features. Low l1_ratio (more Ridge) keeps correlated features together, while high l1_ratio (more Lasso) picks arbitrarily from correlated groups. Let's analyze the results:
# ========================================
# STEP 5: Analyze and compare results
# ========================================
print("\n" + "="*70)
print("ANALYSIS")
print("="*70)
# Find best by test score
best_idx = max(range(len(results)), key=lambda i: results[i]['test_score'])
best = results[best_idx]
print(f"\nBest l1_ratio: {best['l1_ratio']}")
print(f" Test R²: {best['test_score']:.4f}")
print(f" Features used: {best['n_features']}")
# Compare extremes
low_l1 = results[0] # l1_ratio = 0.1 (mostly Ridge)
high_l1 = results[-1] # l1_ratio = 1.0 (pure Lasso)
print(f"\nLow L1 (0.1 - mostly Ridge):")
print(f" Features: {low_l1['n_features']} (keeps more features)")
print(f" Test R²: {low_l1['test_score']:.4f}")
print(f"\nHigh L1 (1.0 - pure Lasso):")
print(f" Features: {high_l1['n_features']} (aggressive selection)")
print(f" Test R²: {high_l1['test_score']:.4f}")
print("\n" + "="*70)
print("KEY INSIGHTS:")
print("="*70)
print("1. Low l1_ratio (Ridge-like): Keeps correlated features together")
print("2. High l1_ratio (Lasso-like): Picks one from correlated group")
print("3. Middle l1_ratio: Balances feature selection and stability")
print("4. All versions eliminate noise features (6-9)!")
print("5. Best l1_ratio depends on your goal: sparsity vs stability")
print("\nFor this data: Mid-range l1_ratio (~0.5) works best!")
Elastic Net successfully handles the correlated features better than pure Lasso would. The l1_ratio parameter lets you tune the balance between feature selection (L1) and grouping effect (L2). When l1_ratio is low (0.1-0.3), the model behaves more like Ridge, keeping correlated features. When high (0.8-1.0), it acts like Lasso, aggressively eliminating features.
ElasticNetCV - Automatic Tuning
ElasticNetCV performs cross-validation over both alpha and l1_ratio to find the optimal combination. This saves you from manually testing every combination. Let's see it in action:
# ========================================
# STEP 1: Import and prepare data
# ========================================
from sklearn.linear_model import ElasticNetCV, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
# Using data from previous example (X_train, X_test, y_train, y_test)
print("Using dataset with:")
print(" 10 features (3 base + 3 correlated + 4 noise)")
print(" 300 samples total")
print(" Correlated features create challenge for pure Lasso")
Instead of manually trying different combinations, ElasticNetCV will test all provided alpha and l1_ratio values using cross-validation. Let's define the search grid:
# ========================================
# STEP 2: Define hyperparameter search space
# ========================================
# Define l1_ratios to test
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99]
print(f"\nTesting {len(l1_ratios)} l1_ratio values:")
for ratio in l1_ratios:
l1_pct = ratio * 100
l2_pct = (1 - ratio) * 100
print(f" {ratio}: {l1_pct:.0f}% L1 + {l2_pct:.0f}% L2")
# Define alphas to test
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
print(f"\nTesting {len(alphas)} alpha values:")
for alpha in alphas:
print(f" {alpha}: {'weak' if alpha < 0.1 else 'moderate' if alpha < 1 else 'strong'} regularization")
print(f"\nTotal combinations to test: {len(l1_ratios)} × {len(alphas)} = {len(l1_ratios) * len(alphas)}")
print(f"With 5-fold CV: {len(l1_ratios) * len(alphas) * 5} model fits!")
We're testing 7 l1_ratio values × 5 alpha values = 35 combinations. With 5-fold CV, that's 175 model fits! ElasticNetCV handles all this automatically. Let's create a pipeline:
# ========================================
# STEP 3: Create pipeline with ElasticNetCV
# ========================================
print("\nCreating pipeline with scaling + ElasticNetCV...")
pipeline = Pipeline([
('scaler', StandardScaler()), # Scale features first
('elasticnet', ElasticNetCV(
l1_ratio=l1_ratios, # L1/L2 mix to test
alphas=alphas, # Regularization strengths to test
cv=5, # 5-fold cross-validation
random_state=42,
max_iter=10000, # Ensure convergence
n_jobs=-1 # Use all CPU cores
))
])
print("Pipeline steps:")
print(" 1. StandardScaler: Scale features to mean=0, std=1")
print(" 2. ElasticNetCV: Find best alpha and l1_ratio")
The pipeline ensures features are always scaled before regularization. The n_jobs=-1 parameter uses all CPU cores for parallel cross-validation, speeding up the search. Let's train:
# ========================================
# STEP 4: Fit pipeline and find optimal parameters
# ========================================
print("\nTraining... (testing 35 combinations with 5-fold CV)")
print("This may take a moment...")
pipeline.fit(X_train, y_train)
# Access the fitted ElasticNetCV model
enet = pipeline.named_steps['elasticnet']
print("\n" + "="*70)
print("OPTIMAL HYPERPARAMETERS FOUND")
print("="*70)
print(f"\nBest alpha: {enet.alpha_:.6f}")
print(f" -> Optimal regularization strength")
print(f"\nBest l1_ratio: {enet.l1_ratio_:.3f}")
l1_pct = enet.l1_ratio_ * 100
l2_pct = (1 - enet.l1_ratio_) * 100
print(f" -> {l1_pct:.1f}% L1 (Lasso) + {l2_pct:.1f}% L2 (Ridge)")
if enet.l1_ratio_ < 0.3:
print(" -> More Ridge-like: Keeps correlated features")
elif enet.l1_ratio_ > 0.7:
print(" -> More Lasso-like: Aggressive feature selection")
else:
print(" -> Balanced: Good mix of selection and stability")
ElasticNetCV found the best combination through exhaustive cross-validation testing. Now let's evaluate the final model and examine which features it selected:
# ========================================
# STEP 5: Evaluate model and analyze features
# ========================================
# Get scores
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print("\nModel Performance:")
print(f" Training R²: {train_score:.4f}")
print(f" Test R²: {test_score:.4f}")
print(f" Gap: {train_score - test_score:.4f}")
if train_score - test_score < 0.05:
print(" Small gap = excellent generalization!")
# Analyze feature selection
coefs = enet.coef_
n_nonzero = np.sum(coefs != 0)
selected_features = np.where(coefs != 0)[0]
print(f"\nFeature Selection:")
print(f" Non-zero coefficients: {n_nonzero} / 10")
print(f" Selected features: {selected_features.tolist()}")
print("\nSelected Coefficients:")
print("-" * 70)
for idx in selected_features:
coef = coefs[idx]
if idx <= 2:
feat_type = "BASE"
elif idx <= 5:
feat_type = "CORR"
else:
feat_type = "NOISE"
print(f" Feature {idx}: {coef:>8.4f} [{feat_type}]")
# Count by type
base_count = np.sum((selected_features >= 0) & (selected_features <= 2))
corr_count = np.sum((selected_features >= 3) & (selected_features <= 5))
noise_count = np.sum((selected_features >= 6) & (selected_features <= 9))
print(f"\nFeature Type Breakdown:")
print(f" Base features (0-2): {base_count} selected")
print(f" Correlated features (3-5): {corr_count} selected")
print(f" Noise features (6-9): {noise_count} selected")
Perfect! ElasticNetCV identified the relevant features while handling the correlation intelligently. Let's compare with pure Ridge and pure Lasso:
# ========================================
# STEP 6: Compare with Ridge and Lasso
# ========================================
print("\n" + "="*70)
print("COMPARISON: ElasticNet vs Ridge vs Lasso")
print("="*70)
# Train Ridge
ridge = Pipeline([
('scaler', StandardScaler()),
('ridge', RidgeCV(cv=5))
])
ridge.fit(X_train, y_train)
ridge_model = ridge.named_steps['ridge']
# Train Lasso
lasso = Pipeline([
('scaler', StandardScaler()),
('lasso', LassoCV(cv=5, random_state=42, max_iter=10000))
])
lasso.fit(X_train, y_train)
lasso_model = lasso.named_steps['lasso']
# Compare
print("\nRidge (L2 only):")
print(f" Test R²: {ridge.score(X_test, y_test):.4f}")
print(f" Non-zero features: {np.sum(ridge_model.coef_ != 0)} / 10 (keeps all!)")
print(f" Max |coefficient|: {np.max(np.abs(ridge_model.coef_)):.4f}")
print("\nLasso (L1 only):")
print(f" Test R²: {lasso.score(X_test, y_test):.4f}")
print(f" Non-zero features: {np.sum(lasso_model.coef_ != 0)} / 10")
print(f" Selected features: {np.where(lasso_model.coef_ != 0)[0].tolist()}")
print(f"\nElastic Net (L1 + L2, ratio={enet.l1_ratio_:.2f}):")
print(f" Test R²: {test_score:.4f}")
print(f" Non-zero features: {n_nonzero} / 10")
print(f" Selected features: {selected_features.tolist()}")
print("\n" + "="*70)
print("WINNER: ElasticNet" if test_score >= max(ridge.score(X_test, y_test),
lasso.score(X_test, y_test)) else "")
print("="*70)
print("Elastic Net advantages:")
print(" ✓ Handles correlated features better than Lasso")
print(" ✓ Performs feature selection unlike Ridge")
print(" ✓ Automatically finds optimal L1/L2 mix")
print(" ✓ More stable coefficient estimates than Lasso alone")
ElasticNetCV successfully combines the strengths of both methods! It eliminates noise features like Lasso, but handles correlated features more gracefully by using the L2 penalty to group them. The automatic hyperparameter tuning via cross-validation ensures we get the optimal balance without manual trial and error.
Choosing the Right Regularization
| Scenario | Recommended Method | Reason |
|---|---|---|
| All features are useful, no feature selection needed | Ridge | Fastest, keeps all features |
| Many irrelevant features, want sparse model | Lasso | Feature selection built-in |
| Correlated features, want some feature selection | Elastic Net | Best of both worlds |
| More features than samples (p > n) | Elastic Net | Lasso can only select n features |
| Need interpretable model with few features | Lasso | Creates sparse model |
| Unsure which method is best | ElasticNetCV | Let cross-validation decide |
Practice Questions: Elastic Net
Test your understanding of Elastic Net regularization.
Question: What l1_ratio value makes Elastic Net equivalent to Ridge?
Show Solution
# Demonstrating l1_ratio effects
import numpy as np
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
X = np.random.randn(100, 5)
y = 2*X[:, 0] + 3*X[:, 1] + np.random.randn(100) * 0.5
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare
alpha = 1.0
ridge = Ridge(alpha=alpha).fit(X_scaled, y)
lasso = Lasso(alpha=alpha, max_iter=10000).fit(X_scaled, y)
enet_l2 = ElasticNet(alpha=alpha, l1_ratio=0.0, max_iter=10000).fit(X_scaled, y)
enet_l1 = ElasticNet(alpha=alpha, l1_ratio=1.0, max_iter=10000).fit(X_scaled, y)
print("Coefficients Comparison:")
print(f"\nRidge: {ridge.coef_.round(3)}")
print(f"ElasticNet(l1=0):{enet_l2.coef_.round(3)}")
print(f"\nLasso: {lasso.coef_.round(3)}")
print(f"ElasticNet(l1=1):{enet_l1.coef_.round(3)}")
print("\nElasticNet with l1_ratio=0 matches Ridge!")
print("ElasticNet with l1_ratio=1 matches Lasso!")
Question: Why might Elastic Net select more features than Lasso?
Show Solution
# Elastic Net with correlated features
import numpy as np
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 100
# Create 3 groups of correlated features
group1 = np.random.randn(n, 1)
group1_corr = group1 + np.random.randn(n, 2) * 0.1
group2 = np.random.randn(n, 1)
group2_corr = group2 + np.random.randn(n, 2) * 0.1
X = np.hstack([group1, group1_corr, group2, group2_corr])
y = 2*group1.ravel() + 3*group2.ravel() + np.random.randn(n) * 0.5
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000).fit(X_scaled, y)
lasso_selected = np.sum(np.abs(lasso.coef_) > 0.01)
enet_selected = np.sum(np.abs(enet.coef_) > 0.01)
print(f"Features selected:")
print(f" Lasso: {lasso_selected} features")
print(f" Elastic Net: {enet_selected} features")
print(f"\nLasso coefficients: {lasso.coef_.round(2)}")
print(f"ElasticNet coefficients: {enet.coef_.round(2)}")
print("\nElastic Net keeps correlated feature groups!")
Question: How does ElasticNetCV choose optimal hyperparameters?
Show Solution
# Using ElasticNetCV for automatic tuning
import numpy as np
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
X = np.random.randn(200, 10)
y = 2*X[:, 0] + X[:, 2] - X[:, 4] + np.random.randn(200) * 0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Define search space
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
l1_ratios = [0.1, 0.5, 0.7, 0.9, 0.95, 0.99]
# ElasticNetCV tries all combinations
enet_cv = ElasticNetCV(
alphas=alphas,
l1_ratio=l1_ratios,
cv=5,
max_iter=10000,
n_jobs=-1
)
enet_cv.fit(X_train_scaled, y_train)
print("ElasticNetCV Results:")
print(f" Best alpha: {enet_cv.alpha_:.4f}")
print(f" Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f" Test R²: {enet_cv.score(X_test_scaled, y_test):.4f}")
print(f"\nTested {len(alphas) * len(l1_ratios)} combinations with 5-fold CV!")
Question: What is the grouping effect in Elastic Net?
Show Solution
# Demonstrating grouping effect
import numpy as np
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 100
# Create a correlated group of 4 features
base_feature = np.random.randn(n, 1)
correlated_group = base_feature + np.random.randn(n, 4) * 0.1
# Add 3 independent features
independent = np.random.randn(n, 3)
X = np.hstack([correlated_group, independent])
# Target depends on the correlated group
y = 2*base_feature.ravel() + np.random.randn(n) * 0.5
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000).fit(X_scaled, y)
print("Correlated features (0-3) vs Independent (4-6):")
print(f"\nLasso coefficients:")
print(f" Group: {lasso.coef_[:4].round(2)}")
print(f" Independent: {lasso.coef_[4:].round(2)}")
print(f"\nElasticNet coefficients:")
print(f" Group: {enet.coef_[:4].round(2)}")
print(f" Independent: {enet.coef_[4:].round(2)}")
print("\nElastic Net assigns similar weights to correlated features!")
Interactive Demo: Regularization Comparison
Experiment with different regularization strengths to see how they affect coefficient values:
Ridge (L2)
- Coef 1: 0.85
- Coef 2: 0.72
- Coef 3: 0.15
- Coef 4: 0.08
Lasso (L1)
- Coef 1: 0.92
- Coef 2: 0.68
- Coef 3: 0.00
- Coef 4: 0.00
Key Takeaways
Linear Regression
Fits a straight line or hyperplane using OLS to minimize squared residuals between predictions and actual values.
Polynomial Regression
Captures curves and non-linear patterns by transforming features into polynomial terms before applying linear regression.
Ridge Regression (L2)
Adds squared coefficient penalty to shrink weights toward zero while keeping all features in the model.
Lasso Regression (L1)
Uses absolute coefficient penalty that can shrink weights to exactly zero, performing automatic feature selection.
Elastic Net
Combines L1 and L2 penalties with a mixing parameter to balance feature selection and coefficient shrinkage.
Hyperparameter Tuning
Use cross-validation to find optimal alpha (regularization strength) and l1_ratio (L1/L2 balance) values.
Knowledge Check
Quick Quiz
Test what you've learned about regression algorithms