Module 6.4

Correlation & Regression

Discover relationships between variables and build predictive models. Learn to quantify associations with correlation coefficients and predict outcomes using linear regression techniques.

35 min read
Intermediate
What You'll Learn
  • Pearson and Spearman correlation
  • Simple linear regression
  • Multiple linear regression
  • Model evaluation metrics
  • Regression assumptions and diagnostics
Contents
01

Correlation Analysis

Correlation measures the strength and direction of the linear relationship between two variables. Understanding correlation is fundamental to data science because it helps identify which variables are related before building predictive models. A positive correlation means both variables increase together, while a negative correlation means one increases as the other decreases.

Interactive Correlation Explorer

Explore

Drag the slider to see how different correlation values look. Watch how the scatter plot changes as r moves from -1 (perfect negative) to +1 (perfect positive).

r = 0.00
-1.0 (Perfect Negative) 0 +1.0 (Perfect Positive)
Strength:
None
Direction
Real Example
No relationship
Data Points Trend Line
No Linear Relationship

r = 0: No linear relationship exists. Knowing the value of X tells you nothing about Y. The data points form a circular cloud with no discernible pattern.

Figure 6.4.1: Different correlation coefficients visualized, from strong positive (+1) to strong negative (-1)

Pearson Correlation Coefficient

The Pearson correlation coefficient (r) is the most widely used measure of linear association between two continuous variables. Named after Karl Pearson, it quantifies both the strength and direction of a relationship by calculating the covariance divided by the product of standard deviations:

r = Σ(xᵢ - x̄)(yᵢ - ȳ) / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]

import numpy as np
import pandas as pd
from scipy import stats

# Sample data: study hours and exam scores
study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
exam_scores = np.array([65, 70, 72, 78, 82, 85, 88, 92, 95, 98])

# Calculate Pearson correlation using SciPy
correlation, p_value = stats.pearsonr(study_hours, exam_scores)
print(f"Pearson r: {correlation:.4f}")  # 0.9912 (very strong positive)
print(f"P-value: {p_value:.6f}")  # Very small = statistically significant

# Manual calculation to understand the formula
x_mean, y_mean = np.mean(study_hours), np.mean(exam_scores)
numerator = np.sum((study_hours - x_mean) * (exam_scores - y_mean))
denominator = np.sqrt(np.sum((study_hours - x_mean)**2) * np.sum((exam_scores - y_mean)**2))
r_manual = numerator / denominator
print(f"Manual calculation: {r_manual:.4f}")  # Same result: 0.9912
Strong

|r| > 0.7

Variables strongly move together

Moderate

0.4 - 0.7

Noticeable association

Weak

|r| < 0.4

Little linear relationship

Spearman Rank Correlation

Spearman's rank correlation (ρ or rho) is a non-parametric measure that assesses monotonic relationships, whether linear or not. Instead of using raw values, it converts data to ranks and then computes correlation on those ranks. This makes it:

  • Robust to outliers: Extreme values have less influence since they become just another rank
  • Suitable for ordinal data: Works with ranked categories (e.g., survey responses 1-5)
  • Detects monotonic relationships: Catches relationships where variables consistently move together, even if not linearly
# Spearman correlation - robust to outliers and non-linear monotonic relationships
spearman_r, spearman_p = stats.spearmanr(study_hours, exam_scores)
print(f"Spearman rho: {spearman_r:.4f}")  # 1.0000 (perfect monotonic)

# Demonstrate robustness to outliers
x_with_outlier = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 50])  # Outlier: 50!
y_with_outlier = np.array([65, 70, 72, 78, 82, 85, 88, 92, 95, 98])

pearson_outlier, _ = stats.pearsonr(x_with_outlier, y_with_outlier)
spearman_outlier, _ = stats.spearmanr(x_with_outlier, y_with_outlier)

print(f"With outlier - Pearson: {pearson_outlier:.4f}")   # 0.7234 (heavily affected!)
print(f"With outlier - Spearman: {spearman_outlier:.4f}")  # 1.0000 (completely robust!)

Correlation Matrix

When analyzing multiple variables simultaneously, a correlation matrix displays all pairwise correlations in a compact table format. This is essential for exploratory data analysis (EDA) to understand variable relationships before building models.

Figure 6.4.2: Correlation matrix heatmap showing relationships between student performance factors
# Create a DataFrame with multiple variables
df = pd.DataFrame({
    'study_hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7, 8],
    'previous_gpa': [2.5, 2.8, 3.0, 3.2, 3.5, 3.3, 3.6, 3.8, 3.9, 4.0],
    'attendance': [70, 75, 80, 85, 88, 90, 92, 94, 96, 98],
    'exam_score': [65, 70, 72, 78, 82, 85, 88, 92, 95, 98]
})

# Generate correlation matrix
corr_matrix = df.corr(method='pearson')
print(corr_matrix.round(2))

# Find highly correlated pairs (excluding diagonal)
import itertools
print("\nHighly correlated pairs (|r| > 0.7):")
for col1, col2 in itertools.combinations(df.columns, 2):
    r = corr_matrix.loc[col1, col2]
    if abs(r) > 0.7:
        print(f"  {col1} & {col2}: r = {r:.3f}")

Visualizing Correlations

Correlation coefficients tell only part of the story. Visualizations reveal patterns, outliers, and non-linear relationships that raw numbers might miss. Always combine numerical analysis with visual inspection to fully understand your data relationships.

import seaborn as sns
import matplotlib.pyplot as plt

# Create a correlation heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap with annotations
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            vmin=-1, vmax=1, square=True, ax=axes[0],
            fmt='.2f', linewidths=0.5)
axes[0].set_title('Correlation Matrix Heatmap')

# Pairplot for visual inspection of all relationships
# (Better to run separately for large figures)
# sns.pairplot(df, diag_kind='kde')

# Scatter plot with regression line for key relationship
sns.regplot(x='study_hours', y='exam_score', data=df, ax=axes[1],
            scatter_kws={'alpha': 0.7}, line_kws={'color': 'red'})
axes[1].set_title(f'Study Hours vs Exam Score (r = {corr_matrix.loc["study_hours", "exam_score"]:.3f})')

plt.tight_layout()
plt.show()

# Clustermap - reorders variables by similarity
plt.figure(figsize=(8, 6))
sns.clustermap(corr_matrix, annot=True, cmap='coolwarm', center=0,
               vmin=-1, vmax=1, figsize=(8, 6))
plt.title('Clustered Correlation Matrix')
plt.show()
Visualization Best Practices

When visualizing correlations:

  • Use diverging colormaps: coolwarm or RdBu centers at 0, making positive/negative clear
  • Always check scatter plots: Anscombe's Quartet shows identical correlations with vastly different patterns
  • Consider clustering: Clustermaps reveal groups of related variables
  • Watch for outliers: A single point can dramatically change correlation
Correlation ≠ Causation

A high correlation between two variables does not mean one causes the other. Common pitfalls include:

  • Confounding variables: A third variable may cause both (e.g., ice cream sales and drowning both increase in summer, both caused by hot weather)
  • Reverse causation: The direction might be opposite (does A cause B, or B cause A?)
  • Spurious correlation: Pure coincidence, especially with enough variables

Practice: Correlation Analysis

Scenario: A marketing team has data on monthly advertising spend (thousands $) and resulting sales (thousands $):

ads = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
sales = [85, 95, 110, 125, 135, 145, 160, 170, 185, 195]

Task: Calculate the Pearson correlation, interpret the strength, and determine if the relationship is statistically significant (α = 0.05).

Show Solution
from scipy import stats
import numpy as np

ads = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
sales = [85, 95, 110, 125, 135, 145, 160, 170, 185, 195]

r, p_value = stats.pearsonr(ads, sales)

print(f"Pearson r: {r:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"\nInterpretation:")
print(f"- r = {r:.2f}: Very strong positive correlation")
print(f"- p < 0.05: Statistically significant")
print(f"- R² = {r**2:.2%} of sales variance explained by ad spend")
print(f"\nConclusion: Strong evidence that increased advertising")
print(f"is associated with increased sales.")

Scenario: A tutor collected data on study hours vs quiz scores.

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

Task: Calculate the Pearson correlation and determine if it's statistically significant.

Show Solution
from scipy import stats

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}")  # 0.7746
print(f"P-value: {p_value:.4f}")  # 0.1243

# Interpretation: Moderate-strong positive correlation
# but p-value > 0.05 suggests not statistically significant
# (sample size is small)

Scenario: HR data includes a 100-year-old CEO founder - an outlier that could skew analysis.

ages = np.array([25, 30, 35, 40, 45, 50, 55, 60, 100])  # Note outlier!
income = np.array([30, 40, 50, 60, 70, 80, 85, 90, 95])

Task: Calculate both Pearson and Spearman correlations, explain the difference, and recommend which to use.

Show Solution
from scipy import stats

ages = np.array([25, 30, 35, 40, 45, 50, 55, 60, 100])
income = np.array([30, 40, 50, 60, 70, 80, 85, 90, 95])

pearson_r, _ = stats.pearsonr(ages, income)
spearman_r, _ = stats.spearmanr(ages, income)

print(f"Pearson r: {pearson_r:.4f}")   # ~0.82 (affected by outlier)
print(f"Spearman rho: {spearman_r:.4f}")  # 1.00 (robust!)

# Spearman is more appropriate because:
# 1. There's an outlier (age 100) skewing Pearson
# 2. Spearman uses ranks, making it robust to extreme values
# 3. The monotonic relationship is perfect when using ranks

Scenario: Marketing wants to know which factor drives sales most. Build a correlation matrix and identify the strongest driver.

Show Solution
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
df = pd.DataFrame({
    'advertising': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55],
    'sales': [100, 150, 180, 220, 250, 300, 350, 380, 420, 460],
    'employees': [5, 7, 8, 10, 12, 14, 16, 18, 20, 22],
    'satisfaction': [3.2, 3.5, 3.8, 4.0, 4.1, 4.3, 4.4, 4.5, 4.6, 4.7]
})

# Generate correlation matrix
corr_matrix = df.corr()
print(corr_matrix.round(3))

# Find strongest correlation (excluding diagonal)
import itertools
strongest = {'pair': None, 'r': 0}
for col1, col2 in itertools.combinations(df.columns, 2):
    r = abs(corr_matrix.loc[col1, col2])
    if r > strongest['r']:
        strongest = {'pair': (col1, col2), 'r': r}

print(f"\nStrongest: {strongest['pair']} with r = {strongest['r']:.3f}")

Scenario: Before building a ML model, you need to identify highly correlated features that could cause multicollinearity.

Task: Create correlation_report(df, threshold=0.7) that returns all highly correlated pairs above the threshold for feature selection.

Show Solution
def correlation_report(df, threshold=0.7):
    """Generate report of highly correlated variable pairs."""
    # Get numeric columns only
    numeric_df = df.select_dtypes(include=['number'])
    corr_matrix = numeric_df.corr()
    
    # Find pairs above threshold
    high_corr = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            col1 = corr_matrix.columns[i]
            col2 = corr_matrix.columns[j]
            r = corr_matrix.iloc[i, j]
            if abs(r) >= threshold:
                high_corr.append({
                    'var1': col1,
                    'var2': col2,
                    'correlation': round(r, 4),
                    'strength': 'Strong' if abs(r) >= 0.7 else 'Moderate',
                    'direction': 'Positive' if r > 0 else 'Negative'
                })
    
    # Create report DataFrame
    if high_corr:
        report = pd.DataFrame(high_corr)
        report = report.sort_values('correlation', key=abs, ascending=False)
        return report
    else:
        return f"No correlations above threshold {threshold}"

# Test
report = correlation_report(df, threshold=0.7)
print(report)

# Example output:
#           var1        var2  correlation strength direction
# 0  advertising       sales        0.998   Strong  Positive
# 1    employees       sales        0.996   Strong  Positive
02

Simple Linear Regression

Simple linear regression models the relationship between a single independent variable (predictor) and a dependent variable (outcome). The model finds the best-fit line through the data points using Ordinary Least Squares (OLS), which minimizes the sum of squared errors between predicted and actual values.

Figure 6.4.3: Simple regression uses one predictor (2D line), while multiple regression uses many predictors (hyperplane)

The Regression Equation

The simple linear regression equation is y = mx + b (or y = β₀ + β₁x in statistical notation):

Slope (m or β₁)

The rate of change: how much Y changes for each one-unit increase in X. A slope of 3.5 means Y increases by 3.5 units when X increases by 1.

Intercept (b or β₀)

The starting point: the predicted value of Y when X equals zero. Represents the baseline before the predictor has any effect.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: study hours and exam scores
study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
exam_scores = np.array([65, 70, 72, 78, 82, 85, 88, 92, 95, 98])

# Prepare data - sklearn expects 2D array for X
X = study_hours.reshape(-1, 1)
y = exam_scores

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Extract the equation parameters
slope = model.coef_[0]
intercept = model.intercept_

print(f"Regression Equation: score = {slope:.2f} × hours + {intercept:.2f}")
print(f"Slope: {slope:.4f} (each extra study hour → +{slope:.1f} points)")
print(f"Intercept: {intercept:.4f} (baseline score with 0 hours)")

# Make a prediction
new_hours = 12
predicted_score = model.predict([[new_hours]])[0]
print(f"\nPrediction: {new_hours} hours of study → {predicted_score:.1f} points")
Figure 6.4.4: The best-fit regression line minimizes the distance between points and the line

Model Evaluation Metrics

After fitting a model, we need to assess how well it performs. Three essential metrics help evaluate regression models:

Figure 6.4.5: Understanding R², RMSE, and MAE, the key metrics for evaluating regression performance
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Make predictions on test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print(f"R-squared: {r2:.4f}")       # 0.9824 (98% variance explained)
print(f"RMSE: {rmse:.4f}")          # 2.12 points average error
print(f"MAE: {mae:.4f}")            # 1.82 points absolute error

# Understanding the metrics:
print("\n--- Metric Interpretations ---")
print(f"R²: Model explains {r2*100:.1f}% of score variance")
print(f"RMSE: Predictions are off by ~{rmse:.1f} points on average")
print(f"MAE: Average absolute error is {mae:.1f} points")
R² (R-squared)

Range: 0 to 1

Proportion of variance in Y explained by the model. R² = 0.85 means 85% of variance is explained. Higher is better.

RMSE

Range: 0 to ∞ (same units as Y)

Penalizes large errors more heavily. Lower is better. Useful when big mistakes are costly.

MAE

Range: 0 to ∞ (same units as Y)

Average absolute error. More robust to outliers than RMSE. Lower is better.

Interpreting R-squared

An R² of 0.98 means the model explains 98% of the variance in exam scores based on study hours. Higher values indicate better fit, but beware of overfitting with complex models. Always validate on unseen test data!

Visualizing Regression Results

A well-crafted regression visualization shows not just the best-fit line, but also the uncertainty around predictions. Confidence intervals and residual plots help assess model quality and identify potential issues like heteroscedasticity or influential outliers.

import matplotlib.pyplot as plt
import seaborn as sns

# Create figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Regression line with confidence interval
sns.regplot(x=study_hours, y=exam_scores, ax=axes[0], 
            ci=95, scatter_kws={'s': 60, 'alpha': 0.7})
axes[0].set_xlabel('Study Hours')
axes[0].set_ylabel('Exam Score')
axes[0].set_title('Regression with 95% CI')

# 2. Actual vs Predicted
y_pred = model.predict(X)
axes[1].scatter(y, y_pred, alpha=0.7)
axes[1].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual Scores')
axes[1].set_ylabel('Predicted Scores')
axes[1].set_title('Actual vs Predicted')

# 3. Residual plot
residuals = y - y_pred
axes[2].scatter(y_pred, residuals, alpha=0.7)
axes[2].axhline(y=0, color='r', linestyle='--')
axes[2].set_xlabel('Predicted Values')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Residual Plot')

plt.tight_layout()
plt.show()

# Check if residuals look random (good) or show patterns (bad)
print(f"Residual mean: {residuals.mean():.4f}")  # Should be ~0
print(f"Residual std: {residuals.std():.4f}")

Practice: Simple Linear Regression

Scenario: A real estate company has square footage and sale prices for recent homes:

sqft = [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700]
price = [245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000]

Task: Build a regression model, calculate R² and RMSE, and predict the price for a 2000 sqft home. Discuss whether the model is reliable.

Show Solution
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

sqft = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700])
price = np.array([245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000])

# Reshape for sklearn
X = sqft.reshape(-1, 1)
y = price

# Fit model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Evaluate
r2 = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))

print(f"Equation: Price = ${model.coef_[0]:.2f} × sqft + ${model.intercept_:,.0f}")
print(f"R² Score: {r2:.4f}")
print(f"RMSE: ${rmse:,.0f}")

# Predict for 2000 sqft
new_price = model.predict([[2000]])[0]
print(f"\nPredicted price for 2000 sqft: ${new_price:,.0f}")

# Assessment
print(f"\nModel Assessment:")
print(f"- R² = {r2:.2%} variance explained (moderate fit)")
print(f"- RMSE = ${rmse:,.0f} average error")
print(f"- Use cautiously - other factors affect price!")

Scenario: Marketing needs to forecast sales if they increase budget to $900.

advertising = np.array([100, 200, 300, 400, 500, 600, 700, 800])
sales = np.array([10, 22, 28, 38, 45, 55, 60, 68])

Task: Build a regression model, interpret the coefficients, and predict sales for $900 spend.

Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np

advertising = np.array([100, 200, 300, 400, 500, 600, 700, 800])
sales = np.array([10, 22, 28, 38, 45, 55, 60, 68])

# Reshape for sklearn
X = advertising.reshape(-1, 1)
y = sales

# Fit model
model = LinearRegression()
model.fit(X, y)

print(f"Equation: sales = {model.coef_[0]:.4f} × advertising + {model.intercept_:.4f}")
# sales = 0.0829 × advertising + 2.50

# Predict for $900
prediction = model.predict([[900]])[0]
print(f"Predicted sales for $900: {prediction:.1f} units")  # ~77.1 units

Scenario: Logistics built a delivery time model. Assess if it's production-ready.

y_true = np.array([100, 150, 200, 250, 300])  # Actual delivery times (min)
y_pred = np.array([110, 145, 195, 260, 290])  # Predicted times

Task: Calculate R², RMSE, and MAE. Provide a recommendation on model readiness.

Show Solution
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 145, 195, 260, 290])

r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)

print(f"R-squared: {r2:.4f}")  # 0.9875 (excellent!)
print(f"RMSE: {rmse:.4f}")     # 8.94
print(f"MAE: {mae:.4f}")       # 8.00

# Assessment: Very good model!
# - R² of 0.99 means 99% of variance explained
# - RMSE/MAE of ~8-9 units on values ranging 100-300 is small
# - Errors are roughly 4% of the mean value

Scenario: Avoid overfitting by implementing proper model validation with holdout data.

Task: Create synthetic data, split 80/20, train a model on training data, and evaluate only on test data.

Show Solution
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Create dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # 100 samples, values 0-10
y = 3 * X.flatten() + 5 + np.random.randn(100) * 2  # y = 3x + 5 + noise

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate on test set (NOT train set!)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Test R²: {r2:.4f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"True equation: y = 3x + 5")
print(f"Learned: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")

Task: Write a function simple_regression_pipeline(X, y, test_size=0.2) that performs training, testing, and returns a dictionary with the model and all metrics.

Show Solution
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

def simple_regression_pipeline(X, y, test_size=0.2, random_state=42):
    """
    Complete simple regression pipeline.
    Returns model and comprehensive metrics.
    """
    # Ensure X is 2D
    if len(X.shape) == 1:
        X = X.reshape(-1, 1)
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Metrics
    results = {
        'model': model,
        'slope': model.coef_[0],
        'intercept': model.intercept_,
        'train_r2': r2_score(y_train, y_train_pred),
        'test_r2': r2_score(y_test, y_test_pred),
        'train_rmse': np.sqrt(mean_squared_error(y_train, y_train_pred)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'train_mae': mean_absolute_error(y_train, y_train_pred),
        'test_mae': mean_absolute_error(y_test, y_test_pred),
        'n_train': len(X_train),
        'n_test': len(X_test)
    }
    
    return results

# Test
X = np.random.rand(100) * 10
y = 2.5 * X + 10 + np.random.randn(100) * 3

results = simple_regression_pipeline(X, y)
print(f"Equation: y = {results['slope']:.2f}x + {results['intercept']:.2f}")
print(f"Train R²: {results['train_r2']:.4f}, Test R²: {results['test_r2']:.4f}")
print(f"Test RMSE: {results['test_rmse']:.4f}")
03

Multiple Linear Regression

Multiple linear regression extends simple regression by including multiple predictor variables. This allows modeling more complex real-world scenarios where outcomes depend on several factors. Each coefficient represents the effect of one variable while holding all others constant, a key concept called "ceteris paribus" (all else being equal).

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Where ε represents the error term (residuals)

Building a Multiple Regression Model

import pandas as pd
from sklearn.linear_model import LinearRegression

# Multiple predictors: hours studied, sleep hours, previous GPA
data = pd.DataFrame({
    'hours_studied': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7, 8],
    'previous_gpa': [2.5, 2.8, 3.0, 3.2, 3.5, 3.3, 3.6, 3.8, 3.9, 4.0],
    'exam_score': [65, 70, 72, 78, 82, 85, 88, 92, 95, 98]
})

# Features and target
X_multi = data[['hours_studied', 'sleep_hours', 'previous_gpa']]
y_multi = data['exam_score']

# Fit multiple regression model
multi_model = LinearRegression()
multi_model.fit(X_multi, y_multi)

# Display the equation
print(f"Intercept: {multi_model.intercept_:.4f}")
print("\nCoefficients (effect of each variable):")
for feature, coef in zip(X_multi.columns, multi_model.coef_):
    print(f"  {feature}: {coef:.4f}")

# Interpretation
print("\n--- Interpretation ---")
print(f"Each extra study hour → +{multi_model.coef_[0]:.2f} points (holding sleep & GPA constant)")
print(f"Each extra sleep hour → +{multi_model.coef_[1]:.2f} points (holding study & GPA constant)")
print(f"Each GPA point increase → +{multi_model.coef_[2]:.2f} points (holding study & sleep constant)")

Feature Importance & Standardization

Raw coefficients can't be compared directly because features have different scales. A coefficient of 3.5 for hours (range: 2-11) isn't comparable to 8.2 for GPA (range: 2.5-4.0). Standardizing features (mean=0, std=1) makes coefficients directly comparable:

from sklearn.preprocessing import StandardScaler

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_multi)

# Fit model on standardized data
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y_multi)

print("Standardized Coefficients (Feature Importance):")
print("(Larger absolute value = more important)")
for feature, coef in zip(X_multi.columns, model_scaled.coef_):
    importance = "★" * min(5, int(abs(coef) / 2) + 1)
    print(f"  {feature}: {coef:+.4f} {importance}")
Adjusted R² vs Regular R²

Regular R² always increases when you add more predictors, even useless ones! Adjusted R² penalizes model complexity:

Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)]

Where n = sample size, p = number of predictors. Use this when comparing models with different numbers of features.

Practice: Multiple Regression

Scenario: An auto company wants to predict MPG from vehicle specs:

data = {
    'horsepower': [130, 165, 150, 150, 140, 198, 220, 215, 225, 190],
    'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3850],
    'cylinders': [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
    'mpg': [18, 15, 18, 16, 17, 14, 14, 14, 12, 16]
}

Task: Build a multiple regression model, interpret the coefficients, and predict MPG for a car with 175 HP, 3800 lbs, and 6 cylinders.

Show Solution
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = {
    'horsepower': [130, 165, 150, 150, 140, 198, 220, 215, 225, 190],
    'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3850],
    'cylinders': [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
    'mpg': [18, 15, 18, 16, 17, 14, 14, 14, 12, 16]
}
df = pd.DataFrame(data)

X = df[['horsepower', 'weight', 'cylinders']]
y = df['mpg']

model = LinearRegression()
model.fit(X, y)

print("Coefficients:")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef:.6f}")
print(f"  Intercept: {model.intercept_:.2f}")

print("\nInterpretation:")
print(f"  - Each additional HP → {model.coef_[0]:.3f} MPG change")
print(f"  - Each additional lb → {model.coef_[1]:.4f} MPG change")

# Predict for new car
new_car = [[175, 3800, 6]]
pred_mpg = model.predict(new_car)[0]
print(f"\nPredicted MPG for 175HP/3800lb/6cyl: {pred_mpg:.1f}")

Scenario: An automotive engineer wants to estimate MPG from horsepower and weight data.

data = {
    'horsepower': [130, 165, 150, 150, 140, 198, 220, 215],
    'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312],
    'mpg': [18, 15, 18, 16, 17, 14, 14, 14]
}
Show Solution
import pandas as pd
from sklearn.linear_model import LinearRegression

data = {
    'horsepower': [130, 165, 150, 150, 140, 198, 220, 215],
    'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312],
    'mpg': [18, 15, 18, 16, 17, 14, 14, 14]
}
df = pd.DataFrame(data)

X = df[['horsepower', 'weight']]
y = df['mpg']

model = LinearRegression()
model.fit(X, y)

print(f"Intercept: {model.intercept_:.4f}")
print(f"Horsepower coef: {model.coef_[0]:.6f}")
print(f"Weight coef: {model.coef_[1]:.6f}")

# Interpretation: Both negative - more HP and weight = lower MPG
# Weight has stronger effect (larger absolute standardized coef)

Scenario: A school wants to know which factor (study hours, sleep, caffeine) matters most for exam scores.

Task: Use standardized coefficients to determine the most important predictor.

df = pd.DataFrame({
    'hours_studied': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7],
    'caffeine_mg': [100, 150, 200, 250, 100, 150, 300, 200, 250],
    'exam_score': [65, 70, 75, 78, 85, 82, 88, 92, 95]
})
Show Solution
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'hours_studied': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7],
    'caffeine_mg': [100, 150, 200, 250, 100, 150, 300, 200, 250],
    'exam_score': [65, 70, 75, 78, 85, 82, 88, 92, 95]
})

X = df[['hours_studied', 'sleep_hours', 'caffeine_mg']]
y = df['exam_score']

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit on standardized data
model = LinearRegression()
model.fit(X_scaled, y)

# Compare standardized coefficients
print("Standardized Coefficients (Feature Importance):")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: {coef:+.4f}")

# hours_studied has largest coefficient - most important!
04

Assumptions & Diagnostics

Linear regression relies on several assumptions that must be checked for valid inference. Violating these assumptions can lead to biased estimates, incorrect p-values, and poor predictions. The good news: we can detect violations using diagnostic plots and statistical tests.

Figure 6.4.6: Diagnostic plots showing assumption violations (bottom) vs. satisfied assumptions (top)

The LINE Assumptions

Remember the acronym LINE for the four key assumptions:

Linearity

The relationship between predictors and outcome should be linear. Check residual vs. fitted plots and look for curves or patterns.

Independence

Observations should be independent of each other. Violated in time series data (autocorrelation). Check with Durbin-Watson test.

No Heteroscedasticity

Residuals should have constant variance across all levels of predictors. Look for "funnel shapes" in residual plots.

Errors Normal

Residuals should be approximately normally distributed for valid confidence intervals and hypothesis tests.

Diagnostic Tests

Figure 6.4.7: Residual plot showing the difference between actual and predicted values
import statsmodels.api as sm
from scipy import stats

# Add constant for intercept (statsmodels requires this explicitly)
X_with_const = sm.add_constant(X_multi)

# Fit OLS model for comprehensive diagnostics
ols_model = sm.OLS(y_multi, X_with_const).fit()
print(ols_model.summary())

# Extract residuals for analysis
residuals = ols_model.resid
fitted_values = ols_model.fittedvalues

# ============================================
# TEST 1: Normality of Residuals (Shapiro-Wilk)
# ============================================
stat, p_value = stats.shapiro(residuals)
print(f"\nShapiro-Wilk Test for Normality:")
print(f"  P-value: {p_value:.4f}")
print(f"  Result: {'✓ Normal' if p_value > 0.05 else '✗ NOT Normal'} (α=0.05)")

# ============================================
# TEST 2: Multicollinearity (VIF)
# ============================================
from statsmodels.stats.outliers_influence import variance_inflation_factor

print("\nVariance Inflation Factors (VIF):")
for i, col in enumerate(X_multi.columns):
    vif = variance_inflation_factor(X_multi.values, i)
    status = "✓" if vif < 5 else "⚠" if vif < 10 else "✗"
    print(f"  {col}: {vif:.2f} {status}")
print("  Rule: VIF > 5 = moderate, VIF > 10 = severe multicollinearity")
Multicollinearity Warning

VIF (Variance Inflation Factor) measures how much a coefficient's variance is inflated due to correlation with other predictors:

  • VIF = 1: No correlation with other predictors (ideal)
  • VIF > 5: Moderate multicollinearity (investigate)
  • VIF > 10: Severe multicollinearity (take action!)

Solutions: Remove correlated predictors, combine them into one feature, or use Ridge/Lasso regression.

Practice: Model Diagnostics

Scenario: A colleague built a model predicting employee performance from tenure. The R² is 0.85 but something seems off. Actual vs predicted values:

y_actual = np.array([45, 52, 58, 63, 70, 75, 82, 88, 95, 100])
y_pred = np.array([50, 52, 56, 62, 68, 76, 80, 86, 94, 96])

Task: Calculate residuals, check if they're normally distributed (Shapiro-Wilk), and identify if there's a pattern suggesting model issues.

Show Solution
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

y_actual = np.array([45, 52, 58, 63, 70, 75, 82, 88, 95, 100])
y_pred = np.array([50, 52, 56, 62, 68, 76, 80, 86, 94, 96])

# Calculate residuals
residuals = y_actual - y_pred

print("Residual Statistics:")
print(f"  Mean: {residuals.mean():.2f} (should be ~0)")
print(f"  Std Dev: {residuals.std():.2f}")
print(f"  Min: {residuals.min()}, Max: {residuals.max()}")

# Shapiro-Wilk test for normality
stat, p_value = stats.shapiro(residuals)
print(f"\nShapiro-Wilk Test:")
print(f"  Statistic: {stat:.4f}")
print(f"  P-value: {p_value:.4f}")
print(f"  {'Normal' if p_value > 0.05 else 'NOT Normal'} distribution")

# Look for patterns
print(f"\nResiduals: {residuals}")
print("\nDiagnosis:")
print("- Early residuals are negative (model overpredicts)")
print("- Later residuals are positive (model underpredicts)")
print("- This pattern suggests non-linear relationship!")
print("- Consider polynomial regression or transformation")

Scenario: Before publishing results, you must verify the normality assumption for valid inference.

Task: Fit a model and test if residuals are normally distributed using the Shapiro-Wilk test.

Show Solution
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression

# Create sample data
np.random.seed(42)
X = np.random.rand(50, 1) * 10
y = 2 * X.flatten() + 3 + np.random.randn(50) * 2

# Fit model
model = LinearRegression()
model.fit(X, y)

# Get residuals
residuals = y - model.predict(X)

# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value > 0.05:
    print("Result: Residuals are normally distributed (fail to reject H0)")
else:
    print("Result: Residuals are NOT normally distributed (reject H0)")

Scenario: Your model coefficients have unexpected signs. Use VIF to detect multicollinearity.

Task: Calculate VIF for features and identify which ones should be removed.

df = pd.DataFrame({
    'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'x2': [1.1, 2.0, 3.1, 4.0, 5.1, 6.0, 7.1, 8.0, 9.1, 10.0],  # ~x1
    'x3': [5, 3, 6, 2, 8, 4, 7, 1, 9, 5]  # Independent
})
Show Solution
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

df = pd.DataFrame({
    'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'x2': [1.1, 2.0, 3.1, 4.0, 5.1, 6.0, 7.1, 8.0, 9.1, 10.0],
    'x3': [5, 3, 6, 2, 8, 4, 7, 1, 9, 5]
})

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = df.columns
vif_data['VIF'] = [variance_inflation_factor(df.values, i) 
                   for i in range(df.shape[1])]

print(vif_data)
# x1 and x2 will have HIGH VIF (>10) due to near-perfect correlation
# x3 will have low VIF (~1)

# Solution: Remove x1 OR x2 (they're nearly identical)
print("\nRecommendation: Remove x1 or x2 - they carry same info")

Key Takeaways

Correlation Measures Association

Pearson measures linear relationships, Spearman measures monotonic relationships between variables

Correlation is Not Causation

High correlation does not prove one variable causes changes in another

Simple Regression Uses One Predictor

Models the linear relationship between one independent and one dependent variable

Multiple Regression Uses Many Predictors

Each coefficient shows the effect of one variable while holding others constant

R-squared Measures Explained Variance

Indicates how much of the outcome variation is explained by the model

Check Model Assumptions

Verify linearity, independence, homoscedasticity, and normality for valid results

Knowledge Check

Test your understanding of correlation and regression:

Question 1 of 6

A Pearson correlation coefficient of -0.85 indicates:

Question 2 of 6

When should you use Spearman correlation instead of Pearson?

Question 3 of 6

In the regression equation y = 3x + 10, what does the coefficient 3 represent?

Question 4 of 6

An R-squared value of 0.75 means:

Question 5 of 6

Which assumption violation is indicated by a funnel-shaped residual plot?

Question 6 of 6

What does a VIF value greater than 10 indicate?

Answer all questions to check your score