Correlation Analysis
Correlation measures the strength and direction of the linear relationship between two variables. Understanding correlation is fundamental to data science because it helps identify which variables are related before building predictive models. A positive correlation means both variables increase together, while a negative correlation means one increases as the other decreases.
Interactive Correlation Explorer
ExploreDrag the slider to see how different correlation values look. Watch how the scatter plot changes as r moves from -1 (perfect negative) to +1 (perfect positive).
r = 0: No linear relationship exists. Knowing the value of X tells you nothing about Y. The data points form a circular cloud with no discernible pattern.
Pearson Correlation Coefficient
The Pearson correlation coefficient (r) is the most widely used measure of linear association between two continuous variables. Named after Karl Pearson, it quantifies both the strength and direction of a relationship by calculating the covariance divided by the product of standard deviations:
r = Σ(xᵢ - x̄)(yᵢ - ȳ) / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]
import numpy as np
import pandas as pd
from scipy import stats
# Sample data: study hours and exam scores
study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
exam_scores = np.array([65, 70, 72, 78, 82, 85, 88, 92, 95, 98])
# Calculate Pearson correlation using SciPy
correlation, p_value = stats.pearsonr(study_hours, exam_scores)
print(f"Pearson r: {correlation:.4f}") # 0.9912 (very strong positive)
print(f"P-value: {p_value:.6f}") # Very small = statistically significant
# Manual calculation to understand the formula
x_mean, y_mean = np.mean(study_hours), np.mean(exam_scores)
numerator = np.sum((study_hours - x_mean) * (exam_scores - y_mean))
denominator = np.sqrt(np.sum((study_hours - x_mean)**2) * np.sum((exam_scores - y_mean)**2))
r_manual = numerator / denominator
print(f"Manual calculation: {r_manual:.4f}") # Same result: 0.9912
Strong
|r| > 0.7
Variables strongly move together
Moderate
0.4 - 0.7
Noticeable association
Weak
|r| < 0.4
Little linear relationship
Spearman Rank Correlation
Spearman's rank correlation (ρ or rho) is a non-parametric measure that assesses monotonic relationships, whether linear or not. Instead of using raw values, it converts data to ranks and then computes correlation on those ranks. This makes it:
- Robust to outliers: Extreme values have less influence since they become just another rank
- Suitable for ordinal data: Works with ranked categories (e.g., survey responses 1-5)
- Detects monotonic relationships: Catches relationships where variables consistently move together, even if not linearly
# Spearman correlation - robust to outliers and non-linear monotonic relationships
spearman_r, spearman_p = stats.spearmanr(study_hours, exam_scores)
print(f"Spearman rho: {spearman_r:.4f}") # 1.0000 (perfect monotonic)
# Demonstrate robustness to outliers
x_with_outlier = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 50]) # Outlier: 50!
y_with_outlier = np.array([65, 70, 72, 78, 82, 85, 88, 92, 95, 98])
pearson_outlier, _ = stats.pearsonr(x_with_outlier, y_with_outlier)
spearman_outlier, _ = stats.spearmanr(x_with_outlier, y_with_outlier)
print(f"With outlier - Pearson: {pearson_outlier:.4f}") # 0.7234 (heavily affected!)
print(f"With outlier - Spearman: {spearman_outlier:.4f}") # 1.0000 (completely robust!)
Correlation Matrix
When analyzing multiple variables simultaneously, a correlation matrix displays all pairwise correlations in a compact table format. This is essential for exploratory data analysis (EDA) to understand variable relationships before building models.
# Create a DataFrame with multiple variables
df = pd.DataFrame({
'study_hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7, 8],
'previous_gpa': [2.5, 2.8, 3.0, 3.2, 3.5, 3.3, 3.6, 3.8, 3.9, 4.0],
'attendance': [70, 75, 80, 85, 88, 90, 92, 94, 96, 98],
'exam_score': [65, 70, 72, 78, 82, 85, 88, 92, 95, 98]
})
# Generate correlation matrix
corr_matrix = df.corr(method='pearson')
print(corr_matrix.round(2))
# Find highly correlated pairs (excluding diagonal)
import itertools
print("\nHighly correlated pairs (|r| > 0.7):")
for col1, col2 in itertools.combinations(df.columns, 2):
r = corr_matrix.loc[col1, col2]
if abs(r) > 0.7:
print(f" {col1} & {col2}: r = {r:.3f}")
Visualizing Correlations
Correlation coefficients tell only part of the story. Visualizations reveal patterns, outliers, and non-linear relationships that raw numbers might miss. Always combine numerical analysis with visual inspection to fully understand your data relationships.
import seaborn as sns
import matplotlib.pyplot as plt
# Create a correlation heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Heatmap with annotations
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
vmin=-1, vmax=1, square=True, ax=axes[0],
fmt='.2f', linewidths=0.5)
axes[0].set_title('Correlation Matrix Heatmap')
# Pairplot for visual inspection of all relationships
# (Better to run separately for large figures)
# sns.pairplot(df, diag_kind='kde')
# Scatter plot with regression line for key relationship
sns.regplot(x='study_hours', y='exam_score', data=df, ax=axes[1],
scatter_kws={'alpha': 0.7}, line_kws={'color': 'red'})
axes[1].set_title(f'Study Hours vs Exam Score (r = {corr_matrix.loc["study_hours", "exam_score"]:.3f})')
plt.tight_layout()
plt.show()
# Clustermap - reorders variables by similarity
plt.figure(figsize=(8, 6))
sns.clustermap(corr_matrix, annot=True, cmap='coolwarm', center=0,
vmin=-1, vmax=1, figsize=(8, 6))
plt.title('Clustered Correlation Matrix')
plt.show()
Visualization Best Practices
When visualizing correlations:
- Use diverging colormaps: coolwarm or RdBu centers at 0, making positive/negative clear
- Always check scatter plots: Anscombe's Quartet shows identical correlations with vastly different patterns
- Consider clustering: Clustermaps reveal groups of related variables
- Watch for outliers: A single point can dramatically change correlation
Correlation ≠ Causation
A high correlation between two variables does not mean one causes the other. Common pitfalls include:
- Confounding variables: A third variable may cause both (e.g., ice cream sales and drowning both increase in summer, both caused by hot weather)
- Reverse causation: The direction might be opposite (does A cause B, or B cause A?)
- Spurious correlation: Pure coincidence, especially with enough variables
Practice: Correlation Analysis
Scenario: A marketing team has data on monthly advertising spend (thousands $) and resulting sales (thousands $):
ads = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
sales = [85, 95, 110, 125, 135, 145, 160, 170, 185, 195]
Task: Calculate the Pearson correlation, interpret the strength, and determine if the relationship is statistically significant (α = 0.05).
Show Solution
from scipy import stats
import numpy as np
ads = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
sales = [85, 95, 110, 125, 135, 145, 160, 170, 185, 195]
r, p_value = stats.pearsonr(ads, sales)
print(f"Pearson r: {r:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"\nInterpretation:")
print(f"- r = {r:.2f}: Very strong positive correlation")
print(f"- p < 0.05: Statistically significant")
print(f"- R² = {r**2:.2%} of sales variance explained by ad spend")
print(f"\nConclusion: Strong evidence that increased advertising")
print(f"is associated with increased sales.")
Scenario: A tutor collected data on study hours vs quiz scores.
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
Task: Calculate the Pearson correlation and determine if it's statistically significant.
Show Solution
from scipy import stats
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}") # 0.7746
print(f"P-value: {p_value:.4f}") # 0.1243
# Interpretation: Moderate-strong positive correlation
# but p-value > 0.05 suggests not statistically significant
# (sample size is small)
Scenario: HR data includes a 100-year-old CEO founder - an outlier that could skew analysis.
ages = np.array([25, 30, 35, 40, 45, 50, 55, 60, 100]) # Note outlier!
income = np.array([30, 40, 50, 60, 70, 80, 85, 90, 95])
Task: Calculate both Pearson and Spearman correlations, explain the difference, and recommend which to use.
Show Solution
from scipy import stats
ages = np.array([25, 30, 35, 40, 45, 50, 55, 60, 100])
income = np.array([30, 40, 50, 60, 70, 80, 85, 90, 95])
pearson_r, _ = stats.pearsonr(ages, income)
spearman_r, _ = stats.spearmanr(ages, income)
print(f"Pearson r: {pearson_r:.4f}") # ~0.82 (affected by outlier)
print(f"Spearman rho: {spearman_r:.4f}") # 1.00 (robust!)
# Spearman is more appropriate because:
# 1. There's an outlier (age 100) skewing Pearson
# 2. Spearman uses ranks, making it robust to extreme values
# 3. The monotonic relationship is perfect when using ranks
Scenario: Marketing wants to know which factor drives sales most. Build a correlation matrix and identify the strongest driver.
Show Solution
import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
df = pd.DataFrame({
'advertising': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55],
'sales': [100, 150, 180, 220, 250, 300, 350, 380, 420, 460],
'employees': [5, 7, 8, 10, 12, 14, 16, 18, 20, 22],
'satisfaction': [3.2, 3.5, 3.8, 4.0, 4.1, 4.3, 4.4, 4.5, 4.6, 4.7]
})
# Generate correlation matrix
corr_matrix = df.corr()
print(corr_matrix.round(3))
# Find strongest correlation (excluding diagonal)
import itertools
strongest = {'pair': None, 'r': 0}
for col1, col2 in itertools.combinations(df.columns, 2):
r = abs(corr_matrix.loc[col1, col2])
if r > strongest['r']:
strongest = {'pair': (col1, col2), 'r': r}
print(f"\nStrongest: {strongest['pair']} with r = {strongest['r']:.3f}")
Scenario: Before building a ML model, you need to identify highly correlated features that could cause multicollinearity.
Task: Create correlation_report(df, threshold=0.7) that returns all highly correlated pairs above the threshold for feature selection.
Show Solution
def correlation_report(df, threshold=0.7):
"""Generate report of highly correlated variable pairs."""
# Get numeric columns only
numeric_df = df.select_dtypes(include=['number'])
corr_matrix = numeric_df.corr()
# Find pairs above threshold
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
col1 = corr_matrix.columns[i]
col2 = corr_matrix.columns[j]
r = corr_matrix.iloc[i, j]
if abs(r) >= threshold:
high_corr.append({
'var1': col1,
'var2': col2,
'correlation': round(r, 4),
'strength': 'Strong' if abs(r) >= 0.7 else 'Moderate',
'direction': 'Positive' if r > 0 else 'Negative'
})
# Create report DataFrame
if high_corr:
report = pd.DataFrame(high_corr)
report = report.sort_values('correlation', key=abs, ascending=False)
return report
else:
return f"No correlations above threshold {threshold}"
# Test
report = correlation_report(df, threshold=0.7)
print(report)
# Example output:
# var1 var2 correlation strength direction
# 0 advertising sales 0.998 Strong Positive
# 1 employees sales 0.996 Strong Positive
Simple Linear Regression
Simple linear regression models the relationship between a single independent variable (predictor) and a dependent variable (outcome). The model finds the best-fit line through the data points using Ordinary Least Squares (OLS), which minimizes the sum of squared errors between predicted and actual values.
The Regression Equation
The simple linear regression equation is y = mx + b (or y = β₀ + β₁x in statistical notation):
Slope (m or β₁)
The rate of change: how much Y changes for each one-unit increase in X. A slope of 3.5 means Y increases by 3.5 units when X increases by 1.
Intercept (b or β₀)
The starting point: the predicted value of Y when X equals zero. Represents the baseline before the predictor has any effect.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Sample data: study hours and exam scores
study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
exam_scores = np.array([65, 70, 72, 78, 82, 85, 88, 92, 95, 98])
# Prepare data - sklearn expects 2D array for X
X = study_hours.reshape(-1, 1)
y = exam_scores
# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# Extract the equation parameters
slope = model.coef_[0]
intercept = model.intercept_
print(f"Regression Equation: score = {slope:.2f} × hours + {intercept:.2f}")
print(f"Slope: {slope:.4f} (each extra study hour → +{slope:.1f} points)")
print(f"Intercept: {intercept:.4f} (baseline score with 0 hours)")
# Make a prediction
new_hours = 12
predicted_score = model.predict([[new_hours]])[0]
print(f"\nPrediction: {new_hours} hours of study → {predicted_score:.1f} points")
Model Evaluation Metrics
After fitting a model, we need to assess how well it performs. Three essential metrics help evaluate regression models:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
# Make predictions on test set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
print(f"R-squared: {r2:.4f}") # 0.9824 (98% variance explained)
print(f"RMSE: {rmse:.4f}") # 2.12 points average error
print(f"MAE: {mae:.4f}") # 1.82 points absolute error
# Understanding the metrics:
print("\n--- Metric Interpretations ---")
print(f"R²: Model explains {r2*100:.1f}% of score variance")
print(f"RMSE: Predictions are off by ~{rmse:.1f} points on average")
print(f"MAE: Average absolute error is {mae:.1f} points")
R² (R-squared)
Range: 0 to 1
Proportion of variance in Y explained by the model. R² = 0.85 means 85% of variance is explained. Higher is better.
RMSE
Range: 0 to ∞ (same units as Y)
Penalizes large errors more heavily. Lower is better. Useful when big mistakes are costly.
MAE
Range: 0 to ∞ (same units as Y)
Average absolute error. More robust to outliers than RMSE. Lower is better.
Interpreting R-squared
An R² of 0.98 means the model explains 98% of the variance in exam scores based on study hours. Higher values indicate better fit, but beware of overfitting with complex models. Always validate on unseen test data!
Visualizing Regression Results
A well-crafted regression visualization shows not just the best-fit line, but also the uncertainty around predictions. Confidence intervals and residual plots help assess model quality and identify potential issues like heteroscedasticity or influential outliers.
import matplotlib.pyplot as plt
import seaborn as sns
# Create figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# 1. Regression line with confidence interval
sns.regplot(x=study_hours, y=exam_scores, ax=axes[0],
ci=95, scatter_kws={'s': 60, 'alpha': 0.7})
axes[0].set_xlabel('Study Hours')
axes[0].set_ylabel('Exam Score')
axes[0].set_title('Regression with 95% CI')
# 2. Actual vs Predicted
y_pred = model.predict(X)
axes[1].scatter(y, y_pred, alpha=0.7)
axes[1].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual Scores')
axes[1].set_ylabel('Predicted Scores')
axes[1].set_title('Actual vs Predicted')
# 3. Residual plot
residuals = y - y_pred
axes[2].scatter(y_pred, residuals, alpha=0.7)
axes[2].axhline(y=0, color='r', linestyle='--')
axes[2].set_xlabel('Predicted Values')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Residual Plot')
plt.tight_layout()
plt.show()
# Check if residuals look random (good) or show patterns (bad)
print(f"Residual mean: {residuals.mean():.4f}") # Should be ~0
print(f"Residual std: {residuals.std():.4f}")
Practice: Simple Linear Regression
Scenario: A real estate company has square footage and sale prices for recent homes:
sqft = [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700]
price = [245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000]
Task: Build a regression model, calculate R² and RMSE, and predict the price for a 2000 sqft home. Discuss whether the model is reliable.
Show Solution
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
sqft = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700])
price = np.array([245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000])
# Reshape for sklearn
X = sqft.reshape(-1, 1)
y = price
# Fit model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
# Evaluate
r2 = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))
print(f"Equation: Price = ${model.coef_[0]:.2f} × sqft + ${model.intercept_:,.0f}")
print(f"R² Score: {r2:.4f}")
print(f"RMSE: ${rmse:,.0f}")
# Predict for 2000 sqft
new_price = model.predict([[2000]])[0]
print(f"\nPredicted price for 2000 sqft: ${new_price:,.0f}")
# Assessment
print(f"\nModel Assessment:")
print(f"- R² = {r2:.2%} variance explained (moderate fit)")
print(f"- RMSE = ${rmse:,.0f} average error")
print(f"- Use cautiously - other factors affect price!")
Scenario: Marketing needs to forecast sales if they increase budget to $900.
advertising = np.array([100, 200, 300, 400, 500, 600, 700, 800])
sales = np.array([10, 22, 28, 38, 45, 55, 60, 68])
Task: Build a regression model, interpret the coefficients, and predict sales for $900 spend.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
advertising = np.array([100, 200, 300, 400, 500, 600, 700, 800])
sales = np.array([10, 22, 28, 38, 45, 55, 60, 68])
# Reshape for sklearn
X = advertising.reshape(-1, 1)
y = sales
# Fit model
model = LinearRegression()
model.fit(X, y)
print(f"Equation: sales = {model.coef_[0]:.4f} × advertising + {model.intercept_:.4f}")
# sales = 0.0829 × advertising + 2.50
# Predict for $900
prediction = model.predict([[900]])[0]
print(f"Predicted sales for $900: {prediction:.1f} units") # ~77.1 units
Scenario: Logistics built a delivery time model. Assess if it's production-ready.
y_true = np.array([100, 150, 200, 250, 300]) # Actual delivery times (min)
y_pred = np.array([110, 145, 195, 260, 290]) # Predicted times
Task: Calculate R², RMSE, and MAE. Provide a recommendation on model readiness.
Show Solution
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 145, 195, 260, 290])
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
print(f"R-squared: {r2:.4f}") # 0.9875 (excellent!)
print(f"RMSE: {rmse:.4f}") # 8.94
print(f"MAE: {mae:.4f}") # 8.00
# Assessment: Very good model!
# - R² of 0.99 means 99% of variance explained
# - RMSE/MAE of ~8-9 units on values ranging 100-300 is small
# - Errors are roughly 4% of the mean value
Scenario: Avoid overfitting by implementing proper model validation with holdout data.
Task: Create synthetic data, split 80/20, train a model on training data, and evaluate only on test data.
Show Solution
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Create dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10 # 100 samples, values 0-10
y = 3 * X.flatten() + 5 + np.random.randn(100) * 2 # y = 3x + 5 + noise
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate on test set (NOT train set!)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Test R²: {r2:.4f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"True equation: y = 3x + 5")
print(f"Learned: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")
Task: Write a function simple_regression_pipeline(X, y, test_size=0.2) that performs training, testing, and returns a dictionary with the model and all metrics.
Show Solution
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
def simple_regression_pipeline(X, y, test_size=0.2, random_state=42):
"""
Complete simple regression pipeline.
Returns model and comprehensive metrics.
"""
# Ensure X is 2D
if len(X.shape) == 1:
X = X.reshape(-1, 1)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Metrics
results = {
'model': model,
'slope': model.coef_[0],
'intercept': model.intercept_,
'train_r2': r2_score(y_train, y_train_pred),
'test_r2': r2_score(y_test, y_test_pred),
'train_rmse': np.sqrt(mean_squared_error(y_train, y_train_pred)),
'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred)),
'train_mae': mean_absolute_error(y_train, y_train_pred),
'test_mae': mean_absolute_error(y_test, y_test_pred),
'n_train': len(X_train),
'n_test': len(X_test)
}
return results
# Test
X = np.random.rand(100) * 10
y = 2.5 * X + 10 + np.random.randn(100) * 3
results = simple_regression_pipeline(X, y)
print(f"Equation: y = {results['slope']:.2f}x + {results['intercept']:.2f}")
print(f"Train R²: {results['train_r2']:.4f}, Test R²: {results['test_r2']:.4f}")
print(f"Test RMSE: {results['test_rmse']:.4f}")
Multiple Linear Regression
Multiple linear regression extends simple regression by including multiple predictor variables. This allows modeling more complex real-world scenarios where outcomes depend on several factors. Each coefficient represents the effect of one variable while holding all others constant, a key concept called "ceteris paribus" (all else being equal).
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where ε represents the error term (residuals)
Building a Multiple Regression Model
import pandas as pd
from sklearn.linear_model import LinearRegression
# Multiple predictors: hours studied, sleep hours, previous GPA
data = pd.DataFrame({
'hours_studied': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7, 8],
'previous_gpa': [2.5, 2.8, 3.0, 3.2, 3.5, 3.3, 3.6, 3.8, 3.9, 4.0],
'exam_score': [65, 70, 72, 78, 82, 85, 88, 92, 95, 98]
})
# Features and target
X_multi = data[['hours_studied', 'sleep_hours', 'previous_gpa']]
y_multi = data['exam_score']
# Fit multiple regression model
multi_model = LinearRegression()
multi_model.fit(X_multi, y_multi)
# Display the equation
print(f"Intercept: {multi_model.intercept_:.4f}")
print("\nCoefficients (effect of each variable):")
for feature, coef in zip(X_multi.columns, multi_model.coef_):
print(f" {feature}: {coef:.4f}")
# Interpretation
print("\n--- Interpretation ---")
print(f"Each extra study hour → +{multi_model.coef_[0]:.2f} points (holding sleep & GPA constant)")
print(f"Each extra sleep hour → +{multi_model.coef_[1]:.2f} points (holding study & GPA constant)")
print(f"Each GPA point increase → +{multi_model.coef_[2]:.2f} points (holding study & sleep constant)")
Feature Importance & Standardization
Raw coefficients can't be compared directly because features have different scales. A coefficient of 3.5 for hours (range: 2-11) isn't comparable to 8.2 for GPA (range: 2.5-4.0). Standardizing features (mean=0, std=1) makes coefficients directly comparable:
from sklearn.preprocessing import StandardScaler
# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_multi)
# Fit model on standardized data
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y_multi)
print("Standardized Coefficients (Feature Importance):")
print("(Larger absolute value = more important)")
for feature, coef in zip(X_multi.columns, model_scaled.coef_):
importance = "★" * min(5, int(abs(coef) / 2) + 1)
print(f" {feature}: {coef:+.4f} {importance}")
Adjusted R² vs Regular R²
Regular R² always increases when you add more predictors, even useless ones! Adjusted R² penalizes model complexity:
Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)]
Where n = sample size, p = number of predictors. Use this when comparing models with different numbers of features.
Practice: Multiple Regression
Scenario: An auto company wants to predict MPG from vehicle specs:
data = {
'horsepower': [130, 165, 150, 150, 140, 198, 220, 215, 225, 190],
'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3850],
'cylinders': [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
'mpg': [18, 15, 18, 16, 17, 14, 14, 14, 12, 16]
}
Task: Build a multiple regression model, interpret the coefficients, and predict MPG for a car with 175 HP, 3800 lbs, and 6 cylinders.
Show Solution
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
data = {
'horsepower': [130, 165, 150, 150, 140, 198, 220, 215, 225, 190],
'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3850],
'cylinders': [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
'mpg': [18, 15, 18, 16, 17, 14, 14, 14, 12, 16]
}
df = pd.DataFrame(data)
X = df[['horsepower', 'weight', 'cylinders']]
y = df['mpg']
model = LinearRegression()
model.fit(X, y)
print("Coefficients:")
for name, coef in zip(X.columns, model.coef_):
print(f" {name}: {coef:.6f}")
print(f" Intercept: {model.intercept_:.2f}")
print("\nInterpretation:")
print(f" - Each additional HP → {model.coef_[0]:.3f} MPG change")
print(f" - Each additional lb → {model.coef_[1]:.4f} MPG change")
# Predict for new car
new_car = [[175, 3800, 6]]
pred_mpg = model.predict(new_car)[0]
print(f"\nPredicted MPG for 175HP/3800lb/6cyl: {pred_mpg:.1f}")
Scenario: An automotive engineer wants to estimate MPG from horsepower and weight data.
data = {
'horsepower': [130, 165, 150, 150, 140, 198, 220, 215],
'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312],
'mpg': [18, 15, 18, 16, 17, 14, 14, 14]
}
Show Solution
import pandas as pd
from sklearn.linear_model import LinearRegression
data = {
'horsepower': [130, 165, 150, 150, 140, 198, 220, 215],
'weight': [3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312],
'mpg': [18, 15, 18, 16, 17, 14, 14, 14]
}
df = pd.DataFrame(data)
X = df[['horsepower', 'weight']]
y = df['mpg']
model = LinearRegression()
model.fit(X, y)
print(f"Intercept: {model.intercept_:.4f}")
print(f"Horsepower coef: {model.coef_[0]:.6f}")
print(f"Weight coef: {model.coef_[1]:.6f}")
# Interpretation: Both negative - more HP and weight = lower MPG
# Weight has stronger effect (larger absolute standardized coef)
Scenario: A school wants to know which factor (study hours, sleep, caffeine) matters most for exam scores.
Task: Use standardized coefficients to determine the most important predictor.
df = pd.DataFrame({
'hours_studied': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7],
'caffeine_mg': [100, 150, 200, 250, 100, 150, 300, 200, 250],
'exam_score': [65, 70, 75, 78, 85, 82, 88, 92, 95]
})
Show Solution
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({
'hours_studied': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'sleep_hours': [5, 6, 7, 6, 8, 7, 6, 8, 7],
'caffeine_mg': [100, 150, 200, 250, 100, 150, 300, 200, 250],
'exam_score': [65, 70, 75, 78, 85, 82, 88, 92, 95]
})
X = df[['hours_studied', 'sleep_hours', 'caffeine_mg']]
y = df['exam_score']
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit on standardized data
model = LinearRegression()
model.fit(X_scaled, y)
# Compare standardized coefficients
print("Standardized Coefficients (Feature Importance):")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:+.4f}")
# hours_studied has largest coefficient - most important!
Assumptions & Diagnostics
Linear regression relies on several assumptions that must be checked for valid inference. Violating these assumptions can lead to biased estimates, incorrect p-values, and poor predictions. The good news: we can detect violations using diagnostic plots and statistical tests.
The LINE Assumptions
Remember the acronym LINE for the four key assumptions:
Linearity
The relationship between predictors and outcome should be linear. Check residual vs. fitted plots and look for curves or patterns.
Independence
Observations should be independent of each other. Violated in time series data (autocorrelation). Check with Durbin-Watson test.
No Heteroscedasticity
Residuals should have constant variance across all levels of predictors. Look for "funnel shapes" in residual plots.
Errors Normal
Residuals should be approximately normally distributed for valid confidence intervals and hypothesis tests.
Diagnostic Tests
import statsmodels.api as sm
from scipy import stats
# Add constant for intercept (statsmodels requires this explicitly)
X_with_const = sm.add_constant(X_multi)
# Fit OLS model for comprehensive diagnostics
ols_model = sm.OLS(y_multi, X_with_const).fit()
print(ols_model.summary())
# Extract residuals for analysis
residuals = ols_model.resid
fitted_values = ols_model.fittedvalues
# ============================================
# TEST 1: Normality of Residuals (Shapiro-Wilk)
# ============================================
stat, p_value = stats.shapiro(residuals)
print(f"\nShapiro-Wilk Test for Normality:")
print(f" P-value: {p_value:.4f}")
print(f" Result: {'✓ Normal' if p_value > 0.05 else '✗ NOT Normal'} (α=0.05)")
# ============================================
# TEST 2: Multicollinearity (VIF)
# ============================================
from statsmodels.stats.outliers_influence import variance_inflation_factor
print("\nVariance Inflation Factors (VIF):")
for i, col in enumerate(X_multi.columns):
vif = variance_inflation_factor(X_multi.values, i)
status = "✓" if vif < 5 else "⚠" if vif < 10 else "✗"
print(f" {col}: {vif:.2f} {status}")
print(" Rule: VIF > 5 = moderate, VIF > 10 = severe multicollinearity")
Multicollinearity Warning
VIF (Variance Inflation Factor) measures how much a coefficient's variance is inflated due to correlation with other predictors:
- VIF = 1: No correlation with other predictors (ideal)
- VIF > 5: Moderate multicollinearity (investigate)
- VIF > 10: Severe multicollinearity (take action!)
Solutions: Remove correlated predictors, combine them into one feature, or use Ridge/Lasso regression.
Practice: Model Diagnostics
Scenario: A colleague built a model predicting employee performance from tenure. The R² is 0.85 but something seems off. Actual vs predicted values:
y_actual = np.array([45, 52, 58, 63, 70, 75, 82, 88, 95, 100])
y_pred = np.array([50, 52, 56, 62, 68, 76, 80, 86, 94, 96])
Task: Calculate residuals, check if they're normally distributed (Shapiro-Wilk), and identify if there's a pattern suggesting model issues.
Show Solution
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
y_actual = np.array([45, 52, 58, 63, 70, 75, 82, 88, 95, 100])
y_pred = np.array([50, 52, 56, 62, 68, 76, 80, 86, 94, 96])
# Calculate residuals
residuals = y_actual - y_pred
print("Residual Statistics:")
print(f" Mean: {residuals.mean():.2f} (should be ~0)")
print(f" Std Dev: {residuals.std():.2f}")
print(f" Min: {residuals.min()}, Max: {residuals.max()}")
# Shapiro-Wilk test for normality
stat, p_value = stats.shapiro(residuals)
print(f"\nShapiro-Wilk Test:")
print(f" Statistic: {stat:.4f}")
print(f" P-value: {p_value:.4f}")
print(f" {'Normal' if p_value > 0.05 else 'NOT Normal'} distribution")
# Look for patterns
print(f"\nResiduals: {residuals}")
print("\nDiagnosis:")
print("- Early residuals are negative (model overpredicts)")
print("- Later residuals are positive (model underpredicts)")
print("- This pattern suggests non-linear relationship!")
print("- Consider polynomial regression or transformation")
Scenario: Before publishing results, you must verify the normality assumption for valid inference.
Task: Fit a model and test if residuals are normally distributed using the Shapiro-Wilk test.
Show Solution
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
# Create sample data
np.random.seed(42)
X = np.random.rand(50, 1) * 10
y = 2 * X.flatten() + 3 + np.random.randn(50) * 2
# Fit model
model = LinearRegression()
model.fit(X, y)
# Get residuals
residuals = y - model.predict(X)
# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value > 0.05:
print("Result: Residuals are normally distributed (fail to reject H0)")
else:
print("Result: Residuals are NOT normally distributed (reject H0)")
Scenario: Your model coefficients have unexpected signs. Use VIF to detect multicollinearity.
Task: Calculate VIF for features and identify which ones should be removed.
df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x2': [1.1, 2.0, 3.1, 4.0, 5.1, 6.0, 7.1, 8.0, 9.1, 10.0], # ~x1
'x3': [5, 3, 6, 2, 8, 4, 7, 1, 9, 5] # Independent
})
Show Solution
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x2': [1.1, 2.0, 3.1, 4.0, 5.1, 6.0, 7.1, 8.0, 9.1, 10.0],
'x3': [5, 3, 6, 2, 8, 4, 7, 1, 9, 5]
})
# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = df.columns
vif_data['VIF'] = [variance_inflation_factor(df.values, i)
for i in range(df.shape[1])]
print(vif_data)
# x1 and x2 will have HIGH VIF (>10) due to near-perfect correlation
# x3 will have low VIF (~1)
# Solution: Remove x1 OR x2 (they're nearly identical)
print("\nRecommendation: Remove x1 or x2 - they carry same info")
Key Takeaways
Correlation Measures Association
Pearson measures linear relationships, Spearman measures monotonic relationships between variables
Correlation is Not Causation
High correlation does not prove one variable causes changes in another
Simple Regression Uses One Predictor
Models the linear relationship between one independent and one dependent variable
Multiple Regression Uses Many Predictors
Each coefficient shows the effect of one variable while holding others constant
R-squared Measures Explained Variance
Indicates how much of the outcome variation is explained by the model
Check Model Assumptions
Verify linearity, independence, homoscedasticity, and normality for valid results
Knowledge Check
Test your understanding of correlation and regression:
A Pearson correlation coefficient of -0.85 indicates:
When should you use Spearman correlation instead of Pearson?
In the regression equation y = 3x + 10, what does the coefficient 3 represent?
An R-squared value of 0.75 means:
Which assumption violation is indicated by a funnel-shaped residual plot?
What does a VIF value greater than 10 indicate?