Module 2.1

Master Linear Regression

The foundation of predictive modeling. Learn how to predict continuous values by fitting a line to your data, understand the mathematics behind it, and implement it with Python and scikit-learn.

50 min
Beginner
Hands-on
What You'll Learn
  • Master simple and multiple linear regression
  • Learn gradient descent and normal equation
  • Understand polynomial regression for non-linear data
  • Evaluate models using MSE, RMSE, MAE, and R²
  • Build end-to-end regression pipelines in Python
Contents
01

Understanding Linear Regression

Linear regression is the foundation of machine learning - it's where your journey into predictive modeling begins. At its core, linear regression answers a fundamental question: how can we predict a continuous value based on one or more input features? Imagine drawing the best-fit line through scattered data points - that's linear regression in action.

Key Concept

What is Linear Regression?

Linear regression is a supervised learning algorithm that models the relationship between input features and a continuous target variable by fitting a linear equation to observed data.

Think of it like this: Imagine plotting points on a graph and trying to draw the "best-fit" straight line through them. Linear regression does exactly this mathematically! It finds the line (or hyperplane in multiple dimensions) that comes closest to all your data points, allowing you to make predictions for new inputs.

Why "Linear"? Because the relationship is represented as a straight line equation (like y = mx + b from algebra). The predictions follow a linear pattern - if you double the input, the predicted output doesn't necessarily double, but it changes in a consistent, proportional way.

Real-world analogy: It's like predicting how much you'll earn based on years of experience. If you plot salary vs. experience for many people, you'll see a general upward trend. Linear regression finds the line that best captures this trend, letting you estimate salary for any experience level.

Real-World Applications

Linear regression powers countless real-world applications. Predicting house prices based on square footage, estimating sales revenue from advertising spend, forecasting stock prices, determining insurance premiums, and predicting student test scores are just a few examples where linear regression excels.

Housing Prices

Predict home values based on size, location, bedrooms, age, and neighborhood features.

Sales Forecasting

Estimate future sales based on advertising budget, season, and market trends.

The Linear Equation

The simplest form of linear regression uses one input feature (x) to predict one output (y). The equation is:

Linear Regression Equation
y = mx + b
The fundamental equation for simple linear regression
Slope (m)

How steep the line is - the rate of change between x and y

Intercept (b)

Where the line crosses the y-axis (value when x=0)

Target (y)

The predicted output (dependent variable)

Feature (x)

The input feature (independent variable)

Think of it This Way: If you're predicting house prices (y) from square footage (x), the slope (m) tells you how much the price increases per additional square foot, and the intercept (b) is the base price when square footage is zero.

Simple Example in Python

import numpy as np
import matplotlib.pyplot as plt

# Sample data: hours studied vs test score
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8])
scores = np.array([50, 55, 60, 65, 70, 75, 80, 85])

# Calculate slope (m) and intercept (b)
m = np.cov(hours, scores)[0, 1] / np.var(hours)  # 5.0
b = np.mean(scores) - m * np.mean(hours)  # 45.0

# Make prediction
new_hours = 9
predicted_score = m * new_hours + b  # 90.0
print(f"Predicted score: {predicted_score}")

Regression vs Classification

Aspect Regression Classification
Output Type Continuous (numbers) Discrete (categories)
Examples Price, temperature, age Spam/not spam, cat/dog
Algorithms Linear, polynomial, ridge Logistic, SVM, decision tree
Evaluation MSE, MAE, R-squared Accuracy, F1-score, AUC

Practice Questions

Problem: Given a car's weight in pounds, predict its miles per gallon (MPG). Use the equation: MPG = 50 - 0.006 × weight. What's the MPG for a 3000-pound car?

Show Solution
# Given equation: MPG = 50 - 0.006 * weight
weight = 3000

# Calculate MPG
mpg = 50 - 0.006 * weight  # 32.0

print(f"A {weight} lb car gets {mpg} MPG")
# Output: A 3000 lb car gets 32.0 MPG

Explanation: This is a simple linear equation where weight is the input feature, and MPG is the output. The negative slope (-0.006) indicates that heavier cars get worse mileage.

Problem: Given two points on a line: (2, 5) and (6, 13), calculate the slope and intercept of the line. Then predict y when x = 10.

Show Solution
# Two points: (x1, y1) = (2, 5) and (x2, y2) = (6, 13)
x1, y1 = 2, 5
x2, y2 = 6, 13

# Calculate slope: m = (y2 - y1) / (x2 - x1)
slope = (y2 - y1) / (x2 - x1)  # 2.0

# Calculate intercept: b = y - mx
intercept = y1 - slope * x1  # 1.0

# Predict for x = 10
x_new = 10
y_pred = slope * x_new + intercept  # 21.0

print(f"Slope: {slope}, Intercept: {intercept}")
print(f"When x={x_new}, y={y_pred}")

Explanation: The slope formula measures the rate of change. The intercept is calculated by plugging one point into y = mx + b and solving for b.

Problem: Create a function that calculates the best-fit line for a dataset using the least squares method. Use it to fit a line to: x = [1, 2, 3, 4, 5] and y = [2, 4, 5, 4, 5].

Show Solution
import numpy as np

def fit_linear_regression(x, y):
    """Calculate slope and intercept using least squares"""
    n = len(x)
    
    # Calculate means
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    
    # Calculate slope
    numerator = np.sum((x - x_mean) * (y - y_mean))
    denominator = np.sum((x - x_mean) ** 2)
    slope = numerator / denominator
    
    # Calculate intercept
    intercept = y_mean - slope * x_mean
    
    return slope, intercept

# Test data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Fit the model
slope, intercept = fit_linear_regression(x, y)  # 0.6, 2.2

print(f"Best-fit line: y = {slope}x + {intercept}")
print(f"Prediction for x=6: {slope * 6 + intercept}")  # 5.8

Explanation: This implements the ordinary least squares method, which minimizes the sum of squared differences between actual and predicted values. The formulas calculate the optimal slope and intercept.

02

The Mathematics

Understanding the math behind linear regression empowers you to debug models, tune hyperparameters, and extend the algorithm to solve new problems. Don't worry - we'll build the intuition step by step before diving into formulas.

The Cost Function (MSE)

How do we measure how "good" our line is? We use a cost function, specifically Mean Squared Error (MSE). It calculates the average squared difference between predicted values and actual values. The goal of training is to minimize this cost.

Formula

Mean Squared Error

Mean Squared Error Formula
$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
Where yi is actual, ŷi is predicted, n is sample count

What does this mean? MSE measures how "wrong" our predictions are on average. Here's the breakdown:

  • $(y_i - \hat{y}_i)$: The difference between actual and predicted values (this is the "error")
  • Squared $(...)^2$: We square each error to make all values positive (so positive and negative errors don't cancel out) and to penalize larger errors more heavily
  • Sum $\sum$: Add up all the squared errors from every data point
  • Average $\frac{1}{n}$: Divide by the number of samples to get the average squared error

Interpretation: Lower MSE = better model! An MSE of 0 means perfect predictions (rare in practice). The MSE value is in "squared units" - if predicting prices in dollars, MSE is in "dollars squared".

Step 1: Define the MSE Function

import numpy as np

def mean_squared_error(y_true, y_pred):
    """Calculate Mean Squared Error"""
    return np.mean((y_true - y_pred) ** 2)
What this code does: Creates a reusable function to calculate Mean Squared Error (MSE), which measures prediction accuracy. The formula (y_true - y_pred) ** 2 computes the squared difference between actual and predicted values - squaring ensures all errors are positive and penalizes larger errors more heavily. Then np.mean() averages all these squared errors to give us a single performance metric.

Key insight: MSE is in "squared units" (if predicting prices in thousands, MSE is in "thousands squared"), which is why we often take the square root (RMSE) for easier interpretation.

Step 2: Prepare Example Data

# Example: actual vs predicted house prices (in thousands)
actual = np.array([200, 250, 300, 350, 400])
predicted = np.array([210, 240, 310, 340, 390])
What this code does: Creates two NumPy arrays for comparison - actual contains the true house prices from the market, and predicted contains what our machine learning model predicted. Notice the predictions aren't perfect: first house is actually $200k but predicted $210k (off by $10k).

Real-world context: This mismatch is normal! No model is perfect. The goal is to minimize these differences, which is exactly what MSE measures.

Step 3: Calculate and Display MSE

mse = mean_squared_error(actual, predicted)  # 120.0
print(f"MSE: ${mse}k squared")  # Lower is better!
What this code does: Calls our MSE function to evaluate prediction accuracy, returning 120.0. To interpret this: take the square root (RMSE = √120 ≈ $10.95k), meaning our predictions are off by about $11k on average.

Performance guide: MSE alone isn't meaningful - you need context. For house prices, MSE=120 might be excellent for $200k-$400k homes, but terrible for $50k apartments. Always compare MSE to your target value range. Lower MSE = Better model!
Why Square the Errors? Squaring ensures that positive and negative errors don't cancel out, penalizes large errors more heavily, and makes the math simpler for optimization algorithms.

Gradient Descent

Gradient descent is an iterative optimization algorithm that systematically finds the optimal parameters (slope and intercept) for our linear regression model. Imagine you're blindfolded on a hillside and need to reach the lowest point (valley). You can't see where the valley is, but you can feel the slope beneath your feet. You take small steps in the direction that feels most downhill, and with each step, you get closer to the bottom. That's exactly how gradient descent works!

The algorithm starts with random initial values for the slope (m) and intercept (b) - essentially making a random guess at the line. It then calculates the error (how wrong the predictions are) using the cost function. Next, it computes the "gradient" - a mathematical measure that tells us which direction and how steeply the error is increasing. By moving in the opposite direction of the gradient (downhill), we reduce the error. The algorithm repeats this process hundreds or thousands of times, taking small steps each time, until it converges to the optimal parameters where the error is minimized. The size of each step is controlled by the "learning rate" - too large and you might overshoot the valley; too small and it takes forever to get there.

1
Initialize

Start with random values for m (slope) and b (intercept), typically both set to 0. This is your starting point - imagine placing a completely random line on your data. The line will be completely wrong at first, but that's okay! The algorithm will improve it step by step.

2
Calculate Gradient

Calculate how much the error changes when you slightly adjust m and b. This tells you the direction of steepest descent - which way to move to reduce error fastest. Think of it as measuring the slope of the hill you're standing on to figure out which direction is most downhill.

3
Update Parameters

Move m and b in the opposite direction of the gradient (downhill) by a small step determined by the learning rate. Repeat steps 2-3 hundreds of times until the error stops decreasing significantly - this is called convergence, meaning you've found the optimal line!

Update Rules

Gradient Descent Equations

Parameter Update for Slope
$$ m_{new} = m_{old} - \alpha \frac{\partial MSE}{\partial m} $$
Update the slope parameter in the direction that reduces error
Parameter Update for Intercept
$$ b_{new} = b_{old} - \alpha \frac{\partial MSE}{\partial b} $$
Update the intercept parameter in the direction that reduces error

Where α is the learning rate (step size), typically between 0.001 and 0.1.

Understanding the equations:

  • $m$ and $b$: The slope and intercept of our line (the parameters we're trying to find)
  • $\frac{\partial MSE}{\partial m}$ and $\frac{\partial MSE}{\partial b}$: These are "gradients" - they tell us which direction to adjust $m$ and $b$ to reduce error
  • $\alpha$ (learning rate): Controls how big each adjustment step is. Too large and we might overshoot; too small and learning takes forever

The intuition: Imagine you're hiking down a mountain in fog (can't see the bottom). You feel the slope under your feet and take a step downhill. That's gradient descent! The gradient tells you which way is "downhill" (toward lower error), and the learning rate determines your step size. You repeat this until you reach the valley (minimum error).

Step 1: Define the Gradient Descent Function

import numpy as np

def gradient_descent(x, y, learning_rate=0.01, epochs=1000):
    m, b = 0, 0  # Initialize slope and intercept to zero
    n = len(x)   # Number of data points
What this code does: Initializes the gradient descent function with starting parameters. m (slope) and b (intercept) both start at 0 - imagine starting with a flat horizontal line at y=0. These will iteratively improve during training.

Parameters explained:
learning_rate=0.01: Controls how big each step is during optimization. Too large = overshooting, too small = slow training.
epochs=1000: Number of complete passes through the dataset to adjust m and b.
n = len(x): Dataset size, needed for gradient calculations.

Step 2: Training Loop - Make Predictions

    for epoch in range(epochs):
        # Make predictions using current m and b
        y_pred = m * x + b
What this code does: For each training iteration (epoch), calculates predictions using the current line equation y_pred = mx + b. In epoch 1, with m=0 and b=0, all predictions will be 0 - completely wrong! But that's okay, the gradients will tell us how to improve.

Training cycle: This loop runs 1000 times. Each iteration makes the predictions slightly better by adjusting m and b based on the errors.

Step 3: Calculate Gradients

        # Calculate gradients (direction of steepest increase in error)
        dm = -(2/n) * np.sum(x * (y - y_pred))  # Gradient for slope
        db = -(2/n) * np.sum(y - y_pred)        # Gradient for intercept
What this code does: Calculates the gradients (slopes) that tell us which direction to move m and b to reduce error. These formulas come from taking partial derivatives of the MSE cost function with respect to m and b.

Mathematical intuition:
dm = -(2/n) * Σ[x * (y - y_pred)]: Shows how much to change slope m
db = -(2/n) * Σ[(y - y_pred)]: Shows how much to change intercept b
• Negative sign: We move opposite to the gradient (downhill) to minimize error
• The (2/n) factor comes from averaging over all n data points

Step 4: Update Parameters

        # Update parameters by moving against the gradient
        m = m - learning_rate * dm
        b = b - learning_rate * db
        
        # Print progress every 100 epochs
        if epoch % 100 == 0:
            mse = np.mean((y - y_pred) ** 2)
            print(f"Epoch {epoch}: MSE = {mse:.2f}")
What this code does: Updates parameters by taking a small step in the direction that reduces error. The learning_rate (0.01) acts as a multiplier - it prevents us from taking huge jumps that might overshoot the optimal values.

Progress tracking: Every 100 epochs, we print the error to monitor training. You should see the error steadily decreasing - if it increases, your learning rate is too high! Typical output: "Epoch 0: Error = 50.2" → "Epoch 100: Error = 12.8" → "Epoch 1000: Error = 2.3"

Step 5: Complete Example with Test Data

    return m, b

# Test with sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
m, b = gradient_descent(x, y)  # m ≈ 0.6, b ≈ 2.2

print(f"Final equation: y = {m:.2f}x + {b:.2f}")
What this code does: Returns the optimized parameters and demonstrates the function with sample data. After 1000 training iterations, it finds the best-fit line with approximately m=0.6 (slope) and b=2.2 (intercept).

Verification: The test data points are [1,2,3,4,5] and [2,4,5,4,5]. Plugging into y = 0.6x + 2.2: when x=1 → y=2.8 (close to actual 2), when x=5 → y=5.2 (close to actual 5). The model has learned the underlying pattern! You can now use this equation to predict y for any new x value.

Learning Rate Impact

Learning Rate Behavior Result
Too Small (0.0001) Tiny steps down the hill Slow convergence, may timeout
Just Right (0.01) Steady progress downhill Converges efficiently to minimum
Too Large (0.5) Overshoots the valley Oscillates or diverges

The Normal Equation: Direct Solution

The Normal Equation is a powerful mathematical formula that provides a direct, closed-form analytical solution to linear regression - completely different from gradient descent's iterative approach. While gradient descent is like carefully walking down a mountain step-by-step in the fog, the Normal Equation is like having a helicopter that takes you directly to the bottom in one jump!

This method uses matrix calculus and linear algebra to find the optimal parameters θ (theta) that minimize the cost function in a single computation. Instead of repeatedly adjusting parameters through hundreds or thousands of iterations, it solves the optimization problem algebraically by taking the derivative of the cost function, setting it to zero, and solving for θ. The result is a formula that, when given your data, outputs the perfect parameters immediately.

The beauty of the Normal Equation lies in its mathematical elegance: it's derived from the principle that at the minimum of a convex function (like our MSE cost function), the gradient equals zero. By solving the equation ∇J(θ) = 0 using matrix differentiation, we arrive at a direct formula. However, this convenience comes with computational costs - the formula requires computing a matrix inverse, which becomes extremely slow and memory-intensive when dealing with large feature matrices. This is why understanding both gradient descent and the Normal Equation is crucial: each has its place depending on your dataset size and computational resources.

The Normal Equation Formula
$$ \theta = (X^T X)^{-1} X^T y $$
Where X is the design matrix (feature matrix with a column of 1s added for the intercept term), y is the target vector containing all output values, and θ (theta) is the parameter vector containing all weights including the bias term. The T denotes transpose, and -1 denotes matrix inverse.
Understanding the Components

Breaking Down the Normal Equation

XTX (X transpose times X): Creates a square matrix that captures the relationships between all features. This is called the "Gram matrix" and it measures how features correlate with each other.

(XTX)-1 (The inverse): This is the computationally expensive part! Matrix inversion requires O(n³) operations where n is the number of features. For 10,000 features, that's 1 trillion operations! This is why the Normal Equation slows down dramatically with more features.

XTy (X transpose times y): Projects the target values onto the feature space, creating a vector that represents how each feature relates to the output.

Why this works: The formula essentially finds the point in parameter space where the gradient of the cost function equals zero - the mathematical definition of a minimum. For linear regression's convex cost function, this point is guaranteed to be the global minimum, giving us the absolutely best possible parameters.

The tradeoff:

  • Advantages: Exact solution (no approximation), no hyperparameters to tune (no learning rate!), guaranteed to find the global minimum, no need for feature scaling, no risk of getting stuck in local minima.
  • ✗ Disadvantages: Computationally expensive O(n³) for large feature sets, requires storing large matrices in memory, fails when XTX is non-invertible (singular matrix), doesn't work with non-linear models, can't be easily adapted to online learning.

When to use it: Ideal for small to medium datasets (< 10,000 samples and < 1,000 features) where you want an exact solution quickly. Perfect for academic projects, small business applications, and situations where you need interpretable, reproducible results. For massive datasets, web-scale applications, or real-time learning, stick with gradient descent or its variants.

Advantages

  • No learning rate needed - No hyperparameter tuning required, eliminating trial and error
  • One calculation - Solves in a single step instead of thousands of iterations
  • Exact solution - Mathematically guaranteed optimal parameters, no approximation
  • Perfect for small datasets - Fast and efficient when features are limited

Limitations

  • Slow for large features - Becomes impractical when n > 10,000 features
  • Matrix inversion cost - O(n³) complexity makes it exponentially slower
  • Singular matrix issues - Fails when XTX is non-invertible
  • Limited scope - Only works for linear models, not other algorithms
import numpy as np

# Sample data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]])  # First column is 1s
y = np.array([2, 4, 5, 4, 5])

# Apply normal equation: theta = (X^T X)^-1 X^T y
X_transpose = X.T
theta = np.linalg.inv(X_transpose.dot(X)).dot(X_transpose).dot(y)

print(f"Optimal parameters: {theta}")
# [1.2 0.6] - intercept and slope

# Make predictions
X_test = np.array([[1, 6], [1, 7]])  # Include 1s for intercept
predictions = X_test.dot(theta)
print(f"Predictions: {predictions}")  # [4.8 5.4]

Practice Questions

Problem: Given actual values [10, 20, 30] and predicted values [12, 18, 28], calculate the Mean Squared Error.

Show Solution
import numpy as np

actual = np.array([10, 20, 30])
predicted = np.array([12, 18, 28])

# Calculate squared errors
squared_errors = (actual - predicted) ** 2  # [4, 4, 4]

# Calculate mean
mse = np.mean(squared_errors)  # 4.0

print(f"MSE: {mse}")
# Output: MSE: 4.0

Explanation: Each prediction is off by 2, so squared errors are all 4. The average is 4.0.

Problem: For the line y = 2x + 1 with one data point (3, 8), calculate the gradients ∂MSE/∂m and ∂MSE/∂b. Prediction is 2(3) + 1 = 7.

Show Solution
import numpy as np

# Data point
x, y = 3, 8

# Current parameters
m, b = 2, 1

# Make prediction
y_pred = m * x + b  # 7
error = y - y_pred  # 1

# Calculate gradients
dm = -2 * x * error  # -6
db = -2 * error  # -2

print(f"Gradient for m: {dm}")
print(f"Gradient for b: {db}")
# dm = -6 means m should increase
# db = -2 means b should increase

Explanation: Negative gradients indicate the parameters should increase to reduce error.

Problem: Implement a complete training loop with gradient descent for the dataset x=[1,2,3], y=[2,4,6]. Track MSE every 10 iterations for 100 epochs.

Show Solution
import numpy as np

x = np.array([1, 2, 3])
y = np.array([2, 4, 6])

m, b = 0, 0  # Initialize
learning_rate = 0.01
n = len(x)

print("Epoch | MSE")
print("-" * 20)

for epoch in range(101):
    # Predictions
    y_pred = m * x + b
    
    # Calculate MSE
    mse = np.mean((y - y_pred) ** 2)
    
    # Print progress
    if epoch % 10 == 0:
        print(f"{epoch:5d} | {mse:.4f}")
    
    # Calculate gradients
    dm = -(2/n) * np.sum(x * (y - y_pred))
    db = -(2/n) * np.sum(y - y_pred)
    
    # Update parameters
    m -= learning_rate * dm
    b -= learning_rate * db

print(f"\nFinal: m={m:.4f}, b={b:.4f}")
# Perfect line is y=2x, so m≈2, b≈0

Explanation: MSE should decrease each iteration. The model learns m≈2 and b≈0, fitting the line y=2x.

03

Simple Linear Regression

Simple linear regression is the most fundamental supervised learning algorithm in machine learning. It's called "simple" not because it's trivial, but because it uses exactly one input feature to predict the target variable. Think of it as finding the best-fitting straight line through your data points - like connecting the dots in a way that minimizes errors.

Imagine you're trying to predict someone's weight based solely on their height. You collect data from 100 people, plot height on the x-axis and weight on the y-axis, and notice the points form a rough upward pattern - taller people tend to weigh more. Simple linear regression draws the single straight line that best captures this trend, allowing you to predict weight for any height value. While real-world relationships are rarely perfectly linear, this technique provides an excellent starting point for understanding machine learning and works surprisingly well for many practical problems.

Core Concept

Simple Linear Regression Explained

What it is: Simple linear regression models the relationship between one input feature (x) and one output variable (y) by fitting a straight line through your data points. The line represents a mathematical function that transforms any input value into a predicted output value.

y = mx + b

Understanding the equation:

  • y (Target Variable): The value we want to predict. This is the "dependent variable" because its value depends on x. Examples: house price, student test score, stock price tomorrow, electricity consumption.
  • x (Input Feature): The single piece of information we use to make predictions. This is the "independent variable" because we treat it as given. Examples: house square footage, hours studied, today's stock price, outdoor temperature.
  • m (Slope): The rate of change - how much y increases (or decreases if negative) when x increases by 1 unit. This is the coefficient or weight. Example: if m = 5000, then each additional year of experience adds $5000 to predicted salary. A steeper slope means x has a stronger influence on y.
  • b (Intercept): The baseline value - what we predict for y when x equals zero. This is also called the bias term. Example: if b = 30000, the model predicts a starting salary of $30,000 with 0 years of experience. Sometimes this doesn't make physical sense (negative height at age 0?), but it's mathematically necessary for the line to fit properly.

How the model learns: During training, the algorithm tries millions of different combinations of m and b values. For each combination, it calculates how wrong the predictions are (using the cost function). Through an optimization process (gradient descent or normal equation), it systematically adjusts m and b to minimize these errors. The final values of m and b represent the "learned" line that best captures the relationship between x and y in your training data.

Real-world interpretation: Once trained, you can interpret your model: "For every 1 unit increase in [x], we expect [y] to change by [m] units, starting from a baseline of [b] when [x] is zero." This interpretability is one of linear regression's greatest strengths - unlike complex neural networks, you can easily explain the model's decisions to stakeholders.

When to use simple linear regression:

  • You have one primary feature that you believe influences your target
  • The relationship appears roughly linear when you plot the data
  • You need an interpretable model that's easy to explain
  • You're just starting with machine learning and want to understand fundamentals
  • ✗ Don't use if: you have multiple important features (use multiple regression), the relationship is clearly curved (use polynomial regression), or you have categorical predictions (use classification)

Key limitation: The word "simple" specifically means we use only ONE feature. Real-world problems often depend on multiple factors - house prices depend on size, location, age, neighborhood, etc. When you need to consider multiple features simultaneously, you'll graduate to Multiple Linear Regression, which extends this concept to many dimensions while keeping the same interpretable structure.

Example: Predicting Salary from Experience

Let's predict salary based on years of experience. This is a classic simple linear regression problem - as experience increases, we expect salary to increase linearly.

Step 1: Import Libraries and Prepare Data

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Data: years of experience and salary (in thousands)
experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
salary = np.array([40, 45, 50, 55, 65, 70, 75, 80, 85, 90])
What this code does: Imports essential libraries and creates sample data showing the relationship between work experience and salary. The .reshape(-1, 1) converts experience from a 1D array [1,2,3,...] to a 2D array [[1],[2],[3],...].

Why reshape? Scikit-learn REQUIRES features to be 2D (rows × columns matrix), even for single-feature problems. The -1 means "figure out the number of rows automatically" and 1 means "make it 1 column". This is a common gotcha for beginners!

Step 2: Create and Train the Model

# Create and train the model
model = LinearRegression()
model.fit(experience, salary)
What this code does: Creates a LinearRegression model object and trains it with .fit() on our data. The fit method automatically finds the optimal slope and intercept that minimize the mean squared error.

Behind the scenes: Sklearn uses the Normal Equation (analytical solution) by default for linear regression, which directly computes the optimal parameters using matrix operations: β = (XTX)-1XTy. This is faster than gradient descent for small datasets!

Step 3: Extract and Display Parameters

# Get parameters
slope = model.coef_[0]  # 5.45
intercept = model.intercept_  # 36.97

print(f"Salary = {slope:.2f} × Experience + {intercept:.2f}")
# Output: Salary = 5.45 × Experience + 36.97
What this code does: Extracts the learned parameters from the trained model. model.coef_ is an array containing the slope(s) - we use [0] to get the first (and only) coefficient. model.intercept_ gives us the y-intercept.

Interpreting results: The equation y = 5.45x + 37.18 tells us:
• Starting salary (0 experience) = $37,180
• Each year of experience adds $5,450 to salary
• After 10 years: $37,180 + (10 × $5,450) = $91,680

Step 4: Make Predictions

# Make prediction for 12 years experience
prediction = model.predict([[12]])  # ~102.42k
print(f"Predicted salary for 12 years: ${prediction[0]:.2f}k")
What this code does: Uses the trained model to predict salary for someone with 12 years of experience. We pass [[12]] as a 2D array (matching the training format) to get a prediction of approximately $102,420.

Important note: Predictions are most reliable within the training data range (1-8 years). Predicting for 12 years is extrapolation, which can be less accurate. For 50 years of experience, the model would predict an unrealistic salary because the linear relationship might not hold for extreme values!
Interpretation: The slope of 5.45 means that for each additional year of experience, salary increases by approximately $5,450. The intercept of 36.97 represents the starting salary for someone with zero experience.

Visualizing the Fit

import matplotlib.pyplot as plt

# Plot data points
plt.scatter(experience, salary, color='blue', label='Actual Data')

# Plot regression line
plt.plot(experience, model.predict(experience), 
         color='red', linewidth=2, label='Best Fit Line')

plt.xlabel('Years of Experience')
plt.ylabel('Salary ($1000s)')
plt.title('Salary vs Experience')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Shows how well the line fits through the data points

Assumptions of Linear Regression

Linearity

The relationship between x and y is linear. Plot the data to verify this assumption holds.

Independence

Observations are independent of each other. No correlation between residuals.

Homoscedasticity

Variance of errors is constant across all levels of the independent variable.

Normality

Residuals (errors) are normally distributed. Check with a histogram or Q-Q plot.

Complete Example with Train-Test Split

Step 1: Import Libraries and Generate Sample Data

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Generate sample data
np.random.seed(42)  # For reproducibility
X = np.random.rand(100, 1) * 10
y = 2.5 * X.squeeze() + 5 + np.random.randn(100) * 2
What this code does: Imports all required libraries and generates 100 synthetic data points following the pattern y = 2.5x + 5 + noise. The np.random.randn(100) adds Gaussian (normal) random noise to simulate real-world data imperfections.

Why add noise? Real-world data is NEVER perfectly linear - there are always measurement errors, unmodeled factors, and random variations. Adding noise makes our synthetic data more realistic and tests the model's ability to find the underlying pattern despite the noise. The true parameters (m=2.5, b=5) are what we want the model to recover.

Step 2: Split Data into Training and Testing Sets

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
What this code does: Splits the 100 data points into 80 training samples and 20 testing samples using train_test_split. The random_state=42 ensures reproducibility - you'll get the same split every time you run the code.

The golden rule: NEVER train and test on the same data! Training on all data gives falsely optimistic performance metrics. The test set simulates "unseen future data" - if the model performs well on test data, it will likely work well in production. This is the foundation of preventing overfitting!

Step 3: Train the Model

# Train model
model = LinearRegression()
model.fit(X_train, y_train)
What this code does: Creates a new LinearRegression model and trains it EXCLUSIVELY on the 80 training samples using .fit(). At this point, the model has never seen the 20 test samples.

Best practice verification: The test data is "locked away" during training - pretend it doesn't exist! This separation is crucial for honest evaluation. If you peek at test data during training (data leakage), your performance metrics will be misleadingly high and your model will fail in production.

Step 4: Evaluate Model Performance

# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)  # ~4.2
r2 = r2_score(y_test, y_pred)  # ~0.92

print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
print(f"Model: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")
What this code does: Evaluates model performance on the test set (unseen data) using three metrics: MSE ≈ 4.2, RMSE ≈ 2.05, and R² ≈ 0.92.

Interpreting the results:
MSE = 4.2: Average squared prediction error. Taking sqrt gives RMSE = 2.05, meaning predictions are typically within ±2 units of actual values
R² = 0.92: Model explains 92% of variance in the data - excellent! (0.7-0.8 is good, 0.9+ is great)
Learned equation (y = 2.5x + 5): Perfectly recovered the true parameters despite noise! This validates our model and training process.

Practice Questions

Problem: Create a LinearRegression model to fit the data: x = [1, 2, 3, 4, 5] and y = [3, 5, 7, 9, 11]. Print the slope and intercept.

Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([3, 5, 7, 9, 11])

# Create and train model
model = LinearRegression()
model.fit(X, y)

# Extract parameters
slope = model.coef_[0]  # 2.0
intercept = model.intercept_  # 1.0

print(f"Slope: {slope}")
print(f"Intercept: {intercept}")
print(f"Equation: y = {slope}x + {intercept}")
# Output: y = 2.0x + 1.0

Explanation: The data follows y = 2x + 1 perfectly, so the model learns these exact values.

Problem: Train a model on house data (size in sqft, price in $k): [(1000, 200), (1500, 250), (2000, 300), (2500, 350)]. Predict price for 1800 sqft.

Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np

# Data
sizes = np.array([1000, 1500, 2000, 2500]).reshape(-1, 1)
prices = np.array([200, 250, 300, 350])

# Train model
model = LinearRegression()
model.fit(sizes, prices)

# Predict for 1800 sqft
size_new = np.array([[1800]])
predicted_price = model.predict(size_new)  # ~280k

print(f"Predicted price: ${predicted_price[0]:.0f}k")
print(f"Price per sqft: ${model.coef_[0]:.2f}/sqft")
# Output: Predicted price: $280k
#         Price per sqft: $0.10/sqft

Explanation: The model learns a linear relationship: each square foot adds $100 to the price.

Problem: Generate 50 data points from y = 3x + 5 + noise. Train 3 models with different train-test splits (60%, 70%, 80%). Compare R² scores.

Show Solution
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

# Generate data
np.random.seed(42)
X = np.random.rand(50, 1) * 10
y = 3 * X.squeeze() + 5 + np.random.randn(50)

# Test different split ratios
split_ratios = [0.4, 0.3, 0.2]  # test sizes

print("Test Size | Train R² | Test R²")
print("-" * 35)

for test_size in split_ratios:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=42
    )
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    train_r2 = model.score(X_train, y_train)
    test_r2 = model.score(X_test, y_test)
    
    print(f"{test_size:.1f}      | {train_r2:.4f}  | {test_r2:.4f}")

# More training data generally gives better, more stable models

Explanation: Larger training sets typically yield more reliable models. Test R² shows how well the model generalizes.

04

Multiple Linear Regression

Real-world problems rarely depend on just one feature. Multiple linear regression extends simple linear regression to handle multiple input features simultaneously. Instead of a line in 2D space, we're fitting a hyperplane in multi-dimensional space.

Definition

Multiple Linear Regression

Multiple linear regression extends simple linear regression to handle multiple input features simultaneously. Instead of just one feature, we can use many features to make more accurate predictions:

Multiple Linear Regression Equation
$$ y = w_1x_1 + w_2x_2 + ... + w_nx_n + b $$
General form with n features and their weights

Understanding the components:

  • $y$: The target variable we're predicting (still just one output)
  • $x_1, x_2, ..., x_n$: Multiple input features (e.g., for house prices: $x_1$ = size, $x_2$ = bedrooms, $x_3$ = location score, $x_4$ = age)
  • $w_1, w_2, ..., w_n$: Weights (coefficients) - one for each feature, showing how much that feature impacts the prediction
  • $b$: The bias/intercept term - base prediction when all features are zero

Real-world example: Predicting house prices...

  • Simple regression: Price = 150 × Size + 50000
  • Multiple regression: Price = 150 × Size + 20000 × Bedrooms - 500 × Age + 30000 × Location + 50000

The benefit: By considering multiple factors at once, we get more nuanced and accurate predictions that capture real-world complexity!

Example: House Price Prediction

Let's predict house prices using multiple features: size, number of bedrooms, age, and location score.

Step 1: Import Libraries and Create DataFrame

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

# House data: size, bedrooms, age, location_score, price
data = pd.DataFrame({
    'size': [1500, 1800, 2400, 2000, 2200],
    'bedrooms': [3, 3, 4, 3, 4],
    'age': [10, 5, 2, 8, 3],
    'location': [7, 8, 9, 6, 8],
    'price': [300, 350, 450, 380, 420]
})
What this code does: Creates a pandas DataFrame with real estate data where each row represents one house. Features include size (sqft), bedrooms, age (years), and location score (1-10). For example, the first house: 1500 sqft, 3 bedrooms, 10 years old, location 7, costs $300k.

Data structure insight: Unlike simple regression (1 feature), we now have 4 features predicting price. This is multiple linear regression. The model will learn how each feature independently contributes to price: price = b₀ + b₁(size) + b₂(bedrooms) + b₃(age) + b₄(location)

Step 2: Separate Features and Target

# Separate features and target
X = data[['size', 'bedrooms', 'age', 'location']]
y = data['price']
What this code does: Separates the dataset into features (X) and target (y). X is a 10×4 matrix (10 houses, 4 features each), and y is a 10-element array of prices.

Machine learning terminology:
Features (X): Independent variables, inputs, predictors - what we know about each house
Target (y): Dependent variable, output, label - what we're trying to predict
This split is standard practice in ML - X and y are fed separately to the .fit() method.

Step 3: Train the Model

# Train model
model = LinearRegression()
model.fit(X, y)
What this code does: Trains the linear regression model on all 4 features simultaneously. The model learns a separate coefficient for each feature, determining how much each one affects price.

How it works: The model solves the system: y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + b₄x₄ by finding the values of b₀ (intercept) and b₁-b₄ (coefficients) that minimize prediction error across all 10 houses. This happens in milliseconds using matrix algebra!

Step 4: Examine Feature Coefficients

# Print coefficients
print("Feature Importance:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature:12s}: {coef:7.2f}")
print(f"Intercept: {model.intercept_:.2f}")
What this code does: Displays each feature's coefficient (weight/importance) in the learned equation. These reveal how each feature impacts price.

Coefficient interpretation:
size = 0.15: Each additional sqft adds $150 to price (coefficient × 1000)
bedrooms = 20.0: Each extra bedroom adds $20,000
age = -2.5: Each year older reduces price by $2,500 (negative = bad for price)
location = 15.0: Each point of location score adds $15,000
Larger magnitude = stronger influence on price predictions.

Step 5: Make Prediction for New House

# Predict for new house: 2100 sqft, 3 bed, 5 years old, location 8
new_house = [[2100, 3, 5, 8]]
predicted_price = model.predict(new_house)  # ~405k
print(f"\nPredicted price: ${predicted_price[0]:.0f}k")
What this code does: Makes a prediction for a new house with specific features: 2100 sqft, 3 bedrooms, 5 years old, location score 8. The model returns an estimated price of approximately $405,000.

Critical requirement: Features MUST be provided in the exact same order and format as training: [[size, bedrooms, age, location]]. Mixing up the order (e.g., [bedrooms, size, age, location]) will produce garbage predictions! The double brackets create a 2D array for a single prediction - for multiple houses, use: [[2100,3,5,8], [1800,2,3,9]]
Interpreting Coefficients: Each coefficient represents the change in price when that feature increases by 1 unit while all other features stay constant. A positive coefficient means the feature increases the price, negative means it decreases the price.

Matrix Notation

Multiple linear regression is elegantly expressed using matrix notation, which is how it's actually computed:

Matrix Form

Vectorized Equation

Matrix Form
$$ \mathbf{y} = \mathbf{X}\mathbf{w} + b $$
Vectorized form - compact and computationally efficient

What's happening here? Instead of writing separate equations for each data point, we use matrices to represent ALL the data and calculations at once. This is the "vectorized" form - it's more compact and much faster to compute!

Breaking down the components:

  • $\mathbf{X}$: Feature matrix (n samples × m features) - each row is one data point, each column is one feature. Think of it as a spreadsheet with your data!
  • $\mathbf{w}$: Weight vector (m × 1) - contains the weight for each feature in a single column
  • $\mathbf{y}$: Target vector (n × 1) - contains the predicted value for each sample
  • $b$: Bias term - often incorporated into $\mathbf{X}$ by adding a column of 1s

The Normal Equation (vectorized):

$ \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $

This directly computes optimal weights using matrix operations.

Why use matrix form? It's not just notation! Matrix operations are highly optimized in libraries like NumPy, making calculations on thousands of data points nearly instant. Writing loops would be much slower!

Feature Scaling

When features have different scales (like size in thousands and bedrooms in single digits), it's important to standardize them. This makes coefficients comparable and helps gradient descent converge faster.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np

# Data with different scales
X = np.array([
    [2000, 3, 10],  # size, bedrooms, age
    [1500, 2, 5],
    [2500, 4, 2],
    [1800, 3, 8]
])
y = np.array([400, 300, 500, 350])

# Scale features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train model on scaled data
model = LinearRegression()
model.fit(X_scaled, y)

# Predict for new house
new_house = [[2100, 3, 6]]
new_house_scaled = scaler.transform(new_house)
prediction = model.predict(new_house_scaled)  # ~380k

print(f"Predicted price: ${prediction[0]:.0f}k")

Feature Engineering

Technique Description Example
Polynomial Features Create interaction terms size² or size × bedrooms
Log Transform Handle skewed distributions log(price) or log(size)
One-Hot Encoding Convert categories to binary location → [0,1,0,0]
Feature Binning Group continuous values age → [young, mid, old]

Polynomial Regression for Non-Linear Data

Real-world relationships are often non-linear - like how productivity increases with sleep up to a point, then decreases. Polynomial regression extends linear regression to model curved relationships by adding polynomial features (x², x³, etc.).

Key Concept

Polynomial Regression

Polynomial regression allows us to fit curved patterns in data by transforming features into polynomial terms (squared, cubed, etc.), then applying linear regression techniques. It's perfect for when your data doesn't follow a straight line!

Example - Degree 2 (Quadratic):

$$ y = b_0 + b_1 x + b_2 x^2 $$

How it works:

  • Original feature: $x$ (e.g., temperature)
  • Create new features: $x^2$ (temperature squared), and optionally $x^3$, $x^4$, etc.
  • Fit linear regression: Treat $x$ and $x^2$ as separate features and find coefficients $b_0, b_1, b_2$
  • Result: A curved line that can capture U-shaped or hill-shaped patterns!

The name confusion: Despite having $x^2$ terms, it's still called "linear" regression because it's linear in the coefficients ($b_0, b_1, b_2$). We're not squaring the coefficients - we're squaring the features! The model still finds $b$ values using the same linear regression math.

Real-world example: Predicting ice cream sales vs. temperature. Sales might increase with temperature, but above 95°F people stay indoors, so sales drop. This U-shaped pattern needs polynomial regression - a straight line can't capture it!

Warning: Higher degrees (3, 4, 5+) can fit training data very closely but may "overfit" - they follow noise rather than true patterns. Start with degree 2 or 3 and test on validation data!

Degree 1 (Linear)

$$ y = b_0 + b_1 x $$

Straight line

Degree 2 (Quadratic)

$$ y = b_0 + b_1 x + b_2 x^2 $$

U-shaped or inverted-U curve

Degree 3 (Cubic)

$$ y = b_0 + b_1 x + b_2 x^2 + b_3 x^3 $$

S-shaped curves

Step 1: Import Libraries and Create Non-Linear Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Non-linear data (productivity vs hours of sleep)
X = np.array([4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([60, 70, 85, 95, 90, 75, 60])  # Inverted U-shape
What this code does: Creates non-linear data representing the relationship between sleep hours and productivity. Notice the inverted U-shape: productivity increases from 4-7 hours (more sleep = better), then decreases from 7-10 hours (too much sleep = worse).

Why simple linear regression fails: A straight line can NEVER fit this curved pattern well - it would either underpredict the middle values or overpredict the extremes. This is where polynomial regression comes in! By adding x² terms, we can model curves, parabolas, and complex non-linear relationships.

Step 2: Create Polynomial Features

# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print("Original features:", X[0])      # [4]
print("Polynomial features:", X_poly[0])  # [4, 16] (x, x²)
What this code does: Transforms each input value into polynomial features using PolynomialFeatures. For degree=2, each x becomes [x, x²]. Example: x=4[4, 16].

The polynomial trick: We're still using linear regression, but on transformed features! The model learns: y = b₀ + b₁x + b₂x². It's linear in the coefficients (b₀, b₁, b₂) but creates a curved prediction. Higher degrees (3, 4, 5...) create more complex curves but risk overfitting. include_bias=False means don't add a column of 1s (intercept is handled by LinearRegression).

Step 3: Train Model on Polynomial Features

# Train model on polynomial features
model = LinearRegression()
model.fit(X_poly, y)
What this code does: Trains a LinearRegression model on the polynomial features X_poly. The model is "linear" but the relationship it learns is quadratic: y = b₀ + b₁x + b₂x².

Key insight: This is the magic of polynomial regression - we're using the exact same LinearRegression class, just feeding it engineered features! The model doesn't know it's fitting a curve; from its perspective, it's doing regular linear regression with 2 features. But when plotted against the original x, the predictions form a parabola that captures the U-shaped pattern.

Step 4: Make Predictions

# Predict
X_test = np.array([[7.5]])
X_test_poly = poly.transform(X_test)
prediction = model.predict(X_test_poly)
print(f"Predicted productivity at 7.5 hrs sleep: {prediction[0]:.1f}")
What this code does: Makes a prediction for 7.5 hours of sleep. First, transforms 7.5 into [7.5, 56.25] using poly.transform(), then feeds it to the model for prediction.

Critical step - DON'T FORGET: You MUST transform new data using the same poly object that transformed training data. If you forget and just pass [[7.5]] to the model, you'll get an error ("Expected 2 features, got 1") or wrong predictions. Always: X_new → poly.transform(X_new) → model.predict()

Step 5: Visualize the Polynomial Fit

# Visualize
X_range = np.linspace(4, 10, 100).reshape(-1, 1)
X_range_poly = poly.transform(X_range)
y_pred = model.predict(X_range_poly)

plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X_range, y_pred, color='red', label='Polynomial fit (degree 2)')
plt.xlabel('Hours of Sleep')
plt.ylabel('Productivity Score')
plt.legend()
plt.title('Polynomial Regression: Productivity vs Sleep')
plt.show()
What this code does: Generates 100 evenly-spaced points between 4 and 10, transforms them to polynomial features, gets predictions, and plots the smooth curved line. This visualizes how well the polynomial fits the U-shaped data.

Visualization insight: You'll see the red polynomial curve pass through or near the blue actual data points, capturing the inverted-U pattern. Compare this to a linear fit (straight line), which would have large errors at the extremes. This demonstrates polynomial regression's power for non-linear relationships! Pro tip: Try different degrees (3, 4, 5) to see how curve complexity changes.
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
import numpy as np

# Try different polynomial degrees
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2, 4, 5, 7, 8, 9, 10, 11, 11, 12])

for degree in range(1, 6):
    # Create pipeline: polynomial features + linear regression
    model = make_pipeline(
        PolynomialFeatures(degree=degree),
        LinearRegression()
    )
    
    # Cross-validation scores
    scores = cross_val_score(model, X, y, cv=5, 
                            scoring='r2')
    print(f"Degree {degree}: R² = {scores.mean():.3f} (+/- {scores.std():.3f})")

# Choose degree with highest average R² without overfitting
# Output might show:
# Degree 1: R² = 0.920 (+/- 0.050)
# Degree 2: R² = 0.945 (+/- 0.030)  <- Best balance
# Degree 3: R² = 0.935 (+/- 0.055)  <- Starting to overfit
# Degree 4: R² = 0.880 (+/- 0.120)  <- Overfitting

Multicollinearity Warning

Watch Out: When features are highly correlated (like size and number of rooms), it's hard to determine their individual effects. Use correlation matrices to detect multicollinearity and consider removing redundant features.
import pandas as pd
import numpy as np

# Create sample data
data = pd.DataFrame({
    'size': [1500, 1800, 2400, 2000, 2200],
    'rooms': [5, 6, 8, 7, 8],  # Highly correlated with size
    'age': [10, 5, 2, 8, 3]
})

# Calculate correlation matrix
correlation_matrix = data.corr()
print(correlation_matrix)

# If correlation > 0.8, consider removing one feature
print("\nHigh correlations detected:")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            col1 = correlation_matrix.columns[i]
            col2 = correlation_matrix.columns[j]
            print(f"{col1} & {col2}: {correlation_matrix.iloc[i, j]:.2f}")

Practice Questions

Problem: Train a model with two features: hours_studied [2, 3, 4, 5] and hours_slept [8, 7, 8, 9], predicting test_score [60, 65, 75, 80]. Predict for 4.5 hours studied and 8 hours slept.

Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np

# Features
X = np.array([
    [2, 8],  # studied, slept
    [3, 7],
    [4, 8],
    [5, 9]
])
y = np.array([60, 65, 75, 80])

# Train
model = LinearRegression()
model.fit(X, y)

# Predict
prediction = model.predict([[4.5, 8]])  # ~76
print(f"Predicted score: {prediction[0]:.1f}")

# Coefficients
print(f"Study effect: {model.coef_[0]:.2f} points per hour")
print(f"Sleep effect: {model.coef_[1]:.2f} points per hour")

Explanation: The model learns how much each factor (studying and sleeping) contributes to the test score.

Problem: Given features with different scales - income [50000, 60000, 70000] and age [25, 30, 35] predicting spending [5000, 6000, 7000]. Apply StandardScaler and train a model.

Show Solution
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np

# Data
X = np.array([[50000, 25], [60000, 30], [70000, 35]])
y = np.array([5000, 6000, 7000])

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original:", X[0])
print("Scaled:", X_scaled[0])

# Train on scaled data
model = LinearRegression()
model.fit(X_scaled, y)

# Predict for income=65000, age=32
new_data = [[65000, 32]]
new_scaled = scaler.transform(new_data)
prediction = model.predict(new_scaled)  # ~6500

print(f"Predicted spending: ${prediction[0]:.0f}")

Explanation: StandardScaler ensures both features contribute fairly despite income being in tens of thousands and age in tens.

Problem: Create polynomial features (x², xy) from x=[1,2,3] and y=[2,3,4], then train a model to predict z=[5,10,17]. Compare with and without polynomial features.

Show Solution
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4]])
z = np.array([5, 10, 17])

# Model 1: Linear features only
model1 = LinearRegression()
model1.fit(X, z)
pred1 = model1.predict(X)
r2_linear = r2_score(z, pred1)

# Model 2: With polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)  # x, y, x², xy, y²

model2 = LinearRegression()
model2.fit(X_poly, z)
pred2 = model2.predict(X_poly)
r2_poly = r2_score(z, pred2)

print(f"Linear R²: {r2_linear:.4f}")
print(f"Polynomial R²: {r2_poly:.4f}")
print(f"\nNew features: {poly.get_feature_names_out()}")
# Polynomial features capture non-linear relationships!

Explanation: Polynomial features allow linear regression to model non-linear relationships by creating interaction and squared terms.

05

Python Implementation

Let's put everything together with a complete, production-ready implementation. We'll load real data, preprocess it, train a model, evaluate performance, and make predictions - all the steps you'll use in actual projects.

Complete End-to-End Pipeline

Step 1: Import Libraries and Load Data

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# 1. Load and explore data
data = pd.DataFrame({
    'square_feet': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],
    'bedrooms': [3, 3, 2, 4, 2, 3, 4, 5, 3, 3],
    'age_years': [10, 8, 15, 5, 20, 12, 3, 2, 18, 9],
    'price': [245, 312, 279, 308, 199, 219, 405, 324, 319, 255]
})

print("Dataset shape:", data.shape)
print("\nFirst few rows:")
print(data.head())
What this code does: Imports all required libraries for a complete ML pipeline and creates a 10-house dataset with 3 features each. Uses pd.DataFrame to organize data with labeled columns for easy manipulation.

Best practice: Always explore data first! data.shape tells you dimensions (10 rows, 4 columns), data.head() shows first 5 rows to verify data loaded correctly, data.describe() gives statistics (mean, std, min, max). This prevents errors like missing values, wrong data types, or incorrect columns. Real-world tip: Add data.isnull().sum() to check for missing data!

Step 2: Prepare and Split Data

# 2. Prepare features and target
X = data[['square_feet', 'bedrooms', 'age_years']]
y = data['price']

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
What this code does: Separates features (X) from target (y), then splits into training (80% = 8 houses) and testing (20% = 2 houses) sets. random_state=42 ensures reproducibility.

The 80/20 rule: This is the most common split ratio, but it depends on dataset size:
Small datasets (<100 samples): Use 80/20 or cross-validation
Medium datasets (100-10k): 80/20 or 70/30
Large datasets (>100k): Can use 90/10 or even 98/2
Rule of thumb: Test set should have at least 30-50 samples for reliable performance estimates. Our 2-sample test set is tiny - in production, you'd need more data!

Step 3: Scale Features

# 4. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
What this code does: Standardizes features using StandardScaler so each feature has mean=0 and standard deviation=1. This rescales square_feet (1000-2000 range) and bedrooms (2-5 range) to comparable scales.

Critical workflow:
1. scaler.fit(X_train): Learns mean and std from training data ONLY
2. scaler.transform(X_train): Applies transformation to training data
3. scaler.transform(X_test): Applies SAME transformation to test data
Never fit on test data! This would cause data leakage - the test set is supposed to be "unseen". Formula used: z = (x - mean) / std

Step 4: Train Model

# 5. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
What this code does: Creates a LinearRegression model and trains it on the scaled training data using .fit(). The model learns the relationship between standardized features and prices.

Training speed: For linear regression with small datasets like this, training is nearly instantaneous (<1ms). The algorithm uses matrix operations to directly compute optimal coefficients. For comparison: gradient descent would take longer but works better for huge datasets that don't fit in memory.

Step 5: Make Predictions

# 6. Make predictions
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)
What this code does: Generates predictions for both training and test sets. We predict on both to compare performance and detect overfitting.

Overfitting check: If train_R² >> test_R² (e.g., train=0.95, test=0.65), the model memorized training data but doesn't generalize. For linear regression with proper train/test split, this is rare - but common with complex models like deep neural networks. Ideally: train and test scores should be close (e.g., both around 0.85-0.90).

Step 6: Evaluate Performance

# 7. Evaluate
print("\nModel Performance:")
print(f"Train R²: {r2_score(y_train, y_train_pred):.4f}")
print(f"Test R²: {r2_score(y_test, y_test_pred):.4f}")
print(f"Test MSE: {mean_squared_error(y_test, y_test_pred):.2f}")
print(f"Test MAE: {mean_absolute_error(y_test, y_test_pred):.2f}")

# 8. Feature importance
print("\nFeature Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")
What this code does: Calculates comprehensive performance metrics: R² (variance explained), MSE (mean squared error), MAE (mean absolute error), and displays feature coefficients.

Metrics interpretation guide:
R² (Train vs Test): Both should be high and similar. If Train R²=0.95 and Test R²=0.93 → Great! If Test R² < 0.7 → Model struggles with unseen data
MSE: Depends on scale - for $200k-$400k prices, MSE < 1000 is excellent
MAE: More interpretable than MSE - shows average prediction error in same units as target
Feature coefficients: After scaling, larger magnitude = more important feature. Sign shows direction (+ increases price, - decreases)

Evaluation Metrics Explained

R² Score

Range: 0 to 1 (higher is better)

Meaning: Proportion of variance in y explained by the model. R² = 0.85 means 85% of variance is explained.

MSE

Range: 0 to ∞ (lower is better)

Meaning: Average of squared errors. Penalizes large errors heavily. Units are squared ($ squared).

RMSE

Range: 0 to ∞ (lower is better)

Meaning: Square root of MSE. Same units as target ($). More interpretable than MSE.

MAE

Range: 0 to ∞ (lower is better)

Meaning: Average of absolute errors. More robust to outliers. Same units as target ($).

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# True and predicted values
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 140, 190, 260, 290])

# Calculate all metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)  # Or use mean_squared_error(y_true, y_pred, squared=False)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"MSE:  {mse:.2f}  (units squared)")
print(f"RMSE: {rmse:.2f} (same units as target)")
print(f"MAE:  {mae:.2f}  (same units as target)")
print(f"R²:   {r2:.4f} (0 to 1, dimensionless)")

# Example output:
# MSE:  200.00  (units squared)
# RMSE: 14.14   (same units as target)
# MAE:  10.00   (same units as target)
# R²:   0.9840  (0 to 1, dimensionless)

Model Diagnostics

After training, always check residuals (prediction errors) to verify assumptions hold:

Step 1: Calculate Residuals

import numpy as np
import matplotlib.pyplot as plt

# Calculate residuals (prediction errors)
residuals = y_test - y_test_pred
What this code does: Calculates residuals (prediction errors) by subtracting predictions from actual values: residual = actual - predicted. Positive residual = underestimated, negative = overestimated.

Why analyze residuals? They reveal model problems that overall metrics (R², MSE) might hide:
Patterns in residuals: Indicate missing features or wrong model type
Non-normal distribution: Violates linear regression assumptions
Increasing variance: Heteroscedasticity problem (predictions less reliable at extremes)
Perfect model: residuals are random noise centered at 0 with constant variance.

Step 2: Create Residual Plot

# 1. Residual plot (should show random scatter)
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(y_test_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
What this code does: Creates a scatter plot of residuals vs predicted values with a horizontal reference line at y=0. This is THE most important diagnostic plot for regression.

Good residual plot: Random scatter around y=0 with no pattern (like random noise)
Bad residual plot patterns:
Curved pattern: Need polynomial features or non-linear model
Cone/funnel shape: Heteroscedasticity - variance increases with predicted values (try log transformation)
Clusters/groups: Missing categorical features or outliers
If you see patterns, your model is systematically wrong in predictable ways!

Step 3: Create Histogram of Residuals

# 2. Histogram (should be roughly normal/bell-shaped)
plt.subplot(1, 3, 2)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
What this code does: Creates a histogram showing the frequency distribution of residuals. For valid linear regression, residuals should follow a normal (bell-shaped) distribution centered at 0.

Distribution shapes:
Bell curve centered at 0: Perfect! Normality assumption satisfied
Skewed right/left: Asymmetric errors - consider transforming target variable (log, sqrt)
Multiple peaks (bimodal): Suggests missing features or subgroups in data
Uniform/flat: Non-normal, possibly wrong model type
Why care? Non-normal residuals mean confidence intervals and hypothesis tests are unreliable!

Step 4: Create Q-Q Plot

# 3. Q-Q plot (points should follow diagonal line)
plt.subplot(1, 3, 3)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')

plt.tight_layout()
plt.show()
What this code does: Creates a Quantile-Quantile (Q-Q) plot comparing residual distribution to theoretical normal distribution. If residuals are normal, points lie on the diagonal line.

📏 Reading Q-Q plots:
Points on diagonal: Residuals are normally distributed
S-shaped curve: Distribution has heavier or lighter tails than normal
Points diverge at extremes: Outliers present
Systematic deviation: Skewed distribution
Q-Q plots are MORE SENSITIVE than histograms for detecting non-normality. Even slight deviations from the line indicate problems. For large datasets, minor deviations are acceptable.

Step 5: Statistical Test for Normality

# Statistical test for normality
_, p_value = stats.shapiro(residuals)
print(f"Shapiro test p-value: {p_value:.4f}")
print("Normal!" if p_value > 0.05 else "Not normal - check assumptions")
What this code does: Performs the Shapiro-Wilk statistical test for normality. Tests the null hypothesis: "residuals come from a normal distribution."

Interpreting p-values:
p > 0.05: Cannot reject null hypothesis → residuals are likely normal (good!)
p ≤ 0.05: Reject null hypothesis → residuals are NOT normal (problem!)
p < 0.01: Strong evidence of non-normality

Important caveats: With large samples (>1000), Shapiro-Wilk becomes oversensitive - tiny deviations from normality (that don't matter practically) will show p<0.05. For large datasets, rely more on visual plots (histogram, Q-Q) than p-values. For small datasets (<50), the test has low power and might miss non-normality.

Cross-Validation

Use k-fold cross-validation to get a more reliable estimate of model performance:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# Create model
model = LinearRegression()

# Perform 5-fold cross-validation
cv_scores = cross_val_score(
    model, X_train_scaled, y_train, 
    cv=5, scoring='r2'
)

print("Cross-Validation R² Scores:", cv_scores)
print(f"Mean R²: {cv_scores.mean():.4f}")
print(f"Std Dev: {cv_scores.std():.4f}")

# Also check MSE
cv_mse = -cross_val_score(
    model, X_train_scaled, y_train,
    cv=5, scoring='neg_mean_squared_error'
)

print(f"\nMean MSE: {cv_mse.mean():.2f}")

Saving and Loading Models

import joblib

# Save model and scaler
joblib.dump(model, 'house_price_model.pkl')
joblib.dump(scaler, 'feature_scaler.pkl')

print("Model saved successfully!")

# Load model later for predictions
loaded_model = joblib.load('house_price_model.pkl')
loaded_scaler = joblib.load('feature_scaler.pkl')

# Make prediction with loaded model
new_house = [[2000, 3, 10]]  # sqft, beds, age
new_house_scaled = loaded_scaler.transform(new_house)
prediction = loaded_model.predict(new_house_scaled)

print(f"Predicted price: ${prediction[0]:.0f}k")

Common Pitfalls to Avoid

Don't: Use the test set for model selection or tuning - this causes data leakage and overoptimistic performance estimates.
Don't: Forget to scale test data using the same scaler fitted on training data. Never fit scaler on test data!
Don't: Extrapolate too far beyond your training data range. Models are unreliable outside the training distribution.

Practice Questions

Problem: Given actual values [100, 200, 300] and predictions [110, 190, 290], calculate the R² score.

Show Solution
from sklearn.metrics import r2_score
import numpy as np

y_true = np.array([100, 200, 300])
y_pred = np.array([110, 190, 290])

r2 = r2_score(y_true, y_pred)  # ~0.95

print(f"R² Score: {r2:.4f}")
print(f"The model explains {r2*100:.1f}% of the variance")

# Manual calculation
ss_res = np.sum((y_true - y_pred) ** 2)  # residual sum of squares
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)  # total sum of squares
r2_manual = 1 - (ss_res / ss_tot)
print(f"Manual R²: {r2_manual:.4f}")

Explanation: R² measures how much better your model is compared to just predicting the mean.

Problem: Create a complete pipeline with StandardScaler and LinearRegression. Use 3-fold cross-validation to evaluate. Data: X = [[1,2], [2,3], [3,4], [4,5], [5,6]], y = [3, 5, 7, 9, 11].

Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

X = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
y = np.array([3, 5, 7, 9, 11])

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=3, scoring='r2')

print("CV Scores:", cv_scores)
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Train on full data for deployment
pipeline.fit(X, y)

# Predict
new_data = [[6, 7]]
prediction = pipeline.predict(new_data)
print(f"Prediction for [6,7]: {prediction[0]:.2f}")

Explanation: Pipelines ensure preprocessing steps are applied consistently and prevent data leakage during cross-validation.

Problem: Generate 100 points with y = 2x + 3 + noise. Train a model, calculate residuals, and check if they're normally distributed using Shapiro-Wilk test.

Show Solution
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + 3 + np.random.normal(0, 1, 100)

# Train model
model = LinearRegression()
model.fit(X, y)

# Calculate residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Normality test
stat, p_value = stats.shapiro(residuals)

print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value > 0.05:
    print(" Residuals are normally distributed (assumption holds)")
else:
    print("✗ Residuals not normal (violation of assumption)")

# Visual check
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')

plt.subplot(1, 2, 2)
plt.hist(residuals, bins=20, edgecolor='black')
plt.xlabel('Residuals')
plt.title('Distribution')

plt.tight_layout()
plt.show()

Explanation: Shapiro-Wilk test checks normality. P > 0.05 means residuals are likely normal, validating a key regression assumption.

Key Takeaways

Linear Model

Linear regression fits a straight line through data points to predict continuous values based on input features

Cost Function

Uses Mean Squared Error to measure prediction accuracy and find optimal parameters through gradient descent

Scikit-learn

Python's scikit-learn provides LinearRegression class for easy implementation and training of models

Evaluation Metrics

Use R-squared, MSE, and MAE to assess model performance and compare different regression models

Assumptions

Linear regression assumes linearity, independence, homoscedasticity, and normally distributed errors

Feature Engineering

Transform and scale features appropriately to improve model performance and interpretability

Knowledge Check

Test your understanding of linear regression:

Question 1 of 6

What is the primary goal of linear regression?

Question 2 of 6

In the equation y = mx + b, what does 'm' represent?

Question 3 of 6

What is the most common cost function used in linear regression?

Question 4 of 6

What does an R-squared value of 0.85 indicate?

Question 5 of 6

Which scikit-learn method trains a linear regression model?

Question 6 of 6

What is the difference between simple and multiple linear regression?

Answer all questions to check your score