Assignment 2: Regression Models | Machine Learning Course

Assignment Overview

In this assignment, you will build a complete Housing Price Prediction System using various regression techniques. This comprehensive project requires you to apply ALL concepts from Module 2: simple linear regression, multiple linear regression, polynomial regression, Ridge regularization, Lasso regularization, and proper model evaluation using regression metrics.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, and scikit-learn for this assignment.

Skills Applied: This assignment tests your understanding of Linear Regression (Topic 2.1), Polynomial Regression (Topic 2.2), and Regularization (Topic 2.3) from Module 2.

Linear Regression (2.1)

Simple & multiple linear regression, coefficients, assumptions

Polynomial Regression (2.2)

Feature transformation, degree selection, overfitting detection

Regularization (2.3)

Ridge (L2), Lasso (L1), ElasticNet, alpha tuning

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

HomeValue Analytics

You have been hired as a Machine Learning Engineer at HomeValue Analytics, a real estate technology company that helps buyers and sellers understand property values. The lead data scientist has given you this task:

"We have historical housing data with various features like square footage, bedrooms, location scores, and more. We need you to build multiple regression models, compare their performance, and recommend the best approach for predicting house prices. Pay special attention to overfitting - we need models that generalize well!"

Your Task

Create a Jupyter Notebook called regression_analysis.ipynb that implements multiple regression models, compares their performance using appropriate metrics, and provides recommendations for the best model to use in production.

The Dataset

You will work with a Housing Price dataset. Create this CSV file as shown below:

File: `housing_data.csv` (Housing Data)

house_id,square_feet,bedrooms,bathrooms,age_years,garage_size,location_score,has_pool,has_garden,distance_to_city,price
H001,1850,3,2,5,2,8.5,0,1,12,385000
H002,2400,4,3,2,2,9.2,1,1,8,520000
H003,1200,2,1,25,1,6.5,0,0,22,195000
H004,3200,5,4,1,3,9.5,1,1,5,725000
H005,1650,3,2,15,1,7.2,0,1,18,275000
H006,2100,4,2,8,2,8.0,0,1,14,385000
H007,1400,2,1,30,1,5.8,0,0,28,165000
H008,2800,4,3,3,2,9.0,1,1,6,595000
H009,1950,3,2,12,2,7.8,0,1,16,325000
H010,3500,5,4,0,3,9.8,1,1,3,850000
H011,1100,2,1,35,0,5.2,0,0,32,145000
H012,2250,4,2,6,2,8.3,0,1,11,425000
H013,1750,3,2,18,1,6.8,0,0,20,255000
H014,2650,4,3,4,2,8.8,1,1,9,485000
H015,1550,3,1,22,1,6.2,0,0,25,215000
H016,2950,5,3,2,3,9.3,1,1,4,675000
H017,1350,2,1,28,1,5.5,0,0,30,155000
H018,2050,3,2,10,2,7.5,0,1,15,345000
H019,1900,3,2,7,2,8.2,0,1,13,365000
H020,3100,5,4,1,3,9.6,1,1,4,780000
H021,1450,2,1,20,1,6.0,0,0,24,185000
H022,2300,4,2,5,2,8.4,1,1,10,445000
H023,1600,3,2,14,1,7.0,0,1,19,285000
H024,2700,4,3,3,2,8.9,1,1,7,545000
H025,1250,2,1,32,0,5.0,0,0,35,135000

Columns Explained

house_id - Unique identifier (string)
square_feet - Living area in sq ft (integer) - Key predictor
bedrooms - Number of bedrooms (integer)
bathrooms - Number of bathrooms (integer)
age_years - Age of the house (integer)
garage_size - Garage capacity in cars (integer)
location_score - Location desirability 1-10 (float)
has_pool - Has swimming pool (binary: 0/1)
has_garden - Has garden (binary: 0/1)
distance_to_city - Distance to city center in km (integer)
price - Sale price in dollars (target variable)

Note: You may extend this dataset with more rows for better model training. Consider adding some outliers to test your preprocessing skills.

Requirements

Your regression_analysis.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load and Explore Data

Create a function load_and_explore(filename) that:

Loads the CSV file using pandas
Displays basic statistics for all numerical columns
Checks for missing values and data types
Returns the DataFrame and exploration summary

def load_and_explore(filename):
    """Load dataset and return exploration summary."""
    # Must return: (df, exploration_dict)
    pass

Visualize Feature Relationships

Create a function visualize_relationships(df, target='price') that:

Creates scatter plots of each feature vs target
Creates a correlation heatmap
Identifies the most correlated features
Saves plots as feature_analysis.png

def visualize_relationships(df, target='price'):
    """Create visualizations of feature-target relationships."""
    # Must save: feature_analysis.png
    pass

Simple Linear Regression

Create a function simple_linear_regression(X, y, feature_name) that:

Trains a simple linear regression with ONE feature
Plots the regression line with data points
Returns model, coefficients, and intercept
Prints the regression equation

def simple_linear_regression(X, y, feature_name):
    """Train simple linear regression and visualize."""
    # Return: (model, coefficient, intercept)
    pass

Multiple Linear Regression

Create a function multiple_linear_regression(X_train, X_test, y_train, y_test) that:

Trains a multiple linear regression with ALL features
Returns model and predictions
Displays feature importance (coefficients)

def multiple_linear_regression(X_train, X_test, y_train, y_test):
    """Train multiple linear regression model."""
    # Return: (model, y_pred, coefficients_dict)
    pass

Polynomial Regression

Create a function polynomial_regression(X_train, X_test, y_train, y_test, degree=2) that:

Creates polynomial features using PolynomialFeatures
Trains linear regression on transformed features
Returns model, transformer, and predictions

def polynomial_regression(X_train, X_test, y_train, y_test, degree=2):
    """Train polynomial regression model."""
    # Return: (model, poly_transformer, y_pred)
    pass

Find Optimal Polynomial Degree

Create a function find_optimal_degree(X_train, X_test, y_train, y_test, max_degree=5) that:

Tests polynomial degrees from 1 to max_degree
Tracks train and test errors for each degree
Plots learning curves to show overfitting
Returns the optimal degree based on test performance

def find_optimal_degree(X_train, X_test, y_train, y_test, max_degree=5):
    """Find optimal polynomial degree by comparing train/test errors."""
    # Return: (optimal_degree, results_df)
    pass

Ridge Regression

Create a function ridge_regression(X_train, X_test, y_train, y_test, alpha=1.0) that:

Trains Ridge regression with specified alpha
Returns model, predictions, and coefficients
Compares coefficient magnitudes with linear regression

def ridge_regression(X_train, X_test, y_train, y_test, alpha=1.0):
    """Train Ridge regression model."""
    # Return: (model, y_pred, coefficients)
    pass

Lasso Regression

Create a function lasso_regression(X_train, X_test, y_train, y_test, alpha=1.0) that:

Trains Lasso regression with specified alpha
Returns model, predictions, and coefficients
Identifies features with zero coefficients (feature selection)

def lasso_regression(X_train, X_test, y_train, y_test, alpha=1.0):
    """Train Lasso regression model."""
    # Return: (model, y_pred, coefficients, zero_features)
    pass

Tune Regularization Alpha

Create a function tune_alpha(X_train, y_train, model_type='ridge') that:

Uses cross-validation to find optimal alpha
Tests alphas: [0.001, 0.01, 0.1, 1, 10, 100]
Plots alpha vs cross-validation score
Returns optimal alpha and CV results

def tune_alpha(X_train, y_train, model_type='ridge'):
    """Find optimal alpha using cross-validation."""
    # Return: (optimal_alpha, cv_results_df)
    pass

Calculate Regression Metrics

Create a function calculate_regression_metrics(y_true, y_pred, model_name) that:

Calculates MSE, RMSE, MAE, and R² score
Returns a dictionary with all metrics
Optionally creates residual plot

def calculate_regression_metrics(y_true, y_pred, model_name):
    """Calculate and return regression metrics."""
    # Return: dict with 'mse', 'rmse', 'mae', 'r2'
    pass

Compare All Models

Create a function compare_models(results_dict) that:

Takes dictionary of model results
Creates comparison bar charts for all metrics
Saves comparison as model_comparison.png
Returns DataFrame with comparison table

def compare_models(results_dict):
    """Compare all models and visualize results."""
    # Return: comparison_df
    pass

Main Pipeline

Create a main() function that:

Runs the complete regression analysis pipeline
Trains all model types and collects results
Generates all required visualizations
Prints final recommendation for best model

def main():
    # 1. Load and explore data
    df, summary = load_and_explore("housing_data.csv")
    
    # 2. Visualize relationships
    visualize_relationships(df)
    
    # 3. Prepare features and target
    X = df.drop(['house_id', 'price'], axis=1)
    y = df['price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 4. Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # 5. Train all models
    results = {}
    
    # Simple Linear Regression (using square_feet)
    slr_model, coef, intercept = simple_linear_regression(X_train[['square_feet']], y_train, 'square_feet')
    
    # Multiple Linear Regression
    mlr_model, mlr_pred, mlr_coefs = multiple_linear_regression(X_train_scaled, X_test_scaled, y_train, y_test)
    results['Linear Regression'] = calculate_regression_metrics(y_test, mlr_pred, 'Linear Regression')
    
    # Polynomial Regression
    optimal_degree, degree_results = find_optimal_degree(X_train_scaled, X_test_scaled, y_train, y_test)
    poly_model, poly_trans, poly_pred = polynomial_regression(X_train_scaled, X_test_scaled, y_train, y_test, optimal_degree)
    results['Polynomial Regression'] = calculate_regression_metrics(y_test, poly_pred, 'Polynomial Regression')
    
    # Ridge Regression
    ridge_alpha, ridge_cv = tune_alpha(X_train_scaled, y_train, 'ridge')
    ridge_model, ridge_pred, ridge_coefs = ridge_regression(X_train_scaled, X_test_scaled, y_train, y_test, ridge_alpha)
    results['Ridge Regression'] = calculate_regression_metrics(y_test, ridge_pred, 'Ridge Regression')
    
    # Lasso Regression
    lasso_alpha, lasso_cv = tune_alpha(X_train_scaled, y_train, 'lasso')
    lasso_model, lasso_pred, lasso_coefs, zero_feats = lasso_regression(X_train_scaled, X_test_scaled, y_train, y_test, lasso_alpha)
    results['Lasso Regression'] = calculate_regression_metrics(y_test, lasso_pred, 'Lasso Regression')
    
    # 6. Compare all models
    comparison_df = compare_models(results)
    print(comparison_df)
    
    # 7. Recommendation
    best_model = comparison_df.loc[comparison_df['R2'].idxmax()]
    print(f"\nRecommendation: {best_model.name} with R² = {best_model['R2']:.4f}")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

housing-price-regression

github.com/<your-username>/housing-price-regression

Required Files

housing-price-regression/
├── regression_analysis.ipynb  # Your Jupyter Notebook with ALL 12 functions
├── housing_data.csv           # Input dataset (as provided or extended)
├── feature_analysis.png       # Feature relationship visualizations
├── model_comparison.png       # Model comparison bar charts
├── predictions.csv            # Test predictions from best model
└── README.md                  # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Summary of all models trained and their metrics
Your recommendation for the best model and why
Any challenges faced and how you solved them
Instructions to run your notebook

Do Include

All 12 functions implemented and working
Docstrings for every function
Clear visualizations with labels and titles
Model comparison with reasoning
Hyperparameter tuning with cross-validation
README.md with all required sections

Do Not Include

Any .pyc or __pycache__ files (use .gitignore)
Virtual environment folders
Large model pickle files
Code that doesn't run without errors
Hardcoded file paths

Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Linear Regression	25	Correct implementation of simple and multiple linear regression
Polynomial Regression	30	Proper feature transformation, degree selection, overfitting analysis
Regularization	35	Correct Ridge and Lasso implementation with alpha tuning
Model Evaluation	25	Accurate calculation of MSE, RMSE, MAE, R² and proper comparison
Visualizations	25	Clear, informative plots with proper labels and titles
Code Quality	35	Docstrings, comments, naming conventions, and clean organization
Total	175

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Linear Regression (2.1)

Understanding coefficients, interpreting regression equations, feature importance

Polynomial Regression (2.2)

Feature transformation, detecting overfitting, selecting optimal complexity

Regularization (2.3)

Ridge vs Lasso, feature selection with L1, hyperparameter tuning

Model Comparison

Evaluating regression models, understanding metrics, making recommendations

Pro Tips

Regression Best Practices

Always scale features before regularization
Check for multicollinearity using VIF
Visualize residuals to check assumptions
Use cross-validation for hyperparameter tuning

Model Selection

Start simple, increase complexity gradually
Compare train vs test performance
Consider interpretability vs accuracy trade-off
Lasso is better when you suspect many irrelevant features

Metrics to Focus On

R² tells you how much variance is explained
RMSE is in the same units as target
MAE is more robust to outliers than MSE
Compare metrics across train and test sets

Common Mistakes

Forgetting to scale features for regularized models
Using polynomial degree too high (overfitting)
Not using cross-validation for alpha selection
Ignoring the bias-variance trade-off

Regression Models Housing Price Prediction

What You'll Practice

Contents

Assignment Overview

Linear Regression (2.1)

Polynomial Regression (2.2)

Regularization (2.3)

The Scenario

HomeValue Analytics

Your Task

The Dataset

File: housing_data.csv (Housing Data)

Columns Explained

Requirements

Load and Explore Data

Visualize Feature Relationships

Simple Linear Regression

Multiple Linear Regression

Polynomial Regression

Find Optimal Polynomial Degree

Ridge Regression

Lasso Regression

Tune Regularization Alpha

Calculate Regression Metrics

Compare All Models

Main Pipeline

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

Linear Regression (2.1)

Polynomial Regression (2.2)

Regularization (2.3)

Model Comparison

Pro Tips

Regression Best Practices

Model Selection

Metrics to Focus On

Common Mistakes

Pre-Submission Checklist

Code Requirements

Repository Requirements

File: `housing_data.csv` (Housing Data)