Assignment Overview
In this assignment, you will build a complete Housing Price Prediction System using various regression techniques. This comprehensive project requires you to apply ALL concepts from Module 2: simple linear regression, multiple linear regression, polynomial regression, Ridge regularization, Lasso regularization, and proper model evaluation using regression metrics.
pandas, numpy, matplotlib,
seaborn, and scikit-learn for this assignment.
Linear Regression (2.1)
Simple & multiple linear regression, coefficients, assumptions
Polynomial Regression (2.2)
Feature transformation, degree selection, overfitting detection
Regularization (2.3)
Ridge (L2), Lasso (L1), ElasticNet, alpha tuning
The Scenario
HomeValue Analytics
You have been hired as a Machine Learning Engineer at HomeValue Analytics, a real estate technology company that helps buyers and sellers understand property values. The lead data scientist has given you this task:
"We have historical housing data with various features like square footage, bedrooms, location scores, and more. We need you to build multiple regression models, compare their performance, and recommend the best approach for predicting house prices. Pay special attention to overfitting - we need models that generalize well!"
Your Task
Create a Jupyter Notebook called regression_analysis.ipynb that implements multiple
regression models, compares their performance using appropriate metrics, and provides recommendations
for the best model to use in production.
The Dataset
You will work with a Housing Price dataset. Create this CSV file as shown below:
File: housing_data.csv (Housing Data)
house_id,square_feet,bedrooms,bathrooms,age_years,garage_size,location_score,has_pool,has_garden,distance_to_city,price
H001,1850,3,2,5,2,8.5,0,1,12,385000
H002,2400,4,3,2,2,9.2,1,1,8,520000
H003,1200,2,1,25,1,6.5,0,0,22,195000
H004,3200,5,4,1,3,9.5,1,1,5,725000
H005,1650,3,2,15,1,7.2,0,1,18,275000
H006,2100,4,2,8,2,8.0,0,1,14,385000
H007,1400,2,1,30,1,5.8,0,0,28,165000
H008,2800,4,3,3,2,9.0,1,1,6,595000
H009,1950,3,2,12,2,7.8,0,1,16,325000
H010,3500,5,4,0,3,9.8,1,1,3,850000
H011,1100,2,1,35,0,5.2,0,0,32,145000
H012,2250,4,2,6,2,8.3,0,1,11,425000
H013,1750,3,2,18,1,6.8,0,0,20,255000
H014,2650,4,3,4,2,8.8,1,1,9,485000
H015,1550,3,1,22,1,6.2,0,0,25,215000
H016,2950,5,3,2,3,9.3,1,1,4,675000
H017,1350,2,1,28,1,5.5,0,0,30,155000
H018,2050,3,2,10,2,7.5,0,1,15,345000
H019,1900,3,2,7,2,8.2,0,1,13,365000
H020,3100,5,4,1,3,9.6,1,1,4,780000
H021,1450,2,1,20,1,6.0,0,0,24,185000
H022,2300,4,2,5,2,8.4,1,1,10,445000
H023,1600,3,2,14,1,7.0,0,1,19,285000
H024,2700,4,3,3,2,8.9,1,1,7,545000
H025,1250,2,1,32,0,5.0,0,0,35,135000
Columns Explained
house_id- Unique identifier (string)square_feet- Living area in sq ft (integer) - Key predictorbedrooms- Number of bedrooms (integer)bathrooms- Number of bathrooms (integer)age_years- Age of the house (integer)garage_size- Garage capacity in cars (integer)location_score- Location desirability 1-10 (float)has_pool- Has swimming pool (binary: 0/1)has_garden- Has garden (binary: 0/1)distance_to_city- Distance to city center in km (integer)price- Sale price in dollars (target variable)
Requirements
Your regression_analysis.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load and Explore Data
Create a function load_and_explore(filename) that:
- Loads the CSV file using pandas
- Displays basic statistics for all numerical columns
- Checks for missing values and data types
- Returns the DataFrame and exploration summary
def load_and_explore(filename):
"""Load dataset and return exploration summary."""
# Must return: (df, exploration_dict)
pass
Visualize Feature Relationships
Create a function visualize_relationships(df, target='price') that:
- Creates scatter plots of each feature vs target
- Creates a correlation heatmap
- Identifies the most correlated features
- Saves plots as
feature_analysis.png
def visualize_relationships(df, target='price'):
"""Create visualizations of feature-target relationships."""
# Must save: feature_analysis.png
pass
Simple Linear Regression
Create a function simple_linear_regression(X, y, feature_name) that:
- Trains a simple linear regression with ONE feature
- Plots the regression line with data points
- Returns model, coefficients, and intercept
- Prints the regression equation
def simple_linear_regression(X, y, feature_name):
"""Train simple linear regression and visualize."""
# Return: (model, coefficient, intercept)
pass
Multiple Linear Regression
Create a function multiple_linear_regression(X_train, X_test, y_train, y_test) that:
- Trains a multiple linear regression with ALL features
- Returns model and predictions
- Displays feature importance (coefficients)
def multiple_linear_regression(X_train, X_test, y_train, y_test):
"""Train multiple linear regression model."""
# Return: (model, y_pred, coefficients_dict)
pass
Polynomial Regression
Create a function polynomial_regression(X_train, X_test, y_train, y_test, degree=2) that:
- Creates polynomial features using
PolynomialFeatures - Trains linear regression on transformed features
- Returns model, transformer, and predictions
def polynomial_regression(X_train, X_test, y_train, y_test, degree=2):
"""Train polynomial regression model."""
# Return: (model, poly_transformer, y_pred)
pass
Find Optimal Polynomial Degree
Create a function find_optimal_degree(X_train, X_test, y_train, y_test, max_degree=5) that:
- Tests polynomial degrees from 1 to max_degree
- Tracks train and test errors for each degree
- Plots learning curves to show overfitting
- Returns the optimal degree based on test performance
def find_optimal_degree(X_train, X_test, y_train, y_test, max_degree=5):
"""Find optimal polynomial degree by comparing train/test errors."""
# Return: (optimal_degree, results_df)
pass
Ridge Regression
Create a function ridge_regression(X_train, X_test, y_train, y_test, alpha=1.0) that:
- Trains Ridge regression with specified alpha
- Returns model, predictions, and coefficients
- Compares coefficient magnitudes with linear regression
def ridge_regression(X_train, X_test, y_train, y_test, alpha=1.0):
"""Train Ridge regression model."""
# Return: (model, y_pred, coefficients)
pass
Lasso Regression
Create a function lasso_regression(X_train, X_test, y_train, y_test, alpha=1.0) that:
- Trains Lasso regression with specified alpha
- Returns model, predictions, and coefficients
- Identifies features with zero coefficients (feature selection)
def lasso_regression(X_train, X_test, y_train, y_test, alpha=1.0):
"""Train Lasso regression model."""
# Return: (model, y_pred, coefficients, zero_features)
pass
Tune Regularization Alpha
Create a function tune_alpha(X_train, y_train, model_type='ridge') that:
- Uses cross-validation to find optimal alpha
- Tests alphas: [0.001, 0.01, 0.1, 1, 10, 100]
- Plots alpha vs cross-validation score
- Returns optimal alpha and CV results
def tune_alpha(X_train, y_train, model_type='ridge'):
"""Find optimal alpha using cross-validation."""
# Return: (optimal_alpha, cv_results_df)
pass
Calculate Regression Metrics
Create a function calculate_regression_metrics(y_true, y_pred, model_name) that:
- Calculates MSE, RMSE, MAE, and R² score
- Returns a dictionary with all metrics
- Optionally creates residual plot
def calculate_regression_metrics(y_true, y_pred, model_name):
"""Calculate and return regression metrics."""
# Return: dict with 'mse', 'rmse', 'mae', 'r2'
pass
Compare All Models
Create a function compare_models(results_dict) that:
- Takes dictionary of model results
- Creates comparison bar charts for all metrics
- Saves comparison as
model_comparison.png - Returns DataFrame with comparison table
def compare_models(results_dict):
"""Compare all models and visualize results."""
# Return: comparison_df
pass
Main Pipeline
Create a main() function that:
- Runs the complete regression analysis pipeline
- Trains all model types and collects results
- Generates all required visualizations
- Prints final recommendation for best model
def main():
# 1. Load and explore data
df, summary = load_and_explore("housing_data.csv")
# 2. Visualize relationships
visualize_relationships(df)
# 3. Prepare features and target
X = df.drop(['house_id', 'price'], axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 4. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 5. Train all models
results = {}
# Simple Linear Regression (using square_feet)
slr_model, coef, intercept = simple_linear_regression(X_train[['square_feet']], y_train, 'square_feet')
# Multiple Linear Regression
mlr_model, mlr_pred, mlr_coefs = multiple_linear_regression(X_train_scaled, X_test_scaled, y_train, y_test)
results['Linear Regression'] = calculate_regression_metrics(y_test, mlr_pred, 'Linear Regression')
# Polynomial Regression
optimal_degree, degree_results = find_optimal_degree(X_train_scaled, X_test_scaled, y_train, y_test)
poly_model, poly_trans, poly_pred = polynomial_regression(X_train_scaled, X_test_scaled, y_train, y_test, optimal_degree)
results['Polynomial Regression'] = calculate_regression_metrics(y_test, poly_pred, 'Polynomial Regression')
# Ridge Regression
ridge_alpha, ridge_cv = tune_alpha(X_train_scaled, y_train, 'ridge')
ridge_model, ridge_pred, ridge_coefs = ridge_regression(X_train_scaled, X_test_scaled, y_train, y_test, ridge_alpha)
results['Ridge Regression'] = calculate_regression_metrics(y_test, ridge_pred, 'Ridge Regression')
# Lasso Regression
lasso_alpha, lasso_cv = tune_alpha(X_train_scaled, y_train, 'lasso')
lasso_model, lasso_pred, lasso_coefs, zero_feats = lasso_regression(X_train_scaled, X_test_scaled, y_train, y_test, lasso_alpha)
results['Lasso Regression'] = calculate_regression_metrics(y_test, lasso_pred, 'Lasso Regression')
# 6. Compare all models
comparison_df = compare_models(results)
print(comparison_df)
# 7. Recommendation
best_model = comparison_df.loc[comparison_df['R2'].idxmax()]
print(f"\nRecommendation: {best_model.name} with R² = {best_model['R2']:.4f}")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
housing-price-regression
Required Files
housing-price-regression/
├── regression_analysis.ipynb # Your Jupyter Notebook with ALL 12 functions
├── housing_data.csv # Input dataset (as provided or extended)
├── feature_analysis.png # Feature relationship visualizations
├── model_comparison.png # Model comparison bar charts
├── predictions.csv # Test predictions from best model
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Summary of all models trained and their metrics
- Your recommendation for the best model and why
- Any challenges faced and how you solved them
- Instructions to run your notebook
Do Include
- All 12 functions implemented and working
- Docstrings for every function
- Clear visualizations with labels and titles
- Model comparison with reasoning
- Hyperparameter tuning with cross-validation
- README.md with all required sections
Do Not Include
- Any .pyc or __pycache__ files (use .gitignore)
- Virtual environment folders
- Large model pickle files
- Code that doesn't run without errors
- Hardcoded file paths
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Linear Regression | 25 | Correct implementation of simple and multiple linear regression |
| Polynomial Regression | 30 | Proper feature transformation, degree selection, overfitting analysis |
| Regularization | 35 | Correct Ridge and Lasso implementation with alpha tuning |
| Model Evaluation | 25 | Accurate calculation of MSE, RMSE, MAE, R² and proper comparison |
| Visualizations | 25 | Clear, informative plots with proper labels and titles |
| Code Quality | 35 | Docstrings, comments, naming conventions, and clean organization |
| Total | 175 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Linear Regression (2.1)
Understanding coefficients, interpreting regression equations, feature importance
Polynomial Regression (2.2)
Feature transformation, detecting overfitting, selecting optimal complexity
Regularization (2.3)
Ridge vs Lasso, feature selection with L1, hyperparameter tuning
Model Comparison
Evaluating regression models, understanding metrics, making recommendations
Pro Tips
Regression Best Practices
- Always scale features before regularization
- Check for multicollinearity using VIF
- Visualize residuals to check assumptions
- Use cross-validation for hyperparameter tuning
Model Selection
- Start simple, increase complexity gradually
- Compare train vs test performance
- Consider interpretability vs accuracy trade-off
- Lasso is better when you suspect many irrelevant features
Metrics to Focus On
- R² tells you how much variance is explained
- RMSE is in the same units as target
- MAE is more robust to outliers than MSE
- Compare metrics across train and test sets
Common Mistakes
- Forgetting to scale features for regularized models
- Using polynomial degree too high (overfitting)
- Not using cross-validation for alpha selection
- Ignoring the bias-variance trade-off