Capstone Project 1

House Price Prediction

Build a complete end-to-end machine learning pipeline for predicting house prices. You will perform exploratory data analysis, feature engineering, train multiple regression models, and evaluate performance using the famous Kaggle Housing Prices dataset with 79 features.

15-20 hours
Intermediate
500 Points
What You Will Build
  • Complete EDA notebook
  • Feature engineering pipeline
  • Multiple regression models
  • Model comparison & tuning
  • Final prediction system
Contents
01

Project Overview

This capstone project brings together everything you have learned in the Machine Learning course. You will work with the famous Kaggle House Prices dataset containing 1,460 training samples with 79 features describing almost every aspect of residential homes in Ames, Iowa. The dataset includes 36 numerical features and 43 categorical features covering everything from lot size and basement quality to garage type and sale conditions. Your goal is to build a production-ready regression pipeline that accurately predicts sale prices.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (preprocessing, pipelines, models), feature engineering, hyperparameter tuning, and model evaluation.
EDA

Explore distributions, correlations, and identify patterns

Feature Engineering

Create, transform, and select the best features

Model Training

Train and compare multiple regression algorithms

Evaluation

Rigorous evaluation with cross-validation and metrics

Learning Objectives

Technical Skills
  • Master pandas for data manipulation and cleaning
  • Perform comprehensive exploratory data analysis
  • Build sklearn pipelines with ColumnTransformer
  • Implement feature engineering for mixed data types
  • Train and tune multiple regression models
ML Engineering Skills
  • Handle missing values with domain knowledge
  • Encode categorical variables effectively
  • Perform hyperparameter tuning with GridSearchCV
  • Evaluate models using appropriate regression metrics
  • Create reproducible and documented pipelines
Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

HomeValue AI

You have been hired as a Machine Learning Engineer at HomeValue AI, a real estate technology startup. The company is building an automated home valuation system to help buyers, sellers, and real estate agents get accurate price estimates instantly. The CEO has given you this challenge:

"We have historical sales data from Ames, Iowa - one of the most detailed housing datasets available. I need you to build a prediction model that can estimate house prices within 10% of actual sale prices. The model needs to be explainable so agents can tell clients WHY a house is valued at a certain price. Can you build us a reliable, interpretable pricing engine?"

Sarah Chen, CEO, HomeValue AI

Business Questions to Answer

Price Prediction
  • What is the predicted sale price for a given house?
  • What is the prediction confidence interval?
  • How accurate is our model on unseen data?
  • Which model performs best for this dataset?
Feature Importance
  • What features most influence house prices?
  • How much does each bedroom add to value?
  • What is the premium for quality finishes?
  • How does neighborhood affect pricing?
Market Insights
  • Are there undervalued properties in the market?
  • What renovations add the most value?
  • How does age affect property value?
  • What is the price distribution by neighborhood?
Model Insights
  • Where does the model make the largest errors?
  • Are there outliers affecting predictions?
  • How does model performance vary by price range?
  • What data quality issues need addressing?
Pro Tip: Think like a data scientist! Your model should not only predict well but also provide interpretable insights that real estate professionals can understand and trust.
03

The Dataset

You will work with the famous Kaggle House Prices dataset. Download the CSV files containing training data with 79 explanatory features:

Dataset Download

Download the house prices dataset files and save them to your project folder. The CSV files contain all necessary data for building your prediction model.

Original Data Source

This project uses the House Prices: Advanced Regression Techniques dataset from Kaggle - one of the most popular competition datasets for learning regression. The dataset was compiled by Dean De Cock for use in data science education and contains 79 features describing homes in Ames, Iowa.

Dataset Info: 1,460 training samples × 81 columns | 1,459 test samples | Price Range: $34,900 - $755,000 | Median: $163,000 | Features: 36 numerical, 43 categorical | Target: SalePrice (log-transformed recommended)
Key Features Overview

CategoryFeaturesDescription
AreaLotArea, GrLivArea, TotalBsmtSF, 1stFlrSF, 2ndFlrSF, GarageAreaSquare footage measurements
RoomsBedroomAbvGr, TotRmsAbvGrd, FullBath, HalfBath, KitchenAbvGrRoom counts
QualityOverallQual, OverallCond1-10 rating scales
AgeYearBuilt, YearRemodAdd, GarageYrBltConstruction and remodel years
BasementBsmtFinSF1, BsmtFinSF2, BsmtUnfSFBasement area breakdown
PorchWoodDeckSF, OpenPorchSF, EnclosedPorch, ScreenPorchOutdoor areas
TargetSalePriceSale price in USD

CategoryFeaturesExample Values
LocationNeighborhood, MSZoningCollgCr, Veenker, NAmes | RL, RM, FV
BuildingBldgType, HouseStyle, RoofStyle1Fam, 2fmCon | 1Story, 2Story
QualityExterQual, ExterCond, BsmtQual, KitchenQualEx, Gd, TA, Fa, Po
GarageGarageType, GarageFinish, GarageCondAttchd, Detchd, BuiltIn
BasementBsmtExposure, BsmtFinType1, BsmtCondGd, Av, Mn, No
UtilitiesHeating, CentralAir, ElectricalGasA, GasW | Y, N
SaleSaleType, SaleConditionWD, New, COD | Normal, Abnormal

FeatureMissing %ReasonRecommended Action
PoolQC99.5%No poolFill with "None"
MiscFeature96.3%No misc featureFill with "None"
Alley93.8%No alley accessFill with "None"
Fence80.8%No fenceFill with "None"
FireplaceQu47.3%No fireplaceFill with "None"
LotFrontage17.7%Missing dataImpute by neighborhood median
GarageYrBlt5.5%No garageFill with 0 or YearBuilt
Dataset Stats: 1,460 training samples, 79 features (36 numerical, 43 categorical), ~6% missing values overall
Target Variable: SalePrice: Mean $180,921 | Median $163,000 | Right-skewed (log transform recommended)
Sample Data Preview

Here is what a typical training record looks like:

IdMSSubClassLotAreaOverallQualYearBuiltGrLivAreaBedroomAbvGrNeighborhoodSalePrice
1608,450720031,7103CollgCr$208,500
2209,600619761,2623Veenker$181,500
36011,250720011,7863CollgCr$223,500
Data Quality Note: The dataset contains many missing values that are NOT random - they often indicate the absence of a feature (e.g., no pool, no garage). Understanding this is crucial for proper imputation!
04

Project Requirements

Your project must include all of the following components. Structure your deliverables as Jupyter notebooks with clear documentation and code organization.

1
Exploratory Data Analysis (EDA)

Create 01_eda.ipynb:

  • Load data and examine shape, dtypes, and missing values
  • Analyze target variable (SalePrice) distribution
  • Visualize numerical feature distributions with histograms
  • Analyze categorical features with bar plots
  • Create correlation matrix heatmap for numerical features
  • Identify top 10 features correlated with SalePrice
  • Scatter plots for key features vs SalePrice
  • Document 5+ key insights from your analysis
Deliverable: Comprehensive EDA notebook with at least 15 visualizations and markdown cells explaining findings.
2
Feature Engineering

Create 02_feature_engineering.ipynb:

  • Handle Missing Values: Domain-aware imputation strategy
  • Create New Features:
    • TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF
    • TotalBathrooms = FullBath + 0.5*HalfBath + BsmtFullBath + 0.5*BsmtHalfBath
    • Age = YrSold - YearBuilt
    • Remodeled = 1 if YearRemodAdd != YearBuilt else 0
    • TotalPorchSF = sum of all porch areas
    • HasPool, HasGarage, HasFireplace = binary indicators
  • Encode Categoricals: OrdinalEncoder for quality features, OneHotEncoder for nominal
  • Handle Outliers: Identify and treat outliers in GrLivArea, LotArea
  • Log Transform: Transform SalePrice and skewed features
Deliverable: Feature engineering notebook creating at least 10 new features with justification for each transformation.
3
Model Training

Create 03_model_training.ipynb:

  • Build sklearn Pipeline: Use ColumnTransformer for preprocessing
  • Train at least 5 models:
    • Linear Regression (baseline)
    • Ridge Regression
    • Lasso Regression
    • Random Forest Regressor
    • XGBoost/Gradient Boosting
  • Cross-Validation: 5-fold CV for all models
  • Hyperparameter Tuning: GridSearchCV for top 2 models
  • Model Comparison: Create comparison table with metrics
Deliverable: Model training notebook with at least 5 models trained, tuned, and compared using consistent evaluation methodology.
4
Model Evaluation

Create 04_evaluation.ipynb:

  • Calculate metrics: RMSE, MAE, R², MAPE on test set
  • Create residual plots (residuals vs predicted)
  • Plot actual vs predicted values
  • Analyze prediction error distribution
  • Feature importance visualization (top 20 features)
  • Error analysis by price range (low, medium, high)
  • Identify worst predictions and analyze patterns
Deliverable: Evaluation notebook with comprehensive analysis of model performance and error patterns.
5
Final Report

Create analysis_report.pdf:

  • Executive Summary (1 page): Key findings and model performance
  • Data Analysis (2 pages): EDA insights with visualizations
  • Methodology (2 pages): Feature engineering and model approach
  • Results (2 pages): Model comparison and final performance
  • Recommendations (1 page): Business insights and next steps
Deliverable: Professional PDF report (6-8 pages) suitable for presentation to non-technical stakeholders.
05

Model Specifications

Train and evaluate the following models. Use cross-validation for fair comparison and tune hyperparameters for your top performers.

Linear Models
  • Linear Regression: Baseline model, no regularization
  • Ridge (L2): Tune alpha: [0.1, 1.0, 10.0, 100.0]
  • Lasso (L1): Tune alpha: [0.0001, 0.001, 0.01, 0.1]
  • ElasticNet: Optional - combine L1 and L2
Tree-Based Models
  • Random Forest: Tune n_estimators, max_depth, min_samples_split
  • Gradient Boosting: Tune learning_rate, n_estimators, max_depth
  • XGBoost: Tune learning_rate, max_depth, subsample, colsample_bytree
  • LightGBM: Optional - faster alternative
Evaluation Metrics
RMSE

Root Mean Squared Error - penalizes large errors

MAE

Mean Absolute Error - average error magnitude

R² Score

Coefficient of determination - variance explained

RMSLE

Root Mean Squared Log Error - Kaggle metric

Sample Pipeline Code
# Build preprocessing pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define column types
numerical_cols = ['LotArea', 'GrLivArea', 'TotalBsmtSF', 'YearBuilt', ...]
categorical_cols = ['Neighborhood', 'HouseStyle', 'ExterQual', ...]

# Preprocessing pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='None')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Full pipeline with model
from sklearn.ensemble import RandomForestRegressor

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
Target Performance: Aim for RMSLE < 0.15 on your validation set. Top Kaggle solutions achieve RMSLE around 0.10-0.12.
06

Required Visualizations

Create at least 20 visualizations across your notebooks. Each visualization should have proper titles, labels, and interpretive commentary.

EDA
Exploratory Visualizations
  • SalePrice distribution (histogram + KDE)
  • Log-transformed SalePrice distribution
  • Correlation heatmap (top 15 features)
  • Missing values heatmap
  • GrLivArea vs SalePrice scatter
  • OverallQual vs SalePrice boxplot
  • Neighborhood price distribution
  • Year vs SalePrice trend
Model
Model Evaluation Visualizations
  • Actual vs Predicted scatter plot
  • Residuals vs Predicted plot
  • Residual distribution histogram
  • Feature importance bar chart (top 20)
  • Model comparison bar chart (CV scores)
  • Learning curves (train vs validation)
  • Cross-validation score boxplots
  • Error analysis by price segment
Design Tips: Use consistent color schemes, add proper titles and axis labels, and include brief interpretations in markdown cells below each visualization.
07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
house-price-prediction-ml
github.com/<your-username>/house-price-prediction-ml
Required Project Structure
house-price-prediction-ml/
├── data/
│   ├── train.csv                 # Original training data
│   ├── test.csv                  # Original test data
│   └── data_description.txt      # Feature descriptions
├── notebooks/
│   ├── 01_eda.ipynb              # Exploratory data analysis
│   ├── 02_feature_engineering.ipynb # Feature engineering
│   ├── 03_model_training.ipynb   # Model training & tuning
│   └── 04_evaluation.ipynb       # Model evaluation
├── models/
│   └── best_model.joblib         # Saved best model
├── reports/
│   └── analysis_report.pdf       # Final analysis report
├── visualizations/
│   ├── correlation_heatmap.png
│   ├── feature_importance.png
│   ├── actual_vs_predicted.png
│   ├── residuals_plot.png
│   └── model_comparison.png
├── requirements.txt              # Python dependencies
└── README.md                     # Project documentation
README.md Required Sections
1. Project Header
  • Project title and description
  • Your full name and submission date
  • Course and project number
2. Business Context
  • HomeValue AI scenario overview
  • Project objectives
  • Dataset summary
3. Technologies Used
  • Python, pandas, numpy
  • scikit-learn, XGBoost
  • matplotlib, seaborn
4. Key Findings
  • Top 5 insights from EDA
  • Most important features
  • Best performing model
5. Model Performance
  • Final RMSE, MAE, R² scores
  • Model comparison table
  • Cross-validation results
6. Visualizations
  • Key visualization screenshots
  • Brief captions for each
  • Link to notebooks
7. How to Run
  • Installation instructions
  • Running notebooks
  • Making predictions
8. Contact
  • GitHub profile link
  • LinkedIn (optional)
Do Include
  • All 4 notebooks with clear documentation
  • At least 20 visualizations with interpretations
  • Trained model saved with joblib
  • PDF report with visualizations
  • requirements.txt with all dependencies
  • Professional README with screenshots
Do Not Include
  • Virtual environment folders (venv, .venv)
  • Jupyter checkpoint files (.ipynb_checkpoints)
  • Extremely large data files (>100MB)
  • API keys or personal credentials
  • Incomplete or non-running notebooks
Important: Before submitting, restart your kernel and run all cells to verify your notebooks execute without errors!
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be graded on the following criteria. Total: 500 points.

Criteria Points Description
Exploratory Data Analysis 80 Comprehensive EDA with 15+ visualizations and documented insights
Feature Engineering 100 At least 10 new features, proper encoding, missing value handling
Model Training 100 5+ models trained, proper pipelines, hyperparameter tuning
Model Evaluation 80 Comprehensive evaluation with multiple metrics and error analysis
Model Performance 50 R² > 0.85, RMSLE < 0.15 on validation set
Analysis Report 50 Professional PDF report with insights and recommendations
Documentation 40 README quality, code comments, notebook organization
Total 500
Grading Levels
Excellent
450-500

Exceeds all requirements with exceptional quality

Good
375-449

Meets all requirements with good quality

Satisfactory
300-374

Meets minimum requirements

Needs Work
<300

Missing key requirements

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project