Capstone Project 3

House Price Prediction

Build a complete machine learning pipeline to predict property prices across major Indian cities. Apply feature engineering for real estate data, compare multiple regression algorithms, and interpret model predictions using feature importance analysis.

10-15 hours
Advanced
600 Points
What You Will Build
  • Feature engineering pipeline
  • Multiple regression models
  • Model comparison framework
  • Feature importance analysis
  • Price prediction system
Contents
01

Project Overview

This advanced capstone project challenges you to build a complete machine learning pipeline for predicting residential property prices across major Indian cities. You will work with a realistic housing dataset containing 150 properties from Mumbai, Bangalore, Delhi, Chennai, Hyderabad, Pune, and Kolkata. Your goal is to engineer meaningful features, train multiple regression models, compare their performance, and interpret which factors drive property valuations.

Skills Applied: This project tests your proficiency in feature engineering, data preprocessing (encoding, scaling), multiple regression algorithms (Linear, Ridge, Lasso, Random Forest, Gradient Boosting), cross-validation, hyperparameter tuning, and model interpretation.
Learning Objectives
Feature Engineering Mastery
  • Create domain-specific features (price per sqft, room ratios, location scores)
  • Understand feature impact through importance analysis
  • Combine multiple raw features into composite metrics
  • Apply business logic to generate meaningful derived features
Model Comparison Skills
  • Train and evaluate 5+ regression algorithms systematically
  • Understand trade-offs: accuracy vs interpretability vs training time
  • Use cross-validation for robust performance estimation
  • Select optimal model based on multiple evaluation metrics
Evaluation & Interpretation
  • Interpret RMSE, MAE, and R-squared in business context
  • Analyze feature importance from tree-based models
  • Identify model strengths and weaknesses through residual analysis
  • Translate model insights into actionable business recommendations
End-to-End ML Pipeline
  • Build complete workflow: EDA → Engineering → Training → Evaluation
  • Handle data preprocessing (encoding, scaling) without leakage
  • Document methodology and findings professionally
  • Create reproducible analysis with clear explanations
Real-World Application

This project mirrors actual work done by data scientists at real estate tech companies like Zillow, Redfin, or MagicBricks. The ability to predict property prices accurately is a multi-million dollar business problem, and your solution demonstrates production-ready ML engineering skills.

Feature Engineering

Create derived features from raw property attributes

Multiple Models

Train and compare 5+ regression algorithms

Model Comparison

Evaluate using RMSE, MAE, and R-squared metrics

Interpretation

Analyze feature importance and model insights

Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

PropValue Analytics Pvt. Ltd.

You have been hired as a Machine Learning Engineer at PropValue Analytics, a real estate technology startup that provides property valuation services to banks, insurance companies, and individual buyers across India. The traditional property valuation process takes 5-7 days and costs ₹5,000-₹10,000 per assessment. The company wants to disrupt this market with AI-powered instant valuations priced at ₹499, making property assessment accessible to millions of Indians.

Currently, the company relies on manual appraisals by experienced real estate agents, but this approach doesn't scale. With 50-100 valuation requests coming in daily, the team is overwhelmed. Additionally, manual valuations suffer from inconsistency and human bias, with the same property sometimes receiving price estimates that vary by 15-20% depending on which agent performs the assessment.

"Our clients need accurate property valuations within minutes, not days. We have collected data on 150 properties across 7 major cities with verified sale prices. Can you build a model that predicts prices with at least 85% accuracy and tells us which features matter most for valuation? We also need to understand if our model works equally well across all cities or if we need city-specific models."

Vikram Mehta, Chief Data Officer
The Business Challenge

PropValue Analytics faces several critical challenges that machine learning can address:

Speed vs Accuracy

Manual valuations take 5-7 days but are reasonably accurate (90-95%). Instant online estimates are fast but often wildly inaccurate (60-70% accuracy), leading to customer distrust.

Market Variability

Mumbai properties average ₹150 Lakhs while similar properties in Pune cost ₹60 Lakhs. The model must capture both national patterns and city-specific pricing dynamics.

Feature Complexity

18 features influence price, but which matter most? Is a furnished 2BHK worth more than an unfurnished 3BHK? Does being near a metro station add ₹10 Lakhs or ₹30 Lakhs to value?

Business Questions to Answer

Price Prediction
  • What is the predicted price for a given property?
  • How accurate is the model across different cities?
  • What is the prediction confidence interval?
Feature Impact
  • Which features have the highest impact on price?
  • How does location affect property valuation?
  • What is the price premium for furnished properties?
Model Selection
  • Which algorithm performs best for this data?
  • Is there overfitting in complex models?
  • What are the trade-offs between models?
Market Insights
  • How does price vary by city and region?
  • What is the price per square foot by property type?
  • How does age affect property depreciation?
Pro Tip: Think like a real estate analyst! Your model should not only predict prices but also provide interpretable insights that help stakeholders understand what drives property values.
03

The Dataset

You will work with a realistic Indian housing market dataset containing 150 residential properties across 7 major cities. This professionally curated dataset includes verified sale prices, making it ideal for supervised learning. Each property record contains 18 features covering physical attributes, location factors, and neighborhood amenities.

Dataset Overview:
150
Total Properties
18
Features per Property
7
Major Cities
45-425
Price Range (Lakhs)
Why This Dataset is Perfect for Regression
Real Market Data

All properties have verified sale prices from actual transactions (2023-2024). No synthetic or estimated values, ensuring your model learns from real market dynamics.

Feature Diversity

Mix of numerical (area, age, price), categorical (city, furnishing), and binary (main road) features. Includes interaction opportunities (bedroom/bathroom ratio, floor position).

Geographic Variation

Properties span Tier-1 metros (Mumbai, Bangalore, Delhi) and Tier-2 cities (Pune, Hyderabad), capturing different market segments and price dynamics.

Dataset Schema
Column Type Description
property_idStringUnique property identifier (HP001, HP002, ...)
locationStringSpecific locality/neighborhood name
cityStringCity name (Mumbai, Bangalore, Delhi, etc.)
regionStringGeographic region (North, South, East, West)
property_typeStringType of property (Apartment, Villa)
bedroomsIntegerNumber of bedrooms (1-5)
bathroomsIntegerNumber of bathrooms (1-4)
area_sqftIntegerTotal area in square feet
floorIntegerFloor number (0 for ground/villa)
total_floorsIntegerTotal floors in the building
age_yearsIntegerAge of property in years
furnishingStringFurnishing status (Furnished, Semi-Furnished, Unfurnished)
parkingIntegerNumber of parking spaces (0-3)
amenities_scoreIntegerAmenities rating (1-10 scale)
nearby_schoolsIntegerNumber of schools within 2km
nearby_hospitalsIntegerNumber of hospitals within 2km
metro_distance_kmFloatDistance to nearest metro station (km)
main_roadStringOn main road (Yes/No)
price_lakhsFloatTarget Variable: Price in Indian Lakhs
Data Quality & Completeness
Clean Data: No missing values in any column. All features are complete and ready for modeling. Focus your time on feature engineering and model optimization, not data cleaning.
Balanced Distribution: Good representation across cities (15-25 properties each), property types (60% apartments, 40% villas), and price ranges (no extreme outliers dominating).
Understanding Key Features
Feature Category Features Included Expected Impact on Price
Size & Layout area_sqft, bedrooms, bathrooms High - Direct correlation with price
Location city, region, location High - Mumbai properties 2-3x costlier than others
Connectivity metro_distance_km, main_road Medium - ₹5-15 Lakhs premium for accessibility
Condition & Quality age_years, furnishing, amenities_score Medium - New/furnished properties command higher prices
Neighborhood nearby_schools, nearby_hospitals Low-Medium - Desirable but less impactful than size/location
Building Features floor, total_floors, parking Low-Medium - Varies by property type and city
Price Context: 1 Lakh = 100,000 rupees. A property priced at 100 Lakhs = ₹1 Crore (10 million rupees). Average Indian property prices range from 40-120 Lakhs depending on city and size.
Getting Started
Loading the Dataset

Start by loading the CSV file and examining its basic properties:

  • Shape: Check number of rows (properties) and columns (features)
  • Price Range: Find minimum and maximum prices in Lakhs
  • Cities: List all unique city values
  • Preview: Display first few rows to understand data structure
04

Project Requirements

Your Jupyter Notebook must include all of the following components. Structure your notebook with clear markdown headers and explanations for each section.

1
Project Setup and Introduction

Title, your name, date, project overview, and business context. Import all required libraries: pandas, numpy, sklearn, matplotlib, seaborn, plotly.

Required Library Groups
  • Data handling: pandas, numpy
  • Visualization: matplotlib, seaborn, plotly
  • Preprocessing: StandardScaler, train_test_split, LabelEncoder/OneHotEncoder
  • Models: LinearRegression, Ridge, Lasso, RandomForestRegressor, GradientBoostingRegressor
  • Evaluation: mean_squared_error, mean_absolute_error, r2_score, cross_val_score
2
Exploratory Data Analysis (EDA)

Comprehensive data exploration to understand patterns, distributions, and relationships before modeling.

Univariate Analysis
  • Numerical Features: Analyze mean, median, std, min/max for price, area, age, parking
  • Target Distribution: Histogram + KDE plot of price_lakhs to check for skewness
  • Categorical Features: Value counts and percentages for city, property_type, furnishing
  • Missing Values: Check for nulls with df.isnull().sum() (should be zero)
Bivariate Analysis
  • Correlation Heatmap: Visualize relationships between all numerical features
  • Price by City: Box plots showing price distribution across 7 cities
  • Area vs Price: Scatter plot with property_type color coding
  • Categorical Comparisons: Bar charts for average price by furnishing and property_type
Outlier Detection
  • Price Outliers: Identify properties >3 standard deviations from mean
  • Area Extremes: Flag properties with unusually large/small area_sqft
  • Age Analysis: Check for very old properties (20+ years) affecting pricing
  • Decision: Document whether to keep, cap, or remove outliers
Key Insights to Document
  • Price Range: What's the min, max, and average property price?
  • Top Correlations: Which features correlate most with price (>0.5)?
  • City Patterns: Which city has highest/lowest average prices?
  • Property Types: Are villas significantly more expensive than apartments?
EDA Best Practice: Create at least 8-10 visualizations covering distribution plots, correlation heatmaps, categorical comparisons, and scatter plots. Each chart should answer a specific business question about the data.
3
Feature Engineering

Create at least 5 new derived features (see Feature Engineering section for ideas):

  • Price per square foot calculation
  • Location-based features (city premium, metro accessibility)
  • Property age categories
  • Room ratios and space efficiency metrics
  • Amenity and accessibility composite scores
4
Data Preprocessing

Transform data into machine-learning-ready format through encoding and scaling.

Categorical Encoding

Convert text categories to numbers:

  • One-Hot Encoding: City, property_type, furnishing, main_road
  • Why One-Hot?: No ordinal relationship (Mumbai isn't "greater than" Delhi)
  • Result: Creates binary columns (city_Mumbai, city_Delhi, etc.)
  • Drop first: Use drop_first=True to avoid multicollinearity
Example: property_type with 2 values (Apartment, Villa) creates 1 column: property_type_Villa (1=Villa, 0=Apartment)
Feature Scaling

Normalize numerical features to same scale:

  • StandardScaler: Mean=0, StdDev=1 (recommended for linear models)
  • Why scale?: area_sqft (500-3000) vs parking (0-3) - prevent large values dominating
  • Critical: Fit scaler on training data only, then transform train + test
  • Not needed: Tree-based models (Random Forest, Gradient Boosting)
Why? Linear models calculate distance between points. Without scaling, area_sqft changes dominate bedroom changes.
Critical: Prevent Data Leakage

Always follow this order:

  1. Separate features (X) from target (y)
  2. Encode categorical variables
  3. Split into train/test sets (80/20 split)
  4. Fit scaler on training data only, then transform both train and test

Why? Fitting the scaler on test data causes information leakage and inflates performance metrics.

5
Model Training and Comparison

Train at least 5 different regression models and compare performance:

  • Linear Regression (baseline)
  • Ridge Regression (L2 regularization)
  • Lasso Regression (L1 regularization)
  • Random Forest Regressor
  • Gradient Boosting Regressor
6
Model Evaluation
  • Calculate RMSE, MAE, and R-squared for each model
  • Perform 5-fold cross-validation
  • Create a comparison table of all models
  • Analyze residual plots for the best model
7
Feature Importance Analysis

Understanding which features drive property prices helps stakeholders make data-informed decisions about pricing, renovations, and investment priorities. Different models reveal different aspects of feature importance.

Tree-Based Model Importance

Method: Extract feature_importances_ attribute from Random Forest or Gradient Boosting.

What it shows: How often features are used to split data and how much they reduce error.

Example: If total_sqft has importance 0.45, it explains 45% of the model's predictive power.

Action: Sort features by importance, visualize top 10-15 with horizontal bar chart.

Linear Model Coefficients

Method: Extract coef_ attribute from Linear, Ridge, or Lasso models.

What it shows: How much price changes per unit increase in each feature.

Example: Coefficient of 5.2 for bedrooms means each additional bedroom adds 5.2 lakhs to price.

Action: Get absolute values for ranking (ignore positive/negative), visualize top contributors.

Important Considerations:
  • Feature scaling matters: For linear models, scale features before extracting coefficients (otherwise large-range features dominate)
  • Correlation ≠ Causation: High importance doesn't mean changing that feature will change the price
  • Context is key: A feature with 5% importance might still be critical for specific property segments
  • Compare across models: If a feature is important in both tree-based AND linear models, it's truly influential
8
Insights and Recommendations

Transform your analysis into actionable business intelligence. Your insights should connect model findings to real-world decisions for buyers, sellers, investors, and real estate professionals.

Framework for Deriving Insights
A
Pricing Drivers

Identify the top 3-5 features with highest importance. Quantify their impact with examples: "Each additional bedroom adds approximately 8-12 lakhs to property value" or "Properties in South Mumbai command a 35% premium over similar properties in suburbs."

B
Surprising Findings

Highlight unexpected patterns from your EDA or feature importance: "Age has minimal impact on price (only 3% importance), suggesting buyers prioritize location and size over property age" or "Parking adds more value than an extra bedroom in urban areas."

C
Model Performance Context

Explain what your R-squared means in practical terms: "Model achieves 85% R-squared, meaning it can predict property prices within ±15 lakhs for 80% of properties. The remaining 15% variation is likely due to unique features like architectural style, renovation quality, or negotiation skills."

D
Segment-Specific Patterns

Analyze if pricing rules differ across segments: "For luxury properties (>1 Cr), floor level and amenities become critical. For budget properties (<50 lakhs), total area and locality score are primary drivers."

E
Data Quality Observations

Note any data limitations: "Properties above 5 Cr are underrepresented (only 2% of dataset), so model predictions may be less reliable for ultra-luxury segment. Recommend collecting more high-end property data."

Stakeholder-Specific Recommendations
For Sellers
  • Focus on high-impact, low-cost improvements (e.g., if bathrooms add value, consider bathroom upgrades)
  • Time market listings based on locality demand patterns
  • Price competitively using model predictions adjusted for unique features
For Buyers
  • Identify undervalued properties where actual price is significantly below model prediction
  • Prioritize features with high importance if budget is constrained
  • Consider emerging localities where location score is improving
For Investors
  • Target properties with features that have growing importance trends
  • Renovate to maximize features with highest ROI based on model coefficients
  • Portfolio diversification based on segment-specific pricing patterns
For Developers
  • Design projects emphasizing high-importance features (e.g., optimize sqft/bedroom ratio)
  • Price new developments using model as baseline, adjust for amenities
  • Identify underserved market segments with pricing gaps
Example High-Quality Insight:

"Analysis of 5,000 properties reveals that price per sqft varies by 85% across locations, making locality the single strongest pricing factor (42% feature importance). Properties within 2km of metro stations command a premium of 18-25%, while properties with dedicated parking add 8-12 lakhs to valuation. The model achieves 85% R-squared, accurately predicting prices within ±15 lakhs for most properties. This suggests location + connectivity + parking form the golden triangle of value drivers in the current market."

05

Feature Engineering

Create at least 5 derived features from the raw data. Feature engineering is crucial for improving model performance and capturing domain-specific knowledge about real estate valuation. Well-designed features can boost model accuracy by 10-20% compared to using raw data alone.

Why Feature Engineering Matters:

Raw features tell you a property has 3 bedrooms and 1500 sqft. But what really matters for pricing?

  • Price per sqft reveals if a property is overpriced for its size
  • Room density indicates efficient space usage vs. sprawling layouts
  • Floor ratio captures premium for high floors with better views
  • Locality score combines multiple amenities into a single quality metric
Recommended Feature Categories
Space Metrics
Key Metrics to Create:
  • price_per_sqft: Convert total price to per-square-foot rate for fair size comparison
  • room_density: Measure how many rooms exist per 1000 sqft (efficient vs. spacious)
  • bhk_ratio: Bedroom-to-bathroom ratio (2:1 is typical, 3:1 may indicate inadequate bathrooms)
  • floor_ratio: Position in building (0.8-1.0 = top floors = premium views)
Calculation Example:
Property: 1500 sqft, 3 bedrooms, 2 bathrooms, 8th floor of 10
- price_per_sqft = (price_lakhs × 100,000) ÷ 1500
- room_density = (3 + 2) ÷ 1500 × 1000 = 3.33
- bhk_ratio = 3 ÷ 2 = 1.5 (balanced)
- floor_ratio = 8 ÷ 10 = 0.8 (high floor premium)
Location Features
Location-Based Features:
  • city_tier: Classify as Tier-1 metros (Mumbai, Delhi, Bangalore) vs Tier-2 cities
  • metro_accessible: Binary flag for metro stations within 1km (walkable distance)
  • locality_score: Sum nearby schools + hospitals as neighborhood quality indicator
  • connectivity_score: Weighted combination: (metro_accessible × 3) + (main_road × 2)
Why This Matters:
Tier-1 cities command 2-3x premium. Metro accessibility adds ₹10-20 Lakhs. Properties near 5+ schools/hospitals are family-friendly and more desirable. Main road access improves resale value by ₹5-10 Lakhs.
Age Categories
Age-Based Categorization:
  • age_category: Group as New (0-3 years), Recent (4-10 years), or Old (10+ years)
  • is_new_property: Binary indicator for brand new construction (premium pricing)
  • depreciation_factor: Estimate value reduction: max(0, 1 - age/30) to model 30-year lifespan
Age Impact on Price:
New (0-3): Full value, no depreciation, warranty coverage
Recent (4-10): Slight depreciation (5-10%), established neighborhood
Old (10+): 15-30% depreciation, may need renovation, but mature locality
Quality Scores
Quality & Amenity Scores:
  • furnishing_score: Convert to numeric scale (Unfurnished=0, Semi=1, Furnished=2)
  • total_amenity_score: Weighted sum: amenities_score + (parking × 2) + locality_score
  • luxury_index: Composite score combining furnishing, amenities, and floor ratio for premium properties
Premium Features:
Furnished properties command 10-15% premium. Each parking space adds ₹3-5 Lakhs. Amenities (gym, pool, security) add ₹5-15 Lakhs. Luxury index >7 indicates high-end properties with 20-30% price premium.
Feature Engineering Best Practices:
  • Check Correlations: After creating features, use df.corr() to find highly correlated features (>0.85). Remove or combine them to avoid multicollinearity.
  • Domain Knowledge: Think like a real estate agent - what factors do buyers actually care about? Create features that capture buyer priorities.
  • Test Impact: Train a baseline model, add your engineered features, and measure improvement in R-squared or RMSE to validate their value.
  • Document Rationale: In your notebook, explain WHY you created each feature and what business insight it captures.
Common Feature Engineering Mistakes
What NOT to Do
  • Create features with missing values (breaks model training)
  • Use target variable in feature calculation (data leakage)
  • Create 20+ features without testing which help (overfitting risk)
  • Ignore extreme outliers in derived features (skews scaling)
Best Practices
  • Start with 5-7 well-reasoned features, add more if needed
  • Validate each feature makes sense (no negative room densities)
  • Create features that could be calculated for new, unseen properties
  • Document formulas clearly for reproducibility
06

ML Models to Implement

Train and compare at least 5 different regression models. Each model should be evaluated using consistent metrics and cross-validation for fair comparison. The goal is not just to find the "best" model, but to understand the trade-offs between interpretability, accuracy, training time, and complexity.

Why Multiple Models?

Different algorithms make different assumptions about data:

  • Linear models assume straight-line relationships (fast, interpretable, but may underfit)
  • Regularized models prevent overfitting by penalizing complex coefficients
  • Tree-based models capture non-linear patterns and feature interactions automatically
  • Ensemble methods combine multiple models for superior accuracy (but less interpretable)
Model Comparison Guide
Model Type Key Parameters Best For
Linear Regression Linear None (baseline) Baseline comparison, interpretable coefficients
Ridge Regression Linear (L2) alpha: 0.1, 1.0, 10.0 Handling multicollinearity, preventing overfitting
Lasso Regression Linear (L1) alpha: 0.01, 0.1, 1.0 Feature selection, sparse solutions
Random Forest Ensemble n_estimators: 100, max_depth: 10-20 Non-linear relationships, feature importance
Gradient Boosting Ensemble n_estimators: 100, learning_rate: 0.1 Best accuracy, handles complex patterns
Deep Dive: Understanding Each Algorithm
Linear Regression

How it works: Finds best-fit line minimizing squared errors. Price = (coef1 × area) + (coef2 × bedrooms) + ... + intercept

  • Pros: Fast training, interpretable coefficients showing feature impact
  • Cons: Assumes linear relationships, sensitive to outliers, can't capture interactions
  • Use case: Baseline to beat. If it performs well (R² >0.75), data has linear patterns.
Ridge & Lasso Regression

How they work: Linear regression + penalty for large coefficients. Ridge (L2) shrinks all coefficients. Lasso (L1) can zero out features.

  • Pros: Prevent overfitting, handle correlated features, Lasso does automatic feature selection
  • Cons: Still assume linearity, need to tune alpha parameter
  • Use case: When many features are correlated (area, bedrooms, bathrooms all correlate)
Random Forest

How it works: Builds 100+ decision trees on random data subsets, averages predictions. Each tree asks yes/no questions about features.

  • Pros: Handles non-linear relationships, no feature scaling needed, provides feature importance
  • Cons: Slower training, black box (hard to explain why), may overfit with too many trees
  • Use case: When relationships are complex (e.g., Mumbai properties behave differently than Pune)
Gradient Boosting

How it works: Builds trees sequentially, each correcting previous tree's errors. Learns patterns iteratively.

  • Pros: Often highest accuracy, handles missing data, captures complex interactions
  • Cons: Longest training time, easy to overfit, requires careful tuning
  • Use case: When you need best prediction accuracy and have clean, engineered features
Model Training Process
1
Initialize Models

Create instances of all 5 regression algorithms with sensible default hyperparameters. Use consistent random_state (42) for reproducibility.

2
Train on Training Data

Fit each model using the scaled training features (X_train_scaled) and target values (y_train). The model learns patterns and coefficients during this step.

3
Generate Predictions

Use each trained model to predict prices on the test set (X_test_scaled). These predictions will be compared against actual prices (y_test).

4
Calculate Performance Metrics

Compute RMSE, MAE, and R-squared for each model by comparing predictions to actual values. Lower RMSE/MAE and higher R-squared indicate better performance.

5
Cross-Validation

Perform 5-fold cross-validation on training data to get more reliable performance estimates. This helps detect overfitting.

6
Compare Results

Create a comparison table showing all metrics for all models. Sort by R-squared to identify the best performer.

Expected Performance Range
Model Expected R-squared Expected RMSE (Lakhs) Typical Training Time
Linear Regression 0.75 - 0.82 18 - 25 < 1 second
Ridge Regression 0.76 - 0.83 17 - 24 < 1 second
Lasso Regression 0.74 - 0.81 19 - 26 < 1 second
Random Forest 0.82 - 0.88 14 - 20 5 - 15 seconds
Gradient Boosting 0.84 - 0.90 12 - 18 10 - 30 seconds
Interpretation Tip

R-squared of 0.85 means the model explains 85% of price variance. The remaining 15% is due to unmeasured factors like:

  • Neighborhood reputation and schools
  • Recent renovations or property condition details
  • Seller motivation and negotiation factors
  • Market timing and economic conditions
Understanding Evaluation Metrics
RMSE (Root Mean Squared Error)

Formula: Square root of average squared errors

Interpretation: Average prediction error in lakhs. RMSE of 20 means typical error is ±20 lakhs.

Key feature: Heavily penalizes large errors. A few big mistakes hurt RMSE more than many small ones.

Use case: When large errors are particularly costly (e.g., overpricing luxury properties).

MAE (Mean Absolute Error)

Formula: Average of absolute errors

Interpretation: Easier to explain to non-technical stakeholders. MAE of 15 means average error is 15 lakhs.

Key feature: Treats all errors equally, not sensitive to outliers.

Use case: When you want robust metric that isn't distorted by a few extreme cases.

R-squared (R²)

Formula: 1 - (Sum of squared errors / Total variance)

Interpretation: Percentage of price variance explained by model. R² of 0.85 = model explains 85% of variation.

Key feature: Scale-independent, ranges from 0 to 1 (higher is better).

Use case: Comparing models on different datasets or quickly assessing model quality.

Comparing Metrics: Practical Examples
Scenario Model A Model B Which is Better?
General comparison R² = 0.85, RMSE = 20L R² = 0.80, RMSE = 18L Model A - Higher R² means better overall fit
Cost of big errors MAE = 15L, RMSE = 25L MAE = 17L, RMSE = 20L Model B - Lower RMSE indicates fewer catastrophic errors
Stakeholder reporting MAE = 12L, R² = 0.82 MAE = 15L, R² = 0.85 Model A - Lower MAE easier to communicate ("average error is 12 lakhs")
Luxury segment R² = 0.75, max error = 80L R² = 0.78, max error = 50L Model B - Smaller max error critical for high-value properties
Cross-Validation: The Gold Standard

Test set performance can be misleading if you happen to get an "easy" or "hard" split. Cross-validation divides training data into 5 folds, trains on 4 and validates on 1, rotating through all combinations.

Example interpretation: If cross-validation R² is 0.83 with std dev of 0.03, your model consistently performs well (0.80-0.86 range). If std dev is 0.12, performance is unstable (0.71-0.95 range) - investigate why.

Action: Always report mean CV score ± standard deviation. Use CV scores for model selection, test set only for final performance estimate.

Detecting Overfitting

Training R² = 0.95, Test R² = 0.72: Model memorized training data. Reduce model complexity or add regularization.
Training R² = 0.85, Test R² = 0.83: Healthy performance, model generalizes well.
Training R² = 0.68, Test R² = 0.70: Underfitting, model too simple. Add features or try more complex algorithms.

07

Required Visualizations

Create at least 10 visualizations covering EDA, model comparison, and feature importance. Use a mix of Matplotlib, Seaborn, and Plotly for different chart types.

1. Distribution Plot
Target Variable Distribution

Histogram of price_lakhs with KDE curve

2. Heatmap
Correlation Matrix

Correlation between all numerical features

3. Box Plot
Price by City

Price distribution across different cities

4. Scatter Plot
Area vs Price

Relationship with property type color coding

5. Bar Chart
Average Price by City

Mean price comparison across cities

6. Violin Plot
Price by Furnishing

Distribution by furnishing status

7. Bar Chart
Model Comparison

R-squared scores for all 5 models

8. Scatter Plot
Actual vs Predicted

45-degree line comparison for best model

9. Residual Plot
Residual Analysis

Residuals vs predicted values

10. Horizontal Bar
Feature Importance

Top 15 features from Random Forest

Bonus
Interactive Map

Plotly map showing prices by city

Bonus
CV Score Distribution

Cross-validation scores across folds

Visualization 1: Feature Importance Analysis
Purpose

Identify which features have the strongest influence on price predictions in tree-based models (Random Forest or Gradient Boosting).

What to Show
  • Top 15 features sorted by importance
  • Horizontal bar chart for easy feature name reading
  • Importance scores (0 to 1 scale)
  • Clear labels and title
What to Look For
  • Area metrics typically rank #1 or #2
  • Location features (city, parking) in top 5
  • Bathrooms often more important than bedrooms
  • Engineered features competing with original ones
Visualization 2: Actual vs Predicted Prices
Purpose

Visually assess prediction accuracy by plotting predicted prices against actual prices. Perfect predictions would fall on a 45-degree diagonal line.

What to Show
  • Scatter plot: x-axis = actual prices, y-axis = predicted
  • 45-degree dashed reference line (perfect prediction)
  • Use your best performing model (likely Gradient Boosting)
  • Axis labels with units (Lakhs)
What to Look For
  • Points close to line = accurate predictions
  • Points below line = model under-predicts (conservative)
  • Points above line = model over-predicts (optimistic)
  • Outliers = properties with unusual characteristics
Business Insight

If your model consistently under-predicts luxury properties (₹1 Cr+), it may need engineered features capturing premium amenities or neighborhood prestige.

Visualization Best Practices
EDA Charts

Use Seaborn for statistical visualizations:

  • Heatmaps for correlations
  • Distribution plots with KDE
  • Box/violin plots by category
Model Comparison

Use Matplotlib for simple comparisons:

  • Bar charts for metrics across models
  • Horizontal bars for feature importance
  • Residual plots for error analysis
Interactive Plots

Use Plotly for exploration:

  • Scatter plots with hover details
  • 3D visualizations of relationships
  • Geographic maps for location data
Interpreting Common Visualization Patterns
Distribution Plots (Histograms)

Right-skewed price distribution: Most properties are affordable (₹30-70L), few luxury outliers (₹2 Cr+)

Insight: Apply log transformation to normalize distribution for linear models

Business value: Focus marketing on 70-80% of market (middle segment), create specialized strategies for luxury tier

Correlation Heatmap

Strong correlation (0.8+): total_sqft and bedrooms often correlated - larger homes have more bedrooms

Insight: Creates multicollinearity issue for linear models. Consider removing one or creating ratio feature

Business value: Don't build 5-bedroom homes in small areas - market expects proportionality

Box Plots (Price by City)

Mumbai median at ₹95L, Pune at ₹52L: Location drives 45% price difference even for similar properties

Insight: Location is critical feature - include city/area encoding in model

Business value: Investment strategy should prioritize location over property features. A basic flat in Mumbai outperforms luxury villa in tier-2 city

Residual Plot

Random scatter around zero: Model assumptions are satisfied, no systematic bias

Funnel shape (increasing spread): Model is less confident with expensive properties - heteroscedasticity issue

Business value: Use model confidently for budget-mid range (₹30-80L), add ±20% margin for luxury predictions

Visualization Interpretation Checklist

For every visualization you create, ask these questions:

  1. What pattern do I see? (e.g., positive correlation, outliers, skewed distribution)
  2. Why does this pattern exist? (e.g., supply-demand dynamics, construction costs, buyer preferences)
  3. How does it affect my model? (e.g., need transformation, feature engineering, separate segments)
  4. What business decision does it inform? (e.g., pricing strategy, target market, renovation priorities)
  5. Should I investigate further? (e.g., outliers requiring detailed analysis, unexpected correlations)
08

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
house-price-prediction
github.com/<your-username>/house-price-prediction
Required Project Structure
Directory Layout
  • data/ folder containing housing_data.csv
  • notebooks/ folder with house_price_analysis.ipynb (your main notebook)
  • models/ folder for saved model files (optional: best_model.pkl)
  • requirements.txt at root level listing all dependencies
  • README.md at root level with project documentation
README.md Must Include:
  • Your full name and submission date
  • Project overview and business context
  • Model comparison table with RMSE, MAE, R-squared for all models
  • Best model selection with justification
  • Top 5 feature insights from importance analysis
  • Technologies used (Python, Pandas, Scikit-learn, etc.)
  • Instructions to run the notebook
  • Screenshots of at least 4 visualizations
Required Python Libraries

Create a requirements.txt file with these dependencies (minimum versions):

Library Version Purpose
pandas 2.0.0+ Data manipulation and analysis
numpy 1.24.0+ Numerical operations and arrays
scikit-learn 1.3.0+ ML models, preprocessing, evaluation
matplotlib 3.7.0+ Static visualizations
seaborn 0.12.0+ Statistical visualizations
plotly 5.18.0+ Interactive charts
jupyter 1.0.0+ Notebook environment
nbformat 5.9.0+ Notebook formatting
joblib 1.3.0+ Model serialization (optional)
Do Include
  • Clear markdown sections with headers
  • All code cells executed with outputs
  • At least 5 trained regression models
  • At least 10 visualizations
  • Model comparison table
  • Feature importance analysis
  • Business insights and recommendations
  • README with model performance and screenshots
Do Not Include
  • Virtual environment folders (venv, .env)
  • Any .pyc or __pycache__ files
  • Unexecuted notebooks
  • Hardcoded absolute file paths
  • Large model files (keep under 100MB)
  • API keys or credentials
Important: Before submitting, run Kernel > Restart and Run All to ensure your notebook executes from top to bottom without errors!
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

09

Grading Rubric

Your project will be graded on the following criteria. Total: 600 points. Each criterion includes specific requirements that must be met for full credit.

Criteria Points Detailed Requirements
EDA and Data Understanding 75
  • Dataset overview with shape and column types
  • Missing value analysis with handling strategy
  • Distribution analysis of target variable (price_lakhs)
  • Correlation heatmap with interpretation
  • At least 3 category-wise price comparisons (city, property type, etc.)
Feature Engineering 100
  • Minimum 5 derived features with clear business justification
  • Examples: space_per_room, price_per_sqft, location_score, age_category
  • Each feature must have explanation of expected impact on price
  • Analysis showing engineered features in top 15 feature importance
Data Preprocessing 50
  • Categorical encoding (One-Hot or Label Encoding with justification)
  • Feature scaling using StandardScaler or MinMaxScaler
  • 80-20 train-test split with random_state for reproducibility
  • No data leakage (scaling fit only on training data)
Model Training 100
  • Exactly 5 regression models: Linear, Ridge, Lasso, Random Forest, Gradient Boosting
  • Hyperparameters documented for each model
  • Models trained on same data splits for fair comparison
  • Training time and complexity considerations discussed
Model Evaluation 75
  • RMSE, MAE, and R-squared calculated for all 5 models
  • 5-fold cross-validation scores for each model
  • Comparison table showing all metrics side-by-side
  • Best model selection with justification beyond just highest R-squared
Visualizations 75
  • Minimum 10 professional visualizations
  • Mix of Matplotlib, Seaborn, and Plotly charts
  • Must include: correlation heatmap, feature importance, actual vs predicted, model comparison bar chart
  • All charts properly labeled with titles, axis labels, and legends
Feature Importance 50
  • Feature importance extracted from Random Forest or Gradient Boosting
  • Top 15 features visualized with horizontal bar chart
  • Business interpretation of top 5 features (why they matter)
  • Comparison of engineered vs original feature importance
Code Quality 25
  • Markdown cells explaining each major section
  • Code comments for complex operations
  • Consistent variable naming (no x, y, z)
  • Notebook runs from top to bottom without errors
Documentation 25
  • README with your name, date, and project overview
  • Model comparison table in README
  • At least 4 visualization screenshots
  • Instructions to run notebook and requirements.txt included
Business Insights 25
  • Summary of top 3 price drivers from feature importance
  • Recommendations for PropValue Analytics based on model findings
  • Model limitations and scenarios where predictions may be unreliable
  • Next steps for model improvement
Total 600

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project
10

Pre-Submission Checklist

Notebook Requirements
Repository Requirements