Assignment 6: Feature Engineering | Machine Learning Course

Assignment Overview

In this assignment, you will build a complete Feature Engineering Pipeline for predicting house prices. This project requires you to apply feature engineering techniques: missing value handling, encoding strategies, feature creation, feature selection, and sklearn pipelines - skills that often make the difference between mediocre and excellent ML models.

Feature Engineering Focus: Good feature engineering can improve model performance more than algorithm selection. You must demonstrate measurable improvement in model metrics after applying your pipeline.

Skills Applied: This assignment tests your understanding of missing value imputation, categorical encoding, feature creation, feature selection methods, and sklearn Pipeline/ColumnTransformer.

Feature Creation

Polynomial features, interactions, aggregations, domain-specific features

Feature Selection

Filter methods, wrapper methods, embedded methods, RFE, importance-based

Sklearn Pipelines

Pipeline, ColumnTransformer, custom transformers, reproducibility

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

HomeValue Analytics - House Price Prediction

You have been hired as a Data Scientist at HomeValue Analytics, a real estate analytics company. The team has a raw dataset of house sales but the initial model performance is poor. The Chief Data Officer has given you this challenge:

"Our raw data has missing values, mixed data types, and irrelevant features. We tried a basic model but got an R² of only 0.65. Can you build a feature engineering pipeline that transforms this messy data into something that gives us at least 0.85 R²?"

Your Task

Create a Jupyter Notebook called feature_engineering.ipynb that implements a complete feature engineering pipeline. Your code must clean the data, create meaningful features, select the most important ones, and demonstrate significant improvement in model performance.

The Dataset

Create a synthetic house sales dataset (house_prices.csv) with the following structure:

File: `house_prices.csv` (House Sales Data)

house_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,neighborhood,garage_type,heating_type,price
1,3,1.5,1800,5650,1.0,0,0,3,7,1180,620,1955,,98178,47.5112,-122.257,1340,5650,Urban,Attached,Gas,221900
2,4,2.5,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639,Suburban,Detached,Electric,538000
3,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,,98028,47.7379,-122.233,2720,8062,Rural,,Oil,180000
...

Columns Explained

house_id - Unique identifier (integer)
bedrooms - Number of bedrooms (integer)
bathrooms - Number of bathrooms (float)
sqft_living - Living area square footage (integer)
sqft_lot - Lot size square footage (integer)
floors - Number of floors (float)
waterfront - Waterfront property, 0/1 (binary)
view - View quality rating, 0-4 (ordinal)
condition - Overall condition, 1-5 (ordinal)
grade - Construction grade, 1-13 (ordinal)

sqft_above - Above ground sqft (integer)
sqft_basement - Basement sqft (integer)
yr_built - Year built (integer)
yr_renovated - Year renovated, blank if never (integer/null)
zipcode - ZIP code (categorical)
lat, long - Geographic coordinates (float)
neighborhood - Urban/Suburban/Rural (categorical)
garage_type - Attached/Detached/None (categorical with nulls)
heating_type - Gas/Electric/Oil (categorical)
price - Sale price in USD (target)

Dataset Requirements: Generate at least 5,000 houses with realistic distributions, intentional missing values (~5-10% in yr_renovated and garage_type), and some outliers. Use numpy's random with a seed.

Requirements

Your feature_engineering.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load and Explore Data

Create a function load_and_explore(filepath) that:

Loads the CSV file into a DataFrame
Prints shape, dtypes, and missing value counts
Identifies numerical vs categorical columns
Returns DataFrame and column type lists

def load_and_explore(filepath):
    """Load data and perform initial exploration."""
    # Return: df, numerical_cols, categorical_cols
    pass

Handle Missing Values

Create a function handle_missing_values(df, strategy='smart') that:

Implements multiple strategies: 'drop', 'mean', 'median', 'mode', 'smart'
'smart' uses domain knowledge (e.g., yr_renovated=0 means never renovated)
Creates indicator columns for missingness (useful features)
Returns cleaned DataFrame with no missing values

def handle_missing_values(df, strategy='smart'):
    """Handle missing values with specified strategy."""
    # 'smart' strategy: yr_renovated NaN -> 0, garage_type NaN -> 'None'
    # Create: has_renovation, has_garage indicator columns
    pass

Encode Categorical Variables

Create a function encode_categoricals(df, method='auto') that:

Uses OneHotEncoder for nominal categories (neighborhood, heating_type)
Uses OrdinalEncoder for ordinal categories (view, condition, grade)
Uses TargetEncoder for high-cardinality (zipcode)
Returns encoded DataFrame and fitted encoders

def encode_categoricals(df, method='auto'):
    """Encode categorical variables appropriately."""
    # Nominal: OneHotEncoder
    # Ordinal: OrdinalEncoder with proper ordering
    # High-cardinality: TargetEncoder
    pass

Create Domain Features

Create a function create_domain_features(df) that:

Creates age = current_year - yr_built
Creates years_since_renovation = current_year - yr_renovated (if renovated)
Creates price_per_sqft (if applicable), total_sqft
Creates bed_bath_ratio, living_lot_ratio
Returns DataFrame with new features

def create_domain_features(df):
    """Create domain-specific features."""
    # age, years_since_renovation, total_sqft
    # bed_bath_ratio, living_lot_ratio, has_basement
    pass

Create Interaction Features

Create a function create_interaction_features(df, pairs) that:

Creates multiplication interactions for specified pairs
Creates polynomial features (degree 2) for important numericals
Example: sqft_living × grade, bedrooms × bathrooms
Returns DataFrame with interaction features

def create_interaction_features(df, pairs):
    """Create interaction features between specified column pairs."""
    # pairs example: [('sqft_living', 'grade'), ('bedrooms', 'bathrooms')]
    pass

Handle Outliers

Create a function handle_outliers(df, method='iqr', threshold=1.5) that:

Implements IQR method and Z-score method
Provides options: 'remove', 'cap', 'log_transform'
Visualizes outliers before/after treatment
Returns treated DataFrame

def handle_outliers(df, method='iqr', threshold=1.5, treatment='cap'):
    """Detect and handle outliers."""
    # method: 'iqr' or 'zscore'
    # treatment: 'remove', 'cap', or 'log_transform'
    pass

Scale Features

Create a function scale_features(df, method='standard') that:

Implements StandardScaler, MinMaxScaler, RobustScaler
Only scales numerical columns
Returns scaled DataFrame and fitted scaler

def scale_features(df, method='standard'):
    """Scale numerical features."""
    # method: 'standard', 'minmax', 'robust'
    # Return: scaled_df, scaler
    pass

Select Features - Filter Methods

Create a function select_features_filter(X, y, method='correlation', k=10) that:

Implements correlation-based selection
Implements mutual information
Implements variance threshold
Returns selected features and scores

def select_features_filter(X, y, method='correlation', k=10):
    """Filter-based feature selection."""
    # method: 'correlation', 'mutual_info', 'variance'
    # Return: selected_features, scores
    pass

Select Features - Wrapper Methods

Create a function select_features_wrapper(X, y, method='rfe', n_features=10) that:

Implements Recursive Feature Elimination (RFE)
Implements Sequential Feature Selection (forward/backward)
Returns selected features and ranking

def select_features_wrapper(X, y, method='rfe', n_features=10):
    """Wrapper-based feature selection."""
    # method: 'rfe', 'forward', 'backward'
    # Return: selected_features, ranking
    pass

Select Features - Embedded Methods

Create a function select_features_embedded(X, y, method='lasso') that:

Implements Lasso regularization-based selection
Implements tree-based feature importance (Random Forest)
Visualizes feature importance
Returns selected features and importance scores

def select_features_embedded(X, y, method='lasso'):
    """Embedded feature selection methods."""
    # method: 'lasso', 'random_forest', 'gradient_boosting'
    # Return: selected_features, importance_scores
    pass

Build Sklearn Pipeline

Create a function build_pipeline(numerical_cols, categorical_cols) that:

Uses ColumnTransformer for different column types
Chains imputers, encoders, scalers in proper order
Creates a reproducible, production-ready pipeline
Returns fitted pipeline

def build_pipeline(numerical_cols, categorical_cols):
    """Build sklearn Pipeline with ColumnTransformer."""
    # numerical: impute -> scale
    # categorical: impute -> encode
    # Return: Pipeline object
    pass

Compare Before/After Performance

Create a function compare_performance(X_raw, X_engineered, y) that:

Trains same model on raw vs engineered features
Uses cross-validation for fair comparison
Reports R², MAE, RMSE improvements
Visualizes improvement metrics

def compare_performance(X_raw, X_engineered, y):
    """Compare model performance before and after feature engineering."""
    # Train same model (e.g., Ridge) on both datasets
    # Report: R², MAE, RMSE before/after
    # Visualize: bar chart of improvements
    pass

Main Pipeline

Create a main() function that:

Loads and explores the raw data
Applies all feature engineering steps
Compares feature selection methods
Builds the final pipeline
Demonstrates performance improvement

def main():
    # Load data
    df, num_cols, cat_cols = load_and_explore("house_prices.csv")
    
    # Apply feature engineering
    df = handle_missing_values(df, strategy='smart')
    df = create_domain_features(df)
    df = handle_outliers(df, method='iqr', treatment='cap')
    
    # Build pipeline
    pipeline = build_pipeline(num_cols, cat_cols)
    
    # Compare performance
    compare_performance(X_raw, X_engineered, y)
    
    print("Feature engineering complete!")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

house-price-feature-engineering

github.com/<your-username>/house-price-feature-engineering

Required Files

house-price-feature-engineering/
├── feature_engineering.ipynb  # Your Jupyter Notebook with ALL 13 functions
├── house_prices.csv           # Synthetic dataset (5,000+ houses)
├── feature_importance.png     # Feature importance visualization
├── correlation_heatmap.png    # Correlation matrix heatmap
├── before_after_comparison.png # Performance comparison chart
├── pipeline_diagram.png       # Visual representation of your pipeline
├── feature_report.txt         # Summary of engineered features
└── README.md                  # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
List of all engineered features with descriptions
Performance improvement metrics (before vs after R², MAE, RMSE)
Explanation of your feature selection strategy
Instructions to run your notebook

Do Include

All 13 functions implemented and working
At least 10 new engineered features
Multiple feature selection methods compared
Sklearn Pipeline using ColumnTransformer
Measurable performance improvement (≥0.10 R² gain)
README.md with all required sections

Do Not Include

Data leakage (fitting on test data)
Hardcoded feature selections without justification
Any .pyc or __pycache__ files
Virtual environment folders
Code that doesn't run without errors
Target variable in feature engineering

Important: Before submitting, run all cells in your notebook to ensure it executes without errors and generates all output files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Function Implementation	70	All 13 functions correctly implemented with proper logic
Feature Creation	30	At least 10 meaningful features with domain justification
Feature Selection	30	Comparison of filter, wrapper, and embedded methods
Sklearn Pipeline	25	Proper use of Pipeline and ColumnTransformer
Performance Improvement	25	Demonstrated improvement in R² (≥0.10 gain required)
Visualizations	10	Clear feature importance and comparison visualizations
Code Quality	10	Docstrings, comments, clean organization
Total	200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Feature Creation

Creating domain-specific features, interactions, and polynomial features that capture hidden patterns

Feature Selection

Filter, wrapper, and embedded methods - knowing when to use each and comparing their effectiveness

Sklearn Pipelines

Building reproducible, production-ready pipelines with ColumnTransformer and custom transformers

Data Cleaning

Strategic handling of missing values, outliers, and encoding - the foundation of good features

Pro Tips

Domain Knowledge is Key

Price per sqft is more informative than raw price
Age of house matters more than year built
Bathroom-to-bedroom ratio indicates luxury
Location features (lat/long) can be clustered

Avoid Data Leakage

Never use target for feature engineering
Fit scalers/encoders on train data only
Use pipelines to ensure proper ordering
Target encoding needs special handling

Feature Selection Strategy

Start with correlation analysis (filter)
Use RFE for final selection (wrapper)
Validate with tree-based importance (embedded)
Remove highly correlated features (>0.9)

Common Pitfalls

Creating too many features (overfitting)
Not handling multicollinearity
Forgetting to scale after feature creation
Using raw categorical IDs as features

Feature Engineering Pipeline

What You'll Practice

Contents

Assignment Overview

Feature Creation

Feature Selection

Sklearn Pipelines

The Scenario

HomeValue Analytics - House Price Prediction

Your Task

The Dataset

File: house_prices.csv (House Sales Data)

Columns Explained

Requirements

Load and Explore Data

Handle Missing Values

Encode Categorical Variables

Create Domain Features

Create Interaction Features

Handle Outliers

Scale Features

Select Features - Filter Methods

Select Features - Wrapper Methods

Select Features - Embedded Methods

Build Sklearn Pipeline

Compare Before/After Performance

Main Pipeline

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

Feature Creation

Feature Selection

Sklearn Pipelines

Data Cleaning

Pro Tips

Domain Knowledge is Key

Avoid Data Leakage

Feature Selection Strategy

Common Pitfalls

Pre-Submission Checklist

Code Requirements

Repository Requirements

File: `house_prices.csv` (House Sales Data)