Assignment Overview
In this assignment, you will build a complete Feature Engineering Pipeline for predicting house prices. This project requires you to apply feature engineering techniques: missing value handling, encoding strategies, feature creation, feature selection, and sklearn pipelines - skills that often make the difference between mediocre and excellent ML models.
Feature Creation
Polynomial features, interactions, aggregations, domain-specific features
Feature Selection
Filter methods, wrapper methods, embedded methods, RFE, importance-based
Sklearn Pipelines
Pipeline, ColumnTransformer, custom transformers, reproducibility
The Scenario
HomeValue Analytics - House Price Prediction
You have been hired as a Data Scientist at HomeValue Analytics, a real estate analytics company. The team has a raw dataset of house sales but the initial model performance is poor. The Chief Data Officer has given you this challenge:
"Our raw data has missing values, mixed data types, and irrelevant features. We tried a basic model but got an R² of only 0.65. Can you build a feature engineering pipeline that transforms this messy data into something that gives us at least 0.85 R²?"
Your Task
Create a Jupyter Notebook called feature_engineering.ipynb that implements a complete
feature engineering pipeline. Your code must clean the data, create meaningful features, select the
most important ones, and demonstrate significant improvement in model performance.
The Dataset
Create a synthetic house sales dataset (house_prices.csv) with the following structure:
File: house_prices.csv (House Sales Data)
house_id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,neighborhood,garage_type,heating_type,price
1,3,1.5,1800,5650,1.0,0,0,3,7,1180,620,1955,,98178,47.5112,-122.257,1340,5650,Urban,Attached,Gas,221900
2,4,2.5,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639,Suburban,Detached,Electric,538000
3,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,,98028,47.7379,-122.233,2720,8062,Rural,,Oil,180000
...
Columns Explained
house_id- Unique identifier (integer)bedrooms- Number of bedrooms (integer)bathrooms- Number of bathrooms (float)sqft_living- Living area square footage (integer)sqft_lot- Lot size square footage (integer)floors- Number of floors (float)waterfront- Waterfront property, 0/1 (binary)view- View quality rating, 0-4 (ordinal)condition- Overall condition, 1-5 (ordinal)grade- Construction grade, 1-13 (ordinal)
sqft_above- Above ground sqft (integer)sqft_basement- Basement sqft (integer)yr_built- Year built (integer)yr_renovated- Year renovated, blank if never (integer/null)zipcode- ZIP code (categorical)lat, long- Geographic coordinates (float)neighborhood- Urban/Suburban/Rural (categorical)garage_type- Attached/Detached/None (categorical with nulls)heating_type- Gas/Electric/Oil (categorical)price- Sale price in USD (target)
Requirements
Your feature_engineering.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load and Explore Data
Create a function load_and_explore(filepath) that:
- Loads the CSV file into a DataFrame
- Prints shape, dtypes, and missing value counts
- Identifies numerical vs categorical columns
- Returns DataFrame and column type lists
def load_and_explore(filepath):
"""Load data and perform initial exploration."""
# Return: df, numerical_cols, categorical_cols
pass
Handle Missing Values
Create a function handle_missing_values(df, strategy='smart') that:
- Implements multiple strategies: 'drop', 'mean', 'median', 'mode', 'smart'
- 'smart' uses domain knowledge (e.g., yr_renovated=0 means never renovated)
- Creates indicator columns for missingness (useful features)
- Returns cleaned DataFrame with no missing values
def handle_missing_values(df, strategy='smart'):
"""Handle missing values with specified strategy."""
# 'smart' strategy: yr_renovated NaN -> 0, garage_type NaN -> 'None'
# Create: has_renovation, has_garage indicator columns
pass
Encode Categorical Variables
Create a function encode_categoricals(df, method='auto') that:
- Uses OneHotEncoder for nominal categories (neighborhood, heating_type)
- Uses OrdinalEncoder for ordinal categories (view, condition, grade)
- Uses TargetEncoder for high-cardinality (zipcode)
- Returns encoded DataFrame and fitted encoders
def encode_categoricals(df, method='auto'):
"""Encode categorical variables appropriately."""
# Nominal: OneHotEncoder
# Ordinal: OrdinalEncoder with proper ordering
# High-cardinality: TargetEncoder
pass
Create Domain Features
Create a function create_domain_features(df) that:
- Creates
age= current_year - yr_built - Creates
years_since_renovation= current_year - yr_renovated (if renovated) - Creates
price_per_sqft(if applicable),total_sqft - Creates
bed_bath_ratio,living_lot_ratio - Returns DataFrame with new features
def create_domain_features(df):
"""Create domain-specific features."""
# age, years_since_renovation, total_sqft
# bed_bath_ratio, living_lot_ratio, has_basement
pass
Create Interaction Features
Create a function create_interaction_features(df, pairs) that:
- Creates multiplication interactions for specified pairs
- Creates polynomial features (degree 2) for important numericals
- Example: sqft_living × grade, bedrooms × bathrooms
- Returns DataFrame with interaction features
def create_interaction_features(df, pairs):
"""Create interaction features between specified column pairs."""
# pairs example: [('sqft_living', 'grade'), ('bedrooms', 'bathrooms')]
pass
Handle Outliers
Create a function handle_outliers(df, method='iqr', threshold=1.5) that:
- Implements IQR method and Z-score method
- Provides options: 'remove', 'cap', 'log_transform'
- Visualizes outliers before/after treatment
- Returns treated DataFrame
def handle_outliers(df, method='iqr', threshold=1.5, treatment='cap'):
"""Detect and handle outliers."""
# method: 'iqr' or 'zscore'
# treatment: 'remove', 'cap', or 'log_transform'
pass
Scale Features
Create a function scale_features(df, method='standard') that:
- Implements StandardScaler, MinMaxScaler, RobustScaler
- Only scales numerical columns
- Returns scaled DataFrame and fitted scaler
def scale_features(df, method='standard'):
"""Scale numerical features."""
# method: 'standard', 'minmax', 'robust'
# Return: scaled_df, scaler
pass
Select Features - Filter Methods
Create a function select_features_filter(X, y, method='correlation', k=10) that:
- Implements correlation-based selection
- Implements mutual information
- Implements variance threshold
- Returns selected features and scores
def select_features_filter(X, y, method='correlation', k=10):
"""Filter-based feature selection."""
# method: 'correlation', 'mutual_info', 'variance'
# Return: selected_features, scores
pass
Select Features - Wrapper Methods
Create a function select_features_wrapper(X, y, method='rfe', n_features=10) that:
- Implements Recursive Feature Elimination (RFE)
- Implements Sequential Feature Selection (forward/backward)
- Returns selected features and ranking
def select_features_wrapper(X, y, method='rfe', n_features=10):
"""Wrapper-based feature selection."""
# method: 'rfe', 'forward', 'backward'
# Return: selected_features, ranking
pass
Select Features - Embedded Methods
Create a function select_features_embedded(X, y, method='lasso') that:
- Implements Lasso regularization-based selection
- Implements tree-based feature importance (Random Forest)
- Visualizes feature importance
- Returns selected features and importance scores
def select_features_embedded(X, y, method='lasso'):
"""Embedded feature selection methods."""
# method: 'lasso', 'random_forest', 'gradient_boosting'
# Return: selected_features, importance_scores
pass
Build Sklearn Pipeline
Create a function build_pipeline(numerical_cols, categorical_cols) that:
- Uses
ColumnTransformerfor different column types - Chains imputers, encoders, scalers in proper order
- Creates a reproducible, production-ready pipeline
- Returns fitted pipeline
def build_pipeline(numerical_cols, categorical_cols):
"""Build sklearn Pipeline with ColumnTransformer."""
# numerical: impute -> scale
# categorical: impute -> encode
# Return: Pipeline object
pass
Compare Before/After Performance
Create a function compare_performance(X_raw, X_engineered, y) that:
- Trains same model on raw vs engineered features
- Uses cross-validation for fair comparison
- Reports R², MAE, RMSE improvements
- Visualizes improvement metrics
def compare_performance(X_raw, X_engineered, y):
"""Compare model performance before and after feature engineering."""
# Train same model (e.g., Ridge) on both datasets
# Report: R², MAE, RMSE before/after
# Visualize: bar chart of improvements
pass
Main Pipeline
Create a main() function that:
- Loads and explores the raw data
- Applies all feature engineering steps
- Compares feature selection methods
- Builds the final pipeline
- Demonstrates performance improvement
def main():
# Load data
df, num_cols, cat_cols = load_and_explore("house_prices.csv")
# Apply feature engineering
df = handle_missing_values(df, strategy='smart')
df = create_domain_features(df)
df = handle_outliers(df, method='iqr', treatment='cap')
# Build pipeline
pipeline = build_pipeline(num_cols, cat_cols)
# Compare performance
compare_performance(X_raw, X_engineered, y)
print("Feature engineering complete!")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
house-price-feature-engineering
Required Files
house-price-feature-engineering/
├── feature_engineering.ipynb # Your Jupyter Notebook with ALL 13 functions
├── house_prices.csv # Synthetic dataset (5,000+ houses)
├── feature_importance.png # Feature importance visualization
├── correlation_heatmap.png # Correlation matrix heatmap
├── before_after_comparison.png # Performance comparison chart
├── pipeline_diagram.png # Visual representation of your pipeline
├── feature_report.txt # Summary of engineered features
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- List of all engineered features with descriptions
- Performance improvement metrics (before vs after R², MAE, RMSE)
- Explanation of your feature selection strategy
- Instructions to run your notebook
Do Include
- All 13 functions implemented and working
- At least 10 new engineered features
- Multiple feature selection methods compared
- Sklearn Pipeline using ColumnTransformer
- Measurable performance improvement (≥0.10 R² gain)
- README.md with all required sections
Do Not Include
- Data leakage (fitting on test data)
- Hardcoded feature selections without justification
- Any .pyc or __pycache__ files
- Virtual environment folders
- Code that doesn't run without errors
- Target variable in feature engineering
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Function Implementation | 70 | All 13 functions correctly implemented with proper logic |
| Feature Creation | 30 | At least 10 meaningful features with domain justification |
| Feature Selection | 30 | Comparison of filter, wrapper, and embedded methods |
| Sklearn Pipeline | 25 | Proper use of Pipeline and ColumnTransformer |
| Performance Improvement | 25 | Demonstrated improvement in R² (≥0.10 gain required) |
| Visualizations | 10 | Clear feature importance and comparison visualizations |
| Code Quality | 10 | Docstrings, comments, clean organization |
| Total | 200 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Feature Creation
Creating domain-specific features, interactions, and polynomial features that capture hidden patterns
Feature Selection
Filter, wrapper, and embedded methods - knowing when to use each and comparing their effectiveness
Sklearn Pipelines
Building reproducible, production-ready pipelines with ColumnTransformer and custom transformers
Data Cleaning
Strategic handling of missing values, outliers, and encoding - the foundation of good features
Pro Tips
Domain Knowledge is Key
- Price per sqft is more informative than raw price
- Age of house matters more than year built
- Bathroom-to-bedroom ratio indicates luxury
- Location features (lat/long) can be clustered
Avoid Data Leakage
- Never use target for feature engineering
- Fit scalers/encoders on train data only
- Use pipelines to ensure proper ordering
- Target encoding needs special handling
Feature Selection Strategy
- Start with correlation analysis (filter)
- Use RFE for final selection (wrapper)
- Validate with tree-based importance (embedded)
- Remove highly correlated features (>0.9)
Common Pitfalls
- Creating too many features (overfitting)
- Not handling multicollinearity
- Forgetting to scale after feature creation
- Using raw categorical IDs as features