Assignment 8-A

Feature Engineering Challenge

Apply your feature engineering skills to real-world scenarios: create new features from raw data, encode categorical variables, scale numerical features, and select the most predictive features for machine learning models.

5-7 hours
Challenging
200 Points
Submit Assignment
What You'll Practice
  • Feature creation from raw data
  • Categorical encoding techniques
  • Feature scaling and normalization
  • Feature selection methods
  • Building ML-ready pipelines
Contents
01

Assignment Overview

In this assignment, you will build a complete Feature Engineering Pipeline for a machine learning project. This comprehensive project requires you to apply ALL concepts from Module 8: feature creation, categorical encoding, feature scaling, and feature selection to prepare raw data for ML models.

Scikit-learn Focus: You must use scikit-learn transformers and pipelines for all feature engineering tasks. This tests your understanding of proper ML preprocessing workflows.
Skills Applied: This assignment tests your understanding of Feature Creation (Topic 8.1), Categorical Encoding (Topic 8.2), Feature Scaling (Topic 8.3), and Feature Selection (Topic 8.4) from Module 8.
Feature Creation (8.1)

Polynomial features, datetime extraction, domain-specific features

Encoding (8.2)

One-hot, label, ordinal, and target encoding techniques

Scaling (8.3)

StandardScaler, MinMaxScaler, RobustScaler, normalization

Selection (8.4)

Variance threshold, correlation, mutual information, RFE

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

PropertyPredict Real Estate

You have been hired as a Machine Learning Engineer at PropertyPredict, a real estate analytics company. Your manager has assigned you this project:

"We have raw property listing data that needs to be transformed for our price prediction model. The data contains mixed types: numerical features, categorical variables, dates, and text. Your job is to build a robust feature engineering pipeline that can handle all these data types and produce ML-ready features. The pipeline must be reproducible and work on new data."

Your Task

Create a Jupyter notebook called feature_engineering.ipynb that implements a complete feature engineering pipeline using scikit-learn. Your pipeline must transform raw property data into features suitable for training a machine learning model.

03

The Dataset

You will work with real property listings data containing mixed data types: numerical features, categorical variables, dates, and boolean flags. Download the CSV file and explore it to understand the features you'll need to engineer.

Property Listings Dataset

Real estate data with mixed feature types for feature engineering practice

1 File
Your Task: Load the property listings dataset and explore its structure. You'll find numerical features (square_feet, bedrooms, etc.), categorical variables (property_type, neighborhood), datetime data (listing_date), and boolean flags (has_pool, has_fireplace). Some columns contain missing values that you'll need to handle as part of your feature engineering pipeline.
04

Requirements

Your feature_engineering.ipynb must implement ALL of the following tasks. Each task is mandatory and will be tested individually.

Part 1: Feature Creation (50 points)

1
Extract DateTime Features

Create a function extract_date_features(df) that:

  • Extracts year, month, day, day_of_week, quarter from listing_date
  • Creates is_weekend boolean feature
  • Creates days_since_listing (days from listing to today)
  • Returns DataFrame with new date-based features
def extract_date_features(df):
    """Extract features from listing_date column."""
    # Must create: year, month, day, day_of_week, quarter, is_weekend, days_since_listing
    pass
2
Create Domain Features

Create a function create_domain_features(df) that:

  • Creates price_per_sqft = price / square_feet
  • Creates property_age = 2024 - year_built
  • Creates bed_bath_ratio = bedrooms / bathrooms
  • Creates total_rooms = bedrooms + bathrooms
  • Creates is_new_construction = True if year_built >= 2020
def create_domain_features(df):
    """Create domain-specific engineered features."""
    # Must create meaningful combinations of existing features
    pass
3
Create Polynomial Features

Create a function create_polynomial_features(df, columns, degree=2) that:

  • Uses sklearn PolynomialFeatures on specified numerical columns
  • Creates interaction terms and polynomial terms
  • Returns DataFrame with new polynomial features added
def create_polynomial_features(df, columns, degree=2):
    """Create polynomial and interaction features."""
    # Must use: sklearn.preprocessing.PolynomialFeatures
    pass
4
Create Binned Features

Create a function create_binned_features(df) that:

  • Bins square_feet into categories: 'Small', 'Medium', 'Large', 'Luxury'
  • Bins property_age into: 'New', 'Modern', 'Established', 'Historic'
  • Uses pd.cut() with appropriate bin edges
def create_binned_features(df):
    """Create binned categorical features from numerical columns."""
    # Must use: pd.cut() or pd.qcut()
    pass

Part 2: Categorical Encoding (50 points)

5
One-Hot Encoding

Create a function apply_onehot_encoding(df, columns) that:

  • Uses sklearn OneHotEncoder on specified categorical columns
  • Handles unknown categories gracefully (ignore or use a default)
  • Returns DataFrame with encoded columns
def apply_onehot_encoding(df, columns):
    """Apply one-hot encoding to categorical columns."""
    # Must use: sklearn.preprocessing.OneHotEncoder
    pass
6
Ordinal Encoding

Create a function apply_ordinal_encoding(df, column, order) that:

  • Uses sklearn OrdinalEncoder with specified category order
  • Apply to 'condition' column with order: Poor < Fair < Good < Excellent
  • Returns encoded column values
def apply_ordinal_encoding(df, column, order):
    """Apply ordinal encoding with specified category order."""
    # Must use: sklearn.preprocessing.OrdinalEncoder
    pass
7
Label Encoding

Create a function apply_label_encoding(df, columns) that:

  • Uses sklearn LabelEncoder on specified columns
  • Stores the encoder for inverse transformation
  • Returns encoded DataFrame and encoder dictionary
def apply_label_encoding(df, columns):
    """Apply label encoding to categorical columns."""
    # Must use: sklearn.preprocessing.LabelEncoder
    pass
8
Target Encoding

Create a function apply_target_encoding(df, column, target) that:

  • Replaces categories with mean of target variable
  • Handles train/test split properly (fit on train, transform both)
  • Returns encoded column values
def apply_target_encoding(df, column, target):
    """Apply target encoding (mean encoding) to a categorical column."""
    # Calculate mean of target for each category
    pass

Part 3: Feature Scaling (50 points)

9
Standard Scaling

Create a function apply_standard_scaling(df, columns) that:

  • Uses sklearn StandardScaler (z-score normalization)
  • Fits on training data and transforms both train and test
  • Returns scaled DataFrame and fitted scaler
def apply_standard_scaling(df, columns):
    """Apply standard scaling (z-score normalization)."""
    # Must use: sklearn.preprocessing.StandardScaler
    pass
10
MinMax Scaling

Create a function apply_minmax_scaling(df, columns) that:

  • Uses sklearn MinMaxScaler to scale features to [0, 1] range
  • Returns scaled DataFrame and fitted scaler
def apply_minmax_scaling(df, columns):
    """Apply min-max scaling to [0, 1] range."""
    # Must use: sklearn.preprocessing.MinMaxScaler
    pass
11
Robust Scaling

Create a function apply_robust_scaling(df, columns) that:

  • Uses sklearn RobustScaler (uses median and IQR, resistant to outliers)
  • Apply to columns with potential outliers
  • Returns scaled DataFrame and fitted scaler
def apply_robust_scaling(df, columns):
    """Apply robust scaling using median and IQR."""
    # Must use: sklearn.preprocessing.RobustScaler
    pass
12
Log Transformation

Create a function apply_log_transform(df, columns) that:

  • Applies log1p transformation (log(1 + x)) to handle skewed distributions
  • Useful for price and square_feet which are right-skewed
  • Returns transformed DataFrame
def apply_log_transform(df, columns):
    """Apply log transformation for skewed features."""
    # Must use: np.log1p()
    pass

Part 4: Feature Selection (50 points)

13
Variance Threshold

Create a function select_by_variance(df, threshold=0.01) that:

  • Uses sklearn VarianceThreshold to remove low-variance features
  • Returns list of features that pass the threshold
def select_by_variance(df, threshold=0.01):
    """Remove features with variance below threshold."""
    # Must use: sklearn.feature_selection.VarianceThreshold
    pass
14
Correlation Analysis

Create a function remove_correlated_features(df, threshold=0.9) that:

  • Calculates correlation matrix for numerical features
  • Identifies highly correlated pairs (above threshold)
  • Removes one feature from each correlated pair
  • Returns DataFrame with reduced features
def remove_correlated_features(df, threshold=0.9):
    """Remove highly correlated features."""
    # Calculate correlation matrix and remove redundant features
    pass
15
Mutual Information

Create a function select_by_mutual_info(X, y, k=10) that:

  • Uses sklearn mutual_info_regression for regression target
  • Selects top k features with highest mutual information
  • Returns selected feature names and scores
def select_by_mutual_info(X, y, k=10):
    """Select top k features by mutual information."""
    # Must use: sklearn.feature_selection.mutual_info_regression
    pass
16
Recursive Feature Elimination

Create a function select_by_rfe(X, y, n_features=10) that:

  • Uses sklearn RFE with a base estimator (e.g., RandomForest)
  • Recursively eliminates least important features
  • Returns selected feature names and ranking
def select_by_rfe(X, y, n_features=10):
    """Select features using Recursive Feature Elimination."""
    # Must use: sklearn.feature_selection.RFE
    pass

Part 5: Complete Pipeline (Bonus - 30 points)

17
Build Complete Pipeline

Create a function build_feature_pipeline() that:

  • Uses sklearn Pipeline and ColumnTransformer
  • Handles numerical and categorical features separately
  • Chains imputation, encoding, scaling, and selection
  • Returns a fitted pipeline that can transform new data
def build_feature_pipeline():
    """Build a complete feature engineering pipeline."""
    # Must use: sklearn.pipeline.Pipeline and ColumnTransformer
    # Chain: SimpleImputer -> Encoders -> Scalers
    pass
18
Demonstrate Pipeline Usage

Create cells that demonstrate:

  • Splitting data into train/test sets
  • Fitting pipeline on training data only
  • Transforming both train and test data
  • Showing the final feature matrix ready for ML
05

Submission Instructions

Submit your completed assignment via GitHub following these instructions:

1
Create Jupyter Notebook

Create a single notebook called feature_engineering.ipynb containing all functions listed above.

  • Organize your notebook with clear markdown headers for each part
  • Each function must have a docstring explaining what it does
  • Include test cells that demonstrate each function working
  • Add markdown cells explaining your approach
2
Include Test Demonstrations

In your notebook, add cells that:

  • Generate the property dataset
  • Call each of your functions with the data
  • Print results showing the transformations
  • Demonstrate the complete pipeline on train/test split
3
Create README

Create README.md that includes:

  • Your name and assignment title
  • Instructions to run your code
  • List of all functions with brief descriptions
  • Any challenges you faced and how you solved them
4
Create requirements.txt
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0
5
Repository Structure

Your GitHub repository should look like this:

propertypredict-feature-engineering/
├── README.md
├── requirements.txt
└── feature_engineering.ipynb    # All functions with test demonstrations
6
Submit via Form

Once your repository is ready:

  • Make sure your repository is public or shared with your instructor
  • Click the "Submit Assignment" button below
  • Fill in the submission form with your GitHub repository URL
Important: Make sure all cells in your notebook run without errors before submitting!
06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Feature Creation 50 DateTime extraction, domain features, polynomial features, binning
Categorical Encoding 50 One-hot, ordinal, label, and target encoding implementations
Feature Scaling 50 Standard, MinMax, Robust scaling and log transformation
Feature Selection 50 Variance threshold, correlation, mutual info, RFE
Complete Pipeline (Bonus) 30 Scikit-learn Pipeline with ColumnTransformer
Total 200 (+30 bonus)

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Feature Creation (8.1)

DateTime extraction, domain-specific features, polynomial features, and binning continuous variables

Categorical Encoding (8.2)

One-hot encoding, ordinal encoding, label encoding, and target/mean encoding for ML models

Feature Scaling (8.3)

StandardScaler, MinMaxScaler, RobustScaler, and log transformations for different scenarios

Feature Selection (8.4)

Variance threshold, correlation analysis, mutual information, and recursive feature elimination

08

Pro Tips

Encoding Best Practices
  • Use one-hot for low cardinality nominal features
  • Use ordinal encoding for ordered categories
  • Use target encoding for high cardinality features
  • Always fit encoders on training data only
Scaling Guidelines
  • StandardScaler for normally distributed data
  • MinMaxScaler when bounded range is needed
  • RobustScaler when outliers are present
  • Always scale after train/test split
Time Management
  • Start with feature creation (most creative part)
  • Test each function independently first
  • Build the pipeline incrementally
  • Leave time for testing the complete workflow
Common Mistakes
  • Data leakage: fitting on test data
  • Forgetting to handle missing values first
  • Not storing fitted transformers for new data
  • Encoding target variable (do not encode y!)
09

Pre-Submission Checklist

Feature Engineering Requirements
Repository Requirements