Module 8.1

Feature Engineering

Learn how to transform raw data into powerful features that improve machine learning model performance. Master techniques like encoding, scaling, binning, and creating interaction features!

45 min read
Intermediate
Hands-on Examples
What You'll Learn
  • Understanding feature engineering fundamentals
  • Encoding categorical variables effectively
  • Scaling and normalizing numerical features
  • Creating interaction and polynomial features
  • Binning continuous variables strategically
Contents
01

What is Feature Engineering?

Feature engineering is the art and science of transforming raw data into meaningful inputs for machine learning models. It is often the most impactful step in the data science pipeline, as better features lead to better model performance. Many Kaggle competition winners attribute their success more to clever feature engineering than to choosing the right algorithm.

Why Feature Engineering Matters

Beginner-Friendly Explanation

Imagine you're a detective trying to identify criminals from photos. If you only have raw pixel values (millions of numbers from 0-255), it's nearly impossible to find patterns. But if someone extracts useful features like "has beard", "wears glasses", "age range", "height" - suddenly identification becomes much easier!

That's feature engineering! You transform raw, messy data into clean, meaningful features that highlight the patterns your model needs to learn. It's like translating a foreign language into one the model understands.

Real-world house price example: Your raw data might include the house's build date (e.g., 1985). But what really matters to buyers is the age of the house - calculated as current year minus build year. A house built in 1985 is now 41 years old. That transformation from "build date" → "age" is feature engineering in action!

The fundamental truth: Machine learning models can only learn from the patterns you give them. If important relationships are hidden deep in your raw data, even the most sophisticated algorithm (neural networks, XGBoost, etc.) will struggle. Feature engineering makes those patterns explicit and learnable.

Without Feature Engineering
  • Raw data: birth_date = "1990-05-15"
  • Model sees: Just a string or timestamp
  • Problem: Can't learn age patterns directly
  • Result: Poor model performance
With Feature Engineering
  • Engineered: age = 2026 - 1990 = 36
  • Model sees: Clear numerical value
  • Benefit: Can learn age-based patterns easily
  • Result: Much better predictions!
Core Concept

Feature Engineering (The Secret Sauce)

Definition: The process of using domain knowledge to create, transform, and select features (input variables) that make machine learning algorithms work more effectively. It bridges the gap between raw, messy data and the clean, meaningful patterns models need to learn.

Expert Quote:

"Applied machine learning is basically feature engineering"

- Andrew Ng (Stanford Professor, Founder of Coursera, former Google Brain lead)

Why is this so important?

  • Kaggle winners spend 70% of their time on feature engineering, only 30% on model selection
  • Good features with a simple model often beat poor features with a complex model
  • Can improve accuracy by 10-20% without changing the algorithm at all!
Real-world example: In fraud detection, raw transaction data includes amount, time, and location. Feature engineering creates derived features like "transactions per hour", "distance from usual location", and "deviation from typical spending" - which are far more predictive of fraud than raw values.

The Feature Engineering Pipeline

Think of it like a factory assembly line! Raw materials (your data) enter one end, go through various stations (feature engineering techniques), and come out as finished products (features ready for machine learning models).

Feature engineering encompasses several types of transformations. Understanding when to apply each technique is crucial for building effective models. Each technique solves a specific problem!

Technique Purpose Example When to Use
Encoding Convert categorical to numerical "red", "blue"0, 1 or [1,0], [0,1] When you have text labels that models can't understand
Scaling Normalize numerical ranges Income (0-1M)(0-1) When features have different units or ranges (e.g., age vs. income)
Transformation Change distribution shape log(income) for skewed data When data is heavily skewed or has outliers
Creation Derive new features age = 2025 - birth_year When you can calculate meaningful features from existing ones
Binning Discretize continuous values age"young", "adult", "senior" When categories are more meaningful than exact numbers
❌ Without Pipeline
  • Apply transformations inconsistently
  • Forget steps when deploying model
  • Data leakage from test set
  • Harder to debug problems
With Pipeline
  • Consistent transformation workflow
  • Easy to reproduce and deploy
  • Prevents data leakage automatically
  • Clean, organized code

Setting Up Your Environment

Think of these as your toolbox! Just like a carpenter has different tools for different jobs (hammer, saw, drill), we have different libraries for different feature engineering tasks.

Let's import the libraries we'll use throughout this lesson. We'll work with pandas for data manipulation and scikit-learn for preprocessing transformations. Don't worry! We'll explain each tool when we use it.

# ===== CORE LIBRARIES FOR FEATURE ENGINEERING =====

# pandas: The Excel of Python - handles our data tables (DataFrames)
# WHY? We need to load, view, and manipulate our data
import pandas as pd

# numpy: Handles mathematical operations and arrays
# WHY? Many ML algorithms work with numpy arrays under the hood
import numpy as np

# ===== SCALING TOOLS (Make features comparable) =====

# StandardScaler: Converts data to mean=0, std=1 (like converting to z-scores)
# WHEN TO USE: Most common choice, works with normally distributed data
from sklearn.preprocessing import StandardScaler

# MinMaxScaler: Squishes all values between 0 and 1
# WHEN TO USE: When you need exact 0-1 range, or data has clear boundaries
from sklearn.preprocessing import MinMaxScaler

# RobustScaler: Like StandardScaler but ignores outliers
# WHEN TO USE: When your data has lots of extreme values/outliers
from sklearn.preprocessing import RobustScaler

# ===== ENCODING TOOLS (Convert text to numbers) =====

# LabelEncoder: Converts categories to simple numbers (0, 1, 2, 3...)
# WHEN TO USE: For target variable, or ordinal categories (small < medium < large)
from sklearn.preprocessing import LabelEncoder

# OneHotEncoder: Creates separate columns for each category (0s and 1s)
# WHEN TO USE: For non-ordinal categories where no order exists (red, blue, green)
from sklearn.preprocessing import OneHotEncoder

# ===== FEATURE CREATION TOOLS =====

# PolynomialFeatures: Creates interactions like x1*x2, x1^2, etc.
# WHEN TO USE: When you suspect features interact (e.g., length × width = area)
from sklearn.preprocessing import PolynomialFeatures

# ===== DATA SPLITTING =====

# train_test_split: Divides data into training and testing sets
# WHY? To evaluate if our model works on NEW, unseen data
from sklearn.model_selection import train_test_split

print("Libraries loaded successfully!")
print("🎯 Ready to engineer some features!")

Pro Tip: Import Only What You Need

In real projects, don't import everything! Import only the specific tools you'll use. This makes your code faster and easier to understand. Think of it like packing for a trip - only bring what you need!

Creating Sample Data

Let's create a sample dataset to practice feature engineering techniques. This dataset represents customer information for a subscription service. Real-world scenario!

The Scenario: You work for a streaming service (like Netflix or Spotify). You have customer data and want to predict who will cancel their subscription. To make good predictions, you need to engineer features from the raw data!

# Create sample customer dataset
data = {
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'age': [25, 45, 35, 52, 28, 61, 33, 40],
    'income': [35000, 72000, 55000, 89000, 42000, 95000, 48000, 67000],
    'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'High School', 'Bachelor'],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi'],
    'subscription_months': [6, 24, 12, 36, 3, 48, 9, 18],
    'churned': [1, 0, 0, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
print(df.head())
What just happened? We created a DataFrame with 8 customers. Each row is one customer, and each column is a different piece of information (feature). This is the typical format for machine learning data!
# Check data types and structure
print(df.dtypes)
# customer_id             int64
# age                     int64
# income                  int64
# education              object
# city                   object
# subscription_months     int64
# churned                 int64
Understanding the output: int64 means numbers (can be used directly by ML models). object means text (needs to be converted to numbers first). Notice that 'education' and 'city' are objects - we'll need to encode these!
Important: Notice that 'education' and 'city' are object types (categorical). Machine learning models cannot use these directly - we need to encode them into numbers.

Practice Questions: Feature Engineering Basics

Test your understanding with these hands-on exercises.

Task: Given the sample dataframe, write code to count how many numerical and categorical columns exist.

Show Solution
# Count numerical and categorical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

print(f"Numerical columns ({len(numerical_cols)}): {list(numerical_cols)}")
print(f"Categorical columns ({len(categorical_cols)}): {list(categorical_cols)}")
# Numerical columns (5): ['customer_id', 'age', 'income', 'subscription_months', 'churned']
# Categorical columns (2): ['education', 'city']
💡 Explanation: select_dtypes() filters columns by their data type. We found 5 numerical columns (numbers) and 2 categorical columns (text). This helps us know which features need encoding before modeling!

Task: Create a new feature called 'tenure_years' by converting subscription_months to years (divide by 12).

Show Solution
# Create tenure in years
df['tenure_years'] = df['subscription_months'] / 12

print(df[['customer_id', 'subscription_months', 'tenure_years']])
#    customer_id  subscription_months  tenure_years
# 0            1                    6          0.50
# 1            2                   24          2.00
# 2            3                   12          1.00
# ...
💡 Explanation: We created a new feature by dividing months by 12 to get years. Customer 0 has been subscribed for 6 months = 0.5 years. Customer 1 for 24 months = 2 years. This makes it easier to understand customer loyalty at a glance!

Task: Create a feature 'income_per_tenure_month' that divides income by subscription_months.

Show Solution
# Calculate income per month of tenure
df['income_per_tenure_month'] = df['income'] / df['subscription_months']

print(df[['customer_id', 'income', 'subscription_months', 'income_per_tenure_month']].head())
#    customer_id  income  subscription_months  income_per_tenure_month
# 0            1   35000                    6               5833.333333
# 1            2   72000                   24               3000.000000
💡 Explanation: This ratio feature shows "income per month of subscription". Customer 1: $35,000 income ÷ 6 months = $5,833/month. Customer 2: $72,000 ÷ 24 months = $3,000/month. Even though Customer 2 has higher income, Customer 1 has higher monthly income, which might indicate faster career growth!

Task: Create an 'age_group' feature that categorizes customers as 'Young' (under 30), 'Adult' (30-50), or 'Senior' (over 50).

Show Solution
# Create age groups using np.select or pd.cut
def categorize_age(age):
    if age < 30:
        return 'Young'
    elif age <= 50:
        return 'Adult'
    else:
        return 'Senior'

df['age_group'] = df['age'].apply(categorize_age)

# Alternative using pd.cut
# df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Adult', 'Senior'])

print(df[['customer_id', 'age', 'age_group']])
#    customer_id  age age_group
# 0            1   25     Young
# 1            2   45     Adult
# 2            3   35     Adult
# 3            4   52    Senior
💡 Explanation: We used a function to categorize ages into groups. Age 25 → "Young", Age 45 → "Adult", Age 52 → "Senior". This is called binning - converting continuous numbers into discrete categories. Useful when age groups are more meaningful than exact ages for your analysis (like marketing campaigns targeting specific age groups).
02

Encoding Categorical Variables

Machine learning algorithms work with numbers, not text. Categorical variables like "color" or "city" must be converted to numerical representations before models can use them effectively. Choosing the right encoding method depends on whether categories have a natural order and how many unique values exist.

Types of Categorical Variables

Think of it like sorting things! Can you arrange these categories in a meaningful order? If YES → Ordinal (has order). If NO → Nominal (no order). This simple question determines your encoding strategy!

Before encoding, identify whether your categorical variable is nominal (no order) or ordinal (has order). This determines which encoding technique to use. Critical decision!

Nominal (No Order)

🤔 Can you rank these? NO! They're just different options with no "better" or "worse".

  • City: Mumbai, Delhi, Bangalore (no city is "higher" than another)
  • Color: Red, Blue, Green (no natural order)
  • Payment: Cash, Card, UPI (all equal methods)
  • Gender: Male, Female, Other (no ranking possible)
Use One-Hot Encoding
Creates separate columns: city_Mumbai, city_Delhi, city_Bangalore
Ordinal (Has Order)

🤔 Can you rank these? YES! They have a clear progression from low to high.

  • Education: High School < Bachelor < Master < PhD (clear progression)
  • Size: Small < Medium < Large (obvious order)
  • Rating: Poor < Fair < Good < Excellent (quality scale)
  • Priority: Low < Medium < High (importance level)
Use Label/Ordinal Encoding
Assigns numbers that respect order: PhD=3 > Master=2 > Bachelor=1
⚠️ Common Mistake: Using Wrong Encoding

DON'T use Label Encoding for Nominal variables! If you encode cities as Mumbai=0, Delhi=1, Bangalore=2, the model thinks "Bangalore is twice as much as Mumbai" which makes no sense! Always use One-Hot Encoding for categories without order.

Label Encoding (Ordinal Variables)

Analogy: Giving medals! 🥇🥈🥉 Bronze=1, Silver=2, Gold=3. Label encoding assigns numbers to categories, and the numbers should reflect the ranking. Perfect for education levels, ratings, or sizes!

Label encoding converts categories to integers (0, 1, 2, ...). This works well for ordinal variables where the numeric order matches the category order. ⚠️ Warning: Be careful with nominal variables - the model might incorrectly assume "Delhi (1)" is "greater than" "Mumbai (0)".

# WHY? ML algorithms need numbers, not text
# WHAT? LabelEncoder assigns a unique integer to each category
from sklearn.preprocessing import LabelEncoder

# Create label encoder
# WHAT IT DOES: Learns all unique categories and assigns 0, 1, 2...
label_encoder = LabelEncoder()

# Encode education (ordinal - has natural order)
# WARNING: LabelEncoder assigns numbers ALPHABETICALLY, not by order!
df['education_encoded'] = label_encoder.fit_transform(df['education'])

print(df[['education', 'education_encoded']])
#       education  education_encoded
# 0   High School                  1  # ❌ Should be 0 (lowest education)
# 1      Bachelor                  0  # ❌ Should be 1 
# 2        Master                  2  # Correct
# 3           PhD                  3  # Correct
# 4      Bachelor                  0  # ❌ Should be 1

# PROBLEM: The numbers don't match the actual education order!
Problem Detected! LabelEncoder assigned values alphabetically:
  • Bachelor=0 (starts with 'B')
  • High School=1 (starts with 'H')
  • Master=2 (starts with 'M')
  • PhD=3 (starts with 'P')

This does NOT match the true education order! The model will think Bachelor (0) is less educated than High School (1). We need to fix this manually!

Solution: Manual Ordinal Encoding

# THE RIGHT WAY: Define the order explicitly using a dictionary
# WHY? You control the exact mapping to match real-world meaning
# WHAT? Create a dictionary that maps each category to its correct rank

education_order = {
    'High School': 0,  # Lowest education level
    'Bachelor': 1,     # Next level up
    'Master': 2,       # Advanced degree
    'PhD': 3           # Highest education level
}

# Apply the mapping using .map() method
# WHAT IT DOES: Looks up each education value and replaces with its number
df['education_encoded'] = df['education'].map(education_order)

print(df[['education', 'education_encoded']])
#       education  education_encoded
# 0   High School                  0  # Correct! Lowest level
# 1      Bachelor                  1  # Correct! 
# 2        Master                  2  # Correct!
# 3           PhD                  3  # Correct! Highest level

# NOW THE NUMBERS MATCH THE REAL ORDER! 🎉

Pro Tip: Always Use Manual Mapping for Ordinal!

For ordinal variables, always define your own mapping dictionary. Don't trust automatic encoding to get the order right! Think about: "What's the logical progression?" and map accordingly.

One-Hot Encoding (Nominal Variables)

Analogy: Checkbox survey! Imagine a survey asking "Which cities have you visited?" You can check: ☐ Mumbai ☐ Delhi ☐ Bangalore. Each city gets its own YES/NO (1/0) column. That's exactly what One-Hot Encoding does!

One-hot encoding creates a separate binary column for each category. This is the standard approach for nominal variables because it does not imply any ordering between categories. Best practice for cities, colors, payment methods!

❌ BEFORE (Text)

city
Mumbai
Delhi
Mumbai
Bangalore

❌ Model can't use this!

AFTER (Numbers)

Bangalore Delhi Mumbai
    0       0      1
    0       1      0
    0       0      1
    1       0      0

Model can use this!

Method 1: Using Pandas get_dummies()

# EASIEST METHOD: Use pandas get_dummies()
# WHY? Quick and simple for exploration
# WHAT IT DOES: Creates a 1 where category matches, 0 everywhere else

# One-hot encode city using pandas get_dummies
# prefix='city' adds 'city_' before each column name for clarity
city_encoded = pd.get_dummies(df['city'], prefix='city')

print(city_encoded)
#    city_Bangalore  city_Delhi  city_Mumbai
# 0           False       False         True   # ← This row is Mumbai (only Mumbai column = 1)
# 1           False        True        False   # ← This row is Delhi
# 2           False       False         True   # ← Mumbai again
# 3            True       False        False   # ← Bangalore

# INTERPRETATION: Each row has exactly ONE "True" (1), indicating which city it is
# No city is "better" than another - they're just different!
# Add encoded columns to original dataframe
# axis=1 means "add as columns" (not rows)
# WHY? Keep original data + new encoded columns together
df_encoded = pd.concat([df, city_encoded], axis=1)

print(df_encoded[['customer_id', 'city', 'city_Bangalore', 'city_Delhi', 'city_Mumbai']].head())
#    customer_id       city  city_Bangalore  city_Delhi  city_Mumbai
# 0            1     Mumbai           False       False         True
# 1            2      Delhi           False        True        False

# ===== SHORTCUT: Do everything in one line! =====
# This drops the original 'city' column and replaces with encoded columns
df_encoded = pd.get_dummies(df, columns=['city'], prefix=['city'])
print(df_encoded.columns.tolist())  # See all columns including new city_ ones
🎯 The "Dummy Variable Trap": When using linear regression or other linear models, having ALL city columns creates redundancy! If someone is NOT in Mumbai AND NOT in Delhi, they MUST be in Bangalore. We can drop one column without losing information. Use drop_first=True to avoid this!
# Drop first category to avoid multicollinearity (dummy variable trap)
# WHY? If city_Delhi=0 and city_Mumbai=0, we KNOW it must be Bangalore
# This is called the "reference category" or "baseline"
city_encoded = pd.get_dummies(df['city'], prefix='city', drop_first=True)

print(city_encoded)
#    city_Delhi  city_Mumbai
# 0       False         True   # Mumbai (Bangalore is implied when both are 0)
# 1        True        False   # Delhi
# 2       False         True   # Mumbai  
# 3       False        False   # ← Bangalore (the "reference" - neither Delhi nor Mumbai)

# INTERPRETATION: 
# - Bangalore is the baseline (0, 0)
# - Delhi is represented by (1, 0)
# - Mumbai is represented by (0, 1)
# We reduced from 3 columns to 2, but kept all the information!

Scikit-learn OneHotEncoder

For machine learning pipelines, scikit-learn's OneHotEncoder is more robust. It remembers categories from training data and handles new categories gracefully.

from sklearn.preprocessing import OneHotEncoder

# Create encoder
onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit and transform city column
city_array = df[['city']]  # Must be 2D array
city_encoded = onehot_encoder.fit_transform(city_array)

# Get feature names
feature_names = onehot_encoder.get_feature_names_out(['city'])
print(feature_names)  # ['city_Bangalore' 'city_Delhi' 'city_Mumbai']

# Create DataFrame with encoded values
city_df = pd.DataFrame(city_encoded, columns=feature_names)
print(city_df.head())
Method Best For Pros Cons
LabelEncoder Ordinal variables Simple, no extra columns Implies order (bad for nominal)
pd.get_dummies() Quick exploration Easy to use, readable Does not remember categories
OneHotEncoder ML pipelines Handles unknowns, pipeline-ready More verbose syntax

Practice Questions: Encoding

Test your understanding with these hands-on exercises.

Given:

payments = pd.DataFrame({'method': ['Cash', 'Card', 'UPI', 'Card', 'Cash']})

Task: One-hot encode the payment method column.

Show Solution
payments = pd.DataFrame({'method': ['Cash', 'Card', 'UPI', 'Card', 'Cash']})

# One-hot encode
encoded = pd.get_dummies(payments['method'], prefix='payment')
print(encoded)
#    payment_Card  payment_Cash  payment_UPI
# 0         False          True        False
# 1          True         False        False
# 2         False         False         True

Given:

ratings = pd.DataFrame({'satisfaction': ['Poor', 'Good', 'Excellent', 'Fair', 'Good']})

Task: Create ordinal encoding where Poor=0, Fair=1, Good=2, Excellent=3.

Show Solution
ratings = pd.DataFrame({'satisfaction': ['Poor', 'Good', 'Excellent', 'Fair', 'Good']})

# Define order mapping
order_map = {'Poor': 0, 'Fair': 1, 'Good': 2, 'Excellent': 3}

# Apply mapping
ratings['satisfaction_encoded'] = ratings['satisfaction'].map(order_map)
print(ratings)
#   satisfaction  satisfaction_encoded
# 0         Poor                     0
# 1         Good                     2
# 2    Excellent                     3
# 3         Fair                     1
# 4         Good                     2

Task: Use sklearn's OneHotEncoder to encode the 'city' column, then create a DataFrame with proper column names.

Show Solution
from sklearn.preprocessing import OneHotEncoder

# Create and fit encoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
city_encoded = encoder.fit_transform(df[['city']])

# Get feature names and create DataFrame
feature_names = encoder.get_feature_names_out(['city'])
city_df = pd.DataFrame(city_encoded, columns=feature_names)

# Combine with original data
result = pd.concat([df.drop('city', axis=1), city_df], axis=1)
print(result.head())
💡 Explanation: This is the production-ready way to one-hot encode! The encoder "remembers" which cities it saw during training, so when you get new data, it can handle unknown cities gracefully (via handle_unknown='ignore'). Perfect for ML pipelines that will be deployed to production!

Given:

products = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'A', 'B', 'A', 'D', 'A', 'B']})

Task: Encode categories by their frequency (how often they appear).

Show Solution
products = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'A', 'B', 'A', 'D', 'A', 'B']})

# Calculate frequency of each category
freq_map = products['category'].value_counts(normalize=True).to_dict()

# Apply frequency encoding
products['category_freq'] = products['category'].map(freq_map)
print(products)
#   category  category_freq
# 0        A            0.5
# 1        B            0.3
# 2        A            0.5
# 3        C            0.1
# 4        A            0.5
💡 Explanation: Frequency encoding replaces categories with how often they appear. Category 'A' appears 5 times out of 10 (50% = 0.5), 'B' appears 3 times (30% = 0.3), 'C' once (10% = 0.1). This is great for high-cardinality features (many unique categories) because it doesn't create tons of columns like one-hot encoding would!
03

Scaling Numerical Features

When features have vastly different scales (e.g., age in years vs. income in thousands), many algorithms struggle to learn effectively. Feature scaling ensures all variables contribute equally and prevents features with larger values from dominating the model.

Why Scaling Matters

Analogy: Comparing apples and... skyscrapers! 🍎🏢 Imagine trying to compare the size of an apple (measured in centimeters) with the height of a building (measured in meters). If you don't convert to the same scale, the building will always seem "more important" just because the numbers are bigger! That's exactly what happens with Age (20-70) vs Income (20,000-200,000).

Consider a customer dataset with age (20-70) and income (20000-200000). Without scaling, a machine learning algorithm might think income is "more important" simply because its values are 1000x larger! Distance-based algorithms like KNN and gradient-based optimizers are especially sensitive to scale.

❌ WITHOUT Scaling
Customer 1: age=25, income=35000
Customer 2: age=45, income=72000

Distance = √[(45-25)² + (72000-35000)²]
         = √[400 + 1,369,000,000]
         = ~37,005
                    

Problem: Income difference dominates! Age barely matters.

WITH Scaling
Customer 1: age=0.0, income=0.0
Customer 2: age=0.5, income=0.6

Distance = √[(0.5-0.0)² + (0.6-0.0)²]
         = √[0.25 + 0.36]
         = ~0.78
                    

Solution: Both features contribute equally!

Algorithms that NEED scaling:
  • Distance-based: K-Nearest Neighbors, K-Means Clustering, SVM
  • Gradient-based: Neural Networks, Logistic Regression, Linear Regression
  • Component-based: Principal Component Analysis (PCA)

DON'T need scaling: Tree-based models (Random Forest, XGBoost, Decision Trees) - they split on thresholds, not distances!

# WHY? Let's see how different our feature scales are
# WHAT IT SHOWS: The huge difference in magnitude between age and income

print("Age range:", df['age'].min(), "-", df['age'].max())  
print("Income range:", df['income'].min(), "-", df['income'].max())
# Age range: 25 - 61      # ← Spans about 36 units
# Income range: 35000 - 95000  # ← Spans 60,000 units! ~1600x bigger!

StandardScaler (Z-score Normalization)

Analogy: Class exam scores! 📊 Imagine you scored 75 on a test. Is that good? It depends! If the class average was 50, you're above average. If it was 90, you're below. StandardScaler converts your score to "how many standard deviations from average" - a universal measure!

StandardScaler transforms data to have zero mean (μ=0) and unit variance (σ²=1). Each value becomes a z-score: how many standard deviations it is from the mean. Most commonly used scaler!

Formula

StandardScaler

z = (x - mean) / std

What it means:

  • z = 0: Exactly at average
  • z = 1: One std above average
  • z = -1: One std below average
  • z = 2: Two std above (rare!)

Best for:

  • Normally distributed data
  • Algorithms assuming zero-centered data
  • SVM, Logistic Regression, Neural Networks
  • PCA and gradient descent
# WHY? Most ML algorithms work better with zero-centered data
# WHAT? Transform data so mean=0 and std=1
from sklearn.preprocessing import StandardScaler

# Create scaler object
# WHAT IT DOES: Will learn the mean and std from training data
scaler = StandardScaler()

# Select numerical features to scale
# WHY THESE? Only numerical features need scaling, not categorical/encoded ones
numerical_features = ['age', 'income', 'subscription_months']

# Fit and transform in one step
# fit() = Learn the mean and std from this data
# transform() = Apply the formula: (x - mean) / std
# fit_transform() = Do both at once!
df_scaled = df.copy()  # Make a copy to keep original
df_scaled[numerical_features] = scaler.fit_transform(df[numerical_features])

print(df_scaled[numerical_features].head())
#         age    income  subscription_months
# 0 -1.190476 -1.197279            -0.707107  # ← Age is 1.19 std BELOW average
# 1  0.476190  0.705459             0.707107  # ← Age is 0.48 std ABOVE average
# 2 -0.357143 -0.168116             0.000000  # ← Subscription is exactly average!

# INTERPRETATION:
# - Negative values = below average
# - Positive values = above average  
# - Values near 0 = close to average
# - Most values between -3 and +3 (99.7% rule for normal distribution)
# Verify that scaling worked correctly
# EXPECTED: Mean should be ~0, Standard deviation should be ~1

print("Mean after scaling:", df_scaled[numerical_features].mean().values)
print("Std after scaling:", df_scaled[numerical_features].std().values)
# Mean after scaling: [~0, ~0, ~0]  # ← All close to zero!
# Std after scaling: [~1, ~1, ~1]  # ← All close to one!

# WHY CHECK? To confirm StandardScaler worked as expected
# WHAT IF NOT? Would indicate a problem with your code

Pro Tip: When to Use StandardScaler

Use StandardScaler when your data is roughly normally distributed (bell curve shape). It's the default choice for most ML tasks. If you have severe outliers, consider RobustScaler instead!

MinMaxScaler (Normalization)

Analogy: Adjusting photo brightness! 📸 Imagine you have photos with brightness values from 50 to 200. MinMaxScaler "compresses" them all to a 0-100 scale. The darkest photo (50) becomes 0, the brightest (200) becomes 100, and everything else fits proportionally in between. Perfect 0-1 range!

MinMaxScaler transforms data to a fixed range, typically 0 to 1. This is useful when you need bounded values or when the data is not normally distributed. Perfect for neural networks!

Formula

MinMaxScaler

x_scaled = (x - min) / (max - min)

What it means:

  • Smallest value → 0
  • Largest value → 1
  • Everything else → Between 0 and 1
  • Preserves the shape of distribution

Best for:

  • Neural networks (sigmoid/tanh activation)
  • Image pixel values (already 0-255)
  • Bounded domains (percentages, probabilities)
  • When you need exact 0-1 range
# WHY? Some algorithms (like neural networks) work best with 0-1 inputs
# WHAT? Squish all values to exactly 0-1 range
from sklearn.preprocessing import MinMaxScaler

# Create scaler
# WHAT IT DOES: Will find min and max from training data
minmax_scaler = MinMaxScaler()  # Default range is (0, 1)

# Fit and transform
# Formula: (x - min) / (max - min)
df_minmax = df.copy()
df_minmax[numerical_features] = minmax_scaler.fit_transform(df[numerical_features])

print(df_minmax[numerical_features].head())
#    age    income  subscription_months
# 0  0.0  0.000000             0.066667  # ← Age is the MINIMUM (25), so it becomes 0
# 1  0.5  0.616667             0.466667  # ← Age is halfway between min and max
# 2  0.25 0.333333             0.200000  # ← Age is 1/4 of the way from min to max

# INTERPRETATION:
# - 0.0 = Minimum value in original data
# - 1.0 = Maximum value in original data
# - 0.5 = Exactly in the middle
# - All values GUARANTEED to be between 0 and 1!
# Verify range - should be exactly 0 and 1
# WHY CHECK? To confirm MinMaxScaler worked correctly

print("Min after scaling:", df_minmax[numerical_features].min().values)
print("Max after scaling:", df_minmax[numerical_features].max().values)
# Min after scaling: [0.0, 0.0, 0.0]  # ← All features start at exactly 0!
# Max after scaling: [1.0, 1.0, 1.0]  # ← All features end at exactly 1!

# PERFECT! All features now on the same 0-1 scale
⚠️ Warning: Very Sensitive to Outliers! If you have ONE extreme value (like income=1,000,000 in a dataset where most incomes are 30,000-80,000), MinMaxScaler will squash everything else near 0! The outlier becomes 1.0 and everything else gets compressed. Use RobustScaler if you have outliers.

RobustScaler (Outlier-Resistant)

Analogy: Ignoring the extremists! 🛡️ Imagine calculating "average happiness" in a room. If 99 people rate 7/10 but one person rates 1000/10 (impossible but they're confused), the average gets ruined! RobustScaler uses the median (middle person) instead, completely ignoring crazy outliers. Smart!

RobustScaler uses the median and interquartile range (IQR) instead of mean and standard deviation. This makes it resistant to outliers that would skew StandardScaler or completely ruin MinMaxScaler. Best choice when you have outliers!

With Outlier
Data: [10, 12, 11, 13, 12, 500, 11]

StandardScaler:

Mean = 81.3 (ruined by 500!)
Std = 182.5 (huge!)
Scaled: [-0.39, -0.38, -0.39, ... 2.30]
# Most values near 0, outlier at 2.30
RobustScaler Solution
Data: [10, 12, 11, 13, 12, 500, 11]

RobustScaler:

Median = 12 (unaffected!)
IQR = 1.5 (ignores extremes)
Scaled: [-1.33, 0, -0.67, 0.67, 0, *, -0.67]
# Outlier still an outlier, but doesn't ruin others!
# WHY? When you have extreme values that shouldn't influence normal data
# WHAT? Uses median (50th percentile) and IQR (25th to 75th percentile range)
from sklearn.preprocessing import RobustScaler

# Create robust scaler
# WHAT IT DOES: Uses statistics that ignore extreme values
robust_scaler = RobustScaler()

# Fit and transform
# Formula: (x - median) / IQR
# WHERE: IQR = Q3 (75th percentile) - Q1 (25th percentile)
df_robust = df.copy()
df_robust[numerical_features] = robust_scaler.fit_transform(df[numerical_features])

print(df_robust[numerical_features].head())
#         age    income  subscription_months
# 0 -0.666667 -0.800000            -0.444444  # ← Scaled relative to MEDIAN, not mean
# 1  0.333333  0.400000             0.555556  
# 2 -0.166667 -0.133333             0.000000  # ← Close to median!

# INTERPRETATION:
# - Values centered around MEDIAN (not mean)
# - Scaled by IQR (middle 50% of data)
# - Outliers don't distort the scaling!

Pro Tip: Detecting If You Need RobustScaler

Check your data: If mean - median is large, or if you see extreme values in df.describe() (like max >> 75th percentile), you have outliers! Use RobustScaler instead of StandardScaler.

Comprehensive Scaler Comparison

Scaler Formula Output Range Best For Outlier Sensitive?
StandardScaler
Most Common
(x - mean) / std Unbounded
Usually -3 to +3
• Normal distributions
• SVM, Logistic Regression
• PCA, Neural Networks
• Gradient descent algorithms
Yes
Outliers shift mean/std
MinMaxScaler
Fixed Range
(x - min) / (max - min) 0 to 1
Exact bounds
• Neural networks
• Image data (pixels)
• Bounded features
• When exact 0-1 range needed
Extremely!
One outlier ruins everything
RobustScaler
Outlier-Proof
(x - median) / IQR Unbounded
Similar to Standard
• Data with outliers
• Financial data
• Real-world messy data
• Skewed distributions
No!
Uses robust statistics

🎯 Decision Guide: Which Scaler Should I Use?

Choose StandardScaler if:
  • Data is roughly normally distributed
  • No significant outliers
  • Using SVM, PCA, or Logistic Regression
  • Default choice!
Choose MinMaxScaler if:
  • Need exact 0-1 range
  • Using Neural Networks (sigmoid/tanh)
  • Image/pixel data
  • NO outliers present
Choose RobustScaler if:
  • Data has outliers
  • Skewed distribution
  • Financial or real-world messy data
  • StandardScaler gives weird results

Avoiding Data Leakage

Analogy: Cheating on the exam! 📝 Imagine studying for a test by looking at the actual test questions beforehand. You'd score 100%, but it's not a real measure of your knowledge! Data leakage is when your model "sees" test data during training. It gives you fake good results that completely fail in real-world use. This is the #1 mistake beginners make!

A critical mistake is fitting scalers on the entire dataset before splitting into train/test. This causes data leakage because test data statistics (mean, std, min, max) influence the training process. Your model appears to work great, but fails miserably on real new data! Career-ending bug!

❌ WRONG: Data Leakage
# ❌ BAD: Scale BEFORE splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses ALL data!

# Then split
X_train, X_test = train_test_split(X_scaled)

# PROBLEM: Test data mean/std influenced training!
# Model has "seen" test data statistics
# Results are FAKE GOOD!
Why it's wrong: The scaler calculated mean and std using BOTH train AND test data. When you scale training data, you're using information from the test set. The model is cheating!
CORRECT: No Leakage
# GOOD: Split FIRST, then scale
X_train, X_test = train_test_split(X)

# Fit scaler on TRAINING data ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform test using TRAINING statistics
X_test_scaled = scaler.transform(X_test)  # No fit!

# PERFECT: Test data is truly unseen!
Why it's right: The scaler only knows about training data. Test data is transformed using training stats, simulating real-world scenario where you only have training data!
# CORRECT WORKFLOW: Preventing Data Leakage
# WHY? Must simulate real-world: you only have training data when building model!
from sklearn.model_selection import train_test_split

# Step 1: Define features and target
# WHAT? X = input features, y = what we want to predict
X = df[numerical_features]  # Our input data
y = df['churned']           # What we're trying to predict

# Step 2: Split into train and test FIRST (before any scaling!)
# WHY? This is the REAL separation - test set is "future unseen data"
# test_size=0.2 means 20% for testing, 80% for training
# random_state=42 makes the split reproducible (same split every time)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,    # 20% test, 80% train
    random_state=42   # For reproducibility
)

print(f"Training samples: {len(X_train)}")  # e.g., 6 samples
print(f"Test samples: {len(X_test)}")       # e.g., 2 samples

# Step 3: Create scaler (doesn't know anything yet)
scaler = StandardScaler()

# Step 4: Fit AND transform on training data
# fit() = Learn the mean and std from TRAINING data ONLY
# transform() = Apply the scaling using those learned statistics
# fit_transform() = Do both at once
X_train_scaled = scaler.fit_transform(X_train)  # Learn from train, apply to train
print("Scaler learned from training data only!")

# Step 5: Transform test data using TRAINING statistics (no fit!)
# WHY NO FIT? Because we pretend test data doesn't exist yet!
# We use the SAME mean/std we learned from training data
X_test_scaled = scaler.transform(X_test)  # Apply training stats to test
print("Test data scaled using training statistics!")

# RESULT: Test data is TRULY unseen!
# The scaler never learned anything from test data
# This simulates real-world: you'll get new data that wasn't in training
🎯 Golden Rule of Scaling:
  1. Split FIRST: Separate train and test before ANY preprocessing
  2. fit_transform() on train: Learn statistics and apply them
  3. transform() on test: Apply training statistics (no learning!)
  4. NEVER fit() on test data - pretend it doesn't exist during training!

Pro Tip: Why This Matters So Much

In real-world ML, you train on historical data, then deploy to predict NEW data you've never seen. If you leak test data into training, your validation metrics will look amazing (95% accuracy!), but production performance will be terrible (60% accuracy). Always simulate the real scenario: test data is future data you don't have yet!

Practice Questions: Scaling

Test your understanding with these hands-on exercises.

Given:

scores = np.array([[85], [92], [78], [95], [88]])

Task: Scale these scores to a 0-1 range using MinMaxScaler.

Show Solution
from sklearn.preprocessing import MinMaxScaler
import numpy as np

scores = np.array([[85], [92], [78], [95], [88]])

scaler = MinMaxScaler()
scores_scaled = scaler.fit_transform(scores)

print(scores_scaled)
# [[0.41176471]
#  [0.82352941]
#  [0.        ]
#  [1.        ]
#  [0.58823529]]

Given:

salaries = pd.DataFrame({'salary': [50000, 75000, 120000, 45000, 200000]})

Task: Standardize the salary column and verify the mean is approximately 0.

Show Solution
from sklearn.preprocessing import StandardScaler
import pandas as pd

salaries = pd.DataFrame({'salary': [50000, 75000, 120000, 45000, 200000]})

scaler = StandardScaler()
salaries['salary_scaled'] = scaler.fit_transform(salaries[['salary']])

print(salaries)
print(f"\nMean: {salaries['salary_scaled'].mean():.10f}")  # Very close to 0
print(f"Std: {salaries['salary_scaled'].std():.2f}")  # Close to 1

Task: Split the customer dataframe into train/test (80/20), then properly scale numerical features avoiding data leakage.

Show Solution
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features and target
X = df[['age', 'income', 'subscription_months']]
y = df['churned']

# Split first (before any scaling!)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data using training statistics
X_test_scaled = scaler.transform(X_test)

print("Training shape:", X_train_scaled.shape)
print("Test shape:", X_test_scaled.shape)
print("No data leakage!")
💡 Explanation: This is the CORRECT way! We split first (80% train, 20% test), then fitted the scaler ONLY on training data. Test data was transformed using training statistics. The test set remains truly "unseen" - just like real-world new data would be. Always remember: Split → Fit on train → Transform both!

Given:

data = pd.DataFrame({'value': [10, 12, 11, 13, 12, 100, 11, 14]})

Task: Scale this data using RobustScaler (note the outlier value of 100).

Show Solution
from sklearn.preprocessing import RobustScaler
import pandas as pd

data = pd.DataFrame({'value': [10, 12, 11, 13, 12, 100, 11, 14]})

scaler = RobustScaler()
data['value_scaled'] = scaler.fit_transform(data[['value']])

print(data)
# The outlier (100) is scaled but doesn't distort other values
# because RobustScaler uses median and IQR
💡 Explanation: Notice the outlier (100) didn't ruin the scaling of other values! RobustScaler used the median (middle value = 12) instead of mean, and scaled by IQR (range of middle 50% of data) instead of standard deviation. The values 10-14 are scaled nicely, while the outlier 100 is just... an outlier. That's exactly what we want!
04

Creating New Features

Analogy: Being a detective! 🕵️ Raw data is like scattered clues. Feature creation is combining those clues to form insights! Example: You have "birth_year" but the model needs "age". You have "total_spent" and "num_orders" but what's really valuable is "average_order_value" (spent ÷ orders). This is where creativity meets data science!

Sometimes the best features are not in your original dataset - you create them. Interaction features, polynomial terms, and domain-specific transformations can unlock hidden patterns that raw features alone cannot express. Often improves accuracy by 10-20%!

Mathematical Transformations

Basic mathematical operations can create powerful new features. Ratios, differences, and aggregations often capture relationships that models would otherwise struggle to learn. Think: "What would make sense to a human?"

Common Feature Creation Patterns
Ratios

total_spent / num_orders = avg_order_value
clicks / impressions = click_rate

Differences

current_year - birth_year = age
end_date - start_date = duration

Aggregates

sum(purchases) = total_purchases
mean(ratings) = avg_rating

# WHY? Let's create a realistic e-commerce scenario
# WHAT? Customer behavior data for an online store
# GOAL: Predict who will become VIP customers

orders = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'total_orders': [12, 5, 28, 8, 15],         # How many times they ordered
    'total_spent': [45000, 12000, 125000, 32000, 58000],  # Total money spent
    'account_age_days': [365, 120, 730, 200, 450]  # How long they've been a customer
})

print(orders)
#    customer_id  total_orders  total_spent  account_age_days
# 0            1            12        45000               365
# 1            2             5        12000               120
# 2            3            28       125000               730  # ← This customer looks very valuable!
# 3            4             8        32000               200
# 4            5            15        58000               450
# ===== CREATE RATIO FEATURES =====
# WHY? Raw numbers don't tell the full story!
# Customer 3 spent 125k, but over 730 days. Customer 1 spent 45k in 365 days.
# Who's more valuable per day? We need to CALCULATE that!

# Feature 1: Average Order Value
# WHAT IT MEANS: How much does this customer spend per order?
# WHY VALUABLE? High AOV = big spender = VIP customer
orders['avg_order_value'] = orders['total_spent'] / orders['total_orders']

# Feature 2: Orders Per Month  
# WHAT IT MEANS: How frequently does this customer order?
# WHY VALUABLE? High frequency = engaged customer = less likely to leave
orders['orders_per_month'] = orders['total_orders'] / (orders['account_age_days'] / 30)

# Feature 3: Spend Per Day
# WHAT IT MEANS: Average daily spending rate
# WHY VALUABLE? Normalizes spending by account age - fairer comparison
orders['spend_per_day'] = orders['total_spent'] / orders['account_age_days']

print(orders[['customer_id', 'avg_order_value', 'orders_per_month', 'spend_per_day']])
#    customer_id  avg_order_value  orders_per_month  spend_per_day
# 0            1      3750.000000          0.986301     123.287671  # ← Good spender!
# 1            2      2400.000000          1.250000     100.000000  # ← Frequent but small orders
# 2            3      4464.285714          1.150685     171.232877  # ← BEST CUSTOMER! High value + frequent
# 3            4      4000.000000          1.200000     160.000000  # ← Also great!
# 4            5      3866.666667          1.000000     128.888889

# INTERPRETATION:
# Customer 3: High AOV ($4,464), frequent orders (1.15/month), high daily spend ($171)
# This customer is MUCH more valuable than Customer 2 (frequent but cheap orders)
# These engineered features reveal insights the raw numbers hid!
# 1 2 2400.000000 1.250000 100.000000 # 2 3 4464.285714 1.150685 171.232877
Domain knowledge is key: "Average order value" and "orders per month" are meaningful business metrics. Creating features that make sense in your domain often works better than random combinations.

Date and Time Features

Dates contain rich information that models cannot use directly. Extract components like year, month, day of week, and calculate durations to unlock temporal patterns.

# Sample data with dates and times
transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'date': pd.to_datetime(['2024-01-15 09:30:00', '2024-03-22 14:45:00', '2024-07-04 20:15:00', '2024-11-28 11:00:00', '2024-12-25 16:30:00']),
    'amount': [1500, 2300, 890, 4500, 3200]
})

# Extract date components
transactions['year'] = transactions['date'].dt.year
transactions['month'] = transactions['date'].dt.month
transactions['day'] = transactions['date'].dt.day
transactions['hour'] = transactions['date'].dt.hour  # Hour of day (0-23)
transactions['day_of_week'] = transactions['date'].dt.dayofweek  # 0=Monday
transactions['is_weekend'] = transactions['day_of_week'].isin([5, 6]).astype(int)

print(transactions[['date', 'month', 'hour', 'day_of_week', 'is_weekend']])
# Create time-of-day categories from hour
def get_time_period(hour):
    if hour < 6:
        return 'Night'
    elif hour < 12:
        return 'Morning'
    elif hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

transactions['time_period'] = transactions['hour'].apply(get_time_period)

print(transactions[['date', 'hour', 'time_period']])
#                  date  hour time_period
# 0 2024-01-15 09:30:00     9     Morning
# 1 2024-03-22 14:45:00    14   Afternoon
# 2 2024-07-04 20:15:00    20     Evening
# Calculate days since a reference date
reference_date = pd.to_datetime('2024-01-01')
transactions['days_since_year_start'] = (transactions['date'] - reference_date).dt.days

# Quarter and season
transactions['quarter'] = transactions['date'].dt.quarter

print(transactions[['date', 'quarter', 'days_since_year_start']])

Interaction Features

Interaction features capture the combined effect of two or more variables. For example, the effect of "experience" on salary might be different for different "education levels".

# Create interaction features manually
df['age_income_interaction'] = df['age'] * df['income']
df['income_per_age'] = df['income'] / df['age']

print(df[['age', 'income', 'age_income_interaction', 'income_per_age']].head())
#    age  income  age_income_interaction  income_per_age
# 0   25   35000                  875000     1400.000000
# 1   45   72000                 3240000     1600.000000

Polynomial Features

Polynomial features create higher-order terms and interactions automatically. This is useful for capturing non-linear relationships in linear models.

from sklearn.preprocessing import PolynomialFeatures

# Sample data
X = df[['age', 'income']].head(3)
print("Original features:")
print(X)

# Create polynomial features (degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Get feature names
feature_names = poly.get_feature_names_out(['age', 'income'])
print("\nPolynomial feature names:")
print(feature_names)
# ['age', 'income', 'age^2', 'age income', 'income^2']
# Create DataFrame with polynomial features
X_poly_df = pd.DataFrame(X_poly, columns=feature_names)
print(X_poly_df)
#     age   income    age^2   age income       income^2
# 0  25.0  35000.0    625.0     875000.0   1225000000.0
# 1  45.0  72000.0   2025.0    3240000.0   5184000000.0
# 2  35.0  55000.0   1225.0    1925000.0   3025000000.0
Caution: Polynomial features grow exponentially! With 10 features and degree=3, you get 286 features. Use interaction_only=True to limit to just interactions without powers.

Text-Based Feature Extraction

Text fields often contain valuable information. Extract features like length, word count, or presence of specific keywords.

# Sample product reviews
reviews = pd.DataFrame({
    'review_id': [1, 2, 3],
    'text': [
        'Great product! Works perfectly.',
        'Terrible quality. Broke after one week. DO NOT BUY!',
        'Good value for money'
    ]
})

# Extract text features
reviews['char_count'] = reviews['text'].str.len()
reviews['word_count'] = reviews['text'].str.split().str.len()
reviews['avg_word_length'] = reviews['char_count'] / reviews['word_count']
reviews['has_exclamation'] = reviews['text'].str.contains('!').astype(int)
reviews['is_uppercase_heavy'] = (reviews['text'].str.count(r'[A-Z]') > 5).astype(int)

print(reviews[['text', 'word_count', 'has_exclamation', 'is_uppercase_heavy']])

Practice Questions: Creating Features

Test your understanding with these hands-on exercises.

Given:

customers = pd.DataFrame({
    'revenue': [5000, 12000, 8000, 3000],
    'acquisition_cost': [500, 800, 600, 400]
})

Task: Create a 'roi' feature as revenue divided by acquisition_cost.

Show Solution
customers = pd.DataFrame({
    'revenue': [5000, 12000, 8000, 3000],
    'acquisition_cost': [500, 800, 600, 400]
})

customers['roi'] = customers['revenue'] / customers['acquisition_cost']
print(customers)
#    revenue  acquisition_cost   roi
# 0     5000               500  10.0
# 1    12000               800  15.0
# 2     8000               600  13.33
# 3     3000               400   7.5

Given:

orders = pd.DataFrame({
    'order_date': pd.to_datetime(['2024-06-15', '2024-12-25', '2024-03-08'])
})

Task: Extract month, is_weekend, and is_holiday (Dec 25) features.

Show Solution
orders = pd.DataFrame({
    'order_date': pd.to_datetime(['2024-06-15', '2024-12-25', '2024-03-08'])
})

orders['month'] = orders['order_date'].dt.month
orders['day_of_week'] = orders['order_date'].dt.dayofweek
orders['is_weekend'] = orders['day_of_week'].isin([5, 6]).astype(int)
orders['is_holiday'] = ((orders['order_date'].dt.month == 12) & 
                        (orders['order_date'].dt.day == 25)).astype(int)

print(orders)
#   order_date  month  day_of_week  is_weekend  is_holiday
# 0 2024-06-15      6            5           1           0
# 1 2024-12-25     12            2           0           1
# 2 2024-03-08      3            4           0           0

Given:

X = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]})

Task: Create degree-2 polynomial features and display the resulting feature names.

Show Solution
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

X = pd.DataFrame({'x1': [1, 2, 3], 'x2': [4, 5, 6]})

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

feature_names = poly.get_feature_names_out(['x1', 'x2'])
X_poly_df = pd.DataFrame(X_poly, columns=feature_names)

print("Feature names:", feature_names)
print(X_poly_df)
# Feature names: ['x1' 'x2' 'x1^2' 'x1 x2' 'x2^2']

Given:

products = pd.DataFrame({
    'description': ['Premium quality laptop', 'Budget phone with great battery life', 'Watch']
})

Task: Create word_count and avg_word_length features.

Show Solution
products = pd.DataFrame({
    'description': ['Premium quality laptop', 'Budget phone with great battery life', 'Watch']
})

products['word_count'] = products['description'].str.split().str.len()
products['char_count'] = products['description'].str.len()
products['avg_word_length'] = products['char_count'] / products['word_count']

print(products)
#                            description  word_count  char_count  avg_word_length
# 0               Premium quality laptop           3          22         7.333333
# 1  Budget phone with great battery life           6          37         6.166667
# 2                                Watch           1           5         5.000000
05

Binning and Discretization

Converting continuous variables into discrete bins can reduce noise, handle outliers, and capture non-linear relationships that simple linear models would miss. Binning transforms numerical data into categorical groups based on value ranges.

When to Use Binning

Binning is particularly useful when the exact numerical value matters less than which range or category it falls into. Think about age groups for marketing, income brackets for loans, or temperature ranges for weather classification.

Good Use Cases
  • Age groups for marketing segments
  • Income brackets for credit scoring
  • Time of day for traffic analysis
  • Reducing impact of outliers
  • When business rules use categories
Avoid When
  • Exact values are important
  • Linear relationship with target
  • Using tree-based models (they bin naturally)
  • Small datasets (loses information)
  • When continuous is more predictive

Equal-Width Binning with pd.cut()

Equal-width binning divides the range into bins of equal size. Use this when you want consistent intervals regardless of how data is distributed.

# Sample age data
ages = pd.DataFrame({'age': [22, 35, 45, 19, 67, 52, 28, 41, 73, 31]})

# Create 4 equal-width bins
ages['age_bin'] = pd.cut(ages['age'], bins=4)

print(ages)
#    age          age_bin
# 0   22   (18.946, 32.5]
# 1   35    (32.5, 46.0]
# 2   45    (32.5, 46.0]
# 3   19   (18.946, 32.5]
# 4   67    (59.5, 73.0]
# Custom bin edges with labels
age_bins = [0, 18, 30, 50, 65, 100]
age_labels = ['Minor', 'Young Adult', 'Adult', 'Middle Age', 'Senior']

ages['age_group'] = pd.cut(ages['age'], bins=age_bins, labels=age_labels)

print(ages[['age', 'age_group']])
#    age    age_group
# 0   22  Young Adult
# 1   35        Adult
# 2   45        Adult
# 3   19  Young Adult
# 4   67       Senior
# 5   52   Middle Age
Tip: Use meaningful business-defined bins when possible. Age groups like "18-25", "26-35" often align with marketing segments and are more interpretable than arbitrary ranges.

Equal-Frequency Binning with pd.qcut()

Equal-frequency (quantile) binning puts approximately the same number of observations in each bin. This is useful when you want balanced groups regardless of value distribution.

# Create 4 equal-frequency bins (quartiles)
ages['age_quartile'] = pd.qcut(ages['age'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

print(ages[['age', 'age_quartile']])
#    age age_quartile
# 0   22           Q1
# 1   35           Q2
# 2   45           Q3
# 3   19           Q1
# 4   67           Q4

# Check bin distribution
print(ages['age_quartile'].value_counts())
# Each quartile has approximately equal count
# Quantile binning with custom quantiles
income = pd.DataFrame({'income': [25000, 45000, 72000, 38000, 150000, 55000, 42000, 89000]})

# Create percentile-based bins
income['income_tier'] = pd.qcut(
    income['income'], 
    q=[0, 0.25, 0.5, 0.75, 1.0],
    labels=['Low', 'Medium', 'High', 'Premium']
)

print(income)

Comparing cut() vs qcut()

Feature pd.cut() (Equal-Width) pd.qcut() (Equal-Frequency)
Bin sizes Same range width Same number of items
Best for Uniformly distributed data Skewed distributions
Custom edges Yes, via bins parameter No (uses quantiles)
Outlier handling May create sparse bins Distributes evenly

Scikit-learn KBinsDiscretizer

For machine learning pipelines, scikit-learn's KBinsDiscretizer offers binning strategies that integrate seamlessly with other transformers.

from sklearn.preprocessing import KBinsDiscretizer

# Sample data
X = np.array([[22], [35], [45], [19], [67], [52], [28], [41]])

# Uniform strategy (equal-width)
discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
X_binned = discretizer.fit_transform(X)

print("Original vs Binned:")
for orig, binned in zip(X.flatten(), X_binned.flatten()):
    print(f"  {orig} -> Bin {int(binned)}")
# Different encoding strategies
# 'ordinal': Returns bin indices (0, 1, 2, ...)
# 'onehot': Returns one-hot encoded sparse matrix
# 'onehot-dense': Returns one-hot encoded dense matrix

discretizer_onehot = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='quantile')
X_onehot = discretizer_onehot.fit_transform(X)

print("One-hot encoded bins shape:", X_onehot.shape)  # (8, 3) for 3 bins

Practical Example: Customer Segmentation

Let's combine multiple binning techniques for customer segmentation in a real-world scenario.

# Create customer dataset
customers = pd.DataFrame({
    'customer_id': range(1, 11),
    'age': [22, 35, 45, 28, 67, 52, 31, 41, 58, 24],
    'annual_spend': [1200, 8500, 15000, 3200, 25000, 12000, 4500, 9800, 18000, 2100],
    'transactions': [12, 45, 89, 23, 156, 67, 34, 52, 98, 15]
})

# Bin age into life stages
age_bins = [0, 25, 35, 50, 65, 100]
age_labels = ['Gen Z', 'Millennial', 'Gen X', 'Boomer', 'Silent']
customers['generation'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels)

# Bin spend into value tiers (quantile-based)
customers['value_tier'] = pd.qcut(
    customers['annual_spend'], 
    q=3, 
    labels=['Bronze', 'Silver', 'Gold']
)

# Bin transaction frequency
customers['activity_level'] = pd.cut(
    customers['transactions'],
    bins=[0, 25, 75, float('inf')],
    labels=['Low', 'Medium', 'High']
)

print(customers[['customer_id', 'generation', 'value_tier', 'activity_level']])

Practice Questions: Binning

Test your understanding with these hands-on exercises.

Given:

temps = pd.DataFrame({'temp_celsius': [5, 15, 25, 32, 8, 22, 38, 12]})

Task: Create a 'weather' column with categories: Cold (0-10), Mild (10-20), Warm (20-30), Hot (30+).

Show Solution
temps = pd.DataFrame({'temp_celsius': [5, 15, 25, 32, 8, 22, 38, 12]})

bins = [0, 10, 20, 30, 50]
labels = ['Cold', 'Mild', 'Warm', 'Hot']

temps['weather'] = pd.cut(temps['temp_celsius'], bins=bins, labels=labels)
print(temps)
#    temp_celsius weather
# 0             5    Cold
# 1            15    Mild
# 2            25    Warm
# 3            32     Hot

Given:

spend = pd.DataFrame({'amount': [100, 5000, 250, 12000, 800, 3500, 15000, 450]})

Task: Use qcut to create 4 equal-frequency spending tiers.

Show Solution
spend = pd.DataFrame({'amount': [100, 5000, 250, 12000, 800, 3500, 15000, 450]})

spend['tier'] = pd.qcut(spend['amount'], q=4, labels=['Tier 1', 'Tier 2', 'Tier 3', 'Tier 4'])
print(spend)
print("\nCounts per tier:")
print(spend['tier'].value_counts())
# Each tier has 2 customers

Task: Use KBinsDiscretizer to bin the 'age' column into 5 quantile-based bins with one-hot encoding.

Show Solution
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

ages = np.array([[22], [35], [45], [28], [67], [52], [31], [41], [58], [24]])

discretizer = KBinsDiscretizer(
    n_bins=5, 
    encode='onehot-dense', 
    strategy='quantile'
)
ages_binned = discretizer.fit_transform(ages)

print("Shape:", ages_binned.shape)  # (10, 5)
print("First few rows:")
print(ages_binned[:5])

Given:

events = pd.DataFrame({'hour': [6, 14, 22, 3, 9, 18, 12, 20]})

Task: Create time_period with Night (0-6), Morning (6-12), Afternoon (12-18), Evening (18-24).

Show Solution
events = pd.DataFrame({'hour': [6, 14, 22, 3, 9, 18, 12, 20]})

bins = [0, 6, 12, 18, 24]
labels = ['Night', 'Morning', 'Afternoon', 'Evening']

events['time_period'] = pd.cut(
    events['hour'], 
    bins=bins, 
    labels=labels,
    include_lowest=True
)

print(events)
#    hour time_period
# 0     6     Morning
# 1    14   Afternoon
# 2    22     Evening
# 3     3       Night

Interactive: Scaling Visualizer

See how different scaling methods transform your data in real-time. Adjust the input values and observe how StandardScaler, MinMaxScaler, and RobustScaler produce different results.

Scaling Comparison Tool
Try adding an outlier like 1000!
First value Value: 50 Last value
StandardScaler
0.00
z-score

Formula: (x - mean) / std Mean: 0, Std: 0
MinMaxScaler
0.00
scaled (0-1)

Formula: (x - min) / (max - min) Min: 0, Max: 0
RobustScaler
0.00
robust scaled

Formula: (x - median) / IQR Median: 0, IQR: 0
Observation

Enter values and move the slider to see how different scalers handle your data differently.

Encoding Comparison Tool
Label Encoding
0

For nominal data, this implies false ordering!
One-Hot Encoding
1 0 0

Red | Blue | Green
Recommendation

For nominal (unordered) categories like colors, use One-Hot Encoding to avoid implying a false order.

Key Takeaways

Features Drive Model Success

Better features often matter more than choosing a fancier algorithm - invest time in feature engineering

Encode Categoricals Wisely

Use one-hot for nominal, label/ordinal for ordered categories - wrong encoding hurts models

Scale Features Appropriately

StandardScaler for normal distributions, MinMaxScaler for bounded ranges, RobustScaler for outliers

Create Meaningful Features

Interaction terms, ratios, and domain-specific features often capture patterns raw data cannot

Binning Reduces Noise

Convert continuous to categorical when exact values matter less than ranges or categories

Avoid Data Leakage

Fit transformers on training data only, then apply to test - never peek at test data statistics

Knowledge Check

Quick Quiz

Test what you've learned about feature engineering

1 What is the primary purpose of feature engineering?
2 Which encoding method is best for a categorical variable with no inherent order (e.g., color: red, blue, green)?
3 Which scaler transforms features to have zero mean and unit variance?
4 What is an interaction feature?
5 When would you use binning on a continuous variable?
6 What is data leakage in feature engineering?
Answer all questions to check your score