Module 8.2

Categorical Encoding

Transform categorical variables into numerical representations that machine learning algorithms can understand. Master one-hot, label, ordinal, target, frequency, and binary encoding techniques!

40 min read
Intermediate
Hands-on Examples
What You'll Learn
  • One-hot encoding for nominal categories
  • Label and ordinal encoding techniques
  • Target encoding for high-cardinality features
  • Frequency and binary encoding methods
  • Choosing the right encoding strategy
Contents
01

One-Hot Encoding

One-hot encoding is the most widely used technique for converting nominal categorical variables into numerical format. It creates a new binary column for each unique category, where 1 indicates the presence of that category and 0 indicates absence. This approach is ideal when categories have no natural ordering and works exceptionally well with linear models, neural networks, and distance-based algorithms.

Analogy: Checkbox Survey! Imagine a survey asking "Which cities have you visited?" You can check: ☐ New York ☐ Boston ☐ Chicago. Each city gets its own YES/NO (1/0) column. That's exactly what One-Hot Encoding does - it creates a separate binary column for each category!
One-Hot Encoding Definition

A technique that transforms a categorical column with n unique values into n binary columns (or n-1 to avoid multicollinearity). Each row has exactly one "1" across these new columns, representing its category.

Example: City column with values [New York, Boston, Chicago] becomes 3 columns: city_New York, city_Boston, city_Chicago. Each row has 1 in its city column, 0s elsewhere.

Using Pandas get_dummies()

The simplest way to perform one-hot encoding in Python is using pd.get_dummies(). This function automatically detects categorical columns and creates binary indicator variables. Fastest method for exploration!

# WHY? Machine learning models need numbers, not text!
# WHAT? pd.get_dummies() converts categories to binary (0/1) columns
import pandas as pd

# Sample customer data with categorical 'city' and 'plan' columns
# SCENARIO: Telecom company analyzing customer subscriptions by location
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'city': ['New York', 'Boston', 'Chicago', 'New York', 'Boston'],
    'plan': ['Premium', 'Basic', 'Premium', 'Basic', 'Standard']
})

print("Original Data:")
print(customers)
# Output:
#    customer_id      city      plan
# 0            1  New York   Premium    # ← Customer 1 lives in New York, has Premium plan
# 1            2    Boston     Basic    # ← Customer 2 lives in Boston, has Basic plan
# 2            3   Chicago   Premium    # ← Customer 3 lives in Chicago, has Premium plan
# 3            4  New York     Basic    # ← Customer 4 lives in New York, has Basic plan
# 4            5    Boston  Standard    # ← Customer 5 lives in Boston, has Standard plan

# STEP 1: One-hot encode ONLY the 'city' column
# WHY? We want to convert city names to numbers without losing information
# WHAT IT DOES: Creates 3 new columns (city_Boston, city_Chicago, city_New York)
# prefix='city' adds 'city_' before each column name for clarity
encoded = pd.get_dummies(customers, columns=['city'], prefix='city')

print("\nOne-Hot Encoded:")
print(encoded)
# Output:
#    customer_id      plan  city_Boston  city_Chicago  city_New York
# 0            1   Premium            0             0              1    # ← New York = [0, 0, 1]
# 1            2     Basic            1             0              0    # ← Boston = [1, 0, 0]
# 2            3   Premium            0             1              0    # ← Chicago = [0, 1, 0]
# 3            4     Basic            0             0              1    # ← New York = [0, 0, 1]
# 4            5  Standard            1             0              0    # ← Boston = [1, 0, 0]

# INTERPRETATION:
# - Each city gets its own binary column
# - Row 0: city_New York = 1 (customer is in New York), other city columns = 0
# - Row 1: city_Boston = 1 (customer is in Boston), other city columns = 0
# - Each row has exactly ONE "1" in the city columns (mutually exclusive categories)
# - No city is considered "higher" or "better" than another - just different!
Why This Works: One-hot encoding treats each category as equally different. The distance between New York and Boston is the same as the distance between Boston and Chicago in the encoded space. This is perfect for nominal categories like cities, colors, or product types where no natural ordering exists!
The Dummy Variable Trap

Problem: When using one-hot encoding with linear regression or similar models, having ALL n columns creates perfect multicollinearity. This happens because if you know the values of n-1 columns, you can perfectly predict the nth column!

Example: If city_Boston=0 and city_Chicago=0, we KNOW city_New York MUST be 1. The third column is redundant!

Solution: Use pd.get_dummies(df, drop_first=True) to automatically drop the first category (called the "reference category" or "baseline"). Note: Tree-based models (Random Forest, XGBoost) are NOT affected by this issue!

# AVOIDING THE DUMMY VARIABLE TRAP
# WHY? Linear models can't handle perfect multicollinearity
# WHAT? Drop the first category as a "baseline" - it's implied when all others are 0

# Encode BOTH 'city' and 'plan' columns with drop_first=True
# WHAT IT DOES: Creates n-1 columns for each categorical variable
# The first category (alphabetically) becomes the baseline/reference
encoded_safe = pd.get_dummies(customers, columns=['city', 'plan'], drop_first=True)

print(encoded_safe)
# Output:
#    customer_id  city_Chicago  city_New York  plan_Premium  plan_Standard
# 0            1             0              1             1              0  # New York, Premium
# 1            2             0              0             0              0  # Boston (baseline!), Basic (baseline!)
# 2            3             1              0             1              0  # Chicago, Premium
# 3            4             0              1             0              0  # New York, Basic (baseline!)
# 4            5             0              0             0              1  # Boston (baseline!), Standard

# INTERPRETATION:
# City baselines: Boston is the baseline city (dropped)
#   - city_Chicago=0, city_New York=0 → Boston (implied)
#   - city_Chicago=1, city_New York=0 → Chicago
#   - city_Chicago=0, city_New York=1 → New York
# 
# Plan baselines: Basic is the baseline plan (dropped)
#   - plan_Premium=0, plan_Standard=0 → Basic (implied)
#   - plan_Premium=1, plan_Standard=0 → Premium
#   - plan_Premium=0, plan_Standard=1 → Standard
#
# WHY THIS WORKS: We reduced from 3+3=6 columns to 2+2=4 columns, but kept ALL information!

Pro Tip: When to Drop First

Always drop_first=True for: Linear Regression, Logistic Regression, Ridge, Lasso.
Don't need to drop for: Random Forest, XGBoost, LightGBM, Decision Trees (they handle it fine).
Why? Tree-based models split on features independently, so multicollinearity doesn't matter!

Using Scikit-learn OneHotEncoder

For machine learning pipelines, sklearn.preprocessing.OneHotEncoder is preferred because it can be fitted once on training data and consistently applied to test data, handling unseen categories gracefully. Production-ready approach!

Why OneHotEncoder > get_dummies() for ML:
  • Consistency: Encoder remembers categories from training data
  • Unseen categories: Handles new categories in test data gracefully
  • Pipeline integration: Works seamlessly with sklearn pipelines
  • Production deployment: Save encoder, use on new data later
# WHY? For production ML pipelines, we need reproducible encoding
# WHAT? OneHotEncoder learns categories from training data and applies consistently
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# STEP 1: Create encoder with important parameters
# sparse_output=False → Get regular array instead of sparse matrix (easier to work with)
# handle_unknown='ignore' → If test data has NEW categories not in training, give them all 0s
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# STEP 2: Fit encoder on training data
# WHY [[city]]? OneHotEncoder expects 2D array (DataFrame with double brackets)
cities = customers[['city']]  # 2D DataFrame (5 rows, 1 column)
print("Shape of input:", cities.shape)  # (5, 1)

# fit_transform() = Learn the unique categories AND encode them in one step
# WHAT IT LEARNS: "There are 3 cities: Boston, Chicago, New York"
encoded_array = encoder.fit_transform(cities)
print("\nEncoded array shape:", encoded_array.shape)  # (5, 3) - 5 rows, 3 columns

# STEP 3: Get feature names (column names for the encoded data)
# WHY? We need to know which column represents which city
feature_names = encoder.get_feature_names_out(['city'])
print("\nFeature names:", feature_names)  
# Output: ['city_Boston' 'city_Chicago' 'city_New York']

# STEP 4: Convert numpy array back to pandas DataFrame for readability
encoded_df = pd.DataFrame(encoded_array, columns=feature_names)
print("\nEncoded DataFrame:")
print(encoded_df)
# Output:
#    city_Boston  city_Chicago  city_New York
# 0          0.0           0.0            1.0  # ← New York
# 1          1.0           0.0            0.0  # ← Boston
# 2          0.0           1.0            0.0  # ← Chicago
# 3          0.0           0.0            1.0  # ← New York
# 4          1.0           0.0            0.0  # ← Boston

# INTERPRETATION:
# - The encoder created columns in ALPHABETICAL order: Boston, Chicago, New York
# - Each row has exactly one 1.0 and two 0.0s
# - Unlike get_dummies(), this encoder is now "trained" and can be reused!
Key Advantage: You can now save this encoder (using pickle or joblib) and use it in production! When new customer data arrives, apply the SAME encoding using encoder.transform(new_data) - no retraining needed!
When to Use One-Hot Encoding
  • Nominal categories with no natural order
  • Low to medium cardinality (less than 15-20 unique values)
  • Linear models, neural networks, SVM
  • When interpretability matters
When to Avoid One-Hot Encoding
  • High cardinality features (100+ categories)
  • Memory-constrained environments
  • Tree-based models (simpler encoding works fine)
  • When sparse matrices cause performance issues

Handling Unseen Categories

Real-World Problem: What happens when your model encounters a NEW category it has never seen before? For example, you trained on data from New York, Boston, and Chicago, but test data includes Seattle. Without proper handling, your model will crash! This is one of the #1 production bugs in ML systems.

A common challenge is handling categories in test data that were not seen during training. Scikit-learn's handle_unknown='ignore' parameter solves this by setting all columns to 0 for unknown categories. Critical for production!

# SCENARIO: Train on 3 cities, then encounter NEW city in production
# WHY THIS MATTERS: Real-world data is messy - you can't predict all future categories!

# STEP 1: Train encoder on training data (only 3 cities)
# SIMULATION: This is what you know during model development
train_cities = pd.DataFrame({'city': ['New York', 'Boston', 'Chicago']})

# Create encoder with handle_unknown='ignore'
# WHAT IT DOES: If it sees an unknown category, it assigns all 0s (safe default)
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_cities)  # LEARN: "I know 3 cities: Boston, Chicago, New York"

print("Categories learned during training:", encoder.categories_)
# Output: [array(['Boston', 'Chicago', 'New York'], dtype=object)]

# STEP 2: Test data contains UNSEEN category 'Seattle'
# SIMULATION: This is what happens in production with real-world data
test_cities = pd.DataFrame({'city': ['Boston', 'Seattle', 'New York']})
print("\nTest data (notice Seattle is NEW!):")
print(test_cities)
#        city
# 0    Boston    # ← Known category
# 1   Seattle    # ← UNKNOWN CATEGORY! Not in training data
# 2  New York    # ← Known category

# STEP 3: Transform test data using the trained encoder
# WHAT HAPPENS: Boston and New York are encoded normally, Seattle gets all 0s
encoded_test = encoder.transform(test_cities)

print("\nEncoded test data:")
encoded_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out())
print(encoded_df)
# Output:
#    city_Boston  city_Chicago  city_New York
# 0          1.0           0.0            0.0  # ← Boston: Encoded normally
# 1          0.0           0.0            0.0  # ← Seattle: ALL ZEROS (unknown category)
# 2          0.0           0.0            1.0  # ← New York: Encoded normally

# INTERPRETATION:
# Row 0 (Boston): city_Boston=1, others=0 → Correct encoding
# Row 1 (Seattle): ALL ZEROS → "I don't know this city, treat it as missing/unknown"
# Row 2 (New York): city_New York=1, others=0 → Correct encoding
#
# WHY ALL ZEROS FOR SEATTLE? It's a safe default that says:
# "This data point doesn't match any category I know, so I'll treat it as neutral"
# The model can still make predictions, but won't use city information for Seattle
Without handle_unknown='ignore'
encoder = OneHotEncoder()
encoder.fit(train_cities)
encoder.transform(test_cities)

# ERROR: ValueError: 
# Found unknown categories 
# ['Seattle'] during transform

Result: Production system crashes! ❌

With handle_unknown='ignore'
encoder = OneHotEncoder(
    handle_unknown='ignore'
)
encoder.fit(train_cities)
encoded = encoder.transform(test_cities)

# SUCCESS: Seattle gets [0, 0, 0]

Result: System handles it gracefully! ✅

Pro Tip: Alternative Strategies

1. Rare category grouping: Before encoding, replace rare categories (appearing <3 times) with "Other"
2. Add "Unknown" category: Include an explicit "Unknown" category in training data
3. Use frequency encoding: For high-cardinality features, frequency encoding handles unknowns naturally
4. Monitor unknowns: Log how often unknown categories appear - might indicate data drift!

Practice Questions

Task: Given a DataFrame with product categories, one-hot encode the 'category' column and drop the first category to avoid multicollinearity.

# Starter code
import pandas as pd

products = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Audio']
})

# Your code here: One-hot encode 'category' with drop_first=True
View Solution
import pandas as pd

products = pd.DataFrame({
    'product_id': [101, 102, 103, 104, 105],
    'name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Audio']
})

# One-hot encode with drop_first to avoid dummy variable trap
encoded_products = pd.get_dummies(products, columns=['category'], drop_first=True)
print(encoded_products)
# Output:
#    product_id        name  category_Audio  category_Electronics
# 0         101      Laptop               0                     1
# 1         102       Mouse               0                     0  <- Accessories (dropped)
# 2         103    Keyboard               0                     0
# 3         104     Monitor               0                     1
# 4         105  Headphones               1                     0
💡 Explanation: The original 'category' column had 3 unique values: Accessories, Audio, Electronics. Using drop_first=True, we dropped 'Accessories' (alphabetically first) to avoid multicollinearity. Now we have 2 binary columns: category_Audio and category_Electronics. When both are 0, we know it's Accessories!

Task: Create a scikit-learn OneHotEncoder that can handle unseen categories, fit it on training data, and transform both training and test data.

# Starter code
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

train_data = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'size': ['Small', 'Large', 'Medium', 'Large', 'Small']
})

test_data = pd.DataFrame({
    'color': ['Blue', 'Yellow', 'Red'],  # Yellow is unseen!
    'size': ['Small', 'XL', 'Large']     # XL is unseen!
})

# Your code here: Create encoder, fit on train, transform both
View Solution
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

train_data = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'size': ['Small', 'Large', 'Medium', 'Large', 'Small']
})

test_data = pd.DataFrame({
    'color': ['Blue', 'Yellow', 'Red'],  # Yellow is unseen!
    'size': ['Small', 'XL', 'Large']     # XL is unseen!
})

# Create encoder that ignores unseen categories
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit on training data only
encoder.fit(train_data)

# Transform both datasets
train_encoded = encoder.transform(train_data)
test_encoded = encoder.transform(test_data)

# Get feature names
feature_names = encoder.get_feature_names_out(['color', 'size'])

# Convert to DataFrames
train_df = pd.DataFrame(train_encoded, columns=feature_names)
test_df = pd.DataFrame(test_encoded, columns=feature_names)

print("Encoded Training Data:")
print(train_df)

print("\nEncoded Test Data (notice Yellow and XL get all zeros):")
print(test_df)
💡 Explanation: Notice how 'Yellow' and 'XL' (unseen during training) both get all zeros across their respective feature columns. This is because we used handle_unknown='ignore'. Without this parameter, the encoder would crash with a ValueError. This approach allows the model to make predictions even with new categories, treating them as "unknown/other".

Task: When a categorical column has too many unique values, group rare categories into "Other" before one-hot encoding. Keep only categories that appear at least 3 times.

# Starter code
import pandas as pd

orders = pd.DataFrame({
    'order_id': range(1, 16),
    'country': ['USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 
                'UK', 'UK', 'France', 'Germany', 'Spain', 'Italy', 'Japan', 'Australia']
})

# Your code here: Group rare countries (< 3 occurrences) into 'Other', then one-hot encode
View Solution
import pandas as pd

orders = pd.DataFrame({
    'order_id': range(1, 16),
    'country': ['USA', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 
                'UK', 'UK', 'France', 'Germany', 'Spain', 'Italy', 'Japan', 'Australia']
})

# Count occurrences of each country
country_counts = orders['country'].value_counts()
print("Country counts:\n", country_counts)

# Find categories with fewer than 3 occurrences
rare_countries = country_counts[country_counts < 3].index.tolist()
print("\nRare countries (< 3):", rare_countries)

# Replace rare countries with 'Other'
orders['country_grouped'] = orders['country'].apply(
    lambda x: 'Other' if x in rare_countries else x
)

print("\nGrouped values:\n", orders['country_grouped'].value_counts())

# One-hot encode the grouped column
encoded = pd.get_dummies(orders, columns=['country_grouped'], prefix='country')
print("\nEncoded DataFrame:")
print(encoded)
💡 Explanation: High cardinality (too many unique categories) creates too many columns and sparse data. By grouping rare categories (appearing <3 times) into 'Other', we reduced the dimensionality from 9 countries to 4 groups (USA, Canada, UK, Other). This technique prevents overfitting, reduces memory usage, and makes the model more robust to new rare countries in production.
02

Label & Ordinal Encoding

Label encoding and ordinal encoding both convert categories to integers, but they serve different purposes. Label encoding assigns arbitrary numbers to categories, while ordinal encoding preserves meaningful order. Understanding when to use each is crucial for building effective machine learning models.

Label Encoding

Assigns a unique integer (0, 1, 2, ...) to each category alphabetically or in order of appearance. The numbers have no inherent meaning or relationship.

Ordinal Encoding

Assigns integers that reflect the natural order or ranking of categories. For example, "Low" = 0, "Medium" = 1, "High" = 2.

Label Encoding with Scikit-learn

Label encoding is straightforward but should be used carefully. Since it assigns arbitrary numbers, models may incorrectly assume numerical relationships between categories.

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data with color categories
products = pd.DataFrame({
    'product': ['Shirt', 'Pants', 'Hat', 'Shoes', 'Jacket'],
    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue']
})

# Create and fit label encoder
label_encoder = LabelEncoder()
products['color_encoded'] = label_encoder.fit_transform(products['color'])

print(products)
# Output:
#   product  color  color_encoded
# 0   Shirt    Red              2
# 1   Pants   Blue              0
# 2     Hat  Green              1
# 3   Shoes    Red              2
# 4  Jacket   Blue              0

# View the mapping
print("\nLabel mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label} -> {i}")
# Output:
#   Blue -> 0
#   Green -> 1
#   Red -> 2
💡 Explanation: LabelEncoder assigns integers alphabetically: Blue=0, Green=1, Red=2. The numbers are arbitrary and don't mean "Red is 2x better than Blue". However, some algorithms like linear regression might incorrectly interpret this as an ordering. That's why label encoding is safest with tree-based models (Random Forest, XGBoost) which don't assume ordinal relationships between categories.
Label Encoding Pitfall

Label encoding can mislead algorithms into thinking there is an order (Red > Green > Blue) when none exists. For nominal categories, one-hot encoding is usually safer. However, tree-based models (Random Forest, XGBoost) handle label encoding well because they split on individual values rather than assuming order.

Ordinal Encoding with Scikit-learn

When categories have a natural ranking, ordinal encoding preserves this information. You must explicitly define the category order.

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Customer data with ordinal categories
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],
    'satisfaction': ['Low', 'Medium', 'High', 'High', 'Medium']
})

# Define the order for each category
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
satisfaction_order = ['Low', 'Medium', 'High']

# Create ordinal encoder with specified order
ordinal_encoder = OrdinalEncoder(
    categories=[education_order, satisfaction_order]
)

# Fit and transform
customers[['education_encoded', 'satisfaction_encoded']] = ordinal_encoder.fit_transform(
    customers[['education', 'satisfaction']]
)

print(customers)
# Output:
#    customer_id    education satisfaction  education_encoded  satisfaction_encoded
# 0            1  High School          Low                0.0                   0.0
# 1            2     Bachelor       Medium                1.0                   1.0
# 2            3       Master         High                2.0                   2.0
# 3            4          PhD         High                3.0                   2.0
# 4            5     Bachelor       Medium                1.0                   1.0
💡 Explanation: Ordinal encoding preserves meaningful order. For education: High School(0) < Bachelor(1) < Master(2) < PhD(3). For satisfaction: Low(0) < Medium(1) < High(2). The numbers now have meaning – higher numbers represent higher levels. This helps models understand that a PhD is "more" than a Bachelor, unlike label encoding where the numbers would be arbitrary.

Manual Ordinal Mapping with Pandas

For simple cases or when you need more control, you can create ordinal mappings manually using a dictionary.

import pandas as pd

# Survey responses
survey = pd.DataFrame({
    'response_id': [1, 2, 3, 4, 5],
    'experience': ['Beginner', 'Expert', 'Intermediate', 'Beginner', 'Expert'],
    'priority': ['Low', 'Critical', 'Medium', 'High', 'Medium']
})

# Define ordinal mappings
experience_map = {'Beginner': 0, 'Intermediate': 1, 'Expert': 2}
priority_map = {'Low': 0, 'Medium': 1, 'High': 2, 'Critical': 3}

# Apply mappings using .map()
survey['experience_encoded'] = survey['experience'].map(experience_map)
survey['priority_encoded'] = survey['priority'].map(priority_map)

print(survey)
# Output:
#    response_id    experience priority  experience_encoded  priority_encoded
# 0            1      Beginner      Low                   0                 0
# 1            2        Expert Critical                   2                 3
# 2            3  Intermediate   Medium                   1                 1
# 3            4      Beginner     High                   0                 2
# 4            5        Expert   Medium                   2                 1
💡 Explanation: The .map() method gives you full control over the encoding. You can see that experience levels increase: Beginner(0) → Intermediate(1) → Expert(2), and priority increases: Low(0) → Medium(1) → High(2) → Critical(3). This manual approach is great for small datasets or when you need custom orderings. Unlike OrdinalEncoder, it's also easier to read and doesn't require sklearn.
Aspect Label Encoding Ordinal Encoding
Order Matters? No - assigns arbitrary integers Yes - preserves natural ranking
Use Case Tree-based models, target variable Ordinal features like ratings, sizes
Examples Colors, countries, product IDs Education level, size (S/M/L), rating
Risk May imply false ordering Must define correct order manually
Sklearn Class LabelEncoder OrdinalEncoder

Inverse Transform

Both encoders support converting encoded values back to original categories, which is useful for interpreting model predictions.

# Inverse transform with LabelEncoder
original_colors = label_encoder.inverse_transform([0, 1, 2])
print("Inverse transform:", original_colors)  # ['Blue' 'Green' 'Red']

# Inverse transform with OrdinalEncoder
original_values = ordinal_encoder.inverse_transform([[2.0, 1.0], [3.0, 2.0]])
print("Inverse transform:", original_values)
# [['Master' 'Medium']
#  ['PhD' 'High']]
💡 Explanation: Both LabelEncoder and OrdinalEncoder can reverse their encoding using inverse_transform(). This is useful for interpreting model predictions or displaying results in human-readable format. For example, if your model predicts education level as 2.0, you can convert it back to "Master" for easier understanding.

Practice Questions

Task: Use LabelEncoder to encode the 'product_type' column and print the mapping between original values and encoded integers.

# Starter code
from sklearn.preprocessing import LabelEncoder
import pandas as pd

inventory = pd.DataFrame({
    'item_id': [1, 2, 3, 4, 5, 6],
    'product_type': ['Electronics', 'Clothing', 'Food', 'Electronics', 'Clothing', 'Furniture']
})

# Your code here: Label encode and show the mapping
View Solution
from sklearn.preprocessing import LabelEncoder
import pandas as pd

inventory = pd.DataFrame({
    'item_id': [1, 2, 3, 4, 5, 6],
    'product_type': ['Electronics', 'Clothing', 'Food', 'Electronics', 'Clothing', 'Furniture']
})

# Create and fit label encoder
encoder = LabelEncoder()
inventory['type_encoded'] = encoder.fit_transform(inventory['product_type'])

# Display result
print(inventory)

# Show the mapping
print("\nEncoding Mapping:")
for i, label in enumerate(encoder.classes_):
    print(f"  {label} -> {i}")
# Clothing -> 0
# Electronics -> 1
# Food -> 2
# Furniture -> 3
💡 Explanation: LabelEncoder automatically sorts categories alphabetically and assigns integers starting from 0. The encoding is arbitrary (Clothing=0 doesn't mean it's "less" than Electronics=1). This approach works well for tree-based models but can mislead linear models into assuming false orderings.

Task: The 'size' column contains T-shirt sizes with a natural order. Use OrdinalEncoder to encode them preserving the correct order: XS < S < M < L < XL < XXL.

# Starter code
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

tshirts = pd.DataFrame({
    'sku': ['TS001', 'TS002', 'TS003', 'TS004', 'TS005'],
    'size': ['M', 'XL', 'S', 'XXL', 'XS']
})

# Your code here: Ordinal encode with correct size order
View Solution
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

tshirts = pd.DataFrame({
    'sku': ['TS001', 'TS002', 'TS003', 'TS004', 'TS005'],
    'size': ['M', 'XL', 'S', 'XXL', 'XS']
})

# Define the correct order from smallest to largest
size_order = [['XS', 'S', 'M', 'L', 'XL', 'XXL']]

# Create ordinal encoder with the specified order
encoder = OrdinalEncoder(categories=size_order)

# Fit and transform
tshirts['size_encoded'] = encoder.fit_transform(tshirts[['size']])

print(tshirts)
# Output:
#      sku size  size_encoded
# 0  TS001    M           2.0  (XS=0, S=1, M=2)
# 1  TS002   XL           4.0
# 2  TS003    S           1.0
# 3  TS004  XXL           5.0
# 4  TS005   XS           0.0

# Verify the mapping
print("\nSize order mapping:")
for i, size in enumerate(size_order[0]):
    print(f"  {size} -> {i}")
💡 Explanation: By explicitly defining the order [XS, S, M, L, XL, XXL], we preserve the natural size progression. Now the model understands that XXL(5) is larger than XS(0). Without specifying this order, the encoder would assign numbers alphabetically (L=0, M=1, S=2, XL=3, XS=4, XXL=5), which would be meaningless for size comparisons!

Task: Fit an OrdinalEncoder on training data and apply it to test data. Handle an unknown category in test data by using handle_unknown='use_encoded_value' with unknown_value=-1.

# Starter code
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

train = pd.DataFrame({
    'rating': ['Poor', 'Good', 'Excellent', 'Good', 'Poor']
})

test = pd.DataFrame({
    'rating': ['Good', 'Average', 'Excellent']  # 'Average' is unseen!
})

# Your code here: Handle unknown categories with -1
View Solution
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

train = pd.DataFrame({
    'rating': ['Poor', 'Good', 'Excellent', 'Good', 'Poor']
})

test = pd.DataFrame({
    'rating': ['Good', 'Average', 'Excellent']  # 'Average' is unseen!
})

# Define ordinal categories
rating_order = [['Poor', 'Good', 'Excellent']]

# Create encoder that assigns -1 to unknown categories
encoder = OrdinalEncoder(
    categories=rating_order,
    handle_unknown='use_encoded_value',
    unknown_value=-1
)

# Fit on training data
encoder.fit(train[['rating']])

# Transform both datasets
train['rating_encoded'] = encoder.transform(train[['rating']])
test['rating_encoded'] = encoder.transform(test[['rating']])

print("Training Data:")
print(train)

print("\nTest Data (notice 'Average' gets -1):")
print(test)
# Output:
#     rating  rating_encoded
# 0     Good             1.0
# 1  Average            -1.0  <- Unknown category
# 2  Excellent          2.0
💡 Explanation: By using handle_unknown='use_encoded_value' with unknown_value=-1, we can handle unseen categories gracefully. The encoder assigns -1 to 'Average' (not seen during training). This is better than crashing, and the -1 signals to downstream processes that this category is unknown. You can then handle it appropriately (e.g., assign global mean, create an "Other" category, or flag for review).
03

Target Encoding

Target encoding (also called mean encoding) replaces each category with a statistic derived from the target variable, typically the mean. This technique is particularly powerful for high-cardinality categorical features where one-hot encoding would create too many columns. However, it must be applied carefully to avoid data leakage and overfitting.

Target Encoding

A technique that replaces each category with the mean (or other aggregate) of the target variable for that category. For example, if customers from "California" have an average purchase of $150, "California" is encoded as 150.

Basic Target Encoding

The simplest form of target encoding calculates the mean of the target for each category. This creates a strong predictive signal but requires careful handling to prevent leakage.

import pandas as pd
import numpy as np

# Customer purchase data
customers = pd.DataFrame({
    'customer_id': range(1, 11),
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago'],
    'purchase_amount': [150, 200, 180, 90, 220, 160, 100, 190, 170, 110]
})

# Calculate target mean for each city
city_means = customers.groupby('city')['purchase_amount'].mean()
print("Mean purchase by city:")
print(city_means)
# Output:
# city
# Chicago    100.000000
# LA         203.333333
# NYC        165.000000

# Apply target encoding
customers['city_encoded'] = customers['city'].map(city_means)
print("\nTarget Encoded Data:")
print(customers[['customer_id', 'city', 'purchase_amount', 'city_encoded']])
# Output:
#    customer_id     city  purchase_amount  city_encoded
# 0            1      NYC              150    165.000000
# 1            2       LA              200    203.333333
# 2            3      NYC              180    165.000000
# ...
💡 Explanation: Target encoding replaces each city with the average purchase amount for that city. NYC customers average $165, LA customers average $203, and Chicago customers average $100. This encoding captures the relationship between city and purchase amount, creating a strong predictive feature. However, this basic approach has data leakage because each row's target influences its own encoding!
Data Leakage Warning

The basic approach above causes data leakage because each row's target value influences its own encoding. This leads to overfitting, especially for rare categories. Always use one of these solutions:

  • Leave-one-out encoding: Exclude the current row when calculating the mean
  • K-fold target encoding: Use cross-validation folds to calculate means
  • Smoothing: Blend category mean with global mean based on category size

Target Encoding with Smoothing

Smoothing (also called regularization) blends the category mean with the global mean. Categories with few samples rely more on the global mean, preventing overfitting on rare categories.

import pandas as pd
import numpy as np

def target_encode_smooth(df, column, target, m=10):
    """
    Target encoding with smoothing.
    m: smoothing parameter (higher = more smoothing toward global mean)
    """
    # Calculate global mean
    global_mean = df[target].mean()
    
    # Calculate category statistics
    agg = df.groupby(column)[target].agg(['mean', 'count'])
    
    # Apply smoothing formula: (count * category_mean + m * global_mean) / (count + m)
    smoothed = (agg['count'] * agg['mean'] + m * global_mean) / (agg['count'] + m)
    
    return df[column].map(smoothed)

# Example data with rare category
sales = pd.DataFrame({
    'region': ['East', 'East', 'East', 'East', 'East',
               'West', 'West', 'West', 
               'North', 'North',
               'South'],  # South has only 1 sample!
    'revenue': [100, 120, 110, 130, 115,
                200, 180, 190,
                80, 85,
                500]  # South's single high value could cause overfitting
})

# Without smoothing
sales['region_raw'] = sales['region'].map(
    sales.groupby('region')['revenue'].mean()
)

# With smoothing (m=3)
sales['region_smooth'] = target_encode_smooth(sales, 'region', 'revenue', m=3)

print(sales[['region', 'revenue', 'region_raw', 'region_smooth']])
# Notice South goes from 500 (raw) to ~300 (smoothed toward global mean)
💡 Explanation: Smoothing prevents overfitting on rare categories. South has only 1 sample with revenue=$500. Without smoothing, South would be encoded as 500 (very high), but this might be noise. With smoothing (m=3), South's encoding blends toward the global mean: (1*500 + 3*global_mean)/(1+3) ≈ 300. Categories with more samples (like East with 5 samples) are less affected by smoothing and stay closer to their true mean.

K-Fold Target Encoding

The most robust approach uses cross-validation: for each fold, calculate the encoding using only the out-of-fold data. This completely prevents target leakage within the training set.

from sklearn.model_selection import KFold
import pandas as pd
import numpy as np

def kfold_target_encode(df, column, target, n_splits=5):
    """K-fold target encoding to prevent leakage."""
    df = df.copy()
    df['encoded'] = np.nan
    
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        # Calculate means using ONLY training fold data
        train_means = df.iloc[train_idx].groupby(column)[target].mean()
        
        # Apply to validation fold
        df.loc[df.index[val_idx], 'encoded'] = df.iloc[val_idx][column].map(train_means)
    
    # Fill any NaN (unseen categories) with global mean
    global_mean = df[target].mean()
    df['encoded'] = df['encoded'].fillna(global_mean)
    
    return df['encoded']

# Apply k-fold encoding
sales['region_kfold'] = kfold_target_encode(sales, 'region', 'revenue', n_splits=3)
print(sales[['region', 'revenue', 'region_kfold']])
💡 Explanation: K-fold encoding completely prevents target leakage within the training set. For each fold, we calculate the encoding using ONLY the other folds' data. For example, when encoding row 0, we calculate the mean using rows in other folds, not including row 0 itself. This is the gold standard for target encoding in competitions and production ML systems!

Using Category Encoders Library

The category_encoders library provides production-ready target encoding with built-in smoothing and cross-validation support.

# Install: pip install category-encoders
import category_encoders as ce
import pandas as pd
from sklearn.model_selection import train_test_split

# Prepare data
X = sales[['region']]
y = sales['revenue']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create target encoder with smoothing
encoder = ce.TargetEncoder(cols=['region'], smoothing=1.0)

# Fit on training data ONLY (pass y_train to prevent leakage)
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

print("Training encoded:")
print(X_train_encoded)

print("\nTest encoded:")
print(X_test_encoded)
💡 Explanation: The category_encoders library provides production-ready target encoding with built-in leakage prevention. By fitting on training data with y_train, it learns the category-to-target relationships. When transforming test data, it applies the learned encoding WITHOUT looking at test target values (preventing leakage). The smoothing parameter (1.0 here) adds regularization to prevent overfitting on rare categories.
Aspect Target Encoding One-Hot Encoding
Dimensionality 1 column (regardless of cardinality) n columns (one per category)
High Cardinality Excellent - no dimension explosion Poor - creates sparse, high-dim data
Leakage Risk High - must use regularization None
Information Encodes relationship with target No target information
Best For Tree-based models, high cardinality Linear models, low cardinality

Practice Questions

Task: Calculate the mean salary for each department and create a target-encoded column.

# Starter code
import pandas as pd

employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'department': ['Sales', 'Engineering', 'Sales', 'HR', 'Engineering', 'Sales', 'HR', 'Engineering'],
    'salary': [50000, 80000, 55000, 45000, 85000, 52000, 48000, 90000]
})

# Your code here: Target encode 'department' using mean salary
View Solution
import pandas as pd

employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'department': ['Sales', 'Engineering', 'Sales', 'HR', 'Engineering', 'Sales', 'HR', 'Engineering'],
    'salary': [50000, 80000, 55000, 45000, 85000, 52000, 48000, 90000]
})

# Calculate mean salary per department
dept_means = employees.groupby('department')['salary'].mean()
print("Mean salary by department:")
print(dept_means)

# Target encode
employees['dept_encoded'] = employees['department'].map(dept_means)
print("\nEncoded DataFrame:")
print(employees)
# Engineering -> 85000, HR -> 46500, Sales -> 52333.33
💡 Explanation: Each department is replaced with its average salary: Engineering=$85,000, HR=$46,500, Sales=$52,333. This creates a powerful feature because it directly encodes the relationship between department and salary. However, remember this causes data leakage – in production, you should use k-fold encoding or fit only on training data!

Task: Implement smoothed target encoding using the formula: (n * category_mean + m * global_mean) / (n + m), where n is category count and m is the smoothing parameter.

# Starter code
import pandas as pd

products = pd.DataFrame({
    'product': range(1, 13),
    'brand': ['Apple', 'Apple', 'Apple', 'Apple', 'Apple',
              'Samsung', 'Samsung', 'Samsung',
              'LG', 'LG',
              'Sony', 'Nokia'],  # Sony and Nokia are rare
    'rating': [4.5, 4.2, 4.8, 4.3, 4.6,
               4.0, 3.8, 4.1,
               3.5, 3.6,
               5.0, 2.0]  # Extreme values for rare brands
})

# Your code here: Implement smoothed target encoding with m=3
View Solution
import pandas as pd

products = pd.DataFrame({
    'product': range(1, 13),
    'brand': ['Apple', 'Apple', 'Apple', 'Apple', 'Apple',
              'Samsung', 'Samsung', 'Samsung',
              'LG', 'LG',
              'Sony', 'Nokia'],  # Sony and Nokia are rare
    'rating': [4.5, 4.2, 4.8, 4.3, 4.6,
               4.0, 3.8, 4.1,
               3.5, 3.6,
               5.0, 2.0]  # Extreme values for rare brands
})

# Smoothing parameter
m = 3

# Calculate global mean
global_mean = products['rating'].mean()
print(f"Global mean rating: {global_mean:.2f}")

# Calculate brand statistics
brand_stats = products.groupby('brand')['rating'].agg(['mean', 'count'])
print("\nBrand statistics:")
print(brand_stats)

# Apply smoothing formula
brand_stats['smoothed'] = (
    (brand_stats['count'] * brand_stats['mean'] + m * global_mean) /
    (brand_stats['count'] + m)
)
print("\nSmoothed values:")
print(brand_stats[['mean', 'smoothed']])

# Apply to dataframe
products['brand_encoded'] = products['brand'].map(brand_stats['smoothed'])
print("\nFinal DataFrame:")
print(products[['product', 'brand', 'rating', 'brand_encoded']])
# Notice Sony (5.0) and Nokia (2.0) are smoothed toward global mean
💡 Explanation: Smoothing formula: (n * category_mean + m * global_mean) / (n + m). Apple (5 samples) with mean 4.48 stays close to 4.48 after smoothing. But Sony (1 sample) with extreme rating 5.0 is heavily smoothed toward global mean (≈4.0), becoming ~4.3. This prevents overfitting: we don't trust Sony's single data point as much as Apple's 5 data points!

Task: Implement proper target encoding using K-fold cross-validation to prevent data leakage in training data, then apply the learned encoding to test data.

# Starter code
from sklearn.model_selection import KFold, train_test_split
import pandas as pd
import numpy as np

# Create sample data
data = pd.DataFrame({
    'store': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] * 3,
    'sales': np.random.randint(100, 500, 30)
})

# Split into train and test
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)

# Your code here: 
# 1. K-fold encode training data (no leakage)
# 2. Calculate final encoding from all training data
# 3. Apply to test data
View Solution
from sklearn.model_selection import KFold, train_test_split
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'store': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'] * 3,
    'sales': np.random.randint(100, 500, 30)
})

# Split into train and test
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# Step 1: K-fold encode training data
train_df['store_encoded'] = np.nan
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, val_idx in kf.split(train_df):
    # Calculate means from training fold only
    fold_means = train_df.iloc[train_idx].groupby('store')['sales'].mean()
    
    # Apply to validation fold
    train_df.loc[train_df.index[val_idx], 'store_encoded'] = \
        train_df.iloc[val_idx]['store'].map(fold_means)

# Fill NaN with global mean
global_mean = train_df['sales'].mean()
train_df['store_encoded'] = train_df['store_encoded'].fillna(global_mean)

# Step 2: Calculate final encoding from ALL training data
final_encoding = train_df.groupby('store')['sales'].mean()

# Step 3: Apply to test data
test_df['store_encoded'] = test_df['store'].map(final_encoding)
test_df['store_encoded'] = test_df['store_encoded'].fillna(global_mean)

print("Training data with K-fold encoding:")
print(train_df.head(10))

print("\nTest data with training-based encoding:")
print(test_df)

print("\nFinal encoding mapping:")
print(final_encoding)
💡 Explanation: This is the proper production approach for target encoding! Step 1: Use k-fold to encode training data without leakage (each row encoded using OTHER folds' means). Step 2: Calculate final encoding from ALL training data. Step 3: Apply that final encoding to test data (test data never influences the encoding). This ensures no data leakage while maximizing the use of training information!
04

Frequency & Binary Encoding

Frequency encoding and binary encoding are powerful alternatives for handling categorical variables, especially with high cardinality. Frequency encoding uses occurrence counts, while binary encoding converts label-encoded integers to binary digits. Both methods balance dimensionality reduction with information preservation.

Frequency Encoding

Replaces each category with its frequency (count or proportion) in the dataset. Categories that appear often get higher values. Simple and effective when frequency correlates with the target.

Binary Encoding

First applies label encoding, then converts each integer to its binary representation across multiple columns. Creates log2(n) columns instead of n, dramatically reducing dimensionality.

Frequency Encoding

Frequency encoding is intuitive: common categories get higher values, rare categories get lower values. This often works well because popular categories may have different behavior patterns than rare ones.

import pandas as pd

# E-commerce order data
orders = pd.DataFrame({
    'order_id': range(1, 16),
    'product_category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Electronics',
                         'Clothing', 'Clothing', 'Clothing', 'Clothing',
                         'Books', 'Books', 'Books',
                         'Toys', 'Toys',
                         'Jewelry']
})

# Count frequency of each category
freq_map = orders['product_category'].value_counts()
print("Category frequencies:")
print(freq_map)
# Output:
# Electronics    5
# Clothing       4
# Books          3
# Toys           2
# Jewelry        1

# Frequency encoding (count)
orders['category_freq_count'] = orders['product_category'].map(freq_map)

# Frequency encoding (proportion)
orders['category_freq_prop'] = orders['product_category'].map(freq_map / len(orders))

print("\nEncoded DataFrame:")
print(orders)
# Output:
#     order_id product_category  category_freq_count  category_freq_prop
# 0          1      Electronics                    5            0.333333
# 1          2      Electronics                    5            0.333333
# ...
# 14        15          Jewelry                    1            0.066667
💡 Explanation: Frequency encoding assigns values based on how often each category appears. Electronics (5 occurrences) gets encoded as 5 or 0.333 (33.3%), while Jewelry (1 occurrence) gets 1 or 0.067 (6.7%). This simple encoding captures popularity: all Electronics orders get the same value (5), making it easy for models to learn that popular categories might behave differently than rare ones.
When Frequency Encoding Shines

Frequency encoding works best when the popularity of a category is predictive. For example, in fraud detection, rare transaction types might be more suspicious. In recommendation systems, popular items may have different click-through rates.

Handling Train/Test Frequency Encoding

When applying frequency encoding to test data, use the frequencies calculated from training data only. This prevents data leakage and handles unseen categories gracefully.

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample data
data = pd.DataFrame({
    'user_id': range(1, 21),
    'browser': ['Chrome', 'Chrome', 'Chrome', 'Chrome', 'Chrome', 'Chrome',
                'Firefox', 'Firefox', 'Firefox', 'Firefox',
                'Safari', 'Safari', 'Safari',
                'Edge', 'Edge',
                'Chrome', 'Firefox', 'Safari', 'Opera', 'IE']  # Opera, IE appear once
})

# Split data
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)

# Calculate frequencies from TRAINING data only
train_freq = train_df['browser'].value_counts()
print("Training frequencies:")
print(train_freq)

# Apply to both datasets
train_df['browser_freq'] = train_df['browser'].map(train_freq)
test_df['browser_freq'] = test_df['browser'].map(train_freq)

# Handle unseen categories with 0 or median frequency
test_df['browser_freq'] = test_df['browser_freq'].fillna(0)

print("\nTest data with frequency encoding:")
print(test_df)
# Unseen categories get 0
💡 Explanation: Always calculate frequencies from training data only to prevent data leakage! If test data has a browser not seen in training (like Opera appearing only in test), we assign it frequency=0 (or median) rather than calculating its actual test frequency. This ensures the encoding is based solely on training data patterns, making it valid for production deployment.

Binary Encoding

Binary encoding is a clever compromise between label encoding and one-hot encoding. It first assigns integers to categories, then converts those integers to binary representation. For 8 categories (requiring 3 bits), you get 3 columns instead of 8.

import pandas as pd
import numpy as np

# Sample data with 8 categories
colors = pd.DataFrame({
    'item_id': range(1, 9),
    'color': ['Red', 'Blue', 'Green', 'Yellow', 'Orange', 'Purple', 'Pink', 'Brown']
})

# Step 1: Label encode (0-7)
color_map = {color: i for i, color in enumerate(colors['color'].unique())}
colors['label_encoded'] = colors['color'].map(color_map)

# Step 2: Convert to binary (3 bits needed for 8 values)
def int_to_binary_columns(df, column, n_bits):
    """Convert integer column to binary representation."""
    binary_cols = []
    for i in range(n_bits):
        col_name = f'{column}_bit_{i}'
        df[col_name] = (df[column] >> i) & 1
        binary_cols.append(col_name)
    return binary_cols

n_bits = int(np.ceil(np.log2(len(color_map))))  # 3 bits for 8 categories
binary_cols = int_to_binary_columns(colors, 'label_encoded', n_bits)

print(f"Number of bits needed: {n_bits}")
print("\nBinary Encoded DataFrame:")
print(colors)
# Output:
#    item_id   color  label_encoded  bit_0  bit_1  bit_2
# 0        1     Red              0      0      0      0
# 1        2    Blue              1      1      0      0
# 2        3   Green              2      0      1      0
# 3        4  Yellow              3      1      1      0
# 4        5  Orange              4      0      0      1
# 5        6  Purple              5      1      0      1
# 6        7    Pink              6      0      1      1
# 7        8   Brown              7      1      1      1
💡 Explanation: Binary encoding is brilliant for high-cardinality features! Instead of 8 one-hot columns, we only need 3 binary columns (since 2³=8). Each category gets a unique binary pattern: Red=000, Blue=001, Green=010, Yellow=011, Orange=100, Purple=101, Pink=110, Brown=111. For 100 categories, one-hot needs 100 columns but binary only needs 7 (2⁷=128). Huge dimensionality reduction!

Using Category Encoders Library

The category_encoders library provides optimized implementations of both frequency and binary encoding.

# Install: pip install category-encoders
import category_encoders as ce
import pandas as pd

# Sample data
df = pd.DataFrame({
    'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix', 
             'Philadelphia', 'San Antonio', 'San Diego'],
    'population': [8.3, 4.0, 2.7, 2.3, 1.7, 1.6, 1.5, 1.4]
})

# Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['city'])
df_binary = binary_encoder.fit_transform(df)
print("Binary Encoded:")
print(df_binary)
# Creates log2(8) = 3 binary columns

# Count/Frequency Encoding
count_encoder = ce.CountEncoder(cols=['city'])
df_count = count_encoder.fit_transform(df)
print("\nCount Encoded:")
print(df_count)
💡 Explanation: The category_encoders library makes encoding much easier! BinaryEncoder creates log₂(n) binary columns automatically. For 8 cities, it creates 3 binary columns instead of 8 one-hot columns. CountEncoder counts category occurrences (like frequency encoding). Both encoders can be saved and reused in production, handle unseen categories, and integrate seamlessly with sklearn pipelines!
Encoding Columns Created 100 Categories 1000 Categories Best Use Case
One-Hot n 100 columns 1000 columns Low cardinality, linear models
Label 1 1 column 1 column Tree models, target variable
Binary log2(n) 7 columns 10 columns High cardinality, any model
Frequency 1 1 column 1 column When frequency is predictive
Target 1 1 column 1 column High cardinality, supervised tasks

Choosing the Right Encoding Method

The choice of encoding depends on several factors. Here is a decision guide to help you choose the right method for your situation.

Encoding Decision Guide
  1. Is there a natural order? → Use Ordinal Encoding
  2. Low cardinality (< 15) + Linear Model? → Use One-Hot Encoding
  3. Any cardinality + Tree Model? → Use Label or Ordinal Encoding
  4. High cardinality + Supervised Task? → Use Target Encoding (with regularization)
  5. High cardinality + Frequency matters? → Use Frequency Encoding
  6. High cardinality + Need compromise? → Use Binary Encoding
  7. Memory constrained? → Avoid One-Hot, prefer Label/Binary/Frequency

Practice Questions

Task: Frequency encode the 'payment_method' column using both count and proportion.

# Starter code
import pandas as pd

transactions = pd.DataFrame({
    'transaction_id': range(1, 13),
    'payment_method': ['Credit Card', 'Credit Card', 'Credit Card', 'Credit Card',
                       'PayPal', 'PayPal', 'PayPal',
                       'Debit Card', 'Debit Card',
                       'Bank Transfer', 'Bank Transfer',
                       'Crypto']
})

# Your code here: Frequency encode with count and proportion
View Solution
import pandas as pd

transactions = pd.DataFrame({
    'transaction_id': range(1, 13),
    'payment_method': ['Credit Card', 'Credit Card', 'Credit Card', 'Credit Card',
                       'PayPal', 'PayPal', 'PayPal',
                       'Debit Card', 'Debit Card',
                       'Bank Transfer', 'Bank Transfer',
                       'Crypto']
})

# Calculate frequency map
freq_map = transactions['payment_method'].value_counts()
print("Frequency counts:")
print(freq_map)

# Frequency encoding - count
transactions['payment_freq_count'] = transactions['payment_method'].map(freq_map)

# Frequency encoding - proportion
total = len(transactions)
transactions['payment_freq_prop'] = transactions['payment_method'].map(freq_map / total)

print("\nEncoded DataFrame:")
print(transactions)
# Credit Card: 4 (0.333), PayPal: 3 (0.25), etc.
💡 Explanation: Frequency encoding is super simple: just count how many times each category appears! Credit Card appears 4 times (33.3%), PayPal 3 times (25%), Debit Card 2 times (16.7%), Bank Transfer 2 times, and Crypto once (8.3%). This single-column encoding captures popularity information. Models can learn that rare payment methods (Crypto) might behave differently than common ones (Credit Card).

Task: Implement binary encoding without using category_encoders. First label encode, then convert to binary columns.

# Starter code
import pandas as pd
import numpy as np

countries = pd.DataFrame({
    'user_id': range(1, 11),
    'country': ['USA', 'Canada', 'UK', 'Germany', 'France', 
                'Japan', 'Australia', 'Brazil', 'India', 'Mexico']
})

# Your code here:
# 1. Label encode the countries (0-9)
# 2. Calculate number of bits needed
# 3. Create binary columns
View Solution
import pandas as pd
import numpy as np

countries = pd.DataFrame({
    'user_id': range(1, 11),
    'country': ['USA', 'Canada', 'UK', 'Germany', 'France', 
                'Japan', 'Australia', 'Brazil', 'India', 'Mexico']
})

# Step 1: Label encode
unique_countries = countries['country'].unique()
country_map = {country: i for i, country in enumerate(unique_countries)}
countries['country_label'] = countries['country'].map(country_map)

print("Label encoding map:")
for country, label in country_map.items():
    print(f"  {country}: {label}")

# Step 2: Calculate bits needed (10 countries need 4 bits: 2^4 = 16)
n_categories = len(unique_countries)
n_bits = int(np.ceil(np.log2(n_categories)))
print(f"\nCategories: {n_categories}, Bits needed: {n_bits}")

# Step 3: Create binary columns
for bit in range(n_bits):
    countries[f'country_bit_{bit}'] = (countries['country_label'] >> bit) & 1

print("\nBinary Encoded DataFrame:")
print(countries)

# Verify: One-hot would need 10 columns, binary only needs 4!
💡 Explanation: Binary encoding converts label-encoded integers to binary representation. For 10 countries, we need 4 bits (2⁴=16 ≥ 10). Each country gets a unique binary pattern across 4 columns. USA(0)=0000, Canada(1)=0001, UK(2)=0010, Germany(3)=0011, etc. This is MUCH better than one-hot encoding which would create 10 columns! With binary, we reduced dimensionality by 60% while preserving all information.

Task: Apply one-hot, label, binary, and frequency encoding to the same categorical column. Compare the resulting number of features and summarize when to use each.

# Starter code
import pandas as pd
import numpy as np

# 20 different product brands
np.random.seed(42)
brands = [f'Brand_{chr(65+i)}' for i in range(20)]  # Brand_A to Brand_T

products = pd.DataFrame({
    'product_id': range(1, 101),
    'brand': np.random.choice(brands, 100)
})

# Your code here:
# 1. One-hot encode
# 2. Label encode  
# 3. Binary encode
# 4. Frequency encode
# 5. Print comparison of dimensions
View Solution
import pandas as pd
import numpy as np

# 20 different product brands
np.random.seed(42)
brands = [f'Brand_{chr(65+i)}' for i in range(20)]  # Brand_A to Brand_T

products = pd.DataFrame({
    'product_id': range(1, 101),
    'brand': np.random.choice(brands, 100)
})

print(f"Original data: {products.shape[0]} rows, {len(products['brand'].unique())} unique brands\n")

# 1. One-Hot Encoding
df_onehot = pd.get_dummies(products, columns=['brand'], prefix='brand')
onehot_cols = len([c for c in df_onehot.columns if c.startswith('brand_')])
print(f"One-Hot Encoding: {onehot_cols} new columns")

# 2. Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
products['brand_label'] = le.fit_transform(products['brand'])
print(f"Label Encoding: 1 new column")

# 3. Binary Encoding
n_bits = int(np.ceil(np.log2(len(brands))))  # 5 bits for 20 brands
for bit in range(n_bits):
    products[f'brand_bit_{bit}'] = (products['brand_label'] >> bit) & 1
print(f"Binary Encoding: {n_bits} new columns")

# 4. Frequency Encoding
freq_map = products['brand'].value_counts() / len(products)
products['brand_freq'] = products['brand'].map(freq_map)
print(f"Frequency Encoding: 1 new column")

# Summary comparison
print("\n" + "="*50)
print("ENCODING COMPARISON SUMMARY")
print("="*50)
print(f"{'Encoding':<20} {'Columns':<10} {'Best For'}")
print("-"*50)
print(f"{'One-Hot':<20} {onehot_cols:<10} Linear models, low cardinality")
print(f"{'Label':<20} {1:<10} Tree models, target encoding base")
print(f"{'Binary':<20} {n_bits:<10} High cardinality, dimension reduction")
print(f"{'Frequency':<20} {1:<10} When frequency correlates with target")

# Show sample of final dataframe
print("\nSample encoded data:")
print(products.head())
💡 Explanation: This comparison reveals the dimensionality tradeoff! For 20 brands: One-Hot creates 20 columns (massive for 1000+ categories), Label creates 1 column (but loses information), Binary creates only 5 columns (log₂(20) ≈ 4.32 → 5 bits), and Frequency creates 1 column (with popularity info). Choose based on: cardinality (how many categories?), model type (linear vs tree?), and memory constraints. For 1000 categories, one-hot is impossible but binary only needs 10 columns!

Interactive Demo

Explore how different encoding methods transform categorical data in real-time. Use these interactive tools to visualize the differences between one-hot, label, binary, and frequency encoding.

Encoding Comparison Tool

Enter comma-separated categories to see how each encoding method transforms them.

Input Categories
Input Summary:

Total values: 8

Unique categories: 4

Encoding Results
One-Hot Encoding

Columns created: 0

Label Encoding

Columns created: 1

Binary Encoding

Columns created: 0

Frequency Encoding

Columns created: 1

Cardinality Impact Visualizer

See how the number of unique categories affects the dimensionality of different encoding methods.

Columns Created:
One-Hot: 10
Label: 1
Binary: 4
Frequency: 1
Target: 1
Visual Comparison (columns created)
One-Hot 100%
Binary 40%
Label / Frequency / Target 10%
With 10 categories, one-hot encoding is reasonable. Consider binary encoding if dimensionality becomes a concern.

Key Takeaways

One-Hot Encoding

Creates binary columns for each category. Best for nominal variables with low cardinality. Beware of the dummy variable trap.

Label & Ordinal

Label encoding assigns integers arbitrarily. Ordinal encoding respects natural order. Use ordinal for ranked categories like education levels.

Target Encoding

Replaces categories with target mean. Powerful for high cardinality but prone to overfitting. Use smoothing and cross-validation.

Frequency Encoding

Encodes categories by their occurrence count. Simple and effective when frequency correlates with target. No explosion of dimensions.

Binary Encoding

Combines label encoding with binary representation. Reduces dimensions compared to one-hot. Great for high-cardinality features.

Choosing Wisely

Consider cardinality, ordinality, and model type. Tree-based models handle label encoding well. Linear models often need one-hot.

Knowledge Check

Quick Quiz

Test what you've learned about categorical encoding techniques

1 What is the main disadvantage of one-hot encoding for high-cardinality features?
2 When should you use ordinal encoding instead of label encoding?
3 What is the primary risk of target encoding?
4 How does frequency encoding transform categorical values?
5 Why is binary encoding more efficient than one-hot encoding for high-cardinality features?
6 Which encoding method is generally preferred for tree-based models like Random Forest?
Answer all questions to check your score