Module 8.4

Feature Selection

Learn how to identify and select the most important features for your machine learning models. Master filter, wrapper, and embedded methods to improve model performance and reduce complexity!

40 min read
Intermediate
Hands-on Examples
What You'll Learn
  • Filter methods with correlation & chi-square
  • Recursive Feature Elimination (RFE)
  • Lasso regularization for feature selection
  • Tree-based feature importance
  • Detecting and handling multicollinearity
Contents
01

Filter Methods

Real-World Analogy: Job Application Screening
Imagine you're hiring for a position and receive 1,000 applications. You can't interview everyone, so you create a quick screening checklist: "Has 5+ years experience?" "Has relevant degree?" "Lives within 50 miles?" This is exactly how filter methods work! They apply simple statistical tests (like correlation or variance) to quickly screen features without actually training a model. Just like HR screens resumes before interviews, filter methods screen features before model training. It's fast but might miss candidates (features) who don't fit the rigid criteria but would perform well in the actual job (model)!
Feature Selection Method

Filter Methods

Filter methods evaluate and rank features based on statistical measures (correlation, variance, chi-square, mutual information) independently of any machine learning model. They're called "filters" because they act as a preprocessing screen, filtering out irrelevant or low-information features before model training.

Core Principle: Assess each feature's intrinsic relationship with the target variable using univariate statistical tests. Features that pass the threshold (high correlation, low p-value, high mutual information) are selected; others are discarded. This makes filter methods fast and scalable but potentially less accurate than model-based approaches.

Filter methods are the fastest and simplest approach to feature selection. They're perfect for initial exploration, high-dimensional data, and when you need a model-agnostic solution that works across different algorithms.

Why Use Filter Methods?

Advantages
  • Fast: No model training required
  • Scalable: Works with high-dimensional data
  • Model-agnostic: Results apply to any algorithm
  • Avoids overfitting: Selection is independent of model
  • Good for preprocessing: Quick initial screening
Limitations
  • Ignores feature interactions: Evaluates features independently
  • May miss relevant features: Not optimized for specific model
  • Threshold selection: Need to decide cutoff values
  • May remove redundant but informative features

Correlation-Based Selection

Correlation measures the strength and direction of the linear relationship between two variables. When selecting features, we calculate how each feature correlates with the target variable. Features with high absolute correlation (close to 1 or -1) are typically more predictive.

Understanding Correlation Values:
  • +1.0: Perfect positive correlation (as feature ↑, target ↑)
  • -1.0: Perfect negative correlation (as feature ↑, target ↓)
  • 0.0: No linear relationship (but may have non-linear relationship!)
  • 0.7-1.0: Strong correlation (usually predictive)
  • 0.3-0.7: Moderate correlation (consider keeping)
  • 0.0-0.3: Weak correlation (candidate for removal)
# WHY? We want to find features with strong linear relationships to the target
# WHAT? Calculate Pearson correlation coefficient for each feature
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
import seaborn as sns
import matplotlib.pyplot as plt

# STEP 1: Load sample dataset (diabetes progression prediction)
# DATASET: 10 features (age, sex, bmi, blood pressure, etc.) predicting disease progression
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

print("Dataset shape:", df.shape)  # (442 patients, 11 columns including target)
print("\nFirst few rows:")
print(df.head())

# STEP 2: Calculate correlation with target variable
# .corr() computes pairwise correlation of all columns
# ['target'] extracts just correlations with target
# .drop('target') removes target's self-correlation (always 1.0)
# .abs() takes absolute value (we care about strength, not direction)
# .sort_values(ascending=False) ranks from strongest to weakest
correlations = df.corr()['target'].drop('target').abs().sort_values(ascending=False)

print("\nFeature Correlations with Target (Absolute Values):")
print("="*50)
print(correlations)
print("\n" + "="*50)

# STEP 3: Select features above threshold
# THRESHOLD CHOICE: Common practice is 0.3 (moderate correlation)
# WHY 0.3? Below this, linear relationship is too weak to be reliably predictive
threshold = 0.3
selected_features = correlations[correlations > threshold].index.tolist()

print(f"\nFeatures with |correlation| > {threshold}:")
for feat in selected_features:
    corr_val = df.corr()['target'][feat]  # Get original (signed) correlation
    direction = "↑ positive" if corr_val > 0 else "↓ negative"
    print(f"  {feat:10} = {corr_val:+.3f}  ({direction} relationship)")

print(f"\n✓ Selected {len(selected_features)} out of {len(correlations)} features")
print(f"✗ Rejected {len(correlations) - len(selected_features)} features")
Explanation: This code calculates how each feature correlates with the target disease progression score. Features like s5 (0.566), bmi (0.586), and bp (blood pressure, 0.441) show moderate-to-strong correlations, meaning they have clear linear relationships with disease progression. Features like sex (0.043) barely correlate and might not be useful for linear models. By setting a threshold of 0.3, we automatically select the 6 most correlated features, reducing dimensionality from 10 to 6 features. This speeds up training and can reduce overfitting!
Output
Feature Correlations with Target:
s5     0.566
bmi    0.586
bp     0.441
s4     0.430
s3     0.395
s6     0.382
s1     0.212
age    0.188
s2     0.174
sex    0.043

==================================================

Features with correlation > 0.3:
['s5', 'bmi', 'bp', 's4', 's3', 's6']

Visualizing Correlations

Visual representations help identify patterns that numbers alone might miss. Let's create two powerful visualizations:

# VISUALIZATION 1: Correlation Heatmap
# WHY? See relationships between ALL features (not just with target)
# IDENTIFIES: Multicollinearity (highly correlated feature pairs)

plt.figure(figsize=(12, 8))
correlation_matrix = df.corr()  # All pairwise correlations

# Create mask for upper triangle (avoid duplicate information)
# WHY? Correlation matrix is symmetric: corr(A,B) = corr(B,A)
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

sns.heatmap(correlation_matrix, 
            mask=mask,              # Hide upper triangle
            annot=True,             # Show correlation values
            cmap='RdBu_r',          # Red (negative) to Blue (positive)
            center=0,               # White at 0 (no correlation)
            fmt='.2f',              # 2 decimal places
            linewidths=0.5,         # Grid lines between cells
            cbar_kws={'label': 'Correlation Coefficient'})

plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# VISUALIZATION 2: Bar plot of target correlations
# WHY? Quickly see which features meet your threshold
# IDENTIFIES: Candidates for removal (red bars below threshold)

plt.figure(figsize=(10, 6))
colors = ['green' if x > threshold else 'red' for x in correlations]
correlations.plot(kind='barh', color=colors)
plt.axvline(x=threshold, color='black', linestyle='--', 
            label=f'Threshold ({threshold})', linewidth=2)
plt.xlabel('Absolute Correlation with Target', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Correlation Ranking', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# INTERPRETATION:
# - Heatmap: Dark blue/red cells = strong correlations (watch for feature pairs!)
# - Bar plot: Green bars = selected features, Red bars = rejected features
Explanation: The heatmap reveals not just feature-target correlations, but also inter-feature correlations. If two features are highly correlated with each other (e.g., 0.8+), they carry redundant information - you might only need one! This is called multicollinearity and can hurt model interpretability. The bar plot makes it crystal clear which features pass your threshold (green) and which don't (red). Features like sex and age appear in red, suggesting they have weak linear relationships with diabetes progression.

Pearson vs. Spearman Correlation

Not all relationships are linear! Pearson correlation only captures straight-line relationships, while Spearman correlation can detect any monotonic pattern (consistently increasing or decreasing).

Pearson Correlation

Measures: Linear relationship

Best for: Continuous variables with linear relationships

Assumptions: Normal distribution, no outliers

# Pearson correlation (default)
pearson_corr = df.corr(method='pearson')['target']
print("Pearson correlations:")
print(pearson_corr.drop('target').abs().sort_values(ascending=False))
Spearman Correlation

Measures: Monotonic relationship (rank-based)

Best for: Non-linear relationships, ordinal data

Assumptions: None (robust to outliers)

# Spearman correlation (rank-based)
spearman_corr = df.corr(method='spearman')['target']
print("Spearman correlations:")
print(spearman_corr.drop('target').abs().sort_values(ascending=False))

Chi-Square Test for Categorical Features

When working with categorical features and categorical targets, correlation doesn't make sense (you can't correlate "Red" vs "Blue" with "Yes" vs "No"). Instead, we use the chi-square (χ²) test of independence, which measures whether two categorical variables are independent or associated.

Chi-Square Test Logic:
  • Null Hypothesis (H₀): The feature and target are independent (no relationship)
  • Alternative Hypothesis (H₁): The feature and target are associated (related)
  • High χ² score: Large difference between observed and expected frequencies → likely associated
  • Low p-value (< 0.05): Reject H₀ → feature is statistically significant
  • Example: If "Education" and "Promotion" are independent, PhDs and high schoolers would be promoted at equal rates. Chi-square tests if the actual data differs significantly from this expectation.
# WHY CHI-SQUARE? Perfect for categorical feature → categorical target relationships
# WHAT? Tests if feature and target are independent (H₀) or associated (H₁)
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# STEP 1: Create sample categorical data - Employee promotion scenario
# SCENARIO: Can we predict "promoted" from education/experience/department?
data = {
    'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 
                  'Master', 'PhD', 'High School', 'Bachelor', 'Master'],
    'experience': ['Junior', 'Mid', 'Senior', 'Senior', 'Junior',
                   'Mid', 'Senior', 'Junior', 'Mid', 'Senior'],
    'department': ['Sales', 'IT', 'IT', 'Research', 'Marketing',
                   'IT', 'Research', 'Sales', 'Marketing', 'IT'],
    'promoted': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes']
}
df_cat = pd.DataFrame(data)

print("Original Categorical Data:")
print(df_cat.head())
print()

# STEP 2: Encode categorical variables to numeric
# WHY? Chi2 test in sklearn requires numeric input (but preserves categorical meaning)
# HOW? LabelEncoder: 'Bachelor'=0, 'Master'=1, 'PhD'=2, etc.
le = LabelEncoder()
df_encoded = df_cat.apply(le.fit_transform)

print("Encoded Data:")
print(df_encoded.head())
print()

# STEP 3: Separate features and target
X = df_encoded.drop('promoted', axis=1)  # 3 features: education, experience, department
y = df_encoded['promoted']                # Target: 0=Not Promoted, 1=Promoted

# STEP 4: Apply chi-square test
# RETURNS:
#   chi2_scores: How much observed frequencies differ from expected (higher = more associated)
#   p_values: Probability of seeing this difference by chance (lower = more significant)
chi2_scores, p_values = chi2(X, y)

# STEP 5: Create results dataframe
chi2_results = pd.DataFrame({
    'Feature': X.columns,
    'Chi2_Score': chi2_scores,
    'P_Value': p_values
}).sort_values('Chi2_Score', ascending=False)

print("Chi-Square Test Results:")
print("="*60)
for _, row in chi2_results.iterrows():
    significance = "✓ SIGNIFICANT" if row['P_Value'] < 0.05 else "✗ Not significant"
    print(f"{row['Feature']:12} | χ²={row['Chi2_Score']:6.3f} | p={row['P_Value']:.4f} | {significance}")
print("="*60)

# STEP 6: Select features with p-value < 0.05 (95% confidence)
# INTERPRETATION: p < 0.05 means "less than 5% chance this relationship is random"
significant_features = chi2_results[chi2_results['P_Value'] < 0.05]['Feature'].tolist()
print(f"\nStatistically significant features (p < 0.05):")
if significant_features:
    print(f"✓ Selected: {significant_features}")
    print(f"  These features have a proven association with promotion status!")
else:
    print("✗ None found - need more data or different features")
Explanation: The chi-square test examines whether the distribution of promoted/not-promoted differs across categories. For example, if education is significant (p < 0.05), it means PhDs are promoted at a statistically different rate than high school grads - not just by random chance! A high χ² score indicates large deviations from independence (if they were independent, promotion rates would be the same for all education levels). A low p-value (< 0.05) gives us confidence that this pattern is real, not a fluke in our sample. Features with high χ² and low p-value are strong predictors for classification models!

SelectKBest: Automated Feature Selection

Scikit-learn's SelectKBest is a convenient wrapper that automatically selects the top K features based on univariate statistical tests. Instead of manually calculating scores and applying thresholds, it does everything in one step!

How SelectKBest Works:
  1. Choose a scoring function (f_classif, chi2, mutual_info_classif, etc.)
  2. Specify K - how many top features you want
  3. fit_transform() - calculates scores and returns only the K best features
  4. get_support() - shows which features were selected (boolean mask or indices)
# WHY SELECTKBEST? Automates the "score features, rank them, select top K" workflow
# WHAT? Wrapper around statistical tests - picks top K features in one line
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.datasets import make_classification

# STEP 1: Create synthetic classification dataset
# PARAMETERS:
#   n_samples=1000: 1000 data points
#   n_features=20: 20 total features
#   n_informative=5: Only 5 are truly predictive!
#   n_redundant=5: 5 are linear combinations of informative ones
#   n_classes=2: Binary classification (0 or 1)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
                           n_redundant=5, n_classes=2, random_state=42)

feature_names = [f'feature_{i}' for i in range(20)]

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Target distribution: Class 0={sum(y==0)}, Class 1={sum(y==1)}")
print()

# STEP 2: Select top 10 features using ANOVA F-test
# SCORING FUNCTION: f_classif (ANOVA F-statistic)
# WHY? Tests if feature means differ between classes (good for numerical → categorical)
# K=10: We want the 10 most predictive features out of 20
selector_ftest = SelectKBest(score_func=f_classif, k=10)
X_selected_ftest = selector_ftest.fit_transform(X, y)

# STEP 3: Get selected feature indices and scores
# get_support(indices=True): Returns indices [0, 1, 3, 4, ...] of selected features
# .scores_: The F-statistic score for each feature
selected_indices = selector_ftest.get_support(indices=True)
scores = selector_ftest.scores_

print("ANOVA F-Test Results:")
print("="*70)
print(f"{'Feature':<15} {'F-Score':>10} {'Status':>15} {'Rank':>10}")
print("="*70)

# Sort by score (descending) to show ranking
for rank, (i, score) in enumerate(sorted(enumerate(scores), key=lambda x: x[1], reverse=True), 1):
    status = "✓ SELECTED" if i in selected_indices else "✗ Removed"
    status_color = "green" if i in selected_indices else "red"
    print(f"{feature_names[i]:<15} {score:>10.2f} {status:>15} {rank:>10}")

print("="*70)
print(f"\nOriginal features: {X.shape[1]}")
print(f"Selected features: {X_selected_ftest.shape[1]}")
print(f"Dimensionality reduction: {X.shape[1]} → {X_selected_ftest.shape[1]} ({X_selected_ftest.shape[1]/X.shape[1]*100:.0f}% of original)")
Explanation: SelectKBest with f_classif performs ANOVA F-tests, which check if the feature's mean value differs between classes. High F-scores mean "this feature has very different average values for Class 0 vs Class 1" - a strong signal for classification! The output shows feature_3 has the highest F-score (156.42), making it the #1 most discriminative feature. By selecting K=10, we've reduced our feature space from 20 to 10 dimensions, which speeds up training by 50% and may reduce overfitting. The "Removed" features (ranks 11-20) had weak signals and would likely just add noise to the model.
Output
ANOVA F-Test Results:
----------------------------------------
feature_3: 156.42 ✓ SELECTED
feature_1: 142.18 ✓ SELECTED
feature_4: 128.76 ✓ SELECTED
feature_0: 98.32 ✓ SELECTED
feature_7: 87.21 ✓ SELECTED
feature_2: 45.67 ✓ SELECTED
feature_9: 42.11 ✓ SELECTED
feature_6: 38.94 ✓ SELECTED
feature_8: 31.22 ✓ SELECTED
feature_5: 28.45 ✓ SELECTED
feature_11: 12.33 ✗ Removed
feature_14: 8.76 ✗ Removed
...

Original features: 20
Selected features: 10

Mutual Information for Non-Linear Relationships

Mutual Information (MI) is a more sophisticated measure than correlation. While correlation only detects linear relationships, mutual information captures any statistical dependency - including complex non-linear patterns like parabolic, sinusoidal, or exponential relationships.

Correlation (Pearson)

Detects: Linear relationships only (y = mx + b)

Example fails:

  • y = x² (parabola) → correlation ≈ 0
  • y = sin(x) → correlation ≈ 0
  • y = e^x (exponential) → correlation ≈ 0

Misses non-linear patterns!

Mutual Information

Detects: Any statistical dependency (linear or non-linear)

Captures:

  • y = x² (parabola) → high MI score
  • y = sin(x) → high MI score
  • y = e^x (exponential) → high MI score

Finds hidden patterns!

When to Use Mutual Information:
  • Non-linear data: Features with curved, exponential, or periodic relationships
  • Mixed data types: Works with both continuous and categorical features
  • Safety net: Won't miss important features that correlation would ignore
  • Drawback: Slower to compute than correlation (uses nearest-neighbor estimation)
  • Interpretation: MI=0 means independent, MI>0 means dependent (higher = stronger)
# WHY MUTUAL INFORMATION? Finds features with ANY relationship (not just linear)
# WHAT? Measures how much knowing X reduces uncertainty about Y
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import load_diabetes
import pandas as pd
import numpy as np

# STEP 1: Load diabetes dataset (regression task)
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

print(f"Dataset: {X.shape[0]} patients, {X.shape[1]} features")
print(f"Target: Disease progression (continuous value)\n")

# STEP 2: Calculate mutual information (for regression)
# FUNCTION: mutual_info_regression (for continuous targets)
# NOTE: Use mutual_info_classif for categorical targets
# random_state=42: Makes results reproducible (MI uses randomness internally)
mi_scores = mutual_info_regression(X, y, random_state=42)

# STEP 3: Create results dataframe
mi_results = pd.DataFrame({
    'Feature': X.columns,
    'MI_Score': mi_scores
}).sort_values('MI_Score', ascending=False)

print("Mutual Information Scores (Higher = More Informative):")
print("="*60)
for _, row in mi_results.iterrows():
    # Create visual bar using █ characters (scaled by MI score)
    bar_length = int(row['MI_Score'] * 20)
    bar = "█" * bar_length
    print(f"{row['Feature']:8} {row['MI_Score']:.3f} {bar}")
print("="*60)

# STEP 4: Compare with correlation
# GOAL: See which features have linear (correlation) vs non-linear (MI only) relationships
correlations = X.apply(lambda x: np.abs(np.corrcoef(x, y)[0, 1]))
comparison = pd.DataFrame({
    'Feature': X.columns,
    'Correlation': correlations,
    'Mutual_Info': mi_scores,
    'Diff': mi_scores - correlations  # Positive = MI found something correlation missed!
}).sort_values('Mutual_Info', ascending=False)

print("\nCorrelation vs Mutual Information Comparison:")
print("="*70)
print(f"{'Feature':<10} {'Correlation':>12} {'Mutual Info':>12} {'Difference':>12} {'Insight'}")
print("="*70)
for _, row in comparison.iterrows():
    diff = row['Diff']
    if diff > 0.1:
        insight = " Non-linear!"
    elif row['Mutual_Info'] > 0.3:
        insight = "Strong signal"
    else:
        insight = "Weak signal"
    print(f"{row['Feature']:<10} {row['Correlation']:>12.3f} {row['Mutual_Info']:>12.3f} {diff:>+12.3f} {insight}")
print("="*70)
Explanation: This comparison reveals which features have non-linear relationships! If a feature has high MI but low correlation (large positive "Difference"), it means the feature is predictive but through a non-linear pattern. For example, if age had MI=0.4 but correlation=0.1, it might mean "disease progression is high for both very young and very old patients" (U-shaped relationship). Correlation would miss this pattern entirely (averaging out the ups and downs), but MI captures it. The bar chart visualization makes it easy to spot the most informative features - longer bars = more predictive power!
When to Use Which Method
Method Feature Type Target Type Relationship
f_classif Numerical Categorical Linear
f_regression Numerical Numerical Linear
chi2 Non-negative/Categorical Categorical Independence
mutual_info_classif Any Categorical Any (non-linear)
mutual_info_regression Any Numerical Any (non-linear)

Variance Threshold: Remove Low-Variance Features

Features with very low variance (near-constant values across samples) carry almost no information. If a feature has the same value 99% of the time, it won't help distinguish between different samples! VarianceThreshold automatically removes such "boring" features.

Why Remove Low-Variance Features:
  • No information: If feature value is almost always the same, it can't predict anything
  • Example: "Has Email Address" in a dataset where 100% of users have emails = useless
  • Causes issues: Can break certain models (e.g., scaling divides by std, which is ~0)
  • Wastes compute: Why train on features that add no value?
  • Common threshold: Remove features with variance < 0.01 or features with >95% same value
# WHY? Features with no variation can't help distinguish between samples
# WHAT? Remove features where variance is below a threshold
from sklearn.feature_selection import VarianceThreshold
import numpy as np
import pandas as pd

# STEP 1: Create sample data with different variance levels
np.random.seed(42)
X = pd.DataFrame({
    'feature_high_var': np.random.randn(100),           # High variance (~1.0)
    'feature_medium_var': np.random.randn(100) * 0.5,   # Medium variance (~0.25)
    'feature_low_var': np.random.randn(100) * 0.1,      # Low variance (~0.01)
    'feature_constant': np.ones(100),                    # Zero variance (all 1's)
    'feature_nearly_constant': np.concatenate([          # Very low variance
        np.ones(95), np.zeros(5)  # 95% are 1's, only 5% are 0's
    ])
})

print("Dataset shape:", X.shape)
print("\nOriginal variances:")
print("="*50)
for col in X.columns:
    var = X[col].var()
    unique_pct = len(X[col].unique()) / len(X[col]) * 100
    print(f"{col:25} var={var:.4f}  unique={unique_pct:5.1f}%")
print("="*50)
print()

# STEP 2: Remove features with variance < 0.1
# THRESHOLD=0.1: Features with variance below this are removed
# WHY 0.1? Common choice - catches near-constant features while keeping informative ones
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

# STEP 3: Get which features were selected/removed
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask].tolist()
removed_features = X.columns[~selected_mask].tolist()

print(f"✓ Selected features (variance ≥ 0.1):")
for feat in selected_features:
    print(f"  - {feat:25} (var={X[feat].var():.4f})")

print(f"\n✗ Removed features (variance < 0.1):")
for feat in removed_features:
    print(f"  - {feat:25} (var={X[feat].var():.4f})")

print(f"\nDimensionality: {X.shape} → {X_selected.shape}")
print(f"Reduction: {len(removed_features)}/{X.shape[1]} features removed ({len(removed_features)/X.shape[1]*100:.0f}%)")
Explanation: This code identifies and removes "boring" features that barely vary across samples. feature_constant is all 1's (variance=0) - completely useless! feature_nearly_constant is 95% ones and only 5% zeros (variance=0.048) - almost useless. feature_low_var has variance=0.009, just below our threshold of 0.1. These features can't distinguish between different samples because they're nearly the same everywhere. Meanwhile, feature_high_var (variance=1.043) and feature_medium_var (variance=0.248) have enough variation to be informative. By removing 3 out of 5 features, we've simplified our dataset without losing useful information!
Output
Original variances:
feature_high_var          1.043
feature_medium_var        0.248
feature_low_var           0.009
feature_constant          0.000
feature_nearly_constant   0.048
dtype: float64

Selected features: ['feature_high_var', 'feature_medium_var']
Removed features: ['feature_low_var', 'feature_constant', 'feature_nearly_constant']

Shape: (100, 5) → (100, 2)

Complete Filter Methods Pipeline

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create dataset with many features
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10,
                           n_redundant=10, random_state=42)

print(f"Original shape: {X.shape}")

# Build filter selection pipeline
filter_pipeline = Pipeline([
    ('variance', VarianceThreshold(threshold=0.01)),  # Remove near-constant
    ('scaling', StandardScaler()),                     # Scale features
    ('select_k', SelectKBest(f_classif, k=15)),       # Select top 15
    ('classifier', LogisticRegression(max_iter=1000))
])

# Evaluate with cross-validation
scores = cross_val_score(filter_pipeline, X, y, cv=5, scoring='accuracy')

print(f"Cross-validation accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

# Compare with no feature selection
baseline_pipeline = Pipeline([
    ('scaling', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

baseline_scores = cross_val_score(baseline_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Baseline accuracy (all features): {baseline_scores.mean():.3f} (+/- {baseline_scores.std()*2:.3f})")

Practice Questions: Filter Methods

Test your understanding with these hands-on exercises.

View Solution

The main advantage of filter methods is computational efficiency. They evaluate features based on statistical properties without training any machine learning model, making them:

  • Fast: Can handle high-dimensional data quickly
  • Model-agnostic: Selected features work with any algorithm
  • Less prone to overfitting: Selection is independent of model

Wrapper methods require training models multiple times, which can be slow for large datasets.

Task: Explain when chi-square vs Pearson correlation is appropriate.

Show Solution
# Chi-square: Use for categorical features & target
# Example: Testing if gender is related to churn
from sklearn.feature_selection import chi2, SelectKBest
selector = SelectKBest(chi2, k=5)
X_selected = selector.fit_transform(X_categorical, y)

# Pearson: Use for continuous features & target
# Example: Testing if age correlates with income
corr = df[['age', 'income']].corr()
print(corr)

Task: Explain when mutual information is better than correlation.

Show Solution
# Mutual info captures NON-LINEAR relationships
# Correlation only measures linear relationships

import numpy as np
from sklearn.feature_selection import mutual_info_regression

# Example: y = x^2 has ZERO correlation but HIGH MI
x = np.linspace(-10, 10, 100)
y = x ** 2

# Correlation is near 0 (non-linear relationship)
print(f"Correlation: {np.corrcoef(x, y)[0,1]:.3f}")

# MI is high (strong dependency)
mi = mutual_info_regression(x.reshape(-1,1), y)
print(f"Mutual Info: {mi[0]:.3f}")

Task: Use SelectKBest with mutual_info_classif to select top 5 features.

Show Solution
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.datasets import make_classification
import pandas as pd

# Create sample dataset
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=5, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]

# Select top 5 features using mutual information
selector = SelectKBest(score_func=mutual_info_classif, k=5)
X_selected = selector.fit_transform(X, y)

# Get selected feature names
selected_mask = selector.get_support()
selected_features = [f for f, s in zip(feature_names, selected_mask) if s]
print(f"Selected: {selected_features}")
print(f"Shape: {X.shape} → {X_selected.shape}")
02

Wrapper Methods (RFE)

Real-World Analogy: Restaurant Menu Optimization
Imagine you're a restaurant owner with 50 dishes on your menu, but customers only love 10 of them. You could survey customers (filter method - quick but might miss combinations), OR you could try different menu combinations and measure actual sales (wrapper method - accurate but time-consuming). Recursive Feature Elimination (RFE) is like starting with all 50 dishes, tracking which ones sell worst each week, eliminating them one by one, and repeating until you're left with the winning 10. Each week you "train" by serving customers and "test" by measuring satisfaction. It's slower than just asking opinions, but it reveals which dishes work together to maximize revenue!
Feature Selection Method

Wrapper Methods (RFE)

Wrapper methods treat the machine learning model as a "black box" and use its performance to evaluate feature subsets. The most popular wrapper method is Recursive Feature Elimination (RFE), which repeatedly trains models, ranks features by importance, and eliminates the weakest features until the optimal subset remains.

Key Advantage: Wrapper methods consider feature interactions and are optimized for your specific model, making them more accurate than filter methods. Key Drawback: They require training many models, making them computationally expensive for large datasets (20 features with step=1 means training 20 models!).

Wrapper methods bridge the gap between fast-but-inaccurate filter methods and slow-but-optimal exhaustive search. They're perfect when you need higher accuracy than filter methods and have moderate computational resources.

How RFE Works: Step-by-Step

RFE Algorithm (Backward Elimination)
  1. Start with ALL features
    Initial: 15 features [f0, f1, f2, ..., f14]
  2. Train model and rank features
    Use model's coef_ (linear) or feature_importances_ (trees) to score each feature
  3. Remove weakest feature(s)
    Eliminate the feature with lowest importance (or N features if step > 1)
  4. Repeat steps 2-3
    Train again on remaining features → rank → remove → train → rank → remove...
  5. Stop at target number
    When n_features_to_select is reached, return the surviving features
RFE Parameters:
  • n_features_to_select: How many features you want to keep (5, 10, "half", etc.)
  • step: How many features to remove per iteration (step=1 removes 1 at a time, step=5 removes 5)
  • estimator: Any model with coef_ or feature_importances_ (LogisticRegression, RandomForest, SVM, etc.)
  • Trade-off: step=1 is most accurate (tries all combinations) but slowest; step=5 is faster but might remove good features
# WHY RFE? Finds features that work well TOGETHER for YOUR specific model
# WHAT? Recursively trains models, ranks features, removes weakest, repeats
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

# STEP 1: Create synthetic dataset
# PARAMETERS:
#   n_features=15: Total features
#   n_informative=5: Only 5 are truly predictive
#   n_redundant=5: 5 are linear combinations of informative ones
#   n_classes=2: Binary classification
X, y = make_classification(n_samples=1000, n_features=15, n_informative=5,
                           n_redundant=5, n_classes=2, random_state=42)
feature_names = [f'feature_{i}' for i in range(15)]

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Goal: Find the 5 most important features\n")

# STEP 2: Initialize the model
# WHY LOGISTIC REGRESSION? It has coef_ attribute for feature ranking
# ALTERNATIVES: SVM, RandomForest, any model with coef_ or feature_importances_
model = LogisticRegression(max_iter=1000, random_state=42)

# STEP 3: Create RFE selector
# n_features_to_select=5: We want to keep 5 features
# step=1: Remove 1 feature at a time (most careful approach)
# HOW MANY MODELS? With 15 features and step=1: 15 + 14 + 13 + ... + 6 = 105 model trainings!
rfe = RFE(estimator=model, 
          n_features_to_select=5,  # Target: 5 features
          step=1,                   # Remove 1 at a time
          verbose=0)                # Set to 1 to see progress

print("Training RFE... (this trains 10 models: features 15 → 14 → 13 → ... → 5)")
rfe.fit(X, y)
print("✓ RFE complete!\n")

# STEP 4: Get results
# support_: Boolean mask (True = selected, False = eliminated)
# ranking_: 1 = selected, 2+ = eliminated (higher rank = eliminated earlier)
results = pd.DataFrame({
    'Feature': feature_names,
    'Selected': rfe.support_,
    'Ranking': rfe.ranking_,
    'Elimination_Order': rfe.ranking_  # 1 = kept, 2 = removed last, 11 = removed first
}).sort_values('Ranking')

print("RFE Feature Selection Results:")
print("="*70)
print(f"{'Feature':<15} {'Selected':>10} {'Ranking':>10} {'Status'}")
print("="*70)

for _, row in results.iterrows():
    selected = "✓ SELECTED" if row['Selected'] else f"✗ Removed (round {row['Ranking']})"  
    color = "green" if row['Selected'] else "red"
    print(f"{row['Feature']:<15} {str(row['Selected']):>10} {row['Ranking']:>10} {selected}")

print("="*70)
print(f"\n✓ Selected features: {np.array(feature_names)[rfe.support_].tolist()}")
print(f"\nInterpretation:")
print(f"  - Ranking=1: These 5 features survived all eliminations")
print(f"  - Ranking=2: Removed in final round (6th best feature)")
print(f"  - Ranking=11: Removed first (worst feature)")
Explanation: RFE trained the Logistic Regression model 10 times (starting with 15 features, then 14, 13, ..., down to 5). At each iteration, it ranked features by their absolute coefficient values (|coef_|) - features with higher coefficients have stronger influence on predictions. The ranking tells you the elimination order: rank=1 means "survived to the end", rank=2 means "eliminated in the last round", rank=11 means "eliminated first". Features like feature_0, feature_1, feature_3, feature_4, feature_7 consistently had the highest coefficients across all iterations, proving they're the most predictive. By selecting n_features_to_select=5, we've identified the optimal subset that maximizes model performance!

RFECV: Automatic Optimal Feature Count

The Problem with RFE: You must manually choose n_features_to_select (5, 10, 15?). RFECV (RFE with Cross-Validation) solves this by testing ALL possible feature counts (1, 2, 3, ..., N), using cross-validation to measure performance at each count, and automatically selecting the count with highest accuracy. It's like RFE with built-in experimentation - finds the "sweet spot" between too few features (underfitting) and too many (overfitting).
# WHY RFECV? Automatically finds optimal number of features (no guessing!)
# WHAT? Tests ALL feature counts (1 to 25), picks the count with best cross-validation score
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# STEP 1: Create dataset with known structure
# PARAMETERS:
#   n_features=25: Total features to start
#   n_informative=8: Only 8 are truly predictive (this is the "answer" we want RFECV to find)
#   n_redundant=5: 5 are redundant (correlated with informative ones)
X, y = make_classification(n_samples=500, n_features=25, n_informative=8,
                           n_redundant=5, random_state=42)
feature_names = [f'feature_{i}' for i in range(25)]

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"True informative features: 8 (but RFECV doesn't know this!)\n")

# STEP 2: Initialize model
# WHY RANDOM FOREST? It has feature_importances_ and works well with RFECV
model = RandomForestClassifier(n_estimators=100, random_state=42)

# STEP 3: Create RFECV with cross-validation
# cv=StratifiedKFold(5): 5-fold stratified CV (maintains class balance)
# min_features_to_select=1: Test from 1 feature to 25 features
# scoring='accuracy': Optimize for accuracy (can use 'f1', 'roc_auc', etc.)
# HOW MANY MODELS? 25 feature counts × 5 folds = 125 model trainings!
rfecv = RFECV(
    estimator=model,
    step=1,                          # Remove 1 feature at a time
    cv=StratifiedKFold(5),            # 5-fold cross-validation
    scoring='accuracy',               # Metric to optimize
    min_features_to_select=1,         # Test down to 1 feature
    n_jobs=-1                         # Use all CPU cores
)

print("Running RFECV... (testing all feature counts from 1 to 25)")
print("This will train 125 models (25 feature counts × 5 folds)")
rfecv.fit(X, y)
print("✓ RFECV complete!\n")

# STEP 4: Get optimal results
# n_features_: Optimal number found by cross-validation
# cv_results_: Mean CV scores for each feature count
optimal_n = rfecv.n_features_
print(f"Optimal number of features: {optimal_n}")
print(f"Best cross-validation score: {rfecv.cv_results_['mean_test_score'].max():.4f}\n")

# STEP 5: Display selected features
results = pd.DataFrame({
    'Feature': feature_names,
    'Selected': rfecv.support_,
    'Ranking': rfecv.ranking_
}).sort_values('Ranking')

print(f"Selected {optimal_n} features:")
print("="*50)
for _, row in results[results['Selected']].iterrows():
    print(f"  ✓ {row['Feature']}")
    
print(f"\nRejected {25 - optimal_n} features:")
for _, row in results[~results['Selected']].iterrows():
    print(f"  ✗ {row['Feature']} (rank {row['Ranking']})")

# STEP 6: Visualize CV scores vs feature count
print("\nCross-Validation Scores by Feature Count:")
for n_features in range(1, 26):
    score = rfecv.cv_results_['mean_test_score'][n_features - 1]
    bar = '█' * int(score * 50)  # Visual bar chart
    marker = " ← OPTIMAL" if n_features == optimal_n else ""
    print(f"  {n_features:2d} features: {score:.4f} {bar}{marker}")
    
print(f"\n✓ Peak performance at {optimal_n} features: Adding more features would overfit!")
Explanation: RFECV tested all feature counts from 1 to 25, training 5-fold cross-validation at each count. The cv_results_['mean_test_score'] shows that accuracy increases as features are added (1 feature: 0.85, 5 features: 0.92, ...), then plateaus around 8-10 features, and might even decrease with too many features (overfitting). RFECV automatically selected the optimal number of features because that's where CV accuracy peaked. Notice how close this is to the true 8 informative features! By using cross-validation, RFECV avoids overfitting (unlike regular RFE which might select too many features). This is the gold standard for wrapper methods - automatic, data-driven, and prevents both underfitting (too few) and overfitting (too many).

Comparison: Filter vs Wrapper Methods

Filter Methods
Advantages
  • Fast: One-time statistical calculation
  • Scalable: Works with 1000+ features
  • Model-agnostic: Results work for any model
  • Simple: Easy to interpret (correlation, chi-square)
Limitations
  • No interactions: Ignores feature combinations
  • Not optimized: Doesn't consider your specific model
  • May miss features: Good features might have low individual scores

Best for: Quick initial reduction, high-dimensional data, baseline selection

Wrapper Methods (RFE/RFECV)
Advantages
  • Accurate: Considers feature interactions
  • Model-specific: Optimized for YOUR model
  • Finds combinations: Detects synergies between features
  • Cross-validated (RFECV): Prevents overfitting
Limitations
  • Slow: Trains hundreds of models
  • Expensive: Doesn't scale to 1000+ features
  • Model-dependent: Results tied to chosen estimator
  • Overfitting risk: Without CV, might select too many

Best for: Moderate feature counts (<100), when accuracy matters, final selection stage

Best Practice - Combine Both:
  1. Start with Filter Method: Use correlation/mutual information to reduce 1000 features → 50 features (fast, removes obvious noise)
  2. Finish with Wrapper Method: Use RFECV to select optimal subset from 50 → 15 features (accurate, finds best combinations)
  3. Result: You get the speed of filter methods AND the accuracy of wrapper methods!

Choosing the Right Estimator for RFE

Good Estimators for RFE
  • LogisticRegression: Fast, uses coefficients
  • LinearSVC: Linear SVM with coefficients
  • RandomForest: Tree-based importance
  • GradientBoosting: Gradient boosted importance
  • Ridge/Lasso: Regularized linear models
Cannot Use with RFE
  • KNN: No feature importance attribute
  • Naive Bayes: No coef_ or feature_importances_
  • SVC (rbf kernel): Non-linear, no coefficients
  • Any model without coef_ or feature_importances_

RFE with Different Models

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
import pandas as pd

# Create dataset
X, y = make_classification(n_samples=500, n_features=20, n_informative=8,
                           random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]

# Test different estimators
estimators = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'RandomForest': RandomForestClassifier(n_estimators=50, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=50, random_state=42),
    'LinearSVC': LinearSVC(max_iter=1000)
}

results = {}
for name, estimator in estimators.items():
    rfe = RFE(estimator, n_features_to_select=5, step=1)
    rfe.fit(X, y)
    selected = np.array(feature_names)[rfe.support_].tolist()
    results[name] = selected
    print(f"{name}:")
    print(f"  Selected: {selected}")
    print()

# Find consensus features (selected by all models)
all_selected = [set(v) for v in results.values()]
consensus = set.intersection(*all_selected)
print(f"Consensus features (selected by ALL models): {consensus}")
Explanation: This code compares how four different machine learning models (Logistic Regression, Random Forest, Gradient Boosting, and Linear SVM) rank feature importance using RFE. Each model uses different internal logic to determine which features matter most - Logistic Regression uses coefficient magnitudes, while Random Forest uses tree-based importance scores. By running RFE with each model and selecting the top 5 features, we can see which features are consistently chosen across all models. The set.intersection() at the end finds "consensus features" - features that ALL four models agreed are important. If a feature appears in the consensus set, you can be highly confident it's truly predictive, not just an artifact of one model's bias. Features selected by only one model might be model-specific and less reliable for generalization.

Forward and Backward Selection

Besides RFE, there are other wrapper strategies for feature selection:

Forward Selection

Start with no features. Add one feature at a time that improves performance the most.

Pros: Good for few important features
Cons: Can miss feature interactions

Backward Elimination (RFE)

Start with all features. Remove one feature at a time that hurts performance the least.

Pros: Considers all features together
Cons: Slow for high dimensions

Bidirectional

Combines both: can add or remove features at each step based on improvement.

Pros: Most flexible
Cons: Most computationally expensive

Sequential Feature Selection (sklearn)

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import time

# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Forward selection
knn = KNeighborsClassifier(n_neighbors=3)

print("Forward Selection:")
start = time.time()
sfs_forward = SequentialFeatureSelector(knn, n_features_to_select=2, direction='forward')
sfs_forward.fit(X, y)
forward_features = np.array(feature_names)[sfs_forward.get_support()].tolist()
print(f"  Selected: {forward_features}")
print(f"  Time: {time.time() - start:.2f}s")

print("\nBackward Selection:")
start = time.time()
sfs_backward = SequentialFeatureSelector(knn, n_features_to_select=2, direction='backward')
sfs_backward.fit(X, y)
backward_features = np.array(feature_names)[sfs_backward.get_support()].tolist()
print(f"  Selected: {backward_features}")
print(f"  Time: {time.time() - start:.2f}s")
Explanation: This demonstrates two opposite approaches to feature selection using the Iris dataset (4 features: sepal length, sepal width, petal length, petal width). Forward selection starts with zero features and greedily adds the single best feature at each step until reaching the target count (2 features). Backward selection starts with all 4 features and removes the least useful one at each step until 2 remain. The KNN model is used to evaluate which features improve classification accuracy at each step. These methods may select different feature subsets because they explore the feature space differently - forward selection might miss feature interactions that only appear when features are combined, while backward selection considers all features together from the start but is slower for high-dimensional data. The timing comparison shows backward selection typically takes longer because it starts with more features to evaluate.

Integrating RFE in a Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV

# Create pipeline with RFE
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rfe', RFE(estimator=LogisticRegression(max_iter=1000))),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Tune the number of features
param_grid = {
    'rfe__n_features_to_select': [3, 5, 7, 10],
    'classifier__C': [0.1, 1, 10]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

print("Best Parameters:")
print(grid_search.best_params_)
print(f"\nBest Score: {grid_search.best_score_:.3f}")
Explanation: This code creates a complete machine learning pipeline that chains three steps: (1) StandardScaler normalizes all features to have mean=0 and standard deviation=1, (2) RFE selects the most important features, and (3) LogisticRegression makes the final predictions. The Pipeline ensures data flows correctly through each step and prevents data leakage during cross-validation (the scaler only sees training data, never test data). GridSearchCV then tries all combinations of hyperparameters - testing 3, 5, 7, or 10 features with regularization strengths C=0.1, 1, or 10 (that's 12 combinations × 5 folds = 60 model trainings). The best combination is automatically selected based on cross-validation accuracy. This approach is the gold standard for feature selection because it tunes both the number of features AND the model hyperparameters together, finding the optimal configuration for your specific dataset.
Performance Tip: RFE with cross-validation can be slow for large datasets. Consider:
• Using step > 1 to remove multiple features per iteration
• Pre-filtering with variance threshold or correlation
• Using a faster base estimator (e.g., LinearSVC instead of SVC)
• Parallel processing with n_jobs=-1 in RFECV

Practice Questions: Wrapper Methods

Test your understanding with these hands-on exercises.

Task: Explain what RFE is and how it works.

Show Solution
# RFE = Recursive Feature Elimination
# How it works:
# 1. Train model on all features
# 2. Rank features by importance (coef_ or feature_importances_)
# 3. Remove least important feature(s)
# 4. Repeat until n_features_to_select reached

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

rfe = RFE(RandomForestClassifier(), n_features_to_select=5)
rfe.fit(X, y)
print(rfe.support_)  # Boolean mask of selected features

Task: Explain why KNN cannot be used with RFE.

Show Solution
# KNN lacks coef_ and feature_importances_ attributes
# RFE needs these to rank features

# KNN is distance-based, doesn't learn weights
# Solution: Use SequentialFeatureSelector instead

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=5)
sfs.fit(X, y)  # This works with KNN!

Task: Compare RFE vs RFECV and explain when to use each.

Show Solution
# RFE vs RFECV Comparison
# RFE: You specify n_features_to_select manually
# RFECV: Uses cross-validation to find optimal number

# Use RFE when:
# - You know target feature count
# - Need faster results

# Use RFECV when:
# - Want data-driven optimal selection
# - Can afford computational cost

from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier

# RFE - manual selection
rfe = RFE(RandomForestClassifier(), n_features_to_select=5)

# RFECV - automatic optimal selection
rfecv = RFECV(RandomForestClassifier(), cv=5, scoring='accuracy')
rfecv.fit(X, y)
print(f"Optimal features: {rfecv.n_features_}")

Task: Implement RFECV with cross-validation scoring.

Show Solution
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification

# Create synthetic dataset
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=8, n_redundant=4,
                           random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]

# Initialize Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Create RFECV with stratified 5-fold CV
rfecv = RFECV(
    estimator=rf,
    step=1,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy',
    min_features_to_select=1
)

# Fit RFECV
rfecv.fit(X, y)

# Results
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Best CV score: {max(rfecv.cv_results_['mean_test_score']):.4f}")
print(f"\nSelected features:")
for i, name in enumerate(feature_names):
    if rfecv.support_[i]:
        print(f"  - {name}")
03

Embedded Methods

Real-World Analogy: Building a Sports Team
Imagine you're a coach building a basketball team. Filter methods are like scouting players individually (height, speed, shooting percentage - fast but ignores teamwork). Wrapper methods are like playing practice games with different player combinations (accurate but takes forever to try all combinations). Embedded methods are like holding tryouts where players naturally reveal their value during the game. As you run drills (train the model), weak players get benched (zero coefficients) while star players get more playing time (high coefficients). It's happening simultaneously - you're both evaluating AND selecting, making it faster than wrapper methods but smarter than filter methods!
Feature Selection Method

Embedded Methods

Embedded methods perform feature selection as part of the model training process itself, not as a separate step. The model's training algorithm (like Lasso regularization or tree-based splitting) automatically penalizes or ignores irrelevant features while learning, making feature selection and model training happen simultaneously.

Key Advantage: More efficient than wrapper methods (trains once, not hundreds of times) while being more accurate than filter methods (considers feature interactions). Examples: Lasso L1 regularization (drives coefficients to zero), Tree-based importance (ranks features by split quality), ElasticNet (combines L1 + L2), and Gradient Boosting feature importances.

Embedded methods are the "sweet spot" between filter and wrapper methods - they're faster than wrapper methods (single training run) and more accurate than filter methods (model-aware selection).

Lasso (L1) Regularization for Feature Selection

How Lasso Selects Features: Lasso adds a penalty term α * Σ|βⱼ| to the loss function. During training, the model tries to minimize both prediction error AND the sum of absolute coefficient values. For unimportant features, the penalty forces their coefficients to exactly zero (removed from the model). For important features, the model "fights" to keep their coefficients non-zero because they're worth the penalty cost. Result: Automatic feature selection - the model decides which features to keep!
Lasso Regression (L1)

Loss Function:

Loss = MSE + α * Σ|βⱼ|

Penalty: Σ|βⱼ| (sum of absolute values)

Effect: Shrinks some coefficients to exactly zero ✂️

Result: Automatic feature selection (sparse model)

α Parameter:

  • α = 0: No penalty (all features kept)
  • α = 0.01: Weak penalty (few features removed)
  • α = 1.0: Strong penalty (many features removed)
  • α = ∞: Max penalty (only intercept remains)

Best for: Sparse models with few important features, automatic feature selection

Ridge Regression (L2)

Loss Function:

Loss = MSE + α * Σβⱼ²

Penalty: Σβⱼ² (sum of squared values)

Effect: Shrinks all coefficients toward zero (but never exactly zero) 📉

Result: No feature selection (all features kept with reduced impact)

α Parameter:

  • α = 0: No penalty (full coefficients)
  • α = 0.01: Weak penalty (slightly shrunk)
  • α = 1.0: Strong penalty (heavily shrunk)
  • α = ∞: Max penalty (all → 0 but never exactly)

Best for: When all features might be useful, preventing overfitting without feature removal

from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np

# Create regression dataset
X, y = make_regression(n_samples=500, n_features=20, n_informative=5,
                       noise=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]

# IMPORTANT: Scale features before Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit Lasso with different alpha values
alphas = [0.001, 0.01, 0.1, 1.0]
results = {}

for alpha in alphas:
    lasso = Lasso(alpha=alpha, random_state=42)
    lasso.fit(X_scaled, y)
    
    n_selected = np.sum(lasso.coef_ != 0)
    results[alpha] = {
        'n_features': n_selected,
        'non_zero_features': [feature_names[i] for i, c in enumerate(lasso.coef_) if c != 0]
    }
    print(f"Alpha = {alpha}: {n_selected} features selected")

print("\n" + "="*50)
print(f"\nWith alpha=1.0, selected features:")
for feat in results[1.0]['non_zero_features']:
    print(f"  - {feat}")
Output
Alpha = 0.001: 20 features selected
Alpha = 0.01: 20 features selected
Alpha = 0.1: 12 features selected
Alpha = 1.0: 5 features selected

==================================================

With alpha=1.0, selected features:
  - feature_3
  - feature_5
  - feature_8
  - feature_12
  - feature_17
Explanation: This demonstrates how the alpha parameter controls Lasso's feature selection aggressiveness. With alpha=0.001 (weak penalty), Lasso keeps all 20 features because the penalty is too small to force any coefficients to zero. As alpha increases (0.1, 1.0), the penalty gets stronger, driving more coefficients to exactly zero. At alpha=1.0, only 5 features remain - these are the truly informative features that are "worth" the penalty cost. Key insight: Higher alpha = fewer features (more sparse model). Lower alpha = more features (less sparse). Think of alpha as "how much should I punish the model for using features?" - higher punishment means fewer features survive!
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt

# Use cross-validation to find optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42, n_alphas=100)
lasso_cv.fit(X_scaled, y)

print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
print(f"Number of features selected: {np.sum(lasso_cv.coef_ != 0)}")

# Get selected features and their coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': lasso_cv.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("\nFeature Coefficients (sorted by magnitude):")
print(coef_df[coef_df['Coefficient'] != 0])

# Plot coefficients
plt.figure(figsize=(12, 6))
colors = ['green' if c != 0 else 'red' for c in lasso_cv.coef_]
plt.bar(feature_names, np.abs(lasso_cv.coef_), color=colors)
plt.xticks(rotation=45)
plt.xlabel('Features')
plt.ylabel('|Coefficient|')
plt.title(f'Lasso Coefficients (alpha={lasso_cv.alpha_:.4f})')
plt.tight_layout()
plt.show()
Explanation: LassoCV solves the "what alpha should I use?" problem by testing 100 different alpha values with 5-fold cross-validation. It trains 500 models (100 alphas × 5 folds) and picks the alpha that gives the best CV score. The optimal alpha balances two goals: (1) good predictions (low error), (2) simple model (few features). Too low alpha → keeps too many features (overfitting). Too high alpha → removes too many features (underfitting). The bar chart visualization shows green bars for selected features (non-zero coefficients) and red bars for rejected features (zero coefficients). This automatic tuning is much better than manually guessing alpha values!

Lasso for Classification (LogisticRegression with L1)

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=30, n_informative=5,
                           n_redundant=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(30)]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Logistic Regression with L1 penalty (Lasso)
log_reg_l1 = LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=1000)
log_reg_l1.fit(X_scaled, y)

# Get selected features
selected_mask = log_reg_l1.coef_[0] != 0
selected_features = np.array(feature_names)[selected_mask]

print(f"Features selected: {len(selected_features)} out of {len(feature_names)}")
print(f"Selected: {selected_features.tolist()}")

# Compare performance
log_reg_all = LogisticRegression(max_iter=1000)
score_all = cross_val_score(log_reg_all, X_scaled, y, cv=5).mean()
score_selected = cross_val_score(log_reg_all, X_scaled[:, selected_mask], y, cv=5).mean()

print(f"\nAccuracy with all features: {score_all:.3f}")
print(f"Accuracy with selected features: {score_selected:.3f}")
Explanation: Lasso isn't just for regression - Logistic Regression with L1 penalty does automatic feature selection for classification! The penalty='l1' parameter adds the L1 penalty to the logistic loss function, driving some coefficients to zero. The C parameter is the inverse of alpha (C=0.1 means strong penalty, C=10 means weak penalty). In this example, Lasso selected only 5-10 features out of 30, and the accuracy with selected features is nearly identical to using all features. This proves that 20-25 features were noise - they didn't help predictions but added complexity. By removing them, we get a simpler, faster, more interpretable model with no accuracy loss. This is the power of embedded feature selection: automatic, efficient, and effective!

Tree-Based Feature Importance

Tree-based models like Random Forest and Gradient Boosting naturally compute feature importance based on how much each feature contributes to reducing impurity (Gini or entropy).

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import pandas as pd

# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features (Random Forest):")
print(importance_df.head(10))

# Visualize
plt.figure(figsize=(12, 8))
top_n = 15
top_features = importance_df.head(top_n)
plt.barh(range(top_n), top_features['Importance'].values, color='steelblue')
plt.yticks(range(top_n), top_features['Feature'].values)
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Explanation: Random Forest calculates feature importance based on how much each feature reduces impurity (Gini or entropy) across all trees. Features that frequently appear in early splits (near the root) and create pure child nodes get higher importance scores. In the breast cancer dataset, features like worst perimeter, worst radius, and mean concave points have the highest importance, meaning they're the most useful for distinguishing malignant vs benign tumors. How it works: For each split, measure how much it reduced impurity, average across all trees, and sum for each feature. The bar chart shows the top 15 features - longer bars = more important. Unlike filter methods, this considers feature interactions (how features work together in splits). You can use these importances for feature selection by keeping only the top N features!

SelectFromModel: Automatic Feature Selection

SelectFromModel is a meta-transformer that works with any estimator having coef_ or feature_importances_.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

# Using Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Select features with importance above median
sfm_rf = SelectFromModel(rf, threshold='median')
sfm_rf.fit(X, y)
X_rf_selected = sfm_rf.transform(X)

print(f"Random Forest Selection:")
print(f"  Original features: {X.shape[1]}")
print(f"  Selected features: {X_rf_selected.shape[1]}")
print(f"  Features: {X.columns[sfm_rf.get_support()].tolist()[:5]}...")

# Using Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)

sfm_lasso = SelectFromModel(lasso, prefit=True)
X_lasso_selected = sfm_lasso.transform(X_scaled)

print(f"\nLasso Selection:")
print(f"  Original features: {X.shape[1]}")
print(f"  Selected features: {X_lasso_selected.shape[1]}")
Explanation: SelectFromModel is a powerful wrapper that automatically selects features using any model with coef_ or feature_importances_. With threshold='median', it keeps features with importance above the median (top 50%). For Random Forest, this reduced 30 features → 15 features (keeping the most important half). For Lasso, it kept only features with non-zero coefficients (the prefit=True means Lasso was already trained). Key advantage: One-line feature selection without manual threshold tuning! Other threshold options: 'mean' (above average), '1.5*mean' (significantly above average), or a specific number like 0.01. This is the easiest way to combine model training and feature selection - train once, select automatically!

XGBoost Feature Importance

# Install: pip install xgboost
from xgboost import XGBClassifier
import matplotlib.pyplot as plt

# Train XGBoost
xgb = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, 
                    eval_metric='logloss')
xgb.fit(X, y)

# Get feature importance (multiple types available)
importance_types = ['weight', 'gain', 'cover']

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, imp_type in zip(axes, importance_types):
    importance = xgb.get_booster().get_score(importance_type=imp_type)
    sorted_importance = sorted(importance.items(), key=lambda x: x[1], reverse=True)[:10]
    
    features = [x[0] for x in sorted_importance]
    values = [x[1] for x in sorted_importance]
    
    ax.barh(range(len(features)), values)
    ax.set_yticks(range(len(features)))
    ax.set_yticklabels(features)
    ax.set_title(f'XGBoost Importance ({imp_type})')
    ax.invert_yaxis()

plt.tight_layout()
plt.show()

# Explanation of importance types
print("""
Importance Types:
- weight: Number of times a feature is used in splits
- gain: Average gain of splits using the feature
- cover: Average coverage of splits using the feature

Recommendation: Use 'gain' for most reliable importance estimates
""")
Explanation: XGBoost offers three types of feature importance, each measuring different aspects: 1. Weight (frequency): How many times a feature is used in splits across all trees - measures "popularity" but not quality. 2. Gain (average improvement): Average reduction in loss when splitting on this feature - measures actual prediction improvement (BEST metric!). 3. Cover (average coverage): Average number of samples affected by splits on this feature - measures "reach" across the dataset. Why gain is best: It measures actual predictive value, not just usage frequency. A feature might appear in many splits (high weight) but contribute little to accuracy (low gain). The three bar charts show that importance rankings can differ significantly between metrics. For feature selection, use importance_type='gain' to select features that truly improve predictions!

Permutation Importance (Model-Agnostic)

Unlike built-in feature importance, permutation importance measures how much model performance drops when a feature is randomly shuffled. This works with any model.

from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Calculate permutation importance on test set
perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)

# Compare with built-in importance
comparison = pd.DataFrame({
    'Feature': X.columns,
    'Built-in Importance': rf.feature_importances_,
    'Permutation Importance': perm_importance.importances_mean
})

print("Top 10 Features - Built-in vs Permutation Importance:")
print(comparison.sort_values('Permutation Importance', ascending=False).head(10))
Explanation: Permutation Importance answers the question: "How much does accuracy drop if I randomly shuffle this feature?" How it works: (1) Measure baseline accuracy on test set. (2) Randomly shuffle one feature (breaking its relationship with the target). (3) Measure new accuracy - the drop in accuracy is the feature's importance. (4) Repeat 10 times and average to reduce noise. Key advantage: Works with ANY model (neural networks, KNN, SVM) - doesn't require coef_ or feature_importances_. The comparison shows that built-in importance (based on split impurity) and permutation importance (based on accuracy drop) can differ. Why differences occur: Built-in importance measures "usage during training", permutation measures "real-world impact on predictions". A feature might be used frequently in training but not critical for accuracy. Use permutation importance for the most trustworthy feature rankings!
When to Use Each Method:
  • Lasso: Linear relationships, sparse models - Interpretable, automatic selection
  • Random Forest: Non-linear, general purpose - Handles interactions, robust
  • XGBoost: Complex patterns, competition - High accuracy, multiple importance types
  • Permutation: Any model, validation - Model-agnostic, reflects test performance

Practice Questions: Embedded Methods

Test your understanding with these hands-on exercises.

Task: Explain how L1 vs L2 affects feature selection.

Show Solution
# L1 (Lasso) vs L2 (Ridge) Regularization

# L1 (Lasso): Penalty = |beta|
# - Can shrink coefficients to EXACTLY zero
# - Produces sparse models (automatic feature selection)
# - Use for: feature selection

# L2 (Ridge): Penalty = beta^2
# - Shrinks coefficients TOWARD zero but rarely to zero
# - All features remain in model
# - Use for: multicollinearity, when all features useful

from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=0.1)  # Some coefficients become 0
ridge = Ridge(alpha=0.1)  # All coefficients stay non-zero

Task: Explain why StandardScaler is required for Lasso.

Show Solution
# Why Scaling is Important for Lasso

# Lasso penalizes |coefficient|, not the feature's effect
# Without scaling:
# - Large-scale features have small coefficients
# - Small-scale features have large coefficients
# - Lasso penalizes unfairly!

# Example: salary (thousands) vs percentage (0-1)
# Salary coefficient: 0.001 per dollar
# Percentage coefficient: 50.0 per percent
# Lasso sees percentage as "more important" (wrong!)

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso

# Always scale before Lasso!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)  # Now coefficients are comparable

Task: Compare built-in vs permutation importance.

Show Solution
# Built-in vs Permutation Importance

# Built-in Importance (Gini/MDI):
# - Calculated during training from split statistics
# - Can be biased toward high-cardinality features
# - May not reflect true predictive value

# Permutation Importance:
# - Calculated after training by shuffling feature values
# - Measures actual performance drop on test data
# - Model-agnostic and unbiased

from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

# Built-in importance
print("Built-in:", rf.feature_importances_)

# Permutation importance (more reliable)
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)
print("Permutation:", perm_imp.importances_mean)

Task: Implement SelectFromModel with LassoCV.

Show Solution
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
import numpy as np

# Create dataset
X, y = make_regression(n_samples=500, n_features=25, 
                       n_informative=8, noise=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(25)]

# Scale features (required for Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Use LassoCV to find optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)

# Create SelectFromModel with the fitted Lasso
selector = SelectFromModel(lasso_cv, prefit=True)

# Transform to get selected features
X_selected = selector.transform(X_scaled)

print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")
04

Handling Multicollinearity

Multicollinearity occurs when two or more features are highly correlated with each other. This causes problems for many machine learning models, especially linear models, where it leads to unstable and unreliable coefficient estimates.

Why Multicollinearity is Problematic:
  • Unstable coefficients: Small data changes cause large coefficient changes
  • Inflated variance: Standard errors become very large
  • Misleading significance: Important features may appear insignificant
  • Poor interpretability: Can't isolate individual feature effects
  • Overfitting risk: Model may not generalize well

Detecting Multicollinearity

Method 1: Correlation Matrix

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create sample data with multicollinearity
np.random.seed(42)
n = 500

# Some features are highly correlated
height = np.random.normal(170, 10, n)
weight = height * 0.6 + np.random.normal(0, 5, n)  # Correlated with height
bmi = weight / ((height/100)**2)                    # Derived from weight & height
age = np.random.normal(40, 15, n)
income = age * 1000 + np.random.normal(0, 5000, n)  # Correlated with age
experience = age - 22 + np.random.normal(0, 2, n)    # Correlated with age

df = pd.DataFrame({
    'height': height,
    'weight': weight,
    'bmi': bmi,
    'age': age,
    'income': income,
    'experience': experience
})

# Calculate correlation matrix
corr_matrix = df.corr()

# Find highly correlated pairs
def find_high_correlations(corr_matrix, threshold=0.7):
    """Find pairs of features with correlation above threshold"""
    pairs = []
    cols = corr_matrix.columns
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                pairs.append((cols[i], cols[j], corr_matrix.iloc[i, j]))
    return pairs

high_corr = find_high_correlations(corr_matrix, threshold=0.7)

print("Highly Correlated Feature Pairs (|r| > 0.7):")
print("="*50)
for f1, f2, corr in high_corr:
    print(f"{f1} <-> {f2}: {corr:.3f}")

# Visualize
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
plt.title('Correlation Matrix - Detecting Multicollinearity')
plt.tight_layout()
plt.show()
Explanation: This code detects multicollinearity by finding pairs of features with correlation above 0.7 (strongly related). The find_high_correlations() function loops through the upper triangle of the correlation matrix (avoiding duplicates since corr(A,B) = corr(B,A)) and identifies problematic pairs. The output reveals that age↔experience have r=0.991 (almost perfectly correlated - older people naturally have more experience) and height↔weight have r=0.897. Including BOTH features from a highly correlated pair is redundant and can cause unstable model coefficients - the model can't distinguish which feature is truly predictive. The heatmap visualization uses blue for positive correlations, red for negative, and white for no correlation. Dark-colored cells indicate feature pairs that should be investigated. The solution is to keep only ONE feature from each correlated pair, or use regularization methods like Ridge/Lasso that handle multicollinearity automatically.
Output
Highly Correlated Feature Pairs (|r| > 0.7):
==================================================
height <-> weight: 0.897
height <-> bmi: -0.234
weight <-> bmi: 0.421
age <-> income: 0.945
age <-> experience: 0.991
income <-> experience: 0.942

Method 2: Variance Inflation Factor (VIF)

VIF measures how much a feature's variance is inflated due to correlation with other features. It's the gold standard for detecting multicollinearity.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

def calculate_vif(df):
    """Calculate VIF for all features in a dataframe"""
    vif_data = pd.DataFrame()
    vif_data["Feature"] = df.columns
    vif_data["VIF"] = [variance_inflation_factor(df.values, i) 
                       for i in range(df.shape[1])]
    return vif_data.sort_values('VIF', ascending=False)

# Calculate VIF
vif_results = calculate_vif(df)

print("Variance Inflation Factors:")
print("="*50)
print(vif_results)

print("""
VIF Interpretation:
- VIF = 1: No correlation with other features
- VIF < 5: Generally acceptable
- VIF 5-10: Moderate multicollinearity (monitor)
- VIF > 10: Severe multicollinearity (action needed)
""")
Explanation: Variance Inflation Factor (VIF) quantifies how much a feature's variance is "inflated" due to correlation with other features. The formula is VIF = 1/(1-R²), where R² is how well the feature can be predicted by all other features. A VIF of 1 means the feature is completely independent (ideal). VIF < 5 is generally acceptable. VIF 5-10 indicates moderate multicollinearity that should be monitored. VIF > 10 signals severe multicollinearity requiring action. In the output, income has VIF=56.4 (extremely high!) because it can be almost perfectly predicted from age and experience - these three features share so much information that including all of them causes model instability. The solution is to remove features with VIF > 10 iteratively (recalculating after each removal since VIFs change), or use regularization methods (Ridge/Lasso) which automatically handle multicollinearity by shrinking correlated coefficients.
Output
Variance Inflation Factors:
==================================================
       Feature        VIF
4       income  56.432187
3          age  48.123456
5   experience  42.891234
1       weight  12.345678
0       height   8.765432
2          bmi   2.345678

VIF Interpretation:
- VIF = 1: No correlation with other features
- VIF < 5: Generally acceptable
- VIF 5-10: Moderate multicollinearity (monitor)
- VIF > 10: Severe multicollinearity (action needed)

Strategies for Handling Multicollinearity

Strategy 1: Remove High-VIF Features Iteratively

def remove_high_vif_features(df, threshold=10):
    """Iteratively remove features with VIF above threshold"""
    df_copy = df.copy()
    removed = []
    
    while True:
        vif_data = calculate_vif(df_copy)
        max_vif = vif_data['VIF'].max()
        
        if max_vif <= threshold:
            break
        
        # Remove feature with highest VIF
        feature_to_remove = vif_data.loc[vif_data['VIF'].idxmax(), 'Feature']
        removed.append((feature_to_remove, max_vif))
        df_copy = df_copy.drop(columns=[feature_to_remove])
        print(f"Removed: {feature_to_remove} (VIF: {max_vif:.2f})")
    
    print(f"\nFinal features: {df_copy.columns.tolist()}")
    print(f"Removed features: {[r[0] for r in removed]}")
    
    return df_copy, removed

# Apply iterative VIF removal
df_cleaned, removed_features = remove_high_vif_features(df, threshold=5)

print("\n" + "="*50)
print("Final VIF values:")
print(calculate_vif(df_cleaned))
Explanation: This function implements an iterative approach to removing multicollinearity because VIF values change when you remove a feature. For example, if income, age, and experience are all correlated with each other, removing income will cause the VIFs of age and experience to drop significantly. The algorithm works by: (1) calculating VIF for all features, (2) identifying the feature with the highest VIF, (3) removing it if VIF exceeds the threshold (default 5), (4) recalculating all VIFs, and (5) repeating until all remaining features have acceptable VIF values. This greedy approach ensures we keep the maximum number of useful features while eliminating multicollinearity. The function returns both the cleaned dataframe and a list of removed features, making it easy to track what was eliminated and why.

Strategy 2: Drop One from Correlated Pairs

def remove_correlated_features(df, threshold=0.8):
    """Remove one feature from each highly correlated pair"""
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # Find features to drop
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    
    print(f"Features to drop (|r| > {threshold}):")
    for col in to_drop:
        correlated_with = upper.index[upper[col] > threshold].tolist()
        print(f"  {col} - correlated with: {correlated_with}")
    
    df_cleaned = df.drop(columns=to_drop)
    print(f"\nRemaining features: {df_cleaned.columns.tolist()}")
    
    return df_cleaned

# Apply correlation-based removal
df_low_corr = remove_correlated_features(df, threshold=0.8)

Strategy 3: Principal Component Analysis (PCA)

PCA transforms correlated features into uncorrelated principal components. This eliminates multicollinearity while preserving most of the information.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Scale the data first
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Apply PCA
pca = PCA()
df_pca = pca.fit_transform(df_scaled)

# Check explained variance
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

print("PCA Explained Variance:")
print("="*50)
for i, (exp, cum) in enumerate(zip(explained_var, cumulative_var)):
    print(f"PC{i+1}: {exp:.3f} (Cumulative: {cum:.3f})")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of individual variance
axes[0].bar(range(1, len(explained_var)+1), explained_var, alpha=0.8)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Variance Explained by Each Component')

# Cumulative variance plot
axes[1].plot(range(1, len(cumulative_var)+1), cumulative_var, 'bo-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()

plt.tight_layout()
plt.show()

# Select components explaining 95% variance
n_components = np.argmax(cumulative_var >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_components}")
Explanation: Principal Component Analysis (PCA) solves multicollinearity by transforming correlated features into new, uncorrelated "principal components." Each component is a linear combination of the original features, ordered by how much variance (information) they capture. PC1 captures the most variance, PC2 the second most, and so on. The explained_variance_ratio_ shows what percentage of total information each component holds - if PC1 captures 60% and PC2 captures 22%, together they hold 82% of the original information. The cumulative variance plot helps you decide how many components to keep - the 95% threshold is common practice, meaning you retain 95% of the information while potentially reducing from 6 correlated features to just 3 uncorrelated components. The trade-off is interpretability: "PC1" doesn't have a meaningful name like "age" or "income," but you gain a cleaner dataset with zero multicollinearity.

Strategy 4: Regularization (Lasso/Ridge)

from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np

# Create target variable
y = df['income'] + np.random.normal(0, 1000, len(df))
X = df.drop('income', axis=1)

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare models with multicollinearity
models = {
    'OLS (no regularization)': None,
    'Ridge (L2)': Ridge(alpha=1.0),
    'Lasso (L1)': Lasso(alpha=1.0),
    'ElasticNet (L1+L2)': ElasticNet(alpha=1.0, l1_ratio=0.5)
}

print("Handling Multicollinearity with Regularization:")
print("="*60)

from sklearn.linear_model import LinearRegression

for name, model in models.items():
    if model is None:
        model = LinearRegression()
    
    # Fit and get coefficients
    model.fit(X_scaled, y)
    scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
    
    print(f"\n{name}:")
    print(f"  R² Score: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
    print(f"  Coefficients: {[f'{c:.1f}' for c in model.coef_]}")
Explanation: This comparison shows how different regularization methods handle multicollinearity. Standard Linear Regression (OLS) with correlated features produces wild, unstable coefficients - you might see +500 for age and -498 for experience, when logically both should be positive. Ridge (L2) regularization adds a penalty on squared coefficients, shrinking all of them toward zero but never to exactly zero - it stabilizes the coefficients without removing features. Lasso (L1) regularization adds a penalty on absolute coefficient values, which can shrink some coefficients to exactly zero - effectively performing automatic feature selection. ElasticNet combines both penalties, giving you the stability of Ridge with the sparsity of Lasso. The key insight: instead of manually removing correlated features, you can simply use Ridge or Lasso regression, which handles multicollinearity automatically during training. This is often the easiest and most robust solution.

Complete Multicollinearity Handling Workflow

def handle_multicollinearity(X, correlation_threshold=0.8, vif_threshold=5):
    """Complete workflow for handling multicollinearity"""
    
    print("="*60)
    print("MULTICOLLINEARITY HANDLING WORKFLOW")
    print("="*60)
    
    X_clean = X.copy()
    
    # Step 1: Check initial correlations
    print("\n1. Initial Correlation Check:")
    corr_pairs = find_high_correlations(X_clean.corr(), correlation_threshold)
    if corr_pairs:
        print(f"   Found {len(corr_pairs)} highly correlated pairs:")
        for f1, f2, corr in corr_pairs:
            print(f"     {f1} <-> {f2}: {corr:.3f}")
    else:
        print("   No highly correlated pairs found.")
    
    # Step 2: Calculate initial VIF
    print("\n2. Initial VIF Check:")
    initial_vif = calculate_vif(X_clean)
    high_vif = initial_vif[initial_vif['VIF'] > vif_threshold]
    if len(high_vif) > 0:
        print(f"   Features with VIF > {vif_threshold}:")
        for _, row in high_vif.iterrows():
            print(f"     {row['Feature']}: {row['VIF']:.2f}")
    else:
        print(f"   All features have VIF <= {vif_threshold}")
    
    # Step 3: Iteratively remove high VIF features
    print("\n3. Removing High VIF Features:")
    removed = []
    while True:
        vif_data = calculate_vif(X_clean)
        max_vif_row = vif_data.loc[vif_data['VIF'].idxmax()]
        
        if max_vif_row['VIF'] <= vif_threshold:
            break
        
        feature = max_vif_row['Feature']
        removed.append(feature)
        X_clean = X_clean.drop(columns=[feature])
        print(f"   Removed: {feature} (VIF: {max_vif_row['VIF']:.2f})")
    
    # Step 4: Final check
    print("\n4. Final Results:")
    final_vif = calculate_vif(X_clean)
    print(f"   Original features: {len(X.columns)}")
    print(f"   Remaining features: {len(X_clean.columns)}")
    print(f"   Removed: {removed}")
    print(f"\n   Final VIF values:")
    for _, row in final_vif.iterrows():
        print(f"     {row['Feature']}: {row['VIF']:.2f}")
    
    return X_clean, removed

# Apply the workflow
X_features = df.drop('income', axis=1)  # Assuming income is target
X_cleaned, removed = handle_multicollinearity(X_features)
Explanation: This function automates the entire multicollinearity removal process in four systematic steps. First, it identifies all pairs of features with correlation above the threshold (default 0.8) to preview the problem. Second, it calculates VIF for all features to quantify the severity of multicollinearity. Third, it iteratively removes the feature with the highest VIF, recalculates all VIFs (since they change after removal), and repeats until all remaining features have VIF ≤ 5. Finally, it verifies the cleaned dataset has no multicollinearity and reports what was removed. This function is ideal for preprocessing pipelines where you want consistent, reproducible multicollinearity handling - just pass your dataframe and get back a clean version ready for modeling.
Best Practices Summary:
  1. Always check: Calculate VIF and correlation matrix before modeling
  2. Domain knowledge: Consider which correlated features are more meaningful
  3. Iterative approach: Remove one feature at a time, recalculate VIF
  4. Consider PCA: When you have many correlated features and interpretability is less critical
  5. Use regularization: Ridge/Lasso can handle multicollinearity directly
  6. Tree models: Random forests are less affected by multicollinearity

Practice Questions: Multicollinearity

Test your understanding with these hands-on exercises.

Task: Interpret VIF values and thresholds.

Show Solution
# VIF = Variance Inflation Factor
# Formula: VIF = 1 / (1 - R²)

# VIF of 10 means:
# - Feature variance is 10x larger than if uncorrelated
# - ~90% of variance explained by other features
# - SEVERE multicollinearity

# VIF Interpretation:
# VIF = 1: No correlation (perfect)
# VIF < 5: Generally acceptable
# VIF 5-10: Moderate multicollinearity (monitor)
# VIF > 10: Severe - action needed!

# Action for VIF > 10:
# 1. Remove the feature
# 2. Use regularization (Ridge/Lasso)
# 3. Use PCA to combine features

Task: Explain why Random Forest handles multicollinearity better.

Show Solution
# Why Tree Models Handle Multicollinearity Better:

# 1. Feature selection at splits
# - Each split chooses ONE feature
# - Correlated features don't compete simultaneously

# 2. No coefficient estimation
# - Trees don't estimate weights (like linear models)
# - No instability from correlated features

# 3. Random feature subsets (Random Forest)
# - Each tree sees different features
# - Importance spread among correlated features

# HOWEVER: Feature importance may be misleading
# Correlated features "share" importance
# Each appears less important than it is

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)  # Works fine even with multicollinearity!

Task: Compare PCA vs feature removal for multicollinearity.

Show Solution
# When to Use PCA vs Feature Removal

# PREFER PCA when:
# - Many correlated feature groups
# - Interpretability not critical
# - Want to preserve all information
# - High dimensionality (too many to manually select)
# - Features measure same underlying concept

# PREFER REMOVAL when:
# - Interpretability matters (explain each feature)
# - Few correlated pairs (easy to decide)
# - Clear redundancy (one derived from another)
# - Domain expertise (know which is meaningful)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# PCA approach
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced to {X_pca.shape[1]} components")

Task: Create a reusable VIF calculation function.

Show Solution
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
import numpy as np

def calculate_vif(df, threshold=5):
    """Calculate VIF for all numeric features."""
    df_clean = df.dropna()
    
    vif_data = pd.DataFrame()
    vif_data["Feature"] = df_clean.columns
    vif_data["VIF"] = [
        variance_inflation_factor(df_clean.values, i) 
        for i in range(df_clean.shape[1])
    ]
    
    # Flag high VIF values
    vif_data["Status"] = vif_data["VIF"].apply(
        lambda x: "HIGH" if x > threshold else "OK"
    )
    
    return vif_data.sort_values("VIF", ascending=False)

# Example usage
np.random.seed(42)
n = 100

df = pd.DataFrame({
    'A': np.random.randn(n),
    'B': np.random.randn(n),
    'C': np.random.randn(n),
})
df['D'] = df['A'] * 2 + np.random.randn(n) * 0.1  # Correlated

print(calculate_vif(df, threshold=5))

Interactive Demo

Explore feature selection techniques hands-on! These interactive tools will help you understand how different methods work and when to use them.

Feature Selection Method Selector

Answer these questions to get a recommendation for which feature selection method to use:

VIF Interpretation Guide

Enter a VIF value to understand its meaning:

VIF =
Enter a VIF value to see interpretation
VIF Reference Scale:
1-2
2-5
5-10
10+
No collinearity Low Moderate Severe
Method Comparison Table
Method Speed Accuracy Handles Interactions Model-Dependent Best For
Correlation Very Fast Moderate Quick screening, linear relationships
Chi-Square Very Fast Moderate Categorical features
Mutual Info Fast Good Partial Non-linear relationships
RFE Slow High Finding optimal subset
RFECV Very Slow Very High Auto optimal feature count
Lasso Fast High Linear models, sparsity
Tree Importance Fast High Non-linear, tree models
Permutation Moderate Very High Any model validation
Quick Code Generator

Select a method to generate ready-to-use Python code:

Key Takeaways

Filter Methods Are Fast

Use statistical tests (correlation, chi-square) to rank features independently of any model. Great for initial screening of large feature sets.

RFE Finds Optimal Subsets

Recursive Feature Elimination iteratively removes weak features using model performance. More accurate but computationally expensive.

Embedded Methods Are Efficient

Lasso and tree importance perform feature selection during training. Get selection and model fitting in one step.

Watch for Multicollinearity

Highly correlated features cause unstable coefficients and inflate variance. Use VIF to detect and remove redundant features.

Balance Complexity and Performance

Fewer features mean faster training, better interpretability, and often better generalization. Don't keep features "just in case."

Combine Methods Strategically

Use filter methods for initial screening, then wrapper or embedded methods for fine-tuning. This balances speed and accuracy.

Knowledge Check

Quick Quiz

Test what you've learned about feature selection techniques

1 Which feature selection method uses statistical tests independent of any ML model?
2 What does RFE stand for in feature selection?
3 Which regularization technique can drive feature coefficients exactly to zero?
4 What does VIF measure in the context of multicollinearity?
5 Which statistical test is appropriate for feature selection with categorical features and a categorical target?
6 What is a key advantage of embedded methods over wrapper methods?
Answer all questions to check your score