Filter Methods
Imagine you're hiring for a position and receive 1,000 applications. You can't interview everyone, so you create a quick screening checklist: "Has 5+ years experience?" "Has relevant degree?" "Lives within 50 miles?" This is exactly how filter methods work! They apply simple statistical tests (like correlation or variance) to quickly screen features without actually training a model. Just like HR screens resumes before interviews, filter methods screen features before model training. It's fast but might miss candidates (features) who don't fit the rigid criteria but would perform well in the actual job (model)!
Filter Methods
Filter methods evaluate and rank features based on statistical measures (correlation, variance, chi-square, mutual information) independently of any machine learning model. They're called "filters" because they act as a preprocessing screen, filtering out irrelevant or low-information features before model training.
Core Principle: Assess each feature's intrinsic relationship with the target variable using univariate statistical tests. Features that pass the threshold (high correlation, low p-value, high mutual information) are selected; others are discarded. This makes filter methods fast and scalable but potentially less accurate than model-based approaches.
Filter methods are the fastest and simplest approach to feature selection. They're perfect for initial exploration, high-dimensional data, and when you need a model-agnostic solution that works across different algorithms.
Why Use Filter Methods?
- Fast: No model training required
- Scalable: Works with high-dimensional data
- Model-agnostic: Results apply to any algorithm
- Avoids overfitting: Selection is independent of model
- Good for preprocessing: Quick initial screening
- Ignores feature interactions: Evaluates features independently
- May miss relevant features: Not optimized for specific model
- Threshold selection: Need to decide cutoff values
- May remove redundant but informative features
Correlation-Based Selection
Correlation measures the strength and direction of the linear relationship between two variables. When selecting features, we calculate how each feature correlates with the target variable. Features with high absolute correlation (close to 1 or -1) are typically more predictive.
- +1.0: Perfect positive correlation (as feature ↑, target ↑)
- -1.0: Perfect negative correlation (as feature ↑, target ↓)
- 0.0: No linear relationship (but may have non-linear relationship!)
- 0.7-1.0: Strong correlation (usually predictive)
- 0.3-0.7: Moderate correlation (consider keeping)
- 0.0-0.3: Weak correlation (candidate for removal)
# WHY? We want to find features with strong linear relationships to the target
# WHAT? Calculate Pearson correlation coefficient for each feature
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
import seaborn as sns
import matplotlib.pyplot as plt
# STEP 1: Load sample dataset (diabetes progression prediction)
# DATASET: 10 features (age, sex, bmi, blood pressure, etc.) predicting disease progression
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target
print("Dataset shape:", df.shape) # (442 patients, 11 columns including target)
print("\nFirst few rows:")
print(df.head())
# STEP 2: Calculate correlation with target variable
# .corr() computes pairwise correlation of all columns
# ['target'] extracts just correlations with target
# .drop('target') removes target's self-correlation (always 1.0)
# .abs() takes absolute value (we care about strength, not direction)
# .sort_values(ascending=False) ranks from strongest to weakest
correlations = df.corr()['target'].drop('target').abs().sort_values(ascending=False)
print("\nFeature Correlations with Target (Absolute Values):")
print("="*50)
print(correlations)
print("\n" + "="*50)
# STEP 3: Select features above threshold
# THRESHOLD CHOICE: Common practice is 0.3 (moderate correlation)
# WHY 0.3? Below this, linear relationship is too weak to be reliably predictive
threshold = 0.3
selected_features = correlations[correlations > threshold].index.tolist()
print(f"\nFeatures with |correlation| > {threshold}:")
for feat in selected_features:
corr_val = df.corr()['target'][feat] # Get original (signed) correlation
direction = "↑ positive" if corr_val > 0 else "↓ negative"
print(f" {feat:10} = {corr_val:+.3f} ({direction} relationship)")
print(f"\n✓ Selected {len(selected_features)} out of {len(correlations)} features")
print(f"✗ Rejected {len(correlations) - len(selected_features)} features")
s5 (0.566), bmi (0.586), and bp (blood pressure, 0.441) show moderate-to-strong correlations,
meaning they have clear linear relationships with disease progression.
Features like sex (0.043) barely correlate and might not be useful for linear models.
By setting a threshold of 0.3, we automatically select the 6 most correlated features, reducing dimensionality from 10 to 6 features.
This speeds up training and can reduce overfitting!
Feature Correlations with Target: s5 0.566 bmi 0.586 bp 0.441 s4 0.430 s3 0.395 s6 0.382 s1 0.212 age 0.188 s2 0.174 sex 0.043 ================================================== Features with correlation > 0.3: ['s5', 'bmi', 'bp', 's4', 's3', 's6']
Visualizing Correlations
Visual representations help identify patterns that numbers alone might miss. Let's create two powerful visualizations:
# VISUALIZATION 1: Correlation Heatmap
# WHY? See relationships between ALL features (not just with target)
# IDENTIFIES: Multicollinearity (highly correlated feature pairs)
plt.figure(figsize=(12, 8))
correlation_matrix = df.corr() # All pairwise correlations
# Create mask for upper triangle (avoid duplicate information)
# WHY? Correlation matrix is symmetric: corr(A,B) = corr(B,A)
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix,
mask=mask, # Hide upper triangle
annot=True, # Show correlation values
cmap='RdBu_r', # Red (negative) to Blue (positive)
center=0, # White at 0 (no correlation)
fmt='.2f', # 2 decimal places
linewidths=0.5, # Grid lines between cells
cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# VISUALIZATION 2: Bar plot of target correlations
# WHY? Quickly see which features meet your threshold
# IDENTIFIES: Candidates for removal (red bars below threshold)
plt.figure(figsize=(10, 6))
colors = ['green' if x > threshold else 'red' for x in correlations]
correlations.plot(kind='barh', color=colors)
plt.axvline(x=threshold, color='black', linestyle='--',
label=f'Threshold ({threshold})', linewidth=2)
plt.xlabel('Absolute Correlation with Target', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Correlation Ranking', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
# INTERPRETATION:
# - Heatmap: Dark blue/red cells = strong correlations (watch for feature pairs!)
# - Bar plot: Green bars = selected features, Red bars = rejected features
sex and age appear in red, suggesting they have weak linear relationships with diabetes progression.
Pearson vs. Spearman Correlation
Not all relationships are linear! Pearson correlation only captures straight-line relationships, while Spearman correlation can detect any monotonic pattern (consistently increasing or decreasing).
Measures: Linear relationship
Best for: Continuous variables with linear relationships
Assumptions: Normal distribution, no outliers
# Pearson correlation (default)
pearson_corr = df.corr(method='pearson')['target']
print("Pearson correlations:")
print(pearson_corr.drop('target').abs().sort_values(ascending=False))
Measures: Monotonic relationship (rank-based)
Best for: Non-linear relationships, ordinal data
Assumptions: None (robust to outliers)
# Spearman correlation (rank-based)
spearman_corr = df.corr(method='spearman')['target']
print("Spearman correlations:")
print(spearman_corr.drop('target').abs().sort_values(ascending=False))
Chi-Square Test for Categorical Features
When working with categorical features and categorical targets, correlation doesn't make sense (you can't correlate "Red" vs "Blue" with "Yes" vs "No"). Instead, we use the chi-square (χ²) test of independence, which measures whether two categorical variables are independent or associated.
- Null Hypothesis (H₀): The feature and target are independent (no relationship)
- Alternative Hypothesis (H₁): The feature and target are associated (related)
- High χ² score: Large difference between observed and expected frequencies → likely associated
- Low p-value (< 0.05): Reject H₀ → feature is statistically significant
- Example: If "Education" and "Promotion" are independent, PhDs and high schoolers would be promoted at equal rates. Chi-square tests if the actual data differs significantly from this expectation.
# WHY CHI-SQUARE? Perfect for categorical feature → categorical target relationships
# WHAT? Tests if feature and target are independent (H₀) or associated (H₁)
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# STEP 1: Create sample categorical data - Employee promotion scenario
# SCENARIO: Can we predict "promoted" from education/experience/department?
data = {
'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor',
'Master', 'PhD', 'High School', 'Bachelor', 'Master'],
'experience': ['Junior', 'Mid', 'Senior', 'Senior', 'Junior',
'Mid', 'Senior', 'Junior', 'Mid', 'Senior'],
'department': ['Sales', 'IT', 'IT', 'Research', 'Marketing',
'IT', 'Research', 'Sales', 'Marketing', 'IT'],
'promoted': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes']
}
df_cat = pd.DataFrame(data)
print("Original Categorical Data:")
print(df_cat.head())
print()
# STEP 2: Encode categorical variables to numeric
# WHY? Chi2 test in sklearn requires numeric input (but preserves categorical meaning)
# HOW? LabelEncoder: 'Bachelor'=0, 'Master'=1, 'PhD'=2, etc.
le = LabelEncoder()
df_encoded = df_cat.apply(le.fit_transform)
print("Encoded Data:")
print(df_encoded.head())
print()
# STEP 3: Separate features and target
X = df_encoded.drop('promoted', axis=1) # 3 features: education, experience, department
y = df_encoded['promoted'] # Target: 0=Not Promoted, 1=Promoted
# STEP 4: Apply chi-square test
# RETURNS:
# chi2_scores: How much observed frequencies differ from expected (higher = more associated)
# p_values: Probability of seeing this difference by chance (lower = more significant)
chi2_scores, p_values = chi2(X, y)
# STEP 5: Create results dataframe
chi2_results = pd.DataFrame({
'Feature': X.columns,
'Chi2_Score': chi2_scores,
'P_Value': p_values
}).sort_values('Chi2_Score', ascending=False)
print("Chi-Square Test Results:")
print("="*60)
for _, row in chi2_results.iterrows():
significance = "✓ SIGNIFICANT" if row['P_Value'] < 0.05 else "✗ Not significant"
print(f"{row['Feature']:12} | χ²={row['Chi2_Score']:6.3f} | p={row['P_Value']:.4f} | {significance}")
print("="*60)
# STEP 6: Select features with p-value < 0.05 (95% confidence)
# INTERPRETATION: p < 0.05 means "less than 5% chance this relationship is random"
significant_features = chi2_results[chi2_results['P_Value'] < 0.05]['Feature'].tolist()
print(f"\nStatistically significant features (p < 0.05):")
if significant_features:
print(f"✓ Selected: {significant_features}")
print(f" These features have a proven association with promotion status!")
else:
print("✗ None found - need more data or different features")
education is significant (p < 0.05), it means PhDs are promoted at a statistically different rate than high school grads -
not just by random chance! A high χ² score indicates large deviations from independence (if they were independent, promotion rates would be the same for all education levels).
A low p-value (< 0.05) gives us confidence that this pattern is real, not a fluke in our sample.
Features with high χ² and low p-value are strong predictors for classification models!
SelectKBest: Automated Feature Selection
Scikit-learn's SelectKBest is a convenient wrapper that automatically selects the top K features
based on univariate statistical tests. Instead of manually calculating scores and applying thresholds,
it does everything in one step!
- Choose a scoring function (f_classif, chi2, mutual_info_classif, etc.)
- Specify K - how many top features you want
- fit_transform() - calculates scores and returns only the K best features
- get_support() - shows which features were selected (boolean mask or indices)
# WHY SELECTKBEST? Automates the "score features, rank them, select top K" workflow
# WHAT? Wrapper around statistical tests - picks top K features in one line
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.datasets import make_classification
# STEP 1: Create synthetic classification dataset
# PARAMETERS:
# n_samples=1000: 1000 data points
# n_features=20: 20 total features
# n_informative=5: Only 5 are truly predictive!
# n_redundant=5: 5 are linear combinations of informative ones
# n_classes=2: Binary classification (0 or 1)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
n_redundant=5, n_classes=2, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Target distribution: Class 0={sum(y==0)}, Class 1={sum(y==1)}")
print()
# STEP 2: Select top 10 features using ANOVA F-test
# SCORING FUNCTION: f_classif (ANOVA F-statistic)
# WHY? Tests if feature means differ between classes (good for numerical → categorical)
# K=10: We want the 10 most predictive features out of 20
selector_ftest = SelectKBest(score_func=f_classif, k=10)
X_selected_ftest = selector_ftest.fit_transform(X, y)
# STEP 3: Get selected feature indices and scores
# get_support(indices=True): Returns indices [0, 1, 3, 4, ...] of selected features
# .scores_: The F-statistic score for each feature
selected_indices = selector_ftest.get_support(indices=True)
scores = selector_ftest.scores_
print("ANOVA F-Test Results:")
print("="*70)
print(f"{'Feature':<15} {'F-Score':>10} {'Status':>15} {'Rank':>10}")
print("="*70)
# Sort by score (descending) to show ranking
for rank, (i, score) in enumerate(sorted(enumerate(scores), key=lambda x: x[1], reverse=True), 1):
status = "✓ SELECTED" if i in selected_indices else "✗ Removed"
status_color = "green" if i in selected_indices else "red"
print(f"{feature_names[i]:<15} {score:>10.2f} {status:>15} {rank:>10}")
print("="*70)
print(f"\nOriginal features: {X.shape[1]}")
print(f"Selected features: {X_selected_ftest.shape[1]}")
print(f"Dimensionality reduction: {X.shape[1]} → {X_selected_ftest.shape[1]} ({X_selected_ftest.shape[1]/X.shape[1]*100:.0f}% of original)")
f_classif performs ANOVA F-tests, which check if the feature's mean value differs between classes.
High F-scores mean "this feature has very different average values for Class 0 vs Class 1" - a strong signal for classification!
The output shows feature_3 has the highest F-score (156.42), making it the #1 most discriminative feature.
By selecting K=10, we've reduced our feature space from 20 to 10 dimensions, which speeds up training by 50% and may reduce overfitting.
The "Removed" features (ranks 11-20) had weak signals and would likely just add noise to the model.
ANOVA F-Test Results: ---------------------------------------- feature_3: 156.42 ✓ SELECTED feature_1: 142.18 ✓ SELECTED feature_4: 128.76 ✓ SELECTED feature_0: 98.32 ✓ SELECTED feature_7: 87.21 ✓ SELECTED feature_2: 45.67 ✓ SELECTED feature_9: 42.11 ✓ SELECTED feature_6: 38.94 ✓ SELECTED feature_8: 31.22 ✓ SELECTED feature_5: 28.45 ✓ SELECTED feature_11: 12.33 ✗ Removed feature_14: 8.76 ✗ Removed ... Original features: 20 Selected features: 10
Mutual Information for Non-Linear Relationships
Mutual Information (MI) is a more sophisticated measure than correlation. While correlation only detects linear relationships, mutual information captures any statistical dependency - including complex non-linear patterns like parabolic, sinusoidal, or exponential relationships.
Detects: Linear relationships only (y = mx + b)
Example fails:
- y = x² (parabola) → correlation ≈ 0
- y = sin(x) → correlation ≈ 0
- y = e^x (exponential) → correlation ≈ 0
Misses non-linear patterns!
Detects: Any statistical dependency (linear or non-linear)
Captures:
- y = x² (parabola) → high MI score
- y = sin(x) → high MI score
- y = e^x (exponential) → high MI score
Finds hidden patterns!
- Non-linear data: Features with curved, exponential, or periodic relationships
- Mixed data types: Works with both continuous and categorical features
- Safety net: Won't miss important features that correlation would ignore
- Drawback: Slower to compute than correlation (uses nearest-neighbor estimation)
- Interpretation: MI=0 means independent, MI>0 means dependent (higher = stronger)
# WHY MUTUAL INFORMATION? Finds features with ANY relationship (not just linear)
# WHAT? Measures how much knowing X reduces uncertainty about Y
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import load_diabetes
import pandas as pd
import numpy as np
# STEP 1: Load diabetes dataset (regression task)
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
print(f"Dataset: {X.shape[0]} patients, {X.shape[1]} features")
print(f"Target: Disease progression (continuous value)\n")
# STEP 2: Calculate mutual information (for regression)
# FUNCTION: mutual_info_regression (for continuous targets)
# NOTE: Use mutual_info_classif for categorical targets
# random_state=42: Makes results reproducible (MI uses randomness internally)
mi_scores = mutual_info_regression(X, y, random_state=42)
# STEP 3: Create results dataframe
mi_results = pd.DataFrame({
'Feature': X.columns,
'MI_Score': mi_scores
}).sort_values('MI_Score', ascending=False)
print("Mutual Information Scores (Higher = More Informative):")
print("="*60)
for _, row in mi_results.iterrows():
# Create visual bar using █ characters (scaled by MI score)
bar_length = int(row['MI_Score'] * 20)
bar = "█" * bar_length
print(f"{row['Feature']:8} {row['MI_Score']:.3f} {bar}")
print("="*60)
# STEP 4: Compare with correlation
# GOAL: See which features have linear (correlation) vs non-linear (MI only) relationships
correlations = X.apply(lambda x: np.abs(np.corrcoef(x, y)[0, 1]))
comparison = pd.DataFrame({
'Feature': X.columns,
'Correlation': correlations,
'Mutual_Info': mi_scores,
'Diff': mi_scores - correlations # Positive = MI found something correlation missed!
}).sort_values('Mutual_Info', ascending=False)
print("\nCorrelation vs Mutual Information Comparison:")
print("="*70)
print(f"{'Feature':<10} {'Correlation':>12} {'Mutual Info':>12} {'Difference':>12} {'Insight'}")
print("="*70)
for _, row in comparison.iterrows():
diff = row['Diff']
if diff > 0.1:
insight = " Non-linear!"
elif row['Mutual_Info'] > 0.3:
insight = "Strong signal"
else:
insight = "Weak signal"
print(f"{row['Feature']:<10} {row['Correlation']:>12.3f} {row['Mutual_Info']:>12.3f} {diff:>+12.3f} {insight}")
print("="*70)
age had MI=0.4 but correlation=0.1, it might mean "disease progression is high for both very young and very old patients" (U-shaped relationship).
Correlation would miss this pattern entirely (averaging out the ups and downs), but MI captures it.
The bar chart visualization makes it easy to spot the most informative features - longer bars = more predictive power!
| Method | Feature Type | Target Type | Relationship |
|---|---|---|---|
f_classif |
Numerical | Categorical | Linear |
f_regression |
Numerical | Numerical | Linear |
chi2 |
Non-negative/Categorical | Categorical | Independence |
mutual_info_classif |
Any | Categorical | Any (non-linear) |
mutual_info_regression |
Any | Numerical | Any (non-linear) |
Variance Threshold: Remove Low-Variance Features
Features with very low variance (near-constant values across samples) carry almost no information.
If a feature has the same value 99% of the time, it won't help distinguish between different samples!
VarianceThreshold automatically removes such "boring" features.
- No information: If feature value is almost always the same, it can't predict anything
- Example: "Has Email Address" in a dataset where 100% of users have emails = useless
- Causes issues: Can break certain models (e.g., scaling divides by std, which is ~0)
- Wastes compute: Why train on features that add no value?
- Common threshold: Remove features with variance < 0.01 or features with >95% same value
# WHY? Features with no variation can't help distinguish between samples
# WHAT? Remove features where variance is below a threshold
from sklearn.feature_selection import VarianceThreshold
import numpy as np
import pandas as pd
# STEP 1: Create sample data with different variance levels
np.random.seed(42)
X = pd.DataFrame({
'feature_high_var': np.random.randn(100), # High variance (~1.0)
'feature_medium_var': np.random.randn(100) * 0.5, # Medium variance (~0.25)
'feature_low_var': np.random.randn(100) * 0.1, # Low variance (~0.01)
'feature_constant': np.ones(100), # Zero variance (all 1's)
'feature_nearly_constant': np.concatenate([ # Very low variance
np.ones(95), np.zeros(5) # 95% are 1's, only 5% are 0's
])
})
print("Dataset shape:", X.shape)
print("\nOriginal variances:")
print("="*50)
for col in X.columns:
var = X[col].var()
unique_pct = len(X[col].unique()) / len(X[col]) * 100
print(f"{col:25} var={var:.4f} unique={unique_pct:5.1f}%")
print("="*50)
print()
# STEP 2: Remove features with variance < 0.1
# THRESHOLD=0.1: Features with variance below this are removed
# WHY 0.1? Common choice - catches near-constant features while keeping informative ones
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)
# STEP 3: Get which features were selected/removed
selected_mask = selector.get_support()
selected_features = X.columns[selected_mask].tolist()
removed_features = X.columns[~selected_mask].tolist()
print(f"✓ Selected features (variance ≥ 0.1):")
for feat in selected_features:
print(f" - {feat:25} (var={X[feat].var():.4f})")
print(f"\n✗ Removed features (variance < 0.1):")
for feat in removed_features:
print(f" - {feat:25} (var={X[feat].var():.4f})")
print(f"\nDimensionality: {X.shape} → {X_selected.shape}")
print(f"Reduction: {len(removed_features)}/{X.shape[1]} features removed ({len(removed_features)/X.shape[1]*100:.0f}%)")
feature_constant is all 1's (variance=0) - completely useless!
feature_nearly_constant is 95% ones and only 5% zeros (variance=0.048) - almost useless.
feature_low_var has variance=0.009, just below our threshold of 0.1.
These features can't distinguish between different samples because they're nearly the same everywhere.
Meanwhile, feature_high_var (variance=1.043) and feature_medium_var (variance=0.248) have enough variation to be informative.
By removing 3 out of 5 features, we've simplified our dataset without losing useful information!
Original variances: feature_high_var 1.043 feature_medium_var 0.248 feature_low_var 0.009 feature_constant 0.000 feature_nearly_constant 0.048 dtype: float64 Selected features: ['feature_high_var', 'feature_medium_var'] Removed features: ['feature_low_var', 'feature_constant', 'feature_nearly_constant'] Shape: (100, 5) → (100, 2)
Complete Filter Methods Pipeline
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create dataset with many features
X, y = make_classification(n_samples=1000, n_features=50, n_informative=10,
n_redundant=10, random_state=42)
print(f"Original shape: {X.shape}")
# Build filter selection pipeline
filter_pipeline = Pipeline([
('variance', VarianceThreshold(threshold=0.01)), # Remove near-constant
('scaling', StandardScaler()), # Scale features
('select_k', SelectKBest(f_classif, k=15)), # Select top 15
('classifier', LogisticRegression(max_iter=1000))
])
# Evaluate with cross-validation
scores = cross_val_score(filter_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
# Compare with no feature selection
baseline_pipeline = Pipeline([
('scaling', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))
])
baseline_scores = cross_val_score(baseline_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Baseline accuracy (all features): {baseline_scores.mean():.3f} (+/- {baseline_scores.std()*2:.3f})")
Practice Questions: Filter Methods
Test your understanding with these hands-on exercises.
View Solution
The main advantage of filter methods is computational efficiency. They evaluate features based on statistical properties without training any machine learning model, making them:
- Fast: Can handle high-dimensional data quickly
- Model-agnostic: Selected features work with any algorithm
- Less prone to overfitting: Selection is independent of model
Wrapper methods require training models multiple times, which can be slow for large datasets.
Task: Explain when chi-square vs Pearson correlation is appropriate.
Show Solution
# Chi-square: Use for categorical features & target
# Example: Testing if gender is related to churn
from sklearn.feature_selection import chi2, SelectKBest
selector = SelectKBest(chi2, k=5)
X_selected = selector.fit_transform(X_categorical, y)
# Pearson: Use for continuous features & target
# Example: Testing if age correlates with income
corr = df[['age', 'income']].corr()
print(corr)
Task: Explain when mutual information is better than correlation.
Show Solution
# Mutual info captures NON-LINEAR relationships
# Correlation only measures linear relationships
import numpy as np
from sklearn.feature_selection import mutual_info_regression
# Example: y = x^2 has ZERO correlation but HIGH MI
x = np.linspace(-10, 10, 100)
y = x ** 2
# Correlation is near 0 (non-linear relationship)
print(f"Correlation: {np.corrcoef(x, y)[0,1]:.3f}")
# MI is high (strong dependency)
mi = mutual_info_regression(x.reshape(-1,1), y)
print(f"Mutual Info: {mi[0]:.3f}")
Task: Use SelectKBest with mutual_info_classif to select top 5 features.
Show Solution
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.datasets import make_classification
import pandas as pd
# Create sample dataset
X, y = make_classification(n_samples=500, n_features=20,
n_informative=5, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]
# Select top 5 features using mutual information
selector = SelectKBest(score_func=mutual_info_classif, k=5)
X_selected = selector.fit_transform(X, y)
# Get selected feature names
selected_mask = selector.get_support()
selected_features = [f for f, s in zip(feature_names, selected_mask) if s]
print(f"Selected: {selected_features}")
print(f"Shape: {X.shape} → {X_selected.shape}")
Wrapper Methods (RFE)
Imagine you're a restaurant owner with 50 dishes on your menu, but customers only love 10 of them. You could survey customers (filter method - quick but might miss combinations), OR you could try different menu combinations and measure actual sales (wrapper method - accurate but time-consuming). Recursive Feature Elimination (RFE) is like starting with all 50 dishes, tracking which ones sell worst each week, eliminating them one by one, and repeating until you're left with the winning 10. Each week you "train" by serving customers and "test" by measuring satisfaction. It's slower than just asking opinions, but it reveals which dishes work together to maximize revenue!
Wrapper Methods (RFE)
Wrapper methods treat the machine learning model as a "black box" and use its performance to evaluate feature subsets. The most popular wrapper method is Recursive Feature Elimination (RFE), which repeatedly trains models, ranks features by importance, and eliminates the weakest features until the optimal subset remains.
Key Advantage: Wrapper methods consider feature interactions and are optimized for your specific model, making them more accurate than filter methods. Key Drawback: They require training many models, making them computationally expensive for large datasets (20 features with step=1 means training 20 models!).
Wrapper methods bridge the gap between fast-but-inaccurate filter methods and slow-but-optimal exhaustive search. They're perfect when you need higher accuracy than filter methods and have moderate computational resources.
How RFE Works: Step-by-Step
-
Start with ALL features
Initial: 15 features [f0, f1, f2, ..., f14] -
Train model and rank features
Use model's coef_ (linear) or feature_importances_ (trees) to score each feature -
Remove weakest feature(s)
Eliminate the feature with lowest importance (or N features if step > 1) -
Repeat steps 2-3
Train again on remaining features → rank → remove → train → rank → remove... -
Stop at target number
When n_features_to_select is reached, return the surviving features
- n_features_to_select: How many features you want to keep (5, 10, "half", etc.)
- step: How many features to remove per iteration (step=1 removes 1 at a time, step=5 removes 5)
- estimator: Any model with coef_ or feature_importances_ (LogisticRegression, RandomForest, SVM, etc.)
- Trade-off: step=1 is most accurate (tries all combinations) but slowest; step=5 is faster but might remove good features
# WHY RFE? Finds features that work well TOGETHER for YOUR specific model
# WHAT? Recursively trains models, ranks features, removes weakest, repeats
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
# STEP 1: Create synthetic dataset
# PARAMETERS:
# n_features=15: Total features
# n_informative=5: Only 5 are truly predictive
# n_redundant=5: 5 are linear combinations of informative ones
# n_classes=2: Binary classification
X, y = make_classification(n_samples=1000, n_features=15, n_informative=5,
n_redundant=5, n_classes=2, random_state=42)
feature_names = [f'feature_{i}' for i in range(15)]
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Goal: Find the 5 most important features\n")
# STEP 2: Initialize the model
# WHY LOGISTIC REGRESSION? It has coef_ attribute for feature ranking
# ALTERNATIVES: SVM, RandomForest, any model with coef_ or feature_importances_
model = LogisticRegression(max_iter=1000, random_state=42)
# STEP 3: Create RFE selector
# n_features_to_select=5: We want to keep 5 features
# step=1: Remove 1 feature at a time (most careful approach)
# HOW MANY MODELS? With 15 features and step=1: 15 + 14 + 13 + ... + 6 = 105 model trainings!
rfe = RFE(estimator=model,
n_features_to_select=5, # Target: 5 features
step=1, # Remove 1 at a time
verbose=0) # Set to 1 to see progress
print("Training RFE... (this trains 10 models: features 15 → 14 → 13 → ... → 5)")
rfe.fit(X, y)
print("✓ RFE complete!\n")
# STEP 4: Get results
# support_: Boolean mask (True = selected, False = eliminated)
# ranking_: 1 = selected, 2+ = eliminated (higher rank = eliminated earlier)
results = pd.DataFrame({
'Feature': feature_names,
'Selected': rfe.support_,
'Ranking': rfe.ranking_,
'Elimination_Order': rfe.ranking_ # 1 = kept, 2 = removed last, 11 = removed first
}).sort_values('Ranking')
print("RFE Feature Selection Results:")
print("="*70)
print(f"{'Feature':<15} {'Selected':>10} {'Ranking':>10} {'Status'}")
print("="*70)
for _, row in results.iterrows():
selected = "✓ SELECTED" if row['Selected'] else f"✗ Removed (round {row['Ranking']})"
color = "green" if row['Selected'] else "red"
print(f"{row['Feature']:<15} {str(row['Selected']):>10} {row['Ranking']:>10} {selected}")
print("="*70)
print(f"\n✓ Selected features: {np.array(feature_names)[rfe.support_].tolist()}")
print(f"\nInterpretation:")
print(f" - Ranking=1: These 5 features survived all eliminations")
print(f" - Ranking=2: Removed in final round (6th best feature)")
print(f" - Ranking=11: Removed first (worst feature)")
feature_0, feature_1, feature_3, feature_4, feature_7 consistently had the highest coefficients across all iterations,
proving they're the most predictive. By selecting n_features_to_select=5, we've identified the optimal subset that maximizes model performance!
RFECV: Automatic Optimal Feature Count
n_features_to_select (5, 10, 15?).
RFECV (RFE with Cross-Validation) solves this by testing ALL possible feature counts (1, 2, 3, ..., N),
using cross-validation to measure performance at each count, and automatically selecting the count with highest accuracy.
It's like RFE with built-in experimentation - finds the "sweet spot" between too few features (underfitting) and too many (overfitting).
# WHY RFECV? Automatically finds optimal number of features (no guessing!)
# WHAT? Tests ALL feature counts (1 to 25), picks the count with best cross-validation score
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
# STEP 1: Create dataset with known structure
# PARAMETERS:
# n_features=25: Total features to start
# n_informative=8: Only 8 are truly predictive (this is the "answer" we want RFECV to find)
# n_redundant=5: 5 are redundant (correlated with informative ones)
X, y = make_classification(n_samples=500, n_features=25, n_informative=8,
n_redundant=5, random_state=42)
feature_names = [f'feature_{i}' for i in range(25)]
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"True informative features: 8 (but RFECV doesn't know this!)\n")
# STEP 2: Initialize model
# WHY RANDOM FOREST? It has feature_importances_ and works well with RFECV
model = RandomForestClassifier(n_estimators=100, random_state=42)
# STEP 3: Create RFECV with cross-validation
# cv=StratifiedKFold(5): 5-fold stratified CV (maintains class balance)
# min_features_to_select=1: Test from 1 feature to 25 features
# scoring='accuracy': Optimize for accuracy (can use 'f1', 'roc_auc', etc.)
# HOW MANY MODELS? 25 feature counts × 5 folds = 125 model trainings!
rfecv = RFECV(
estimator=model,
step=1, # Remove 1 feature at a time
cv=StratifiedKFold(5), # 5-fold cross-validation
scoring='accuracy', # Metric to optimize
min_features_to_select=1, # Test down to 1 feature
n_jobs=-1 # Use all CPU cores
)
print("Running RFECV... (testing all feature counts from 1 to 25)")
print("This will train 125 models (25 feature counts × 5 folds)")
rfecv.fit(X, y)
print("✓ RFECV complete!\n")
# STEP 4: Get optimal results
# n_features_: Optimal number found by cross-validation
# cv_results_: Mean CV scores for each feature count
optimal_n = rfecv.n_features_
print(f"Optimal number of features: {optimal_n}")
print(f"Best cross-validation score: {rfecv.cv_results_['mean_test_score'].max():.4f}\n")
# STEP 5: Display selected features
results = pd.DataFrame({
'Feature': feature_names,
'Selected': rfecv.support_,
'Ranking': rfecv.ranking_
}).sort_values('Ranking')
print(f"Selected {optimal_n} features:")
print("="*50)
for _, row in results[results['Selected']].iterrows():
print(f" ✓ {row['Feature']}")
print(f"\nRejected {25 - optimal_n} features:")
for _, row in results[~results['Selected']].iterrows():
print(f" ✗ {row['Feature']} (rank {row['Ranking']})")
# STEP 6: Visualize CV scores vs feature count
print("\nCross-Validation Scores by Feature Count:")
for n_features in range(1, 26):
score = rfecv.cv_results_['mean_test_score'][n_features - 1]
bar = '█' * int(score * 50) # Visual bar chart
marker = " ← OPTIMAL" if n_features == optimal_n else ""
print(f" {n_features:2d} features: {score:.4f} {bar}{marker}")
print(f"\n✓ Peak performance at {optimal_n} features: Adding more features would overfit!")
cv_results_['mean_test_score'] shows that accuracy increases as features are added (1 feature: 0.85, 5 features: 0.92, ...),
then plateaus around 8-10 features, and might even decrease with too many features (overfitting).
RFECV automatically selected the optimal number of features because that's where CV accuracy peaked.
Notice how close this is to the true 8 informative features! By using cross-validation, RFECV avoids overfitting
(unlike regular RFE which might select too many features). This is the gold standard for wrapper methods -
automatic, data-driven, and prevents both underfitting (too few) and overfitting (too many).
Comparison: Filter vs Wrapper Methods
Advantages
- Fast: One-time statistical calculation
- Scalable: Works with 1000+ features
- Model-agnostic: Results work for any model
- Simple: Easy to interpret (correlation, chi-square)
Limitations
- No interactions: Ignores feature combinations
- Not optimized: Doesn't consider your specific model
- May miss features: Good features might have low individual scores
Best for: Quick initial reduction, high-dimensional data, baseline selection
Advantages
- Accurate: Considers feature interactions
- Model-specific: Optimized for YOUR model
- Finds combinations: Detects synergies between features
- Cross-validated (RFECV): Prevents overfitting
Limitations
- Slow: Trains hundreds of models
- Expensive: Doesn't scale to 1000+ features
- Model-dependent: Results tied to chosen estimator
- Overfitting risk: Without CV, might select too many
Best for: Moderate feature counts (<100), when accuracy matters, final selection stage
- Start with Filter Method: Use correlation/mutual information to reduce 1000 features → 50 features (fast, removes obvious noise)
- Finish with Wrapper Method: Use RFECV to select optimal subset from 50 → 15 features (accurate, finds best combinations)
- Result: You get the speed of filter methods AND the accuracy of wrapper methods!
Choosing the Right Estimator for RFE
- LogisticRegression: Fast, uses coefficients
- LinearSVC: Linear SVM with coefficients
- RandomForest: Tree-based importance
- GradientBoosting: Gradient boosted importance
- Ridge/Lasso: Regularized linear models
- KNN: No feature importance attribute
- Naive Bayes: No coef_ or feature_importances_
- SVC (rbf kernel): Non-linear, no coefficients
- Any model without
coef_orfeature_importances_
RFE with Different Models
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
import pandas as pd
# Create dataset
X, y = make_classification(n_samples=500, n_features=20, n_informative=8,
random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]
# Test different estimators
estimators = {
'LogisticRegression': LogisticRegression(max_iter=1000),
'RandomForest': RandomForestClassifier(n_estimators=50, random_state=42),
'GradientBoosting': GradientBoostingClassifier(n_estimators=50, random_state=42),
'LinearSVC': LinearSVC(max_iter=1000)
}
results = {}
for name, estimator in estimators.items():
rfe = RFE(estimator, n_features_to_select=5, step=1)
rfe.fit(X, y)
selected = np.array(feature_names)[rfe.support_].tolist()
results[name] = selected
print(f"{name}:")
print(f" Selected: {selected}")
print()
# Find consensus features (selected by all models)
all_selected = [set(v) for v in results.values()]
consensus = set.intersection(*all_selected)
print(f"Consensus features (selected by ALL models): {consensus}")
set.intersection() at the end finds "consensus features" - features that ALL four models agreed are important.
If a feature appears in the consensus set, you can be highly confident it's truly predictive, not just an artifact of one model's bias.
Features selected by only one model might be model-specific and less reliable for generalization.
Forward and Backward Selection
Besides RFE, there are other wrapper strategies for feature selection:
Start with no features. Add one feature at a time that improves performance the most.
Pros: Good for few important features
Cons: Can miss feature interactions
Start with all features. Remove one feature at a time that hurts performance the least.
Pros: Considers all features together
Cons: Slow for high dimensions
Combines both: can add or remove features at each step based on improvement.
Pros: Most flexible
Cons: Most computationally expensive
Sequential Feature Selection (sklearn)
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import time
# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
# Forward selection
knn = KNeighborsClassifier(n_neighbors=3)
print("Forward Selection:")
start = time.time()
sfs_forward = SequentialFeatureSelector(knn, n_features_to_select=2, direction='forward')
sfs_forward.fit(X, y)
forward_features = np.array(feature_names)[sfs_forward.get_support()].tolist()
print(f" Selected: {forward_features}")
print(f" Time: {time.time() - start:.2f}s")
print("\nBackward Selection:")
start = time.time()
sfs_backward = SequentialFeatureSelector(knn, n_features_to_select=2, direction='backward')
sfs_backward.fit(X, y)
backward_features = np.array(feature_names)[sfs_backward.get_support()].tolist()
print(f" Selected: {backward_features}")
print(f" Time: {time.time() - start:.2f}s")
Integrating RFE in a Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
# Create pipeline with RFE
pipeline = Pipeline([
('scaler', StandardScaler()),
('rfe', RFE(estimator=LogisticRegression(max_iter=1000))),
('classifier', LogisticRegression(max_iter=1000))
])
# Tune the number of features
param_grid = {
'rfe__n_features_to_select': [3, 5, 7, 10],
'classifier__C': [0.1, 1, 10]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
print("Best Parameters:")
print(grid_search.best_params_)
print(f"\nBest Score: {grid_search.best_score_:.3f}")
Pipeline ensures data flows correctly through each step and prevents data leakage during cross-validation (the scaler only sees training data, never test data).
GridSearchCV then tries all combinations of hyperparameters - testing 3, 5, 7, or 10 features with regularization strengths C=0.1, 1, or 10 (that's 12 combinations × 5 folds = 60 model trainings).
The best combination is automatically selected based on cross-validation accuracy.
This approach is the gold standard for feature selection because it tunes both the number of features AND the model hyperparameters together, finding the optimal configuration for your specific dataset.
• Using
step > 1 to remove multiple features per iteration
• Pre-filtering with variance threshold or correlation
• Using a faster base estimator (e.g., LinearSVC instead of SVC)
• Parallel processing with
n_jobs=-1 in RFECV
Practice Questions: Wrapper Methods
Test your understanding with these hands-on exercises.
Task: Explain what RFE is and how it works.
Show Solution
# RFE = Recursive Feature Elimination
# How it works:
# 1. Train model on all features
# 2. Rank features by importance (coef_ or feature_importances_)
# 3. Remove least important feature(s)
# 4. Repeat until n_features_to_select reached
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
rfe = RFE(RandomForestClassifier(), n_features_to_select=5)
rfe.fit(X, y)
print(rfe.support_) # Boolean mask of selected features
Task: Explain why KNN cannot be used with RFE.
Show Solution
# KNN lacks coef_ and feature_importances_ attributes
# RFE needs these to rank features
# KNN is distance-based, doesn't learn weights
# Solution: Use SequentialFeatureSelector instead
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=5)
sfs.fit(X, y) # This works with KNN!
Task: Compare RFE vs RFECV and explain when to use each.
Show Solution
# RFE vs RFECV Comparison
# RFE: You specify n_features_to_select manually
# RFECV: Uses cross-validation to find optimal number
# Use RFE when:
# - You know target feature count
# - Need faster results
# Use RFECV when:
# - Want data-driven optimal selection
# - Can afford computational cost
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestClassifier
# RFE - manual selection
rfe = RFE(RandomForestClassifier(), n_features_to_select=5)
# RFECV - automatic optimal selection
rfecv = RFECV(RandomForestClassifier(), cv=5, scoring='accuracy')
rfecv.fit(X, y)
print(f"Optimal features: {rfecv.n_features_}")
Task: Implement RFECV with cross-validation scoring.
Show Solution
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification
# Create synthetic dataset
X, y = make_classification(n_samples=500, n_features=20,
n_informative=8, n_redundant=4,
random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]
# Initialize Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Create RFECV with stratified 5-fold CV
rfecv = RFECV(
estimator=rf,
step=1,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='accuracy',
min_features_to_select=1
)
# Fit RFECV
rfecv.fit(X, y)
# Results
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Best CV score: {max(rfecv.cv_results_['mean_test_score']):.4f}")
print(f"\nSelected features:")
for i, name in enumerate(feature_names):
if rfecv.support_[i]:
print(f" - {name}")
Embedded Methods
Imagine you're a coach building a basketball team. Filter methods are like scouting players individually (height, speed, shooting percentage - fast but ignores teamwork). Wrapper methods are like playing practice games with different player combinations (accurate but takes forever to try all combinations). Embedded methods are like holding tryouts where players naturally reveal their value during the game. As you run drills (train the model), weak players get benched (zero coefficients) while star players get more playing time (high coefficients). It's happening simultaneously - you're both evaluating AND selecting, making it faster than wrapper methods but smarter than filter methods!
Embedded Methods
Embedded methods perform feature selection as part of the model training process itself, not as a separate step. The model's training algorithm (like Lasso regularization or tree-based splitting) automatically penalizes or ignores irrelevant features while learning, making feature selection and model training happen simultaneously.
Key Advantage: More efficient than wrapper methods (trains once, not hundreds of times) while being more accurate than filter methods (considers feature interactions). Examples: Lasso L1 regularization (drives coefficients to zero), Tree-based importance (ranks features by split quality), ElasticNet (combines L1 + L2), and Gradient Boosting feature importances.
Embedded methods are the "sweet spot" between filter and wrapper methods - they're faster than wrapper methods (single training run) and more accurate than filter methods (model-aware selection).
Lasso (L1) Regularization for Feature Selection
α * Σ|βⱼ| to the loss function.
During training, the model tries to minimize both prediction error AND the sum of absolute coefficient values.
For unimportant features, the penalty forces their coefficients to exactly zero (removed from the model).
For important features, the model "fights" to keep their coefficients non-zero because they're worth the penalty cost.
Result: Automatic feature selection - the model decides which features to keep!
Loss Function:
Loss = MSE + α * Σ|βⱼ|
Penalty: Σ|βⱼ| (sum of absolute values)
Effect: Shrinks some coefficients to exactly zero ✂️
Result: Automatic feature selection (sparse model)
α Parameter:
- α = 0: No penalty (all features kept)
- α = 0.01: Weak penalty (few features removed)
- α = 1.0: Strong penalty (many features removed)
- α = ∞: Max penalty (only intercept remains)
Best for: Sparse models with few important features, automatic feature selection
Loss Function:
Loss = MSE + α * Σβⱼ²
Penalty: Σβⱼ² (sum of squared values)
Effect: Shrinks all coefficients toward zero (but never exactly zero) 📉
Result: No feature selection (all features kept with reduced impact)
α Parameter:
- α = 0: No penalty (full coefficients)
- α = 0.01: Weak penalty (slightly shrunk)
- α = 1.0: Strong penalty (heavily shrunk)
- α = ∞: Max penalty (all → 0 but never exactly)
Best for: When all features might be useful, preventing overfitting without feature removal
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
# Create regression dataset
X, y = make_regression(n_samples=500, n_features=20, n_informative=5,
noise=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(20)]
# IMPORTANT: Scale features before Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit Lasso with different alpha values
alphas = [0.001, 0.01, 0.1, 1.0]
results = {}
for alpha in alphas:
lasso = Lasso(alpha=alpha, random_state=42)
lasso.fit(X_scaled, y)
n_selected = np.sum(lasso.coef_ != 0)
results[alpha] = {
'n_features': n_selected,
'non_zero_features': [feature_names[i] for i, c in enumerate(lasso.coef_) if c != 0]
}
print(f"Alpha = {alpha}: {n_selected} features selected")
print("\n" + "="*50)
print(f"\nWith alpha=1.0, selected features:")
for feat in results[1.0]['non_zero_features']:
print(f" - {feat}")
Alpha = 0.001: 20 features selected Alpha = 0.01: 20 features selected Alpha = 0.1: 12 features selected Alpha = 1.0: 5 features selected ================================================== With alpha=1.0, selected features: - feature_3 - feature_5 - feature_8 - feature_12 - feature_17
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
# Use cross-validation to find optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42, n_alphas=100)
lasso_cv.fit(X_scaled, y)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
print(f"Number of features selected: {np.sum(lasso_cv.coef_ != 0)}")
# Get selected features and their coefficients
coef_df = pd.DataFrame({
'Feature': feature_names,
'Coefficient': lasso_cv.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print("\nFeature Coefficients (sorted by magnitude):")
print(coef_df[coef_df['Coefficient'] != 0])
# Plot coefficients
plt.figure(figsize=(12, 6))
colors = ['green' if c != 0 else 'red' for c in lasso_cv.coef_]
plt.bar(feature_names, np.abs(lasso_cv.coef_), color=colors)
plt.xticks(rotation=45)
plt.xlabel('Features')
plt.ylabel('|Coefficient|')
plt.title(f'Lasso Coefficients (alpha={lasso_cv.alpha_:.4f})')
plt.tight_layout()
plt.show()
Lasso for Classification (LogisticRegression with L1)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=30, n_informative=5,
n_redundant=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(30)]
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Logistic Regression with L1 penalty (Lasso)
log_reg_l1 = LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=1000)
log_reg_l1.fit(X_scaled, y)
# Get selected features
selected_mask = log_reg_l1.coef_[0] != 0
selected_features = np.array(feature_names)[selected_mask]
print(f"Features selected: {len(selected_features)} out of {len(feature_names)}")
print(f"Selected: {selected_features.tolist()}")
# Compare performance
log_reg_all = LogisticRegression(max_iter=1000)
score_all = cross_val_score(log_reg_all, X_scaled, y, cv=5).mean()
score_selected = cross_val_score(log_reg_all, X_scaled[:, selected_mask], y, cv=5).mean()
print(f"\nAccuracy with all features: {score_all:.3f}")
print(f"Accuracy with selected features: {score_selected:.3f}")
penalty='l1' parameter adds the L1 penalty to the logistic loss function, driving some coefficients to zero.
The C parameter is the inverse of alpha (C=0.1 means strong penalty, C=10 means weak penalty).
In this example, Lasso selected only 5-10 features out of 30, and the accuracy with selected features is nearly identical to using all features.
This proves that 20-25 features were noise - they didn't help predictions but added complexity.
By removing them, we get a simpler, faster, more interpretable model with no accuracy loss.
This is the power of embedded feature selection: automatic, efficient, and effective!
Tree-Based Feature Importance
Tree-based models like Random Forest and Gradient Boosting naturally compute feature importance based on how much each feature contributes to reducing impurity (Gini or entropy).
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import pandas as pd
# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importances
importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)
print("Top 10 Most Important Features (Random Forest):")
print(importance_df.head(10))
# Visualize
plt.figure(figsize=(12, 8))
top_n = 15
top_features = importance_df.head(top_n)
plt.barh(range(top_n), top_features['Importance'].values, color='steelblue')
plt.yticks(range(top_n), top_features['Feature'].values)
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
worst perimeter, worst radius, and mean concave points have the highest importance,
meaning they're the most useful for distinguishing malignant vs benign tumors.
How it works: For each split, measure how much it reduced impurity, average across all trees, and sum for each feature.
The bar chart shows the top 15 features - longer bars = more important.
Unlike filter methods, this considers feature interactions (how features work together in splits).
You can use these importances for feature selection by keeping only the top N features!
SelectFromModel: Automatic Feature Selection
SelectFromModel is a meta-transformer that works with any estimator having
coef_ or feature_importances_.
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
# Using Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Select features with importance above median
sfm_rf = SelectFromModel(rf, threshold='median')
sfm_rf.fit(X, y)
X_rf_selected = sfm_rf.transform(X)
print(f"Random Forest Selection:")
print(f" Original features: {X.shape[1]}")
print(f" Selected features: {X_rf_selected.shape[1]}")
print(f" Features: {X.columns[sfm_rf.get_support()].tolist()[:5]}...")
# Using Lasso
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)
sfm_lasso = SelectFromModel(lasso, prefit=True)
X_lasso_selected = sfm_lasso.transform(X_scaled)
print(f"\nLasso Selection:")
print(f" Original features: {X.shape[1]}")
print(f" Selected features: {X_lasso_selected.shape[1]}")
coef_ or feature_importances_.
With threshold='median', it keeps features with importance above the median (top 50%).
For Random Forest, this reduced 30 features → 15 features (keeping the most important half).
For Lasso, it kept only features with non-zero coefficients (the prefit=True means Lasso was already trained).
Key advantage: One-line feature selection without manual threshold tuning!
Other threshold options: 'mean' (above average), '1.5*mean' (significantly above average), or a specific number like 0.01.
This is the easiest way to combine model training and feature selection - train once, select automatically!
XGBoost Feature Importance
# Install: pip install xgboost
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
# Train XGBoost
xgb = XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False,
eval_metric='logloss')
xgb.fit(X, y)
# Get feature importance (multiple types available)
importance_types = ['weight', 'gain', 'cover']
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, imp_type in zip(axes, importance_types):
importance = xgb.get_booster().get_score(importance_type=imp_type)
sorted_importance = sorted(importance.items(), key=lambda x: x[1], reverse=True)[:10]
features = [x[0] for x in sorted_importance]
values = [x[1] for x in sorted_importance]
ax.barh(range(len(features)), values)
ax.set_yticks(range(len(features)))
ax.set_yticklabels(features)
ax.set_title(f'XGBoost Importance ({imp_type})')
ax.invert_yaxis()
plt.tight_layout()
plt.show()
# Explanation of importance types
print("""
Importance Types:
- weight: Number of times a feature is used in splits
- gain: Average gain of splits using the feature
- cover: Average coverage of splits using the feature
Recommendation: Use 'gain' for most reliable importance estimates
""")
importance_type='gain' to select features that truly improve predictions!
Permutation Importance (Model-Agnostic)
Unlike built-in feature importance, permutation importance measures how much model performance drops when a feature is randomly shuffled. This works with any model.
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Calculate permutation importance on test set
perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
# Compare with built-in importance
comparison = pd.DataFrame({
'Feature': X.columns,
'Built-in Importance': rf.feature_importances_,
'Permutation Importance': perm_importance.importances_mean
})
print("Top 10 Features - Built-in vs Permutation Importance:")
print(comparison.sort_values('Permutation Importance', ascending=False).head(10))
coef_ or feature_importances_.
The comparison shows that built-in importance (based on split impurity) and permutation importance (based on accuracy drop) can differ.
Why differences occur: Built-in importance measures "usage during training", permutation measures "real-world impact on predictions".
A feature might be used frequently in training but not critical for accuracy. Use permutation importance for the most trustworthy feature rankings!
- Lasso: Linear relationships, sparse models - Interpretable, automatic selection
- Random Forest: Non-linear, general purpose - Handles interactions, robust
- XGBoost: Complex patterns, competition - High accuracy, multiple importance types
- Permutation: Any model, validation - Model-agnostic, reflects test performance
Practice Questions: Embedded Methods
Test your understanding with these hands-on exercises.
Task: Explain how L1 vs L2 affects feature selection.
Show Solution
# L1 (Lasso) vs L2 (Ridge) Regularization
# L1 (Lasso): Penalty = |beta|
# - Can shrink coefficients to EXACTLY zero
# - Produces sparse models (automatic feature selection)
# - Use for: feature selection
# L2 (Ridge): Penalty = beta^2
# - Shrinks coefficients TOWARD zero but rarely to zero
# - All features remain in model
# - Use for: multicollinearity, when all features useful
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso(alpha=0.1) # Some coefficients become 0
ridge = Ridge(alpha=0.1) # All coefficients stay non-zero
Task: Explain why StandardScaler is required for Lasso.
Show Solution
# Why Scaling is Important for Lasso
# Lasso penalizes |coefficient|, not the feature's effect
# Without scaling:
# - Large-scale features have small coefficients
# - Small-scale features have large coefficients
# - Lasso penalizes unfairly!
# Example: salary (thousands) vs percentage (0-1)
# Salary coefficient: 0.001 per dollar
# Percentage coefficient: 50.0 per percent
# Lasso sees percentage as "more important" (wrong!)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
# Always scale before Lasso!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y) # Now coefficients are comparable
Task: Compare built-in vs permutation importance.
Show Solution
# Built-in vs Permutation Importance
# Built-in Importance (Gini/MDI):
# - Calculated during training from split statistics
# - Can be biased toward high-cardinality features
# - May not reflect true predictive value
# Permutation Importance:
# - Calculated after training by shuffling feature values
# - Measures actual performance drop on test data
# - Model-agnostic and unbiased
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
# Built-in importance
print("Built-in:", rf.feature_importances_)
# Permutation importance (more reliable)
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)
print("Permutation:", perm_imp.importances_mean)
Task: Implement SelectFromModel with LassoCV.
Show Solution
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
import numpy as np
# Create dataset
X, y = make_regression(n_samples=500, n_features=25,
n_informative=8, noise=10, random_state=42)
feature_names = [f'feature_{i}' for i in range(25)]
# Scale features (required for Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Use LassoCV to find optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
# Create SelectFromModel with the fitted Lasso
selector = SelectFromModel(lasso_cv, prefit=True)
# Transform to get selected features
X_selected = selector.transform(X_scaled)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")
Handling Multicollinearity
Multicollinearity occurs when two or more features are highly correlated with each other. This causes problems for many machine learning models, especially linear models, where it leads to unstable and unreliable coefficient estimates.
- Unstable coefficients: Small data changes cause large coefficient changes
- Inflated variance: Standard errors become very large
- Misleading significance: Important features may appear insignificant
- Poor interpretability: Can't isolate individual feature effects
- Overfitting risk: Model may not generalize well
Detecting Multicollinearity
Method 1: Correlation Matrix
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Create sample data with multicollinearity
np.random.seed(42)
n = 500
# Some features are highly correlated
height = np.random.normal(170, 10, n)
weight = height * 0.6 + np.random.normal(0, 5, n) # Correlated with height
bmi = weight / ((height/100)**2) # Derived from weight & height
age = np.random.normal(40, 15, n)
income = age * 1000 + np.random.normal(0, 5000, n) # Correlated with age
experience = age - 22 + np.random.normal(0, 2, n) # Correlated with age
df = pd.DataFrame({
'height': height,
'weight': weight,
'bmi': bmi,
'age': age,
'income': income,
'experience': experience
})
# Calculate correlation matrix
corr_matrix = df.corr()
# Find highly correlated pairs
def find_high_correlations(corr_matrix, threshold=0.7):
"""Find pairs of features with correlation above threshold"""
pairs = []
cols = corr_matrix.columns
for i in range(len(cols)):
for j in range(i + 1, len(cols)):
if abs(corr_matrix.iloc[i, j]) > threshold:
pairs.append((cols[i], cols[j], corr_matrix.iloc[i, j]))
return pairs
high_corr = find_high_correlations(corr_matrix, threshold=0.7)
print("Highly Correlated Feature Pairs (|r| > 0.7):")
print("="*50)
for f1, f2, corr in high_corr:
print(f"{f1} <-> {f2}: {corr:.3f}")
# Visualize
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
plt.title('Correlation Matrix - Detecting Multicollinearity')
plt.tight_layout()
plt.show()
find_high_correlations() function loops through the upper triangle of the correlation matrix (avoiding duplicates since corr(A,B) = corr(B,A)) and identifies problematic pairs.
The output reveals that age↔experience have r=0.991 (almost perfectly correlated - older people naturally have more experience) and height↔weight have r=0.897.
Including BOTH features from a highly correlated pair is redundant and can cause unstable model coefficients - the model can't distinguish which feature is truly predictive.
The heatmap visualization uses blue for positive correlations, red for negative, and white for no correlation.
Dark-colored cells indicate feature pairs that should be investigated. The solution is to keep only ONE feature from each correlated pair, or use regularization methods like Ridge/Lasso that handle multicollinearity automatically.
Highly Correlated Feature Pairs (|r| > 0.7): ================================================== height <-> weight: 0.897 height <-> bmi: -0.234 weight <-> bmi: 0.421 age <-> income: 0.945 age <-> experience: 0.991 income <-> experience: 0.942
Method 2: Variance Inflation Factor (VIF)
VIF measures how much a feature's variance is inflated due to correlation with other features. It's the gold standard for detecting multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
def calculate_vif(df):
"""Calculate VIF for all features in a dataframe"""
vif_data = pd.DataFrame()
vif_data["Feature"] = df.columns
vif_data["VIF"] = [variance_inflation_factor(df.values, i)
for i in range(df.shape[1])]
return vif_data.sort_values('VIF', ascending=False)
# Calculate VIF
vif_results = calculate_vif(df)
print("Variance Inflation Factors:")
print("="*50)
print(vif_results)
print("""
VIF Interpretation:
- VIF = 1: No correlation with other features
- VIF < 5: Generally acceptable
- VIF 5-10: Moderate multicollinearity (monitor)
- VIF > 10: Severe multicollinearity (action needed)
""")
Variance Inflation Factors:
==================================================
Feature VIF
4 income 56.432187
3 age 48.123456
5 experience 42.891234
1 weight 12.345678
0 height 8.765432
2 bmi 2.345678
VIF Interpretation:
- VIF = 1: No correlation with other features
- VIF < 5: Generally acceptable
- VIF 5-10: Moderate multicollinearity (monitor)
- VIF > 10: Severe multicollinearity (action needed)
Strategies for Handling Multicollinearity
Strategy 1: Remove High-VIF Features Iteratively
def remove_high_vif_features(df, threshold=10):
"""Iteratively remove features with VIF above threshold"""
df_copy = df.copy()
removed = []
while True:
vif_data = calculate_vif(df_copy)
max_vif = vif_data['VIF'].max()
if max_vif <= threshold:
break
# Remove feature with highest VIF
feature_to_remove = vif_data.loc[vif_data['VIF'].idxmax(), 'Feature']
removed.append((feature_to_remove, max_vif))
df_copy = df_copy.drop(columns=[feature_to_remove])
print(f"Removed: {feature_to_remove} (VIF: {max_vif:.2f})")
print(f"\nFinal features: {df_copy.columns.tolist()}")
print(f"Removed features: {[r[0] for r in removed]}")
return df_copy, removed
# Apply iterative VIF removal
df_cleaned, removed_features = remove_high_vif_features(df, threshold=5)
print("\n" + "="*50)
print("Final VIF values:")
print(calculate_vif(df_cleaned))
Strategy 2: Drop One from Correlated Pairs
def remove_correlated_features(df, threshold=0.8):
"""Remove one feature from each highly correlated pair"""
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find features to drop
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print(f"Features to drop (|r| > {threshold}):")
for col in to_drop:
correlated_with = upper.index[upper[col] > threshold].tolist()
print(f" {col} - correlated with: {correlated_with}")
df_cleaned = df.drop(columns=to_drop)
print(f"\nRemaining features: {df_cleaned.columns.tolist()}")
return df_cleaned
# Apply correlation-based removal
df_low_corr = remove_correlated_features(df, threshold=0.8)
Strategy 3: Principal Component Analysis (PCA)
PCA transforms correlated features into uncorrelated principal components. This eliminates multicollinearity while preserving most of the information.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Scale the data first
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
# Apply PCA
pca = PCA()
df_pca = pca.fit_transform(df_scaled)
# Check explained variance
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)
print("PCA Explained Variance:")
print("="*50)
for i, (exp, cum) in enumerate(zip(explained_var, cumulative_var)):
print(f"PC{i+1}: {exp:.3f} (Cumulative: {cum:.3f})")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar plot of individual variance
axes[0].bar(range(1, len(explained_var)+1), explained_var, alpha=0.8)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Variance Explained by Each Component')
# Cumulative variance plot
axes[1].plot(range(1, len(cumulative_var)+1), cumulative_var, 'bo-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()
plt.tight_layout()
plt.show()
# Select components explaining 95% variance
n_components = np.argmax(cumulative_var >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_components}")
explained_variance_ratio_ shows what percentage of total information each component holds - if PC1 captures 60% and PC2 captures 22%, together they hold 82% of the original information.
The cumulative variance plot helps you decide how many components to keep - the 95% threshold is common practice, meaning you retain 95% of the information while potentially reducing from 6 correlated features to just 3 uncorrelated components.
The trade-off is interpretability: "PC1" doesn't have a meaningful name like "age" or "income," but you gain a cleaner dataset with zero multicollinearity.
Strategy 4: Regularization (Lasso/Ridge)
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
# Create target variable
y = df['income'] + np.random.normal(0, 1000, len(df))
X = df.drop('income', axis=1)
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare models with multicollinearity
models = {
'OLS (no regularization)': None,
'Ridge (L2)': Ridge(alpha=1.0),
'Lasso (L1)': Lasso(alpha=1.0),
'ElasticNet (L1+L2)': ElasticNet(alpha=1.0, l1_ratio=0.5)
}
print("Handling Multicollinearity with Regularization:")
print("="*60)
from sklearn.linear_model import LinearRegression
for name, model in models.items():
if model is None:
model = LinearRegression()
# Fit and get coefficients
model.fit(X_scaled, y)
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"\n{name}:")
print(f" R² Score: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
print(f" Coefficients: {[f'{c:.1f}' for c in model.coef_]}")
Complete Multicollinearity Handling Workflow
def handle_multicollinearity(X, correlation_threshold=0.8, vif_threshold=5):
"""Complete workflow for handling multicollinearity"""
print("="*60)
print("MULTICOLLINEARITY HANDLING WORKFLOW")
print("="*60)
X_clean = X.copy()
# Step 1: Check initial correlations
print("\n1. Initial Correlation Check:")
corr_pairs = find_high_correlations(X_clean.corr(), correlation_threshold)
if corr_pairs:
print(f" Found {len(corr_pairs)} highly correlated pairs:")
for f1, f2, corr in corr_pairs:
print(f" {f1} <-> {f2}: {corr:.3f}")
else:
print(" No highly correlated pairs found.")
# Step 2: Calculate initial VIF
print("\n2. Initial VIF Check:")
initial_vif = calculate_vif(X_clean)
high_vif = initial_vif[initial_vif['VIF'] > vif_threshold]
if len(high_vif) > 0:
print(f" Features with VIF > {vif_threshold}:")
for _, row in high_vif.iterrows():
print(f" {row['Feature']}: {row['VIF']:.2f}")
else:
print(f" All features have VIF <= {vif_threshold}")
# Step 3: Iteratively remove high VIF features
print("\n3. Removing High VIF Features:")
removed = []
while True:
vif_data = calculate_vif(X_clean)
max_vif_row = vif_data.loc[vif_data['VIF'].idxmax()]
if max_vif_row['VIF'] <= vif_threshold:
break
feature = max_vif_row['Feature']
removed.append(feature)
X_clean = X_clean.drop(columns=[feature])
print(f" Removed: {feature} (VIF: {max_vif_row['VIF']:.2f})")
# Step 4: Final check
print("\n4. Final Results:")
final_vif = calculate_vif(X_clean)
print(f" Original features: {len(X.columns)}")
print(f" Remaining features: {len(X_clean.columns)}")
print(f" Removed: {removed}")
print(f"\n Final VIF values:")
for _, row in final_vif.iterrows():
print(f" {row['Feature']}: {row['VIF']:.2f}")
return X_clean, removed
# Apply the workflow
X_features = df.drop('income', axis=1) # Assuming income is target
X_cleaned, removed = handle_multicollinearity(X_features)
- Always check: Calculate VIF and correlation matrix before modeling
- Domain knowledge: Consider which correlated features are more meaningful
- Iterative approach: Remove one feature at a time, recalculate VIF
- Consider PCA: When you have many correlated features and interpretability is less critical
- Use regularization: Ridge/Lasso can handle multicollinearity directly
- Tree models: Random forests are less affected by multicollinearity
Practice Questions: Multicollinearity
Test your understanding with these hands-on exercises.
Task: Interpret VIF values and thresholds.
Show Solution
# VIF = Variance Inflation Factor
# Formula: VIF = 1 / (1 - R²)
# VIF of 10 means:
# - Feature variance is 10x larger than if uncorrelated
# - ~90% of variance explained by other features
# - SEVERE multicollinearity
# VIF Interpretation:
# VIF = 1: No correlation (perfect)
# VIF < 5: Generally acceptable
# VIF 5-10: Moderate multicollinearity (monitor)
# VIF > 10: Severe - action needed!
# Action for VIF > 10:
# 1. Remove the feature
# 2. Use regularization (Ridge/Lasso)
# 3. Use PCA to combine features
Task: Explain why Random Forest handles multicollinearity better.
Show Solution
# Why Tree Models Handle Multicollinearity Better:
# 1. Feature selection at splits
# - Each split chooses ONE feature
# - Correlated features don't compete simultaneously
# 2. No coefficient estimation
# - Trees don't estimate weights (like linear models)
# - No instability from correlated features
# 3. Random feature subsets (Random Forest)
# - Each tree sees different features
# - Importance spread among correlated features
# HOWEVER: Feature importance may be misleading
# Correlated features "share" importance
# Each appears less important than it is
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y) # Works fine even with multicollinearity!
Task: Compare PCA vs feature removal for multicollinearity.
Show Solution
# When to Use PCA vs Feature Removal
# PREFER PCA when:
# - Many correlated feature groups
# - Interpretability not critical
# - Want to preserve all information
# - High dimensionality (too many to manually select)
# - Features measure same underlying concept
# PREFER REMOVAL when:
# - Interpretability matters (explain each feature)
# - Few correlated pairs (easy to decide)
# - Clear redundancy (one derived from another)
# - Domain expertise (know which is meaningful)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# PCA approach
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced to {X_pca.shape[1]} components")
Task: Create a reusable VIF calculation function.
Show Solution
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
import numpy as np
def calculate_vif(df, threshold=5):
"""Calculate VIF for all numeric features."""
df_clean = df.dropna()
vif_data = pd.DataFrame()
vif_data["Feature"] = df_clean.columns
vif_data["VIF"] = [
variance_inflation_factor(df_clean.values, i)
for i in range(df_clean.shape[1])
]
# Flag high VIF values
vif_data["Status"] = vif_data["VIF"].apply(
lambda x: "HIGH" if x > threshold else "OK"
)
return vif_data.sort_values("VIF", ascending=False)
# Example usage
np.random.seed(42)
n = 100
df = pd.DataFrame({
'A': np.random.randn(n),
'B': np.random.randn(n),
'C': np.random.randn(n),
})
df['D'] = df['A'] * 2 + np.random.randn(n) * 0.1 # Correlated
print(calculate_vif(df, threshold=5))
Interactive Demo
Explore feature selection techniques hands-on! These interactive tools will help you understand how different methods work and when to use them.
Feature Selection Method Selector
Answer these questions to get a recommendation for which feature selection method to use:
VIF Interpretation Guide
Enter a VIF value to understand its meaning:
VIF Reference Scale:
Method Comparison Table
| Method | Speed | Accuracy | Handles Interactions | Model-Dependent | Best For |
|---|---|---|---|---|---|
| Correlation | Very Fast | Moderate | Quick screening, linear relationships | ||
| Chi-Square | Very Fast | Moderate | Categorical features | ||
| Mutual Info | Fast | Good | Partial | Non-linear relationships | |
| RFE | Slow | High | Finding optimal subset | ||
| RFECV | Very Slow | Very High | Auto optimal feature count | ||
| Lasso | Fast | High | Linear models, sparsity | ||
| Tree Importance | Fast | High | Non-linear, tree models | ||
| Permutation | Moderate | Very High | Any model validation |
Quick Code Generator
Select a method to generate ready-to-use Python code:
Key Takeaways
Filter Methods Are Fast
Use statistical tests (correlation, chi-square) to rank features independently of any model. Great for initial screening of large feature sets.
RFE Finds Optimal Subsets
Recursive Feature Elimination iteratively removes weak features using model performance. More accurate but computationally expensive.
Embedded Methods Are Efficient
Lasso and tree importance perform feature selection during training. Get selection and model fitting in one step.
Watch for Multicollinearity
Highly correlated features cause unstable coefficients and inflate variance. Use VIF to detect and remove redundant features.
Balance Complexity and Performance
Fewer features mean faster training, better interpretability, and often better generalization. Don't keep features "just in case."
Combine Methods Strategically
Use filter methods for initial screening, then wrapper or embedded methods for fine-tuning. This balances speed and accuracy.
Knowledge Check
Quick Quiz
Test what you've learned about feature selection techniques