Assignment Overview
In this assignment, you will build a complete Customer Churn Prediction System using various classification algorithms. This comprehensive project requires you to apply ALL concepts from Module 3: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), proper evaluation metrics for classification, and techniques for handling imbalanced datasets.
pandas, numpy, matplotlib,
seaborn, scikit-learn, and imbalanced-learn for this assignment.
Logistic Regression (3.1)
Binary classification, probability estimation, sigmoid function
Decision Trees (3.2)
Tree-based models, pruning, feature importance
Random Forest (3.3)
Ensemble learning, bagging, out-of-bag error
SVM (3.4)
Hyperplanes, kernels, margin maximization
The Scenario
TeleConnect Communications
You have been hired as a Machine Learning Engineer at TeleConnect Communications, a telecommunications company facing high customer churn rates. The VP of Customer Success has given you this task:
"We're losing customers at an alarming rate and need to predict who's likely to churn before they leave. We have historical customer data with usage patterns, billing info, and support interactions. Build multiple classification models, compare them thoroughly, and help us identify the most at-risk customers!"
Your Task
Create a Jupyter Notebook called churn_classification.ipynb that implements multiple
classification algorithms, handles the imbalanced nature of churn data, compares model performance,
and provides actionable insights for the business.
The Dataset
You will work with a Customer Churn dataset. Create this CSV file as shown below:
File: customer_churn.csv (Customer Data)
customer_id,tenure_months,monthly_charges,total_charges,contract_type,payment_method,num_support_tickets,avg_monthly_usage_gb,has_premium_support,has_online_backup,num_additional_services,age,is_senior,churn
C001,24,65.5,1572.0,Two Year,Credit Card,1,45.2,1,1,3,35,0,0
C002,3,89.0,267.0,Month-to-Month,Electronic Check,5,78.5,0,0,1,28,0,1
C003,48,45.0,2160.0,Two Year,Bank Transfer,0,32.1,1,1,2,52,0,0
C004,6,95.5,573.0,Month-to-Month,Electronic Check,4,92.3,0,0,0,24,0,1
C005,36,55.0,1980.0,One Year,Credit Card,2,41.5,1,1,2,45,0,0
C006,1,78.0,78.0,Month-to-Month,Electronic Check,3,65.8,0,0,1,31,0,1
C007,60,42.0,2520.0,Two Year,Bank Transfer,1,28.9,1,1,3,67,1,0
C008,12,72.5,870.0,One Year,Credit Card,2,55.2,0,1,2,38,0,0
C009,2,88.5,177.0,Month-to-Month,Electronic Check,6,85.1,0,0,0,22,0,1
C010,42,52.0,2184.0,Two Year,Bank Transfer,0,35.6,1,1,3,58,0,0
C011,4,92.0,368.0,Month-to-Month,Electronic Check,4,88.7,0,0,1,26,0,1
C012,30,58.5,1755.0,One Year,Credit Card,1,48.3,1,1,2,41,0,0
C013,8,85.0,680.0,Month-to-Month,Electronic Check,3,72.4,0,0,1,33,0,1
C014,54,48.0,2592.0,Two Year,Bank Transfer,0,30.2,1,1,3,71,1,0
C015,18,62.5,1125.0,One Year,Credit Card,2,52.8,1,1,2,39,0,0
C016,5,90.0,450.0,Month-to-Month,Electronic Check,5,82.6,0,0,0,27,0,1
C017,36,50.0,1800.0,One Year,Bank Transfer,1,38.4,1,1,2,48,0,0
C018,2,95.0,190.0,Month-to-Month,Electronic Check,4,91.2,0,0,1,23,0,1
C019,45,47.5,2137.5,Two Year,Credit Card,0,33.7,1,1,3,62,0,0
C020,15,68.0,1020.0,One Year,Credit Card,2,58.9,0,1,2,36,0,0
C021,3,87.5,262.5,Month-to-Month,Electronic Check,5,79.8,0,0,0,29,0,1
C022,28,54.0,1512.0,One Year,Bank Transfer,1,42.1,1,1,2,44,0,0
C023,7,82.0,574.0,Month-to-Month,Electronic Check,3,68.5,0,0,1,32,0,1
C024,52,44.0,2288.0,Two Year,Bank Transfer,0,29.8,1,1,3,69,1,0
C025,22,60.0,1320.0,One Year,Credit Card,2,50.6,1,1,2,40,0,0
Columns Explained
customer_id- Unique identifier (string)tenure_months- Months as customer (integer)monthly_charges- Monthly bill amount (float)total_charges- Total amount paid (float)contract_type- Contract length (categorical: Month-to-Month/One Year/Two Year)payment_method- Payment type (categorical)num_support_tickets- Support tickets filed (integer)avg_monthly_usage_gb- Average data usage (float)has_premium_support- Premium support subscriber (binary: 0/1)has_online_backup- Online backup subscriber (binary: 0/1)num_additional_services- Count of add-on services (integer)age- Customer age (integer)is_senior- Senior citizen status (binary: 0/1)churn- Customer churned (target: 0=No, 1=Yes)
Requirements
Your churn_classification.ipynb must implement ALL of the following functions.
Each function is mandatory and will be tested individually.
Load and Explore Data
Create a function load_and_explore(filename) that:
- Loads the CSV file using pandas
- Displays class distribution of the target variable
- Calculates class imbalance ratio
- Returns the DataFrame and exploration summary
def load_and_explore(filename):
"""Load dataset and analyze class distribution."""
# Must return: (df, exploration_dict with 'imbalance_ratio')
pass
Visualize Class Distribution
Create a function visualize_class_distribution(df, target='churn') that:
- Creates pie chart and bar chart of target classes
- Shows feature distributions by class
- Saves plots as
class_distribution.png
def visualize_class_distribution(df, target='churn'):
"""Visualize target class distribution and features."""
# Must save: class_distribution.png
pass
Preprocess Data
Create a function preprocess_data(df, target_col) that:
- Encodes categorical variables
- Scales numerical features
- Splits into train/test sets with stratification
- Returns processed data and preprocessing objects
def preprocess_data(df, target_col):
"""Preprocess data for classification."""
# Return: (X_train, X_test, y_train, y_test, preprocessors)
pass
Handle Class Imbalance
Create a function handle_imbalance(X_train, y_train, method='smote') that:
- Implements SMOTE for oversampling minority class
- Optionally supports random undersampling
- Returns balanced training data
- Prints before/after class distribution
def handle_imbalance(X_train, y_train, method='smote'):
"""Handle class imbalance using specified method."""
# Return: (X_resampled, y_resampled)
pass
Logistic Regression Classifier
Create a function train_logistic_regression(X_train, X_test, y_train, y_test) that:
- Trains logistic regression with class weights
- Returns model, predictions, and probabilities
- Extracts and displays feature coefficients
def train_logistic_regression(X_train, X_test, y_train, y_test):
"""Train Logistic Regression classifier."""
# Return: (model, y_pred, y_proba, coefficients)
pass
Decision Tree Classifier
Create a function train_decision_tree(X_train, X_test, y_train, y_test, max_depth=5) that:
- Trains decision tree with specified max depth
- Visualizes the tree structure
- Returns model, predictions, and feature importance
def train_decision_tree(X_train, X_test, y_train, y_test, max_depth=5):
"""Train Decision Tree classifier."""
# Return: (model, y_pred, feature_importance)
pass
Random Forest Classifier
Create a function train_random_forest(X_train, X_test, y_train, y_test, n_estimators=100) that:
- Trains random forest ensemble
- Returns model, predictions, and feature importance
- Plots feature importance bar chart
def train_random_forest(X_train, X_test, y_train, y_test, n_estimators=100):
"""Train Random Forest classifier."""
# Return: (model, y_pred, y_proba, feature_importance)
pass
Support Vector Machine Classifier
Create a function train_svm(X_train, X_test, y_train, y_test, kernel='rbf') that:
- Trains SVM with specified kernel (linear, rbf, poly)
- Uses probability estimation for ROC curve
- Returns model and predictions
def train_svm(X_train, X_test, y_train, y_test, kernel='rbf'):
"""Train SVM classifier with specified kernel."""
# Return: (model, y_pred, y_proba)
pass
Calculate Classification Metrics
Create a function calculate_classification_metrics(y_true, y_pred, y_proba, model_name) that:
- Calculates accuracy, precision, recall, F1-score
- Generates confusion matrix
- Calculates ROC-AUC score
- Returns dictionary with all metrics
def calculate_classification_metrics(y_true, y_pred, y_proba, model_name):
"""Calculate and return classification metrics."""
# Return: dict with 'accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'confusion_matrix'
pass
Plot ROC Curves
Create a function plot_roc_curves(results_dict, y_test) that:
- Plots ROC curves for all models on same figure
- Includes AUC scores in legend
- Saves plot as
roc_curves.png
def plot_roc_curves(results_dict, y_test):
"""Plot ROC curves for all models."""
# Must save: roc_curves.png
pass
Hyperparameter Tuning
Create a function tune_best_model(X_train, y_train, model_type='random_forest') that:
- Uses GridSearchCV with cross-validation
- Tunes hyperparameters for specified model
- Returns best model and best parameters
def tune_best_model(X_train, y_train, model_type='random_forest'):
"""Tune hyperparameters using GridSearchCV."""
# Return: (best_model, best_params, cv_results)
pass
Compare All Models
Create a function compare_models(results_dict) that:
- Creates comparison table of all models
- Generates comparison bar charts for metrics
- Saves comparison as
model_comparison.png - Returns DataFrame with comparison
def compare_models(results_dict):
"""Compare all classification models."""
# Return: comparison_df
pass
Main Pipeline
Create a main() function that:
- Runs the complete classification pipeline
- Trains all model types and collects results
- Generates all required visualizations
- Prints final recommendation for best model
def main():
# 1. Load and explore data
df, summary = load_and_explore("customer_churn.csv")
# 2. Visualize class distribution
visualize_class_distribution(df)
# 3. Preprocess data
X_train, X_test, y_train, y_test, preprocessors = preprocess_data(df, 'churn')
# 4. Handle imbalance
X_train_balanced, y_train_balanced = handle_imbalance(X_train, y_train)
# 5. Train all models
results = {}
# Logistic Regression
lr_model, lr_pred, lr_proba, lr_coefs = train_logistic_regression(
X_train_balanced, X_test, y_train_balanced, y_test)
results['Logistic Regression'] = {
'predictions': lr_pred,
'probabilities': lr_proba,
'metrics': calculate_classification_metrics(y_test, lr_pred, lr_proba, 'Logistic Regression')
}
# Decision Tree
dt_model, dt_pred, dt_importance = train_decision_tree(
X_train_balanced, X_test, y_train_balanced, y_test)
dt_proba = dt_model.predict_proba(X_test)[:, 1]
results['Decision Tree'] = {
'predictions': dt_pred,
'probabilities': dt_proba,
'metrics': calculate_classification_metrics(y_test, dt_pred, dt_proba, 'Decision Tree')
}
# Random Forest
rf_model, rf_pred, rf_proba, rf_importance = train_random_forest(
X_train_balanced, X_test, y_train_balanced, y_test)
results['Random Forest'] = {
'predictions': rf_pred,
'probabilities': rf_proba,
'metrics': calculate_classification_metrics(y_test, rf_pred, rf_proba, 'Random Forest')
}
# SVM
svm_model, svm_pred, svm_proba = train_svm(
X_train_balanced, X_test, y_train_balanced, y_test)
results['SVM'] = {
'predictions': svm_pred,
'probabilities': svm_proba,
'metrics': calculate_classification_metrics(y_test, svm_pred, svm_proba, 'SVM')
}
# 6. Plot ROC curves
plot_roc_curves(results, y_test)
# 7. Compare all models
comparison_df = compare_models(results)
print(comparison_df)
# 8. Tune best model
best_model, best_params, cv_results = tune_best_model(X_train_balanced, y_train_balanced)
print(f"Best Parameters: {best_params}")
# 9. Recommendation
best = comparison_df.loc[comparison_df['ROC_AUC'].idxmax()]
print(f"\nRecommendation: {best.name} with ROC-AUC = {best['ROC_AUC']:.4f}")
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
customer-churn-classification
Required Files
customer-churn-classification/
├── churn_classification.ipynb # Your Jupyter Notebook with ALL 13 functions
├── customer_churn.csv # Input dataset (as provided or extended)
├── class_distribution.png # Class distribution visualizations
├── roc_curves.png # ROC curves for all models
├── model_comparison.png # Model comparison bar charts
├── predictions.csv # Test predictions from best model
└── README.md # REQUIRED - see contents below
README.md Must Include:
- Your full name and submission date
- Summary of all models trained and their metrics
- How you handled class imbalance
- Your recommendation for the best model and why
- Any challenges faced and how you solved them
- Instructions to run your notebook
Do Include
- All 13 functions implemented and working
- Docstrings for every function
- Clear visualizations with labels and titles
- Class imbalance handling with SMOTE
- Hyperparameter tuning with cross-validation
- README.md with all required sections
Do Not Include
- Any .pyc or __pycache__ files (use .gitignore)
- Virtual environment folders
- Large model pickle files
- Code that doesn't run without errors
- Hardcoded file paths
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Logistic Regression | 25 | Correct implementation with class weights and coefficient interpretation |
| Decision Trees | 25 | Proper tree training, visualization, and feature importance |
| Random Forest | 30 | Ensemble implementation with feature importance analysis |
| SVM | 25 | Correct kernel usage and probability estimation |
| Class Imbalance Handling | 25 | Proper use of SMOTE or other balancing techniques |
| Evaluation Metrics | 30 | ROC-AUC, confusion matrix, precision, recall, F1 calculations |
| Code Quality | 40 | Docstrings, comments, naming conventions, and clean organization |
| Total | 200 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Logistic Regression (3.1)
Understanding probability estimation, odds ratios, and decision boundaries
Tree-Based Models (3.2-3.3)
Decision trees, random forests, feature importance, and ensemble learning
Support Vector Machines (3.4)
Kernel selection, margin maximization, and high-dimensional classification
Imbalanced Data
SMOTE, class weights, and evaluation metrics for imbalanced datasets
Pro Tips
Classification Best Practices
- Always check for class imbalance first
- Use stratified splits to maintain class ratio
- Scale features for SVM and Logistic Regression
- Use ROC-AUC for imbalanced data, not accuracy
Model Selection
- Start with Logistic Regression as baseline
- Random Forest often works well out-of-the-box
- SVM with RBF kernel for non-linear boundaries
- Consider business impact of false positives vs negatives
Metrics to Focus On
- ROC-AUC: Overall ranking ability
- Precision: When false positives are costly
- Recall: When false negatives are costly
- F1-Score: Balance of precision and recall
Common Mistakes
- Using accuracy on imbalanced datasets
- Applying SMOTE before train/test split
- Not tuning SVM kernel and C parameter
- Ignoring confusion matrix interpretation