Assignment 3-A

Classification Models Customer Churn Prediction

Build a complete classification system that applies all Module 3 concepts: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, model comparison, hyperparameter tuning, and handling imbalanced datasets.

6-8 hours
Challenging
200 Points
Submit Assignment
What You'll Practice
  • Build logistic regression classifiers
  • Train decision trees & random forests
  • Implement SVM with different kernels
  • Handle imbalanced datasets
  • Evaluate with ROC-AUC & confusion matrix
Contents
01

Assignment Overview

In this assignment, you will build a complete Customer Churn Prediction System using various classification algorithms. This comprehensive project requires you to apply ALL concepts from Module 3: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), proper evaluation metrics for classification, and techniques for handling imbalanced datasets.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, scikit-learn, and imbalanced-learn for this assignment.
Skills Applied: This assignment tests your understanding of Logistic Regression (Topic 3.1), Decision Trees (Topic 3.2), Ensemble Methods (Topic 3.3), and SVM (Topic 3.4) from Module 3.
Logistic Regression (3.1)

Binary classification, probability estimation, sigmoid function

Decision Trees (3.2)

Tree-based models, pruning, feature importance

Random Forest (3.3)

Ensemble learning, bagging, out-of-bag error

SVM (3.4)

Hyperplanes, kernels, margin maximization

Ready to submit? Already completed the assignment? Submit your work now!
Submit Now
02

The Scenario

TeleConnect Communications

You have been hired as a Machine Learning Engineer at TeleConnect Communications, a telecommunications company facing high customer churn rates. The VP of Customer Success has given you this task:

"We're losing customers at an alarming rate and need to predict who's likely to churn before they leave. We have historical customer data with usage patterns, billing info, and support interactions. Build multiple classification models, compare them thoroughly, and help us identify the most at-risk customers!"

Your Task

Create a Jupyter Notebook called churn_classification.ipynb that implements multiple classification algorithms, handles the imbalanced nature of churn data, compares model performance, and provides actionable insights for the business.

03

The Dataset

You will work with a Customer Churn dataset. Create this CSV file as shown below:

File: customer_churn.csv (Customer Data)

customer_id,tenure_months,monthly_charges,total_charges,contract_type,payment_method,num_support_tickets,avg_monthly_usage_gb,has_premium_support,has_online_backup,num_additional_services,age,is_senior,churn
C001,24,65.5,1572.0,Two Year,Credit Card,1,45.2,1,1,3,35,0,0
C002,3,89.0,267.0,Month-to-Month,Electronic Check,5,78.5,0,0,1,28,0,1
C003,48,45.0,2160.0,Two Year,Bank Transfer,0,32.1,1,1,2,52,0,0
C004,6,95.5,573.0,Month-to-Month,Electronic Check,4,92.3,0,0,0,24,0,1
C005,36,55.0,1980.0,One Year,Credit Card,2,41.5,1,1,2,45,0,0
C006,1,78.0,78.0,Month-to-Month,Electronic Check,3,65.8,0,0,1,31,0,1
C007,60,42.0,2520.0,Two Year,Bank Transfer,1,28.9,1,1,3,67,1,0
C008,12,72.5,870.0,One Year,Credit Card,2,55.2,0,1,2,38,0,0
C009,2,88.5,177.0,Month-to-Month,Electronic Check,6,85.1,0,0,0,22,0,1
C010,42,52.0,2184.0,Two Year,Bank Transfer,0,35.6,1,1,3,58,0,0
C011,4,92.0,368.0,Month-to-Month,Electronic Check,4,88.7,0,0,1,26,0,1
C012,30,58.5,1755.0,One Year,Credit Card,1,48.3,1,1,2,41,0,0
C013,8,85.0,680.0,Month-to-Month,Electronic Check,3,72.4,0,0,1,33,0,1
C014,54,48.0,2592.0,Two Year,Bank Transfer,0,30.2,1,1,3,71,1,0
C015,18,62.5,1125.0,One Year,Credit Card,2,52.8,1,1,2,39,0,0
C016,5,90.0,450.0,Month-to-Month,Electronic Check,5,82.6,0,0,0,27,0,1
C017,36,50.0,1800.0,One Year,Bank Transfer,1,38.4,1,1,2,48,0,0
C018,2,95.0,190.0,Month-to-Month,Electronic Check,4,91.2,0,0,1,23,0,1
C019,45,47.5,2137.5,Two Year,Credit Card,0,33.7,1,1,3,62,0,0
C020,15,68.0,1020.0,One Year,Credit Card,2,58.9,0,1,2,36,0,0
C021,3,87.5,262.5,Month-to-Month,Electronic Check,5,79.8,0,0,0,29,0,1
C022,28,54.0,1512.0,One Year,Bank Transfer,1,42.1,1,1,2,44,0,0
C023,7,82.0,574.0,Month-to-Month,Electronic Check,3,68.5,0,0,1,32,0,1
C024,52,44.0,2288.0,Two Year,Bank Transfer,0,29.8,1,1,3,69,1,0
C025,22,60.0,1320.0,One Year,Credit Card,2,50.6,1,1,2,40,0,0
Columns Explained
  • customer_id - Unique identifier (string)
  • tenure_months - Months as customer (integer)
  • monthly_charges - Monthly bill amount (float)
  • total_charges - Total amount paid (float)
  • contract_type - Contract length (categorical: Month-to-Month/One Year/Two Year)
  • payment_method - Payment type (categorical)
  • num_support_tickets - Support tickets filed (integer)
  • avg_monthly_usage_gb - Average data usage (float)
  • has_premium_support - Premium support subscriber (binary: 0/1)
  • has_online_backup - Online backup subscriber (binary: 0/1)
  • num_additional_services - Count of add-on services (integer)
  • age - Customer age (integer)
  • is_senior - Senior citizen status (binary: 0/1)
  • churn - Customer churned (target: 0=No, 1=Yes)
Note: The dataset is intentionally imbalanced (more non-churners than churners). You must handle this imbalance using appropriate techniques like SMOTE, class weights, or undersampling.
04

Requirements

Your churn_classification.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

1
Load and Explore Data

Create a function load_and_explore(filename) that:

  • Loads the CSV file using pandas
  • Displays class distribution of the target variable
  • Calculates class imbalance ratio
  • Returns the DataFrame and exploration summary
def load_and_explore(filename):
    """Load dataset and analyze class distribution."""
    # Must return: (df, exploration_dict with 'imbalance_ratio')
    pass
2
Visualize Class Distribution

Create a function visualize_class_distribution(df, target='churn') that:

  • Creates pie chart and bar chart of target classes
  • Shows feature distributions by class
  • Saves plots as class_distribution.png
def visualize_class_distribution(df, target='churn'):
    """Visualize target class distribution and features."""
    # Must save: class_distribution.png
    pass
3
Preprocess Data

Create a function preprocess_data(df, target_col) that:

  • Encodes categorical variables
  • Scales numerical features
  • Splits into train/test sets with stratification
  • Returns processed data and preprocessing objects
def preprocess_data(df, target_col):
    """Preprocess data for classification."""
    # Return: (X_train, X_test, y_train, y_test, preprocessors)
    pass
4
Handle Class Imbalance

Create a function handle_imbalance(X_train, y_train, method='smote') that:

  • Implements SMOTE for oversampling minority class
  • Optionally supports random undersampling
  • Returns balanced training data
  • Prints before/after class distribution
def handle_imbalance(X_train, y_train, method='smote'):
    """Handle class imbalance using specified method."""
    # Return: (X_resampled, y_resampled)
    pass
5
Logistic Regression Classifier

Create a function train_logistic_regression(X_train, X_test, y_train, y_test) that:

  • Trains logistic regression with class weights
  • Returns model, predictions, and probabilities
  • Extracts and displays feature coefficients
def train_logistic_regression(X_train, X_test, y_train, y_test):
    """Train Logistic Regression classifier."""
    # Return: (model, y_pred, y_proba, coefficients)
    pass
6
Decision Tree Classifier

Create a function train_decision_tree(X_train, X_test, y_train, y_test, max_depth=5) that:

  • Trains decision tree with specified max depth
  • Visualizes the tree structure
  • Returns model, predictions, and feature importance
def train_decision_tree(X_train, X_test, y_train, y_test, max_depth=5):
    """Train Decision Tree classifier."""
    # Return: (model, y_pred, feature_importance)
    pass
7
Random Forest Classifier

Create a function train_random_forest(X_train, X_test, y_train, y_test, n_estimators=100) that:

  • Trains random forest ensemble
  • Returns model, predictions, and feature importance
  • Plots feature importance bar chart
def train_random_forest(X_train, X_test, y_train, y_test, n_estimators=100):
    """Train Random Forest classifier."""
    # Return: (model, y_pred, y_proba, feature_importance)
    pass
8
Support Vector Machine Classifier

Create a function train_svm(X_train, X_test, y_train, y_test, kernel='rbf') that:

  • Trains SVM with specified kernel (linear, rbf, poly)
  • Uses probability estimation for ROC curve
  • Returns model and predictions
def train_svm(X_train, X_test, y_train, y_test, kernel='rbf'):
    """Train SVM classifier with specified kernel."""
    # Return: (model, y_pred, y_proba)
    pass
9
Calculate Classification Metrics

Create a function calculate_classification_metrics(y_true, y_pred, y_proba, model_name) that:

  • Calculates accuracy, precision, recall, F1-score
  • Generates confusion matrix
  • Calculates ROC-AUC score
  • Returns dictionary with all metrics
def calculate_classification_metrics(y_true, y_pred, y_proba, model_name):
    """Calculate and return classification metrics."""
    # Return: dict with 'accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'confusion_matrix'
    pass
10
Plot ROC Curves

Create a function plot_roc_curves(results_dict, y_test) that:

  • Plots ROC curves for all models on same figure
  • Includes AUC scores in legend
  • Saves plot as roc_curves.png
def plot_roc_curves(results_dict, y_test):
    """Plot ROC curves for all models."""
    # Must save: roc_curves.png
    pass
11
Hyperparameter Tuning

Create a function tune_best_model(X_train, y_train, model_type='random_forest') that:

  • Uses GridSearchCV with cross-validation
  • Tunes hyperparameters for specified model
  • Returns best model and best parameters
def tune_best_model(X_train, y_train, model_type='random_forest'):
    """Tune hyperparameters using GridSearchCV."""
    # Return: (best_model, best_params, cv_results)
    pass
12
Compare All Models

Create a function compare_models(results_dict) that:

  • Creates comparison table of all models
  • Generates comparison bar charts for metrics
  • Saves comparison as model_comparison.png
  • Returns DataFrame with comparison
def compare_models(results_dict):
    """Compare all classification models."""
    # Return: comparison_df
    pass
13
Main Pipeline

Create a main() function that:

  • Runs the complete classification pipeline
  • Trains all model types and collects results
  • Generates all required visualizations
  • Prints final recommendation for best model
def main():
    # 1. Load and explore data
    df, summary = load_and_explore("customer_churn.csv")
    
    # 2. Visualize class distribution
    visualize_class_distribution(df)
    
    # 3. Preprocess data
    X_train, X_test, y_train, y_test, preprocessors = preprocess_data(df, 'churn')
    
    # 4. Handle imbalance
    X_train_balanced, y_train_balanced = handle_imbalance(X_train, y_train)
    
    # 5. Train all models
    results = {}
    
    # Logistic Regression
    lr_model, lr_pred, lr_proba, lr_coefs = train_logistic_regression(
        X_train_balanced, X_test, y_train_balanced, y_test)
    results['Logistic Regression'] = {
        'predictions': lr_pred, 
        'probabilities': lr_proba,
        'metrics': calculate_classification_metrics(y_test, lr_pred, lr_proba, 'Logistic Regression')
    }
    
    # Decision Tree
    dt_model, dt_pred, dt_importance = train_decision_tree(
        X_train_balanced, X_test, y_train_balanced, y_test)
    dt_proba = dt_model.predict_proba(X_test)[:, 1]
    results['Decision Tree'] = {
        'predictions': dt_pred,
        'probabilities': dt_proba,
        'metrics': calculate_classification_metrics(y_test, dt_pred, dt_proba, 'Decision Tree')
    }
    
    # Random Forest
    rf_model, rf_pred, rf_proba, rf_importance = train_random_forest(
        X_train_balanced, X_test, y_train_balanced, y_test)
    results['Random Forest'] = {
        'predictions': rf_pred,
        'probabilities': rf_proba,
        'metrics': calculate_classification_metrics(y_test, rf_pred, rf_proba, 'Random Forest')
    }
    
    # SVM
    svm_model, svm_pred, svm_proba = train_svm(
        X_train_balanced, X_test, y_train_balanced, y_test)
    results['SVM'] = {
        'predictions': svm_pred,
        'probabilities': svm_proba,
        'metrics': calculate_classification_metrics(y_test, svm_pred, svm_proba, 'SVM')
    }
    
    # 6. Plot ROC curves
    plot_roc_curves(results, y_test)
    
    # 7. Compare all models
    comparison_df = compare_models(results)
    print(comparison_df)
    
    # 8. Tune best model
    best_model, best_params, cv_results = tune_best_model(X_train_balanced, y_train_balanced)
    print(f"Best Parameters: {best_params}")
    
    # 9. Recommendation
    best = comparison_df.loc[comparison_df['ROC_AUC'].idxmax()]
    print(f"\nRecommendation: {best.name} with ROC-AUC = {best['ROC_AUC']:.4f}")

if __name__ == "__main__":
    main()
05

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name
customer-churn-classification
github.com/<your-username>/customer-churn-classification
Required Files
customer-churn-classification/
├── churn_classification.ipynb  # Your Jupyter Notebook with ALL 13 functions
├── customer_churn.csv          # Input dataset (as provided or extended)
├── class_distribution.png      # Class distribution visualizations
├── roc_curves.png              # ROC curves for all models
├── model_comparison.png        # Model comparison bar charts
├── predictions.csv             # Test predictions from best model
└── README.md                   # REQUIRED - see contents below
README.md Must Include:
  • Your full name and submission date
  • Summary of all models trained and their metrics
  • How you handled class imbalance
  • Your recommendation for the best model and why
  • Any challenges faced and how you solved them
  • Instructions to run your notebook
Do Include
  • All 13 functions implemented and working
  • Docstrings for every function
  • Clear visualizations with labels and titles
  • Class imbalance handling with SMOTE
  • Hyperparameter tuning with cross-validation
  • README.md with all required sections
Do Not Include
  • Any .pyc or __pycache__ files (use .gitignore)
  • Virtual environment folders
  • Large model pickle files
  • Code that doesn't run without errors
  • Hardcoded file paths
Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!
Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

06

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria Points Description
Logistic Regression 25 Correct implementation with class weights and coefficient interpretation
Decision Trees 25 Proper tree training, visualization, and feature importance
Random Forest 30 Ensemble implementation with feature importance analysis
SVM 25 Correct kernel usage and probability estimation
Class Imbalance Handling 25 Proper use of SMOTE or other balancing techniques
Evaluation Metrics 30 ROC-AUC, confusion matrix, precision, recall, F1 calculations
Code Quality 40 Docstrings, comments, naming conventions, and clean organization
Total 200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment
07

What You Will Practice

Logistic Regression (3.1)

Understanding probability estimation, odds ratios, and decision boundaries

Tree-Based Models (3.2-3.3)

Decision trees, random forests, feature importance, and ensemble learning

Support Vector Machines (3.4)

Kernel selection, margin maximization, and high-dimensional classification

Imbalanced Data

SMOTE, class weights, and evaluation metrics for imbalanced datasets

08

Pro Tips

Classification Best Practices
  • Always check for class imbalance first
  • Use stratified splits to maintain class ratio
  • Scale features for SVM and Logistic Regression
  • Use ROC-AUC for imbalanced data, not accuracy
Model Selection
  • Start with Logistic Regression as baseline
  • Random Forest often works well out-of-the-box
  • SVM with RBF kernel for non-linear boundaries
  • Consider business impact of false positives vs negatives
Metrics to Focus On
  • ROC-AUC: Overall ranking ability
  • Precision: When false positives are costly
  • Recall: When false negatives are costly
  • F1-Score: Balance of precision and recall
Common Mistakes
  • Using accuracy on imbalanced datasets
  • Applying SMOTE before train/test split
  • Not tuning SVM kernel and C parameter
  • Ignoring confusion matrix interpretation
09

Pre-Submission Checklist

Code Requirements
Repository Requirements