Assignment 3: Classification Models | Machine Learning Course

Assignment Overview

In this assignment, you will build a complete Customer Churn Prediction System using various classification algorithms. This comprehensive project requires you to apply ALL concepts from Module 3: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), proper evaluation metrics for classification, and techniques for handling imbalanced datasets.

Libraries Allowed: You may use pandas, numpy, matplotlib, seaborn, scikit-learn, and imbalanced-learn for this assignment.

Skills Applied: This assignment tests your understanding of Logistic Regression (Topic 3.1), Decision Trees (Topic 3.2), Ensemble Methods (Topic 3.3), and SVM (Topic 3.4) from Module 3.

Logistic Regression (3.1)

Binary classification, probability estimation, sigmoid function

Decision Trees (3.2)

Tree-based models, pruning, feature importance

Random Forest (3.3)

Ensemble learning, bagging, out-of-bag error

SVM (3.4)

Hyperplanes, kernels, margin maximization

Ready to submit? Already completed the assignment? Submit your work now!

Submit Now

The Scenario

TeleConnect Communications

You have been hired as a Machine Learning Engineer at TeleConnect Communications, a telecommunications company facing high customer churn rates. The VP of Customer Success has given you this task:

"We're losing customers at an alarming rate and need to predict who's likely to churn before they leave. We have historical customer data with usage patterns, billing info, and support interactions. Build multiple classification models, compare them thoroughly, and help us identify the most at-risk customers!"

Your Task

Create a Jupyter Notebook called churn_classification.ipynb that implements multiple classification algorithms, handles the imbalanced nature of churn data, compares model performance, and provides actionable insights for the business.

The Dataset

You will work with a Customer Churn dataset. Create this CSV file as shown below:

File: `customer_churn.csv` (Customer Data)

customer_id,tenure_months,monthly_charges,total_charges,contract_type,payment_method,num_support_tickets,avg_monthly_usage_gb,has_premium_support,has_online_backup,num_additional_services,age,is_senior,churn
C001,24,65.5,1572.0,Two Year,Credit Card,1,45.2,1,1,3,35,0,0
C002,3,89.0,267.0,Month-to-Month,Electronic Check,5,78.5,0,0,1,28,0,1
C003,48,45.0,2160.0,Two Year,Bank Transfer,0,32.1,1,1,2,52,0,0
C004,6,95.5,573.0,Month-to-Month,Electronic Check,4,92.3,0,0,0,24,0,1
C005,36,55.0,1980.0,One Year,Credit Card,2,41.5,1,1,2,45,0,0
C006,1,78.0,78.0,Month-to-Month,Electronic Check,3,65.8,0,0,1,31,0,1
C007,60,42.0,2520.0,Two Year,Bank Transfer,1,28.9,1,1,3,67,1,0
C008,12,72.5,870.0,One Year,Credit Card,2,55.2,0,1,2,38,0,0
C009,2,88.5,177.0,Month-to-Month,Electronic Check,6,85.1,0,0,0,22,0,1
C010,42,52.0,2184.0,Two Year,Bank Transfer,0,35.6,1,1,3,58,0,0
C011,4,92.0,368.0,Month-to-Month,Electronic Check,4,88.7,0,0,1,26,0,1
C012,30,58.5,1755.0,One Year,Credit Card,1,48.3,1,1,2,41,0,0
C013,8,85.0,680.0,Month-to-Month,Electronic Check,3,72.4,0,0,1,33,0,1
C014,54,48.0,2592.0,Two Year,Bank Transfer,0,30.2,1,1,3,71,1,0
C015,18,62.5,1125.0,One Year,Credit Card,2,52.8,1,1,2,39,0,0
C016,5,90.0,450.0,Month-to-Month,Electronic Check,5,82.6,0,0,0,27,0,1
C017,36,50.0,1800.0,One Year,Bank Transfer,1,38.4,1,1,2,48,0,0
C018,2,95.0,190.0,Month-to-Month,Electronic Check,4,91.2,0,0,1,23,0,1
C019,45,47.5,2137.5,Two Year,Credit Card,0,33.7,1,1,3,62,0,0
C020,15,68.0,1020.0,One Year,Credit Card,2,58.9,0,1,2,36,0,0
C021,3,87.5,262.5,Month-to-Month,Electronic Check,5,79.8,0,0,0,29,0,1
C022,28,54.0,1512.0,One Year,Bank Transfer,1,42.1,1,1,2,44,0,0
C023,7,82.0,574.0,Month-to-Month,Electronic Check,3,68.5,0,0,1,32,0,1
C024,52,44.0,2288.0,Two Year,Bank Transfer,0,29.8,1,1,3,69,1,0
C025,22,60.0,1320.0,One Year,Credit Card,2,50.6,1,1,2,40,0,0

Columns Explained

customer_id - Unique identifier (string)
tenure_months - Months as customer (integer)
monthly_charges - Monthly bill amount (float)
total_charges - Total amount paid (float)
contract_type - Contract length (categorical: Month-to-Month/One Year/Two Year)
payment_method - Payment type (categorical)
num_support_tickets - Support tickets filed (integer)
avg_monthly_usage_gb - Average data usage (float)
has_premium_support - Premium support subscriber (binary: 0/1)
has_online_backup - Online backup subscriber (binary: 0/1)
num_additional_services - Count of add-on services (integer)
age - Customer age (integer)
is_senior - Senior citizen status (binary: 0/1)
churn - Customer churned (target: 0=No, 1=Yes)

Note: The dataset is intentionally imbalanced (more non-churners than churners). You must handle this imbalance using appropriate techniques like SMOTE, class weights, or undersampling.

Requirements

Your churn_classification.ipynb must implement ALL of the following functions. Each function is mandatory and will be tested individually.

Load and Explore Data

Create a function load_and_explore(filename) that:

Loads the CSV file using pandas
Displays class distribution of the target variable
Calculates class imbalance ratio
Returns the DataFrame and exploration summary

def load_and_explore(filename):
    """Load dataset and analyze class distribution."""
    # Must return: (df, exploration_dict with 'imbalance_ratio')
    pass

Visualize Class Distribution

Create a function visualize_class_distribution(df, target='churn') that:

Creates pie chart and bar chart of target classes
Shows feature distributions by class
Saves plots as class_distribution.png

def visualize_class_distribution(df, target='churn'):
    """Visualize target class distribution and features."""
    # Must save: class_distribution.png
    pass

Preprocess Data

Create a function preprocess_data(df, target_col) that:

Encodes categorical variables
Scales numerical features
Splits into train/test sets with stratification
Returns processed data and preprocessing objects

def preprocess_data(df, target_col):
    """Preprocess data for classification."""
    # Return: (X_train, X_test, y_train, y_test, preprocessors)
    pass

Handle Class Imbalance

Create a function handle_imbalance(X_train, y_train, method='smote') that:

Implements SMOTE for oversampling minority class
Optionally supports random undersampling
Returns balanced training data
Prints before/after class distribution

def handle_imbalance(X_train, y_train, method='smote'):
    """Handle class imbalance using specified method."""
    # Return: (X_resampled, y_resampled)
    pass

Logistic Regression Classifier

Create a function train_logistic_regression(X_train, X_test, y_train, y_test) that:

Trains logistic regression with class weights
Returns model, predictions, and probabilities
Extracts and displays feature coefficients

def train_logistic_regression(X_train, X_test, y_train, y_test):
    """Train Logistic Regression classifier."""
    # Return: (model, y_pred, y_proba, coefficients)
    pass

Decision Tree Classifier

Create a function train_decision_tree(X_train, X_test, y_train, y_test, max_depth=5) that:

Trains decision tree with specified max depth
Visualizes the tree structure
Returns model, predictions, and feature importance

def train_decision_tree(X_train, X_test, y_train, y_test, max_depth=5):
    """Train Decision Tree classifier."""
    # Return: (model, y_pred, feature_importance)
    pass

Random Forest Classifier

Create a function train_random_forest(X_train, X_test, y_train, y_test, n_estimators=100) that:

Trains random forest ensemble
Returns model, predictions, and feature importance
Plots feature importance bar chart

def train_random_forest(X_train, X_test, y_train, y_test, n_estimators=100):
    """Train Random Forest classifier."""
    # Return: (model, y_pred, y_proba, feature_importance)
    pass

Support Vector Machine Classifier

Create a function train_svm(X_train, X_test, y_train, y_test, kernel='rbf') that:

Trains SVM with specified kernel (linear, rbf, poly)
Uses probability estimation for ROC curve
Returns model and predictions

def train_svm(X_train, X_test, y_train, y_test, kernel='rbf'):
    """Train SVM classifier with specified kernel."""
    # Return: (model, y_pred, y_proba)
    pass

Calculate Classification Metrics

Create a function calculate_classification_metrics(y_true, y_pred, y_proba, model_name) that:

Calculates accuracy, precision, recall, F1-score
Generates confusion matrix
Calculates ROC-AUC score
Returns dictionary with all metrics

def calculate_classification_metrics(y_true, y_pred, y_proba, model_name):
    """Calculate and return classification metrics."""
    # Return: dict with 'accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'confusion_matrix'
    pass

Plot ROC Curves

Create a function plot_roc_curves(results_dict, y_test) that:

Plots ROC curves for all models on same figure
Includes AUC scores in legend
Saves plot as roc_curves.png

def plot_roc_curves(results_dict, y_test):
    """Plot ROC curves for all models."""
    # Must save: roc_curves.png
    pass

Hyperparameter Tuning

Create a function tune_best_model(X_train, y_train, model_type='random_forest') that:

Uses GridSearchCV with cross-validation
Tunes hyperparameters for specified model
Returns best model and best parameters

def tune_best_model(X_train, y_train, model_type='random_forest'):
    """Tune hyperparameters using GridSearchCV."""
    # Return: (best_model, best_params, cv_results)
    pass

Compare All Models

Create a function compare_models(results_dict) that:

Creates comparison table of all models
Generates comparison bar charts for metrics
Saves comparison as model_comparison.png
Returns DataFrame with comparison

def compare_models(results_dict):
    """Compare all classification models."""
    # Return: comparison_df
    pass

Main Pipeline

Create a main() function that:

Runs the complete classification pipeline
Trains all model types and collects results
Generates all required visualizations
Prints final recommendation for best model

def main():
    # 1. Load and explore data
    df, summary = load_and_explore("customer_churn.csv")
    
    # 2. Visualize class distribution
    visualize_class_distribution(df)
    
    # 3. Preprocess data
    X_train, X_test, y_train, y_test, preprocessors = preprocess_data(df, 'churn')
    
    # 4. Handle imbalance
    X_train_balanced, y_train_balanced = handle_imbalance(X_train, y_train)
    
    # 5. Train all models
    results = {}
    
    # Logistic Regression
    lr_model, lr_pred, lr_proba, lr_coefs = train_logistic_regression(
        X_train_balanced, X_test, y_train_balanced, y_test)
    results['Logistic Regression'] = {
        'predictions': lr_pred, 
        'probabilities': lr_proba,
        'metrics': calculate_classification_metrics(y_test, lr_pred, lr_proba, 'Logistic Regression')
    }
    
    # Decision Tree
    dt_model, dt_pred, dt_importance = train_decision_tree(
        X_train_balanced, X_test, y_train_balanced, y_test)
    dt_proba = dt_model.predict_proba(X_test)[:, 1]
    results['Decision Tree'] = {
        'predictions': dt_pred,
        'probabilities': dt_proba,
        'metrics': calculate_classification_metrics(y_test, dt_pred, dt_proba, 'Decision Tree')
    }
    
    # Random Forest
    rf_model, rf_pred, rf_proba, rf_importance = train_random_forest(
        X_train_balanced, X_test, y_train_balanced, y_test)
    results['Random Forest'] = {
        'predictions': rf_pred,
        'probabilities': rf_proba,
        'metrics': calculate_classification_metrics(y_test, rf_pred, rf_proba, 'Random Forest')
    }
    
    # SVM
    svm_model, svm_pred, svm_proba = train_svm(
        X_train_balanced, X_test, y_train_balanced, y_test)
    results['SVM'] = {
        'predictions': svm_pred,
        'probabilities': svm_proba,
        'metrics': calculate_classification_metrics(y_test, svm_pred, svm_proba, 'SVM')
    }
    
    # 6. Plot ROC curves
    plot_roc_curves(results, y_test)
    
    # 7. Compare all models
    comparison_df = compare_models(results)
    print(comparison_df)
    
    # 8. Tune best model
    best_model, best_params, cv_results = tune_best_model(X_train_balanced, y_train_balanced)
    print(f"Best Parameters: {best_params}")
    
    # 9. Recommendation
    best = comparison_df.loc[comparison_df['ROC_AUC'].idxmax()]
    print(f"\nRecommendation: {best.name} with ROC-AUC = {best['ROC_AUC']:.4f}")

if __name__ == "__main__":
    main()

Submission

Create a public GitHub repository with the exact name shown below:

Required Repository Name

customer-churn-classification

github.com/<your-username>/customer-churn-classification

Required Files

customer-churn-classification/
├── churn_classification.ipynb  # Your Jupyter Notebook with ALL 13 functions
├── customer_churn.csv          # Input dataset (as provided or extended)
├── class_distribution.png      # Class distribution visualizations
├── roc_curves.png              # ROC curves for all models
├── model_comparison.png        # Model comparison bar charts
├── predictions.csv             # Test predictions from best model
└── README.md                   # REQUIRED - see contents below

README.md Must Include:

Your full name and submission date
Summary of all models trained and their metrics
How you handled class imbalance
Your recommendation for the best model and why
Any challenges faced and how you solved them
Instructions to run your notebook

Do Include

All 13 functions implemented and working
Docstrings for every function
Clear visualizations with labels and titles
Class imbalance handling with SMOTE
Hyperparameter tuning with cross-validation
README.md with all required sections

Do Not Include

Any .pyc or __pycache__ files (use .gitignore)
Virtual environment folders
Large model pickle files
Code that doesn't run without errors
Hardcoded file paths

Important: Before submitting, run all cells in your notebook to make sure it executes without errors and generates all output files correctly!

Submit Your Assignment

Enter your GitHub username - we'll verify your repository automatically

Grading Rubric

Your assignment will be graded on the following criteria:

Criteria	Points	Description
Logistic Regression	25	Correct implementation with class weights and coefficient interpretation
Decision Trees	25	Proper tree training, visualization, and feature importance
Random Forest	30	Ensemble implementation with feature importance analysis
SVM	25	Correct kernel usage and probability estimation
Class Imbalance Handling	25	Proper use of SMOTE or other balancing techniques
Evaluation Metrics	30	ROC-AUC, confusion matrix, precision, recall, F1 calculations
Code Quality	40	Docstrings, comments, naming conventions, and clean organization
Total	200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Assignment

What You Will Practice

Logistic Regression (3.1)

Understanding probability estimation, odds ratios, and decision boundaries

Tree-Based Models (3.2-3.3)

Decision trees, random forests, feature importance, and ensemble learning

Support Vector Machines (3.4)

Kernel selection, margin maximization, and high-dimensional classification

Imbalanced Data

SMOTE, class weights, and evaluation metrics for imbalanced datasets

Pro Tips

Classification Best Practices

Always check for class imbalance first
Use stratified splits to maintain class ratio
Scale features for SVM and Logistic Regression
Use ROC-AUC for imbalanced data, not accuracy

Model Selection

Start with Logistic Regression as baseline
Random Forest often works well out-of-the-box
SVM with RBF kernel for non-linear boundaries
Consider business impact of false positives vs negatives

Metrics to Focus On

ROC-AUC: Overall ranking ability
Precision: When false positives are costly
Recall: When false negatives are costly
F1-Score: Balance of precision and recall

Common Mistakes

Using accuracy on imbalanced datasets
Applying SMOTE before train/test split
Not tuning SVM kernel and C parameter
Ignoring confusion matrix interpretation

Classification Models Customer Churn Prediction

What You'll Practice

Contents

Assignment Overview

Logistic Regression (3.1)

Decision Trees (3.2)

Random Forest (3.3)

SVM (3.4)

The Scenario

TeleConnect Communications

Your Task

The Dataset

File: customer_churn.csv (Customer Data)

Columns Explained

Requirements

Load and Explore Data

Visualize Class Distribution

Preprocess Data

Handle Class Imbalance

Logistic Regression Classifier

Decision Tree Classifier

Random Forest Classifier

Support Vector Machine Classifier

Calculate Classification Metrics

Plot ROC Curves

Hyperparameter Tuning

Compare All Models

Main Pipeline

Submission

Required Repository Name

Required Files

README.md Must Include:

Do Include

Do Not Include

Grading Rubric

Ready to Submit?

What You Will Practice

Logistic Regression (3.1)

Tree-Based Models (3.2-3.3)

Support Vector Machines (3.4)

Imbalanced Data

Pro Tips

Classification Best Practices

Model Selection

Metrics to Focus On

Common Mistakes

Pre-Submission Checklist

Code Requirements

Repository Requirements

File: `customer_churn.csv` (Customer Data)