Project 6: End-to-End ML Pipeline | Data Science Course

Project Overview

This advanced capstone project brings together everything you have learned in the Data Science course. You will work with a telecommunications customer churn dataset containing 120 customers with comprehensive service and billing information. Your goal is to build a production-ready ML pipeline that can be deployed and used for real-time predictions.

Skills Applied: This project tests your proficiency in scikit-learn (Pipelines, ColumnTransformer), model evaluation, serialization (joblib/pickle), and basic API development with Flask.

Learning Objectives

Pipeline Engineering

Master scikit-learn's Pipeline API to create modular, reproducible ML workflows that eliminate manual preprocessing steps and prevent data leakage.

Build ColumnTransformer for mixed data types
Chain preprocessing and modeling in single Pipeline
Use feature names propagation (get_feature_names_out)
Implement proper train-test isolation

Model Persistence

Learn production-grade model serialization techniques for saving, versioning, and loading trained models with full preprocessing pipelines intact.

Serialize complete pipelines with joblib
Create model metadata and versioning
Validate deserialized models work correctly
Handle model updates and rollbacks

API Development

Build RESTful APIs with Flask to serve ML predictions in real-time, implementing proper error handling and response formatting for production use.

Create Flask endpoints for predictions
Implement health checks and monitoring
Handle JSON input validation
Return predictions with confidence scores

Production Readiness

Develop ML systems following software engineering best practices: modularity, reproducibility, testability, and deployability for real-world applications.

Structure code into reusable modules
Write clear documentation and README
Create reproducible environments (requirements.txt)
Test API endpoints before deployment

Preprocessing

Build modular preprocessing with ColumnTransformer

Pipeline

Create end-to-end scikit-learn Pipeline

Serialization

Save and load models with joblib

Deployment

Build Flask API for predictions

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

TeleConnect Communications

You have been hired as a Machine Learning Engineer at TeleConnect Communications, a major telecommunications provider with 2.5 million subscribers. The company is experiencing significant customer churn and losing $8.2 million in revenue each month. With an average customer lifetime value of $2,400 and acquisition costs of $450 per customer, reducing churn by just 5% would save $2.5 million annually.

"We need a system that our customer retention team can query in real-time when speaking with at-risk customers. The solution must be deployable and retrainable monthly with new data. Can you build us an ML pipeline with an API endpoint that returns churn probability within 200ms?"

Michael Rodriguez, VP of Customer Success

38%

Monthly churn rate (industry avg: 22%)

15 mins

Manual churn assessment per customer

$450

Cost to acquire new customer

Business Challenges

High Churn Rate

38% monthly churn vs 22% industry average. Month-to-month contracts have 58% churn vs 12% for 2-year contracts. Need to identify at-risk customers before they cancel.

Manual Process

Retention team manually reviews accounts (15 mins each). With 12,000+ at-risk customers monthly, only 8% can be contacted. Automated prediction needed.

Dynamic Patterns

Customer behavior changes quarterly (new competitors, pricing, services). Model must be retrained monthly with fresh data. Pipeline ensures reproducibility.

Technical Requirements

Pipeline

Handle mixed feature types (numeric + categorical)
Include all preprocessing in the pipeline
Enable easy retraining with new data

Model

Train a classification model (churn/no churn)
Evaluate with cross-validation
Compare at least two algorithms

Serialization

Save complete pipeline (not just model)
Include model metadata and versioning
Verify model loads and predicts correctly

API

Build Flask app with prediction endpoint
Add health check and error handling
Return prediction with probability

Pro Tip: Think like a ML engineer! Your pipeline should be modular, reproducible, and production-ready.

The Dataset

You will work with a telecommunications customer churn dataset. Download the CSV file and place it in your project's data/ folder:

Download telecom_churn.csv

Dataset Schema

Column	Type	Description
`customer_id`	String	Unique customer identifier
`gender`	String	Customer gender (Male/Female)
`senior_citizen`	Integer	Senior citizen status (0/1)
`partner`	String	Has partner (Yes/No)
`dependents`	String	Has dependents (Yes/No)
`tenure_months`	Integer	Months with company
`phone_service`	String	Phone service subscription
`multiple_lines`	String	Multiple phone lines
`internet_service`	String	Internet type (DSL/Fiber optic/No)
`online_security`	String	Online security add-on
`online_backup`	String	Online backup add-on
`device_protection`	String	Device protection plan
`tech_support`	String	Tech support add-on
`streaming_tv`	String	Streaming TV subscription
`streaming_movies`	String	Streaming movies subscription
`contract`	String	Contract type (Month-to-month/One year/Two year)
`paperless_billing`	String	Paperless billing enabled
`payment_method`	String	Payment method type
`monthly_charges`	Float	Monthly billing amount ($)
`total_charges`	Float	Total amount billed ($)
`churn`	String	Target: Customer churned (Yes/No)

Dataset Stats: 120 customers, 20 features, ~40% churn rate, mix of numeric and categorical variables

Project Requirements

Your project must include all of the following components. Structure your work with clear separation between notebooks, source code, and API.

Data Exploration Notebook

Explore the dataset to understand feature distributions, identify patterns related to churn, and uncover data quality issues before building your pipeline.

Key Exploration Tasks

Dataset Overview

Shape: 120 customers, 21 columns (20 features + 1 target)
Churn distribution: Calculate percentage of churned vs retained customers
Data types: Identify numeric (3) vs categorical (17) features
Missing values: Check for nulls in each column

Pattern Discovery

Tenure analysis: Do long-term customers churn less?
Contract type: Compare churn rates across contract types
Service patterns: Which services correlate with retention?
Pricing impact: Is churn higher for expensive plans?

Expected Findings

Your exploration should reveal key churn indicators:

Month-to-month contracts likely have 2-3x higher churn than long-term contracts
Tenure matters: Customers with <3 months tenure are highest risk (50%+ churn rate)
Service bundles: Customers with multiple services (internet + TV + phone) churn less
Fiber optic users: May have higher churn despite faster service (check pricing)

Preprocessing Pipeline

Build a ColumnTransformer that handles numeric and categorical features differently. This is the cornerstone of production ML - all preprocessing must happen inside the pipeline to prevent data leakage.

Pipeline Architecture

Numeric Features Pipeline

Features: tenure_months, monthly_charges, total_charges

Step 1 - Imputation: Fill missing values with median (robust to outliers)

Step 2 - Scaling: Apply StandardScaler to normalize ranges (mean=0, std=1)

Why: Linear models and distance-based algorithms need scaled features

Categorical Features Pipeline

Features: gender, partner, contract, internet_service, payment_method, etc. (17 total)

Step 1 - Imputation: Fill missing with 'Unknown' constant

Step 2 - Encoding: OneHotEncoder creates binary columns for each category

Why: ML models need numeric inputs; one-hot encoding preserves category independence

Critical: Data Leakage Prevention

Wrong approach: Scaling features before train-test split. The scaler learns statistics from test data, artificially boosting performance.

Right approach: Fit preprocessor only on training data. The pipeline ensures test data is transformed using training statistics.

Example: If training set monthly_charges has mean=$70, test data is scaled using $70, not test data's actual mean.

ColumnTransformer Workflow

ColumnTransformer applies different transformations to different column subsets in parallel:

Define numeric transformer: Chain SimpleImputer (median) → StandardScaler
Define categorical transformer: Chain SimpleImputer ('Unknown') → OneHotEncoder (handle_unknown='ignore')
Combine in ColumnTransformer: Specify which columns get which transformer
Result: Single preprocessor object that handles all feature types correctly

Feature Names Output

After OneHotEncoding, contract becomes contract_Month-to-month, contract_One year, contract_Two year. Use get_feature_names_out() to retrieve transformed feature names for interpretation and debugging.

Complete ML Pipeline

Combine preprocessing and model training into a single Pipeline object. This is the magic of scikit-learn - one object handles everything from raw data to predictions.

Pipeline Composition

Pipeline Structure:

┌─────────────────────────────────────┐
│ Step 1: ColumnTransformer       │
│   ├─ Numeric: Impute → Scale       │
│   └─ Categorical: Impute → Encode  │
├─────────────────────────────────────┤
│ Step 2: RandomForestClassifier  │
│   Fits on transformed features     │
└─────────────────────────────────────┘

Pipeline Benefits

No leakage: Preprocessing fitted only on training data
One-line training: pipeline.fit(X_train, y_train) does everything
One-line prediction: pipeline.predict(X_test) preprocesses automatically
Serialization: Save entire workflow with joblib.dump(pipeline)
Reproducibility: Same transformations applied in training and production

Classifier Selection

RandomForest: Handles non-linear patterns, robust to outliers, provides feature importance
LogisticRegression: Fast, interpretable coefficients, good baseline
GradientBoosting: Often highest accuracy, requires tuning
Recommendation: Start with RandomForest (n_estimators=100, random_state=42)

The Power of Pipelines

Without pipeline: You must manually impute, scale, encode, then train. In production, you must remember and replicate every step exactly. Any mistake causes train-serve skew.

With pipeline: Training: pipeline.fit(X_train, y_train). Production: pipeline.predict(new_data). All preprocessing happens automatically in correct order. Zero train-serve skew.

Model Evaluation

Rigorously evaluate your churn prediction model using classification metrics appropriate for imbalanced data and business context.

Evaluation Strategy

Train-Test Split

Method: Stratified split (80/20) - maintains churn ratio in both sets

Why stratified: With 40% churn rate, random split might give 45% in train and 35% in test - stratified ensures both have ~40%

Random state: Set random_state=42 for reproducibility

Cross-Validation

Method: 5-fold stratified CV on training data only

Purpose: Get robust performance estimate, detect overfitting

Interpretation: CV accuracy of 0.82 ± 0.04 means model consistently performs well (0.78-0.86 range)

Key Metrics for Churn Prediction

Metric	Formula	Business Meaning	Target
Recall	TP / (TP + FN)	% of churners correctly identified - critical for retention budget	>75%
Precision	TP / (TP + FP)	% of predictions that are actual churners - avoids wasting retention offers	>65%
F1 Score	2 * (P * R) / (P + R)	Balance between precision and recall - overall model quality	>70%
AUC-ROC	Area under ROC curve	Ranking ability - can model distinguish churners from non-churners?	>0.80

Business Trade-off: Recall vs Precision

High Recall (80%): Catch most churners but send retention offers to non-churners too. Cost: $50 offer × false positives.

High Precision (85%): Offers only go to likely churners, but miss some at-risk customers. Cost: $2,400 LTV × missed churners.

Recommendation: Prioritize recall (minimize FN) - missing a churner costs 48x more than a wasted offer ($2,400 vs $50).

Model Comparison Framework

Train at least 2 algorithms and compare using a metrics table:

Model	CV Accuracy	Test Recall	Test Precision	F1 Score	AUC-ROC	Train Time
RandomForest	0.82 ± 0.03	0.78	0.71	0.74	0.85	2.1s
LogisticRegression	0.79 ± 0.04	0.75	0.68	0.71	0.82	0.3s

Selection: RandomForest wins - higher recall (catches more churners), better AUC, acceptable training time for monthly retraining.

Confusion Matrix Interpretation

For 100 test customers with 40% churn rate:

True Positives (31): Correctly predicted churners - send retention offer, save $2,400 each = $74,400 saved
False Negatives (9): Missed churners - lost customers, cost $2,400 each = $21,600 lost revenue
False Positives (12): Predicted churn but didn't - wasted offers, cost $50 each = $600 wasted
True Negatives (48): Correctly predicted retention - no action needed

Model Serialization

Save the trained pipeline to disk using joblib for deployment. This captures the entire workflow - preprocessing + model - in a single file.

Why Model Persistence Matters

Training Once

Training takes 2-5 minutes with 120 customers. In production, you can't retrain for every prediction. Save once, load instantly (50ms).

Deployment

API servers load the saved model at startup. Thousands of predictions use the same trained pipeline without retraining.

Versioning

Save models with timestamps (churn_model_2026_01.joblib). Rollback to previous version if new model underperforms.

Joblib vs Pickle

Format	Speed	Compression	Best For	Recommendation
joblib	Fast for NumPy arrays	Built-in compression	ML models with large arrays	Use This
pickle	Slower for arrays	Manual compression	General Python objects	Alternative

Model Metadata Best Practices

Create a JSON file alongside the model with critical information:

{
  "model_version": "1.0",
  "train_date": "2026-01-02",
  "algorithm": "RandomForestClassifier",
  "accuracy": 0.82,
  "recall": 0.78,
  "precision": 0.71,
  "auc_roc": 0.85,
  "training_samples": 96,
  "test_samples": 24,
  "features": ["tenure_months", "monthly_charges", ...],
  "hyperparameters": {
    "n_estimators": 100,
    "random_state": 42
  }
}

Verification Checklist

After saving, verify your model works correctly:

Load test: Load the saved pipeline in a fresh Python session
Prediction test: Make predictions on sample data, verify output format
Probability test: Ensure predict_proba() returns values between 0-1
Feature test: Confirm model accepts same feature columns as training data
Performance test: Verify predictions match pre-save performance

Security Warning

Pickle and joblib files can execute arbitrary code when loaded. Never load models from untrusted sources. In production, store models in secure locations with access controls and integrity checks (checksums).

Flask API for Real-Time Predictions

Build a RESTful API using Flask that loads your saved model and serves predictions via HTTP endpoints. This transforms your ML model into a production service.

API Architecture

Startup:
1. Import Flask, joblib, pandas
2. Create Flask app instance
3. Load saved model ONCE (at startup, not per request)
Request Flow:
Client → POST /predict → JSON payload → DataFrame → Pipeline → Prediction → JSON response

Required Endpoints

GET /health

Purpose: Health check for monitoring systems (Kubernetes, AWS ELB)

Response: JSON with status: 'healthy', model version, uptime

Use case: Load balancer pings every 30s to verify API is running

POST /predict

Purpose: Accept customer data, return churn prediction + probability

Input: JSON object with all 20 feature fields

Output: prediction ('Churn' or 'No Churn'), churn_probability (0.0-1.0)

Example API Interaction

Request Example

Method: POST

URL: http://localhost:5000/predict

Headers: Content-Type: application/json

Body:

{
  "tenure_months": 3,
  "monthly_charges": 89.50,
  "contract": "Month-to-month",
  "internet_service": "Fiber optic",
  ...
}

Response Example

Status: 200 OK

Body:

{
  "prediction": "Churn",
  "churn_probability": 0.78,
  "confidence": "High",
  "recommendation": "Contact customer"
}

Interpretation: 78% chance of churn - high priority for retention team

Error Handling Requirements

Error Type	HTTP Code	Response	Example
Missing features	400	Error message listing missing fields	"Missing required field: contract"
Invalid data type	400	Error describing type mismatch	"tenure_months must be integer"
Model load failure	500	Internal server error	"Model file not found"
Prediction failure	500	Error with details	"Prediction failed: feature mismatch"

Performance Optimization

Load model at startup: Loading takes 50-200ms. Do it once when API starts, not on every request.
Use pandas DataFrame: Pipeline expects DataFrame input with correct column names and order.
Return probabilities: Include predict_proba() result so caller can set custom thresholds (e.g., 0.6 instead of 0.5).
Add caching: For batch predictions, consider caching results with TTL to reduce compute.

Production Considerations

Never use debug=True in production: Exposes sensitive information and allows arbitrary code execution
Add request logging: Log all predictions with timestamps for audit trails
Set timeouts: Predictions should complete in <200ms; timeout after 5 seconds
Use WSGI server: For production, use Gunicorn or uWSGI instead of Flask's built-in server
Add authentication: Protect endpoint with API keys or OAuth tokens

Testing Your API

Create a test script that validates all endpoints:

Health check test: GET /health returns 200 with 'healthy' status
Valid prediction test: POST /predict with complete data returns prediction
Missing field test: POST /predict without 'contract' returns 400 error
Invalid type test: POST /predict with tenure_months='abc' returns 400
Probability range test: Verify churn_probability is between 0.0 and 1.0

Pipeline Specifications

Your pipeline must follow scikit-learn best practices and be production-ready. This section provides detailed specifications for each component.

What Makes a Pipeline "Production-Ready"?

Encapsulation: All preprocessing inside the pipeline - zero manual steps
Reproducibility: Same inputs always produce same outputs (set random_state)
Serialization: Can be saved, loaded, and used in different environments
Error handling: Gracefully handles missing values, unknown categories, edge cases
Documentation: Clear naming, comments, and README explaining usage

Preprocessing Requirements

Numeric: StandardScaler or MinMaxScaler
Categorical: OneHotEncoder with handle_unknown
Imputation: Median for numeric, constant for categorical
ColumnTransformer: Combine all transformers
Feature Names: Use get_feature_names_out()

Model Requirements

Algorithm: RandomForest, LogisticRegression, or XGBoost
Validation: 5-fold stratified cross-validation
Metrics: Accuracy, Precision, Recall, F1, AUC
Comparison: Compare at least 2 models
Selection: Choose best based on business needs

Serialization Requirements

Format: joblib (preferred) or pickle
Scope: Save complete pipeline, not just model
Metadata: Version, date, metrics, features
Verification: Load and test predictions
Location: models/ directory

API Requirements

Framework: Flask (or FastAPI)
Endpoints: /health, /predict
Input: JSON with customer features
Output: Prediction + probability
Errors: Proper error handling with messages

Best Practices

Include all preprocessing in the pipeline
Use ColumnTransformer for mixed data types
Version your models with timestamps
Add input validation in the API
Load model once at startup, not per request

Common Mistakes

Preprocessing outside the pipeline
Fitting scaler on test data (data leakage)
Hardcoding feature names
Missing error handling in API
Exposing debug mode in production

Required Visualizations

Create the following visualizations to understand your data and communicate model performance.

1. Distribution

Target Variable Distribution

Pie or bar chart showing churn vs. no churn

2. Confusion Matrix

Model Predictions

Heatmap of predicted vs. actual values

3. ROC Curve

Classifier Performance

ROC curve with AUC score

4. Feature Importance

Top Features

Bar chart of most important features

5. Model Comparison

Algorithm Performance

Compare metrics across models

Bonus

Precision-Recall Curve

For imbalanced classification

Visualization Guidelines

Confusion Matrix Best Practices

Heatmap format: Use Seaborn heatmap with annotations showing actual counts
Color scheme: Green gradient for clarity, darker = higher count
Labels: Clearly label axes as 'Actual' and 'Predicted'
Diagonal focus: High numbers on diagonal (TP, TN) = good performance
Business context: Annotate with cost implications of FP vs FN

ROC Curve Interpretation

Diagonal line: Random guessing baseline (AUC = 0.5)
Perfect model: Top-left corner (AUC = 1.0)
Your goal: Curve bowed toward top-left (AUC > 0.80)
AUC meaning: Probability model ranks random churner higher than random non-churner
Display: Include AUC score in title for quick reference

Feature Importance Visualization

Chart type: Horizontal bar chart for easy feature name reading
Top N features: Show top 10-15 most important features only
Sort order: Highest importance at top
Insight extraction: Look for contract type, tenure, charges at top
Business value: Identifies which customer attributes drive churn risk

Model Comparison Chart

Chart type: Grouped bar chart comparing multiple metrics
Metrics shown: Accuracy, Recall, Precision, F1, AUC side-by-side
Models: Show RandomForest vs LogisticRegression performance
Decision making: Quickly identify which model wins on key metrics
Trade-offs: Visualize speed vs accuracy trade-offs

Visualization Library Choices

Seaborn: Use for confusion matrix heatmap, distribution plots. Better statistical defaults than Matplotlib.

Matplotlib: Use for ROC curves, feature importance bars, model comparisons. More control over customization.

Plotly: Bonus - create interactive confusion matrix or ROC curve that users can hover over for details.

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

ml-pipeline-project

github.com/<your-username>/ml-pipeline-project

Required Project Structure

ml-pipeline-project/
├── data/
│   └── telecom_churn.csv         # The dataset
├── notebooks/
│   ├── 01_data_exploration.ipynb # EDA notebook
│   └── 02_pipeline_training.ipynb # Pipeline development
├── src/
│   ├── __init__.py
│   ├── pipeline.py               # Pipeline code
│   └── preprocessing.py          # Preprocessing utilities
├── api/
│   ├── app.py                    # Flask application
│   └── test_api.py               # API test script
├── models/
│   ├── churn_model.joblib        # Trained pipeline
│   └── model_metadata.json       # Model metadata
├── visualizations/
│   ├── confusion_matrix.png
│   └── roc_curve.png
├── requirements.txt
└── README.md

README.md Must Include:

Your full name and submission date
Project overview and business context
Model performance metrics
API documentation with example requests
Instructions to run the project
Screenshots of visualizations

requirements.txt

pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.12.0
flask>=3.0.0
joblib>=1.3.0
jupyter>=1.0.0

Important: Test your Flask API locally before submitting. Ensure the model loads correctly and returns predictions.

Submit Your Project

Enter your GitHub username - we will verify your repository automatically

Grading Rubric

Your project will be graded on the following criteria. Total: 550 points.

Criteria	Points	Description
Data Preprocessing	100	Proper ColumnTransformer with numeric and categorical transformers
Pipeline Development	120	Complete end-to-end Pipeline, cross-validation, model comparison
Model Serialization	80	Correct joblib usage, metadata file, verification
API Development	130	Working Flask app with endpoints, error handling, testing
Visualizations	60	Required visualizations with professional formatting
Documentation	60	Comprehensive README, code comments, docstrings
Total	550

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project

End-to-End ML Pipeline

What You Will Build

Contents

Project Overview

Learning Objectives

Pipeline Engineering

Model Persistence

API Development

Production Readiness

Preprocessing

Pipeline

Serialization

Deployment

Business Scenario

TeleConnect Communications

38%

15 mins

$450

Business Challenges

High Churn Rate

Manual Process

Dynamic Patterns

Technical Requirements

The Dataset

Dataset Schema

Project Requirements

Data Exploration Notebook

Key Exploration Tasks

Dataset Overview

Pattern Discovery

Expected Findings

Preprocessing Pipeline

Pipeline Architecture

Numeric Features Pipeline

Categorical Features Pipeline

Critical: Data Leakage Prevention

ColumnTransformer Workflow

Feature Names Output

Complete ML Pipeline

Pipeline Composition

Pipeline Benefits

Classifier Selection

The Power of Pipelines

Model Evaluation

Evaluation Strategy

Train-Test Split

Cross-Validation

Key Metrics for Churn Prediction

Business Trade-off: Recall vs Precision

Model Comparison Framework

Confusion Matrix Interpretation

Model Serialization

Why Model Persistence Matters

Training Once

Deployment

Versioning

Joblib vs Pickle

Model Metadata Best Practices

Verification Checklist

Security Warning

Flask API for Real-Time Predictions

API Architecture

Required Endpoints

GET /health

POST /predict

Example API Interaction

Request Example

Response Example

Error Handling Requirements

Performance Optimization

Production Considerations

Testing Your API

Pipeline Specifications

What Makes a Pipeline "Production-Ready"?

Best Practices

Common Mistakes

Required Visualizations

Target Variable Distribution

Model Predictions

Classifier Performance