Capstone Project 6

End-to-End ML Pipeline

Build a complete, production-ready machine learning pipeline that takes raw data through preprocessing, feature engineering, model training, and deployment. Master scikit-learn Pipelines, model serialization, and Flask API development.

12-16 hours
Advanced
550 Points
What You Will Build
  • Preprocessing pipeline
  • ColumnTransformer workflow
  • Model training pipeline
  • Model serialization (joblib)
  • Flask prediction API
Contents
01

Project Overview

This advanced capstone project brings together everything you have learned in the Data Science course. You will work with a telecommunications customer churn dataset containing 120 customers with comprehensive service and billing information. Your goal is to build a production-ready ML pipeline that can be deployed and used for real-time predictions.

Skills Applied: This project tests your proficiency in scikit-learn (Pipelines, ColumnTransformer), model evaluation, serialization (joblib/pickle), and basic API development with Flask.
Learning Objectives
Pipeline Engineering

Master scikit-learn's Pipeline API to create modular, reproducible ML workflows that eliminate manual preprocessing steps and prevent data leakage.

  • Build ColumnTransformer for mixed data types
  • Chain preprocessing and modeling in single Pipeline
  • Use feature names propagation (get_feature_names_out)
  • Implement proper train-test isolation
Model Persistence

Learn production-grade model serialization techniques for saving, versioning, and loading trained models with full preprocessing pipelines intact.

  • Serialize complete pipelines with joblib
  • Create model metadata and versioning
  • Validate deserialized models work correctly
  • Handle model updates and rollbacks
API Development

Build RESTful APIs with Flask to serve ML predictions in real-time, implementing proper error handling and response formatting for production use.

  • Create Flask endpoints for predictions
  • Implement health checks and monitoring
  • Handle JSON input validation
  • Return predictions with confidence scores
Production Readiness

Develop ML systems following software engineering best practices: modularity, reproducibility, testability, and deployability for real-world applications.

  • Structure code into reusable modules
  • Write clear documentation and README
  • Create reproducible environments (requirements.txt)
  • Test API endpoints before deployment
Preprocessing

Build modular preprocessing with ColumnTransformer

Pipeline

Create end-to-end scikit-learn Pipeline

Serialization

Save and load models with joblib

Deployment

Build Flask API for predictions

Ready to submit? Already completed the project? Submit your work now!
Submit Now
02

Business Scenario

TeleConnect Communications

You have been hired as a Machine Learning Engineer at TeleConnect Communications, a major telecommunications provider with 2.5 million subscribers. The company is experiencing significant customer churn and losing $8.2 million in revenue each month. With an average customer lifetime value of $2,400 and acquisition costs of $450 per customer, reducing churn by just 5% would save $2.5 million annually.

"We need a system that our customer retention team can query in real-time when speaking with at-risk customers. The solution must be deployable and retrainable monthly with new data. Can you build us an ML pipeline with an API endpoint that returns churn probability within 200ms?"

Michael Rodriguez, VP of Customer Success

38%

Monthly churn rate (industry avg: 22%)

15 mins

Manual churn assessment per customer

$450

Cost to acquire new customer

Business Challenges
High Churn Rate

38% monthly churn vs 22% industry average. Month-to-month contracts have 58% churn vs 12% for 2-year contracts. Need to identify at-risk customers before they cancel.

Manual Process

Retention team manually reviews accounts (15 mins each). With 12,000+ at-risk customers monthly, only 8% can be contacted. Automated prediction needed.

Dynamic Patterns

Customer behavior changes quarterly (new competitors, pricing, services). Model must be retrained monthly with fresh data. Pipeline ensures reproducibility.

Technical Requirements

Pipeline
  • Handle mixed feature types (numeric + categorical)
  • Include all preprocessing in the pipeline
  • Enable easy retraining with new data
Model
  • Train a classification model (churn/no churn)
  • Evaluate with cross-validation
  • Compare at least two algorithms
Serialization
  • Save complete pipeline (not just model)
  • Include model metadata and versioning
  • Verify model loads and predicts correctly
API
  • Build Flask app with prediction endpoint
  • Add health check and error handling
  • Return prediction with probability
Pro Tip: Think like a ML engineer! Your pipeline should be modular, reproducible, and production-ready.
03

The Dataset

You will work with a telecommunications customer churn dataset. Download the CSV file and place it in your project's data/ folder:

Dataset Schema
Column Type Description
customer_idStringUnique customer identifier
genderStringCustomer gender (Male/Female)
senior_citizenIntegerSenior citizen status (0/1)
partnerStringHas partner (Yes/No)
dependentsStringHas dependents (Yes/No)
tenure_monthsIntegerMonths with company
phone_serviceStringPhone service subscription
multiple_linesStringMultiple phone lines
internet_serviceStringInternet type (DSL/Fiber optic/No)
online_securityStringOnline security add-on
online_backupStringOnline backup add-on
device_protectionStringDevice protection plan
tech_supportStringTech support add-on
streaming_tvStringStreaming TV subscription
streaming_moviesStringStreaming movies subscription
contractStringContract type (Month-to-month/One year/Two year)
paperless_billingStringPaperless billing enabled
payment_methodStringPayment method type
monthly_chargesFloatMonthly billing amount ($)
total_chargesFloatTotal amount billed ($)
churnStringTarget: Customer churned (Yes/No)
Dataset Stats: 120 customers, 20 features, ~40% churn rate, mix of numeric and categorical variables
04

Project Requirements

Your project must include all of the following components. Structure your work with clear separation between notebooks, source code, and API.

1
Data Exploration Notebook

Explore the dataset to understand feature distributions, identify patterns related to churn, and uncover data quality issues before building your pipeline.

Key Exploration Tasks
Dataset Overview
  • Shape: 120 customers, 21 columns (20 features + 1 target)
  • Churn distribution: Calculate percentage of churned vs retained customers
  • Data types: Identify numeric (3) vs categorical (17) features
  • Missing values: Check for nulls in each column
Pattern Discovery
  • Tenure analysis: Do long-term customers churn less?
  • Contract type: Compare churn rates across contract types
  • Service patterns: Which services correlate with retention?
  • Pricing impact: Is churn higher for expensive plans?
Expected Findings

Your exploration should reveal key churn indicators:

  • Month-to-month contracts likely have 2-3x higher churn than long-term contracts
  • Tenure matters: Customers with <3 months tenure are highest risk (50%+ churn rate)
  • Service bundles: Customers with multiple services (internet + TV + phone) churn less
  • Fiber optic users: May have higher churn despite faster service (check pricing)
2
Preprocessing Pipeline

Build a ColumnTransformer that handles numeric and categorical features differently. This is the cornerstone of production ML - all preprocessing must happen inside the pipeline to prevent data leakage.

Pipeline Architecture
Numeric Features Pipeline

Features: tenure_months, monthly_charges, total_charges

Step 1 - Imputation: Fill missing values with median (robust to outliers)

Step 2 - Scaling: Apply StandardScaler to normalize ranges (mean=0, std=1)

Why: Linear models and distance-based algorithms need scaled features

Categorical Features Pipeline

Features: gender, partner, contract, internet_service, payment_method, etc. (17 total)

Step 1 - Imputation: Fill missing with 'Unknown' constant

Step 2 - Encoding: OneHotEncoder creates binary columns for each category

Why: ML models need numeric inputs; one-hot encoding preserves category independence

Critical: Data Leakage Prevention

Wrong approach: Scaling features before train-test split. The scaler learns statistics from test data, artificially boosting performance.

Right approach: Fit preprocessor only on training data. The pipeline ensures test data is transformed using training statistics.

Example: If training set monthly_charges has mean=$70, test data is scaled using $70, not test data's actual mean.

ColumnTransformer Workflow

ColumnTransformer applies different transformations to different column subsets in parallel:

  1. Define numeric transformer: Chain SimpleImputer (median) → StandardScaler
  2. Define categorical transformer: Chain SimpleImputer ('Unknown') → OneHotEncoder (handle_unknown='ignore')
  3. Combine in ColumnTransformer: Specify which columns get which transformer
  4. Result: Single preprocessor object that handles all feature types correctly
Feature Names Output

After OneHotEncoding, contract becomes contract_Month-to-month, contract_One year, contract_Two year. Use get_feature_names_out() to retrieve transformed feature names for interpretation and debugging.

3
Complete ML Pipeline

Combine preprocessing and model training into a single Pipeline object. This is the magic of scikit-learn - one object handles everything from raw data to predictions.

Pipeline Composition
Pipeline Structure:
┌─────────────────────────────────────┐
Step 1: ColumnTransformer
│ ├─ Numeric: Impute → Scale │
│ └─ Categorical: Impute → Encode │
├─────────────────────────────────────┤
Step 2: RandomForestClassifier
│ Fits on transformed features │
└─────────────────────────────────────┘
Pipeline Benefits
  • No leakage: Preprocessing fitted only on training data
  • One-line training: pipeline.fit(X_train, y_train) does everything
  • One-line prediction: pipeline.predict(X_test) preprocesses automatically
  • Serialization: Save entire workflow with joblib.dump(pipeline)
  • Reproducibility: Same transformations applied in training and production
Classifier Selection
  • RandomForest: Handles non-linear patterns, robust to outliers, provides feature importance
  • LogisticRegression: Fast, interpretable coefficients, good baseline
  • GradientBoosting: Often highest accuracy, requires tuning
  • Recommendation: Start with RandomForest (n_estimators=100, random_state=42)
The Power of Pipelines

Without pipeline: You must manually impute, scale, encode, then train. In production, you must remember and replicate every step exactly. Any mistake causes train-serve skew.

With pipeline: Training: pipeline.fit(X_train, y_train). Production: pipeline.predict(new_data). All preprocessing happens automatically in correct order. Zero train-serve skew.

4
Model Evaluation

Rigorously evaluate your churn prediction model using classification metrics appropriate for imbalanced data and business context.

Evaluation Strategy
Train-Test Split

Method: Stratified split (80/20) - maintains churn ratio in both sets

Why stratified: With 40% churn rate, random split might give 45% in train and 35% in test - stratified ensures both have ~40%

Random state: Set random_state=42 for reproducibility

Cross-Validation

Method: 5-fold stratified CV on training data only

Purpose: Get robust performance estimate, detect overfitting

Interpretation: CV accuracy of 0.82 ± 0.04 means model consistently performs well (0.78-0.86 range)

Key Metrics for Churn Prediction
Metric Formula Business Meaning Target
Recall TP / (TP + FN) % of churners correctly identified - critical for retention budget >75%
Precision TP / (TP + FP) % of predictions that are actual churners - avoids wasting retention offers >65%
F1 Score 2 * (P * R) / (P + R) Balance between precision and recall - overall model quality >70%
AUC-ROC Area under ROC curve Ranking ability - can model distinguish churners from non-churners? >0.80
Business Trade-off: Recall vs Precision

High Recall (80%): Catch most churners but send retention offers to non-churners too. Cost: $50 offer × false positives.

High Precision (85%): Offers only go to likely churners, but miss some at-risk customers. Cost: $2,400 LTV × missed churners.

Recommendation: Prioritize recall (minimize FN) - missing a churner costs 48x more than a wasted offer ($2,400 vs $50).

Model Comparison Framework

Train at least 2 algorithms and compare using a metrics table:

Model CV Accuracy Test Recall Test Precision F1 Score AUC-ROC Train Time
RandomForest 0.82 ± 0.03 0.78 0.71 0.74 0.85 2.1s
LogisticRegression 0.79 ± 0.04 0.75 0.68 0.71 0.82 0.3s

Selection: RandomForest wins - higher recall (catches more churners), better AUC, acceptable training time for monthly retraining.

Confusion Matrix Interpretation

For 100 test customers with 40% churn rate:

  • True Positives (31): Correctly predicted churners - send retention offer, save $2,400 each = $74,400 saved
  • False Negatives (9): Missed churners - lost customers, cost $2,400 each = $21,600 lost revenue
  • False Positives (12): Predicted churn but didn't - wasted offers, cost $50 each = $600 wasted
  • True Negatives (48): Correctly predicted retention - no action needed
5
Model Serialization

Save the trained pipeline to disk using joblib for deployment. This captures the entire workflow - preprocessing + model - in a single file.

Why Model Persistence Matters
Training Once

Training takes 2-5 minutes with 120 customers. In production, you can't retrain for every prediction. Save once, load instantly (50ms).

Deployment

API servers load the saved model at startup. Thousands of predictions use the same trained pipeline without retraining.

Versioning

Save models with timestamps (churn_model_2026_01.joblib). Rollback to previous version if new model underperforms.

Joblib vs Pickle
Format Speed Compression Best For Recommendation
joblib Fast for NumPy arrays Built-in compression ML models with large arrays Use This
pickle Slower for arrays Manual compression General Python objects Alternative
Model Metadata Best Practices

Create a JSON file alongside the model with critical information:

{
  "model_version": "1.0",
  "train_date": "2026-01-02",
  "algorithm": "RandomForestClassifier",
  "accuracy": 0.82,
  "recall": 0.78,
  "precision": 0.71,
  "auc_roc": 0.85,
  "training_samples": 96,
  "test_samples": 24,
  "features": ["tenure_months", "monthly_charges", ...],
  "hyperparameters": {
    "n_estimators": 100,
    "random_state": 42
  }
}
Verification Checklist

After saving, verify your model works correctly:

  1. Load test: Load the saved pipeline in a fresh Python session
  2. Prediction test: Make predictions on sample data, verify output format
  3. Probability test: Ensure predict_proba() returns values between 0-1
  4. Feature test: Confirm model accepts same feature columns as training data
  5. Performance test: Verify predictions match pre-save performance
Security Warning

Pickle and joblib files can execute arbitrary code when loaded. Never load models from untrusted sources. In production, store models in secure locations with access controls and integrity checks (checksums).

6
Flask API for Real-Time Predictions

Build a RESTful API using Flask that loads your saved model and serves predictions via HTTP endpoints. This transforms your ML model into a production service.

API Architecture
Startup:
1. Import Flask, joblib, pandas
2. Create Flask app instance
3. Load saved model ONCE (at startup, not per request)
Request Flow:
Client → POST /predict → JSON payload → DataFrame → Pipeline → Prediction → JSON response
Required Endpoints
GET /health

Purpose: Health check for monitoring systems (Kubernetes, AWS ELB)

Response: JSON with status: 'healthy', model version, uptime

Use case: Load balancer pings every 30s to verify API is running

POST /predict

Purpose: Accept customer data, return churn prediction + probability

Input: JSON object with all 20 feature fields

Output: prediction ('Churn' or 'No Churn'), churn_probability (0.0-1.0)

Example API Interaction
Request Example

Method: POST

URL: http://localhost:5000/predict

Headers: Content-Type: application/json

Body:

{
  "tenure_months": 3,
  "monthly_charges": 89.50,
  "contract": "Month-to-month",
  "internet_service": "Fiber optic",
  ...
}
Response Example

Status: 200 OK

Body:

{
  "prediction": "Churn",
  "churn_probability": 0.78,
  "confidence": "High",
  "recommendation": "Contact customer"
}

Interpretation: 78% chance of churn - high priority for retention team

Error Handling Requirements
Error Type HTTP Code Response Example
Missing features 400 Error message listing missing fields "Missing required field: contract"
Invalid data type 400 Error describing type mismatch "tenure_months must be integer"
Model load failure 500 Internal server error "Model file not found"
Prediction failure 500 Error with details "Prediction failed: feature mismatch"
Performance Optimization
  • Load model at startup: Loading takes 50-200ms. Do it once when API starts, not on every request.
  • Use pandas DataFrame: Pipeline expects DataFrame input with correct column names and order.
  • Return probabilities: Include predict_proba() result so caller can set custom thresholds (e.g., 0.6 instead of 0.5).
  • Add caching: For batch predictions, consider caching results with TTL to reduce compute.
Production Considerations
  • Never use debug=True in production: Exposes sensitive information and allows arbitrary code execution
  • Add request logging: Log all predictions with timestamps for audit trails
  • Set timeouts: Predictions should complete in <200ms; timeout after 5 seconds
  • Use WSGI server: For production, use Gunicorn or uWSGI instead of Flask's built-in server
  • Add authentication: Protect endpoint with API keys or OAuth tokens
Testing Your API

Create a test script that validates all endpoints:

  1. Health check test: GET /health returns 200 with 'healthy' status
  2. Valid prediction test: POST /predict with complete data returns prediction
  3. Missing field test: POST /predict without 'contract' returns 400 error
  4. Invalid type test: POST /predict with tenure_months='abc' returns 400
  5. Probability range test: Verify churn_probability is between 0.0 and 1.0
05

Pipeline Specifications

Your pipeline must follow scikit-learn best practices and be production-ready. This section provides detailed specifications for each component.

What Makes a Pipeline "Production-Ready"?
  • Encapsulation: All preprocessing inside the pipeline - zero manual steps
  • Reproducibility: Same inputs always produce same outputs (set random_state)
  • Serialization: Can be saved, loaded, and used in different environments
  • Error handling: Gracefully handles missing values, unknown categories, edge cases
  • Documentation: Clear naming, comments, and README explaining usage
Preprocessing Requirements
  • Numeric: StandardScaler or MinMaxScaler
  • Categorical: OneHotEncoder with handle_unknown
  • Imputation: Median for numeric, constant for categorical
  • ColumnTransformer: Combine all transformers
  • Feature Names: Use get_feature_names_out()
Model Requirements
  • Algorithm: RandomForest, LogisticRegression, or XGBoost
  • Validation: 5-fold stratified cross-validation
  • Metrics: Accuracy, Precision, Recall, F1, AUC
  • Comparison: Compare at least 2 models
  • Selection: Choose best based on business needs
Serialization Requirements
  • Format: joblib (preferred) or pickle
  • Scope: Save complete pipeline, not just model
  • Metadata: Version, date, metrics, features
  • Verification: Load and test predictions
  • Location: models/ directory
API Requirements
  • Framework: Flask (or FastAPI)
  • Endpoints: /health, /predict
  • Input: JSON with customer features
  • Output: Prediction + probability
  • Errors: Proper error handling with messages
Best Practices
  • Include all preprocessing in the pipeline
  • Use ColumnTransformer for mixed data types
  • Version your models with timestamps
  • Add input validation in the API
  • Load model once at startup, not per request
Common Mistakes
  • Preprocessing outside the pipeline
  • Fitting scaler on test data (data leakage)
  • Hardcoding feature names
  • Missing error handling in API
  • Exposing debug mode in production
06

Required Visualizations

Create the following visualizations to understand your data and communicate model performance.

1. Distribution
Target Variable Distribution

Pie or bar chart showing churn vs. no churn

2. Confusion Matrix
Model Predictions

Heatmap of predicted vs. actual values

3. ROC Curve
Classifier Performance

ROC curve with AUC score

4. Feature Importance
Top Features

Bar chart of most important features

5. Model Comparison
Algorithm Performance

Compare metrics across models

Bonus
Precision-Recall Curve

For imbalanced classification

Visualization Guidelines
Confusion Matrix Best Practices
  • Heatmap format: Use Seaborn heatmap with annotations showing actual counts
  • Color scheme: Green gradient for clarity, darker = higher count
  • Labels: Clearly label axes as 'Actual' and 'Predicted'
  • Diagonal focus: High numbers on diagonal (TP, TN) = good performance
  • Business context: Annotate with cost implications of FP vs FN
ROC Curve Interpretation
  • Diagonal line: Random guessing baseline (AUC = 0.5)
  • Perfect model: Top-left corner (AUC = 1.0)
  • Your goal: Curve bowed toward top-left (AUC > 0.80)
  • AUC meaning: Probability model ranks random churner higher than random non-churner
  • Display: Include AUC score in title for quick reference
Feature Importance Visualization
  • Chart type: Horizontal bar chart for easy feature name reading
  • Top N features: Show top 10-15 most important features only
  • Sort order: Highest importance at top
  • Insight extraction: Look for contract type, tenure, charges at top
  • Business value: Identifies which customer attributes drive churn risk
Model Comparison Chart
  • Chart type: Grouped bar chart comparing multiple metrics
  • Metrics shown: Accuracy, Recall, Precision, F1, AUC side-by-side
  • Models: Show RandomForest vs LogisticRegression performance
  • Decision making: Quickly identify which model wins on key metrics
  • Trade-offs: Visualize speed vs accuracy trade-offs
Visualization Library Choices

Seaborn: Use for confusion matrix heatmap, distribution plots. Better statistical defaults than Matplotlib.

Matplotlib: Use for ROC curves, feature importance bars, model comparisons. More control over customization.

Plotly: Bonus - create interactive confusion matrix or ROC curve that users can hover over for details.

07

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name
ml-pipeline-project
github.com/<your-username>/ml-pipeline-project
Required Project Structure
ml-pipeline-project/
├── data/
│   └── telecom_churn.csv         # The dataset
├── notebooks/
│   ├── 01_data_exploration.ipynb # EDA notebook
│   └── 02_pipeline_training.ipynb # Pipeline development
├── src/
│   ├── __init__.py
│   ├── pipeline.py               # Pipeline code
│   └── preprocessing.py          # Preprocessing utilities
├── api/
│   ├── app.py                    # Flask application
│   └── test_api.py               # API test script
├── models/
│   ├── churn_model.joblib        # Trained pipeline
│   └── model_metadata.json       # Model metadata
├── visualizations/
│   ├── confusion_matrix.png
│   └── roc_curve.png
├── requirements.txt
└── README.md
README.md Must Include:
  • Your full name and submission date
  • Project overview and business context
  • Model performance metrics
  • API documentation with example requests
  • Instructions to run the project
  • Screenshots of visualizations
requirements.txt
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.12.0
flask>=3.0.0
joblib>=1.3.0
jupyter>=1.0.0
Important: Test your Flask API locally before submitting. Ensure the model loads correctly and returns predictions.
Submit Your Project

Enter your GitHub username - we will verify your repository automatically

08

Grading Rubric

Your project will be graded on the following criteria. Total: 550 points.

Criteria Points Description
Data Preprocessing 100 Proper ColumnTransformer with numeric and categorical transformers
Pipeline Development 120 Complete end-to-end Pipeline, cross-validation, model comparison
Model Serialization 80 Correct joblib usage, metadata file, verification
API Development 130 Working Flask app with endpoints, error handling, testing
Visualizations 60 Required visualizations with professional formatting
Documentation 60 Comprehensive README, code comments, docstrings
Total 550

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.

Submit Your Project
09

Pre-Submission Checklist

Pipeline Requirements
API and Repository