Project Overview
This advanced capstone project brings together everything you have learned in the Data Science course. You will work with a telecommunications customer churn dataset containing 120 customers with comprehensive service and billing information. Your goal is to build a production-ready ML pipeline that can be deployed and used for real-time predictions.
Learning Objectives
Pipeline Engineering
Master scikit-learn's Pipeline API to create modular, reproducible ML workflows that eliminate manual preprocessing steps and prevent data leakage.
- Build ColumnTransformer for mixed data types
- Chain preprocessing and modeling in single Pipeline
- Use feature names propagation (get_feature_names_out)
- Implement proper train-test isolation
Model Persistence
Learn production-grade model serialization techniques for saving, versioning, and loading trained models with full preprocessing pipelines intact.
- Serialize complete pipelines with joblib
- Create model metadata and versioning
- Validate deserialized models work correctly
- Handle model updates and rollbacks
API Development
Build RESTful APIs with Flask to serve ML predictions in real-time, implementing proper error handling and response formatting for production use.
- Create Flask endpoints for predictions
- Implement health checks and monitoring
- Handle JSON input validation
- Return predictions with confidence scores
Production Readiness
Develop ML systems following software engineering best practices: modularity, reproducibility, testability, and deployability for real-world applications.
- Structure code into reusable modules
- Write clear documentation and README
- Create reproducible environments (requirements.txt)
- Test API endpoints before deployment
Preprocessing
Build modular preprocessing with ColumnTransformer
Pipeline
Create end-to-end scikit-learn Pipeline
Serialization
Save and load models with joblib
Deployment
Build Flask API for predictions
Business Scenario
TeleConnect Communications
You have been hired as a Machine Learning Engineer at TeleConnect Communications, a major telecommunications provider with 2.5 million subscribers. The company is experiencing significant customer churn and losing $8.2 million in revenue each month. With an average customer lifetime value of $2,400 and acquisition costs of $450 per customer, reducing churn by just 5% would save $2.5 million annually.
"We need a system that our customer retention team can query in real-time when speaking with at-risk customers. The solution must be deployable and retrainable monthly with new data. Can you build us an ML pipeline with an API endpoint that returns churn probability within 200ms?"
38%
Monthly churn rate (industry avg: 22%)
15 mins
Manual churn assessment per customer
$450
Cost to acquire new customer
Business Challenges
High Churn Rate
38% monthly churn vs 22% industry average. Month-to-month contracts have 58% churn vs 12% for 2-year contracts. Need to identify at-risk customers before they cancel.
Manual Process
Retention team manually reviews accounts (15 mins each). With 12,000+ at-risk customers monthly, only 8% can be contacted. Automated prediction needed.
Dynamic Patterns
Customer behavior changes quarterly (new competitors, pricing, services). Model must be retrained monthly with fresh data. Pipeline ensures reproducibility.
Technical Requirements
- Handle mixed feature types (numeric + categorical)
- Include all preprocessing in the pipeline
- Enable easy retraining with new data
- Train a classification model (churn/no churn)
- Evaluate with cross-validation
- Compare at least two algorithms
- Save complete pipeline (not just model)
- Include model metadata and versioning
- Verify model loads and predicts correctly
- Build Flask app with prediction endpoint
- Add health check and error handling
- Return prediction with probability
The Dataset
You will work with a telecommunications customer churn dataset. Download the CSV file and place it
in your project's data/ folder:
Dataset Schema
| Column | Type | Description |
|---|---|---|
customer_id | String | Unique customer identifier |
gender | String | Customer gender (Male/Female) |
senior_citizen | Integer | Senior citizen status (0/1) |
partner | String | Has partner (Yes/No) |
dependents | String | Has dependents (Yes/No) |
tenure_months | Integer | Months with company |
phone_service | String | Phone service subscription |
multiple_lines | String | Multiple phone lines |
internet_service | String | Internet type (DSL/Fiber optic/No) |
online_security | String | Online security add-on |
online_backup | String | Online backup add-on |
device_protection | String | Device protection plan |
tech_support | String | Tech support add-on |
streaming_tv | String | Streaming TV subscription |
streaming_movies | String | Streaming movies subscription |
contract | String | Contract type (Month-to-month/One year/Two year) |
paperless_billing | String | Paperless billing enabled |
payment_method | String | Payment method type |
monthly_charges | Float | Monthly billing amount ($) |
total_charges | Float | Total amount billed ($) |
churn | String | Target: Customer churned (Yes/No) |
Project Requirements
Your project must include all of the following components. Structure your work with clear separation between notebooks, source code, and API.
Data Exploration Notebook
Explore the dataset to understand feature distributions, identify patterns related to churn, and uncover data quality issues before building your pipeline.
Key Exploration Tasks
Dataset Overview
- Shape: 120 customers, 21 columns (20 features + 1 target)
- Churn distribution: Calculate percentage of churned vs retained customers
- Data types: Identify numeric (3) vs categorical (17) features
- Missing values: Check for nulls in each column
Pattern Discovery
- Tenure analysis: Do long-term customers churn less?
- Contract type: Compare churn rates across contract types
- Service patterns: Which services correlate with retention?
- Pricing impact: Is churn higher for expensive plans?
Expected Findings
Your exploration should reveal key churn indicators:
- Month-to-month contracts likely have 2-3x higher churn than long-term contracts
- Tenure matters: Customers with <3 months tenure are highest risk (50%+ churn rate)
- Service bundles: Customers with multiple services (internet + TV + phone) churn less
- Fiber optic users: May have higher churn despite faster service (check pricing)
Preprocessing Pipeline
Build a ColumnTransformer that handles numeric and categorical features differently. This is the cornerstone of production ML - all preprocessing must happen inside the pipeline to prevent data leakage.
Pipeline Architecture
Numeric Features Pipeline
Features: tenure_months, monthly_charges, total_charges
Step 1 - Imputation: Fill missing values with median (robust to outliers)
Step 2 - Scaling: Apply StandardScaler to normalize ranges (mean=0, std=1)
Why: Linear models and distance-based algorithms need scaled features
Categorical Features Pipeline
Features: gender, partner, contract, internet_service, payment_method, etc. (17 total)
Step 1 - Imputation: Fill missing with 'Unknown' constant
Step 2 - Encoding: OneHotEncoder creates binary columns for each category
Why: ML models need numeric inputs; one-hot encoding preserves category independence
Critical: Data Leakage Prevention
Wrong approach: Scaling features before train-test split. The scaler learns statistics from test data, artificially boosting performance.
Right approach: Fit preprocessor only on training data. The pipeline ensures test data is transformed using training statistics.
Example: If training set monthly_charges has mean=$70, test data is scaled using $70, not test data's actual mean.
ColumnTransformer Workflow
ColumnTransformer applies different transformations to different column subsets in parallel:
- Define numeric transformer: Chain SimpleImputer (median) → StandardScaler
- Define categorical transformer: Chain SimpleImputer ('Unknown') → OneHotEncoder (handle_unknown='ignore')
- Combine in ColumnTransformer: Specify which columns get which transformer
- Result: Single preprocessor object that handles all feature types correctly
Feature Names Output
After OneHotEncoding, contract becomes contract_Month-to-month, contract_One year, contract_Two year. Use get_feature_names_out() to retrieve transformed feature names for interpretation and debugging.
Complete ML Pipeline
Combine preprocessing and model training into a single Pipeline object. This is the magic of scikit-learn - one object handles everything from raw data to predictions.
Pipeline Composition
Pipeline Benefits
- No leakage: Preprocessing fitted only on training data
- One-line training: pipeline.fit(X_train, y_train) does everything
- One-line prediction: pipeline.predict(X_test) preprocesses automatically
- Serialization: Save entire workflow with joblib.dump(pipeline)
- Reproducibility: Same transformations applied in training and production
Classifier Selection
- RandomForest: Handles non-linear patterns, robust to outliers, provides feature importance
- LogisticRegression: Fast, interpretable coefficients, good baseline
- GradientBoosting: Often highest accuracy, requires tuning
- Recommendation: Start with RandomForest (n_estimators=100, random_state=42)
The Power of Pipelines
Without pipeline: You must manually impute, scale, encode, then train. In production, you must remember and replicate every step exactly. Any mistake causes train-serve skew.
With pipeline: Training: pipeline.fit(X_train, y_train). Production: pipeline.predict(new_data). All preprocessing happens automatically in correct order. Zero train-serve skew.
Model Evaluation
Rigorously evaluate your churn prediction model using classification metrics appropriate for imbalanced data and business context.
Evaluation Strategy
Train-Test Split
Method: Stratified split (80/20) - maintains churn ratio in both sets
Why stratified: With 40% churn rate, random split might give 45% in train and 35% in test - stratified ensures both have ~40%
Random state: Set random_state=42 for reproducibility
Cross-Validation
Method: 5-fold stratified CV on training data only
Purpose: Get robust performance estimate, detect overfitting
Interpretation: CV accuracy of 0.82 ± 0.04 means model consistently performs well (0.78-0.86 range)
Key Metrics for Churn Prediction
| Metric | Formula | Business Meaning | Target |
|---|---|---|---|
| Recall | TP / (TP + FN) | % of churners correctly identified - critical for retention budget | >75% |
| Precision | TP / (TP + FP) | % of predictions that are actual churners - avoids wasting retention offers | >65% |
| F1 Score | 2 * (P * R) / (P + R) | Balance between precision and recall - overall model quality | >70% |
| AUC-ROC | Area under ROC curve | Ranking ability - can model distinguish churners from non-churners? | >0.80 |
Business Trade-off: Recall vs Precision
High Recall (80%): Catch most churners but send retention offers to non-churners too. Cost: $50 offer × false positives.
High Precision (85%): Offers only go to likely churners, but miss some at-risk customers. Cost: $2,400 LTV × missed churners.
Recommendation: Prioritize recall (minimize FN) - missing a churner costs 48x more than a wasted offer ($2,400 vs $50).
Model Comparison Framework
Train at least 2 algorithms and compare using a metrics table:
| Model | CV Accuracy | Test Recall | Test Precision | F1 Score | AUC-ROC | Train Time |
|---|---|---|---|---|---|---|
| RandomForest | 0.82 ± 0.03 | 0.78 | 0.71 | 0.74 | 0.85 | 2.1s |
| LogisticRegression | 0.79 ± 0.04 | 0.75 | 0.68 | 0.71 | 0.82 | 0.3s |
Selection: RandomForest wins - higher recall (catches more churners), better AUC, acceptable training time for monthly retraining.
Confusion Matrix Interpretation
For 100 test customers with 40% churn rate:
- True Positives (31): Correctly predicted churners - send retention offer, save $2,400 each = $74,400 saved
- False Negatives (9): Missed churners - lost customers, cost $2,400 each = $21,600 lost revenue
- False Positives (12): Predicted churn but didn't - wasted offers, cost $50 each = $600 wasted
- True Negatives (48): Correctly predicted retention - no action needed
Model Serialization
Save the trained pipeline to disk using joblib for deployment. This captures the entire workflow - preprocessing + model - in a single file.
Why Model Persistence Matters
Training Once
Training takes 2-5 minutes with 120 customers. In production, you can't retrain for every prediction. Save once, load instantly (50ms).
Deployment
API servers load the saved model at startup. Thousands of predictions use the same trained pipeline without retraining.
Versioning
Save models with timestamps (churn_model_2026_01.joblib). Rollback to previous version if new model underperforms.
Joblib vs Pickle
| Format | Speed | Compression | Best For | Recommendation |
|---|---|---|---|---|
| joblib | Fast for NumPy arrays | Built-in compression | ML models with large arrays | Use This |
| pickle | Slower for arrays | Manual compression | General Python objects | Alternative |
Model Metadata Best Practices
Create a JSON file alongside the model with critical information:
Verification Checklist
After saving, verify your model works correctly:
- Load test: Load the saved pipeline in a fresh Python session
- Prediction test: Make predictions on sample data, verify output format
- Probability test: Ensure predict_proba() returns values between 0-1
- Feature test: Confirm model accepts same feature columns as training data
- Performance test: Verify predictions match pre-save performance
Security Warning
Pickle and joblib files can execute arbitrary code when loaded. Never load models from untrusted sources. In production, store models in secure locations with access controls and integrity checks (checksums).
Flask API for Real-Time Predictions
Build a RESTful API using Flask that loads your saved model and serves predictions via HTTP endpoints. This transforms your ML model into a production service.
API Architecture
Required Endpoints
GET /health
Purpose: Health check for monitoring systems (Kubernetes, AWS ELB)
Response: JSON with status: 'healthy', model version, uptime
Use case: Load balancer pings every 30s to verify API is running
POST /predict
Purpose: Accept customer data, return churn prediction + probability
Input: JSON object with all 20 feature fields
Output: prediction ('Churn' or 'No Churn'), churn_probability (0.0-1.0)
Example API Interaction
Request Example
Method: POST
URL: http://localhost:5000/predict
Headers: Content-Type: application/json
Body:
Response Example
Status: 200 OK
Body:
Interpretation: 78% chance of churn - high priority for retention team
Error Handling Requirements
| Error Type | HTTP Code | Response | Example |
|---|---|---|---|
| Missing features | 400 | Error message listing missing fields | "Missing required field: contract" |
| Invalid data type | 400 | Error describing type mismatch | "tenure_months must be integer" |
| Model load failure | 500 | Internal server error | "Model file not found" |
| Prediction failure | 500 | Error with details | "Prediction failed: feature mismatch" |
Performance Optimization
- Load model at startup: Loading takes 50-200ms. Do it once when API starts, not on every request.
- Use pandas DataFrame: Pipeline expects DataFrame input with correct column names and order.
- Return probabilities: Include predict_proba() result so caller can set custom thresholds (e.g., 0.6 instead of 0.5).
- Add caching: For batch predictions, consider caching results with TTL to reduce compute.
Production Considerations
- Never use debug=True in production: Exposes sensitive information and allows arbitrary code execution
- Add request logging: Log all predictions with timestamps for audit trails
- Set timeouts: Predictions should complete in <200ms; timeout after 5 seconds
- Use WSGI server: For production, use Gunicorn or uWSGI instead of Flask's built-in server
- Add authentication: Protect endpoint with API keys or OAuth tokens
Testing Your API
Create a test script that validates all endpoints:
- Health check test: GET /health returns 200 with 'healthy' status
- Valid prediction test: POST /predict with complete data returns prediction
- Missing field test: POST /predict without 'contract' returns 400 error
- Invalid type test: POST /predict with tenure_months='abc' returns 400
- Probability range test: Verify churn_probability is between 0.0 and 1.0
Pipeline Specifications
Your pipeline must follow scikit-learn best practices and be production-ready. This section provides detailed specifications for each component.
What Makes a Pipeline "Production-Ready"?
- Encapsulation: All preprocessing inside the pipeline - zero manual steps
- Reproducibility: Same inputs always produce same outputs (set random_state)
- Serialization: Can be saved, loaded, and used in different environments
- Error handling: Gracefully handles missing values, unknown categories, edge cases
- Documentation: Clear naming, comments, and README explaining usage
- Numeric: StandardScaler or MinMaxScaler
- Categorical: OneHotEncoder with handle_unknown
- Imputation: Median for numeric, constant for categorical
- ColumnTransformer: Combine all transformers
- Feature Names: Use get_feature_names_out()
- Algorithm: RandomForest, LogisticRegression, or XGBoost
- Validation: 5-fold stratified cross-validation
- Metrics: Accuracy, Precision, Recall, F1, AUC
- Comparison: Compare at least 2 models
- Selection: Choose best based on business needs
- Format: joblib (preferred) or pickle
- Scope: Save complete pipeline, not just model
- Metadata: Version, date, metrics, features
- Verification: Load and test predictions
- Location: models/ directory
- Framework: Flask (or FastAPI)
- Endpoints: /health, /predict
- Input: JSON with customer features
- Output: Prediction + probability
- Errors: Proper error handling with messages
Best Practices
- Include all preprocessing in the pipeline
- Use ColumnTransformer for mixed data types
- Version your models with timestamps
- Add input validation in the API
- Load model once at startup, not per request
Common Mistakes
- Preprocessing outside the pipeline
- Fitting scaler on test data (data leakage)
- Hardcoding feature names
- Missing error handling in API
- Exposing debug mode in production
Required Visualizations
Create the following visualizations to understand your data and communicate model performance.
Target Variable Distribution
Pie or bar chart showing churn vs. no churn
Model Predictions
Heatmap of predicted vs. actual values
Classifier Performance
ROC curve with AUC score
Top Features
Bar chart of most important features
Algorithm Performance
Compare metrics across models
Precision-Recall Curve
For imbalanced classification
Visualization Guidelines
Confusion Matrix Best Practices
- Heatmap format: Use Seaborn heatmap with annotations showing actual counts
- Color scheme: Green gradient for clarity, darker = higher count
- Labels: Clearly label axes as 'Actual' and 'Predicted'
- Diagonal focus: High numbers on diagonal (TP, TN) = good performance
- Business context: Annotate with cost implications of FP vs FN
ROC Curve Interpretation
- Diagonal line: Random guessing baseline (AUC = 0.5)
- Perfect model: Top-left corner (AUC = 1.0)
- Your goal: Curve bowed toward top-left (AUC > 0.80)
- AUC meaning: Probability model ranks random churner higher than random non-churner
- Display: Include AUC score in title for quick reference
Feature Importance Visualization
- Chart type: Horizontal bar chart for easy feature name reading
- Top N features: Show top 10-15 most important features only
- Sort order: Highest importance at top
- Insight extraction: Look for contract type, tenure, charges at top
- Business value: Identifies which customer attributes drive churn risk
Model Comparison Chart
- Chart type: Grouped bar chart comparing multiple metrics
- Metrics shown: Accuracy, Recall, Precision, F1, AUC side-by-side
- Models: Show RandomForest vs LogisticRegression performance
- Decision making: Quickly identify which model wins on key metrics
- Trade-offs: Visualize speed vs accuracy trade-offs
Visualization Library Choices
Seaborn: Use for confusion matrix heatmap, distribution plots. Better statistical defaults than Matplotlib.
Matplotlib: Use for ROC curves, feature importance bars, model comparisons. More control over customization.
Plotly: Bonus - create interactive confusion matrix or ROC curve that users can hover over for details.
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
ml-pipeline-project
Required Project Structure
ml-pipeline-project/
├── data/
│ └── telecom_churn.csv # The dataset
├── notebooks/
│ ├── 01_data_exploration.ipynb # EDA notebook
│ └── 02_pipeline_training.ipynb # Pipeline development
├── src/
│ ├── __init__.py
│ ├── pipeline.py # Pipeline code
│ └── preprocessing.py # Preprocessing utilities
├── api/
│ ├── app.py # Flask application
│ └── test_api.py # API test script
├── models/
│ ├── churn_model.joblib # Trained pipeline
│ └── model_metadata.json # Model metadata
├── visualizations/
│ ├── confusion_matrix.png
│ └── roc_curve.png
├── requirements.txt
└── README.md
README.md Must Include:
- Your full name and submission date
- Project overview and business context
- Model performance metrics
- API documentation with example requests
- Instructions to run the project
- Screenshots of visualizations
requirements.txt
pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
seaborn>=0.12.0
flask>=3.0.0
joblib>=1.3.0
jupyter>=1.0.0
Enter your GitHub username - we will verify your repository automatically
Grading Rubric
Your project will be graded on the following criteria. Total: 550 points.
| Criteria | Points | Description |
|---|---|---|
| Data Preprocessing | 100 | Proper ColumnTransformer with numeric and categorical transformers |
| Pipeline Development | 120 | Complete end-to-end Pipeline, cross-validation, model comparison |
| Model Serialization | 80 | Correct joblib usage, metadata file, verification |
| API Development | 130 | Working Flask app with endpoints, error handling, testing |
| Visualizations | 60 | Required visualizations with professional formatting |
| Documentation | 60 | Comprehensive README, code comments, docstrings |
| Total | 550 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project