Module 1.2

The Machine Learning Workflow

Master the complete ML pipeline from problem definition to deployment. Learn the structured approach that professional data scientists use to build successful ML solutions.

35 min read
Beginner
Hands-on Examples
What You'll Learn
  • The 7-step ML workflow
  • Data preparation techniques
  • Train-test split and cross-validation
  • Model evaluation metrics
  • Deployment considerations
Contents
01

The ML Workflow Overview

Every successful ML project follows a structured workflow. Understanding this process is crucial before diving into algorithms. Let's explore the 7 key steps that transform raw data into production-ready ML solutions.

Remember: ML is iterative, not linear. You'll often loop back to earlier steps based on your findings. This is normal and expected!
1
Problem Definition

Define the business problem, target variable, and success metrics clearly.

Goals Metrics Stakeholders
5-10% of project time
2
Data Collection

Gather data from databases, APIs, files, and external sources.

SQL APIs Files
10-15% of project time
3
Data Preparation

Clean, handle missing values, remove duplicates, and split data.

Cleaning Split Filter
40-60% of project time!
4
Feature Engineering

Create, transform, scale, and select the most predictive features.

Create Scale Encode
10-20% of project time
5
Model Training

Train algorithms, tune hyperparameters, and use cross-validation.

Algorithms Tuning CV
10-15% of project time
6
Evaluation

Measure performance with metrics, confusion matrix, and validation.

Metrics Confusion Validate
5-10% of project time
7
Deployment

Deploy to production, create APIs, monitor, and plan retraining.

Deploy API Monitor
10-15% of project time

Interactive: ML Workflow Explorer

Click Steps!

Explore each phase of the ML pipeline. Click on any step to discover key activities, essential tools, common pitfalls, and real-world tips.

Problem Definition

Step 1 of 7

Define the business problem clearly. What are you trying to predict? What does success look like? This step sets the foundation for everything else.

Key Activities
  • Stakeholder interviews
  • Define target variable
  • Set success metrics
  • Assess feasibility
Common Tools
Requirements Doc Stakeholder Meetings KPI Definition
Common Pitfalls

Skipping this step, vague objectives, not involving stakeholders, choosing wrong success metrics.

Pro Tips

Start with a simple baseline. Define what "good enough" looks like before building anything.

Time Allocation
5-10%
of total project
Difficulty Level
Moderate
Key Deliverable
Problem Statement
02

Step 1: Problem Definition

Before writing any code, you must clearly understand the problem. A poorly defined problem leads to wasted effort and failed projects.

Questions to Ask
  • What exactly are we trying to predict or classify?
  • Is this a classification, regression, or clustering problem?
  • What data do we have available?
  • How will success be measured?
  • What are the business constraints (time, accuracy, interpretability)?
Example Problems
  • Classification: Will this customer churn? (Yes/No)
  • Regression: What will be the house price? ($X)
  • Clustering: What customer segments exist?
  • Ranking: Which products should we recommend?
  • Anomaly: Is this transaction fraudulent?
Problem Statement Template
# Problem Statement Template
problem = {
    "objective": "Predict customer churn within 30 days",
    "type": "Binary Classification",
    "target_variable": "churned (0 or 1)",
    "success_metric": "F1-score >= 0.85",
    "constraints": {
        "latency": "< 100ms per prediction",
        "interpretability": "Must explain top 3 factors",
        "refresh": "Retrain weekly"
    }
}
Template Fields Explained
objective

Clear, measurable goal statement. What exactly are you predicting and within what timeframe?

type

ML problem category: Classification (Binary/Multi), Regression, Clustering, Ranking, or Anomaly Detection.

target_variable

The column/feature you're predicting, including its data type and possible values (0/1, continuous, etc.).

success_metric

How you'll measure success. Always include metric name + threshold.

F1 ≥ 0.85 RMSE ≤ 10 AUC ≥ 0.90
constraints

Real-world limitations that affect model choices:

Latency Interpretability Refresh Rate Compute Budget
Pro Tip: Share this template with stakeholders BEFORE starting any coding. Getting alignment early prevents wasted effort and ensures everyone agrees on what "success" means.
03

Step 2: Data Collection

Data is the fuel for ML. The quality and quantity of your data directly impacts model performance. Garbage in, garbage out!

Internal Sources
  • Company databases
  • CRM systems
  • Transaction logs
  • User behavior data
  • Sensor/IoT data
External Sources
  • Public APIs
  • Open datasets (Kaggle, UCI)
  • Government data
  • Third-party providers
  • Web scraping
Considerations
  • Data privacy (GDPR, CCPA)
  • Data quality issues
  • Sampling bias
  • Licensing restrictions
  • Freshness/timeliness

Loading Data with pandas

import pandas as pd

# From CSV file
df = pd.read_csv('customers.csv')

# From Excel file
df = pd.read_excel('sales_data.xlsx', sheet_name='2024')

# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)

# From API (JSON)
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

# Quick data overview
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
df.head()  # First 5 rows
pandas Data Loading Methods
read_csv()

Most common format. Comma-separated values.

.csv .tsv .txt
read_excel()

Excel workbooks. Specify sheet_name for multi-sheet files.

.xlsx .xls
read_sql()

Query databases directly. Requires connection object.

SQLite MySQL PostgreSQL
DataFrame()

From JSON/API responses. Convert dict/list to DataFrame.

REST APIs .json
Quick Data Overview Methods
df.shape → (rows, columns)
df.head() → First 5 rows
df.info() → Data types & nulls
df.describe() → Statistics
Best Practice: Always run df.info() and df.describe() immediately after loading. This reveals data types, missing values, and statistical distribution at a glance.
04

Step 3: Data Preparation

This is where you spend 60-80% of your time! Real-world data is messy. You need to clean, transform, and prepare it before feeding it to ML algorithms.

Reality Check: Data scientists spend most of their time wrangling data, not building models. Master this step!

Common Data Issues

Missing Values
# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df_clean = df.dropna()

# Fill with mean/median
df['age'].fillna(df['age'].median(), inplace=True)

# Fill with most frequent (mode)
df['category'].fillna(df['category'].mode()[0], inplace=True)

# Forward fill (time series)
df['price'].fillna(method='ffill', inplace=True)
Methods Explained
dropna()

Remove rows with any missing values. Use when missing data is minimal.

fillna(median)

Replace with median. Robust to outliers for numerical data.

fillna(mode)

Replace with most frequent value. Best for categorical columns.

ffill / bfill

Forward/backward fill. Use for time series data.

Warning: Never drop or fill before splitting! Calculate fill values from training set only.
Duplicates & Outliers
# Remove duplicates
df = df.drop_duplicates()

# Find duplicates
duplicates = df[df.duplicated()]
print(f"Found {len(duplicates)} duplicates")

# Detect outliers using IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['price'] < Q1 - 1.5*IQR) | 
              (df['price'] > Q3 + 1.5*IQR)]

# Remove outliers
df = df[~df.index.isin(outliers.index)]
Methods Explained
drop_duplicates()

Removes exact duplicate rows. Keeps first occurrence by default.

duplicated()

Returns boolean mask. True for duplicate rows.

IQR Method

Interquartile Range: Values outside Q1 - 1.5×IQR to Q3 + 1.5×IQR are considered outliers. This method is robust and doesn't assume normal distribution.

Tip: Don't blindly remove outliers! Investigate first—they might be valid rare events or data entry errors.

Train-Test Split

Splitting your dataset is one of the most critical steps in ML. You need to evaluate your model on data it has never seen before to get an honest estimate of real-world performance.

The training set is used to teach your model patterns. The test set is kept completely separate and only used at the very end to evaluate how well your model generalizes to new, unseen data.

Critical Rule

Always split BEFORE preprocessing! If you scale, encode, or impute on the full dataset first, information from the test set leaks into training, giving overly optimistic results that won't hold in production.

from sklearn.model_selection import train_test_split

# Features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducibility
    stratify=y          # Keep class proportions (for classification)
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
train_test_split() Parameters
X, y

X = Features (input columns). y = Target (what you predict). Separate them before splitting.

test_size

Fraction for testing (0.2 = 20%). Common values: 0.2, 0.25, 0.3. Larger datasets can use smaller test sizes.

random_state

Seed for reproducibility. Same number = same split every time. Use 42, 0, or any integer.

stratify

Preserves class proportions. If 30% positive in original, both splits have ~30%. Critical for imbalanced data!

Visual: 80/20 Split
Training Set (80%)
Test (20%)
Train: Model learns from this data
Test: Model is evaluated on this (unseen!)
Pro Tip: For small datasets, consider a validation set too (60/20/20 split) or use cross-validation to get more reliable performance estimates without wasting data.
Data Leakage Warning: Never fit preprocessing (like scaling or encoding) on the entire dataset. Fit only on training data, then transform both train and test.
05

Step 4: Feature Engineering

Feature engineering is the art of creating and selecting the right input variables. Good features can make a simple model outperform a complex one!

Scaling Features
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler!

# MinMaxScaler: range [0, 1]
minmax = MinMaxScaler()
X_train_norm = minmax.fit_transform(X_train)
X_test_norm = minmax.transform(X_test)
Why Scale Features?

Many ML algorithms (like KNN, SVM, Neural Networks) are sensitive to the magnitude of features. If "age" ranges 0-100 and "income" ranges 0-1,000,000, the model will think income is more important just because it has bigger numbers!

StandardScaler

Transforms to mean=0, std=1. Best for most algorithms. Works well with outliers.

MinMaxScaler

Transforms to range [0, 1]. Good for neural networks. Sensitive to outliers.

fit_transform vs transform:
  • fit_transform(X_train) — Learn parameters (mean, std) from training data AND transform it
  • transform(X_test) — Use SAME learned parameters to transform test data. Never fit on test!
Encoding Categories
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Label Encoding (ordinal categories)
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])
# S=0, M=1, L=2

# One-Hot Encoding (nominal categories)
df_encoded = pd.get_dummies(df, columns=['color'])
# Creates: color_red, color_blue, color_green

# Or using sklearn
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(df[['color']])
Why Encode Categories?

ML algorithms work with numbers, not text! You must convert categorical columns like "color" or "size" into numerical format. The encoding method depends on whether categories have a natural order.

LabelEncoder

Ordinal data (has order): S→0, M→1, L→2. The numbers reflect real ranking.

OneHotEncoder

Nominal data (no order): Red, Blue, Green become separate 0/1 columns.

Beginner Mistake: Using LabelEncoder on nominal data (like colors). The model might think Red(0) < Blue(1) < Green(2), which makes no sense! Use OneHot for nominal categories.

Creating New Features

Feature engineering is where data science becomes an art! Creating smart new features from existing data can dramatically improve your model's performance — sometimes more than changing the algorithm itself.

# Feature creation examples
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100], 
                         labels=['teen', 'young', 'middle', 'senior'])

# Date features
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month
df['signup_dayofweek'] = df['signup_date'].dt.dayofweek
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days

# Interaction features
df['price_per_sqft'] = df['price'] / df['sqft']
df['total_spend'] = df['quantity'] * df['unit_price']

# Log transform for skewed data
import numpy as np
df['log_income'] = np.log1p(df['income'])  # log(1+x) handles zeros
Feature Engineering Techniques Explained
pd.cut() Binning

Converts continuous numbers into categories. Groups ages 0-18 as "teen", 19-35 as "young", etc.

When to use: When exact values don't matter, but ranges do (age groups, income brackets, time periods)
.dt accessor Date Extraction

Extracts year, month, day, weekday from dates. Captures patterns like "more sales on weekends" or "seasonal trends".

Common extractions: .dt.year, .dt.month, .dt.dayofweek, .dt.hour, .dt.quarter
Column Math Interactions

Combines columns to create meaningful ratios. "Price per sqft" is more informative than price and sqft separately!

Ideas: ratios (A/B), products (A×B), differences (A-B), percentages (A/total×100)
np.log1p() Transform

Compresses skewed data (like income: few millionaires, many middle-class). Makes distribution more normal.

Why log1p? log(1+x) safely handles zero values. Use np.expm1() to reverse it.
Feature Engineering Gold: Domain knowledge creates the best features! Understanding your data's context (real estate, healthcare, e-commerce) helps you create features the algorithm can't discover on its own.
Beginner Tip: Start simple! Try basic ratios and date extractions first. Test if new features actually improve your model before creating dozens of them.
06

Step 5: Model Training

Now comes the exciting part - training your ML model! Start simple, then iterate to more complex models if needed.

Pro Tip: Always start with a simple baseline model (like Logistic Regression or Decision Tree). If it works well, you might not need deep learning!

The Basic Training Pattern

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Step 1: Choose a model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Step 2: Train (fit) on training data
model.fit(X_train, y_train)

# Step 3: Make predictions
y_pred = model.predict(X_test)

# Step 4: Get probability scores (if needed)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1
Understanding Each Step
Step 1 Choose Model

Pick an algorithm that suits your problem type (classification, regression) and data size.

RandomForestClassifier
n_estimators = number of trees
random_state = reproducibility
Step 2 Train (Fit)

The model learns patterns from your training data. This is where the "magic" happens!

model.fit(X_train, y_train)
• X_train = features
• y_train = target labels
Step 3 Predict

Use trained model to make predictions on new, unseen test data.

model.predict(X_test)
• Returns class labels (0, 1)
• Hard predictions
Step 4 Probabilities

Get confidence scores instead of just yes/no. Useful for ranking or setting custom thresholds.

predict_proba(X_test)[:, 1]
• Returns 0.0 to 1.0
• [:, 1] gets positive class
Scikit-learn's Consistent API: Every model follows the same pattern: .fit().predict().score(). Once you learn it for one model, you know it for all 50+ models in sklearn!
Beginner tip: Start with LogisticRegression (classification) or LinearRegression (regression) as baselines
Common mistake: Never call .fit() on test data — that's cheating and causes data leakage!

Hyperparameter Tuning

Hyperparameters are settings you choose before training (unlike model parameters learned during training). Finding the best combination can significantly boost your model's performance!

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search (tries all combinations)
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='f1',      # Optimize for F1-score
    n_jobs=-1          # Use all CPU cores
)

grid_search.fit(X_train, y_train)

# Best parameters and score
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_
Understanding the Code
Parameter Grid

A dictionary defining which values to try for each hyperparameter. GridSearch will test all combinations:

n_estimators

Number of trees in forest

max_depth

How deep each tree grows

min_samples_split

Min samples to split node

min_samples_leaf

Min samples in leaf node

Total combinations: 3 × 4 × 3 × 3 = 108 models trained!

cv=5

Uses 5-fold cross-validation for each combination. More reliable than a single train-test split!

scoring='f1'

Metric to optimize. Options: 'accuracy', 'precision', 'recall', 'roc_auc'

n_jobs=-1

Use all CPU cores for parallel processing. Makes tuning much faster!

Accessing Results
.best_params_

The winning hyperparameter combination

.best_score_

Best cross-validation score achieved

.best_estimator_

The trained model with best params

GridSearchCV: Tries every combination. Thorough but slow. Best for small grids (<100 combinations).
RandomizedSearchCV: Samples random combinations. Faster for large search spaces. Use n_iter to control attempts.

Cross-Validation

Why Cross-Validation?
The Problem: A single train-test split can give misleading results. You might get "lucky" or "unlucky" with which samples end up in each set.
Single Split Problems
  • Results depend heavily on which samples are in test set
  • May overestimate or underestimate true performance
  • Hard to know if your model will generalize well
  • Wastes data - some samples never used for training
Cross-Validation Benefits
  • Uses ALL data for both training and validation
  • Provides mean AND standard deviation of performance
  • Detects overfitting (high variance = overfitting)
  • More reliable estimate of real-world performance
How K-Fold Cross-Validation Works
5-Fold Example (K=5): Data split into 5 equal parts
Fold 1 (Test) Fold 2 Fold 3 Fold 4 Fold 5 → Score 1
Fold 1 Fold 2 (Test) Fold 3 Fold 4 Fold 5 → Score 2
Fold 1 Fold 2 Fold 3 (Test) Fold 4 Fold 5 → Score 3
Fold 1 Fold 2 Fold 3 Fold 4 (Test) Fold 5 → Score 4
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 (Test) → Score 5
Final Score = Average(Score 1, 2, 3, 4, 5) ± Std Dev
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(
    model, X_train, y_train, 
    cv=5, 
    scoring='accuracy'
)

print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")

# Interpretation:
# Mean = expected performance
# Low std = stable model (good!)
# High std = model overfitting (bad!)
Understanding cross_val_score
model

Your ML model (unfitted)

cv=5

Number of folds

scoring

Metric to evaluate

scores

Array of 5 scores

Interpreting results: .mean() = average performance, .std() = consistency (lower is better, means stable across folds)
# Other CV strategies
from sklearn.model_selection import (
    StratifiedKFold,  # Preserves class balance
    LeaveOneOut,      # N folds for N samples
    TimeSeriesSplit   # For time series data
)

# Stratified K-Fold (recommended for classification)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# Time Series Split (respects temporal order)
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv)
CV Strategies Explained
RECOMMENDED StratifiedKFold

Ensures each fold has same ratio of classes. Essential for imbalanced data (e.g., 90% class A, 10% class B).

SMALL DATA LeaveOneOut

Uses N-1 samples for training, 1 for testing. Repeats N times. Thorough but very slow for large datasets.

TIME DATA TimeSeriesSplit

Never uses future data to predict past! Training expands forward, test is always the next period.

Visual: How Each Strategy Splits Data
K-Fold (5 folds)
TestTrain Train Train Train
TrainTestTrain Train Train
Train TrainTestTrain Train

Each sample is tested exactly once

Stratified K-Fold
ABABA
Same A:B ratio in each fold

Preserves class distribution

Time Series Split
TrainTest
Train TrainTest
Train Train TrainTest

Training window expands forward

Rule of Thumb: Use 5 or 10 folds. More folds = more reliable but slower. For small datasets, use Leave-One-Out. For imbalanced classification, use StratifiedKFold.
07

Step 6: Model Evaluation

How do you know if your model is good? Evaluation metrics tell you how well your model performs on unseen data.

Classification Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, 
    recall_score, f1_score, 
    confusion_matrix, classification_report
)

# Basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Detailed report
print(classification_report(y_test, y_pred))
Metrics Explained
Accuracy

% of correct predictions overall. Can be misleading with imbalanced data!

Precision

Of all "YES" predictions, how many were actually YES? (Avoid false alarms)

Recall

Of all actual YES cases, how many did we catch? (Don't miss anything!)

F1-Score

Harmonic mean of Precision & Recall. Best single metric for imbalanced data.

When to use what: Spam filter → High Precision (don't mark good emails as spam). Cancer detection → High Recall (don't miss any cancer cases).
Regression Metrics
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    r2_score, root_mean_squared_error
)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")
Metrics Explained
MAE

Average absolute error. Easy to interpret: "Off by $X on average".

MSE

Average squared error. Penalizes large errors more heavily.

RMSE

Square root of MSE. Same units as target. Most commonly used!

R² Score

0-1 scale. "Model explains X% of variance". 1.0 = perfect fit.

Quick guide: Use RMSE for general comparison. Use MAE when outliers shouldn't dominate. R² for "how good is my model?" (0.8+ is usually good).

Understanding the Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted No', 'Predicted Yes'],
            yticklabels=['Actual No', 'Actual Yes'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Code Breakdown
confusion_matrix()

Creates 2×2 array of TN, FP, FN, TP counts

sns.heatmap()

Visualizes matrix with colors and numbers

annot=True

Shows numbers in each cell

fmt='d'

Format as integers (not decimals)

Confusion Matrix Explained
TN (True Negative) Correctly predicted NO
FP (False Positive) Incorrectly predicted YES (Type I Error)
FN (False Negative) Incorrectly predicted NO (Type II Error)
TP (True Positive) Correctly predicted YES
Visual: How to Read the Matrix
← Predicted →
← Actual →
NO YES
NO TN
✓ Correct
FP
✗ Type I
YES FN
✗ Type II
TP
✓ Correct

Diagonals = Correct (green), Off-diagonals = Errors (red)

Real-World Examples:
False Positive (FP)

🔔 Fire alarm when there's no fire
📧 Good email marked as spam
🏥 Healthy patient told they're sick

False Negative (FN)

🔇 No alarm during actual fire!
📩 Spam lands in inbox
🏥 Sick patient told they're healthy!

Which error is worse? Depends on context! Medical diagnosis: FN is worse (missing disease). Spam filter: FP is worse (losing important email).

Metrics Calculated from Confusion Matrix
Accuracy

(TP + TN) / Total

Precision

TP / (TP + FP)

Recall

TP / (TP + FN)

Specificity

TN / (TN + FP)

08

Step 7: Deployment

A model that stays on your laptop creates zero value. Deployment puts your model into production where it can make real predictions!

Save Model
import joblib
import pickle

# Save with joblib (recommended)
joblib.dump(model, 'model.joblib')

# Load model
loaded_model = joblib.load('model.joblib')

# Make prediction
prediction = loaded_model.predict(new_data)
REST API
# Flask API example
from flask import Flask, request
import joblib

app = Flask(__name__)
model = joblib.load('model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([data['features']])
    return {'prediction': int(prediction[0])}
Monitor

After deployment, monitor:

  • Model accuracy over time
  • Data drift (input distribution changes)
  • Latency and throughput
  • Error rates
  • Business metrics impact
Remember: Models degrade over time as data patterns change. Plan for regular retraining and monitoring from the start!

Key Takeaways

Follow the Workflow

The 7-step ML workflow provides structure. Don't skip steps, especially problem definition and data prep

Data Prep Takes 60-80%

Most of your time goes into data cleaning and preparation. This is normal and expected

Always Split Data First

Split into train/test BEFORE preprocessing to prevent data leakage and get honest evaluation

Choose Right Metrics

Accuracy isn't everything. Use F1 for imbalanced data, RMSE for regression, and align with business goals

ML is Iterative

You'll loop back to earlier steps. Poor results? Go back to features or data. This is the process

Deployment is Essential

A model that isn't deployed creates zero value. Plan for production, monitoring, and retraining

Knowledge Check

Test your understanding of the ML workflow:

Question 1 of 6

What percentage of a data scientist's time is typically spent on data preparation?

Question 2 of 6

Why should you split data into train/test sets BEFORE preprocessing?

Question 3 of 6

Which metric would be BEST for evaluating a model that predicts house prices?

Question 4 of 6

What is the purpose of cross-validation?

Question 5 of 6

When encoding categorical variables, when should you use One-Hot Encoding vs Label Encoding?

Question 6 of 6

What is GridSearchCV used for?

Answer all questions to check your score