Module 1.3

Scikit-learn Introduction

Master Python's most popular machine learning library. Learn the consistent API pattern (fit, predict, score) that works across all models, explore built-in datasets, and build your first ML models with just a few lines of code.

40 min
Beginner
Hands-on
What You'll Learn
  • The sklearn API pattern
  • Built-in datasets & utilities
  • Estimators, transformers, pipelines
  • Your first classification model
  • Model evaluation basics
Contents
01

What is Scikit-learn?

Scikit-learn (sklearn) is Python's most widely-used machine learning library. It provides simple and efficient tools for data analysis and modeling, built on NumPy, SciPy, and Matplotlib. Whether you're building a spam classifier, predicting house prices, or clustering customers, sklearn has you covered.

Key Concept

What is Scikit-learn?

Scikit-learn is an open-source machine learning library that provides a consistent interface for dozens of ML algorithms. From simple linear regression to complex ensemble methods, all models follow the same fit()predict()score() pattern.

Why it matters: Once you learn the sklearn API, you can easily switch between algorithms without learning new syntax. This makes experimentation fast and code maintainable.

Why Scikit-learn?

Consistent API

Every estimator has fit(), predict(), and score(). Learn once, use everywhere.

Batteries Included

50+ algorithms, preprocessing tools, model selection utilities, and sample datasets ready to use.

Excellent Docs

Comprehensive documentation with examples, tutorials, and a user guide that explains concepts.

Installing Scikit-learn

Scikit-learn is typically installed via pip or conda. It's included in most data science distributions like Anaconda.

# Install via pip
# pip install scikit-learn

# Import the library
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

# Common imports you'll use frequently
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Code Breakdown — Line by Line
import sklearn

Imports the entire library. The __version__ attribute tells you which version is installed (useful for debugging).

from sklearn.model_selection import train_test_split

Splits your data into training and testing sets. Essential for evaluating if your model generalizes to new data.

from sklearn.preprocessing import StandardScaler

Standardizes features to have mean=0 and std=1. Many algorithms work better with normalized data.

from sklearn.linear_model import LogisticRegression

A simple but powerful classification algorithm. Great starting point for binary/multi-class problems.

from sklearn.metrics import accuracy_score, classification_report

Functions to evaluate model performance. accuracy_score gives overall accuracy, classification_report gives precision, recall, F1.

Sklearn Module Structure — Complete Guide

Sklearn organizes its tools into modules by purpose. Here's the complete breakdown:

model_selection

Train/test split, cross-validation, grid search

preprocessing

Scalers, encoders, transformers

linear_model

Linear/Logistic regression, Ridge, Lasso

metrics

Accuracy, F1, confusion matrix, R²

tree

Decision trees for classification & regression

ensemble

Random Forest, Gradient Boosting, AdaBoost

cluster

KMeans, DBSCAN, hierarchical clustering

datasets

Built-in datasets for practice & testing

Pro Tip: You don't need to memorize all modules! Just remember the pattern: from sklearn.<category> import <Tool>. Use autocomplete or documentation to find what you need.
Beginner Tip: Start with just 4 imports for most projects: train_test_split (split data), StandardScaler (normalize), one model class (like LogisticRegression), and accuracy_score (evaluate). Add more as you need them!
02

The Sklearn API Pattern

The beauty of scikit-learn is its consistent API. Every model (called an "estimator") follows the same pattern. Learn it once, and you can use any of the 50+ algorithms without reading new documentation.

The Universal Pattern: fit → predict → score

1

.fit(X, y)

Train the model on your data. The model learns patterns from X (features) to predict y (target).

2

.predict(X)

Make predictions on new data. Returns predicted values based on learned patterns.

3

.score(X, y)

Evaluate performance. Returns accuracy (classification) or R² (regression).

# The universal sklearn pattern
from sklearn.linear_model import LogisticRegression

Import: Every sklearn algorithm lives in a specific module. linear_model contains logistic regression, linear regression, etc. Other modules: tree, ensemble, svm.

# Step 1: Create the model (estimator)
model = LogisticRegression()

Create Model: Instantiate the estimator object. At this point it's "empty" — no learning has happened. You can pass hyperparameters like LogisticRegression(C=0.1).

# Step 2: Train on data
model.fit(X_train, y_train)

Train (fit): This is where learning happens! The model analyzes X_train (features) and y_train (labels) to find patterns. May take time for large datasets.

# Step 3: Make predictions
predictions = model.predict(X_test)

Predict: Use the trained model on new data. Returns an array of predicted class labels (0, 1, 2...). For probabilities, use predict_proba(X_test).

# Step 4: Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")

Evaluate: Compare predictions to actual labels. Returns accuracy (0-1) for classifiers, R² for regressors. :.2% formats 0.95 as "95.00%".

Code Breakdown — Understanding Each Step
Step 1 model = LogisticRegression()

Create the estimator object. This initializes the model with default settings. At this point, the model is "empty" — it hasn't learned anything yet.

You can pass hyperparameters here, like LogisticRegression(C=0.1, max_iter=1000)

Step 2 model.fit(X_train, y_train)

Train the model. This is where the "learning" happens. The model analyzes X_train (features) and y_train (labels) to find patterns.

This step may take time for large datasets or complex models.

Step 3 predictions = model.predict(X_test)

Make predictions. The trained model uses the patterns it learned to predict labels for new data (X_test). Returns an array of predicted values.

predict() returns class labels. Use predict_proba() for probability scores.

Step 4 accuracy = model.score(X_test, y_test)

Evaluate performance. Compares predictions to actual labels. For classifiers, returns accuracy (0-1). For regressors, returns R² score.

:.2% formats 0.95 as "95.00%" for readability.

Why This Pattern is Powerful

The same 4 lines work for any sklearn model. Want to try a different algorithm? Just change the import and class name:

DecisionTreeClassifier()

Tree-based classification

RandomForestClassifier()

Ensemble of trees

SVC()

Support Vector Machine

Estimators vs Transformers

Sklearn has two main types of objects: Estimators (models that predict) and Transformers (objects that transform data).

Estimators (Predictors)

Models that learn and make predictions.

  • fit(X, y) — Learn from data
  • predict(X) — Make predictions
  • score(X, y) — Evaluate performance
Examples: LogisticRegression, RandomForest, KMeans
Transformers

Objects that transform/preprocess data.

  • fit(X) — Learn parameters from data
  • transform(X) — Apply transformation
  • fit_transform(X) — Both in one step
Examples: StandardScaler, OneHotEncoder, PCA
# Transformer example: StandardScaler
from sklearn.preprocessing import StandardScaler

Import: StandardScaler is in the preprocessing module. This module contains all data transformation tools (scalers, encoders, normalizers).

scaler = StandardScaler()

Create Transformer: Like models, transformers start "empty". The scaler doesn't know mean or std yet — it will learn these from data during fit().

# fit_transform on training data (learn mean/std, then transform)
X_train_scaled = scaler.fit_transform(X_train)

Fit + Transform (Training Data):

  • fit() — Calculate mean and std from X_train
  • transform() — Apply: (X - mean) / std
  • fit_transform() — Does both in one efficient step
# transform only on test data (use learned mean/std)
X_test_scaled = scaler.transform(X_test)

Transform Only (Test Data): Critical! Only use transform() on test data, NOT fit_transform()!

We use the mean/std learned from training data. This prevents "data leakage" — test data should never influence our preprocessing parameters.

# Now X_train_scaled and X_test_scaled have mean=0, std=1

Result: Both datasets are now standardized — each feature has mean ≈ 0 and std ≈ 1. This helps many ML algorithms converge faster and perform better.

Understanding Transformers — Step by Step
scaler = StandardScaler()

Create transformer. Like models, transformers start empty. No data has been seen yet.

scaler.fit_transform(X_train)

On training data: Learn the mean & std from X_train, then immediately apply the transformation. Combines two steps in one.

scaler.transform(X_test)

On test data: Use the same mean & std learned from training. Never call fit() on test data!

Result: Both datasets are now standardized using the same scale. Each feature will have approximately mean=0 and standard deviation=1.

Critical Rule: Always fit_transform() on training data, then only transform() on test data. Never fit on test data — that's data leakage and will give you overly optimistic results!
Common Mistake vs Correct Approach
Wrong
# Fitting on test data = DATA LEAKAGE!
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

This learns different parameters for test data, defeating the purpose of evaluation.

✓ Correct
# Same parameters applied to both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Test data is transformed using training data's statistics — proper simulation of real-world deployment.

03

Built-in Datasets

Scikit-learn comes with several toy datasets perfect for learning and experimentation. No need to download anything — they're ready to use immediately.

Classification Datasets

from sklearn.datasets import load_iris, load_wine, load_breast_cancer

Import dataset loaders. Each function loads a different built-in dataset.

# Load the famous Iris dataset
iris = load_iris()

# It's a Bunch object (like a dictionary)
print(iris.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

load_iris() returns a Bunch object (like a dictionary). Use .keys() to see what's inside.

# Features (X) and target (y)
X = iris.data           # Shape: (150, 4)
y = iris.target         # Shape: (150,)

.data = features (150 rows × 4 columns). .target = labels (150 values: 0, 1, or 2).

# Feature names
print(iris.feature_names)
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

.feature_names = list of column names describing what each feature represents.

# Target classes
print(iris.target_names)
# ['setosa' 'versicolor' 'virginica']

.target_names = human-readable class labels. 0='setosa', 1='versicolor', 2='virginica'.

Understanding the Bunch Object

When you load a dataset, sklearn returns a Bunch object — a dictionary-like container with several important attributes:

iris.data

Features (X) — NumPy array with shape (samples, features). This is your input data for training.

iris.target

Labels (y) — NumPy array with class labels (0, 1, 2...). This is what you're trying to predict.

iris.feature_names

Column names — List of strings describing each feature column.

iris.target_names

Class names — Human-readable names for each target class.

iris.DESCR

Description — Full documentation about the dataset's origin and attributes.

iris.frame

DataFrame — Optional pandas DataFrame (if available in the loader).

load_iris()

150 samples, 4 features, 3 classes

Classic beginner dataset. Predict flower species from petal/sepal measurements.

load_wine()

178 samples, 13 features, 3 classes

Wine recognition. Classify wines by chemical analysis (alcohol, ash, etc.).

load_breast_cancer()

569 samples, 30 features, 2 classes

Binary classification. Predict malignant vs benign tumors.

Regression Datasets

from sklearn.datasets import load_diabetes, fetch_california_housing

Import: load_diabetes comes bundled with sklearn. fetch_california_housing downloads from internet on first use (larger dataset).

# Diabetes dataset (regression)
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

Load Diabetes: 442 samples, 10 features (age, BMI, blood pressure...). Returns a Bunch object, then extract features (X) and target (y) separately.

print(f"Features: {diabetes.feature_names}")
print(f"Target: disease progression (continuous)")

Explore Data: Check feature names. Target is disease progression one year after baseline — a continuous value (not categories like 0/1).

# California Housing (larger dataset)
housing = fetch_california_housing()
X, y = housing.data, housing.target

Load Housing: 20,640 samples, 8 features (income, house age, rooms...). Note: fetch_* downloads data on first use — may take a few seconds.

print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"Target: median house value")

Explore Size: X.shape[0] = number of samples, X.shape[1] = number of features. Target is median house value in $100,000s.

Regression vs Classification Datasets

The key difference: regression datasets have continuous target values (numbers like price, score), while classification has discrete categories (0, 1, 2).

load_diabetes()

442 samples, 10 features (age, BMI, blood pressure...)

Target: Disease progression one year after baseline — a continuous measure.

fetch_california_housing()

20,640 samples, 8 features (income, age, rooms...)

Target: Median house value in $100,000s — predict housing prices.

Note: fetch_* functions download data from the internet on first use. load_* functions use bundled data — no download needed.

Generating Synthetic Datasets

Need a custom dataset for testing? Sklearn can generate synthetic data with known properties.

from sklearn.datasets import make_classification, make_regression, make_blobs

Import generators for classification, regression, and clustering data.

# Generate classification data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,    # Features actually useful for classification
    n_classes=2,
    random_state=42
)
make_classification() Parameters
n_samples=1000

Total data points

n_features=20

Total columns

n_informative=10

Useful features

n_classes=2

Binary classification

# Generate regression data
X, y = make_regression(
    n_samples=1000,
    n_features=10,
    noise=10,            # Add some noise
    random_state=42
)
make_regression() Parameters
n_samples=1000

Number of data points

n_features=10

Number of input features

noise=10

Random noise (higher = harder)

# Generate clustering data (blobs)
X, y = make_blobs(
    n_samples=500,
    n_features=2,
    centers=4,           # 4 clusters
    random_state=42
)
make_blobs() Parameters
n_samples=500

Total points across all clusters

n_features=2

2D data (easy to visualize!)

centers=4

Number of clusters to create

Synthetic Dataset Parameters Explained
make_classification() — For Classification Tasks
n_samples

Total number of data points to generate

n_features

Total columns (informative + redundant + random)

n_informative

Features that actually help predict the class

n_classes

Number of target categories (2 = binary)

make_regression() — For Regression Tasks
noise=10

Random noise added to targets (higher = harder task)

n_informative

Features with actual relationship to target

bias

Intercept value added to all targets

make_blobs() — For Clustering Tasks
centers=4

Number of clusters/groups to create

cluster_std

Spread of each cluster (higher = more overlap)

random_state=42

Seed for reproducibility (always set this!)

When to use synthetic data: Testing algorithms, understanding model behavior, creating reproducible examples, or when you need specific data properties (number of features, noise level, etc.).
Beginner Tip: Synthetic data is great for learning because you know the ground truth. Start with 2 informative features and 2 classes, visualize the data with matplotlib, then train a model — you can literally see what the algorithm is learning!
04

Your First Model

Let's build a complete machine learning model from start to finish. We'll use the Iris dataset to classify flower species using Logistic Regression.

Complete Workflow Example

# Step 1: Import everything we need
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Step 1: Import Libraries

Import all tools we need: load_iris for data, train_test_split to split data, StandardScaler to normalize, LogisticRegression as our model, and metrics to evaluate results.

# Step 2: Load the data
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset shape: {X.shape}")  # (150, 4)
Step 2: Load Dataset

load_iris() returns a Bunch object. We unpack it: X = features (150 samples × 4 features), y = target labels (0, 1, or 2 for each flower species).

# Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,       # 20% for testing
    random_state=42,     # Reproducibility
    stratify=y           # Keep class balance
)
print(f"Training: {X_train.shape[0]}, Test: {X_test.shape[0]}")
Step 3: Split Data
test_size=0.2

20% for testing (30 samples), 80% for training (120 samples)

random_state=42

Seed for reproducibility — same split every run

stratify=y

Maintains class proportions in both sets

# Step 4: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 4: Scale Features

fit_transform(X_train) learns mean/std from training data and transforms it. transform(X_test) uses the same parameters — never fit on test data!

# Step 5: Create and train the model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)
Step 5: Train Model

Create model object, then call .fit() with training data. The model analyzes 120 samples and learns decision boundaries to separate the 3 flower species.

# Step 6: Make predictions
y_pred = model.predict(X_test_scaled)
Step 6: Make Predictions

.predict() takes the 30 test samples and returns predicted class labels (0, 1, or 2). The model uses patterns learned during training to make these predictions.

# Step 7: Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Step 7: Evaluate Results

accuracy_score() compares predictions to actual labels. classification_report() shows precision, recall, and F1 for each class.

Expected result: ~97-100% accuracy. Iris is a simple dataset!

Complete Code Breakdown — What Each Part Does
Step 1: Imports
  • load_iris — Built-in dataset function
  • train_test_split — Splits data into train/test
  • StandardScaler — Normalizes features
  • LogisticRegression — Our classifier model
  • accuracy_score, classification_report — Evaluation tools
Step 2: Load Data

X, y = iris.data, iris.target

This "unpacks" the Bunch object. X gets all feature columns (150 rows × 4 columns), y gets the labels (150 integers: 0, 1, or 2).

Step 3: Train-Test Split — Key Parameters
test_size=0.2

20% of data for testing = 30 samples. 80% for training = 120 samples. Common ratios: 80/20, 70/30.

random_state=42

Seed for random splitting. Same seed = same split every time. Essential for reproducibility!

stratify=y

Maintains class proportions. If data is 33% each class, both train and test will be 33% each class.

Step 4: Scaling

fit_transform on train learns mean/std, then transforms. transform on test uses those same values.

Step 5: Training

model.fit(X_train_scaled, y_train) — The model analyzes 120 samples and learns decision boundaries.

Step 6-7: Predict & Evaluate

predict() returns class labels. classification_report() shows precision, recall, F1 per class.

Expected Output:
Dataset shape: (150, 4)
Training: 120, Test: 30

Accuracy: 100.00%

Classification Report:
              precision    recall  f1-score   support
      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00        10
   virginica       1.00      1.00      1.00        10
    accuracy                           1.00        30

The Iris dataset is simple enough that many models achieve ~97-100% accuracy.

Step-by-Step Breakdown
Step 1-2
Import & Load

Import necessary modules and load the dataset. X = features, y = labels.

Step 3
Train-Test Split

stratify=y ensures each set has same class proportions.

Step 4
Scale Features

StandardScaler normalizes features. Fit on train, transform both.

Step 5
Train Model

model.fit() learns patterns from training data.

Step 6
Predict

model.predict() generates predictions on test data.

Step 7
Evaluate

Compare predictions to actual labels. Report shows precision, recall, F1.

Trying Different Models

The beauty of sklearn: switch models with just one line change!

# Try different classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

Import Multiple Classifiers: Each algorithm lives in its own module. tree for decision trees, ensemble for forests, svm for support vector machines, neighbors for KNN.

models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

Create Models Dictionary: Store models in a dictionary for easy comparison:

  • Keys: Human-readable names for display
  • Values: Actual model objects ready to train
  • random_state=42 — Ensures reproducible results (same results every run)
print("Model Comparison:")
print("-" * 40)

Print Header: Simple formatting. "-" * 40 creates a line of 40 dashes for visual separation in output.

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    accuracy = model.score(X_test_scaled, y_test)
    print(f"{name:25} Accuracy: {accuracy:.2%}")

Loop and Compare: The magic of sklearn's consistent API!

  • models.items() — Get both name and model from dictionary
  • model.fit(...) — Same method works for ALL models
  • model.score(...) — Same method returns accuracy for ALL models
  • {name:25} — Format name with 25 character width for alignment
  • {accuracy:.2%} — Display as percentage with 2 decimal places
Understanding the Model Comparison Code
models = {...}

A dictionary storing model names as keys and model objects as values. This lets us loop through all models easily.

for name, model in models.items():

Loop through each model. name = string like "Random Forest", model = the actual classifier object.

model.fit(...) → model.score(...)

Same two methods work for every model! This is the power of sklearn's consistent API. The loop trains 5 different algorithms with identical code.

Quick Guide to These Models
LogisticRegression

Best for: Binary/multi-class, linearly separable data. Fast, interpretable. Start here!

DecisionTreeClassifier

Best for: Interpretable models, non-linear relationships. Prone to overfitting.

RandomForestClassifier

Best for: General purpose, handles non-linearity. Ensemble of trees = more robust.

SVC

Best for: Complex decision boundaries. Works well with scaling. Slower on large data.

KNeighborsClassifier

Best for: Simple concept, no training. Classifies by nearest neighbors. Slow on large data.

Tip: No model is "best" — it depends on your data! Always try multiple.

Same Pattern, Different Algorithms: Notice how the code structure is identical for every model. This is the power of sklearn's consistent API — experiment freely without learning new syntax!
05

Preprocessing & Pipelines

Real-world ML involves multiple preprocessing steps before training. Sklearn Pipelines chain these steps together, preventing data leakage and making code cleaner.

Common Preprocessing Steps

Scaling
from sklearn.preprocessing import (
    StandardScaler,   # Mean=0, Std=1
    MinMaxScaler,     # Range [0, 1]
    RobustScaler      # Robust to outliers
)

# StandardScaler (most common)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

StandardScaler: Best default choice. MinMaxScaler: When you need bounded values. RobustScaler: When data has outliers.

Encoding
from sklearn.preprocessing import (
    LabelEncoder,      # For target labels
    OneHotEncoder,     # Nominal categories
    OrdinalEncoder     # Ordinal categories
)

# OneHotEncoder for categorical features
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X_cat)

OneHotEncoder: For categories with no order (colors, countries). OrdinalEncoder: For ordered categories (low/medium/high).

When to Use Which Scaler?
StandardScaler

Use when: data is roughly normally distributed. Algorithms: Logistic Regression, SVM, Neural Networks.

MinMaxScaler

Use when: you need values in a specific range (0-1). Algorithms: KNN, Neural Networks with sigmoid.

RobustScaler

Use when: data has many outliers. Uses median/IQR instead of mean/std.

The Pipeline Solution

A Pipeline chains preprocessing and model into a single object. This ensures consistent preprocessing during training and prediction.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

Import: Pipeline chains steps together. StandardScaler preprocesses data. LogisticRegression is our classifier. All work together!

# Create a pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),         # Step 1: Scale
    ('classifier', LogisticRegression())  # Step 2: Classify
])

Create Pipeline: Pass a list of (name, object) tuples:

  • 'scaler' — Your custom name for this step (used to access it later)
  • StandardScaler() — The transformer/estimator object
  • Order matters! Each step's output becomes the next step's input
# Use it like any other model!
pipe.fit(X_train, y_train)              # Fits scaler, then classifier

Fit Pipeline: Internally this does:

  1. scaler.fit_transform(X_train) — Learn scaling params & transform
  2. classifier.fit(X_train_scaled, y_train) — Train model on scaled data
predictions = pipe.predict(X_test)       # Scales X_test, then predicts

Predict: Automatically does:

  1. scaler.transform(X_test) — Use learned params (NOT fit_transform!)
  2. classifier.predict(X_test_scaled) — Get predictions
accuracy = pipe.score(X_test, y_test)    # All in one

print(f"Pipeline Accuracy: {accuracy:.2%}")

Score: Scales test data and evaluates accuracy — all in one method call. The pipeline handles everything internally!

Pipeline Code Breakdown — How It Works
Pipeline([('name', Object()), ...])

A list of (name, transformer/estimator) tuples. Names are required — they're used to access steps later.

('scaler', StandardScaler())

"scaler" is the name (your choice). StandardScaler() is the transformer object. You can access it via pipe.named_steps['scaler'].

What Happens During pipe.fit(X_train, y_train):
  1. scaler.fit_transform(X_train) — Learn scaling params, transform training data
  2. Pass scaled data to next step
  3. classifier.fit(X_train_scaled, y_train) — Train the model on scaled data
What Happens During pipe.predict(X_test):
  1. scaler.transform(X_test) — Use saved params (no fit!)
  2. Pass scaled data to next step
  3. classifier.predict(X_test_scaled) — Get predictions
Why Use Pipelines?
No Data Leakage: Scaler only sees training data during fit
Cleaner Code: One object does everything
Easy Deployment: Save entire pipeline with joblib
Grid Search Compatible: Tune preprocessing params too

ColumnTransformer for Mixed Data

Real datasets have both numerical and categorical columns. ColumnTransformer applies different transformations to different columns.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

Import All Tools: ColumnTransformer handles different columns differently, StandardScaler for numbers, OneHotEncoder for categories, Pipeline to chain everything, RandomForestClassifier as our model.

# Define column groups
numerical_cols = ['age', 'income', 'balance']
categorical_cols = ['gender', 'country', 'occupation']

Define Column Groups: List your numeric columns (numbers like age, income) and categorical columns (text labels like gender, country). This tells ColumnTransformer which columns get which treatment.

# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

Create ColumnTransformer: The transformers parameter takes a list of tuples:

  • ('num', StandardScaler(), numerical_cols) — Name it 'num', apply StandardScaler, to numeric columns
  • ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols) — Name it 'cat', one-hot encode category columns
  • handle_unknown='ignore' — If test data has unknown categories, ignore them (don't crash)
# Full pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

Complete Pipeline: Combine the preprocessor (which handles all column transformations) with a classifier into one pipeline. Now you can do full_pipeline.fit(X_train, y_train) and it handles everything automatically!

# Now fit on raw data with mixed types! full_pipeline.fit(X_train, y_train) accuracy = full_pipeline.score(X_test, y_test)
ColumnTransformer — Line by Line Explanation
numerical_cols = ['age', 'income', 'balance']

Define which columns contain numbers. These will be scaled with StandardScaler.

categorical_cols = ['gender', 'country', ...]

Define which columns are categories. These will be one-hot encoded.

Understanding the transformers=[] Argument:
('num', StandardScaler(), numerical_cols)
  • 'num' — Name for this transformer (your choice)
  • StandardScaler() — The transformer to apply
  • numerical_cols — List of column names to transform
The Complete Flow:

1. ColumnTransformer splits data by columns → 2. Applies StandardScaler to numerical → 3. Applies OneHotEncoder to categorical → 4. Concatenates results → 5. Passes to RandomForestClassifier

Pro Tip: Use handle_unknown='ignore' in OneHotEncoder so the model doesn't crash if test data has categories not seen during training.
Beginner Tip: ColumnTransformer + Pipeline is the real-world pattern. Raw data goes in, predictions come out. No manual preprocessing steps to remember!
06

Model Selection

Sklearn's model_selection module provides tools for splitting data, cross-validation, and hyperparameter tuning. These are essential for building robust models.

Cross-Validation

from sklearn.model_selection import cross_val_score, KFold

model = LogisticRegression()

Import CV tools and create a model. Cross-validation will test this model on multiple train/test splits.

# Simple 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
5-Fold Cross-Validation
cv=5

Split into 5 parts, train on 4, test on 1, rotate

scoring='accuracy'

Metric to measure (accuracy, f1, roc_auc...)

scores.mean() ± std*2

Average ± 95% confidence interval

# Custom KFold
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(f"10-Fold CV Mean: {scores.mean():.3f}")
Custom KFold Parameters
n_splits=10

10 folds = more reliable estimate

shuffle=True

Randomize data order first

random_state=42

Reproducible shuffle

Cross-Validation — What It Does
cv=5

5-Fold CV: Splits data into 5 parts. Trains on 4, tests on 1, rotates 5 times. Returns 5 scores — one for each fold.

scoring='accuracy'

Metric to optimize: Options include 'accuracy', 'f1', 'precision', 'recall', 'roc_auc', 'neg_mean_squared_error', etc.

scores.mean() (+/- scores.std()*2)

Standard format: Mean score ± 2 standard deviations. This gives you a range where ~95% of results fall. A stable model has low std.

KFold Parameters:
n_splits=10

Number of folds. More folds = more reliable but slower.

shuffle=True

Randomize before splitting. Essential if data is ordered!

random_state=42

Seed for reproducibility when shuffling.

Why Cross-Validation? A single train/test split might be lucky or unlucky. CV tests on multiple test sets and averages results — giving you a more reliable performance estimate.

Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Import GridSearchCV for hyperparameter tuning and RandomForest as our model to tune.

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}
Parameter Grid — Values to Try
n_estimators

Number of trees: 50, 100, or 200

max_depth

Tree depth: 5, 10, 20, or unlimited

min_samples_split

Minimum samples to split: 2, 5, or 10

Total: 3 × 4 × 3 = 36 combinations to try!

# Create grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,           # Use all CPU cores
    verbose=1
)
GridSearchCV Parameters
cv=5

5-fold CV per combo

scoring='accuracy'

Metric to optimize

n_jobs=-1

Use all CPU cores

verbose=1

Show progress

# Fit to find best parameters
grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.3f}")
Finding Best Parameters
.best_params_

Dictionary with optimal parameter values found

.best_score_

Best cross-validation score achieved

# Use best model
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.3f}")
Using the Best Model

.best_estimator_ is the actual trained model with optimal parameters. Use it directly for predictions on new data!

GridSearchCV — Complete Code Breakdown
Step 1: Define the Parameter Grid
'n_estimators': [50, 100, 200]

Number of trees in the forest. More trees = slower but often better.

'max_depth': [5, 10, 20, None]

Max tree depth. None = unlimited. Shallower = less overfitting.

'min_samples_split': [2, 5, 10]

Minimum samples to split a node. Higher = more regularization.

Step 2: GridSearchCV Parameters
cv=5

5-fold cross-validation for each combination.

scoring='accuracy'

Metric to optimize. Pick based on your problem.

n_jobs=-1

Use all CPU cores for parallel training.

verbose=1

Print progress (0=silent, 1=progress, 2=detailed).

Step 3: Access Results
grid_search.best_params_

Dictionary of best parameter values found.

grid_search.best_score_

Best cross-validation score achieved.

grid_search.best_estimator_

The actual model object with best params.

Total combinations: 3 × 4 × 3 = 36 parameter combinations. With 5-fold CV: 36 × 5 = 180 models trained!

Saving and Loading Models

import joblib

Import joblib: This library saves Python objects to disk. It's faster than pickle for NumPy arrays (which sklearn uses internally). Pre-installed with sklearn.

# Save model to file
joblib.dump(best_model, 'random_forest_model.joblib')

Save Model: joblib.dump(object, filename) serializes your trained model to a file. The .joblib extension is convention. All learned parameters are saved!

# Load model later
loaded_model = joblib.load('random_forest_model.joblib')

Load Model: joblib.load(filename) reads the file and recreates your model object. No retraining needed — it's ready to use immediately!

# Use loaded model
predictions = loaded_model.predict(X_new)

Use Loaded Model: Call .predict() on the loaded model just like normal. It works exactly like the original — all methods (predict, predict_proba, score) are available.

# Save entire pipeline (recommended!)
joblib.dump(full_pipeline, 'complete_pipeline.joblib')

Save Entire Pipeline: Best practice! Saving a pipeline includes all preprocessing steps (scalers, encoders). When you load it, you can directly pass raw data — the pipeline handles everything.

Model Persistence — Code Breakdown
joblib.dump(model, 'filename.joblib')

Save to disk. Serializes the model object and all its learned parameters to a file. Works with any sklearn object.

loaded_model = joblib.load('filename.joblib')

Load from disk. Deserializes the file back into a Python object. Ready to use immediately — no retraining needed!

Why joblib over pickle?
  • Faster for large NumPy arrays (common in ML models)
  • Compressed output — smaller file sizes
  • Designed specifically for sklearn objects
Always Save Pipelines: Saving just the model means you need to manually preprocess new data. Saving the entire pipeline ensures preprocessing is included — just call pipeline.predict(raw_data)!
Complete Deployment Workflow
1

Train
Build and validate your pipeline

2

Save
joblib.dump(pipe, 'model.joblib')

3

Deploy
Copy .joblib file to server

4

Predict
pipe.predict(new_data)

Practice: Sklearn Basics

Task: Load the wine dataset, split it 80/20, train a LogisticRegression model, and print the accuracy on test data.

Show Solution
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}")

Task: Using the iris dataset, compare LogisticRegression, DecisionTree, and RandomForest using 5-fold cross-validation. Print mean and std for each.

Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier()
}

for name, model in models.items():
    scores = cross_val_score(model, iris.data, iris.target, cv=5)
    print(f"{name}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Task: Create a Pipeline with StandardScaler and RandomForest. Use GridSearchCV to find the best n_estimators (50, 100, 200) and max_depth (5, 10, None). Print best parameters and test score.

Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

param_grid = {
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [5, 10, None]
}

grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Test score: {grid.score(X_test, y_test):.3f}")

Key Takeaways

Consistent API

Every sklearn model follows fit()predict()score(). Learn once, use everywhere across 50+ algorithms.

Estimators vs Transformers

Estimators predict (predict()), Transformers preprocess (transform()). Both use fit() to learn from data.

Prevent Data Leakage

Always fit_transform() on train, then only transform() on test. Never fit preprocessors on test data!

Use Pipelines

Chain preprocessing and models into a Pipeline. It prevents leakage, simplifies code, and makes deployment easier.

Tune with GridSearchCV

Don't guess hyperparameters. Use GridSearchCV or RandomizedSearchCV to systematically find the best settings.

Save Complete Pipelines

Use joblib.dump(pipeline) to save everything. Loading later means you can predict on raw data immediately.

Knowledge Check

Test your understanding of Scikit-learn with this quick quiz.

Question 1 of 6

What method is used to train a model in scikit-learn?

Question 2 of 6

What does fit_transform() do on a StandardScaler?

Question 3 of 6

Which is the correct way to preprocess test data?

Question 4 of 6

What does stratify=y do in train_test_split?

Question 5 of 6

What is the main benefit of using a Pipeline?

Question 6 of 6

Which attribute gives you the best model from GridSearchCV?

Answer all questions to check your score