What is Scikit-learn?
Scikit-learn (sklearn) is Python's most widely-used machine learning library. It provides simple and efficient tools for data analysis and modeling, built on NumPy, SciPy, and Matplotlib. Whether you're building a spam classifier, predicting house prices, or clustering customers, sklearn has you covered.
What is Scikit-learn?
Scikit-learn is an open-source machine learning library that provides a consistent interface for dozens of ML algorithms. From simple linear regression to complex ensemble methods, all models follow the same fit() → predict() → score() pattern.
Why it matters: Once you learn the sklearn API, you can easily switch between algorithms without learning new syntax. This makes experimentation fast and code maintainable.
Why Scikit-learn?
Consistent API
Every estimator has fit(), predict(), and score(). Learn once, use everywhere.
Batteries Included
50+ algorithms, preprocessing tools, model selection utilities, and sample datasets ready to use.
Excellent Docs
Comprehensive documentation with examples, tutorials, and a user guide that explains concepts.
Installing Scikit-learn
Scikit-learn is typically installed via pip or conda. It's included in most data science distributions like Anaconda.
# Install via pip
# pip install scikit-learn
# Import the library
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")
# Common imports you'll use frequently
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Code Breakdown — Line by Line
import sklearn
Imports the entire library. The __version__ attribute tells you which version is installed (useful for debugging).
from sklearn.model_selection import train_test_split
Splits your data into training and testing sets. Essential for evaluating if your model generalizes to new data.
from sklearn.preprocessing import StandardScaler
Standardizes features to have mean=0 and std=1. Many algorithms work better with normalized data.
from sklearn.linear_model import LogisticRegression
A simple but powerful classification algorithm. Great starting point for binary/multi-class problems.
from sklearn.metrics import accuracy_score, classification_report
Functions to evaluate model performance. accuracy_score gives overall accuracy, classification_report gives precision, recall, F1.
Sklearn Module Structure — Complete Guide
Sklearn organizes its tools into modules by purpose. Here's the complete breakdown:
model_selection
Train/test split, cross-validation, grid search
preprocessing
Scalers, encoders, transformers
linear_model
Linear/Logistic regression, Ridge, Lasso
metrics
Accuracy, F1, confusion matrix, R²
tree
Decision trees for classification & regression
ensemble
Random Forest, Gradient Boosting, AdaBoost
cluster
KMeans, DBSCAN, hierarchical clustering
datasets
Built-in datasets for practice & testing
from sklearn.<category> import <Tool>. Use autocomplete or documentation to find what you need.
train_test_split (split data), StandardScaler (normalize), one model class (like LogisticRegression), and accuracy_score (evaluate). Add more as you need them!
The Sklearn API Pattern
The beauty of scikit-learn is its consistent API. Every model (called an "estimator") follows the same pattern. Learn it once, and you can use any of the 50+ algorithms without reading new documentation.
The Universal Pattern: fit → predict → score
.fit(X, y)
Train the model on your data. The model learns patterns from X (features) to predict y (target).
.predict(X)
Make predictions on new data. Returns predicted values based on learned patterns.
.score(X, y)
Evaluate performance. Returns accuracy (classification) or R² (regression).
# The universal sklearn pattern
from sklearn.linear_model import LogisticRegression
Import: Every sklearn algorithm lives in a specific module. linear_model contains logistic regression, linear regression, etc. Other modules: tree, ensemble, svm.
# Step 1: Create the model (estimator)
model = LogisticRegression()
Create Model: Instantiate the estimator object. At this point it's "empty" — no learning has happened. You can pass hyperparameters like LogisticRegression(C=0.1).
# Step 2: Train on data
model.fit(X_train, y_train)
Train (fit): This is where learning happens! The model analyzes X_train (features) and y_train (labels) to find patterns. May take time for large datasets.
# Step 3: Make predictions
predictions = model.predict(X_test)
Predict: Use the trained model on new data. Returns an array of predicted class labels (0, 1, 2...). For probabilities, use predict_proba(X_test).
# Step 4: Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
Evaluate: Compare predictions to actual labels. Returns accuracy (0-1) for classifiers, R² for regressors. :.2% formats 0.95 as "95.00%".
Code Breakdown — Understanding Each Step
model = LogisticRegression()
Create the estimator object. This initializes the model with default settings. At this point, the model is "empty" — it hasn't learned anything yet.
You can pass hyperparameters here, like LogisticRegression(C=0.1, max_iter=1000)
model.fit(X_train, y_train)
Train the model. This is where the "learning" happens. The model analyzes X_train (features) and y_train (labels) to find patterns.
This step may take time for large datasets or complex models.
predictions = model.predict(X_test)
Make predictions. The trained model uses the patterns it learned to predict labels for new data (X_test). Returns an array of predicted values.
predict() returns class labels. Use predict_proba() for probability scores.
accuracy = model.score(X_test, y_test)
Evaluate performance. Compares predictions to actual labels. For classifiers, returns accuracy (0-1). For regressors, returns R² score.
:.2% formats 0.95 as "95.00%" for readability.
Why This Pattern is Powerful
The same 4 lines work for any sklearn model. Want to try a different algorithm? Just change the import and class name:
DecisionTreeClassifier()
Tree-based classification
RandomForestClassifier()
Ensemble of trees
SVC()
Support Vector Machine
Estimators vs Transformers
Sklearn has two main types of objects: Estimators (models that predict) and Transformers (objects that transform data).
Estimators (Predictors)
Models that learn and make predictions.
fit(X, y)— Learn from datapredict(X)— Make predictionsscore(X, y)— Evaluate performance
Transformers
Objects that transform/preprocess data.
fit(X)— Learn parameters from datatransform(X)— Apply transformationfit_transform(X)— Both in one step
# Transformer example: StandardScaler
from sklearn.preprocessing import StandardScaler
Import: StandardScaler is in the preprocessing module. This module contains all data transformation tools (scalers, encoders, normalizers).
scaler = StandardScaler()
Create Transformer: Like models, transformers start "empty". The scaler doesn't know mean or std yet — it will learn these from data during fit().
# fit_transform on training data (learn mean/std, then transform)
X_train_scaled = scaler.fit_transform(X_train)
Fit + Transform (Training Data):
fit()— Calculate mean and std from X_traintransform()— Apply:(X - mean) / stdfit_transform()— Does both in one efficient step
# transform only on test data (use learned mean/std)
X_test_scaled = scaler.transform(X_test)
Transform Only (Test Data): Critical! Only use transform() on test data, NOT fit_transform()!
We use the mean/std learned from training data. This prevents "data leakage" — test data should never influence our preprocessing parameters.
# Now X_train_scaled and X_test_scaled have mean=0, std=1
Result: Both datasets are now standardized — each feature has mean ≈ 0 and std ≈ 1. This helps many ML algorithms converge faster and perform better.
Understanding Transformers — Step by Step
scaler = StandardScaler()
Create transformer. Like models, transformers start empty. No data has been seen yet.
scaler.fit_transform(X_train)
On training data: Learn the mean & std from X_train, then immediately apply the transformation. Combines two steps in one.
scaler.transform(X_test)
On test data: Use the same mean & std learned from training. Never call fit() on test data!
Result: Both datasets are now standardized using the same scale. Each feature will have approximately mean=0 and standard deviation=1.
fit_transform() on training data, then only transform() on test data.
Never fit on test data — that's data leakage and will give you overly optimistic results!
Common Mistake vs Correct Approach
# Fitting on test data = DATA LEAKAGE!
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
This learns different parameters for test data, defeating the purpose of evaluation.
# Same parameters applied to both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Test data is transformed using training data's statistics — proper simulation of real-world deployment.
Built-in Datasets
Scikit-learn comes with several toy datasets perfect for learning and experimentation. No need to download anything — they're ready to use immediately.
Classification Datasets
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
Import dataset loaders. Each function loads a different built-in dataset.
# Load the famous Iris dataset
iris = load_iris()
# It's a Bunch object (like a dictionary)
print(iris.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
load_iris() returns a Bunch object (like a dictionary). Use .keys() to see what's inside.
# Features (X) and target (y)
X = iris.data # Shape: (150, 4)
y = iris.target # Shape: (150,)
.data = features (150 rows × 4 columns). .target = labels (150 values: 0, 1, or 2).
# Feature names
print(iris.feature_names)
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
.feature_names = list of column names describing what each feature represents.
# Target classes
print(iris.target_names)
# ['setosa' 'versicolor' 'virginica']
.target_names = human-readable class labels. 0='setosa', 1='versicolor', 2='virginica'.
Understanding the Bunch Object
When you load a dataset, sklearn returns a Bunch object — a dictionary-like container with several important attributes:
iris.data
Features (X) — NumPy array with shape (samples, features). This is your input data for training.
iris.target
Labels (y) — NumPy array with class labels (0, 1, 2...). This is what you're trying to predict.
iris.feature_names
Column names — List of strings describing each feature column.
iris.target_names
Class names — Human-readable names for each target class.
iris.DESCR
Description — Full documentation about the dataset's origin and attributes.
iris.frame
DataFrame — Optional pandas DataFrame (if available in the loader).
load_iris()
150 samples, 4 features, 3 classes
Classic beginner dataset. Predict flower species from petal/sepal measurements.
load_wine()
178 samples, 13 features, 3 classes
Wine recognition. Classify wines by chemical analysis (alcohol, ash, etc.).
load_breast_cancer()
569 samples, 30 features, 2 classes
Binary classification. Predict malignant vs benign tumors.
Regression Datasets
from sklearn.datasets import load_diabetes, fetch_california_housing
Import: load_diabetes comes bundled with sklearn. fetch_california_housing downloads from internet on first use (larger dataset).
# Diabetes dataset (regression)
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
Load Diabetes: 442 samples, 10 features (age, BMI, blood pressure...). Returns a Bunch object, then extract features (X) and target (y) separately.
print(f"Features: {diabetes.feature_names}")
print(f"Target: disease progression (continuous)")
Explore Data: Check feature names. Target is disease progression one year after baseline — a continuous value (not categories like 0/1).
# California Housing (larger dataset)
housing = fetch_california_housing()
X, y = housing.data, housing.target
Load Housing: 20,640 samples, 8 features (income, house age, rooms...). Note: fetch_* downloads data on first use — may take a few seconds.
print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"Target: median house value")
Explore Size: X.shape[0] = number of samples, X.shape[1] = number of features. Target is median house value in $100,000s.
Regression vs Classification Datasets
The key difference: regression datasets have continuous target values (numbers like price, score), while classification has discrete categories (0, 1, 2).
load_diabetes()
442 samples, 10 features (age, BMI, blood pressure...)
Target: Disease progression one year after baseline — a continuous measure.
fetch_california_housing()
20,640 samples, 8 features (income, age, rooms...)
Target: Median house value in $100,000s — predict housing prices.
Note: fetch_* functions download data from the internet on first use. load_* functions use bundled data — no download needed.
Generating Synthetic Datasets
Need a custom dataset for testing? Sklearn can generate synthetic data with known properties.
from sklearn.datasets import make_classification, make_regression, make_blobs
Import generators for classification, regression, and clustering data.
# Generate classification data
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=10, # Features actually useful for classification
n_classes=2,
random_state=42
)
make_classification() Parameters
n_samples=1000
Total data points
n_features=20
Total columns
n_informative=10
Useful features
n_classes=2
Binary classification
# Generate regression data
X, y = make_regression(
n_samples=1000,
n_features=10,
noise=10, # Add some noise
random_state=42
)
make_regression() Parameters
n_samples=1000
Number of data points
n_features=10
Number of input features
noise=10
Random noise (higher = harder)
# Generate clustering data (blobs)
X, y = make_blobs(
n_samples=500,
n_features=2,
centers=4, # 4 clusters
random_state=42
)
make_blobs() Parameters
n_samples=500
Total points across all clusters
n_features=2
2D data (easy to visualize!)
centers=4
Number of clusters to create
Synthetic Dataset Parameters Explained
make_classification() — For Classification Tasks
n_samples
Total number of data points to generate
n_features
Total columns (informative + redundant + random)
n_informative
Features that actually help predict the class
n_classes
Number of target categories (2 = binary)
make_regression() — For Regression Tasks
noise=10
Random noise added to targets (higher = harder task)
n_informative
Features with actual relationship to target
bias
Intercept value added to all targets
make_blobs() — For Clustering Tasks
centers=4
Number of clusters/groups to create
cluster_std
Spread of each cluster (higher = more overlap)
random_state=42
Seed for reproducibility (always set this!)
Your First Model
Let's build a complete machine learning model from start to finish. We'll use the Iris dataset to classify flower species using Logistic Regression.
Complete Workflow Example
# Step 1: Import everything we need
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Step 1: Import Libraries
Import all tools we need: load_iris for data, train_test_split to split data, StandardScaler to normalize, LogisticRegression as our model, and metrics to evaluate results.
# Step 2: Load the data
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset shape: {X.shape}") # (150, 4)
Step 2: Load Dataset
load_iris() returns a Bunch object. We unpack it: X = features (150 samples × 4 features), y = target labels (0, 1, or 2 for each flower species).
# Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Reproducibility
stratify=y # Keep class balance
)
print(f"Training: {X_train.shape[0]}, Test: {X_test.shape[0]}")
Step 3: Split Data
test_size=0.2
20% for testing (30 samples), 80% for training (120 samples)
random_state=42
Seed for reproducibility — same split every run
stratify=y
Maintains class proportions in both sets
# Step 4: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 4: Scale Features
fit_transform(X_train) learns mean/std from training data and transforms it. transform(X_test) uses the same parameters — never fit on test data!
# Step 5: Create and train the model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)
Step 5: Train Model
Create model object, then call .fit() with training data. The model analyzes 120 samples and learns decision boundaries to separate the 3 flower species.
# Step 6: Make predictions
y_pred = model.predict(X_test_scaled)
Step 6: Make Predictions
.predict() takes the 30 test samples and returns predicted class labels (0, 1, or 2). The model uses patterns learned during training to make these predictions.
# Step 7: Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Step 7: Evaluate Results
accuracy_score() compares predictions to actual labels. classification_report() shows precision, recall, and F1 for each class.
Expected result: ~97-100% accuracy. Iris is a simple dataset!
Complete Code Breakdown — What Each Part Does
load_iris— Built-in dataset functiontrain_test_split— Splits data into train/testStandardScaler— Normalizes featuresLogisticRegression— Our classifier modelaccuracy_score, classification_report— Evaluation tools
X, y = iris.data, iris.target
This "unpacks" the Bunch object. X gets all feature columns (150 rows × 4 columns), y gets the labels (150 integers: 0, 1, or 2).
test_size=0.2
20% of data for testing = 30 samples. 80% for training = 120 samples. Common ratios: 80/20, 70/30.
random_state=42
Seed for random splitting. Same seed = same split every time. Essential for reproducibility!
stratify=y
Maintains class proportions. If data is 33% each class, both train and test will be 33% each class.
fit_transform on train learns mean/std, then transforms. transform on test uses those same values.
model.fit(X_train_scaled, y_train) — The model analyzes 120 samples and learns decision boundaries.
predict() returns class labels. classification_report() shows precision, recall, F1 per class.
Dataset shape: (150, 4)
Training: 120, Test: 30
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
The Iris dataset is simple enough that many models achieve ~97-100% accuracy.
Step-by-Step Breakdown
Import & Load
Import necessary modules and load the dataset. X = features, y = labels.
Train-Test Split
stratify=y ensures each set has same class proportions.
Scale Features
StandardScaler normalizes features. Fit on train, transform both.
Train Model
model.fit() learns patterns from training data.
Predict
model.predict() generates predictions on test data.
Evaluate
Compare predictions to actual labels. Report shows precision, recall, F1.
Trying Different Models
The beauty of sklearn: switch models with just one line change!
# Try different classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
Import Multiple Classifiers: Each algorithm lives in its own module. tree for decision trees, ensemble for forests, svm for support vector machines, neighbors for KNN.
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'SVM': SVC(random_state=42),
'K-Nearest Neighbors': KNeighborsClassifier()
}
Create Models Dictionary: Store models in a dictionary for easy comparison:
- Keys: Human-readable names for display
- Values: Actual model objects ready to train
random_state=42— Ensures reproducible results (same results every run)
print("Model Comparison:")
print("-" * 40)
Print Header: Simple formatting. "-" * 40 creates a line of 40 dashes for visual separation in output.
for name, model in models.items():
model.fit(X_train_scaled, y_train)
accuracy = model.score(X_test_scaled, y_test)
print(f"{name:25} Accuracy: {accuracy:.2%}")
Loop and Compare: The magic of sklearn's consistent API!
models.items()— Get both name and model from dictionarymodel.fit(...)— Same method works for ALL modelsmodel.score(...)— Same method returns accuracy for ALL models{name:25}— Format name with 25 character width for alignment{accuracy:.2%}— Display as percentage with 2 decimal places
Understanding the Model Comparison Code
models = {...}
A dictionary storing model names as keys and model objects as values. This lets us loop through all models easily.
for name, model in models.items():
Loop through each model. name = string like "Random Forest", model = the actual classifier object.
model.fit(...) → model.score(...)
Same two methods work for every model! This is the power of sklearn's consistent API. The loop trains 5 different algorithms with identical code.
Quick Guide to These Models
LogisticRegression
Best for: Binary/multi-class, linearly separable data. Fast, interpretable. Start here!
DecisionTreeClassifier
Best for: Interpretable models, non-linear relationships. Prone to overfitting.
RandomForestClassifier
Best for: General purpose, handles non-linearity. Ensemble of trees = more robust.
SVC
Best for: Complex decision boundaries. Works well with scaling. Slower on large data.
KNeighborsClassifier
Best for: Simple concept, no training. Classifies by nearest neighbors. Slow on large data.
Tip: No model is "best" — it depends on your data! Always try multiple.
Preprocessing & Pipelines
Real-world ML involves multiple preprocessing steps before training. Sklearn Pipelines chain these steps together, preventing data leakage and making code cleaner.
Common Preprocessing Steps
Scaling
from sklearn.preprocessing import (
StandardScaler, # Mean=0, Std=1
MinMaxScaler, # Range [0, 1]
RobustScaler # Robust to outliers
)
# StandardScaler (most common)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
StandardScaler: Best default choice. MinMaxScaler: When you need bounded values. RobustScaler: When data has outliers.
Encoding
from sklearn.preprocessing import (
LabelEncoder, # For target labels
OneHotEncoder, # Nominal categories
OrdinalEncoder # Ordinal categories
)
# OneHotEncoder for categorical features
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X_cat)
OneHotEncoder: For categories with no order (colors, countries). OrdinalEncoder: For ordered categories (low/medium/high).
When to Use Which Scaler?
StandardScaler
Use when: data is roughly normally distributed. Algorithms: Logistic Regression, SVM, Neural Networks.
MinMaxScaler
Use when: you need values in a specific range (0-1). Algorithms: KNN, Neural Networks with sigmoid.
RobustScaler
Use when: data has many outliers. Uses median/IQR instead of mean/std.
The Pipeline Solution
A Pipeline chains preprocessing and model into a single object. This ensures consistent preprocessing during training and prediction.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
Import: Pipeline chains steps together. StandardScaler preprocesses data. LogisticRegression is our classifier. All work together!
# Create a pipeline
pipe = Pipeline([
('scaler', StandardScaler()), # Step 1: Scale
('classifier', LogisticRegression()) # Step 2: Classify
])
Create Pipeline: Pass a list of (name, object) tuples:
'scaler'— Your custom name for this step (used to access it later)StandardScaler()— The transformer/estimator object- Order matters! Each step's output becomes the next step's input
# Use it like any other model!
pipe.fit(X_train, y_train) # Fits scaler, then classifier
Fit Pipeline: Internally this does:
scaler.fit_transform(X_train)— Learn scaling params & transformclassifier.fit(X_train_scaled, y_train)— Train model on scaled data
predictions = pipe.predict(X_test) # Scales X_test, then predicts
Predict: Automatically does:
scaler.transform(X_test)— Use learned params (NOT fit_transform!)classifier.predict(X_test_scaled)— Get predictions
accuracy = pipe.score(X_test, y_test) # All in one
print(f"Pipeline Accuracy: {accuracy:.2%}")
Score: Scales test data and evaluates accuracy — all in one method call. The pipeline handles everything internally!
Pipeline Code Breakdown — How It Works
Pipeline([('name', Object()), ...])
A list of (name, transformer/estimator) tuples. Names are required — they're used to access steps later.
('scaler', StandardScaler())
"scaler" is the name (your choice). StandardScaler() is the transformer object. You can access it via pipe.named_steps['scaler'].
What Happens During pipe.fit(X_train, y_train):
scaler.fit_transform(X_train)— Learn scaling params, transform training data- Pass scaled data to next step
classifier.fit(X_train_scaled, y_train)— Train the model on scaled data
What Happens During pipe.predict(X_test):
scaler.transform(X_test)— Use saved params (no fit!)- Pass scaled data to next step
classifier.predict(X_test_scaled)— Get predictions
Why Use Pipelines?
ColumnTransformer for Mixed Data
Real datasets have both numerical and categorical columns. ColumnTransformer applies different transformations to different columns.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
Import All Tools: ColumnTransformer handles different columns differently, StandardScaler for numbers, OneHotEncoder for categories, Pipeline to chain everything, RandomForestClassifier as our model.
# Define column groups
numerical_cols = ['age', 'income', 'balance']
categorical_cols = ['gender', 'country', 'occupation']
Define Column Groups: List your numeric columns (numbers like age, income) and categorical columns (text labels like gender, country). This tells ColumnTransformer which columns get which treatment.
# Create preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
]
)
Create ColumnTransformer: The transformers parameter takes a list of tuples:
('num', StandardScaler(), numerical_cols)— Name it 'num', apply StandardScaler, to numeric columns('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)— Name it 'cat', one-hot encode category columnshandle_unknown='ignore'— If test data has unknown categories, ignore them (don't crash)
# Full pipeline
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
Complete Pipeline: Combine the preprocessor (which handles all column transformations) with a classifier into one pipeline. Now you can do full_pipeline.fit(X_train, y_train) and it handles everything automatically!
ColumnTransformer — Line by Line Explanation
numerical_cols = ['age', 'income', 'balance']
Define which columns contain numbers. These will be scaled with StandardScaler.
categorical_cols = ['gender', 'country', ...]
Define which columns are categories. These will be one-hot encoded.
Understanding the transformers=[] Argument:
('num', StandardScaler(), numerical_cols)
- 'num' — Name for this transformer (your choice)
- StandardScaler() — The transformer to apply
- numerical_cols — List of column names to transform
The Complete Flow:
1. ColumnTransformer splits data by columns → 2. Applies StandardScaler to numerical → 3. Applies OneHotEncoder to categorical → 4. Concatenates results → 5. Passes to RandomForestClassifier
handle_unknown='ignore' in OneHotEncoder so the model doesn't crash if test data has categories not seen during training.
Model Selection
Sklearn's model_selection module provides tools for splitting data, cross-validation, and hyperparameter tuning. These are essential for building robust models.
Cross-Validation
from sklearn.model_selection import cross_val_score, KFold
model = LogisticRegression()
Import CV tools and create a model. Cross-validation will test this model on multiple train/test splits.
# Simple 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
5-Fold Cross-Validation
cv=5
Split into 5 parts, train on 4, test on 1, rotate
scoring='accuracy'
Metric to measure (accuracy, f1, roc_auc...)
scores.mean() ± std*2
Average ± 95% confidence interval
# Custom KFold
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(f"10-Fold CV Mean: {scores.mean():.3f}")
Custom KFold Parameters
n_splits=10
10 folds = more reliable estimate
shuffle=True
Randomize data order first
random_state=42
Reproducible shuffle
Cross-Validation — What It Does
cv=5
5-Fold CV: Splits data into 5 parts. Trains on 4, tests on 1, rotates 5 times. Returns 5 scores — one for each fold.
scoring='accuracy'
Metric to optimize: Options include 'accuracy', 'f1', 'precision', 'recall', 'roc_auc', 'neg_mean_squared_error', etc.
scores.mean() (+/- scores.std()*2)
Standard format: Mean score ± 2 standard deviations. This gives you a range where ~95% of results fall. A stable model has low std.
KFold Parameters:
n_splits=10
Number of folds. More folds = more reliable but slower.
shuffle=True
Randomize before splitting. Essential if data is ordered!
random_state=42
Seed for reproducibility when shuffling.
Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
Import GridSearchCV for hyperparameter tuning and RandomForest as our model to tune.
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
}
Parameter Grid — Values to Try
n_estimators
Number of trees: 50, 100, or 200
max_depth
Tree depth: 5, 10, 20, or unlimited
min_samples_split
Minimum samples to split: 2, 5, or 10
Total: 3 × 4 × 3 = 36 combinations to try!
# Create grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1, # Use all CPU cores
verbose=1
)
GridSearchCV Parameters
cv=5
5-fold CV per combo
scoring='accuracy'
Metric to optimize
n_jobs=-1
Use all CPU cores
verbose=1
Show progress
# Fit to find best parameters
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.3f}")
Finding Best Parameters
.best_params_
Dictionary with optimal parameter values found
.best_score_
Best cross-validation score achieved
# Use best model
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.3f}")
Using the Best Model
.best_estimator_ is the actual trained model with optimal parameters. Use it directly for predictions on new data!
GridSearchCV — Complete Code Breakdown
Step 1: Define the Parameter Grid
'n_estimators': [50, 100, 200]
Number of trees in the forest. More trees = slower but often better.
'max_depth': [5, 10, 20, None]
Max tree depth. None = unlimited. Shallower = less overfitting.
'min_samples_split': [2, 5, 10]
Minimum samples to split a node. Higher = more regularization.
Step 2: GridSearchCV Parameters
cv=5
5-fold cross-validation for each combination.
scoring='accuracy'
Metric to optimize. Pick based on your problem.
n_jobs=-1
Use all CPU cores for parallel training.
verbose=1
Print progress (0=silent, 1=progress, 2=detailed).
Step 3: Access Results
grid_search.best_params_
Dictionary of best parameter values found.
grid_search.best_score_
Best cross-validation score achieved.
grid_search.best_estimator_
The actual model object with best params.
Total combinations: 3 × 4 × 3 = 36 parameter combinations. With 5-fold CV: 36 × 5 = 180 models trained!
Saving and Loading Models
import joblib
Import joblib: This library saves Python objects to disk. It's faster than pickle for NumPy arrays (which sklearn uses internally). Pre-installed with sklearn.
# Save model to file
joblib.dump(best_model, 'random_forest_model.joblib')
Save Model: joblib.dump(object, filename) serializes your trained model to a file. The .joblib extension is convention. All learned parameters are saved!
# Load model later
loaded_model = joblib.load('random_forest_model.joblib')
Load Model: joblib.load(filename) reads the file and recreates your model object. No retraining needed — it's ready to use immediately!
# Use loaded model
predictions = loaded_model.predict(X_new)
Use Loaded Model: Call .predict() on the loaded model just like normal. It works exactly like the original — all methods (predict, predict_proba, score) are available.
# Save entire pipeline (recommended!)
joblib.dump(full_pipeline, 'complete_pipeline.joblib')
Save Entire Pipeline: Best practice! Saving a pipeline includes all preprocessing steps (scalers, encoders). When you load it, you can directly pass raw data — the pipeline handles everything.
Model Persistence — Code Breakdown
joblib.dump(model, 'filename.joblib')
Save to disk. Serializes the model object and all its learned parameters to a file. Works with any sklearn object.
loaded_model = joblib.load('filename.joblib')
Load from disk. Deserializes the file back into a Python object. Ready to use immediately — no retraining needed!
Why joblib over pickle?
- Faster for large NumPy arrays (common in ML models)
- Compressed output — smaller file sizes
- Designed specifically for sklearn objects
pipeline.predict(raw_data)!
Complete Deployment Workflow
Train
Build and validate your pipeline
Savejoblib.dump(pipe, 'model.joblib')
Deploy
Copy .joblib file to server
Predictpipe.predict(new_data)
Practice: Sklearn Basics
Task: Load the wine dataset, split it 80/20, train a LogisticRegression model, and print the accuracy on test data.
Show Solution
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}")
Task: Using the iris dataset, compare LogisticRegression, DecisionTree, and RandomForest using 5-fold cross-validation. Print mean and std for each.
Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
models = {
'Logistic Regression': LogisticRegression(max_iter=200),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier()
}
for name, model in models.items():
scores = cross_val_score(model, iris.data, iris.target, cv=5)
print(f"{name}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
Task: Create a Pipeline with StandardScaler and RandomForest. Use GridSearchCV to find the best n_estimators (50, 100, 200) and max_depth (5, 10, None). Print best parameters and test score.
Show Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
pipe = Pipeline([
('scaler', StandardScaler()),
('rf', RandomForestClassifier(random_state=42))
])
param_grid = {
'rf__n_estimators': [50, 100, 200],
'rf__max_depth': [5, 10, None]
}
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Test score: {grid.score(X_test, y_test):.3f}")
Key Takeaways
Consistent API
Every sklearn model follows fit() → predict() → score(). Learn once, use everywhere across 50+ algorithms.
Estimators vs Transformers
Estimators predict (predict()), Transformers preprocess (transform()). Both use fit() to learn from data.
Prevent Data Leakage
Always fit_transform() on train, then only transform() on test. Never fit preprocessors on test data!
Use Pipelines
Chain preprocessing and models into a Pipeline. It prevents leakage, simplifies code, and makes deployment easier.
Tune with GridSearchCV
Don't guess hyperparameters. Use GridSearchCV or RandomizedSearchCV to systematically find the best settings.
Save Complete Pipelines
Use joblib.dump(pipeline) to save everything. Loading later means you can predict on raw data immediately.
Knowledge Check
Test your understanding of Scikit-learn with this quick quiz.
What method is used to train a model in scikit-learn?
What does fit_transform() do on a StandardScaler?
Which is the correct way to preprocess test data?
What does stratify=y do in train_test_split?
What is the main benefit of using a Pipeline?
Which attribute gives you the best model from GridSearchCV?