Module 3.3

SVM & Neural Networks

Discover the power of Support Vector Machines and Neural Networks! Learn how SVMs find optimal decision boundaries using kernel tricks, and how multi-layer perceptrons learn complex patterns through backpropagation.

45 min read
Intermediate
Hands-on Examples
What You'll Learn
  • SVM classification and margin maximization
  • Kernel tricks (Linear, RBF, Polynomial)
  • SVM hyperparameter tuning (C, gamma)
  • Multi-layer Perceptrons with MLPClassifier
  • Deep learning introduction and when to use it
Contents
01

Support Vector Machines

Support Vector Machines (SVMs) are powerful classifiers that find the optimal hyperplane to separate classes. Unlike other classifiers that just find any separating boundary, SVMs find the boundary with the maximum margin - the largest possible distance between the boundary and the nearest data points from each class. This makes SVMs robust and great at generalizing to new data.

Road Analogy: Imagine drawing a road between two towns (classes). Other algorithms might draw any road that separates them. SVM draws the widest possible road - maximizing the "buffer zone" on each side. The wider the road, the less likely a new house will be misclassified as belonging to the wrong town.

What is a Hyperplane?

A hyperplane is a decision boundary that separates different classes. In 2D, it's a line. In 3D, it's a plane. In higher dimensions, we call it a hyperplane. SVM finds the hyperplane that maximizes the margin between classes.

Support Vectors

The data points closest to the decision boundary. These "support" the hyperplane - if you remove other points, the boundary stays the same. Only support vectors matter for the final model.

Margin

The distance between the hyperplane and the nearest support vectors. SVM maximizes this margin, creating a "safety buffer" that improves generalization to new data.

Hard Margin vs Soft Margin

In a perfect world, data is linearly separable and we can find a hyperplane with no misclassifications. This is called hard margin SVM. But real data is messy - some points may be on the wrong side. Soft margin SVM allows some misclassifications controlled by the C parameter.

# Basic SVM Classification
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load and prepare data
iris = load_iris()
X, y = iris.data[:, :2], iris.target  # Use first 2 features for visualization

# SVM requires feature scaling!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train SVM with linear kernel
svm_clf = SVC(kernel='linear', C=1.0, random_state=42)
svm_clf.fit(X_train, y_train)

print(f"Training Accuracy: {svm_clf.score(X_train, y_train):.2%}")
print(f"Test Accuracy: {svm_clf.score(X_test, y_test):.2%}")
print(f"Number of Support Vectors: {svm_clf.n_support_}")
Critical: Always Scale Your Features! SVM is sensitive to feature scales because it measures distances. A feature ranging from 0-1000 will dominate one ranging from 0-1. Always use StandardScaler or MinMaxScaler before training SVM.

The C Parameter

The C parameter controls the trade-off between having a smooth decision boundary and classifying training points correctly. Think of C as "how much you care about mistakes":

Low C (e.g., 0.01)
  • Wider margin (more tolerance for errors)
  • More regularization
  • Simpler decision boundary
  • May underfit if too low
  • Better for noisy data
High C (e.g., 100)
  • Narrower margin (less tolerance)
  • Less regularization
  • Complex decision boundary
  • May overfit if too high
  • Better for clean data
# Effect of C parameter
from sklearn.svm import SVC

# Low C - wider margin, more misclassifications allowed
svm_low_c = SVC(kernel='linear', C=0.01, random_state=42)
svm_low_c.fit(X_train, y_train)
print(f"Low C (0.01): Train={svm_low_c.score(X_train, y_train):.2%}, Test={svm_low_c.score(X_test, y_test):.2%}")

# High C - narrow margin, fewer misclassifications
svm_high_c = SVC(kernel='linear', C=100, random_state=42)
svm_high_c.fit(X_train, y_train)
print(f"High C (100): Train={svm_high_c.score(X_train, y_train):.2%}, Test={svm_high_c.score(X_test, y_test):.2%}")

Practice Questions: SVM Basics

Test your understanding with these coding challenges.

Task: Load the breast cancer dataset, scale features, train a linear SVM with C=1.0, and print accuracy.

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm = SVC(kernel='linear', C=1.0, random_state=42)
svm.fit(X_train_scaled, y_train)

print(f"Accuracy: {svm.score(X_test_scaled, y_test):.2%}")

Task: Train SVMs with C values [0.001, 0.1, 1, 10, 100] and compare train/test accuracy.

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

C_values = [0.001, 0.1, 1, 10, 100]
for C in C_values:
    svm = SVC(kernel='linear', C=C, random_state=42)
    svm.fit(X_train_scaled, y_train)
    train_acc = svm.score(X_train_scaled, y_train)
    test_acc = svm.score(X_test_scaled, y_test)
    print(f"C={C:6.3f}: Train={train_acc:.2%}, Test={test_acc:.2%}")

Task: For C values [0.01, 1, 100], print the number of support vectors. What pattern do you notice?

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

for C in [0.01, 1, 100]:
    svm = SVC(kernel='linear', C=C, random_state=42)
    svm.fit(X_train_scaled, y_train)
    n_sv = sum(svm.n_support_)
    print(f"C={C:5.2f}: {n_sv} support vectors")
# Lower C -> more support vectors (wider margin includes more points)
02

Kernel Tricks

What if your data isn't linearly separable? You can't draw a straight line (or hyperplane) to separate the classes. The kernel trick is SVM's secret weapon - it transforms data into a higher-dimensional space where it becomes linearly separable, without actually computing the transformation!

The Magic of Kernels: Imagine you have red and blue dots in a circle pattern - red in the center, blue around it. No straight line can separate them in 2D. But if you "lift" the center dots up (add a third dimension based on distance from center), suddenly a flat plane can separate them! Kernels do this mathematically without computing the actual transformation.

Common Kernel Types

Linear Kernel

No transformation. Works when data is already linearly separable. Fastest to train.

kernel='linear'
RBF Kernel

Radial Basis Function. Most popular. Creates circular decision boundaries. Good default choice.

kernel='rbf'
Polynomial Kernel

Creates polynomial decision boundaries. Degree parameter controls complexity.

kernel='poly'
# Comparing Different Kernels
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate non-linear data
X, y = make_moons(n_samples=200, noise=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different kernels
kernels = ['linear', 'rbf', 'poly']
for kernel in kernels:
    svm = SVC(kernel=kernel, random_state=42)
    svm.fit(X_train_scaled, y_train)
    acc = svm.score(X_test_scaled, y_test)
    print(f"{kernel:8s} kernel: {acc:.2%}")

The Gamma Parameter (RBF Kernel)

For the RBF kernel, the gamma parameter controls how far the influence of a single training example reaches. Think of it as how "local" vs "global" the decision boundary is:

Low Gamma (e.g., 0.01)
  • Each point has wide influence
  • Smoother, more global decision boundary
  • May underfit (too simple)
  • Less sensitive to individual points
High Gamma (e.g., 100)
  • Each point has narrow influence
  • Wiggly, more local decision boundary
  • May overfit (too complex)
  • Very sensitive to each point
# Effect of Gamma
from sklearn.svm import SVC

# Low gamma - smooth boundary
svm_low_gamma = SVC(kernel='rbf', gamma=0.01, random_state=42)
svm_low_gamma.fit(X_train_scaled, y_train)
print(f"Low gamma (0.01): {svm_low_gamma.score(X_test_scaled, y_test):.2%}")

# High gamma - wiggly boundary  
svm_high_gamma = SVC(kernel='rbf', gamma=100, random_state=42)
svm_high_gamma.fit(X_train_scaled, y_train)
print(f"High gamma (100): {svm_high_gamma.score(X_test_scaled, y_test):.2%}")

# Default gamma='scale' often works well
svm_auto = SVC(kernel='rbf', gamma='scale', random_state=42)
svm_auto.fit(X_train_scaled, y_train)
print(f"Auto gamma: {svm_auto.score(X_test_scaled, y_test):.2%}")
Pro Tip: Start with gamma='scale' (default) which uses 1 / (n_features * X.var()). This adapts to your data automatically and is a good baseline before tuning.

Practice Questions: Kernel Tricks

Test your understanding with these coding challenges.

Task: Load digits dataset, scale features, train SVM with RBF kernel, print accuracy.

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train_scaled, y_train)

print(f"Accuracy: {svm.score(X_test_scaled, y_test):.2%}")

Task: Use make_circles() to generate data, compare linear, rbf, and poly kernels.

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

for kernel in ['linear', 'rbf', 'poly']:
    svm = SVC(kernel=kernel, random_state=42)
    svm.fit(X_train_scaled, y_train)
    print(f"{kernel:8s}: {svm.score(X_test_scaled, y_test):.2%}")
# RBF will perform best on circular data!

Task: Use cross_val_score to find the best gamma from [0.001, 0.01, 0.1, 1, 10].

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np

digits = load_digits()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(digits.data)

gamma_values = [0.001, 0.01, 0.1, 1, 10]
best_gamma, best_score = None, 0

for gamma in gamma_values:
    svm = SVC(kernel='rbf', gamma=gamma, random_state=42)
    scores = cross_val_score(svm, X_scaled, digits.target, cv=5)
    mean_score = scores.mean()
    print(f"gamma={gamma:5.3f}: {mean_score:.2%} (+/- {scores.std()*2:.2%})")
    if mean_score > best_score:
        best_gamma, best_score = gamma, mean_score

print(f"\nBest: gamma={best_gamma} with {best_score:.2%}")
03

SVM Hyperparameter Tuning

SVM has two main hyperparameters to tune: C (regularization) and gamma (kernel width for RBF). Finding the right combination is crucial for good performance. Grid Search and Random Search are the standard approaches, and scikit-learn makes this easy with GridSearchCV.

Grid Search for SVM

Grid Search exhaustively tries all combinations of hyperparameters you specify. For SVM, we typically search over C and gamma values on a logarithmic scale:

# Grid Search for SVM Hyperparameters
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

digits = load_digits()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(digits.data)

# Define parameter grid (logarithmic scale)
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Grid Search with 5-fold cross-validation
svm = SVC(random_state=42)
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_scaled, digits.target)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.2%}")

Visualizing the Parameter Space

# Visualize Grid Search Results
import pandas as pd

results = pd.DataFrame(grid_search.cv_results_)
pivot = results.pivot(index='param_C', columns='param_gamma', values='mean_test_score')
print("\nAccuracy for each C/gamma combination:")
print(pivot.round(3))
Tuning Strategy: Start with a coarse grid (powers of 10: 0.01, 0.1, 1, 10, 100) to find the right region, then zoom in with a finer grid around the best values.

Using the Best Model

# Use the best model for predictions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, digits.target, test_size=0.2, random_state=42
)

# best_estimator_ is already fitted on all training data
best_svm = grid_search.best_estimator_
print(f"Test Accuracy: {best_svm.score(X_test, y_test):.2%}")

# Or create a new model with best params
final_svm = SVC(**grid_search.best_params_, random_state=42)
final_svm.fit(X_train, y_train)
print(f"Final Model Accuracy: {final_svm.score(X_test, y_test):.2%}")

Practice Questions: Hyperparameter Tuning

Test your understanding with these coding challenges.

Task: Use GridSearchCV to find optimal C from [0.1, 1, 10] for linear SVM.

Show Solution
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data.data)

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear']}
grid = GridSearchCV(SVC(random_state=42), param_grid, cv=5)
grid.fit(X_scaled, data.target)

print(f"Best C: {grid.best_params_['C']}")
print(f"Best Score: {grid.best_score_:.2%}")

Task: Include both kernels in grid search. For RBF, also tune gamma.

Show Solution
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data.data)

# Different params for different kernels
param_grid = [
    {'kernel': ['linear'], 'C': [0.1, 1, 10]},
    {'kernel': ['rbf'], 'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
]

grid = GridSearchCV(SVC(random_state=42), param_grid, cv=5)
grid.fit(X_scaled, data.target)

print(f"Best Params: {grid.best_params_}")
print(f"Best Score: {grid.best_score_:.2%}")

Task: Use RandomizedSearchCV with loguniform distributions for C and gamma.

Show Solution
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from scipy.stats import loguniform

digits = load_digits()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(digits.data)

param_dist = {
    'C': loguniform(0.01, 100),
    'gamma': loguniform(0.001, 10),
    'kernel': ['rbf']
}

random_search = RandomizedSearchCV(
    SVC(random_state=42), param_dist, 
    n_iter=20, cv=5, random_state=42, n_jobs=-1
)
random_search.fit(X_scaled, digits.target)

print(f"Best Params: {random_search.best_params_}")
print(f"Best Score: {random_search.best_score_:.2%}")
04

Multi-layer Perceptrons

Multi-layer Perceptrons (MLPs) are the simplest form of neural networks. They consist of layers of interconnected neurons that learn to recognize patterns through a process called backpropagation. Scikit-learn provides MLPClassifier for easy neural network training without needing deep learning frameworks like TensorFlow or PyTorch.

Brain Analogy: Think of an MLP like a simplified brain. Input neurons receive data (like your eyes seeing an image). Hidden layers process and combine this information (like your visual cortex recognizing shapes). Output neurons give the final answer (like your brain saying "that's a cat"). During training, the network adjusts its connections to get better at the task.

MLP Architecture

Input Layer

One neuron per feature. Receives raw data. No activation function - just passes data through.

Hidden Layers

Where the "learning" happens. Each neuron combines inputs with weights and applies an activation function (ReLU, tanh).

Output Layer

One neuron per class (for classification). Uses softmax to output probabilities that sum to 1.

# Basic MLP Classifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

# Neural networks need scaled data!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MLP with 2 hidden layers (100 and 50 neurons)
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',              # ReLU activation function
    max_iter=500,                   # Maximum training iterations
    random_state=42
)
mlp.fit(X_train_scaled, y_train)

print(f"Training Accuracy: {mlp.score(X_train_scaled, y_train):.2%}")
print(f"Test Accuracy: {mlp.score(X_test_scaled, y_test):.2%}")
print(f"Number of iterations: {mlp.n_iter_}")

Key Hyperparameters

hidden_layer_sizes

Tuple defining the number of neurons in each hidden layer.

  • (100,) - One layer with 100 neurons
  • (100, 50) - Two layers: 100 then 50
  • (100, 100, 100) - Three layers of 100 each
activation

Activation function for hidden layers.

  • 'relu' - Default, fast, works well (ReLU)
  • 'tanh' - Outputs between -1 and 1
  • 'logistic' - Sigmoid, outputs 0 to 1
# Experimenting with Architectures
architectures = [
    (50,),           # Shallow: 1 layer, 50 neurons
    (100, 50),       # Medium: 2 layers
    (100, 100, 50),  # Deep: 3 layers
]

for arch in architectures:
    mlp = MLPClassifier(hidden_layer_sizes=arch, max_iter=500, random_state=42)
    mlp.fit(X_train_scaled, y_train)
    acc = mlp.score(X_test_scaled, y_test)
    print(f"Architecture {str(arch):20s}: {acc:.2%}")

Regularization and Learning Rate

# Preventing Overfitting with Regularization
mlp_reg = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    alpha=0.01,              # L2 regularization (higher = more regularization)
    learning_rate='adaptive', # Adjust learning rate during training
    learning_rate_init=0.001, # Initial learning rate
    early_stopping=True,      # Stop when validation score stops improving
    validation_fraction=0.1,  # Use 10% of training data for validation
    n_iter_no_change=10,      # Stop after 10 iterations without improvement
    max_iter=500,
    random_state=42
)
mlp_reg.fit(X_train_scaled, y_train)

print(f"Accuracy: {mlp_reg.score(X_test_scaled, y_test):.2%}")
print(f"Stopped at iteration: {mlp_reg.n_iter_}")
Common Pitfalls:
  • Not scaling data: MLPs are very sensitive to feature scales
  • Not enough iterations: Increase max_iter if you see convergence warnings
  • Too many neurons: Start small and increase if underfitting

Practice Questions: Multi-layer Perceptrons

Test your understanding with these coding challenges.

Task: Train MLPClassifier with one hidden layer of 50 neurons on scaled Iris data.

Show Solution
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train)

print(f"Accuracy: {mlp.score(X_test_scaled, y_test):.2%}")

Task: Compare 'relu', 'tanh', and 'logistic' activations on the digits dataset.

Show Solution
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

for activation in ['relu', 'tanh', 'logistic']:
    mlp = MLPClassifier(
        hidden_layer_sizes=(100,), activation=activation,
        max_iter=500, random_state=42
    )
    mlp.fit(X_train_scaled, y_train)
    print(f"{activation:10s}: {mlp.score(X_test_scaled, y_test):.2%}")

Task: Train MLP with early_stopping=True and print the loss curve length.

Show Solution
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=1000,
    random_state=42
)
mlp.fit(X_train_scaled, y_train)

print(f"Accuracy: {mlp.score(X_test_scaled, y_test):.2%}")
print(f"Iterations: {mlp.n_iter_}")
print(f"Loss curve length: {len(mlp.loss_curve_)}")
print(f"Final loss: {mlp.loss_curve_[-1]:.4f}")
05

Deep Learning Introduction

Deep learning is a subset of machine learning that uses neural networks with many layers (hence "deep"). While MLPClassifier is great for learning, real-world deep learning uses specialized frameworks like TensorFlow, PyTorch, or Keras that offer GPU acceleration, more layer types, and advanced architectures.

When to Use Deep Learning

Use Deep Learning When
  • Large datasets (100,000+ samples)
  • Image, audio, or text data
  • Complex patterns that simpler models miss
  • You have GPU resources
  • State-of-the-art accuracy is critical
Use Traditional ML When
  • Small to medium datasets
  • Tabular/structured data
  • Interpretability is important
  • Limited compute resources
  • Quick prototyping needed

Deep Learning Frameworks

TensorFlow / Keras

Google's framework. Keras provides high-level API. Great for production deployment.

PyTorch

Facebook's framework. Preferred for research. Dynamic computation graphs.

Scikit-learn MLP

Great for learning. Simple API. CPU only. Good for small-medium problems.

Common Neural Network Architectures

CNN (Convolutional Neural Networks)

Specialized for images. Uses filters to detect edges, shapes, and patterns. Powers image classification, object detection, and facial recognition.

RNN/LSTM (Recurrent Neural Networks)

Specialized for sequences. Has memory of previous inputs. Powers language models, translation, and time series forecasting.

Transformer

The architecture behind GPT, BERT, and modern LLMs. Uses attention mechanism. Revolutionary for NLP and increasingly for vision.

GAN (Generative Adversarial Networks)

Two networks compete: generator creates fake data, discriminator detects fakes. Powers image generation, style transfer, and deepfakes.

SVM vs MLP vs Deep Learning

Aspect SVM MLP (sklearn) Deep Learning
Best for Small-medium data, clear margins Learning, quick experiments Large data, images, text, audio
Training speed Medium (depends on kernel) Fast (CPU) Slow (but GPU accelerated)
Interpretability Medium (support vectors) Low Very Low (black box)
Hyperparameters C, gamma, kernel Layers, neurons, learning rate Many (architecture, optimizers, etc.)
Data requirements Works with small data Medium data Needs large data
Practical Advice: For tabular data and structured problems, start with traditional ML (Random Forest, XGBoost, SVM). They're faster to train, easier to interpret, and often perform just as well. Save deep learning for images, text, and truly large datasets where it shines.

Practice Questions: Algorithm Comparison

Test your understanding with these coding challenges.

Task: Train both SVM (RBF) and MLP on digits dataset and compare accuracy.

Show Solution
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# SVM
svm = SVC(kernel='rbf', random_state=42)
svm.fit(X_train_scaled, y_train)
print(f"SVM Accuracy: {svm.score(X_test_scaled, y_test):.2%}")

# MLP
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train)
print(f"MLP Accuracy: {mlp.score(X_test_scaled, y_test):.2%}")

Task: Time the training of SVM, MLP, and Random Forest on the same data.

Show Solution
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import time

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = {
    'SVM': SVC(random_state=42),
    'MLP': MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

for name, model in models.items():
    start = time.time()
    model.fit(X_train_scaled, y_train)
    elapsed = time.time() - start
    acc = model.score(X_test_scaled, y_test)
    print(f"{name:15s}: {acc:.2%} (trained in {elapsed:.3f}s)")

Task: Create a Pipeline with StandardScaler and MLP, then use GridSearchCV to tune hidden_layer_sizes.

Show Solution
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits

digits = load_digits()

# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('mlp', MLPClassifier(max_iter=500, random_state=42))
])

# Grid search over architectures
param_grid = {
    'mlp__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)]
}

grid = GridSearchCV(pipe, param_grid, cv=3, n_jobs=-1)
grid.fit(digits.data, digits.target)

print(f"Best architecture: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.2%}")

Key Takeaways

SVM Margin Maximization

SVMs find the hyperplane with maximum margin between classes, making them robust and great at generalizing

Kernel Trick

Kernels transform data to higher dimensions where it becomes linearly separable without computing the transformation

C and Gamma Tuning

C controls regularization (margin width), gamma controls kernel width. Use GridSearchCV to find optimal values

Neural Networks

MLPs learn through layers of neurons and backpropagation. hidden_layer_sizes defines the architecture

Scale Your Data

Both SVM and MLP are sensitive to feature scales. Always use StandardScaler before training

Deep Learning When Needed

Use deep learning for images, text, and large datasets. For tabular data, traditional ML often works better

Knowledge Check

Test your understanding of SVM and Neural Networks:

Question 1 of 9

What is the main goal of Support Vector Machines?

Question 2 of 9

What happens when you increase the C parameter in SVM?

Question 3 of 9

What does the kernel trick allow SVM to do?

Question 4 of 9

What is the effect of a high gamma value in the RBF kernel?

Question 5 of 9

Why is feature scaling important for SVM?

Question 6 of 9

What does hidden_layer_sizes=(100, 50) mean in MLPClassifier?

Question 7 of 9

What is the purpose of the alpha parameter in MLPClassifier?

Question 8 of 9

When should you prefer traditional ML over deep learning?

Question 9 of 9

What does early_stopping do in MLPClassifier?