Module 3.2

Convolutional Neural Networks

This beginner-friendly guide explains Convolutional Neural Networks (CNNs) in simple terms and practical steps. CNNs are neural networks specialized for images; they learn small pattern detectors called filters and combine them to recognize complex shapes. You will learn how images are represented as arrays, how convolution and pooling extract useful features, and how to use common architectures such as LeNet, VGG, and ResNet. The page includes practical Python examples (NumPy, TensorFlow/Keras) and clear preprocessing guidance so you can follow along with minimal prerequisites.

60 min read
Intermediate
Hands-on
What You'll Learn
  • Digital image representation and preprocessing
  • Convolution operations and feature detection
  • Pooling layers for spatial reduction
  • Classic CNN architectures (LeNet, VGG, ResNet)
  • Transfer learning with pre-trained models
Contents
01

Image Processing Fundamentals

Computers store images as grids of numbers called arrays. Each pixel contains one or more numeric channels (for example, three values for RGB color) and these arrays are the input that CNNs learn from. In this section we explain how to read image arrays, examine pixel ranges and data types, convert between color spaces, and apply simple preprocessing steps like resizing and normalization. These preparations ensure models train reliably and produce useful results.

Key Concept

Digital Image Representation

A digital image is a 3-dimensional array (tensor) with dimensions (Height, Width, Channels). For RGB images, there are 3 channels representing Red, Green, and Blue intensities. Each pixel value typically ranges from 0-255 (8-bit) or 0.0-1.0 (normalized). Grayscale images have a single channel, reducing them to 2D arrays.

Images as NumPy Arrays

In Python, images are represented as NumPy arrays. Understanding this representation is crucial for building image processing pipelines. Let us explore how to load and inspect images programmatically.

# Loading and inspecting an image
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Load an image
img = Image.open('cat.jpg')
img_array = np.array(img)

# Inspect shape: (Height, Width, Channels)
print(f"Image shape: {img_array.shape}")
# Output: Image shape: (224, 224, 3)

print(f"Data type: {img_array.dtype}")
# Output: Data type: uint8

print(f"Pixel value range: {img_array.min()} to {img_array.max()}")
# Output: Pixel value range: 0 to 255

# Access a single pixel (row 100, column 50)
pixel = img_array[100, 50]
print(f"RGB values: R={pixel[0]}, G={pixel[1]}, B={pixel[2]}")
Code Walkthrough

Images as NumPy Arrays — Line by line

Imports: import numpy as np, from PIL import Image, and import matplotlib.pyplot as plt bring in the libraries used to load, inspect, and display images. PIL (Pillow) loads many image formats; NumPy represents image data as arrays; Matplotlib visualizes images. Load image: img = Image.open('cat.jpg') opens the file and returns a PIL Image object. If the file path is wrong an error will be raised—use a small sample image while experimenting.Convert to array: img_array = np.array(img) turns the PIL Image into a NumPy array with shape (height, width, channels). For RGB color images the channels dimension is 3; grayscale images have no channel axis (shape (h, w)).
  • Inspect shape: img_array.shape shows the image dimensions and channel count. This is the first thing to check—many bugs are caused by unexpected shapes (for example, an extra alpha channel).
  • Inspect dtype: img_array.dtype reveals the numeric type—commonly uint8 for 0–255 integers. When training models you often convert to float32 and normalize values to 0–1.
  • Pixel range: img_array.min() and img_array.max() confirm the actual pixel value range. If values are in 0–255, divide by 255.0 to normalize; if already in 0–1 floats, do not rescale again.
  • Access a pixel: pixel = img_array[100, 50] reads the pixel at row 100, column 50. For RGB this returns a length-3 array [R, G, B]; if you see 4 values the image includes an alpha channel (RGBA).
  • Common pitfalls: note the channel order—Pillow and Matplotlib use RGB, while OpenCV uses BGR. When switching between libraries, convert the channel order with img[:, :, ::-1] or the appropriate conversion function.
  • Quick normalization example: img_norm = img_array.astype('float32') / 255.0 converts data to floats in range [0, 1], ready for most TensorFlow/Keras models.
  • Color Spaces and Channels

    Different color spaces represent images in various ways. RGB is the most common for display, but other spaces like grayscale, HSV, and LAB are useful for specific tasks. CNNs can learn from any color representation.

    # Color space conversions
    import cv2
    import numpy as np
    
    # Load image in BGR (OpenCV default)
    img_bgr = cv2.imread('photo.jpg')
    
    # Convert to RGB for display
    img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    
    # Convert to grayscale (single channel)
    img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
    print(f"Grayscale shape: {img_gray.shape}")
    # Output: Grayscale shape: (224, 224)
    
    # Convert to HSV (Hue, Saturation, Value)
    img_hsv = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV)
    print(f"HSV shape: {img_hsv.shape}")
    # Output: HSV shape: (224, 224, 3)
    
    # Expand grayscale to 3 channels for CNN input
    img_gray_3ch = np.stack([img_gray, img_gray, img_gray], axis=-1)
    print(f"Expanded grayscale shape: {img_gray_3ch.shape}")
    # Output: Expanded grayscale shape: (224, 224, 3)
    Preprocessing

    Image Normalization

    Normalization scales pixel values to a standard range, typically [0, 1] or [-1, 1]. This ensures stable gradient flow during training and allows the network to converge faster. Pre-trained models often require specific normalization using ImageNet mean and standard deviation values.

    Image Preprocessing Pipeline

    Raw images need preprocessing before feeding into CNNs. Common operations include resizing, normalization, and data augmentation. A consistent preprocessing pipeline ensures reproducible results.

    # Complete preprocessing pipeline
    import tensorflow as tf
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    def preprocess_image(image_path, target_size=(224, 224)):
        """Standard preprocessing for CNN input."""
        # Load image
        img = tf.io.read_file(image_path)
        img = tf.image.decode_jpeg(img, channels=3)
        
        # Resize to target dimensions
        img = tf.image.resize(img, target_size)
        
        # Normalize to [0, 1]
        img = img / 255.0
        
        # ImageNet normalization (for pre-trained models)
        mean = [0.485, 0.456, 0.406]
        std = [0.229, 0.224, 0.225]
        img = (img - mean) / std
        
        return img
    
    # Using Keras ImageDataGenerator for batch preprocessing
    datagen = ImageDataGenerator(
        rescale=1.0 / 255,           # Normalize to [0, 1]
        rotation_range=20,            # Random rotation
        width_shift_range=0.2,        # Horizontal shift
        height_shift_range=0.2,       # Vertical shift
        horizontal_flip=True,         # Random horizontal flip
        zoom_range=0.15,              # Random zoom
        fill_mode='nearest'           # Fill mode for new pixels
    )
    
    # Load images from directory with preprocessing
    train_generator = datagen.flow_from_directory(
        'data/train/',
        target_size=(224, 224),
        batch_size=32,
        class_mode='categorical'
    )

    Understanding Spatial Relationships

    Images contain spatial information where neighboring pixels are often correlated. A cat ear is always near the cat head. CNNs exploit this locality through convolution operations that process small regions at a time, preserving spatial relationships throughout the network.

    # Visualizing local patterns in images
    import numpy as np
    import matplotlib.pyplot as plt
    
    # Simulate a simple edge pattern
    edge_pattern = np.array([
        [0, 0, 0, 255, 255, 255],
        [0, 0, 0, 255, 255, 255],
        [0, 0, 0, 255, 255, 255],
        [0, 0, 0, 255, 255, 255],
        [0, 0, 0, 255, 255, 255],
        [0, 0, 0, 255, 255, 255]
    ], dtype=np.uint8)
    
    # Simple edge detection with a filter
    vertical_edge_filter = np.array([
        [-1, 0, 1],
        [-1, 0, 1],
        [-1, 0, 1]
    ])
    
    # Manual convolution demonstration
    def simple_convolve(image, kernel):
        """Apply a 3x3 kernel to detect patterns."""
        h, w = image.shape
        kh, kw = kernel.shape
        output = np.zeros((h - kh + 1, w - kw + 1))
        
        for i in range(output.shape[0]):
            for j in range(output.shape[1]):
                region = image[i:i+kh, j:j+kw]
                output[i, j] = np.sum(region * kernel)
        
        return output
    
    edge_detected = simple_convolve(edge_pattern.astype(float), vertical_edge_filter)
    print("Edge detection output:")
    print(edge_detected)
    Why Preprocessing Matters: Proper preprocessing can improve model accuracy by 5-10%. Always resize images to the same dimensions, normalize pixel values, and apply consistent augmentations during training. When using pre-trained models, match the preprocessing used during their original training.

    Practice: Image Data Fundamentals

    Answer: 921,600 values (640 * 480 * 3 = 921,600). Each of the 307,200 pixels has 3 color channel values.

    Answer: Pre-trained models learned their weights with specific input distributions. ImageNet normalization ensures your input data matches the distribution the model was trained on. Using different normalization would cause a distribution shift, degrading the pre-trained features and requiring more fine-tuning to adapt.

    Answer: Grayscale images reduce computational cost (1 channel vs 3), decrease model parameters, and sometimes improve generalization. For tasks where color is irrelevant (document OCR, X-ray analysis, edge detection), grayscale removes noise from color variations. This forces the model to focus on structural features rather than color-based shortcuts, often improving robustness.

    02

    Convolutional and Pooling Layers

    The convolution operation is the heart of CNNs. Instead of processing every pixel independently like fully connected layers, convolution applies learnable filters that slide across images to detect local patterns. Combined with pooling layers for dimensionality reduction, these operations enable CNNs to efficiently extract hierarchical features from images.

    Key Concept

    Convolution Operation

    A convolution is a mathematical operation that slides a small learnable matrix (kernel/filter) across an input image. At each position, it computes the element-wise product between the filter and the overlapping input region, then sums the results to produce one output value. This creates a feature map that highlights where the filter's pattern appears in the image.

    How Convolution Works

    A convolution filter (also called a kernel) is typically a small 3x3 or 5x5 matrix of learnable weights. The filter slides across the image with a specified stride, computing dot products to create an output feature map. Different filters detect different features like edges, textures, or shapes.

    # Understanding convolution step by step
    import numpy as np
    
    # Sample 5x5 grayscale image patch
    image = np.array([
        [10, 10, 10, 0, 0],
        [10, 10, 10, 0, 0],
        [10, 10, 10, 0, 0],
        [10, 10, 10, 0, 0],
        [10, 10, 10, 0, 0]
    ], dtype=np.float32)
    
    # Vertical edge detection kernel (3x3)
    kernel = np.array([
        [-1, 0, 1],
        [-1, 0, 1],
        [-1, 0, 1]
    ], dtype=np.float32)
    
    # Manual convolution with stride=1, no padding
    def convolve2d(image, kernel, stride=1):
        h, w = image.shape
        kh, kw = kernel.shape
        out_h = (h - kh) // stride + 1
        out_w = (w - kw) // stride + 1
        output = np.zeros((out_h, out_w))
        
        for i in range(0, out_h):
            for j in range(0, out_w):
                row = i * stride
                col = j * stride
                region = image[row:row+kh, col:col+kw]
                output[i, j] = np.sum(region * kernel)
        
        return output
    
    feature_map = convolve2d(image, kernel)
    print("Feature map (edge detection):")
    print(feature_map)
    # The vertical edge is detected in the middle column

    Convolution Parameters

    Several parameters control how convolution operates: filter size, stride, padding, and number of filters. Understanding these parameters is essential for designing CNN architectures.

    # Keras Conv2D layer with all parameters
    from tensorflow.keras.layers import Conv2D, Input
    from tensorflow.keras.models import Model
    
    # Input: 224x224 RGB image
    input_layer = Input(shape=(224, 224, 3))
    
    # Convolution layer parameters explained
    conv_layer = Conv2D(
        filters=32,              # Number of output feature maps
        kernel_size=(3, 3),      # Filter dimensions (3x3 is common)
        strides=(1, 1),          # Step size for sliding (1 = move one pixel)
        padding='same',          # 'same' preserves spatial size, 'valid' shrinks
        activation='relu',       # Apply ReLU after convolution
        use_bias=True,           # Add learnable bias term
        kernel_initializer='he_normal'  # Weight initialization
    )(input_layer)
    
    # Output shape calculation for 'valid' padding:
    # output_size = (input_size - kernel_size) / stride + 1
    # For 224x224 with 3x3 kernel, stride 1: (224-3)/1 + 1 = 222
    
    # Output shape for 'same' padding:
    # output_size = input_size / stride = 224/1 = 224
    
    model = Model(inputs=input_layer, outputs=conv_layer)
    print(model.summary())
    Key Concept

    Pooling Operation

    Pooling reduces the spatial dimensions (height and width) of feature maps while retaining the most important information. Max pooling takes the maximum value in each region, while average pooling computes the mean. Pooling provides translation invariance and reduces computation for deeper layers.

    Max Pooling and Average Pooling

    Pooling layers downsample feature maps by aggregating values in local regions. Max pooling is most common because it preserves the strongest activations (detected features), while average pooling is sometimes used in final layers.

    # Pooling operations explained
    import numpy as np
    from tensorflow.keras.layers import MaxPooling2D, AveragePooling2D, GlobalAveragePooling2D
    
    # Sample 4x4 feature map
    feature_map = np.array([
        [1, 3, 2, 4],
        [5, 6, 1, 2],
        [3, 2, 8, 1],
        [4, 7, 3, 9]
    ], dtype=np.float32)
    
    # Manual 2x2 max pooling with stride 2
    def max_pool_2d(feature_map, pool_size=2, stride=2):
        h, w = feature_map.shape
        out_h = h // stride
        out_w = w // stride
        output = np.zeros((out_h, out_w))
        
        for i in range(out_h):
            for j in range(out_w):
                row = i * stride
                col = j * stride
                region = feature_map[row:row+pool_size, col:col+pool_size]
                output[i, j] = np.max(region)
        
        return output
    
    pooled = max_pool_2d(feature_map)
    print("Max pooling result (4x4 -> 2x2):")
    print(pooled)
    # [[6, 4],
    #  [7, 9]]
    
    # Keras pooling layers
    from tensorflow.keras.layers import Input
    from tensorflow.keras.models import Model
    
    input_layer = Input(shape=(28, 28, 64))
    
    # Max pooling: keeps strongest activations
    max_pool = MaxPooling2D(pool_size=(2, 2), strides=(2, 2))(input_layer)
    # Output: (14, 14, 64) - spatial dimensions halved
    
    # Average pooling: smoother downsampling
    avg_pool = AveragePooling2D(pool_size=(2, 2))(input_layer)
    
    # Global average pooling: reduces to single value per channel
    # Often used before final dense layer
    gap = GlobalAveragePooling2D()(input_layer)
    # Output: (64,) - one value per feature map

    Padding and Stride

    Padding adds zeros around the input to control output size. Stride determines how far the filter moves at each step. Together, they give precise control over the spatial dimensions of feature maps.

    # Padding and stride effects on output size
    import tensorflow as tf
    from tensorflow.keras.layers import Conv2D
    
    # Input: 32x32 image
    input_shape = (32, 32, 3)
    
    # Case 1: No padding ('valid'), stride 1, 3x3 kernel
    # Output: (32-3)/1 + 1 = 30x30
    conv_valid = Conv2D(16, (3, 3), strides=1, padding='valid')
    
    # Case 2: Same padding, stride 1, 3x3 kernel
    # Output: 32/1 = 32x32 (padding preserves size)
    conv_same = Conv2D(16, (3, 3), strides=1, padding='same')
    
    # Case 3: Same padding, stride 2, 3x3 kernel
    # Output: 32/2 = 16x16 (halved by stride)
    conv_stride2 = Conv2D(16, (3, 3), strides=2, padding='same')
    
    # Case 4: Valid padding, stride 2, 5x5 kernel
    # Output: (32-5)/2 + 1 = 14x14
    conv_large = Conv2D(16, (5, 5), strides=2, padding='valid')
    
    # Output size formula:
    # Valid: floor((input - kernel) / stride) + 1
    # Same:  ceil(input / stride)
    
    # Test the layers
    x = tf.random.normal((1, 32, 32, 3))
    print(f"Input shape: {x.shape}")
    print(f"Valid padding: {conv_valid(x).shape}")   # (1, 30, 30, 16)
    print(f"Same padding: {conv_same(x).shape}")     # (1, 32, 32, 16)
    print(f"Stride 2: {conv_stride2(x).shape}")      # (1, 16, 16, 16)
    print(f"5x5 kernel stride 2: {conv_large(x).shape}")  # (1, 14, 14, 16)
    Parameter sharing — the CNN advantage: A 3x3 filter has only 9 learnable weights (plus 1 bias). The same filter is applied across the entire image, detecting the pattern wherever it appears. In contrast, a fully connected layer mapping a 224×224×3 input to 1000 neurons would need on the order of 150 million weights. Parameter sharing drastically reduces parameters and improves generalization.

    Practice: Convolution and Pooling

    Answer: 24x24. Using the formula: (input - kernel) / stride + 1 = (28 - 5) / 1 + 1 = 24.

    Answer: 18,496 parameters. Each filter is 3x3x32 = 288 weights, plus 1 bias = 289 per filter. With 64 filters: 289 x 64 = 18,496 parameters.

    Answer: Strided convolution learns how to downsample, potentially preserving more useful information than the fixed max operation. It combines feature extraction and downsampling in one step, reducing computation. However, max pooling provides stronger translation invariance and has no learnable parameters. Modern architectures like ResNet use strided convolution, while classic architectures use max pooling. Strided convolution may slightly increase overfitting risk.

    03

    Building CNN Architectures

    CNN architectures have evolved dramatically since LeNet-5 in 1998. Modern architectures like VGG, ResNet, and EfficientNet achieve superhuman accuracy on image classification. Understanding these landmark architectures helps you design networks for your own tasks and choose the right pre-trained model for transfer learning.

    Key Concept

    CNN Architecture Patterns

    Most CNN architectures follow a pattern: Feature extraction (convolutional and pooling layers that progressively reduce spatial dimensions while increasing channels) followed by Classification (fully connected layers that map features to class probabilities). Deeper networks learn more abstract features but are harder to train without techniques like batch normalization and skip connections.

    LeNet-5: The Pioneer

    LeNet-5, developed by Yann LeCun in 1998, was the first successful CNN for digit recognition. Its simple architecture established the fundamental CNN pattern still used today: alternating convolution and pooling layers followed by fully connected layers.

    # LeNet-5 architecture (1998)
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Conv2D, AveragePooling2D, Flatten, Dense
    
    def create_lenet5(input_shape=(32, 32, 1), num_classes=10):
        """
        LeNet-5: The original CNN for digit recognition.
        Total params: ~60,000
        """
        model = Sequential([
            # C1: 6 filters of 5x5, output: 28x28x6
            Conv2D(6, (5, 5), activation='tanh', input_shape=input_shape),
            
            # S2: Average pooling 2x2, output: 14x14x6
            AveragePooling2D(pool_size=(2, 2)),
            
            # C3: 16 filters of 5x5, output: 10x10x16
            Conv2D(16, (5, 5), activation='tanh'),
            
            # S4: Average pooling 2x2, output: 5x5x16
            AveragePooling2D(pool_size=(2, 2)),
            
            # Flatten: 5*5*16 = 400
            Flatten(),
            
            # C5: Fully connected, 120 neurons
            Dense(120, activation='tanh'),
            
            # F6: Fully connected, 84 neurons
            Dense(84, activation='tanh'),
            
            # Output: 10 classes
            Dense(num_classes, activation='softmax')
        ])
        return model
    
    lenet = create_lenet5()
    lenet.summary()
    # Total params: 61,706

    VGGNet: Going Deeper with 3x3

    VGG (2014) demonstrated that very deep networks with small 3x3 filters outperform shallow networks with large filters. VGG-16 has 16 weight layers and established the practice of stacking multiple convolutions before pooling.

    # VGG-16 style architecture
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
    
    def create_vgg_block(model, filters, num_convs):
        """VGG block: multiple 3x3 convs followed by max pooling."""
        for _ in range(num_convs):
            model.add(Conv2D(filters, (3, 3), padding='same', activation='relu'))
        model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    
    def create_vgg16(input_shape=(224, 224, 3), num_classes=1000):
        """
        VGG-16: Deep network with 3x3 filters.
        Total params: ~138 million
        """
        model = Sequential()
        model.add(Conv2D(64, (3, 3), padding='same', activation='relu', 
                         input_shape=input_shape))
        
        # Block 1: 64 filters, 2 convs
        create_vgg_block(model, 64, 1)  # Already added first conv
        
        # Block 2: 128 filters, 2 convs  
        create_vgg_block(model, 128, 2)
        
        # Block 3: 256 filters, 3 convs
        create_vgg_block(model, 256, 3)
        
        # Block 4: 512 filters, 3 convs
        create_vgg_block(model, 512, 3)
        
        # Block 5: 512 filters, 3 convs
        create_vgg_block(model, 512, 3)
        
        # Classifier
        model.add(Flatten())
        model.add(Dense(4096, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(4096, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(num_classes, activation='softmax'))
        
        return model
    
    # Use pre-trained VGG instead of training from scratch
    from tensorflow.keras.applications import VGG16
    
    vgg16_pretrained = VGG16(
        weights='imagenet',
        include_top=True,
        input_shape=(224, 224, 3)
    )
    print(f"VGG16 params: {vgg16_pretrained.count_params():,}")
    Architecture Pattern

    Skip Connections (Residual Learning)

    Skip connections add the input of a block directly to its output, allowing gradients to flow through the network without degradation. Instead of learning a mapping H(x), the network learns the residual F(x) = H(x) - x, then computes H(x) = F(x) + x. This simple modification enabled training networks with 100+ layers.

    ResNet: Skip Connections Revolution

    ResNet (2015) solved the degradation problem in very deep networks through skip connections. The insight was that if layers are not needed, they can learn identity mappings. ResNet-50 and ResNet-152 became the go-to architectures for transfer learning.

    # ResNet residual block implementation
    from tensorflow.keras.layers import (Conv2D, BatchNormalization, Activation, 
                                          Add, Input, GlobalAveragePooling2D, Dense)
    from tensorflow.keras.models import Model
    
    def residual_block(x, filters, stride=1, downsample=False):
        """
        ResNet residual block with skip connection.
        If dimensions change (downsample=True), adjust skip with 1x1 conv.
        """
        # Save input for skip connection
        shortcut = x
        
        # First convolution
        x = Conv2D(filters, (3, 3), strides=stride, padding='same')(x)
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        
        # Second convolution
        x = Conv2D(filters, (3, 3), strides=1, padding='same')(x)
        x = BatchNormalization()(x)
        
        # Adjust shortcut dimensions if needed
        if downsample:
            shortcut = Conv2D(filters, (1, 1), strides=stride, padding='same')(shortcut)
            shortcut = BatchNormalization()(shortcut)
        
        # Add skip connection (the key innovation!)
        x = Add()([x, shortcut])
        x = Activation('relu')(x)
        
        return x
    
    def create_simple_resnet(input_shape=(224, 224, 3), num_classes=10):
        """Simplified ResNet-18 style architecture."""
        inputs = Input(shape=input_shape)
        
        # Initial convolution
        x = Conv2D(64, (7, 7), strides=2, padding='same')(inputs)
        x = BatchNormalization()(x)
        x = Activation('relu')(x)
        
        # Residual blocks (simplified)
        x = residual_block(x, 64)
        x = residual_block(x, 64)
        x = residual_block(x, 128, stride=2, downsample=True)
        x = residual_block(x, 128)
        x = residual_block(x, 256, stride=2, downsample=True)
        x = residual_block(x, 256)
        
        # Classification head
        x = GlobalAveragePooling2D()(x)
        outputs = Dense(num_classes, activation='softmax')(x)
        
        return Model(inputs, outputs)
    
    # Use pre-trained ResNet
    from tensorflow.keras.applications import ResNet50
    
    resnet = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    print(f"ResNet50 feature extractor output: {resnet.output_shape}")

    Building a Custom CNN from Scratch

    Understanding architecture patterns allows you to design custom CNNs for specific tasks. Start simple, then add complexity based on dataset size and task difficulty. Always use batch normalization and consider skip connections for deeper networks.

    # Custom CNN for CIFAR-10 classification
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import (Conv2D, BatchNormalization, Activation,
                                          MaxPooling2D, Dropout, Flatten, Dense)
    from tensorflow.keras.regularizers import l2
    
    def create_custom_cnn(input_shape=(32, 32, 3), num_classes=10):
        """
        Custom CNN with modern best practices:
        - BatchNorm after every conv
        - Dropout for regularization  
        - L2 weight regularization
        - Increasing filters with depth
        """
        model = Sequential([
            # Block 1: 32 filters
            Conv2D(32, (3, 3), padding='same', kernel_regularizer=l2(1e-4),
                   input_shape=input_shape),
            BatchNormalization(),
            Activation('relu'),
            Conv2D(32, (3, 3), padding='same', kernel_regularizer=l2(1e-4)),
            BatchNormalization(),
            Activation('relu'),
            MaxPooling2D(pool_size=(2, 2)),
            Dropout(0.25),
            
            # Block 2: 64 filters
            Conv2D(64, (3, 3), padding='same', kernel_regularizer=l2(1e-4)),
            BatchNormalization(),
            Activation('relu'),
            Conv2D(64, (3, 3), padding='same', kernel_regularizer=l2(1e-4)),
            BatchNormalization(),
            Activation('relu'),
            MaxPooling2D(pool_size=(2, 2)),
            Dropout(0.25),
            
            # Block 3: 128 filters
            Conv2D(128, (3, 3), padding='same', kernel_regularizer=l2(1e-4)),
            BatchNormalization(),
            Activation('relu'),
            Conv2D(128, (3, 3), padding='same', kernel_regularizer=l2(1e-4)),
            BatchNormalization(),
            Activation('relu'),
            MaxPooling2D(pool_size=(2, 2)),
            Dropout(0.25),
            
            # Classifier
            Flatten(),
            Dense(256, kernel_regularizer=l2(1e-4)),
            BatchNormalization(),
            Activation('relu'),
            Dropout(0.5),
            Dense(num_classes, activation='softmax')
        ])
        
        return model
    
    model = create_custom_cnn()
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    model.summary()
    Modern architecture guidelines:
    • Prefer 3x3 convolutional kernels; stack multiple 3x3 convs instead of using larger kernels.
    • Apply BatchNormalization after each convolutional layer.
    • Use ReLU (or its variants) for activation functions.
    • When halving spatial dimensions, double the number of filters to preserve representational capacity.
    • Add skip connections for networks deeper than ~10 layers and use Global Average Pooling instead of flattening before the classifier to reduce overfitting.

    Practice: CNN Architectures

    Answer: Three 3x3 convolutions have the same receptive field as one 7x7 (3+3+3-2 = 7), but with fewer parameters (3 * 3^2 = 27 vs 7^2 = 49) and more non-linearities (3 ReLU activations vs 1). This increases the network's expressive power while reducing computation.

    Answer: Skip connections solve the degradation problem where deeper networks paradoxically have higher training error than shallower ones. This happens because gradients vanish as they backpropagate through many layers. By adding the input directly to the output, gradients have a direct path to flow backward. If a layer is not useful, the network can easily learn the identity mapping (set F(x) to zero, leaving just the skip connection).

    Answer: Global Average Pooling (GAP) reduces each feature map to a single value, creating a vector of length equal to the number of channels. This has three benefits: (1) Dramatically fewer parameters compared to flattening (e.g., 7x7x512 = 25,088 values vs 512 values), reducing overfitting. (2) Spatial invariance since the average is the same regardless of where the feature appears. (3) More interpretable since each value corresponds to a specific feature detector. The VGG classifier has 123M of its 138M parameters in the Dense layers, showing the difference.

    04

    Transfer Learning with Pre-trained Models

    Training CNNs from scratch requires massive datasets and computational resources. Transfer learning solves this by reusing features learned on large datasets like ImageNet. Pre-trained models have already learned universal features like edges, textures, and shapes. You can fine-tune these models for your specific task with a fraction of the data and training time.

    Key Concept

    Transfer Learning

    Transfer learning applies knowledge from one task (source domain) to a different but related task (target domain). In computer vision, models pre-trained on ImageNet (1.2 million images, 1000 classes) have learned general visual features that transfer well to most image tasks. Two main approaches: Feature extraction (freeze pre-trained layers, train new classifier) and Fine-tuning (unfreeze some layers, train with small learning rate).

    Loading Pre-trained Models

    Keras and PyTorch provide pre-trained models with a single line of code. These models come with ImageNet weights and can be used immediately for inference or adapted for your task.

    # Loading pre-trained models in Keras
    from tensorflow.keras.applications import (
        VGG16, VGG19, ResNet50, ResNet101,
        InceptionV3, MobileNetV2, EfficientNetB0
    )
    
    # Load ResNet50 with ImageNet weights
    # include_top=True: includes final classifier (1000 ImageNet classes)
    # include_top=False: only feature extraction layers
    resnet = ResNet50(
        weights='imagenet',      # Pre-trained on ImageNet
        include_top=False,       # Remove classifier for transfer learning
        input_shape=(224, 224, 3)
    )
    
    # Model comparison (ImageNet top-1 accuracy)
    models_info = {
        'VGG16': {'params': '138M', 'accuracy': '71.3%', 'size': '528MB'},
        'ResNet50': {'params': '25.6M', 'accuracy': '74.9%', 'size': '98MB'},
        'InceptionV3': {'params': '23.9M', 'accuracy': '77.9%', 'size': '92MB'},
        'MobileNetV2': {'params': '3.5M', 'accuracy': '71.3%', 'size': '14MB'},
        'EfficientNetB0': {'params': '5.3M', 'accuracy': '77.1%', 'size': '29MB'},
    }
    
    # EfficientNet: Best accuracy per parameter
    efficientnet = EfficientNetB0(weights='imagenet', include_top=False)
    
    # MobileNetV2: Best for mobile/edge deployment
    mobilenet = MobileNetV2(weights='imagenet', include_top=False)
    
    print(f"ResNet50 output shape: {resnet.output_shape}")
    # Output: (None, 7, 7, 2048)

    Feature Extraction (Frozen Layers)

    The simplest transfer learning approach freezes all pre-trained layers and adds a new classifier on top. This works well when your dataset is small or very similar to ImageNet.

    # Feature extraction with frozen base model
    from tensorflow.keras.applications import MobileNetV2
    from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
    from tensorflow.keras.models import Model
    
    def create_feature_extractor(num_classes, input_shape=(224, 224, 3)):
        """
        Transfer learning with frozen pre-trained layers.
        Train only the new classifier head.
        """
        # Load pre-trained base (without top classifier)
        base_model = MobileNetV2(
            weights='imagenet',
            include_top=False,
            input_shape=input_shape
        )
        
        # FREEZE all base model layers
        base_model.trainable = False
        
        # Add custom classifier
        x = base_model.output
        x = GlobalAveragePooling2D()(x)      # (7, 7, 1280) -> (1280,)
        x = Dense(256, activation='relu')(x)
        x = Dropout(0.5)(x)
        outputs = Dense(num_classes, activation='softmax')(x)
        
        model = Model(inputs=base_model.input, outputs=outputs)
        
        # Check trainable parameters
        trainable_params = sum(p.numpy().size for p in model.trainable_weights)
        total_params = model.count_params()
        print(f"Trainable: {trainable_params:,} / {total_params:,} params")
        # Trainable: ~330K / ~2.6M params
        
        return model
    
    # Create model for 5-class classification
    model = create_feature_extractor(num_classes=5)
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    # Train only the new layers (fast, prevents overfitting)
    # history = model.fit(train_data, epochs=10)

    Fine-tuning Pre-trained Layers

    Fine-tuning unfreezes some pre-trained layers and trains them with a small learning rate. This adapts the learned features to your specific task. Always start with feature extraction, then fine-tune for best results.

    # Fine-tuning: Unfreeze top layers of base model
    from tensorflow.keras.applications import ResNet50
    from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
    from tensorflow.keras.models import Model
    from tensorflow.keras.optimizers import Adam
    
    def create_finetune_model(num_classes, input_shape=(224, 224, 3)):
        """
        Fine-tuning: train classifier first, then unfreeze top layers.
        Use small learning rate for pre-trained weights.
        """
        base_model = ResNet50(
            weights='imagenet',
            include_top=False,
            input_shape=input_shape
        )
        
        # Add classifier head
        x = base_model.output
        x = GlobalAveragePooling2D()(x)
        x = Dense(256, activation='relu')(x)
        x = Dropout(0.5)(x)
        outputs = Dense(num_classes, activation='softmax')(x)
        
        model = Model(inputs=base_model.input, outputs=outputs)
        return model, base_model
    
    model, base_model = create_finetune_model(num_classes=10)
    
    # STEP 1: Train classifier with frozen base (feature extraction)
    base_model.trainable = False
    model.compile(
        optimizer=Adam(learning_rate=1e-3),
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    # model.fit(train_data, epochs=5)
    
    # STEP 2: Unfreeze top layers and fine-tune
    base_model.trainable = True
    
    # Freeze all layers except the last 20
    for layer in base_model.layers[:-20]:
        layer.trainable = False
    
    # Use SMALLER learning rate for fine-tuning
    model.compile(
        optimizer=Adam(learning_rate=1e-5),  # 100x smaller!
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    # Count trainable layers
    trainable_layers = sum(1 for layer in model.layers if layer.trainable)
    print(f"Trainable layers: {trainable_layers}")
    
    # model.fit(train_data, epochs=10)
    Preprocessing

    Feature Extraction vs Fine-tuning

    Choose feature extraction (frozen layers) when: you have a small dataset (under 1000 images), limited compute, or your task is very similar to ImageNet. Choose fine-tuning when: you have more data (10k+ images), your domain differs from natural images (medical, satellite), or you need maximum accuracy. Always start with feature extraction, then optionally fine-tune.

    Complete Transfer Learning Pipeline

    A complete transfer learning pipeline includes data augmentation, proper preprocessing, learning rate scheduling, and callbacks for early stopping. This production-ready example achieves excellent results on custom datasets.

    # Complete transfer learning pipeline
    import tensorflow as tf
    from tensorflow.keras.applications import EfficientNetB0
    from tensorflow.keras.applications.efficientnet import preprocess_input
    from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
    from tensorflow.keras.models import Model
    from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Data augmentation for training
    train_datagen = ImageDataGenerator(
        preprocessing_function=preprocess_input,  # Model-specific preprocessing
        rotation_range=20,
        width_shift_range=0.2,
        height_shift_range=0.2,
        horizontal_flip=True,
        zoom_range=0.15,
        validation_split=0.2  # 20% for validation
    )
    
    # Load training data
    train_generator = train_datagen.flow_from_directory(
        'data/train',
        target_size=(224, 224),
        batch_size=32,
        class_mode='categorical',
        subset='training'
    )
    
    val_generator = train_datagen.flow_from_directory(
        'data/train',
        target_size=(224, 224),
        batch_size=32,
        class_mode='categorical',
        subset='validation'
    )
    
    # Build model
    base_model = EfficientNetB0(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    base_model.trainable = False
    
    x = GlobalAveragePooling2D()(base_model.output)
    x = Dense(256, activation='relu')(x)
    x = Dropout(0.5)(x)
    outputs = Dense(train_generator.num_classes, activation='softmax')(x)
    
    model = Model(inputs=base_model.input, outputs=outputs)
    
    # Callbacks for training
    callbacks = [
        EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
        ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3),
        ModelCheckpoint('best_model.keras', monitor='val_accuracy', save_best_only=True)
    ]
    
    # Compile and train
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    history = model.fit(
        train_generator,
        validation_data=val_generator,
        epochs=20,
        callbacks=callbacks
    )
    
    # Save final model
    model.save('flower_classifier.keras')
    Transfer learning best practices:
    • Match preprocessing to the pre-trained model (use the model's preprocess_input).
    • Apply data augmentation to the training set only; keep validation/test data unchanged.
    • Train new classifier layers with a standard learning rate, then fine-tune with a 10–100× smaller learning rate.
    • Use callbacks (EarlyStopping, ReduceLROnPlateau, ModelCheckpoint) to stabilize training.
    • For very different domains (e.g., medical or satellite imagery), unfreeze and fine-tune more layers.
    • Consider efficient architectures (MobileNet, EfficientNet) for a good accuracy-to-compute trade-off.

    Practice: Transfer Learning

    Answer: It freezes all the weights in the base model, preventing them from being updated during training. Only the new layers (classifier head) will be trained. This preserves the pre-trained features and prevents overfitting when you have limited data.

    Answer: Pre-trained weights already contain useful learned features. A large learning rate would cause large weight updates that could destroy these carefully learned representations. A small learning rate (typically 10-100x smaller) allows gentle adjustments that adapt features to your task while preserving the general knowledge. New layers need larger updates since they start from random initialization.

    Answer: You would freeze fewer layers (fine-tune more layers) for satellite images. ImageNet consists of natural photos with objects like cats, dogs, and cars. Pet breeds are very similar to ImageNet content, so early and middle layer features (edges, textures, animal parts) transfer directly. Satellite images have a different visual domain with overhead perspectives, different color distributions, and unique features (roads, buildings from above). You need to fine-tune more layers to adapt the mid-level and high-level features to this different domain.

    Answer: Validation data should represent real-world deployment conditions where images arrive unaugmented. Augmentation artificially expands training data diversity to improve generalization, but we need consistent, unmodified validation data to reliably measure model performance. Augmenting validation data would give misleading metrics and make it harder to compare training runs or detect overfitting.

    Interactive: Convolution Filter Visualizer

    See how different convolution filters detect features in images. Select a filter type and watch it slide across the input to produce the output feature map.

    Input (8x8)
    Filter (3x3)
    Output (6x6)
    Current Operation

    Select options and click Animate to see convolution in action

    Filter Purpose

    Vertical edges are detected by filters that respond to intensity changes from left to right

    Key Takeaways

    Images as Tensors

    Digital images are 3D tensors (height, width, channels) that CNNs process through hierarchical feature extraction

    Convolution Operations

    Learnable filters slide across images to detect features like edges, textures, and complex patterns

    Pooling for Reduction

    Pooling layers reduce spatial dimensions while preserving important features and adding translation invariance

    Deep Architectures

    Modern CNNs stack many layers with skip connections (ResNet) to learn increasingly abstract representations

    Transfer Learning

    Pre-trained models on ImageNet can be fine-tuned for new tasks, dramatically reducing training time and data needs

    Feature Hierarchies

    Early layers detect edges and colors, middle layers find textures and parts, deep layers recognize objects

    Knowledge Check

    Test your understanding of Convolutional Neural Networks:

    1 What is the main advantage of using convolution instead of fully connected layers for image processing?
    2 What is the purpose of pooling layers in a CNN?
    3 If you apply a 3x3 filter with stride 1 and no padding to a 32x32 image, what is the output size?
    4 What innovation did ResNet introduce to enable training very deep networks?
    5 In transfer learning, what does "freezing" layers mean?
    6 Which layer type is typically used at the end of a CNN for classification?
    Answer all questions to check your score