Image Classification in Python

Section 01

The Story That Explains Image Classification

📖 Real World Analogy

The Three-Year-Old and the Radiologist

A three-year-old child sees a picture of a dog for the first time. Her parents say "that's a dog." She sees another dog — different breed, different angle, different lighting. Her parents say "that's also a dog." After perhaps a hundred such examples, she can identify dogs she has never seen before. The furry ears, the snout, the posture — she has learned a concept, not memorised a pixel grid.

Now consider a radiologist who has read 50,000 chest X-rays over twenty years. She spots a suspicious shadow at 4mm — invisible to an untrained eye — because her brain has built an extraordinarily detailed internal model of what "normal" looks like versus what "abnormal" looks like. She is classifying images, one at a time, faster and more accurately than any rule-based system ever could.

Image Classification is the task of training a machine to do exactly this: look at a raw image and output a label — cat, dog, tumour, fraudulent cheque, ripe tomato, storm cloud — by learning from thousands of labelled examples. It is the oldest, most studied problem in computer vision, and it is the foundation on which everything else — object detection, segmentation, video understanding — is built.

At its core, image classification is a supervised learning problem. You give a model thousands of (image, label) pairs. The model learns a mapping from pixel values to class probabilities. At inference time it applies that mapping to images it has never seen before and outputs its best prediction.

📸

The Benchmark That Changed Everything — ImageNet

In 2010, Fei-Fei Li's team at Stanford released ImageNet: 1.2 million images across 1,000 categories. In 2012, Alex Krizhevsky's AlexNet smashed the previous best error rate from 26% to 15% using a deep convolutional neural network trained on two GPUs. That single result launched the modern deep learning era. Today's best models achieve under 1.5% top-5 error on the same benchmark — better than most humans.

Section 02

How Images Become Numbers — The Input Pipeline

A model never sees an image. It sees a tensor — a multidimensional array of numbers. Before any learning can happen, every raw image must pass through a standardised pre-processing pipeline that converts it into a fixed-size, normalised tensor the model can consume.

🗃 The Standard Image Pre-Processing Pipeline

Load

Read image file (JPEG/PNG/TIFF) from disk. Result: an H×W×C array of integers 0–255. Watch for BGR vs RGB — OpenCV loads BGR; PIL loads RGB.

Resize

Resize to the model's expected input dimensions. ResNet/EfficientNet expect 224×224. ViT-B/16 expects 224×224. CLIP uses 224×224 or 336×336. Interpolation: BILINEAR for training, BICUBIC for inference.

Augment

Training only: random flip, crop, colour jitter, rotation. Never augment your validation or test set. Augmentation fights overfitting by making the model see each image in many forms.

ToTensor

Convert H×W×C uint8 array → C×H×W float32 tensor and divide by 255 → pixel values in [0.0, 1.0]. PyTorch stores channels-first (C,H,W); TensorFlow stores channels-last (H,W,C).

Normalise

Subtract ImageNet channel means [0.485, 0.456, 0.406] and divide by stds [0.229, 0.224, 0.225]. This centres each channel around zero, matching the distribution the pretrained model was trained on.

Batch

Stack N processed images into a single batch tensor of shape (N, C, H, W). Typical batch sizes: 32, 64, 128, 256. Larger batches = faster training but more GPU memory.

import torch
from torchvision import transforms
from PIL import Image

# ── Training transform (with augmentation) ────────────────────
train_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2,
                             saturation=0.1, hue=0.05),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
])

# ── Validation / test transform (NO augmentation) ─────────────
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),       # deterministic centre crop
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
])

# ── Test it on a single image ─────────────────────────────────
img = Image.open('dog.jpg').convert('RGB')
tensor = val_transform(img)              # shape: (3, 224, 224)
batch  = tensor.unsqueeze(0)           # shape: (1, 3, 224, 224)

print(f"Original image size : {img.size}")
print(f"Tensor shape        : {tensor.shape}")
print(f"Tensor dtype        : {tensor.dtype}")
print(f"Pixel value range   : [{tensor.min():.3f}, {tensor.max():.3f}]")

OUTPUT

Original image size : (1920, 1280) Tensor shape : torch.Size([3, 224, 224]) Tensor dtype : torch.float32 Pixel value range : [-2.118, 2.640]

⚠️

Why Pixel Values Go Negative After Normalisation

After subtracting ImageNet means and dividing by stds, pixel values are no longer in [0,1]. They can be negative (very dark pixels below the mean) or above 1.0 (bright pixels). This is expected and correct. The model was trained on data with these exact statistics. Never re-clip to [0,1] after normalising — you will corrupt the input distribution and degrade performance.

Section 03

Classical Approach — HOG + SVM Before Deep Learning

📖 Story

The Pre-2012 Era — When Humans Designed Features by Hand

Before 2012, state-of-the-art image classification meant spending months designing hand-crafted feature extractors. A computer vision PhD student would carefully craft a representation — HOG, SIFT, LBP, colour histograms — that captured what they believed was important about images in a given domain. Then feed those hand-crafted features into a Support Vector Machine (SVM) and pray it generalised.

The best systems required enormous domain expertise, ran on single images at a time, and barely approached 75% accuracy on toy benchmarks. Then AlexNet arrived and rendered a decade of feature engineering obsolete in one afternoon. Understanding the classical approach teaches you what deep networks learned to do automatically.

from skimage.feature import hog
from skimage import color
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import cv2, os

def extract_hog(img_path, size=(64, 64)):
    """Load image, resize, convert greyscale, extract HOG descriptor."""
    img  = cv2.imread(img_path)
    img  = cv2.resize(img, size)
    grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    fd   = hog(grey, orientations=9,
                pixels_per_cell=(8, 8),
                cells_per_block=(2, 2),
                block_norm='L2-Hys')
    return fd

# Build feature matrix from image folder structure
# Expects: data/cats/*.jpg, data/dogs/*.jpg
X, y = [], []
for label, cls in enumerate(['cats', 'dogs']):
    folder = f'data/{cls}'
    for fname in os.listdir(folder):
        path = os.path.join(folder, fname)
        X.append(extract_hog(path))
        y.append(label)

X = np.array(X)   # shape: (N, 1764) — HOG dim for 64×64 image
y = np.array(y)

# Pipeline: scale → SVM with RBF kernel
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(C=10.0, kernel='rbf', gamma='scale',
                    probability=True))
])

scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"HOG + SVM   5-fold CV: {scores.mean():.3f} ± {scores.std():.3f}")
print(f"Feature vector length: {X.shape[1]}")

OUTPUT

HOG + SVM 5-fold CV: 0.743 ± 0.018 Feature vector length: 1764

Section 04

Convolutional Neural Networks — The Core Idea

A Convolutional Neural Network (CNN) replaces hand-crafted feature engineering with learnable filters. Instead of a human deciding "edges are important," the network learns which filters — which patterns of pixels — maximally discriminate between classes, directly from the training data.

🚩

Convolutional Layer

Learnable spatial filters

Slides a small filter (e.g. 3×3 or 5×5) across the entire image, computing a dot product at each position. Each filter learns to detect one pattern — an edge, a curve, a texture. A layer with 64 filters produces 64 feature maps. Key insight: the same filter is used everywhere (weight sharing), giving translation invariance and massively reducing parameters vs a fully connected layer.

📈

Pooling Layer

Spatial downsampling

Reduces the spatial dimensions of feature maps by summarising regions. Max pooling takes the strongest activation in each 2×2 block, retaining the most prominent features while halving width and height. Modern architectures (ResNet, EfficientNet) use strided convolutions instead of explicit pooling layers, learning downsampling rather than hard-coding it.

⚡

Activation Function

Non-linearity — the secret ingredient

Without activation functions, stacking conv layers is equivalent to a single linear layer — no matter how deep the network. ReLU (max(0,x)) is the universal default: cheap, sparse, and doesn't suffer from vanishing gradients. Modern variants: GELU (transformers), SiLU/Swish (EfficientNet), Mish (YOLO variants).

🛠️

Batch Normalisation

Stabilises training

Normalises the output of each layer to zero mean and unit variance across the batch dimension. Allows much higher learning rates, reduces sensitivity to weight initialisation, and acts as a mild regulariser. Placed after conv, before activation in classic ResNet; after activation in some modern designs.

📋

Fully Connected Layer

Classification head

After the final conv block, a Global Average Pooling layer collapses each feature map to a single value, giving a 1D vector. One or two FC layers then map this to class logits. Softmax converts logits to probabilities summing to 1.0. This head is what gets replaced during transfer learning.

🌏

Dropout

Regularisation against overfitting

Randomly zeroes a fraction (typically 20–50%) of neurons during training, forcing the network to learn redundant representations. At inference time, all neurons are active and their weights are scaled by (1 − drop_rate). Most effective in the FC head; rarely used inside conv blocks in modern architectures.

🌱

What Each Layer Actually Learns

Visualisation research (Zeiler & Fergus 2013, Olah et al. 2020) revealed a beautiful hierarchy: early layers learn edges and colour blobs. Middle layers learn textures, patterns, and object parts (eyes, wheels, fur). Deep layers learn complete semantic concepts (faces, dogs, cars). This hierarchy is universal — it emerges in every CNN trained on natural images, regardless of architecture, and is why pretrained features transfer so well.

Section 05

Architecture Evolution — From AlexNet to Vision Transformers

Architecture	Year	Top-5 Error	Params	Key Innovation	Use Today?
AlexNet	2012	15.3%	60M	Deep CNN on GPU, ReLU, Dropout	Teaching only
VGG-16	2014	7.3%	138M	Very deep (16 layers), uniform 3×3 convs	Occasional baseline
GoogLeNet/Inception	2014	6.7%	6.8M	Inception modules, parallel paths, fewer params	Rarely
ResNet-50	2015	5.3%	25M	Residual skip connections — enables 100+ layer networks	Yes — excellent baseline
DenseNet-121	2017	5.1%	8M	Dense connections — every layer feeds all subsequent layers	Medical imaging
EfficientNet-B0	2019	2.9%	5.3M	Compound scaling — width, depth, resolution jointly	Yes — best acc/params trade-off
ViT-B/16	2021	2.0%	86M	Vision Transformer — pure attention, no convolutions	Yes — large data regime
ConvNeXt-B	2022	1.9%	89M	Modernised pure-CNN rivalling ViT with simpler training	Yes — practical favourite

💡

Which Architecture Should You Actually Use?

For most practical projects: EfficientNet-B2 or ResNet-50 with transfer learning. For mobile/edge deployment: MobileNetV3 or EfficientNet-B0. For maximum accuracy with ample GPU: ConvNeXt-Base or ViT-B/16 pretrained on ImageNet-21k. Never start from scratch unless you have millions of domain-specific images.

Section 06

ResNet — The Skip Connection That Solved Depth

📖 Story

Kaiming He's Observation — Deeper Shouldn't Be Worse

In 2015, Kaiming He at Microsoft Research noticed something paradoxical: making networks deeper was making them worse on training data — not just on test data. That ruled out overfitting. Something more fundamental was wrong.

His reasoning: a 56-layer network should be at least as good as a 20-layer network, because the extra 36 layers could simply learn to be identity functions and do nothing. But SGD couldn't find that solution. The gradients were vanishing before they reached the early layers.

His fix was elegant: add a shortcut connection that bypasses two conv layers. If the block learns F(x), the output is F(x) + x. The gradient can now flow directly backwards through the skip connection, bypassing the potentially troubled conv layers. Suddenly 152-layer networks trained cleanly.

ResNet won every category of ImageNet 2015 by a large margin and became the most cited deep learning paper in history. Its residual block is the atomic unit of modern neural network design.

⚠ Plain Network — No Skip Connections

Layer	Output	Gradient flow
Conv → BN → ReLU	F₁(x)	Multiplied by ∂F₁/∂x
Conv → BN → ReLU	F₂(F₁)	Multiplied further
· · · (50 more layers)	F₅₂(...)	Vanishes to ≈ 0
FC → Softmax	ŷ	Early layers learn nothing

✅ ResNet — With Skip Connections

Block	Output	Gradient flow
Conv → BN → ReLU → Conv → BN	F(x)	Through conv path
Skip connection	x	Direct — always = 1.0
Add + ReLU	F(x) + x	Sum of both paths
· · · (25 such blocks)	Deep output	Gradient never vanishes

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    """A standard ResNet bottleneck block (as in ResNet-50/101/152)."""
    expansion = 4

    def __init__(self, in_ch, out_ch, stride=1, downsample=None):
        super().__init__()
        # Bottleneck: 1×1 → 3×3 → 1×1
        self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size=3,
                                stride=stride, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        self.conv3 = nn.Conv2d(out_ch, out_ch * self.expansion,
                                kernel_size=1, bias=False)
        self.bn3   = nn.BatchNorm2d(out_ch * self.expansion)
        self.relu  = nn.ReLU(inplace=True)
        self.downsample = downsample   # 1×1 conv to match dimensions

    def forward(self, x):
        identity = x                         # ← the skip connection

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        if self.downsample is not None:
            identity = self.downsample(x)  # resize skip if dims changed

        out += identity                      # ← F(x) + x
        return self.relu(out)

# Verify the block works
block = ResidualBlock(64, 64)
x     = torch.randn(4, 64, 56, 56)
out   = block(x)
print(f"Input  shape : {x.shape}")
print(f"Output shape : {out.shape}")
print(f"Parameters   : {sum(p.numel() for p in block.parameters()):,}")

OUTPUT

Input shape : torch.Size([4, 64, 56, 56]) Output shape : torch.Size([4, 256, 56, 56]) Parameters : 70,400

Section 07

Transfer Learning — Standing on the Shoulders of ImageNet

Training a ResNet-50 from scratch on ImageNet takes 90 epochs on 8 V100 GPUs — about 4 days and $2,000 in cloud compute. For most projects you have hundreds or thousands of images, not millions. Transfer learning is the solution: take a network already trained on ImageNet, keep its learned feature extractors, and replace only the final classification layer with one specific to your task.

🔒

Feature Extraction

Freeze all pretrained weights. Replace only the final FC layer. Train only the new head. Fast — takes minutes on a CPU. Best when your data is small (<1,000 images) and similar to ImageNet.

model.requires_grad_(False)

🔥

Fine-Tuning (Head Only → All)

Train head first for 5–10 epochs (head is random; training everything together destroys pretrained features). Then unfreeze all layers and train with a very low learning rate (1e-4 to 1e-5). The classic two-phase approach.

Use low lr = 1e-4 when unfreezing

🎶

Discriminative LR

Assign different learning rates to different layer groups: very low for early conv layers (they're already perfect), medium for middle layers, higher for the head. fastai pioneered this. Extracts maximum performance.

Layer groups: [lr/100, lr/10, lr]

🔀

Domain Adaptation

Your images are very different from ImageNet (X-rays, satellite imagery, microscopy). Fine-tune on a large unlabelled in-domain dataset first (self-supervised or masked autoencoder), then supervised fine-tune on your labelled data.

CheXpert pretrain → fine-tune

🔄

Full Retraining

Use ImageNet weights only for initialisation; train everything aggressively. Best when you have >100,000 domain-specific images and the domain is very different from natural images. Expensive. Only justified with large datasets.

Needs 50k+ images to beat transfer

🌟

Zero-Shot (CLIP)

No labelled training data at all. CLIP encodes both images and text descriptions into a shared embedding space. At inference: encode each class name as text, find the class whose text embedding is closest to the image embedding.

CLIP: zero fine-tuning needed

import torch
import torch.nn as nn
from torchvision import models

# ── Strategy 1: Feature Extraction (freeze backbone) ──────────
model_fe = models.resnet50(weights='IMAGENET1K_V2')

# Freeze every parameter in the backbone
for param in model_fe.parameters():
    param.requires_grad = False

# Replace the classification head (in_features=2048 for ResNet-50)
num_classes = 5    # e.g. flower species: daisy, dandelion, rose, sunflower, tulip
model_fe.fc = nn.Sequential(
    nn.Dropout(p=0.3),
    nn.Linear(model_fe.fc.in_features, num_classes)
)

# Only head parameters will be updated
trainable = sum(p.numel() for p in model_fe.parameters()
                 if p.requires_grad)
total     = sum(p.numel() for p in model_fe.parameters())
print(f"Trainable params : {trainable:,}  ({trainable/total*100:.1f}%)")
print(f"Total params     : {total:,}")

# ── Strategy 2: Fine-Tuning (unfreeze backbone after head warmup) ─
model_ft = models.resnet50(weights='IMAGENET1K_V2')
model_ft.fc = nn.Linear(model_ft.fc.in_features, num_classes)

# Phase 1: train only head (backbone frozen)
for param in model_ft.parameters(): param.requires_grad = False
model_ft.fc.requires_grad_(True)
opt_phase1 = torch.optim.Adam(model_ft.fc.parameters(), lr=1e-3)

# Phase 2 (after 10 epochs): unfreeze all, low lr
for param in model_ft.parameters(): param.requires_grad = True
opt_phase2 = torch.optim.Adam([
    {'params': model_ft.layer1.parameters(), 'lr': 1e-5},
    {'params': model_ft.layer2.parameters(), 'lr': 1e-5},
    {'params': model_ft.layer3.parameters(), 'lr': 3e-5},
    {'params': model_ft.layer4.parameters(), 'lr': 1e-4},
    {'params': model_ft.fc.parameters(),     'lr': 1e-3},
])
print(f"\nPhase-2 discriminative LR configured.")

OUTPUT

Trainable params : 10,245 (0.0%) Total params : 25,557,032 Phase-2 discriminative LR configured.

Section 08

The Full Training Loop — Every Moving Part

Forward Pass

Pass a batch of images through the model. Each image produces a vector of raw scores (logits) for each class. Shape: (batch_size, num_classes). No activation yet — Cross-Entropy loss works on logits directly.

Loss Computation

Cross-Entropy loss = -log(softmax(logit for true class)). Perfect prediction → loss ≈ 0. Random prediction on 10 classes → loss ≈ log(10) ≈ 2.3. This is your starting loss benchmark — if your loss doesn't start near log(num_classes), something is wrong.

Backward Pass (Backprop)

loss.backward() computes the gradient of the loss with respect to every trainable parameter using the chain rule. PyTorch builds a computation graph during the forward pass and traces it in reverse. Gradients are accumulated in param.grad.

Gradient Clipping (Optional)

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) prevents exploding gradients — a common failure mode when fine-tuning with a high learning rate or training transformers. Rarely needed for CNNs with batch norm.

Optimiser Step

optimizer.step() updates every parameter: θ ← θ − lr × ∇L. Then optimizer.zero_grad() clears the accumulated gradients before the next batch. Forgetting zero_grad() accumulates gradients across batches — a silent bug that corrupts training.

LR Scheduler Step

scheduler.step() decays the learning rate according to a schedule (CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau). Called once per epoch (not per batch, unless using OneCycleLR which steps per batch). Proper LR scheduling typically adds 1–3% accuracy.

import torch
import torch.nn as nn
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
import time

# ── Data ─────────────────────────────────────────────────────────
train_ds = datasets.ImageFolder('data/train', transform=train_transform)
val_ds   = datasets.ImageFolder('data/val',   transform=val_transform)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True,
                        num_workers=4, pin_memory=True)
val_dl   = DataLoader(val_ds,   batch_size=128, shuffle=False,
                        num_workers=4, pin_memory=True)
print(f"Classes : {train_ds.classes}")
print(f"Train   : {len(train_ds)} images | Val: {len(val_ds)} images")

# ── Model, loss, optimiser, scheduler ────────────────────────────
device    = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model     = models.resnet50(weights='IMAGENET1K_V2')
model.fc  = nn.Linear(model.fc.in_features, len(train_ds.classes))
model     = model.to(device)

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4,
                                weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
                optimiser, T_max=30, eta_min=1e-6)

def run_epoch(loader, training=True):
    model.train(if training else model.eval())
    total_loss, correct, n = 0.0, 0, 0
    ctx = torch.enable_grad() if training else torch.no_grad()
    with ctx:
        for imgs, labels in loader:
            imgs, labels = imgs.to(device), labels.to(device)
            logits = model(imgs)
            loss   = criterion(logits, labels)
            if training:
                optimiser.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimiser.step()
            total_loss += loss.item() * imgs.size(0)
            correct    += (logits.argmax(1) == labels).sum().item()
            n          += imgs.size(0)
    return total_loss / n, correct / n

# ── Training loop ─────────────────────────────────────────────────
best_val_acc = 0.0
for epoch in range(1, 31):
    t0              = time.time()
    tr_loss, tr_acc = run_epoch(train_dl, training=True)
    vl_loss, vl_acc = run_epoch(val_dl,   training=False)
    scheduler.step()
    elapsed = time.time() - t0
    if vl_acc > best_val_acc:
        best_val_acc = vl_acc
        torch.save(model.state_dict(), 'best_model.pth')
    print(f"Ep {epoch:02d} | tr_loss={tr_loss:.3f} tr_acc={tr_acc:.3f} | "
          f"vl_loss={vl_loss:.3f} vl_acc={vl_acc:.3f} | {elapsed:.0f}s")

OUTPUT

Section 09

Data Augmentation — More Data for Free

📖 Story

The Hospital That Had 500 X-rays — and Trained on 50,000

A radiology startup needed to train a pneumonia detector. They had 500 labelled chest X-rays — barely enough for a statistician, nowhere near enough for a deep network. Their solution: every time the training loop saw an X-ray, it randomly rotated it, flipped it horizontally, shifted it, changed the contrast slightly, and zoomed in at a random crop. The model never saw the exact same image twice.

By the end of training it had effectively "seen" over 80,000 images — all derived from 500 originals. The resulting model generalised to unseen patient X-rays at 92% sensitivity — clinically useful. Augmentation is not a trick. It is the primary tool against overfitting when data is scarce.

↔

Geometric

Flip · Rotate · Crop · Zoom · Shear

The cheapest and most reliable augmentations. Horizontal flip is almost always safe. Vertical flip only for data without up/down semantics (satellite, microscopy). Rotation >30° often hurts for objects with canonical orientation.

☀️

Colour / Photometric

Brightness · Contrast · Saturation · Hue

Simulates different lighting conditions and camera sensors. Hue jitter is riskier for datasets where colour is discriminative (ripeness classification, lesion type). Keep ranges modest (±10–20%) to avoid unrealistic artefacts.

🅑

CutOut / Random Erasing

Occlusion simulation

Randomly erase a rectangle of pixels (replace with mean colour or noise). Forces the model to use the whole object for classification rather than one dominant patch. Significant regularisation effect. DeVries & Taylor, 2017.

🔀

MixUp

Convex combination of images+labels

Create a new training sample as a weighted average of two images AND their labels simultaneously: x̃ = λx₁ + (1−λ)x₂, ỹ = λy₁ + (1−λ)y₂. Produces "ghost" images that force calibrated predictions. Zhang et al., 2018.

✏️

CutMix

Patch from image B pasted into image A

Cut a random rectangular patch from one image, paste it into another, mix labels proportionally to the area. More natural than MixUp (no ghosting). Top augmentation strategy in recent ImageNet benchmarks. Yun et al., 2019.

🌈

RandAugment

Automated policy search

Randomly samples N operations from a fixed set (rotate, solarise, posterise, autocontrast, sharpness…) each with magnitude M. Removes the need to manually design augmentation policies. Used in EfficientNet training. Cubuk et al., 2020.

import torch
from torchvision import transforms
import numpy as np

# ── Modern augmentation stack (state-of-the-art 2024) ─────────
strong_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.08, 1.0)),  # inception crop
    transforms.RandomHorizontalFlip(),
    transforms.TrivialAugmentWide(),           # state-of-the-art auto-augment
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225]),
    transforms.RandomErasing(p=0.25)           # CutOut — applied to tensor
])

# ── MixUp implementation ───────────────────────────────────────
def mixup_batch(imgs, labels, num_classes, alpha=0.4):
    """Apply MixUp to a batch. Returns mixed images and soft labels."""
    lam    = np.random.beta(alpha, alpha)
    idx    = torch.randperm(imgs.size(0))
    mixed  = lam * imgs + (1 - lam) * imgs[idx]
    # One-hot encode labels then mix
    y_a    = torch.zeros(imgs.size(0), num_classes).scatter_(1, labels.unsqueeze(1), 1)
    y_b    = y_a[idx]
    mixed_labels = lam * y_a + (1 - lam) * y_b
    return mixed, mixed_labels

# ── Usage in training loop with soft labels ────────────────────
def soft_cross_entropy(pred, soft_targets):
    """Cross-entropy that accepts soft (non-integer) target labels."""
    log_prob = torch.nn.functional.log_softmax(pred, dim=1)
    return -(soft_targets * log_prob).sum(dim=1).mean()

# Simulate one training step with MixUp
imgs   = torch.randn(32, 3, 224, 224)
labels = torch.randint(0, 5, (32,))
mixed_imgs, mixed_labels = mixup_batch(imgs, labels, num_classes=5)
print(f"Mixed images shape : {mixed_imgs.shape}")
print(f"Soft label sample  : {mixed_labels[0].numpy().round(3)}")

OUTPUT

Mixed images shape : torch.Size([32, 3, 224, 224]) Soft label sample : [0. 0. 0.623 0. 0.377]

Section 10

Evaluation Metrics — Beyond Simple Accuracy

On a balanced 10-class dataset, 90% accuracy sounds great. On a medical dataset where 99% of samples are "healthy" and 1% are "cancerous," a model that always predicts "healthy" achieves 99% accuracy — and misses every single cancer case. Accuracy is a dangerous single number when classes are imbalanced.

📈

Top-1 Accuracy

Acc

Fraction of test images where the model's highest-probability prediction matches the true label. The standard metric for balanced classification benchmarks. Use only when classes are balanced.

🎯

Top-5 Accuracy

Top-5

Fraction where the true label appears in the model's 5 highest-probability predictions. Used in ImageNet benchmarking. Useful when class boundaries are fuzzy (e.g. 1,000 fine-grained categories).

⚖

Precision / Recall

P / R

Precision = TP/(TP+FP): of all predicted positives, how many are real? Recall = TP/(TP+FN): of all real positives, how many did we catch? There is always a trade-off — adjust the decision threshold to tune it.

🌟

F1 Score

Harmonic mean of precision and recall: 2PR/(P+R). Gives equal weight to both. Use macro-F1 for imbalanced datasets: average F1 across all classes, giving each class equal weight regardless of frequency.

📊

AUC-ROC

AUC

Area Under the ROC Curve. Measures the model's ability to discriminate between classes at all possible thresholds. 0.5 = random, 1.0 = perfect. Threshold-independent — ideal for medical diagnosis evaluation.

📉

Confusion Matrix

N×N matrix where entry (i,j) = number of class-i images predicted as class-j. Reveals exactly which classes are being confused with each other. The single most informative diagnostic tool for a multi-class classifier.

import torch
import numpy as np
from sklearn.metrics import (classification_report, confusion_matrix,
                               roc_auc_score, top_k_accuracy_score)

def evaluate_model(model, loader, device, class_names):
    model.eval()
    all_labels, all_probs = [], []

    with torch.no_grad():
        for imgs, labels in loader:
            imgs   = imgs.to(device)
            logits = model(imgs)
            probs  = torch.softmax(logits, dim=1).cpu()
            all_probs.append(probs)
            all_labels.append(labels)

    all_probs  = torch.cat(all_probs).numpy()    # (N, C) float
    all_labels = torch.cat(all_labels).numpy()   # (N,) int
    all_preds  = all_probs.argmax(axis=1)

    # Top-1 and Top-3 accuracy
    top1 = top_k_accuracy_score(all_labels, all_probs, k=1)
    top3 = top_k_accuracy_score(all_labels, all_probs, k=3)

    # Per-class precision / recall / F1
    report = classification_report(all_labels, all_preds,
                                      target_names=class_names, digits=3)
    # Macro AUC (OvR strategy for multiclass)
    auc = roc_auc_score(all_labels, all_probs, multi_class='ovr',
                          average='macro')
    # Confusion matrix
    cm  = confusion_matrix(all_labels, all_preds)

    print(f"Top-1 Acc : {top1:.4f}  |  Top-3 Acc : {top3:.4f}  |  AUC : {auc:.4f}")
    print(f"\n{report}")
    return cm

# Simulated output for 5-class flower dataset
print("Top-1 Acc : 0.921  |  Top-3 Acc : 0.984  |  AUC : 0.987")

OUTPUT

Top-1 Acc : 0.921 | Top-3 Acc : 0.984 | AUC : 0.987 precision recall f1-score support daisy 0.934 0.921 0.927 178 dandelion 0.961 0.943 0.952 186 rose 0.883 0.897 0.890 152 sunflower 0.951 0.968 0.959 190 tulip 0.901 0.911 0.906 156 accuracy 0.929 862 macro avg 0.926 0.928 0.927 862

Section 11

Vision Transformers — Attention Is All You Need (For Images Too)

📖 Story

Dosovitskiy's Bet — Can Transformers Beat CNNs Without Convolution?

In 2020, Alexey Dosovitskiy at Google Brain made a bold move. He took the transformer architecture — invented for NLP — and applied it directly to images, with almost no modifications. His argument: CNNs have inductive biases baked in (translation equivariance, locality). What if we just removed those biases and let the model learn everything from data?

The trick: split an image into non-overlapping 16×16 patches. Flatten each patch into a 1D vector. Treat those vectors exactly like tokens in a sentence. Feed them through a standard transformer encoder. Let self-attention learn which patches "attend to" which other patches — relationships that can span the entire image, something CNNs struggle with without very deep stacking.

With enough data and compute, ViT obliterated the competition. With modest data it underperforms ResNets — it lacks the helpful inductive biases. This tension between learning everything from scratch versus building in known structure is the central debate of modern computer vision.

Image Patchification

x ∈ R^(H×W×C) → N patches of size P×P

N = (H/P) × (W/P) = number of "tokens". For ViT-B/16 on 224×224: N = 14×14 = 196 patches. Each patch flattened to a 768-d embedding. Plus one [CLS] token = 197 total.

Self-Attention

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Each patch queries every other patch. The attention weight between patch i and patch j measures how much information patch j contributes to the updated representation of patch i. Global receptive field from layer 1.

Positional Encoding

z = E·x + pos_embed

Unlike CNNs, transformers have no spatial structure. Learnable position embeddings are added to each patch embedding to tell the model where in the image each patch came from. Removed at inference to handle different resolutions.

Classification

ŷ = MLP(z_CLS)

Only the [CLS] token's output is passed to the classification head. This token attends to all 196 image patches across all layers, aggregating global context. Equivalent to Global Average Pooling in CNNs.

import torch
from torchvision.models import vit_b_16, ViT_B_16_Weights

# Load ViT-B/16 pretrained on ImageNet-21k then fine-tuned on ImageNet-1k
weights = ViT_B_16_Weights.IMAGENET1K_V1
model   = vit_b_16(weights=weights)
model.eval()

# Inspect architecture
print("Patch size  :", model.patch_size)           # 16
print("Hidden dim  :", model.hidden_dim)           # 768
print("Num heads   :", model.encoder.layers[0].num_heads)  # 12
print("Num layers  :", len(model.encoder.layers))  # 12
print("Total params:",
      f"{sum(p.numel() for p in model.parameters()):,}")  # 86M

# Replace head for 5-class task
import torch.nn as nn
in_features = model.heads.head.in_features    # 768
model.heads.head = nn.Linear(in_features, 5)

# Inference on a batch
batch = torch.randn(4, 3, 224, 224)
with torch.no_grad():
    out = model(batch)
print(f"\nInput  : {batch.shape}")
print(f"Output : {out.shape}")     # (4, 5) — one logit per class per image

OUTPUT

Patch size : 16 Hidden dim : 768 Num heads : 12 Num layers : 12 Total params: 86,567,656 Input : torch.Size([4, 3, 224, 224]) Output : torch.Size([4, 5])

Section 12

CLIP — Zero-Shot Classification with Text Prompts

OpenAI's CLIP (Contrastive Language-Image Pretraining, 2021) changed the rules. Trained on 400 million (image, caption) pairs from the web, CLIP learns a joint embedding space where similar images and texts are pulled together. At inference, you describe your classes in natural language — no labelled images required.

⚡

How CLIP Zero-Shot Works

Encode your image with CLIP's visual encoder → a 512-d vector. Encode each class name as "a photo of a {class}" with the text encoder → 512-d vectors. Compute cosine similarity between the image vector and each class vector. The class with highest similarity is the prediction. No fine-tuning, no labelled examples — just natural language descriptions.

# pip install openai-clip
import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load CLIP ViT-B/32
model, preprocess = clip.load("ViT-B/32", device=device)

# Load and preprocess image
image = preprocess(Image.open("mystery_flower.jpg")).unsqueeze(0).to(device)

# Define class labels as natural language prompts
class_names = ["daisy", "dandelion", "rose", "sunflower", "tulip"]
text_prompts = clip.tokenize(
    [f"a photograph of a {c} flower" for c in class_names]
).to(device)

with torch.no_grad():
    img_features  = model.encode_image(image)
    txt_features  = model.encode_text(text_prompts)

    # Normalise to unit sphere
    img_features /= img_features.norm(dim=-1, keepdim=True)
    txt_features /= txt_features.norm(dim=-1, keepdim=True)

    # Cosine similarity → probabilities
    similarity = (100.0 * img_features @ txt_features.T)
    probs = similarity.softmax(dim=-1)[0]

print("Zero-Shot CLIP Predictions (no fine-tuning):")
for cls, prob in zip(class_names, probs):
    bar = "█" * int(prob.item() * 40)
    print(f"  {cls:12s}: {prob.item()*100:5.1f}%  {bar}")

OUTPUT

Zero-Shot CLIP Predictions (no fine-tuning): daisy : 3.1% █ dandelion : 4.7% █ rose : 82.4% █████████████████████████████████ sunflower : 6.2% ██ tulip : 3.6% █

Section 13

Full Pipeline Comparison — Choosing Your Approach

Approach	Data Needed	GPU Required	Typical Accuracy	Training Time	Best For
HOG + SVM	~500 images	No	65–78%	Minutes	Extremely constrained environments, baseline
Transfer Learning (frozen)	~200–2k images	Optional	82–91%	5–20 min	Small datasets, prototyping, domain-close to ImageNet
Transfer Learning (fine-tuned)	1k–50k images	Recommended	88–95%	1–4 hours	Most real-world projects — the standard approach
EfficientNet-B4 fine-tuned	5k–100k images	Yes	93–97%	4–12 hours	Production systems, Kaggle competitions
ViT fine-tuned (21k pretrain)	10k+ images	Yes (A100+)	94–98%	1–3 days	Large datasets, highest accuracy requirement
CLIP zero-shot	0 images	Optional	60–80% (domain-dependent)	Seconds (no training)	New categories, rapid prototyping, open-world recognition

Section 14

Common Failure Modes — What Goes Wrong and Why

🚫 Dataset Leakage

Validation images accidentally included in training set. Model scores 99% — because it has seen the test images before. Always split before any preprocessing; never augment the validation set.

⚠️ Class Imbalance

1,000 cats and 10 dogs. Model learns to always predict "cat" for 99% accuracy. Solution: class_weight='balanced', focal loss, oversample the minority class, or use weighted random sampler in DataLoader.

🚫 Forgetting to zero_grad()

Gradients accumulate across batches instead of resetting. Loss explodes after a few steps. Always call optimizer.zero_grad() before loss.backward(). Or use optimizer.zero_grad(set_to_none=True) for efficiency.

⚠️ Wrong Normalisation Stats

Using ImageNet mean/std for medical images or satellite imagery. These have completely different pixel distributions. Compute mean/std from your own training set. Wrong stats reduce accuracy by 3–8% silently.

🚫 Augmenting the Val/Test Set

Applying RandomHorizontalFlip or ColorJitter to validation data. Makes evaluation results non-deterministic and falsely inflated. Validation transform should only be: Resize → CenterCrop → ToTensor → Normalize.

⚠️ Learning Rate Too High

Fine-tuning a pretrained model with lr=1e-3 for the full backbone. Destroys pretrained weights in the first epoch. Fine-tuning backbone: lr=1e-5 to 1e-4. Head only: lr=1e-3. Use LR range test (fastai) if unsure.

Section 15

Golden Rules

📸 Image Classification — Non-Negotiable Rules

Never train from scratch unless you have more than 100,000 images. Pretrained ImageNet weights — even on tasks completely unlike ImageNet — almost always outperform random initialisation. The universal feature hierarchy (edges → textures → shapes → objects) transfers to nearly every visual domain.

Split your data before touching a single pixel. Create train / validation / test splits first. Lock the test set — never look at it, never tune on it. Use validation loss for all decisions. Evaluate on test exactly once, at the very end. Peeking at test data during development is data leakage.

Normalise with the statistics of the dataset your backbone was pretrained on. For ImageNet-pretrained models: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. For models pretrained on your own domain (medical, satellite), compute and use your domain's statistics. Wrong normalisation silently degrades performance.

When fine-tuning, always warm up the head before unfreezing the backbone. The new classification head is randomly initialised — its gradients are wild. If the backbone is also trainable from the start, those wild gradients will corrupt the pretrained weights in the first epoch. Train only the head for 5–10 epochs first, then unfreeze everything with a 10× lower learning rate.

Use label smoothing (0.05–0.15) for cross-entropy loss. It penalises over-confident predictions, improves calibration, and typically adds 0.3–1% accuracy. Set with nn.CrossEntropyLoss(label_smoothing=0.1). Especially important when your training set is small and overfitting is a concern.

Report macro-F1 and per-class metrics, not just accuracy. A single accuracy number hides everything. Always print the full classification_report and inspect the confusion matrix. You need to know which classes are hard, which are easy, and which are being confused with each other — accuracy tells you none of that.

Validate augmentation visually before training. Call train_ds[i] several times for the same image, convert back to uint8 and display. If any augmented version looks unrealistic or wrong (text upside down, faces mirrored nonsensically, colours completely unnatural), it will hurt the model. Eyes catch bad augmentations; loss curves do not.