Computer Vision 📂 Computer Vision Basics · 5 of 12 57 min read

Image Classification in Python

A comprehensive, story-driven tutorial covering image classification from first principles to state-of-the-art practice. Opens with the "Three-Year-Old and the Radiologist" analogy, then walks through every level of the stack: the pre-processing pipeline, classical HOG+SVM, convolutional neural networks layer by layer,

Section 01

The Story That Explains Image Classification

The Three-Year-Old and the Radiologist
A three-year-old child sees a picture of a dog for the first time. Her parents say "that's a dog." She sees another dog — different breed, different angle, different lighting. Her parents say "that's also a dog." After perhaps a hundred such examples, she can identify dogs she has never seen before. The furry ears, the snout, the posture — she has learned a concept, not memorised a pixel grid.

Now consider a radiologist who has read 50,000 chest X-rays over twenty years. She spots a suspicious shadow at 4mm — invisible to an untrained eye — because her brain has built an extraordinarily detailed internal model of what "normal" looks like versus what "abnormal" looks like. She is classifying images, one at a time, faster and more accurately than any rule-based system ever could.

Image Classification is the task of training a machine to do exactly this: look at a raw image and output a label — cat, dog, tumour, fraudulent cheque, ripe tomato, storm cloud — by learning from thousands of labelled examples. It is the oldest, most studied problem in computer vision, and it is the foundation on which everything else — object detection, segmentation, video understanding — is built.

At its core, image classification is a supervised learning problem. You give a model thousands of (image, label) pairs. The model learns a mapping from pixel values to class probabilities. At inference time it applies that mapping to images it has never seen before and outputs its best prediction.

📸
The Benchmark That Changed Everything — ImageNet

In 2010, Fei-Fei Li's team at Stanford released ImageNet: 1.2 million images across 1,000 categories. In 2012, Alex Krizhevsky's AlexNet smashed the previous best error rate from 26% to 15% using a deep convolutional neural network trained on two GPUs. That single result launched the modern deep learning era. Today's best models achieve under 1.5% top-5 error on the same benchmark — better than most humans.


Section 02

How Images Become Numbers — The Input Pipeline

A model never sees an image. It sees a tensor — a multidimensional array of numbers. Before any learning can happen, every raw image must pass through a standardised pre-processing pipeline that converts it into a fixed-size, normalised tensor the model can consume.

🗃 The Standard Image Pre-Processing Pipeline
Load
Read image file (JPEG/PNG/TIFF) from disk. Result: an H×W×C array of integers 0–255. Watch for BGR vs RGB — OpenCV loads BGR; PIL loads RGB.
Resize
Resize to the model's expected input dimensions. ResNet/EfficientNet expect 224×224. ViT-B/16 expects 224×224. CLIP uses 224×224 or 336×336. Interpolation: BILINEAR for training, BICUBIC for inference.
Augment
Training only: random flip, crop, colour jitter, rotation. Never augment your validation or test set. Augmentation fights overfitting by making the model see each image in many forms.
ToTensor
Convert H×W×C uint8 array → C×H×W float32 tensor and divide by 255 → pixel values in [0.0, 1.0]. PyTorch stores channels-first (C,H,W); TensorFlow stores channels-last (H,W,C).
Normalise
Subtract ImageNet channel means [0.485, 0.456, 0.406] and divide by stds [0.229, 0.224, 0.225]. This centres each channel around zero, matching the distribution the pretrained model was trained on.
Batch
Stack N processed images into a single batch tensor of shape (N, C, H, W). Typical batch sizes: 32, 64, 128, 256. Larger batches = faster training but more GPU memory.
import torch
from torchvision import transforms
from PIL import Image

# ── Training transform (with augmentation) ────────────────────
train_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2,
                             saturation=0.1, hue=0.05),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
])

# ── Validation / test transform (NO augmentation) ─────────────
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),       # deterministic centre crop
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
])

# ── Test it on a single image ─────────────────────────────────
img = Image.open('dog.jpg').convert('RGB')
tensor = val_transform(img)              # shape: (3, 224, 224)
batch  = tensor.unsqueeze(0)           # shape: (1, 3, 224, 224)

print(f"Original image size : {img.size}")
print(f"Tensor shape        : {tensor.shape}")
print(f"Tensor dtype        : {tensor.dtype}")
print(f"Pixel value range   : [{tensor.min():.3f}, {tensor.max():.3f}]")
OUTPUT
Original image size : (1920, 1280) Tensor shape : torch.Size([3, 224, 224]) Tensor dtype : torch.float32 Pixel value range : [-2.118, 2.640]
⚠️
Why Pixel Values Go Negative After Normalisation

After subtracting ImageNet means and dividing by stds, pixel values are no longer in [0,1]. They can be negative (very dark pixels below the mean) or above 1.0 (bright pixels). This is expected and correct. The model was trained on data with these exact statistics. Never re-clip to [0,1] after normalising — you will corrupt the input distribution and degrade performance.


Section 03

Classical Approach — HOG + SVM Before Deep Learning

The Pre-2012 Era — When Humans Designed Features by Hand
Before 2012, state-of-the-art image classification meant spending months designing hand-crafted feature extractors. A computer vision PhD student would carefully craft a representation — HOG, SIFT, LBP, colour histograms — that captured what they believed was important about images in a given domain. Then feed those hand-crafted features into a Support Vector Machine (SVM) and pray it generalised.

The best systems required enormous domain expertise, ran on single images at a time, and barely approached 75% accuracy on toy benchmarks. Then AlexNet arrived and rendered a decade of feature engineering obsolete in one afternoon. Understanding the classical approach teaches you what deep networks learned to do automatically.
from skimage.feature import hog
from skimage import color
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import cv2, os

def extract_hog(img_path, size=(64, 64)):
    """Load image, resize, convert greyscale, extract HOG descriptor."""
    img  = cv2.imread(img_path)
    img  = cv2.resize(img, size)
    grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    fd   = hog(grey, orientations=9,
                pixels_per_cell=(8, 8),
                cells_per_block=(2, 2),
                block_norm='L2-Hys')
    return fd

# Build feature matrix from image folder structure
# Expects: data/cats/*.jpg, data/dogs/*.jpg
X, y = [], []
for label, cls in enumerate(['cats', 'dogs']):
    folder = f'data/{cls}'
    for fname in os.listdir(folder):
        path = os.path.join(folder, fname)
        X.append(extract_hog(path))
        y.append(label)

X = np.array(X)   # shape: (N, 1764) — HOG dim for 64×64 image
y = np.array(y)

# Pipeline: scale → SVM with RBF kernel
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm',    SVC(C=10.0, kernel='rbf', gamma='scale',
                    probability=True))
])

scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"HOG + SVM   5-fold CV: {scores.mean():.3f} ± {scores.std():.3f}")
print(f"Feature vector length: {X.shape[1]}")
OUTPUT
HOG + SVM 5-fold CV: 0.743 ± 0.018 Feature vector length: 1764

Section 04

Convolutional Neural Networks — The Core Idea

A Convolutional Neural Network (CNN) replaces hand-crafted feature engineering with learnable filters. Instead of a human deciding "edges are important," the network learns which filters — which patterns of pixels — maximally discriminate between classes, directly from the training data.

🚩
Convolutional Layer
Learnable spatial filters
Slides a small filter (e.g. 3×3 or 5×5) across the entire image, computing a dot product at each position. Each filter learns to detect one pattern — an edge, a curve, a texture. A layer with 64 filters produces 64 feature maps. Key insight: the same filter is used everywhere (weight sharing), giving translation invariance and massively reducing parameters vs a fully connected layer.
📈
Pooling Layer
Spatial downsampling
Reduces the spatial dimensions of feature maps by summarising regions. Max pooling takes the strongest activation in each 2×2 block, retaining the most prominent features while halving width and height. Modern architectures (ResNet, EfficientNet) use strided convolutions instead of explicit pooling layers, learning downsampling rather than hard-coding it.
Activation Function
Non-linearity — the secret ingredient
Without activation functions, stacking conv layers is equivalent to a single linear layer — no matter how deep the network. ReLU (max(0,x)) is the universal default: cheap, sparse, and doesn't suffer from vanishing gradients. Modern variants: GELU (transformers), SiLU/Swish (EfficientNet), Mish (YOLO variants).
🛠️
Batch Normalisation
Stabilises training
Normalises the output of each layer to zero mean and unit variance across the batch dimension. Allows much higher learning rates, reduces sensitivity to weight initialisation, and acts as a mild regulariser. Placed after conv, before activation in classic ResNet; after activation in some modern designs.
📋
Fully Connected Layer
Classification head
After the final conv block, a Global Average Pooling layer collapses each feature map to a single value, giving a 1D vector. One or two FC layers then map this to class logits. Softmax converts logits to probabilities summing to 1.0. This head is what gets replaced during transfer learning.
🌏
Dropout
Regularisation against overfitting
Randomly zeroes a fraction (typically 20–50%) of neurons during training, forcing the network to learn redundant representations. At inference time, all neurons are active and their weights are scaled by (1 − drop_rate). Most effective in the FC head; rarely used inside conv blocks in modern architectures.
🌱
What Each Layer Actually Learns

Visualisation research (Zeiler & Fergus 2013, Olah et al. 2020) revealed a beautiful hierarchy: early layers learn edges and colour blobs. Middle layers learn textures, patterns, and object parts (eyes, wheels, fur). Deep layers learn complete semantic concepts (faces, dogs, cars). This hierarchy is universal — it emerges in every CNN trained on natural images, regardless of architecture, and is why pretrained features transfer so well.


Section 05

Architecture Evolution — From AlexNet to Vision Transformers

Architecture Year Top-5 Error Params Key Innovation Use Today?
AlexNet 2012 15.3% 60M Deep CNN on GPU, ReLU, Dropout Teaching only
VGG-16 2014 7.3% 138M Very deep (16 layers), uniform 3×3 convs Occasional baseline
GoogLeNet/Inception 2014 6.7% 6.8M Inception modules, parallel paths, fewer params Rarely
ResNet-50 2015 5.3% 25M Residual skip connections — enables 100+ layer networks Yes — excellent baseline
DenseNet-121 2017 5.1% 8M Dense connections — every layer feeds all subsequent layers Medical imaging
EfficientNet-B0 2019 2.9% 5.3M Compound scaling — width, depth, resolution jointly Yes — best acc/params trade-off
ViT-B/16 2021 2.0% 86M Vision Transformer — pure attention, no convolutions Yes — large data regime
ConvNeXt-B 2022 1.9% 89M Modernised pure-CNN rivalling ViT with simpler training Yes — practical favourite
💡
Which Architecture Should You Actually Use?

For most practical projects: EfficientNet-B2 or ResNet-50 with transfer learning. For mobile/edge deployment: MobileNetV3 or EfficientNet-B0. For maximum accuracy with ample GPU: ConvNeXt-Base or ViT-B/16 pretrained on ImageNet-21k. Never start from scratch unless you have millions of domain-specific images.


Section 06

ResNet — The Skip Connection That Solved Depth

Kaiming He's Observation — Deeper Shouldn't Be Worse
In 2015, Kaiming He at Microsoft Research noticed something paradoxical: making networks deeper was making them worse on training data — not just on test data. That ruled out overfitting. Something more fundamental was wrong.

His reasoning: a 56-layer network should be at least as good as a 20-layer network, because the extra 36 layers could simply learn to be identity functions and do nothing. But SGD couldn't find that solution. The gradients were vanishing before they reached the early layers.

His fix was elegant: add a shortcut connection that bypasses two conv layers. If the block learns F(x), the output is F(x) + x. The gradient can now flow directly backwards through the skip connection, bypassing the potentially troubled conv layers. Suddenly 152-layer networks trained cleanly.

ResNet won every category of ImageNet 2015 by a large margin and became the most cited deep learning paper in history. Its residual block is the atomic unit of modern neural network design.
⚠ Plain Network — No Skip Connections
LayerOutputGradient flow
Conv → BN → ReLUF₁(x)Multiplied by ∂F₁/∂x
Conv → BN → ReLUF₂(F₁)Multiplied further
· · · (50 more layers)F₅₂(...)Vanishes to ≈ 0
FC → SoftmaxŷEarly layers learn nothing
✅ ResNet — With Skip Connections
BlockOutputGradient flow
Conv → BN → ReLU → Conv → BNF(x)Through conv path
Skip connectionxDirect — always = 1.0
Add + ReLUF(x) + xSum of both paths
· · · (25 such blocks)Deep outputGradient never vanishes
import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    """A standard ResNet bottleneck block (as in ResNet-50/101/152)."""
    expansion = 4

    def __init__(self, in_ch, out_ch, stride=1, downsample=None):
        super().__init__()
        # Bottleneck: 1×1 → 3×3 → 1×1
        self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size=3,
                                stride=stride, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        self.conv3 = nn.Conv2d(out_ch, out_ch * self.expansion,
                                kernel_size=1, bias=False)
        self.bn3   = nn.BatchNorm2d(out_ch * self.expansion)
        self.relu  = nn.ReLU(inplace=True)
        self.downsample = downsample   # 1×1 conv to match dimensions

    def forward(self, x):
        identity = x                         # ← the skip connection

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        if self.downsample is not None:
            identity = self.downsample(x)  # resize skip if dims changed

        out += identity                      # ← F(x) + x
        return self.relu(out)

# Verify the block works
block = ResidualBlock(64, 64)
x     = torch.randn(4, 64, 56, 56)
out   = block(x)
print(f"Input  shape : {x.shape}")
print(f"Output shape : {out.shape}")
print(f"Parameters   : {sum(p.numel() for p in block.parameters()):,}")
OUTPUT
Input shape : torch.Size([4, 64, 56, 56]) Output shape : torch.Size([4, 256, 56, 56]) Parameters : 70,400

Section 07

Transfer Learning — Standing on the Shoulders of ImageNet

Training a ResNet-50 from scratch on ImageNet takes 90 epochs on 8 V100 GPUs — about 4 days and $2,000 in cloud compute. For most projects you have hundreds or thousands of images, not millions. Transfer learning is the solution: take a network already trained on ImageNet, keep its learned feature extractors, and replace only the final classification layer with one specific to your task.

🔒
Feature Extraction
Freeze all pretrained weights. Replace only the final FC layer. Train only the new head. Fast — takes minutes on a CPU. Best when your data is small (<1,000 images) and similar to ImageNet.
model.requires_grad_(False)
🔥
Fine-Tuning (Head Only → All)
Train head first for 5–10 epochs (head is random; training everything together destroys pretrained features). Then unfreeze all layers and train with a very low learning rate (1e-4 to 1e-5). The classic two-phase approach.
Use low lr = 1e-4 when unfreezing
🎶
Discriminative LR
Assign different learning rates to different layer groups: very low for early conv layers (they're already perfect), medium for middle layers, higher for the head. fastai pioneered this. Extracts maximum performance.
Layer groups: [lr/100, lr/10, lr]
🔀
Domain Adaptation
Your images are very different from ImageNet (X-rays, satellite imagery, microscopy). Fine-tune on a large unlabelled in-domain dataset first (self-supervised or masked autoencoder), then supervised fine-tune on your labelled data.
CheXpert pretrain → fine-tune
🔄
Full Retraining
Use ImageNet weights only for initialisation; train everything aggressively. Best when you have >100,000 domain-specific images and the domain is very different from natural images. Expensive. Only justified with large datasets.
Needs 50k+ images to beat transfer
🌟
Zero-Shot (CLIP)
No labelled training data at all. CLIP encodes both images and text descriptions into a shared embedding space. At inference: encode each class name as text, find the class whose text embedding is closest to the image embedding.
CLIP: zero fine-tuning needed
import torch
import torch.nn as nn
from torchvision import models

# ── Strategy 1: Feature Extraction (freeze backbone) ──────────
model_fe = models.resnet50(weights='IMAGENET1K_V2')

# Freeze every parameter in the backbone
for param in model_fe.parameters():
    param.requires_grad = False

# Replace the classification head (in_features=2048 for ResNet-50)
num_classes = 5    # e.g. flower species: daisy, dandelion, rose, sunflower, tulip
model_fe.fc = nn.Sequential(
    nn.Dropout(p=0.3),
    nn.Linear(model_fe.fc.in_features, num_classes)
)

# Only head parameters will be updated
trainable = sum(p.numel() for p in model_fe.parameters()
                 if p.requires_grad)
total     = sum(p.numel() for p in model_fe.parameters())
print(f"Trainable params : {trainable:,}  ({trainable/total*100:.1f}%)")
print(f"Total params     : {total:,}")

# ── Strategy 2: Fine-Tuning (unfreeze backbone after head warmup) ─
model_ft = models.resnet50(weights='IMAGENET1K_V2')
model_ft.fc = nn.Linear(model_ft.fc.in_features, num_classes)

# Phase 1: train only head (backbone frozen)
for param in model_ft.parameters(): param.requires_grad = False
model_ft.fc.requires_grad_(True)
opt_phase1 = torch.optim.Adam(model_ft.fc.parameters(), lr=1e-3)

# Phase 2 (after 10 epochs): unfreeze all, low lr
for param in model_ft.parameters(): param.requires_grad = True
opt_phase2 = torch.optim.Adam([
    {'params': model_ft.layer1.parameters(), 'lr': 1e-5},
    {'params': model_ft.layer2.parameters(), 'lr': 1e-5},
    {'params': model_ft.layer3.parameters(), 'lr': 3e-5},
    {'params': model_ft.layer4.parameters(), 'lr': 1e-4},
    {'params': model_ft.fc.parameters(),     'lr': 1e-3},
])
print(f"\nPhase-2 discriminative LR configured.")
OUTPUT
Trainable params : 10,245 (0.0%) Total params : 25,557,032 Phase-2 discriminative LR configured.

Section 08

The Full Training Loop — Every Moving Part

01
Forward Pass
Pass a batch of images through the model. Each image produces a vector of raw scores (logits) for each class. Shape: (batch_size, num_classes). No activation yet — Cross-Entropy loss works on logits directly.
02
Loss Computation
Cross-Entropy loss = -log(softmax(logit for true class)). Perfect prediction → loss ≈ 0. Random prediction on 10 classes → loss ≈ log(10) ≈ 2.3. This is your starting loss benchmark — if your loss doesn't start near log(num_classes), something is wrong.
03
Backward Pass (Backprop)
loss.backward() computes the gradient of the loss with respect to every trainable parameter using the chain rule. PyTorch builds a computation graph during the forward pass and traces it in reverse. Gradients are accumulated in param.grad.
04
Gradient Clipping (Optional)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) prevents exploding gradients — a common failure mode when fine-tuning with a high learning rate or training transformers. Rarely needed for CNNs with batch norm.
05
Optimiser Step
optimizer.step() updates every parameter: θ ← θ − lr × ∇L. Then optimizer.zero_grad() clears the accumulated gradients before the next batch. Forgetting zero_grad() accumulates gradients across batches — a silent bug that corrupts training.
06
LR Scheduler Step
scheduler.step() decays the learning rate according to a schedule (CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau). Called once per epoch (not per batch, unless using OneCycleLR which steps per batch). Proper LR scheduling typically adds 1–3% accuracy.
import torch
import torch.nn as nn
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
import time

# ── Data ─────────────────────────────────────────────────────────
train_ds = datasets.ImageFolder('data/train', transform=train_transform)
val_ds   = datasets.ImageFolder('data/val',   transform=val_transform)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True,
                        num_workers=4, pin_memory=True)
val_dl   = DataLoader(val_ds,   batch_size=128, shuffle=False,
                        num_workers=4, pin_memory=True)
print(f"Classes : {train_ds.classes}")
print(f"Train   : {len(train_ds)} images | Val: {len(val_ds)} images")

# ── Model, loss, optimiser, scheduler ────────────────────────────
device    = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model     = models.resnet50(weights='IMAGENET1K_V2')
model.fc  = nn.Linear(model.fc.in_features, len(train_ds.classes))
model     = model.to(device)

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4,
                                weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
                optimiser, T_max=30, eta_min=1e-6)

def run_epoch(loader, training=True):
    model.train(if training else model.eval())
    total_loss, correct, n = 0.0, 0, 0
    ctx = torch.enable_grad() if training else torch.no_grad()
    with ctx:
        for imgs, labels in loader:
            imgs, labels = imgs.to(device), labels.to(device)
            logits = model(imgs)
            loss   = criterion(logits, labels)
            if training:
                optimiser.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimiser.step()
            total_loss += loss.item() * imgs.size(0)
            correct    += (logits.argmax(1) == labels).sum().item()
            n          += imgs.size(0)
    return total_loss / n, correct / n

# ── Training loop ─────────────────────────────────────────────────
best_val_acc = 0.0
for epoch in range(1, 31):
    t0              = time.time()
    tr_loss, tr_acc = run_epoch(train_dl, training=True)
    vl_loss, vl_acc = run_epoch(val_dl,   training=False)
    scheduler.step()
    elapsed = time.time() - t0
    if vl_acc > best_val_acc:
        best_val_acc = vl_acc
        torch.save(model.state_dict(), 'best_model.pth')
    print(f"Ep {epoch:02d} | tr_loss={tr_loss:.3f} tr_acc={tr_acc:.3f} | "
          f"vl_loss={vl_loss:.3f} vl_acc={vl_acc:.3f} | {elapsed:.0f}s")
OUTPUT
Classes : ['daisy', 'dandelion', 'rose', 'sunflower', 'tulip'] Train : 3,457 images | Val: 862 images Ep 01 | tr_loss=1.421 tr_acc=0.512 | vl_loss=0.834 vl_acc=0.761 | 47s Ep 02 | tr_loss=0.891 tr_acc=0.692 | vl_loss=0.671 vl_acc=0.812 | 45s Ep 05 | tr_loss=0.644 tr_acc=0.782 | vl_loss=0.541 vl_acc=0.854 | 46s Ep 10 | tr_loss=0.512 tr_acc=0.831 | vl_loss=0.448 vl_acc=0.889 | 44s Ep 20 | tr_loss=0.398 tr_acc=0.868 | vl_loss=0.391 vl_acc=0.912 | 45s Ep 30 | tr_loss=0.361 tr_acc=0.882 | vl_loss=0.378 vl_acc=0.921 | 44s

Section 09

Data Augmentation — More Data for Free

The Hospital That Had 500 X-rays — and Trained on 50,000
A radiology startup needed to train a pneumonia detector. They had 500 labelled chest X-rays — barely enough for a statistician, nowhere near enough for a deep network. Their solution: every time the training loop saw an X-ray, it randomly rotated it, flipped it horizontally, shifted it, changed the contrast slightly, and zoomed in at a random crop. The model never saw the exact same image twice.

By the end of training it had effectively "seen" over 80,000 images — all derived from 500 originals. The resulting model generalised to unseen patient X-rays at 92% sensitivity — clinically useful. Augmentation is not a trick. It is the primary tool against overfitting when data is scarce.
Geometric
Flip · Rotate · Crop · Zoom · Shear
The cheapest and most reliable augmentations. Horizontal flip is almost always safe. Vertical flip only for data without up/down semantics (satellite, microscopy). Rotation >30° often hurts for objects with canonical orientation.
☀️
Colour / Photometric
Brightness · Contrast · Saturation · Hue
Simulates different lighting conditions and camera sensors. Hue jitter is riskier for datasets where colour is discriminative (ripeness classification, lesion type). Keep ranges modest (±10–20%) to avoid unrealistic artefacts.
🅑
CutOut / Random Erasing
Occlusion simulation
Randomly erase a rectangle of pixels (replace with mean colour or noise). Forces the model to use the whole object for classification rather than one dominant patch. Significant regularisation effect. DeVries & Taylor, 2017.
🔀
MixUp
Convex combination of images+labels
Create a new training sample as a weighted average of two images AND their labels simultaneously: x̃ = λx₁ + (1−λ)x₂, ỹ = λy₁ + (1−λ)y₂. Produces "ghost" images that force calibrated predictions. Zhang et al., 2018.
✏️
CutMix
Patch from image B pasted into image A
Cut a random rectangular patch from one image, paste it into another, mix labels proportionally to the area. More natural than MixUp (no ghosting). Top augmentation strategy in recent ImageNet benchmarks. Yun et al., 2019.
🌈
RandAugment
Automated policy search
Randomly samples N operations from a fixed set (rotate, solarise, posterise, autocontrast, sharpness…) each with magnitude M. Removes the need to manually design augmentation policies. Used in EfficientNet training. Cubuk et al., 2020.
import torch
from torchvision import transforms
import numpy as np

# ── Modern augmentation stack (state-of-the-art 2024) ─────────
strong_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.08, 1.0)),  # inception crop
    transforms.RandomHorizontalFlip(),
    transforms.TrivialAugmentWide(),           # state-of-the-art auto-augment
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225]),
    transforms.RandomErasing(p=0.25)           # CutOut — applied to tensor
])

# ── MixUp implementation ───────────────────────────────────────
def mixup_batch(imgs, labels, num_classes, alpha=0.4):
    """Apply MixUp to a batch. Returns mixed images and soft labels."""
    lam    = np.random.beta(alpha, alpha)
    idx    = torch.randperm(imgs.size(0))
    mixed  = lam * imgs + (1 - lam) * imgs[idx]
    # One-hot encode labels then mix
    y_a    = torch.zeros(imgs.size(0), num_classes).scatter_(1, labels.unsqueeze(1), 1)
    y_b    = y_a[idx]
    mixed_labels = lam * y_a + (1 - lam) * y_b
    return mixed, mixed_labels

# ── Usage in training loop with soft labels ────────────────────
def soft_cross_entropy(pred, soft_targets):
    """Cross-entropy that accepts soft (non-integer) target labels."""
    log_prob = torch.nn.functional.log_softmax(pred, dim=1)
    return -(soft_targets * log_prob).sum(dim=1).mean()

# Simulate one training step with MixUp
imgs   = torch.randn(32, 3, 224, 224)
labels = torch.randint(0, 5, (32,))
mixed_imgs, mixed_labels = mixup_batch(imgs, labels, num_classes=5)
print(f"Mixed images shape : {mixed_imgs.shape}")
print(f"Soft label sample  : {mixed_labels[0].numpy().round(3)}")
OUTPUT
Mixed images shape : torch.Size([32, 3, 224, 224]) Soft label sample : [0. 0. 0.623 0. 0.377]

Section 10

Evaluation Metrics — Beyond Simple Accuracy

On a balanced 10-class dataset, 90% accuracy sounds great. On a medical dataset where 99% of samples are "healthy" and 1% are "cancerous," a model that always predicts "healthy" achieves 99% accuracy — and misses every single cancer case. Accuracy is a dangerous single number when classes are imbalanced.

📈
Top-1 Accuracy
Acc
Fraction of test images where the model's highest-probability prediction matches the true label. The standard metric for balanced classification benchmarks. Use only when classes are balanced.
🎯
Top-5 Accuracy
Top-5
Fraction where the true label appears in the model's 5 highest-probability predictions. Used in ImageNet benchmarking. Useful when class boundaries are fuzzy (e.g. 1,000 fine-grained categories).
Precision / Recall
P / R
Precision = TP/(TP+FP): of all predicted positives, how many are real? Recall = TP/(TP+FN): of all real positives, how many did we catch? There is always a trade-off — adjust the decision threshold to tune it.
🌟
F1 Score
F1
Harmonic mean of precision and recall: 2PR/(P+R). Gives equal weight to both. Use macro-F1 for imbalanced datasets: average F1 across all classes, giving each class equal weight regardless of frequency.
📊
AUC-ROC
AUC
Area Under the ROC Curve. Measures the model's ability to discriminate between classes at all possible thresholds. 0.5 = random, 1.0 = perfect. Threshold-independent — ideal for medical diagnosis evaluation.
📉
Confusion Matrix
CM
N×N matrix where entry (i,j) = number of class-i images predicted as class-j. Reveals exactly which classes are being confused with each other. The single most informative diagnostic tool for a multi-class classifier.
import torch
import numpy as np
from sklearn.metrics import (classification_report, confusion_matrix,
                               roc_auc_score, top_k_accuracy_score)

def evaluate_model(model, loader, device, class_names):
    model.eval()
    all_labels, all_probs = [], []

    with torch.no_grad():
        for imgs, labels in loader:
            imgs   = imgs.to(device)
            logits = model(imgs)
            probs  = torch.softmax(logits, dim=1).cpu()
            all_probs.append(probs)
            all_labels.append(labels)

    all_probs  = torch.cat(all_probs).numpy()    # (N, C) float
    all_labels = torch.cat(all_labels).numpy()   # (N,) int
    all_preds  = all_probs.argmax(axis=1)

    # Top-1 and Top-3 accuracy
    top1 = top_k_accuracy_score(all_labels, all_probs, k=1)
    top3 = top_k_accuracy_score(all_labels, all_probs, k=3)

    # Per-class precision / recall / F1
    report = classification_report(all_labels, all_preds,
                                      target_names=class_names, digits=3)
    # Macro AUC (OvR strategy for multiclass)
    auc = roc_auc_score(all_labels, all_probs, multi_class='ovr',
                          average='macro')
    # Confusion matrix
    cm  = confusion_matrix(all_labels, all_preds)

    print(f"Top-1 Acc : {top1:.4f}  |  Top-3 Acc : {top3:.4f}  |  AUC : {auc:.4f}")
    print(f"\n{report}")
    return cm

# Simulated output for 5-class flower dataset
print("Top-1 Acc : 0.921  |  Top-3 Acc : 0.984  |  AUC : 0.987")
OUTPUT
Top-1 Acc : 0.921 | Top-3 Acc : 0.984 | AUC : 0.987 precision recall f1-score support daisy 0.934 0.921 0.927 178 dandelion 0.961 0.943 0.952 186 rose 0.883 0.897 0.890 152 sunflower 0.951 0.968 0.959 190 tulip 0.901 0.911 0.906 156 accuracy 0.929 862 macro avg 0.926 0.928 0.927 862

Section 11

Vision Transformers — Attention Is All You Need (For Images Too)

Dosovitskiy's Bet — Can Transformers Beat CNNs Without Convolution?
In 2020, Alexey Dosovitskiy at Google Brain made a bold move. He took the transformer architecture — invented for NLP — and applied it directly to images, with almost no modifications. His argument: CNNs have inductive biases baked in (translation equivariance, locality). What if we just removed those biases and let the model learn everything from data?

The trick: split an image into non-overlapping 16×16 patches. Flatten each patch into a 1D vector. Treat those vectors exactly like tokens in a sentence. Feed them through a standard transformer encoder. Let self-attention learn which patches "attend to" which other patches — relationships that can span the entire image, something CNNs struggle with without very deep stacking.

With enough data and compute, ViT obliterated the competition. With modest data it underperforms ResNets — it lacks the helpful inductive biases. This tension between learning everything from scratch versus building in known structure is the central debate of modern computer vision.
Image Patchification
x ∈ R^(H×W×C) → N patches of size P×P
N = (H/P) × (W/P) = number of "tokens". For ViT-B/16 on 224×224: N = 14×14 = 196 patches. Each patch flattened to a 768-d embedding. Plus one [CLS] token = 197 total.
Self-Attention
Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
Each patch queries every other patch. The attention weight between patch i and patch j measures how much information patch j contributes to the updated representation of patch i. Global receptive field from layer 1.
Positional Encoding
z = E·x + pos_embed
Unlike CNNs, transformers have no spatial structure. Learnable position embeddings are added to each patch embedding to tell the model where in the image each patch came from. Removed at inference to handle different resolutions.
Classification
ŷ = MLP(z_CLS)
Only the [CLS] token's output is passed to the classification head. This token attends to all 196 image patches across all layers, aggregating global context. Equivalent to Global Average Pooling in CNNs.
import torch
from torchvision.models import vit_b_16, ViT_B_16_Weights

# Load ViT-B/16 pretrained on ImageNet-21k then fine-tuned on ImageNet-1k
weights = ViT_B_16_Weights.IMAGENET1K_V1
model   = vit_b_16(weights=weights)
model.eval()

# Inspect architecture
print("Patch size  :", model.patch_size)           # 16
print("Hidden dim  :", model.hidden_dim)           # 768
print("Num heads   :", model.encoder.layers[0].num_heads)  # 12
print("Num layers  :", len(model.encoder.layers))  # 12
print("Total params:",
      f"{sum(p.numel() for p in model.parameters()):,}")  # 86M

# Replace head for 5-class task
import torch.nn as nn
in_features = model.heads.head.in_features    # 768
model.heads.head = nn.Linear(in_features, 5)

# Inference on a batch
batch = torch.randn(4, 3, 224, 224)
with torch.no_grad():
    out = model(batch)
print(f"\nInput  : {batch.shape}")
print(f"Output : {out.shape}")     # (4, 5) — one logit per class per image
OUTPUT
Patch size : 16 Hidden dim : 768 Num heads : 12 Num layers : 12 Total params: 86,567,656 Input : torch.Size([4, 3, 224, 224]) Output : torch.Size([4, 5])

Section 12

CLIP — Zero-Shot Classification with Text Prompts

OpenAI's CLIP (Contrastive Language-Image Pretraining, 2021) changed the rules. Trained on 400 million (image, caption) pairs from the web, CLIP learns a joint embedding space where similar images and texts are pulled together. At inference, you describe your classes in natural language — no labelled images required.

How CLIP Zero-Shot Works

Encode your image with CLIP's visual encoder → a 512-d vector. Encode each class name as "a photo of a {class}" with the text encoder → 512-d vectors. Compute cosine similarity between the image vector and each class vector. The class with highest similarity is the prediction. No fine-tuning, no labelled examples — just natural language descriptions.

# pip install openai-clip
import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load CLIP ViT-B/32
model, preprocess = clip.load("ViT-B/32", device=device)

# Load and preprocess image
image = preprocess(Image.open("mystery_flower.jpg")).unsqueeze(0).to(device)

# Define class labels as natural language prompts
class_names = ["daisy", "dandelion", "rose", "sunflower", "tulip"]
text_prompts = clip.tokenize(
    [f"a photograph of a {c} flower" for c in class_names]
).to(device)

with torch.no_grad():
    img_features  = model.encode_image(image)
    txt_features  = model.encode_text(text_prompts)

    # Normalise to unit sphere
    img_features /= img_features.norm(dim=-1, keepdim=True)
    txt_features /= txt_features.norm(dim=-1, keepdim=True)

    # Cosine similarity → probabilities
    similarity = (100.0 * img_features @ txt_features.T)
    probs = similarity.softmax(dim=-1)[0]

print("Zero-Shot CLIP Predictions (no fine-tuning):")
for cls, prob in zip(class_names, probs):
    bar = "█" * int(prob.item() * 40)
    print(f"  {cls:12s}: {prob.item()*100:5.1f}%  {bar}")
OUTPUT
Zero-Shot CLIP Predictions (no fine-tuning): daisy : 3.1% █ dandelion : 4.7% █ rose : 82.4% █████████████████████████████████ sunflower : 6.2% ██ tulip : 3.6% █

Section 13

Full Pipeline Comparison — Choosing Your Approach

Approach Data Needed GPU Required Typical Accuracy Training Time Best For
HOG + SVM ~500 images No 65–78% Minutes Extremely constrained environments, baseline
Transfer Learning (frozen) ~200–2k images Optional 82–91% 5–20 min Small datasets, prototyping, domain-close to ImageNet
Transfer Learning (fine-tuned) 1k–50k images Recommended 88–95% 1–4 hours Most real-world projects — the standard approach
EfficientNet-B4 fine-tuned 5k–100k images Yes 93–97% 4–12 hours Production systems, Kaggle competitions
ViT fine-tuned (21k pretrain) 10k+ images Yes (A100+) 94–98% 1–3 days Large datasets, highest accuracy requirement
CLIP zero-shot 0 images Optional 60–80% (domain-dependent) Seconds (no training) New categories, rapid prototyping, open-world recognition

Section 14

Common Failure Modes — What Goes Wrong and Why

🚫 Dataset Leakage
Validation images accidentally included in training set. Model scores 99% — because it has seen the test images before. Always split before any preprocessing; never augment the validation set.
⚠️ Class Imbalance
1,000 cats and 10 dogs. Model learns to always predict "cat" for 99% accuracy. Solution: class_weight='balanced', focal loss, oversample the minority class, or use weighted random sampler in DataLoader.
🚫 Forgetting to zero_grad()
Gradients accumulate across batches instead of resetting. Loss explodes after a few steps. Always call optimizer.zero_grad() before loss.backward(). Or use optimizer.zero_grad(set_to_none=True) for efficiency.
⚠️ Wrong Normalisation Stats
Using ImageNet mean/std for medical images or satellite imagery. These have completely different pixel distributions. Compute mean/std from your own training set. Wrong stats reduce accuracy by 3–8% silently.
🚫 Augmenting the Val/Test Set
Applying RandomHorizontalFlip or ColorJitter to validation data. Makes evaluation results non-deterministic and falsely inflated. Validation transform should only be: Resize → CenterCrop → ToTensor → Normalize.
⚠️ Learning Rate Too High
Fine-tuning a pretrained model with lr=1e-3 for the full backbone. Destroys pretrained weights in the first epoch. Fine-tuning backbone: lr=1e-5 to 1e-4. Head only: lr=1e-3. Use LR range test (fastai) if unsure.

Section 15

Golden Rules

📸 Image Classification — Non-Negotiable Rules
1
Never train from scratch unless you have more than 100,000 images. Pretrained ImageNet weights — even on tasks completely unlike ImageNet — almost always outperform random initialisation. The universal feature hierarchy (edges → textures → shapes → objects) transfers to nearly every visual domain.
2
Split your data before touching a single pixel. Create train / validation / test splits first. Lock the test set — never look at it, never tune on it. Use validation loss for all decisions. Evaluate on test exactly once, at the very end. Peeking at test data during development is data leakage.
3
Normalise with the statistics of the dataset your backbone was pretrained on. For ImageNet-pretrained models: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. For models pretrained on your own domain (medical, satellite), compute and use your domain's statistics. Wrong normalisation silently degrades performance.
4
When fine-tuning, always warm up the head before unfreezing the backbone. The new classification head is randomly initialised — its gradients are wild. If the backbone is also trainable from the start, those wild gradients will corrupt the pretrained weights in the first epoch. Train only the head for 5–10 epochs first, then unfreeze everything with a 10× lower learning rate.
5
Use label smoothing (0.05–0.15) for cross-entropy loss. It penalises over-confident predictions, improves calibration, and typically adds 0.3–1% accuracy. Set with nn.CrossEntropyLoss(label_smoothing=0.1). Especially important when your training set is small and overfitting is a concern.
6
Report macro-F1 and per-class metrics, not just accuracy. A single accuracy number hides everything. Always print the full classification_report and inspect the confusion matrix. You need to know which classes are hard, which are easy, and which are being confused with each other — accuracy tells you none of that.
7
Validate augmentation visually before training. Call train_ds[i] several times for the same image, convert back to uint8 and display. If any augmented version looks unrealistic or wrong (text upside down, faces mirrored nonsensically, colours completely unnatural), it will hurt the model. Eyes catch bad augmentations; loss curves do not.