The Story That Explains Image Classification
Now consider a radiologist who has read 50,000 chest X-rays over twenty years. She spots a suspicious shadow at 4mm — invisible to an untrained eye — because her brain has built an extraordinarily detailed internal model of what "normal" looks like versus what "abnormal" looks like. She is classifying images, one at a time, faster and more accurately than any rule-based system ever could.
Image Classification is the task of training a machine to do exactly this: look at a raw image and output a label — cat, dog, tumour, fraudulent cheque, ripe tomato, storm cloud — by learning from thousands of labelled examples. It is the oldest, most studied problem in computer vision, and it is the foundation on which everything else — object detection, segmentation, video understanding — is built.
At its core, image classification is a supervised learning problem. You give a model thousands of (image, label) pairs. The model learns a mapping from pixel values to class probabilities. At inference time it applies that mapping to images it has never seen before and outputs its best prediction.
In 2010, Fei-Fei Li's team at Stanford released ImageNet: 1.2 million images across 1,000 categories. In 2012, Alex Krizhevsky's AlexNet smashed the previous best error rate from 26% to 15% using a deep convolutional neural network trained on two GPUs. That single result launched the modern deep learning era. Today's best models achieve under 1.5% top-5 error on the same benchmark — better than most humans.
How Images Become Numbers — The Input Pipeline
A model never sees an image. It sees a tensor — a multidimensional array of numbers. Before any learning can happen, every raw image must pass through a standardised pre-processing pipeline that converts it into a fixed-size, normalised tensor the model can consume.
import torch
from torchvision import transforms
from PIL import Image
# ── Training transform (with augmentation) ────────────────────
train_transform = transforms.Compose([
transforms.Resize(256),
transforms.RandomCrop(224),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(brightness=0.2, contrast=0.2,
saturation=0.1, hue=0.05),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# ── Validation / test transform (NO augmentation) ─────────────
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224), # deterministic centre crop
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# ── Test it on a single image ─────────────────────────────────
img = Image.open('dog.jpg').convert('RGB')
tensor = val_transform(img) # shape: (3, 224, 224)
batch = tensor.unsqueeze(0) # shape: (1, 3, 224, 224)
print(f"Original image size : {img.size}")
print(f"Tensor shape : {tensor.shape}")
print(f"Tensor dtype : {tensor.dtype}")
print(f"Pixel value range : [{tensor.min():.3f}, {tensor.max():.3f}]")
After subtracting ImageNet means and dividing by stds, pixel values are no longer in [0,1]. They can be negative (very dark pixels below the mean) or above 1.0 (bright pixels). This is expected and correct. The model was trained on data with these exact statistics. Never re-clip to [0,1] after normalising — you will corrupt the input distribution and degrade performance.
Classical Approach — HOG + SVM Before Deep Learning
The best systems required enormous domain expertise, ran on single images at a time, and barely approached 75% accuracy on toy benchmarks. Then AlexNet arrived and rendered a decade of feature engineering obsolete in one afternoon. Understanding the classical approach teaches you what deep networks learned to do automatically.
from skimage.feature import hog
from skimage import color
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import cv2, os
def extract_hog(img_path, size=(64, 64)):
"""Load image, resize, convert greyscale, extract HOG descriptor."""
img = cv2.imread(img_path)
img = cv2.resize(img, size)
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
fd = hog(grey, orientations=9,
pixels_per_cell=(8, 8),
cells_per_block=(2, 2),
block_norm='L2-Hys')
return fd
# Build feature matrix from image folder structure
# Expects: data/cats/*.jpg, data/dogs/*.jpg
X, y = [], []
for label, cls in enumerate(['cats', 'dogs']):
folder = f'data/{cls}'
for fname in os.listdir(folder):
path = os.path.join(folder, fname)
X.append(extract_hog(path))
y.append(label)
X = np.array(X) # shape: (N, 1764) — HOG dim for 64×64 image
y = np.array(y)
# Pipeline: scale → SVM with RBF kernel
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(C=10.0, kernel='rbf', gamma='scale',
probability=True))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"HOG + SVM 5-fold CV: {scores.mean():.3f} ± {scores.std():.3f}")
print(f"Feature vector length: {X.shape[1]}")
Convolutional Neural Networks — The Core Idea
A Convolutional Neural Network (CNN) replaces hand-crafted feature engineering with learnable filters. Instead of a human deciding "edges are important," the network learns which filters — which patterns of pixels — maximally discriminate between classes, directly from the training data.
Visualisation research (Zeiler & Fergus 2013, Olah et al. 2020) revealed a beautiful hierarchy: early layers learn edges and colour blobs. Middle layers learn textures, patterns, and object parts (eyes, wheels, fur). Deep layers learn complete semantic concepts (faces, dogs, cars). This hierarchy is universal — it emerges in every CNN trained on natural images, regardless of architecture, and is why pretrained features transfer so well.
Architecture Evolution — From AlexNet to Vision Transformers
| Architecture | Year | Top-5 Error | Params | Key Innovation | Use Today? |
|---|---|---|---|---|---|
| AlexNet | 2012 | 15.3% | 60M | Deep CNN on GPU, ReLU, Dropout | Teaching only |
| VGG-16 | 2014 | 7.3% | 138M | Very deep (16 layers), uniform 3×3 convs | Occasional baseline |
| GoogLeNet/Inception | 2014 | 6.7% | 6.8M | Inception modules, parallel paths, fewer params | Rarely |
| ResNet-50 | 2015 | 5.3% | 25M | Residual skip connections — enables 100+ layer networks | Yes — excellent baseline |
| DenseNet-121 | 2017 | 5.1% | 8M | Dense connections — every layer feeds all subsequent layers | Medical imaging |
| EfficientNet-B0 | 2019 | 2.9% | 5.3M | Compound scaling — width, depth, resolution jointly | Yes — best acc/params trade-off |
| ViT-B/16 | 2021 | 2.0% | 86M | Vision Transformer — pure attention, no convolutions | Yes — large data regime |
| ConvNeXt-B | 2022 | 1.9% | 89M | Modernised pure-CNN rivalling ViT with simpler training | Yes — practical favourite |
For most practical projects: EfficientNet-B2 or
ResNet-50 with transfer learning. For mobile/edge deployment:
MobileNetV3 or EfficientNet-B0. For
maximum accuracy with ample GPU: ConvNeXt-Base or
ViT-B/16 pretrained on ImageNet-21k. Never start from scratch
unless you have millions of domain-specific images.
ResNet — The Skip Connection That Solved Depth
His reasoning: a 56-layer network should be at least as good as a 20-layer network, because the extra 36 layers could simply learn to be identity functions and do nothing. But SGD couldn't find that solution. The gradients were vanishing before they reached the early layers.
His fix was elegant: add a shortcut connection that bypasses two conv layers. If the block learns F(x), the output is F(x) + x. The gradient can now flow directly backwards through the skip connection, bypassing the potentially troubled conv layers. Suddenly 152-layer networks trained cleanly.
ResNet won every category of ImageNet 2015 by a large margin and became the most cited deep learning paper in history. Its residual block is the atomic unit of modern neural network design.
| Layer | Output | Gradient flow |
|---|---|---|
| Conv → BN → ReLU | F₁(x) | Multiplied by ∂F₁/∂x |
| Conv → BN → ReLU | F₂(F₁) | Multiplied further |
| · · · (50 more layers) | F₅₂(...) | Vanishes to ≈ 0 |
| FC → Softmax | ŷ | Early layers learn nothing |
| Block | Output | Gradient flow |
|---|---|---|
| Conv → BN → ReLU → Conv → BN | F(x) | Through conv path |
| Skip connection | x | Direct — always = 1.0 |
| Add + ReLU | F(x) + x | Sum of both paths |
| · · · (25 such blocks) | Deep output | Gradient never vanishes |
import torch
import torch.nn as nn
class ResidualBlock(nn.Module):
"""A standard ResNet bottleneck block (as in ResNet-50/101/152)."""
expansion = 4
def __init__(self, in_ch, out_ch, stride=1, downsample=None):
super().__init__()
# Bottleneck: 1×1 → 3×3 → 1×1
self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_ch)
self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_ch)
self.conv3 = nn.Conv2d(out_ch, out_ch * self.expansion,
kernel_size=1, bias=False)
self.bn3 = nn.BatchNorm2d(out_ch * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample # 1×1 conv to match dimensions
def forward(self, x):
identity = x # ← the skip connection
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
if self.downsample is not None:
identity = self.downsample(x) # resize skip if dims changed
out += identity # ← F(x) + x
return self.relu(out)
# Verify the block works
block = ResidualBlock(64, 64)
x = torch.randn(4, 64, 56, 56)
out = block(x)
print(f"Input shape : {x.shape}")
print(f"Output shape : {out.shape}")
print(f"Parameters : {sum(p.numel() for p in block.parameters()):,}")
Transfer Learning — Standing on the Shoulders of ImageNet
Training a ResNet-50 from scratch on ImageNet takes 90 epochs on 8 V100 GPUs — about 4 days and $2,000 in cloud compute. For most projects you have hundreds or thousands of images, not millions. Transfer learning is the solution: take a network already trained on ImageNet, keep its learned feature extractors, and replace only the final classification layer with one specific to your task.
import torch
import torch.nn as nn
from torchvision import models
# ── Strategy 1: Feature Extraction (freeze backbone) ──────────
model_fe = models.resnet50(weights='IMAGENET1K_V2')
# Freeze every parameter in the backbone
for param in model_fe.parameters():
param.requires_grad = False
# Replace the classification head (in_features=2048 for ResNet-50)
num_classes = 5 # e.g. flower species: daisy, dandelion, rose, sunflower, tulip
model_fe.fc = nn.Sequential(
nn.Dropout(p=0.3),
nn.Linear(model_fe.fc.in_features, num_classes)
)
# Only head parameters will be updated
trainable = sum(p.numel() for p in model_fe.parameters()
if p.requires_grad)
total = sum(p.numel() for p in model_fe.parameters())
print(f"Trainable params : {trainable:,} ({trainable/total*100:.1f}%)")
print(f"Total params : {total:,}")
# ── Strategy 2: Fine-Tuning (unfreeze backbone after head warmup) ─
model_ft = models.resnet50(weights='IMAGENET1K_V2')
model_ft.fc = nn.Linear(model_ft.fc.in_features, num_classes)
# Phase 1: train only head (backbone frozen)
for param in model_ft.parameters(): param.requires_grad = False
model_ft.fc.requires_grad_(True)
opt_phase1 = torch.optim.Adam(model_ft.fc.parameters(), lr=1e-3)
# Phase 2 (after 10 epochs): unfreeze all, low lr
for param in model_ft.parameters(): param.requires_grad = True
opt_phase2 = torch.optim.Adam([
{'params': model_ft.layer1.parameters(), 'lr': 1e-5},
{'params': model_ft.layer2.parameters(), 'lr': 1e-5},
{'params': model_ft.layer3.parameters(), 'lr': 3e-5},
{'params': model_ft.layer4.parameters(), 'lr': 1e-4},
{'params': model_ft.fc.parameters(), 'lr': 1e-3},
])
print(f"\nPhase-2 discriminative LR configured.")
The Full Training Loop — Every Moving Part
import torch
import torch.nn as nn
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
import time
# ── Data ─────────────────────────────────────────────────────────
train_ds = datasets.ImageFolder('data/train', transform=train_transform)
val_ds = datasets.ImageFolder('data/val', transform=val_transform)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True,
num_workers=4, pin_memory=True)
val_dl = DataLoader(val_ds, batch_size=128, shuffle=False,
num_workers=4, pin_memory=True)
print(f"Classes : {train_ds.classes}")
print(f"Train : {len(train_ds)} images | Val: {len(val_ds)} images")
# ── Model, loss, optimiser, scheduler ────────────────────────────
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, len(train_ds.classes))
model = model.to(device)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4,
weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimiser, T_max=30, eta_min=1e-6)
def run_epoch(loader, training=True):
model.train(if training else model.eval())
total_loss, correct, n = 0.0, 0, 0
ctx = torch.enable_grad() if training else torch.no_grad()
with ctx:
for imgs, labels in loader:
imgs, labels = imgs.to(device), labels.to(device)
logits = model(imgs)
loss = criterion(logits, labels)
if training:
optimiser.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimiser.step()
total_loss += loss.item() * imgs.size(0)
correct += (logits.argmax(1) == labels).sum().item()
n += imgs.size(0)
return total_loss / n, correct / n
# ── Training loop ─────────────────────────────────────────────────
best_val_acc = 0.0
for epoch in range(1, 31):
t0 = time.time()
tr_loss, tr_acc = run_epoch(train_dl, training=True)
vl_loss, vl_acc = run_epoch(val_dl, training=False)
scheduler.step()
elapsed = time.time() - t0
if vl_acc > best_val_acc:
best_val_acc = vl_acc
torch.save(model.state_dict(), 'best_model.pth')
print(f"Ep {epoch:02d} | tr_loss={tr_loss:.3f} tr_acc={tr_acc:.3f} | "
f"vl_loss={vl_loss:.3f} vl_acc={vl_acc:.3f} | {elapsed:.0f}s")
Data Augmentation — More Data for Free
By the end of training it had effectively "seen" over 80,000 images — all derived from 500 originals. The resulting model generalised to unseen patient X-rays at 92% sensitivity — clinically useful. Augmentation is not a trick. It is the primary tool against overfitting when data is scarce.
import torch
from torchvision import transforms
import numpy as np
# ── Modern augmentation stack (state-of-the-art 2024) ─────────
strong_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.08, 1.0)), # inception crop
transforms.RandomHorizontalFlip(),
transforms.TrivialAugmentWide(), # state-of-the-art auto-augment
transforms.ToTensor(),
transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225]),
transforms.RandomErasing(p=0.25) # CutOut — applied to tensor
])
# ── MixUp implementation ───────────────────────────────────────
def mixup_batch(imgs, labels, num_classes, alpha=0.4):
"""Apply MixUp to a batch. Returns mixed images and soft labels."""
lam = np.random.beta(alpha, alpha)
idx = torch.randperm(imgs.size(0))
mixed = lam * imgs + (1 - lam) * imgs[idx]
# One-hot encode labels then mix
y_a = torch.zeros(imgs.size(0), num_classes).scatter_(1, labels.unsqueeze(1), 1)
y_b = y_a[idx]
mixed_labels = lam * y_a + (1 - lam) * y_b
return mixed, mixed_labels
# ── Usage in training loop with soft labels ────────────────────
def soft_cross_entropy(pred, soft_targets):
"""Cross-entropy that accepts soft (non-integer) target labels."""
log_prob = torch.nn.functional.log_softmax(pred, dim=1)
return -(soft_targets * log_prob).sum(dim=1).mean()
# Simulate one training step with MixUp
imgs = torch.randn(32, 3, 224, 224)
labels = torch.randint(0, 5, (32,))
mixed_imgs, mixed_labels = mixup_batch(imgs, labels, num_classes=5)
print(f"Mixed images shape : {mixed_imgs.shape}")
print(f"Soft label sample : {mixed_labels[0].numpy().round(3)}")
Evaluation Metrics — Beyond Simple Accuracy
On a balanced 10-class dataset, 90% accuracy sounds great. On a medical dataset where 99% of samples are "healthy" and 1% are "cancerous," a model that always predicts "healthy" achieves 99% accuracy — and misses every single cancer case. Accuracy is a dangerous single number when classes are imbalanced.
import torch
import numpy as np
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, top_k_accuracy_score)
def evaluate_model(model, loader, device, class_names):
model.eval()
all_labels, all_probs = [], []
with torch.no_grad():
for imgs, labels in loader:
imgs = imgs.to(device)
logits = model(imgs)
probs = torch.softmax(logits, dim=1).cpu()
all_probs.append(probs)
all_labels.append(labels)
all_probs = torch.cat(all_probs).numpy() # (N, C) float
all_labels = torch.cat(all_labels).numpy() # (N,) int
all_preds = all_probs.argmax(axis=1)
# Top-1 and Top-3 accuracy
top1 = top_k_accuracy_score(all_labels, all_probs, k=1)
top3 = top_k_accuracy_score(all_labels, all_probs, k=3)
# Per-class precision / recall / F1
report = classification_report(all_labels, all_preds,
target_names=class_names, digits=3)
# Macro AUC (OvR strategy for multiclass)
auc = roc_auc_score(all_labels, all_probs, multi_class='ovr',
average='macro')
# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
print(f"Top-1 Acc : {top1:.4f} | Top-3 Acc : {top3:.4f} | AUC : {auc:.4f}")
print(f"\n{report}")
return cm
# Simulated output for 5-class flower dataset
print("Top-1 Acc : 0.921 | Top-3 Acc : 0.984 | AUC : 0.987")
Vision Transformers — Attention Is All You Need (For Images Too)
The trick: split an image into non-overlapping 16×16 patches. Flatten each patch into a 1D vector. Treat those vectors exactly like tokens in a sentence. Feed them through a standard transformer encoder. Let self-attention learn which patches "attend to" which other patches — relationships that can span the entire image, something CNNs struggle with without very deep stacking.
With enough data and compute, ViT obliterated the competition. With modest data it underperforms ResNets — it lacks the helpful inductive biases. This tension between learning everything from scratch versus building in known structure is the central debate of modern computer vision.
import torch
from torchvision.models import vit_b_16, ViT_B_16_Weights
# Load ViT-B/16 pretrained on ImageNet-21k then fine-tuned on ImageNet-1k
weights = ViT_B_16_Weights.IMAGENET1K_V1
model = vit_b_16(weights=weights)
model.eval()
# Inspect architecture
print("Patch size :", model.patch_size) # 16
print("Hidden dim :", model.hidden_dim) # 768
print("Num heads :", model.encoder.layers[0].num_heads) # 12
print("Num layers :", len(model.encoder.layers)) # 12
print("Total params:",
f"{sum(p.numel() for p in model.parameters()):,}") # 86M
# Replace head for 5-class task
import torch.nn as nn
in_features = model.heads.head.in_features # 768
model.heads.head = nn.Linear(in_features, 5)
# Inference on a batch
batch = torch.randn(4, 3, 224, 224)
with torch.no_grad():
out = model(batch)
print(f"\nInput : {batch.shape}")
print(f"Output : {out.shape}") # (4, 5) — one logit per class per image
CLIP — Zero-Shot Classification with Text Prompts
OpenAI's CLIP (Contrastive Language-Image Pretraining, 2021) changed the rules. Trained on 400 million (image, caption) pairs from the web, CLIP learns a joint embedding space where similar images and texts are pulled together. At inference, you describe your classes in natural language — no labelled images required.
Encode your image with CLIP's visual encoder → a 512-d vector. Encode each class name as "a photo of a {class}" with the text encoder → 512-d vectors. Compute cosine similarity between the image vector and each class vector. The class with highest similarity is the prediction. No fine-tuning, no labelled examples — just natural language descriptions.
# pip install openai-clip
import clip
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load CLIP ViT-B/32
model, preprocess = clip.load("ViT-B/32", device=device)
# Load and preprocess image
image = preprocess(Image.open("mystery_flower.jpg")).unsqueeze(0).to(device)
# Define class labels as natural language prompts
class_names = ["daisy", "dandelion", "rose", "sunflower", "tulip"]
text_prompts = clip.tokenize(
[f"a photograph of a {c} flower" for c in class_names]
).to(device)
with torch.no_grad():
img_features = model.encode_image(image)
txt_features = model.encode_text(text_prompts)
# Normalise to unit sphere
img_features /= img_features.norm(dim=-1, keepdim=True)
txt_features /= txt_features.norm(dim=-1, keepdim=True)
# Cosine similarity → probabilities
similarity = (100.0 * img_features @ txt_features.T)
probs = similarity.softmax(dim=-1)[0]
print("Zero-Shot CLIP Predictions (no fine-tuning):")
for cls, prob in zip(class_names, probs):
bar = "█" * int(prob.item() * 40)
print(f" {cls:12s}: {prob.item()*100:5.1f}% {bar}")
Full Pipeline Comparison — Choosing Your Approach
| Approach | Data Needed | GPU Required | Typical Accuracy | Training Time | Best For |
|---|---|---|---|---|---|
| HOG + SVM | ~500 images | No | 65–78% | Minutes | Extremely constrained environments, baseline |
| Transfer Learning (frozen) | ~200–2k images | Optional | 82–91% | 5–20 min | Small datasets, prototyping, domain-close to ImageNet |
| Transfer Learning (fine-tuned) | 1k–50k images | Recommended | 88–95% | 1–4 hours | Most real-world projects — the standard approach |
| EfficientNet-B4 fine-tuned | 5k–100k images | Yes | 93–97% | 4–12 hours | Production systems, Kaggle competitions |
| ViT fine-tuned (21k pretrain) | 10k+ images | Yes (A100+) | 94–98% | 1–3 days | Large datasets, highest accuracy requirement |
| CLIP zero-shot | 0 images | Optional | 60–80% (domain-dependent) | Seconds (no training) | New categories, rapid prototyping, open-world recognition |
Common Failure Modes — What Goes Wrong and Why
Golden Rules
nn.CrossEntropyLoss(label_smoothing=0.1).
Especially important when your training set is small and overfitting is a concern.
classification_report and inspect the confusion matrix.
You need to know which classes are hard, which are easy, and which are being
confused with each other — accuracy tells you none of that.
train_ds[i] several times for the same image, convert back to
uint8 and display. If any augmented version looks unrealistic or wrong
(text upside down, faces mirrored nonsensically, colours completely unnatural),
it will hurt the model. Eyes catch bad augmentations; loss curves do not.