Computer Vision Deep Dive: CNNs, Object Detection

Section 01

The Story Behind Computer Vision

📖 Real World Analogy

Teaching a Baby to Recognise a Cat

A newborn has no concept of a "cat." Over months, parents point at cats, say the word, show pictures — the brain builds a model. By age two, the child spots a cat instantly in a crowded photo, even in the dark, even from an unusual angle.

That exact journey — from raw pixel signals to "I see a cat" — is what computer vision recreates in software. We show a neural network millions of images labelled "cat" and "not cat." It learns which combinations of edges, textures, and shapes scream feline. After training, it identifies cats in photos it has never seen — sometimes better than humans.

Computer vision is, at its core, statistical pattern recognition over pixels.

Computer vision (CV) is a field of artificial intelligence that trains computers to interpret and understand visual data — images, video frames, depth maps, point clouds — extracting structured information from them. It powers face unlock on your phone, quality inspection on factory floors, cancer detection in radiology, and self-driving cars navigating city streets at 60 km/h.

👁

Why Vision Is Hard for Machines

A 224×224 colour image contains 150,528 raw numbers (pixels × 3 channels). The same object looks entirely different under different lighting, at different scales, from different angles, partially occluded, or photographed with motion blur. Classical programming cannot enumerate every possible variation — learned representations can.

Section 02

The Visual Cortex — How Humans vs. Machines See

🧠 Human Visual Cortex

Layer	What It Detects
V1 (Primary)	Edges, orientations, contrast
V2	Simple shapes, colour boundaries
V4	Colour, curvature, form
IT (Inferotemporal)	Objects, faces, categories
PFC	Context, memory, attention

🤖 Convolutional Neural Network

Layer	What It Detects
Conv 1–2	Edges, colour gradients
Conv 3–4	Textures, corners, patterns
Conv 5–6	Parts (eyes, wheels, fins)
Fully Connected	Whole objects, classes
Softmax Output	Probability per class

This analogy is not coincidence — early deep learning researchers were inspired by neuroscience. Hubel and Wiesel's Nobel Prize-winning cat-cortex experiments (1981) directly informed Yann LeCun's 1998 convolutional network design. Biology and AI evolved in parallel.

💡

The Hierarchy Principle

Both biological and artificial vision systems are hierarchical: low-level features (edges) combine into mid-level features (shapes), which combine into high-level concepts (objects). Each layer in a CNN learns increasingly abstract representations. This is why deep networks outperform shallow ones on vision tasks.

Section 03

The Image as Data — Pixels, Channels, and Tensors

Before any algorithm runs, you need to understand what an image actually is as data.

📷 Anatomy of a Digital Image

Pixel

The smallest unit of an image. A single coloured dot. A 640×480 image contains 307,200 pixels.

Channel

A grayscale image has 1 channel (intensity 0–255). A colour image has 3 channels — Red, Green, Blue (RGB). Medical CT scans may have 1 or more specialised channels.

Tensor

An image is stored as a 3D tensor of shape (H, W, C) — Height × Width × Channels. A batch of images is a 4D tensor (N, H, W, C) or (N, C, H, W) in PyTorch.

Dtype

Raw pixels: uint8 (0–255). Neural networks expect: float32 (0.0–1.0 or normalised). Always divide by 255 before feeding a network.

Image Type	Shape	Value Range	Use Case
Grayscale	(H, W, 1)	0–255	OCR, X-rays, depth maps
RGB Colour	(H, W, 3)	0–255 per channel	Natural photos, screenshots
RGBA (with alpha)	(H, W, 4)	0–255	UI elements, transparent PNGs
HSV Colour Space	(H, W, 3)	H:0–179, S:0–255, V:0–255	Colour-based segmentation
Float Tensor	(N, C, H, W)	0.0–1.0 (normalised)	Neural network input

import cv2
import numpy as np
from PIL import Image

# Load an image (OpenCV loads as BGR by default)
img_bgr = cv2.imread('photo.jpg')
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

print(f"Shape: {img_rgb.shape}")     # (H, W, 3)
print(f"Dtype: {img_rgb.dtype}")     # uint8
print(f"Min/Max: {img_rgb.min()}/{img_rgb.max()}")  # 0 / 255

# Normalise for neural network input
img_float = img_rgb.astype(np.float32) / 255.0

# Convert to PyTorch tensor (C, H, W)
import torch
tensor = torch.from_numpy(img_float).permute(2, 0, 1)
print(f"Tensor shape: {tensor.shape}")  # torch.Size([3, H, W])

OUTPUT

Shape: (480, 640, 3) Dtype: uint8 Min/Max: 0/255 Tensor shape: torch.Size([3, 480, 640])

Section 04

Core Computer Vision Tasks — The Taxonomy

Computer vision is not a single problem — it is a family of related tasks. Understanding the taxonomy is essential before choosing an architecture or dataset.

🏳️

Image Classification

What is in this image?

Assigns a single label to an entire image. Input: image tensor. Output: class probabilities. Examples: cat vs dog, benign vs malignant tumour.

✓ Simple, fast, well-studied

✕ No location information

🔥

Object Detection

Where are objects and what are they?

Detects multiple objects and returns bounding boxes + class labels + confidence scores. Examples: YOLO, Faster R-CNN. Used in security cameras, autonomous vehicles.

✓ Location + class in one pass

✕ Heavier to train

🎯

Semantic Segmentation

Label every pixel by class

Assigns a class label to every pixel. Two cars become one "car" blob. Examples: DeepLab, SegNet. Used in medical imaging, satellite analysis.

✓ Pixel-precise understanding

✕ No instance separation

🧰

Instance Segmentation

Separate mask per object

Like semantic segmentation but separates individual instances. Car 1 vs Car 2 are distinct masks. Examples: Mask R-CNN, YOLACT. Robotics, AR applications.

✓ Individual object masks

✕ Most complex to train

👨‍💻

Pose Estimation

Detect body / hand keypoints

Locates skeletal keypoints (joints, landmarks). 2D or 3D. Examples: OpenPose, MediaPipe. Used in sports analytics, physiotherapy, AR filters.

✓ Structural understanding

✕ Struggles with occlusion

📺

Optical Flow / Video

How do pixels move over time?

Estimates motion vectors between consecutive frames. Tracks objects, detects anomalies in video. Examples: RAFT, FlowNet. Used in action recognition, surveillance.

✓ Temporal dynamics

✕ Computationally expensive

Section 05

The Convolutional Neural Network — Deep Dive

📖 Story

The Sliding Magnifying Glass

Imagine you are asked to find all phone numbers in a newspaper. You do not read the whole page at once — you run a small window (your mental "filter") across the page, hunting for a specific pattern: digits followed by dashes. When the window matches the pattern, you record the location and move on.

A convolutional layer does exactly this. A small matrix called a kernel (typically 3×3 or 5×5) slides across the entire image, computing a dot product at each position. If a 3×3 kernel is tuned to detect vertical edges, it fires strongly wherever vertical edges appear. A bank of 64 different kernels detects 64 different patterns — simultaneously. That is one convolutional layer. Stack many of them and you build a feature hierarchy.

Input Layer

Raw image tensor enters as shape (N, 3, 224, 224). Pixels normalised to [0,1] and optionally standardised using ImageNet mean/std: μ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225].

Convolutional Layer (Conv2d)

Applies K learnable kernels of size (k×k) via sliding window. Output shape: (N, K, H', W'). Learns local spatial patterns. Parameters: kernel weights + bias. Uses shared weights → massive parameter efficiency vs. fully connected.

Batch Normalisation

Normalises activations across the batch dimension after each conv layer. Stabilises training, allows higher learning rates, acts as mild regularisation. Almost always used in modern CNNs between Conv and ReLU.

Activation (ReLU / GELU)

ReLU: max(0, x) — kills negative activations, introduces non-linearity. Without activation functions, stacking linear layers is still just one linear transformation. GELU is preferred in transformers; ReLU remains dominant in CNNs.

Pooling (MaxPool / AvgPool)

Reduces spatial dimensions. MaxPool(2×2) halves H and W, retaining the strongest activation in each 2×2 block. Provides translation invariance: a shifted cat still activates the same cat features. Modern networks sometimes use strided convolutions instead.

Flatten + Fully Connected

After several conv+pool blocks, the spatial feature map is flattened into a 1D vector. Fully connected (Linear) layers combine all learned features into class scores. Dropout is applied here to prevent overfitting.

Output + Loss

Softmax converts raw logits to probabilities summing to 1. Cross-entropy loss measures difference from one-hot ground truth. During backpropagation, gradients flow back through all layers, adjusting every weight. Optimiser (Adam / SGD) applies updates.

Section 06

Building a CNN from Scratch in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    '''
    A clean 3-block CNN for 32×32 images (e.g. CIFAR-10, 10 classes).
    Architecture: [Conv→BN→ReLU→Pool] × 3 → Flatten → FC → FC
    '''
    def __init__(self, num_classes=10):
        super().__init__()

        # Block 1: 3 → 32 channels, 32×32 → 16×16
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2)            # 32×32 → 16×16
        )

        # Block 2: 32 → 64 channels, 16×16 → 8×8
        self.block2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2)            # 16×16 → 8×8
        )

        # Block 3: 64 → 128 channels, 8×8 → 4×4
        self.block3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2)            # 8×8 → 4×4
        )

        # Classifier head
        self.classifier = nn.Sequential(
            nn.Flatten(),              # 128×4×4 = 2048
            nn.Linear(2048, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        return self.classifier(x)

# Sanity-check with a random batch
model = SimpleCNN(num_classes=10)
dummy = torch.randn(4, 3, 32, 32)   # batch of 4 CIFAR images
logits = model(dummy)
print(f"Output shape: {logits.shape}")   # (4, 10)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

OUTPUT

Output shape: torch.Size([4, 10]) Total parameters: 1,182,218

Section 07

Landmark CNN Architectures — The Evolution

Architecture	Year	Depth	Top-1 (ImageNet)	Key Innovation
LeNet-5	1998	5 layers	~98% on MNIST	First practical CNN; convolution + pooling concept
AlexNet	2012	8 layers	63.3%	GPU training, ReLU, Dropout — started the deep learning revolution
VGGNet-16	2014	16 layers	71.5%	Very deep with only 3×3 convolutions; elegant simplicity
GoogLeNet	2014	22 layers	74.8%	Inception modules (parallel convolutions), 12× fewer parameters than AlexNet
ResNet-50	2015	50 layers	76.1%	Residual skip connections — solved vanishing gradient for 100+ layers
DenseNet	2017	121–264	77.4%	Each layer receives features from ALL previous layers
EfficientNet-B7	2019	—	84.4%	Neural architecture search; compound scaling of width/depth/resolution
Vision Transformer (ViT)	2021	—	88.5%	Transformer (attention) applied to image patches — no convolution at all
ConvNeXt	2022	—	87.8%	Pure CNN redesigned with transformer design principles

🧐

The Residual Skip Connection — The Most Important Idea in Deep Learning

ResNet introduced skip connections: instead of learning F(x), the layer learns the residual F(x) + x. If the optimal function is close to the identity, the layer only needs to learn F(x) ≈ 0 — much easier. Gradients also flow back through the skip path directly, preventing the vanishing gradient problem. Without this, training networks deeper than ~20 layers was practically impossible.

Section 08

Transfer Learning — The Greatest Productivity Hack in Vision

📖 Story

The Expert Who Already Knows How to See

A radiologist trained on 500,000 chest X-rays already understands textures, densities, and shapes in medical images. If you then ask them to read bone scans — a slightly different domain — they don't start from scratch. They already know how to interpret visual patterns; they just need to adapt to the new context. Two weeks of bone scan training, not five years.

Transfer learning does exactly this with neural networks. A ResNet-50 trained on ImageNet's 1.2 million images has already learned to detect edges, textures, shapes, and object parts. Freeze those learned representations, add a new classification head for your 200-image dataset, fine-tune for a few epochs, and achieve 90%+ accuracy on a problem where training from scratch would give you 40%.

🔒 Feature Extraction

Frozen backbone

Freeze ALL pretrained weights. Only train the new classification head you added. Use when your dataset is tiny (<500 images) or very similar to the source domain.

✓ Fast, low risk of overfitting, minimal GPU needed

✕ Backbone cannot adapt to domain differences

🔧 Fine-Tuning

Unfreeze upper layers

Unfreeze the last few conv blocks and train end-to-end with a very small learning rate (1e-4 to 1e-5). Use when you have >1,000 images and a different but related domain.

✓ Backbone adapts to your data; best accuracy

✕ Can overfit if dataset is tiny; slower training

⚡ Full Fine-Tuning

All layers trainable

Unfreeze everything. Train the entire network with a discriminative learning rate (lower LR for early layers, higher for later layers). Requires large, labelled domain-specific dataset (>10k images).

✓ Maximum flexibility and accuracy

✕ Risk of catastrophic forgetting; expensive

import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader, ImageFolder

# ── 1. Load pretrained ResNet50 ────────────────────────────────
backbone = models.resnet50(weights='IMAGENET1K_V2')

# ── 2. Freeze ALL backbone parameters ─────────────────────────
for param in backbone.parameters():
    param.requires_grad = False

# ── 3. Replace the final FC layer for your task (5 classes) ───
in_feats = backbone.fc.in_features           # 2048
backbone.fc = nn.Sequential(
    nn.Dropout(0.4),
    nn.Linear(in_feats, 256),
    nn.ReLU(),
    nn.Linear(256, 5)                   # 5 custom classes
)

# Only the new head has gradients
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}")   # ~527,621

# ── 4. ImageNet normalisation for pretrained models ────────────
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# ── 5. Training loop ──────────────────────────────────────────
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = backbone.to(device)
optimiser = torch.optim.Adam(model.parameters(), lr=3e-4)
criterion = nn.CrossEntropyLoss()

# Assume train_loader is a DataLoader over your image folder
model.train()
for epoch in range(10):
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        optimiser.zero_grad()
        loss = criterion(model(imgs), labels)
        loss.backward()
        optimiser.step()

🏆

The Rule of Thumb for Transfer Learning

Dataset small + similar domain → Feature extraction only.
Dataset medium + different domain → Unfreeze last 2–3 blocks, use LR 1e-4.
Dataset large + very different domain → Full fine-tune with discriminative LRs.
Always use ImageNet normalisation statistics when loading pretrained torchvision models. Never skip it.

Section 09

Data Augmentation — Making One Image into Many

A model that only ever sees one orientation of a cat will fail on an upside-down cat. Data augmentation artificially expands your training set by applying random transformations at training time. At inference time, no augmentation is applied — only normalisation.

🔄

Geometric

Flip, Rotate, Crop, Scale

Random horizontal flip (50% chance), rotation ±15°, random crop of 80–100% of image area. Teaches invariance to orientation and scale.

🌞

Colour Jitter

Brightness, Contrast, Hue

Randomly perturbs brightness ±30%, contrast ±30%, saturation ±30%, hue ±10%. Teaches invariance to lighting conditions and camera settings.

🧿

CutOut / RandomErasing

Mask a rectangle to zero

Randomly zeros out a rectangular patch (10–33% of image area). Forces the network to use context rather than single discriminative patches. Strong regulariser.

🎬

MixUp

Blend two images + labels

Creates synthetic training samples: img = λ·img1 + (1-λ)·img2, label = λ·y1 + (1-λ)·y2. Smooths the decision boundary, significantly reduces overfitting.

🏳

CutMix

Paste a patch from another image

Cuts a rectangle from image B and pastes onto image A. Labels are mixed proportionally to the area of each image visible. Outperforms MixUp on most benchmarks.

🤖

AutoAugment / RandAugment

Learned augmentation policy

AutoAugment uses RL to find optimal augmentation policies per dataset. RandAugment simplifies this by randomly choosing N transforms at magnitude M. State of the art for ImageNet.

from torchvision.transforms import v2

# Modern torchvision v2 augmentation pipeline
train_transform = v2.Compose([
    v2.RandomResizedCrop(224, scale=(0.6, 1.0)),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ColorJitter(brightness=0.3, contrast=0.3,
                   saturation=0.3, hue=0.1),
    v2.RandomRotation(degrees=15),
    v2.RandomErasing(p=0.25, scale=(0.02, 0.20)),
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406],
                  std=[0.229, 0.224, 0.225])
])

# Validation: NO augmentation, only resize + normalise
val_transform = v2.Compose([
    v2.Resize((224, 224)),
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406],
                  std=[0.229, 0.224, 0.225])
])

Section 10

Object Detection — YOLO Architecture Deep Dive

📖 Story

The Security Guard Scanning a Crowd

A security guard doesn't scan a crowd one face at a time. They look at the entire crowd simultaneously, pattern-matching against a mental model of "suspicious behaviour." In one glance, they identify three people of interest and their approximate locations.

YOLO (You Only Look Once) is the computer vision equivalent. Unlike two-stage detectors (which first propose regions, then classify them), YOLO makes a single forward pass through the network and directly predicts all bounding boxes and class labels simultaneously. This is why it runs at 30–100+ FPS on modern hardware — fast enough for real-time video.

🧐 How YOLO Detection Works — Step by Step

Grid

Image is divided into an S×S grid (e.g. 13×13 for 416×416 input). Each cell is responsible for detecting objects whose centre falls within it.

Anchors

Each cell predicts B bounding boxes using learnable anchor priors. Anchors are pre-clustered from training set bounding box shapes using k-means. Each prediction: (x, y, w, h, confidence).

Prediction

For each anchor: 5 values (tx, ty, tw, th, objectness) + C class probabilities. Total output tensor: S × S × B × (5 + C).

IoU

Intersection over Union: measures overlap between predicted and ground-truth boxes. IoU = Area(Intersection) / Area(Union). IoU ≥ 0.5 = true positive. Used in NMS and loss calculation.

NMS

Non-Maximum Suppression: multiple anchors often detect the same object. NMS suppresses all boxes for a class except the one with highest confidence, when IoU with that box exceeds a threshold (typically 0.45).

from ultralytics import YOLO

# Load pretrained YOLOv8 nano (fastest model)
model = YOLO('yolov8n.pt')

# ── Inference on a single image ────────────────────────────────
results = model.predict(
    source='street.jpg',
    conf=0.45,      # confidence threshold
    iou=0.45,       # NMS IoU threshold
    device='cuda'
)

for r in results:
    for box in r.boxes:
        cls   = r.names[int(box.cls)]
        conf  = float(box.conf)
        xyxy  = box.xyxy[0].tolist()     # [x1,y1,x2,y2]
        print(f"{cls}: {conf:.2f} @ {[int(v) for v in xyxy]}")

# ── Fine-tune on custom dataset ────────────────────────────────
# Requires dataset.yaml defining: path, train, val, nc, names
model.train(
    data='dataset.yaml',
    epochs=50,
    imgsz=640,
    batch=16,
    lr0=0.01,
    patience=10,    # early stopping
    device='0'
)

# ── Evaluate — prints mAP50, mAP50-95 ─────────────────────────
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")

OUTPUT — INFERENCE

person: 0.91 @ [134, 210, 287, 498] car: 0.87 @ [45, 310, 390, 480] car: 0.72 @ [510, 295, 630, 460] bicycle:0.68 @ [205, 370, 295, 495]

Section 11

Evaluation Metrics for Vision Models

Metric	Formula	Task	Notes
Accuracy	TP+TN / Total	Classification	Misleading on imbalanced datasets
Precision	TP / (TP+FP)	Classification / Detection	How many positive predictions were correct?
Recall	TP / (TP+FN)	Classification / Detection	How many actual positives did we find?
F1 Score	2·P·R / (P+R)	Classification	Balances precision and recall — preferred for imbalanced classes
IoU	Intersect / Union	Detection / Segmentation	Measures box overlap. IoU ≥ 0.5 = correct detection
mAP@50	mean AP over classes	Object Detection	Primary detection metric. AP = area under Precision-Recall curve at IoU=0.5
mAP@50-95	mean over IoU 0.5:0.05:0.95	Object Detection	Stricter metric used in COCO benchmark — harder to game
mIoU	mean IoU over classes	Segmentation	Primary segmentation metric. Averages per-class IoU across all semantic classes

⚠️

Never Report Accuracy Alone on Vision Tasks

A skin cancer classifier on a dataset with 95% benign samples achieves 95% accuracy by predicting "benign" every time — while missing every single cancer. Always report F1, AUC-ROC, and per-class precision/recall for imbalanced classification tasks. For detection, mAP50 is the minimum you must report.

Section 12

Vision Transformers (ViT) — Attention Meets Images

📖 Story

The Art Critic Who Reads the Whole Painting at Once

A convolutional network examines an image locally — it sees small patches first, then gradually integrates them. Like reading a sentence word by word, never jumping ahead.

A Vision Transformer works differently. It cuts the image into 16×16 pixel patches, treats each patch as a "word," and runs the transformer's attention mechanism over all patches simultaneously. Every patch can directly attend to every other patch in a single layer. A patch showing an eye can immediately attend to the patch showing a nose across the image — no need to wait for information to propagate through spatial layers. This global receptive field from layer one is ViT's fundamental advantage over CNNs.

🔭 ViT Architecture — Patch by Patch

Patch

Image (224×224) is split into P×P non-overlapping patches. Default P=16 → 196 patches, each 16×16×3 = 768-dim vector.

Embed

Each patch flattened and projected via a linear layer to embedding dim D (e.g. 768). A learnable [CLS] token is prepended — its final representation is used for classification.

Pos. Enc.

Learnable positional embeddings are added to each patch embedding. Without this, the model has no notion of spatial order — patch at (0,0) looks identical to patch at (10,10).

Attention

Multi-head self-attention: each patch queries all other patches. Attention weights reveal which parts of the image matter for classifying a given patch. Attention maps are directly interpretable.

Classify

After L transformer blocks, the [CLS] token embedding is fed through an MLP head → class probabilities. ViT-L/16 achieves 88.5% top-1 on ImageNet.

📊

CNN vs. ViT — When to Use Which

Use CNNs when: your dataset is small (<50k images), you need fast inference on edge devices, or strong inductive biases (locality, translation invariance) help your task.

Use ViTs when: you have large datasets (>1M images or large pretrained checkpoints like CLIP/DINO), you need global context reasoning, or you're building multimodal systems (vision + language). Hybrid models (CNN stem + transformer body) like CvT often give the best of both worlds.

Section 13

Real-World CV Pipeline — End to End

Data Collection & Labelling

Define your classes precisely. Collect raw images from multiple sources (web scraping, cameras, open datasets). Use labelling tools: LabelImg (bounding boxes), LabelMe (polygons), Roboflow (managed pipeline). Target ≥500 images per class for classification, ≥300 annotated instances per class for detection.

Exploratory Data Analysis

Check class balance (imbalance → use weighted loss or oversampling). Verify image quality — blur, resolution, exposure. Examine label quality — incorrect or inconsistent labels destroy model performance. Plot image size distribution; standardise via resizing strategy.

Preprocessing & Augmentation

Resize to model input dimensions. Apply training augmentation (geometric + colour). Build reproducible DataLoader with fixed random seed. Use torchvision v2 transforms or Albumentations for advanced augmentation including bounding-box-aware transforms.

Model Selection & Baseline

Start with the smallest pretrained model that fits your constraints (ResNet-18 or MobileNetV3 for edge, EfficientNet-B3 for accuracy). Get a working baseline first — simple is fast to iterate. Record baseline mAP or accuracy before any hyperparameter tuning.

Training & Monitoring

Log loss curves, val accuracy, and sample predictions to TensorBoard or Weights & Biases. Watch for: divergence (LR too high), underfitting (model too small), overfitting (train loss ↓ but val loss ↑). Use learning rate schedulers (CosineAnnealing or OneCycleLR).

Evaluation & Error Analysis

Compute full metric suite: confusion matrix, per-class F1, mAP. Visualise worst predictions — look for systematic errors (certain lighting, angle, background). Use GradCAM to see what the model attends to. Fix data issues before tuning hyperparameters.

Deployment & Optimisation

Export to ONNX or TorchScript for production. Apply post-training quantisation (INT8) for 3–4× speedup with <1% accuracy loss. Use TensorRT for NVIDIA GPU deployment. Profile inference latency on target hardware. Set up monitoring for distribution shift in production.

Section 14

Model Interpretability — GradCAM Explained

One common criticism of deep learning is the "black box" problem. GradCAM (Gradient-weighted Class Activation Mapping) solves this for CNNs by visualising which spatial regions the model attended to when making a prediction.

📖 Intuition

Asking the Network "What Made You Say Cat?"

After a forward pass predicting "cat," you ask: which neurons in the final conv layer were most activated when predicting this class? GradCAM computes the gradient of the class score with respect to each feature map in the last conv layer. Regions where gradients are large produced strong features for that class. A weighted sum of feature maps produces a heatmap overlaid on the original image — bright regions are where the model "looked" to decide.

import torch
import torch.nn.functional as F
import numpy as np
import cv2

class GradCAM:
    def __init__(self, model, target_layer):
        self.model = model
        self.target_layer = target_layer
        self.gradients = None
        self.activations = None

        target_layer.register_forward_hook(self._save_activation)
        target_layer.register_backward_hook(self._save_gradient)

    def _save_activation(self, module, input, output):
        self.activations = output.detach()

    def _save_gradient(self, module, grad_input, grad_output):
        self.gradients = grad_output[0].detach()

    def generate(self, input_tensor, class_idx=None):
        logits = self.model(input_tensor)
        target = logits[0, class_idx if class_idx else logits.argmax()]
        self.model.zero_grad()
        target.backward()

        # Global average pooling of gradients → channel weights
        weights = self.gradients.mean(dim=(2, 3), keepdim=True)

        # Weighted combination of activation maps
        cam = (weights * self.activations).sum(dim=1, keepdim=True)
        cam = F.relu(cam)  # only positive contributions
        cam = F.interpolate(cam, input_tensor.shape[2:], mode='bilinear')
        cam = cam.squeeze().numpy()

        # Normalise to 0–1
        cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)
        return cam

# Usage on a ResNet50
model = models.resnet50(weights='IMAGENET1K_V2').eval()
gcam = GradCAM(model, model.layer4[2].conv3)
heatmap = gcam.generate(img_tensor, class_idx=281)  # 281=tabby cat

# Overlay as colour map
heatmap_bgr = cv2.applyColorMap(np.uint8(255 * heatmap), cv2.COLORMAP_JET)
overlay = cv2.addWeighted(original_bgr, 0.6, heatmap_bgr, 0.4, 0)
cv2.imwrite('gradcam_output.jpg', overlay)

Section 15

Golden Rules for Computer Vision Projects

👀 Computer Vision — Non-Negotiable Rules

Never train without normalisation. Always normalise pixel values to [0,1] (divide by 255) and then standardise using the mean and std of your training set (or ImageNet stats for pretrained models). Raw uint8 values in neural networks cause terrible gradient behaviour.

Only augment training data — never validation or test data. Augmenting validation gives optimistically biased metrics. Validation transform: resize and normalise only. Test transform: identical to validation.

Always use a pretrained backbone as your starting point unless you have 100k+ labelled images and a very different domain. Training from scratch on <50k images almost always underperforms transfer learning. The ImageNet backbone is a gift — use it.

Fix your data before tuning your model. If your confusion matrix shows systematic class confusion, the cause is usually mislabelled data, inconsistent label criteria, or class imbalance — not a suboptimal learning rate. Garbage data defeats any architecture.

Use a learning rate scheduler. A constant LR is suboptimal. Use CosineAnnealingLR for standard training or OneCycleLR for super-convergence. Warmup for the first 5 epochs is strongly recommended when fine-tuning large backbones.

Benchmark on multiple metrics. For classification: accuracy + F1 + per-class recall. For detection: mAP50 + mAP50-95 + FPS. For segmentation: mIoU per class. A single number hides failure modes on minority classes.

Profile before optimising. Before converting to INT8 or TensorRT, measure baseline latency. Many models are already fast enough. Quantisation can reduce accuracy significantly on fine-grained tasks. Always validate quantised model accuracy on the full validation set, not just a sample.