The Story Behind Computer Vision
That exact journey — from raw pixel signals to "I see a cat" — is what computer vision recreates in software. We show a neural network millions of images labelled "cat" and "not cat." It learns which combinations of edges, textures, and shapes scream feline. After training, it identifies cats in photos it has never seen — sometimes better than humans.
Computer vision is, at its core, statistical pattern recognition over pixels.
Computer vision (CV) is a field of artificial intelligence that trains computers to interpret and understand visual data — images, video frames, depth maps, point clouds — extracting structured information from them. It powers face unlock on your phone, quality inspection on factory floors, cancer detection in radiology, and self-driving cars navigating city streets at 60 km/h.
A 224×224 colour image contains 150,528 raw numbers (pixels × 3 channels). The same object looks entirely different under different lighting, at different scales, from different angles, partially occluded, or photographed with motion blur. Classical programming cannot enumerate every possible variation — learned representations can.
The Visual Cortex — How Humans vs. Machines See
| Layer | What It Detects |
|---|---|
| V1 (Primary) | Edges, orientations, contrast |
| V2 | Simple shapes, colour boundaries |
| V4 | Colour, curvature, form |
| IT (Inferotemporal) | Objects, faces, categories |
| PFC | Context, memory, attention |
| Layer | What It Detects |
|---|---|
| Conv 1–2 | Edges, colour gradients |
| Conv 3–4 | Textures, corners, patterns |
| Conv 5–6 | Parts (eyes, wheels, fins) |
| Fully Connected | Whole objects, classes |
| Softmax Output | Probability per class |
This analogy is not coincidence — early deep learning researchers were inspired by neuroscience. Hubel and Wiesel's Nobel Prize-winning cat-cortex experiments (1981) directly informed Yann LeCun's 1998 convolutional network design. Biology and AI evolved in parallel.
Both biological and artificial vision systems are hierarchical: low-level features (edges) combine into mid-level features (shapes), which combine into high-level concepts (objects). Each layer in a CNN learns increasingly abstract representations. This is why deep networks outperform shallow ones on vision tasks.
The Image as Data — Pixels, Channels, and Tensors
Before any algorithm runs, you need to understand what an image actually is as data.
| Image Type | Shape | Value Range | Use Case |
|---|---|---|---|
| Grayscale | (H, W, 1) | 0–255 | OCR, X-rays, depth maps |
| RGB Colour | (H, W, 3) | 0–255 per channel | Natural photos, screenshots |
| RGBA (with alpha) | (H, W, 4) | 0–255 | UI elements, transparent PNGs |
| HSV Colour Space | (H, W, 3) | H:0–179, S:0–255, V:0–255 | Colour-based segmentation |
| Float Tensor | (N, C, H, W) | 0.0–1.0 (normalised) | Neural network input |
import cv2
import numpy as np
from PIL import Image
# Load an image (OpenCV loads as BGR by default)
img_bgr = cv2.imread('photo.jpg')
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
print(f"Shape: {img_rgb.shape}") # (H, W, 3)
print(f"Dtype: {img_rgb.dtype}") # uint8
print(f"Min/Max: {img_rgb.min()}/{img_rgb.max()}") # 0 / 255
# Normalise for neural network input
img_float = img_rgb.astype(np.float32) / 255.0
# Convert to PyTorch tensor (C, H, W)
import torch
tensor = torch.from_numpy(img_float).permute(2, 0, 1)
print(f"Tensor shape: {tensor.shape}") # torch.Size([3, H, W])
Core Computer Vision Tasks — The Taxonomy
Computer vision is not a single problem — it is a family of related tasks. Understanding the taxonomy is essential before choosing an architecture or dataset.
The Convolutional Neural Network — Deep Dive
A convolutional layer does exactly this. A small matrix called a kernel (typically 3×3 or 5×5) slides across the entire image, computing a dot product at each position. If a 3×3 kernel is tuned to detect vertical edges, it fires strongly wherever vertical edges appear. A bank of 64 different kernels detects 64 different patterns — simultaneously. That is one convolutional layer. Stack many of them and you build a feature hierarchy.
Building a CNN from Scratch in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
'''
A clean 3-block CNN for 32×32 images (e.g. CIFAR-10, 10 classes).
Architecture: [Conv→BN→ReLU→Pool] × 3 → Flatten → FC → FC
'''
def __init__(self, num_classes=10):
super().__init__()
# Block 1: 3 → 32 channels, 32×32 → 16×16
self.block1 = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2) # 32×32 → 16×16
)
# Block 2: 32 → 64 channels, 16×16 → 8×8
self.block2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2) # 16×16 → 8×8
)
# Block 3: 64 → 128 channels, 8×8 → 4×4
self.block3 = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2) # 8×8 → 4×4
)
# Classifier head
self.classifier = nn.Sequential(
nn.Flatten(), # 128×4×4 = 2048
nn.Linear(2048, 512),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.block1(x)
x = self.block2(x)
x = self.block3(x)
return self.classifier(x)
# Sanity-check with a random batch
model = SimpleCNN(num_classes=10)
dummy = torch.randn(4, 3, 32, 32) # batch of 4 CIFAR images
logits = model(dummy)
print(f"Output shape: {logits.shape}") # (4, 10)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
Landmark CNN Architectures — The Evolution
| Architecture | Year | Depth | Top-1 (ImageNet) | Key Innovation |
|---|---|---|---|---|
| LeNet-5 | 1998 | 5 layers | ~98% on MNIST | First practical CNN; convolution + pooling concept |
| AlexNet | 2012 | 8 layers | 63.3% | GPU training, ReLU, Dropout — started the deep learning revolution |
| VGGNet-16 | 2014 | 16 layers | 71.5% | Very deep with only 3×3 convolutions; elegant simplicity |
| GoogLeNet | 2014 | 22 layers | 74.8% | Inception modules (parallel convolutions), 12× fewer parameters than AlexNet |
| ResNet-50 | 2015 | 50 layers | 76.1% | Residual skip connections — solved vanishing gradient for 100+ layers |
| DenseNet | 2017 | 121–264 | 77.4% | Each layer receives features from ALL previous layers |
| EfficientNet-B7 | 2019 | — | 84.4% | Neural architecture search; compound scaling of width/depth/resolution |
| Vision Transformer (ViT) | 2021 | — | 88.5% | Transformer (attention) applied to image patches — no convolution at all |
| ConvNeXt | 2022 | — | 87.8% | Pure CNN redesigned with transformer design principles |
ResNet introduced skip connections: instead of learning F(x), the layer learns the residual F(x) + x. If the optimal function is close to the identity, the layer only needs to learn F(x) ≈ 0 — much easier. Gradients also flow back through the skip path directly, preventing the vanishing gradient problem. Without this, training networks deeper than ~20 layers was practically impossible.
Transfer Learning — The Greatest Productivity Hack in Vision
Transfer learning does exactly this with neural networks. A ResNet-50 trained on ImageNet's 1.2 million images has already learned to detect edges, textures, shapes, and object parts. Freeze those learned representations, add a new classification head for your 200-image dataset, fine-tune for a few epochs, and achieve 90%+ accuracy on a problem where training from scratch would give you 40%.
import torch
import torch.nn as nn
from torchvision import models, transforms
from torch.utils.data import DataLoader, ImageFolder
# ── 1. Load pretrained ResNet50 ────────────────────────────────
backbone = models.resnet50(weights='IMAGENET1K_V2')
# ── 2. Freeze ALL backbone parameters ─────────────────────────
for param in backbone.parameters():
param.requires_grad = False
# ── 3. Replace the final FC layer for your task (5 classes) ───
in_feats = backbone.fc.in_features # 2048
backbone.fc = nn.Sequential(
nn.Dropout(0.4),
nn.Linear(in_feats, 256),
nn.ReLU(),
nn.Linear(256, 5) # 5 custom classes
)
# Only the new head has gradients
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}") # ~527,621
# ── 4. ImageNet normalisation for pretrained models ────────────
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# ── 5. Training loop ──────────────────────────────────────────
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = backbone.to(device)
optimiser = torch.optim.Adam(model.parameters(), lr=3e-4)
criterion = nn.CrossEntropyLoss()
# Assume train_loader is a DataLoader over your image folder
model.train()
for epoch in range(10):
for imgs, labels in train_loader:
imgs, labels = imgs.to(device), labels.to(device)
optimiser.zero_grad()
loss = criterion(model(imgs), labels)
loss.backward()
optimiser.step()
Dataset small + similar domain → Feature extraction only.
Dataset medium + different domain → Unfreeze last 2–3 blocks, use LR 1e-4.
Dataset large + very different domain → Full fine-tune with discriminative LRs.
Always use ImageNet normalisation statistics when loading pretrained torchvision models. Never skip it.
Data Augmentation — Making One Image into Many
A model that only ever sees one orientation of a cat will fail on an upside-down cat. Data augmentation artificially expands your training set by applying random transformations at training time. At inference time, no augmentation is applied — only normalisation.
from torchvision.transforms import v2
# Modern torchvision v2 augmentation pipeline
train_transform = v2.Compose([
v2.RandomResizedCrop(224, scale=(0.6, 1.0)),
v2.RandomHorizontalFlip(p=0.5),
v2.ColorJitter(brightness=0.3, contrast=0.3,
saturation=0.3, hue=0.1),
v2.RandomRotation(degrees=15),
v2.RandomErasing(p=0.25, scale=(0.02, 0.20)),
v2.ToImage(),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
# Validation: NO augmentation, only resize + normalise
val_transform = v2.Compose([
v2.Resize((224, 224)),
v2.ToImage(),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Object Detection — YOLO Architecture Deep Dive
YOLO (You Only Look Once) is the computer vision equivalent. Unlike two-stage detectors (which first propose regions, then classify them), YOLO makes a single forward pass through the network and directly predicts all bounding boxes and class labels simultaneously. This is why it runs at 30–100+ FPS on modern hardware — fast enough for real-time video.
from ultralytics import YOLO
# Load pretrained YOLOv8 nano (fastest model)
model = YOLO('yolov8n.pt')
# ── Inference on a single image ────────────────────────────────
results = model.predict(
source='street.jpg',
conf=0.45, # confidence threshold
iou=0.45, # NMS IoU threshold
device='cuda'
)
for r in results:
for box in r.boxes:
cls = r.names[int(box.cls)]
conf = float(box.conf)
xyxy = box.xyxy[0].tolist() # [x1,y1,x2,y2]
print(f"{cls}: {conf:.2f} @ {[int(v) for v in xyxy]}")
# ── Fine-tune on custom dataset ────────────────────────────────
# Requires dataset.yaml defining: path, train, val, nc, names
model.train(
data='dataset.yaml',
epochs=50,
imgsz=640,
batch=16,
lr0=0.01,
patience=10, # early stopping
device='0'
)
# ── Evaluate — prints mAP50, mAP50-95 ─────────────────────────
metrics = model.val()
print(f"mAP50: {metrics.box.map50:.3f}")
Evaluation Metrics for Vision Models
| Metric | Formula | Task | Notes |
|---|---|---|---|
| Accuracy | TP+TN / Total | Classification | Misleading on imbalanced datasets |
| Precision | TP / (TP+FP) | Classification / Detection | How many positive predictions were correct? |
| Recall | TP / (TP+FN) | Classification / Detection | How many actual positives did we find? |
| F1 Score | 2·P·R / (P+R) | Classification | Balances precision and recall — preferred for imbalanced classes |
| IoU | Intersect / Union | Detection / Segmentation | Measures box overlap. IoU ≥ 0.5 = correct detection |
| mAP@50 | mean AP over classes | Object Detection | Primary detection metric. AP = area under Precision-Recall curve at IoU=0.5 |
| mAP@50-95 | mean over IoU 0.5:0.05:0.95 | Object Detection | Stricter metric used in COCO benchmark — harder to game |
| mIoU | mean IoU over classes | Segmentation | Primary segmentation metric. Averages per-class IoU across all semantic classes |
A skin cancer classifier on a dataset with 95% benign samples achieves 95% accuracy by predicting "benign" every time — while missing every single cancer. Always report F1, AUC-ROC, and per-class precision/recall for imbalanced classification tasks. For detection, mAP50 is the minimum you must report.
Vision Transformers (ViT) — Attention Meets Images
A Vision Transformer works differently. It cuts the image into 16×16 pixel patches, treats each patch as a "word," and runs the transformer's attention mechanism over all patches simultaneously. Every patch can directly attend to every other patch in a single layer. A patch showing an eye can immediately attend to the patch showing a nose across the image — no need to wait for information to propagate through spatial layers. This global receptive field from layer one is ViT's fundamental advantage over CNNs.
Use CNNs when: your dataset is small (<50k images), you need fast inference on edge devices,
or strong inductive biases (locality, translation invariance) help your task.
Use ViTs when: you have large datasets (>1M images or large pretrained checkpoints like CLIP/DINO),
you need global context reasoning, or you're building multimodal systems (vision + language).
Hybrid models (CNN stem + transformer body) like CvT often give the best of both worlds.
Real-World CV Pipeline — End to End
Model Interpretability — GradCAM Explained
One common criticism of deep learning is the "black box" problem. GradCAM (Gradient-weighted Class Activation Mapping) solves this for CNNs by visualising which spatial regions the model attended to when making a prediction.
import torch
import torch.nn.functional as F
import numpy as np
import cv2
class GradCAM:
def __init__(self, model, target_layer):
self.model = model
self.target_layer = target_layer
self.gradients = None
self.activations = None
target_layer.register_forward_hook(self._save_activation)
target_layer.register_backward_hook(self._save_gradient)
def _save_activation(self, module, input, output):
self.activations = output.detach()
def _save_gradient(self, module, grad_input, grad_output):
self.gradients = grad_output[0].detach()
def generate(self, input_tensor, class_idx=None):
logits = self.model(input_tensor)
target = logits[0, class_idx if class_idx else logits.argmax()]
self.model.zero_grad()
target.backward()
# Global average pooling of gradients → channel weights
weights = self.gradients.mean(dim=(2, 3), keepdim=True)
# Weighted combination of activation maps
cam = (weights * self.activations).sum(dim=1, keepdim=True)
cam = F.relu(cam) # only positive contributions
cam = F.interpolate(cam, input_tensor.shape[2:], mode='bilinear')
cam = cam.squeeze().numpy()
# Normalise to 0–1
cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)
return cam
# Usage on a ResNet50
model = models.resnet50(weights='IMAGENET1K_V2').eval()
gcam = GradCAM(model, model.layer4[2].conv3)
heatmap = gcam.generate(img_tensor, class_idx=281) # 281=tabby cat
# Overlay as colour map
heatmap_bgr = cv2.applyColorMap(np.uint8(255 * heatmap), cv2.COLORMAP_JET)
overlay = cv2.addWeighted(original_bgr, 0.6, heatmap_bgr, 0.4, 0)
cv2.imwrite('gradcam_output.jpg', overlay)
Golden Rules for Computer Vision Projects
CosineAnnealingLR for standard training or OneCycleLR for super-convergence.
Warmup for the first 5 epochs is strongly recommended when fine-tuning large backbones.