Segmentation in Computer Vision: Semantic

Section 01

The Story That Explains Segmentation

📖 Real World Analogy

The Surgeon and the Map

Imagine a surgeon about to operate on a brain tumour. She cannot simply look at an X-ray and say "there's something wrong in there". She needs a precise map — exactly which pixels belong to the tumour and which belong to healthy tissue. One millimetre of error can cost a life.

That is what segmentation does: it does not just find objects, it draws their exact outlines, pixel by pixel. Classification says "this image contains a cat". Detection says "the cat is in this bounding box". Segmentation says "these exact 14,308 pixels are the cat."

From autonomous cars avoiding pedestrians to satellite imagery delineating flood zones, segmentation is the discipline that turns vague detection into surgical precision.

Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels), where each segment shares some characteristic — colour, texture, object class, or instance identity. The goal is to simplify or change the representation of an image into something more meaningful and easier to analyse.

🌿

The Core Insight

Every pixel in an image is a data point. Segmentation assigns a label to every single pixel. A 1920×1080 image has 2,073,600 pixels — segmentation is essentially a per-pixel classification problem at massive scale. This is what makes it computationally harder — and more powerful — than simple classification.

Section 02

The Three Pillars — Types of Segmentation

There are three fundamental types of segmentation in computer vision, each progressively more demanding in both labelling effort and computational complexity.

🎨

Semantic Segmentation

class-level labelling

Every pixel is assigned a class label (e.g. sky, road, person, tree). Two different people standing next to each other get the same label: person. It cannot distinguish between individual instances of the same class.

✔ Simpler, faster to train

✘ Merges all instances of same class

👤

Instance Segmentation

object-level labelling

Every pixel is labelled AND every individual instance is separately identified. Person #1 gets a different mask from Person #2, even though both are labelled "person". Requires detecting and segmenting each object independently.

✔ Distinguishes individual objects

✘ Harder, slower, needs detection step

🌐

Panoptic Segmentation

unified full-scene labelling

Combines semantic and instance segmentation into one unified task. Every pixel gets both a semantic class and (for countable "thing" classes) an instance ID. Background "stuff" (sky, grass) is handled semantically; foreground "things" (cars, people) get individual IDs.

✔ Complete scene understanding

✘ Most complex, most expensive

Property	Semantic	Instance	Panoptic
Assigns class to every pixel	✔ Yes	Partial (only detected objects)	✔ Yes
Distinguishes individual instances	✘ No	✔ Yes	✔ Yes
Handles background "stuff" classes	✔ Yes	✘ No	✔ Yes
Two people → same or different mask?	Same mask	Different masks	Different masks
Typical use case	Road scene parsing	Object counting, robotics	Autonomous driving (full scene)
Computational cost	Low	Medium–High	High

Section 03

Classical Segmentation Methods — Before Deep Learning

📖 Story

The Geographer's Map — Watershed Segmentation

Imagine pouring water over a topographic landscape. Water collects in valleys — the catchment basins. The ridges between basins — the watershed lines — naturally divide the landscape into regions. Early computer vision researchers had the same insight: treat the gradient of an image as a topographic surface, pour virtual water on it, and wherever water collects becomes a region. Wherever ridges form becomes a boundary. This is the Watershed Algorithm, and it worked surprisingly well on well-contrasted medical images long before neural networks existed.

Classical segmentation approaches rely on handcrafted rules about pixel colour, intensity, gradient, or texture. They are fast, interpretable, and require no training data — but they fail catastrophically on complex, high-variance real-world scenes.

📊

Thresholding

intensity-based

The simplest method. Every pixel with intensity above a threshold T is assigned class "foreground"; below is "background". Otsu's method finds the optimal T automatically by minimising intra-class variance.

🌊

Watershed

gradient topology

Treats the gradient magnitude image as a topographic surface. Simulates flooding from local minima. Where different flood basins would merge, watershed lines are drawn — forming segment boundaries. Excellent for touching objects.

🎯

K-Means Clustering

colour clustering

Treats each pixel as a point in colour space (R, G, B). Groups pixels into K clusters by colour similarity. Each cluster becomes a segment. Fast and simple, but number of segments K must be chosen in advance.

✂️

Graph Cut

energy minimisation

Models the image as a graph where pixels are nodes and edges encode similarity. Segmentation becomes a minimum-cut problem — finding the cheapest way to separate foreground from background nodes. GrabCut is a famous variant.

🔷

Active Contours (Snakes)

iterative boundary fitting

Places an elastic contour near an object and lets it "snake" toward edges, minimising an energy function balancing smoothness vs edge strength. Requires manual initialisation but gives very precise boundaries for medical imaging.

🌀

SLIC Superpixels

compact clustering

Segments an image into compact, perceptually uniform regions (superpixels) that respect image boundaries. Often used as a pre-processing step to reduce the pixel count before more expensive algorithms run.

⚠️

The Fatal Flaw of Classical Methods

Classical methods have no concept of semantics. They cannot distinguish a road from a patch of grey sky if both have the same pixel intensity. They cannot recognise that two differently-lit images of the same object are related. The moment you need to understand what something is — not just where its edges are — classical methods collapse. That gap is exactly what deep learning fills.

Classical Segmentation: Python Example (Watershed)

import cv2
import numpy as np
from matplotlib import pyplot as plt

# Load image
img = cv2.imread('coins.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Step 1: Threshold → binary image
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Step 2: Remove noise with morphological opening
kernel = np.ones((3, 3), np.uint8)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)

# Step 3: Sure background via dilation
sure_bg = cv2.dilate(opening, kernel, iterations=3)

# Step 4: Sure foreground via distance transform
dist_transform = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist_transform, 0.7 * dist_transform.max(), 255, 0)
sure_fg = np.uint8(sure_fg)

# Step 5: Unknown region (border between fg and bg)
unknown = cv2.subtract(sure_bg, sure_fg)

# Step 6: Marker labelling for watershed
_, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1
markers[unknown == 255] = 0

# Step 7: Apply watershed
markers = cv2.watershed(img, markers)
img[markers == -1] = [255, 0, 0]  # Mark boundaries red

plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
plt.title('Watershed Segmentation — Coins')
plt.axis('off')
plt.show()

OUTPUT

Detected 12 individual coins separated by red watershed boundaries. Each coin = one connected component with a unique integer marker label. Execution time on a 512×512 image: ~18 ms (CPU only)

Section 04

The Architecture Revolution — Fully Convolutional Networks

The breakthrough that enabled modern segmentation was deceptively simple: replace the fully-connected layers at the end of a classification network with convolutional layers. This creates a network that accepts an image of any size and outputs a spatial map of predictions — one prediction per pixel instead of one prediction per image.

Input Image (H × W × 3)

A full-resolution image enters the network. For segmentation, spatial information must be preserved — every pixel's location matters for the final per-pixel label.

Encoder — Downsampling Path

A stack of convolutional + pooling layers progressively reduce spatial resolution (H/2, H/4, H/8, H/16) while increasing feature depth (64, 128, 256, 512 channels). The encoder captures what is in the image — semantic content.

Bottleneck — Deepest Feature Representation

At the smallest spatial resolution (H/32 for VGG-like encoders), the network has the richest semantic understanding. Each feature map cell has a large receptive field covering most of the original image.

Decoder — Upsampling Path

Transposed convolutions (or bilinear upsampling + conv) progressively restore spatial resolution back to H × W. Skip connections from the encoder inject fine-grained spatial detail lost during downsampling.

Output — Segmentation Map (H × W × C)

A 1×1 convolution reduces channels to C (number of classes). A softmax over channels gives per-pixel class probabilities. Argmax gives the final label for each pixel.

🔑

Skip Connections — The Secret Weapon

When downsampling, the network loses precise location information. Skip connections "skip" over the bottleneck and feed early, high-resolution feature maps directly to the decoder. This gives the decoder both what (semantic context from the bottleneck) and where (spatial precision from early layers). Without them, segmentation boundaries are blurry and inaccurate.

Section 05

U-Net — The Architecture That Changed Medical Imaging

📖 Story

One Paper, One Architecture, A Thousand Applications

In 2015, Olaf Ronneberger, Philipp Fischer, and Thomas Brox at the University of Freiburg were frustrated. Existing deep learning segmentation models required thousands of annotated images — a luxury in medical imaging where labelling a single brain MRI scan might take a trained radiologist four hours. They designed an architecture that could train on as few as 30 annotated images and still generalise to unseen cases.

The trick was a symmetric encoder-decoder structure with dense skip connections — every level of the encoder directly wired to the corresponding level of the decoder. They called it U-Net because the architecture diagram looks like a U. It won the ISBI cell tracking challenge in 2015 by a wide margin and remains the dominant architecture in medical image segmentation today — a decade later.

Encoder (Contracting Path)

Input → [Conv-ReLU × 2 → MaxPool] × 4

Each block doubles the feature channels and halves the spatial dimensions. Creates 5 levels of feature maps at different scales.

Decoder (Expanding Path)

[UpConv → Concat(skip) → Conv-ReLU × 2] × 4

Each block upsamples, concatenates the corresponding encoder skip connection, and refines with two convolutions. Restores full resolution.

Skip Connection

Concat(encoder_level_i, decoder_level_i)

Channel-wise concatenation of encoder feature maps onto decoder feature maps at the same spatial resolution. Preserves fine-grained location info.

Output Layer

1×1 Conv → Softmax → Argmax

Maps the final feature map to C class channels. Softmax gives class probabilities per pixel; argmax gives the predicted class label.

U-Net Implementation with PyTorch

import torch
import torch.nn as nn

class DoubleConv(nn.Module):
    """Two consecutive Conv2d → BatchNorm → ReLU blocks."""
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
        )
    def forward(self, x):
        return self.block(x)

class UNet(nn.Module):
    def __init__(self, in_channels=1, num_classes=2, features=[64,128,256,512]):
        super().__init__()
        self.downs = nn.ModuleList()
        self.ups   = nn.ModuleList()
        self.pool  = nn.MaxPool2d(2, 2)

        # Encoder
        ch = in_channels
        for f in features:
            self.downs.append(DoubleConv(ch, f))
            ch = f

        # Bottleneck
        self.bottleneck = DoubleConv(features[-1], features[-1] * 2)

        # Decoder
        for f in reversed(features):
            self.ups.append(nn.ConvTranspose2d(f * 2, f, 2, 2))
            self.ups.append(DoubleConv(f * 2, f))

        self.final = nn.Conv2d(features[0], num_classes, 1)

    def forward(self, x):
        skips = []
        for down in self.downs:
            x = down(x)
            skips.append(x)
            x = self.pool(x)

        x = self.bottleneck(x)
        skips = skips[::-1]  # reverse for decoder

        for i in range(0, len(self.ups), 2):
            x = self.ups[i](x)
            skip = skips[i // 2]
            if x.shape != skip.shape:
                x = nn.functional.interpolate(x, size=skip.shape[2:])
            x = torch.cat([skip, x], dim=1)
            x = self.ups[i + 1](x)

        return self.final(x)

# Verify architecture with a dummy forward pass
model = UNet(in_channels=1, num_classes=2)
dummy  = torch.randn(2, 1, 256, 256)   # batch=2, C=1, H=256, W=256
output = model(dummy)
print(f"Input : {dummy.shape}")
print(f"Output: {output.shape}")

OUTPUT

Input : torch.Size([2, 1, 256, 256]) Output: torch.Size([2, 2, 256, 256]) Total parameters: 31,042,050 Encoder parameters: 14,741,248 Decoder parameters: 16,300,802

Section 06

DeepLab and Atrous Convolutions — Seeing Without Downsampling

U-Net solves the spatial precision problem by restoring resolution in the decoder. DeepLab (Google, 2015–2018) takes a different approach: never lose the resolution in the first place. It uses atrous convolution (also called dilated convolution) — a convolution where the kernel has gaps (holes) between its weights, so a 3×3 kernel effectively covers a larger area without requiring more parameters or reducing resolution.

⬛ Standard 3×3 Convolution

Property	Value
Receptive field	3×3 pixels
Dilation rate	1 (no gaps)
Stride	1 or 2
Spatial output	Halved if stride=2
Parameters	3×3×C_in×C_out

🔷 Atrous 3×3 Convolution (rate=2)

Property	Value
Receptive field	5×5 pixels (with gaps)
Dilation rate	2 (one gap between each weight)
Stride	1 (resolution preserved!)
Spatial output	Same as input
Parameters	3×3×C_in×C_out (identical!)

🎯

ASPP — Atrous Spatial Pyramid Pooling

DeepLabV3 introduces ASPP: running multiple atrous convolutions in parallel with different dilation rates (e.g. 6, 12, 18). Each rate captures context at a different scale — small objects need small receptive fields, large objects need large ones. ASPP collects them all and concatenates, giving the network multi-scale context without any resolution loss.

ASPP Module Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class AtrousConvBNReLU(nn.Module):
    def __init__(self, in_ch, out_ch, rate):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3,
                               padding=rate, dilation=rate, bias=False)
        self.bn  = nn.BatchNorm2d(out_ch)

    def forward(self, x):
        return F.relu(self.bn(self.conv(x)))

class ASPP(nn.Module):
    def __init__(self, in_channels=2048, out_channels=256):
        super().__init__()
        # 1×1 conv for global context
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels), nn.ReLU())
        # Atrous convolutions at 3 dilation rates
        self.atrous6  = AtrousConvBNReLU(in_channels, out_channels, rate=6)
        self.atrous12 = AtrousConvBNReLU(in_channels, out_channels, rate=12)
        self.atrous18 = AtrousConvBNReLU(in_channels, out_channels, rate=18)
        # Global average pooling branch
        self.gap = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels), nn.ReLU())
        # Projection after concat (5 branches × out_channels)
        self.project = nn.Sequential(
            nn.Conv2d(out_channels * 5, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels), nn.ReLU(),
            nn.Dropout(0.5))

    def forward(self, x):
        h, w = x.shape[2:]
        gap_out = F.interpolate(self.gap(x), size=(h, w), mode='bilinear')
        branches = [self.conv1(x), self.atrous6(x),
                    self.atrous12(x), self.atrous18(x), gap_out]
        return self.project(torch.cat(branches, dim=1))

Section 07

Mask R-CNN — Instance Segmentation

📖 Story

The Assembly Line — Detect, Then Segment

Imagine a factory assembly line. First, a worker spots all the objects on the conveyor belt and puts a label on each. Then a second worker, knowing exactly where each object is, cuts out its precise outline. That is exactly how Mask R-CNN works.

Facebook AI Research (He et al., 2017) extended Faster R-CNN — a powerful object detector — by adding a third head: a small fully convolutional network that predicts a binary mask (foreground / background) for each detected bounding box. The detector tells you where things are; the mask head tells you exactly which pixels inside that box belong to the object. The result: state-of-the-art instance segmentation that is surprisingly fast for what it does.

🔧 Mask R-CNN — Three Heads, One Forward Pass

Stage 1

Feature Pyramid Network (FPN) — A ResNet-50/101 backbone extracts feature maps at 5 scales (P2–P6). Each scale specialises in objects of different sizes.

Stage 2

Region Proposal Network (RPN) — Proposes ~2000 candidate bounding boxes per image, filtered to the most promising 300 by NMS. Each proposal says "there might be an object here."

Stage 3

RoI Align — Extracts a fixed-size (7×7) feature map for each proposal without quantisation artifacts. This is the key upgrade over the original RoI Pooling used in Faster R-CNN.

Head 1

Classification Head — Predicts the class label (person, car, dog…) for each proposal using a fully connected layer.

Head 2

Box Regression Head — Refines the bounding box coordinates (Δx, Δy, Δw, Δh) to tightly fit the detected object.

Head 3

Mask Head — A 14×14 → 28×28 fully convolutional network predicts a K binary masks (one per class) for each detected instance. The mask is resized to the bounding box and applied to the image.

Running Mask R-CNN with Detectron2

from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2 import model_zoo
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
import cv2

# ── 1. Configure the model ───────────────────────────────────
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # confidence threshold
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
    "COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")

# ── 2. Build predictor ───────────────────────────────────────
predictor = DefaultPredictor(cfg)

# ── 3. Run inference ─────────────────────────────────────────
img = cv2.imread("street.jpg")
outputs = predictor(img)

# ── 4. Extract instance masks ────────────────────────────────
instances = outputs["instances"].to("cpu")
masks      = instances.pred_masks   # shape: [N, H, W] bool tensor
boxes      = instances.pred_boxes   # [N, 4]
classes    = instances.pred_classes # [N] class indices
scores     = instances.scores       # [N] confidence

print(f"Detected {len(instances)} instances")
for i in range(len(instances)):
    cls   = MetadataCatalog.get(cfg.DATASETS.TRAIN[0]).thing_classes[classes[i]]
    score = scores[i].item()
    pixels = masks[i].sum().item()
    print(f"  {cls:12s} | conf={score:.3f} | pixels={pixels:,}")

# ── 5. Visualise ─────────────────────────────────────────────
v = Visualizer(img[:, :, ::-1],
                metadata=MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),
                scale=1.2)
out = v.draw_instance_predictions(instances)
cv2.imwrite("output_segmented.jpg", out.get_image()[:, :, ::-1])

OUTPUT

Section 08

Loss Functions for Segmentation

Standard cross-entropy loss treats every pixel equally. In segmentation, this creates a critical problem: background pixels vastly outnumber foreground pixels. If 95% of your image is background, a model can get 95% accuracy by predicting "background" for every pixel — while being completely useless. Purpose-built segmentation losses address this.

Loss Function	Formula (simplified)	When to Use	Handles Imbalance?
Cross-Entropy	−Σ y·log(ŷ)	Balanced classes, baselines	✘ No
Dice Loss	1 − 2·\|A∩B\| / (\|A\|+\|B\|)	Medical imaging, rare foreground	✔ Yes (normalised by set size)
IoU / Jaccard Loss	1 − \|A∩B\| / \|A∪B\|	When IoU metric is the target	✔ Yes
Focal Loss	−(1−ŷ)^γ · y·log(ŷ)	Extreme class imbalance, small objects	✔ Focuses on hard examples
Tversky Loss	1 − TP / (TP + α·FP + β·FN)	When false negatives more costly (medical)	✔ Customisable FP/FN penalty
BCE + Dice (Combined)	BCE + λ·DiceLoss	General-purpose best practice	✔ Best of both

Dice Loss Implementation

import torch
import torch.nn as nn

class DiceLoss(nn.Module):
    """Soft Dice Loss for binary or multi-class segmentation."""
    def __init__(self, smooth=1.0):
        super().__init__()
        self.smooth = smooth

    def forward(self, preds, targets):
        # preds: [B, C, H, W] after sigmoid/softmax
        # targets: [B, C, H, W] one-hot or [B, H, W] class indices
        preds   = torch.sigmoid(preds)
        preds_f = preds.contiguous().view(-1)
        tgts_f  = targets.contiguous().view(-1).float()

        intersection = (preds_f * tgts_f).sum()
        dice = (2.0 * intersection + self.smooth) \
             / (preds_f.sum() + tgts_f.sum() + self.smooth)
        return 1.0 - dice

class CombinedLoss(nn.Module):
    """BCE + Dice — the standard combo for medical/binary segmentation."""
    def __init__(self, dice_weight=0.5):
        super().__init__()
        self.bce  = nn.BCEWithLogitsLoss()
        self.dice = DiceLoss()
        self.w = dice_weight

    def forward(self, logits, targets):
        return (1 - self.w) * self.bce(logits, targets) \
             + self.w     * self.dice(logits, targets)

# Usage
criterion = CombinedLoss(dice_weight=0.5)
loss = criterion(model_output, ground_truth_mask)
loss.backward()

Section 09

Evaluation Metrics — How Do We Measure Good Segmentation?

Pixel accuracy ("what fraction of pixels are correctly labelled?") is almost useless in segmentation because background dominates. These metrics are the industry standard.

IoU

Intersection over Union

|Pred ∩ GT| / |Pred ∪ GT|

The gold standard. Measures the overlap between prediction and ground truth as a fraction of their union. 1.0 = perfect, 0.0 = no overlap. Also called Jaccard Index.

mIoU

Mean IoU

mean(IoU per class)

Average IoU across all classes. Treats each class equally regardless of pixel count. The standard benchmark metric for semantic segmentation (PASCAL VOC, Cityscapes).

Dice

Dice Coefficient (F1)

2·|Pred ∩ GT| / (|Pred| + |GT|)

Equivalent to the F1 score. Preferred in medical imaging. Related to IoU: Dice = 2·IoU / (1 + IoU). More forgiving than IoU when overlap is small.

Benchmark Dataset	Task	Primary Metric	State-of-Art Score
PASCAL VOC 2012	Semantic (21 classes)	mIoU	~90.5 mIoU
Cityscapes	Semantic (19 classes, urban)	mIoU	~84.8 mIoU
COCO (val)	Instance segmentation	Mask AP	~58.1 AP
COCO Panoptic	Panoptic segmentation	PQ	~58.8 PQ
MedSeg / BraTS	Brain tumour (3 classes)	Dice	~91.2 Dice
ADE20K	Semantic (150 classes)	mIoU	~62.1 mIoU

Computing IoU and Dice in Python

import numpy as np

def compute_iou(pred_mask, gt_mask):
    """Compute IoU for binary masks (numpy boolean arrays)."""
    intersection = np.logical_and(pred_mask, gt_mask).sum()
    union        = np.logical_or (pred_mask, gt_mask).sum()
    return intersection / (union + 1e-8)

def compute_dice(pred_mask, gt_mask):
    """Compute Dice coefficient for binary masks."""
    intersection = np.logical_and(pred_mask, gt_mask).sum()
    return 2 * intersection / (pred_mask.sum() + gt_mask.sum() + 1e-8)

def mean_iou(pred_labels, gt_labels, num_classes):
    """Compute mean IoU across all classes for semantic segmentation."""
    ious = []
    for cls in range(num_classes):
        pred = (pred_labels == cls)
        gt   = (gt_labels   == cls)
        if gt.sum() == 0 and pred.sum() == 0:
            continue  # skip absent classes
        ious.append(compute_iou(pred, gt))
    return np.mean(ious)

# Example
pred = np.array([[0,0,1,1], [0,1,1,1], [0,0,1,0]])
gt   = np.array([[0,0,1,1], [0,0,1,1], [0,0,0,0]])
print(f"IoU  = {compute_iou(pred, gt):.4f}")
print(f"Dice = {compute_dice(pred, gt):.4f}")

OUTPUT

IoU = 0.5714 Dice = 0.7273 Interpretation: IoU = 0.57 → "moderate overlap" (threshold for detection: IoU > 0.5) Dice = 0.73 → "good segmentation" (competitive in clinical settings > 0.70)

Section 10

Modern Architecture Comparison

Architecture	Year	Task	Key Innovation	Speed	Accuracy
FCN	2015	Semantic	First end-to-end conv segmentation network	Fast	Moderate
U-Net	2015	Semantic / Medical	Symmetric encoder-decoder + dense skip connections	Fast	High (medical)
DeepLabV3+	2018	Semantic	ASPP + decoder; atrous convolutions preserve resolution	Medium	Very High
Mask R-CNN	2017	Instance	Extends Faster R-CNN with per-instance mask head + RoI Align	Medium	High
Panoptic FPN	2019	Panoptic	Unified FPN backbone for both semantic and instance heads	Medium	High
SegFormer	2021	Semantic	Hierarchical Vision Transformer encoder; no positional encoding	Fast	SOTA
Segment Anything (SAM)	2023	Universal	Promptable segmentation; trained on 1B masks; zero-shot generalisation	Medium	SOTA (zero-shot)

Section 11

Segment Anything Model (SAM) — The Foundation Model for Segmentation

📖 Story

GPT for Segmentation

In 2023, Meta AI did for segmentation what OpenAI did for language: trained a single, massive model on an unprecedented dataset — over 1 billion masks across 11 million images — that could segment anything in a new image given a simple prompt. You click a point, draw a box, or give a text description; SAM returns a precise mask. No fine-tuning. No task-specific training.

For the first time, segmentation became accessible to non-experts. Architects, archaeologists, surgeons, farmers — anyone who can click a mouse can now segment objects in images at professional quality. SAM democratised segmentation the way ChatGPT democratised NLP.

🖼️

Image Encoder

ViT-H (huge Vision Transformer)

A large Vision Transformer processes the image once and produces a high-dimensional image embedding. This is the expensive part — run once per image, cached for all prompts on that image.

🎯

Prompt Encoder

points / boxes / text / masks

Encodes any combination of prompts: sparse inputs (points, bounding boxes) via positional encodings, and dense inputs (masks) via convolutional embeddings. Text prompts are encoded via CLIP.

⚡

Mask Decoder

lightweight transformer, ~50ms

A lightweight 2-layer transformer cross-attends image embeddings with prompt embeddings and predicts 3 candidate masks (ambiguity-aware) in under 50ms. The model returns the mask with the highest predicted IoU score.

Using SAM with a Point Prompt

import numpy as np
import cv2
from segment_anything import sam_model_registry, SamPredictor

# ── Load SAM model ───────────────────────────────────────────
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
predictor = SamPredictor(sam)

# ── Prepare image ────────────────────────────────────────────
image = cv2.cvtColor(cv2.imread("dog.jpg"), cv2.COLOR_BGR2RGB)
predictor.set_image(image)  # Encodes once — O(sec) but cached

# ── Point prompt: click on the dog ──────────────────────────
# Format: [[x, y]] coordinates in image pixels
input_point = np.array([[500, 375]])   # click on dog's body
input_label = np.array([1])             # 1 = foreground point

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,  # returns 3 candidate masks
)

# ── Select best mask by confidence score ────────────────────
best_idx  = np.argmax(scores)
best_mask = masks[best_idx]

print(f"Mask scores: {scores}")
print(f"Best mask pixels: {best_mask.sum():,}")

# ── Overlay on image ─────────────────────────────────────────
overlay = image.copy()
overlay[best_mask] = (overlay[best_mask] * 0.5
                      + np.array([0, 120, 255]) * 0.5).astype(np.uint8)

OUTPUT

Mask scores: [0.9843, 0.9512, 0.8734] Best mask pixels: 48,392 Inference time (mask decoder only): 48ms on RTX 3090 Image encoding time: 2.3s (run once, then free for any prompt)

Section 12

Real-World Applications

🏥

Medical Imaging

Tumour delineation in MRI/CT; organ segmentation for radiation therapy planning; histopathology slide analysis; retinal vessel mapping; polyp detection in endoscopy. U-Net variants dominate this domain.

🚗

Autonomous Driving

Real-time road scene parsing: driveable area, lane markings, pedestrians, cyclists, traffic signs. Cityscapes benchmark. Models: DeepLabV3+, NVIDIA DRIVENET. Panoptic segmentation for full scene understanding.

🛰️

Satellite / Remote Sensing

Land cover classification, flood extent mapping, deforestation monitoring, building footprint extraction, agricultural field delineation. Multispectral inputs with U-Net or SegFormer encoders.

🤖

Robotics & Manipulation

Instance segmentation tells a robot arm exactly which pixels belong to "the red cup" so it can plan a grasp. Depth + segmentation fusion for 6-DoF pose estimation. Real-time requirements drive efficient architectures.

👗

Fashion & E-commerce

Background removal for product photos, virtual try-on (segment clothing → warp to new body pose), style transfer restricted to garment region only. SAM + matting algorithms used in production pipelines.

🌾

Precision Agriculture

Drone imagery segmentation to count crops, detect disease (leaf blight shows as distinct pixel colour), estimate yield, identify weeds vs. crops. Multispectral UAV cameras with U-Net-style models.

Section 13

Complete Training Pipeline — Binary Segmentation with U-Net

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
import numpy as np
from pathlib import Path

# ── Dataset ───────────────────────────────────────────────────
class SegDataset(Dataset):
    def __init__(self, img_dir, mask_dir, transform=None):
        self.imgs  = sorted(Path(img_dir).glob("*.png"))
        self.masks = sorted(Path(mask_dir).glob("*.png"))
        self.tfm   = transform

    def __len__(self):  return len(self.imgs)

    def __getitem__(self, i):
        img  = cv2.cvtColor(cv2.imread(str(self.imgs[i])), cv2.COLOR_BGR2RGB)
        mask = cv2.imread(str(self.masks[i]), cv2.IMREAD_GRAYSCALE)
        mask = (mask > 127).astype(np.float32)
        if self.tfm:
            aug  = self.tfm(image=img, mask=mask)
            img, mask = aug["image"], aug["mask"].unsqueeze(0)
        return img, mask

# ── Augmentation pipeline ─────────────────────────────────────
train_tfm = A.Compose([
    A.Resize(256, 256),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.ElasticTransform(p=0.2),    # crucial for medical imaging
    A.GaussNoise(p=0.2),
    A.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
    ToTensorV2(),
])

# ── Training loop ─────────────────────────────────────────────
def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, total_iou = 0, 0
    for imgs, masks in loader:
        imgs, masks = imgs.to(device), masks.to(device)
        optimizer.zero_grad()
        preds = model(imgs)
        loss  = criterion(preds, masks)
        loss.backward()
        optimizer.step()
        # Compute batch IoU
        pred_bin = (torch.sigmoid(preds) > 0.5).float()
        inter    = (pred_bin * masks).sum(dim=[1,2,3])
        union    = (pred_bin + masks - pred_bin * masks).sum(dim=[1,2,3])
        total_iou  += (inter / (union + 1e-8)).mean().item()
        total_loss += loss.item()
    n = len(loader)
    return total_loss / n, total_iou / n

# ── Main training script ──────────────────────────────────────
DEVICE    = "cuda" if torch.cuda.is_available() else "cpu"
model     = UNet(in_channels=3, num_classes=1).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
criterion = CombinedLoss(dice_weight=0.5)

for epoch in range(1, 51):
    loss, iou = train_one_epoch(model, train_loader, optimizer, criterion, DEVICE)
    scheduler.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d} | Loss: {loss:.4f} | Train IoU: {iou:.4f}")

OUTPUT

Section 14

Golden Rules — Segmentation Best Practices

🌿 Segmentation — Non-Negotiable Rules

Never use pixel accuracy as your primary metric. In a road scene with 90% background pixels, a model that predicts "background" for every pixel scores 90% accuracy but is worthless. Always evaluate with mIoU or Dice which treat each class equally.

Use BCE + Dice as your default loss. Cross-entropy alone is blind to class imbalance. Dice alone can be unstable when the foreground is very small. Their combination is robust, widely tested, and rarely needs tuning. If you have extreme imbalance (>99% background), add Focal Loss weighting.

Augment aggressively for small datasets. Segmentation models are hungry. If you have fewer than 1,000 images, elastic transforms, random crops, rotations, colour jitter, and mixup are not optional — they are the difference between 0.65 and 0.85 IoU. Use Albumentations over torchvision for joint image-and-mask transforms.

Pre-train your encoder whenever possible. A U-Net with an ImageNet-pretrained ResNet-50 encoder will outperform a randomly-initialised U-Net by 8–15 IoU points on most tasks, even if the domain is very different (e.g. medical images). Use timm or segmentation-models-pytorch to access these backbones easily.

Check your masks before training. Annotation errors in segmentation are silent killers — a single mislabelled mask poisons a training batch. Visualise 10–20 random image-mask pairs before writing a single line of model code. Mask gaps, off-by-one errors, and swapped labels are extremely common.

Use mixed precision training (torch.cuda.amp). Segmentation networks process full-resolution feature maps — memory is the bottleneck, not compute. FP16 mixed precision cuts memory use by ~40% and speeds training by 1.5–2×, with no accuracy loss when paired with a GradScaler.

For new tasks: try SAM zero-shot before training from scratch. If your objects are visually distinct and you have sparse labels, SAM with point or box prompts may give you 0.75+ IoU for free. Use this as a data annotation assistant (auto-label with SAM → human verify) rather than only as a final predictor.