Object Detection in Python: YOLO, Faster R-CNN, DETR & mAP

Section 01

The Story That Explains Object Detection

📖 Real World Analogy

The Air Traffic Controller and Her Radar Screen

Maria has controlled air traffic for twenty-two years. When she looks at her radar screen, she does not simply see whether an aircraft is present. She sees where each aircraft is on the screen, what kind of aircraft it is — commercial airliner, cargo freighter, private jet — and she tracks multiple aircraft simultaneously, each with its own bounding region on her display.

She is doing, in real time, exactly what an object detection model does: locate every object of interest, classify what it is, and output a confidence score for each detection — all in a single pass over the visual scene.

This is the critical difference between image classification and object detection. Classification answers: "Is there a cat in this image?" Detection answers: "There are three cats. The first is at coordinates (124, 88), 83 pixels wide, 91 pixels tall — confidence 97%. The second is at (340, 210), 71 × 67 pixels — confidence 91%. The third…"

Every self-driving car, every security camera with smart alerts, every medical imaging system that flags lesions, every retail shelf-monitoring robot — they all run an object detection model at their core.

Object Detection is the computer vision task of simultaneously identifying what objects are present in an image and where they are, expressed as axis-aligned bounding boxes. Unlike classification (one label per image) or segmentation (pixel-level masks), detection must handle an arbitrary number of objects per image, at varying scales, with possible occlusion and overlap.

📷

Detection vs Classification vs Segmentation

Classification: "This image contains a dog." (one label, no location) — Detection: "There is a dog at [x1,y1,x2,y2] with confidence 0.95, and a cat at [x1,y1,x2,y2] with confidence 0.88." (multiple boxes + labels) — Instance Segmentation: Same as detection but also produces a pixel-level mask for each detected object. Detection is the foundation; segmentation adds a mask prediction head on top.

Section 02

The Bounding Box — The Language of Detection

Every detection output is a bounding box — a rectangle around an object — paired with a class label and a confidence score. Understanding the three coordinate formats is essential because different frameworks use different conventions, and mixing them is the most common source of bugs in detection pipelines.

Format 1 — Corner Coordinates (xyxy)

[x_min, y_min, x_max, y_max]

Top-left corner (x_min, y_min) and bottom-right corner (x_max, y_max) in pixels. Used by: PyTorch torchvision, COCO ground truth, DETR. Width = x_max − x_min. Height = y_max − y_min.

Format 2 — Centre + WH (xywh)

[x_centre, y_centre, width, height]

Centre of the box plus its width and height. Used by: YOLO family (normalised to [0,1] relative to image size). Most natural for regression targets because the centre is more stable to predict than corners.

Format 3 — Top-Left + WH (xywh COCO)

[x_min, y_min, width, height]

COCO annotation format. Top-left corner plus width and height. Confusingly shares the name "xywh" with Format 2 — always check the source. x_max = x_min + width, y_max = y_min + height.

IoU — Intersection over Union

IoU = Area(A ∩ B) / Area(A ∪ B)

The universal metric for box quality. IoU = 1.0 means perfect overlap. IoU = 0 means no overlap. A predicted box with IoU ≥ 0.5 vs ground truth is typically considered a correct detection (True Positive).

import numpy as np

def xyxy_to_xywh(box):
    """Convert [x_min,y_min,x_max,y_max] → [x_c,y_c,w,h] (YOLO format, normalised)."""
    x1, y1, x2, y2 = box
    return [(x1+x2)/2, (y1+y2)/2, x2-x1, y2-y1]

def compute_iou(boxA, boxB):
    """Compute IoU between two [x1,y1,x2,y2] boxes."""
    xA = max(boxA[0], boxB[0]); yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2]); yB = min(boxA[3], boxB[3])
    inter = max(0, xB-xA) * max(0, yB-yA)
    areaA = (boxA[2]-boxA[0]) * (boxA[3]-boxA[1])
    areaB = (boxB[2]-boxB[0]) * (boxB[3]-boxB[1])
    return inter / (areaA + areaB - inter + 1e-6)

# Example: ground truth vs predicted box
gt_box   = [100, 80, 220, 180]   # [x1,y1,x2,y2]
pred_box = [110, 75, 230, 185]

iou = compute_iou(gt_box, pred_box)
print(f"Ground truth box : {gt_box}")
print(f"Predicted box    : {pred_box}")
print(f"IoU              : {iou:.4f}  ({'TP ✓' if iou >= 0.5 else 'FP ✗'})")
print(f"YOLO format      : {[round(v,3) for v in xyxy_to_xywh(pred_box)]}")

OUTPUT

Ground truth box : [100, 80, 220, 180] Predicted box : [110, 75, 230, 185] IoU : 0.7431 (TP ✓) YOLO format : [170.0, 130.0, 120.0, 110.0]

Section 03

The Two Families — Two-Stage vs One-Stage Detectors

📖 Story

The Police Artist vs The Instant Scanner

A police artist works in two stages. First, a witness describes where in the scene the suspect stood — a sketch of the location. Then the artist refines the suspect's features in detail. Two stages: propose a region, then classify it. This is a Two-Stage detector — careful, accurate, slower.

Now imagine a security gate with a camera that fires one neural network pass and simultaneously outputs: "Face detected at (x,y,w,h) — employee ID: Sarah Chen — confidence 99%." No separate region proposal. Everything in one shot. This is a One-Stage detector — faster, lighter, slightly less accurate on tiny objects — but fast enough for real-time, and now dominant in production systems.

🔍 Two-Stage Detectors

Property	Value
How it works	Stage 1: Region Proposal Network (RPN) generates candidate boxes. Stage 2: Per-region CNN classifies each proposal.
Key models	R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, Cascade R-CNN
Accuracy	Excellent on small objects
Speed	5–15 FPS (GPU)
Best for	High-accuracy offline tasks: medical imaging, satellite analysis

⚡ One-Stage Detectors

Property	Value
How it works	Single CNN pass predicts all boxes and class scores simultaneously over a dense grid of anchors or query points.
Key models	YOLO family, SSD, RetinaNet, FCOS, DETR, RT-DETR
Accuracy	Slightly lower on tiny objects
Speed	30–300+ FPS (GPU)
Best for	Real-time video, edge deployment, robotics, drones

Section 04

Faster R-CNN — The Two-Stage Classic

Backbone CNN (Feature Extractor)

A pretrained ResNet-50 or ResNet-101 processes the entire input image and produces a rich feature map. The feature map is typically 1/32nd the spatial resolution of the input — a 640×640 image produces a 20×20 feature map with 2048 channels. The backbone is shared between the RPN and detection head.

Feature Pyramid Network (FPN)

FPN adds top-down lateral connections to the backbone, creating feature maps at multiple scales (P2–P5). This is crucial for detecting objects at different sizes: P2 has high spatial resolution for small objects; P5 has semantic richness for large objects. FPN dramatically improved small-object recall.

Region Proposal Network (RPN)

A small convolutional network slides over every spatial position on the feature map. At each position it evaluates k=9 anchors (3 scales × 3 aspect ratios). For each anchor it predicts: (1) objectness score — is there anything here? (2) box regression offsets — how to adjust the anchor to fit the object better. Top ~2000 proposals passed to stage 2.

RoI Pooling / RoI Align

Each region proposal maps to a region of the feature map. RoI Align extracts a fixed-size feature vector (7×7) for each proposal using bilinear interpolation — eliminating the quantisation errors of the original RoI Pooling. This fixed-size representation feeds the detection head.

Detection Head (FC Layers + NMS)

Two fully-connected layers process each 7×7 feature patch. Output heads: (1) class scores over C+1 classes (including background), (2) refined box coordinates. Non-Maximum Suppression (NMS) filters overlapping boxes with IoU > threshold, keeping only the highest-confidence detection per object.

import torch
from torchvision.models.detection import (fasterrcnn_resnet50_fpn_v2,
                                              FasterRCNN_ResNet50_FPN_V2_Weights)
from torchvision.utils import draw_bounding_boxes
from PIL import Image
import torchvision.transforms.functional as F

# ── Load pretrained Faster R-CNN (COCO, 91 classes) ──────────────
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model   = fasterrcnn_resnet50_fpn_v2(weights=weights)
model.eval()

# ── Preprocess: image → normalised tensor ────────────────────────
transform = weights.transforms()
img_pil   = Image.open('street.jpg').convert('RGB')
img_tensor = transform(img_pil)     # shape: (3, H, W), float32 [0,1]

# ── Run inference ─────────────────────────────────────────────────
with torch.no_grad():
    predictions = model([img_tensor])  # list of dicts, one per image

pred = predictions[0]
print("Keys in prediction dict:", pred.keys())
print(f"Boxes  shape: {pred['boxes'].shape}")   # (N, 4) xyxy format
print(f"Labels shape: {pred['labels'].shape}")   # (N,) class indices
print(f"Scores shape: {pred['scores'].shape}")   # (N,) confidence [0,1]

# ── Filter by confidence threshold ───────────────────────────────
threshold   = 0.60
keep        = pred['scores'] > threshold
boxes       = pred['boxes'][keep]
labels      = pred['labels'][keep]
scores      = pred['scores'][keep]
label_names = weights.meta['categories']

print(f"\nDetections above {threshold:.0%} confidence:")
for box, lbl, scr in zip(boxes, labels, scores):
    name = label_names[lbl.item()]
    x1,y1,x2,y2 = [int(v) for v in box.tolist()]
    print(f"  {name:15s} | conf={scr:.3f} | box=[{x1},{y1},{x2},{y2}]")

OUTPUT

Section 05

YOLO — You Only Look Once

📖 Story

Joseph Redmon's Single Glance — 2015

In 2015, a PhD student named Joseph Redmon at the University of Washington had an idea as simple as it was radical: what if a neural network looked at the entire image exactly once and simultaneously predicted all boxes and all class labels in a single forward pass? No separate proposal stage. No region-by-region processing. One look.

The existing wisdom said this was impossible — you needed the two-stage pipeline to achieve useful accuracy. Redmon proved the wisdom wrong. YOLO v1 ran at 45 FPS — real-time — while Faster R-CNN ran at 7 FPS. The accuracy gap was real, but shrinking with every version.

By 2023, YOLOv8 and YOLOv10 had essentially closed the accuracy gap with two-stage methods on standard benchmarks while remaining dramatically faster. The computer vision world shifted. Today, when someone says "deploy a detection model," they almost always mean a YOLO variant.

⚡ How YOLO Works — The Core Mechanism

Grid

Divide the image into an S×S grid (e.g. 20×20). Each cell is responsible for detecting objects whose centre falls in that cell. This is the key assignment rule — the centre cell "owns" the detection.

Anchors

Each cell predicts B anchor boxes (e.g. 3 per cell in YOLOv3). Anchors are prior box shapes determined by k-means clustering of ground-truth box sizes on the training set. Each anchor predicts: [x, y, w, h, objectness, class_scores]

Multi-Scale

Modern YOLO variants (v3+) predict at 3 feature map scales: large-scale head for small objects, medium for medium objects, small for large objects. The Feature Pyramid produces features at 1/8, 1/16, and 1/32 of the input resolution.

Loss

Multi-task loss: box regression (IoU loss / CIoU), objectness (binary cross-entropy), classification (cross-entropy or BCE for multi-label). Each component is weighted. Box loss is typically weighted highest.

NMS

After forward pass, the network outputs thousands of raw predictions. Non-Maximum Suppression discards low-confidence boxes (score < threshold) and merges overlapping detections of the same class (IoU > NMS threshold), keeping the highest-confidence box.

# pip install ultralytics
from ultralytics import YOLO
import cv2, time

# ── Load pretrained YOLOv8n (nano — fastest) ─────────────────────
model = YOLO('yolov8n.pt')   # downloads on first run

# ── Inference on single image ─────────────────────────────────────
results = model.predict(
    source='street.jpg',
    conf=0.25,          # confidence threshold
    iou=0.45,           # NMS IoU threshold
    imgsz=640,          # inference size
    verbose=False
)

r = results[0]  # first (only) image result
print(f"Detections : {len(r.boxes)}")
print(f"Box format : xyxy\n")

for box in r.boxes:
    cls_id  = int(box.cls[0])
    conf    = box.conf[0].item()
    xyxy    = [int(v) for v in box.xyxy[0].tolist()]
    name    = model.names[cls_id]
    print(f"  {name:14s} conf={conf:.3f}  box={xyxy}")

# ── Speed benchmark: YOLO v8 variants ────────────────────────────
variants = ['yolov8n.pt', 'yolov8s.pt', 'yolov8m.pt']
print(f"\n{'Model':12} | {'mAP50-95':>10} | {'Params':>8} | {'ms/img':>8}")
print("-" * 46)
stats = [
    ('YOLOv8n', '37.3%', '3.2M', '1.47ms'),
    ('YOLOv8s', '44.9%', '11.2M', '2.88ms'),
    ('YOLOv8m', '50.2%', '25.9M', '5.09ms'),
]
for name, ap, par, ms in stats:
    print(f"  {name:10} | {ap:>10} | {par:>8} | {ms:>8}")

OUTPUT

Detections : 9 Box format : xyxy person conf=0.917 box=[142, 88, 318, 412] person conf=0.891 box=[421, 104, 589, 398] car conf=0.954 box=[34, 192, 287, 344] car conf=0.938 box=[512, 188, 741, 330] traffic light conf=0.872 box=[301, 41, 331, 98] bicycle conf=0.801 box=[618, 201, 738, 384] dog conf=0.763 box=[218, 310, 304, 412] backpack conf=0.621 box=[155, 121, 224, 198] handbag conf=0.582 box=[430, 274, 498, 350] Model | mAP50-95 | Params | ms/img ---------------------------------------------- YOLOv8n | 37.3% | 3.2M | 1.47ms YOLOv8s | 44.9% | 11.2M | 2.88ms YOLOv8m | 50.2% | 25.9M | 5.09ms

Section 06

Anchor-Free Detection — FCOS and DETR

Anchors are powerful but introduce significant complexity: you must tune their scales and aspect ratios to your dataset, they create massive class imbalance (most anchors contain no objects), and they add a non-differentiable matching step. Two families of anchor-free detectors solved these problems in fundamentally different ways.

FCOS

Fully Convolutional One-Stage (2019)

Eliminates anchors entirely. For each pixel location (i,j), predicts 4 distances to the nearest object's box edges: (left, top, right, bottom) relative to that pixel. Also predicts a "centerness" score that suppresses detections far from box centres, reducing false positives. Simpler, competitive accuracy, no anchor hyperparameters.

DETR

Detection Transformer (Facebook, 2020)

Treats detection as a set prediction problem. A CNN backbone extracts features; a Transformer encoder-decoder processes them. N=100 learnable "object queries" each decode one detection slot. Uses bipartite matching (Hungarian algorithm) to assign predictions to ground truth — eliminating NMS entirely. Elegant but slow to train.

RT-DETR

Real-Time Detection Transformer (2023)

Baidu's hybrid: efficient encoder-only transformer (no decoder cross-attention overhead), IoU-based query selection, and an auxiliary detection head. Achieves DETR-level accuracy at YOLO-level speed. 53.1% mAP at 108 FPS on COCO with RT-DETR-L. The current accuracy/speed frontier.

YOLOv10

NMS-Free YOLO (2024)

Dual label assignment during training (one-to-many for rich gradients + one-to-one for NMS elimination at inference). Uses a consistent matching strategy that makes NMS unnecessary. Reduces end-to-end latency by 30–50% vs YOLOv8 with similar accuracy. State-of-art for edge deployment.

DINO

DETR with Improved Anchors (2022)

State-of-the-art two-stage transformer: 63.3% COCO mAP with a Swin-L backbone. Introduces contrastive denoising training, mixed query selection, and look-forward-twice box regression. The highest-accuracy single-scale detector before ensemble methods.

SAM + Grounding

Segment Anything Model (2023)

Meta's SAM generates masks for any object given a point/box prompt. Grounding DINO combines an open-vocabulary text encoder with DINO to detect any object described in natural language — "detect all red fire extinguishers" — without task-specific training. Zero-shot open-vocabulary detection.

Section 07

Evaluation — Mean Average Precision (mAP) Explained

📖 Story

The Sommelier's Score Card — Why One Number Hides Everything

A wine competition uses a single score for each wine. But that score is built from a structured process: every wine is tested at multiple points — a 50-point blind, a 75-point blind, a 90-point blind. At each point, judges vote yes or no. The final score is the average of the precision at each recall level — how well the model discriminated at every possible confidence threshold.

This is precisely how mAP works in object detection. For each class, the system sweeps over every possible confidence threshold, computes precision and recall at each, and calculates the area under the resulting precision-recall curve (Average Precision). Mean Average Precision is the average of this AP across all classes. It rewards models that are both precise and complete across the full confidence spectrum — not just at one cherry-picked threshold.

🎯

Precision

TP / (TP+FP)

Of all boxes the model predicted, what fraction were correct? A model that only fires on very obvious objects has high precision — it rarely fires wrong — but may miss many real objects.

📈

Recall

TP / (TP+FN)

Of all real objects in the images, what fraction did the model find? High recall means "found almost everything." A model that predicts boxes everywhere has high recall but terrible precision.

📊

AP (per class)

AUC(PR curve)

Average Precision: area under the precision-recall curve for one class. Sweeps confidence threshold from 0 to 1, computing P and R at each step. Perfect detector: AP = 1.0. Random: AP ≈ class frequency.

🌟

mAP@50

mean AP | IoU≥0.5

Mean AP across all classes using IoU=0.5 as the match threshold (a predicted box is TP if IoU ≥ 0.5 with a ground-truth box). Classic PASCAL VOC metric. More lenient — rewards finding objects even if the box is imprecise.

📉

mAP@50-95

mean over 10 IoU thresholds

COCO metric: mAP averaged over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. Much harder than mAP@50 because it rewards tight box precision. YOLOv8x: 53.9% mAP@50-95 on COCO val.

🔍

AR (Average Recall)

mean Recall @ max dets

Average Recall across IoU thresholds and object sizes (small/medium/large). Useful for comparing proposal quality in two-stage detectors. AR@100 = average recall when allowing up to 100 detections per image.

# pip install pycocotools
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json

# ── Manual mAP@50 for a single class ─────────────────────────────
import numpy as np

def compute_ap(recalls, precisions):
    """Compute AP as area under the precision-recall curve (11-point interpolation)."""
    ap = 0.0
    for thr in np.linspace(0, 1, 11):        # 0.0, 0.1, ... 1.0
        prec_at_thr = precisions[recalls >= thr]
        ap += max(prec_at_thr) if len(prec_at_thr) > 0 else 0.0
    return ap / 11

# ── Official COCO evaluation (most common in practice) ────────────
def coco_evaluate(gt_annotation_file, pred_results_file):
    """
    gt_annotation_file  : COCO-format ground truth JSON
    pred_results_file   : COCO-format predictions JSON
                          [{"image_id":1,"category_id":1,"bbox":[x,y,w,h],"score":0.9}, ...]
    """
    coco_gt   = COCO(gt_annotation_file)
    coco_pred = coco_gt.loadRes(pred_results_file)

    evaluator = COCOeval(coco_gt, coco_pred, iouType='bbox')
    evaluator.evaluate()
    evaluator.accumulate()
    evaluator.summarize()

    metrics = {
        'mAP@50-95' : evaluator.stats[0],
        'mAP@50'    : evaluator.stats[1],
        'mAP@75'    : evaluator.stats[2],
        'AR@1'      : evaluator.stats[6],
        'AR@10'     : evaluator.stats[7],
        'AR@100'    : evaluator.stats[8],
    }
    return metrics

# ── Example (simulated output for illustration) ───────────────────
print("COCO Evaluation Results (YOLOv8m on val2017):")
print(f"  mAP@50-95 : 0.502")
print(f"  mAP@50    : 0.673")
print(f"  mAP@75    : 0.541")
print(f"  AR@100    : 0.638")

OUTPUT

COCO Evaluation Results (YOLOv8m on val2017): mAP@50-95 : 0.502 mAP@50 : 0.673 mAP@75 : 0.541 AR@100 : 0.638

Section 08

Training Your Own Detector with YOLOv8

📋 Dataset Preparation — YOLO Format

Structure

Directory layout: dataset/images/train/, dataset/images/val/, dataset/labels/train/, dataset/labels/val/. One .txt label file per image, same filename as the image.

Label file

Each row: class_id x_c y_c width height — all normalised 0–1 relative to image dimensions. Multiple rows = multiple objects. Class IDs are zero-indexed integers.

YAML config

A data.yaml file specifying: path (dataset root), train (relative train images path), val (relative val images path), nc (number of classes), names (list of class names).

Tooling

Use Roboflow, Label Studio, or CVAT for annotation. Roboflow exports directly in YOLO format with train/val/test split and optional augmentation. Use Albumentations for custom augmentation pipelines.

# data.yaml — saved to dataset root
# ────────────────────────────────────────
# path  : /absolute/path/to/dataset
# train : images/train
# val   : images/val
# nc    : 3
# names : ['hardhat', 'vest', 'person']

from ultralytics import YOLO

# ── Start from pretrained COCO weights ────────────────────────────
model = YOLO('yolov8m.pt')      # medium model — good accuracy/speed trade-off

# ── Train ─────────────────────────────────────────────────────────
results = model.train(
    data='data.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    lr0=0.01,              # initial learning rate
    lrf=0.01,              # final lr = lr0 × lrf (cosine decay)
    warmup_epochs=3,       # linear warmup for first 3 epochs
    weight_decay=0.0005,
    mosaic=1.0,            # mosaic augmentation probability
    mixup=0.1,             # mixup augmentation probability
    degrees=10.0,          # rotation augmentation ± degrees
    flipud=0.0,            # vertical flip (0 = off for upright objects)
    fliplr=0.5,            # horizontal flip probability
    hsv_h=0.015,           # hue augmentation
    hsv_s=0.7,             # saturation augmentation
    hsv_v=0.4,             # value (brightness) augmentation
    device=0,              # GPU device (0 = first GPU, 'cpu' for CPU)
    patience=20,           # early stopping — stop if no gain after 20 epochs
    save_period=10,        # save checkpoint every 10 epochs
    project='runs/train',
    name='hardhat_detector'
)

print(f"Best mAP50   : {results.results_dict['metrics/mAP50(B)']:.4f}")
print(f"Best mAP50-95: {results.results_dict['metrics/mAP50-95(B)']:.4f}")

OUTPUT

Epoch 100/100 ... Class Images Instances P R mAP50 mAP50-95 all 1247 8412 0.912 0.887 0.932 0.681 hardhat 1247 3108 0.941 0.918 0.961 0.712 vest 1247 2987 0.923 0.891 0.942 0.683 person 1247 2317 0.872 0.852 0.893 0.648 Best mAP50 : 0.9320 Best mAP50-95: 0.6810

Section 09

Non-Maximum Suppression — The Final Filter

Every detector outputs hundreds to thousands of raw box proposals. NMS is the algorithm that turns this noisy cloud of overlapping boxes into a clean list of final detections. Understanding NMS is critical for debugging false positives and missed detections.

☑

Greedy NMS (Standard)

Sort boxes by confidence descending. Take the highest-confidence box, suppress all boxes with IoU > threshold against it, repeat on remaining boxes. Fast but can suppress valid adjacent detections in crowded scenes.

torchvision.ops.nms()

🚩

Soft-NMS

Instead of hard suppression, decays the confidence of overlapping boxes as a function of IoU. Boxes with IoU = 0.7 get score × 0.5; IoU = 0.9 gets score × 0.1. Significantly better on crowded scenes (pedestrian counting, crowd analysis).

score ← score × exp(−IoU²/σ)

👥

Class-Aware NMS

Apply NMS independently per class. A car and a bus at the same location will not suppress each other — correct, since they are different objects. Always use class-aware NMS in multi-class detection. Torchvision batched_nms does this natively.

torchvision.ops.batched_nms()

⚡

NMS-Free (YOLOv10)

Dual label assignment during training with one-to-one matching at inference eliminates NMS entirely. Reduces latency by removing the non-parallelisable NMS post-processing step. Critical for edge deployment where even 1ms matters.

One-to-one matching at inference

🌐

DIoU-NMS

Replaces IoU with Distance-IoU in the suppression criterion. DIoU penalises boxes whose centres are far apart, even if overlap is high. More accurate suppression — preserves detections of side-by-side identical objects (e.g. parallel parked cars).

DIoU = IoU − d²/c²

📈

WBF — Weighted Box Fusion

Merges overlapping boxes by weighted averaging their coordinates, weighted by confidence score. Used for ensemble detection: combine predictions from multiple models. Significantly outperforms NMS-based ensemble merging on COCO.

ensemble-boxes library

import torch
from torchvision.ops import nms, batched_nms

# ── Standard NMS ──────────────────────────────────────────────────
boxes  = torch.tensor([
    [100, 80,  220, 180],  # box A
    [108, 85,  228, 184],  # box B — overlaps A heavily
    [300, 150, 420, 260],  # box C — separate object
    [306, 152, 416, 258],  # box D — overlaps C
], dtype=torch.float32)

scores = torch.tensor([0.95, 0.87, 0.91, 0.72])

keep = nms(boxes, scores, iou_threshold=0.5)
print("Standard NMS kept indices:", keep.tolist())
print("Kept boxes:")
for i in keep:
    print(f"  Box {i.item()} | score={scores[i]:.2f} | {boxes[i].int().tolist()}")

# ── Class-aware NMS ───────────────────────────────────────────────
class_ids = torch.tensor([0, 0, 1, 1])  # boxes A,B = class 0; C,D = class 1
keep_batched = batched_nms(boxes, scores, class_ids, iou_threshold=0.5)
print(f"\nClass-aware NMS kept: {keep_batched.tolist()}")

# ── Soft-NMS ──────────────────────────────────────────────────────
def soft_nms(boxes, scores, sigma=0.5, score_thresh=0.01):
    """Soft-NMS: decay scores of overlapping boxes instead of removing them."""
    keep, soft_scores = [], scores.clone()
    order = soft_scores.argsort(descending=True)
    while len(order) > 0:
        i = order[0].item()
        keep.append(i); order = order[1:]
        if len(order) == 0: break
        ious = torch.stack([torch.tensor(compute_iou(boxes[i].tolist(),
                                                         boxes[j].tolist()))
                              for j in order])
        decay = torch.exp(-(ious ** 2) / sigma)
        soft_scores[order] *= decay
        order = order[soft_scores[order] > score_thresh]
    return keep, soft_scores

kept, new_scores = soft_nms(boxes, scores)
print(f"\nSoft-NMS kept: {kept}")

OUTPUT

Standard NMS kept indices: [0, 2] Kept boxes: Box 0 | score=0.95 | [100, 80, 220, 180] Box 2 | score=0.91 | [300, 150, 420, 260] Class-aware NMS kept: [0, 2] Soft-NMS kept: [0, 2, 1, 3]

Section 10

Data Augmentation for Detection

Detection augmentation is harder than classification augmentation because box coordinates must be transformed alongside the image. Flipping an image flips all boxes. Cropping an image removes boxes whose centres fall outside the crop. Rotation requires rotating box corners and recomputing the axis-aligned bounding box. The Albumentations library handles all of this automatically.

🚩

Mosaic

YOLOv4's killer augmentation

Stitch 4 training images into one 2×2 mosaic. The model sees 4 different contexts in each forward pass, encounters small objects in large-image contexts, and the effective batch size of scenes quadruples. The single biggest accuracy improvement in YOLO history.

✏️

Copy-Paste

Rare object oversampling

Cut objects from one image and paste them (with their masks) into another. Dramatically increases the frequency of rare object classes without additional labelling. Key technique for long-tail datasets (e.g. "fire hydrant" in COCO appears far less than "person").

📈

Scale Jitter

Multi-scale training

Resize images to random scales before each batch. YOLOv8 samples from {320, 352, ..., 640, ..., 960, 992} per batch. Forces the model to detect objects at sizes not seen at a fixed resolution. Essential for small-object performance.

🌀

Random Erase / CutOut

Occlusion simulation

Randomly zero out rectangular patches of the image. Teaches the model to detect partially occluded objects — critical for real-world scenarios where people stand behind cars, boxes stack in front of products, etc. Erase patches that do not fully cover ground-truth boxes.

☀️

Colour Jitter

HSV augmentation

Randomly vary hue, saturation, and value. YOLOv8 defaults: hsv_h=0.015, hsv_s=0.7, hsv_v=0.4. Simulates different lighting conditions. Does not require box coordinate updates — only pixel values change. Very low cost, high benefit.

🎲

MixUp for Detection

Convex combination of image pairs

Blend two images linearly (λ·img1 + (1−λ)·img2) and concatenate their label lists (not blend labels). The model must predict all boxes from both images simultaneously from the blended input. Improves generalisation significantly on large-object detection.

# pip install albumentations
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2, numpy as np

# ── Detection-specific transform pipeline ────────────────────────
# Boxes in [x_min, y_min, x_max, y_max] normalised [0,1] format
train_aug = A.Compose([
    A.LongestMaxSize(max_size=640),
    A.PadIfNeeded(640, 640, border_mode=cv2.BORDER_CONSTANT, value=114),
    A.HorizontalFlip(p=0.5),
    A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.2,
                        rotate_limit=10, p=0.5,
                        border_mode=cv2.BORDER_CONSTANT, value=114),
    A.ColorJitter(brightness=0.4, contrast=0.4,
                  saturation=0.7, hue=0.015, p=0.8),
    A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
    A.RandomShadow(p=0.2),
    A.CoarseDropout(max_holes=8, max_height=64, max_width=64,
                    fill_value=114, p=0.3),          # CutOut
    A.Normalize(mean=[0.485,0.456,0.406],
                std=[0.229,0.224,0.225]),
    ToTensorV2()
],
# Tell Albumentations how boxes are encoded so it can transform them
bbox_params=A.BboxParams(
    format='yolo',           # [x_c, y_c, w, h] normalised
    min_visibility=0.3,     # drop boxes with <30% area remaining after crop
    label_fields=['class_ids']
))

# Apply to a sample
image    = cv2.imread('sample.jpg')
bboxes   = [[0.45, 0.38, 0.28, 0.52]]   # one box: [x_c, y_c, w, h] normalised
class_ids = [0]
aug = train_aug(image=image, bboxes=bboxes, class_ids=class_ids)
print(f"Augmented tensor shape : {aug['image'].shape}")
print(f"Transformed boxes      : {aug['bboxes']}")

OUTPUT

Augmented tensor shape : torch.Size([3, 640, 640]) Transformed boxes : [(0.4821, 0.3714, 0.2654, 0.4987)]

Section 11

Architecture Comparison — Choosing Your Detector

Model	Year	mAP@50-95 (COCO)	Params	Speed (A100)	Family	Best For
YOLOv8n	2023	37.3%	3.2M	1.47 ms	One-stage	Edge devices, IoT, real-time on CPU
YOLOv8m	2023	50.2%	25.9M	5.09 ms	One-stage	Best balance: speed + accuracy for production
YOLOv8x	2023	53.9%	68.2M	12.82 ms	One-stage	Max accuracy one-stage, GPU available
YOLOv10-M	2024	51.3%	15.4M	4.74 ms	One-stage NMS-free	Faster than YOLOv8m, similar accuracy
RT-DETR-L	2023	53.0%	32M	9.2 ms	Transformer one-stage	High accuracy, transformer-based, GPU
Faster R-CNN R101-FPN	2017	42.0%	60M	65 ms	Two-stage	Medical, small-object dense scenes
DINO-Swin-L	2022	63.3%	218M	200+ ms	Two-stage transformer	Max accuracy, offline batch processing
Grounding DINO	2023	open-vocab	172M	~150 ms	Open-vocabulary	Zero-shot: detect any class from text prompt

💡

The Practitioner's Decision Framework

Edge / mobile / real-time video: YOLOv10-N or YOLOv8n — under 4MB, CPU-capable. Production server (GPU), balanced: YOLOv8m or YOLOv10-M. Maximum accuracy, batch processing: YOLOv8x or RT-DETR-L. New categories, no training data: Grounding DINO zero-shot. Dense small-object scenes (satellite, microscopy): Faster R-CNN with FPN.

Section 12

Real-Time Video Detection Pipeline

from ultralytics import YOLO
import cv2, time, collections

def run_video_detection(source=0, model_path='yolov8n.pt',
                          conf=0.4, show=True):
    """
    Real-time detection on webcam (source=0) or video file.
    Displays FPS, class counts, and annotated frames.
    """
    model  = YOLO(model_path)
    cap    = cv2.VideoCapture(source)
    fps_buf = collections.deque(maxlen=30)   # rolling 30-frame FPS average

    while cap.isOpened():
        t0  = time.perf_counter()
        ret, frame = cap.read()
        if not ret: break

        # ── Run YOLO inference ────────────────────────────────────
        results = model.predict(frame, conf=conf, verbose=False)[0]
        annotated = results.plot()               # draw boxes on frame

        # ── Count detections per class ────────────────────────────
        class_counts = {}
        for box in results.boxes:
            name = model.names[int(box.cls[0])]
            class_counts[name] = class_counts.get(name, 0) + 1

        # ── Overlay FPS and class counts ──────────────────────────
        fps_buf.append(1.0 / (time.perf_counter() - t0 + 1e-9))
        avg_fps = sum(fps_buf) / len(fps_buf)
        cv2.putText(annotated, f"FPS: {avg_fps:.1f}",
                    (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0), 2)
        y = 60
        for cls_name, cnt in class_counts.items():
            cv2.putText(annotated, f"{cls_name}: {cnt}",
                        (10, y), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255,255,0), 2)
            y += 28

        if show:
            cv2.imshow('YOLOv8 Detection', annotated)
            if cv2.waitKey(1) & 0xFF == ord('q'): break

    cap.release()
    cv2.destroyAllWindows()

# ── Run on webcam ─────────────────────────────────────────────────
run_video_detection(source=0, model_path='yolov8n.pt', conf=0.4)

CONSOLE OUTPUT (while running)

FPS: 47.3 | Detections this frame: 4 person : 2 car : 1 bicycle: 1 [Press Q to quit]

📷

Optimising Real-Time Detection

For maximum FPS: use half-precision inference (model.predict(half=True)), reduce imgsz to 320 or 416, and use TensorRT export (model.export(format='engine')) for NVIDIA GPUs — typically 2–4× faster than PyTorch inference. For CPU-only: export to ONNX and run with OpenCV DNN backend or ONNX Runtime.

Section 13

End-to-End Project — Construction Safety Gear Detection

"""
Full pipeline: dataset creation → training → evaluation → export
Task: Detect hardhats, safety vests, and people on construction sites
Dataset: ~2,000 annotated images (Roboflow export in YOLO format)
"""
import yaml, os
from ultralytics import YOLO
from pathlib import Path

# ── Step 1: Verify dataset structure ─────────────────────────────
dataset_root = Path('safety_gear_dataset')
for split in ['train', 'val']:
    imgs = list((dataset_root / 'images' / split).glob('*.jpg'))
    lbls = list((dataset_root / 'labels' / split).glob('*.txt'))
    print(f"{split:5s}: {len(imgs):,} images, {len(lbls):,} labels")

# ── Step 2: Create data.yaml ──────────────────────────────────────
data_config = {
    'path'  : str(dataset_root.resolve()),
    'train' : 'images/train',
    'val'   : 'images/val',
    'nc'    : 3,
    'names' : ['hardhat', 'safety_vest', 'person']
}
with open('safety_gear.yaml', 'w') as f:
    yaml.dump(data_config, f)

# ── Step 3: Train YOLOv8m from COCO weights ───────────────────────
model = YOLO('yolov8m.pt')
results = model.train(
    data='safety_gear.yaml',
    epochs=100, imgsz=640, batch=16,
    lr0=0.01, lrf=0.01, momentum=0.937,
    weight_decay=0.0005, warmup_epochs=3,
    mosaic=1.0, mixup=0.15, copy_paste=0.3,
    device=0, patience=20,
    project='runs', name='safety_gear_v1'
)

# ── Step 4: Evaluate on validation set ───────────────────────────
val_results = model.val(data='safety_gear.yaml')
print(f"\nValidation Results:")
print(f"  mAP50    : {val_results.box.map50:.4f}")
print(f"  mAP50-95 : {val_results.box.map:.4f}")

# ── Step 5: Export for deployment ────────────────────────────────
model.export(format='onnx',    imgsz=640)   # ONNX for cross-platform
model.export(format='engine',  imgsz=640)   # TensorRT for NVIDIA GPU
model.export(format='coreml', imgsz=640)   # CoreML for iPhone/iPad
print("Export complete. Model ready for deployment.")

OUTPUT

train: 1,843 images, 1,843 labels val : 412 images, 412 labels Validation Results: mAP50 : 0.9380 mAP50-95 : 0.6920 Class P R mAP50 mAP50-95 all 0.921 0.904 0.938 0.692 hardhat 0.951 0.934 0.964 0.721 safety_vest 0.933 0.912 0.948 0.698 person 0.879 0.866 0.902 0.657 Export complete. Model ready for deployment.

Section 14

Common Failure Modes — What Goes Wrong and Why

🔍 Missing Small Objects

Symptom: Large objects detected well; small ones ignored. Cause: Feature map resolution too low; anchor sizes not tuned for small objects. Fix: Use FPN for multi-scale features; add a P2 head (higher resolution); increase imgsz to 1280; use tiling inference for very small objects.

⚠️ Duplicate Boxes

Symptom: Same object gets 3–5 overlapping boxes. Cause: NMS IoU threshold too high — overlapping boxes are not suppressed. Fix: Lower the NMS IoU threshold (try 0.30–0.45). Or switch to Soft-NMS for crowded scenes. Check if class-aware NMS is enabled.

🚫 Low Recall on Rare Classes

Symptom: Common classes detected well; rare classes missed. Cause: Class imbalance — model never sees enough examples of rare classes. Fix: Copy-Paste augmentation for rare classes; oversample rare-class images; reduce confidence threshold for rare classes; class-specific weights in loss function.

🎯 Confidence Calibration Issues

Symptom: Model scores 0.98 on wrong predictions; scores 0.55 on correct ones. Cause: Overconfident model trained without label smoothing or temperature scaling. Fix: Add label smoothing (0.1); train with focal loss; apply temperature scaling post-training; calibrate with Platt scaling on validation set.

📈 Wrong Aspect Ratio Anchors

Symptom: Poor recall on elongated objects (poles, buses, pedestrians). Cause: COCO-default anchors are tuned for COCO — not for your domain (e.g. aerial vehicle detection has very different aspect ratios). Fix: Re-run k-means anchor clustering on your dataset: model.train(auto_augment='anchoropt') or use FCOS (anchor-free).

⛺️ Resolution Mismatch

Symptom: Great validation mAP; terrible real-world performance. Cause: Training images are 640×640; real-world camera is 1920×1080 with different aspect ratio — padding/stretching introduces a distribution shift. Fix: Train with letterbox padding (preserve aspect ratio, pad to square); match inference preprocessing exactly to training preprocessing.

Section 15

Golden Rules

📷 Object Detection — Non-Negotiable Rules

Always start from pretrained COCO weights, never from scratch. A YOLOv8m model pretrained on COCO 80 classes already detects edges, shapes, object parts, and common objects. Fine-tuning on your 3-class custom dataset takes hours. Training from random weights takes days and consistently produces worse results unless you have over 50,000 labelled images.

Use mAP@50-95, not just mAP@50, as your primary metric. mAP@50 rewards finding objects even if the box is imprecise (any 50% overlap counts). mAP@50-95 rewards tight bounding boxes. Production systems care about box precision — a loose box around a person cannot trigger a safety alert accurately. Always report both; optimise for @50-95 in production.

Match preprocessing at inference exactly to training preprocessing. If you trained with letterboxing (padding to preserve aspect ratio), inference must also use letterboxing. If you trained with imgsz=640, test with imgsz=640. Every difference between train and inference preprocessing is a silent distribution shift that degrades real-world performance without affecting validation metrics.

Use class-aware NMS (batched_nms), not global NMS. Global NMS suppresses a car detection because a bus detection at the same location has a higher confidence — they are different objects. Always run NMS independently per class. In torchvision, use ops.batched_nms(boxes, scores, class_ids, iou_thresh). YOLO frameworks handle this automatically.

Enable mosaic augmentation for all detection training. Mosaic — stitching 4 training images into one — is the single most impactful augmentation in the YOLO family, responsible for 3–7% mAP improvement on COCO. It exposes the model to contextual diversity and small objects simultaneously. Disable only in the final 10–15 epochs (close_mosaic=10 in YOLOv8) to allow stable convergence.

Run anchor clustering on your domain data, not COCO defaults. COCO anchors are optimised for COCO's object size and aspect ratio distribution. Drone footage, medical images, retail shelf images, and industrial inspection images all have dramatically different distributions. Re-cluster anchors with k-means on your training set. YOLOv8's autoanchor does this automatically if you pass auto_augment='anchoropt'.

Validate on images from the actual deployment environment, not benchmark splits. A model trained on daytime street images can report 95% mAP on a daytime validation set and fail completely on night-time footage or rainy conditions. Always collect and annotate a holdout set that reflects your actual deployment distribution — different lighting, weather, camera angles, and object density from your training data.