The Story That Explains Object Detection
She is doing, in real time, exactly what an object detection model does: locate every object of interest, classify what it is, and output a confidence score for each detection — all in a single pass over the visual scene.
This is the critical difference between image classification and object detection. Classification answers: "Is there a cat in this image?" Detection answers: "There are three cats. The first is at coordinates (124, 88), 83 pixels wide, 91 pixels tall — confidence 97%. The second is at (340, 210), 71 × 67 pixels — confidence 91%. The third…"
Every self-driving car, every security camera with smart alerts, every medical imaging system that flags lesions, every retail shelf-monitoring robot — they all run an object detection model at their core.
Object Detection is the computer vision task of simultaneously identifying what objects are present in an image and where they are, expressed as axis-aligned bounding boxes. Unlike classification (one label per image) or segmentation (pixel-level masks), detection must handle an arbitrary number of objects per image, at varying scales, with possible occlusion and overlap.
Classification: "This image contains a dog." (one label, no location) — Detection: "There is a dog at [x1,y1,x2,y2] with confidence 0.95, and a cat at [x1,y1,x2,y2] with confidence 0.88." (multiple boxes + labels) — Instance Segmentation: Same as detection but also produces a pixel-level mask for each detected object. Detection is the foundation; segmentation adds a mask prediction head on top.
The Bounding Box — The Language of Detection
Every detection output is a bounding box — a rectangle around an object — paired with a class label and a confidence score. Understanding the three coordinate formats is essential because different frameworks use different conventions, and mixing them is the most common source of bugs in detection pipelines.
import numpy as np
def xyxy_to_xywh(box):
"""Convert [x_min,y_min,x_max,y_max] → [x_c,y_c,w,h] (YOLO format, normalised)."""
x1, y1, x2, y2 = box
return [(x1+x2)/2, (y1+y2)/2, x2-x1, y2-y1]
def compute_iou(boxA, boxB):
"""Compute IoU between two [x1,y1,x2,y2] boxes."""
xA = max(boxA[0], boxB[0]); yA = max(boxA[1], boxB[1])
xB = min(boxA[2], boxB[2]); yB = min(boxA[3], boxB[3])
inter = max(0, xB-xA) * max(0, yB-yA)
areaA = (boxA[2]-boxA[0]) * (boxA[3]-boxA[1])
areaB = (boxB[2]-boxB[0]) * (boxB[3]-boxB[1])
return inter / (areaA + areaB - inter + 1e-6)
# Example: ground truth vs predicted box
gt_box = [100, 80, 220, 180] # [x1,y1,x2,y2]
pred_box = [110, 75, 230, 185]
iou = compute_iou(gt_box, pred_box)
print(f"Ground truth box : {gt_box}")
print(f"Predicted box : {pred_box}")
print(f"IoU : {iou:.4f} ({'TP ✓' if iou >= 0.5 else 'FP ✗'})")
print(f"YOLO format : {[round(v,3) for v in xyxy_to_xywh(pred_box)]}")
The Two Families — Two-Stage vs One-Stage Detectors
Now imagine a security gate with a camera that fires one neural network pass and simultaneously outputs: "Face detected at (x,y,w,h) — employee ID: Sarah Chen — confidence 99%." No separate region proposal. Everything in one shot. This is a One-Stage detector — faster, lighter, slightly less accurate on tiny objects — but fast enough for real-time, and now dominant in production systems.
| Property | Value |
|---|---|
| How it works | Stage 1: Region Proposal Network (RPN) generates candidate boxes. Stage 2: Per-region CNN classifies each proposal. |
| Key models | R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, Cascade R-CNN |
| Accuracy | Excellent on small objects |
| Speed | 5–15 FPS (GPU) |
| Best for | High-accuracy offline tasks: medical imaging, satellite analysis |
| Property | Value |
|---|---|
| How it works | Single CNN pass predicts all boxes and class scores simultaneously over a dense grid of anchors or query points. |
| Key models | YOLO family, SSD, RetinaNet, FCOS, DETR, RT-DETR |
| Accuracy | Slightly lower on tiny objects |
| Speed | 30–300+ FPS (GPU) |
| Best for | Real-time video, edge deployment, robotics, drones |
Faster R-CNN — The Two-Stage Classic
import torch
from torchvision.models.detection import (fasterrcnn_resnet50_fpn_v2,
FasterRCNN_ResNet50_FPN_V2_Weights)
from torchvision.utils import draw_bounding_boxes
from PIL import Image
import torchvision.transforms.functional as F
# ── Load pretrained Faster R-CNN (COCO, 91 classes) ──────────────
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = fasterrcnn_resnet50_fpn_v2(weights=weights)
model.eval()
# ── Preprocess: image → normalised tensor ────────────────────────
transform = weights.transforms()
img_pil = Image.open('street.jpg').convert('RGB')
img_tensor = transform(img_pil) # shape: (3, H, W), float32 [0,1]
# ── Run inference ─────────────────────────────────────────────────
with torch.no_grad():
predictions = model([img_tensor]) # list of dicts, one per image
pred = predictions[0]
print("Keys in prediction dict:", pred.keys())
print(f"Boxes shape: {pred['boxes'].shape}") # (N, 4) xyxy format
print(f"Labels shape: {pred['labels'].shape}") # (N,) class indices
print(f"Scores shape: {pred['scores'].shape}") # (N,) confidence [0,1]
# ── Filter by confidence threshold ───────────────────────────────
threshold = 0.60
keep = pred['scores'] > threshold
boxes = pred['boxes'][keep]
labels = pred['labels'][keep]
scores = pred['scores'][keep]
label_names = weights.meta['categories']
print(f"\nDetections above {threshold:.0%} confidence:")
for box, lbl, scr in zip(boxes, labels, scores):
name = label_names[lbl.item()]
x1,y1,x2,y2 = [int(v) for v in box.tolist()]
print(f" {name:15s} | conf={scr:.3f} | box=[{x1},{y1},{x2},{y2}]")
YOLO — You Only Look Once
The existing wisdom said this was impossible — you needed the two-stage pipeline to achieve useful accuracy. Redmon proved the wisdom wrong. YOLO v1 ran at 45 FPS — real-time — while Faster R-CNN ran at 7 FPS. The accuracy gap was real, but shrinking with every version.
By 2023, YOLOv8 and YOLOv10 had essentially closed the accuracy gap with two-stage methods on standard benchmarks while remaining dramatically faster. The computer vision world shifted. Today, when someone says "deploy a detection model," they almost always mean a YOLO variant.
# pip install ultralytics
from ultralytics import YOLO
import cv2, time
# ── Load pretrained YOLOv8n (nano — fastest) ─────────────────────
model = YOLO('yolov8n.pt') # downloads on first run
# ── Inference on single image ─────────────────────────────────────
results = model.predict(
source='street.jpg',
conf=0.25, # confidence threshold
iou=0.45, # NMS IoU threshold
imgsz=640, # inference size
verbose=False
)
r = results[0] # first (only) image result
print(f"Detections : {len(r.boxes)}")
print(f"Box format : xyxy\n")
for box in r.boxes:
cls_id = int(box.cls[0])
conf = box.conf[0].item()
xyxy = [int(v) for v in box.xyxy[0].tolist()]
name = model.names[cls_id]
print(f" {name:14s} conf={conf:.3f} box={xyxy}")
# ── Speed benchmark: YOLO v8 variants ────────────────────────────
variants = ['yolov8n.pt', 'yolov8s.pt', 'yolov8m.pt']
print(f"\n{'Model':12} | {'mAP50-95':>10} | {'Params':>8} | {'ms/img':>8}")
print("-" * 46)
stats = [
('YOLOv8n', '37.3%', '3.2M', '1.47ms'),
('YOLOv8s', '44.9%', '11.2M', '2.88ms'),
('YOLOv8m', '50.2%', '25.9M', '5.09ms'),
]
for name, ap, par, ms in stats:
print(f" {name:10} | {ap:>10} | {par:>8} | {ms:>8}")
Anchor-Free Detection — FCOS and DETR
Anchors are powerful but introduce significant complexity: you must tune their scales and aspect ratios to your dataset, they create massive class imbalance (most anchors contain no objects), and they add a non-differentiable matching step. Two families of anchor-free detectors solved these problems in fundamentally different ways.
Evaluation — Mean Average Precision (mAP) Explained
This is precisely how mAP works in object detection. For each class, the system sweeps over every possible confidence threshold, computes precision and recall at each, and calculates the area under the resulting precision-recall curve (Average Precision). Mean Average Precision is the average of this AP across all classes. It rewards models that are both precise and complete across the full confidence spectrum — not just at one cherry-picked threshold.
# pip install pycocotools
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
import json
# ── Manual mAP@50 for a single class ─────────────────────────────
import numpy as np
def compute_ap(recalls, precisions):
"""Compute AP as area under the precision-recall curve (11-point interpolation)."""
ap = 0.0
for thr in np.linspace(0, 1, 11): # 0.0, 0.1, ... 1.0
prec_at_thr = precisions[recalls >= thr]
ap += max(prec_at_thr) if len(prec_at_thr) > 0 else 0.0
return ap / 11
# ── Official COCO evaluation (most common in practice) ────────────
def coco_evaluate(gt_annotation_file, pred_results_file):
"""
gt_annotation_file : COCO-format ground truth JSON
pred_results_file : COCO-format predictions JSON
[{"image_id":1,"category_id":1,"bbox":[x,y,w,h],"score":0.9}, ...]
"""
coco_gt = COCO(gt_annotation_file)
coco_pred = coco_gt.loadRes(pred_results_file)
evaluator = COCOeval(coco_gt, coco_pred, iouType='bbox')
evaluator.evaluate()
evaluator.accumulate()
evaluator.summarize()
metrics = {
'mAP@50-95' : evaluator.stats[0],
'mAP@50' : evaluator.stats[1],
'mAP@75' : evaluator.stats[2],
'AR@1' : evaluator.stats[6],
'AR@10' : evaluator.stats[7],
'AR@100' : evaluator.stats[8],
}
return metrics
# ── Example (simulated output for illustration) ───────────────────
print("COCO Evaluation Results (YOLOv8m on val2017):")
print(f" mAP@50-95 : 0.502")
print(f" mAP@50 : 0.673")
print(f" mAP@75 : 0.541")
print(f" AR@100 : 0.638")
Training Your Own Detector with YOLOv8
.txt label file per image, same filename as the image.
path (dataset root), train (relative train images path), val (relative val images path), nc (number of classes), names (list of class names).
# data.yaml — saved to dataset root
# ────────────────────────────────────────
# path : /absolute/path/to/dataset
# train : images/train
# val : images/val
# nc : 3
# names : ['hardhat', 'vest', 'person']
from ultralytics import YOLO
# ── Start from pretrained COCO weights ────────────────────────────
model = YOLO('yolov8m.pt') # medium model — good accuracy/speed trade-off
# ── Train ─────────────────────────────────────────────────────────
results = model.train(
data='data.yaml',
epochs=100,
imgsz=640,
batch=16,
lr0=0.01, # initial learning rate
lrf=0.01, # final lr = lr0 × lrf (cosine decay)
warmup_epochs=3, # linear warmup for first 3 epochs
weight_decay=0.0005,
mosaic=1.0, # mosaic augmentation probability
mixup=0.1, # mixup augmentation probability
degrees=10.0, # rotation augmentation ± degrees
flipud=0.0, # vertical flip (0 = off for upright objects)
fliplr=0.5, # horizontal flip probability
hsv_h=0.015, # hue augmentation
hsv_s=0.7, # saturation augmentation
hsv_v=0.4, # value (brightness) augmentation
device=0, # GPU device (0 = first GPU, 'cpu' for CPU)
patience=20, # early stopping — stop if no gain after 20 epochs
save_period=10, # save checkpoint every 10 epochs
project='runs/train',
name='hardhat_detector'
)
print(f"Best mAP50 : {results.results_dict['metrics/mAP50(B)']:.4f}")
print(f"Best mAP50-95: {results.results_dict['metrics/mAP50-95(B)']:.4f}")
Non-Maximum Suppression — The Final Filter
Every detector outputs hundreds to thousands of raw box proposals. NMS is the algorithm that turns this noisy cloud of overlapping boxes into a clean list of final detections. Understanding NMS is critical for debugging false positives and missed detections.
import torch
from torchvision.ops import nms, batched_nms
# ── Standard NMS ──────────────────────────────────────────────────
boxes = torch.tensor([
[100, 80, 220, 180], # box A
[108, 85, 228, 184], # box B — overlaps A heavily
[300, 150, 420, 260], # box C — separate object
[306, 152, 416, 258], # box D — overlaps C
], dtype=torch.float32)
scores = torch.tensor([0.95, 0.87, 0.91, 0.72])
keep = nms(boxes, scores, iou_threshold=0.5)
print("Standard NMS kept indices:", keep.tolist())
print("Kept boxes:")
for i in keep:
print(f" Box {i.item()} | score={scores[i]:.2f} | {boxes[i].int().tolist()}")
# ── Class-aware NMS ───────────────────────────────────────────────
class_ids = torch.tensor([0, 0, 1, 1]) # boxes A,B = class 0; C,D = class 1
keep_batched = batched_nms(boxes, scores, class_ids, iou_threshold=0.5)
print(f"\nClass-aware NMS kept: {keep_batched.tolist()}")
# ── Soft-NMS ──────────────────────────────────────────────────────
def soft_nms(boxes, scores, sigma=0.5, score_thresh=0.01):
"""Soft-NMS: decay scores of overlapping boxes instead of removing them."""
keep, soft_scores = [], scores.clone()
order = soft_scores.argsort(descending=True)
while len(order) > 0:
i = order[0].item()
keep.append(i); order = order[1:]
if len(order) == 0: break
ious = torch.stack([torch.tensor(compute_iou(boxes[i].tolist(),
boxes[j].tolist()))
for j in order])
decay = torch.exp(-(ious ** 2) / sigma)
soft_scores[order] *= decay
order = order[soft_scores[order] > score_thresh]
return keep, soft_scores
kept, new_scores = soft_nms(boxes, scores)
print(f"\nSoft-NMS kept: {kept}")
Data Augmentation for Detection
Detection augmentation is harder than classification augmentation because box coordinates must be transformed alongside the image. Flipping an image flips all boxes. Cropping an image removes boxes whose centres fall outside the crop. Rotation requires rotating box corners and recomputing the axis-aligned bounding box. The Albumentations library handles all of this automatically.
# pip install albumentations
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2, numpy as np
# ── Detection-specific transform pipeline ────────────────────────
# Boxes in [x_min, y_min, x_max, y_max] normalised [0,1] format
train_aug = A.Compose([
A.LongestMaxSize(max_size=640),
A.PadIfNeeded(640, 640, border_mode=cv2.BORDER_CONSTANT, value=114),
A.HorizontalFlip(p=0.5),
A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.2,
rotate_limit=10, p=0.5,
border_mode=cv2.BORDER_CONSTANT, value=114),
A.ColorJitter(brightness=0.4, contrast=0.4,
saturation=0.7, hue=0.015, p=0.8),
A.GaussNoise(var_limit=(10.0, 50.0), p=0.3),
A.RandomShadow(p=0.2),
A.CoarseDropout(max_holes=8, max_height=64, max_width=64,
fill_value=114, p=0.3), # CutOut
A.Normalize(mean=[0.485,0.456,0.406],
std=[0.229,0.224,0.225]),
ToTensorV2()
],
# Tell Albumentations how boxes are encoded so it can transform them
bbox_params=A.BboxParams(
format='yolo', # [x_c, y_c, w, h] normalised
min_visibility=0.3, # drop boxes with <30% area remaining after crop
label_fields=['class_ids']
))
# Apply to a sample
image = cv2.imread('sample.jpg')
bboxes = [[0.45, 0.38, 0.28, 0.52]] # one box: [x_c, y_c, w, h] normalised
class_ids = [0]
aug = train_aug(image=image, bboxes=bboxes, class_ids=class_ids)
print(f"Augmented tensor shape : {aug['image'].shape}")
print(f"Transformed boxes : {aug['bboxes']}")
Architecture Comparison — Choosing Your Detector
| Model | Year | mAP@50-95 (COCO) | Params | Speed (A100) | Family | Best For |
|---|---|---|---|---|---|---|
| YOLOv8n | 2023 | 37.3% | 3.2M | 1.47 ms | One-stage | Edge devices, IoT, real-time on CPU |
| YOLOv8m | 2023 | 50.2% | 25.9M | 5.09 ms | One-stage | Best balance: speed + accuracy for production |
| YOLOv8x | 2023 | 53.9% | 68.2M | 12.82 ms | One-stage | Max accuracy one-stage, GPU available |
| YOLOv10-M | 2024 | 51.3% | 15.4M | 4.74 ms | One-stage NMS-free | Faster than YOLOv8m, similar accuracy |
| RT-DETR-L | 2023 | 53.0% | 32M | 9.2 ms | Transformer one-stage | High accuracy, transformer-based, GPU |
| Faster R-CNN R101-FPN | 2017 | 42.0% | 60M | 65 ms | Two-stage | Medical, small-object dense scenes |
| DINO-Swin-L | 2022 | 63.3% | 218M | 200+ ms | Two-stage transformer | Max accuracy, offline batch processing |
| Grounding DINO | 2023 | open-vocab | 172M | ~150 ms | Open-vocabulary | Zero-shot: detect any class from text prompt |
Edge / mobile / real-time video: YOLOv10-N or YOLOv8n — under 4MB, CPU-capable. Production server (GPU), balanced: YOLOv8m or YOLOv10-M. Maximum accuracy, batch processing: YOLOv8x or RT-DETR-L. New categories, no training data: Grounding DINO zero-shot. Dense small-object scenes (satellite, microscopy): Faster R-CNN with FPN.
Real-Time Video Detection Pipeline
from ultralytics import YOLO
import cv2, time, collections
def run_video_detection(source=0, model_path='yolov8n.pt',
conf=0.4, show=True):
"""
Real-time detection on webcam (source=0) or video file.
Displays FPS, class counts, and annotated frames.
"""
model = YOLO(model_path)
cap = cv2.VideoCapture(source)
fps_buf = collections.deque(maxlen=30) # rolling 30-frame FPS average
while cap.isOpened():
t0 = time.perf_counter()
ret, frame = cap.read()
if not ret: break
# ── Run YOLO inference ────────────────────────────────────
results = model.predict(frame, conf=conf, verbose=False)[0]
annotated = results.plot() # draw boxes on frame
# ── Count detections per class ────────────────────────────
class_counts = {}
for box in results.boxes:
name = model.names[int(box.cls[0])]
class_counts[name] = class_counts.get(name, 0) + 1
# ── Overlay FPS and class counts ──────────────────────────
fps_buf.append(1.0 / (time.perf_counter() - t0 + 1e-9))
avg_fps = sum(fps_buf) / len(fps_buf)
cv2.putText(annotated, f"FPS: {avg_fps:.1f}",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0), 2)
y = 60
for cls_name, cnt in class_counts.items():
cv2.putText(annotated, f"{cls_name}: {cnt}",
(10, y), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255,255,0), 2)
y += 28
if show:
cv2.imshow('YOLOv8 Detection', annotated)
if cv2.waitKey(1) & 0xFF == ord('q'): break
cap.release()
cv2.destroyAllWindows()
# ── Run on webcam ─────────────────────────────────────────────────
run_video_detection(source=0, model_path='yolov8n.pt', conf=0.4)
For maximum FPS: use half-precision inference
(model.predict(half=True)), reduce imgsz to 320 or 416,
and use TensorRT export (model.export(format='engine'))
for NVIDIA GPUs — typically 2–4× faster than PyTorch inference.
For CPU-only: export to ONNX and run with OpenCV DNN backend
or ONNX Runtime.
End-to-End Project — Construction Safety Gear Detection
"""
Full pipeline: dataset creation → training → evaluation → export
Task: Detect hardhats, safety vests, and people on construction sites
Dataset: ~2,000 annotated images (Roboflow export in YOLO format)
"""
import yaml, os
from ultralytics import YOLO
from pathlib import Path
# ── Step 1: Verify dataset structure ─────────────────────────────
dataset_root = Path('safety_gear_dataset')
for split in ['train', 'val']:
imgs = list((dataset_root / 'images' / split).glob('*.jpg'))
lbls = list((dataset_root / 'labels' / split).glob('*.txt'))
print(f"{split:5s}: {len(imgs):,} images, {len(lbls):,} labels")
# ── Step 2: Create data.yaml ──────────────────────────────────────
data_config = {
'path' : str(dataset_root.resolve()),
'train' : 'images/train',
'val' : 'images/val',
'nc' : 3,
'names' : ['hardhat', 'safety_vest', 'person']
}
with open('safety_gear.yaml', 'w') as f:
yaml.dump(data_config, f)
# ── Step 3: Train YOLOv8m from COCO weights ───────────────────────
model = YOLO('yolov8m.pt')
results = model.train(
data='safety_gear.yaml',
epochs=100, imgsz=640, batch=16,
lr0=0.01, lrf=0.01, momentum=0.937,
weight_decay=0.0005, warmup_epochs=3,
mosaic=1.0, mixup=0.15, copy_paste=0.3,
device=0, patience=20,
project='runs', name='safety_gear_v1'
)
# ── Step 4: Evaluate on validation set ───────────────────────────
val_results = model.val(data='safety_gear.yaml')
print(f"\nValidation Results:")
print(f" mAP50 : {val_results.box.map50:.4f}")
print(f" mAP50-95 : {val_results.box.map:.4f}")
# ── Step 5: Export for deployment ────────────────────────────────
model.export(format='onnx', imgsz=640) # ONNX for cross-platform
model.export(format='engine', imgsz=640) # TensorRT for NVIDIA GPU
model.export(format='coreml', imgsz=640) # CoreML for iPhone/iPad
print("Export complete. Model ready for deployment.")
Common Failure Modes — What Goes Wrong and Why
model.train(auto_augment='anchoropt') or use FCOS (anchor-free).Golden Rules
batched_nms), not global NMS.
Global NMS suppresses a car detection because a bus detection at the same location
has a higher confidence — they are different objects. Always run NMS independently
per class. In torchvision, use ops.batched_nms(boxes, scores, class_ids, iou_thresh).
YOLO frameworks handle this automatically.
close_mosaic=10 in YOLOv8)
to allow stable convergence.
auto_augment='anchoropt'.