The Story That Explains YOLO
YOLO is the guard who glances once and instantly says: "There's a person in aisle 3, a handbag on shelf 2, and a suspicious package near exit B." One look. One pass. All detections simultaneously.
That single glance — one forward pass through a neural network — is the entire philosophy behind YOLO: You Only Look Once.
YOLO is a family of real-time object detection models that frames detection as a single regression problem — directly predicting bounding boxes and class probabilities from full images in one evaluation. First introduced by Joseph Redmon et al. in 2015, it revolutionised computer vision by making detection fast enough to run on live video streams.
Before YOLO, detection pipelines were two-stage: first propose regions of interest, then classify them (R-CNN, Fast R-CNN). YOLO collapsed this into a single-stage approach — one neural network, one pass, all predictions at once. The accuracy trade-off was worth it for a 100× speedup.
The Problem YOLO Solves — Object Detection 101
Object detection is the task of simultaneously answering two questions for every object in an image: What is it? (classification) and Where exactly is it? (localisation via a bounding box).
| Detection Approach | Key Idea | Speed | Accuracy | Example |
|---|---|---|---|---|
| Two-stage (Region-Based) | Propose regions → then classify | Slow (5–7 fps) | High | R-CNN, Faster R-CNN |
| One-stage (Anchor-Based) | Predict boxes directly on grid | Fast (45–140 fps) | Good | YOLO, SSD |
| One-stage (Anchor-Free) | Predict centre point directly | Very Fast | SOTA | YOLOv8, FCOS |
| Transformer-Based | Global attention on image tokens | Medium | SOTA | DETR, RT-DETR |
How YOLO Works — The Grid Trick
Intersection over Union (IoU) measures how much two bounding boxes overlap. IoU = Area of Overlap / Area of Union. A perfect prediction has IoU = 1.0. IoU > 0.5 is generally considered a correct detection. IoU is also called the Jaccard Index.
Anchor Boxes — YOLO's Secret Weapon
Raw x, y, w, h predictions are unstable to train. YOLO uses anchor boxes (also called prior boxes) — pre-defined box shapes computed via k-means clustering on the training set's ground-truth bounding boxes. The network learns offsets from these anchors, not absolute coordinates.
| YOLO Version | Year | Anchors Per Cell | Grid Scales | Backbone | mAP (COCO) |
|---|---|---|---|---|---|
| YOLOv1 | 2015 | 2 (no anchors) | 1 (7×7) | Custom CNN | ~45% |
| YOLOv2 | 2016 | 5 | 1 | Darknet-19 | ~48% |
| YOLOv3 | 2018 | 3 × 3 scales | 3 (multi-scale) | Darknet-53 | ~55% |
| YOLOv4 | 2020 | 3 × 3 scales | 3 | CSPDarknet-53 | ~65% |
| YOLOv5 | 2020 | 3 × 3 scales | 3 | CSP + Focus | ~56% (val) |
| YOLOv8 | 2023 | Anchor-Free | 3 | C2f + CSP | ~53% (val) |
| YOLOv10 | 2024 | Anchor-Free | 3 | CSPNet | ~54% (val) |
The YOLO Architecture — Visual Diagram
Every YOLO model is built from three conceptual components: the Backbone, the Neck, and the Head. Understanding this structure helps you know where to customise the model.
YOLO Loss Function — What It's Learning
Training YOLO requires a carefully designed multi-part loss. The network must simultaneously learn to localise boxes accurately and classify objects correctly, while suppressing false positives.
In any image, the vast majority of grid cells contain no object. If treated equally, the model would learn to always predict "background." YOLO addresses this with loss weighting (λ_noobj = 0.5, λ_coord = 5) and modern versions use focal loss or task-aligned assignment strategies to focus learning on hard examples.
Python Implementation — YOLOv8 with Ultralytics
Installation
# Install the ultralytics package (includes YOLOv8, YOLOv9, YOLOv10)
pip install ultralytics
# Verify GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
Running Inference on an Image
from ultralytics import YOLO
import cv2
# Load a pre-trained YOLOv8 nano model (fastest, smallest)
# Options: yolov8n, yolov8s, yolov8m, yolov8l, yolov8x
model = YOLO('yolov8n.pt') # auto-downloads on first run
# Run inference on a single image
results = model('street.jpg', conf=0.25, iou=0.45)
# Access detection results
for r in results:
boxes = r.boxes # Boxes object with xyxy, conf, cls
masks = r.masks # Segmentation masks (if seg model)
keypts = r.keypoints # Pose keypoints (if pose model)
print(f"Detected {len(boxes)} objects")
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0].tolist()
conf = box.conf[0].item()
cls_id = int(box.cls[0].item())
label = model.names[cls_id]
print(f" {label:15s} conf={conf:.2f} box=[{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")
# Save annotated image
results[0].save('output.jpg')
print("Saved to output.jpg")
Real-Time Webcam Detection
from ultralytics import YOLO
import cv2
model = YOLO('yolov8n.pt')
cap = cv2.VideoCapture(0) # 0 = default webcam
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Run YOLO on each frame (returns annotated frame)
results = model.track(frame, persist=True, conf=0.3)
annotated = results[0].plot()
# Display FPS
fps = model.predictor.profilers[0].t
cv2.putText(annotated, f'FPS: {1/fps:.0f}',
(10, 30), cv2.FONT_HERSHEY_SIMPLEX,
1, (0, 255, 0), 2)
cv2.imshow('YOLOv8 Live Detection', annotated)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
yolov8n (nano) runs at ~200 FPS on GPU, ~30 FPS on modern CPU. yolov8x (extra-large) gives best accuracy but ~25 FPS on GPU. For edge devices (Raspberry Pi, Jetson Nano), always use nano or small variants. Export to ONNX or TensorRT for another 2–5× speedup.
Training YOLO on a Custom Dataset
Step 1 — Dataset Structure (YOLO Format)
# Required folder structure for YOLO training
dataset/
├── images/
│ ├── train/ # Training images (.jpg, .png)
│ └── val/ # Validation images
├── labels/
│ ├── train/ # Annotation .txt files (one per image)
│ └── val/
└── data.yaml # Dataset config file
# Each .txt label file: one row per object
# Format: class_id cx cy w h (all normalised 0-1)
# Example label file for an image with 2 objects:
0 0.512 0.348 0.124 0.210 # class 0 (cracked_cap)
2 0.731 0.512 0.088 0.175 # class 2 (missing_label)
Step 2 — data.yaml Configuration
# data.yaml — tells YOLO where data is and what classes exist
path: /path/to/dataset # absolute root
train: images/train
val: images/val
nc: 3 # number of classes
names:
0: cracked_cap
1: unfilled_bottle
2: missing_label
Step 3 — Training
from ultralytics import YOLO
# Load pre-trained model (transfer learning from COCO weights)
model = YOLO('yolov8s.pt') # small model for balance of speed/accuracy
# Train on custom dataset
results = model.train(
data='dataset/data.yaml',
epochs=100, # training epochs
imgsz=640, # input image size
batch=16, # batch size (reduce if OOM)
lr0=0.01, # initial learning rate
lrf=0.001, # final learning rate
momentum=0.937,
weight_decay=0.0005,
warmup_epochs=3, # gradual LR warmup
augment=True, # mosaic, flipLR, hsv augmentation
device='0', # GPU device ('cpu' for CPU training)
project='runs/detect',
name='bottle_defect_v1',
patience=20, # early stopping patience
save_period=10, # save checkpoint every N epochs
)
print(f"Best mAP50: {results.results_dict['metrics/mAP50(B)']:.4f}")
Evaluation — Understanding mAP
The standard metric for object detection is mAP (mean Average Precision). Understanding it is essential for knowing whether your model is actually good.
from ultralytics import YOLO
# Load trained model
model = YOLO('runs/detect/bottle_defect_v1/weights/best.pt')
# Evaluate on validation set
metrics = model.val(data='dataset/data.yaml', conf=0.001, iou=0.6)
print(f"mAP50: {metrics.box.map50:.4f}")
print(f"mAP50-95: {metrics.box.map:.4f}")
print(f"Precision: {metrics.box.mp:.4f}")
print(f"Recall: {metrics.box.mr:.4f}")
# Per-class breakdown
for i, cls_name in model.names.items():
ap50 = metrics.box.ap50[i]
print(f" {cls_name:20s}: AP50 = {ap50:.4f}")
| mAP50 Score | Interpretation | Typical Use Case |
|---|---|---|
| < 0.40 | Poor — model barely learns to detect | More data or architectural fix needed |
| 0.40 – 0.60 | Moderate — detects most objects with many misses | Simple use cases, good starting point |
| 0.60 – 0.80 | Good — reliable in controlled conditions | Production-grade for many applications |
| > 0.80 | Excellent — near-human level for the domain | Safety-critical or high-accuracy use cases |
Key Hyperparameters for Training YOLO
| Parameter | Default | What It Controls | Tuning Advice |
|---|---|---|---|
imgsz | 640 | Input image resolution (must be multiple of 32) | Use 640 for start. Try 1280 for small object detection. |
epochs | 100 | Total training epochs | Use patience (early stopping) rather than a fixed count. |
batch | 16 | Images per gradient update | Maximise to fit GPU. -1 = auto-batch (fills 60% of GPU VRAM). |
lr0 | 0.01 | Initial learning rate | Reduce to 0.001 for fine-tuning a pre-trained model. |
augment | True | Mosaic, flip, HSV augmentation | Always keep True. It's the biggest single accuracy booster. |
degrees | 0.0 | Random rotation augmentation range | Enable (10–15) for aerial/satellite images where rotation matters. |
mosaic | 1.0 | Mosaic augmentation probability | Set to 0.0 for last few epochs to stabilise training. |
conf | 0.25 | Inference confidence threshold | Lower (0.1) for recall-focused tasks, higher (0.5) for precision. |
iou | 0.45 | NMS IoU threshold | Lower for dense scenes, higher to reduce duplicate detections. |
YOLO Variants — Choosing the Right Model
Model Export and Deployment
Training in PyTorch gives you a flexible .pt file. For production deployment, export to optimised formats that can run without Python or PyTorch installed.
from ultralytics import YOLO
model = YOLO('runs/detect/bottle_defect_v1/weights/best.pt')
# Export to ONNX (cross-platform, runs on CPU/GPU/mobile)
model.export(format='onnx', opset=17, simplify=True)
# Export to TensorRT (NVIDIA GPU — fastest possible inference)
model.export(format='engine', half=True) # FP16 precision
# Export to CoreML (Apple devices — iPhone, Mac)
model.export(format='coreml', nms=True)
# Export to TFLite (Android / embedded Linux)
model.export(format='tflite', int8=True) # INT8 quantisation
# Run inference with the exported ONNX model
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession('best.onnx')
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
dummy_input = np.random.rand(1, 3, 640, 640).astype(np.float32)
outputs = sess.run([output_name], {input_name: dummy_input})
print("ONNX output shape:", outputs[0].shape) # (1, 7, 8400)
| Export Format | Target Platform | Speed Gain | Precision | Use Case |
|---|---|---|---|---|
| .pt (PyTorch) | Python server | Baseline | FP32 | Development, research |
| ONNX | Any platform | 1.5–2× | FP32 | Cross-platform production |
| TensorRT | NVIDIA GPU | 3–5× | FP16/INT8 | Max GPU throughput |
| CoreML | Apple devices | 2–4× | FP16 | iOS / macOS apps |
| TFLite INT8 | Mobile / Edge | 1.5–3× | INT8 | Android, Raspberry Pi |
A Complete End-to-End Example — Traffic Monitoring
from ultralytics import YOLO
from collections import defaultdict
import cv2, numpy as np
# Load model — YOLOv8m for good speed/accuracy on RTX 3060
model = YOLO('yolov8m.pt')
# Vehicle classes in COCO dataset
VEHICLE_CLASSES = {2: 'car', 3: 'motorcycle', 5: 'bus', 7: 'truck'}
# Counting line (horizontal, at y = 540 for 1080p video)
LINE_Y = 540
track_hist = defaultdict(lambda: []) # store track paths
vehicle_cnt = defaultdict(int) # cumulative counts
counted_ids = set() # track IDs already counted
cap = cv2.VideoCapture('intersection.mp4')
out = cv2.VideoWriter('output_counted.mp4',
cv2.VideoWriter_fourcc(*'mp4v'), 30, (1920, 1080))
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
# Run tracking (ByteTrack — maintains object IDs across frames)
results = model.track(frame, persist=True, tracker='bytetrack.yaml',
classes=list(VEHICLE_CLASSES.keys()), conf=0.4)
if results[0].boxes.id is not None:
boxes = results[0].boxes.xyxy.cpu().numpy()
track_ids= results[0].boxes.id.int().cpu().tolist()
cls_ids = results[0].boxes.cls.int().cpu().tolist()
for box, tid, cid in zip(boxes, track_ids, cls_ids):
x1, y1, x2, y2 = box
cx, cy = int((x1+x2)/2), int((y1+y2)/2)
# Store trajectory
track_hist[tid].append((cx, cy))
if len(track_hist[tid]) > 30:
track_hist[tid].pop(0)
# Count if vehicle crosses the line (north→south)
hist = track_hist[tid]
if (len(hist) >= 2 and
hist[-2][1] < LINE_Y <= hist[-1][1] and
tid not in counted_ids):
vehicle_cnt[VEHICLE_CLASSES.get(cid, 'other')] += 1
counted_ids.add(tid)
# Draw counting line and totals
cv2.line(frame, (0, LINE_Y), (1920, LINE_Y), (0, 255, 255), 2)
y_off = 50
for vtype, count in vehicle_cnt.items():
cv2.putText(frame, f"{vtype}: {count}",
(20, y_off), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0,255,0), 3)
y_off += 45
out.write(frame)
cap.release()
out.release()
print("Vehicle counts:", dict(vehicle_cnt))
When to Use YOLO — and When Not To
YOLO vs Other Detectors — Comparison
| Property | Value |
|---|---|
| Pipeline | RPN → ROI Pooling → Classifier |
| Speed | ~5 fps (GPU) |
| mAP (COCO) | ~37.9 |
| Small Objects | Excellent |
| Deployment | Complex |
| Best For | Highest accuracy, no latency |
| Property | Value |
|---|---|
| Pipeline | Backbone → Neck → Head (1 pass) |
| Speed | ~80 fps (GPU) |
| mAP (COCO) | ~50.2 |
| Small Objects | Good (with SAHI) |
| Deployment | Simple (single model file) |
| Best For | Real-time, production, edge |
Golden Rules — YOLO in Production
augment=True and mosaic augmentation. YOLO's default
augmentation pipeline (mosaic, random flip, HSV jitter, scale/translate) is highly
tuned and often provides more effective training signal than adding more raw images.