Computer Vision 📂 Computer Vision Basics · 11 of 12 39 min read

YOLO in Computer Vision

YOLO (You Only Look Once) is the gold standard for real-time object detection. This tutorial covers how YOLO works (the grid, anchors, loss functions, and multi-scale detection), the evolution from YOLOv1 to YOLOv10, hands-on Python code using Ultralytics for inference, custom training, evaluation, and deployment — all explained through stories, diagrams, and real-world examples.

Section 01

The Story That Explains YOLO

The Security Guard vs The Forensic Analyst
Imagine a shopping mall security guard watching a live camera feed. The old way of doing computer vision was like sending the footage to a forensic lab — they'd scan the whole image multiple times, propose hundreds of suspicious zones, then carefully classify each one. It worked, but it took minutes.

YOLO is the guard who glances once and instantly says: "There's a person in aisle 3, a handbag on shelf 2, and a suspicious package near exit B." One look. One pass. All detections simultaneously.

That single glance — one forward pass through a neural network — is the entire philosophy behind YOLO: You Only Look Once.

YOLO is a family of real-time object detection models that frames detection as a single regression problem — directly predicting bounding boxes and class probabilities from full images in one evaluation. First introduced by Joseph Redmon et al. in 2015, it revolutionised computer vision by making detection fast enough to run on live video streams.

🏦
The Core Insight

Before YOLO, detection pipelines were two-stage: first propose regions of interest, then classify them (R-CNN, Fast R-CNN). YOLO collapsed this into a single-stage approach — one neural network, one pass, all predictions at once. The accuracy trade-off was worth it for a 100× speedup.


Section 02

The Problem YOLO Solves — Object Detection 101

Object detection is the task of simultaneously answering two questions for every object in an image: What is it? (classification) and Where exactly is it? (localisation via a bounding box).

🎯 What Every Object Detector Must Output Per Object
x, y
Centre coordinates of the bounding box, normalised to image width and height (0 to 1)
w, h
Width and height of the bounding box, normalised relative to the image dimensions
conf
Objectness confidence score — how certain the model is that an object exists in that box
class
Probability distribution over all possible class labels (e.g. person = 0.97, car = 0.02)
Detection Approach Key Idea Speed Accuracy Example
Two-stage (Region-Based)Propose regions → then classifySlow (5–7 fps)HighR-CNN, Faster R-CNN
One-stage (Anchor-Based)Predict boxes directly on gridFast (45–140 fps)GoodYOLO, SSD
One-stage (Anchor-Free)Predict centre point directlyVery FastSOTAYOLOv8, FCOS
Transformer-BasedGlobal attention on image tokensMediumSOTADETR, RT-DETR

Section 03

How YOLO Works — The Grid Trick

The Newspaper Grid
Think of a city map divided into a grid of squares — say 13 × 13. Each square is responsible for reporting objects whose centre falls within that square. If a car's centre is in square (row 4, col 7), then that grid cell and only that cell is responsible for detecting the car. Every cell simultaneously reports its findings. The grid answers all at once — no waiting, no two-stage pipeline.
01
Divide Image into an S × S Grid
The input image (e.g. 416×416 px) is divided into a grid — say 13×13. Each cell covers a ~32×32 pixel region. Every cell is simultaneously responsible for detecting objects whose centres fall within it.
02
Each Cell Predicts B Bounding Boxes
Each grid cell predicts B bounding boxes (e.g. B=3). Each box has 5 values: [x, y, w, h, objectness_confidence]. The x, y are offsets from the cell corner; w, h are relative to anchor dimensions.
03
Each Cell Predicts C Class Probabilities
Each cell also outputs a vector of C class probabilities (e.g. C=80 for COCO). These are conditional on an object being present. Multiply by objectness confidence to get the final class-specific confidence.
04
Total Output Tensor: S × S × (B×5 + C)
For 13×13 grid, B=3, C=80: output is 13×13×(3×5+80) = 13×13×95. This entire prediction is produced in one forward pass through the backbone CNN.
05
Non-Maximum Suppression (NMS)
Multiple cells may predict the same object. NMS removes duplicate detections by keeping the box with the highest confidence and suppressing overlapping boxes with IoU above a threshold (e.g. 0.45).
🔑
What is IoU?

Intersection over Union (IoU) measures how much two bounding boxes overlap. IoU = Area of Overlap / Area of Union. A perfect prediction has IoU = 1.0. IoU > 0.5 is generally considered a correct detection. IoU is also called the Jaccard Index.


Section 04

Anchor Boxes — YOLO's Secret Weapon

Raw x, y, w, h predictions are unstable to train. YOLO uses anchor boxes (also called prior boxes) — pre-defined box shapes computed via k-means clustering on the training set's ground-truth bounding boxes. The network learns offsets from these anchors, not absolute coordinates.

Predicted X (centre)
bx = σ(tx) + cx
σ(tx) is sigmoid of network output, cx is grid cell x-offset. Keeps centre inside the cell.
Predicted Y (centre)
by = σ(ty) + cy
cy is grid cell y-offset. Sigmoid ensures bx, by stay within [0, 1] relative to cell.
Predicted Width
bw = pw · e^tw
pw is anchor prior width. Exponential ensures width is always positive and anchored to prior.
Predicted Height
bh = ph · e^th
ph is anchor prior height. Network learns log-scale offsets, making large box predictions stable.
YOLO Version Year Anchors Per Cell Grid Scales Backbone mAP (COCO)
YOLOv120152 (no anchors)1 (7×7)Custom CNN~45%
YOLOv2201651Darknet-19~48%
YOLOv320183 × 3 scales3 (multi-scale)Darknet-53~55%
YOLOv420203 × 3 scales3CSPDarknet-53~65%
YOLOv520203 × 3 scales3CSP + Focus~56% (val)
YOLOv82023Anchor-Free3C2f + CSP~53% (val)
YOLOv102024Anchor-Free3CSPNet~54% (val)

Section 05

The YOLO Architecture — Visual Diagram

Every YOLO model is built from three conceptual components: the Backbone, the Neck, and the Head. Understanding this structure helps you know where to customise the model.

🧠
Backbone — Feature Extraction
CSPDarknet / EfficientNet / ResNet
A deep CNN that processes the raw image and extracts hierarchical feature maps at multiple resolutions. Early layers detect edges and textures; deep layers detect semantic concepts (faces, wheels, text). The backbone is the heavy compute stage.
🔁
Neck — Feature Aggregation
FPN / PAN / BiFPN
Combines feature maps from different backbone layers to create rich multi-scale representations. The Feature Pyramid Network (FPN) merges deep (semantic) and shallow (spatial) features. The PAN adds a bottom-up path for stronger localisation signals.
🎯
Head — Prediction
Decoupled / Coupled / Anchor-Free
Takes multi-scale feature maps from the neck and outputs the final detection tensors. Modern YOLO (v8+) uses a decoupled head — separate branches for classification and box regression — which significantly improves accuracy over the old coupled approach.
🏭 Data Flow Through a YOLOv8 Model
Input
640×640×3 image tensor → normalised to [0, 1], no mean subtraction needed
Backbone
C2f blocks + SPPF module → outputs P3 (80×80), P4 (40×40), P5 (20×20) feature maps
Neck (PAN)
Upsample P5 → fuse with P4 → fuse with P3. Downsample P3 → fuse with P4 → fuse with P5
Head
Three detection heads at 80×80 (small), 40×40 (medium), 20×20 (large) scales
Output
Per anchor: [x, y, w, h, conf, cls×80] → NMS → final detections as list of (box, score, class)

Section 06

YOLO Loss Function — What It's Learning

Training YOLO requires a carefully designed multi-part loss. The network must simultaneously learn to localise boxes accurately and classify objects correctly, while suppressing false positives.

📈
Box Regression Loss
CIoU / DIoU / GIoU
Penalises the difference between predicted and ground-truth boxes. Modern YOLO uses CIoU loss which accounts for overlap, centre distance, and aspect ratio simultaneously — far better than plain MSE on coordinates.
🎧
Objectness / Confidence Loss
Binary Cross-Entropy
Trains the objectness score: cells with objects should output confidence → 1.0, cells without objects → 0.0. The loss is weighted heavily toward background cells since they outnumber object cells by ~50:1 in a typical image.
🌟
Classification Loss
BCE with Logits (YOLOv8)
Trains the class probability vector. YOLOv8 uses binary cross-entropy per class instead of softmax, enabling multi-label detection (an object can be both "person" and "athlete" simultaneously). Label smoothing is applied to prevent overconfidence.
⚠️
Class Imbalance in Detection

In any image, the vast majority of grid cells contain no object. If treated equally, the model would learn to always predict "background." YOLO addresses this with loss weighting (λ_noobj = 0.5, λ_coord = 5) and modern versions use focal loss or task-aligned assignment strategies to focus learning on hard examples.


Section 07

Python Implementation — YOLOv8 with Ultralytics

Installation

# Install the ultralytics package (includes YOLOv8, YOLOv9, YOLOv10)
pip install ultralytics

# Verify GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

Running Inference on an Image

from ultralytics import YOLO
import cv2

# Load a pre-trained YOLOv8 nano model (fastest, smallest)
# Options: yolov8n, yolov8s, yolov8m, yolov8l, yolov8x
model = YOLO('yolov8n.pt')   # auto-downloads on first run

# Run inference on a single image
results = model('street.jpg', conf=0.25, iou=0.45)

# Access detection results
for r in results:
    boxes  = r.boxes           # Boxes object with xyxy, conf, cls
    masks  = r.masks           # Segmentation masks (if seg model)
    keypts = r.keypoints       # Pose keypoints (if pose model)

    print(f"Detected {len(boxes)} objects")

    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        conf   = box.conf[0].item()
        cls_id = int(box.cls[0].item())
        label  = model.names[cls_id]
        print(f"  {label:15s} conf={conf:.2f}  box=[{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")

# Save annotated image
results[0].save('output.jpg')
print("Saved to output.jpg")
OUTPUT
Detected 5 objects person conf=0.92 box=[120,45,310,480] car conf=0.88 box=[400,200,750,420] car conf=0.81 box=[50,190,280,380] traffic light conf=0.76 box=[510,30,560,130] bicycle conf=0.67 box=[630,310,710,460] Saved to output.jpg

Real-Time Webcam Detection

from ultralytics import YOLO
import cv2

model = YOLO('yolov8n.pt')
cap   = cv2.VideoCapture(0)  # 0 = default webcam

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Run YOLO on each frame (returns annotated frame)
    results = model.track(frame, persist=True, conf=0.3)
    annotated = results[0].plot()

    # Display FPS
    fps = model.predictor.profilers[0].t
    cv2.putText(annotated, f'FPS: {1/fps:.0f}',
                (10, 30), cv2.FONT_HERSHEY_SIMPLEX,
                1, (0, 255, 0), 2)

    cv2.imshow('YOLOv8 Live Detection', annotated)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()
Speed Tip — Model Size vs FPS

yolov8n (nano) runs at ~200 FPS on GPU, ~30 FPS on modern CPU. yolov8x (extra-large) gives best accuracy but ~25 FPS on GPU. For edge devices (Raspberry Pi, Jetson Nano), always use nano or small variants. Export to ONNX or TensorRT for another 2–5× speedup.


Section 08

Training YOLO on a Custom Dataset

Detecting Defective Bottles on a Production Line
A beverage factory wants to detect cracked caps, unfilled bottles, and missing labels on a high-speed conveyor belt at 120 bottles per minute. Pre-trained YOLO on COCO knows nothing about bottle defects. We need to fine-tune it on factory images. Here is exactly how.

Step 1 — Dataset Structure (YOLO Format)

# Required folder structure for YOLO training
dataset/
├── images/
│   ├── train/    # Training images (.jpg, .png)
│   └── val/      # Validation images
├── labels/
│   ├── train/    # Annotation .txt files (one per image)
│   └── val/
└── data.yaml     # Dataset config file

# Each .txt label file: one row per object
# Format: class_id  cx  cy  w  h  (all normalised 0-1)
# Example label file for an image with 2 objects:
0 0.512 0.348 0.124 0.210   # class 0 (cracked_cap)
2 0.731 0.512 0.088 0.175   # class 2 (missing_label)

Step 2 — data.yaml Configuration

# data.yaml — tells YOLO where data is and what classes exist
path: /path/to/dataset    # absolute root
train: images/train
val:   images/val

nc: 3                      # number of classes
names:
  0: cracked_cap
  1: unfilled_bottle
  2: missing_label

Step 3 — Training

from ultralytics import YOLO

# Load pre-trained model (transfer learning from COCO weights)
model = YOLO('yolov8s.pt')   # small model for balance of speed/accuracy

# Train on custom dataset
results = model.train(
    data='dataset/data.yaml',
    epochs=100,           # training epochs
    imgsz=640,            # input image size
    batch=16,             # batch size (reduce if OOM)
    lr0=0.01,             # initial learning rate
    lrf=0.001,            # final learning rate
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3,      # gradual LR warmup
    augment=True,         # mosaic, flipLR, hsv augmentation
    device='0',           # GPU device ('cpu' for CPU training)
    project='runs/detect',
    name='bottle_defect_v1',
    patience=20,          # early stopping patience
    save_period=10,       # save checkpoint every N epochs
)

print(f"Best mAP50: {results.results_dict['metrics/mAP50(B)']:.4f}")
TRAINING OUTPUT (last 5 epochs)
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 96/100 4.12G 0.8312 0.4109 1.0821 28 640 97/100 4.12G 0.8187 0.4051 1.0794 31 640 98/100 4.12G 0.8104 0.3998 1.0761 26 640 99/100 4.12G 0.8063 0.3971 1.0740 29 640 100/100 4.12G 0.8021 0.3940 1.0715 27 640 Results saved to runs/detect/bottle_defect_v1 Best mAP50: 0.8934

Section 09

Evaluation — Understanding mAP

The standard metric for object detection is mAP (mean Average Precision). Understanding it is essential for knowing whether your model is actually good.

📊 mAP Calculation — Step by Step
Step 1
For each class, collect all predictions across the validation set, sorted by confidence score (highest first)
Step 2
Label each prediction as TP (IoU > threshold, matched to unmatched GT) or FP (duplicate or IoU too low)
Step 3
Compute cumulative Precision and Recall at each confidence threshold → plot the PR curve
Step 4
AP = area under the PR curve for one class. mAP = mean AP across all classes
mAP50
AP computed at IoU threshold 0.50 — lenient, commonly reported. Used in PASCAL VOC challenge.
mAP50-95
Average AP across IoU thresholds 0.50 to 0.95 (step 0.05). Stricter, used in COCO benchmark.
from ultralytics import YOLO

# Load trained model
model = YOLO('runs/detect/bottle_defect_v1/weights/best.pt')

# Evaluate on validation set
metrics = model.val(data='dataset/data.yaml', conf=0.001, iou=0.6)

print(f"mAP50:        {metrics.box.map50:.4f}")
print(f"mAP50-95:     {metrics.box.map:.4f}")
print(f"Precision:    {metrics.box.mp:.4f}")
print(f"Recall:       {metrics.box.mr:.4f}")

# Per-class breakdown
for i, cls_name in model.names.items():
    ap50 = metrics.box.ap50[i]
    print(f"  {cls_name:20s}: AP50 = {ap50:.4f}")
EVALUATION OUTPUT
mAP50: 0.8934 mAP50-95: 0.6712 Precision: 0.9102 Recall: 0.8741 cracked_cap : AP50 = 0.9214 unfilled_bottle : AP50 = 0.8801 missing_label : AP50 = 0.8787
mAP50 Score Interpretation Typical Use Case
< 0.40Poor — model barely learns to detectMore data or architectural fix needed
0.40 – 0.60Moderate — detects most objects with many missesSimple use cases, good starting point
0.60 – 0.80Good — reliable in controlled conditionsProduction-grade for many applications
> 0.80Excellent — near-human level for the domainSafety-critical or high-accuracy use cases

Section 10

Key Hyperparameters for Training YOLO

ParameterDefaultWhat It ControlsTuning Advice
imgsz640Input image resolution (must be multiple of 32)Use 640 for start. Try 1280 for small object detection.
epochs100Total training epochsUse patience (early stopping) rather than a fixed count.
batch16Images per gradient updateMaximise to fit GPU. -1 = auto-batch (fills 60% of GPU VRAM).
lr00.01Initial learning rateReduce to 0.001 for fine-tuning a pre-trained model.
augmentTrueMosaic, flip, HSV augmentationAlways keep True. It's the biggest single accuracy booster.
degrees0.0Random rotation augmentation rangeEnable (10–15) for aerial/satellite images where rotation matters.
mosaic1.0Mosaic augmentation probabilitySet to 0.0 for last few epochs to stabilise training.
conf0.25Inference confidence thresholdLower (0.1) for recall-focused tasks, higher (0.5) for precision.
iou0.45NMS IoU thresholdLower for dense scenes, higher to reduce duplicate detections.

Section 11

YOLO Variants — Choosing the Right Model

🚀
YOLOv8n / v8s (Nano / Small)
Fastest inference, smallest model. Use for edge devices, real-time apps on CPU, embedded systems (Raspberry Pi, mobile). Accuracy sacrifice is acceptable for most non-critical use cases.
~200 FPS GPU | ~30 FPS CPU
⚖️
YOLOv8m (Medium)
The production sweet spot. Best balance of speed and accuracy. Runs at 60+ FPS on GPU. If you have a modern NVIDIA card and don't need extreme speed, start here.
~80 FPS GPU
🏆
YOLOv8l / v8x (Large / XL)
Maximum accuracy. Use for batch processing, medical imaging, satellite/aerial detection where false negatives are costly. Not suitable for real-time applications on consumer hardware.
~25–45 FPS GPU
🪣
YOLOv8-seg (Instance Segmentation)
Outputs per-pixel masks in addition to bounding boxes. Identifies the exact shape of each object, not just the box around it. Use for robotics, medical analysis, autonomous vehicles.
Mask + Box output
🦥
YOLOv8-pose (Keypoint Detection)
Detects human skeletal keypoints (17 joints: nose, eyes, shoulders, elbows, wrists, hips, knees, ankles). Use for sports analytics, physiotherapy, gesture control.
17 keypoints per person
📷
YOLOv8-obb (Oriented Bounding Boxes)
Predicts rotated bounding boxes with an angle parameter. Critical for satellite imagery where ships, planes, and vehicles appear at arbitrary orientations. Standard boxes are not sufficient.
x, y, w, h, θ output

Section 12

Model Export and Deployment

Training in PyTorch gives you a flexible .pt file. For production deployment, export to optimised formats that can run without Python or PyTorch installed.

from ultralytics import YOLO

model = YOLO('runs/detect/bottle_defect_v1/weights/best.pt')

# Export to ONNX (cross-platform, runs on CPU/GPU/mobile)
model.export(format='onnx', opset=17, simplify=True)

# Export to TensorRT (NVIDIA GPU — fastest possible inference)
model.export(format='engine', half=True)   # FP16 precision

# Export to CoreML (Apple devices — iPhone, Mac)
model.export(format='coreml', nms=True)

# Export to TFLite (Android / embedded Linux)
model.export(format='tflite', int8=True)   # INT8 quantisation

# Run inference with the exported ONNX model
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession('best.onnx')
input_name  = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

dummy_input = np.random.rand(1, 3, 640, 640).astype(np.float32)
outputs = sess.run([output_name], {input_name: dummy_input})
print("ONNX output shape:", outputs[0].shape)  # (1, 7, 8400)
Export Format Target Platform Speed Gain Precision Use Case
.pt (PyTorch)Python serverBaselineFP32Development, research
ONNXAny platform1.5–2×FP32Cross-platform production
TensorRTNVIDIA GPU3–5×FP16/INT8Max GPU throughput
CoreMLApple devices2–4×FP16iOS / macOS apps
TFLite INT8Mobile / Edge1.5–3×INT8Android, Raspberry Pi

Section 13

A Complete End-to-End Example — Traffic Monitoring

Counting Vehicles at an Intersection
A city council wants to automatically count the number of cars, trucks, and buses passing through a busy intersection per hour to inform traffic light timing decisions. The camera captures 1080p video at 30 fps. The system must run on a small edge server with a consumer GPU (RTX 3060). Here is the full pipeline.
from ultralytics import YOLO
from collections import defaultdict
import cv2, numpy as np

# Load model — YOLOv8m for good speed/accuracy on RTX 3060
model = YOLO('yolov8m.pt')

# Vehicle classes in COCO dataset
VEHICLE_CLASSES = {2: 'car', 3: 'motorcycle', 5: 'bus', 7: 'truck'}

# Counting line (horizontal, at y = 540 for 1080p video)
LINE_Y      = 540
track_hist  = defaultdict(lambda: [])  # store track paths
vehicle_cnt = defaultdict(int)         # cumulative counts
counted_ids = set()                   # track IDs already counted

cap = cv2.VideoCapture('intersection.mp4')
out = cv2.VideoWriter('output_counted.mp4',
        cv2.VideoWriter_fourcc(*'mp4v'), 30, (1920, 1080))

while cap.isOpened():
    ret, frame = cap.read()
    if not ret: break

    # Run tracking (ByteTrack — maintains object IDs across frames)
    results = model.track(frame, persist=True, tracker='bytetrack.yaml',
                          classes=list(VEHICLE_CLASSES.keys()), conf=0.4)

    if results[0].boxes.id is not None:
        boxes    = results[0].boxes.xyxy.cpu().numpy()
        track_ids= results[0].boxes.id.int().cpu().tolist()
        cls_ids  = results[0].boxes.cls.int().cpu().tolist()

        for box, tid, cid in zip(boxes, track_ids, cls_ids):
            x1, y1, x2, y2 = box
            cx, cy = int((x1+x2)/2), int((y1+y2)/2)

            # Store trajectory
            track_hist[tid].append((cx, cy))
            if len(track_hist[tid]) > 30:
                track_hist[tid].pop(0)

            # Count if vehicle crosses the line (north→south)
            hist = track_hist[tid]
            if (len(hist) >= 2 and
                    hist[-2][1] < LINE_Y <= hist[-1][1] and
                    tid not in counted_ids):
                vehicle_cnt[VEHICLE_CLASSES.get(cid, 'other')] += 1
                counted_ids.add(tid)

    # Draw counting line and totals
    cv2.line(frame, (0, LINE_Y), (1920, LINE_Y), (0, 255, 255), 2)
    y_off = 50
    for vtype, count in vehicle_cnt.items():
        cv2.putText(frame, f"{vtype}: {count}",
                   (20, y_off), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0,255,0), 3)
        y_off += 45

    out.write(frame)

cap.release()
out.release()
print("Vehicle counts:", dict(vehicle_cnt))
OUTPUT
Vehicle counts: {'car': 142, 'truck': 31, 'bus': 18, 'motorcycle': 9} Video saved to output_counted.mp4

Section 14

When to Use YOLO — and When Not To

Real-Time Video Streams
YOLO's single-pass design is built for live video. Security cameras, dashcams, drone feeds, sports tracking — any application where latency under 50ms per frame matters.
surveillance, autonomous driving, drones
Multi-Object Scenes
YOLO detects all objects simultaneously in one pass — unlike two-stage detectors that process each region separately. Crowded scenes with 50+ objects per frame are YOLO's native territory.
crowd analysis, warehouse management
Edge Deployment
With the nano/small variants and INT8 quantisation, YOLO runs on Raspberry Pi, Jetson Nano, and mobile devices. No cloud required — data stays local, latency stays low.
IoT, manufacturing QC, retail analytics
Very Small Objects in Large Images
YOLO resizes images to 640×640, which crushes tiny objects. For satellite imagery with centimetre-sized objects, use SAHI (Slicing Aided Hyper Inference) to tile the image first.
aerial survey, microscopy
Maximum Accuracy Over Speed
If you have a batch job with no latency requirement and need the absolute best accuracy (e.g. medical imaging, forensics), a two-stage detector like Cascade RCNN may outperform YOLO.
pathology, forensics, scientific imaging
Very Few Training Images (<50)
YOLO is a data-hungry model. With fewer than ~100 images per class, fine-tuning may not converge. Consider few-shot detection methods (Detic, GLIP) or extensive augmentation strategies.
rare defect detection, novel classes

Section 15

YOLO vs Other Detectors — Comparison

📰 Two-Stage (Faster R-CNN)
PropertyValue
PipelineRPN → ROI Pooling → Classifier
Speed~5 fps (GPU)
mAP (COCO)~37.9
Small ObjectsExcellent
DeploymentComplex
Best ForHighest accuracy, no latency
⚡ One-Stage (YOLOv8m)
PropertyValue
PipelineBackbone → Neck → Head (1 pass)
Speed~80 fps (GPU)
mAP (COCO)~50.2
Small ObjectsGood (with SAHI)
DeploymentSimple (single model file)
Best ForReal-time, production, edge

Section 16

Golden Rules — YOLO in Production

🎯 YOLO — Non-Negotiable Rules
1
Always start with a pre-trained model — never train from scratch. Transfer learning from COCO weights gives you a massive head start. Fine-tuning requires 10–100× fewer images and converges in hours, not days.
2
Image quality beats quantity. 500 clean, well-labelled, diverse images outperform 5,000 blurry, poorly-labelled ones. Bad labels directly cap your mAP ceiling — there is no model architecture fix for incorrect ground truth.
3
Match your training data distribution to your deployment environment. If your camera captures images at night, train on night images. If it's mounted at 45°, include tilted examples. Domain shift is the number one cause of production failures.
4
Use augment=True and mosaic augmentation. YOLO's default augmentation pipeline (mosaic, random flip, HSV jitter, scale/translate) is highly tuned and often provides more effective training signal than adding more raw images.
5
Tune confidence and IoU thresholds for your use case. The default conf=0.25 is a starting point. For a security system where false negatives are dangerous, lower it to 0.1 and accept more false positives. For a product recommendation engine, raise it to 0.6 for precision.
6
Export to TensorRT or ONNX before shipping to production. Running .pt files via PyTorch in production wastes 3–5× compute compared to a compiled TensorRT engine. Export is a one-time cost; the inference speedup is permanent.
7
Monitor production performance continuously. Object distributions shift over time (seasonal changes, new product SKUs, different lighting). Set up automated alerts if mAP on a held-out benchmark drops more than 5%. Retrain quarterly at minimum.