YOLO Object Detection Tutorial

Section 01

The Story That Explains YOLO

📖 Real World Analogy

The Security Guard vs The Forensic Analyst

Imagine a shopping mall security guard watching a live camera feed. The old way of doing computer vision was like sending the footage to a forensic lab — they'd scan the whole image multiple times, propose hundreds of suspicious zones, then carefully classify each one. It worked, but it took minutes.

YOLO is the guard who glances once and instantly says: "There's a person in aisle 3, a handbag on shelf 2, and a suspicious package near exit B." One look. One pass. All detections simultaneously.

That single glance — one forward pass through a neural network — is the entire philosophy behind YOLO: You Only Look Once.

YOLO is a family of real-time object detection models that frames detection as a single regression problem — directly predicting bounding boxes and class probabilities from full images in one evaluation. First introduced by Joseph Redmon et al. in 2015, it revolutionised computer vision by making detection fast enough to run on live video streams.

🏦

The Core Insight

Before YOLO, detection pipelines were two-stage: first propose regions of interest, then classify them (R-CNN, Fast R-CNN). YOLO collapsed this into a single-stage approach — one neural network, one pass, all predictions at once. The accuracy trade-off was worth it for a 100× speedup.

Section 02

The Problem YOLO Solves — Object Detection 101

Object detection is the task of simultaneously answering two questions for every object in an image: What is it? (classification) and Where exactly is it? (localisation via a bounding box).

🎯 What Every Object Detector Must Output Per Object

x, y

Centre coordinates of the bounding box, normalised to image width and height (0 to 1)

w, h

Width and height of the bounding box, normalised relative to the image dimensions

conf

Objectness confidence score — how certain the model is that an object exists in that box

class

Probability distribution over all possible class labels (e.g. person = 0.97, car = 0.02)

Detection Approach	Key Idea	Speed	Accuracy	Example
Two-stage (Region-Based)	Propose regions → then classify	Slow (5–7 fps)	High	R-CNN, Faster R-CNN
One-stage (Anchor-Based)	Predict boxes directly on grid	Fast (45–140 fps)	Good	YOLO, SSD
One-stage (Anchor-Free)	Predict centre point directly	Very Fast	SOTA	YOLOv8, FCOS
Transformer-Based	Global attention on image tokens	Medium	SOTA	DETR, RT-DETR

Section 03

How YOLO Works — The Grid Trick

📖 Story

The Newspaper Grid

Think of a city map divided into a grid of squares — say 13 × 13. Each square is responsible for reporting objects whose centre falls within that square. If a car's centre is in square (row 4, col 7), then that grid cell and only that cell is responsible for detecting the car. Every cell simultaneously reports its findings. The grid answers all at once — no waiting, no two-stage pipeline.

Divide Image into an S × S Grid

The input image (e.g. 416×416 px) is divided into a grid — say 13×13. Each cell covers a ~32×32 pixel region. Every cell is simultaneously responsible for detecting objects whose centres fall within it.

Each Cell Predicts B Bounding Boxes

Each grid cell predicts B bounding boxes (e.g. B=3). Each box has 5 values: [x, y, w, h, objectness_confidence]. The x, y are offsets from the cell corner; w, h are relative to anchor dimensions.

Each Cell Predicts C Class Probabilities

Each cell also outputs a vector of C class probabilities (e.g. C=80 for COCO). These are conditional on an object being present. Multiply by objectness confidence to get the final class-specific confidence.

Total Output Tensor: S × S × (B×5 + C)

For 13×13 grid, B=3, C=80: output is 13×13×(3×5+80) = 13×13×95. This entire prediction is produced in one forward pass through the backbone CNN.

Non-Maximum Suppression (NMS)

Multiple cells may predict the same object. NMS removes duplicate detections by keeping the box with the highest confidence and suppressing overlapping boxes with IoU above a threshold (e.g. 0.45).

🔑

What is IoU?

Intersection over Union (IoU) measures how much two bounding boxes overlap. IoU = Area of Overlap / Area of Union. A perfect prediction has IoU = 1.0. IoU > 0.5 is generally considered a correct detection. IoU is also called the Jaccard Index.

Section 04

Anchor Boxes — YOLO's Secret Weapon

Raw x, y, w, h predictions are unstable to train. YOLO uses anchor boxes (also called prior boxes) — pre-defined box shapes computed via k-means clustering on the training set's ground-truth bounding boxes. The network learns offsets from these anchors, not absolute coordinates.

Predicted X (centre)

bx = σ(tx) + cx

σ(tx) is sigmoid of network output, cx is grid cell x-offset. Keeps centre inside the cell.

Predicted Y (centre)

by = σ(ty) + cy

cy is grid cell y-offset. Sigmoid ensures bx, by stay within [0, 1] relative to cell.

Predicted Width

bw = pw · e^tw

pw is anchor prior width. Exponential ensures width is always positive and anchored to prior.

Predicted Height

bh = ph · e^th

ph is anchor prior height. Network learns log-scale offsets, making large box predictions stable.

YOLO Version	Year	Anchors Per Cell	Grid Scales	Backbone	mAP (COCO)
YOLOv1	2015	2 (no anchors)	1 (7×7)	Custom CNN	~45%
YOLOv2	2016	5	1	Darknet-19	~48%
YOLOv3	2018	3 × 3 scales	3 (multi-scale)	Darknet-53	~55%
YOLOv4	2020	3 × 3 scales	3	CSPDarknet-53	~65%
YOLOv5	2020	3 × 3 scales	3	CSP + Focus	~56% (val)
YOLOv8	2023	Anchor-Free	3	C2f + CSP	~53% (val)
YOLOv10	2024	Anchor-Free	3	CSPNet	~54% (val)

Section 05

The YOLO Architecture — Visual Diagram

Every YOLO model is built from three conceptual components: the Backbone, the Neck, and the Head. Understanding this structure helps you know where to customise the model.

🧠

Backbone — Feature Extraction

CSPDarknet / EfficientNet / ResNet

A deep CNN that processes the raw image and extracts hierarchical feature maps at multiple resolutions. Early layers detect edges and textures; deep layers detect semantic concepts (faces, wheels, text). The backbone is the heavy compute stage.

🔁

Neck — Feature Aggregation

FPN / PAN / BiFPN

Combines feature maps from different backbone layers to create rich multi-scale representations. The Feature Pyramid Network (FPN) merges deep (semantic) and shallow (spatial) features. The PAN adds a bottom-up path for stronger localisation signals.

🎯

Head — Prediction

Decoupled / Coupled / Anchor-Free

Takes multi-scale feature maps from the neck and outputs the final detection tensors. Modern YOLO (v8+) uses a decoupled head — separate branches for classification and box regression — which significantly improves accuracy over the old coupled approach.

🏭 Data Flow Through a YOLOv8 Model

Input

640×640×3 image tensor → normalised to [0, 1], no mean subtraction needed

Backbone

C2f blocks + SPPF module → outputs P3 (80×80), P4 (40×40), P5 (20×20) feature maps

Neck (PAN)

Upsample P5 → fuse with P4 → fuse with P3. Downsample P3 → fuse with P4 → fuse with P5

Head

Three detection heads at 80×80 (small), 40×40 (medium), 20×20 (large) scales

Output

Per anchor: [x, y, w, h, conf, cls×80] → NMS → final detections as list of (box, score, class)

Section 06

YOLO Loss Function — What It's Learning

Training YOLO requires a carefully designed multi-part loss. The network must simultaneously learn to localise boxes accurately and classify objects correctly, while suppressing false positives.

📈

Box Regression Loss

CIoU / DIoU / GIoU

Penalises the difference between predicted and ground-truth boxes. Modern YOLO uses CIoU loss which accounts for overlap, centre distance, and aspect ratio simultaneously — far better than plain MSE on coordinates.

🎧

Objectness / Confidence Loss

Binary Cross-Entropy

Trains the objectness score: cells with objects should output confidence → 1.0, cells without objects → 0.0. The loss is weighted heavily toward background cells since they outnumber object cells by ~50:1 in a typical image.

🌟

Classification Loss

BCE with Logits (YOLOv8)

Trains the class probability vector. YOLOv8 uses binary cross-entropy per class instead of softmax, enabling multi-label detection (an object can be both "person" and "athlete" simultaneously). Label smoothing is applied to prevent overconfidence.

⚠️

Class Imbalance in Detection

In any image, the vast majority of grid cells contain no object. If treated equally, the model would learn to always predict "background." YOLO addresses this with loss weighting (λ_noobj = 0.5, λ_coord = 5) and modern versions use focal loss or task-aligned assignment strategies to focus learning on hard examples.

Section 07

Python Implementation — YOLOv8 with Ultralytics

Installation

# Install the ultralytics package (includes YOLOv8, YOLOv9, YOLOv10)
pip install ultralytics

# Verify GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")



Running Inference on an Image

from ultralytics import YOLO
import cv2

# Load a pre-trained YOLOv8 nano model (fastest, smallest)
# Options: yolov8n, yolov8s, yolov8m, yolov8l, yolov8x
model = YOLO('yolov8n.pt')   # auto-downloads on first run

# Run inference on a single image
results = model('street.jpg', conf=0.25, iou=0.45)

# Access detection results
for r in results:
    boxes  = r.boxes           # Boxes object with xyxy, conf, cls
    masks  = r.masks           # Segmentation masks (if seg model)
    keypts = r.keypoints       # Pose keypoints (if pose model)

    print(f"Detected {len(boxes)} objects")

    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        conf   = box.conf[0].item()
        cls_id = int(box.cls[0].item())
        label  = model.names[cls_id]
        print(f"  {label:15s} conf={conf:.2f}  box=[{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")

# Save annotated image
results[0].save('output.jpg')
print("Saved to output.jpg")


OUTPUTDetected 5 objects
  person          conf=0.92  box=[120,45,310,480]
  car             conf=0.88  box=[400,200,750,420]
  car             conf=0.81  box=[50,190,280,380]
  traffic light   conf=0.76  box=[510,30,560,130]
  bicycle         conf=0.67  box=[630,310,710,460]
Saved to output.jpg

Real-Time Webcam Detection

from ultralytics import YOLO
import cv2

model = YOLO('yolov8n.pt')
cap   = cv2.VideoCapture(0)  # 0 = default webcam

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Run YOLO on each frame (returns annotated frame)
    results = model.track(frame, persist=True, conf=0.3)
    annotated = results[0].plot()

    # Display FPS
    fps = model.predictor.profilers[0].t
    cv2.putText(annotated, f'FPS: {1/fps:.0f}',
                (10, 30), cv2.FONT_HERSHEY_SIMPLEX,
                1, (0, 255, 0), 2)

    cv2.imshow('YOLOv8 Live Detection', annotated)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()



  ⚡
  
    Speed Tip — Model Size vs FPS
    
      yolov8n (nano) runs at ~200 FPS on GPU, ~30 FPS on modern CPU.
      yolov8x (extra-large) gives best accuracy but ~25 FPS on GPU.
      For edge devices (Raspberry Pi, Jetson Nano), always use nano or small variants.
      Export to ONNX or TensorRT for another 2–5× speedup.
    
  




Section 08
Training YOLO on a Custom Dataset


  📖 Real World Scenario
  Detecting Defective Bottles on a Production Line
  
    A beverage factory wants to detect cracked caps, unfilled bottles, and missing labels on a
    high-speed conveyor belt at 120 bottles per minute. Pre-trained YOLO on COCO knows nothing
    about bottle defects. We need to fine-tune it on factory images. Here is exactly how.
  


Step 1 — Dataset Structure (YOLO Format)

# Required folder structure for YOLO training
dataset/
├── images/
│   ├── train/    # Training images (.jpg, .png)
│   └── val/      # Validation images
├── labels/
│   ├── train/    # Annotation .txt files (one per image)
│   └── val/
└── data.yaml     # Dataset config file

# Each .txt label file: one row per object
# Format: class_id  cx  cy  w  h  (all normalised 0-1)
# Example label file for an image with 2 objects:
0 0.512 0.348 0.124 0.210   # class 0 (cracked_cap)
2 0.731 0.512 0.088 0.175   # class 2 (missing_label)


Step 2 — data.yaml Configuration

# data.yaml — tells YOLO where data is and what classes exist
path: /path/to/dataset    # absolute root
train: images/train
val:   images/val

nc: 3                      # number of classes
names:
  0: cracked_cap
  1: unfilled_bottle
  2: missing_label


Step 3 — Training

from ultralytics import YOLO

# Load pre-trained model (transfer learning from COCO weights)
model = YOLO('yolov8s.pt')   # small model for balance of speed/accuracy

# Train on custom dataset
results = model.train(
    data='dataset/data.yaml',
    epochs=100,           # training epochs
    imgsz=640,            # input image size
    batch=16,             # batch size (reduce if OOM)
    lr0=0.01,             # initial learning rate
    lrf=0.001,            # final learning rate
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3,      # gradual LR warmup
    augment=True,         # mosaic, flipLR, hsv augmentation
    device='0',           # GPU device ('cpu' for CPU training)
    project='runs/detect',
    name='bottle_defect_v1',
    patience=20,          # early stopping patience
    save_period=10,       # save checkpoint every N epochs
)

print(f"Best mAP50: {results.results_dict['metrics/mAP50(B)']:.4f}")


TRAINING OUTPUT (last 5 epochs)Epoch   GPU_mem  box_loss  cls_loss  dfl_loss   Instances   Size
  96/100   4.12G     0.8312    0.4109    1.0821        28     640
  97/100   4.12G     0.8187    0.4051    1.0794        31     640
  98/100   4.12G     0.8104    0.3998    1.0761        26     640
  99/100   4.12G     0.8063    0.3971    1.0740        29     640
 100/100   4.12G     0.8021    0.3940    1.0715        27     640

Results saved to runs/detect/bottle_defect_v1
Best mAP50: 0.8934



Section 09
Evaluation — Understanding mAP


  The standard metric for object detection is mAP (mean Average Precision).
  Understanding it is essential for knowing whether your model is actually good.



  📊 mAP Calculation — Step by Step
  
    
      Step 1
      For each class, collect all predictions across the validation set, sorted by confidence score (highest first)
    
    
      Step 2
      Label each prediction as TP (IoU > threshold, matched to unmatched GT) or FP (duplicate or IoU too low)
    
    
      Step 3
      Compute cumulative Precision and Recall at each confidence threshold → plot the PR curve
    
    
      Step 4
      AP = area under the PR curve for one class. mAP = mean AP across all classes
    
    
      mAP50
      AP computed at IoU threshold 0.50 — lenient, commonly reported. Used in PASCAL VOC challenge.
    
    
      mAP50-95
      Average AP across IoU thresholds 0.50 to 0.95 (step 0.05). Stricter, used in COCO benchmark.
    
  


from ultralytics import YOLO

# Load trained model
model = YOLO('runs/detect/bottle_defect_v1/weights/best.pt')

# Evaluate on validation set
metrics = model.val(data='dataset/data.yaml', conf=0.001, iou=0.6)

print(f"mAP50:        {metrics.box.map50:.4f}")
print(f"mAP50-95:     {metrics.box.map:.4f}")
print(f"Precision:    {metrics.box.mp:.4f}")
print(f"Recall:       {metrics.box.mr:.4f}")

# Per-class breakdown
for i, cls_name in model.names.items():
    ap50 = metrics.box.ap50[i]
    print(f"  {cls_name:20s}: AP50 = {ap50:.4f}")


EVALUATION OUTPUTmAP50:        0.8934
mAP50-95:     0.6712
Precision:    0.9102
Recall:       0.8741

  cracked_cap         : AP50 = 0.9214
  unfilled_bottle     : AP50 = 0.8801
  missing_label       : AP50 = 0.8787


  
    
      
        mAP50 Score
        Interpretation
        Typical Use Case
      
    
    
      < 0.40 Poor — model barely learns to detect More data or architectural fix needed
      0.40 – 0.60 Moderate — detects most objects with many misses Simple use cases, good starting point
      0.60 – 0.80 Good — reliable in controlled conditions Production-grade for many applications
      > 0.80 Excellent — near-human level for the domain Safety-critical or high-accuracy use cases
    
  




Section 10
Key Hyperparameters for Training YOLO


  
    
      
        Parameter Default What It Controls Tuning Advice
      
    
    
      imgsz 640 Input image resolution (must be multiple of 32) Use 640 for start. Try 1280 for small object detection.
      epochs 100 Total training epochs Use patience (early stopping) rather than a fixed count.
      batch 16 Images per gradient update Maximise to fit GPU. -1 = auto-batch (fills 60% of GPU VRAM).
      lr0 0.01 Initial learning rate Reduce to 0.001 for fine-tuning a pre-trained model.
      augment True Mosaic, flip, HSV augmentation Always keep True. It's the biggest single accuracy booster.
      degrees 0.0 Random rotation augmentation range Enable (10–15) for aerial/satellite images where rotation matters.
      mosaic 1.0 Mosaic augmentation probability Set to 0.0 for last few epochs to stabilise training.
      conf 0.25 Inference confidence threshold Lower (0.1) for recall-focused tasks, higher (0.5) for precision.
      iou 0.45 NMS IoU threshold Lower for dense scenes, higher to reduce duplicate detections.
    
  




Section 11
YOLO Variants — Choosing the Right Model


  
    🚀
    YOLOv8n / v8s (Nano / Small)
    Fastest inference, smallest model. Use for edge devices, real-time apps on CPU, embedded systems (Raspberry Pi, mobile). Accuracy sacrifice is acceptable for most non-critical use cases.
    ~200 FPS GPU | ~30 FPS CPU
  
  
    ⚖️
    YOLOv8m (Medium)
    The production sweet spot. Best balance of speed and accuracy. Runs at 60+ FPS on GPU. If you have a modern NVIDIA card and don't need extreme speed, start here.
    ~80 FPS GPU
  
  
    🏆
    YOLOv8l / v8x (Large / XL)
    Maximum accuracy. Use for batch processing, medical imaging, satellite/aerial detection where false negatives are costly. Not suitable for real-time applications on consumer hardware.
    ~25–45 FPS GPU
  
  
    🪣
    YOLOv8-seg (Instance Segmentation)
    Outputs per-pixel masks in addition to bounding boxes. Identifies the exact shape of each object, not just the box around it. Use for robotics, medical analysis, autonomous vehicles.
    Mask + Box output
  
  
    🦥
    YOLOv8-pose (Keypoint Detection)
    Detects human skeletal keypoints (17 joints: nose, eyes, shoulders, elbows, wrists, hips, knees, ankles). Use for sports analytics, physiotherapy, gesture control.
    17 keypoints per person
  
  
    📷
    YOLOv8-obb (Oriented Bounding Boxes)
    Predicts rotated bounding boxes with an angle parameter. Critical for satellite imagery where ships, planes, and vehicles appear at arbitrary orientations. Standard boxes are not sufficient.
    x, y, w, h, θ output
  




Section 12
Model Export and Deployment


  Training in PyTorch gives you a flexible .pt file. For production deployment, export
  to optimised formats that can run without Python or PyTorch installed.


from ultralytics import YOLO

model = YOLO('runs/detect/bottle_defect_v1/weights/best.pt')

# Export to ONNX (cross-platform, runs on CPU/GPU/mobile)
model.export(format='onnx', opset=17, simplify=True)

# Export to TensorRT (NVIDIA GPU — fastest possible inference)
model.export(format='engine', half=True)   # FP16 precision

# Export to CoreML (Apple devices — iPhone, Mac)
model.export(format='coreml', nms=True)

# Export to TFLite (Android / embedded Linux)
model.export(format='tflite', int8=True)   # INT8 quantisation

# Run inference with the exported ONNX model
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession('best.onnx')
input_name  = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

dummy_input = np.random.rand(1, 3, 640, 640).astype(np.float32)
outputs = sess.run([output_name], {input_name: dummy_input})
print("ONNX output shape:", outputs[0].shape)  # (1, 7, 8400)



  
    
      
        Export Format
        Target Platform
        Speed Gain
        Precision
        Use Case
      
    
    
      .pt (PyTorch) Python server Baseline FP32 Development, research
      ONNX Any platform 1.5–2× FP32 Cross-platform production
      TensorRT NVIDIA GPU 3–5× FP16/INT8 Max GPU throughput
      CoreML Apple devices 2–4× FP16 iOS / macOS apps
      TFLite INT8 Mobile / Edge 1.5–3× INT8 Android, Raspberry Pi
    
  




Section 13
A Complete End-to-End Example — Traffic Monitoring


  📖 Real World Application
  Counting Vehicles at an Intersection
  
    A city council wants to automatically count the number of cars, trucks, and buses
    passing through a busy intersection per hour to inform traffic light timing decisions.
    The camera captures 1080p video at 30 fps. The system must run on a small edge server
    with a consumer GPU (RTX 3060). Here is the full pipeline.
  


from ultralytics import YOLO
from collections import defaultdict
import cv2, numpy as np

# Load model — YOLOv8m for good speed/accuracy on RTX 3060
model = YOLO('yolov8m.pt')

# Vehicle classes in COCO dataset
VEHICLE_CLASSES = {2: 'car', 3: 'motorcycle', 5: 'bus', 7: 'truck'}

# Counting line (horizontal, at y = 540 for 1080p video)
LINE_Y      = 540
track_hist  = defaultdict(lambda: [])  # store track paths
vehicle_cnt = defaultdict(int)         # cumulative counts
counted_ids = set()                   # track IDs already counted

cap = cv2.VideoCapture('intersection.mp4')
out = cv2.VideoWriter('output_counted.mp4',
        cv2.VideoWriter_fourcc(*'mp4v'), 30, (1920, 1080))

while cap.isOpened():
    ret, frame = cap.read()
    if not ret: break

    # Run tracking (ByteTrack — maintains object IDs across frames)
    results = model.track(frame, persist=True, tracker='bytetrack.yaml',
                          classes=list(VEHICLE_CLASSES.keys()), conf=0.4)

    if results[0].boxes.id is not None:
        boxes    = results[0].boxes.xyxy.cpu().numpy()
        track_ids= results[0].boxes.id.int().cpu().tolist()
        cls_ids  = results[0].boxes.cls.int().cpu().tolist()

        for box, tid, cid in zip(boxes, track_ids, cls_ids):
            x1, y1, x2, y2 = box
            cx, cy = int((x1+x2)/2), int((y1+y2)/2)

            # Store trajectory
            track_hist[tid].append((cx, cy))
            if len(track_hist[tid]) > 30:
                track_hist[tid].pop(0)

            # Count if vehicle crosses the line (north→south)
            hist = track_hist[tid]
            if (len(hist) >= 2 and
                    hist[-2][1] < LINE_Y <= hist[-1][1] and
                    tid not in counted_ids):
                vehicle_cnt[VEHICLE_CLASSES.get(cid, 'other')] += 1
                counted_ids.add(tid)

    # Draw counting line and totals
    cv2.line(frame, (0, LINE_Y), (1920, LINE_Y), (0, 255, 255), 2)
    y_off = 50
    for vtype, count in vehicle_cnt.items():
        cv2.putText(frame, f"{vtype}: {count}",
                   (20, y_off), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0,255,0), 3)
        y_off += 45

    out.write(frame)

cap.release()
out.release()
print("Vehicle counts:", dict(vehicle_cnt))


OUTPUTVehicle counts: {'car': 142, 'truck': 31, 'bus': 18, 'motorcycle': 9}
Video saved to output_counted.mp4



Section 14
When to Use YOLO — and When Not To


  
    ✅
    Real-Time Video Streams
    YOLO's single-pass design is built for live video. Security cameras, dashcams, drone feeds, sports tracking — any application where latency under 50ms per frame matters.
    surveillance, autonomous driving, drones
  
  
    ✅
    Multi-Object Scenes
    YOLO detects all objects simultaneously in one pass — unlike two-stage detectors that process each region separately. Crowded scenes with 50+ objects per frame are YOLO's native territory.
    crowd analysis, warehouse management
  
  
    ✅
    Edge Deployment
    With the nano/small variants and INT8 quantisation, YOLO runs on Raspberry Pi, Jetson Nano, and mobile devices. No cloud required — data stays local, latency stays low.
    IoT, manufacturing QC, retail analytics
  
  
    ❌
    Very Small Objects in Large Images
    YOLO resizes images to 640×640, which crushes tiny objects. For satellite imagery with centimetre-sized objects, use SAHI (Slicing Aided Hyper Inference) to tile the image first.
    aerial survey, microscopy
  
  
    ❌
    Maximum Accuracy Over Speed
    If you have a batch job with no latency requirement and need the absolute best accuracy (e.g. medical imaging, forensics), a two-stage detector like Cascade RCNN may outperform YOLO.
    pathology, forensics, scientific imaging
  
  
    ❌
    Very Few Training Images (<50)
    YOLO is a data-hungry model. With fewer than ~100 images per class, fine-tuning may not converge. Consider few-shot detection methods (Detic, GLIP) or extensive augmentation strategies.
    rare defect detection, novel classes
  




Section 15
YOLO vs Other Detectors — Comparison


  
    📰 Two-Stage (Faster R-CNN)
    
      Property Value
      
        Pipeline RPN → ROI Pooling → Classifier
        Speed ~5 fps (GPU)
        mAP (COCO) ~37.9
        Small Objects Excellent
        Deployment Complex
        Best For Highest accuracy, no latency
      
    
  
  
    ⚡ One-Stage (YOLOv8m)
    
      Property Value
      
        Pipeline Backbone → Neck → Head (1 pass)
        Speed ~80 fps (GPU)
        mAP (COCO) ~50.2
        Small Objects Good (with SAHI)
        Deployment Simple (single model file)
        Best For Real-time, production, edge
      
    
  




Section 16
Golden Rules — YOLO in Production


  🎯 YOLO — Non-Negotiable Rules

  
    1
    
      Always start with a pre-trained model — never train from scratch.
      Transfer learning from COCO weights gives you a massive head start.
      Fine-tuning requires 10–100× fewer images and converges in hours, not days.
    
  
  
    2
    
      Image quality beats quantity. 500 clean, well-labelled, diverse images
      outperform 5,000 blurry, poorly-labelled ones. Bad labels directly cap your mAP ceiling —
      there is no model architecture fix for incorrect ground truth.
    
  
  
    3
    
      Match your training data distribution to your deployment environment.
      If your camera captures images at night, train on night images. If it's mounted at 45°,
      include tilted examples. Domain shift is the number one cause of production failures.
    
  
  
    4
    
      Use augment=True and mosaic augmentation. YOLO's default
      augmentation pipeline (mosaic, random flip, HSV jitter, scale/translate) is highly
      tuned and often provides more effective training signal than adding more raw images.
    
  
  
    5
    
      Tune confidence and IoU thresholds for your use case.
      The default conf=0.25 is a starting point. For a security system where false negatives
      are dangerous, lower it to 0.1 and accept more false positives. For a product
      recommendation engine, raise it to 0.6 for precision.
    
  
  
    6
    
      Export to TensorRT or ONNX before shipping to production.
      Running .pt files via PyTorch in production wastes 3–5× compute compared to a compiled
      TensorRT engine. Export is a one-time cost; the inference speedup is permanent.
    
  
  
    7
    
      Monitor production performance continuously. Object distributions shift
      over time (seasonal changes, new product SKUs, different lighting). Set up automated
      alerts if mAP on a held-out benchmark drops more than 5%. Retrain quarterly at minimum.