Computer Vision 📂 Computer Vision Basics · 4 of 12 58 min read

Feature Detection in Computer Vision

A deep-dive tutorial covering every major feature detection algorithm used in modern computer vision. Opens with the "Cartographer's Landmarks" analogy that makes the core concept immediately intuitive, then progresses from classical methods

Section 01

The Story That Explains Feature Detection

The Cartographer's Landmarks — How Explorers Found Their Way
In the age of exploration, a ship's navigator had no GPS. When two ships needed to meet in an uncharted ocean, they didn't describe the entire sea — they agreed on landmarks. A distinctive rock arch. A volcano with a split peak. A bay shaped like a crescent. These were the features of the coastline.

Now imagine two completely different ships — one arriving from the north, one from the east — both spotting the same volcano. Despite seeing it from different angles, in different light, from different distances, they both recognise the same landmark. That is feature detection.

A feature detector is an algorithm that finds those distinctive landmarks in an image — the corners, blobs, and edges that remain recognisable even when the image is zoomed, rotated, or seen from a different viewpoint. A feature descriptor then gives each landmark a unique fingerprint — a compact numerical representation — so that the same landmark can be matched across two completely different images.

In computer vision, Feature Detection is the process of automatically finding interesting points in an image — locations that carry distinctive, repeatable information. Paired with a descriptor that encodes the local appearance around each point, features become the backbone of image matching, panorama stitching, object tracking, 3D reconstruction, and robot navigation.

🔍
The Three-Step Pipeline

Every feature-based vision system follows three steps: (1) Detection — find keypoints (distinctive locations in the image). (2) Description — compute a numerical descriptor vector for each keypoint's local neighbourhood. (3) Matching — compare descriptors between images to find corresponding point pairs. Getting all three right is what separates a working system from a failing one.


Section 02

What Makes a Good Feature?

Not every point in an image is a useful feature. A pixel in the middle of a smooth wall looks exactly like every other pixel around it — useless for matching. A corner, a blob centre, or a salient texture region is far more distinctive. Good features share four essential properties:

🎯
Repeatability
Consistency across views
The same physical point must be detected in both images, even if the images differ in scale, rotation, or lighting. Without repeatability, there is nothing to match. Harris corners score well here; flat edges score poorly.
🧾
Distinctiveness
Uniqueness of fingerprint
The descriptor vector for one feature must differ enough from all others to produce correct matches. A non-distinctive feature looks like hundreds of others — the matcher gets confused. SIFT's 128-dimensional descriptor excels at this.
Efficiency
Speed matters in practice
Detecting 1,000 features in 5 milliseconds is far more useful than 10,000 features in 2 seconds. Real-time applications — drones, AR, self-driving cars — demand fast detectors like FAST or ORB over accurate-but-slow ones like SIFT.
🔄
Locality
Robust to occlusion
Features capture a small local patch of the image, not global structure. This means partial occlusion of an object (e.g., a hand covering part of a logo) still leaves most features matchable. Global descriptors fail completely when any part is hidden.
🏠
Invariance
Scale · Rotation · Illumination
A feature detected at 1× scale must match the same feature at 2× scale. A feature in daylight must match the same feature under artificial light. Achieving full invariance is the central engineering challenge of every detector.
📈
Quantity
Enough to constrain geometry
Too few features and geometry estimation (homography, essential matrix) fails. Too many and matching becomes slow and noisy. For practical homography estimation, you need at minimum 4 correct matches; for robustness, aim for 50–500.
⚠️
The Aperture Problem — Why Edges Fail as Features

A single straight edge is a terrible feature. If you look through a small window (aperture) at a line, you can tell it moved perpendicular to itself — but you cannot tell how far it moved along its length. This ambiguity, the aperture problem, is why edges are unreliable landmarks. Corners do not suffer from this — they constrain motion in all directions.


Section 03

Harris Corner Detector — The Classic Foundation

Chris Harris, Mike Stephens, and the Patch Shift Test (1988)
In 1988, Chris Harris and Mike Stephens asked a deceptively simple question: "What does a corner look like mathematically?"

Their insight: take a small square patch of an image and slide it in every direction. If you are on a flat surface — intensity barely changes as you slide. If you are on an edge — intensity changes strongly in one direction but not the perpendicular. If you are on a corner — intensity changes strongly in every direction.

That "change in every direction" is exactly what a corner is. Harris formalised this using a structure tensor (the second-moment matrix of image gradients), then distilled it into a single response value R. Thirty-five years later, it is still the first detector taught in every computer vision course.

The Harris detector computes image gradients Ix and Iy, builds the structure tensor M for each pixel's local neighbourhood, then computes the corner response R from M's eigenvalues — without actually computing eigenvalues (which would be slow).

Structure Tensor M
M = Σ w(x,y) · [[Ix², Ix·Iy], [Ix·Iy, Iy²]]
Summed over a local window w. Ix and Iy are horizontal and vertical Sobel gradients. w is typically a Gaussian window for smooth response.
Harris Response R
R = det(M) − k · trace(M)²
det(M) = λ₁λ₂, trace(M) = λ₁+λ₂. k is empirically set to 0.04–0.06. R > threshold → corner. R << 0 → edge. |R| ≈ 0 → flat.
Flat Region
λ₁ ≈ 0, λ₂ ≈ 0 → R ≈ 0
Both eigenvalues are small. The patch looks the same when shifted in any direction. No gradient information to anchor matching.
Corner Region
λ₁ ≫ 0, λ₂ ≫ 0 → R ≫ 0
Both eigenvalues are large. The patch changes significantly when shifted in any direction. This is a reliable, matchable keypoint.
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Load image and convert to greyscale
img   = cv2.imread('chessboard.jpg')
grey  = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
grey_f = np.float32(grey)

# Harris Corner Detection
# blockSize = neighbourhood size for M computation
# ksize     = Sobel kernel aperture size
# k         = Harris free parameter (0.04 – 0.06)
harris = cv2.cornerHarris(grey_f, blockSize=2, ksize=3, k=0.04)

# Dilate to mark corner regions more visibly
harris = cv2.dilate(harris, None)

# Threshold: mark pixels with strong corner response as red
img_corners = img.copy()
img_corners[harris > 0.01 * harris.max()] = [0, 0, 255]

# Count detected corners
corner_mask  = harris > 0.01 * harris.max()
n_corners    = np.sum(corner_mask)
print(f"Harris corners detected : {n_corners}")
print(f"Max response value      : {harris.max():.4f}")
print(f"Image shape             : {grey.shape}")

# Sub-pixel accuracy refinement (optional but recommended)
coords = np.argwhere(corner_mask)           # (row, col) pairs
coords = np.float32(coords[:, ::-1])        # flip to (x, y)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
corners_sub = cv2.cornerSubPix(grey_f, coords, (5,5), (-1,-1), criteria)
print(f"Sub-pixel refined corners: {len(corners_sub)}")
OUTPUT
Harris corners detected : 312 Max response value : 0.0082 Image shape : (480, 640) Sub-pixel refined corners: 312
R Value λ₁, λ₂ Relationship Region Type Action
R ≫ 0 Both large & similar Corner ✓ Keep as keypoint
R ≪ 0 λ₁ ≫ λ₂ or vice versa Edge ✗ Discard — unreliable
|R| ≈ 0 Both small Flat region ✗ Discard — no information
⚠️
Harris Is Not Scale-Invariant

Harris detects corners at a fixed scale determined by the window size and Sobel kernel. A corner visible at 1× magnification may not be detected at 0.5×. This is the fundamental limitation that motivated the development of SIFT five years later. For matching images with significant zoom differences, Harris alone is insufficient.


Section 04

FAST — Features from Accelerated Segment Test

Harris corners are accurate but slow — computing gradients for every pixel is expensive. In 2006, Edward Rosten and Tom Drummond published FAST, which detects corners using a clever circle-based test that can skip most pixels immediately, making it 10–100× faster than Harris. FAST became the detector half of ORB, the most widely-used real-time feature detector today.

◯ The FAST Circle Test — How It Works
Step 1
Consider a candidate pixel p. Draw a circle of 16 pixels around it (Bresenham circle, radius 3).
Step 2
Quick rejection: check pixels at compass positions (N, S, E, W — positions 1, 5, 9, 13). If fewer than 3 are all brighter or all darker than p ± threshold, reject immediately. This skips ~75% of pixels.
Step 3
For surviving pixels, check all 16 circle pixels. If at least N contiguous pixels (default N=12, FAST-12) are all brighter than p + t OR all darker than p − t, p is a corner.
Step 4
Non-Maximum Suppression: compare adjacent corners by their response score and keep only local maxima to avoid clusters of detections.
Step 5
Score each corner as the maximum threshold t for which p still qualifies as a corner — a proxy for corner strength used in NMS.
import cv2
import numpy as np

img  = cv2.imread('street.jpg')
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Create FAST detector
fast = cv2.FastFeatureDetector_create(
    threshold=20,           # intensity difference threshold t
    nonmaxSuppression=True, # remove clustered detections
    type=cv2.FastFeatureDetector_TYPE_9_16   # FAST-9 variant
)

# Detect keypoints
kp = fast.detect(grey, None)
print(f"Keypoints (NMS ON) : {len(kp)}")

# Compare without non-maximum suppression
fast.setNonmaxSuppression(False)
kp_no_nms = fast.detect(grey, None)
print(f"Keypoints (NMS OFF): {len(kp_no_nms)}")

# Draw keypoints
img_kp = cv2.drawKeypoints(img, kp, None, color=(0, 255, 0))

# Benchmark: FAST vs Harris speed
import time
t0 = time.perf_counter()
for _ in range(100): fast.detect(grey, None)
fast_ms = (time.perf_counter() - t0) * 10  # avg ms per call

t0 = time.perf_counter()
for _ in range(100): cv2.cornerHarris(np.float32(grey), 2, 3, 0.04)
harris_ms = (time.perf_counter() - t0) * 10

print(f"\nSpeed comparison (640×480 image):")
print(f"  FAST   : {fast_ms:.2f} ms")
print(f"  Harris : {harris_ms:.2f} ms")
print(f"  Speedup: {harris_ms/fast_ms:.1f}×")
OUTPUT
Keypoints (NMS ON) : 843 Keypoints (NMS OFF): 7,241 Speed comparison (640×480 image): FAST : 1.23 ms Harris : 18.74 ms Speedup: 15.2×

Section 05

SIFT — Scale-Invariant Feature Transform

David Lowe's Mountain — The Same Peak from Any Distance
In 1999, David Lowe (University of British Columbia) was working on robot localisation. His robots kept failing to recognise the same location at different distances. A door handle detected at 1 metre was invisible to the detector at 5 metres — because the detector operated at a single fixed scale.

Lowe's breakthrough insight: build an image pyramid — repeatedly blur and downsample the image — and look for features that stand out relative to their scale. A feature that is distinctive at its own scale — appearing as a blob against its neighbours in the pyramid — will be found at the same physical location regardless of the image magnification.

SIFT was published in full in 2004. Its 128-dimensional descriptor set the gold standard that every subsequent detector has been measured against. The patent expired in 2020; it is now fully open source.
01
Scale-Space Construction
Build a Gaussian pyramid: repeatedly blur the image with increasing σ values. Group into octaves (each octave = 2× downsampling). This gives a continuous representation of the image at every possible scale.
02
DoG Keypoint Localisation
Subtract adjacent Gaussian-blurred images to produce Difference of Gaussians (DoG). DoG approximates the Laplacian of Gaussian, which is a blob detector. Local extrema (maxima & minima) in the DoG space across scale and position are candidate keypoints.
03
Keypoint Filtering
Discard low-contrast keypoints (unstable under noise) and edge-response keypoints (DoG has strong response along edges — use Harris-like ratio test on the Hessian to reject these). Only strong, well-localised blobs survive.
04
Orientation Assignment
Compute gradient magnitude and direction in the keypoint's local neighbourhood. Build a histogram of 36 orientation bins (every 10°). Assign the dominant peak orientation to the keypoint. Now the descriptor will be computed relative to this angle — making it rotation-invariant.
05
Descriptor Computation
Take a 16×16 neighbourhood around the keypoint. Divide into a 4×4 grid of cells. In each cell, compute an 8-bin gradient orientation histogram. Concatenate all 4×4×8 = 128 values. L2-normalise. Clip values > 0.2, renormalise. Result: a 128-d vector robust to illumination and nonlinear contrast changes.
import cv2
import numpy as np
import matplotlib.pyplot as plt

img1  = cv2.imread('building_a.jpg')
img2  = cv2.imread('building_b.jpg')  # same building, different viewpoint
grey1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
grey2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

# Create SIFT detector (patent-free since 2020)
sift = cv2.SIFT_create(
    nfeatures=0,          # 0 = unlimited
    nOctaveLayers=3,       # layers per octave (DoG levels = nOctaveLayers + 2)
    contrastThreshold=0.04,# min DoG response to keep keypoint
    edgeThreshold=10,      # Harris ratio threshold for edge rejection
    sigma=1.6              # initial Gaussian blur sigma
)

# Detect and compute descriptors in one call
kp1, des1 = sift.detectAndCompute(grey1, None)
kp2, des2 = sift.detectAndCompute(grey2, None)

print(f"SIFT keypoints — img1: {len(kp1)}, img2: {len(kp2)}")
print(f"Descriptor shape      : {des1.shape}")   # (N, 128) float32

# FLANN-based matcher — much faster than brute force for SIFT
FLANN_INDEX_KDTREE = 1
index_params  = {'algorithm': FLANN_INDEX_KDTREE, 'trees': 5}
search_params = {'checks': 50}
flann   = cv2.FlannBasedMatcher(index_params, search_params)
matches = flann.knnMatch(des1, des2, k=2)

# Lowe's ratio test — keep only unambiguous matches
# A match is "good" if the nearest neighbour is much closer than the 2nd nearest
good = [m for m, n in matches if m.distance < 0.75 * n.distance]

print(f"Total matches         : {len(matches)}")
print(f"After ratio test (0.75): {len(good)}")
print(f"Match acceptance rate : {len(good)/len(matches)*100:.1f}%")
OUTPUT
SIFT keypoints — img1: 1843, img2: 1621 Descriptor shape : (1843, 128) Total matches : 1843 After ratio test (0.75): 612 Match acceptance rate : 33.2%
🔑
Lowe's Ratio Test — The Single Most Important Matching Trick

For every query descriptor, FLANN returns the two nearest neighbours (distances d1 and d2). If d1 / d2 < 0.75, the nearest neighbour is significantly closer than the second-nearest — the match is unambiguous and likely correct. If the ratio is close to 1.0, two descriptors look almost equally similar — the match is ambiguous and should be discarded. This single test removes the majority of false positives with almost no false-negative cost.


Section 06

ORB — The Real-Time Champion

SIFT is excellent but slow — and was patent-encumbered until 2020. In 2011, Rublee et al. (OpenCV lab) published ORB: a detector and descriptor that rivals SIFT in matching quality, runs over 100× faster, and has always been free to use. ORB is the default choice for any application needing real-time performance.

oFAST Detector
Oriented FAST
Runs FAST at multiple scales to gain scale invariance. Adds orientation using the intensity centroid: the vector from a patch's geometric centre to its intensity-weighted centre of mass defines the dominant orientation. This single addition makes FAST rotation-aware.
✓ 15× faster than SIFT detection
✗ Less scale-invariant than DoG-based methods
🛠️
rBRIEF Descriptor
Rotated BRIEF
BRIEF compares random pairs of pixel intensities in a patch, encoding each comparison as a single bit (brighter → 1, darker → 0). 256 comparisons → 256-bit (32-byte) binary string. rBRIEF steers the pixel pairs according to the keypoint's orientation so the descriptor is rotation-invariant.
✓ 256-bit string vs 512-byte SIFT (16× smaller)
✗ Less distinctive on uniform textures
🎯
Hamming Matching
XOR + popcount
Binary descriptors are compared with Hamming distance: XOR the two bit strings and count the 1s. Modern CPUs execute popcount in a single instruction. This makes ORB matching approximately 50× faster than floating-point L2 distance used by SIFT/SURF.
✓ Hardware-accelerated on all modern CPUs
✗ Can't use FLANN KD-tree (use LSH instead)
import cv2
import numpy as np

img1  = cv2.imread('logo_clean.jpg')
img2  = cv2.imread('logo_rotated.jpg')  # same logo, rotated 45°
grey1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
grey2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

# Create ORB detector
orb = cv2.ORB_create(
    nfeatures=1000,      # max keypoints to retain
    scaleFactor=1.2,     # pyramid scale factor between levels
    nlevels=8,           # number of pyramid levels
    edgeThreshold=31,    # border where features are not detected
    firstLevel=0,
    WTA_K=2,             # points compared per BRIEF test (2=standard)
    scoreType=cv2.ORB_HARRIS_SCORE,  # use Harris score for NMS ranking
    patchSize=31,        # patch size for BRIEF descriptor
    fastThreshold=20
)

# Detect and compute
kp1, des1 = orb.detectAndCompute(grey1, None)
kp2, des2 = orb.detectAndCompute(grey2, None)

print(f"ORB keypoints  : img1={len(kp1)}, img2={len(kp2)}")
print(f"Descriptor dtype : {des1.dtype}")        # uint8 binary
print(f"Descriptor shape : {des1.shape}")        # (N, 32) — 256 bits

# Brute Force matcher with Hamming distance
bf      = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda x: x.distance)

# Print top matches
print(f"\nTotal ORB matches  : {len(matches)}")
print(f"Best  match distance: {matches[0].distance:.1f} (0=perfect)")
print(f"Worst match distance: {matches[-1].distance:.1f} (256=worst)")

# Draw top 30 matches
img_matches = cv2.drawMatches(
    img1, kp1, img2, kp2, matches[:30], None,
    flags=cv2.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS
)
OUTPUT
ORB keypoints : img1=1000, img2=987 Descriptor dtype : uint8 Descriptor shape : (1000, 32) Total ORB matches : 853 Best match distance: 0.0 (0=perfect) Worst match distance: 89.0 (256=worst)

Section 07

AKAZE & BRISK — The Modern Binary Alternatives

📈
AKAZE
Accelerated-KAZE (2013)
Detects blobs in a nonlinear scale space using Perona-Malik diffusion instead of Gaussian blurring. Preserves edge structure better than SIFT's DoG pyramid. Uses M-LDB (Modified Local Difference Binary) descriptor — fast and distinctive. Excellent on textured objects with complex boundaries.
🔀
BRISK
Binary Robust Invariant Scalable Keypoints (2011)
Combines AGAST (faster version of FAST) for detection with a hand-crafted sampling pattern of 60 points in concentric rings. Long-distance pairs determine orientation; short-distance pairs generate the 512-bit binary descriptor. Very fast, competitive with ORB on textured scenes.
🎰
KAZE
Nonlinear Scale Space (2012)
The parent of AKAZE, using full nonlinear diffusion scale space. Significantly better than SIFT at preserving boundaries and fine structure. Uses floating-point M-SURF descriptor. Slower than AKAZE but more accurate on deformable objects.
import cv2
import numpy as np
import time

img  = cv2.imread('texture_scene.jpg')
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

detectors = {
    'SIFT'  : cv2.SIFT_create(),
    'ORB'   : cv2.ORB_create(nfeatures=1000),
    'AKAZE' : cv2.AKAZE_create(),
    'BRISK' : cv2.BRISK_create(),
    'KAZE'  : cv2.KAZE_create(),
}

print(f"{'Detector':8} | {'Keypoints':>10} | {'Desc Shape':>12} | {'Desc Type':>10} | {'ms':>6}")
print("-" * 58)
for name, det in detectors.items():
    t0   = time.perf_counter()
    kp, des = det.detectAndCompute(grey, None)
    ms   = (time.perf_counter() - t0) * 1000
    dtype = des.dtype if des is not None else 'N/A'
    shape = des.shape if des is not None else (0,0)
    print(f"{name:8} | {len(kp):>10,} | {str(shape):>12} | {str(dtype):>10} | {ms:>6.1f}")
OUTPUT
Detector | Keypoints | Desc Shape | Desc Type | ms ---------------------------------------------------------- SIFT | 1,843 | (1843, 128) | float32 | 87.4 ORB | 1,000 | (1000, 32) | uint8 | 4.1 AKAZE | 921 | (921, 61) | uint8 | 24.7 BRISK | 1,204 | (1204, 64) | uint8 | 6.3 KAZE | 1,156 | (1156, 64) | float32 | 134.2

Section 08

Detector & Descriptor Comparison — Choosing the Right Tool

Detector Scale Inv. Rotation Inv. Speed Desc. Size Desc. Type Best For
Harris ✗ No Partial Medium N/A None Calibration boards, teaching
FAST ✗ No ✗ No Very fast N/A None Real-time detection only (pair with BRIEF)
SIFT ✓ Yes ✓ Yes Slow 128 × 4B float32 Accuracy-critical matching, 3D reconstruction
ORB ✓ Yes ✓ Yes Very fast 32B uint8 binary Real-time AR, mobile, embedded devices
AKAZE ✓ Yes ✓ Yes Medium 61B uint8 binary Textured objects, deformable surfaces
BRISK ✓ Yes ✓ Yes Fast 64B uint8 binary General purpose real-time matching
KAZE ✓ Yes ✓ Yes Very slow 64 × 4B float32 Medical / scientific imaging, deformable objects
🏆
The Practitioner's Decision Rule

Use ORB when speed matters and you can control image quality. Use SIFT when you cannot — extreme viewpoint changes, low-contrast images, or when wrong matches are costly (medical, forensic, satellite matching). Use AKAZE when matching highly textured or deformable objects. Never use Harris or FAST alone for matching — they produce no descriptors.


Section 09

HOG — Histogram of Oriented Gradients

Dalal & Triggs, 2005 — Teaching Computers to See People
In 2005, Navneet Dalal and Bill Triggs (INRIA) were tasked with an urgent problem: autonomous vehicles needed to detect pedestrians in real time, but no descriptor was good enough. They asked: "What makes a human silhouette recognisable?"

Their insight: a person's shape is defined by the local distribution of gradient directions — the orientation of edges in small regions of the image. You don't need to know the exact brightness of each pixel; you need to know which way the edges point in each small block.

HOG divides the image into a grid of cells, computes an 8-bin gradient orientation histogram for each cell, groups cells into blocks, and normalises across blocks. The result: a dense feature vector that perfectly captures shape. The Dalal-Triggs paper became one of the most cited in computer vision history, directly enabling the pedestrian detectors in the first generation of driver-assistance systems.
📋 HOG Computation — Step by Step
Step 1
Resize & normalise the image to a fixed detection window (e.g. 128×64 for pedestrians). Optional gamma normalisation improves performance in varied lighting.
Step 2
Compute gradients using a simple [-1,0,1] kernel in x and y. Calculate gradient magnitude M = √(Gx²+Gy²) and angle θ = atan2(Gy, Gx) at every pixel.
Step 3
Cell histograms: divide the window into 8×8 pixel cells. Each cell gets a 9-bin gradient orientation histogram (0°–180° in 20° steps, unsigned). Each pixel votes for its bin, weighted by gradient magnitude.
Step 4
Block normalisation: group 2×2 cells into overlapping blocks. Concatenate the four cell histograms (4×9=36 values) and L2-normalise. This corrects for local illumination changes.
Step 5
Final vector: slide the block window across all cells with 50% overlap. For a 128×64 window: 7×15 block positions × 36 = 3,780-dimensional HOG vector per detection window.
from skimage.feature import hog
from skimage import exposure, color
import cv2
import numpy as np
import matplotlib.pyplot as plt

img_bgr = cv2.imread('pedestrian.jpg')
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

# Resize to standard detection window
img_resized = cv2.resize(img_rgb, (64, 128))

# Compute HOG features + visualisation image
fd, hog_image = hog(
    img_resized,
    orientations=9,            # 9 orientation bins (0–180°)
    pixels_per_cell=(8, 8),    # cell size in pixels
    cells_per_block=(2, 2),    # block size in cells (for normalisation)
    visualize=True,
    channel_axis=-1            # input is HxWxC (RGB)
)

# Enhance contrast for visualisation
hog_vis = exposure.rescale_intensity(hog_image, in_range=(0, 10))

print(f"HOG feature vector length : {len(fd)}")  # 3780 for 64×128
print(f"HOG vector dtype          : {fd.dtype}")  # float64
print(f"HOG min / max             : {fd.min():.4f} / {fd.max():.4f}")

# ── HOG + SVM pedestrian detector (OpenCV built-in) ──────────
hog_cv  = cv2.HOGDescriptor()
hog_cv.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

# Multi-scale sliding window detection
full_img = cv2.imread('crowd.jpg')
boxes, weights = hog_cv.detectMultiScale(
    full_img,
    winStride=(8, 8),         # step between detection windows
    padding=(4, 4),           # padding around image before detection
    scale=1.05                # pyramid scale factor
)
print(f"\nPedestrians detected: {len(boxes)}")
for (x, y, w, h), conf in zip(boxes, weights):
    print(f"  Box ({x},{y},{w},{h})  confidence: {conf[0]:.3f}")
OUTPUT
HOG feature vector length : 3780 HOG vector dtype : float64 HOG min / max : 0.0000 / 0.5321 Pedestrians detected: 4 Box (124,88,64,128) confidence: 0.872 Box (301,76,64,128) confidence: 0.934 Box (198,102,64,128) confidence: 0.718 Box (412,64,64,128) confidence: 0.801

Section 10

Feature Matching & RANSAC — From Matches to Geometry

Detecting and describing features is only half the story. The real goal is using those matched feature pairs to estimate a geometric transformation between two images — a homography (for planar scenes or pure rotation), an essential matrix (for calibrated cameras), or a fundamental matrix (for uncalibrated cameras). The challenge: even with Lowe's ratio test, some matches are still wrong. Wrong matches are called outliers. They destroy least-squares estimation. The solution is RANSAC.

The Detective Who Ignores Liars — RANSAC Explained
A detective is trying to reconstruct a crime timeline from 100 witness statements. She knows some witnesses are lying — but she doesn't know which ones. Her strategy: pick 4 witnesses at random. If their stories are consistent, find all other witnesses whose stories also fit. The set of consistent witnesses is the inlier set — the best evidence. Repeat this process 1,000 times, keep the largest inlier set, and re-estimate the timeline using only those witnesses.

This is RANSAC (Random Sample Consensus). In feature matching, each "witness" is a matched point pair. The "story" is whether those points obey a specific geometric transformation model (homography). The liars are mismatches. RANSAC finds the transformation supported by the most matches, regardless of outliers.
01
Detect & Match Features
Use SIFT or ORB to detect keypoints and compute descriptors in both images. Match using FLANN (SIFT) or BFMatcher with Hamming (ORB). Apply Lowe's ratio test to reduce obvious mismatches. You still have some outliers.
02
Extract Point Correspondences
Build two arrays of 2D points from the "good" matches: src_pts (keypoint locations in image 1) and dst_pts (corresponding locations in image 2). These are the input to RANSAC.
03
RANSAC Homography Estimation
cv2.findHomography() runs RANSAC internally. It randomly samples 4-point subsets, computes the homography for each, counts inliers (points whose reprojection error is below a threshold), and returns the best model. The inlier mask tells you which matches are geometrically consistent.
04
Apply the Homography
Use cv2.warpPerspective() to transform image 1 into the coordinate system of image 2 (or blend them for panorama stitching). The RANSAC inlier mask can also be used to visualise only the verified matches.
import cv2
import numpy as np

# ── Load images ────────────────────────────────────────────────
img1  = cv2.imread('scene_left.jpg')
img2  = cv2.imread('scene_right.jpg')  # overlapping panorama shot
g1    = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
g2    = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

# ── SIFT detect + describe ─────────────────────────────────────
sift       = cv2.SIFT_create(nfeatures=2000)
kp1, des1  = sift.detectAndCompute(g1, None)
kp2, des2  = sift.detectAndCompute(g2, None)

# ── FLANN matching + ratio test ────────────────────────────────
flann      = cv2.FlannBasedMatcher({'algorithm':1,'trees':5}, {'checks':50})
raw        = flann.knnMatch(des1, des2, k=2)
good       = [m for m,n in raw if m.distance < 0.75*n.distance]
print(f"Ratio-test survivors: {len(good)} / {len(raw)}")

# ── Extract (x,y) point pairs ──────────────────────────────────
src_pts = np.float32([kp1[m.queryIdx].pt for m in good]).reshape(-1,1,2)
dst_pts = np.float32([kp2[m.trainIdx].pt for m in good]).reshape(-1,1,2)

# ── RANSAC homography estimation ───────────────────────────────
H, mask = cv2.findHomography(
    src_pts, dst_pts,
    method=cv2.RANSAC,
    ransacReprojThreshold=5.0   # max reprojection error (pixels)
)

inliers  = mask.ravel().tolist()
n_in     = sum(inliers)
n_out    = len(inliers) - n_in
print(f"RANSAC inliers  : {n_in}")
print(f"RANSAC outliers : {n_out}")
print(f"Inlier ratio    : {n_in/len(inliers)*100:.1f}%")

# ── Warp img1 onto img2's plane ────────────────────────────────
h2, w2    = img2.shape[:2]
warped    = cv2.warpPerspective(img1, H, (w2 * 2, h2))
print(f"Panorama canvas : {warped.shape}")
OUTPUT
Ratio-test survivors: 412 / 2000 RANSAC inliers : 348 RANSAC outliers : 64 Inlier ratio : 84.5% Panorama canvas : (720, 2560, 3)
RANSAC Iteration Formula

The number of RANSAC iterations needed to guarantee (with probability p) finding at least one all-inlier sample: N = log(1−p) / log(1−w^s), where w is the inlier ratio and s is the sample size (4 for homography). With 50% inliers and p=0.99: N = log(0.01)/log(1−0.5⁴) ≈ 72 iterations. OpenCV's default of 2000 max iterations covers even 10% inlier scenarios.


Section 11

Deep Learning Features — SuperPoint, LightGlue & Beyond

Classical detectors are engineered by hand — every design choice (DoG blob scale, orientation bin count, BRIEF sampling pattern) is manually tuned. Since 2017, a new wave of learned feature detectors and matchers have emerged, trained end-to-end to maximise matching accuracy on real image pairs. These methods now set the state of the art on every benchmark.

SuperPoint
DeTone et al., 2018 (Magic Leap)
Self-supervised CNN trained on synthetic "MagicPoint" homographic warps. Simultaneously predicts keypoint locations and 256-d descriptors in a single forward pass. Runs at 70 fps on GPU. State-of-the-art repeatability on indoor scenes (HPatches).
LightGlue
Lindenberger et al., 2023
Transformer-based matcher that learns to match SuperPoint (or SIFT/DISK) descriptors. Replaces RANSAC entirely — outputs verified inlier matches directly. 5–10× faster than SuperGlue with comparable accuracy. Works with any descriptor.
DISK
Tyszkiewicz et al., 2020
Differentiable keypoint detection trained with reinforcement learning. Learns to detect keypoints that are matchable, rather than just distinctive. Achieves top-3 matching performance on ETH3D benchmark. Highly reproducible across viewpoints.
SuperGlue
Sarlin et al., 2020
Graph Neural Network matcher. Treats matching as optimal transport on a bipartite graph with GNN-learned edge weights. Handles ambiguous and repeated textures far better than classical nearest-neighbour. Used in production SLAM systems.
LoFTR
Sun et al., 2021
Dense detector-free matching using transformer attention. Establishes semi-dense correspondences directly from feature maps without explicit keypoint detection. Outstanding performance on texture-less and repetitive scenes where keypoint detectors fail.
XFeat
Potje et al., 2024
Extremely lightweight learned detector+descriptor. Matches or beats ORB quality at comparable speed, SIFT quality at 3× faster speed. Designed for resource-constrained deployment. Currently the best bang-for-buck learned feature.
# SuperPoint + LightGlue example (requires kornia / lightglue library)
# pip install lightglue
import torch
from lightglue import LightGlue, SuperPoint
from lightglue.utils import load_image, rbd

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialise detector and matcher
extractor = SuperPoint(max_num_keypoints=1024).eval().to(device)
matcher   = LightGlue(features='superpoint').eval().to(device)

# Load images as normalised tensors
img0 = load_image('view_a.jpg').to(device)
img1 = load_image('view_b.jpg').to(device)

with torch.no_grad():
    # Extract keypoints + descriptors
    feats0 = extractor(dict(image=img0))
    feats1 = extractor(dict(image=img1))

    # Match — LightGlue replaces RANSAC
    result = matcher({'image0': feats0, 'image1': feats1})

# Remove batch dimension; extract matches
feats0, feats1, result = [rbd(x) for x in [feats0, feats1, result]]
matches   = result['matches']
scores    = result['matching_scores']
kpts0     = feats0['keypoints'][matches[..., 0]]
kpts1     = feats1['keypoints'][matches[..., 1]]

print(f"Keypoints detected   : {len(feats0['keypoints'])}")
print(f"Verified matches     : {len(matches)}")
print(f"Mean match confidence: {scores.mean():.3f}")
OUTPUT
Keypoints detected : 1024 Verified matches : 687 Mean match confidence: 0.847
💡
When to Use Learned Features vs Classical

Use learned features when you have a GPU, need maximum accuracy, and face challenging conditions: low texture, large viewpoint changes, night/day transitions, or motion blur. Use classical features (ORB, SIFT) when you need CPU-only deployment, interpretability, or are working in a constrained environment (embedded systems, edge devices). Classical methods are still competitive for well-textured scenes under moderate viewpoint change.


Section 12

End-to-End Project — Panorama Stitching from Scratch

This project ties everything together: detect with SIFT, match with FLANN + ratio test, estimate geometry with RANSAC, warp and blend into a seamless panorama. This is exactly what your phone's panorama mode does — in real time.

import cv2
import numpy as np

def stitch_panorama(img_left, img_right, ratio=0.75, reproj_thresh=5.0):
    """
    Stitch two overlapping images into a panorama using SIFT + RANSAC.
    Returns the stitched panorama and metadata dict.
    """
    g_left  = cv2.cvtColor(img_left,  cv2.COLOR_BGR2GRAY)
    g_right = cv2.cvtColor(img_right, cv2.COLOR_BGR2GRAY)

    # ── 1. Detect and describe ─────────────────────────────────
    sift = cv2.SIFT_create(nfeatures=3000)
    kp_l, des_l = sift.detectAndCompute(g_left,  None)
    kp_r, des_r = sift.detectAndCompute(g_right, None)

    # ── 2. FLANN match + ratio test ────────────────────────────
    flann  = cv2.FlannBasedMatcher({'algorithm':1,'trees':5},{'checks':50})
    raw    = flann.knnMatch(des_l, des_r, k=2)
    good   = [m for m,n in raw if m.distance < ratio*n.distance]

    if len(good) < 10:
        raise ValueError(f"Insufficient matches: {len(good)}")

    # ── 3. RANSAC homography ───────────────────────────────────
    src = np.float32([kp_l[m.queryIdx].pt for m in good]).reshape(-1,1,2)
    dst = np.float32([kp_r[m.trainIdx].pt for m in good]).reshape(-1,1,2)
    H, mask = cv2.findHomography(src, dst, cv2.RANSAC, reproj_thresh)
    inliers  = int(mask.sum())

    # ── 4. Warp left image into right image's coordinate system ─
    h_r, w_r = img_right.shape[:2]
    h_l, w_l = img_left.shape[:2]
    canvas_w = w_r + w_l           # wide enough for both images
    warped   = cv2.warpPerspective(img_left, H, (canvas_w, h_r))

    # ── 5. Composite: place right image on top of warped left ───
    warped[0:h_r, 0:w_r] = img_right

    # Crop black borders
    grey_w   = cv2.cvtColor(warped, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(grey_w, 1, 255, cv2.THRESH_BINARY)
    x,y,w,h  = cv2.boundingRect(thresh)
    panorama = warped[y:y+h, x:x+w]

    meta = {'keypoints_left': len(kp_l), 'keypoints_right': len(kp_r),
            'good_matches': len(good),  'inliers': inliers,
            'panorama_shape': panorama.shape}
    return panorama, meta


# ── Run it ─────────────────────────────────────────────────────
left   = cv2.imread('pano_left.jpg')
right  = cv2.imread('pano_right.jpg')
result, info = stitch_panorama(left, right)

for k, v in info.items():
    print(f"{k:20}: {v}")
cv2.imwrite('panorama_output.jpg', result)
OUTPUT
keypoints_left : 2841 keypoints_right : 2763 good_matches : 524 inliers : 448 panorama_shape : (720, 1834, 3)

Section 13

Evaluating Feature Detectors — How Do You Know It Is Working?

🔄
Repeatability
R%
Fraction of keypoints in image A that are also detected in image B (after geometric transformation). High repeatability = reliable detection under viewpoint change. Measure: R = |detected_in_both| / min(|kp_A|, |kp_B|).
🎯
Matching Score
MS
Fraction of correct matches among all putative matches after ratio test. MS = correct_matches / total_putative_matches. High MS means your matcher wastes little effort on wrong pairs. Typical values: ORB 40–60%, SIFT 60–80%.
📊
MMA
Mean Matching Accuracy — the fraction of correct matches across a suite of image pairs at various pixel error thresholds (1px, 3px, 5px). Standard benchmark metric on HPatches dataset. SuperPoint+LightGlue scores 0.92 MMA@3px.
Localisation Error
LE
Mean pixel distance between a detected keypoint and its true location (after sub-pixel refinement). Low LE is essential for precise 3D reconstruction. Sub-pixel refinement (cornerSubPix) reduces LE from ~1.5px to ~0.3px.
🏃
Homography Accuracy
HA
Fraction of image pairs for which the estimated homography has corner error below a threshold (1px, 3px, 5px). The gold standard for evaluating full detect-describe-match pipelines. Tested on HPatches-v (viewpoint) and HPatches-i (illumination) splits.
🛠️
Speed (FPS)
T
Full pipeline frames per second on standard hardware (Intel i7 / V100 GPU). ORB: 150+ fps CPU. SIFT: 10–15 fps CPU / 80 fps GPU. SuperPoint: 70 fps GPU. LightGlue: 30–100 fps GPU depending on keypoint count.
import cv2
import numpy as np

def evaluate_repeatability(img1, img2, H_gt, detector, px_threshold=3):
    """
    Compute keypoint repeatability between two images given ground-truth homography H_gt.
    H_gt maps points from img1 to img2's coordinate system.
    """
    g1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
    g2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

    kp1 = detector.detect(g1, None)
    kp2 = detector.detect(g2, None)

    # Project kp1 points into img2 using ground-truth homography
    pts1 = np.float32([k.pt for k in kp1]).reshape(-1,1,2)
    pts2 = np.float32([k.pt for k in kp2])
    proj1 = cv2.perspectiveTransform(pts1, H_gt).reshape(-1,2)

    # For each projected point, find nearest kp2 point
    repeated = 0
    for p in proj1:
        dists = np.linalg.norm(pts2 - p, axis=1)
        if dists.min() < px_threshold:
            repeated += 1

    repeatability = repeated / min(len(kp1), len(kp2))
    return repeatability, len(kp1), len(kp2), repeated


# Evaluate SIFT vs ORB on a test pair with known homography
img1  = cv2.imread('test_a.jpg')
img2  = cv2.imread('test_b.jpg')
H_gt  = np.load('H_ground_truth.npy')   # from dataset

for name, det in {'SIFT': cv2.SIFT_create(),
                   'ORB' : cv2.ORB_create(),
                   'AKAZE': cv2.AKAZE_create()}.items():
    rep, n1, n2, n_rep = evaluate_repeatability(img1, img2, H_gt, det)
    print(f"{name:6s}: kp1={n1:5d}, kp2={n2:5d}, repeated={n_rep:4d}, R={rep:.3f}")
OUTPUT
SIFT : kp1= 1843, kp2= 1721, repeated=1024, R=0.595 ORB : kp1= 1000, kp2= 1000, repeated= 512, R=0.512 AKAZE : kp1= 921, kp2= 887, repeated= 493, R=0.556

Section 14

Classical vs Deep Learning Features — Full Showdown

Property Harris / FAST SIFT / AKAZE ORB / BRISK SuperPoint + LightGlue
Scale invariance ✗ None ✓ Full ✓ Partial ✓ Learned
Rotation invariance ✗ None ✓ Full ✓ Full ✓ Full
Illumination robustness Limited Good Moderate Excellent
Texture-less scenes Fails Struggles Fails LoFTR handles well
CPU-only speed Very fast SIFT: slow / AKAZE: ok Very fast Requires GPU
Descriptor size 512 B (SIFT) 32 B (ORB) 256 B (SuperPoint)
Requires training data No No No Yes — large image pairs dataset
HPatches MMA @ 3px ~0.62 ~0.45 ~0.92
Best use case Calibration, teaching Offline 3D recon, forensics Mobile AR, robotics Autonomous vehicles, SLAM

Section 15

Golden Rules of Feature Detection

🔍 Feature Detection — Non-Negotiable Rules
1
Always detect on greyscale, not colour. Feature detectors measure intensity gradients. Running SIFT or ORB on a colour image adds no repeatability benefit but triples computation. Convert with cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) before every detect-and-compute call.
2
Always use Lowe's ratio test when matching SIFT/AKAZE/KAZE. Use knnMatch with k=2 and keep only matches where d1 < 0.75 × d2. Skipping this one step typically doubles or triples your false match rate. For binary descriptors (ORB, BRISK), use cross-check matching instead.
3
Always follow matching with RANSAC before using point pairs geometrically. Even after the ratio test, 10–30% of matches are wrong. Passing raw matched points to pose estimation, homography recovery, or triangulation without RANSAC produces catastrophically wrong results that can appear superficially plausible.
4
Use L2 (FLANN) for float descriptors; use Hamming (BFMatcher) for binary descriptors. Using Hamming distance on SIFT's float32 descriptors silently gives nonsense results. Using L2 on ORB's uint8 binary descriptors is wrong and slow. Match the norm to the type.
5
Limit nfeatures in ORB explicitly — the default of 500 is often too low for wide-baseline matching. Set nfeatures=1000–2000 for panorama stitching and nfeatures=300–500 for real-time tracking. Uncapped SIFT (nfeatures=0) is fine for offline pipelines.
6
Pre-blur noisy images before detection. Salt-and-pepper noise creates thousands of spurious corners and blobs. A single cv2.GaussianBlur(img, (3,3), 0) before detection eliminates the majority of noise-driven false keypoints at negligible cost.
7
Visualise your matches before trusting homography results. Always call cv2.drawMatches() and inspect visually at least once per new dataset. Silent failures — wrong homographies that happen to not throw exceptions — are far more dangerous than noisy errors. Your eyes catch geometric inconsistencies instantly.