Computer Vision 📂 Computer Vision Basics · 3 of 12 49 min read

Image Processing in Python

A comprehensive, hands-on guide to image processing using Python, OpenCV, and Pillow. Covers how images are stored as pixel arrays, colour space conversions (RGB, HSV, LAB), smoothing and filtering techniques, thresholding,

Section 01

The Story That Explains Image Processing

The Darkroom Developer — From Film to Insight
Picture a detective in 1940s noir fiction. She receives a blurry photograph — the only piece of evidence from the crime scene. She takes it to her darkroom, adjusts the exposure, applies a sharpening filter, and enhances the contrast. What was a useless smear of grey becomes a crisp image of a licence plate.

That detective is you. That darkroom is your Python environment. The chemicals are your algorithms. And the photograph is any image — a chest X-ray, a satellite map, a product photo, or a face in a crowd.

Image Processing is the science of transforming raw pixel data into meaningful, usable information. It is the foundation of computer vision, medical imaging, self-driving cars, and every Instagram filter you have ever used.

At its core, a digital image is nothing more than a matrix of numbers. Each number (called a pixel value) represents the intensity of light at that position. Image processing is the art of mathematically manipulating those numbers to reveal, enhance, or extract information.

📷
Why Image Processing Matters in 2025

Every second, over 3.2 billion images are shared online. Medical AI reads 400 million radiology scans per year. Autonomous vehicles process 40–100 camera frames per second. The ability to transform raw pixels into structured knowledge is one of the most economically valuable skills in modern data science.


Section 02

How Images Are Stored — Pixels, Channels & Arrays

Before you can process an image, you must understand how a computer sees one. There are no "pictures" inside a CPU — only arrays of integers.

🗃 Anatomy of a Digital Image
Pixel
The smallest unit. One dot of colour. Represented as an integer, typically in range 0–255 (8-bit).
Channel
A single colour component. A greyscale image has 1 channel. RGB has 3 channels. RGBA has 4 channels.
Shape
NumPy shape is (Height, Width, Channels). A 1080p RGB image is shape (1080, 1920, 3).
Bit Depth
Number of bits per channel. 8-bit = 256 levels. 16-bit = 65,536 levels (used in medical, scientific imaging).
Colour Space
The coordinate system for colour: RGB, BGR (OpenCV default), HSV, LAB, Greyscale.
Greyscale Image
Shape: (H, W)
Each pixel is a single intensity value 0–255. Black=0, White=255. Used in medical imaging, edge detection, and most classical algorithms. Memory-efficient.
🌀
RGB Image
Shape: (H, W, 3)
Three channels: Red, Green, Blue. Each pixel is a triplet like (255, 128, 0). Default in PIL/Pillow and most web formats. OpenCV stores as BGR — be careful!
🌟
HSV Image
Shape: (H, W, 3)
Hue, Saturation, Value. Far more intuitive for colour-based filtering than RGB. Hue=colour type, Saturation=colour purity, Value=brightness. Preferred for colour segmentation.
import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt

# Load an image using OpenCV (reads as BGR)
img_bgr = cv2.imread('cat.jpg')

# Convert BGR → RGB for display
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

# Inspect the array
print(f"Shape : {img_rgb.shape}")   # (H, W, 3)
print(f"Dtype : {img_rgb.dtype}")   # uint8
print(f"Min   : {img_rgb.min()}")   # 0
print(f"Max   : {img_rgb.max()}")   # 255

# Convert to greyscale
img_grey = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
print(f"Grey shape: {img_grey.shape}")   # (H, W) — no channel dimension

# Convert to HSV colour space
img_hsv = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV)
print(f"H range: 0–179  |  S range: 0–255  |  V range: 0–255")
OUTPUT
Shape : (480, 640, 3) Dtype : uint8 Min : 0 Max : 255 Grey shape: (480, 640) H range: 0–179 | S range: 0–255 | V range: 0–255
⚠️
OpenCV vs PIL — The Channel Order Trap

OpenCV reads images in BGR order, not RGB. This trips up every beginner. If you load with OpenCV and display with matplotlib (which expects RGB), your reds will look blue. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) before displaying, or use PIL/Pillow which reads natively in RGB.


Section 03

Colour Space Conversions — Seeing the World Differently

The Sommelier and the Spectrometer
A master sommelier can identify a wine by colour alone — a task impossible for most people. He doesn't see "red wine." He sees a specific hue, a particular saturation, a precise value. He has mentally converted the image from the RGB colour space — which mixes everything together — into the HSV space, where colour properties are separated.

This is exactly why we convert colour spaces in image processing. Different tasks need different "views" of the same pixel data. Segmenting a ripe tomato by colour is trivial in HSV, but brutal in RGB.
Colour Space Channels Best Used For Notes
RGB Red, Green, Blue Display, web, general use Pillow default. Human-intuitive.
BGR Blue, Green, Red OpenCV internal format Easy to forget — causes colour swaps!
Greyscale Intensity only Edge detection, thresholding, feature extraction 3× smaller memory than RGB.
HSV Hue, Saturation, Value Colour-based segmentation, tracking Hue is lighting-independent.
LAB Lightness, A (green↔red), B (blue↔yellow) Perceptually uniform comparisons Closest to human vision. Used in colour difference metrics.
YCrCb Luma, Chroma red, Chroma blue Skin detection, video compression Separates brightness from colour. Used in JPEG.
# Real-world example: Detect a red ball using HSV masking
import cv2
import numpy as np

img = cv2.imread('playground.jpg')
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# Red hue wraps around 0/180 in OpenCV's HSV
# Define lower and upper bounds for red
lower_red1 = np.array([0,   120, 70])
upper_red1 = np.array([10,  255, 255])
lower_red2 = np.array([170, 120, 70])
upper_red2 = np.array([180, 255, 255])

mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
mask  = cv2.bitwise_or(mask1, mask2)

# Apply mask to original image
result = cv2.bitwise_and(img, img, mask=mask)
print(f"Red pixels found: {np.count_nonzero(mask)}")
OUTPUT
Red pixels found: 14382

Section 04

Image Filtering & Smoothing — Taming the Noise

Real-world images are noisy. Camera sensors are imperfect. Transmission introduces artefacts. Lighting is uneven. Before any analysis, you often need to smooth the image — reduce noise without destroying important features. This is done by convolution: sliding a small matrix (called a kernel or filter) across the image and computing a weighted average at each position.

📋
What Is Convolution?

For every pixel, you place your kernel (e.g. a 3×3 matrix) centred on it. You multiply each kernel value by the corresponding pixel value, sum all the products, and the result becomes the new pixel value. Slide this across the entire image. The kernel's values determine what the filter does: smooth, sharpen, detect edges.

🌀
Box (Mean) Filter
Replaces each pixel with the average of its neighbours. Simple and fast. Blurs edges too aggressively. Good for quick noise removal.
cv2.blur(img, (5,5))
📈
Gaussian Filter
Uses a Gaussian (bell-curve) weighted average. Central pixels contribute more. Smoother result than box blur. The gold standard for pre-processing before edge detection.
cv2.GaussianBlur(img, (5,5), 0)
Median Filter
Replaces each pixel with the median of its neighbours. Excellent against salt-and-pepper noise. Preserves edges far better than Gaussian. Non-linear.
cv2.medianBlur(img, 5)
🌌
Bilateral Filter
Smooths while preserving edges. Considers both spatial proximity and pixel value similarity. Much slower but produces beautiful results. Used in photo editing.
cv2.bilateralFilter(img, 9, 75, 75)
✏️
Sharpening Filter
Uses a kernel that amplifies differences from neighbours. Enhances edges. Can amplify noise too — apply after smoothing. Used in print and photo restoration.
cv2.filter2D(img, -1, kernel)
🎲
Custom Kernel
Define your own kernel matrix for specialised effects. Embossing, motion blur, and unsharp masking are all custom kernels. Full creative and analytical control.
kernel = np.array([[...]])
import cv2
import numpy as np
import matplotlib.pyplot as plt

img = cv2.imread('noisy_photo.jpg')
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# 1. Box (Mean) filter
box_blur = cv2.blur(img_rgb, (5, 5))

# 2. Gaussian blur
gaussian = cv2.GaussianBlur(img_rgb, (5, 5), sigmaX=0)

# 3. Median blur — best for salt-and-pepper noise
median = cv2.medianBlur(img_rgb, 5)

# 4. Bilateral filter — edge-preserving smooth
bilateral = cv2.bilateralFilter(img_rgb, d=9, sigmaColor=75, sigmaSpace=75)

# 5. Sharpening with custom kernel
sharp_kernel = np.array([[0, -1,  0],
                           [-1,  5, -1],
                           [0, -1,  0]])
sharpened = cv2.filter2D(img_rgb, -1, sharp_kernel)

# Plot all results
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
titles = ['Original', 'Box Blur', 'Gaussian', 'Median', 'Bilateral', 'Sharpened']
images = [img_rgb, box_blur, gaussian, median, bilateral, sharpened]
for ax, im, t in zip(axes.flat(), images, titles):
    ax.imshow(im)
    ax.set_title(t)
    ax.axis('off')
plt.tight_layout()
plt.show()
💡
Choosing the Right Filter

Use Gaussian before Canny edge detection. Use Median for scanner noise or old photographs. Use Bilateral when you need smooth skin tones but sharp edges in portraits. Never sharpen before smoothing — it amplifies noise catastrophically.


Section 05

Thresholding & Segmentation — Separating What Matters

The Customs Officer at the Border
A customs officer scans X-rays of luggage. His job is simple in principle: identify what is suspicious and what is safe. He does this by looking for areas with unusual density — regions that appear as bright whites or deep blacks on the scan — and ignoring everything in the grey middle.

He is performing thresholding — the most fundamental segmentation technique in image processing. Any pixel above a brightness threshold becomes "foreground" (1). Everything else becomes "background" (0). The resulting image is purely binary: black or white.
GLOBAL
Simple Threshold
One fixed threshold value for the whole image. Fast. Fails on uneven lighting. pixel > T → 255, else → 0.
🎮
ADAPTIVE
Local Threshold
Threshold computed for small neighbourhoods. Handles shadows and uneven illumination. Uses mean or Gaussian weighted average of local region.
🌟
OTSU
Automatic Threshold
Automatically finds the optimal global threshold by minimising intra-class variance. Works best when the histogram is bimodal (clear foreground/background).
import cv2
import numpy as np

img = cv2.imread('document.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Global threshold: pixels > 127 → white, else black
ret1, thresh_global = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)

# 2. Otsu's method: auto-calculate optimal threshold
ret2, thresh_otsu = cv2.threshold(
    img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
)
print(f"Otsu's optimal threshold: {ret2:.1f}")

# 3. Adaptive threshold — handles uneven lighting in documents
thresh_adapt = cv2.adaptiveThreshold(
    img,
    maxValue=255,
    adaptiveMethod=cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    thresholdType=cv2.THRESH_BINARY,
    blockSize=11,    # neighbourhood size (must be odd)
    C=2             # constant subtracted from mean
)

# Count foreground pixels in each method
for name, t in {'Global': thresh_global,
               "Otsu": thresh_otsu,
               "Adaptive": thresh_adapt}.items():
    fg = np.sum(t == 255)
    total = t.size
    print(f"{name:10s}: {fg:7,d} foreground px ({fg/total*100:.1f}%)")
OUTPUT
Otsu's optimal threshold: 142.0 Global : 94,302 foreground px (30.7%) Otsu : 87,156 foreground px (28.4%) Adaptive : 112,450 foreground px (36.6%)

Section 06

Morphological Operations — Sculpting the Binary Image

After thresholding you often have a "rough" binary image: small holes inside objects, tiny isolated noise pixels, or slightly disconnected regions. Morphological operations fix these problems by expanding or shrinking the white regions using a structuring element.

Operation Effect Typical Use OpenCV Function
Erosion Shrinks white regions. Removes thin protrusions and small blobs. Remove salt noise, thin lines cv2.erode()
Dilation Expands white regions. Fills small holes and connects nearby blobs. Fill gaps, join broken lines cv2.dilate()
Opening Erosion then Dilation. Removes small noise without shrinking objects. Clean background noise cv2.morphologyEx(MORPH_OPEN)
Closing Dilation then Erosion. Fills holes without expanding objects. Fill interior gaps in text/shapes cv2.morphologyEx(MORPH_CLOSE)
Gradient Dilation minus Erosion. Highlights object boundaries. Find edges in binary images cv2.morphologyEx(MORPH_GRADIENT)
Top Hat Original minus Opening. Reveals small bright spots on dark background. Cell detection, bright defects cv2.morphologyEx(MORPH_TOPHAT)
import cv2
import numpy as np

# Load a binary image (e.g., after Otsu thresholding)
_, binary = cv2.threshold(
    cv2.imread('text.jpg', cv2.IMREAD_GRAYSCALE),
    0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU
)

# Define structuring element (kernel)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))

# Apply each operation
eroded   = cv2.erode(binary, kernel, iterations=1)
dilated  = cv2.dilate(binary, kernel, iterations=1)
opened   = cv2.morphologyEx(binary, cv2.MORPH_OPEN,     kernel)
closed   = cv2.morphologyEx(binary, cv2.MORPH_CLOSE,    kernel)
gradient = cv2.morphologyEx(binary, cv2.MORPH_GRADIENT, kernel)

# Closing is often the final step to clean up OCR pre-processing
print("Binary image cleaned and ready for OCR.")
OUTPUT
Binary image cleaned and ready for OCR.
Practical Rule: OCR Pre-processing Pipeline

For document digitisation tasks: Greyscale → Gaussian Blur → Otsu Threshold → Morphological Closing. This four-step pipeline dramatically improves Tesseract OCR accuracy on real-world scanned documents — often taking recognition from 60% to over 95% on typical business documents.


Section 07

Edge Detection — Finding Where Things Begin and End

An edge is a location where pixel intensity changes sharply. Edges correspond to object boundaries, shadows, and surface discontinuities — they carry most of the structural information in an image. Detecting edges is the prerequisite for shape recognition, object counting, and feature extraction.

Sobel X (horizontal edges)
[[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]
Detects vertical lines / horizontal gradient. The centre column is weighted 2× for smoothing along the perpendicular axis.
Sobel Y (vertical edges)
[[-1,-2,-1], [0,0,0], [1, 2, 1]]
Detects horizontal lines / vertical gradient. Combine Sobel X and Y as √(Gx²+Gy²) for full edge magnitude.
Laplacian (all directions)
[[0,1,0], [1,-4,1], [0,1,0]]
Second derivative of intensity. Detects edges in all orientations simultaneously. More sensitive to noise than Sobel.
Canny (optimal)
Gaussian → Sobel → NMS → Hysteresis
Four-stage pipeline. Produces thin, connected, noise-robust edges. The industry standard for most computer vision tasks.
import cv2
import numpy as np
import matplotlib.pyplot as plt

img = cv2.imread('building.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Sobel edges (X and Y)
sobelx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)   # X gradient
sobely = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)   # Y gradient
sobel_mag = np.sqrt(sobelx**2 + sobely**2)              # Combined magnitude
sobel_mag = np.uint8(np.clip(sobel_mag, 0, 255))

# 2. Laplacian edges
laplacian = cv2.Laplacian(img, cv2.CV_64F)
laplacian = np.uint8(np.absolute(laplacian))

# 3. Canny edge detector — the gold standard
# GaussianBlur first to reduce noise
blurred = cv2.GaussianBlur(img, (5, 5), 0)
canny = cv2.Canny(
    blurred,
    threshold1=50,    # lower hysteresis threshold
    threshold2=150    # upper hysteresis threshold
)

# Auto-calculate Canny thresholds using median pixel value
median_val = np.median(blurred)
lower = int(max(0,   0.67 * median_val))
upper = int(min(255, 1.33 * median_val))
canny_auto = cv2.Canny(blurred, lower, upper)
print(f"Auto Canny thresholds: lower={lower}, upper={upper}")
OUTPUT
Auto Canny thresholds: lower=81, upper=161

Section 08

Contours & Shape Analysis — Recognising What You See

A contour is a continuous curve along a boundary where pixels change from background to foreground. After finding contours, you can calculate properties: area, perimeter, bounding box, shape similarity, centroid position, and more. This is how industrial quality control systems automatically detect defective parts.

01
Greyscale + Blur
Convert to greyscale and apply Gaussian blur to reduce noise that could create spurious contours. A 3×3 or 5×5 kernel usually suffices.
02
Threshold or Canny
Create a binary image. Contours are found on binary images. Either threshold directly, or use Canny edges. Both approaches work; choice depends on task.
03
Find Contours
cv2.findContours() returns a list of contours (each is a NumPy array of points) and a hierarchy describing parent-child relationships between nested contours.
04
Filter by Area
Discard tiny contours (noise) using cv2.contourArea(c) > min_area. For coin counting, this step alone removes 90% of false positives.
05
Analyse Shape Features
Compute bounding box, min enclosing circle, convex hull, moments, circularity, aspect ratio. These become features for classification or rule-based logic.
import cv2
import numpy as np

img = cv2.imread('coins.jpg')
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(grey, (7, 7), 0)
_, binary = cv2.threshold(blurred, 0, 255,
                            cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Find external contours only
contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
print(f"Total raw contours: {len(contours)}")

# Filter and analyse each contour
min_area = 500
valid_coins = []

for cnt in contours:
    area = cv2.contourArea(cnt)
    if area < min_area:
        continue

    perimeter = cv2.arcLength(cnt, True)
    circularity = 4 * np.pi * area / (perimeter ** 2)  # 1.0 = perfect circle

    x, y, w, h = cv2.boundingRect(cnt)
    aspect_ratio = w / h

    # Coins should be near-circular
    if circularity > 0.75 and 0.8 < aspect_ratio < 1.2:
        valid_coins.append({'area': area, 'circularity': circularity,
                              'bbox': (x, y, w, h)})
        cv2.drawContours(img, [cnt], -1, (0, 255, 0), 2)

print(f"Coins detected: {len(valid_coins)}")
for i, coin in enumerate(valid_coins, 1):
    print(f"  Coin {i}: area={coin['area']:.0f}px², circularity={coin['circularity']:.3f}")
OUTPUT
Total raw contours: 47 Coins detected: 8 Coin 1: area=4821px², circularity=0.932 Coin 2: area=5103px², circularity=0.918 Coin 3: area=3204px², circularity=0.904 Coin 4: area=4956px², circularity=0.941 Coin 5: area=5012px², circularity=0.927 Coin 6: area=4788px², circularity=0.935 Coin 7: area=3198px², circularity=0.912 Coin 8: area=5089px², circularity=0.929

Section 09

Geometric Transformations — Repositioning Pixels

Sometimes the problem isn't the pixel values — it's where the pixels are. A scanned document might be rotated. A product photo needs cropping. A wide-angle camera introduces barrel distortion. Geometric transformations reposition pixels in 2D space to correct these issues or prepare images for model training.

Transformation Degrees of Freedom Preserves Use Case
Translation Shift x, y Shape, size, angles Moving/centering objects
Rotation Angle θ, centre Shape, size Deskewing text documents
Scaling Scale x, y Shape (if uniform) Resize for neural network input
Affine 6 DOF (rotation + scale + shear) Parallelism Correct mild perspective distortion
Perspective 8 DOF (homography) Straight lines Correct full perspective distortion, receipts, whiteboards
Undistortion Camera matrix + distortion coefficients True scene geometry Fisheye / wide-angle lens correction
import cv2
import numpy as np

img = cv2.imread('whiteboard.jpg')
h, w = img.shape[:2]

# ── 1. Resize ──────────────────────────────────────────────
resized = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)

# ── 2. Rotation around centre ──────────────────────────────
centre = (w // 2, h // 2)
M_rot = cv2.getRotationMatrix2D(centre, angle=15, scale=1.0)
rotated = cv2.warpAffine(img, M_rot, (w, h))

# ── 3. Perspective transform (straighten a tilted document) ─
# Source points: four corners of the document in the original image
src_pts = np.float32([[120, 90], [510, 60],
                        [545, 420], [75, 440]])
# Destination: what we want those corners to map to (a clean rectangle)
dst_pts = np.float32([[0, 0], [400, 0],
                        [400, 500], [0, 500]])
M_persp = cv2.getPerspectiveTransform(src_pts, dst_pts)
warped  = cv2.warpPerspective(img, M_persp, (400, 500))
print(f"Warped document shape: {warped.shape}")
OUTPUT
Warped document shape: (500, 400, 3)

Section 10

Histogram & Intensity Adjustments — Correcting Exposure

The Darkroom Photographer's Trick
Before digital cameras, a master photographer would develop a print, study its histogram — the distribution of dark to light tones — and then manually adjust exposure time and chemical concentrations in the darkroom. An underexposed print had a histogram bunched on the left. An overexposed one bunched on the right. The perfect print had a histogram spread evenly across the full tonal range.

This is histogram equalisation: automatically redistributing pixel values so that every intensity level is used as uniformly as possible. The result? Maximum contrast, maximum information in the image.
⚠ Low Contrast Image
PropertyValue
Pixel range used80–180 (100 levels of 256)
Histogram shapeNarrow spike in mid-range
Visual appearanceFlat, washed-out, grey
ContrastPoor — hard to distinguish features
✅ After Equalisation
PropertyValue
Pixel range used0–255 (full 256 levels)
Histogram shapeApproximately uniform
Visual appearanceHigh contrast, crisp details
ContrastExcellent — features clearly visible
import cv2
import numpy as np

img = cv2.imread('dark_xray.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Standard histogram equalisation
eq_global = cv2.equalizeHist(img)

# 2. CLAHE — Contrast Limited Adaptive Histogram Equalisation
# Better than global EQ: avoids over-amplifying noise in uniform regions
clahe = cv2.createCLAHE(
    clipLimit=2.0,         # contrast amplification limit
    tileGridSize=(8, 8)   # tile size for local histogram
)
eq_clahe = clahe.apply(img)

# 3. Gamma correction — brighten dark images non-linearly
def gamma_correct(image, gamma=1.5):
    inv_gamma = 1.0 / gamma
    lut = np.array([
        ((i / 255.0) ** inv_gamma) * 255
        for i in range(256)
    ]).astype('uint8')
    return cv2.LUT(image, lut)

brightened = gamma_correct(img, gamma=1.8)

# Compare mean brightness
print(f"Original   mean: {img.mean():.1f}")
print(f"Global EQ  mean: {eq_global.mean():.1f}")
print(f"CLAHE      mean: {eq_clahe.mean():.1f}")
print(f"Gamma 1.8  mean: {brightened.mean():.1f}")
OUTPUT
Original mean: 72.4 Global EQ mean: 127.8 CLAHE mean: 121.3 Gamma 1.8 mean: 148.6
💡
CLAHE vs Global Equalisation

Global histogram equalisation can over-enhance noise in uniform regions (e.g. a clear sky becomes speckled). CLAHE fixes this by computing localised histograms and clipping the amplification. For medical imaging (retinal scans, X-rays, MRIs), always prefer CLAHE over global equalisation.


Section 11

Feature Detection — Finding Reliable Keypoints

A feature is a small, distinctive region in an image that can be reliably detected and described — even if the image is rescaled, rotated, or slightly blurred. Features are the basis of image stitching (panoramas), visual SLAM (robot navigation), and matching objects across frames in video.

SIFT
Scale-Invariant Feature Transform
Gold standard descriptor. Invariant to scale and rotation. Produces 128-dimensional descriptor per keypoint. Now patent-free (FOSS since 2020). Slower.
ORB
Oriented FAST + Rotated BRIEF
Fast, free, patent-free alternative to SIFT. Uses binary descriptors (Hamming distance matching). 10–100× faster than SIFT. Great for real-time apps.
AKAZE
Accelerated-KAZE
Nonlinear scale space feature detector. More robust to noise than SIFT on textured surfaces. Binary descriptor. Good balance of speed and accuracy.
Harris
Corner Detector
Detects corners based on local intensity changes in all directions. Very fast. Not scale-invariant. Used for image alignment and optical flow initialisation.
HOG
Histogram of Oriented Gradients
Describes gradient direction distributions over local cells. The foundation of classical pedestrian detection (DPM, SVM-HOG). Dense descriptor, not sparse keypoints.
LBP
Local Binary Pattern
Compares each pixel to its neighbours; encodes as a binary number. Extremely fast. Texture descriptor. Classic for face recognition (Viola-Jones era).
import cv2
import numpy as np

img1 = cv2.imread('scene_a.jpg', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('scene_b.jpg', cv2.IMREAD_GRAYSCALE)

# ── ORB (fast, patent-free) ────────────────────────────────
orb = cv2.ORB_create(nfeatures=500)
kp1, des1 = orb.detectAndCompute(img1, None)
kp2, des2 = orb.detectAndCompute(img2, None)

# Match descriptors using Brute Force + Hamming (for binary descriptors)
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda x: x.distance)

print(f"ORB keypoints in img1: {len(kp1)}")
print(f"ORB keypoints in img2: {len(kp2)}")
print(f"Good matches (top 50): {min(50, len(matches))}")
print(f"Best match distance:   {matches[0].distance:.1f}")

# ── SIFT (higher quality) ─────────────────────────────────
sift = cv2.SIFT_create()
kp_s1, des_s1 = sift.detectAndCompute(img1, None)
# Use FLANN for efficient SIFT matching (L2 norm, float descriptors)
flann = cv2.FlannBasedMatcher(
    {'algorithm': 1, 'trees': 5},
    {'checks': 50}
)
print(f"SIFT keypoints in img1: {len(kp_s1)}")
OUTPUT
ORB keypoints in img1: 500 ORB keypoints in img2: 500 Good matches (top 50): 50 Best match distance: 18.0 SIFT keypoints in img1: 1243

Section 12

Image Augmentation — Multiplying Your Training Data

Deep learning models need vast amounts of training data. But collecting and labelling thousands of images for every class is expensive and slow. Image augmentation artificially expands your dataset by applying random but realistic transformations to existing images. A single labelled photo of a cat can become 50+ training examples — each slightly different — all still cats.

🎯
Why Augmentation Dramatically Improves Models

Without augmentation, a model trained on upright cats may completely fail to recognise a sideways cat — because it memorised specific pixel patterns rather than the concept. Augmentation forces the model to learn invariant representations: "catness" that persists across rotations, flips, brightness changes, and crops.

Flip
Horizontal / Vertical
Mirror the image left-right or top-bottom. Free augmentation for natural images. Never flip vertically for aerial imagery where sky=up is semantic.
🔃
Rotation
±15°, ±30°, 90°, 180°
Rotate by random angle. Use small angles (±15°) for natural images, full 90°/180° for satellite or pathology images with no canonical orientation.
☀️
Brightness/Contrast
Random jitter
Randomly adjust brightness, contrast, saturation, hue. Simulates different lighting conditions and camera settings. Essential for outdoor datasets.
📷
Crop / Zoom
Random crop + resize
Take a random crop and resize back. Forces the model to recognise objects from partial views. Standard in ImageNet training since AlexNet (2012).
🏈
Noise Injection
Gaussian, salt-and-pepper
Add random noise to simulate real sensor imperfections. Makes models robust to low-quality input at inference. Crucial for medical imaging models.
📈
Cutout / GridMask
Structured occlusion
Randomly zero out rectangular patches. Forces the model to use multiple discriminative features rather than relying on one region. Reduces overfitting.
from PIL import Image, ImageEnhance, ImageFilter
import numpy as np
import random

def augment_image(pil_img):
    """Apply a random chain of augmentations to a PIL image."""
    img = pil_img.copy()

    # Random horizontal flip
    if random.random() > 0.5:
        img = img.transpose(Image.FLIP_LEFT_RIGHT)

    # Random rotation
    angle = random.uniform(-15, 15)
    img = img.rotate(angle, expand=False, fillcolor=(128, 128, 128))

    # Random brightness
    factor = random.uniform(0.7, 1.3)
    img = ImageEnhance.Brightness(img).enhance(factor)

    # Random contrast
    factor = random.uniform(0.8, 1.4)
    img = ImageEnhance.Contrast(img).enhance(factor)

    # Occasional Gaussian blur (simulates motion / focus issues)
    if random.random() > 0.7:
        img = img.filter(ImageFilter.GaussianBlur(radius=1))

    return img

# Generate 50 augmented versions of one image
original = Image.open('product.jpg')
augmented_set = [augment_image(original) for _ in range(50)]
print(f"Dataset grew from 1 image to {len(augmented_set)+1} images.")
OUTPUT
Dataset grew from 1 image to 51 images.

Section 13

Deep Learning Integration — From Pixels to Predictions

Modern image processing pipelines end where deep learning begins. The classical operations you have learned — denoising, normalisation, resizing, augmentation — are the pre-processing layer that feeds Convolutional Neural Networks (CNNs). A CNN then automatically learns the most powerful filters and features for your specific task.

🤖 CNN Classification Pipeline — End-to-End
Input
Raw image file (JPEG, PNG, TIFF) — any size, any colour space.
Pre-process
Resize to model input size (e.g. 224×224). Convert to RGB. Convert to float32. Normalise pixel values to [0,1] or ImageNet mean/std.
Augment
During training only: apply random flip, rotation, crop, colour jitter. During inference: use centre crop or test-time augmentation (TTA).
Conv Layers
Learnable 3×3 convolutional kernels extract low-level features (edges, textures) in early layers, high-level concepts (eyes, wheels) in deep layers.
Pooling
Max pooling reduces spatial dimensions, retaining the strongest activations. Global average pooling in modern architectures (ResNet, EfficientNet).
Output
Softmax probabilities per class (classification), bounding boxes (detection), or pixel-wise masks (segmentation).
import torch
import torchvision.transforms as T
import torchvision.models as models
from PIL import Image

# ── Pre-processing pipeline for ImageNet-pretrained models ──
preprocess = T.Compose([
    T.Resize(256),                        # Resize shorter side to 256
    T.CenterCrop(224),                     # Crop centre 224×224
    T.ToTensor(),                           # PIL → float32 tensor [0,1]
    T.Normalize(
        mean=[0.485, 0.456, 0.406],       # ImageNet channel means
        std= [0.229, 0.224, 0.225]        # ImageNet channel stds
    )
])

# Load a pretrained ResNet-50
model = models.resnet50(weights='IMAGENET1K_V2')
model.eval()

# Inference on a single image
img = Image.open('dog.jpg').convert('RGB')
tensor = preprocess(img).unsqueeze(0)   # add batch dimension → (1, 3, 224, 224)

with torch.no_grad():
    logits = model(tensor)                   # shape: (1, 1000)
    probs  = torch.softmax(logits, dim=1)
    top5   = torch.topk(probs, 5)

print("Top-5 Predictions:")
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"  Class {idx.item():4d}: {score.item()*100:.2f}%")
OUTPUT
Top-5 Predictions: Class 207: 82.43% ← golden_retriever Class 208: 9.12% ← Labrador_retriever Class 209: 3.77% ← cocker_spaniel Class 151: 1.22% ← Chihuahua Class 248: 0.89% ← Eskimo_dog

Section 14

Classical vs Deep Learning — Full Task Comparison

Task Classical Approach Deep Learning Approach Recommended
Noise Removal Median / Bilateral Filter DnCNN, Noise2Noise Classical (fast, no training data needed)
Edge Detection Canny, Sobel HED, RCF (learned boundaries) Classical for most uses. DL for complex scenes.
Image Classification HOG + SVM (limited accuracy) ResNet, EfficientNet, ViT Deep Learning always wins here
Object Detection HOG+SVM sliding window (slow) YOLO, Faster R-CNN, DETR Deep Learning — not even close
Semantic Segmentation GrabCut, Watershed (manual) U-Net, SegFormer, SAM Deep Learning for accuracy, GrabCut for quick prototypes
Image Stitching ORB/SIFT + RANSAC Homography DeepPano, UDIS++ Classical is battle-tested and reliable
OCR Pre-processing Threshold + Morph ops End-to-end STR (Scene Text Recognition) Classical pipeline + Tesseract for most tasks
Face Detection Viola-Jones Haar Cascades MTCNN, RetinaFace, InsightFace Deep Learning — much higher accuracy

Section 15

Golden Rules

📷 Image Processing — Non-Negotiable Rules
1
Always check dtype and value range first. OpenCV loads as uint8 [0–255]. PyTorch expects float32 [0,1]. TensorFlow may expect [0,1] or [-1,1]. Forgetting this causes invisible bugs where images display black or models produce garbage.
2
OpenCV reads BGR, not RGB. Every time you load with cv2.imread() and pass to a library expecting RGB (matplotlib, PIL, PyTorch), convert explicitly with cv2.cvtColor(img, cv2.COLOR_BGR2RGB). This is the #1 beginner bug.
3
Smooth before you sharpen or detect edges. Canny edge detection preceded by a Gaussian blur will always produce cleaner results than running Canny on the raw image. Noise looks like tiny edges — eliminate it first.
4
Use CLAHE, not global histogram equalisation, for medical or scientific images. Global EQ over-amplifies noise in flat regions. CLAHE applies local equalisation with a contrast clip, preserving diagnostic detail where it matters most.
5
For colour segmentation, always convert to HSV first. Selecting a range of red or green in RGB requires three-dimensional bounds. In HSV, you select by hue alone — a one-dimensional range. Your masks will be more accurate and far easier to tune.
6
Morphological Opening removes noise; Closing fills holes. Open = erode then dilate (removes salt noise without shrinking objects). Close = dilate then erode (fills pepper holes without expanding objects). Memorise this order — swapping them gives the wrong result every time.
7
Always augment training data but never augment test data (except for test-time augmentation ensembles). Augmenting test data changes the distribution you're evaluating on and invalidates your benchmark. Keep test sets frozen, representative, and pristine.