Image Processing in Python

Section 01

The Story That Explains Image Processing

📖 Real World Analogy

The Darkroom Developer — From Film to Insight

Picture a detective in 1940s noir fiction. She receives a blurry photograph — the only piece of evidence from the crime scene. She takes it to her darkroom, adjusts the exposure, applies a sharpening filter, and enhances the contrast. What was a useless smear of grey becomes a crisp image of a licence plate.

That detective is you. That darkroom is your Python environment. The chemicals are your algorithms. And the photograph is any image — a chest X-ray, a satellite map, a product photo, or a face in a crowd.

Image Processing is the science of transforming raw pixel data into meaningful, usable information. It is the foundation of computer vision, medical imaging, self-driving cars, and every Instagram filter you have ever used.

At its core, a digital image is nothing more than a matrix of numbers. Each number (called a pixel value) represents the intensity of light at that position. Image processing is the art of mathematically manipulating those numbers to reveal, enhance, or extract information.

📷

Why Image Processing Matters in 2025

Every second, over 3.2 billion images are shared online. Medical AI reads 400 million radiology scans per year. Autonomous vehicles process 40–100 camera frames per second. The ability to transform raw pixels into structured knowledge is one of the most economically valuable skills in modern data science.

Section 02

How Images Are Stored — Pixels, Channels & Arrays

Before you can process an image, you must understand how a computer sees one. There are no "pictures" inside a CPU — only arrays of integers.

🗃 Anatomy of a Digital Image

Pixel

The smallest unit. One dot of colour. Represented as an integer, typically in range 0–255 (8-bit).

Channel

A single colour component. A greyscale image has 1 channel. RGB has 3 channels. RGBA has 4 channels.

Shape

NumPy shape is (Height, Width, Channels). A 1080p RGB image is shape (1080, 1920, 3).

Bit Depth

Number of bits per channel. 8-bit = 256 levels. 16-bit = 65,536 levels (used in medical, scientific imaging).

Colour Space

The coordinate system for colour: RGB, BGR (OpenCV default), HSV, LAB, Greyscale.

⬛

Greyscale Image

Shape: (H, W)

Each pixel is a single intensity value 0–255. Black=0, White=255. Used in medical imaging, edge detection, and most classical algorithms. Memory-efficient.

🌀

RGB Image

Shape: (H, W, 3)

Three channels: Red, Green, Blue. Each pixel is a triplet like (255, 128, 0). Default in PIL/Pillow and most web formats. OpenCV stores as BGR — be careful!

🌟

HSV Image

Shape: (H, W, 3)

Hue, Saturation, Value. Far more intuitive for colour-based filtering than RGB. Hue=colour type, Saturation=colour purity, Value=brightness. Preferred for colour segmentation.

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt

# Load an image using OpenCV (reads as BGR)
img_bgr = cv2.imread('cat.jpg')

# Convert BGR → RGB for display
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

# Inspect the array
print(f"Shape : {img_rgb.shape}")   # (H, W, 3)
print(f"Dtype : {img_rgb.dtype}")   # uint8
print(f"Min   : {img_rgb.min()}")   # 0
print(f"Max   : {img_rgb.max()}")   # 255

# Convert to greyscale
img_grey = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
print(f"Grey shape: {img_grey.shape}")   # (H, W) — no channel dimension

# Convert to HSV colour space
img_hsv = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV)
print(f"H range: 0–179  |  S range: 0–255  |  V range: 0–255")

OUTPUT

Shape : (480, 640, 3) Dtype : uint8 Min : 0 Max : 255 Grey shape: (480, 640) H range: 0–179 | S range: 0–255 | V range: 0–255

⚠️

OpenCV vs PIL — The Channel Order Trap

OpenCV reads images in BGR order, not RGB. This trips up every beginner. If you load with OpenCV and display with matplotlib (which expects RGB), your reds will look blue. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) before displaying, or use PIL/Pillow which reads natively in RGB.

Section 03

Colour Space Conversions — Seeing the World Differently

📖 Story

The Sommelier and the Spectrometer

A master sommelier can identify a wine by colour alone — a task impossible for most people. He doesn't see "red wine." He sees a specific hue, a particular saturation, a precise value. He has mentally converted the image from the RGB colour space — which mixes everything together — into the HSV space, where colour properties are separated.

This is exactly why we convert colour spaces in image processing. Different tasks need different "views" of the same pixel data. Segmenting a ripe tomato by colour is trivial in HSV, but brutal in RGB.

Colour Space	Channels	Best Used For	Notes
RGB	Red, Green, Blue	Display, web, general use	Pillow default. Human-intuitive.
BGR	Blue, Green, Red	OpenCV internal format	Easy to forget — causes colour swaps!
Greyscale	Intensity only	Edge detection, thresholding, feature extraction	3× smaller memory than RGB.
HSV	Hue, Saturation, Value	Colour-based segmentation, tracking	Hue is lighting-independent.
LAB	Lightness, A (green↔red), B (blue↔yellow)	Perceptually uniform comparisons	Closest to human vision. Used in colour difference metrics.
YCrCb	Luma, Chroma red, Chroma blue	Skin detection, video compression	Separates brightness from colour. Used in JPEG.

# Real-world example: Detect a red ball using HSV masking
import cv2
import numpy as np

img = cv2.imread('playground.jpg')
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# Red hue wraps around 0/180 in OpenCV's HSV
# Define lower and upper bounds for red
lower_red1 = np.array([0,   120, 70])
upper_red1 = np.array([10,  255, 255])
lower_red2 = np.array([170, 120, 70])
upper_red2 = np.array([180, 255, 255])

mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
mask  = cv2.bitwise_or(mask1, mask2)

# Apply mask to original image
result = cv2.bitwise_and(img, img, mask=mask)
print(f"Red pixels found: {np.count_nonzero(mask)}")

OUTPUT

Red pixels found: 14382

Section 04

Image Filtering & Smoothing — Taming the Noise

Real-world images are noisy. Camera sensors are imperfect. Transmission introduces artefacts. Lighting is uneven. Before any analysis, you often need to smooth the image — reduce noise without destroying important features. This is done by convolution: sliding a small matrix (called a kernel or filter) across the image and computing a weighted average at each position.

📋

What Is Convolution?

For every pixel, you place your kernel (e.g. a 3×3 matrix) centred on it. You multiply each kernel value by the corresponding pixel value, sum all the products, and the result becomes the new pixel value. Slide this across the entire image. The kernel's values determine what the filter does: smooth, sharpen, detect edges.

🌀

Box (Mean) Filter

Replaces each pixel with the average of its neighbours. Simple and fast. Blurs edges too aggressively. Good for quick noise removal.

cv2.blur(img, (5,5))

📈

Gaussian Filter

Uses a Gaussian (bell-curve) weighted average. Central pixels contribute more. Smoother result than box blur. The gold standard for pre-processing before edge detection.

cv2.GaussianBlur(img, (5,5), 0)

⚖

Median Filter

Replaces each pixel with the median of its neighbours. Excellent against salt-and-pepper noise. Preserves edges far better than Gaussian. Non-linear.

cv2.medianBlur(img, 5)

🌌

Bilateral Filter

Smooths while preserving edges. Considers both spatial proximity and pixel value similarity. Much slower but produces beautiful results. Used in photo editing.

cv2.bilateralFilter(img, 9, 75, 75)

✏️

Sharpening Filter

Uses a kernel that amplifies differences from neighbours. Enhances edges. Can amplify noise too — apply after smoothing. Used in print and photo restoration.

cv2.filter2D(img, -1, kernel)

🎲

Custom Kernel

Define your own kernel matrix for specialised effects. Embossing, motion blur, and unsharp masking are all custom kernels. Full creative and analytical control.

kernel = np.array([[...]])

import cv2
import numpy as np
import matplotlib.pyplot as plt

img = cv2.imread('noisy_photo.jpg')
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# 1. Box (Mean) filter
box_blur = cv2.blur(img_rgb, (5, 5))

# 2. Gaussian blur
gaussian = cv2.GaussianBlur(img_rgb, (5, 5), sigmaX=0)

# 3. Median blur — best for salt-and-pepper noise
median = cv2.medianBlur(img_rgb, 5)

# 4. Bilateral filter — edge-preserving smooth
bilateral = cv2.bilateralFilter(img_rgb, d=9, sigmaColor=75, sigmaSpace=75)

# 5. Sharpening with custom kernel
sharp_kernel = np.array([[0, -1,  0],
                           [-1,  5, -1],
                           [0, -1,  0]])
sharpened = cv2.filter2D(img_rgb, -1, sharp_kernel)

# Plot all results
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
titles = ['Original', 'Box Blur', 'Gaussian', 'Median', 'Bilateral', 'Sharpened']
images = [img_rgb, box_blur, gaussian, median, bilateral, sharpened]
for ax, im, t in zip(axes.flat(), images, titles):
    ax.imshow(im)
    ax.set_title(t)
    ax.axis('off')
plt.tight_layout()
plt.show()

💡

Choosing the Right Filter

Use Gaussian before Canny edge detection. Use Median for scanner noise or old photographs. Use Bilateral when you need smooth skin tones but sharp edges in portraits. Never sharpen before smoothing — it amplifies noise catastrophically.

Section 05

Thresholding & Segmentation — Separating What Matters

📖 Story

The Customs Officer at the Border

A customs officer scans X-rays of luggage. His job is simple in principle: identify what is suspicious and what is safe. He does this by looking for areas with unusual density — regions that appear as bright whites or deep blacks on the scan — and ignoring everything in the grey middle.

He is performing thresholding — the most fundamental segmentation technique in image processing. Any pixel above a brightness threshold becomes "foreground" (1). Everything else becomes "background" (0). The resulting image is purely binary: black or white.

⬛

GLOBAL

Simple Threshold

One fixed threshold value for the whole image. Fast. Fails on uneven lighting. pixel > T → 255, else → 0.

🎮

ADAPTIVE

Local Threshold

Threshold computed for small neighbourhoods. Handles shadows and uneven illumination. Uses mean or Gaussian weighted average of local region.

🌟

OTSU

Automatic Threshold

Automatically finds the optimal global threshold by minimising intra-class variance. Works best when the histogram is bimodal (clear foreground/background).

import cv2
import numpy as np

img = cv2.imread('document.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Global threshold: pixels > 127 → white, else black
ret1, thresh_global = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)

# 2. Otsu's method: auto-calculate optimal threshold
ret2, thresh_otsu = cv2.threshold(
    img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
)
print(f"Otsu's optimal threshold: {ret2:.1f}")

# 3. Adaptive threshold — handles uneven lighting in documents
thresh_adapt = cv2.adaptiveThreshold(
    img,
    maxValue=255,
    adaptiveMethod=cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    thresholdType=cv2.THRESH_BINARY,
    blockSize=11,    # neighbourhood size (must be odd)
    C=2             # constant subtracted from mean
)

# Count foreground pixels in each method
for name, t in {'Global': thresh_global,
               "Otsu": thresh_otsu,
               "Adaptive": thresh_adapt}.items():
    fg = np.sum(t == 255)
    total = t.size
    print(f"{name:10s}: {fg:7,d} foreground px ({fg/total*100:.1f}%)")

OUTPUT

Otsu's optimal threshold: 142.0 Global : 94,302 foreground px (30.7%) Otsu : 87,156 foreground px (28.4%) Adaptive : 112,450 foreground px (36.6%)

Section 06

Morphological Operations — Sculpting the Binary Image

After thresholding you often have a "rough" binary image: small holes inside objects, tiny isolated noise pixels, or slightly disconnected regions. Morphological operations fix these problems by expanding or shrinking the white regions using a structuring element.

Operation	Effect	Typical Use	OpenCV Function
Erosion	Shrinks white regions. Removes thin protrusions and small blobs.	Remove salt noise, thin lines	`cv2.erode()`
Dilation	Expands white regions. Fills small holes and connects nearby blobs.	Fill gaps, join broken lines	`cv2.dilate()`
Opening	Erosion then Dilation. Removes small noise without shrinking objects.	Clean background noise	`cv2.morphologyEx(MORPH_OPEN)`
Closing	Dilation then Erosion. Fills holes without expanding objects.	Fill interior gaps in text/shapes	`cv2.morphologyEx(MORPH_CLOSE)`
Gradient	Dilation minus Erosion. Highlights object boundaries.	Find edges in binary images	`cv2.morphologyEx(MORPH_GRADIENT)`
Top Hat	Original minus Opening. Reveals small bright spots on dark background.	Cell detection, bright defects	`cv2.morphologyEx(MORPH_TOPHAT)`

import cv2
import numpy as np

# Load a binary image (e.g., after Otsu thresholding)
_, binary = cv2.threshold(
    cv2.imread('text.jpg', cv2.IMREAD_GRAYSCALE),
    0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU
)

# Define structuring element (kernel)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))

# Apply each operation
eroded   = cv2.erode(binary, kernel, iterations=1)
dilated  = cv2.dilate(binary, kernel, iterations=1)
opened   = cv2.morphologyEx(binary, cv2.MORPH_OPEN,     kernel)
closed   = cv2.morphologyEx(binary, cv2.MORPH_CLOSE,    kernel)
gradient = cv2.morphologyEx(binary, cv2.MORPH_GRADIENT, kernel)

# Closing is often the final step to clean up OCR pre-processing
print("Binary image cleaned and ready for OCR.")

OUTPUT

Binary image cleaned and ready for OCR.

✅

Practical Rule: OCR Pre-processing Pipeline

For document digitisation tasks: Greyscale → Gaussian Blur → Otsu Threshold → Morphological Closing. This four-step pipeline dramatically improves Tesseract OCR accuracy on real-world scanned documents — often taking recognition from 60% to over 95% on typical business documents.

Section 07

Edge Detection — Finding Where Things Begin and End

An edge is a location where pixel intensity changes sharply. Edges correspond to object boundaries, shadows, and surface discontinuities — they carry most of the structural information in an image. Detecting edges is the prerequisite for shape recognition, object counting, and feature extraction.

Sobel X (horizontal edges)

[[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]

Detects vertical lines / horizontal gradient. The centre column is weighted 2× for smoothing along the perpendicular axis.

Sobel Y (vertical edges)

[[-1,-2,-1], [0,0,0], [1, 2, 1]]

Detects horizontal lines / vertical gradient. Combine Sobel X and Y as √(Gx²+Gy²) for full edge magnitude.

Laplacian (all directions)

[[0,1,0], [1,-4,1], [0,1,0]]

Second derivative of intensity. Detects edges in all orientations simultaneously. More sensitive to noise than Sobel.

Canny (optimal)

Gaussian → Sobel → NMS → Hysteresis

Four-stage pipeline. Produces thin, connected, noise-robust edges. The industry standard for most computer vision tasks.

import cv2
import numpy as np
import matplotlib.pyplot as plt

img = cv2.imread('building.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Sobel edges (X and Y)
sobelx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)   # X gradient
sobely = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)   # Y gradient
sobel_mag = np.sqrt(sobelx**2 + sobely**2)              # Combined magnitude
sobel_mag = np.uint8(np.clip(sobel_mag, 0, 255))

# 2. Laplacian edges
laplacian = cv2.Laplacian(img, cv2.CV_64F)
laplacian = np.uint8(np.absolute(laplacian))

# 3. Canny edge detector — the gold standard
# GaussianBlur first to reduce noise
blurred = cv2.GaussianBlur(img, (5, 5), 0)
canny = cv2.Canny(
    blurred,
    threshold1=50,    # lower hysteresis threshold
    threshold2=150    # upper hysteresis threshold
)

# Auto-calculate Canny thresholds using median pixel value
median_val = np.median(blurred)
lower = int(max(0,   0.67 * median_val))
upper = int(min(255, 1.33 * median_val))
canny_auto = cv2.Canny(blurred, lower, upper)
print(f"Auto Canny thresholds: lower={lower}, upper={upper}")

OUTPUT

Auto Canny thresholds: lower=81, upper=161

Section 08

Contours & Shape Analysis — Recognising What You See

A contour is a continuous curve along a boundary where pixels change from background to foreground. After finding contours, you can calculate properties: area, perimeter, bounding box, shape similarity, centroid position, and more. This is how industrial quality control systems automatically detect defective parts.

Greyscale + Blur

Convert to greyscale and apply Gaussian blur to reduce noise that could create spurious contours. A 3×3 or 5×5 kernel usually suffices.

Threshold or Canny

Create a binary image. Contours are found on binary images. Either threshold directly, or use Canny edges. Both approaches work; choice depends on task.

Find Contours

cv2.findContours() returns a list of contours (each is a NumPy array of points) and a hierarchy describing parent-child relationships between nested contours.

Filter by Area

Discard tiny contours (noise) using cv2.contourArea(c) > min_area. For coin counting, this step alone removes 90% of false positives.

Analyse Shape Features

Compute bounding box, min enclosing circle, convex hull, moments, circularity, aspect ratio. These become features for classification or rule-based logic.

import cv2
import numpy as np

img = cv2.imread('coins.jpg')
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(grey, (7, 7), 0)
_, binary = cv2.threshold(blurred, 0, 255,
                            cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Find external contours only
contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
print(f"Total raw contours: {len(contours)}")

# Filter and analyse each contour
min_area = 500
valid_coins = []

for cnt in contours:
    area = cv2.contourArea(cnt)
    if area < min_area:
        continue

    perimeter = cv2.arcLength(cnt, True)
    circularity = 4 * np.pi * area / (perimeter ** 2)  # 1.0 = perfect circle

    x, y, w, h = cv2.boundingRect(cnt)
    aspect_ratio = w / h

    # Coins should be near-circular
    if circularity > 0.75 and 0.8 < aspect_ratio < 1.2:
        valid_coins.append({'area': area, 'circularity': circularity,
                              'bbox': (x, y, w, h)})
        cv2.drawContours(img, [cnt], -1, (0, 255, 0), 2)

print(f"Coins detected: {len(valid_coins)}")
for i, coin in enumerate(valid_coins, 1):
    print(f"  Coin {i}: area={coin['area']:.0f}px², circularity={coin['circularity']:.3f}")

OUTPUT

Total raw contours: 47 Coins detected: 8 Coin 1: area=4821px², circularity=0.932 Coin 2: area=5103px², circularity=0.918 Coin 3: area=3204px², circularity=0.904 Coin 4: area=4956px², circularity=0.941 Coin 5: area=5012px², circularity=0.927 Coin 6: area=4788px², circularity=0.935 Coin 7: area=3198px², circularity=0.912 Coin 8: area=5089px², circularity=0.929

Section 09

Geometric Transformations — Repositioning Pixels

Sometimes the problem isn't the pixel values — it's where the pixels are. A scanned document might be rotated. A product photo needs cropping. A wide-angle camera introduces barrel distortion. Geometric transformations reposition pixels in 2D space to correct these issues or prepare images for model training.

Transformation	Degrees of Freedom	Preserves	Use Case
Translation	Shift x, y	Shape, size, angles	Moving/centering objects
Rotation	Angle θ, centre	Shape, size	Deskewing text documents
Scaling	Scale x, y	Shape (if uniform)	Resize for neural network input
Affine	6 DOF (rotation + scale + shear)	Parallelism	Correct mild perspective distortion
Perspective	8 DOF (homography)	Straight lines	Correct full perspective distortion, receipts, whiteboards
Undistortion	Camera matrix + distortion coefficients	True scene geometry	Fisheye / wide-angle lens correction

import cv2
import numpy as np

img = cv2.imread('whiteboard.jpg')
h, w = img.shape[:2]

# ── 1. Resize ──────────────────────────────────────────────
resized = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)

# ── 2. Rotation around centre ──────────────────────────────
centre = (w // 2, h // 2)
M_rot = cv2.getRotationMatrix2D(centre, angle=15, scale=1.0)
rotated = cv2.warpAffine(img, M_rot, (w, h))

# ── 3. Perspective transform (straighten a tilted document) ─
# Source points: four corners of the document in the original image
src_pts = np.float32([[120, 90], [510, 60],
                        [545, 420], [75, 440]])
# Destination: what we want those corners to map to (a clean rectangle)
dst_pts = np.float32([[0, 0], [400, 0],
                        [400, 500], [0, 500]])
M_persp = cv2.getPerspectiveTransform(src_pts, dst_pts)
warped  = cv2.warpPerspective(img, M_persp, (400, 500))
print(f"Warped document shape: {warped.shape}")

OUTPUT

Warped document shape: (500, 400, 3)

Section 10

Histogram & Intensity Adjustments — Correcting Exposure

📖 Story

The Darkroom Photographer's Trick

Before digital cameras, a master photographer would develop a print, study its histogram — the distribution of dark to light tones — and then manually adjust exposure time and chemical concentrations in the darkroom. An underexposed print had a histogram bunched on the left. An overexposed one bunched on the right. The perfect print had a histogram spread evenly across the full tonal range.

This is histogram equalisation: automatically redistributing pixel values so that every intensity level is used as uniformly as possible. The result? Maximum contrast, maximum information in the image.

⚠ Low Contrast Image

Property	Value
Pixel range used	80–180 (100 levels of 256)
Histogram shape	Narrow spike in mid-range
Visual appearance	Flat, washed-out, grey
Contrast	Poor — hard to distinguish features

✅ After Equalisation

Property	Value
Pixel range used	0–255 (full 256 levels)
Histogram shape	Approximately uniform
Visual appearance	High contrast, crisp details
Contrast	Excellent — features clearly visible

import cv2
import numpy as np

img = cv2.imread('dark_xray.jpg', cv2.IMREAD_GRAYSCALE)

# 1. Standard histogram equalisation
eq_global = cv2.equalizeHist(img)

# 2. CLAHE — Contrast Limited Adaptive Histogram Equalisation
# Better than global EQ: avoids over-amplifying noise in uniform regions
clahe = cv2.createCLAHE(
    clipLimit=2.0,         # contrast amplification limit
    tileGridSize=(8, 8)   # tile size for local histogram
)
eq_clahe = clahe.apply(img)

# 3. Gamma correction — brighten dark images non-linearly
def gamma_correct(image, gamma=1.5):
    inv_gamma = 1.0 / gamma
    lut = np.array([
        ((i / 255.0) ** inv_gamma) * 255
        for i in range(256)
    ]).astype('uint8')
    return cv2.LUT(image, lut)

brightened = gamma_correct(img, gamma=1.8)

# Compare mean brightness
print(f"Original   mean: {img.mean():.1f}")
print(f"Global EQ  mean: {eq_global.mean():.1f}")
print(f"CLAHE      mean: {eq_clahe.mean():.1f}")
print(f"Gamma 1.8  mean: {brightened.mean():.1f}")

OUTPUT

Original mean: 72.4 Global EQ mean: 127.8 CLAHE mean: 121.3 Gamma 1.8 mean: 148.6

💡

CLAHE vs Global Equalisation

Global histogram equalisation can over-enhance noise in uniform regions (e.g. a clear sky becomes speckled). CLAHE fixes this by computing localised histograms and clipping the amplification. For medical imaging (retinal scans, X-rays, MRIs), always prefer CLAHE over global equalisation.

Section 11

Feature Detection — Finding Reliable Keypoints

A feature is a small, distinctive region in an image that can be reliably detected and described — even if the image is rescaled, rotated, or slightly blurred. Features are the basis of image stitching (panoramas), visual SLAM (robot navigation), and matching objects across frames in video.

SIFT

Scale-Invariant Feature Transform

Gold standard descriptor. Invariant to scale and rotation. Produces 128-dimensional descriptor per keypoint. Now patent-free (FOSS since 2020). Slower.

ORB

Oriented FAST + Rotated BRIEF

Fast, free, patent-free alternative to SIFT. Uses binary descriptors (Hamming distance matching). 10–100× faster than SIFT. Great for real-time apps.

AKAZE

Accelerated-KAZE

Nonlinear scale space feature detector. More robust to noise than SIFT on textured surfaces. Binary descriptor. Good balance of speed and accuracy.

Harris

Corner Detector

Detects corners based on local intensity changes in all directions. Very fast. Not scale-invariant. Used for image alignment and optical flow initialisation.

HOG

Histogram of Oriented Gradients

Describes gradient direction distributions over local cells. The foundation of classical pedestrian detection (DPM, SVM-HOG). Dense descriptor, not sparse keypoints.

LBP

Local Binary Pattern

Compares each pixel to its neighbours; encodes as a binary number. Extremely fast. Texture descriptor. Classic for face recognition (Viola-Jones era).

import cv2
import numpy as np

img1 = cv2.imread('scene_a.jpg', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('scene_b.jpg', cv2.IMREAD_GRAYSCALE)

# ── ORB (fast, patent-free) ────────────────────────────────
orb = cv2.ORB_create(nfeatures=500)
kp1, des1 = orb.detectAndCompute(img1, None)
kp2, des2 = orb.detectAndCompute(img2, None)

# Match descriptors using Brute Force + Hamming (for binary descriptors)
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda x: x.distance)

print(f"ORB keypoints in img1: {len(kp1)}")
print(f"ORB keypoints in img2: {len(kp2)}")
print(f"Good matches (top 50): {min(50, len(matches))}")
print(f"Best match distance:   {matches[0].distance:.1f}")

# ── SIFT (higher quality) ─────────────────────────────────
sift = cv2.SIFT_create()
kp_s1, des_s1 = sift.detectAndCompute(img1, None)
# Use FLANN for efficient SIFT matching (L2 norm, float descriptors)
flann = cv2.FlannBasedMatcher(
    {'algorithm': 1, 'trees': 5},
    {'checks': 50}
)
print(f"SIFT keypoints in img1: {len(kp_s1)}")

OUTPUT

ORB keypoints in img1: 500 ORB keypoints in img2: 500 Good matches (top 50): 50 Best match distance: 18.0 SIFT keypoints in img1: 1243

Section 12

Image Augmentation — Multiplying Your Training Data

Deep learning models need vast amounts of training data. But collecting and labelling thousands of images for every class is expensive and slow. Image augmentation artificially expands your dataset by applying random but realistic transformations to existing images. A single labelled photo of a cat can become 50+ training examples — each slightly different — all still cats.

🎯

Why Augmentation Dramatically Improves Models

Without augmentation, a model trained on upright cats may completely fail to recognise a sideways cat — because it memorised specific pixel patterns rather than the concept. Augmentation forces the model to learn invariant representations: "catness" that persists across rotations, flips, brightness changes, and crops.

↔

Flip

Horizontal / Vertical

Mirror the image left-right or top-bottom. Free augmentation for natural images. Never flip vertically for aerial imagery where sky=up is semantic.

🔃

Rotation

±15°, ±30°, 90°, 180°

Rotate by random angle. Use small angles (±15°) for natural images, full 90°/180° for satellite or pathology images with no canonical orientation.

☀️

Brightness/Contrast

Random jitter

Randomly adjust brightness, contrast, saturation, hue. Simulates different lighting conditions and camera settings. Essential for outdoor datasets.

📷

Crop / Zoom

Random crop + resize

Take a random crop and resize back. Forces the model to recognise objects from partial views. Standard in ImageNet training since AlexNet (2012).

🏈

Noise Injection

Gaussian, salt-and-pepper

Add random noise to simulate real sensor imperfections. Makes models robust to low-quality input at inference. Crucial for medical imaging models.

📈

Cutout / GridMask

Structured occlusion

Randomly zero out rectangular patches. Forces the model to use multiple discriminative features rather than relying on one region. Reduces overfitting.

from PIL import Image, ImageEnhance, ImageFilter
import numpy as np
import random

def augment_image(pil_img):
    """Apply a random chain of augmentations to a PIL image."""
    img = pil_img.copy()

    # Random horizontal flip
    if random.random() > 0.5:
        img = img.transpose(Image.FLIP_LEFT_RIGHT)

    # Random rotation
    angle = random.uniform(-15, 15)
    img = img.rotate(angle, expand=False, fillcolor=(128, 128, 128))

    # Random brightness
    factor = random.uniform(0.7, 1.3)
    img = ImageEnhance.Brightness(img).enhance(factor)

    # Random contrast
    factor = random.uniform(0.8, 1.4)
    img = ImageEnhance.Contrast(img).enhance(factor)

    # Occasional Gaussian blur (simulates motion / focus issues)
    if random.random() > 0.7:
        img = img.filter(ImageFilter.GaussianBlur(radius=1))

    return img

# Generate 50 augmented versions of one image
original = Image.open('product.jpg')
augmented_set = [augment_image(original) for _ in range(50)]
print(f"Dataset grew from 1 image to {len(augmented_set)+1} images.")

OUTPUT

Dataset grew from 1 image to 51 images.

Section 13

Deep Learning Integration — From Pixels to Predictions

Modern image processing pipelines end where deep learning begins. The classical operations you have learned — denoising, normalisation, resizing, augmentation — are the pre-processing layer that feeds Convolutional Neural Networks (CNNs). A CNN then automatically learns the most powerful filters and features for your specific task.

🤖 CNN Classification Pipeline — End-to-End

Input

Raw image file (JPEG, PNG, TIFF) — any size, any colour space.

Pre-process

Resize to model input size (e.g. 224×224). Convert to RGB. Convert to float32. Normalise pixel values to [0,1] or ImageNet mean/std.

Augment

During training only: apply random flip, rotation, crop, colour jitter. During inference: use centre crop or test-time augmentation (TTA).

Conv Layers

Learnable 3×3 convolutional kernels extract low-level features (edges, textures) in early layers, high-level concepts (eyes, wheels) in deep layers.

Pooling

Max pooling reduces spatial dimensions, retaining the strongest activations. Global average pooling in modern architectures (ResNet, EfficientNet).

Output

Softmax probabilities per class (classification), bounding boxes (detection), or pixel-wise masks (segmentation).

import torch
import torchvision.transforms as T
import torchvision.models as models
from PIL import Image

# ── Pre-processing pipeline for ImageNet-pretrained models ──
preprocess = T.Compose([
    T.Resize(256),                        # Resize shorter side to 256
    T.CenterCrop(224),                     # Crop centre 224×224
    T.ToTensor(),                           # PIL → float32 tensor [0,1]
    T.Normalize(
        mean=[0.485, 0.456, 0.406],       # ImageNet channel means
        std= [0.229, 0.224, 0.225]        # ImageNet channel stds
    )
])

# Load a pretrained ResNet-50
model = models.resnet50(weights='IMAGENET1K_V2')
model.eval()

# Inference on a single image
img = Image.open('dog.jpg').convert('RGB')
tensor = preprocess(img).unsqueeze(0)   # add batch dimension → (1, 3, 224, 224)

with torch.no_grad():
    logits = model(tensor)                   # shape: (1, 1000)
    probs  = torch.softmax(logits, dim=1)
    top5   = torch.topk(probs, 5)

print("Top-5 Predictions:")
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"  Class {idx.item():4d}: {score.item()*100:.2f}%")

OUTPUT

Top-5 Predictions: Class 207: 82.43% ← golden_retriever Class 208: 9.12% ← Labrador_retriever Class 209: 3.77% ← cocker_spaniel Class 151: 1.22% ← Chihuahua Class 248: 0.89% ← Eskimo_dog

Section 14

Classical vs Deep Learning — Full Task Comparison

Task	Classical Approach	Deep Learning Approach	Recommended
Noise Removal	Median / Bilateral Filter	DnCNN, Noise2Noise	Classical (fast, no training data needed)
Edge Detection	Canny, Sobel	HED, RCF (learned boundaries)	Classical for most uses. DL for complex scenes.
Image Classification	HOG + SVM (limited accuracy)	ResNet, EfficientNet, ViT	Deep Learning always wins here
Object Detection	HOG+SVM sliding window (slow)	YOLO, Faster R-CNN, DETR	Deep Learning — not even close
Semantic Segmentation	GrabCut, Watershed (manual)	U-Net, SegFormer, SAM	Deep Learning for accuracy, GrabCut for quick prototypes
Image Stitching	ORB/SIFT + RANSAC Homography	DeepPano, UDIS++	Classical is battle-tested and reliable
OCR Pre-processing	Threshold + Morph ops	End-to-end STR (Scene Text Recognition)	Classical pipeline + Tesseract for most tasks
Face Detection	Viola-Jones Haar Cascades	MTCNN, RetinaFace, InsightFace	Deep Learning — much higher accuracy

Section 15

Golden Rules

📷 Image Processing — Non-Negotiable Rules

Always check dtype and value range first. OpenCV loads as uint8 [0–255]. PyTorch expects float32 [0,1]. TensorFlow may expect [0,1] or [-1,1]. Forgetting this causes invisible bugs where images display black or models produce garbage.

OpenCV reads BGR, not RGB. Every time you load with cv2.imread() and pass to a library expecting RGB (matplotlib, PIL, PyTorch), convert explicitly with cv2.cvtColor(img, cv2.COLOR_BGR2RGB). This is the #1 beginner bug.

Smooth before you sharpen or detect edges. Canny edge detection preceded by a Gaussian blur will always produce cleaner results than running Canny on the raw image. Noise looks like tiny edges — eliminate it first.

Use CLAHE, not global histogram equalisation, for medical or scientific images. Global EQ over-amplifies noise in flat regions. CLAHE applies local equalisation with a contrast clip, preserving diagnostic detail where it matters most.

For colour segmentation, always convert to HSV first. Selecting a range of red or green in RGB requires three-dimensional bounds. In HSV, you select by hue alone — a one-dimensional range. Your masks will be more accurate and far easier to tune.

Morphological Opening removes noise; Closing fills holes. Open = erode then dilate (removes salt noise without shrinking objects). Close = dilate then erode (fills pepper holes without expanding objects). Memorise this order — swapping them gives the wrong result every time.

Always augment training data but never augment test data (except for test-time augmentation ensembles). Augmenting test data changes the distribution you're evaluating on and invalidates your benchmark. Keep test sets frozen, representative, and pristine.