The Story That Explains Image Processing
That detective is you. That darkroom is your Python environment. The chemicals are your algorithms. And the photograph is any image — a chest X-ray, a satellite map, a product photo, or a face in a crowd.
Image Processing is the science of transforming raw pixel data into meaningful, usable information. It is the foundation of computer vision, medical imaging, self-driving cars, and every Instagram filter you have ever used.
At its core, a digital image is nothing more than a matrix of numbers. Each number (called a pixel value) represents the intensity of light at that position. Image processing is the art of mathematically manipulating those numbers to reveal, enhance, or extract information.
Every second, over 3.2 billion images are shared online. Medical AI reads 400 million radiology scans per year. Autonomous vehicles process 40–100 camera frames per second. The ability to transform raw pixels into structured knowledge is one of the most economically valuable skills in modern data science.
How Images Are Stored — Pixels, Channels & Arrays
Before you can process an image, you must understand how a computer sees one. There are no "pictures" inside a CPU — only arrays of integers.
import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
# Load an image using OpenCV (reads as BGR)
img_bgr = cv2.imread('cat.jpg')
# Convert BGR → RGB for display
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
# Inspect the array
print(f"Shape : {img_rgb.shape}") # (H, W, 3)
print(f"Dtype : {img_rgb.dtype}") # uint8
print(f"Min : {img_rgb.min()}") # 0
print(f"Max : {img_rgb.max()}") # 255
# Convert to greyscale
img_grey = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
print(f"Grey shape: {img_grey.shape}") # (H, W) — no channel dimension
# Convert to HSV colour space
img_hsv = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV)
print(f"H range: 0–179 | S range: 0–255 | V range: 0–255")
OpenCV reads images in BGR order, not RGB. This trips up every beginner.
If you load with OpenCV and display with matplotlib (which expects RGB), your reds will look
blue. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) before displaying,
or use PIL/Pillow which reads natively in RGB.
Colour Space Conversions — Seeing the World Differently
This is exactly why we convert colour spaces in image processing. Different tasks need different "views" of the same pixel data. Segmenting a ripe tomato by colour is trivial in HSV, but brutal in RGB.
| Colour Space | Channels | Best Used For | Notes |
|---|---|---|---|
| RGB | Red, Green, Blue | Display, web, general use | Pillow default. Human-intuitive. |
| BGR | Blue, Green, Red | OpenCV internal format | Easy to forget — causes colour swaps! |
| Greyscale | Intensity only | Edge detection, thresholding, feature extraction | 3× smaller memory than RGB. |
| HSV | Hue, Saturation, Value | Colour-based segmentation, tracking | Hue is lighting-independent. |
| LAB | Lightness, A (green↔red), B (blue↔yellow) | Perceptually uniform comparisons | Closest to human vision. Used in colour difference metrics. |
| YCrCb | Luma, Chroma red, Chroma blue | Skin detection, video compression | Separates brightness from colour. Used in JPEG. |
# Real-world example: Detect a red ball using HSV masking
import cv2
import numpy as np
img = cv2.imread('playground.jpg')
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Red hue wraps around 0/180 in OpenCV's HSV
# Define lower and upper bounds for red
lower_red1 = np.array([0, 120, 70])
upper_red1 = np.array([10, 255, 255])
lower_red2 = np.array([170, 120, 70])
upper_red2 = np.array([180, 255, 255])
mask1 = cv2.inRange(hsv, lower_red1, upper_red1)
mask2 = cv2.inRange(hsv, lower_red2, upper_red2)
mask = cv2.bitwise_or(mask1, mask2)
# Apply mask to original image
result = cv2.bitwise_and(img, img, mask=mask)
print(f"Red pixels found: {np.count_nonzero(mask)}")
Image Filtering & Smoothing — Taming the Noise
Real-world images are noisy. Camera sensors are imperfect. Transmission introduces artefacts. Lighting is uneven. Before any analysis, you often need to smooth the image — reduce noise without destroying important features. This is done by convolution: sliding a small matrix (called a kernel or filter) across the image and computing a weighted average at each position.
For every pixel, you place your kernel (e.g. a 3×3 matrix) centred on it. You multiply each kernel value by the corresponding pixel value, sum all the products, and the result becomes the new pixel value. Slide this across the entire image. The kernel's values determine what the filter does: smooth, sharpen, detect edges.
import cv2
import numpy as np
import matplotlib.pyplot as plt
img = cv2.imread('noisy_photo.jpg')
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# 1. Box (Mean) filter
box_blur = cv2.blur(img_rgb, (5, 5))
# 2. Gaussian blur
gaussian = cv2.GaussianBlur(img_rgb, (5, 5), sigmaX=0)
# 3. Median blur — best for salt-and-pepper noise
median = cv2.medianBlur(img_rgb, 5)
# 4. Bilateral filter — edge-preserving smooth
bilateral = cv2.bilateralFilter(img_rgb, d=9, sigmaColor=75, sigmaSpace=75)
# 5. Sharpening with custom kernel
sharp_kernel = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
sharpened = cv2.filter2D(img_rgb, -1, sharp_kernel)
# Plot all results
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
titles = ['Original', 'Box Blur', 'Gaussian', 'Median', 'Bilateral', 'Sharpened']
images = [img_rgb, box_blur, gaussian, median, bilateral, sharpened]
for ax, im, t in zip(axes.flat(), images, titles):
ax.imshow(im)
ax.set_title(t)
ax.axis('off')
plt.tight_layout()
plt.show()
Use Gaussian before Canny edge detection. Use Median for scanner noise or old photographs. Use Bilateral when you need smooth skin tones but sharp edges in portraits. Never sharpen before smoothing — it amplifies noise catastrophically.
Thresholding & Segmentation — Separating What Matters
He is performing thresholding — the most fundamental segmentation technique in image processing. Any pixel above a brightness threshold becomes "foreground" (1). Everything else becomes "background" (0). The resulting image is purely binary: black or white.
import cv2
import numpy as np
img = cv2.imread('document.jpg', cv2.IMREAD_GRAYSCALE)
# 1. Global threshold: pixels > 127 → white, else black
ret1, thresh_global = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)
# 2. Otsu's method: auto-calculate optimal threshold
ret2, thresh_otsu = cv2.threshold(
img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
)
print(f"Otsu's optimal threshold: {ret2:.1f}")
# 3. Adaptive threshold — handles uneven lighting in documents
thresh_adapt = cv2.adaptiveThreshold(
img,
maxValue=255,
adaptiveMethod=cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
thresholdType=cv2.THRESH_BINARY,
blockSize=11, # neighbourhood size (must be odd)
C=2 # constant subtracted from mean
)
# Count foreground pixels in each method
for name, t in {'Global': thresh_global,
"Otsu": thresh_otsu,
"Adaptive": thresh_adapt}.items():
fg = np.sum(t == 255)
total = t.size
print(f"{name:10s}: {fg:7,d} foreground px ({fg/total*100:.1f}%)")
Morphological Operations — Sculpting the Binary Image
After thresholding you often have a "rough" binary image: small holes inside objects, tiny isolated noise pixels, or slightly disconnected regions. Morphological operations fix these problems by expanding or shrinking the white regions using a structuring element.
| Operation | Effect | Typical Use | OpenCV Function |
|---|---|---|---|
| Erosion | Shrinks white regions. Removes thin protrusions and small blobs. | Remove salt noise, thin lines | cv2.erode() |
| Dilation | Expands white regions. Fills small holes and connects nearby blobs. | Fill gaps, join broken lines | cv2.dilate() |
| Opening | Erosion then Dilation. Removes small noise without shrinking objects. | Clean background noise | cv2.morphologyEx(MORPH_OPEN) |
| Closing | Dilation then Erosion. Fills holes without expanding objects. | Fill interior gaps in text/shapes | cv2.morphologyEx(MORPH_CLOSE) |
| Gradient | Dilation minus Erosion. Highlights object boundaries. | Find edges in binary images | cv2.morphologyEx(MORPH_GRADIENT) |
| Top Hat | Original minus Opening. Reveals small bright spots on dark background. | Cell detection, bright defects | cv2.morphologyEx(MORPH_TOPHAT) |
import cv2
import numpy as np
# Load a binary image (e.g., after Otsu thresholding)
_, binary = cv2.threshold(
cv2.imread('text.jpg', cv2.IMREAD_GRAYSCALE),
0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU
)
# Define structuring element (kernel)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
# Apply each operation
eroded = cv2.erode(binary, kernel, iterations=1)
dilated = cv2.dilate(binary, kernel, iterations=1)
opened = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
closed = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
gradient = cv2.morphologyEx(binary, cv2.MORPH_GRADIENT, kernel)
# Closing is often the final step to clean up OCR pre-processing
print("Binary image cleaned and ready for OCR.")
For document digitisation tasks: Greyscale → Gaussian Blur → Otsu Threshold → Morphological Closing. This four-step pipeline dramatically improves Tesseract OCR accuracy on real-world scanned documents — often taking recognition from 60% to over 95% on typical business documents.
Edge Detection — Finding Where Things Begin and End
An edge is a location where pixel intensity changes sharply. Edges correspond to object boundaries, shadows, and surface discontinuities — they carry most of the structural information in an image. Detecting edges is the prerequisite for shape recognition, object counting, and feature extraction.
import cv2
import numpy as np
import matplotlib.pyplot as plt
img = cv2.imread('building.jpg', cv2.IMREAD_GRAYSCALE)
# 1. Sobel edges (X and Y)
sobelx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3) # X gradient
sobely = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3) # Y gradient
sobel_mag = np.sqrt(sobelx**2 + sobely**2) # Combined magnitude
sobel_mag = np.uint8(np.clip(sobel_mag, 0, 255))
# 2. Laplacian edges
laplacian = cv2.Laplacian(img, cv2.CV_64F)
laplacian = np.uint8(np.absolute(laplacian))
# 3. Canny edge detector — the gold standard
# GaussianBlur first to reduce noise
blurred = cv2.GaussianBlur(img, (5, 5), 0)
canny = cv2.Canny(
blurred,
threshold1=50, # lower hysteresis threshold
threshold2=150 # upper hysteresis threshold
)
# Auto-calculate Canny thresholds using median pixel value
median_val = np.median(blurred)
lower = int(max(0, 0.67 * median_val))
upper = int(min(255, 1.33 * median_val))
canny_auto = cv2.Canny(blurred, lower, upper)
print(f"Auto Canny thresholds: lower={lower}, upper={upper}")
Contours & Shape Analysis — Recognising What You See
A contour is a continuous curve along a boundary where pixels change from background to foreground. After finding contours, you can calculate properties: area, perimeter, bounding box, shape similarity, centroid position, and more. This is how industrial quality control systems automatically detect defective parts.
cv2.findContours() returns a list of contours (each is a NumPy array of points) and a hierarchy describing parent-child relationships between nested contours.cv2.contourArea(c) > min_area. For coin counting, this step alone removes 90% of false positives.import cv2
import numpy as np
img = cv2.imread('coins.jpg')
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(grey, (7, 7), 0)
_, binary = cv2.threshold(blurred, 0, 255,
cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Find external contours only
contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
print(f"Total raw contours: {len(contours)}")
# Filter and analyse each contour
min_area = 500
valid_coins = []
for cnt in contours:
area = cv2.contourArea(cnt)
if area < min_area:
continue
perimeter = cv2.arcLength(cnt, True)
circularity = 4 * np.pi * area / (perimeter ** 2) # 1.0 = perfect circle
x, y, w, h = cv2.boundingRect(cnt)
aspect_ratio = w / h
# Coins should be near-circular
if circularity > 0.75 and 0.8 < aspect_ratio < 1.2:
valid_coins.append({'area': area, 'circularity': circularity,
'bbox': (x, y, w, h)})
cv2.drawContours(img, [cnt], -1, (0, 255, 0), 2)
print(f"Coins detected: {len(valid_coins)}")
for i, coin in enumerate(valid_coins, 1):
print(f" Coin {i}: area={coin['area']:.0f}px², circularity={coin['circularity']:.3f}")
Geometric Transformations — Repositioning Pixels
Sometimes the problem isn't the pixel values — it's where the pixels are. A scanned document might be rotated. A product photo needs cropping. A wide-angle camera introduces barrel distortion. Geometric transformations reposition pixels in 2D space to correct these issues or prepare images for model training.
| Transformation | Degrees of Freedom | Preserves | Use Case |
|---|---|---|---|
| Translation | Shift x, y | Shape, size, angles | Moving/centering objects |
| Rotation | Angle θ, centre | Shape, size | Deskewing text documents |
| Scaling | Scale x, y | Shape (if uniform) | Resize for neural network input |
| Affine | 6 DOF (rotation + scale + shear) | Parallelism | Correct mild perspective distortion |
| Perspective | 8 DOF (homography) | Straight lines | Correct full perspective distortion, receipts, whiteboards |
| Undistortion | Camera matrix + distortion coefficients | True scene geometry | Fisheye / wide-angle lens correction |
import cv2
import numpy as np
img = cv2.imread('whiteboard.jpg')
h, w = img.shape[:2]
# ── 1. Resize ──────────────────────────────────────────────
resized = cv2.resize(img, (224, 224), interpolation=cv2.INTER_LINEAR)
# ── 2. Rotation around centre ──────────────────────────────
centre = (w // 2, h // 2)
M_rot = cv2.getRotationMatrix2D(centre, angle=15, scale=1.0)
rotated = cv2.warpAffine(img, M_rot, (w, h))
# ── 3. Perspective transform (straighten a tilted document) ─
# Source points: four corners of the document in the original image
src_pts = np.float32([[120, 90], [510, 60],
[545, 420], [75, 440]])
# Destination: what we want those corners to map to (a clean rectangle)
dst_pts = np.float32([[0, 0], [400, 0],
[400, 500], [0, 500]])
M_persp = cv2.getPerspectiveTransform(src_pts, dst_pts)
warped = cv2.warpPerspective(img, M_persp, (400, 500))
print(f"Warped document shape: {warped.shape}")
Histogram & Intensity Adjustments — Correcting Exposure
This is histogram equalisation: automatically redistributing pixel values so that every intensity level is used as uniformly as possible. The result? Maximum contrast, maximum information in the image.
| Property | Value |
|---|---|
| Pixel range used | 80–180 (100 levels of 256) |
| Histogram shape | Narrow spike in mid-range |
| Visual appearance | Flat, washed-out, grey |
| Contrast | Poor — hard to distinguish features |
| Property | Value |
|---|---|
| Pixel range used | 0–255 (full 256 levels) |
| Histogram shape | Approximately uniform |
| Visual appearance | High contrast, crisp details |
| Contrast | Excellent — features clearly visible |
import cv2
import numpy as np
img = cv2.imread('dark_xray.jpg', cv2.IMREAD_GRAYSCALE)
# 1. Standard histogram equalisation
eq_global = cv2.equalizeHist(img)
# 2. CLAHE — Contrast Limited Adaptive Histogram Equalisation
# Better than global EQ: avoids over-amplifying noise in uniform regions
clahe = cv2.createCLAHE(
clipLimit=2.0, # contrast amplification limit
tileGridSize=(8, 8) # tile size for local histogram
)
eq_clahe = clahe.apply(img)
# 3. Gamma correction — brighten dark images non-linearly
def gamma_correct(image, gamma=1.5):
inv_gamma = 1.0 / gamma
lut = np.array([
((i / 255.0) ** inv_gamma) * 255
for i in range(256)
]).astype('uint8')
return cv2.LUT(image, lut)
brightened = gamma_correct(img, gamma=1.8)
# Compare mean brightness
print(f"Original mean: {img.mean():.1f}")
print(f"Global EQ mean: {eq_global.mean():.1f}")
print(f"CLAHE mean: {eq_clahe.mean():.1f}")
print(f"Gamma 1.8 mean: {brightened.mean():.1f}")
Global histogram equalisation can over-enhance noise in uniform regions (e.g. a clear sky becomes speckled). CLAHE fixes this by computing localised histograms and clipping the amplification. For medical imaging (retinal scans, X-rays, MRIs), always prefer CLAHE over global equalisation.
Feature Detection — Finding Reliable Keypoints
A feature is a small, distinctive region in an image that can be reliably detected and described — even if the image is rescaled, rotated, or slightly blurred. Features are the basis of image stitching (panoramas), visual SLAM (robot navigation), and matching objects across frames in video.
import cv2
import numpy as np
img1 = cv2.imread('scene_a.jpg', cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread('scene_b.jpg', cv2.IMREAD_GRAYSCALE)
# ── ORB (fast, patent-free) ────────────────────────────────
orb = cv2.ORB_create(nfeatures=500)
kp1, des1 = orb.detectAndCompute(img1, None)
kp2, des2 = orb.detectAndCompute(img2, None)
# Match descriptors using Brute Force + Hamming (for binary descriptors)
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda x: x.distance)
print(f"ORB keypoints in img1: {len(kp1)}")
print(f"ORB keypoints in img2: {len(kp2)}")
print(f"Good matches (top 50): {min(50, len(matches))}")
print(f"Best match distance: {matches[0].distance:.1f}")
# ── SIFT (higher quality) ─────────────────────────────────
sift = cv2.SIFT_create()
kp_s1, des_s1 = sift.detectAndCompute(img1, None)
# Use FLANN for efficient SIFT matching (L2 norm, float descriptors)
flann = cv2.FlannBasedMatcher(
{'algorithm': 1, 'trees': 5},
{'checks': 50}
)
print(f"SIFT keypoints in img1: {len(kp_s1)}")
Image Augmentation — Multiplying Your Training Data
Deep learning models need vast amounts of training data. But collecting and labelling thousands of images for every class is expensive and slow. Image augmentation artificially expands your dataset by applying random but realistic transformations to existing images. A single labelled photo of a cat can become 50+ training examples — each slightly different — all still cats.
Without augmentation, a model trained on upright cats may completely fail to recognise a sideways cat — because it memorised specific pixel patterns rather than the concept. Augmentation forces the model to learn invariant representations: "catness" that persists across rotations, flips, brightness changes, and crops.
from PIL import Image, ImageEnhance, ImageFilter
import numpy as np
import random
def augment_image(pil_img):
"""Apply a random chain of augmentations to a PIL image."""
img = pil_img.copy()
# Random horizontal flip
if random.random() > 0.5:
img = img.transpose(Image.FLIP_LEFT_RIGHT)
# Random rotation
angle = random.uniform(-15, 15)
img = img.rotate(angle, expand=False, fillcolor=(128, 128, 128))
# Random brightness
factor = random.uniform(0.7, 1.3)
img = ImageEnhance.Brightness(img).enhance(factor)
# Random contrast
factor = random.uniform(0.8, 1.4)
img = ImageEnhance.Contrast(img).enhance(factor)
# Occasional Gaussian blur (simulates motion / focus issues)
if random.random() > 0.7:
img = img.filter(ImageFilter.GaussianBlur(radius=1))
return img
# Generate 50 augmented versions of one image
original = Image.open('product.jpg')
augmented_set = [augment_image(original) for _ in range(50)]
print(f"Dataset grew from 1 image to {len(augmented_set)+1} images.")
Deep Learning Integration — From Pixels to Predictions
Modern image processing pipelines end where deep learning begins. The classical operations you have learned — denoising, normalisation, resizing, augmentation — are the pre-processing layer that feeds Convolutional Neural Networks (CNNs). A CNN then automatically learns the most powerful filters and features for your specific task.
import torch
import torchvision.transforms as T
import torchvision.models as models
from PIL import Image
# ── Pre-processing pipeline for ImageNet-pretrained models ──
preprocess = T.Compose([
T.Resize(256), # Resize shorter side to 256
T.CenterCrop(224), # Crop centre 224×224
T.ToTensor(), # PIL → float32 tensor [0,1]
T.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet channel means
std= [0.229, 0.224, 0.225] # ImageNet channel stds
)
])
# Load a pretrained ResNet-50
model = models.resnet50(weights='IMAGENET1K_V2')
model.eval()
# Inference on a single image
img = Image.open('dog.jpg').convert('RGB')
tensor = preprocess(img).unsqueeze(0) # add batch dimension → (1, 3, 224, 224)
with torch.no_grad():
logits = model(tensor) # shape: (1, 1000)
probs = torch.softmax(logits, dim=1)
top5 = torch.topk(probs, 5)
print("Top-5 Predictions:")
for score, idx in zip(top5.values[0], top5.indices[0]):
print(f" Class {idx.item():4d}: {score.item()*100:.2f}%")
Classical vs Deep Learning — Full Task Comparison
| Task | Classical Approach | Deep Learning Approach | Recommended |
|---|---|---|---|
| Noise Removal | Median / Bilateral Filter | DnCNN, Noise2Noise | Classical (fast, no training data needed) |
| Edge Detection | Canny, Sobel | HED, RCF (learned boundaries) | Classical for most uses. DL for complex scenes. |
| Image Classification | HOG + SVM (limited accuracy) | ResNet, EfficientNet, ViT | Deep Learning always wins here |
| Object Detection | HOG+SVM sliding window (slow) | YOLO, Faster R-CNN, DETR | Deep Learning — not even close |
| Semantic Segmentation | GrabCut, Watershed (manual) | U-Net, SegFormer, SAM | Deep Learning for accuracy, GrabCut for quick prototypes |
| Image Stitching | ORB/SIFT + RANSAC Homography | DeepPano, UDIS++ | Classical is battle-tested and reliable |
| OCR Pre-processing | Threshold + Morph ops | End-to-end STR (Scene Text Recognition) | Classical pipeline + Tesseract for most tasks |
| Face Detection | Viola-Jones Haar Cascades | MTCNN, RetinaFace, InsightFace | Deep Learning — much higher accuracy |
Golden Rules
uint8 [0–255].
PyTorch expects float32 [0,1]. TensorFlow may expect [0,1] or [-1,1].
Forgetting this causes invisible bugs where images display black or models produce garbage.
cv2.imread()
and pass to a library expecting RGB (matplotlib, PIL, PyTorch), convert explicitly with
cv2.cvtColor(img, cv2.COLOR_BGR2RGB). This is the #1 beginner bug.