Deep Learning 📂 Convolutional neural networks (CNN) · 2 of 4 32 min read

Pooling & Spatial Hierarchy in CNNs

Pooling layers slide a window across a feature map and replace each neighbourhood with a single number — the maximum (max pooling) or the mean (average pooling). This shrinks spatial dimensions, introduces translation invariance, grows the receptive field, and builds a pyramid of meaning where each successive layer detects progressively larger, more abstract patterns. The tutorial covers the mechanics, animated diagrams, worked numericals, the receptive field formula, and a full PyTorch implemen

Section 01

The Story — Why Pooling Exists

Reading a Newspaper From Across the Room
Imagine reading a newspaper. Up close, you see every pixel of every letter. Step back five metres and you can no longer read individual letters — but you can still instantly spot the headline, the photo, and the section borders. Step back twenty metres and you perceive only the rough layout — two columns, a big image on the left.

At no distance did you lose the meaning of the page. You lost resolution, but gained the ability to see structure at multiple scales simultaneously. Your brain pooled local detail into progressively coarser summaries — exactly what pooling layers do inside a CNN.

After a convolution produces a feature map, the network needs to reduce its spatial size for three reasons: (1) cut computation, (2) limit parameters in later layers, and (3) make the network robust to small shifts in position. Pooling achieves all three with zero learnable parameters — a fixed mathematical operation that simply summarises each small neighbourhood into one number.

💡
Pooling in One Sentence

Slide a small window across the feature map; replace every window with a single summary statistic (the maximum or the average). The result is a smaller map that retains the essence of what was detected without caring exactly where inside the window it appeared.


Section 02

Max Pooling vs Average Pooling

Both operations use the same sliding-window mechanics as convolution — a window size and a stride — but replace the dot product with a simpler rule.

▲ Max Pooling — Keep the Loudest Signal
PropertyDetail
RuleOutput = maximum value in each window
EffectAsks: "Did this feature appear anywhere here?"
KeepsThe strongest activation — presence detection
Best forDetecting sharp features: edges, textures, corners
Used inAlexNet, VGG, ResNet — virtually every classification CNN
≈ Average Pooling — Blend the Neighbourhood
PropertyDetail
RuleOutput = mean value across the window
EffectAsks: "How active is this region overall?"
KeepsThe distributed signal — background / diffuse features
Best forGlobal context, final layer summarisation
Used inGoogLeNet GAP, MobileNet, NLP token pooling
📊 Numerical Example — 4×4 Feature Map, 2×2 Pool, Stride 2
Input
Row 0:  1  3  2  4
Row 1:  5  6  1  2
Row 2:  3  2  4  7
Row 3:  1  0  6  3
Window [0:2, 0:2]
Values: 1, 3, 5, 6  →  Max = 6  |  Avg = 3.75
Window [0:2, 2:4]
Values: 2, 4, 1, 2  →  Max = 4  |  Avg = 2.25
Window [2:4, 0:2]
Values: 3, 2, 1, 0  →  Max = 3  |  Avg = 1.50
Window [2:4, 2:4]
Values: 4, 7, 6, 3  →  Max = 7  |  Avg = 5.00
Max Output
4×4 → 2×2  :  [[6, 4], [3, 7]] — halved both dims, 75% fewer values
Avg Output
4×4 → 2×2  :  [[3.75, 2.25], [1.50, 5.00]] — softer, all values influence output
Max Pooling — 2×2 window, stride 2
The amber window slides across the 4×4 input. At each position it keeps only the maximum value → one green cell in the output. Press Play or step manually.
Press Play or Next to begin.
3
Step 1 / 4
Average Pooling — 2×2 window, stride 2
Same sliding window, but each output cell is the mean of the four values in the window. Notice how the output values are smoother — no single extreme dominates.
Press Play or Next to begin.
Step 1 / 4
Global Average Pooling (GAP) — the entire map becomes one number per channel
Instead of a sliding window, GAP takes every single value in the feature map and averages them into one scalar. A 7×7 feature map → 1 number. Used right before the classifier in modern CNNs (GoogLeNet, MobileNet) to eliminate fully-connected layers.
🔑
The Key Difference in One Line

Max pooling answers "was this feature present?" — average pooling answers "how strongly present was this feature on average?" Max pooling is preferred for detection tasks. Average pooling (especially Global Average Pooling) is preferred at the end of a network where you want a holistic summary before classification.


Section 03

Translation Invariance — Why CNNs Don't Care Where You Are

The Dog Detector
You train a CNN to detect dogs. In training, every dog photo has the dog roughly centred. At inference, a dog appears in the top-left corner. Should the network fail?

Without pooling — almost yes. Each neuron is tied to a fixed spatial position. Shift the dog two pixels right and a different set of neurons fires. With pooling, the story changes: the detector neuron fires strongly somewhere inside the 2×2 window. Max pooling says "I don't care which pixel inside this window triggered — the maximum tells me the feature is present in this region." A small shift moves the dog within the same window — the maximum is unchanged. Invariance emerges.
🔍
Local Invariance
From pooling
A 2×2 max pool with stride 2 makes the network invariant to shifts up to ±1 pixel in any direction within each window. Small jitter, noise, or minor misalignment no longer changes the output.
✓ Robust to minor positional shifts
🌎
Global Invariance
From stacked layers
Stack three pooling layers and the network is insensitive to shifts of ±8 pixels (2×2×2). The deeper the network, the larger the region any single output neuron "ignores" in terms of exact position.
✓ Robust to large positional changes
⚠️
Equivariance ≠ Invariance
Common confusion
Convolution is equivariant — shift the input, the feature map shifts identically. Pooling adds invariance — small shifts stop affecting the output. You need both: conv to detect, pool to ignore where.
✗ Full invariance needs data augmentation too

Section 04

Spatial Hierarchy — The Pyramid of Meaning

Stacking convolution + pooling repeatedly creates a spatial hierarchy: each layer looks at a larger portion of the original image through fewer, richer neurons. Early layers see fine texture. Deep layers see semantic objects.

🏢 Spatial Hierarchy in a Typical CNN (224×224 RGB input)
Layer 1 — Conv
224×224×64  →  each neuron sees a 3×3 patch of the original image. Detects: edges, colour blobs, simple gradients.
Pool 1
224×224 → 112×112. Spatial size halved. Receptive field of each neuron now covers 4×4 of the original.
Layer 2 — Conv
112×112×128  →  neurons combine edge responses from Pool 1. Detects: corners, simple curves, local textures.
Pool 2
112×112 → 56×56. Each neuron now "sees" 8×8 of the original. Hierarchy grows.
Layer 3–5 — Conv
56×56 → 28×28 → 14×14 → 7×7×512. Neurons now cover 64×64 to 196×196 pixels of the original image. Detects: object parts, faces, wheels, text blocks.
GAP
7×7 → 1×1×512. Global Average Pool collapses all spatial information. One 512-D vector summarises the entire image — fed to the classifier.
🎯
The Hierarchy Insight

No single layer "understands" an object. Layer 1 fires on edges, Layer 3 fires on curve-combinations that look like an eye, Layer 5 fires on the full face structure. The CNN doesn't see a face — it sees that this pattern of edge activations typically co-occurs with face labels. Pooling makes this hierarchy stable by suppressing the exact pixel-level positions as information propagates upward.


Section 05

Receptive Field — How Much of the Image Does a Neuron See?

The receptive field of a neuron is the region of the original input that can influence its activation. Pooling dramatically grows the receptive field without adding parameters.

After one Conv (kernel F, stride 1)
RF = F
A 3×3 conv layer gives every output neuron a 3×3 receptive field in the input — it sees exactly 9 pixels.
After stacking L conv layers (all F=3, S=1)
RF = 2L + 1
Two 3×3 layers → RF=5. Five 3×3 layers → RF=11. This is why deep networks with small kernels can rival single large-kernel layers.
After one 2×2 Pool (stride 2)
RF doubles
Pooling halves the spatial map, so every subsequent neuron now "looks through" twice the original area. A pool after RF=5 → effective RF=10.
General receptive field formula
RF_l = RF_(l-1) + (F_l − 1) × ∏S
Where ∏S is the product of all strides in prior layers. This is why pooling (stride 2) multiplies all subsequent RF growth — it's a stride that compounds.
📈 Worked Receptive Field — VGG-style stack
Conv 3×3, S=1
RF = 3×3  —  sees 9 pixels of the original image
Conv 3×3, S=1
RF = 5×5  —  each extra 3×3 conv adds 2 to each side
MaxPool 2×2, S=2
Stride 2 compounds all future additions: RF doubles to effective 10×10
Conv 3×3, S=1
RF = 14×14  —  (3−1)×2 = 4 added, scaled by prior stride of 2
MaxPool 2×2, S=2
RF = 28×28  —  another ×2 compounding. One neuron now "sees" nearly 1/8 of a 224-wide image.
⚠️
Theoretical vs Effective Receptive Field

The formula above gives the theoretical receptive field — the maximum area that could influence a neuron. In practice, central pixels contribute far more than border pixels (because of how gradients accumulate), so the effective receptive field is roughly Gaussian-shaped and much smaller than theory suggests. For most practical networks, doubling the theoretical RF only increases the effective RF by about 40%.


Section 06

Output Dimension Formula

Pooling follows the exact same output-size formula as convolution — because it is the same sliding-window operation, just with a different reduction rule.

General output size (H or W)
O = ⌊(N − F) / S⌋ + 1
N = input size, F = pool window size, S = stride. No learnable padding is added in standard pooling.
Common case — 2×2 pool, stride 2
O = N / 2
This is why every max-pool halves the spatial dimensions. 224→112→56→28→14→7 in VGG.
3×3 pool, stride 2 (non-overlapping)
O = ⌊(N − 3) / 2⌋ + 1
AlexNet uses this. For N=27: O = ⌊24/2⌋+1 = 13. Slightly smaller than the 2×2 case.
Global Average Pooling
O = 1×1
Window equals the full spatial size. Output is always 1×1×C regardless of input size — the network becomes input-size agnostic.
InputPoolStrideOutputReduction
224×2242×22112×11275% fewer values
112×1122×2256×5675% fewer values
56×562×2228×2875% fewer values
28×282×2214×1475% fewer values
14×142×227×775% fewer values
7×77×7 GAP1×198% fewer values

Section 07

Python Implementation

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

# ── 1. From scratch: Max and Average Pooling (2-D) ─────────────
def pool2d_scratch(x, pool_size=2, stride=2, mode='max'):
    """x: (H, W) numpy array"""
    H, W = x.shape
    OH = (H - pool_size) // stride + 1
    OW = (W - pool_size) // stride + 1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            patch = x[i*stride : i*stride+pool_size,
                      j*stride : j*stride+pool_size]
            out[i, j] = patch.max() if mode == 'max' else patch.mean()
    return out

inp = np.array([[1,3,2,4],[5,6,1,2],[3,2,4,7],[1,0,6,3]], dtype=float)

print("Max pool:", pool2d_scratch(inp, mode='max'))
print("Avg pool:", pool2d_scratch(inp, mode='avg'))

# ── 2. PyTorch layers ───────────────────────────────────────────
x_t = torch.tensor(inp).unsqueeze(0).unsqueeze(0).float()  # (1,1,4,4)

max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
gap      = nn.AdaptiveAvgPool2d((1, 1))     # Global Average Pool

print("MaxPool2d: ", max_pool(x_t).squeeze())
print("AvgPool2d: ", avg_pool(x_t).squeeze())
print("GAP:       ", gap(x_t).squeeze())

# ── 3. Building a mini-CNN with pooling layers ──────────────────
class MiniCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),  # 28×28×32
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                             # 14×14×32
            nn.Conv2d(32, 64, kernel_size=3, padding=1), # 14×14×64
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                             # 7×7×64
        )
        self.gap  = nn.AdaptiveAvgPool2d((1, 1))           # 1×1×64
        self.head = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = self.gap(x).flatten(1)  # (B, 64)
        return self.head(x)

model = MiniCNN()
dummy = torch.randn(4, 1, 28, 28)  # batch of 4 MNIST images
print("Output shape:", model(dummy).shape)  # → (4, 10)
OUTPUT
Max pool: [[6. 4.] [3. 7.]] Avg pool: [[3.75 2.25] [1.5 5. ]] GAP: tensor(3.1875) MaxPool2d: tensor([[6., 4.], [3., 7.]]) AvgPool2d: tensor([[3.7500, 2.2500], [1.5000, 5.0000]]) GAP: tensor(3.1875) Output shape: torch.Size([4, 10])
📌
Why AdaptiveAvgPool2d((1,1)) is the Modern Standard

AdaptiveAvgPool2d accepts any input spatial size and always outputs exactly 1×1. This means your CNN works on 224×224 training images and also correctly processes a 320×480 image at test time — without any code changes. This is why every modern architecture (ResNet, EfficientNet, MobileNet) uses it instead of a fixed-size max pool at the end.


Section 08

Quick Reference — Everything in One Table

Concept Max Pool Average Pool Global Avg Pool
Operation max(window) mean(window) mean(entire map)
Parameters Zero Zero Zero
Preserves Strongest activation (presence) Overall energy (distribution) Channel-level global average
Translation invariance Strong Moderate Full spatial invariance
Typical use After conv blocks in backbone Intermediate or final layers Replaces flatten + FC layers
Output formula O = ⌊(N − F) / S⌋ + 1 Always 1×1×C
⚡ Pooling — Five Things to Never Forget
1
Pooling has zero learnable parameters. It is a fixed mathematical reduction — there is nothing to train, nothing to overfit. The computation cost is negligible.
2
Max pooling dominates inside the backbone. It preserves the presence signal of whatever was detected and provides strong local invariance. Average pooling lives mostly at the end.
3
Every 2×2 pool with stride 2 halves spatial dimensions and doubles the effective receptive field of every subsequent layer — the most efficient way to grow context.
4
Global Average Pooling replaces fully-connected layers in modern CNNs. It is input-size agnostic, reduces parameters by millions, and acts as a strong regulariser against overfitting.
5
The spatial hierarchy is the core power of CNNs: Layer 1 = pixels → edges. Layer 3 = edges → textures. Layer 5 = textures → object parts. Layer 7 = parts → objects. Pooling makes each transition stable.