Deep Learning 📂 Convolutional neural networks (CNN) · 3 of 4 44 min read

CNN Fully Solved Numericals

This article solves the complete CNN building block — Convolution, ReLU, Max Pooling — twice in full. Numerical 1 uses a 5×5 input with a vertical edge detector kernel, computing all 9 dot products by hand, zeroing negatives with ReLU, then applying a 2×2 stride-1 pool. Numerical 2 uses a 4×4 input with a sharpening kernel, computing all 4 dot products, applying ReLU, then a 2×2 stride-2 (non-overlapping) pool that collapses the result to a single scalar. Every number is shown, every step is wor

Section 01

The Full Pipeline — Every CNN Block in Order

Raw Material → Inspection → Rejection → Packaging
Imagine a factory assembly line. Raw steel sheets come in (the input image). A stamping press shapes them with a mould (the convolution kernel) — every region gets pressed, producing a new shaped sheet (the feature map). A quality inspector then discards every warped or bent piece below zero (the ReLU). Finally, a packing machine groups every 2×2 batch of pieces and keeps only the best one (the max pool). What arrives at the warehouse is a compact, high-quality summary of the original sheet.

That is exactly what happens in one CNN block — in that order, every time.
Input
5×5 image
Conv 2D
kernel 3×3
Feature Map
3×3
ReLU
max(0, x)
Max Pool
2×2, S=1
Output
2×2
📋
What You Will Compute By Hand

Two complete numericals. Each one goes through Step 1 — Convolution (every dot product, position by position), Step 2 — ReLU (zero out every negative), and Step 3 — Max Pooling (slide a 2×2 window and keep the maximum). Nothing skipped, nothing assumed.


Section 02

Numerical 1 — Full Pipeline: Conv → ReLU → Max Pool

Given: A 5×5 input image and a 3×3 kernel. No padding. Stride 1 for both conv and pool (2×2 pool, stride 1).

🖼 Input Image (5×5)

1
2
3
0
1
4
5
6
1
2
7
8
9
0
3
2
1
0
4
5
6
3
2
1
0
⚙ Kernel (3×3)

1
0
-1
1
0
-1
1
0
-1
🔎
What This Kernel Does

This kernel has +1 in the left column, 0 in the middle, −1 in the right column. It subtracts the right side from the left side of every 3×3 patch — a classic vertical edge detector. Bright on the left, dark on the right → large positive output.

📐 Output Size after Convolution
Formula
O = ⌊(N − F + 2P) / S⌋ + 1 = ⌊(5 − 3 + 0) / 1⌋ + 1 = 3 → Feature map is 3×3
After Pool
O = ⌊(3 − 2) / 1⌋ + 1 = 2 → Final output is 2×2

① Step 1 — Convolution (9 dot products)

The 3×3 kernel slides across the 5×5 input with stride 1. Every position produces one value. Here are all 9:

Position [0,0] — top-left patch
1
2
3
4
5
6
7
8
9
1
0
-1
1
0
-1
1
0
-1
(1×1)+(2×0)+(3×−1) + (4×1)+(5×0)+(6×−1) + (7×1)+(8×0)+(9×−1)
= (1+0−3) + (4+0−6) + (7+0−9)
= −2 + (−2) + (−2) = −6   →  FM[0,0] = −6
Position [0,1] — shift right by 1
2
3
0
5
6
1
8
9
0
1
0
-1
1
0
-1
1
0
-1
(2×1)+(3×0)+(0×−1) + (5×1)+(6×0)+(1×−1) + (8×1)+(9×0)+(0×−1)
= (2+0+0) + (5+0−1) + (8+0+0)
= 2 + 4 + 8 = 14   →  FM[0,1] = 14
Position [0,2] — shift right again
3
0
1
6
1
2
9
0
3
1
0
-1
1
0
-1
1
0
-1
(3×1)+(0×0)+(1×−1) + (6×1)+(1×0)+(2×−1) + (9×1)+(0×0)+(3×−1)
= (3+0−1) + (6+0−2) + (9+0−3)
= 2 + 4 + 6 = 12   →  FM[0,2] = 12
Position [1,0] — move down to row 1
4
5
6
7
8
9
2
1
0
1
0
-1
1
0
-1
1
0
-1
(4×1)+(5×0)+(6×−1) + (7×1)+(8×0)+(9×−1) + (2×1)+(1×0)+(0×−1)
= (4+0−6) + (7+0−9) + (2+0+0)
= −2 + (−2) + 2 = −2   →  FM[1,0] = −2
Position [1,1] — centre of feature map
5
6
1
8
9
0
1
0
4
1
0
-1
1
0
-1
1
0
-1
(5×1)+(6×0)+(1×−1) + (8×1)+(9×0)+(0×−1) + (1×1)+(0×0)+(4×−1)
= (5+0−1) + (8+0+0) + (1+0−4)
= 4 + 8 + (−3) = 9   →  FM[1,1] = 9
Position [1,2]
6
1
2
9
0
3
0
4
5
1
0
-1
1
0
-1
1
0
-1
(6×1)+(1×0)+(2×−1) + (9×1)+(0×0)+(3×−1) + (0×1)+(4×0)+(5×−1)
= (6+0−2) + (9+0−3) + (0+0−5)
= 4 + 6 + (−5) = 5   →  FM[1,2] = 5
Position [2,0] — bottom row, left
7
8
9
2
1
0
6
3
2
1
0
-1
1
0
-1
1
0
-1
(7×1)+(8×0)+(9×−1) + (2×1)+(1×0)+(0×−1) + (6×1)+(3×0)+(2×−1)
= (7+0−9) + (2+0+0) + (6+0−2)
= −2 + 2 + 4 = 4   →  FM[2,0] = 4
Position [2,1]
8
9
0
1
0
4
3
2
1
1
0
-1
1
0
-1
1
0
-1
(8×1)+(9×0)+(0×−1) + (1×1)+(0×0)+(4×−1) + (3×1)+(2×0)+(1×−1)
= (8+0+0) + (1+0−4) + (3+0−1)
= 8 + (−3) + 2 = 7   →  FM[2,1] = 7
Position [2,2] — bottom-right patch
9
0
3
0
4
5
2
1
0
1
0
-1
1
0
-1
1
0
-1
(9×1)+(0×0)+(3×−1) + (0×1)+(4×0)+(5×−1) + (2×1)+(1×0)+(0×−1)
= (9+0−3) + (0+0−5) + (2+0+0)
= 6 + (−5) + 2 = 3   →  FM[2,2] = 3
📈 Feature Map after Convolution (3×3)

−6
14
12
−2
9
5
4
7
3

② Step 2 — ReLU Activation: max(0, x)

Apply ReLU element-wise. Every negative value becomes 0. Every positive value stays unchanged.

PositionConv OutputReLU RuleResult
[0,0]−6→ max(0, −6)0
[0,1]14→ max(0, 14)14
[0,2]12→ max(0, 12)12
[1,0]−2→ max(0, −2)0
[1,1]9→ max(0, 9)9
[1,2]5→ max(0, 5)5
[2,0]4→ max(0, 4)4
[2,1]7→ max(0, 7)7
[2,2]3→ max(0, 3)3
⚡ After ReLU (3×3)

0
14
12
0
9
5
4
7
3

③ Step 3 — Max Pooling: 2×2 window, Stride 1

Output size: O = ⌊(3 − 2)/1⌋ + 1 = 2 → 2×2 output. Slide the 2×2 window over the ReLU map:

Window [0:2, 0:2] → Out[0,0]
0
14
0
9
max(0, 14, 0, 9) = 14
Window [0:2, 1:3] → Out[0,1]
14
12
9
5
max(14, 12, 9, 5) = 14
Window [1:3, 0:2] → Out[1,0]
0
9
4
7
max(0, 9, 4, 7) = 9
Window [1:3, 1:3] → Out[1,1]
9
5
7
3
max(9, 5, 7, 3) = 9
🏆 Final Output after Max Pool (2×2)

14
14
9
9
🎯
Numerical 1 — Full Summary

Input 5×5 → Conv (3×3 kernel, S=1, P=0) → Feature Map 3×3 [−6,14,12 / −2,9,5 / 4,7,3]ReLU [0,14,12 / 0,9,5 / 4,7,3]MaxPool (2×2, S=1) → Final [[14,14],[9,9]]. The two negatives (−6 and −2) were killed by ReLU. Max pooling then pulled the strongest signal (14 — the edge response) into both top cells.


Section 03

Numerical 2 — Different Kernel, Stride 2 Pool

Given: A 4×4 input image and a 3×3 sharpening kernel. No padding. Stride 1 conv, then 2×2 MaxPool with stride 2 (non-overlapping).

Input
4×4
Conv 3×3
S=1, P=0
Feature Map
2×2
ReLU
max(0,x)
MaxPool 2×2
Stride 2
Output
1×1
🖼 Input Image (4×4)

2
4
1
3
5
8
2
6
1
3
7
4
0
2
5
9
⚙ Kernel (3×3) — Sharpening

0
-1
0
-1
5
-1
0
-1
0
📐 Output Sizes
After Conv
O = ⌊(4 − 3 + 0) / 1⌋ + 1 = 2 → Feature map is 2×2
After Pool
O = ⌊(2 − 2) / 2⌋ + 1 = 1 → Final output is 1×1 (a single number!)

① Step 1 — Convolution (4 dot products on the 4×4 input)

Position [0,0]
2
4
1
5
8
2
1
3
7
0
-1
0
-1
5
-1
0
-1
0
(2×0)+(4×−1)+(1×0) + (5×−1)+(8×5)+(2×−1) + (1×0)+(3×−1)+(7×0)
= (0−4+0) + (−5+40−2) + (0−3+0)
= −4 + 33 + (−3) = 26   →  FM[0,0] = 26
Position [0,1]
4
1
3
8
2
6
3
7
4
0
-1
0
-1
5
-1
0
-1
0
(4×0)+(1×−1)+(3×0) + (8×−1)+(2×5)+(6×−1) + (3×0)+(7×−1)+(4×0)
= (0−1+0) + (−8+10−6) + (0−7+0)
= −1 + (−4) + (−7) = −12   →  FM[0,1] = −12
Position [1,0]
5
8
2
1
3
7
0
2
5
0
-1
0
-1
5
-1
0
-1
0
(5×0)+(8×−1)+(2×0) + (1×−1)+(3×5)+(7×−1) + (0×0)+(2×−1)+(5×0)
= (0−8+0) + (−1+15−7) + (0−2+0)
= −8 + 7 + (−2) = −3   →  FM[1,0] = −3
Position [1,1]
8
2
6
3
7
4
2
5
9
0
-1
0
-1
5
-1
0
-1
0
(8×0)+(2×−1)+(6×0) + (3×−1)+(7×5)+(4×−1) + (2×0)+(5×−1)+(9×0)
= (0−2+0) + (−3+35−4) + (0−5+0)
= −2 + 28 + (−5) = 21   →  FM[1,1] = 21
📈 Feature Map after Convolution (2×2)

26
−12
−3
21

② Step 2 — ReLU Activation

PositionConv OutputReLU RuleResult
[0,0]26→ max(0, 26)26
[0,1]−12→ max(0, −12)0
[1,0]−3→ max(0, −3)0
[1,1]21→ max(0, 21)21
⚡ After ReLU (2×2)

26
0
0
21

③ Step 3 — Max Pooling: 2×2 window, Stride 2

Output size: O = ⌊(2 − 2)/2⌋ + 1 = 1 → single scalar output. Only one window — it covers the entire 2×2 ReLU map:

Window [0:2, 0:2] — the entire ReLU map → Out[0,0]
26
0
0
21
max(26, 0, 0, 21) = 26
🏆 Final Output after Max Pool (1×1)

26
🎯
Numerical 2 — Full Summary

Input 4×4 → Conv (sharpening 3×3, S=1, P=0) → Feature Map 2×2 [26, −12, −3, 21]ReLU[26, 0, 0, 21]MaxPool (2×2, S=2) → single value 26. The sharpening kernel amplified the two "high-contrast" patches (strong neighbours) and suppressed the rest. ReLU removed the two negative responses. Max pool selected the strongest — 26.


Section 04

Side-by-Side Pipeline Summary

StageNumerical 1 (5×5 input)Numerical 2 (4×4 input)
Input 5×5 = 25 values 4×4 = 16 values
Kernel 3×3 vertical edge detector [1,0,−1 / 1,0,−1 / 1,0,−1] 3×3 sharpening [0,−1,0 / −1,5,−1 / 0,−1,0]
After Conv 3×3 feature map [−6,14,12 / −2,9,5 / 4,7,3] 2×2 feature map [26,−12 / −3,21]
Negatives 2 values (−6, −2) 2 values (−12, −3)
After ReLU [0,14,12 / 0,9,5 / 4,7,3] [26,0 / 0,21]
Pool Config 2×2, Stride 1 → overlapping 2×2, Stride 2 → non-overlapping
Final Output [[14,14],[9,9]] — 2×2 [[26]] — single scalar

Section 05

Python — Verify Both Pipelines

import numpy as np

# ── Convolution (cross-correlation, no flip) ──────────────────
def conv2d(x, k):
    """No padding, stride 1."""
    KH, KW = k.shape
    OH, OW = x.shape[0]-KH+1, x.shape[1]-KW+1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            out[i, j] = np.sum(x[i:i+KH, j:j+KW] * k)
    return out

# ── ReLU ──────────────────────────────────────────────────────
def relu(x):
    return np.maximum(0, x)

# ── Max Pool ──────────────────────────────────────────────────
def max_pool(x, pool=2, stride=2):
    OH = (x.shape[0] - pool) // stride + 1
    OW = (x.shape[1] - pool) // stride + 1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            out[i, j] = x[i*stride:i*stride+pool, j*stride:j*stride+pool].max()
    return out

# ═══════════════════════════════════════════════
# NUMERICAL 1 — 5×5 input, vertical edge kernel
# ═══════════════════════════════════════════════
inp1 = np.array([
    [1,2,3,0,1],
    [4,5,6,1,2],
    [7,8,9,0,3],
    [2,1,0,4,5],
    [6,3,2,1,0]
])
k1 = np.array([[1,0,-1],[1,0,-1],[1,0,-1]])

fm1 = conv2d(inp1, k1)
r1  = relu(fm1)
p1  = max_pool(r1, pool=2, stride=1)

print("N1 Feature Map:\n", fm1)
print("N1 After ReLU:\n",  r1)
print("N1 Max Pool output:\n", p1)

# ═══════════════════════════════════════════════
# NUMERICAL 2 — 4×4 input, sharpening kernel
# ═══════════════════════════════════════════════
inp2 = np.array([
    [2,4,1,3],
    [5,8,2,6],
    [1,3,7,4],
    [0,2,5,9]
])
k2 = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])

fm2 = conv2d(inp2, k2)
r2  = relu(fm2)
p2  = max_pool(r2, pool=2, stride=2)

print("N2 Feature Map:\n", fm2)
print("N2 After ReLU:\n",  r2)
print("N2 Max Pool output:\n", p2)
OUTPUT
N1 Feature Map: [[ -6. 14. 12.] [ -2. 9. 5.] [ 4. 7. 3.]] N1 After ReLU: [[ 0. 14. 12.] [ 0. 9. 5.] [ 4. 7. 3.]] N1 Max Pool output: [[14. 14.] [ 9. 9.]] N2 Feature Map: [[ 26. -12.] [ -3. 21.]] N2 After ReLU: [[26. 0.] [ 0. 21.]] N2 Max Pool output: [[26.]]

Section 06

Golden Rules — The Three-Step Sequence

⚡ Conv → ReLU → MaxPool — What Every Student Must Internalise
1
Always compute output size before you start. O = ⌊(N − F + 2P)/S⌋ + 1. Know your dimensions at every stage — an error here means all subsequent numbers are wrong.
2
Convolution is a dot product, not a multiplication. Multiply element-wise then sum all 9 (or 4, or 25) products to get one number. Do not multiply entire rows or columns.
3
ReLU is trivial but critical. Every negative → 0, every positive stays. It is the non-linearity that lets the network learn non-linear decision boundaries. Without it, stacking convolutions is just one big linear transform.
4
Max pooling uses the ReLU output, not the raw feature map. The order is always: Conv → ReLU → Pool. Reversing ReLU and Pool is incorrect — pooling before ReLU allows negatives to propagate.
5
Pool stride controls spatial compression. Stride 1 = barely any size reduction. Stride 2 = halved spatial size. Stride = pool size = no overlap. These are distinct behaviours with very different effects on the network.