The Story That Makes Backpropagation Click
The coach doesn't just blame the last archer. She walks backwards down the line, asking: how much did each archer contribute to that final error? Archer 3 contributed 60% of the drift. Archer 2's bad stance caused 30%. Archer 1's grip caused the remaining 10%. Each archer gets a personalised correction — not a generic one.
That is backpropagation. The network fires forward, produces an error, then the coach (the algorithm) walks backwards through every layer, assigning a precise share of blame to every single weight — then corrects them all in one efficient backward pass.
Backpropagation (short for "backward propagation of errors") is the algorithm that makes neural network learning possible. It computes how much each weight in the network contributed to the final loss, using the chain rule of calculus applied efficiently through the computation graph — all weights updated in a single backward sweep.
Before backpropagation (formalised by Rumelhart, Hinton & Williams in 1986), there was no efficient way to train deep networks. Computing gradients by perturbing each weight individually requires two forward passes per weight — absurdly slow for millions of parameters. Backprop does it in one forward + one backward pass, regardless of network depth. This is the algorithmic core of every modern deep learning system.
The Four Core Concepts
The Computation Graph — Your Network's Blueprint
Every neural network operation can be drawn as a graph of primitive computations. During the forward pass, data flows left to right. During the backward pass, gradients flow right to left. The graph makes it possible to apply the chain rule mechanically, node by node.
Each node stores its output during the forward pass so the backward pass can reuse it. This is why backprop only needs one forward + one backward sweep.
The Chain Rule — The Mathematical Engine
The answer is the product of each gear ratio along the chain. If A→B has ratio 2×, and B→C has ratio 3×, then A→C is 6×. You multiply the local rates. The chain rule is precisely this — multiply local derivatives along a path in the computation graph.
The chain rule states: if L depends on z through a, then:
At each node, backprop reuses the stored forward-pass values (activations and pre-activations). It computes the local gradient and multiplies it by the incoming gradient — then passes the result to the next node upstream. Every weight gets its gradient in exactly one pass. No redundant computation. This is the genius of the algorithm.
Full Numerical Example — Step-by-Step Mathematics
We will build a tiny neural network with 2 inputs → 1 hidden neuron → 1 output neuron and walk through every single number in both the forward and backward pass. No hand-waving, no skipping — every calculation shown.
Input: x = [0.5, 0.8] | True label: y = 1.0 | Activation: Sigmoid σ(z) = 1/(1+e⁻ᶻ) | Loss: MSE = ½(ŷ − y)²
👉 Step 1 — Forward Pass
z₁ = W₁₁·x₁ + W₁₂·x₂ + b₁
z₁ = (0.4)(0.5) + (0.6)(0.8) + 0.1
z₁ = 0.20 + 0.48 + 0.10 = 0.68
a₁ = 1 / (1 + e⁻⁰·⁶⁸) = 1 / (1 + 0.5066) = 1 / 1.5066 = 0.6637
z₂ = W₂·a₁ + b₂
z₂ = (0.9)(0.6637) + 0.1 = 0.5973 + 0.1 = 0.6973
ŷ = 1 / (1 + e⁻⁰·⁶⁹⁷³) = 1 / (1 + 0.4977) = 0.6682
L = ½(0.6682 − 1.0)² = ½(−0.3318)² = ½(0.1101) = 0.0550
👉 Step 2 — Backward Pass
Now we work backwards from the loss, computing gradients layer by layer using the chain rule. We start at the output and propagate the error signal upstream.
Error signal δ flows backward. Each layer multiplies the incoming δ by its own local derivative (chain rule), then passes the result upstream.
L = ½(ŷ − y)² → dL/dŷ = ŷ − y = 0.6682 − 1.0 = −0.3318
σ'(z) = σ(z)·(1 − σ(z)) = ŷ·(1 − ŷ)
σ'(z₂) = 0.6682 × (1 − 0.6682) = 0.6682 × 0.3318 = 0.2217
δ₂ = (dL/dŷ) × σ'(z₂) = (−0.3318) × 0.2217 = −0.0736
dL/dW₂ = δ₂ × a₁ = (−0.0736) × 0.6637 = −0.0488
dL/db₂ = δ₂ = −0.0736
dL/da₁ = δ₂ × W₂ = (−0.0736) × 0.9 = −0.0662
σ'(z₁) = a₁ × (1 − a₁) = 0.6637 × (1 − 0.6637) = 0.6637 × 0.3363 = 0.2232
δ₁ = (dL/da₁) × σ'(z₁) = (−0.0662) × 0.2232 = −0.0148
dL/dW₁₁ = δ₁ × x₁ = (−0.0148) × 0.5 = −0.0074
dL/dW₁₂ = δ₁ × x₂ = (−0.0148) × 0.8 = −0.0118
dL/db₁ = δ₁ = −0.0148
👉 Step 3 — Weight Update (Gradient Descent)
All gradients were negative, so all weights increased slightly. Since the prediction (0.668) was below the true label (1.0), the network needed to output a higher value next time — and increasing the weights achieves exactly that. Gradient descent nudged every weight in the right direction, all at once.
Layer-by-Layer Diagram — What Each Layer Computes
Complete Gradient Summary
| Parameter | Old Value | Gradient (dL/dθ) | Update (η=0.5) | New Value | Direction |
|---|---|---|---|---|---|
| W₂ (output weight) | 0.9000 | −0.0488 | +0.0244 | 0.9244 | ↑ increase |
| b₂ (output bias) | 0.1000 | −0.0736 | +0.0368 | 0.1368 | ↑ increase |
| W₁₁ (hidden weight 1) | 0.4000 | −0.0074 | +0.0037 | 0.4037 | ↑ increase |
| W₁₂ (hidden weight 2) | 0.6000 | −0.0118 | +0.0059 | 0.6059 | ↑ increase |
| b₁ (hidden bias) | 0.1000 | −0.0148 | +0.0074 | 0.1074 | ↑ increase |
All gradients are negative because the loss decreases when all weights increase (our prediction was below the target). Gradient descent subtracts the gradient, so W_new = W_old − η × (negative) = W_old + positive → weights go up. The network learns to predict higher, closer to the true label of 1.0.
Python Implementation — From Scratch
Below is a clean, well-commented Python implementation of the exact network we computed by hand above. It verifies all our numbers programmatically.
import numpy as np
# ── Network weights (matching our manual example) ──────────
W1 = np.array([[0.4, 0.6]]) # shape (1, 2) — 1 hidden neuron, 2 inputs
b1 = np.array([[0.1]]) # shape (1, 1)
W2 = np.array([[0.9]]) # shape (1, 1) — 1 output, 1 hidden neuron
b2 = np.array([[0.1]]) # shape (1, 1)
x = np.array([[0.5], [0.8]]) # shape (2, 1) — column vector
y = np.array([[1.0]]) # true label
# ── Activation functions ───────────────────────────────────
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1 - s) # σ'(z) = σ(z)(1 - σ(z))
# ── FORWARD PASS ──────────────────────────────────────────
z1 = W1 @ x + b1 # pre-activation hidden: (1,1)
a1 = sigmoid(z1) # activation hidden
z2 = W2 @ a1 + b2 # pre-activation output: (1,1)
y_hat = sigmoid(z2) # prediction ŷ
loss = 0.5 * (y_hat - y)**2 # MSE loss
print(f"z1 = {z1[0,0]:.4f}")
print(f"a1 = {a1[0,0]:.4f}")
print(f"z2 = {z2[0,0]:.4f}")
print(f"y_hat = {y_hat[0,0]:.4f}")
print(f"Loss = {loss[0,0]:.4f}")
# ── BACKWARD PASS ─────────────────────────────────────────
# Output layer
dL_dyhat = y_hat - y # dL/dŷ
delta2 = dL_dyhat * sigmoid_deriv(z2) # δ₂ = dL/dz₂
dL_dW2 = delta2 @ a1.T # dL/dW₂
dL_db2 = delta2 # dL/db₂
# Hidden layer
dL_da1 = W2.T @ delta2 # propagate error upstream
delta1 = dL_da1 * sigmoid_deriv(z1) # δ₁ = dL/dz₁
dL_dW1 = delta1 @ x.T # dL/dW₁ shape (1,2)
dL_db1 = delta1 # dL/db₁
print(f"\nGradients:")
print(f"delta2 = {delta2[0,0]:.4f} (output error signal)")
print(f"dL/dW2 = {dL_dW2[0,0]:.4f}")
print(f"dL/db2 = {dL_db2[0,0]:.4f}")
print(f"delta1 = {delta1[0,0]:.4f} (hidden error signal)")
print(f"dL/dW11 = {dL_dW1[0,0]:.4f}")
print(f"dL/dW12 = {dL_dW1[0,1]:.4f}")
print(f"dL/db1 = {dL_db1[0,0]:.4f}")
# ── GRADIENT DESCENT UPDATE ───────────────────────────────
lr = 0.5 # learning rate η
W2 = W2 - lr * dL_dW2
b2 = b2 - lr * dL_db2
W1 = W1 - lr * dL_dW1
b1 = b1 - lr * dL_db1
print(f"\nUpdated Weights:")
print(f"W2 = {W2[0,0]:.4f}")
print(f"b2 = {b2[0,0]:.4f}")
print(f"W11 = {W1[0,0]:.4f} W12 = {W1[0,1]:.4f}")
print(f"b1 = {b1[0,0]:.4f}")
Every number from the hand-calculation appears in the program output. This confirms the correctness of the chain rule application and the backpropagation logic. In practice you would run this in a loop over thousands of training examples and iterations — that loop is training.
The Full Training Loop — 1000 Iterations
import numpy as np
import matplotlib.pyplot as plt
# ── Data: XOR problem ─────────────────────────────────────
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float).T # (2,4)
Y = np.array([[0,1,1,0]], dtype=float) # (1,4)
# ── Initialise weights with small random values ───────────
np.random.seed(42)
W1 = np.random.randn(4, 2) * 0.5 # 4 hidden neurons
b1 = np.zeros((4, 1))
W2 = np.random.randn(1, 4) * 0.5
b2 = np.zeros((1, 1))
lr = 2.0
losses = []
for epoch in range(5000):
# ── Forward ────────────────────────────────────────────
z1 = W1 @ X + b1 # (4,4)
a1 = 1 / (1 + np.exp(-z1)) # sigmoid
z2 = W2 @ a1 + b2 # (1,4)
y_hat = 1 / (1 + np.exp(-z2))
loss = np.mean(0.5 * (y_hat - Y)**2)
losses.append(loss)
# ── Backward ───────────────────────────────────────────
m = X.shape[1] # number of samples = 4
delta2 = (y_hat - Y) * y_hat * (1 - y_hat) / m
dW2 = delta2 @ a1.T
db2 = np.sum(delta2, axis=1, keepdims=True)
dA1 = W2.T @ delta2
delta1 = dA1 * a1 * (1 - a1)
dW1 = delta1 @ X.T
db1 = np.sum(delta1, axis=1, keepdims=True)
# ── Update ─────────────────────────────────────────────
W2 -= lr * dW2; b2 -= lr * db2
W1 -= lr * dW1; b1 -= lr * db1
if epoch % 1000 == 0:
print(f"Epoch {epoch:5d} Loss: {loss:.5f}")
# ── Final predictions ─────────────────────────────────────
print(f"\nFinal predictions:")
print(np.round(y_hat, 2), " (target: [0,1,1,0])")
The network learned the XOR function — a classic non-linear problem — purely through repeated forward + backward passes. No magic: just the chain rule, applied thousands of times.
Common Pitfalls in Backpropagation
Each layer multiplies the gradient by σ'(z) ≤ 0.25. With 10 layers: 0.25¹⁰ ≈ 0.000001. The gradient at layer 1 is a millionth of the gradient at layer 10. Early layers learn nothing. Fix: use ReLU (derivative is 1 for positive z), batch normalisation, or residual connections.
The opposite problem: if weights are large, gradients blow up exponentially through layers. The loss oscillates wildly or returns NaN. Fix: gradient clipping (cap the norm of the gradient before the update), or careful weight initialisation (Xavier, He).
A neuron using ReLU outputs 0 for any negative input. Its gradient is also 0. If a neuron always gets a negative pre-activation, it will never update — it is "dead." Fix: use Leaky ReLU (small slope for negatives) or ELU, or lower the learning rate to prevent neurons from drifting into permanent negative territory.
Golden Rules of Backpropagation
dL/db = δ.
No input term. This is because bias has coefficient 1 in the linear transform —
∂(Wx + b)/∂b = 1. Beginners often overcomplicate bias gradients.
optimizer.zero_grad()
in PyTorch, or tape management in TensorFlow.
Accumulated gradients across batches will cause incorrect updates.
Activation Functions & Their Derivatives
| Activation | Formula σ(z) | Derivative σ'(z) | Range | Gradient Issue | Use Case |
|---|---|---|---|---|---|
| Sigmoid | 1 / (1 + e⁻ᶻ) | σ(z)(1 − σ(z)) | (0, 1) | Vanishes for |z|>5 | Binary output |
| Tanh | (eᶻ − e⁻ᶻ)/(eᶻ + e⁻ᶻ) | 1 − tanh²(z) | (−1, 1) | Less severe vanishing | Hidden layers |
| ReLU | max(0, z) | 1 if z>0 else 0 | [0, ∞) | Dead neurons (z<0) | Deep hidden layers |
| Leaky ReLU | z if z>0 else 0.01z | 1 if z>0 else 0.01 | (−∞, ∞) | No dead neurons | Default hidden |
| Softmax | eᶻᵢ / Σeᶻⱼ | Jacobian matrix | (0, 1) | None (output only) | Multi-class output |
PyTorch — Automatic Backpropagation
In practice, frameworks like PyTorch compute all gradients automatically. But understanding the manual derivation is essential — it tells you why the API works the way it does.
import torch
import torch.nn as nn
# ── Reproduce our manual example in PyTorch ───────────────
x = torch.tensor([[0.5], [0.8]], dtype=torch.float32)
y = torch.tensor([[1.0]])
# Weights — require_grad=True tracks the computation graph
W1 = torch.tensor([[0.4, 0.6]], requires_grad=True)
b1 = torch.tensor([[0.1]], requires_grad=True)
W2 = torch.tensor([[0.9]], requires_grad=True)
b2 = torch.tensor([[0.1]], requires_grad=True)
# ── Forward pass (PyTorch builds the computation graph) ───
z1 = W1 @ x + b1
a1 = torch.sigmoid(z1)
z2 = W2 @ a1 + b2
y_hat = torch.sigmoid(z2)
loss = 0.5 * (y_hat - y)**2
# ── Backward pass (ONE LINE — PyTorch does all the chain rule)
loss.backward()
# ── Gradients are now in .grad attributes ─────────────────
print(f"dL/dW2 = {W2.grad.item():.4f}")
print(f"dL/db2 = {b2.grad.item():.4f}")
print(f"dL/dW11 = {W1.grad[0,0].item():.4f}")
print(f"dL/dW12 = {W1.grad[0,1].item():.4f}")
print(f"dL/db1 = {b1.grad.item():.4f}")
# ── Gradient descent update ───────────────────────────────
with torch.no_grad():
W2 -= 0.5 * W2.grad
b2 -= 0.5 * b2.grad
W1 -= 0.5 * W1.grad
b1 -= 0.5 * b1.grad
print(f"\nUpdated: W2={W2.item():.4f} b2={b2.item():.4f}")
print(f" W1={W1.detach().numpy()} b1={b1.item():.4f}")
loss.backward() does in one line what we computed across 10 manual steps.
Behind the scenes, PyTorch's autograd engine traverses the computation graph
in reverse, applying the chain rule at every node — exactly the algorithm we implemented
by hand. Knowing the manual version means you understand what every framework does internally.