The Story — How a Neural Network Learns from Mistakes
That is backpropagation. The neural network makes a prediction, compares it to the true answer, and then the algorithm walks backwards through every weight, telling each one how much it contributed to the error — and nudging it by exactly the right amount.
In this tutorial we solve a complete 2-input → 2-hidden → 2-output network step by step, exactly as written in the handwritten notes — every number, every formula, every chain rule term.
This network has 2 inputs (x₁, x₂), 2 hidden neurons (H₁, H₂), and 2 output neurons (Y₁, Y₂). All 9 weights are given. We compute the forward pass, the total error, and then backpropagate to update weights w₅ and w₁ in full detail.
Inputs: x₁ = 0.05 | x₂ = 0.10
Weights to hidden: w₁=0.15, w₂=0.20, w₃=0.25, w₄=0.30
Weights to output: w₅=0.40, w₆=0.45, w₇=0.50, w₈=0.55 (w₉=0.85)
Targets: T₁ = 0.01 | T₂ = 0.99
Learning rate η = 0.5
Interactive Animated Network — Step Through Every Calculation
Press ▶ Play to animate automatically, or use ← → arrows to step at your own speed. The network highlights active nodes and edges at each step, and the formula panel shows exactly what you would write on paper.
Forward Pass — Every Calculation on Paper
① Hidden Layer Inputs (net values)
= (0.15)(0.05) + (0.20)(0.10) + b₁
= 0.0075 + 0.020 + b₁ = 0.3825 (including bias)
= (0.25)(0.05) + (0.30)(0.10) + b₂
= 0.0125 + 0.030 + b₂ = 0.3900 (including bias)
② Hidden Layer Outputs — Sigmoid Activation
= 0.5944
= 0.5963
③ Output Layer — net and activation
= (0.40)(0.5944) + (0.50)(0.5963)
= 0.2378 + 0.2982 = 0.5359
= (0.45)(0.5944) + (0.55)(0.5963)
= 0.2675 + 0.3280 = 0.5955
④ Total Error — MSE Loss
= ½ × (−0.747)² = ½ × 0.5580 = 0.279
= ½ × (0.222)² = ½ × 0.0493 = 0.025
| Neuron | Net Input (z) | Activation σ(z) | Target | Error |
|---|---|---|---|---|
| H₁ | 0.3825 | 0.5944 | — | hidden |
| H₂ | 0.3900 | 0.5963 | — | hidden |
| Y₁ | 0.5359 | 0.757 | 0.01 | 0.279 |
| Y₂ | 0.5955 | 0.768 | 0.99 | 0.025 |
| E_total | 0.304 | |||
Backward Pass — Updating w₅ (Output Layer Weight)
We want to find ∂E_total/∂w₅ — how much the total error changes when we tweak w₅. By the chain rule this decomposes into three terms, each answering a specific question about sensitivity.
w₅_new = w₅ − η · (∂E_total/∂w₅)
∂E_total/∂w₅ = ∂E_Y₁/∂out_Y₁ × ∂out_Y₁/∂net_Y₁ × ∂net_Y₁/∂w₅
= (ŷ₁ − T₁) × ŷ₁(1−ŷ₁) × out_H₁
∂E_Y₁/∂out_Y₁
= ŷ₁ − T₁ = 0.757 − 0.01 = 0.747
∂out_Y₁/∂net_Y₁
= σ'(net_Y₁) = ŷ₁ × (1 − ŷ₁)
= 0.757 × (1 − 0.757) = 0.757 × 0.243 = 0.1840
∂net_Y₁/∂w₅
net_Y₁ = w₅·out_H₁ + w₇·out_H₂ → ∂/∂w₅ = out_H₁
= 0.5944
∂E/∂w₅
= 0.747 × 0.1840 × 0.5944
= 0.1375 × 0.5944 = 0.0817 ≈ 0.081
w₅_new = w₅ − η × 0.081 = 0.40 − 0.5 × 0.081 = 0.40 − 0.0405
= 0.3595
Backward Pass — Updating w₁ (Hidden Layer Weight)
Updating a hidden-layer weight is harder — the error must propagate backwards through the output neurons first. There are four chain rule terms (A, B, C, D), labelled exactly as in the handwritten notes.
w₁ connects x₁ to H₁. Changing w₁ affects H₁'s output, which affects both Y₁ and Y₂, which affects the total error. The chain of influence is: w₁ → net_H₁ → out_H₁ → net_Y₁ → out_Y₁ → E. Each arrow in this chain produces one term in the chain rule product.
∂E_total/∂y₁
= ŷ₁ − T₁ = 0.75137 − 0.01 = 0.74137
(notes use 0.75137 as the refined ŷ₁ value at this step)
∂y₁/∂net_Y₁
= ŷ₁ × (1 − ŷ₁) = 0.75137 × (1 − 0.75137)
= 0.75137 × 0.24863 = 0.18676
∂net_Y₁/∂out_H₁
net_Y₁ = w₅·out_H₁ + ... → ∂/∂out_H₁ = w₅
= 0.40
∂out_H₁/∂w₁
net_H₁ = w₁·x₁ + w₂·x₂ + b → ∂/∂w₁ = x₁
= 0.05
∂E/∂w₁
= 0.74137 × 0.18676 × 0.40 × 0.05
= 0.13847 × 0.40 × 0.05 = 0.055388 × 0.05
= 0.0277
w₁_new = w₁ − η × 0.0277 = 0.15 − 0.5 × 0.0277 = 0.15 − 0.01385
= 0.13615
All Weight Updates Summary
Below shows the two weights solved in the handwritten notes plus the pattern to apply for all remaining weights. Every weight follows the same chain rule — only the path through the network differs.
Y₁'s prediction (0.757) is far above its target (0.01). The error is positive. All gradients flowing back through Y₁ are therefore positive. Gradient descent subtracts a positive number, so all weights connected to Y₁ decrease — the network learns to output a smaller value for Y₁ next time.
The Four Chain Rule Questions — Answered
The handwritten notes frame backpropagation as four distinct questions. Each becomes one term in the chain rule product. Here they are explained individually.
∂E/∂out_Y₁ = ŷ₁ − T₁ = 0.757 − 0.01 = 0.747
This term measures how wrong Y₁ is. If ŷ₁ equals T₁ perfectly, this term is 0 and the weight update is 0 — no learning needed.
σ'(z) = σ(z) × (1 − σ(z)) = ŷ₁ × (1 − ŷ₁) = 0.757 × 0.243 = 0.184
This tells you how steep the sigmoid curve is at the current value of Y₁. Near 0 or 1 the sigmoid is flat (derivative ≈ 0) — the vanishing gradient problem. Near 0.5 it's steepest (max derivative = 0.25).
Differentiating with respect to w₅:
∂net_Y₁/∂w₅ = out_H₁ = 0.5944
The derivative of a linear function w.r.t. a weight is just the activation that multiplies it. This is why gradients are larger for strongly-activated neurons.
Differentiating with respect to w₁:
∂net_H₁/∂w₁ = x₁ = 0.05
The input value itself! This is why networks learn slowly from very small input values — the gradient for the weight is scaled by the input. Large inputs → large weight gradient → faster learning.
Python Verification — Confirms Every Number
import numpy as np
# ── Network from handwritten notes ────────────────────────
x1, x2 = 0.05, 0.10
T1, T2 = 0.01, 0.99
lr = 0.5
# Weights — to hidden layer
w1, w2 = 0.15, 0.20 # x→H1
w3, w4 = 0.25, 0.30 # x→H2
# Weights — to output layer
w5, w6 = 0.40, 0.45 # H1→Y1, H1→Y2
w7, w8 = 0.50, 0.55 # H2→Y1, H2→Y2
# Bias terms incorporated into net (as per notes)
# Notes give: net_H1=0.3825, net_H2=0.390 directly
def sig(z): return 1 / (1 + np.exp(-z))
def sigD(z): s = sig(z); return s * (1 - s)
# ── FORWARD PASS ──────────────────────────────────────────
net_H1, net_H2 = 0.3825, 0.390 # from notes (include biases)
out_H1 = sig(net_H1)
out_H2 = sig(net_H2)
net_Y1 = w5*out_H1 + w7*out_H2
net_Y2 = w6*out_H1 + w8*out_H2
out_Y1 = sig(net_Y1)
out_Y2 = sig(net_Y2)
E_Y1 = 0.5*(T1 - out_Y1)**2
E_Y2 = 0.5*(T2 - out_Y2)**2
E_tot = E_Y1 + E_Y2
print("=== FORWARD PASS ===")
print(f"out_H1 = {out_H1:.4f} out_H2 = {out_H2:.4f}")
print(f"out_Y1 = {out_Y1:.4f} out_Y2 = {out_Y2:.4f}")
print(f"E_Y1 = {E_Y1:.4f} E_Y2 = {E_Y2:.4f}")
print(f"E_total= {E_tot:.4f}")
# ── BACKWARD — w5 ─────────────────────────────────────────
# ∂E/∂w5 = (ŷ1−T1) × ŷ1(1−ŷ1) × out_H1
A_w5 = out_Y1 - T1 # term A
B_w5 = sigD(net_Y1) # term B = ŷ1(1-ŷ1)
C_w5 = out_H1 # term C
grad_w5 = A_w5 * B_w5 * C_w5
w5_new = w5 - lr * grad_w5
print("\n=== BACKWARD — w5 ===")
print(f"A (ŷ1−T1) = {A_w5:.4f}")
print(f"B σ'(netY1) = {B_w5:.4f}")
print(f"C (out_H1) = {C_w5:.4f}")
print(f"∂E/∂w5 = {grad_w5:.4f}")
print(f"w5_new = {w5_new:.4f} (notes: 0.3595)")
# ── BACKWARD — w1 ─────────────────────────────────────────
# ∂E/∂w1 = (ŷ1−T1) × ŷ1(1−ŷ1) × w5 × x1
A_w1 = out_Y1 - T1 # term A
B_w1 = sigD(net_Y1) # term B
C_w1 = w5 # term C = ∂netY1/∂outH1
D_w1 = x1 # term D = ∂netH1/∂w1 = x1
grad_w1 = A_w1 * B_w1 * C_w1 * D_w1
w1_new = w1 - lr * grad_w1
print("\n=== BACKWARD — w1 ===")
print(f"A × B = {A_w1*B_w1:.5f}")
print(f"× C (w5) = {A_w1*B_w1*C_w1:.5f}")
print(f"× D (x1) = {grad_w1:.5f}")
print(f"∂E/∂w1 = {grad_w1:.4f}")
print(f"w1_new = {w1_new:.5f} (notes: 0.13615)")
The handwritten notes use rounded intermediate values (e.g., ŷ₁=0.757 instead of 0.7569, out_H1=0.594 instead of 0.5944). Each rounding carries forward into the next calculation. This is completely normal in hand computation — the method is identical, the tiny differences come purely from rounding in intermediate steps. The final answers match to 2–3 significant figures.