Forward Propagation in Neural Networks

Section 01

The Story: A Whisper Telephone Through Many Rooms

📖 Real-World Analogy

The Translated Message

Imagine you stand at the entrance of a building with many rooms in series. You whisper a message — say, a photo of a cat — into Room 1. The workers in Room 1 don't understand the raw photo. Instead, they each take a weighted blend of every pixel, add their own bias, and pass their result to Room 2. Room 2 does the same, then Room 3, and so on.

By the last room, the original pixels have been transformed into something much more abstract: a sentence of probabilities — "90% cat, 7% fox, 3% dog."

That journey — input → weighted sums → activations → output — is forward propagation. Nothing learns yet. It is pure, deterministic arithmetic flowing in one direction.

🧠

One-Line Definition

Forward propagation is the process of passing an input through every layer of a neural network — computing weighted sums and applying activations — to produce a final prediction. No weights change during the forward pass.

Section 02

The Computation Graph — Animated Flow

Each layer is a station. Data flows strictly left → right. Every station performs two operations: an affine transformation and an activation. The graph below animates the full forward pass.

Section 03

The Four Core Operations

⚖️

Affine Transformation

z = Wx + b

Each neuron computes a weighted sum of every input, then adds a bias. W rotates and scales; b shifts the result. Pure linear algebra.

⚡

Activation Function

a = f(z)

Applied element-wise after the affine step. Introduces non-linearity so the network can learn curved decision boundaries, not just straight lines.

🔁

Layer-by-Layer

aˡ = f(Wˡ aˡ⁻¹ + bˡ)

Output of one layer becomes the input to the next. Depth lets each layer learn increasingly abstract representations of the data.

🎯

Softmax Output

ŷᵢ = eᶻⁱ / Σeᶻʲ

Converts raw output scores (logits) into a valid probability distribution: all values between 0 and 1, and they sum to exactly 1.

Common Activation Functions

ReLU

f(z) = max(0, z)

Dead neuron risk for z<0

Sigmoid

f(z) = 1/(1+e⁻ᶻ)

Output ∈ (0,1) — vanishing gradient

Tanh

(eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)

Output ∈ (−1,1) — zero-centred

Softmax

eᶻⁱ / Σeᶻʲ

Output layer only — Σ = 1.0

Section 04

Numerical 1 — Single Neuron, One Layer

📐

Setup

A neuron receives inputs x = [2, 3]ᵀ, weights W = [0.5, −0.4], bias b = 1. Activation: ReLU.

🔢 Step-by-Step Computation

Step 1

Affine: z = (0.5×2) + (−0.4×3) + 1 = 1.0 − 1.2 + 1.0 = 0.8

Step 2

ReLU: a = max(0, 0.8) = 0.8

Output

Neuron fires with a = 0.8

Section 05

Numerical 2 — Full 2-Layer Network + Softmax

🏗️

Network Architecture

Input: 2 neurons | Hidden: 2 neurons (ReLU) | Output: 2 neurons (Softmax) — binary classification.

📊 Given Values

Input

x = [1, 2]ᵀ

W¹

[[0.1, 0.2], [0.3, 0.4]] b¹ = [0, 0]ᵀ

W²

[[0.5, −0.3], [−0.1, 0.6]] b² = [0, 0]ᵀ

Layer 1 — Affine: z¹ = W¹x + b¹

z¹₁ = 0.1×1 + 0.2×2 = 0.1 + 0.4 = 0.5
z¹₂ = 0.3×1 + 0.4×2 = 0.3 + 0.8 = 1.1
∴ z¹ = [0.5, 1.1]ᵀ

Layer 1 — Activation: a¹ = ReLU(z¹)

a¹₁ = ReLU(0.5) = 0.5
a¹₂ = ReLU(1.1) = 1.1
∴ a¹ = [0.5, 1.1]ᵀ (both positive, unchanged)

Layer 2 — Affine: z² = W²a¹ + b²

z²₁ = 0.5×0.5 + (−0.3)×1.1 = 0.25 − 0.33 = −0.08
z²₂ = (−0.1)×0.5 + 0.6×1.1 = −0.05 + 0.66 = 0.61
∴ z² = [−0.08, 0.61]ᵀ

Output — Softmax: ŷ = softmax(z²)

e^(−0.08) ≈ 0.923 e^(0.61) ≈ 1.840 Sum = 2.763
ŷ₁ = 0.923 ÷ 2.763 ≈ 0.334 → 33.4%
ŷ₂ = 1.840 ÷ 2.763 ≈ 0.666 → 66.6%
✅ Sum = 1.000 — valid probability distribution

Softmax Output — Probability Distribution

33.4%

Class 0

66.6%

Class 1 ← Predicted

✅

Verdict

The network predicts Class 1 with 66.6% confidence. These are random weights — no learning has happened yet. Backpropagation will later adjust W¹, W², b¹, b² to improve this output.

Section 06

Python Implementation

import numpy as np

# ── Inputs and Weights ────────────────────────────────
x  = np.array([1, 2], dtype=float)

W1 = np.array([[0.1, 0.2],
               [0.3, 0.4]])
b1 = np.zeros(2)

W2 = np.array([[ 0.5, -0.3],
               [-0.1,  0.6]])
b2 = np.zeros(2)

# ── Activation helpers ────────────────────────────────
def relu(z):
    return np.maximum(0, z)

def softmax(z):
    e = np.exp(z - np.max(z))   # subtract max for numerical stability
    return e / e.sum()

# ── Forward Propagation ───────────────────────────────
z1    = W1 @ x + b1        # Layer 1 affine
a1    = relu(z1)            # Layer 1 activation

z2    = W2 @ a1 + b2       # Layer 2 affine
y_hat = softmax(z2)         # Softmax output

print(f"z1    = {z1}")
print(f"a1    = {a1}")
print(f"z2    = {z2}")
print(f"y_hat = {y_hat}")
print(f"Pred  = Class {np.argmax(y_hat)}")

OUTPUT

z1 = [0.5 1.1 ] a1 = [0.5 1.1 ] z2 = [-0.08 0.61] y_hat = [0.334 0.666] Pred = Class 1

Section 07

Golden Rules

⚡ Forward Propagation — Non-Negotiable Rules

No weights change during the forward pass. It is purely arithmetic — multiply, add, activate, repeat. Learning happens only during backpropagation.

Every hidden layer must have a non-linear activation. Without it, stacking layers is pointless — a chain of purely linear transforms collapses into a single linear transform.

For multi-class output always use Softmax + Cross-Entropy loss. Softmax ensures outputs are valid probabilities; Cross-Entropy measures how wrong they are.

When computing Softmax always subtract the max logit first: e^(z − max(z)). This prevents numerical overflow with zero effect on the final probabilities.

The forward pass is identical at test time. The same affine-activation chain runs — only Dropout and BatchNorm behave differently between training and inference.