Deep Learning 📂 Artificial Neural Networks (ANN) · 3 of 7 21 min read

Forward Propagation in Neural Networks

Forward propagation is the process by which input data flows through a neural network — layer by layer — using weighted sums, bias additions, and activation functions, until it reaches the output as a probability distribution. It is the network's prediction engine, with no learning involved.

Section 01

The Story: A Whisper Telephone Through Many Rooms

The Translated Message
Imagine you stand at the entrance of a building with many rooms in series. You whisper a message — say, a photo of a cat — into Room 1. The workers in Room 1 don't understand the raw photo. Instead, they each take a weighted blend of every pixel, add their own bias, and pass their result to Room 2. Room 2 does the same, then Room 3, and so on.

By the last room, the original pixels have been transformed into something much more abstract: a sentence of probabilities — "90% cat, 7% fox, 3% dog."

That journey — input → weighted sums → activations → output — is forward propagation. Nothing learns yet. It is pure, deterministic arithmetic flowing in one direction.
🧠
One-Line Definition

Forward propagation is the process of passing an input through every layer of a neural network — computing weighted sums and applying activations — to produce a final prediction. No weights change during the forward pass.


Section 02

The Computation Graph — Animated Flow

Each layer is a station. Data flows strictly left → right. Every station performs two operations: an affine transformation and an activation. The graph below animates the full forward pass.

INPUT x [2×1] AFFINE 1 z¹=W¹x+b¹ W¹: [3×2] b¹: [3×1] ACTIVATE a¹=σ(z¹) ReLU / tanh AFFINE 2 z²=W²a¹+b² W²: [2×3] b²: [2×1] SOFTMAX ŷ=softmax(z²) Σ ŷᵢ = 1 Raw features Weighted sum Non-linearity Output weights Probabilities

Section 03

The Four Core Operations

⚖️
Affine Transformation
z = Wx + b
Each neuron computes a weighted sum of every input, then adds a bias. W rotates and scales; b shifts the result. Pure linear algebra.
Activation Function
a = f(z)
Applied element-wise after the affine step. Introduces non-linearity so the network can learn curved decision boundaries, not just straight lines.
🔁
Layer-by-Layer
aˡ = f(Wˡ aˡ⁻¹ + bˡ)
Output of one layer becomes the input to the next. Depth lets each layer learn increasingly abstract representations of the data.
🎯
Softmax Output
ŷᵢ = eᶻⁱ / Σeᶻʲ
Converts raw output scores (logits) into a valid probability distribution: all values between 0 and 1, and they sum to exactly 1.
Common Activation Functions
ReLU
f(z) = max(0, z)
Dead neuron risk for z<0
Sigmoid
f(z) = 1/(1+e⁻ᶻ)
Output ∈ (0,1) — vanishing gradient
Tanh
(eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)
Output ∈ (−1,1) — zero-centred
Softmax
eᶻⁱ / Σeᶻʲ
Output layer only — Σ = 1.0

Section 04

Numerical 1 — Single Neuron, One Layer

📐
Setup

A neuron receives inputs x = [2, 3]ᵀ, weights W = [0.5, −0.4], bias b = 1. Activation: ReLU.

🔢 Step-by-Step Computation
Step 1
Affine: z = (0.5×2) + (−0.4×3) + 1 = 1.0 − 1.2 + 1.0 = 0.8
Step 2
ReLU: a = max(0, 0.8) = 0.8
Output
Neuron fires with a = 0.8
x₁ = 2 w₁ = 0.5 x₂ = 3 w₂ = −0.4 b = 1 Σ z=0.8 ReLU a = 0.8 0.8 ✓

Section 05

Numerical 2 — Full 2-Layer Network + Softmax

🏗️
Network Architecture

Input: 2 neurons  |  Hidden: 2 neurons (ReLU)  |  Output: 2 neurons (Softmax) — binary classification.

📊 Given Values
Input
x = [1, 2]ᵀ
[[0.1, 0.2], [0.3, 0.4]]   b¹ = [0, 0]ᵀ
[[0.5, −0.3], [−0.1, 0.6]]   b² = [0, 0]ᵀ
L1
Layer 1 — Affine: z¹ = W¹x + b¹
z¹₁ = 0.1×1 + 0.2×2 = 0.1 + 0.4 = 0.5
z¹₂ = 0.3×1 + 0.4×2 = 0.3 + 0.8 = 1.1
∴ z¹ = [0.5, 1.1]ᵀ
A1
Layer 1 — Activation: a¹ = ReLU(z¹)
a¹₁ = ReLU(0.5) = 0.5
a¹₂ = ReLU(1.1) = 1.1
∴ a¹ = [0.5, 1.1]ᵀ  (both positive, unchanged)
L2
Layer 2 — Affine: z² = W²a¹ + b²
z²₁ = 0.5×0.5 + (−0.3)×1.1 = 0.25 − 0.33 = −0.08
z²₂ = (−0.1)×0.5 + 0.6×1.1 = −0.05 + 0.66 = 0.61
∴ z² = [−0.08, 0.61]ᵀ
SM
Output — Softmax: ŷ = softmax(z²)
e^(−0.08) ≈ 0.923    e^(0.61) ≈ 1.840    Sum = 2.763
ŷ₁ = 0.923 ÷ 2.763 ≈ 0.334 → 33.4%
ŷ₂ = 1.840 ÷ 2.763 ≈ 0.666 → 66.6%
✅ Sum = 1.000 — valid probability distribution
Softmax Output — Probability Distribution
33.4%
Class 0
66.6%
Class 1 ← Predicted
Verdict

The network predicts Class 1 with 66.6% confidence. These are random weights — no learning has happened yet. Backpropagation will later adjust W¹, W², b¹, b² to improve this output.


Section 06

Python Implementation

import numpy as np

# ── Inputs and Weights ────────────────────────────────
x  = np.array([1, 2], dtype=float)

W1 = np.array([[0.1, 0.2],
               [0.3, 0.4]])
b1 = np.zeros(2)

W2 = np.array([[ 0.5, -0.3],
               [-0.1,  0.6]])
b2 = np.zeros(2)

# ── Activation helpers ────────────────────────────────
def relu(z):
    return np.maximum(0, z)

def softmax(z):
    e = np.exp(z - np.max(z))   # subtract max for numerical stability
    return e / e.sum()

# ── Forward Propagation ───────────────────────────────
z1    = W1 @ x + b1        # Layer 1 affine
a1    = relu(z1)            # Layer 1 activation

z2    = W2 @ a1 + b2       # Layer 2 affine
y_hat = softmax(z2)         # Softmax output

print(f"z1    = {z1}")
print(f"a1    = {a1}")
print(f"z2    = {z2}")
print(f"y_hat = {y_hat}")
print(f"Pred  = Class {np.argmax(y_hat)}")
OUTPUT
z1 = [0.5 1.1 ] a1 = [0.5 1.1 ] z2 = [-0.08 0.61] y_hat = [0.334 0.666] Pred = Class 1

Section 07

Golden Rules

⚡ Forward Propagation — Non-Negotiable Rules
1
No weights change during the forward pass. It is purely arithmetic — multiply, add, activate, repeat. Learning happens only during backpropagation.
2
Every hidden layer must have a non-linear activation. Without it, stacking layers is pointless — a chain of purely linear transforms collapses into a single linear transform.
3
For multi-class output always use Softmax + Cross-Entropy loss. Softmax ensures outputs are valid probabilities; Cross-Entropy measures how wrong they are.
4
When computing Softmax always subtract the max logit first: e^(z − max(z)). This prevents numerical overflow with zero effect on the final probabilities.
5
The forward pass is identical at test time. The same affine-activation chain runs — only Dropout and BatchNorm behave differently between training and inference.