Why Neurons Need to Be Non-Linear
The fix? Introduce a small non-linear "kink" at every neuron — an activation function. Suddenly your stack of glass becomes a prism, then a kaleidoscope. Each layer now transforms space in ways the next layer can exploit. That is what makes deep networks deep.
An activation function takes a neuron's raw weighted sum z = Wx + b
and applies a non-linear squeeze, shift, or gate before passing it forward.
Without this step, every layer collapses into one — and depth becomes meaningless.
A good activation function must be differentiable almost everywhere so gradients can flow backwards during training. The shape of that derivative — whether it saturates, clips, or stays linear — determines everything about how well your network learns.
Sigmoid & tanh — The Saturation Trap
Yellow dashed = derivative (gradient). Both functions saturate at extremes — gradient ≈ 0 ⟹ vanishing gradient problem.
When gradients pass through many saturated sigmoid/tanh neurons, they shrink exponentially. By the time they reach the first layers of a deep network, they're essentially zero — those weights never update, the network never learns. This is why deep networks were nearly impossible to train before 2010.
ReLU — The Revolution & Its Fatal Flaw
This simplicity is ReLU's superpower — gradients for positive activations are always exactly 1, so they flow freely through 100 layers without vanishing. Training became 6× faster. Deep networks finally worked. The 2010s deep learning renaissance was largely built on this one dumb function.
But the bouncer has a problem: once a neuron's input goes negative and stays negative, it gets frozen at zero permanently. The gradient is zero, no update happens, it's dead forever. This is the dying neuron problem.
Red dashed = zero-gradient (dead) zone. Positive z flows freely with gradient = 1. A high learning rate can kill entire layers.
If a large gradient update pushes a neuron's weights so that z is
always negative for every training sample, that neuron outputs zero and receives
zero gradient forever. It's dead — contributing nothing.
With aggressive learning rates or poor initialisation, up to 40% of a ReLU
network can die before training is complete.
Leaky ReLU, ELU & GELU — Fixing What ReLU Broke
Three activations emerged to solve the dying neuron problem — each with a
different philosophy about what happens when z < 0.
GELU dips slightly below zero near z = −0.17 before rising, creating a soft gating effect. ELU saturates at −α. Leaky ReLU stays linear with small slope.
| Activation | Dying Neurons | Zero-Centred | Smooth | Compute Cost | Best Use Case |
|---|---|---|---|---|---|
| ReLU | Yes — common | No | No (kink at 0) | Very low | CNNs, fast training baseline |
| Leaky ReLU | No | No | No | Very low | CNNs, GANs |
| ELU | No | Yes (≈) | Yes | Medium (exp) | Deep MLPs, regression tasks |
| GELU | No | Yes (≈) | Yes | Medium | Transformers, LLMs, BERT, GPT |
Softmax — Output Distributions
In a neural network, the final layer produces raw scores called logits. Softmax turns these into a probability distribution over all classes. The class with the highest probability is your prediction.
Softmax amplifies differences — a logit advantage of 1.4 can translate to a 52% probability swing. Used exclusively in the final classification layer, not hidden layers.
Softmax is never used in hidden layers — it would force every layer to compete in a zero-sum probability game, destroying the independent signal in each neuron. It belongs only at the final step, converting logits to a probability distribution for multi-class classification. For binary classification, a single sigmoid output is preferred.
Python — All Activations From Scratch
import numpy as np
import matplotlib.pyplot as plt
# ── Define all activation functions ───────────────────────
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def tanh(z):
return np.tanh(z)
def relu(z):
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
def elu(z, alpha=1.0):
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
def gelu(z):
# Approximation used in practice (Hendrycks & Gimpel 2016)
return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))
def softmax(z):
# Numerically stable version — subtract max before exp
e = np.exp(z - np.max(z))
return e / e.sum()
# ── Quick demo ─────────────────────────────────────────────
z = np.array([3.2, 1.8, 0.5]) # raw logits
print("Logits: ", z)
print("Softmax: ", np.round(softmax(z), 3))
z_range = np.linspace(-5, 5, 400)
fns = {
"Sigmoid": sigmoid(z_range),
"tanh": tanh(z_range),
"ReLU": relu(z_range),
"Leaky ReLU": leaky_relu(z_range),
"ELU": elu(z_range),
"GELU": gelu(z_range),
}
for name, vals in fns.items():
print(f"{name:12s} | min={vals.min():.3f} max={vals.max():.3f}")
PyTorch — Activations in a Real Network
import torch
import torch.nn as nn
# ── Model with GELU (modern default) ──────────────────────
class DeepMLP(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.GELU(), # ← modern choice
nn.LayerNorm(hidden),
nn.Linear(hidden, hidden),
nn.GELU(),
nn.LayerNorm(hidden),
nn.Linear(hidden, out_dim), # raw logits
)
def forward(self, x):
return self.net(x) # Softmax in loss fn
# ── Comparing activations on gradient flow ─────────────────
activations = {
"ReLU": nn.ReLU(),
"LeakyReLU": nn.LeakyReLU(0.01),
"ELU": nn.ELU(1.0),
"GELU": nn.GELU(),
}
x = torch.linspace(-3, 3, 10, requires_grad=True)
for name, act in activations.items():
y = act(x).sum()
y.backward()
dead = (x.grad == 0).sum().item()
print(f"{name:12s} | dead neurons: {dead}/10")
x.grad.zero_()
# ── Loss function — Softmax is baked in here ───────────────
# Use CrossEntropyLoss: it applies LogSoftmax + NLLLoss
# NEVER add a Softmax layer before CrossEntropyLoss in PyTorch!
criterion = nn.CrossEntropyLoss()
Never add nn.Softmax() before
nn.CrossEntropyLoss() in PyTorch. The loss function already
applies log-softmax internally. Adding your own softmax first causes
numerical instability and double-application. Output raw logits from
your final linear layer and let the loss handle the rest.
How to Choose — The Decision Map
When in doubt: use GELU in hidden layers + nn.CrossEntropyLoss (which handles softmax) for classification. This is the setup used by every major transformer-based model and almost always beats ReLU on anything deeper than 4 layers with minimal tuning.
Golden Rules
nn.CrossEntropyLoss
handles log-softmax internally. Adding a softmax layer before it will corrupt
your gradients and produce numerically unstable training.