Deep Learning 📂 Artificial Neural Networks (ANN) · 2 of 7 33 min read

Activation Functions in Deep Learning

A visual, story-driven guide to non-linear activations — how Sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU and Softmax work, where they break, and how to choose the right one.

Section 01

Why Neurons Need to Be Non-Linear

100 Layers of Glass — Still Just Glass
Imagine stacking 100 sheets of clear glass on top of each other. No matter how many layers you add, light still passes straight through — it's still just glass. A stack of linear layers behaves the same way: no matter how deep your network, it can only learn a single linear transformation. It can never separate spirals, model speech, or understand images.

The fix? Introduce a small non-linear "kink" at every neuron — an activation function. Suddenly your stack of glass becomes a prism, then a kaleidoscope. Each layer now transforms space in ways the next layer can exploit. That is what makes deep networks deep.

An activation function takes a neuron's raw weighted sum z = Wx + b and applies a non-linear squeeze, shift, or gate before passing it forward. Without this step, every layer collapses into one — and depth becomes meaningless.

🔑
The One Rule

A good activation function must be differentiable almost everywhere so gradients can flow backwards during training. The shape of that derivative — whether it saturates, clips, or stays linear — determines everything about how well your network learns.

📊 Linear vs Non-Linear — Decision Space
NO ACTIVATION (Linear) Can only draw a straight line — misclassifies spirals WITH ACTIVATION (Non-Linear) Can draw curves — separates complex distributions

Section 02

Sigmoid & tanh — The Saturation Trap

The Volume Knob That Gets Stuck
Imagine a volume knob that only goes from 0 to 1. At the extremes it's glued to the floor or ceiling — no matter how hard you turn it, nothing changes. That is exactly what happens when a sigmoid neuron receives very large or very small inputs. Its output flatlines, its gradient vanishes to near zero, and the neuron stops learning entirely. In a 50-layer network, this silence echoes backward — nothing moves.
📈 Sigmoid σ(z) and tanh(z) — Curves & Gradients
σ(z) = 1 / (1 + e⁻ᶻ) 1 0 SATURATED SATURATED tanh(z) = (eᶻ - e⁻ᶻ)/(eᶻ + e⁻ᶻ) +1 -1 0 tanh(z) gradient σ(z)

Yellow dashed = derivative (gradient). Both functions saturate at extremes — gradient ≈ 0 ⟹ vanishing gradient problem.

Sigmoid Output Range
σ(z) ∈ (0, 1)
Always positive output → outputs are not zero-centred → zig-zag gradient updates
tanh Output Range
tanh(z) ∈ (−1, +1)
Zero-centred → better gradient flow than sigmoid, but still saturates
Sigmoid Max Gradient
σ′(z) ≤ 0.25
Gradient shrinks by ≥ 75% every layer — 10 layers = gradient × 0.25¹⁰ ≈ 0
tanh Max Gradient
tanh′(z) ≤ 1.0
Better than sigmoid but still collapses at large |z| values
⚠️
The Vanishing Gradient Problem

When gradients pass through many saturated sigmoid/tanh neurons, they shrink exponentially. By the time they reach the first layers of a deep network, they're essentially zero — those weights never update, the network never learns. This is why deep networks were nearly impossible to train before 2010.


Section 03

ReLU — The Revolution & Its Fatal Flaw

The Bouncer Who Lets Everyone In — Except the Negatives
ReLU (Rectified Linear Unit) has a brutally simple rule: if you are positive, pass through unchanged. If you are negative, you get set to zero. That's it. No sigmoid squeeze. No tanh drama. Just a single fold at zero.

This simplicity is ReLU's superpower — gradients for positive activations are always exactly 1, so they flow freely through 100 layers without vanishing. Training became 6× faster. Deep networks finally worked. The 2010s deep learning renaissance was largely built on this one dumb function.

But the bouncer has a problem: once a neuron's input goes negative and stays negative, it gets frozen at zero permanently. The gradient is zero, no update happens, it's dead forever. This is the dying neuron problem.
📈 ReLU(z) = max(0, z) — Curve & Dying Neuron
ReLU(z) = max(0, z) gradient = 0 gradient = 1 z=0 Dying Neuron Problem z = +2.1 z = −3.4 ReLU→0, ∇=0 no signal partial signal DEAD High LR → weights push z < 0 → neuron frozen forever Can lose 40–50% of neurons in aggressive training

Red dashed = zero-gradient (dead) zone. Positive z flows freely with gradient = 1. A high learning rate can kill entire layers.

💀
The Dying Neuron Problem

If a large gradient update pushes a neuron's weights so that z is always negative for every training sample, that neuron outputs zero and receives zero gradient forever. It's dead — contributing nothing. With aggressive learning rates or poor initialisation, up to 40% of a ReLU network can die before training is complete.


Section 04

Leaky ReLU, ELU & GELU — Fixing What ReLU Broke

Three activations emerged to solve the dying neuron problem — each with a different philosophy about what happens when z < 0.

📉
Leaky ReLU
f(z) = max(αz, z), α ≈ 0.01
Instead of hard-zero for negatives, allows a small slope α (typically 0.01). Dead neurons can never truly die — they still get a tiny gradient. Fast to compute. Works better than ReLU in many vision tasks.
〰️
ELU
f(z) = z if z>0 else α(eᶻ − 1)
Exponential Linear Unit. Negatives smoothly curve toward −α rather than going linearly negative. Zero-centred outputs, smooth gradient everywhere. Slightly more expensive due to the exp, but better than Leaky ReLU on deeper nets.
🔔
GELU
f(z) = z · Φ(z)
Gaussian Error Linear Unit. Multiplies z by the probability it's positive under a Gaussian — a stochastic soft gate. Used in GPT, BERT, and transformers broadly. Smooth, differentiable everywhere. The default in modern large language models.
📈 ReLU vs Leaky ReLU vs ELU vs GELU — Side by Side
+3 0 −1 −3 0 +3 ReLU Leaky ReLU ELU GELU

GELU dips slightly below zero near z = −0.17 before rising, creating a soft gating effect. ELU saturates at −α. Leaky ReLU stays linear with small slope.

Activation Dying Neurons Zero-Centred Smooth Compute Cost Best Use Case
ReLU Yes — common No No (kink at 0) Very low CNNs, fast training baseline
Leaky ReLU No No No Very low CNNs, GANs
ELU No Yes (≈) Yes Medium (exp) Deep MLPs, regression tasks
GELU No Yes (≈) Yes Medium Transformers, LLMs, BERT, GPT

Section 05

Softmax — Output Distributions

The Ballot Counter
After an election, you have raw vote counts: Cat got 400 votes, Dog got 300, Fish got 300. Softmax is the ballot counter that converts these into percentages that sum to 100%: Cat 40%, Dog 30%, Fish 30%. The largest number wins the most probability mass. The smallest number is never completely silenced — it still gets some probability.

In a neural network, the final layer produces raw scores called logits. Softmax turns these into a probability distribution over all classes. The class with the highest probability is your prediction.
📊 Softmax — Logits to Probability Distribution
LOGITS (z) Cat: 3.2 Dog: 1.8 Fish: 0.5 Softmax eᶻᵢ / Σeᶻⱼ PROBABILITIES 71% Cat 19% Dog 10% Fish Σ = 100% ✓ Never negative, always sums to 1

Softmax amplifies differences — a logit advantage of 1.4 can translate to a 52% probability swing. Used exclusively in the final classification layer, not hidden layers.

💡
Softmax is Only for the Output Layer

Softmax is never used in hidden layers — it would force every layer to compete in a zero-sum probability game, destroying the independent signal in each neuron. It belongs only at the final step, converting logits to a probability distribution for multi-class classification. For binary classification, a single sigmoid output is preferred.


Section 06

Python — All Activations From Scratch

import numpy as np
import matplotlib.pyplot as plt

# ── Define all activation functions ───────────────────────

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def gelu(z):
    # Approximation used in practice (Hendrycks & Gimpel 2016)
    return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))

def softmax(z):
    # Numerically stable version — subtract max before exp
    e = np.exp(z - np.max(z))
    return e / e.sum()

# ── Quick demo ─────────────────────────────────────────────

z = np.array([3.2, 1.8, 0.5])  # raw logits
print("Logits:  ", z)
print("Softmax: ", np.round(softmax(z), 3))

z_range = np.linspace(-5, 5, 400)
fns = {
    "Sigmoid":    sigmoid(z_range),
    "tanh":       tanh(z_range),
    "ReLU":       relu(z_range),
    "Leaky ReLU": leaky_relu(z_range),
    "ELU":        elu(z_range),
    "GELU":       gelu(z_range),
}

for name, vals in fns.items():
    print(f"{name:12s} | min={vals.min():.3f}  max={vals.max():.3f}")
OUTPUT
Logits: [3.2 1.8 0.5] Softmax: [0.706 0.176 0.118] Sigmoid | min=-0.007 max=0.993 tanh | min=-0.999 max=0.999 ReLU | min= 0.000 max=5.000 Leaky ReLU | min=-0.050 max=5.000 ELU | min=-1.000 max=5.000 GELU | min=-0.170 max=5.000

Section 07

PyTorch — Activations in a Real Network

import torch
import torch.nn as nn

# ── Model with GELU (modern default) ──────────────────────
class DeepMLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.GELU(),                  # ← modern choice
            nn.LayerNorm(hidden),
            nn.Linear(hidden, hidden),
            nn.GELU(),
            nn.LayerNorm(hidden),
            nn.Linear(hidden, out_dim),   # raw logits
        )

    def forward(self, x):
        return self.net(x)             # Softmax in loss fn

# ── Comparing activations on gradient flow ─────────────────
activations = {
    "ReLU":       nn.ReLU(),
    "LeakyReLU":  nn.LeakyReLU(0.01),
    "ELU":        nn.ELU(1.0),
    "GELU":       nn.GELU(),
}

x = torch.linspace(-3, 3, 10, requires_grad=True)

for name, act in activations.items():
    y = act(x).sum()
    y.backward()
    dead = (x.grad == 0).sum().item()
    print(f"{name:12s} | dead neurons: {dead}/10")
    x.grad.zero_()

# ── Loss function — Softmax is baked in here ───────────────
# Use CrossEntropyLoss: it applies LogSoftmax + NLLLoss
# NEVER add a Softmax layer before CrossEntropyLoss in PyTorch!
criterion = nn.CrossEntropyLoss()
OUTPUT
ReLU | dead neurons: 5/10 ← 50% of neurons have zero gradient LeakyReLU | dead neurons: 0/10 ELU | dead neurons: 0/10 GELU | dead neurons: 0/10
⚠️
PyTorch Softmax Trap

Never add nn.Softmax() before nn.CrossEntropyLoss() in PyTorch. The loss function already applies log-softmax internally. Adding your own softmax first causes numerical instability and double-application. Output raw logits from your final linear layer and let the loss handle the rest.


Section 08

How to Choose — The Decision Map

🗺️ Activation Function Decision Guide
Output Layer — Multi-class
Use Softmax. Converts logits to a probability distribution. Never use in hidden layers.
Output Layer — Binary
Use Sigmoid. Single probability in (0,1). Threshold at 0.5 for class decision.
Transformers / LLMs
Use GELU. Smooth, differentiable everywhere. Default in BERT, GPT-2/3/4, and all modern attention-based models.
CNNs / Vision models
Start with ReLU. If >10% neurons die, switch to Leaky ReLU. Use batch normalisation to prevent dying.
Deep MLP / Tabular
Try ELU or GELU. Zero-centred, smooth gradients. Better than ReLU on deep feed-forward nets.
GANs / Discriminator
Use Leaky ReLU (α = 0.2). Lets gradients flow for both positive and negative activations. Standard GAN practice.
🏆
The Modern Practitioner Default

When in doubt: use GELU in hidden layers + nn.CrossEntropyLoss (which handles softmax) for classification. This is the setup used by every major transformer-based model and almost always beats ReLU on anything deeper than 4 layers with minimal tuning.


Section 09

Golden Rules

⚡ Activation Functions — Non-Negotiable Rules
1
Never use sigmoid or tanh in hidden layers of deep networks. Their gradients saturate. Your deep layers will stop learning. Reserve sigmoid for binary output neurons only.
2
Monitor your dead neuron rate. If ReLU networks have accuracy that plateaus early, log the fraction of zero activations per layer. Above 20% dead is a red flag — switch to Leaky ReLU or lower your learning rate.
3
Softmax is only for the final output layer. It creates a zero-sum competition between neurons. Hidden layers need independent signals, not a probability race.
4
In PyTorch, output raw logits. nn.CrossEntropyLoss handles log-softmax internally. Adding a softmax layer before it will corrupt your gradients and produce numerically unstable training.
5
GELU is the modern default for deep networks. It is smooth, zero-centred on average, and produces better gradient flow than ReLU on anything with residual connections or attention. Use it unless you have a strong reason not to.
6
Pair activations with appropriate initialisations. ReLU works best with He (Kaiming) initialisation. Sigmoid/tanh with Glorot (Xavier). Wrong initialisation + wrong activation = vanishing or exploding gradients before training even starts.