The Story That Explains RNNs
A standard feedforward neural network is the photo-recogniser. Each input is a self-contained snapshot processed in isolation. It sees the pixels, fires some neurons, gives an answer. There is no memory of "what came before" because nothing came before — each input arrives alone.
But language does not work that way. The word "bank" means something completely different in "I walked to the river bank" versus "I deposited money in the bank." The meaning of every word depends on what preceded it. Understanding language requires memory of context.
A Recurrent Neural Network (RNN) is the novel reader. At every word, it maintains a hidden state — a condensed memory of everything it has seen so far — and uses both that memory and the current word to decide what to do next. The nervousness of the butler mentioned on page one still echoes faintly in the hidden state by the final chapter.
An RNN is a neural network that feeds its own output back as an additional input at the next time step — creating a loop of memory through time.
What Is a Recurrent Neural Network?
Before RNNs, the dominant approach to sequence data was to flatten everything into a fixed-size vector and feed it to a standard dense network. This broke the temporal structure of language completely. RNNs were invented to respect the natural order of sequences.
The key insight is simple: at each time step t, the RNN receives two inputs — the current word/token xt and the previous hidden state ht-1 — and produces a new hidden state ht. This hidden state is simultaneously the output of the current step and the memory passed to the next.
RNN Architecture — Folded & Unrolled
An RNN is typically drawn in two ways: folded (compact, showing the self-loop) and unrolled through time (expanded, showing each time step as a separate column). Both represent exactly the same computation. The unrolled view is far easier to reason about.
The amber arrows show the recurrent connection — each hidden state is built from the current input and the previous hidden state. Indigo = input flow. Green = output (not always present at every step).
The "folded" diagram shows a single box with an arrow pointing back to itself. The "unrolled" diagram copies that box once per time step. The weights (W, U, b) are shared across every time step — there are not separate weights per step. This is called parameter sharing and it is what allows an RNN to process sequences of any length.
The Mathematics of a Vanilla RNN
The vanilla RNN has two equations. Everything the network does collapses to these two lines. Once you understand them, you understand the entire forward pass.
Wx: input-to-hidden weights — processes the current token.
Wy: hidden-to-output weights — makes predictions. All three are shared across every time step.
From Words to Vectors: The Text Preprocessing Pipeline
Neural networks cannot process raw text. Every word must become a vector of numbers before an RNN can touch it. This pipeline is the same regardless of which RNN variant you use.
After this pipeline, the RNN receives a 3D tensor of shape [batch_size, sequence_length, embed_dim] — e.g. [32, 100, 128] means 32 reviews, each 100 tokens long, each token represented by a 128-dimensional vector. At each of the 100 time steps, the RNN processes one 128-dimensional slice.
Backpropagation Through Time (BPTT)
Training an RNN requires computing gradients. Because the network is unrolled through time, the gradients must flow backwards through every time step. This is called Backpropagation Through Time (BPTT).
Full BPTT over very long sequences (thousands of steps) is computationally expensive and memory-intensive. In practice, Truncated BPTT splits the sequence into chunks of length K (e.g., 35 steps), runs forward and backward within each chunk, and passes the hidden state across chunk boundaries but stops gradient flow at the boundary. This is the standard approach in language model training.
The Vanishing Gradient Problem
The vanishing gradient problem is exactly this. During BPTT, the gradient signal is multiplied by the Jacobian matrix at every time step. If the spectral radius of that matrix is slightly less than 1 (which it almost always is with tanh), then multiplying it over 50 or 100 steps drives the gradient exponentially toward zero. The weights near step 1 receive a gradient so small it may as well be zero — they never learn. The network becomes effectively blind to long-range dependencies.
| Token | Distance to Output | Gradient Signal |
|---|---|---|
| "terrible" | 2 steps | Strong ✓ |
| "movie" | 1 step | Strong ✓ |
| [output] | 0 | Negative sentiment ✓ |
| Token | Distance to Output | Gradient Signal |
|---|---|---|
| "The food was great" | 80 steps | ≈ 0 (vanished) ✗ |
| "but the service was" | 4 steps | Weak |
| "awful" | 1 step | Strong ✓ |
| [output] | 0 | Incorrectly negative ✗ |
When the Jacobian's spectral radius is greater than 1, gradients grow exponentially — weights receive gigantic updates and training diverges. The fix is simple: gradient clipping — cap the gradient norm at a threshold (typically 1.0 or 5.0) before applying the update. Vanishing gradients are far harder to fix, which is why LSTM and GRU were invented.
LSTM — Long Short-Term Memory
Sepp Hochreiter and Jürgen Schmidhuber introduced the LSTM in 1997 to solve the vanishing gradient problem. The key innovation: instead of relying on tanh to pass gradients through time (which squashes them), LSTM introduces a cell state — a separate memory channel that can carry information across hundreds of steps almost unchanged, because it uses addition (not multiplication) to update.
1. Erase some old notes (forget gate — selectively wipe irrelevant history).
2. Write new notes (input gate — add new relevant information).
3. Read from the notebook to act (output gate — decide what to show to the outside world).
The notebook itself persists intact across long conversations — you do not rewrite it every time you speak. Only targeted edits happen at each step, so old information can survive for hundreds of steps without degrading.
The amber highway at the top is the cell state — it flows with minimal interference, solving the vanishing gradient problem. Each gate (Forget, Input, Output) uses a sigmoid (σ) to produce values between 0 and 1, acting as a soft on/off switch.
The Four LSTM Equations
GRU — Gated Recurrent Unit
Kyunghyun Cho et al. introduced the GRU in 2014 as a simpler alternative to LSTM. The GRU merges the cell state and hidden state into one, and uses only two gates instead of three — reducing the parameter count by roughly 25% with comparable performance on most tasks.
LSTM vs GRU — Side-by-Side
| Property | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| State vectors | 2 (h_t, C_t) | 1 (h_t only) |
| Parameters | 4 × (n² + n·d) | 3 × (n² + n·d) — ~25% fewer |
| Training speed | Slower | Faster |
| Long sequences | Slightly better — separate C_t helps | Can struggle past ~500 steps |
| Short-medium sequences | Excellent | Excellent — often matches LSTM |
| When to prefer | Long sequences, complex temporal patterns | Faster prototyping, limited compute |
Start with GRU — it trains faster, performs comparably on most NLP tasks, and is less likely to overfit on small datasets due to fewer parameters. Upgrade to LSTM if your sequences are long (>200 tokens) or if GRU plateaus and you have compute budget to spare. In most production systems, the difference is under 1% accuracy.
RNN Architecture Patterns
The same RNN cell can be wired in fundamentally different ways depending on your task. Andrej Karpathy's famous taxonomy describes five patterns, and understanding which to use is one of the most important design decisions in sequence modelling.
Blue = inputs, Amber = hidden states (RNN cells), Green = outputs, Purple = context vector (encoder summary). The pattern you choose depends entirely on your task's input/output structure.
Complete Text Preprocessing Pipeline — Code
Before building the model, you must convert raw text into padded integer sequences. This code is universal — it works the same whether you use a vanilla RNN, LSTM, or GRU.
import numpy as np
from collections import Counter
import re
# ── 1. Raw data ───────────────────────────────────────────
reviews = [
'The film was absolutely brilliant and deeply moving',
'Terrible acting, boring script, complete waste of time',
'A masterpiece of modern cinema — breathtaking visuals',
'I fell asleep halfway through, incredibly dull',
'Outstanding performances by the entire cast',
]
labels = [1, 0, 1, 0, 1] # 1 = positive, 0 = negative
# ── 2. Tokenise (simple word-level) ───────────────────────
def tokenise(text):
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text) # strip punctuation
return text.split()
tokenised = [tokenise(r) for r in reviews]
# ── 3. Build vocabulary ───────────────────────────────────
VOCAB_SIZE = 10_000
PAD_TOKEN = '<PAD>'
OOV_TOKEN = '<OOV>'
all_words = [word for tokens in tokenised for word in tokens]
word_freq = Counter(all_words)
vocab = [PAD_TOKEN, OOV_TOKEN] + [w for w, _ in word_freq.most_common(VOCAB_SIZE - 2)]
word2idx = {w: i for i, w in enumerate(vocab)}
print(f'Vocabulary size: {len(vocab)}')
# ── 4. Encode sequences ───────────────────────────────────
def encode(tokens, word2idx):
return [word2idx.get(t, 1) for t in tokens] # 1 = OOV index
encoded = [encode(t, word2idx) for t in tokenised]
# ── 5. Pad / truncate to fixed length ────────────────────
MAX_LEN = 50
def pad_sequence(seq, max_len, pad_val=0):
seq = seq[:max_len] # truncate if too long
seq = seq + [pad_val] * (max_len - len(seq)) # pad if too short
return seq
X = np.array([pad_sequence(s, MAX_LEN) for s in encoded])
y = np.array(labels)
print(f'X shape: {X.shape}') # (5, 50)
print(f'First review encoded:\n{X[0]}')
Sentiment Analysis with Keras LSTM
Now we build a real many-to-one LSTM classifier for sentiment analysis on the IMDB movie reviews dataset. The model reads the full review (many words) and outputs a single probability (positive or negative).
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
# ── 1. Load IMDB dataset (built into Keras) ───────────────
VOCAB_SIZE = 20_000
MAX_LEN = 200
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=VOCAB_SIZE)
print(f'Training samples : {len(X_train):,}')
print(f'Test samples : {len(X_test):,}')
# ── 2. Pad sequences ──────────────────────────────────────
X_train = pad_sequences(X_train, maxlen=MAX_LEN, padding='post', truncating='post')
X_test = pad_sequences(X_test, maxlen=MAX_LEN, padding='post', truncating='post')
print(f'X_train shape : {X_train.shape}') # (25000, 200)
# ── 3. Build the LSTM model ───────────────────────────────
model = keras.Sequential([
# Embedding: integer IDs → dense vectors
layers.Embedding(
input_dim=VOCAB_SIZE,
output_dim=128, # each word → 128-dim vector
input_length=MAX_LEN,
mask_zero=True, # ignore PAD tokens in LSTM
),
# Stacked LSTM layers
layers.LSTM(128, return_sequences=True, # return all hidden states
dropout=0.2, recurrent_dropout=0.2),
layers.LSTM(64, return_sequences=False, # return only final state
dropout=0.2, recurrent_dropout=0.2),
layers.Dense(32, activation='relu'),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid'), # binary output
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-3),
loss='binary_crossentropy',
metrics=['accuracy']
)
model.summary()
# ── 4. Train ──────────────────────────────────────────────
callbacks = [
keras.callbacks.EarlyStopping(monitor='val_loss', patience=3,
restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(patience=2, factor=0.5),
]
history = model.fit(
X_train, y_train,
epochs=10,
batch_size=64,
validation_split=0.1,
callbacks=callbacks,
verbose=1
)
# ── 5. Evaluate on held-out test set ─────────────────────
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f'\nTest Accuracy : {acc:.4f}')
print(f'Test Loss : {loss:.4f}')
# ── 6. Predict on new text ────────────────────────────────
word_index = imdb.get_word_index()
def predict_sentiment(text, model, word_index, max_len=MAX_LEN):
tokens = text.lower().split()
seq = [word_index.get(w, 2) for w in tokens] # 2 = OOV
seq = pad_sequences([seq], maxlen=max_len, padding='post')
prob = model.predict(seq, verbose=0)[0][0]
label = 'POSITIVE' if prob > 0.5 else 'NEGATIVE'
return label, float(prob)
label, confidence = predict_sentiment(
'This film was a masterpiece, I loved every second of it',
model, word_index
)
print(f'Sentiment : {label} ({confidence:.2%})')
return_sequences=True on the First LSTM?When stacking LSTM layers, the first layer must output a hidden state at every time step — not just the last — so the second LSTM has a full sequence to process. Set return_sequences=True on all LSTM layers except the final one. The last LSTM uses the default return_sequences=False to produce a single summary vector for the dense classifier.
Sentiment Analysis with PyTorch LSTM
PyTorch gives you more control over the LSTM — you write the forward pass explicitly, which is invaluable when you need custom architectures like attention over the hidden states.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
# ── 1. Custom Dataset ─────────────────────────────────────
class TextDataset(Dataset):
def __init__(self, X, y):
self.X = torch.tensor(X, dtype=torch.long)
self.y = torch.tensor(y, dtype=torch.float32)
def __len__(self):
return len(self.y)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
# ── 2. LSTM Classifier Model ──────────────────────────────
class LSTMSentimentClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim,
num_layers, num_classes, dropout=0.3, pad_idx=0):
super().__init__()
# Embedding layer — pad tokens contribute no gradient
self.embedding = nn.Embedding(
vocab_size, embed_dim, padding_idx=pad_idx
)
# Stacked, bidirectional LSTM
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True, # input: (batch, seq, features)
bidirectional=True, # run forwards AND backwards
dropout=dropout if num_layers > 1 else 0.0
)
# Classifier head — ×2 for bidirectional
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, 64),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(64, num_classes),
)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# x: (batch, seq_len) → integer token IDs
embedded = self.dropout(self.embedding(x))
# embedded: (batch, seq_len, embed_dim)
output, (hidden, cell) = self.lstm(embedded)
# hidden: (num_layers * 2, batch, hidden_dim) ← bidirectional
# Concatenate final forward and backward hidden states
fwd = hidden[-2, :, :] # last layer, forward direction
bwd = hidden[-1, :, :] # last layer, backward direction
combined = torch.cat([fwd, bwd], dim=1) # (batch, hidden*2)
logits = self.classifier(combined) # (batch, num_classes)
return logits
# ── 3. Training loop ──────────────────────────────────────
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
VOCAB_SIZE = 20_000
EMBED_DIM = 128
HIDDEN_DIM = 256
NUM_LAYERS = 2
BATCH_SIZE = 64
EPOCHS = 10
LR = 1e-3
model = LSTMSentimentClassifier(
vocab_size=VOCAB_SIZE, embed_dim=EMBED_DIM,
hidden_dim=HIDDEN_DIM, num_layers=NUM_LAYERS, num_classes=2
).to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2)
criterion = nn.CrossEntropyLoss()
# Assume train_loader, val_loader, test_loader are defined
best_val_acc = 0.0
for epoch in range(1, EPOCHS + 1):
model.train()
total_loss, correct, total = 0.0, 0, 0
for X_batch, y_batch in train_loader:
X_batch, y_batch = X_batch.to(DEVICE), y_batch.long().to(DEVICE)
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch)
loss.backward()
# Gradient clipping — crucial for RNNs
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
preds = logits.argmax(dim=1)
correct += (preds == y_batch).sum().item()
total += y_batch.size(0)
train_acc = correct / total
val_loss, val_acc = evaluate(model, val_loader, criterion, DEVICE)
scheduler.step(val_loss)
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_lstm.pt')
print(f'Epoch {epoch:02d} | train_acc={train_acc:.4f} | val_acc={val_acc:.4f}')
# ── 4. Test evaluation ────────────────────────────────────
model.load_state_dict(torch.load('best_lstm.pt'))
_, test_acc = evaluate(model, test_loader, criterion, DEVICE)
print(f'\nFinal Test Accuracy: {test_acc:.4f}')
The line clip_grad_norm_(model.parameters(), max_norm=1.0) is not optional for RNNs — it prevents exploding gradients from destabilising training. Always clip before the optimizer step. A max_norm of 1.0–5.0 is the standard range; monitor your gradient norms (via torch.nn.utils.clip_grad_norm_'s return value) to tune this. If it rarely triggers, the norm is fine. If it always triggers, reduce your learning rate.
Bidirectional RNNs
A standard RNN processes text left-to-right. But for many NLP tasks, the words after a target word are just as informative as the words before. Bidirectional RNNs solve this by running two RNNs in parallel: one forward (→), one backward (←), then concatenating their hidden states.
In the sentence "The bank by the river was flooded", the word "bank" is ambiguous if you only read left-to-right. But reading the full context in both directions, "river" (right context) clearly disambiguates "bank" as a riverbank, not a financial institution. Bidirectional RNNs capture both left and right context simultaneously at every position.
Bidirectional RNNs require the complete sequence to exist before processing. They cannot generate text autoregressively (word by word) because the backward pass needs future tokens that do not exist yet. Use bidirectional for understanding tasks (classification, tagging, QA). Use unidirectional for generation tasks (language models, chatbots).
RNN vs Transformer — When to Use Which
Transformers (BERT, GPT, T5) have largely superseded RNNs for high-resource NLP tasks. But RNNs are far from obsolete — they remain the better choice in many real-world scenarios. Understanding the trade-offs is essential for building systems that are correct, not just trendy.
| Property | RNN / LSTM / GRU | Transformer (BERT / GPT) |
|---|---|---|
| Sequence handling | Sequential — one step at a time | Parallel — all tokens at once |
| Training speed | Slower — cannot parallelise across time | Faster — full GPU utilisation |
| Long-range dependencies | Degrades after ~500 steps (even LSTM) | Self-attention reaches any distance equally |
| Memory footprint | Small — linear in sequence length | Quadratic in sequence length (attention matrix) |
| Online / streaming inference | Natural — process token by token as it arrives | Awkward — needs full context window |
| Small datasets | Better — fewer parameters, less overfit risk | Worse — data-hungry, needs pre-training to shine |
| On-device / edge deployment | Excellent — tiny model size | Difficult — BERT-base = 110M params |
| Interpretability | Moderate — hidden states hard to interpret | Moderate — attention weights can mislead |
| State-of-the-art NLP accuracy | Good — but not SOTA since ~2019 | Best — across almost all NLP benchmarks |
| Time series / real-time signals | Natural fit — causal, step-by-step | Requires adaptations (Temporal Fusion Transformer) |
Choose LSTM/GRU when: you have <50K training examples, you're deploying on constrained hardware, your sequence is streaming in real time, or you're working on time series rather than natural language. Choose Transformers when: you have large data (or can use pre-trained weights), you need state-of-the-art accuracy, and you have access to substantial GPU compute. In many production systems, a well-tuned bidirectional LSTM outperforms a fine-tuned BERT because of lower latency and memory cost — not every task needs a 110-million-parameter model.
Golden Rules for RNNs in Production
torch.nn.utils.clip_grad_norm_(params, max_norm=1.0)
in PyTorch or clipnorm=1.0 in Keras. Exploding gradients are the most common
reason RNN training diverges — clipping costs nothing and prevents catastrophic failures.
mask_zero=True / padding_idx=0.
Padding tokens must not contribute to the hidden state updates or the loss.
In Keras, set mask_zero=True on the Embedding. In PyTorch, set
padding_idx=0 in nn.Embedding.
Ignoring this causes the network to "learn" from artificial zero tokens.
dropout
on the inter-layer connections AND recurrent_dropout on the recurrent connections
(the hidden-to-hidden weights). In PyTorch, nn.LSTM only applies dropout between layers;
recurrent dropout requires a custom cell. Missing recurrent dropout leaves a major overfitting path open.
batch_first=True in PyTorch. PyTorch LSTM defaults to
(seq_len, batch, features) — the reverse of what NumPy, Keras, and most datasets use.
Always pass batch_first=True to match the standard
(batch, seq_len, features) convention. Shape mismatches from forgetting this are
silent — the code runs but produces nonsense.
trainable=False initially, then unfreeze after a few
epochs. This single change often adds 3–5% accuracy at zero compute cost.