Recurrent Neural Networks (RNN, LSTM, GRU)

Section 01

The Story That Explains RNNs

📖 Real World Analogy

Reading a Novel vs. Looking at a Photograph

Imagine two tasks. First: you glance at a photo of a cat and immediately recognise it as a cat — instant, no context required. The photo is complete in itself. Second: you are reading a detective novel and the final sentence is "And so, he did it." — without reading the previous three hundred pages, those five words are completely meaningless.

A standard feedforward neural network is the photo-recogniser. Each input is a self-contained snapshot processed in isolation. It sees the pixels, fires some neurons, gives an answer. There is no memory of "what came before" because nothing came before — each input arrives alone.

But language does not work that way. The word "bank" means something completely different in "I walked to the river bank" versus "I deposited money in the bank." The meaning of every word depends on what preceded it. Understanding language requires memory of context.

A Recurrent Neural Network (RNN) is the novel reader. At every word, it maintains a hidden state — a condensed memory of everything it has seen so far — and uses both that memory and the current word to decide what to do next. The nervousness of the butler mentioned on page one still echoes faintly in the hidden state by the final chapter.

🧠

The One-Line Definition

An RNN is a neural network that feeds its own output back as an additional input at the next time step — creating a loop of memory through time.

Section 02

What Is a Recurrent Neural Network?

Before RNNs, the dominant approach to sequence data was to flatten everything into a fixed-size vector and feed it to a standard dense network. This broke the temporal structure of language completely. RNNs were invented to respect the natural order of sequences.

📷

Feedforward Network

No memory

Every input is treated independently. There is no concept of "what came before." Works perfectly for images, tabular data — anything where the input is complete and self-contained.

✗ Blind to sequence order

🔁

Recurrent Network (RNN)

Hidden state memory

Processes inputs one step at a time. After each step, it passes a hidden state to the next step — like handing a note to your future self. Designed for sequences: text, audio, time series.

✓ Captures temporal dependencies

🔊

Typical Sequence Tasks

Applications

Sentiment analysis, machine translation, speech recognition, text generation, named entity recognition, time series forecasting, music generation.

✓ Order matters for all of these

The key insight is simple: at each time step t, the RNN receives two inputs — the current word/token x_t and the previous hidden state h_t-1 — and produces a new hidden state h_t. This hidden state is simultaneously the output of the current step and the memory passed to the next.

Section 03

RNN Architecture — Folded & Unrolled

An RNN is typically drawn in two ways: folded (compact, showing the self-loop) and unrolled through time (expanded, showing each time step as a separate column). Both represent exactly the same computation. The unrolled view is far easier to reason about.

▶ Unrolled RNN — Three Time Steps

The amber arrows show the recurrent connection — each hidden state is built from the current input and the previous hidden state. Indigo = input flow. Green = output (not always present at every step).

🔑

Folded vs Unrolled — Same Network

The "folded" diagram shows a single box with an arrow pointing back to itself. The "unrolled" diagram copies that box once per time step. The weights (W, U, b) are shared across every time step — there are not separate weights per step. This is called parameter sharing and it is what allows an RNN to process sequences of any length.

Section 04

The Mathematics of a Vanilla RNN

The vanilla RNN has two equations. Everything the network does collapses to these two lines. Once you understand them, you understand the entire forward pass.

Hidden State Update

hₜ = tanh(Wₕ · hₜ₋₁ + Wₓ · xₜ + bₕ)

Combine the previous memory (h_t-1) with the new input (x_t), apply a non-linearity. tanh squashes the result to [−1, 1], preventing values from exploding. This is the memory write step.

Output at Each Step

yₜ = softmax(Wᵇ · hₜ + bᵇ)

Translate the hidden state into a prediction. For many tasks (like sentiment analysis), you only use the final y_T. For tasks like translation, you need every y_t. The output layer is a standard dense layer.

Weight Matrices

Wₕ, Wₓ, Wᵇ

W_h: hidden-to-hidden (recurrent) weights — this is the "memory" matrix.
W_x: input-to-hidden weights — processes the current token.
W_y: hidden-to-output weights — makes predictions. All three are shared across every time step.

Initial Hidden State

h₀ = zeros(hidden_dim)

Before the first word, the network has no memory. h₀ is conventionally initialised to a vector of zeros. Some architectures learn h₀ as a trainable parameter, which can marginally improve performance on short sequences.

🤖 Worked Example — Processing "I love Paris"

t = 1

Input: x₁ = embed("I") — a vector of numbers representing the word "I". Hidden: h₁ = tanh(Wₕ·h₀ + Wₓ·x₁ + bₕ). Since h₀ = 0, this is just processing "I" alone.

t = 2

Input: x₂ = embed("love"). Hidden: h₂ = tanh(Wₕ·h₁ + Wₓ·x₂ + bₕ). Now h₁ carries information about "I" — so "love" is processed in the context of "I".

t = 3

Input: x₃ = embed("Paris"). Hidden: h₃ = tanh(Wₕ·h₂ + Wₓ·x₃ + bₕ). h₂ carries a compressed trace of "I love" — so "Paris" is understood as a loved city, not a random noun.

Output

For sentiment analysis: y = softmax(W_y · h₃ + b_y) → [0.04, 0.96] → class: Positive. The final hidden state h₃ summarises the entire sequence.

Section 05

From Words to Vectors: The Text Preprocessing Pipeline

Neural networks cannot process raw text. Every word must become a vector of numbers before an RNN can touch it. This pipeline is the same regardless of which RNN variant you use.

Tokenisation

Split the raw text into tokens — usually words or subwords. For the sentence "I love Paris!" word-level tokenisation gives ["I", "love", "Paris", "!"]. Subword tokenisation (used by BERT, GPT) splits further: ["I", "love", "Par", "##is", "!"]. For vanilla RNNs, word-level is most common.

Build a Vocabulary

Assign a unique integer to each token in your training corpus. {"<PAD>": 0, "<OOV>": 1, "I": 2, "love": 3, "Paris": 4, …}. Limit vocabulary to the top N most frequent words (e.g., 20,000) to manage memory. All unseen words map to the <OOV> (out-of-vocabulary) token.

Integer Encoding

Convert token strings to their vocabulary integers. "I love Paris" → [2, 3, 4]. This produces a sequence of integers — one per token.

Padding & Truncation

Sequences in a mini-batch must be the same length for efficient GPU computation. Short sequences get <PAD> (0) tokens appended (post-padding). Long sequences get clipped. A typical max length is 128–512 tokens.

Word Embeddings

Each integer is looked up in an Embedding matrix — a learned table of size [vocab_size × embed_dim]. Integer 4 (Paris) → a dense 128-dimensional vector. Initially random, these vectors are learned during training so that semantically similar words end up close in vector space. You can also load pre-trained embeddings (GloVe, Word2Vec, FastText) to jumpstart the network.

💡

What the RNN Actually Sees

After this pipeline, the RNN receives a 3D tensor of shape [batch_size, sequence_length, embed_dim] — e.g. [32, 100, 128] means 32 reviews, each 100 tokens long, each token represented by a 128-dimensional vector. At each of the 100 time steps, the RNN processes one 128-dimensional slice.

Section 06

Backpropagation Through Time (BPTT)

Training an RNN requires computing gradients. Because the network is unrolled through time, the gradients must flow backwards through every time step. This is called Backpropagation Through Time (BPTT).

📈 BPTT — How Gradients Flow Backward

Forward

Run the RNN forward: x₁ → h₁ → x₂ → h₂ → … → h_T → ŷ. Compute loss L = CrossEntropy(ŷ, y).

∂L/∂h_T

Start at the output. Compute gradient of loss with respect to the final hidden state h_T. Standard backprop here — nothing special yet.

Chain Rule

∂L/∂h_{t} = (∂L/∂h_{t+1}) · (∂h_{t+1}/∂h_t). This requires multiplying the gradient by the weight matrix Wₕ and the tanh derivative at every step. For a sequence of length T, you multiply this T times.

Weight Update

Gradient for shared weights Wₕ is the sum of gradients accumulated across all time steps. W ← W − η · Σ(∂L/∂W_t). Update once after the full backward pass.

⚠️

Truncated BPTT — Practical Training Trick

Full BPTT over very long sequences (thousands of steps) is computationally expensive and memory-intensive. In practice, Truncated BPTT splits the sequence into chunks of length K (e.g., 35 steps), runs forward and backward within each chunk, and passes the hidden state across chunk boundaries but stops gradient flow at the boundary. This is the standard approach in language model training.

Section 07

The Vanishing Gradient Problem

📖 Story

The Broken Telephone Game

Remember the children's game where you whisper a message down a long line of people? By the time it reaches the last person, the original message is unrecognisably garbled. Each person introduces a tiny distortion. Multiply tiny distortions over 50 people and the signal is gone.

The vanishing gradient problem is exactly this. During BPTT, the gradient signal is multiplied by the Jacobian matrix at every time step. If the spectral radius of that matrix is slightly less than 1 (which it almost always is with tanh), then multiplying it over 50 or 100 steps drives the gradient exponentially toward zero. The weights near step 1 receive a gradient so small it may as well be zero — they never learn. The network becomes effectively blind to long-range dependencies.

🔴 Short Dependency — RNN Succeeds

Token	Distance to Output	Gradient Signal
"terrible"	2 steps	Strong ✓
"movie"	1 step	Strong ✓
[output]	0	Negative sentiment ✓

🔴 Long Dependency — RNN Fails

Token	Distance to Output	Gradient Signal
"The food was great"	80 steps	≈ 0 (vanished) ✗
"but the service was"	4 steps	Weak
"awful"	1 step	Strong ✓
[output]	0	Incorrectly negative ✗

💥

Exploding Gradients — The Opposite Problem

When the Jacobian's spectral radius is greater than 1, gradients grow exponentially — weights receive gigantic updates and training diverges. The fix is simple: gradient clipping — cap the gradient norm at a threshold (typically 1.0 or 5.0) before applying the update. Vanishing gradients are far harder to fix, which is why LSTM and GRU were invented.

Section 08

LSTM — Long Short-Term Memory

Sepp Hochreiter and Jürgen Schmidhuber introduced the LSTM in 1997 to solve the vanishing gradient problem. The key innovation: instead of relying on tanh to pass gradients through time (which squashes them), LSTM introduces a cell state — a separate memory channel that can carry information across hundreds of steps almost unchanged, because it uses addition (not multiplication) to update.

📖 Analogy

A Notebook with Selective Amnesia

Imagine you have a notebook (the cell state) and a pen. At each step you can:

1. Erase some old notes (forget gate — selectively wipe irrelevant history).
2. Write new notes (input gate — add new relevant information).
3. Read from the notebook to act (output gate — decide what to show to the outside world).

The notebook itself persists intact across long conversations — you do not rewrite it every time you speak. Only targeted edits happen at each step, so old information can survive for hundreds of steps without degrading.

▶ LSTM Cell — Internal Architecture

The amber highway at the top is the cell state — it flows with minimal interference, solving the vanishing gradient problem. Each gate (Forget, Input, Output) uses a sigmoid (σ) to produce values between 0 and 1, acting as a soft on/off switch.

The Four LSTM Equations

Forget Gate

fₜ = σ(Wⁱ · [hₜ₋₁, xₜ] + bⁱ)

Output ∈ [0,1] for each cell value. 0 = completely erase, 1 = completely keep. Learns to forget irrelevant history — e.g., after reading a full-stop, forget the previous sentence's subject.

Input Gate + Candidate

iₜ = σ(Wᵢ · [h,x] + bᵢ) c̃ₜ = tanh(Wᶜ · [h,x] + bᶜ)

i_t decides how much to write; c̃_t is the what to write. Together they perform a selective write to the cell state.

Cell State Update

Cₜ = fₜ ⊗ Cₜ₋₁ + iₜ ⊗ c̃ₜ

The key equation. Old cell state is selectively forgotten (⊗ f_t), then new candidate is selectively added (⊗ i_t). Because this is an addition, gradients flow back through it without multiplication-induced shrinkage.

Output Gate + Hidden State

oₜ = σ(Wₒ · [h,x] + bₒ) hₜ = oₜ ⊗ tanh(Cₜ)

The output gate decides what part of the cell state to expose as the hidden state. h_t is the "working memory" passed to the next step and used for predictions.

Section 09

GRU — Gated Recurrent Unit

Kyunghyun Cho et al. introduced the GRU in 2014 as a simpler alternative to LSTM. The GRU merges the cell state and hidden state into one, and uses only two gates instead of three — reducing the parameter count by roughly 25% with comparable performance on most tasks.

Update Gate

zₜ = σ(Wᵣ · [hₜ₋₁, xₜ])

Controls how much of the old hidden state to keep vs how much of the new candidate to accept. 1 = keep old state entirely (skip this step), 0 = fully replace with candidate. Analogous to LSTM's forget + input gates combined.

Reset Gate

rₜ = σ(Wᵣ · [hₜ₋₁, xₜ])

Controls how much of the previous hidden state to use when computing the candidate. 0 = ignore the past entirely (fresh start), 1 = use the full past. This lets the GRU "reset" its context — useful for topic changes.

Candidate Hidden State

h̃ₜ = tanh(W · [rₜ ⊗ hₜ₋₁, xₜ])

Compute the proposed new state using the reset-filtered past and the current input. This is the candidate for the new hidden state — how much of it to accept is decided by the update gate.

Final Hidden State

hₜ = (1 − zₜ) ⊗ hₜ₋₁ + zₜ ⊗ h̃ₜ

Linear interpolation between old state and candidate. If z_t ≈ 0: h_t ≈ h_{t-1} (copy forward, ignore current input — used for filler words). If z_t ≈ 1: h_t ≈ h̃_t (fully update — used for important new words).

LSTM vs GRU — Side-by-Side

Property	LSTM	GRU
Gates	3 (forget, input, output)	2 (update, reset)
State vectors	2 (h_t, C_t)	1 (h_t only)
Parameters	4 × (n² + n·d)	3 × (n² + n·d) — ~25% fewer
Training speed	Slower	Faster
Long sequences	Slightly better — separate C_t helps	Can struggle past ~500 steps
Short-medium sequences	Excellent	Excellent — often matches LSTM
When to prefer	Long sequences, complex temporal patterns	Faster prototyping, limited compute

🎯

Which to Use in Practice?

Start with GRU — it trains faster, performs comparably on most NLP tasks, and is less likely to overfit on small datasets due to fewer parameters. Upgrade to LSTM if your sequences are long (>200 tokens) or if GRU plateaus and you have compute budget to spare. In most production systems, the difference is under 1% accuracy.

Section 10

RNN Architecture Patterns

The same RNN cell can be wired in fundamentally different ways depending on your task. Andrej Karpathy's famous taxonomy describes five patterns, and understanding which to use is one of the most important design decisions in sequence modelling.

▶ Five RNN Wiring Patterns

Blue = inputs, Amber = hidden states (RNN cells), Green = outputs, Purple = context vector (encoder summary). The pattern you choose depends entirely on your task's input/output structure.

📷

One-to-One

Standard NN

Single input → single output. No recurrence needed. Image classification, single-value regression.

🎙️

One-to-Many

Sequence Generator

Single seed → sequence of outputs. Image captioning (one image → many caption words), music generation from a seed note.

💬

Many-to-One

Sequence Classifier

Sequence of inputs → single output. Sentiment analysis (many words → positive/negative), document classification.

🏷

Many-to-Many (Synced)

Token Tagger

Each input timestep gets an output. Named Entity Recognition (each word → entity tag), part-of-speech tagging.

🌐

Seq2Seq (Async)

Encoder-Decoder

Variable-length input → variable-length output. Machine translation, text summarisation, chatbots. Encoder reads, decoder generates.

📋

Bidirectional RNN

Context from Both Sides

Two RNNs: one reads left-to-right, one right-to-left. Their hidden states are concatenated. Essential for tasks where future context matters — e.g., NER, question answering.

Section 11

Complete Text Preprocessing Pipeline — Code

Before building the model, you must convert raw text into padded integer sequences. This code is universal — it works the same whether you use a vanilla RNN, LSTM, or GRU.

import numpy as np
from collections import Counter
import re

# ── 1. Raw data ───────────────────────────────────────────
reviews = [
    'The film was absolutely brilliant and deeply moving',
    'Terrible acting, boring script, complete waste of time',
    'A masterpiece of modern cinema — breathtaking visuals',
    'I fell asleep halfway through, incredibly dull',
    'Outstanding performances by the entire cast',
]
labels = [1, 0, 1, 0, 1]  # 1 = positive, 0 = negative

# ── 2. Tokenise (simple word-level) ───────────────────────
def tokenise(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # strip punctuation
    return text.split()

tokenised = [tokenise(r) for r in reviews]

# ── 3. Build vocabulary ───────────────────────────────────
VOCAB_SIZE = 10_000
PAD_TOKEN  = '<PAD>'
OOV_TOKEN  = '<OOV>'

all_words = [word for tokens in tokenised for word in tokens]
word_freq = Counter(all_words)
vocab = [PAD_TOKEN, OOV_TOKEN] + [w for w, _ in word_freq.most_common(VOCAB_SIZE - 2)]
word2idx = {w: i for i, w in enumerate(vocab)}

print(f'Vocabulary size: {len(vocab)}')

# ── 4. Encode sequences ───────────────────────────────────
def encode(tokens, word2idx):
    return [word2idx.get(t, 1) for t in tokens]  # 1 = OOV index

encoded = [encode(t, word2idx) for t in tokenised]

# ── 5. Pad / truncate to fixed length ────────────────────
MAX_LEN = 50

def pad_sequence(seq, max_len, pad_val=0):
    seq = seq[:max_len]            # truncate if too long
    seq = seq + [pad_val] * (max_len - len(seq))  # pad if too short
    return seq

X = np.array([pad_sequence(s, MAX_LEN) for s in encoded])
y = np.array(labels)

print(f'X shape: {X.shape}')   # (5, 50)
print(f'First review encoded:\n{X[0]}')

OUTPUT

Vocabulary size: 42 X shape: (5, 50) First review encoded: [ 4 20 3 7 5 12 2 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Section 12

Sentiment Analysis with Keras LSTM

Now we build a real many-to-one LSTM classifier for sentiment analysis on the IMDB movie reviews dataset. The model reads the full review (many words) and outputs a single probability (positive or negative).

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ── 1. Load IMDB dataset (built into Keras) ───────────────
VOCAB_SIZE = 20_000
MAX_LEN    = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=VOCAB_SIZE)
print(f'Training samples : {len(X_train):,}')
print(f'Test samples     : {len(X_test):,}')

# ── 2. Pad sequences ──────────────────────────────────────
X_train = pad_sequences(X_train, maxlen=MAX_LEN, padding='post', truncating='post')
X_test  = pad_sequences(X_test,  maxlen=MAX_LEN, padding='post', truncating='post')
print(f'X_train shape    : {X_train.shape}')  # (25000, 200)

# ── 3. Build the LSTM model ───────────────────────────────
model = keras.Sequential([
    # Embedding: integer IDs → dense vectors
    layers.Embedding(
        input_dim=VOCAB_SIZE,
        output_dim=128,      # each word → 128-dim vector
        input_length=MAX_LEN,
        mask_zero=True,       # ignore PAD tokens in LSTM
    ),
    # Stacked LSTM layers
    layers.LSTM(128, return_sequences=True,  # return all hidden states
               dropout=0.2, recurrent_dropout=0.2),
    layers.LSTM(64,  return_sequences=False, # return only final state
               dropout=0.2, recurrent_dropout=0.2),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(1,  activation='sigmoid'),  # binary output
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='binary_crossentropy',
    metrics=['accuracy']
)
model.summary()

# ── 4. Train ──────────────────────────────────────────────
callbacks = [
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=3,
                                   restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(patience=2, factor=0.5),
]

history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=64,
    validation_split=0.1,
    callbacks=callbacks,
    verbose=1
)

# ── 5. Evaluate on held-out test set ─────────────────────
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f'\nTest Accuracy : {acc:.4f}')
print(f'Test Loss     : {loss:.4f}')

# ── 6. Predict on new text ────────────────────────────────
word_index = imdb.get_word_index()
def predict_sentiment(text, model, word_index, max_len=MAX_LEN):
    tokens = text.lower().split()
    seq = [word_index.get(w, 2) for w in tokens]  # 2 = OOV
    seq = pad_sequences([seq], maxlen=max_len, padding='post')
    prob = model.predict(seq, verbose=0)[0][0]
    label = 'POSITIVE' if prob > 0.5 else 'NEGATIVE'
    return label, float(prob)

label, confidence = predict_sentiment(
    'This film was a masterpiece, I loved every second of it',
    model, word_index
)
print(f'Sentiment : {label}  ({confidence:.2%})')

OUTPUT

Training samples : 25,000 Test samples : 25,000 X_train shape : (25000, 200) Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 200, 128) 2,560,000 lstm (LSTM) (None, 200, 128) 131,584 lstm_1 (LSTM) (None, 64) 49,408 dense (Dense) (None, 32) 2,080 dropout (Dropout) (None, 32) 0 dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 2,743,105 (10.46 MB) _________________________________________________________________ Epoch 1/10 — loss: 0.5821 — accuracy: 0.6814 — val_accuracy: 0.8256 Epoch 2/10 — loss: 0.3512 — accuracy: 0.8615 — val_accuracy: 0.8634 Epoch 3/10 — loss: 0.2793 — accuracy: 0.8912 — val_accuracy: 0.8802 Epoch 4/10 — loss: 0.2208 — accuracy: 0.9145 — val_accuracy: 0.8810 Epoch 5/10 — loss: 0.1844 — accuracy: 0.9314 — val_accuracy: 0.8801 [EarlyStopping] Restoring best weights from epoch 4. Test Accuracy : 0.8824 Test Loss : 0.2891 Sentiment : POSITIVE (94.71%)

🎯

Why return_sequences=True on the First LSTM?

When stacking LSTM layers, the first layer must output a hidden state at every time step — not just the last — so the second LSTM has a full sequence to process. Set return_sequences=True on all LSTM layers except the final one. The last LSTM uses the default return_sequences=False to produce a single summary vector for the dense classifier.

Section 13

Sentiment Analysis with PyTorch LSTM

PyTorch gives you more control over the LSTM — you write the forward pass explicitly, which is invaluable when you need custom architectures like attention over the hidden states.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np

# ── 1. Custom Dataset ─────────────────────────────────────
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# ── 2. LSTM Classifier Model ──────────────────────────────
class LSTMSentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim,
                 num_layers, num_classes, dropout=0.3, pad_idx=0):
        super().__init__()

        # Embedding layer — pad tokens contribute no gradient
        self.embedding = nn.Embedding(
            vocab_size, embed_dim, padding_idx=pad_idx
        )

        # Stacked, bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,       # input: (batch, seq, features)
            bidirectional=True,     # run forwards AND backwards
            dropout=dropout if num_layers > 1 else 0.0
        )

        # Classifier head — ×2 for bidirectional
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, num_classes),
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x: (batch, seq_len)  → integer token IDs
        embedded = self.dropout(self.embedding(x))
        # embedded: (batch, seq_len, embed_dim)

        output, (hidden, cell) = self.lstm(embedded)
        # hidden: (num_layers * 2, batch, hidden_dim)  ← bidirectional

        # Concatenate final forward and backward hidden states
        fwd  = hidden[-2, :, :]  # last layer, forward direction
        bwd  = hidden[-1, :, :]  # last layer, backward direction
        combined = torch.cat([fwd, bwd], dim=1)  # (batch, hidden*2)

        logits = self.classifier(combined)        # (batch, num_classes)
        return logits

# ── 3. Training loop ──────────────────────────────────────
DEVICE     = 'cuda' if torch.cuda.is_available() else 'cpu'
VOCAB_SIZE = 20_000
EMBED_DIM  = 128
HIDDEN_DIM = 256
NUM_LAYERS = 2
BATCH_SIZE = 64
EPOCHS     = 10
LR         = 1e-3

model = LSTMSentimentClassifier(
    vocab_size=VOCAB_SIZE, embed_dim=EMBED_DIM,
    hidden_dim=HIDDEN_DIM, num_layers=NUM_LAYERS, num_classes=2
).to(DEVICE)

optimizer  = optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)
scheduler  = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=2)
criterion  = nn.CrossEntropyLoss()

# Assume train_loader, val_loader, test_loader are defined
best_val_acc = 0.0

for epoch in range(1, EPOCHS + 1):
    model.train()
    total_loss, correct, total = 0.0, 0, 0

    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(DEVICE), y_batch.long().to(DEVICE)

        optimizer.zero_grad()
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)
        loss.backward()

        # Gradient clipping — crucial for RNNs
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        total_loss += loss.item()
        preds = logits.argmax(dim=1)
        correct += (preds == y_batch).sum().item()
        total   += y_batch.size(0)

    train_acc  = correct / total
    val_loss, val_acc = evaluate(model, val_loader, criterion, DEVICE)
    scheduler.step(val_loss)

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_lstm.pt')

    print(f'Epoch {epoch:02d} | train_acc={train_acc:.4f} | val_acc={val_acc:.4f}')

# ── 4. Test evaluation ────────────────────────────────────
model.load_state_dict(torch.load('best_lstm.pt'))
_, test_acc = evaluate(model, test_loader, criterion, DEVICE)
print(f'\nFinal Test Accuracy: {test_acc:.4f}')

OUTPUT

🔨

Why Gradient Clipping Is Essential for RNNs

The line clip_grad_norm_(model.parameters(), max_norm=1.0) is not optional for RNNs — it prevents exploding gradients from destabilising training. Always clip before the optimizer step. A max_norm of 1.0–5.0 is the standard range; monitor your gradient norms (via torch.nn.utils.clip_grad_norm_'s return value) to tune this. If it rarely triggers, the norm is fine. If it always triggers, reduce your learning rate.

Section 14

Bidirectional RNNs

A standard RNN processes text left-to-right. But for many NLP tasks, the words after a target word are just as informative as the words before. Bidirectional RNNs solve this by running two RNNs in parallel: one forward (→), one backward (←), then concatenating their hidden states.

📗

Why Bidirectionality Matters

In the sentence "The bank by the river was flooded", the word "bank" is ambiguous if you only read left-to-right. But reading the full context in both directions, "river" (right context) clearly disambiguates "bank" as a riverbank, not a financial institution. Bidirectional RNNs capture both left and right context simultaneously at every position.

➡️

Forward RNN

left → right

Reads tokens in normal order. Hidden state h→_t encodes everything seen up to and including position t. Good for the "what has been said so far" context.

⬅️

Backward RNN

right → left

Reads tokens in reverse order. Hidden state h←_t encodes everything from the end of the sequence back to position t. Good for the "what comes after" context.

🔗

Concatenated Output

h = [h→ ‖ h←]

At each position, concatenate forward and backward hidden states: h_t = [h→_t; h←_t]. This doubles the hidden dimension. For classification, use the concatenated final states from both directions.

✓ Used in: NER, QA, BiLSTM-CRF, ELMo

⚠️

Bidirectional ≠ Usable for Language Generation

Bidirectional RNNs require the complete sequence to exist before processing. They cannot generate text autoregressively (word by word) because the backward pass needs future tokens that do not exist yet. Use bidirectional for understanding tasks (classification, tagging, QA). Use unidirectional for generation tasks (language models, chatbots).

Section 15

RNN vs Transformer — When to Use Which

Transformers (BERT, GPT, T5) have largely superseded RNNs for high-resource NLP tasks. But RNNs are far from obsolete — they remain the better choice in many real-world scenarios. Understanding the trade-offs is essential for building systems that are correct, not just trendy.

Property	RNN / LSTM / GRU	Transformer (BERT / GPT)
Sequence handling	Sequential — one step at a time	Parallel — all tokens at once
Training speed	Slower — cannot parallelise across time	Faster — full GPU utilisation
Long-range dependencies	Degrades after ~500 steps (even LSTM)	Self-attention reaches any distance equally
Memory footprint	Small — linear in sequence length	Quadratic in sequence length (attention matrix)
Online / streaming inference	Natural — process token by token as it arrives	Awkward — needs full context window
Small datasets	Better — fewer parameters, less overfit risk	Worse — data-hungry, needs pre-training to shine
On-device / edge deployment	Excellent — tiny model size	Difficult — BERT-base = 110M params
Interpretability	Moderate — hidden states hard to interpret	Moderate — attention weights can mislead
State-of-the-art NLP accuracy	Good — but not SOTA since ~2019	Best — across almost all NLP benchmarks
Time series / real-time signals	Natural fit — causal, step-by-step	Requires adaptations (Temporal Fusion Transformer)

🏆

The Practitioner's Decision Rule

Choose LSTM/GRU when: you have <50K training examples, you're deploying on constrained hardware, your sequence is streaming in real time, or you're working on time series rather than natural language. Choose Transformers when: you have large data (or can use pre-trained weights), you need state-of-the-art accuracy, and you have access to substantial GPU compute. In many production systems, a well-tuned bidirectional LSTM outperforms a fine-tuned BERT because of lower latency and memory cost — not every task needs a 110-million-parameter model.

Section 16

Golden Rules for RNNs in Production

🌟 RNN & LSTM — Non-Negotiable Rules

Always clip gradients. Use torch.nn.utils.clip_grad_norm_(params, max_norm=1.0) in PyTorch or clipnorm=1.0 in Keras. Exploding gradients are the most common reason RNN training diverges — clipping costs nothing and prevents catastrophic failures.

Start with GRU, not vanilla RNN. Vanilla RNNs with tanh vanish in under 10 steps on most real sequences. GRU solves this at minimal extra cost. Only drop to vanilla RNN if you have strict parameter budget constraints or are running on extreme edge hardware.

Use mask_zero=True / padding_idx=0. Padding tokens must not contribute to the hidden state updates or the loss. In Keras, set mask_zero=True on the Embedding. In PyTorch, set padding_idx=0 in nn.Embedding. Ignoring this causes the network to "learn" from artificial zero tokens.

Apply both dropout variants in LSTM. Use dropout on the inter-layer connections AND recurrent_dropout on the recurrent connections (the hidden-to-hidden weights). In PyTorch, nn.LSTM only applies dropout between layers; recurrent dropout requires a custom cell. Missing recurrent dropout leaves a major overfitting path open.

Set batch_first=True in PyTorch. PyTorch LSTM defaults to (seq_len, batch, features) — the reverse of what NumPy, Keras, and most datasets use. Always pass batch_first=True to match the standard (batch, seq_len, features) convention. Shape mismatches from forgetting this are silent — the code runs but produces nonsense.

Tune hidden size before number of layers. A single well-sized LSTM layer (256–512 units) almost always beats two small ones (64 units each). Adding layers increases vanishing gradient risk. Start with 1–2 layers. Only go deeper if validation loss clearly improves, and always increase dropout in proportion to depth.

Pre-train embeddings for small datasets. With under 20K training examples, random embedding initialisation almost always underperforms. Load GloVe (50d or 100d) or FastText embeddings, set trainable=False initially, then unfreeze after a few epochs. This single change often adds 3–5% accuracy at zero compute cost.

For generation, always use a causal (unidirectional) RNN. Bidirectional models cannot generate text autoregressively — they require the full input sequence before producing output. Use bidirectional for classification, tagging, and understanding. Use left-to-right unidirectional for language modelling, dialogue, and generation.