Natural Language Processing (NLP) 📂 Sequence Models for NLP · 2 of 2 52 min read

Encoder-Decoder Architecture

A comprehensive guide to the Encoder-Decoder (seq2seq) Transformer architecture — covering how the encoder builds contextual representations via self-attention, how the decoder generates output autoregressively via masked self-attention and cross-attention,

Section 01

The Story That Explains Encoder-Decoder

The United Nations Interpreter
Imagine a UN summit. A French diplomat delivers a speech. A human interpreter in the booth listens to every word, builds a complete mental model of the meaning — the intent, the emotion, the diplomatic nuance — and only then speaks it out in English for the audience.

The interpreter does not translate word-by-word. They first encode the full French speech into an internal "meaning representation" inside their head, then decode that meaning into English, one word at a time, attending to what has already been said.

That is exactly what an Encoder-Decoder neural network does. The Encoder reads the entire input and compresses it into a rich internal representation. The Decoder reads that representation and produces the output, step by step, attending to the most relevant parts as it goes.

The Encoder-Decoder architecture (also called sequence-to-sequence or seq2seq) is the backbone of modern machine translation, summarisation, speech recognition, image captioning, and nearly every large language model in use today. Understanding it deeply means understanding how GPT, BERT, T5, and Whisper all work under the hood.

🌸
The Core Insight

An Encoder-Decoder solves the fundamental challenge of variable-length input → variable-length output. Plain neural networks need fixed sizes. Seq2seq breaks the problem into two stages: first understand (encode), then generate (decode) — each stage independently flexible in length.


Section 02

The Big Picture — Architecture Overview

Before diving into components, here is the full data flow from input tokens to output tokens.

📋 Full Encoder-Decoder Data Flow
Step 1
Tokenisation — Raw text is split into tokens (words or subwords). Each token gets an integer ID.
Step 2
Embedding — Token IDs are mapped to dense vectors of size d_model (e.g., 512 dimensions).
Step 3
Positional Encoding — Since transformers have no recurrence, position information is injected into embeddings via sinusoidal functions.
Step 4
Encoder Stack — N layers of Self-Attention + Feed-Forward transform the input into a contextual representation matrix.
Step 5
Decoder Stack — N layers of Masked Self-Attention + Cross-Attention + Feed-Forward generate output tokens one at a time.
Step 6
Linear + Softmax — Final decoder output is projected to vocabulary size, giving a probability distribution over the next token.
⚙️ Architecture Diagram — Encoder-Decoder (Transformer)
ENCODER Input Tokens: "The cat sat" Embedding + Positional Encoding Encoder Layer × N Self-Attention → Add & Norm Feed-Forward → Add & Norm Context Representation [seq_len × d_model] matrix Cross-Attention Keys & Values DECODER Output (shifted right): <sos> Le chat Embedding + Positional Encoding Decoder Layer × N Masked Self-Attention → Add & Norm Cross-Attention → Add & Norm Feed-Forward → Add & Norm Linear Layer + Softmax → probability over vocab Output Token: "Le" → "chat" → "s'est"

ⓘ The encoder runs once; the decoder runs autoregressively — each generated token feeds back as the next input.


Section 03

The Encoder — Reading and Understanding

The Speed-Reader Who Builds a Mind Map
Think of a lawyer reading a 200-page contract before a negotiation. She does not read it once and forget each sentence. Instead, as she reads, every sentence changes her understanding of every other sentence. The word "termination" in clause 3 suddenly recontextualises clause 47. By the end, she holds a rich, interconnected understanding of the whole document simultaneously — not a sequence of facts, but a web of meaning.

The Encoder does the same thing. Through Self-Attention, every token simultaneously attends to every other token. The result is not a single summary vector, but a matrix where every position has been enriched by full context.

Self-Attention — How Every Token Talks to Every Other Token

Self-Attention is the heart of the Encoder. For each token, it computes three vectors: Query (Q), Key (K), and Value (V). Think of it as a soft database lookup: the Query asks a question, the Keys determine relevance, and the Values hold the content.

Query
Q = X · W_Q
What am I looking for? Each token projects itself into a query vector.
Key
K = X · W_K
What do I offer? Each token projects itself into a key vector for others to match against.
Value
V = X · W_V
What information do I carry? The actual content retrieved once relevance is established.
Attention Score
softmax(QKᵀ / √d_k) · V
Scale dot-products by √d_k to prevent vanishing gradients, apply softmax, weight values.
🔑
Why Divide by √d_k?

With large embedding dimensions, dot products grow very large in magnitude, pushing softmax into regions with near-zero gradients. Dividing by √d_k (square root of key dimension) keeps the scores in a healthy range and stabilises training. The original "Attention Is All You Need" paper (Vaswani et al., 2017) introduced this trick.

Multi-Head Attention — Many Perspectives at Once

A single attention head can only focus on one type of relationship at a time. Multi-Head Attention runs h attention heads in parallel, each with its own Q, K, V projections. Their outputs are concatenated and projected back to d_model.

👁
Head 1
Syntactic Dependency
Might learn subject-verb agreement: "The cats that live next door eat loudly" — links "cats" to "eat".
🔌
Head 2
Coreference
Might learn pronoun resolution: "Mary saw Jane. She smiled." — links "she" to "Mary" or "Jane".
📍
Head 3…h
Semantic Role / Position
Other heads may capture positional proximity, named-entity spans, or domain-specific semantic patterns — all simultaneously.

Encoder Layer — Full Walkthrough

01
Multi-Head Self-Attention
All tokens attend to all tokens. Input shape: [batch, seq_len, d_model]. Output shape: same. This is where context is woven into each position.
02
Add & LayerNorm (Residual)
The original input is added back to the attention output (residual connection), then Layer Normalisation is applied. This prevents gradient vanishing and eases training.
03
Position-wise Feed-Forward Network (FFN)
Two linear layers with a ReLU (or GELU) activation in between. Applied identically to each position: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. Inner dimension is typically 4× d_model.
04
Add & LayerNorm (again)
Another residual + normalisation around the FFN. This entire block (steps 1–4) repeats N times (e.g., N=6 in the original Transformer). Depth gives the model power to build hierarchical representations.

Section 04

The Decoder — Generating Output

The Author Writing a Novel, One Word at a Time
Imagine an author who has read a detailed plot outline (the encoder's context), and is now writing the novel word by word. As she writes each word, she cannot look ahead at what she will write next — only what she has written so far. But she constantly glances back at the plot outline to stay on track. Each new word depends on: (1) all the words she has written so far, and (2) the relevant parts of the outline.

This is the Decoder: it attends to its own past outputs (Masked Self-Attention) and to the encoder's context (Cross-Attention), then predicts the next token.

Three Sub-Layers of the Decoder

🔕
1. Masked Self-Attention
Causal / Autoregressive
Identical to encoder self-attention, but with a causal mask applied. Position i can only attend to positions 0…i. This prevents the model from "cheating" by seeing future tokens during training.
💌
2. Cross-Attention
Encoder–Decoder Bridge
Queries come from the decoder (what we want to express next); Keys and Values come from the encoder output (the source context). This is how the decoder "reads" the source input.
🔁
3. Feed-Forward + Norm
Same as Encoder FFN
Position-wise two-layer MLP applied after cross-attention, followed by Add & LayerNorm. This projects the cross-attention output into the space needed for the final token prediction.
⚠️
The Causal Mask — Critical for Training

During training, we feed the entire target sequence to the decoder at once for efficiency. Without the causal mask, position 3 could attend to position 7 — it would "see the answer." The mask sets all positions to the right to −∞ before softmax, making them contribute zero to the weighted sum. This forces the model to predict each token using only past context.

🔆 Causal Attention Mask Visualisation — 4-Token Sequence
Encoder Self-Attention (Full) Every token attends to all others The cat sat down The cat sat down 0.55 0.25 0.12 0.08 0.18 0.57 0.18 0.07 0.12 0.29 0.42 0.17 0.09 0.14 0.31 0.46 Decoder Masked Self-Attention Future tokens are blocked (masked) Le chat s'est assis Le chat s'est assis 1.00 0.31 0.69 0.18 0.31 0.51 0.12 0.22 0.31 0.35 Attends Masked (future)

ⓘ The red cells show masked future positions. The decoder at position 2 ("chat") can only see positions 0–2, never positions 3+ ("s'est", "assis").


Section 05

Cross-Attention — The Bridge Between Encoder and Decoder

Cross-Attention is the mechanism that lets the decoder "consult" the encoder. It is the most important part of the Encoder-Decoder architecture — without it, the decoder would generate output in a vacuum.

Without Cross-Attention
ProblemEffect
Decoder has no source accessHallucination
Output ignores input meaningRandom text
Bottleneck: single context vectorInformation loss
With Cross-Attention
CapabilityEffect
Decoder queries encoder at each stepGrounded output
Attention over all source positionsNo bottleneck
Dynamic focus per output tokenAccurate alignment
💌 Cross-Attention — English to French Translation
SOURCE (Encoder Keys/Values) The black cat sat down quietly Generating: "chat" 0.04 0.32 0.55 0.05 0.02 0.02 Line thickness = attention weight. "chat" mostly attends to "cat" (0.55) and "black" (0.32) in the source.
💡
Alignment is Learned, Not Programmed

The model learns these attention patterns entirely from data. Nobody told it that "chat" corresponds to "cat." Through backpropagation on millions of translation pairs, the cross-attention weights organically learn the alignment between source and target languages. This replaces the hand-crafted alignment tables used in older statistical machine translation systems.


Section 06

Positional Encoding — Giving the Model a Sense of Order

The Shuffled Manuscript
A transformer processes all tokens simultaneously — no step-by-step recurrence. Imagine handing an editor a manuscript with all pages in a random pile. Without page numbers, they cannot tell if "Once upon a time" is the beginning or the end. Positional Encoding is the page number — a signal injected into each token's embedding that says "you are at position 5 in the sequence," without disrupting the semantic meaning the embedding already carries.

The original Transformer uses sinusoidal positional encoding:

Even Dimensions
PE(pos,2i) = sin(pos / 10000^(2i/d))
Sine wave with frequency depending on dimension index i. Each dimension encodes a different "frequency" of position.
Odd Dimensions
PE(pos,2i+1) = cos(pos / 10000^(2i/d))
Cosine wave paired with the sine. Together they create a unique signature for every position up to any sequence length.
📈
Why Sinusoidal?
Relative Position
For any fixed offset k, PE(pos+k) is a linear function of PE(pos). This helps the model learn relative positions: "this token is 3 steps after that one."
🔧
Learned Positional Encoding
BERT / GPT Style
Modern models like BERT and GPT learn positional embeddings from data instead of using the sinusoidal formula. These often perform slightly better in practice.
🔄
Rotary (RoPE) & ALiBi
LLaMA / PaLM Style
Newer architectures use Rotary Position Embedding (RoPE) or Attention with Linear Biases (ALiBi), which generalise better to sequences longer than those seen during training.

Section 07

Python Implementation — From Scratch with PyTorch

Step 1 — Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: [batch, heads, seq_len, d_k]
    mask:    [batch, 1, seq_len, seq_len] — True = keep, False = mask
    """
    d_k = Q.size(-1)

    # Step 1: Dot-product scores and scale
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # scores: [batch, heads, seq_len, seq_len]

    # Step 2: Apply mask (set masked positions to -inf → 0 after softmax)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Step 3: Softmax to get attention weights
    attn_weights = F.softmax(scores, dim=-1)

    # Step 4: Weighted sum of values
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

Step 2 — Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model    = d_model
        self.num_heads  = num_heads
        self.d_k        = d_model // num_heads  # dimension per head

        # Learned projection matrices for Q, K, V and the output
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def split_heads(self, x):
        # x: [batch, seq_len, d_model]
        # → [batch, num_heads, seq_len, d_k]
        batch, seq_len, _ = x.size()
        x = x.view(batch, seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, Q_in, K_in, V_in, mask=None):
        # Project inputs
        Q = self.split_heads(self.W_Q(Q_in))
        K = self.split_heads(self.W_K(K_in))
        V = self.split_heads(self.W_V(V_in))

        # Attention across all heads in parallel
        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Merge heads: [batch, heads, seq_len, d_k] → [batch, seq_len, d_model]
        batch, _, seq_len, _ = attn_out.size()
        attn_out = attn_out.transpose(1, 2).contiguous().view(batch, seq_len, self.d_model)

        return self.W_O(attn_out)  # Final output projection

Step 3 — Encoder Layer

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn  = MultiHeadAttention(d_model, num_heads)
        self.ffn        = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm1   = nn.LayerNorm(d_model)
        self.norm2   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, src_mask=None):
        # Self-attention sub-layer (Q = K = V = x)
        attn_out = self.self_attn(x, x, x, src_mask)
        x = self.norm1(x + self.dropout(attn_out))   # Residual + Norm

        # Feed-forward sub-layer
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_out))     # Residual + Norm
        return x

Step 4 — Decoder Layer

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn       = MultiHeadAttention(d_model, num_heads)
        self.ffn              = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.ReLU(),
            nn.Dropout(dropout), nn.Linear(d_ff, d_model)
        )
        self.norm1   = nn.LayerNorm(d_model)
        self.norm2   = nn.LayerNorm(d_model)
        self.norm3   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # 1. Masked Self-Attention on decoder's own past outputs
        ma_out = self.masked_self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(ma_out))

        # 2. Cross-Attention: Q from decoder, K & V from encoder
        ca_out = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(ca_out))

        # 3. Feed-Forward
        ffn_out = self.ffn(x)
        x = self.norm3(x + self.dropout(ffn_out))
        return x

Step 5 — Full Transformer

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, N=6,
                 num_heads=8, d_ff=2048, dropout=0.1, max_len=5000):
        super().__init__()
        self.src_embed = nn.Embedding(src_vocab, d_model)
        self.tgt_embed = nn.Embedding(tgt_vocab, d_model)
        self.pos_enc   = PositionalEncoding(d_model, max_len)  # See below

        self.encoder_layers = nn.ModuleList(
            [EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(N)]
        )
        self.decoder_layers = nn.ModuleList(
            [DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(N)]
        )
        self.fc_out = nn.Linear(d_model, tgt_vocab)

    def encode(self, src, src_mask):
        x = self.pos_enc(self.src_embed(src))
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        return x  # [batch, src_len, d_model]

    def decode(self, tgt, enc_out, src_mask, tgt_mask):
        x = self.pos_enc(self.tgt_embed(tgt))
        for layer in self.decoder_layers:
            x = layer(x, enc_out, src_mask, tgt_mask)
        return x

    def forward(self, src, tgt, src_mask, tgt_mask):
        enc_out = self.encode(src, src_mask)
        dec_out = self.decode(tgt, enc_out, src_mask, tgt_mask)
        return self.fc_out(dec_out)   # [batch, tgt_len, tgt_vocab]

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

Step 6 — Quick Sanity Check

# Sanity check — forward pass with random data
model = Transformer(src_vocab=10000, tgt_vocab=10000)
model.eval()

batch, src_len, tgt_len = 2, 12, 10
src = torch.randint(0, 10000, (batch, src_len))
tgt = torch.randint(0, 10000, (batch, tgt_len))

# Source mask: no padding (all ones)
src_mask = torch.ones(batch, 1, src_len, src_len)

# Target mask: causal (lower triangular)
tgt_mask = torch.tril(torch.ones(tgt_len, tgt_len)).unsqueeze(0).unsqueeze(0)

with torch.no_grad():
    logits = model(src, tgt, src_mask, tgt_mask)

print(f"Input shape:  {src.shape}")
print(f"Target shape: {tgt.shape}")
print(f"Output shape: {logits.shape}")
OUTPUT
Input shape: torch.Size([2, 12]) Target shape: torch.Size([2, 10]) Output shape: torch.Size([2, 10, 10000]) ✓ Logits: [batch=2, tgt_len=10, vocab_size=10000] — Ready for cross-entropy loss

Section 08

Decoding Strategies — How We Generate Text

The raw model output is a probability distribution over the vocabulary. How we sample from it massively affects the quality, creativity, and coherence of the generated text.

🎯
Greedy Decoding
Always pick the highest-probability token. Fast and deterministic, but prone to getting stuck in loops and missing globally better sequences.
argmax(logits)
🌚
Beam Search
Keep the top-k sequences (beams) at each step. Balance between greedy (k=1) and exhaustive. Standard for translation and summarisation.
k=4 to 10 beams
🎲
Temperature Sampling
Divide logits by temperature T before softmax. T<1 = sharper (conservative). T>1 = flatter (creative, more random). T=1 = standard sampling.
logits / T
📊
Top-k Sampling
Restrict sampling to only the k most probable tokens. Prevents sampling from extremely unlikely tokens. Commonly k=40–50. Used in GPT-2.
k=50 typical
🔄
Top-p (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability ≥ p. Dynamic — uses few tokens on confident steps, many on uncertain ones.
p=0.9 typical
🔥
Repetition Penalty
Down-weight the probability of tokens already generated. Prevents the model from repeating the same phrase endlessly. Common in chatbots and story generation.
penalty=1.2 typical
🏆
Practitioner's Decoding Guide

For translation/summarisation: use beam search (k=4–8), no temperature. For creative writing/chat: use Top-p (0.9) + temperature (0.7–1.0) + repetition penalty (1.1–1.3). For code generation: low temperature (0.2–0.4), greedy or small beam, no repetition penalty.


Section 09

Encoder-Only, Decoder-Only, Encoder-Decoder — When to Use Which

The full Encoder-Decoder is just one of three variants. Understanding which to use for your task is critical.

Architecture Examples Best For Attention Type Output
Encoder-Only BERT, RoBERTa, DistilBERT Classification, NER, Q&A (extractive), Embeddings Bidirectional self-attention Contextual embeddings, not generation
Decoder-Only GPT-2, GPT-4, LLaMA, Mistral Text generation, Chatbots, Code, Completion Causal (masked) self-attention Next token prediction, generation
Encoder-Decoder T5, BART, Whisper, mBART Translation, Summarisation, Speech-to-Text, Q&A (abstractive) Bidirectional enc + causal dec + cross-attn Variable-length sequence from input
🔨 Use Encoder-Decoder When…
Input and output are different sequences (translation, summarisation)
You need the model to read first, then write
Output length differs greatly from input length
Strong alignment between source and target is critical
🚫 Avoid Encoder-Decoder When…
You only need classification — use encoder-only
You only need open-ended generation — use decoder-only
Compute budget is very tight — two stacks cost roughly 2× the parameters
Your task is simply next-token prediction — decoder-only is optimal

Section 10

Real-World Applications

🌐
Machine Translation
The original motivation. English → French, Chinese → Spanish. Models like mBART and M2M-100 support 100+ language pairs using a single Encoder-Decoder model.
Model: Helsinki-NLP/opus-mt-en-fr
📋
Text Summarisation
Long articles → short summaries. BART and PEGASUS are fine-tuned encoder-decoder models that generate abstractive (not copy-paste) summaries.
Model: facebook/bart-large-cnn
🎤
Speech Recognition (ASR)
Audio waveform → transcript. Whisper encodes mel-spectrograms with a CNN+Transformer encoder, then decodes text with a standard transformer decoder.
Model: openai/whisper-large-v3
🔍
Question Answering (Abstractive)
T5 ("Text-to-Text Transfer Transformer") treats every NLP task as seq2seq. QA: input "question: ... context: ..." → output is a generated answer string.
Model: google/flan-t5-large
📷
Image Captioning
A CNN or Vision Transformer encodes the image into a feature map (the "context"), which a transformer decoder attends to via cross-attention to generate captions.
Model: Salesforce/blip-image-captioning-large
💻
Code Generation
Natural language docstring → Python function. CodeT5 and StarCoder use encoder-decoder to map intent to code, attending to the docstring while generating.
Model: Salesforce/codet5-base

Section 11

Using Hugging Face — Production-Ready in 10 Lines

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# ── Translation: English → French using Helsinki-NLP ──────
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
model     = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

src_text = "The encoder-decoder architecture revolutionised natural language processing."
inputs   = tokenizer(src_text, return_tensors="pt", padding=True)

# Generate with beam search
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_ids"],
        num_beams=5,
        max_length=100,
        early_stopping=True
    )

translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"EN: {src_text}")
print(f"FR: {translation}")
OUTPUT
EN: The encoder-decoder architecture revolutionised natural language processing. FR: L'architecture encodeur-décodeur a révolutionné le traitement du langage naturel.
# ── Summarisation with BART ───────────────────────────────
from transformers import pipeline

summariser = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
    The Transformer architecture, introduced in the landmark 2017 paper
    'Attention Is All You Need' by Vaswani et al., replaced recurrent neural
    networks with self-attention mechanisms. This allowed massively parallel
    computation during training and eliminated the vanishing gradient problem
    that plagued RNNs. The architecture consists of an encoder stack that
    processes the input, and a decoder stack that generates the output,
    connected by cross-attention. Within three years, nearly every
    state-of-the-art NLP system was built on this foundation.
"""

result = summariser(article, max_length=60, min_length=20, do_sample=False)
print(result[0]["summary_text"])
OUTPUT
The Transformer architecture was introduced in 2017 by Vaswani et al. It replaced recurrent networks with self-attention, enabling parallel training and solving the vanishing gradient problem. Nearly every modern NLP system is built on this foundation.

Section 12

Common Failure Modes and How to Fix Them

Failure Mode Symptom Root Cause Fix
Exposure Bias Good at training, poor at generation Trained on gold tokens, generates with predicted tokens Scheduled Sampling or Reinforcement from Human Feedback (RLHF)
Over-generation Output is repetitive or too long Beam search favours short common tokens Length penalty, repetition penalty, no-repeat-ngram-size
Hallucination Output contradicts input facts Decoder over-relies on language priors, not encoder Faithfulness loss, copy mechanism, RAG
Attention Collapse Model ignores most of input Cross-attention weight saturates on <eos> token Coverage mechanism, attention supervision, label smoothing
Slow Inference Generates 1 token per 200ms Autoregressive decoding is sequential by nature KV cache, speculative decoding, model distillation
OOM on Long Sequences GPU OOM for seq_len > 512 Attention is O(n²) in memory Flash Attention, Sliding Window Attention, chunked cross-attention
🔭
KV Cache — The Critical Inference Optimisation

At inference, the decoder generates one token at a time. Without caching, it recomputes Keys and Values for all previous tokens at every step — O(n²) total. With KV caching, computed K and V tensors from previous steps are stored and reused. This reduces per-step compute from O(n) to O(1) for the already-processed positions. It is the single most impactful inference optimisation for autoregressive models.


Section 13

Golden Rules for Encoder-Decoder Models

🪝 Encoder-Decoder — Non-Negotiable Rules
1
Always use a pre-trained backbone (T5, BART, mBART) unless you have hundreds of millions of training pairs. Training a transformer from scratch requires enormous data. Fine-tune instead — it is faster, cheaper, and almost always better.
2
Match your tokeniser to your model. Never use BERT's tokeniser with T5. Each model has a vocabulary built specifically for it. Mixing them gives meaningless token IDs and garbage outputs.
3
For generation, always enable num_beams ≥ 4 for translation and summarisation. Greedy decoding is faster but measurably lower quality on these tasks. Use sampling (Top-p) for creative tasks.
4
Use label smoothing (ε=0.1) during training. Instead of target probabilities of exactly 0 or 1, distribute a small probability mass ε to non-target tokens. This regularises the model and prevents overconfident predictions.
5
Use the warmup learning rate schedule. The original Transformer paper uses lr = d_model^{-0.5} · min(step^{-0.5}, step · warmup_steps^{-1.5}). Skip warmup and your model will diverge in the first few hundred steps — transformers are sensitive to early LR.
6
Pad on the right for the source, on the right for the target. Always use attention_mask to ignore padding tokens. Cross-attention to padding positions is wasted compute and degrades alignment quality.
7
Enable Flash Attention 2 if your GPU supports it (Ampere+). It implements exact attention in O(n) memory instead of O(n²), enabling training on sequences up to 16k–32k tokens on a single GPU. One line change: model = AutoModel.from_pretrained(..., attn_implementation="flash_attention_2").
You have completed Sequence Models for NLP. View all sections →