Encoder-Decoder Architecture

Section 01

The Story That Explains Encoder-Decoder

📖 Real World Analogy

The United Nations Interpreter

Imagine a UN summit. A French diplomat delivers a speech. A human interpreter in the booth listens to every word, builds a complete mental model of the meaning — the intent, the emotion, the diplomatic nuance — and only then speaks it out in English for the audience.

The interpreter does not translate word-by-word. They first encode the full French speech into an internal "meaning representation" inside their head, then decode that meaning into English, one word at a time, attending to what has already been said.

That is exactly what an Encoder-Decoder neural network does. The Encoder reads the entire input and compresses it into a rich internal representation. The Decoder reads that representation and produces the output, step by step, attending to the most relevant parts as it goes.

The Encoder-Decoder architecture (also called sequence-to-sequence or seq2seq) is the backbone of modern machine translation, summarisation, speech recognition, image captioning, and nearly every large language model in use today. Understanding it deeply means understanding how GPT, BERT, T5, and Whisper all work under the hood.

🌸

The Core Insight

An Encoder-Decoder solves the fundamental challenge of variable-length input → variable-length output. Plain neural networks need fixed sizes. Seq2seq breaks the problem into two stages: first understand (encode), then generate (decode) — each stage independently flexible in length.

Section 02

The Big Picture — Architecture Overview

Before diving into components, here is the full data flow from input tokens to output tokens.

📋 Full Encoder-Decoder Data Flow

Step 1

Tokenisation — Raw text is split into tokens (words or subwords). Each token gets an integer ID.

Step 2

Embedding — Token IDs are mapped to dense vectors of size d_model (e.g., 512 dimensions).

Step 3

Positional Encoding — Since transformers have no recurrence, position information is injected into embeddings via sinusoidal functions.

Step 4

Encoder Stack — N layers of Self-Attention + Feed-Forward transform the input into a contextual representation matrix.

Step 5

Decoder Stack — N layers of Masked Self-Attention + Cross-Attention + Feed-Forward generate output tokens one at a time.

Step 6

Linear + Softmax — Final decoder output is projected to vocabulary size, giving a probability distribution over the next token.

⚙️ Architecture Diagram — Encoder-Decoder (Transformer)

ⓘ The encoder runs once; the decoder runs autoregressively — each generated token feeds back as the next input.

Section 03

The Encoder — Reading and Understanding

📖 Story

The Speed-Reader Who Builds a Mind Map

Think of a lawyer reading a 200-page contract before a negotiation. She does not read it once and forget each sentence. Instead, as she reads, every sentence changes her understanding of every other sentence. The word "termination" in clause 3 suddenly recontextualises clause 47. By the end, she holds a rich, interconnected understanding of the whole document simultaneously — not a sequence of facts, but a web of meaning.

The Encoder does the same thing. Through Self-Attention, every token simultaneously attends to every other token. The result is not a single summary vector, but a matrix where every position has been enriched by full context.

Self-Attention — How Every Token Talks to Every Other Token

Self-Attention is the heart of the Encoder. For each token, it computes three vectors: Query (Q), Key (K), and Value (V). Think of it as a soft database lookup: the Query asks a question, the Keys determine relevance, and the Values hold the content.

Query

Q = X · W_Q

What am I looking for? Each token projects itself into a query vector.

Key

K = X · W_K

What do I offer? Each token projects itself into a key vector for others to match against.

Value

V = X · W_V

What information do I carry? The actual content retrieved once relevance is established.

Attention Score

softmax(QKᵀ / √d_k) · V

Scale dot-products by √d_k to prevent vanishing gradients, apply softmax, weight values.

🔑

Why Divide by √d_k?

With large embedding dimensions, dot products grow very large in magnitude, pushing softmax into regions with near-zero gradients. Dividing by √d_k (square root of key dimension) keeps the scores in a healthy range and stabilises training. The original "Attention Is All You Need" paper (Vaswani et al., 2017) introduced this trick.

Multi-Head Attention — Many Perspectives at Once

A single attention head can only focus on one type of relationship at a time. Multi-Head Attention runs h attention heads in parallel, each with its own Q, K, V projections. Their outputs are concatenated and projected back to d_model.

👁

Head 1

Syntactic Dependency

Might learn subject-verb agreement: "The cats that live next door eat loudly" — links "cats" to "eat".

🔌

Head 2

Coreference

Might learn pronoun resolution: "Mary saw Jane. She smiled." — links "she" to "Mary" or "Jane".

📍

Head 3…h

Semantic Role / Position

Other heads may capture positional proximity, named-entity spans, or domain-specific semantic patterns — all simultaneously.

Encoder Layer — Full Walkthrough

Multi-Head Self-Attention

All tokens attend to all tokens. Input shape: [batch, seq_len, d_model]. Output shape: same. This is where context is woven into each position.

Add & LayerNorm (Residual)

The original input is added back to the attention output (residual connection), then Layer Normalisation is applied. This prevents gradient vanishing and eases training.

Position-wise Feed-Forward Network (FFN)

Two linear layers with a ReLU (or GELU) activation in between. Applied identically to each position: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. Inner dimension is typically 4× d_model.

Add & LayerNorm (again)

Another residual + normalisation around the FFN. This entire block (steps 1–4) repeats N times (e.g., N=6 in the original Transformer). Depth gives the model power to build hierarchical representations.

Section 04

The Decoder — Generating Output

📖 Story

The Author Writing a Novel, One Word at a Time

Imagine an author who has read a detailed plot outline (the encoder's context), and is now writing the novel word by word. As she writes each word, she cannot look ahead at what she will write next — only what she has written so far. But she constantly glances back at the plot outline to stay on track. Each new word depends on: (1) all the words she has written so far, and (2) the relevant parts of the outline.

This is the Decoder: it attends to its own past outputs (Masked Self-Attention) and to the encoder's context (Cross-Attention), then predicts the next token.

Three Sub-Layers of the Decoder

🔕

1. Masked Self-Attention

Causal / Autoregressive

Identical to encoder self-attention, but with a causal mask applied. Position i can only attend to positions 0…i. This prevents the model from "cheating" by seeing future tokens during training.

💌

2. Cross-Attention

Encoder–Decoder Bridge

Queries come from the decoder (what we want to express next); Keys and Values come from the encoder output (the source context). This is how the decoder "reads" the source input.

🔁

3. Feed-Forward + Norm

Same as Encoder FFN

Position-wise two-layer MLP applied after cross-attention, followed by Add & LayerNorm. This projects the cross-attention output into the space needed for the final token prediction.

⚠️

The Causal Mask — Critical for Training

During training, we feed the entire target sequence to the decoder at once for efficiency. Without the causal mask, position 3 could attend to position 7 — it would "see the answer." The mask sets all positions to the right to −∞ before softmax, making them contribute zero to the weighted sum. This forces the model to predict each token using only past context.

🔆 Causal Attention Mask Visualisation — 4-Token Sequence

ⓘ The red cells show masked future positions. The decoder at position 2 ("chat") can only see positions 0–2, never positions 3+ ("s'est", "assis").

Section 05

Cross-Attention — The Bridge Between Encoder and Decoder

Cross-Attention is the mechanism that lets the decoder "consult" the encoder. It is the most important part of the Encoder-Decoder architecture — without it, the decoder would generate output in a vacuum.

Without Cross-Attention

Problem	Effect
Decoder has no source access	Hallucination
Output ignores input meaning	Random text
Bottleneck: single context vector	Information loss

With Cross-Attention

Capability	Effect
Decoder queries encoder at each step	Grounded output
Attention over all source positions	No bottleneck
Dynamic focus per output token	Accurate alignment

💌 Cross-Attention — English to French Translation

💡

Alignment is Learned, Not Programmed

The model learns these attention patterns entirely from data. Nobody told it that "chat" corresponds to "cat." Through backpropagation on millions of translation pairs, the cross-attention weights organically learn the alignment between source and target languages. This replaces the hand-crafted alignment tables used in older statistical machine translation systems.

Section 06

Positional Encoding — Giving the Model a Sense of Order

📖 Story

The Shuffled Manuscript

A transformer processes all tokens simultaneously — no step-by-step recurrence. Imagine handing an editor a manuscript with all pages in a random pile. Without page numbers, they cannot tell if "Once upon a time" is the beginning or the end. Positional Encoding is the page number — a signal injected into each token's embedding that says "you are at position 5 in the sequence," without disrupting the semantic meaning the embedding already carries.

The original Transformer uses sinusoidal positional encoding:

Even Dimensions

PE(pos,2i) = sin(pos / 10000^(2i/d))

Sine wave with frequency depending on dimension index i. Each dimension encodes a different "frequency" of position.

Odd Dimensions

PE(pos,2i+1) = cos(pos / 10000^(2i/d))

Cosine wave paired with the sine. Together they create a unique signature for every position up to any sequence length.

📈

Why Sinusoidal?

Relative Position

For any fixed offset k, PE(pos+k) is a linear function of PE(pos). This helps the model learn relative positions: "this token is 3 steps after that one."

🔧

Learned Positional Encoding

BERT / GPT Style

Modern models like BERT and GPT learn positional embeddings from data instead of using the sinusoidal formula. These often perform slightly better in practice.

🔄

Rotary (RoPE) & ALiBi

LLaMA / PaLM Style

Newer architectures use Rotary Position Embedding (RoPE) or Attention with Linear Biases (ALiBi), which generalise better to sequences longer than those seen during training.

Section 07

Python Implementation — From Scratch with PyTorch

Step 1 — Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: [batch, heads, seq_len, d_k]
    mask:    [batch, 1, seq_len, seq_len] — True = keep, False = mask
    """
    d_k = Q.size(-1)

    # Step 1: Dot-product scores and scale
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # scores: [batch, heads, seq_len, seq_len]

    # Step 2: Apply mask (set masked positions to -inf → 0 after softmax)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Step 3: Softmax to get attention weights
    attn_weights = F.softmax(scores, dim=-1)

    # Step 4: Weighted sum of values
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

Step 2 — Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model    = d_model
        self.num_heads  = num_heads
        self.d_k        = d_model // num_heads  # dimension per head

        # Learned projection matrices for Q, K, V and the output
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def split_heads(self, x):
        # x: [batch, seq_len, d_model]
        # → [batch, num_heads, seq_len, d_k]
        batch, seq_len, _ = x.size()
        x = x.view(batch, seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, Q_in, K_in, V_in, mask=None):
        # Project inputs
        Q = self.split_heads(self.W_Q(Q_in))
        K = self.split_heads(self.W_K(K_in))
        V = self.split_heads(self.W_V(V_in))

        # Attention across all heads in parallel
        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Merge heads: [batch, heads, seq_len, d_k] → [batch, seq_len, d_model]
        batch, _, seq_len, _ = attn_out.size()
        attn_out = attn_out.transpose(1, 2).contiguous().view(batch, seq_len, self.d_model)

        return self.W_O(attn_out)  # Final output projection

Step 3 — Encoder Layer

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn  = MultiHeadAttention(d_model, num_heads)
        self.ffn        = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm1   = nn.LayerNorm(d_model)
        self.norm2   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, src_mask=None):
        # Self-attention sub-layer (Q = K = V = x)
        attn_out = self.self_attn(x, x, x, src_mask)
        x = self.norm1(x + self.dropout(attn_out))   # Residual + Norm

        # Feed-forward sub-layer
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_out))     # Residual + Norm
        return x

Step 4 — Decoder Layer

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn       = MultiHeadAttention(d_model, num_heads)
        self.ffn              = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.ReLU(),
            nn.Dropout(dropout), nn.Linear(d_ff, d_model)
        )
        self.norm1   = nn.LayerNorm(d_model)
        self.norm2   = nn.LayerNorm(d_model)
        self.norm3   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # 1. Masked Self-Attention on decoder's own past outputs
        ma_out = self.masked_self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(ma_out))

        # 2. Cross-Attention: Q from decoder, K & V from encoder
        ca_out = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(ca_out))

        # 3. Feed-Forward
        ffn_out = self.ffn(x)
        x = self.norm3(x + self.dropout(ffn_out))
        return x

Step 5 — Full Transformer

class Transformer(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, d_model=512, N=6,
                 num_heads=8, d_ff=2048, dropout=0.1, max_len=5000):
        super().__init__()
        self.src_embed = nn.Embedding(src_vocab, d_model)
        self.tgt_embed = nn.Embedding(tgt_vocab, d_model)
        self.pos_enc   = PositionalEncoding(d_model, max_len)  # See below

        self.encoder_layers = nn.ModuleList(
            [EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(N)]
        )
        self.decoder_layers = nn.ModuleList(
            [DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(N)]
        )
        self.fc_out = nn.Linear(d_model, tgt_vocab)

    def encode(self, src, src_mask):
        x = self.pos_enc(self.src_embed(src))
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        return x  # [batch, src_len, d_model]

    def decode(self, tgt, enc_out, src_mask, tgt_mask):
        x = self.pos_enc(self.tgt_embed(tgt))
        for layer in self.decoder_layers:
            x = layer(x, enc_out, src_mask, tgt_mask)
        return x

    def forward(self, src, tgt, src_mask, tgt_mask):
        enc_out = self.encode(src, src_mask)
        dec_out = self.decode(tgt, enc_out, src_mask, tgt_mask)
        return self.fc_out(dec_out)   # [batch, tgt_len, tgt_vocab]

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

Step 6 — Quick Sanity Check

# Sanity check — forward pass with random data
model = Transformer(src_vocab=10000, tgt_vocab=10000)
model.eval()

batch, src_len, tgt_len = 2, 12, 10
src = torch.randint(0, 10000, (batch, src_len))
tgt = torch.randint(0, 10000, (batch, tgt_len))

# Source mask: no padding (all ones)
src_mask = torch.ones(batch, 1, src_len, src_len)

# Target mask: causal (lower triangular)
tgt_mask = torch.tril(torch.ones(tgt_len, tgt_len)).unsqueeze(0).unsqueeze(0)

with torch.no_grad():
    logits = model(src, tgt, src_mask, tgt_mask)

print(f"Input shape:  {src.shape}")
print(f"Target shape: {tgt.shape}")
print(f"Output shape: {logits.shape}")

OUTPUT

Input shape: torch.Size([2, 12]) Target shape: torch.Size([2, 10]) Output shape: torch.Size([2, 10, 10000]) ✓ Logits: [batch=2, tgt_len=10, vocab_size=10000] — Ready for cross-entropy loss

Section 08

Decoding Strategies — How We Generate Text

The raw model output is a probability distribution over the vocabulary. How we sample from it massively affects the quality, creativity, and coherence of the generated text.

🎯

Greedy Decoding

Always pick the highest-probability token. Fast and deterministic, but prone to getting stuck in loops and missing globally better sequences.

argmax(logits)

🌚

Beam Search

Keep the top-k sequences (beams) at each step. Balance between greedy (k=1) and exhaustive. Standard for translation and summarisation.

k=4 to 10 beams

🎲

Temperature Sampling

Divide logits by temperature T before softmax. T<1 = sharper (conservative). T>1 = flatter (creative, more random). T=1 = standard sampling.

logits / T

📊

Top-k Sampling

Restrict sampling to only the k most probable tokens. Prevents sampling from extremely unlikely tokens. Commonly k=40–50. Used in GPT-2.

k=50 typical

🔄

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability ≥ p. Dynamic — uses few tokens on confident steps, many on uncertain ones.

p=0.9 typical

🔥

Repetition Penalty

Down-weight the probability of tokens already generated. Prevents the model from repeating the same phrase endlessly. Common in chatbots and story generation.

penalty=1.2 typical

🏆

Practitioner's Decoding Guide

For translation/summarisation: use beam search (k=4–8), no temperature. For creative writing/chat: use Top-p (0.9) + temperature (0.7–1.0) + repetition penalty (1.1–1.3). For code generation: low temperature (0.2–0.4), greedy or small beam, no repetition penalty.

Section 09

Encoder-Only, Decoder-Only, Encoder-Decoder — When to Use Which

The full Encoder-Decoder is just one of three variants. Understanding which to use for your task is critical.

Architecture	Examples	Best For	Attention Type	Output
Encoder-Only	BERT, RoBERTa, DistilBERT	Classification, NER, Q&A (extractive), Embeddings	Bidirectional self-attention	Contextual embeddings, not generation
Decoder-Only	GPT-2, GPT-4, LLaMA, Mistral	Text generation, Chatbots, Code, Completion	Causal (masked) self-attention	Next token prediction, generation
Encoder-Decoder	T5, BART, Whisper, mBART	Translation, Summarisation, Speech-to-Text, Q&A (abstractive)	Bidirectional enc + causal dec + cross-attn	Variable-length sequence from input

🔨 Use Encoder-Decoder When…

✓	Input and output are different sequences (translation, summarisation)
✓	You need the model to read first, then write
✓	Output length differs greatly from input length
✓	Strong alignment between source and target is critical

🚫 Avoid Encoder-Decoder When…

✗	You only need classification — use encoder-only
✗	You only need open-ended generation — use decoder-only
✗	Compute budget is very tight — two stacks cost roughly 2× the parameters
✗	Your task is simply next-token prediction — decoder-only is optimal

Section 10

Real-World Applications

🌐

Machine Translation

The original motivation. English → French, Chinese → Spanish. Models like mBART and M2M-100 support 100+ language pairs using a single Encoder-Decoder model.

Model: Helsinki-NLP/opus-mt-en-fr

📋

Text Summarisation

Long articles → short summaries. BART and PEGASUS are fine-tuned encoder-decoder models that generate abstractive (not copy-paste) summaries.

Model: facebook/bart-large-cnn

🎤

Speech Recognition (ASR)

Audio waveform → transcript. Whisper encodes mel-spectrograms with a CNN+Transformer encoder, then decodes text with a standard transformer decoder.

Model: openai/whisper-large-v3

🔍

Question Answering (Abstractive)

T5 ("Text-to-Text Transfer Transformer") treats every NLP task as seq2seq. QA: input "question: ... context: ..." → output is a generated answer string.

Model: google/flan-t5-large

📷

Image Captioning

A CNN or Vision Transformer encodes the image into a feature map (the "context"), which a transformer decoder attends to via cross-attention to generate captions.

Model: Salesforce/blip-image-captioning-large

💻

Code Generation

Natural language docstring → Python function. CodeT5 and StarCoder use encoder-decoder to map intent to code, attending to the docstring while generating.

Model: Salesforce/codet5-base

Section 11

Using Hugging Face — Production-Ready in 10 Lines

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# ── Translation: English → French using Helsinki-NLP ──────
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
model     = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

src_text = "The encoder-decoder architecture revolutionised natural language processing."
inputs   = tokenizer(src_text, return_tensors="pt", padding=True)

# Generate with beam search
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_ids"],
        num_beams=5,
        max_length=100,
        early_stopping=True
    )

translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"EN: {src_text}")
print(f"FR: {translation}")

OUTPUT

EN: The encoder-decoder architecture revolutionised natural language processing. FR: L'architecture encodeur-décodeur a révolutionné le traitement du langage naturel.

# ── Summarisation with BART ───────────────────────────────
from transformers import pipeline

summariser = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
    The Transformer architecture, introduced in the landmark 2017 paper
    'Attention Is All You Need' by Vaswani et al., replaced recurrent neural
    networks with self-attention mechanisms. This allowed massively parallel
    computation during training and eliminated the vanishing gradient problem
    that plagued RNNs. The architecture consists of an encoder stack that
    processes the input, and a decoder stack that generates the output,
    connected by cross-attention. Within three years, nearly every
    state-of-the-art NLP system was built on this foundation.
"""

result = summariser(article, max_length=60, min_length=20, do_sample=False)
print(result[0]["summary_text"])

OUTPUT

The Transformer architecture was introduced in 2017 by Vaswani et al. It replaced recurrent networks with self-attention, enabling parallel training and solving the vanishing gradient problem. Nearly every modern NLP system is built on this foundation.

Section 12

Common Failure Modes and How to Fix Them

Failure Mode	Symptom	Root Cause	Fix
Exposure Bias	Good at training, poor at generation	Trained on gold tokens, generates with predicted tokens	Scheduled Sampling or Reinforcement from Human Feedback (RLHF)
Over-generation	Output is repetitive or too long	Beam search favours short common tokens	Length penalty, repetition penalty, no-repeat-ngram-size
Hallucination	Output contradicts input facts	Decoder over-relies on language priors, not encoder	Faithfulness loss, copy mechanism, RAG
Attention Collapse	Model ignores most of input	Cross-attention weight saturates on <eos> token	Coverage mechanism, attention supervision, label smoothing
Slow Inference	Generates 1 token per 200ms	Autoregressive decoding is sequential by nature	KV cache, speculative decoding, model distillation
OOM on Long Sequences	GPU OOM for seq_len > 512	Attention is O(n²) in memory	Flash Attention, Sliding Window Attention, chunked cross-attention

🔭

KV Cache — The Critical Inference Optimisation

At inference, the decoder generates one token at a time. Without caching, it recomputes Keys and Values for all previous tokens at every step — O(n²) total. With KV caching, computed K and V tensors from previous steps are stored and reused. This reduces per-step compute from O(n) to O(1) for the already-processed positions. It is the single most impactful inference optimisation for autoregressive models.

Section 13

Golden Rules for Encoder-Decoder Models

🪝 Encoder-Decoder — Non-Negotiable Rules

Always use a pre-trained backbone (T5, BART, mBART) unless you have hundreds of millions of training pairs. Training a transformer from scratch requires enormous data. Fine-tune instead — it is faster, cheaper, and almost always better.

Match your tokeniser to your model. Never use BERT's tokeniser with T5. Each model has a vocabulary built specifically for it. Mixing them gives meaningless token IDs and garbage outputs.

For generation, always enable num_beams ≥ 4 for translation and summarisation. Greedy decoding is faster but measurably lower quality on these tasks. Use sampling (Top-p) for creative tasks.

Use label smoothing (ε=0.1) during training. Instead of target probabilities of exactly 0 or 1, distribute a small probability mass ε to non-target tokens. This regularises the model and prevents overconfident predictions.

Use the warmup learning rate schedule. The original Transformer paper uses lr = d_model^{-0.5} · min(step^{-0.5}, step · warmup_steps^{-1.5}). Skip warmup and your model will diverge in the first few hundred steps — transformers are sensitive to early LR.

Pad on the right for the source, on the right for the target. Always use attention_mask to ignore padding tokens. Cross-attention to padding positions is wasted compute and degrades alignment quality.

Enable Flash Attention 2 if your GPU supports it (Ampere+). It implements exact attention in O(n) memory instead of O(n²), enabling training on sequences up to 16k–32k tokens on a single GPU. One line change: model = AutoModel.from_pretrained(..., attn_implementation="flash_attention_2").