The Story That Explains Encoder-Decoder
The interpreter does not translate word-by-word. They first encode the full French speech into an internal "meaning representation" inside their head, then decode that meaning into English, one word at a time, attending to what has already been said.
That is exactly what an Encoder-Decoder neural network does. The Encoder reads the entire input and compresses it into a rich internal representation. The Decoder reads that representation and produces the output, step by step, attending to the most relevant parts as it goes.
The Encoder-Decoder architecture (also called sequence-to-sequence or seq2seq) is the backbone of modern machine translation, summarisation, speech recognition, image captioning, and nearly every large language model in use today. Understanding it deeply means understanding how GPT, BERT, T5, and Whisper all work under the hood.
An Encoder-Decoder solves the fundamental challenge of variable-length input → variable-length output. Plain neural networks need fixed sizes. Seq2seq breaks the problem into two stages: first understand (encode), then generate (decode) — each stage independently flexible in length.
The Big Picture — Architecture Overview
Before diving into components, here is the full data flow from input tokens to output tokens.
ⓘ The encoder runs once; the decoder runs autoregressively — each generated token feeds back as the next input.
The Encoder — Reading and Understanding
The Encoder does the same thing. Through Self-Attention, every token simultaneously attends to every other token. The result is not a single summary vector, but a matrix where every position has been enriched by full context.
Self-Attention — How Every Token Talks to Every Other Token
Self-Attention is the heart of the Encoder. For each token, it computes three vectors: Query (Q), Key (K), and Value (V). Think of it as a soft database lookup: the Query asks a question, the Keys determine relevance, and the Values hold the content.
With large embedding dimensions, dot products grow very large in magnitude, pushing softmax into regions with near-zero gradients. Dividing by √d_k (square root of key dimension) keeps the scores in a healthy range and stabilises training. The original "Attention Is All You Need" paper (Vaswani et al., 2017) introduced this trick.
Multi-Head Attention — Many Perspectives at Once
A single attention head can only focus on one type of relationship at a time. Multi-Head Attention runs h attention heads in parallel, each with its own Q, K, V projections. Their outputs are concatenated and projected back to d_model.
Encoder Layer — Full Walkthrough
The Decoder — Generating Output
This is the Decoder: it attends to its own past outputs (Masked Self-Attention) and to the encoder's context (Cross-Attention), then predicts the next token.
Three Sub-Layers of the Decoder
During training, we feed the entire target sequence to the decoder at once for efficiency. Without the causal mask, position 3 could attend to position 7 — it would "see the answer." The mask sets all positions to the right to −∞ before softmax, making them contribute zero to the weighted sum. This forces the model to predict each token using only past context.
ⓘ The red cells show masked future positions. The decoder at position 2 ("chat") can only see positions 0–2, never positions 3+ ("s'est", "assis").
Cross-Attention — The Bridge Between Encoder and Decoder
Cross-Attention is the mechanism that lets the decoder "consult" the encoder. It is the most important part of the Encoder-Decoder architecture — without it, the decoder would generate output in a vacuum.
| Problem | Effect |
|---|---|
| Decoder has no source access | Hallucination |
| Output ignores input meaning | Random text |
| Bottleneck: single context vector | Information loss |
| Capability | Effect |
|---|---|
| Decoder queries encoder at each step | Grounded output |
| Attention over all source positions | No bottleneck |
| Dynamic focus per output token | Accurate alignment |
The model learns these attention patterns entirely from data. Nobody told it that "chat" corresponds to "cat." Through backpropagation on millions of translation pairs, the cross-attention weights organically learn the alignment between source and target languages. This replaces the hand-crafted alignment tables used in older statistical machine translation systems.
Positional Encoding — Giving the Model a Sense of Order
The original Transformer uses sinusoidal positional encoding:
Python Implementation — From Scratch with PyTorch
Step 1 — Scaled Dot-Product Attention
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: [batch, heads, seq_len, d_k]
mask: [batch, 1, seq_len, seq_len] — True = keep, False = mask
"""
d_k = Q.size(-1)
# Step 1: Dot-product scores and scale
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# scores: [batch, heads, seq_len, seq_len]
# Step 2: Apply mask (set masked positions to -inf → 0 after softmax)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Step 3: Softmax to get attention weights
attn_weights = F.softmax(scores, dim=-1)
# Step 4: Weighted sum of values
output = torch.matmul(attn_weights, V)
return output, attn_weights
Step 2 — Multi-Head Attention
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # dimension per head
# Learned projection matrices for Q, K, V and the output
self.W_Q = nn.Linear(d_model, d_model)
self.W_K = nn.Linear(d_model, d_model)
self.W_V = nn.Linear(d_model, d_model)
self.W_O = nn.Linear(d_model, d_model)
def split_heads(self, x):
# x: [batch, seq_len, d_model]
# → [batch, num_heads, seq_len, d_k]
batch, seq_len, _ = x.size()
x = x.view(batch, seq_len, self.num_heads, self.d_k)
return x.transpose(1, 2)
def forward(self, Q_in, K_in, V_in, mask=None):
# Project inputs
Q = self.split_heads(self.W_Q(Q_in))
K = self.split_heads(self.W_K(K_in))
V = self.split_heads(self.W_V(V_in))
# Attention across all heads in parallel
attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)
# Merge heads: [batch, heads, seq_len, d_k] → [batch, seq_len, d_model]
batch, _, seq_len, _ = attn_out.size()
attn_out = attn_out.transpose(1, 2).contiguous().view(batch, seq_len, self.d_model)
return self.W_O(attn_out) # Final output projection
Step 3 — Encoder Layer
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, src_mask=None):
# Self-attention sub-layer (Q = K = V = x)
attn_out = self.self_attn(x, x, x, src_mask)
x = self.norm1(x + self.dropout(attn_out)) # Residual + Norm
# Feed-forward sub-layer
ffn_out = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_out)) # Residual + Norm
return x
Step 4 — Decoder Layer
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff), nn.ReLU(),
nn.Dropout(dropout), nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
# 1. Masked Self-Attention on decoder's own past outputs
ma_out = self.masked_self_attn(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(ma_out))
# 2. Cross-Attention: Q from decoder, K & V from encoder
ca_out = self.cross_attn(x, enc_output, enc_output, src_mask)
x = self.norm2(x + self.dropout(ca_out))
# 3. Feed-Forward
ffn_out = self.ffn(x)
x = self.norm3(x + self.dropout(ffn_out))
return x
Step 5 — Full Transformer
class Transformer(nn.Module):
def __init__(self, src_vocab, tgt_vocab, d_model=512, N=6,
num_heads=8, d_ff=2048, dropout=0.1, max_len=5000):
super().__init__()
self.src_embed = nn.Embedding(src_vocab, d_model)
self.tgt_embed = nn.Embedding(tgt_vocab, d_model)
self.pos_enc = PositionalEncoding(d_model, max_len) # See below
self.encoder_layers = nn.ModuleList(
[EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(N)]
)
self.decoder_layers = nn.ModuleList(
[DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(N)]
)
self.fc_out = nn.Linear(d_model, tgt_vocab)
def encode(self, src, src_mask):
x = self.pos_enc(self.src_embed(src))
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x # [batch, src_len, d_model]
def decode(self, tgt, enc_out, src_mask, tgt_mask):
x = self.pos_enc(self.tgt_embed(tgt))
for layer in self.decoder_layers:
x = layer(x, enc_out, src_mask, tgt_mask)
return x
def forward(self, src, tgt, src_mask, tgt_mask):
enc_out = self.encode(src, src_mask)
dec_out = self.decode(tgt, enc_out, src_mask, tgt_mask)
return self.fc_out(dec_out) # [batch, tgt_len, tgt_vocab]
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
pe = torch.zeros(max_len, d_model)
pos = torch.arange(0, max_len).unsqueeze(1).float()
div = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
Step 6 — Quick Sanity Check
# Sanity check — forward pass with random data
model = Transformer(src_vocab=10000, tgt_vocab=10000)
model.eval()
batch, src_len, tgt_len = 2, 12, 10
src = torch.randint(0, 10000, (batch, src_len))
tgt = torch.randint(0, 10000, (batch, tgt_len))
# Source mask: no padding (all ones)
src_mask = torch.ones(batch, 1, src_len, src_len)
# Target mask: causal (lower triangular)
tgt_mask = torch.tril(torch.ones(tgt_len, tgt_len)).unsqueeze(0).unsqueeze(0)
with torch.no_grad():
logits = model(src, tgt, src_mask, tgt_mask)
print(f"Input shape: {src.shape}")
print(f"Target shape: {tgt.shape}")
print(f"Output shape: {logits.shape}")
Decoding Strategies — How We Generate Text
The raw model output is a probability distribution over the vocabulary. How we sample from it massively affects the quality, creativity, and coherence of the generated text.
For translation/summarisation: use beam search (k=4–8), no temperature. For creative writing/chat: use Top-p (0.9) + temperature (0.7–1.0) + repetition penalty (1.1–1.3). For code generation: low temperature (0.2–0.4), greedy or small beam, no repetition penalty.
Encoder-Only, Decoder-Only, Encoder-Decoder — When to Use Which
The full Encoder-Decoder is just one of three variants. Understanding which to use for your task is critical.
| Architecture | Examples | Best For | Attention Type | Output |
|---|---|---|---|---|
| Encoder-Only | BERT, RoBERTa, DistilBERT | Classification, NER, Q&A (extractive), Embeddings | Bidirectional self-attention | Contextual embeddings, not generation |
| Decoder-Only | GPT-2, GPT-4, LLaMA, Mistral | Text generation, Chatbots, Code, Completion | Causal (masked) self-attention | Next token prediction, generation |
| Encoder-Decoder | T5, BART, Whisper, mBART | Translation, Summarisation, Speech-to-Text, Q&A (abstractive) | Bidirectional enc + causal dec + cross-attn | Variable-length sequence from input |
| ✓ | Input and output are different sequences (translation, summarisation) |
| ✓ | You need the model to read first, then write |
| ✓ | Output length differs greatly from input length |
| ✓ | Strong alignment between source and target is critical |
| ✗ | You only need classification — use encoder-only |
| ✗ | You only need open-ended generation — use decoder-only |
| ✗ | Compute budget is very tight — two stacks cost roughly 2× the parameters |
| ✗ | Your task is simply next-token prediction — decoder-only is optimal |
Real-World Applications
Using Hugging Face — Production-Ready in 10 Lines
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
# ── Translation: English → French using Helsinki-NLP ──────
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
src_text = "The encoder-decoder architecture revolutionised natural language processing."
inputs = tokenizer(src_text, return_tensors="pt", padding=True)
# Generate with beam search
with torch.no_grad():
generated_ids = model.generate(
inputs["input_ids"],
num_beams=5,
max_length=100,
early_stopping=True
)
translation = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"EN: {src_text}")
print(f"FR: {translation}")
# ── Summarisation with BART ───────────────────────────────
from transformers import pipeline
summariser = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
The Transformer architecture, introduced in the landmark 2017 paper
'Attention Is All You Need' by Vaswani et al., replaced recurrent neural
networks with self-attention mechanisms. This allowed massively parallel
computation during training and eliminated the vanishing gradient problem
that plagued RNNs. The architecture consists of an encoder stack that
processes the input, and a decoder stack that generates the output,
connected by cross-attention. Within three years, nearly every
state-of-the-art NLP system was built on this foundation.
"""
result = summariser(article, max_length=60, min_length=20, do_sample=False)
print(result[0]["summary_text"])
Common Failure Modes and How to Fix Them
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Exposure Bias | Good at training, poor at generation | Trained on gold tokens, generates with predicted tokens | Scheduled Sampling or Reinforcement from Human Feedback (RLHF) |
| Over-generation | Output is repetitive or too long | Beam search favours short common tokens | Length penalty, repetition penalty, no-repeat-ngram-size |
| Hallucination | Output contradicts input facts | Decoder over-relies on language priors, not encoder | Faithfulness loss, copy mechanism, RAG |
| Attention Collapse | Model ignores most of input | Cross-attention weight saturates on <eos> token | Coverage mechanism, attention supervision, label smoothing |
| Slow Inference | Generates 1 token per 200ms | Autoregressive decoding is sequential by nature | KV cache, speculative decoding, model distillation |
| OOM on Long Sequences | GPU OOM for seq_len > 512 | Attention is O(n²) in memory | Flash Attention, Sliding Window Attention, chunked cross-attention |
At inference, the decoder generates one token at a time. Without caching, it recomputes Keys and Values for all previous tokens at every step — O(n²) total. With KV caching, computed K and V tensors from previous steps are stored and reused. This reduces per-step compute from O(n) to O(1) for the already-processed positions. It is the single most impactful inference optimisation for autoregressive models.
Golden Rules for Encoder-Decoder Models
num_beams ≥ 4 for translation and summarisation.
Greedy decoding is faster but measurably lower quality on these tasks. Use sampling (Top-p) for creative tasks.
lr = d_model^{-0.5} · min(step^{-0.5}, step · warmup_steps^{-1.5}).
Skip warmup and your model will diverge in the first few hundred steps — transformers are sensitive to early LR.
attention_mask to ignore padding tokens. Cross-attention to padding positions
is wasted compute and degrades alignment quality.
model = AutoModel.from_pretrained(..., attn_implementation="flash_attention_2").