Transformer Architecture Explained: Encoder & Decoder

Section 01

The Story That Explains Transformers

📖 Real World Analogy

The UN Simultaneous Interpreter

Imagine you are an interpreter at the United Nations. A diplomat says:

"The bank by the river was steep — but the bank refused my loan anyway."

The word "bank" appears twice. As a human, you instantly understand that the first means a riverbank and the second means a financial institution — because you relate each word to every other word in the sentence simultaneously. You don't read left-to-right and forget the beginning; you hold the whole sentence in your mind and resolve ambiguity through context.

That is exactly what a Transformer does — and it was a radical departure from every AI language model that came before it. Old models (RNNs, LSTMs) read word-by-word, like a person reading aloud. Transformers read the whole sentence at once, attending to every word in relation to every other. That single idea — self-attention — changed everything.

The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain. In just a few years it replaced RNNs and LSTMs as the dominant architecture for natural language processing, and has since conquered computer vision, protein folding, audio generation, and more.

⚡

Why Transformers Won

RNNs process sequences one token at a time — information from step 1 must travel through every intermediate step to reach step 100. This creates a vanishing gradient and makes learning long-range dependencies practically impossible. Transformers eliminated this bottleneck: any token can attend directly to any other token, regardless of how far apart they are. Distance is irrelevant. And because steps are not sequential, the whole sequence can be processed in parallel on GPUs — training became dramatically faster.

Section 02

The Bird's-Eye View — Encoder & Decoder

The original Transformer is an encoder–decoder architecture designed for sequence-to-sequence tasks like machine translation. Think of it as two specialists passing a baton:

👁️

The Encoder

Understanding

Reads the entire input sequence and builds a rich, context-aware representation of its meaning. It doesn't produce output words — it produces a deeply understood summary of what was said.

▶️

The Connection

Cross-Attention

The encoder's output — a matrix of vectors called the context — is handed to the decoder. The decoder reads it via cross-attention, allowing every output word to look at every input word.

🖊️

The Decoder

Generating

Generates the output sequence one token at a time. It attends to previously generated tokens (self-attention) and to the encoder's context (cross-attention) to decide what comes next.

🏭 Full Encoder–Decoder Pipeline — English to French Translation

Input

Raw English text: "The cat sat on the mat" → tokenised to IDs → embeddings added with positional encoding

Encoder

6 identical encoder blocks process the embeddings in parallel → each token learns context from all other tokens via self-attention

Context

Encoder outputs a matrix K, V (Keys and Values) — one rich vector per input token — passed to every decoder layer

Decoder

6 decoder blocks generate the French translation token-by-token, using masked self-attention (can't see future tokens) + cross-attention over encoder output

Output

Final linear layer + softmax produces a probability over the vocabulary → argmax gives predicted token: "Le chat était assis sur le tapis"

Section 03

Input Embeddings & Positional Encoding

📖 Story

Words Don't Have Addresses — Until You Give Them One

A neural network sees numbers, not words. Every word is converted to a dense vector (an embedding) of typically 512 or 768 dimensions. But here's the problem: if you feed a Transformer the sentence "Dog bites man" and "Man bites dog", and the embeddings are just looked up in a table, the same three vectors appear in both cases — in a different order that the Transformer's parallel attention has no way to distinguish.

The solution: add a positional encoding to each embedding — a unique mathematical "address" that encodes not just what a word is, but where it sits in the sentence. Position 0 gets a different sinusoidal pattern added to it than position 1, 2, or 100. Now the model knows the order without ever processing tokens sequentially.

Positional Encoding — Even Dimension

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

For even dimensions, a sine wave is used. Different frequencies encode different positions uniquely.

Positional Encoding — Odd Dimension

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

For odd dimensions, a cosine wave with the same frequency. Together, sine + cosine make each position a unique fingerprint.

Final Token Representation

x = WordEmbedding(token) + PE(position)

The word embedding and positional encoding are simply summed. The model learns to separate meaning from position during training.

Embedding Scale Factor

x = x × √d_model

Embeddings are scaled up before adding PE, so the PE signal doesn't overwhelm the word meaning. Standard practice in all Transformer implementations.

import torch
import math

class PositionalEncoding(torch.nn.Module):
    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = torch.nn.Dropout(p=dropout)

        # Create a (max_len, d_model) matrix of positional encodings
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float()
                        * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)   # even dims → sin
        pe[:, 1::2] = torch.cos(position * div_term)   # odd  dims → cos

        pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model) — batch dimension
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x shape: (batch, seq_len, d_model)
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

💡

Why Sinusoids? Why Not Just Integers?

Using integers (0, 1, 2 …) would give the model an unbounded signal that explodes for long sequences. Sinusoids are bounded between –1 and +1 regardless of sequence length, and their mathematical properties let the model generalise to sequences longer than it was trained on. Modern models (GPT, BERT derivatives) often learn positional embeddings as parameters instead — simpler but less elegant.

Section 04

The Heart of It — Scaled Dot-Product Attention

📖 Story

The Library with Post-it Notes

Imagine a vast library. Every book has two things: a label on the spine (the Key) describing what's inside, and a summary card on the shelf (the Value) holding the actual knowledge. You walk in with a question (the Query). You match your question against every spine label, score how relevant each book is, then pull a weighted blend of all the summary cards proportional to those relevance scores.

That is attention. Query × Key gives relevance. Softmax normalises it to probabilities. Those probabilities weight the Values. The result is a context-aware blend of information from every position in the sequence — for every position simultaneously.

Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

The full formula in one line. Q, K, V are projections of the input. d_k is the key dimension. The √d_k scaling prevents the dot products from growing too large.

Why √d_k Scaling?

Var(q · k) = d_k → scale by 1/√d_k

Without scaling, large d_k causes dot products to grow large, pushing softmax into regions with tiny gradients. Dividing by √d_k restores unit variance and stable gradients.

🔑 Attention — Step by Step with a 4-Word Sentence

Step 1

Each input embedding (shape d_model) is linearly projected into three vectors: Q (query), K (key), V (value) — each of dimension d_k = d_model / h where h is the number of heads

Step 2

Compute attention scores: the dot product of Q with all K vectors. For a 4-word sequence, this produces a 4×4 score matrix — every word scored against every word

Step 3

Divide scores by √d_k to stabilise gradients, then apply softmax row-by-row to get attention weights that sum to 1 per row

Step 4

Multiply the attention weight matrix by the V matrix → each position's output is a weighted blend of all value vectors, weighted by relevance

Output

A new matrix, same shape as input, where each vector is now context-aware — it knows about every other token in the sequence

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (batch, heads, seq_len, d_k)
    K: (batch, heads, seq_len, d_k)
    V: (batch, heads, seq_len, d_v)
    Returns: (batch, heads, seq_len, d_v), attention_weights
    """
    d_k = Q.size(-1)

    # Step 1: compute raw scores → (batch, heads, seq, seq)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    # Step 2: apply mask (decoder causal mask or padding mask)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Step 3: softmax over last dimension (which key each query attends to)
    attn_weights = F.softmax(scores, dim=-1)   # (batch, heads, seq, seq)

    # Step 4: weighted sum of values
    output = torch.matmul(attn_weights, V)        # (batch, heads, seq, d_v)

    return output, attn_weights


# Quick numerical demo — 1 head, 4 tokens, d_k=8
batch, heads, seq, d_k = 1, 1, 4, 8
Q = torch.randn(batch, heads, seq, d_k)
K = torch.randn(batch, heads, seq, d_k)
V = torch.randn(batch, heads, seq, d_k)

out, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape:  {out.shape}")       # (1, 1, 4, 8)
print(f"Weights shape: {weights.shape}")   # (1, 1, 4, 4) — 4×4 attention map
print(f"Weights sum:   {weights[0,0,0].sum():.4f}")  # 1.0000 — softmax guarantee

OUTPUT

Output shape: torch.Size([1, 1, 4, 8]) Weights shape: torch.Size([1, 1, 4, 4]) Weights sum: 1.0000

Section 05

Multi-Head Attention — Why One Perspective Isn't Enough

📖 Story

The Panel of Experts

When a journalist interviews a politician, one reporter might focus on economic policy, another on foreign affairs, another on personality and leadership. Each asks different questions and comes away with a different picture. Together, their reporting is richer than any single perspective could be.

Multi-head attention does exactly this. Instead of computing attention once with d_model-dimensional Q, K, V matrices, it splits into h heads, each with its own learned projection. Each head learns to attend to a different type of relationship: syntactic dependencies, co-reference, semantic similarity, position proximity. The results are concatenated and projected back, giving the layer a multi-dimensional understanding of context.

Multi-Head Attention

MultiHead(Q,K,V) = Concat(head₁,…,headₕ)·W^O

The outputs of all h heads are concatenated and passed through a learned output projection W^O to restore the original d_model dimension.

Each Head

headᵢ = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V)

Each head projects Q, K, V with its own weight matrices, then runs standard scaled dot-product attention. d_k = d_model / h (typically 64 when d_model=512, h=8).

Component	Shape	Purpose
Input X	(batch, seq, d_model)	Raw embeddings + positional encoding
W^Q, W^K, W^V per head	(d_model, d_k)	Learned projections, one set per head
Each head output	(batch, seq, d_k)	Context-aware representation from one perspective
Concatenated heads	(batch, seq, h×d_k) = (batch, seq, d_model)	All perspectives merged
W^O projection	(d_model, d_model)	Mixes information across heads
Final output	(batch, seq, d_model)	Same shape as input — ready for next layer

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model    = d_model
        self.num_heads  = num_heads
        self.d_k        = d_model // num_heads   # dimension per head

        # Single weight matrices — we split them per head inside forward()
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def split_heads(self, x, batch_size):
        # x: (batch, seq, d_model) → (batch, heads, seq, d_k)
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # Project and split into h heads
        Q = self.split_heads(self.W_q(Q), batch_size)
        K = self.split_heads(self.W_k(K), batch_size)
        V = self.split_heads(self.W_v(V), batch_size)

        # Attention on all heads simultaneously
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.nn.functional.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)             # (batch, heads, seq, d_k)

        # Merge heads: (batch, heads, seq, d_k) → (batch, seq, d_model)
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, -1, self.d_model)

        return self.W_o(context)   # final linear projection


# Test: batch=2, seq=10, d_model=512, 8 heads → d_k = 64 per head
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)
out = mha(x, x, x)
print(f"Input:  {x.shape}")    # torch.Size([2, 10, 512])
print(f"Output: {out.shape}")   # torch.Size([2, 10, 512]) — same shape!

OUTPUT

Input: torch.Size([2, 10, 512]) Output: torch.Size([2, 10, 512])

Section 06

The Encoder Block — Full Architecture

A single encoder block has just two sub-layers, each followed by a residual connection and layer normalisation. The full encoder stacks N of these blocks (N=6 in the original paper). Each block deepens the model's understanding.

Multi-Head Self-Attention

The sequence attends to itself — every token builds a context-aware representation by attending to all other tokens. Q, K, V are all derived from the same input (hence "self"). This is the core understanding step.

Add & Norm (Residual Connection)

Output = LayerNorm(x + Sublayer(x)). The residual connection adds the original input back — this combats vanishing gradients and allows the model to learn "modifications" rather than completely new representations. Layer Norm stabilises training across variable-length batches.

Position-wise Feed-Forward Network (FFN)

Two linear transformations with a ReLU (or GELU in modern models) in between: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. Applied independently to each position. While attention mixes information across positions, the FFN processes each position separately and deeply — think of it as the "thinking" step after "listening". The inner dimension is typically 4× d_model (2048 in the original paper).

Add & Norm (Second Residual)

Another residual + LayerNorm after the FFN. The block outputs a matrix of the same shape as its input — ready to feed into the next encoder block or be passed to the decoder as context.

import torch
import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        # Sub-layer 1: Multi-Head Self-Attention
        self.self_attn  = MultiHeadAttention(d_model, num_heads)
        self.norm1      = nn.LayerNorm(d_model)
        self.dropout1   = nn.Dropout(dropout)

        # Sub-layer 2: Position-wise FFN
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm2    = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Sub-layer 1: self-attention + residual + norm
        attn_out = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_out))

        # Sub-layer 2: FFN + residual + norm
        ffn_out = self.ffn(x)
        x = self.norm2(x + self.dropout2(ffn_out))

        return x   # shape unchanged: (batch, seq, d_model)


class Encoder(nn.Module):
    def __init__(self, num_blocks: int, d_model: int, num_heads: int, d_ff: int):
        super().__init__()
        self.blocks = nn.ModuleList(
            [EncoderBlock(d_model, num_heads, d_ff) for _ in range(num_blocks)]
        )

    def forward(self, x, mask=None):
        for block in self.blocks:
            x = block(x, mask)
        return x   # final encoder output → passed to decoder


# 6-layer encoder: batch=2, seq=10, d_model=512
encoder = Encoder(num_blocks=6, d_model=512, num_heads=8, d_ff=2048)
x = torch.randn(2, 10, 512)
enc_out = encoder(x)
print(f"Encoder output: {enc_out.shape}")   # torch.Size([2, 10, 512])

OUTPUT

Encoder output: torch.Size([2, 10, 512])

💡

Residual Connections — The Secret to Training Deep Networks

Without residual connections, gradients vanish before they reach the early layers in a 6-block network. The formula x + Sublayer(x) means that even if the sublayer outputs zero (at initialisation), the gradient still flows through the identity path. This is borrowed from ResNet (2015) and is now standard in all deep learning architectures.

Section 07

The Decoder Block — Three Sub-Layers

📖 Story

The Author Writing a Novel

Imagine a novelist writing a historical fiction book. They do three things simultaneously:

1. They look at what they've already written — the previous chapters — to stay consistent in voice and plot (Masked Self-Attention).
2. They consult their research notes and source material — facts about the era, historical events — to ensure accuracy (Cross-Attention over the encoder output).
3. They think deeply and creatively about how to construct the next sentence given what they know (Feed-Forward Network).

The decoder block does all three — in exactly this order — for every token it generates.

Masked Multi-Head Self-Attention

The decoder attends to its own previously generated tokens. The word "masked" means a causal mask is applied — position i can only attend to positions 0 through i (no peeking at future tokens). This ensures the model generates autoregressively and can't "cheat" during training.

Cross-Attention (Encoder–Decoder Attention)

The Q (Query) comes from the decoder; the K and V come from the encoder output. Every decoder position can attend to any encoder position. This is the bridge between the two halves — the decoder "reads" the encoder's deep understanding of the input to inform generation.

Position-wise FFN + Add & Norm

Identical to the encoder FFN. Each position is processed independently through the two-layer MLP. Residual connections and LayerNorm follow both the attention sub-layers and the FFN. The decoder also has 6 identical blocks stacked.

class DecoderBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        # 1. Masked self-attention (decoder attends to itself)
        self.self_attn  = MultiHeadAttention(d_model, num_heads)
        self.norm1      = nn.LayerNorm(d_model)

        # 2. Cross-attention (decoder Q, encoder K and V)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.norm2      = nn.LayerNorm(d_model)

        # 3. FFN
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        self.norm3   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, tgt, enc_output, tgt_mask=None, src_mask=None):
        # tgt:        decoder input (batch, tgt_seq, d_model)
        # enc_output: encoder output (batch, src_seq, d_model)

        # 1. Masked self-attention
        sa_out = self.self_attn(tgt, tgt, tgt, tgt_mask)
        tgt    = self.norm1(tgt + self.dropout(sa_out))

        # 2. Cross-attention: Q from decoder, K/V from encoder
        ca_out = self.cross_attn(tgt, enc_output, enc_output, src_mask)
        tgt    = self.norm2(tgt + self.dropout(ca_out))

        # 3. FFN
        ffn_out = self.ffn(tgt)
        tgt     = self.norm3(tgt + self.dropout(ffn_out))

        return tgt

⚠️

The Causal Mask Is Non-Negotiable

Without the causal mask in the decoder's self-attention, the model at position i could directly see the token at position i+1 — the answer it's supposed to predict. Training would be trivially solved by copying the next token. The mask fills future positions with −∞ before softmax, making those attention weights exactly zero. Removing the mask is one of the most common bugs in Transformer implementations.

Section 08

The Full Transformer — Putting It Together

Component	Location	Purpose	Trainable Params (d_model=512)
Input Embedding	Encoder + Decoder	Map token IDs → dense vectors	vocab × 512
Positional Encoding	Encoder + Decoder	Inject position information	0 (sinusoidal, fixed)
Multi-Head Self-Attention	Encoder (×6) + Decoder (×6)	Context across all positions	4 × (512 × 512) per block
Cross-Attention	Decoder only (×6)	Bridge encoder understanding to decoder	4 × (512 × 512) per block
FFN	Encoder (×6) + Decoder (×6)	Deep per-position reasoning	2 × (512 × 2048) per block
Layer Norm	Every sub-layer	Stabilise activations	2 × 512 per instance
Output Linear + Softmax	Decoder top	Project to vocabulary probabilities	512 × vocab

class Transformer(nn.Module):
    def __init__(self, src_vocab: int, tgt_vocab: int,
                 d_model=512, num_heads=8, num_blocks=6,
                 d_ff=2048, max_len=5000, dropout=0.1):
        super().__init__()

        # Shared or separate embeddings
        self.src_embed = nn.Embedding(src_vocab, d_model)
        self.tgt_embed = nn.Embedding(tgt_vocab, d_model)
        self.pos_enc   = PositionalEncoding(d_model, max_len, dropout)

        # Encoder stack
        self.encoder   = Encoder(num_blocks, d_model, num_heads, d_ff)

        # Decoder stack
        self.decoder   = nn.ModuleList(
            [DecoderBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_blocks)]
        )

        # Output projection: d_model → tgt_vocab
        self.out_proj  = nn.Linear(d_model, tgt_vocab)

    def encode(self, src, src_mask):
        src = self.pos_enc(self.src_embed(src) * (self.src_embed.embedding_dim ** 0.5))
        return self.encoder(src, src_mask)

    def decode(self, tgt, enc_out, tgt_mask, src_mask):
        tgt = self.pos_enc(self.tgt_embed(tgt) * (self.tgt_embed.embedding_dim ** 0.5))
        for block in self.decoder:
            tgt = block(tgt, enc_out, tgt_mask, src_mask)
        return tgt

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        enc_out = self.encode(src, src_mask)
        dec_out = self.decode(tgt, enc_out, tgt_mask, src_mask)
        return self.out_proj(dec_out)   # logits: (batch, tgt_seq, tgt_vocab)


# Instantiate the original "base" model from the paper
model = Transformer(src_vocab=37000, tgt_vocab=37000,
                     d_model=512, num_heads=8, num_blocks=6, d_ff=2048)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

OUTPUT

Total parameters: 65,052,672 ← ~65M parameters (original paper "base" model)

Section 09

Encoder-Only vs Decoder-Only vs Encoder–Decoder

The original Transformer used both encoder and decoder. But the research community quickly discovered that different tasks benefit from different subsets of the architecture. This led to three dominant families of modern models, each a specialised evolution:

👁️

Encoder-Only

e.g. BERT, RoBERTa

Processes the entire input bidirectionally — every token attends to every other token simultaneously, in both directions. Excellent for understanding tasks: classification, NER, question answering (extractive). The model sees full context; there is no mask. Downside: cannot generate text.

🖊️

Decoder-Only

e.g. GPT-4, Claude, LLaMA

Only uses the decoder stack with its causal (left-to-right) mask. Trained to predict the next token given all previous tokens. Excels at text generation, completion, instruction-following, code, reasoning chains. The dominant architecture for large language models today.

⇄️

Encoder–Decoder

e.g. T5, BART, mT5

The original full architecture. Best for sequence-to-sequence tasks where the input and output are distinct sequences: translation, summarisation, document Q&A, code generation from spec. The encoder deeply understands the input; the decoder generates conditioned on that understanding.

Property	Encoder-Only (BERT)	Decoder-Only (GPT)	Encoder–Decoder (T5)
Attention direction	Bidirectional	Causal (left→right)	Enc: bi / Dec: causal
Can generate text?	No	Yes	Yes
Best for	Classification, NER, QA	Chat, completion, LLMs	Translation, summarisation
Training objective	Masked LM (MLM)	Causal LM (CLM)	Span corruption / seq2seq
Uses cross-attention?	No	No	Yes

Section 10

Training a Transformer — The Key Techniques

⚙️ Critical Training Details from the Original Paper

Learning Rate Schedule — Warmup then Decay. The original paper uses a custom schedule: LR increases linearly for warmup_steps (typically 4,000) then decays proportionally to step^(−0.5). Training with a constant LR diverges; the warmup is essential for stabilising the early layers. Formula: lr = d_model^(−0.5) · min(step^(−0.5), step · warmup_steps^(−1.5))

Label Smoothing. Instead of a hard one-hot target (probability 1.0 for the correct token), use a smoothed target: ε=0.1 is distributed uniformly across all other vocabulary tokens. Prevents overconfidence and improves BLEU score. Implemented as nn.CrossEntropyLoss(label_smoothing=0.1) in modern PyTorch.

Dropout Everywhere. Dropout (p=0.1) is applied to the output of every sub-layer before the residual addition, to the embedding sums, and inside the FFN. The Transformer paper found that dropout is the single most important regularisation technique for this architecture.

Adam Optimiser with β₁=0.9, β₂=0.98, ε=10⁻⁹. These are different from the standard Adam defaults (β₂=0.999). The lower β₂ makes the optimiser more responsive to recent gradient history — important for the non-stationary loss landscape of the warmup phase. Using wrong β₂ is a common training instability source.

Weight Tying (optional but common). The input embedding matrix and the output projection matrix share weights. This reduces parameters significantly and often improves performance — the model must use the same representation for "meaning of a word" when reading it and when generating it.

import torch
import torch.nn as nn

class WarmupScheduler(torch.optim.lr_scheduler._LRScheduler):
    def __init__(self, optimizer, d_model: int, warmup_steps: int):
        self.d_model      = d_model
        self.warmup_steps = warmup_steps
        super().__init__(optimizer)

    def get_lr(self):
        step = max(1, self._step_count)
        scale = self.d_model ** (-0.5) * min(
            step ** (-0.5),
            step * self.warmup_steps ** (-1.5)
        )
        return [scale for _ in self.base_lrs]


# Training setup — original paper configuration
model     = Transformer(src_vocab=37000, tgt_vocab=37000)
optimizer = torch.optim.Adam(model.parameters(),
                              lr=1.0, betas=(0.9, 0.98), eps=1e-9)
scheduler = WarmupScheduler(optimizer, d_model=512, warmup_steps=4000)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1, ignore_index=0)

# Training step (simplified)
def train_step(src, tgt):
    model.train()
    tgt_in  = tgt[:, :-1]   # input:  "BOS I am happy"
    tgt_out = tgt[:, 1:]    # target: "I am happy EOS"

    logits = model(src, tgt_in)                   # (batch, tgt_seq, vocab)
    logits = logits.view(-1, logits.size(-1))      # flatten batch × seq
    loss   = criterion(logits, tgt_out.view(-1))

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    return loss.item()

Section 11

Transformer Variants — The Family Tree

The original 2017 Transformer spawned an entire dynasty of architectures. Understanding where each fits helps you choose the right tool for any task.

Model	Year	Type	Key Innovation	Best Use
Transformer	2017	Enc–Dec	The original — attention is all you need	Translation
BERT	2018	Encoder	Bidirectional + masked language modelling pre-training	Classification, NER, QA
GPT-2/3/4	2019–23	Decoder	Scale + RLHF alignment; emergent few-shot ability	Chat, generation, agents
T5	2020	Enc–Dec	"Text-to-text" — every NLP task framed as seq2seq	Summarisation, translation
Vision Transformer (ViT)	2020	Encoder	Patches as tokens — Transformer conquers images	Image classification
LLaMA / Mistral	2023	Decoder	RoPE, GQA, SwiGLU — efficient open-weight LLMs	Open-source LLM apps

📚

Modern Improvements Over the 2017 Design

Modern LLMs replace the original design in several ways: RoPE (Rotary Positional Embedding) replaces sinusoidal PE for better length generalisation. RMSNorm replaces LayerNorm for efficiency. SwiGLU/GeGLU replaces ReLU in the FFN for better gradients. Grouped Query Attention (GQA) reduces the KV cache memory by sharing key/value heads across query heads. Flash Attention rewrites the attention kernel for hardware efficiency. The core attention principle, however, has not changed.

Section 12

Computational Complexity — The Quadratic Problem

⚠️

The Achilles' Heel of Attention: O(n²) Memory and Compute

The attention score matrix has shape (seq_len × seq_len). For a sequence of 1,000 tokens, that's 1,000,000 values. For 10,000 tokens — a long document — it's 100,000,000 values. Memory and compute scale quadratically with sequence length. This is why standard Transformers were limited to 512–2048 tokens for years, and why a large chunk of modern AI research is dedicated to solving this bottleneck.

Sequence Length	Attention Matrix Size	Memory (float32)	Practical?
512 tokens	262,144 entries	~1 MB	Yes — standard
2,048 tokens	4,194,304 entries	~16 MB	Yes — GPT-3 context
8,192 tokens	67,108,864 entries	~256 MB	Marginal — needs Flash Attention
128,000 tokens	16,384,000,000 entries	~64 GB	Impossible naively — requires sparse/linear attention

⚡

Flash Attention

Dao et al. 2022

Rewrites attention to operate in SRAM (fast GPU cache) rather than HBM (slow GPU memory). Same O(n²) complexity but 10× memory savings and much faster wall-clock time. Now standard in all serious LLM implementations.

📍

Sparse / Local Attention

Longformer, BigBird

Each token only attends to a local window of nearby tokens plus a few global tokens. Reduces complexity to O(n·k) where k is window size. Enables processing of very long documents with manageable memory.

💾

KV Cache (Inference)

Autoregressive Generation

During generation, key and value matrices for previously generated tokens are cached rather than recomputed. Each new token only computes new K and V and attends to the cache. Enables efficient decoding without O(n²) recomputation per step.

Section 13

Using a Pre-trained Transformer — Practical Guide

In practice, you almost never train a Transformer from scratch. You use a pre-trained model and either fine-tune it for your task or use it directly via the Hugging Face ecosystem.

# pip install transformers torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ── ENCODER-ONLY: Text Classification with BERT ──────────────
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model     = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

texts = [
    "The Transformer architecture revolutionised NLP.",
    "I had a terrible experience with this product."
]

inputs  = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
outputs = model(**inputs)
probs   = torch.nn.functional.softmax(outputs.logits, dim=-1)

for text, prob in zip(texts, probs):
    pred = "POSITIVE" if prob.argmax() == 1 else "NEGATIVE"
    print(f"{pred} ({prob.max():.2%}) — {text[:50]}")

# ── DECODER-ONLY: Text Generation with GPT-2 ─────────────────
from transformers import pipeline

gen  = pipeline('text-generation', model='gpt2')
out  = gen("The Transformer architecture works by",
           max_new_tokens=50, num_return_sequences=1)
print(out[0]['generated_text'])

OUTPUT

POSITIVE (96.42%) — The Transformer architecture revolutionised NLP. NEGATIVE (88.15%) — I had a terrible experience with this product. The Transformer architecture works by breaking text into tokens, mapping each token to a high-dimensional vector, then applying self-attention layers to build context-aware representations...

🏆

Fine-Tuning vs Zero/Few-Shot Prompting

For encoder models (BERT): add a classification head and fine-tune all weights on your labelled dataset — 1,000 examples is often sufficient. For decoder models (GPT, Claude): try prompting first with 3–5 examples (few-shot) before fine-tuning. Fine-tune only if accuracy is critical and you have thousands of labelled examples. Large models (7B+ parameters) are better prompted; small models (125M–1B) benefit more from fine-tuning.

Section 14

Golden Rules — Transformers in Practice

⚡ Transformer — Non-Negotiable Rules

Never train from scratch unless you have billions of tokens and a research budget. Pre-trained models like BERT-base or LLaMA-3 encode knowledge from trillions of tokens. Fine-tuning a pre-trained model on your task beats a from-scratch model trained on your data by a wide margin in almost every scenario.

Warmup your learning rate. The Transformer is notoriously sensitive to learning rate during the first few thousand steps. A linear warmup to your target LR over 4,000–10,000 steps prevents early divergence. Using Adam with a constant LR is a common cause of inexplicable training instability.

Match model type to task. Classification/NER → encoder-only (BERT family). Long-form generation / chat / agents → decoder-only (GPT/LLaMA family). Translation / summarisation → encoder–decoder (T5 / BART family). Using the wrong variant for the task is a fundamental architectural mismatch.

Watch your context length. Attention is O(n²). Passing 16,000 tokens to a model not designed for it will OOM (out of memory) your GPU instantly. Know your model's maximum context window and stay within it. Use Flash Attention or sliding window attention for long-document tasks.

Use the right tokeniser. Transformers are deeply tied to their tokenisers — a BERT tokeniser with a GPT-2 model will produce garbage. Always load the tokeniser paired with the model: AutoTokenizer.from_pretrained('model-name'). Mismatched tokeniser is a silent, hard-to-debug failure mode.

Gradient clipping is not optional. Clip gradients to max_norm=1.0 before every optimiser step. Attention layers can produce exploding gradients — especially at the start of training. One unclipped update can destroy days of training progress.

The causal mask in decoders must cover the padding too. When batch-training decoders, you have two masks to apply: the causal mask (no peeking future) AND the padding mask (ignore PAD tokens). Many implementations forget the padding mask — this silently hurts performance and is hard to catch because loss still decreases.