RoBERTa vs ALBERT vs DistilBERT

Section 01

The Origin Story — Why BERT Needed Successors

📖 Real World Analogy

The Brilliant Student Who Never Studied Properly

Imagine a gifted student — BERT — who showed up to university and scored astonishing marks. But here is the secret: BERT never learned how to study efficiently. He read every book twice (bidirectional encoding), but he also wasted time studying fake chapters inserted by teachers to trick him (Next Sentence Prediction). He also memorised chapters in a strange mask-and-guess pattern that never appeared in real exams (static masking). And frankly — he was enormous. Carrying his textbooks required a truck.

Three smarter students came after him. RoBERTa said: "Remove the fake chapters, study longer, and vary the masks." ALBERT said: "Share notes between chapters — don't write the same thing twice." DistilBERT said: "I'll learn from BERT, but I only need half the pages." All three outperformed or matched BERT in important ways. That is the entire motivation for this tutorial.

BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google in 2018 and fundamentally changed NLP. But its training recipe had real flaws. RoBERTa, ALBERT, and DistilBERT each attacked one or more of those flaws — and in doing so, taught us deep lessons about what actually matters in pre-training language models.

🏴

What You Will Learn in This Tutorial

How BERT works at its core — and what was wrong with it. What RoBERTa, ALBERT, and DistilBERT each changed and why those changes helped. How to fine-tune all three on real NLP tasks using HuggingFace Transformers. How to choose between them for your specific use case. Diagrams, stories, and working Python code throughout.

Section 02

BERT in 5 Minutes — The Foundation You Must Know

Before understanding the successors, you need a clear mental model of what BERT is actually doing.

🧠 How BERT Reads a Sentence

Input

Text is split into subword tokens: "unbelievable" → ["un", "##believ", "##able"]

Special

A [CLS] token is prepended (used for classification). A [SEP] token separates sentence pairs.

Embeds

Each token gets a Token Embedding + Segment Embedding (which sentence?) + Position Embedding (which index?)

Layers

12 stacked Transformer encoder layers (BERT-base). Each layer runs self-attention — every token attends to every other token simultaneously.

Output

A rich contextual vector for each token. The word "bank" in "river bank" and "bank transfer" gets completely different vectors.

BERT's Two Pre-training Tasks

🎮

Masked Language Modelling (MLM)

Task 1

15% of tokens are randomly replaced by [MASK]. The model must predict the original word from context. This forces bidirectional understanding — the model must look left and right to guess.

📝

Next Sentence Prediction (NSP)

Task 2

Two sentences are fed in. 50% of the time sentence B genuinely follows sentence A. 50% of the time it's random. The model predicts: IsNext or NotNext.

🔴

The Problem With NSP

Flaw Discovered Later

NSP is too easy. The model solves it using topic coherence — not deep sentence-level understanding. It became a noisy distraction that hurt learning rather than helping. RoBERTa and ALBERT both addressed this.

⚠️

BERT's Four Key Weaknesses

1. Static masking — the same tokens are masked in every epoch, so the model sees identical patterns repeatedly. 2. NSP is harmful — training on document pairs for NSP limits batching and adds noise. 3. Undertrained — BERT was trained for only 1M steps on 16GB of data. Much more data helps dramatically. 4. Parameter redundancy — the embedding matrix (30,000 × 768) is enormous and wasteful. Each of RoBERTa, ALBERT, and DistilBERT fixes at least one of these.

Section 03

RoBERTa — A Robustly Optimised BERT Pretraining Approach

📖 Story

The Same Recipe, Cooked Correctly

A chef (BERT) invented a spectacular dish but never followed the recipe properly — they used the wrong temperature, cooked for only half the time, and added an unnecessary spice (NSP) that made it slightly bitter. A second chef (RoBERTa) read the original recipe, threw out the bad spice, cranked up the heat, cooked for three times as long, and used ten times as many fresh ingredients. The result? A dramatically better dish — using the same underlying technique. RoBERTa did not invent a new architecture. It simply trained BERT the way it should have been trained from the start.

RoBERTa was published by Facebook AI in 2019. The paper's central claim was audacious: BERT was significantly undertrained, and fixing the training procedure — without changing the architecture — produced state-of-the-art results on many benchmarks.

RoBERTa's Five Key Changes to BERT Training

Dynamic Masking

BERT uses static masking — the mask pattern is fixed once at data preprocessing. Every epoch the model sees the same masked tokens. RoBERTa generates a new random mask each time a sequence is fed to the model. With 40 epochs of training, this means each sentence can appear with up to 40 different mask patterns, exposing the model to far richer learning signals.

Remove NSP — Use Full-Sentence Inputs

NSP is dropped entirely. Instead, RoBERTa packs inputs with full consecutive sentences from the same document (or across documents) until the 512-token limit is reached. No artificial 50/50 sentence-pair splits. Experiments showed that removing NSP consistently improved downstream task performance, confirming the hypothesis that NSP was noise.

Much Larger Batches

BERT trained with a batch size of 256 sequences. RoBERTa trained with batch sizes of 2,000 to 8,000 sequences. Larger batches improve training stability and allow higher learning rates. The paper found that perplexity decreased more reliably and downstream performance improved consistently as batch size grew.

Longer Training on More Data

BERT trained on 16GB of text (BooksCorpus + English Wikipedia) for 1M steps. RoBERTa added CC-News (76GB), OpenWebText (38GB), and Stories (31GB), training on 160GB total for up to 500K steps. More data, more steps, better generalisation — the result was not surprising, but the magnitude of the improvement confirmed that BERT was dramatically data-starved.

Byte-Pair Encoding (BPE) with Larger Vocabulary

RoBERTa uses a byte-level BPE tokeniser with a vocabulary of 50,000 tokens (BERT uses a WordPiece tokeniser with 30,000 tokens). Byte-level BPE operates on raw bytes, making it language-agnostic and eliminating the need for pre-tokenisation. Unknown tokens essentially disappear — any string can be encoded.

Property	BERT-base	RoBERTa-base
Architecture	12 layers, 768 hidden, 12 heads	12 layers, 768 hidden, 12 heads
Parameters	110M	125M
Training Data	16 GB	160 GB
NSP Task	Yes (harmful)	Removed
Masking	Static	Dynamic
Batch Size	256	2,000–8,000
Tokeniser	WordPiece (30K vocab)	Byte-level BPE (50K vocab)
GLUE Score	79.6	88.5

RoBERTa Architecture Diagram

📊 RoBERTa vs BERT — What Changed (Shaded = RoBERTa Improvement)

Same transformer architecture — every gain came purely from a better training recipe.

Section 04

ALBERT — A Lite BERT for Self-Supervised Learning

📖 Story

The Student Who Shared One Set of Notes Across All Subjects

BERT had 12 separate notebooks — one for each transformer layer. Every notebook contained a full copy of the lesson, filling 110 million pages total. ALBERT realised that most of these pages were near-identical. Instead of 12 separate notebooks, he used one shared notebook that every layer could read from. He also stopped wasting pages by compressing the vocabulary table at the front. The result? ALBERT-large had only 18 million unique parameters, while actually stacking 24 transformer layers. It was simultaneously smaller in memory and deeper in understanding — a genuinely clever engineering insight.

ALBERT (A Lite BERT) was published by Google Research and the Toyota Technological Institute in 2019. Its contribution is not about training data or masking — it's about parameter efficiency. ALBERT introduced two architectural innovations and one new training objective.

ALBERT's Three Core Innovations

💾

Factorised Embedding Parameterisation

Innovation 1 — Embedding Compression

In BERT, the token embedding size E equals the hidden size H (both are 768 for BERT-base). ALBERT decouples them. Vocabulary size V=30,000 × E=128 → projected up to H=768 via a small matrix. This reduces embedding parameters from 23M → 3.8M. The intuition: token embeddings should encode context-independent meaning (small), while hidden states encode context-dependent meaning (large).

✓ Huge memory saving on embedding layer

👥

Cross-Layer Parameter Sharing

Innovation 2 — Weight Tying

ALBERT uses the same weights for every transformer layer. All 12 (or 24) layers share one set of attention weights and one set of feed-forward weights. ALBERT-base has 12M parameters; stacked to 12 layers it behaves as 12 layers but stores only 1 layer's worth of weights. This is like a function being called recursively — the same computation, applied repeatedly, refines the representation progressively.

✓ 70–90% parameter reduction vs BERT

✗ Slower inference (layers can't parallelise as easily)

🤔

Sentence Order Prediction (SOP)

Innovation 3 — Replaces NSP

ALBERT replaces NSP with Sentence Order Prediction. Both sentences always come from the same document. The model must predict whether sentence B comes after sentence A, or if they're swapped. SOP forces genuine inter-sentence coherence modelling — you can't solve it with topic matching alone, since both sentences are about the same topic.

✓ More meaningful pre-training signal than NSP

ALBERT Model Variants vs BERT

Model	Layers	Hidden Size	Embedding Size	Parameters	GLUE Score
BERT-base	12	768	768	110M	79.6
BERT-large	24	1024	1024	336M	82.8
ALBERT-base	12	768	128	12M	82.3
ALBERT-large	24	1024	128	18M	88.7
ALBERT-xxlarge	12	4096	128	235M	91.0

💡

The ALBERT Paradox — Fewer Parameters, More Computation

ALBERT-large has 18M parameters vs BERT-large's 336M. You'd expect it to be much faster. It isn't, for inference. Why? Because parameter sharing doesn't reduce the number of operations — 24 layers still means 24 forward passes. ALBERT wins on memory footprint and storage size, not raw speed. For memory-constrained deployment (mobile, edge), ALBERT is a genuine win. For latency-sensitive production, prefer DistilBERT.

ALBERT Architecture Diagram

✎ ALBERT — Factorised Embedding + Shared Layer Weights

Section 05

DistilBERT — Knowledge Distillation from BERT

📖 Story

The Master Teaching the Apprentice

Imagine BERT as a world-class professor who gives perfect lectures but takes four hours to explain anything. His student — DistilBERT — sat in on every lecture, not to memorise the words but to capture the essence of how the professor thinks. DistilBERT asked: what patterns does the professor notice? What subtle distinctions does he make? Rather than re-reading every textbook, the student studied the professor's thought process itself. The result: a student who gives answers in 40% of the time with 97% of the accuracy. You'd send the professor to write the PhD thesis — but for answering everyday questions, you'd hire the apprentice.

DistilBERT was published by HuggingFace in 2019. It uses knowledge distillation — a technique where a small "student" model is trained to mimic the behaviour of a large "teacher" model, not just its final predictions, but the soft probability distributions it produces at every step.

Knowledge Distillation — The Core Mechanism

💡

Why Soft Labels Contain More Information Than Hard Labels

Hard label: "The correct answer is cat." (1 bit of information).
Soft label from BERT: "This is probably a cat (0.71), possibly a dog (0.18), maybe a fox (0.06)."
The soft label tells the student how similar different classes are — rich relational knowledge that no training dataset explicitly provides. This is what makes distillation so powerful.

🎓 How DistilBERT Is Trained

Teacher

BERT-base (110M params, 12 layers) is pre-trained and frozen. It produces soft probability distributions over the vocabulary for every masked token.

Student

DistilBERT has 6 layers (every other BERT layer is removed). Same hidden size (768) and attention heads (12) — only depth is reduced.

Loss 1

Distillation Loss: KL divergence between teacher's soft logits and student's soft logits. This is the core distillation signal.

Loss 2

MLM Loss: Standard masked language modelling loss on hard labels (ground truth tokens).

Loss 3

Cosine Embedding Loss: Student's hidden state vectors are pushed to align with the teacher's hidden state vectors — layer by layer.

Result

66M parameters, 40% fewer, 60% faster inference, 97% of BERT's performance on GLUE benchmark.

DistilBERT vs BERT — Size, Speed, Performance

Property	BERT-base	DistilBERT	Change
Parameters	110M	66M	−40%
Layers	12	6	−50%
Inference Speed	baseline	~1.7× faster	+60%
GLUE Score	79.6	77.0	−3.3%
NSP Task	Yes	Removed	✓
Training Approach	Pre-training from scratch	Distilled from BERT	💡
Best Use Case	Accuracy-critical tasks	Production speed-sensitive NLP	—

⚡ Distillation Flow — Teacher to Student

Section 06

Side-by-Side Comparison — RoBERTa vs ALBERT vs DistilBERT

Property	RoBERTa-base	ALBERT-base	DistilBERT
Core Idea	Better training recipe	Parameter efficiency	Knowledge distillation
Architecture Change?	None	Yes — shared layers + factorised embeddings	Partial — 6 layers
Parameters	125M	12M	66M
NSP Task	Removed	Replaced by SOP	Removed
GLUE Score	88.5	82.3	77.0
Inference Speed	Slow (same as BERT)	Slow (many layers)	Fast (~1.7×)
Memory Footprint	Large	Tiny	Medium
Training Cost	Very high (160GB data)	High	Moderate
Best For	Max accuracy, research	Edge devices, low memory	Production APIs, speed

🚀

Choose RoBERTa When…

Accuracy is everything

You need the best possible performance on classification, NER, or QA tasks and inference latency is not a constraint. Competition-grade NLP, research papers, production systems where you have powerful hardware. Fine-tunes extremely well thanks to its superior pre-training.

✓ Highest accuracy among the three

✗ Largest compute requirement

📈

Choose ALBERT When…

Memory is the constraint

You need to deploy on edge devices, mobile, or extremely memory-constrained environments. ALBERT-base stores only 12M parameters — less than 50MB. Excellent when you need competitive accuracy with a tiny model footprint. Note: inference is not faster, just leaner.

✓ Smallest parameter count by far

✗ Not faster at inference time

⚡

Choose DistilBERT When…

Speed is the constraint

You need fast inference in production. REST APIs, real-time chatbots, text classification pipelines processing thousands of requests per second. DistilBERT is genuinely 1.6–2× faster than BERT/RoBERTa and fits comfortably on a single small GPU or even CPU in production.

✓ Fastest inference of the three

✗ Lowest accuracy ceiling

Section 07

Fine-Tuning RoBERTa for Text Classification

All three models share the same fine-tuning paradigm via HuggingFace Transformers. We'll fine-tune RoBERTa on sentiment classification (SST-2) as our first example.

📦

Install Requirements

Run pip install transformers datasets accelerate before proceeding. All examples use HuggingFace Transformers 4.x and PyTorch.

# ── RoBERTa Fine-tuning for Sentiment Classification (SST-2) ──
from transformers import (
    RobertaTokenizer,
    RobertaForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── 1. Load data ───────────────────────────────────────────────
dataset = load_dataset("sst2")
tokeniser = RobertaTokenizer.from_pretrained("roberta-base")

# ── 2. Tokenise ────────────────────────────────────────────────
def tokenise(batch):
    return tokeniser(
        batch["sentence"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

tokenised = dataset.map(tokenise, batched=True)
tokenised = tokenised.rename_column("label", "labels")
tokenised.set_format("torch",
    columns=["input_ids", "attention_mask", "labels"])

# ── 3. Load pre-trained model ──────────────────────────────────
model = RobertaForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=2     # binary: positive / negative
)

# ── 4. Metrics ─────────────────────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1":       f1_score(labels, preds, average="binary")
    }

# ── 5. Training arguments ──────────────────────────────────────
args = TrainingArguments(
    output_dir="./roberta-sst2",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=True            # mixed precision: faster on GPU
)

# ── 6. Train ───────────────────────────────────────────────────
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"Validation Accuracy: {results['eval_accuracy']:.4f}")
print(f"Validation F1:       {results['eval_f1']:.4f}")

OUTPUT

Epoch 1/3 — eval_accuracy: 0.9277 eval_f1: 0.9281 Epoch 2/3 — eval_accuracy: 0.9496 eval_f1: 0.9501 Epoch 3/3 — eval_accuracy: 0.9564 eval_f1: 0.9568 Validation Accuracy: 0.9564 Validation F1: 0.9568

✅

RoBERTa Fine-tuning Tips

Learning rate of 1e-5 to 3e-5 works well for most tasks. Use weight_decay=0.01 on all parameters except bias and LayerNorm. Warmup for 6–10% of total steps prevents early instability. 3 epochs is typically enough; more than 5 often causes overfitting. Always use fp16=True on a GPU — it's free speed.

Section 08

Fine-Tuning ALBERT for Named Entity Recognition

ALBERT shines in memory-constrained settings. Here we fine-tune it on NER (token classification), labelling each token as PER, ORG, LOC, or O (not an entity).

# ── ALBERT Fine-tuning for NER (CoNLL-2003) ───────────────────
from transformers import (
    AlbertTokenizerFast,
    AlbertForTokenClassification,
    DataCollatorForTokenClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np

dataset   = load_dataset("conll2003")
tokeniser = AlbertTokenizerFast.from_pretrained("albert-base-v2")

label_list = dataset["train"].features["ner_tags"].feature.names
# ['O','B-PER','I-PER','B-ORG','I-ORG','B-LOC','I-LOC','B-MISC','I-MISC']
num_labels = len(label_list)

# ── Align labels with subword tokens ──────────────────────────
def tokenise_and_align(examples):
    tok = tokeniser(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )
    aligned_labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids   = tok.word_ids(batch_index=i)
        prev_word  = None
        label_ids  = []
        for wid in word_ids:
            if wid is None:
                label_ids.append(-100)   # ignore [CLS], [SEP]
            elif wid != prev_word:
                label_ids.append(label[wid])
            else:
                label_ids.append(-100)   # ignore sub-tokens
            prev_word = wid
        aligned_labels.append(label_ids)
    tok["labels"] = aligned_labels
    return tok

tokenised  = dataset.map(tokenise_and_align, batched=True)
collator   = DataCollatorForTokenClassification(tokeniser)

model = AlbertForTokenClassification.from_pretrained(
    "albert-base-v2",
    num_labels=num_labels
)

args = TrainingArguments(
    output_dir="./albert-ner",
    num_train_epochs=5,
    per_device_train_batch_size=32,   # large batch — ALBERT is small!
    learning_rate=3e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    data_collator=collator,
    tokenizer=tokeniser
)
trainer.train()

OUTPUT

Epoch 1 — eval_loss: 0.1834 Epoch 3 — eval_loss: 0.0912 Epoch 5 — eval_loss: 0.0741 F1 Score on CoNLL-2003 validation: 0.8923

📌

Label Alignment — The Critical NER Detail

BERT-style tokenisers split words into subwords: "Washington" might become ["Wash", "##ington"]. Your original label applies to the whole word — B-LOC. For subword tokens, you must assign -100 so the loss function ignores them. Only the first subword of each word gets the true label. The is_split_into_words=True flag and word_ids() method handle this.

Section 09

Fine-Tuning DistilBERT for Question Answering

DistilBERT's strength is production speed. Here we fine-tune it on extractive QA (SQuAD-style) — given a passage and a question, find the answer span within the passage.

# ── DistilBERT Fine-tuning for Extractive QA (SQuAD v1) ───────
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForQuestionAnswering
)
from datasets import load_dataset
import torch

dataset   = load_dataset("squad", split="train[:5000]")
tokeniser = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# ── Tokenise + find start/end positions ───────────────────────
def preprocess(examples):
    inputs = tokeniser(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map     = inputs.pop("overflow_to_sample_mapping")
    answers        = examples["answers"]
    start_positions, end_positions = [], []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer     = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char   = start_char + len(answer["text"][0])

        seq_ids = inputs.sequence_ids(i)
        # Find context token range
        ctx_start = next(j for j, s in enumerate(seq_ids) if s == 1)
        ctx_end   = next(j for j in range(len(seq_ids)-1, -1, -1) if seq_ids[j] == 1)

        if offset[ctx_start][0] > start_char or offset[ctx_end][1] < end_char:
            start_positions.append(0); end_positions.append(0)
        else:
            start_positions.append(next(
                j for j in range(ctx_start, ctx_end+1)
                if offset[j][0] <= start_char < offset[j][1] or offset[j][0] == start_char
            ))
            end_positions.append(next(
                j for j in range(ctx_end, ctx_start-1, -1)
                if offset[j][1] >= end_char
            ))

    inputs["start_positions"] = start_positions
    inputs["end_positions"]   = end_positions
    return inputs

tokenised = dataset.map(preprocess, batched=True, remove_columns=dataset.column_names)
model     = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

# ── Training ───────────────────────────────────────────────────
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
    output_dir="./distilbert-squad",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    learning_rate=5e-5,
    fp16=True
)
trainer = Trainer(model=model, args=args, train_dataset=tokenised)
trainer.train()

# ── Inference ──────────────────────────────────────────────────
from transformers import pipeline

qa_pipe = pipeline(
    "question-answering",
    model="./distilbert-squad",
    tokenizer=tokeniser
)
result = qa_pipe({
    "question": "Where did the Battle of Hastings take place?",
    "context":  "The Battle of Hastings was fought on 14 October 1066 near Hastings, East Sussex, England between the Norman-French army of William the Conqueror and the English army of King Harold II."
})
print(f"Answer:     {result['answer']}")
print(f"Confidence: {result['score']:.4f}")

OUTPUT

Answer: near Hastings, East Sussex, England Confidence: 0.9142

Section 10

Advanced — Building Your Own Distillation Pipeline

You are not limited to the pre-distilled DistilBERT. You can distil from any fine-tuned teacher — for example, a RoBERTa teacher fine-tuned on your domain data, distilled into a smaller student model for production deployment.

# ── Custom Knowledge Distillation: RoBERTa → Small Student ────
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import (
    RobertaForSequenceClassification,
    DistilBertForSequenceClassification,
    RobertaTokenizer
)

# Assume teacher is already fine-tuned on your task
teacher = RobertaForSequenceClassification.from_pretrained("./roberta-finetuned")
student = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
teacher.eval()   # teacher is frozen during distillation

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.T     = temperature   # soften probability distributions
        self.alpha = alpha         # weight: distil vs hard label loss

    def forward(self, student_logits, teacher_logits, true_labels):
        # Soft distillation loss (KL divergence)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.T, dim=-1),
            F.softmax(teacher_logits / self.T, dim=-1),
            reduction="batchmean"
        ) * (self.T ** 2)

        # Hard label (cross-entropy) loss
        hard_loss = F.cross_entropy(student_logits, true_labels)

        # Combined loss: alpha controls the trade-off
        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

loss_fn   = DistillationLoss(temperature=4.0, alpha=0.7)
optimiser = torch.optim.AdamW(student.parameters(), lr=3e-5)

# ── Training loop ──────────────────────────────────────────────
for epoch in range(3):
    for batch in train_loader:
        input_ids      = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels         = batch["labels"]

        # Teacher forward pass (no gradient)
        with torch.no_grad():
            teacher_out = teacher(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

        # Student forward pass
        student_out = student(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        loss = loss_fn(
            student_out.logits,
            teacher_out.logits,
            labels
        )

        optimiser.zero_grad()
        loss.backward()
        optimiser.step()

    print(f"Epoch {epoch+1}: loss = {loss.item():.4f}")

🎯

Temperature — The Secret Dial of Distillation

Temperature T controls how "soft" the teacher's distributions are. At T=1, you get the raw probability distribution. At T=4, the probabilities are flattened — the model shares more information about which classes are similar (even wrong answers). Higher T → smoother learning signal, better knowledge transfer. Typical values: 3.0 to 6.0. Start at 4.0.

Section 11

When Things Go Wrong — Common Mistakes and Fixes

🔴

Catastrophic Forgetting

Symptom: train acc 99%, val acc 55%

Learning rate too high. The model overwrites its pre-trained knowledge. Fix: Use learning rates between 1e-5 and 5e-5. Apply differential learning rates — lower for early layers, higher for the classification head. Add a warmup phase.

✗ Most common fine-tuning failure

🔴

Wrong Tokeniser for Model

Symptom: poor performance immediately

RoBERTa uses byte-level BPE. ALBERT uses SentencePiece. DistilBERT uses WordPiece. Never use a BERT tokeniser with RoBERTa. Fix: Always load tokeniser and model from the same checkpoint string. Use AutoTokenizer.from_pretrained(model_name).

✗ Silent failure — loss looks normal

🔴

ALBERT's Slow Inference Surprise

Symptom: ALBERT is slower than expected

ALBERT-large has 24 layers — fewer parameters than BERT-large but the same depth. Inference requires 24 forward passes through shared weights. Fix: Use ALBERT for memory-constrained environments, not latency- constrained ones. For speed, use DistilBERT.

✗ Common misconception

Problem	Likely Cause	Fix
Validation loss spikes at epoch 1	LR too high	Reduce to 1e-5, add warmup_ratio=0.1
Loss NaN on RoBERTa	fp16 + large logits	Add gradient_clipping=1.0 to TrainingArguments
ALBERT labels wrong in NER	Subword misalignment	Use is_split_into_words=True + word_ids()
DistilBERT accuracy 3% below BERT	Expected — this is known	Accept trade-off or use RoBERTa-distil
Out-of-memory on ALBERT-xxlarge	Hidden size 4096 is very wide	Reduce batch size to 4 or 8, use gradient accumulation
RoBERTa tokeniser adds Ġ prefix	BPE byte-level encoding	Normal — Ġ means space. Ignore in decode.

Section 12

Production Inference — Loading and Using All Three Models

# ── Unified Inference Examples — All Three Models ─────────────
from transformers import pipeline

# ── RoBERTa: Sentiment Classification ──────────────────────────
roberta_pipe = pipeline(
    "text-classification",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
print(roberta_pipe("This model is absolutely brilliant!"))
# [{'label': 'positive', 'score': 0.9876}]

# ── ALBERT: Zero-Shot Classification ───────────────────────────
albert_pipe = pipeline(
    "zero-shot-classification",
    model="cross-encoder/nli-albert-small-v2"
)
result = albert_pipe(
    "The Federal Reserve raised interest rates by 25 basis points.",
    candidate_labels=["finance", "sports", "politics", "technology"]
)
print(f"Label: {result['labels'][0]}, Score: {result['scores'][0]:.4f}")
# Label: finance, Score: 0.9543

# ── DistilBERT: Fast Feature Extraction ────────────────────────
distil_pipe = pipeline(
    "feature-extraction",
    model="distilbert-base-uncased",
    return_tensors=True
)
# Get sentence embedding (mean pool the [CLS] vector)
import torch
outputs = distil_pipe("Knowledge distillation compresses large models.")
embedding = torch.tensor(outputs[0]).mean(dim=1)   # [1, 768]
print(f"Sentence embedding shape: {embedding.shape}")
# Sentence embedding shape: torch.Size([1, 768])

# ── Benchmark Inference Speed ──────────────────────────────────
import time
texts = ["This is a test sentence for benchmarking."] * 100

for name, pipe in [(
    "RoBERTa", roberta_pipe), ("DistilBERT", distil_pipe
)]:
    start = time.time()
    _ = [pipe(t) for t in texts]
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.2f}s for 100 sentences ({elapsed*10:.0f}ms/sentence)")

OUTPUT

RoBERTa: 8.42s for 100 sentences (84ms/sentence) DistilBERT: 4.97s for 100 sentences (50ms/sentence) → DistilBERT is 1.69× faster on CPU inference

Section 13

Golden Rules — Non-Negotiable Practices

🧠 RoBERTa / ALBERT / DistilBERT — Production Rules

Always load tokeniser and model from the same checkpoint. Never use a BERT tokeniser with RoBERTa (different vocabulary, different special tokens). Use AutoTokenizer.from_pretrained(model_name) and let HuggingFace pick the right one.

Fine-tune with a learning rate between 1e-5 and 5e-5. Anything above 1e-4 will catastrophically overwrite pre-trained representations. Add a warmup phase of 6–10% of total training steps to stabilise early training.

For NER and token classification, always handle subword label alignment. Assign label -100 to all subword tokens except the first — these are ignored by the loss function. Failing to do this will corrupt your training signal silently.

Use RoBERTa as your default accuracy-first baseline, not BERT. RoBERTa is strictly better across virtually all benchmarks with the same architecture cost. There is almost no reason to use BERT-base over RoBERTa-base in new projects.

Do not confuse ALBERT's small parameter count with fast inference. ALBERT-large has 24 layers — inference requires 24 forward passes through shared weights. It's memory-efficient, not compute-efficient. Use it for edge deployment, not low-latency APIs.

For production APIs handling >100 req/s, use DistilBERT or a quantised model. Even on a GPU, RoBERTa at full precision adds latency that compounds at scale. DistilBERT + dynamic quantisation (torch.quantization.quantize_dynamic) can push throughput another 1.5–2× without meaningful accuracy loss.

Evaluate on your actual task data, not just benchmark scores. RoBERTa wins on GLUE. DistilBERT wins on short-text classification. ALBERT-xxlarge wins on SQuAD. None of this predicts your domain performance. Always fine-tune and evaluate on held-out in-domain data before choosing a model.

Section 14

Final Decision Framework

Scenario	Best Choice	Why
NLP research / competition	RoBERTa-large	Highest accuracy ceiling, best fine-tuning stability
Production REST API (<100ms SLA)	DistilBERT	1.7× faster inference, 60% smaller
Mobile / edge deployment	ALBERT-base	Only 12M params, <50MB model file
Domain-specific NLP (medical, legal)	RoBERTa + domain pre-training	Strong base + further pre-train on domain text
Few labelled examples (<1,000)	RoBERTa-large	Better representations = better few-shot fine-tuning
Multilingual tasks	XLM-RoBERTa	RoBERTa trained on 100 languages
Semantic similarity / embeddings	DistilBERT (Sentence-Transformers)	Fast, good embeddings, huge ecosystem
Memory budget <64MB	ALBERT-base	Smallest parameter count of all BERT variants

🏆

The Practitioner's Final Word

Start every new NLP project with DistilBERT as your baseline — it is fast, cheap to fine-tune, and surprisingly strong. If you need higher accuracy, upgrade to RoBERTa-base. If memory is the constraint, switch to ALBERT-base. Only reach for the "large" variants when you have a GPU budget and a well-curated dataset — large models need large data to realise their potential. The three models together give you coverage of virtually every real-world NLP deployment scenario.