The Origin Story — Why BERT Needed Successors
Three smarter students came after him. RoBERTa said: "Remove the fake chapters, study longer, and vary the masks." ALBERT said: "Share notes between chapters — don't write the same thing twice." DistilBERT said: "I'll learn from BERT, but I only need half the pages." All three outperformed or matched BERT in important ways. That is the entire motivation for this tutorial.
BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google in 2018 and fundamentally changed NLP. But its training recipe had real flaws. RoBERTa, ALBERT, and DistilBERT each attacked one or more of those flaws — and in doing so, taught us deep lessons about what actually matters in pre-training language models.
How BERT works at its core — and what was wrong with it. What RoBERTa, ALBERT, and DistilBERT each changed and why those changes helped. How to fine-tune all three on real NLP tasks using HuggingFace Transformers. How to choose between them for your specific use case. Diagrams, stories, and working Python code throughout.
BERT in 5 Minutes — The Foundation You Must Know
Before understanding the successors, you need a clear mental model of what BERT is actually doing.
BERT's Two Pre-training Tasks
1. Static masking — the same tokens are masked in every epoch, so the model sees identical patterns repeatedly. 2. NSP is harmful — training on document pairs for NSP limits batching and adds noise. 3. Undertrained — BERT was trained for only 1M steps on 16GB of data. Much more data helps dramatically. 4. Parameter redundancy — the embedding matrix (30,000 × 768) is enormous and wasteful. Each of RoBERTa, ALBERT, and DistilBERT fixes at least one of these.
RoBERTa — A Robustly Optimised BERT Pretraining Approach
RoBERTa was published by Facebook AI in 2019. The paper's central claim was audacious: BERT was significantly undertrained, and fixing the training procedure — without changing the architecture — produced state-of-the-art results on many benchmarks.
RoBERTa's Five Key Changes to BERT Training
| Property | BERT-base | RoBERTa-base |
|---|---|---|
| Architecture | 12 layers, 768 hidden, 12 heads | 12 layers, 768 hidden, 12 heads |
| Parameters | 110M | 125M |
| Training Data | 16 GB | 160 GB |
| NSP Task | Yes (harmful) | Removed |
| Masking | Static | Dynamic |
| Batch Size | 256 | 2,000–8,000 |
| Tokeniser | WordPiece (30K vocab) | Byte-level BPE (50K vocab) |
| GLUE Score | 79.6 | 88.5 |
RoBERTa Architecture Diagram
ALBERT — A Lite BERT for Self-Supervised Learning
ALBERT (A Lite BERT) was published by Google Research and the Toyota Technological Institute in 2019. Its contribution is not about training data or masking — it's about parameter efficiency. ALBERT introduced two architectural innovations and one new training objective.
ALBERT's Three Core Innovations
ALBERT Model Variants vs BERT
| Model | Layers | Hidden Size | Embedding Size | Parameters | GLUE Score |
|---|---|---|---|---|---|
| BERT-base | 12 | 768 | 768 | 110M | 79.6 |
| BERT-large | 24 | 1024 | 1024 | 336M | 82.8 |
| ALBERT-base | 12 | 768 | 128 | 12M | 82.3 |
| ALBERT-large | 24 | 1024 | 128 | 18M | 88.7 |
| ALBERT-xxlarge | 12 | 4096 | 128 | 235M | 91.0 |
ALBERT-large has 18M parameters vs BERT-large's 336M. You'd expect it to be much faster. It isn't, for inference. Why? Because parameter sharing doesn't reduce the number of operations — 24 layers still means 24 forward passes. ALBERT wins on memory footprint and storage size, not raw speed. For memory-constrained deployment (mobile, edge), ALBERT is a genuine win. For latency-sensitive production, prefer DistilBERT.
ALBERT Architecture Diagram
DistilBERT — Knowledge Distillation from BERT
DistilBERT was published by HuggingFace in 2019. It uses knowledge distillation — a technique where a small "student" model is trained to mimic the behaviour of a large "teacher" model, not just its final predictions, but the soft probability distributions it produces at every step.
Knowledge Distillation — The Core Mechanism
Hard label: "The correct answer is cat." (1 bit of information).
Soft label from BERT: "This is probably a cat (0.71), possibly a dog (0.18), maybe a fox (0.06)."
The soft label tells the student how similar different classes are — rich relational knowledge
that no training dataset explicitly provides. This is what makes distillation so powerful.
DistilBERT vs BERT — Size, Speed, Performance
| Property | BERT-base | DistilBERT | Change |
|---|---|---|---|
| Parameters | 110M | 66M | −40% |
| Layers | 12 | 6 | −50% |
| Inference Speed | baseline | ~1.7× faster | +60% |
| GLUE Score | 79.6 | 77.0 | −3.3% |
| NSP Task | Yes | Removed | ✓ |
| Training Approach | Pre-training from scratch | Distilled from BERT | 💡 |
| Best Use Case | Accuracy-critical tasks | Production speed-sensitive NLP | — |
Side-by-Side Comparison — RoBERTa vs ALBERT vs DistilBERT
| Property | RoBERTa-base | ALBERT-base | DistilBERT |
|---|---|---|---|
| Core Idea | Better training recipe | Parameter efficiency | Knowledge distillation |
| Architecture Change? | None | Yes — shared layers + factorised embeddings | Partial — 6 layers |
| Parameters | 125M | 12M | 66M |
| NSP Task | Removed | Replaced by SOP | Removed |
| GLUE Score | 88.5 | 82.3 | 77.0 |
| Inference Speed | Slow (same as BERT) | Slow (many layers) | Fast (~1.7×) |
| Memory Footprint | Large | Tiny | Medium |
| Training Cost | Very high (160GB data) | High | Moderate |
| Best For | Max accuracy, research | Edge devices, low memory | Production APIs, speed |
Fine-Tuning RoBERTa for Text Classification
All three models share the same fine-tuning paradigm via HuggingFace Transformers. We'll fine-tune RoBERTa on sentiment classification (SST-2) as our first example.
Run pip install transformers datasets accelerate before proceeding. All examples use HuggingFace Transformers 4.x and PyTorch.
# ── RoBERTa Fine-tuning for Sentiment Classification (SST-2) ──
from transformers import (
RobertaTokenizer,
RobertaForSequenceClassification,
Trainer,
TrainingArguments
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
# ── 1. Load data ───────────────────────────────────────────────
dataset = load_dataset("sst2")
tokeniser = RobertaTokenizer.from_pretrained("roberta-base")
# ── 2. Tokenise ────────────────────────────────────────────────
def tokenise(batch):
return tokeniser(
batch["sentence"],
truncation=True,
padding="max_length",
max_length=128
)
tokenised = dataset.map(tokenise, batched=True)
tokenised = tokenised.rename_column("label", "labels")
tokenised.set_format("torch",
columns=["input_ids", "attention_mask", "labels"])
# ── 3. Load pre-trained model ──────────────────────────────────
model = RobertaForSequenceClassification.from_pretrained(
"roberta-base",
num_labels=2 # binary: positive / negative
)
# ── 4. Metrics ─────────────────────────────────────────────────
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="binary")
}
# ── 5. Training arguments ──────────────────────────────────────
args = TrainingArguments(
output_dir="./roberta-sst2",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=True # mixed precision: faster on GPU
)
# ── 6. Train ───────────────────────────────────────────────────
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["validation"],
compute_metrics=compute_metrics
)
trainer.train()
results = trainer.evaluate()
print(f"Validation Accuracy: {results['eval_accuracy']:.4f}")
print(f"Validation F1: {results['eval_f1']:.4f}")
Learning rate of 1e-5 to 3e-5 works well for most tasks.
Use weight_decay=0.01 on all parameters except bias and LayerNorm.
Warmup for 6–10% of total steps prevents early instability.
3 epochs is typically enough; more than 5 often causes overfitting.
Always use fp16=True on a GPU — it's free speed.
Fine-Tuning ALBERT for Named Entity Recognition
ALBERT shines in memory-constrained settings. Here we fine-tune it on NER (token classification), labelling each token as PER, ORG, LOC, or O (not an entity).
# ── ALBERT Fine-tuning for NER (CoNLL-2003) ───────────────────
from transformers import (
AlbertTokenizerFast,
AlbertForTokenClassification,
DataCollatorForTokenClassification,
Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np
dataset = load_dataset("conll2003")
tokeniser = AlbertTokenizerFast.from_pretrained("albert-base-v2")
label_list = dataset["train"].features["ner_tags"].feature.names
# ['O','B-PER','I-PER','B-ORG','I-ORG','B-LOC','I-LOC','B-MISC','I-MISC']
num_labels = len(label_list)
# ── Align labels with subword tokens ──────────────────────────
def tokenise_and_align(examples):
tok = tokeniser(
examples["tokens"],
truncation=True,
is_split_into_words=True
)
aligned_labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tok.word_ids(batch_index=i)
prev_word = None
label_ids = []
for wid in word_ids:
if wid is None:
label_ids.append(-100) # ignore [CLS], [SEP]
elif wid != prev_word:
label_ids.append(label[wid])
else:
label_ids.append(-100) # ignore sub-tokens
prev_word = wid
aligned_labels.append(label_ids)
tok["labels"] = aligned_labels
return tok
tokenised = dataset.map(tokenise_and_align, batched=True)
collator = DataCollatorForTokenClassification(tokeniser)
model = AlbertForTokenClassification.from_pretrained(
"albert-base-v2",
num_labels=num_labels
)
args = TrainingArguments(
output_dir="./albert-ner",
num_train_epochs=5,
per_device_train_batch_size=32, # large batch — ALBERT is small!
learning_rate=3e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
fp16=True
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["validation"],
data_collator=collator,
tokenizer=tokeniser
)
trainer.train()
BERT-style tokenisers split words into subwords: "Washington" might become
["Wash", "##ington"].
Your original label applies to the whole word — B-LOC.
For subword tokens, you must assign -100 so the loss function ignores them.
Only the first subword of each word gets the true label.
The is_split_into_words=True flag and word_ids() method handle this.
Fine-Tuning DistilBERT for Question Answering
DistilBERT's strength is production speed. Here we fine-tune it on extractive QA (SQuAD-style) — given a passage and a question, find the answer span within the passage.
# ── DistilBERT Fine-tuning for Extractive QA (SQuAD v1) ───────
from transformers import (
DistilBertTokenizerFast,
DistilBertForQuestionAnswering
)
from datasets import load_dataset
import torch
dataset = load_dataset("squad", split="train[:5000]")
tokeniser = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
# ── Tokenise + find start/end positions ───────────────────────
def preprocess(examples):
inputs = tokeniser(
examples["question"],
examples["context"],
truncation="only_second",
max_length=384,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length"
)
offset_mapping = inputs.pop("offset_mapping")
sample_map = inputs.pop("overflow_to_sample_mapping")
answers = examples["answers"]
start_positions, end_positions = [], []
for i, offset in enumerate(offset_mapping):
sample_idx = sample_map[i]
answer = answers[sample_idx]
start_char = answer["answer_start"][0]
end_char = start_char + len(answer["text"][0])
seq_ids = inputs.sequence_ids(i)
# Find context token range
ctx_start = next(j for j, s in enumerate(seq_ids) if s == 1)
ctx_end = next(j for j in range(len(seq_ids)-1, -1, -1) if seq_ids[j] == 1)
if offset[ctx_start][0] > start_char or offset[ctx_end][1] < end_char:
start_positions.append(0); end_positions.append(0)
else:
start_positions.append(next(
j for j in range(ctx_start, ctx_end+1)
if offset[j][0] <= start_char < offset[j][1] or offset[j][0] == start_char
))
end_positions.append(next(
j for j in range(ctx_end, ctx_start-1, -1)
if offset[j][1] >= end_char
))
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
tokenised = dataset.map(preprocess, batched=True, remove_columns=dataset.column_names)
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
# ── Training ───────────────────────────────────────────────────
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="./distilbert-squad",
num_train_epochs=2,
per_device_train_batch_size=16,
learning_rate=5e-5,
fp16=True
)
trainer = Trainer(model=model, args=args, train_dataset=tokenised)
trainer.train()
# ── Inference ──────────────────────────────────────────────────
from transformers import pipeline
qa_pipe = pipeline(
"question-answering",
model="./distilbert-squad",
tokenizer=tokeniser
)
result = qa_pipe({
"question": "Where did the Battle of Hastings take place?",
"context": "The Battle of Hastings was fought on 14 October 1066 near Hastings, East Sussex, England between the Norman-French army of William the Conqueror and the English army of King Harold II."
})
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")
Advanced — Building Your Own Distillation Pipeline
You are not limited to the pre-distilled DistilBERT. You can distil from any fine-tuned teacher — for example, a RoBERTa teacher fine-tuned on your domain data, distilled into a smaller student model for production deployment.
# ── Custom Knowledge Distillation: RoBERTa → Small Student ────
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import (
RobertaForSequenceClassification,
DistilBertForSequenceClassification,
RobertaTokenizer
)
# Assume teacher is already fine-tuned on your task
teacher = RobertaForSequenceClassification.from_pretrained("./roberta-finetuned")
student = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
teacher.eval() # teacher is frozen during distillation
class DistillationLoss(nn.Module):
def __init__(self, temperature=4.0, alpha=0.5):
super().__init__()
self.T = temperature # soften probability distributions
self.alpha = alpha # weight: distil vs hard label loss
def forward(self, student_logits, teacher_logits, true_labels):
# Soft distillation loss (KL divergence)
soft_loss = F.kl_div(
F.log_softmax(student_logits / self.T, dim=-1),
F.softmax(teacher_logits / self.T, dim=-1),
reduction="batchmean"
) * (self.T ** 2)
# Hard label (cross-entropy) loss
hard_loss = F.cross_entropy(student_logits, true_labels)
# Combined loss: alpha controls the trade-off
return self.alpha * soft_loss + (1 - self.alpha) * hard_loss
loss_fn = DistillationLoss(temperature=4.0, alpha=0.7)
optimiser = torch.optim.AdamW(student.parameters(), lr=3e-5)
# ── Training loop ──────────────────────────────────────────────
for epoch in range(3):
for batch in train_loader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["labels"]
# Teacher forward pass (no gradient)
with torch.no_grad():
teacher_out = teacher(
input_ids=input_ids,
attention_mask=attention_mask
)
# Student forward pass
student_out = student(
input_ids=input_ids,
attention_mask=attention_mask
)
loss = loss_fn(
student_out.logits,
teacher_out.logits,
labels
)
optimiser.zero_grad()
loss.backward()
optimiser.step()
print(f"Epoch {epoch+1}: loss = {loss.item():.4f}")
Temperature T controls how "soft" the teacher's distributions are. At T=1, you get the raw probability distribution. At T=4, the probabilities are flattened — the model shares more information about which classes are similar (even wrong answers). Higher T → smoother learning signal, better knowledge transfer. Typical values: 3.0 to 6.0. Start at 4.0.
When Things Go Wrong — Common Mistakes and Fixes
AutoTokenizer.from_pretrained(model_name).
| Problem | Likely Cause | Fix |
|---|---|---|
| Validation loss spikes at epoch 1 | LR too high | Reduce to 1e-5, add warmup_ratio=0.1 |
| Loss NaN on RoBERTa | fp16 + large logits | Add gradient_clipping=1.0 to TrainingArguments |
| ALBERT labels wrong in NER | Subword misalignment | Use is_split_into_words=True + word_ids() |
| DistilBERT accuracy 3% below BERT | Expected — this is known | Accept trade-off or use RoBERTa-distil |
| Out-of-memory on ALBERT-xxlarge | Hidden size 4096 is very wide | Reduce batch size to 4 or 8, use gradient accumulation |
| RoBERTa tokeniser adds Ġ prefix | BPE byte-level encoding | Normal — Ġ means space. Ignore in decode. |
Production Inference — Loading and Using All Three Models
# ── Unified Inference Examples — All Three Models ─────────────
from transformers import pipeline
# ── RoBERTa: Sentiment Classification ──────────────────────────
roberta_pipe = pipeline(
"text-classification",
model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
print(roberta_pipe("This model is absolutely brilliant!"))
# [{'label': 'positive', 'score': 0.9876}]
# ── ALBERT: Zero-Shot Classification ───────────────────────────
albert_pipe = pipeline(
"zero-shot-classification",
model="cross-encoder/nli-albert-small-v2"
)
result = albert_pipe(
"The Federal Reserve raised interest rates by 25 basis points.",
candidate_labels=["finance", "sports", "politics", "technology"]
)
print(f"Label: {result['labels'][0]}, Score: {result['scores'][0]:.4f}")
# Label: finance, Score: 0.9543
# ── DistilBERT: Fast Feature Extraction ────────────────────────
distil_pipe = pipeline(
"feature-extraction",
model="distilbert-base-uncased",
return_tensors=True
)
# Get sentence embedding (mean pool the [CLS] vector)
import torch
outputs = distil_pipe("Knowledge distillation compresses large models.")
embedding = torch.tensor(outputs[0]).mean(dim=1) # [1, 768]
print(f"Sentence embedding shape: {embedding.shape}")
# Sentence embedding shape: torch.Size([1, 768])
# ── Benchmark Inference Speed ──────────────────────────────────
import time
texts = ["This is a test sentence for benchmarking."] * 100
for name, pipe in [(
"RoBERTa", roberta_pipe), ("DistilBERT", distil_pipe
)]:
start = time.time()
_ = [pipe(t) for t in texts]
elapsed = time.time() - start
print(f"{name}: {elapsed:.2f}s for 100 sentences ({elapsed*10:.0f}ms/sentence)")
Golden Rules — Non-Negotiable Practices
AutoTokenizer.from_pretrained(model_name) and let HuggingFace pick the right one.
-100 to all subword tokens except the first — these are ignored
by the loss function. Failing to do this will corrupt your training signal silently.
torch.quantization.quantize_dynamic)
can push throughput another 1.5–2× without meaningful accuracy loss.
Final Decision Framework
| Scenario | Best Choice | Why |
|---|---|---|
| NLP research / competition | RoBERTa-large | Highest accuracy ceiling, best fine-tuning stability |
| Production REST API (<100ms SLA) | DistilBERT | 1.7× faster inference, 60% smaller |
| Mobile / edge deployment | ALBERT-base | Only 12M params, <50MB model file |
| Domain-specific NLP (medical, legal) | RoBERTa + domain pre-training | Strong base + further pre-train on domain text |
| Few labelled examples (<1,000) | RoBERTa-large | Better representations = better few-shot fine-tuning |
| Multilingual tasks | XLM-RoBERTa | RoBERTa trained on 100 languages |
| Semantic similarity / embeddings | DistilBERT (Sentence-Transformers) | Fast, good embeddings, huge ecosystem |
| Memory budget <64MB | ALBERT-base | Smallest parameter count of all BERT variants |
Start every new NLP project with DistilBERT as your baseline — it is fast, cheap to fine-tune, and surprisingly strong. If you need higher accuracy, upgrade to RoBERTa-base. If memory is the constraint, switch to ALBERT-base. Only reach for the "large" variants when you have a GPU budget and a well-curated dataset — large models need large data to realise their potential. The three models together give you coverage of virtually every real-world NLP deployment scenario.