Natural Language Processing (NLP) 📂 Pre-trained Language Models (PLMs) · 4 of 4 55 min read

RoBERTa vs ALBERT vs DistilBERT

Learn how RoBERTa, ALBERT, and DistilBERT improve on BERT with clear diagrams, Python fine-tuning examples, and a side-by-side comparison. Covers dynamic masking, parameter sharing, knowledge distillation, and when to use each model in production.

Section 01

The Origin Story — Why BERT Needed Successors

The Brilliant Student Who Never Studied Properly
Imagine a gifted student — BERT — who showed up to university and scored astonishing marks. But here is the secret: BERT never learned how to study efficiently. He read every book twice (bidirectional encoding), but he also wasted time studying fake chapters inserted by teachers to trick him (Next Sentence Prediction). He also memorised chapters in a strange mask-and-guess pattern that never appeared in real exams (static masking). And frankly — he was enormous. Carrying his textbooks required a truck.

Three smarter students came after him. RoBERTa said: "Remove the fake chapters, study longer, and vary the masks." ALBERT said: "Share notes between chapters — don't write the same thing twice." DistilBERT said: "I'll learn from BERT, but I only need half the pages." All three outperformed or matched BERT in important ways. That is the entire motivation for this tutorial.

BERT (Bidirectional Encoder Representations from Transformers) was introduced by Google in 2018 and fundamentally changed NLP. But its training recipe had real flaws. RoBERTa, ALBERT, and DistilBERT each attacked one or more of those flaws — and in doing so, taught us deep lessons about what actually matters in pre-training language models.

🏴
What You Will Learn in This Tutorial

How BERT works at its core — and what was wrong with it. What RoBERTa, ALBERT, and DistilBERT each changed and why those changes helped. How to fine-tune all three on real NLP tasks using HuggingFace Transformers. How to choose between them for your specific use case. Diagrams, stories, and working Python code throughout.


Section 02

BERT in 5 Minutes — The Foundation You Must Know

Before understanding the successors, you need a clear mental model of what BERT is actually doing.

🧠 How BERT Reads a Sentence
Input
Text is split into subword tokens: "unbelievable"["un", "##believ", "##able"]
Special
A [CLS] token is prepended (used for classification). A [SEP] token separates sentence pairs.
Embeds
Each token gets a Token Embedding + Segment Embedding (which sentence?) + Position Embedding (which index?)
Layers
12 stacked Transformer encoder layers (BERT-base). Each layer runs self-attention — every token attends to every other token simultaneously.
Output
A rich contextual vector for each token. The word "bank" in "river bank" and "bank transfer" gets completely different vectors.

BERT's Two Pre-training Tasks

🎮
Masked Language Modelling (MLM)
Task 1
15% of tokens are randomly replaced by [MASK]. The model must predict the original word from context. This forces bidirectional understanding — the model must look left and right to guess.
📝
Next Sentence Prediction (NSP)
Task 2
Two sentences are fed in. 50% of the time sentence B genuinely follows sentence A. 50% of the time it's random. The model predicts: IsNext or NotNext.
🔴
The Problem With NSP
Flaw Discovered Later
NSP is too easy. The model solves it using topic coherence — not deep sentence-level understanding. It became a noisy distraction that hurt learning rather than helping. RoBERTa and ALBERT both addressed this.
⚠️
BERT's Four Key Weaknesses

1. Static masking — the same tokens are masked in every epoch, so the model sees identical patterns repeatedly. 2. NSP is harmful — training on document pairs for NSP limits batching and adds noise. 3. Undertrained — BERT was trained for only 1M steps on 16GB of data. Much more data helps dramatically. 4. Parameter redundancy — the embedding matrix (30,000 × 768) is enormous and wasteful. Each of RoBERTa, ALBERT, and DistilBERT fixes at least one of these.


Section 03

RoBERTa — A Robustly Optimised BERT Pretraining Approach

The Same Recipe, Cooked Correctly
A chef (BERT) invented a spectacular dish but never followed the recipe properly — they used the wrong temperature, cooked for only half the time, and added an unnecessary spice (NSP) that made it slightly bitter. A second chef (RoBERTa) read the original recipe, threw out the bad spice, cranked up the heat, cooked for three times as long, and used ten times as many fresh ingredients. The result? A dramatically better dish — using the same underlying technique. RoBERTa did not invent a new architecture. It simply trained BERT the way it should have been trained from the start.

RoBERTa was published by Facebook AI in 2019. The paper's central claim was audacious: BERT was significantly undertrained, and fixing the training procedure — without changing the architecture — produced state-of-the-art results on many benchmarks.

RoBERTa's Five Key Changes to BERT Training

01
Dynamic Masking
BERT uses static masking — the mask pattern is fixed once at data preprocessing. Every epoch the model sees the same masked tokens. RoBERTa generates a new random mask each time a sequence is fed to the model. With 40 epochs of training, this means each sentence can appear with up to 40 different mask patterns, exposing the model to far richer learning signals.
02
Remove NSP — Use Full-Sentence Inputs
NSP is dropped entirely. Instead, RoBERTa packs inputs with full consecutive sentences from the same document (or across documents) until the 512-token limit is reached. No artificial 50/50 sentence-pair splits. Experiments showed that removing NSP consistently improved downstream task performance, confirming the hypothesis that NSP was noise.
03
Much Larger Batches
BERT trained with a batch size of 256 sequences. RoBERTa trained with batch sizes of 2,000 to 8,000 sequences. Larger batches improve training stability and allow higher learning rates. The paper found that perplexity decreased more reliably and downstream performance improved consistently as batch size grew.
04
Longer Training on More Data
BERT trained on 16GB of text (BooksCorpus + English Wikipedia) for 1M steps. RoBERTa added CC-News (76GB), OpenWebText (38GB), and Stories (31GB), training on 160GB total for up to 500K steps. More data, more steps, better generalisation — the result was not surprising, but the magnitude of the improvement confirmed that BERT was dramatically data-starved.
05
Byte-Pair Encoding (BPE) with Larger Vocabulary
RoBERTa uses a byte-level BPE tokeniser with a vocabulary of 50,000 tokens (BERT uses a WordPiece tokeniser with 30,000 tokens). Byte-level BPE operates on raw bytes, making it language-agnostic and eliminating the need for pre-tokenisation. Unknown tokens essentially disappear — any string can be encoded.
Property BERT-base RoBERTa-base
Architecture12 layers, 768 hidden, 12 heads12 layers, 768 hidden, 12 heads
Parameters110M125M
Training Data16 GB160 GB
NSP TaskYes (harmful)Removed
MaskingStaticDynamic
Batch Size2562,000–8,000
TokeniserWordPiece (30K vocab)Byte-level BPE (50K vocab)
GLUE Score79.688.5

RoBERTa Architecture Diagram

📊 RoBERTa vs BERT — What Changed (Shaded = RoBERTa Improvement)
BERT RoBERTa Training Data 16 GB Training Data 160 GB (10×) Masking Strategy Static (same mask every epoch) Masking Strategy Dynamic (new mask each step) Next Sentence Prediction Included (adds noise) Next Sentence Prediction Removed ✓ Batch Size 256 sequences Batch Size 2,000–8,000 sequences GLUE Score: 79.6 GLUE Score: 88.5 (+8.9)
Same transformer architecture — every gain came purely from a better training recipe.

Section 04

ALBERT — A Lite BERT for Self-Supervised Learning

The Student Who Shared One Set of Notes Across All Subjects
BERT had 12 separate notebooks — one for each transformer layer. Every notebook contained a full copy of the lesson, filling 110 million pages total. ALBERT realised that most of these pages were near-identical. Instead of 12 separate notebooks, he used one shared notebook that every layer could read from. He also stopped wasting pages by compressing the vocabulary table at the front. The result? ALBERT-large had only 18 million unique parameters, while actually stacking 24 transformer layers. It was simultaneously smaller in memory and deeper in understanding — a genuinely clever engineering insight.

ALBERT (A Lite BERT) was published by Google Research and the Toyota Technological Institute in 2019. Its contribution is not about training data or masking — it's about parameter efficiency. ALBERT introduced two architectural innovations and one new training objective.

ALBERT's Three Core Innovations

💾
Factorised Embedding Parameterisation
Innovation 1 — Embedding Compression
In BERT, the token embedding size E equals the hidden size H (both are 768 for BERT-base). ALBERT decouples them. Vocabulary size V=30,000 × E=128 → projected up to H=768 via a small matrix. This reduces embedding parameters from 23M → 3.8M. The intuition: token embeddings should encode context-independent meaning (small), while hidden states encode context-dependent meaning (large).
✓ Huge memory saving on embedding layer
👥
Cross-Layer Parameter Sharing
Innovation 2 — Weight Tying
ALBERT uses the same weights for every transformer layer. All 12 (or 24) layers share one set of attention weights and one set of feed-forward weights. ALBERT-base has 12M parameters; stacked to 12 layers it behaves as 12 layers but stores only 1 layer's worth of weights. This is like a function being called recursively — the same computation, applied repeatedly, refines the representation progressively.
✓ 70–90% parameter reduction vs BERT
✗ Slower inference (layers can't parallelise as easily)
🤔
Sentence Order Prediction (SOP)
Innovation 3 — Replaces NSP
ALBERT replaces NSP with Sentence Order Prediction. Both sentences always come from the same document. The model must predict whether sentence B comes after sentence A, or if they're swapped. SOP forces genuine inter-sentence coherence modelling — you can't solve it with topic matching alone, since both sentences are about the same topic.
✓ More meaningful pre-training signal than NSP

ALBERT Model Variants vs BERT

Model Layers Hidden Size Embedding Size Parameters GLUE Score
BERT-base12768768110M79.6
BERT-large2410241024336M82.8
ALBERT-base1276812812M82.3
ALBERT-large24102412818M88.7
ALBERT-xxlarge124096128235M91.0
💡
The ALBERT Paradox — Fewer Parameters, More Computation

ALBERT-large has 18M parameters vs BERT-large's 336M. You'd expect it to be much faster. It isn't, for inference. Why? Because parameter sharing doesn't reduce the number of operations — 24 layers still means 24 forward passes. ALBERT wins on memory footprint and storage size, not raw speed. For memory-constrained deployment (mobile, edge), ALBERT is a genuine win. For latency-sensitive production, prefer DistilBERT.

ALBERT Architecture Diagram

✎ ALBERT — Factorised Embedding + Shared Layer Weights
BERT Embedding (768×768) ALBERT Factorised (128→768) Vocab (30K) × 768 = 23,040,000 parameters direct → hidden size 30K × 128 = 3.84M params small E 128 → 768 projection matrix + 12 Independent Layers L1 L2 L3 ... L12 110M total params 1 Shared Layer × 24 Passes Shared Weights → applied 24 times 18M total params Pre-training: NSP Sentence A (doc 1) Sentence B (random doc) → IsNext? Too easy — solvable by topic Pre-training: SOP Sentence A + B (same doc) Are they in order or swapped? Requires real coherence reasoning ALBERT achieves competitive scores with 6–18× fewer parameters

Section 05

DistilBERT — Knowledge Distillation from BERT

The Master Teaching the Apprentice
Imagine BERT as a world-class professor who gives perfect lectures but takes four hours to explain anything. His student — DistilBERT — sat in on every lecture, not to memorise the words but to capture the essence of how the professor thinks. DistilBERT asked: what patterns does the professor notice? What subtle distinctions does he make? Rather than re-reading every textbook, the student studied the professor's thought process itself. The result: a student who gives answers in 40% of the time with 97% of the accuracy. You'd send the professor to write the PhD thesis — but for answering everyday questions, you'd hire the apprentice.

DistilBERT was published by HuggingFace in 2019. It uses knowledge distillation — a technique where a small "student" model is trained to mimic the behaviour of a large "teacher" model, not just its final predictions, but the soft probability distributions it produces at every step.

Knowledge Distillation — The Core Mechanism

💡
Why Soft Labels Contain More Information Than Hard Labels

Hard label: "The correct answer is cat." (1 bit of information).
Soft label from BERT: "This is probably a cat (0.71), possibly a dog (0.18), maybe a fox (0.06)."
The soft label tells the student how similar different classes are — rich relational knowledge that no training dataset explicitly provides. This is what makes distillation so powerful.

🎓 How DistilBERT Is Trained
Teacher
BERT-base (110M params, 12 layers) is pre-trained and frozen. It produces soft probability distributions over the vocabulary for every masked token.
Student
DistilBERT has 6 layers (every other BERT layer is removed). Same hidden size (768) and attention heads (12) — only depth is reduced.
Loss 1
Distillation Loss: KL divergence between teacher's soft logits and student's soft logits. This is the core distillation signal.
Loss 2
MLM Loss: Standard masked language modelling loss on hard labels (ground truth tokens).
Loss 3
Cosine Embedding Loss: Student's hidden state vectors are pushed to align with the teacher's hidden state vectors — layer by layer.
Result
66M parameters, 40% fewer, 60% faster inference, 97% of BERT's performance on GLUE benchmark.

DistilBERT vs BERT — Size, Speed, Performance

Property BERT-base DistilBERT Change
Parameters110M66M−40%
Layers126−50%
Inference Speedbaseline~1.7× faster+60%
GLUE Score79.677.0−3.3%
NSP TaskYesRemoved
Training ApproachPre-training from scratchDistilled from BERT💡
Best Use CaseAccuracy-critical tasksProduction speed-sensitive NLP
⚡ Distillation Flow — Teacher to Student
TEACHER: BERT-base 110M params · 12 layers · frozen Layer 1–12 (Self-Attention + FFN) Soft Logits over 30K vocab cat:0.71, dog:0.18, fox:0.06 ... Hidden States (contextual vectors) Produces rich knowledge signals that student will learn to mimic KL div + cosine + MLM STUDENT: DistilBERT 66M params · 6 layers · trains Layer 1–6 (same hidden size 768) Learns to match soft distribution: cat:0.69, dog:0.20, fox:0.07 ... Aligns hidden states with teacher's 97% of teacher performance at 60% of the inference cost

Section 06

Side-by-Side Comparison — RoBERTa vs ALBERT vs DistilBERT

Property RoBERTa-base ALBERT-base DistilBERT
Core IdeaBetter training recipeParameter efficiencyKnowledge distillation
Architecture Change?NoneYes — shared layers + factorised embeddingsPartial — 6 layers
Parameters125M12M66M
NSP TaskRemovedReplaced by SOPRemoved
GLUE Score88.582.377.0
Inference SpeedSlow (same as BERT)Slow (many layers)Fast (~1.7×)
Memory FootprintLargeTinyMedium
Training CostVery high (160GB data)HighModerate
Best ForMax accuracy, researchEdge devices, low memoryProduction APIs, speed
🚀
Choose RoBERTa When…
Accuracy is everything
You need the best possible performance on classification, NER, or QA tasks and inference latency is not a constraint. Competition-grade NLP, research papers, production systems where you have powerful hardware. Fine-tunes extremely well thanks to its superior pre-training.
✓ Highest accuracy among the three
✗ Largest compute requirement
📈
Choose ALBERT When…
Memory is the constraint
You need to deploy on edge devices, mobile, or extremely memory-constrained environments. ALBERT-base stores only 12M parameters — less than 50MB. Excellent when you need competitive accuracy with a tiny model footprint. Note: inference is not faster, just leaner.
✓ Smallest parameter count by far
✗ Not faster at inference time
Choose DistilBERT When…
Speed is the constraint
You need fast inference in production. REST APIs, real-time chatbots, text classification pipelines processing thousands of requests per second. DistilBERT is genuinely 1.6–2× faster than BERT/RoBERTa and fits comfortably on a single small GPU or even CPU in production.
✓ Fastest inference of the three
✗ Lowest accuracy ceiling

Section 07

Fine-Tuning RoBERTa for Text Classification

All three models share the same fine-tuning paradigm via HuggingFace Transformers. We'll fine-tune RoBERTa on sentiment classification (SST-2) as our first example.

📦
Install Requirements

Run pip install transformers datasets accelerate before proceeding. All examples use HuggingFace Transformers 4.x and PyTorch.

# ── RoBERTa Fine-tuning for Sentiment Classification (SST-2) ──
from transformers import (
    RobertaTokenizer,
    RobertaForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── 1. Load data ───────────────────────────────────────────────
dataset = load_dataset("sst2")
tokeniser = RobertaTokenizer.from_pretrained("roberta-base")

# ── 2. Tokenise ────────────────────────────────────────────────
def tokenise(batch):
    return tokeniser(
        batch["sentence"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

tokenised = dataset.map(tokenise, batched=True)
tokenised = tokenised.rename_column("label", "labels")
tokenised.set_format("torch",
    columns=["input_ids", "attention_mask", "labels"])

# ── 3. Load pre-trained model ──────────────────────────────────
model = RobertaForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=2     # binary: positive / negative
)

# ── 4. Metrics ─────────────────────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1":       f1_score(labels, preds, average="binary")
    }

# ── 5. Training arguments ──────────────────────────────────────
args = TrainingArguments(
    output_dir="./roberta-sst2",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=True            # mixed precision: faster on GPU
)

# ── 6. Train ───────────────────────────────────────────────────
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"Validation Accuracy: {results['eval_accuracy']:.4f}")
print(f"Validation F1:       {results['eval_f1']:.4f}")
OUTPUT
Epoch 1/3 — eval_accuracy: 0.9277 eval_f1: 0.9281 Epoch 2/3 — eval_accuracy: 0.9496 eval_f1: 0.9501 Epoch 3/3 — eval_accuracy: 0.9564 eval_f1: 0.9568 Validation Accuracy: 0.9564 Validation F1: 0.9568
RoBERTa Fine-tuning Tips

Learning rate of 1e-5 to 3e-5 works well for most tasks. Use weight_decay=0.01 on all parameters except bias and LayerNorm. Warmup for 6–10% of total steps prevents early instability. 3 epochs is typically enough; more than 5 often causes overfitting. Always use fp16=True on a GPU — it's free speed.


Section 08

Fine-Tuning ALBERT for Named Entity Recognition

ALBERT shines in memory-constrained settings. Here we fine-tune it on NER (token classification), labelling each token as PER, ORG, LOC, or O (not an entity).

# ── ALBERT Fine-tuning for NER (CoNLL-2003) ───────────────────
from transformers import (
    AlbertTokenizerFast,
    AlbertForTokenClassification,
    DataCollatorForTokenClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np

dataset   = load_dataset("conll2003")
tokeniser = AlbertTokenizerFast.from_pretrained("albert-base-v2")

label_list = dataset["train"].features["ner_tags"].feature.names
# ['O','B-PER','I-PER','B-ORG','I-ORG','B-LOC','I-LOC','B-MISC','I-MISC']
num_labels = len(label_list)

# ── Align labels with subword tokens ──────────────────────────
def tokenise_and_align(examples):
    tok = tokeniser(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True
    )
    aligned_labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids   = tok.word_ids(batch_index=i)
        prev_word  = None
        label_ids  = []
        for wid in word_ids:
            if wid is None:
                label_ids.append(-100)   # ignore [CLS], [SEP]
            elif wid != prev_word:
                label_ids.append(label[wid])
            else:
                label_ids.append(-100)   # ignore sub-tokens
            prev_word = wid
        aligned_labels.append(label_ids)
    tok["labels"] = aligned_labels
    return tok

tokenised  = dataset.map(tokenise_and_align, batched=True)
collator   = DataCollatorForTokenClassification(tokeniser)

model = AlbertForTokenClassification.from_pretrained(
    "albert-base-v2",
    num_labels=num_labels
)

args = TrainingArguments(
    output_dir="./albert-ner",
    num_train_epochs=5,
    per_device_train_batch_size=32,   # large batch — ALBERT is small!
    learning_rate=3e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    data_collator=collator,
    tokenizer=tokeniser
)
trainer.train()
OUTPUT
Epoch 1 — eval_loss: 0.1834 Epoch 3 — eval_loss: 0.0912 Epoch 5 — eval_loss: 0.0741 F1 Score on CoNLL-2003 validation: 0.8923
📌
Label Alignment — The Critical NER Detail

BERT-style tokenisers split words into subwords: "Washington" might become ["Wash", "##ington"]. Your original label applies to the whole word — B-LOC. For subword tokens, you must assign -100 so the loss function ignores them. Only the first subword of each word gets the true label. The is_split_into_words=True flag and word_ids() method handle this.


Section 09

Fine-Tuning DistilBERT for Question Answering

DistilBERT's strength is production speed. Here we fine-tune it on extractive QA (SQuAD-style) — given a passage and a question, find the answer span within the passage.

# ── DistilBERT Fine-tuning for Extractive QA (SQuAD v1) ───────
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForQuestionAnswering
)
from datasets import load_dataset
import torch

dataset   = load_dataset("squad", split="train[:5000]")
tokeniser = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# ── Tokenise + find start/end positions ───────────────────────
def preprocess(examples):
    inputs = tokeniser(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map     = inputs.pop("overflow_to_sample_mapping")
    answers        = examples["answers"]
    start_positions, end_positions = [], []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer     = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char   = start_char + len(answer["text"][0])

        seq_ids = inputs.sequence_ids(i)
        # Find context token range
        ctx_start = next(j for j, s in enumerate(seq_ids) if s == 1)
        ctx_end   = next(j for j in range(len(seq_ids)-1, -1, -1) if seq_ids[j] == 1)

        if offset[ctx_start][0] > start_char or offset[ctx_end][1] < end_char:
            start_positions.append(0); end_positions.append(0)
        else:
            start_positions.append(next(
                j for j in range(ctx_start, ctx_end+1)
                if offset[j][0] <= start_char < offset[j][1] or offset[j][0] == start_char
            ))
            end_positions.append(next(
                j for j in range(ctx_end, ctx_start-1, -1)
                if offset[j][1] >= end_char
            ))

    inputs["start_positions"] = start_positions
    inputs["end_positions"]   = end_positions
    return inputs

tokenised = dataset.map(preprocess, batched=True, remove_columns=dataset.column_names)
model     = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

# ── Training ───────────────────────────────────────────────────
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
    output_dir="./distilbert-squad",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    learning_rate=5e-5,
    fp16=True
)
trainer = Trainer(model=model, args=args, train_dataset=tokenised)
trainer.train()

# ── Inference ──────────────────────────────────────────────────
from transformers import pipeline

qa_pipe = pipeline(
    "question-answering",
    model="./distilbert-squad",
    tokenizer=tokeniser
)
result = qa_pipe({
    "question": "Where did the Battle of Hastings take place?",
    "context":  "The Battle of Hastings was fought on 14 October 1066 near Hastings, East Sussex, England between the Norman-French army of William the Conqueror and the English army of King Harold II."
})
print(f"Answer:     {result['answer']}")
print(f"Confidence: {result['score']:.4f}")
OUTPUT
Answer: near Hastings, East Sussex, England Confidence: 0.9142

Section 10

Advanced — Building Your Own Distillation Pipeline

You are not limited to the pre-distilled DistilBERT. You can distil from any fine-tuned teacher — for example, a RoBERTa teacher fine-tuned on your domain data, distilled into a smaller student model for production deployment.

# ── Custom Knowledge Distillation: RoBERTa → Small Student ────
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import (
    RobertaForSequenceClassification,
    DistilBertForSequenceClassification,
    RobertaTokenizer
)

# Assume teacher is already fine-tuned on your task
teacher = RobertaForSequenceClassification.from_pretrained("./roberta-finetuned")
student = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
teacher.eval()   # teacher is frozen during distillation

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.T     = temperature   # soften probability distributions
        self.alpha = alpha         # weight: distil vs hard label loss

    def forward(self, student_logits, teacher_logits, true_labels):
        # Soft distillation loss (KL divergence)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.T, dim=-1),
            F.softmax(teacher_logits / self.T, dim=-1),
            reduction="batchmean"
        ) * (self.T ** 2)

        # Hard label (cross-entropy) loss
        hard_loss = F.cross_entropy(student_logits, true_labels)

        # Combined loss: alpha controls the trade-off
        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

loss_fn   = DistillationLoss(temperature=4.0, alpha=0.7)
optimiser = torch.optim.AdamW(student.parameters(), lr=3e-5)

# ── Training loop ──────────────────────────────────────────────
for epoch in range(3):
    for batch in train_loader:
        input_ids      = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels         = batch["labels"]

        # Teacher forward pass (no gradient)
        with torch.no_grad():
            teacher_out = teacher(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

        # Student forward pass
        student_out = student(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        loss = loss_fn(
            student_out.logits,
            teacher_out.logits,
            labels
        )

        optimiser.zero_grad()
        loss.backward()
        optimiser.step()

    print(f"Epoch {epoch+1}: loss = {loss.item():.4f}")
🎯
Temperature — The Secret Dial of Distillation

Temperature T controls how "soft" the teacher's distributions are. At T=1, you get the raw probability distribution. At T=4, the probabilities are flattened — the model shares more information about which classes are similar (even wrong answers). Higher T → smoother learning signal, better knowledge transfer. Typical values: 3.0 to 6.0. Start at 4.0.


Section 11

When Things Go Wrong — Common Mistakes and Fixes

🔴
Catastrophic Forgetting
Symptom: train acc 99%, val acc 55%
Learning rate too high. The model overwrites its pre-trained knowledge. Fix: Use learning rates between 1e-5 and 5e-5. Apply differential learning rates — lower for early layers, higher for the classification head. Add a warmup phase.
✗ Most common fine-tuning failure
🔴
Wrong Tokeniser for Model
Symptom: poor performance immediately
RoBERTa uses byte-level BPE. ALBERT uses SentencePiece. DistilBERT uses WordPiece. Never use a BERT tokeniser with RoBERTa. Fix: Always load tokeniser and model from the same checkpoint string. Use AutoTokenizer.from_pretrained(model_name).
✗ Silent failure — loss looks normal
🔴
ALBERT's Slow Inference Surprise
Symptom: ALBERT is slower than expected
ALBERT-large has 24 layers — fewer parameters than BERT-large but the same depth. Inference requires 24 forward passes through shared weights. Fix: Use ALBERT for memory-constrained environments, not latency- constrained ones. For speed, use DistilBERT.
✗ Common misconception
Problem Likely Cause Fix
Validation loss spikes at epoch 1LR too highReduce to 1e-5, add warmup_ratio=0.1
Loss NaN on RoBERTafp16 + large logitsAdd gradient_clipping=1.0 to TrainingArguments
ALBERT labels wrong in NERSubword misalignmentUse is_split_into_words=True + word_ids()
DistilBERT accuracy 3% below BERTExpected — this is knownAccept trade-off or use RoBERTa-distil
Out-of-memory on ALBERT-xxlargeHidden size 4096 is very wideReduce batch size to 4 or 8, use gradient accumulation
RoBERTa tokeniser adds Ġ prefixBPE byte-level encodingNormal — Ġ means space. Ignore in decode.

Section 12

Production Inference — Loading and Using All Three Models

# ── Unified Inference Examples — All Three Models ─────────────
from transformers import pipeline

# ── RoBERTa: Sentiment Classification ──────────────────────────
roberta_pipe = pipeline(
    "text-classification",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
print(roberta_pipe("This model is absolutely brilliant!"))
# [{'label': 'positive', 'score': 0.9876}]

# ── ALBERT: Zero-Shot Classification ───────────────────────────
albert_pipe = pipeline(
    "zero-shot-classification",
    model="cross-encoder/nli-albert-small-v2"
)
result = albert_pipe(
    "The Federal Reserve raised interest rates by 25 basis points.",
    candidate_labels=["finance", "sports", "politics", "technology"]
)
print(f"Label: {result['labels'][0]}, Score: {result['scores'][0]:.4f}")
# Label: finance, Score: 0.9543

# ── DistilBERT: Fast Feature Extraction ────────────────────────
distil_pipe = pipeline(
    "feature-extraction",
    model="distilbert-base-uncased",
    return_tensors=True
)
# Get sentence embedding (mean pool the [CLS] vector)
import torch
outputs = distil_pipe("Knowledge distillation compresses large models.")
embedding = torch.tensor(outputs[0]).mean(dim=1)   # [1, 768]
print(f"Sentence embedding shape: {embedding.shape}")
# Sentence embedding shape: torch.Size([1, 768])

# ── Benchmark Inference Speed ──────────────────────────────────
import time
texts = ["This is a test sentence for benchmarking."] * 100

for name, pipe in [(
    "RoBERTa", roberta_pipe), ("DistilBERT", distil_pipe
)]:
    start = time.time()
    _ = [pipe(t) for t in texts]
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.2f}s for 100 sentences ({elapsed*10:.0f}ms/sentence)")
OUTPUT
RoBERTa: 8.42s for 100 sentences (84ms/sentence) DistilBERT: 4.97s for 100 sentences (50ms/sentence) → DistilBERT is 1.69× faster on CPU inference

Section 13

Golden Rules — Non-Negotiable Practices

🧠 RoBERTa / ALBERT / DistilBERT — Production Rules
1
Always load tokeniser and model from the same checkpoint. Never use a BERT tokeniser with RoBERTa (different vocabulary, different special tokens). Use AutoTokenizer.from_pretrained(model_name) and let HuggingFace pick the right one.
2
Fine-tune with a learning rate between 1e-5 and 5e-5. Anything above 1e-4 will catastrophically overwrite pre-trained representations. Add a warmup phase of 6–10% of total training steps to stabilise early training.
3
For NER and token classification, always handle subword label alignment. Assign label -100 to all subword tokens except the first — these are ignored by the loss function. Failing to do this will corrupt your training signal silently.
4
Use RoBERTa as your default accuracy-first baseline, not BERT. RoBERTa is strictly better across virtually all benchmarks with the same architecture cost. There is almost no reason to use BERT-base over RoBERTa-base in new projects.
5
Do not confuse ALBERT's small parameter count with fast inference. ALBERT-large has 24 layers — inference requires 24 forward passes through shared weights. It's memory-efficient, not compute-efficient. Use it for edge deployment, not low-latency APIs.
6
For production APIs handling >100 req/s, use DistilBERT or a quantised model. Even on a GPU, RoBERTa at full precision adds latency that compounds at scale. DistilBERT + dynamic quantisation (torch.quantization.quantize_dynamic) can push throughput another 1.5–2× without meaningful accuracy loss.
7
Evaluate on your actual task data, not just benchmark scores. RoBERTa wins on GLUE. DistilBERT wins on short-text classification. ALBERT-xxlarge wins on SQuAD. None of this predicts your domain performance. Always fine-tune and evaluate on held-out in-domain data before choosing a model.

Section 14

Final Decision Framework

Scenario Best Choice Why
NLP research / competitionRoBERTa-largeHighest accuracy ceiling, best fine-tuning stability
Production REST API (<100ms SLA)DistilBERT1.7× faster inference, 60% smaller
Mobile / edge deploymentALBERT-baseOnly 12M params, <50MB model file
Domain-specific NLP (medical, legal)RoBERTa + domain pre-trainingStrong base + further pre-train on domain text
Few labelled examples (<1,000)RoBERTa-largeBetter representations = better few-shot fine-tuning
Multilingual tasksXLM-RoBERTaRoBERTa trained on 100 languages
Semantic similarity / embeddingsDistilBERT (Sentence-Transformers)Fast, good embeddings, huge ecosystem
Memory budget <64MBALBERT-baseSmallest parameter count of all BERT variants
🏆
The Practitioner's Final Word

Start every new NLP project with DistilBERT as your baseline — it is fast, cheap to fine-tune, and surprisingly strong. If you need higher accuracy, upgrade to RoBERTa-base. If memory is the constraint, switch to ALBERT-base. Only reach for the "large" variants when you have a GPU budget and a well-curated dataset — large models need large data to realise their potential. The three models together give you coverage of virtually every real-world NLP deployment scenario.

You have completed Pre-trained Language Models (PLMs). View all sections →