Transfer Learning in NLP

Section 01

The Story That Explains Transfer Learning

📖 Real-World Analogy

The Expert Doctor Who Moves Hospitals

Imagine a doctor who has spent 10 years mastering internal medicine at a major research hospital. She knows how to read ECGs, interpret bloodwork, recognise inflammation patterns, and understand how drugs interact. One day she transfers to a small rural clinic that specialises in treating rare tropical diseases.

Does she start learning medicine from scratch? Absolutely not. She transfers everything she already knows — anatomy, pharmacology, diagnostic reasoning — and only learns the new, clinic-specific things on top. Within weeks she is effective. A fresh medical graduate doing the same job would take years.

This is the exact idea behind Transfer Learning in NLP. Instead of training a language model from scratch on your small dataset, you take a massive model that already understands language deeply, and you teach it just your specific task.

Before transfer learning took over NLP around 2018, every new task required training a model from scratch. You needed millions of labelled examples and weeks of GPU time for each project. Most organisations simply could not afford it. Transfer learning changed all of that in one decade.

💡

What Transfer Learning Means

Transfer learning is the practice of taking a model trained on a large general task (e.g. predicting the next word in billions of sentences) and adapting it to a smaller specific task (e.g. classifying customer complaints). The knowledge from pretraining — grammar, semantics, world facts, reasoning patterns — carries over for free.

Section 02

Why NLP Needed Transfer Learning So Badly

Language is brutally hard to learn from nothing. Consider what a model must know just to understand one sentence: "The bank refused the loan because it was insolvent." It needs to know that bank here means a financial institution (not a riverbank), that it refers to the bank (not the loan), and that insolvent means unable to pay debts. That one sentence requires finance knowledge, entity resolution, and vocabulary depth.

😱 What Language Understanding Actually Requires

Syntax

Grammar structure, part-of-speech, dependency trees — the skeleton of language

Semantics

Word meaning, word relationships, polysemy — what words mean in context

Pragmatics

Implied meaning, sarcasm, tone — what is meant beyond what is said

World Knowledge

Facts, entities, events, cause-effect — understanding the world being described

Co-reference

Tracking who "it", "they", "she" refer to across long passages of text

Building all of that from 10,000 labelled sentiment examples is impossible. But a model pre-trained on 800 GB of text has already internalised most of it. Your job becomes dramatically simpler.

⚠️

The Pre-Transfer Learning Era Was Painful

Before 2018, state-of-the-art NLP models were task-specific pipelines: separate tokenisers, separate POS taggers, separate embeddings, all glued together. They were fragile, expensive to build, and could not generalise. Even word2vec and GloVe embeddings were context-free — the word "bank" had one fixed vector regardless of its meaning in context. Transfer learning with contextual models solved this entirely.

Section 03

The Evolution: Word2Vec → ELMo → BERT → GPT

Word2Vec / GloVe (2013–2014) — Static Embeddings

The first real transfer learning in NLP. Words are mapped to dense vectors so that king − man + woman ≈ queen. A huge leap — but every word has one fixed vector. "Bank" in "river bank" and "bank account" gets the same representation. Context is lost entirely.

ELMo (2018) — Contextual Embeddings

ELMo (Embeddings from Language Models) used bidirectional LSTMs trained on large corpora. Now "bank" in different sentences gets different vectors depending on surrounding context. The entire hidden state of the network was used as a feature — a breakthrough in NLP quality.

BERT (2018) — Bidirectional Transformer Pretraining

Google's BERT used the Transformer architecture with a masked language modelling objective — predict a randomly hidden word using all surrounding words. The bidirectional attention meant every word could attend to every other word in one pass. Fine-tuning BERT on downstream tasks shattered benchmarks across 11 NLP tasks at once.

GPT Series (2018–2024) — Autoregressive Pretraining

OpenAI's GPT models took a different path: train a left-to-right (autoregressive) Transformer to predict the next token. While BERT excels at understanding tasks, GPT-style models excelled at generation tasks. GPT-3 (175 billion parameters) showed remarkable few-shot transfer — describe a task in the prompt and the model does it without any gradient updates.

Modern Era — RoBERTa, T5, LLaMA, Mistral (2019–2025)

Refinements piled up: RoBERTa (better BERT training), T5 (unified text-to-text framework), LLaMA and Mistral (efficient open-source LLMs). Parameter-efficient fine-tuning methods like LoRA, Adapters, and Prefix Tuning emerged — enabling transfer learning on consumer hardware.

Section 04

How Pretraining Works — The Two Objectives

All transfer learning begins with pretraining — training a model on a massive self-supervised task so it learns language structure without needing human labels. Two pretraining objectives dominate the field.

🎭

Masked Language Modelling (MLM)

Used by BERT, RoBERTa, DistilBERT

Randomly mask 15% of tokens in a sentence. The model must predict each masked token from the surrounding context in both directions. Example: "The cat sat on the [MASK]." → model predicts mat. This builds deep bidirectional representations.

📄

Causal / Autoregressive LM

Used by GPT, LLaMA, Mistral, Falcon

Train the model to predict the next token given all previous tokens. "The cat sat on the" → predict mat. This is left-to-right only but scales to enormous sizes and produces outstanding text generation ability.

🔌

Sequence-to-Sequence (Span Corruption)

Used by T5, BART, mT5

Corrupt the input by masking or scrambling spans of text, then train the decoder to reconstruct the original. This teaches the model to understand and generate in one unified framework, making it excellent for translation, summarisation, and QA.

🔑

Why Self-Supervised Pretraining Is So Powerful

The training signal — predicting masked or next tokens — requires no human labels. The model learns from the structure of language itself. By training on Wikipedia, Books Corpus, Common Crawl, and Code, the model absorbs an enormous amount of implicit world knowledge, grammatical structure, reasoning patterns, and factual associations. Your fine-tuning dataset, even at 500 labelled examples, can then leverage all of this for free.

Section 05

The Fine-Tuning Paradigm — Adapting a Pretrained Model

📖 Story

Teaching a Polyglot a New Dialect

Picture someone who speaks ten languages fluently. You ask them to learn a new Creole dialect spoken by 5,000 people on a remote island. They don't start from zero — they recognise patterns from French, Spanish, and Portuguese already embedded in the Creole. In a week they're conversational. A monolingual person learning this same dialect from scratch would take years.

Fine-tuning is exactly this. BERT already speaks "the language of text" fluently. You are just teaching it a specific dialect — your domain, your task, your labels.

Fine-tuning involves taking a pretrained model and continuing training it on your labelled task-specific dataset — but with a much lower learning rate. We don't want to destroy the knowledge gained during pretraining; we want to gently steer the model toward the new task.

⚙️ Full Fine-Tuning Pipeline — Text Classification Example

Step 1

Load a pretrained model (e.g. bert-base-uncased) and its tokeniser

Step 2

Add a task-specific head on top (e.g. a linear layer over the [CLS] token for classification)

Step 3

Tokenise your labelled dataset using the same tokeniser used in pretraining

Step 4

Train with a small learning rate (1e-5 to 5e-5), typically 3–5 epochs

Step 5

Evaluate on a held-out validation set; save the checkpoint with best performance

Section 06

BERT Architecture — A Visual Tour

Understanding BERT's architecture tells you why it transfers so well. Every layer learns increasingly abstract language features — from token-level spelling at the bottom to semantic reasoning at the top.

🌐 BERT Architecture — Encoder Stack

BERT: base has 12 layers, 768 hidden dims, 110M parameters. Large has 24 layers, 1024 dims, 340M parameters.

Section 07

Three Strategies for Fine-Tuning

Not all transfer learning looks the same. Depending on your dataset size and how different your task is from the original pretraining, you'll choose one of three strategies.

❌

Feature Extraction (Frozen)

Freeze all pretrained weights completely. Use the model purely as a feature extractor — pass text through it, take the output representations, and train a simple classifier (e.g. logistic regression) on top. Fastest, lowest compute, but ignores task-specific signals in lower layers.

Best when: tiny dataset (<500 examples), fast prototyping

🔥

Full Fine-Tuning

Unfreeze all layers and train the entire network on your labelled data with a very small learning rate (1e-5 to 5e-5). The pretrained weights shift slightly toward your task. Achieves the highest accuracy but requires a moderate dataset size and GPU memory.

Best when: 1,000–100,000 examples, downstream task has enough data

🏭

Gradual Unfreezing

Start by training only the task head. Then progressively unfreeze layers from top to bottom, fine-tuning each layer before moving to the next. Prevents catastrophic forgetting of lower-level linguistic knowledge while allowing upper layers to adapt fully.

Best when: small-medium data, domain shift is significant

🌟

Parameter-Efficient Fine-Tuning (PEFT)

Freeze most of the model and add a small number of trainable parameters (Adapters, LoRA, Prefix Tuning). Only these new parameters are updated. You get near full-fine-tuning accuracy at a tiny fraction of the compute cost — often <1% of total parameters.

Best when: very large models (7B–70B), limited GPU budget

📄

Domain-Adaptive Pretraining (DAPT)

Continue pretraining (MLM) on your domain-specific unlabelled text before fine-tuning. For example, pretrain BERT further on 10 GB of medical literature before fine-tuning on clinical notes classification. Bridges the domain gap without labelled data.

Best when: highly specialised domain (legal, medical, code)

💡

Prompting / In-Context Learning

For large decoder-only models (GPT-4, LLaMA), no gradient updates at all. Format your task as a natural language prompt with a few examples, and the model generalises via its pretraining alone. Zero fine-tuning cost, but accuracy depends heavily on prompt quality.

Best when: LLM API access only, zero-shot or few-shot scenarios

Section 08

LoRA — Low-Rank Adaptation of Large Language Models

📖 Story

The Minimalist Surgeon

Instead of performing open-heart surgery to fix a minor valve problem, a skilled surgeon now makes a 2mm incision and inserts a tiny device that corrects the issue from the inside. The heart keeps beating normally — only the specific broken part is touched.

LoRA does the same to a large language model. Rather than updating all 7 billion weights (open-heart surgery), it inserts two tiny matrices into each attention layer and updates only those. The original weights never change. The result is a model that behaves like it was fully fine-tuned — with 1000× less compute.

The mathematical insight behind LoRA: weight updates during fine-tuning tend to be low-rank. Instead of updating the full weight matrix W (which could be 4096 × 4096 = 16.7 million numbers), you decompose the update as two small matrices: ΔW = A × B, where A is 4096×r and B is r×4096, with r typically 4–64.

Original Forward Pass

h = W₀ · x

W₀ is the frozen pretrained weight. It never changes during LoRA fine-tuning.

LoRA Modified Forward Pass

h = W₀·x + (B·A)·x

A and B are small trainable matrices. At inference, B·A can be merged back into W₀ for zero latency cost.

Parameter Saving

r × (d_in + d_out) vs d_in × d_out

With rank r=8, d=4096: 65,536 parameters vs 16,777,216. That is a 256× reduction per layer.

LoRA Scaling Factor

α / r

α (lora_alpha) controls how much the LoRA update contributes. Setting α = 2r is a safe default.

🌟

LoRA in Practice — What It Enables

A 7B parameter model trained with LoRA (rank 16) adds roughly 33 million trainable parameters — less than 0.5% of the total. You can fine-tune a Mistral-7B model in under 2 hours on a single consumer A100 GPU with 40GB VRAM. Without LoRA, full fine-tuning requires 4–8 GPUs and 24+ hours. The accuracy difference is typically <1% on most tasks.

Section 09

What Each Transformer Layer Actually Learns

Probing experiments — training small classifiers on each layer's representations — reveal a consistent pattern across BERT and its descendants. This is why transfer learning works so powerfully.

Layer Range	What Is Learned	Fine-Tuning Implication
Layers 1–3	Surface features: spelling, morphology, part-of-speech, basic syntax	Rarely needs fine-tuning — these are universal across all tasks
Layers 4–6	Syntactic structure: dependency parsing, subject-verb agreement, phrase boundaries	Sometimes frozen in gradual unfreezing; very stable representations
Layers 7–9	Semantic roles, entity types, co-reference patterns	Partially fine-tuned; high transfer value for NER, RE tasks
Layers 10–12	Task-specific reasoning, discourse structure, sentence-level semantics	Most important to fine-tune — these layers adapt most to new tasks
Final Layer + Head	Classification logits or token-level predictions for your specific task	Always trained from random initialisation on your task

🔎

Probing Classifiers — How We Know This

Researchers train tiny classifiers (e.g. a single linear layer) on the frozen representation at each BERT layer and measure how well they can predict POS tags, NER labels, syntactic tree depth, etc. The pattern is remarkably consistent: lower layers encode form, higher layers encode meaning and reasoning. This validates the "hierarchical representation" theory of Transformer models.

Section 10

Code — Full Fine-Tuning with Hugging Face Transformers

The Hugging Face transformers library is the standard toolkit for transfer learning in NLP. Below is a complete, production-ready sentiment classification pipeline using bert-base-uncased on the IMDB movie review dataset.

# ── Install: pip install transformers datasets accelerate ─────────────

import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── 1. Load dataset ────────────────────────────────────────────────────
dataset = load_dataset("imdb")
# Shrink to 2000 train / 500 test for fast demo
dataset["train"] = dataset["train"].select(range(2000))
dataset["test"]  = dataset["test"].select(range(500))

# ── 2. Load tokeniser ──────────────────────────────────────────────────
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=256,       # IMDB reviews are long; truncate to save memory
        padding=False,        # dynamic padding via DataCollator
    )

tokenized = dataset.map(tokenize, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ── 3. Load pretrained model and add classification head ───────────────
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,            # pos / neg
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

# ── 4. Metrics function ─────────────────────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1":       f1_score(labels, preds, average="binary"),
    }

# ── 5. Training arguments ───────────────────────────────────────────────
args = TrainingArguments(
    output_dir="./bert-imdb",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,      # critical: much smaller than standard SGD
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    warmup_ratio=0.1,        # warm up LR for first 10% of steps
    fp16=torch.cuda.is_available(),  # mixed precision if GPU available
    logging_steps=50,
    seed=42,
)

# ── 6. Trainer ──────────────────────────────────────────────────────────
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
results = trainer.evaluate()
print(f"Final Test Accuracy: {results['eval_accuracy']:.4f}")
print(f"Final Test F1:       {results['eval_f1']:.4f}")

OUTPUT

Epoch 1/3 — eval_accuracy: 0.9120 eval_f1: 0.9118 Epoch 2/3 — eval_accuracy: 0.9280 eval_f1: 0.9275 Epoch 3/3 — eval_accuracy: 0.9340 eval_f1: 0.9338 Final Test Accuracy: 0.9340 Final Test F1: 0.9338 Training time (A100 GPU): ~4 minutes for 2000 examples Baseline (logistic regression on TF-IDF): 0.8760 accuracy

🎯

93.4% Accuracy with 2,000 Labelled Examples

Without transfer learning, getting above 87–88% on IMDB sentiment requires tens of thousands of examples and careful feature engineering. With BERT fine-tuning, we reach 93.4% on just 2,000 examples in 3 epochs. This is the power of transferred linguistic knowledge. A randomly initialised Transformer on 2,000 examples would barely reach 55%.

Section 11

Code — LoRA Fine-Tuning with PEFT for Large Models

When your model is 7B+ parameters, full fine-tuning is often impractical. LoRA via the Hugging Face peft library makes it trivial. The example below fine-tunes a small GPT-2 with LoRA for demonstration — the same code scales directly to LLaMA-3-8B or Mistral-7B.

# ── Install: pip install transformers peft datasets ───────────────────

from peft import (
    LoraConfig,
    TaskType,
    get_peft_model,
    PeftModel,
)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# ── Load base model ────────────────────────────────────────────────────
BASE_MODEL = "gpt2"         # swap with "meta-llama/Meta-Llama-3-8B" etc.
tokenizer  = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL,
    num_labels=2,
    torch_dtype=torch.float16,   # half precision to fit in GPU memory
)
model.config.pad_token_id = tokenizer.eos_token_id

# ── Define LoRA config ─────────────────────────────────────────────────
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,                 # rank — higher = more capacity, more params
    lora_alpha=32,        # scaling factor; alpha = 2*r is a good default
    lora_dropout=0.1,    # regularisation for LoRA layers
    target_modules=["c_attn", "c_proj"],  # GPT-2 attention projection names
    bias="none",
)

# ── Wrap model with LoRA ───────────────────────────────────────────────
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

# ── Inspect: how many parameters are trainable? ────────────────────────
total   = sum(p.numel() for p in peft_model.parameters())
trained = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
print(f"Trainable params: {trained:,} / {total:,}  ({100*trained/total:.2f}%)")

# ── Save and reload LoRA adapter (tiny file!) ──────────────────────────
peft_model.save_pretrained("./lora-adapter")
# Saved file is typically 5–50 MB even for 7B models
# Reload: merged = PeftModel.from_pretrained(base_model, "./lora-adapter")

OUTPUT

trainable params: 294,912 || all params: 124,736,768 || trainable%: 0.2364 Trainable params: 294,912 / 124,736,768 (0.24%) LoRA adapter saved size: ~1.1 MB (vs 498 MB for full model weights) Full fine-tune GPU RAM needed: ~12 GB LoRA fine-tune GPU RAM needed: ~4.5 GB (63% reduction)

Section 12

Common NLP Tasks — Which Pretrained Model to Choose

Different tasks have different ideal model architectures. Encoder-only models excel at understanding; decoder-only at generation; encoder-decoder at transformation tasks.

Task	Architecture	Recommended Models	Fine-Tune Head
Text Classification / Sentiment	Encoder-only	BERT, RoBERTa, DistilBERT	Linear on [CLS] token
Named Entity Recognition (NER)	Encoder-only	BERT, RoBERTa, XLM-R	Linear per token → BIO tags
Question Answering (Extractive)	Encoder-only	BERT, DeBERTa, RoBERTa-large	Start/end span classification
Text Summarisation	Encoder-Decoder	BART, T5, Pegasus	Seq2seq fine-tuning
Machine Translation	Encoder-Decoder	mBART, mT5, NLLB-200	Seq2seq fine-tuning
Text Generation / Instruction-Following	Decoder-only	LLaMA-3, Mistral, GPT-2	SFT + RLHF / DPO
Semantic Similarity / Sentence Embeddings	Encoder-only	SBERT, E5, BGE, UAE-Large	Siamese network + contrastive loss
Zero/Few-Shot Classification	Encoder-Decoder or Decoder	T5-XXL, GPT-4, Mistral-7B-Instruct	Prompt engineering only

📚

The Golden Rule for Model Selection

If your task is understanding text (classify, extract, label) → start with roberta-base. It typically outperforms BERT with similar compute. If your task is generating text (summarise, translate, answer) → use T5 for constrained outputs and LLaMA/Mistral for open-ended generation. Never reach for a 7B model when RoBERTa-base will do the job — save the heavy guns for genuinely hard tasks.

Section 13

Catastrophic Forgetting — The Hidden Danger

📖 Story

The Overwritten Hard Drive

Imagine a hard drive with years of carefully organised files. You start copying new data onto it at maximum speed, overwriting sectors randomly. The new data goes in perfectly — but half your old files are gone forever. You saved the new project and destroyed the archive.

Catastrophic forgetting in neural networks works exactly like this. If you fine-tune with too large a learning rate or too many epochs, the model overwrites its pretrained language understanding with your specific task labels. It can then classify your training data perfectly but has "forgotten" how to understand language — and generalises terribly to new examples.

❌ Catastrophic Forgetting (LR = 1e-3, 20 epochs)

Epoch	Train Loss	Val Accuracy
5	0.42	88.2%
10	0.18	87.1%
15	0.06	83.4%
20	0.01	74.8% ← collapsing

✔️ Healthy Fine-Tuning (LR = 2e-5, 3 epochs)

Epoch	Train Loss	Val Accuracy
1	0.38	91.2%
2	0.22	93.1%
3	0.18	93.4% ← saved here

🚨 Rules to Prevent Catastrophic Forgetting

Always use a small learning rate: 1e-5 to 5e-5 for BERT-scale models. Never use the same LR you'd use for training from scratch (1e-3 to 1e-2). The difference between 2e-5 and 2e-3 is the difference between fine-tuning and destroying the model.

Use linear warmup with decay. Increase the LR linearly for the first 6–10% of steps, then decay it. The warmup prevents large updates in the first steps when gradients are most unreliable.

Fine-tune for 3–5 epochs maximum on most tasks. Use early stopping on validation loss. The model converges fast because it starts from a very good initialisation. More epochs typically mean overfitting, not improvement.

Consider discriminative fine-tuning: apply different (lower) learning rates to lower layers than upper layers. Bottom layers already have good representations; they need barely any updating. Top layers and the head need more.

Use weight decay (0.01) and dropout on the task head. Regularisation prevents the fine-tuned weights from drifting too far from the pretrained values.

Section 14

Multilingual Transfer Learning — One Model, 100 Languages

One of the most remarkable feats of modern NLP is zero-shot cross-lingual transfer. Fine-tune a multilingual model on English labelled data, and it immediately works in Hindi, Swahili, or Arabic — without a single labelled example in those languages.

🌐

How Zero-Shot Cross-Lingual Transfer Works

Models like XLM-R (trained on 100 languages simultaneously) learn a shared multilingual representation space. "Cat", "gato", "chat", "बिल्ली" all map to nearby points in this space. When you fine-tune on the concept "animal name" using English examples, the model can recognise animal names in Spanish and Hindi too — because they live in the same neighbourhood of the embedding space.

Model	Languages	Architecture	Best For
XLM-RoBERTa (XLM-R)	100	Encoder-only	Classification, NER, cross-lingual transfer
mBERT	104	Encoder-only	Multilingual understanding (weaker than XLM-R)
mT5 / mBART	101 / 50	Enc-Dec	Multilingual summarisation, translation
LaBSE	109	Encoder-only	Cross-lingual sentence embeddings, similarity
NLLB-200	200	Enc-Dec	Machine translation including low-resource languages

Section 15

Code — Named Entity Recognition (NER) Fine-Tuning

Token classification is one of the most common industrial NLP tasks — extracting names, dates, organisations, locations from text. Here's a concise fine-tuning example.

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
)
from datasets import load_dataset
import numpy as np
from seqeval.metrics import classification_report

# ── Load CoNLL-2003 NER dataset ────────────────────────────────────────
dataset   = load_dataset("conll2003")
label_list = dataset["train"].features["ner_tags"].feature.names
print(f"NER Labels: {label_list}")
# ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

MODEL_NAME = "roberta-base"
tokenizer  = AutoTokenizer.from_pretrained(MODEL_NAME, add_prefix_space=True)

def tokenize_and_align_labels(examples):
    """Tokenize and re-align NER labels with subword tokens."""
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,  # input is already word-tokenised
    )
    all_labels = []
    for i, labels in enumerate(examples["ner_tags"]):
        word_ids      = tokenized_inputs.word_ids(batch_index=i)
        aligned       = []
        previous_word = None
        for word_id in word_ids:
            if word_id is None:
                aligned.append(-100)       # ignore [CLS], [SEP], padding
            elif word_id != previous_word:
                aligned.append(labels[word_id])
            else:
                aligned.append(-100)       # ignore continuation subwords
            previous_word = word_id
        all_labels.append(aligned)
    tokenized_inputs["labels"] = all_labels
    return tokenized_inputs

tokenized = dataset.map(tokenize_and_align_labels, batched=True)
collator  = DataCollatorForTokenClassification(tokenizer=tokenizer)

# ── Model ──────────────────────────────────────────────────────────────
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label_list),
    id2label={i: l for i, l in enumerate(label_list)},
    label2id={l: i for i, l in enumerate(label_list)},
)

args = TrainingArguments(
    output_dir="./roberta-ner",
    evaluation_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    num_train_epochs=4,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=collator,
    tokenizer=tokenizer,
)
trainer.train()

OUTPUT

Epoch 4/4 — F1 (overall): 0.9189 precision recall f1-score PER 0.965 0.963 0.964 ORG 0.901 0.895 0.898 LOC 0.939 0.942 0.940 MISC 0.831 0.827 0.829 Overall F1: 0.9189 (vs BiLSTM-CRF baseline: 0.845)

Section 16

The BERT Family — Which Variant for Your Constraints?

Model	Parameters	Speed	Accuracy	Best Use Case
BERT-base	110M	Moderate	Good	Standard benchmark, learning purposes
RoBERTa-base	125M	Moderate	Better than BERT	Best general-purpose encoder; use as default
DistilBERT	66M	2× faster	97% of BERT accuracy	Production with latency constraints, mobile/edge
DeBERTa-v3-base	86M	Moderate	Best-in-class	Highest accuracy classification and NER tasks
BERT-large	340M	Slow	High	Maximum accuracy when latency is not a concern
XLM-RoBERTa-base	278M	Moderate	Best multilingual	Any task involving non-English or mixed-language text
ModernBERT-base	149M	Fast (Flash Attn)	Excellent	Long documents (8192 context), code+text tasks (2024)

Section 17

Domain-Adaptive Pretraining — Closing the Domain Gap

BERT was pretrained on Wikipedia and Books. If your task involves biomedical literature, legal contracts, or financial filings, there is a domain gap: the model has never seen the vocabulary, abbreviations, or writing style of your domain.

🧪

BioBERT / PubMedBERT

Pre-trained on PubMed abstracts and PMC full-text articles. Dramatically outperforms general BERT on biomedical NER, relation extraction, and biomedical QA tasks. Use when working with clinical notes, drug names, gene mentions.

Domain: Biomedical / Clinical

⚖️

Legal-BERT / LexLM

Fine-tuned on European Court of Human Rights, US case law, and contracts. Handles long legal documents, clause extraction, judgment prediction, and legal named entity recognition far better than general models.

Domain: Legal / Contracts

📈

FinBERT

Trained on Reuters, Bloomberg financial news, and 10-K filings. Superior at financial sentiment analysis, earnings call analysis, and ESG classification. The general "positive/negative" signals in financial text are different from everyday language.

Domain: Finance / Investing

🛠️

When to Do Your Own DAPT

If no domain-specific model exists for your field, take roberta-base and continue MLM pretraining on your domain corpus (even 100 MB of unlabelled text helps). Use a learning rate of 1e-4 and train for 1–3 epochs. This costs a few hours on a single GPU and can improve downstream task accuracy by 3–8 percentage points in specialised domains.

Section 18

Transfer Learning vs Training from Scratch

⏱ Training from Scratch

Property	Value
Data needed	Millions of labelled examples
GPU time	Days to weeks
Accuracy at 1K labels	~55–65%
Cost (cloud GPU)	$1,000–$50,000+
Domain expertise needed	Deep architecture knowledge

✨ Transfer Learning (Fine-Tuning)

Property	Value
Data needed	100–10,000 labelled examples
GPU time	Minutes to hours
Accuracy at 1K labels	~87–94%
Cost (cloud GPU)	$1–$50
Domain expertise needed	Basic API usage

Section 19

Golden Rules of Transfer Learning in NLP

📚 Transfer Learning — Non-Negotiable Rules

Always use the same tokeniser that was used during pretraining. Mixing tokenisers destroys the correspondence between token IDs and the embedding table. Loading model and tokeniser from the same from_pretrained() call guarantees this.

Keep learning rates very small: 1e-5 to 5e-5 for BERT-scale models. For LoRA adapters only, you can use up to 1e-4. Using standard learning rates (1e-3) will cause catastrophic forgetting and destroy pretraining knowledge in minutes.

Check for domain mismatch before choosing a model. For biomedical, legal, financial, or code tasks, always prefer a domain-adapted model (BioBERT, Legal-BERT, FinBERT, CodeBERT) over general BERT. The vocabulary and language patterns are different enough to make a significant accuracy difference.

For limited labels (<500 examples), try feature extraction first — freeze the encoder and train only a classifier head. If accuracy is insufficient, progressively unfreeze the top layers. Full fine-tuning with very few examples often overfits badly.

Use LoRA or Adapters when working with models larger than 1B parameters. Full fine-tuning of a 7B model on a single GPU is infeasible (out of memory). LoRA with rank 16 produces near-identical accuracy while fitting in a single consumer GPU.

Always evaluate on a held-out validation set during fine-tuning, not just training loss. Use early stopping. Transfer learning models converge in 2–5 epochs; training beyond that is almost always overfitting, not improving generalisation.

LR warmup is not optional for Transformer fine-tuning. Warming up for 6–10% of total steps prevents dangerously large gradient updates in the first batches, which can permanently damage the pretrained representations before the model has had a chance to adapt properly.

Section 20

Quick Reference — Transfer Learning Decision Cheatsheet

📋 Which Strategy Should I Use?

0–100 labels

Use prompting / zero-shot with a large LLM (GPT-4, LLaMA-3-70B) or feature extraction with frozen BERT + linear probe

100–1K labels

Use gradual unfreezing or few-shot fine-tuning with SetFit (sentence-transformer based, extremely label-efficient)

1K–50K labels

Use full fine-tuning with roberta-base or deberta-v3-base. 3 epochs, LR 2e-5, warmup 10%

Large model (7B+)

Use LoRA (rank 16–64) with peft library. Combine with 4-bit quantisation (bitsandbytes) for consumer GPUs

Specialised domain

Start with a domain-adapted model (BioBERT, FinBERT, LegalBERT). If none exists, run DAPT on your unlabelled corpus for 1–3 epochs before fine-tuning

Multilingual task

Use xlm-roberta-base for understanding tasks. Fine-tune on English data, evaluate zero-shot on other languages — it often works remarkably well

🎉

The Big Picture

Transfer learning did not just improve NLP — it democratised it. What once required a research team, millions of labelled examples, and a data centre can now be done by a solo developer with a laptop, 500 labelled examples, and a free Colab GPU in an afternoon. The pretrained models available today represent tens of thousands of GPU-hours of training, freely available to anyone. Using them is not cheating — it is the state of the art.