Master transfer learning with Python and PyTorch.

Section 01

The Story That Explains Transfer Learning

📖 Real World Analogy

The Surgeon Who Became a Painter — Knowledge Travels

Dr. Amara spent twelve years as a cardiothoracic surgeon. Every day she trained her hands for extraordinary fine motor precision, learned to read the subtle visual language of human tissue, and built the mental focus to stay calm under pressure. Then, at 41, she decided to take up oil painting.

Her surgical colleagues thought she would start from scratch like everyone else. They were wrong. Within eight months she painted with a control of fine detail that students a decade younger could not match. Her hands already knew precision. Her eyes already read subtle colour gradients in tissue — learning to read them in landscapes was a smaller leap. Her surgical discipline translated directly into painting discipline.

She did not start from zero. She transferred a decade of deeply encoded skill into a related domain — and reached mastery far faster than anyone starting cold.

This is Transfer Learning. A model trained on one large task — say, recognising 1.2 million ImageNet photographs — has already learned the fundamental vocabulary of visual understanding: edges, textures, shapes, object parts. When you take that model and point it at your specific problem — plant diseases, chest X-rays, satellite images — it does not start from random noise. It starts from a deep, structured knowledge of what visual patterns mean. You are not retraining a model. You are redirecting an expert.

Transfer Learning is the practice of taking a model already trained on a large source task and reusing its learned representations — either directly or with adaptation — on a different but related target task. Instead of training from random weight initialisation (which requires millions of examples and days of GPU time), you begin from a rich learned starting point and converge quickly with far less data.

🌐

Why Transfer Learning Dominates Modern Practice

Training ResNet-50 from scratch on ImageNet requires 1.2 million labelled images, 90 epochs, and roughly 4 days on 8 V100 GPUs — costing over $2,000 in compute. Fine-tuning that same pretrained ResNet-50 on your 1,000-image medical dataset takes under 30 minutes on a single consumer GPU and routinely achieves 85–95% accuracy on tasks where training from scratch would give 40–60%. Transfer learning is not just a technique. In 2025, it is the default approach for any visual, textual, or audio deep learning task.

Section 02

Why It Works — The Universal Feature Hierarchy

📖 The Scientific Insight

Zeiler & Fergus, 2013 — What Is the Network Actually Looking At?

In 2013, Matthew Zeiler and Rob Fergus at NYU published a deceptively simple question: what does each layer of a deep CNN actually learn? Using a technique called deconvolution networks, they projected each filter's activations back into pixel space to visualise what patterns most excite each layer.

The result was stunning. Every large CNN trained on natural images — regardless of architecture, dataset, or training details — learned the same layered hierarchy of features. This hierarchy is not designed by engineers. It emerges from the structure of natural images themselves. And because natural images share this structure universally, features learned from photographs of cats and dogs transfer meaningfully to X-rays, satellite imagery, skin lesions, and factory defect detection.

Layer 1–2

Early Layers — Primitive Detectors

Gabor-like edge detectors at multiple orientations and frequencies. Colour blob detectors. These filters look nearly identical across AlexNet, VGG, ResNet, and ViT. Completely domain-agnostic — these same features appear in X-rays, satellite images, and microscopy.

Layer 3–5

Middle Layers — Textures & Patterns

Combinations of edges forming textures: fur, scales, brickwork, fabric weave, grass. Corner detectors. Circular blob detectors. More task-specific than early layers but still broadly transferable. A fur-texture detector becomes a tissue-texture detector with minimal adaptation.

Layer 6–8

Deep Layers — Object Parts

Detectors for specific object components: eyes, noses, wheel rims, windows, leaves, text characters. Highly task-specific. For very different target domains (medical vs natural images), these layers should be partially or fully retrained.

Final Layer

Classification Head — Task-Specific

The fully-connected layer(s) that map extracted features to class probabilities. 100% task-specific — always replaced when transferring to a new classification task. This is the only layer you must train from scratch.

The Key Insight

Transferability Decreases With Depth

Early layer features are universally useful. Deep layer features are increasingly task-specific. This gradient of transferability determines your strategy: freeze early layers, carefully tune deep layers, always replace the head. More domain difference → unfreeze more layers.

Texture Bias

Geirhos et al., 2019 — The Striking Discovery

ImageNet-trained CNNs are primarily texture-biased, not shape-biased (unlike humans). A cat texture on a dog shape → network predicts cat. This surprising finding explains why CNNs transfer so well to medical imaging: tissue has rich texture, and the network's texture detectors are perfectly suited to it.

Section 03

The Four Strategies — Which One Should You Use?

Transfer learning is not one technique — it is a spectrum of four distinct strategies. The right strategy depends on two factors: how much target data you have and how similar your target domain is to the source domain (ImageNet, for vision models).

🔒

Strategy A — Feature Extraction

Freeze the entire pretrained backbone. Add a new classification head and train only that head. The backbone acts as a fixed feature extractor — every image is converted to a rich feature vector and the head learns to classify those vectors. Training takes minutes, even on CPU.

Small data + domain similar to ImageNet

🔥

Strategy B — Fine-Tuning (Partial)

Freeze early layers (edges, textures — universally useful). Unfreeze later layers and the head — these are more task-specific and need to adapt. Train unfrozen layers with a low learning rate (10–100× smaller than training from scratch). Best balance of speed and accuracy.

Moderate data + domain somewhat different

📈

Strategy C — Full Fine-Tuning

Unfreeze all layers after head warm-up. Apply discriminative learning rates: very low for early layers, progressively higher for later layers and head. Extracts maximum performance but requires more data and careful LR tuning to avoid destroying pretrained features.

Large data + domain different from ImageNet

⚡

Strategy D — Domain Adaptive Pretraining

First, pretrain (or continue-pretrain) on a large unlabelled in-domain dataset using self-supervised or masked-autoencoder objectives (MAE, SimCLR, DINO). Then fine-tune with your labelled data. The "source" model is now domain-adapted before any supervised signal.

Very different domain, limited labelled data

🌟

Strategy E — Zero / Few-Shot (CLIP)

Use a model pretrained on image-text pairs (CLIP, ALIGN, BLIP). At inference, describe your classes in natural language and compute similarity between image and text embeddings. No fine-tuning required for zero-shot; dramatic gains with just 5–20 examples per class (few-shot).

Zero or near-zero labelled data

🔀

Strategy F — Knowledge Distillation

Train a small "student" model to mimic the soft probability outputs of a large pretrained "teacher" model — capturing the teacher's generalisation without its size. Produces compact models suitable for edge/mobile deployment that outperform students trained on hard labels alone.

Production deployment on constrained hardware

Strategy	Data Size	Domain Similarity	Layers Trained	Training Time	Accuracy Potential
A — Feature Extraction	< 1,000 images	Similar	Head only	Minutes (CPU)	Good (82–90%)
B — Partial Fine-Tuning	1k – 10k images	Moderate	Last 2–3 blocks + head	30 min – 2 hr	Very good (88–94%)
C — Full Fine-Tuning	10k+ images	Different	All layers + head	2 – 8 hr	Excellent (91–97%)
D — Domain Adaptive Pretrain	Large unlabelled + small labelled	Very different	All (self-supervised first)	Days	Excellent (93–98%)
E — Zero-Shot (CLIP)	0 images	Any	None (inference only)	Seconds	Moderate (60–80%)

Section 04

Strategy A — Feature Extraction (Frozen Backbone)

📋 Feature Extraction Recipe — Step by Step

Step 1

Load a pretrained model (ResNet-50, EfficientNet-B2, ViT-B/16) with ImageNet weights.

Step 2

Freeze all backbone parameters: requires_grad = False. No gradients flow into the backbone. Only memory-efficient.

Step 3

Replace the final classification head with a new layer sized for your num_classes. This head has randomly initialised weights — it is what you train.

Step 4

Pass your entire training set through the frozen backbone once and cache the feature vectors to disk. Then train a simple classifier on those cached features — re-running inference is unnecessary.

Step 5

Evaluate on validation set. If accuracy is sufficient, deploy. If not, move to Strategy B (partial fine-tuning).

import torch
import torch.nn as nn
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ── 1. Load pretrained ResNet-50 ─────────────────────────────────
backbone = models.resnet50(weights='IMAGENET1K_V2')

# ── 2. Freeze ALL backbone parameters ────────────────────────────
for param in backbone.parameters():
    param.requires_grad = False

# ── 3. Replace classification head ───────────────────────────────
# ResNet-50's fc layer: Linear(2048, 1000) → replace with our task
num_classes = 5
backbone.fc = nn.Sequential(
    nn.Linear(backbone.fc.in_features, 256),
    nn.ReLU(),
    nn.Dropout(p=0.4),
    nn.Linear(256, num_classes)
)
backbone = backbone.to(device)

# Confirm: only head parameters are trainable
trainable = sum(p.numel() for p in backbone.parameters()
                 if p.requires_grad)
total     = sum(p.numel() for p in backbone.parameters())
print(f"Trainable : {trainable:,}  ({trainable/total*100:.2f}% of total)")
print(f"Frozen    : {total-trainable:,}")
print(f"Total     : {total:,}")

# ── 4. Data pipeline ─────────────────────────────────────────────
tfm = transforms.Compose([
    transforms.Resize(256), transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225])
])
train_ds = datasets.ImageFolder('data/train', transform=tfm)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4)

# ── 5. Train only the head ────────────────────────────────────────
optimiser = torch.optim.Adam(backbone.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

backbone.train()
for epoch in range(10):
    total_loss, correct, n = 0.0, 0, 0
    for imgs, labels in train_dl:
        imgs, labels = imgs.to(device), labels.to(device)
        optimiser.zero_grad()
        logits = backbone(imgs)
        loss   = criterion(logits, labels)
        loss.backward()
        optimiser.step()
        total_loss += loss.item() * imgs.size(0)
        correct    += (logits.argmax(1) == labels).sum().item()
        n          += imgs.size(0)
    print(f"Epoch {epoch+1:02d} | loss={total_loss/n:.4f} | acc={correct/n:.4f}")

OUTPUT

Section 05

Strategy B — Partial Fine-Tuning (The Most Common Real-World Approach)

📖 Story

The Two-Phase Restaurant Retraining

A Michelin-starred French chef is hired by a Japanese restaurant. They do not throw away everything he knows. His knife skills, sense of timing, heat control, and flavour balance are universal — they keep all of that. They freeze it. What needs to change is his seasoning palette, his presentation style, and his ingredient vocabulary — the later, more task-specific skills.

First they send him to a six-week Japanese culinary course — a warm-up phase where only the new skills are trained while the foundational ones rest untouched. Once the new skills are stable, they gradually allow his French techniques to blend in at a very low influence — a full fine-tune phase at low learning rate. The result is a chef who is both deeply grounded and newly specialised.

This is exactly the two-phase transfer learning protocol: warm up the head, then carefully unfreeze deeper layers.

import torch
import torch.nn as nn
from torchvision import models

# ── Load pretrained EfficientNet-B2 ──────────────────────────────
model = models.efficientnet_b2(weights='IMAGENET1K_V1')
num_classes = 7

# Replace classifier head
in_features = model.classifier[1].in_features    # 1408 for B2
model.classifier = nn.Sequential(
    nn.Dropout(p=0.3),
    nn.Linear(in_features, num_classes)
)
model = model.to(device)

# ════════════════════════════════════════════════════════
# PHASE 1 — Freeze backbone, train head only (5–10 epochs)
# ════════════════════════════════════════════════════════
for param in model.features.parameters():
    param.requires_grad = False
model.classifier.requires_grad_(True)

opt_p1 = torch.optim.AdamW(model.classifier.parameters(),
                              lr=1e-3, weight_decay=1e-2)
sch_p1 = torch.optim.lr_scheduler.OneCycleLR(
    opt_p1, max_lr=1e-3, steps_per_epoch=len(train_dl), epochs=5)

print("Phase 1 — Training head only...")
# [training loop here — same as Section 04]

# ════════════════════════════════════════════════════════
# PHASE 2 — Unfreeze deeper backbone blocks, low LR
# ════════════════════════════════════════════════════════
# EfficientNet-B2 has 8 feature blocks (0–7)
# Unfreeze blocks 5, 6, 7 and the classifier
for name, param in model.named_parameters():
    param.requires_grad = False     # start fully frozen

for block_idx in [5, 6, 7]:
    for param in model.features[block_idx].parameters():
        param.requires_grad = True
for param in model.classifier.parameters():
    param.requires_grad = True

# Use separate LR groups: smaller for backbone, larger for head
opt_p2 = torch.optim.AdamW([
    {'params': model.features[5].parameters(), 'lr': 2e-5},
    {'params': model.features[6].parameters(), 'lr': 4e-5},
    {'params': model.features[7].parameters(), 'lr': 8e-5},
    {'params': model.classifier.parameters(),   'lr': 3e-4},
], weight_decay=1e-2)

sch_p2 = torch.optim.lr_scheduler.CosineAnnealingLR(
    opt_p2, T_max=15, eta_min=1e-7)

trainable_p2 = sum(p.numel() for p in model.parameters()
                    if p.requires_grad)
print(f"Phase 2 trainable params: {trainable_p2:,}")
print("Phase 2 — Fine-tuning blocks 5-7 + head with discriminative LRs...")

OUTPUT

Phase 1 — Training head only... Phase 2 trainable params: 3,847,192 Phase 2 — Fine-tuning blocks 5-7 + head with discriminative LRs...

⚠️

The Catastrophic Forgetting Trap

If you skip Phase 1 and immediately fine-tune all layers together, the randomly initialised head generates enormous gradients in the first few batches. These gradients propagate into the backbone and catastrophically overwrite pretrained weights — destroying the very knowledge you are trying to leverage. Always warm up the head first. The head's random initialisation is the loaded gun; Phase 1 is the safety.

Section 06

Discriminative Learning Rates — One of the Most Powerful Tricks

Not all layers need the same amount of change. Early layers learned universal, perfect features — they barely need updating. Later layers are task-specific and need more adaptation. The head is randomly initialised and needs aggressive training. Discriminative learning rates assign progressively higher learning rates to progressively deeper (or more task-specific) parts of the network.

⚠ Uniform LR — The Wrong Way

Layer Group	LR	Problem
Early conv layers	1e-4	Over-trains universal edge detectors — corrupts them
Middle layers	1e-4	Slightly too high — forces unnecessary change
Late layers	1e-4	Reasonable but could be higher for faster adaptation
Head	1e-4	Far too low — head needs much faster learning from random init

✅ Discriminative LR — The Right Way

Layer Group	LR	Rationale
Early conv layers	1e-6	Near-perfect universal features — barely touch them
Middle layers	1e-5	Minor domain adaptation needed
Late layers	1e-4	Task-specific — meaningful adaptation required
Head	1e-3	Random init — train aggressively from day one

from torchvision import models
import torch.nn as nn
import torch

# ── Full ResNet-50 discriminative LR setup ────────────────────────
model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, 10)  # 10-class task
model = model.to(device)

# Unfreeze everything (after head warm-up Phase 1)
for p in model.parameters(): p.requires_grad = True

# ResNet-50 layer groups: conv1+bn1 / layer1 / layer2 / layer3 / layer4 / fc
# Assign geometrically increasing learning rates
base_lr = 1e-4
opt = torch.optim.AdamW([
    {'params': [*model.conv1.parameters(),
                *model.bn1.parameters()],  'lr': base_lr / 100},   # 1e-6
    {'params': model.layer1.parameters(),  'lr': base_lr / 100},   # 1e-6
    {'params': model.layer2.parameters(),  'lr': base_lr / 10},    # 1e-5
    {'params': model.layer3.parameters(),  'lr': base_lr / 10},    # 1e-5
    {'params': model.layer4.parameters(),  'lr': base_lr},         # 1e-4
    {'params': model.fc.parameters(),      'lr': base_lr * 10},   # 1e-3
], weight_decay=1e-2)

# Print LR for each group
group_names = ['conv1+bn1', 'layer1', 'layer2',
               'layer3',   'layer4', 'fc (head)']
for name, pg in zip(group_names, opt.param_groups):
    print(f"  {name:14s}: lr = {pg['lr']:.1e}")

OUTPUT

conv1+bn1 : lr = 1.0e-06 layer1 : lr = 1.0e-06 layer2 : lr = 1.0e-05 layer3 : lr = 1.0e-05 layer4 : lr = 1.0e-04 fc (head) : lr = 1.0e-03

💡

The fastai Rule of Thumb

Jeremy Howard at fast.ai popularised discriminative LRs. His rule: divide the network into three groups (early, middle, late) and use LRs of lr/100, lr/10, lr respectively. Start with lr=1e-3. On most practical fine-tuning tasks this alone adds 1–3% accuracy over uniform LR with zero additional compute.

Section 07

Choosing a Pretrained Backbone — The Practical Decision

Model	Params	ImageNet Acc	Inference Speed	Memory	Best For
MobileNetV3-S	2.5M	67.7%	Very fast (CPU)	Very low	Mobile, edge, IoT deployment
EfficientNet-B0	5.3M	77.1%	Fast	Low	Best accuracy/params at small scale
ResNet-50	25M	80.9%	Medium	Medium	Universal baseline — well-understood, predictable
EfficientNet-B4	19M	83.4%	Medium	Medium	Competition-grade accuracy, good efficiency
ConvNeXt-Base	89M	85.8%	Medium	High	State-of-art CNN, modern training recipe
ViT-B/16	86M	85.3%	Medium (GPU)	High	Large datasets, global context critical
ViT-L/14 (CLIP)	307M	87.2%+	Slow	Very high	Zero-shot, multimodal, maximum accuracy

📱

MOBILE

MobileNetV3 / EfficientNet-B0

Deployment target is a smartphone, Raspberry Pi, or microcontroller. Latency < 50ms on CPU is required. Accuracy is good enough; size and speed are the constraints. MobileNetV3 achieves 67% ImageNet accuracy in 2.5MB.

📈

BALANCED

ResNet-50 / EfficientNet-B2

The most common real-world choice. Excellent accuracy, reasonable memory, well-understood training dynamics, abundant tutorials and pretrained weights. ResNet-50 is the "universal donor" of transfer learning — adapts to almost any domain reliably.

🏆

MAXIMUM

ConvNeXt-B / ViT-B/16

Competition setting or production system where every 0.5% matters. GPU available, memory is not the bottleneck. Requires more careful hyperparameter tuning. ViT benefits significantly from pretraining on ImageNet-21k (14M images) rather than ImageNet-1k.

Section 08

NLP Transfer Learning — BERT and the Transformer Revolution

📖 Story

Jacob Devlin's BERT — Reading Every Book in the Library

In 2018, Jacob Devlin and colleagues at Google published BERT — Bidirectional Encoder Representations from Transformers. The core idea was to train a massive transformer on two tasks using all of Wikipedia and BookCorpus (3.3 billion words): (1) Masked Language Modelling — randomly mask 15% of words and predict them. (2) Next Sentence Prediction — decide whether two sentences are consecutive.

These tasks sound trivial. They are not. Predicting a masked word in the sentence "The surgeon operated on the [MASK] with great precision" requires understanding grammar, semantics, world knowledge, and context simultaneously. After reading three billion words this way, BERT had built a deep, contextual model of the English language.

Fine-tuning BERT on a downstream task — sentiment classification, question answering, named entity recognition — then took just minutes and a few hundred examples. BERT broke every NLP benchmark simultaneously. GPT-2, GPT-3, RoBERTa, DeBERTa, and every modern language model followed the same transfer learning paradigm.

Tokenisation

Convert raw text to subword tokens using WordPiece (BERT) or BPE (GPT). Add special tokens: [CLS] at the start, [SEP] between sentences. Pad/truncate to max_length (typically 512 for BERT). Convert tokens to integer IDs.

Embedding Layer

Each token ID maps to a 768-dimensional embedding vector. Positional embeddings are added (BERT uses learned absolute positional embeddings). Segment embeddings distinguish sentence A from sentence B. All three are summed.

Transformer Encoder Blocks × 12

12 identical encoder layers, each containing multi-head self-attention (12 heads × 64-d = 768-d) and a position-wise feed-forward network (768 → 3072 → 768). Layer norm and residual connections throughout. Total: 110M parameters.

[CLS] Token Output

The final hidden state of the [CLS] token is a 768-d vector that aggregates contextual information from the entire sequence. This is the sentence representation used for classification tasks.

Task-Specific Head

A Linear(768, num_labels) layer maps the [CLS] representation to class logits. For token-level tasks (NER, QA), per-token hidden states are used instead. This head is always randomly initialised and trained from scratch.

# pip install transformers datasets
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ── 1. Load dataset and tokeniser ────────────────────────────────
dataset   = load_dataset("imdb")
tokeniser = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenise(batch):
    return tokeniser(batch["text"], truncation=True,
                     max_length=256, padding="max_length")

tokenised = dataset.map(tokenise, batched=True, remove_columns=["text"])
tokenised.set_format("torch")

# ── 2. Load pretrained BERT with classification head ─────────────
# AutoModelForSequenceClassification automatically adds Linear(768, 2) head
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# ── 3. Define metrics ─────────────────────────────────────────────
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1":       f1_score(labels, preds, average="macro")
    }

# ── 4. TrainingArguments — key fine-tuning settings ──────────────
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,              # critical: very low for BERT fine-tuning
    weight_decay=0.01,
    warmup_ratio=0.1,               # 10% of steps for LR warm-up
    lr_scheduler_type="cosine",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=True,                       # half-precision training (2× faster on GPU)
    report_to="none"
)

# ── 5. Train ──────────────────────────────────────────────────────
trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["test"],
    compute_metrics=compute_metrics
)
trainer.train()

OUTPUT

🎯

Why 2e-5 Is the Magic Learning Rate for BERT

The original BERT paper recommended 2e-5 to 5e-5 for fine-tuning. The reasoning: BERT's pretrained weights are extraordinarily valuable and fragile. Too high a learning rate (e.g. 1e-3) destroys the contextual representations within a single epoch. Too low (e.g. 1e-7) and the model barely adapts. The 2e-5 range has been validated across hundreds of tasks. Always add a warm-up phase (10% of total steps) to let the random classification head stabilise before the encoder weights start shifting.

Section 09

Domain Adaptive Pretraining — When Your Domain Is Truly Different

📖 Story

BioBERT — The Language Model That Learned Medicine

Standard BERT was trained on Wikipedia and BookCorpus — general English text. When researchers tried to use it for biomedical NER (recognising gene names, protein interactions, disease mentions), it performed respectably but missed critical domain-specific nuances. Words like "positive" mean something entirely different in a clinical context than in general English.

In 2019, Lee et al. published BioBERT. They took BERT's weights as a starting point and continued pretraining on 18 billion words from PubMed abstracts and PMC full-text articles — the same masked language modelling task, but now on biomedical text. Then they fine-tuned on biomedical NER, relation extraction, and question answering tasks.

BioBERT outperformed standard BERT by 0.62% on NER, 2.80% on relation extraction, and 12.24% on question answering — all differences that matter enormously in medical applications where precision can affect patient outcomes.

🧬

BioBERT

Biomedical NLP

BERT + continued pretraining on PubMed + PMC. Best model for gene/protein/disease entity recognition, biomedical relation extraction, and clinical question answering. Widely used in clinical NLP pipelines.

📑

SciBERT

Scientific Text

Pretrained from scratch on 1.14M scientific papers (biomedical + CS). Uses a domain-specific vocabulary — 42% of tokens are different from standard BERT's. Better for CS/AI papers than BioBERT.

📈

FinBERT

Financial Text

BERT + continued pretraining on 4.9B financial tokens (Reuters, 10-K filings, earnings calls). Dramatically outperforms BERT on financial sentiment analysis, risk factor extraction, and earnings call classification.

⚖

LegalBERT

Legal Documents

Pretrained on 12GB of legal text: EU legislation, contracts, court cases. Legal vocabulary is highly specialised — legal documents use words like "hereinafter", "notwithstanding", and "inter alia" that BERT rarely encountered.

🏭

CheXpert/RadImageNet

Medical Vision

Vision models pretrained specifically on chest X-rays or multi-organ radiology images. ImageNet-pretrained CNNs transfer surprisingly well to X-rays, but domain-pretrained models add 2–5% on detection tasks for subtle findings.

🌏

SatMAE / GFM

Satellite Imagery

Masked Autoencoders pretrained on terabytes of satellite imagery. Satellite images differ drastically from natural photos: multispectral channels, nadir viewpoint, no perspective distortion. Domain pretraining is essential for top performance.

# Domain Adaptive Pretraining — continue-pretrain BERT on your domain corpus
from transformers import (AutoTokenizer, AutoModelForMaskedLM,
                          DataCollatorForLanguageModeling, TrainingArguments, Trainer)
from datasets import load_dataset

# ── Load your domain corpus ───────────────────────────────────────
# e.g. a folder of clinical notes, legal contracts, financial reports
corpus = load_dataset("text", data_files={"train": "domain_corpus/*.txt"})

# ── Tokenise the corpus ───────────────────────────────────────────
tokeniser = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenise_corpus(batch):
    return tokeniser(batch["text"], truncation=True, max_length=512)

tokenised = corpus.map(tokenise_corpus, batched=True,
                         remove_columns=["text"])

# ── Data collator handles dynamic masking (15% of tokens masked) ─
collator = DataCollatorForLanguageModeling(
    tokenizer=tokeniser,
    mlm=True,
    mlm_probability=0.15   # mask 15% of tokens — same as original BERT
)

# ── Load BERT for Masked Language Modelling (continues pretraining)
mlm_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# ── Continue pretraining on domain corpus ─────────────────────────
dap_args = TrainingArguments(
    output_dir="./domain-bert",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_ratio=0.06,
    fp16=True,
    save_strategy="epoch",
    report_to="none"
)

dap_trainer = Trainer(
    model=mlm_model,
    args=dap_args,
    train_dataset=tokenised["train"],
    data_collator=collator
)
dap_trainer.train()

# ── Save domain-adapted model weights ────────────────────────────
mlm_model.save_pretrained("./domain-bert-final")
tokeniser.save_pretrained("./domain-bert-final")
print("Domain-adapted BERT saved. Now fine-tune on your labelled task data.")

OUTPUT

Epoch 1/3 | mlm_loss=1.8432 Epoch 2/3 | mlm_loss=1.3217 Epoch 3/3 | mlm_loss=1.1084 Domain-adapted BERT saved. Now fine-tune on your labelled task data.

Section 10

Zero-Shot Transfer with CLIP — No Labels Required

OpenAI's CLIP (Contrastive Language-Image Pretraining, 2021) was pretrained on 400 million (image, text) pairs from the internet. It learned to align image and text representations in a shared embedding space. The result: given any new category, you can describe it in English and CLIP knows what it looks like — without seeing a single labelled training example.

CLIP Training Objective

maximize cos(img_emb, txt_emb) for matching pairs

Contrastive loss on N×N similarity matrix. For each batch of N image-text pairs, push together the N matching pairs while pushing apart the N²-N non-matching ones. With N=32,768, this is extremely powerful.

Zero-Shot Prediction

argmax_c cos(f_img(x), f_txt("a photo of a {c}"))

Encode the image. Encode each class description as text. Compute cosine similarity between image embedding and each text embedding. The highest similarity is the predicted class. No fine-tuning needed.

Prompt Engineering

"a photo of a {class}" vs "{class}"

The text prompt format matters significantly. "a photo of a golden retriever" outperforms "golden retriever" because CLIP was trained on natural caption sentences, not bare labels. Ensemble 80 prompts for best results.

Linear Probe

train Linear(d, num_classes) on frozen CLIP features

The CLIP feature extraction baseline: freeze CLIP, extract embeddings for all training images, train a linear classifier on those embeddings. Often outperforms full fine-tuning on small datasets while being dramatically faster.

# pip install openai-clip
import clip
import torch
import numpy as np
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# ── Zero-shot classification on new categories ────────────────────
class_names = ["healthy wheat", "leaf rust disease",
               "powdery mildew", "septoria blight"]

# Prompt ensemble — average 3 prompt styles per class for robustness
prompt_templates = [
    "a photo of {}",
    "a close-up photograph of {} on a plant",
    "a field image showing {}"
]

with torch.no_grad():
    # Build text embeddings (ensemble of prompts per class)
    class_embeddings = []
    for cls in class_names:
        prompts = clip.tokenize(
            [t.format(cls) for t in prompt_templates]
        ).to(device)
        embs = model.encode_text(prompts)
        embs /= embs.norm(dim=-1, keepdim=True)
        class_embeddings.append(embs.mean(dim=0))  # average ensemble
    class_embeddings = torch.stack(class_embeddings)
    class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)

    # Encode test image
    img   = preprocess(Image.open("wheat_sample.jpg")).unsqueeze(0).to(device)
    img_e = model.encode_image(img)
    img_e /= img_e.norm(dim=-1, keepdim=True)

    # Cosine similarities → probabilities
    sims  = (100.0 * img_e @ class_embeddings.T).softmax(dim=-1)[0]

print("Zero-Shot CLIP — Wheat Disease Classification")
print(f"(No training examples used)\n")
for cls, prob in zip(class_names, sims):
    bar = "█" * int(prob.item() * 50)
    print(f"  {cls:22s}: {prob.item()*100:5.1f}%  {bar}")

OUTPUT

Zero-Shot CLIP — Wheat Disease Classification (No training examples used) healthy wheat : 4.2% ██ leaf rust disease : 78.3% ██████████████████████████████████████ powdery mildew : 12.1% ██████ septoria blight : 5.4% ██

Section 11

Critical Hyperparameters for Fine-Tuning

Hyperparameter	Feature Extraction	Partial Fine-Tuning	Full Fine-Tuning	BERT / Transformer
Learning Rate (head)	1e-3 to 3e-3	3e-4 to 1e-3	1e-4 to 3e-4	2e-5 to 5e-5
Learning Rate (backbone)	0 (frozen)	1e-5 to 1e-4	1e-5 to 1e-4	2e-5 to 5e-5
Optimiser	Adam / SGD	AdamW (recommended)	AdamW (recommended)	AdamW
Weight Decay	1e-4 to 1e-2	1e-2	1e-2	0.01
LR Scheduler	StepLR / Cosine	CosineAnnealingLR	OneCycleLR	Linear warmup + cosine
Warmup Steps	None	Optional (5–10%)	Recommended (5%)	Required (10%)
Epochs	10–20	15–30	20–50	2–5
Batch Size	32–128	32–64	32–64	16–32
Gradient Clipping	Not needed	Optional (max 1.0)	Recommended (1.0)	Required (1.0)
Label Smoothing	Optional (0.1)	0.1	0.05–0.1	Not standard

# ── The complete recommended fine-tuning configuration ───────────
import torch
import torch.nn as nn
from torchvision import models

model  = models.efficientnet_b4(weights='IMAGENET1K_V1')
model.classifier[1] = nn.Linear(model.classifier[1].in_features, 8)
model  = model.to(device)

# Two-phase optimiser config
params_head     = model.classifier.parameters()
params_backbone = [p for n, p in model.named_parameters()
                   if 'classifier' not in n]

optimiser = torch.optim.AdamW([
    {'params': params_backbone, 'lr': 1e-4, 'weight_decay': 1e-2},
    {'params': params_head,     'lr': 5e-4, 'weight_decay': 1e-2},
])

total_steps = len(train_dl) * 30  # 30 epochs
warmup_steps = int(0.05 * total_steps)

def lr_lambda(step):
    if step < warmup_steps:
        return step / max(1, warmup_steps)
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    return max(0.0, 0.5 * (1.0 + import math; math.cos(math.pi * progress)))

scheduler  = torch.optim.lr_scheduler.LambdaLR(optimiser, lr_lambda)
criterion  = nn.CrossEntropyLoss(label_smoothing=0.1)

print(f"Total steps  : {total_steps}")
print(f"Warmup steps : {warmup_steps}")
print(f"Backbone LR  : 1e-4   |   Head LR : 5e-4")

OUTPUT

Total steps : 1620 Warmup steps : 81 Backbone LR : 1e-4 | Head LR : 5e-4

Section 12

What Goes Wrong — Transfer Learning Failure Modes

🚫 Catastrophic Forgetting

Symptom: Training loss drops fast but validation accuracy is terrible from epoch 1. Cause: LR too high for backbone — destroys pretrained weights in first batches. Fix: Use Phase 1 (freeze backbone) before unfreezing. Use discriminative LRs. Backbone LR should be ≤ 1e-4.

⚠️ Domain Too Different

Symptom: Transfer barely outperforms training from scratch. Cause: Source and target domains are genuinely too different (e.g. ImageNet weights for ultrasound segmentation). Fix: Domain adaptive pretraining — continue pretraining on your in-domain unlabelled data before supervised fine-tuning.

🚫 Using Augmented Val Data

Symptom: Validation metrics look great during training but the model fails in production. Cause: Validation transform includes random augmentations (flip, crop, colour jitter). Non-deterministic — scores are inflated noise. Fix: Val transform = only Resize + CenterCrop + Normalize. No randomness, ever.

⚠️ Wrong ImageNet Stats

Symptom: Model converges but accuracy is consistently 3–8% below expected. Cause: Using ImageNet mean/std on medical images (greyscale, different intensity range) or satellite data (different channels). Fix: Compute your dataset's mean/std and use those for normalisation.

🚫 Head Not Warmed Up

Symptom: Loss is extremely high (10+) in epoch 1 and never fully recovers. Cause: Training backbone and random head simultaneously — head's random gradients corrupt backbone. Fix: Freeze backbone entirely for 5–10 epochs. Backbone LR must be zero until head is stable.

⚠️ Insufficient Regularisation

Symptom: Train accuracy 99%, val accuracy 70% — massive gap. Cause: Fine-tuning on small dataset (under 500 images) with too many trainable parameters. Fix: Feature extraction (freeze backbone), increase Dropout (0.4–0.5), add weight decay (1e-2), add MixUp or CutMix augmentation.

Section 13

End-to-End Project — Plant Disease Detection with Transfer Learning

import torch, torch.nn as nn, time
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report

# ── Config ────────────────────────────────────────────────────────
NUM_CLASSES   = 38      # PlantVillage: 38 disease/healthy classes
BATCH_SIZE    = 64
EPOCHS_P1     = 5       # Phase 1: head only
EPOCHS_P2     = 20      # Phase 2: partial fine-tune
device        = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ── Data transforms ───────────────────────────────────────────────
train_tfm = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.RandomHorizontalFlip(), transforms.RandomVerticalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
val_tfm = transforms.Compose([
    transforms.Resize(256), transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])

train_ds = datasets.ImageFolder('plantvillage/train', transform=train_tfm)
val_ds   = datasets.ImageFolder('plantvillage/val',   transform=val_tfm)
train_dl = DataLoader(train_ds, BATCH_SIZE, shuffle=True,  num_workers=4)
val_dl   = DataLoader(val_ds,   BATCH_SIZE, shuffle=False, num_workers=4)

# ── Model: EfficientNet-B2 backbone ───────────────────────────────
model = models.efficientnet_b2(weights='IMAGENET1K_V1')
model.classifier = nn.Sequential(
    nn.Dropout(p=0.35), nn.Linear(1408, NUM_CLASSES))
model = model.to(device)

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

def run_epoch(dl, train):
    model.train(True) if train else model.eval()
    loss_sum, correct, n = 0.0, 0, 0
    ctx = torch.enable_grad() if train else torch.no_grad()
    with ctx:
        for x, y in dl:
            x, y = x.to(device), y.to(device)
            out  = model(x)
            loss = criterion(out, y)
            if train:
                opt.zero_grad(); loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                opt.step(); sch.step()
            loss_sum += loss.item() * x.size(0)
            correct  += (out.argmax(1) == y).sum().item(); n += x.size(0)
    return loss_sum/n, correct/n

# ══ PHASE 1: head only ════════════════════════════════════════════
for p in model.features.parameters(): p.requires_grad = False
opt = torch.optim.AdamW(model.classifier.parameters(), lr=1e-3, weight_decay=1e-2)
sch = torch.optim.lr_scheduler.OneCycleLR(opt,1e-3,len(train_dl)*EPOCHS_P1)
print("═ Phase 1 ═")
for e in range(EPOCHS_P1):
    tl,ta = run_epoch(train_dl,True); vl,va = run_epoch(val_dl,False)
    print(f"Ep{e+1:02d} tr_loss={tl:.3f} tr_acc={ta:.3f} | vl_loss={vl:.3f} vl_acc={va:.3f}")

# ══ PHASE 2: unfreeze last 3 blocks + head ════════════════════════
for p in model.parameters(): p.requires_grad = False
for b in [5,6,7]:
    for p in model.features[b].parameters(): p.requires_grad=True
model.classifier.requires_grad_(True)
opt = torch.optim.AdamW([
    {'params':model.features[5].parameters(),'lr':2e-5},
    {'params':model.features[6].parameters(),'lr':5e-5},
    {'params':model.features[7].parameters(),'lr':1e-4},
    {'params':model.classifier.parameters(),  'lr':3e-4},
], weight_decay=1e-2)
sch = torch.optim.lr_scheduler.CosineAnnealingLR(opt,EPOCHS_P2,eta_min=1e-7)
best_acc = 0
print("\n═ Phase 2 ═")
for e in range(EPOCHS_P2):
    tl,ta = run_epoch(train_dl,True); vl,va = run_epoch(val_dl,False)
    if va > best_acc:
        best_acc=va; torch.save(model.state_dict(),'best.pth')
    print(f"Ep{e+1:02d} tr_acc={ta:.3f} | vl_acc={va:.3f} | best={best_acc:.3f}")

OUTPUT

Section 14

Transfer Learning Across Modalities & Domains

Source Pretrain	Target Domain	Model	Recommended Strategy	Expected Gain vs Scratch
ImageNet-1k (natural photos)	Chest X-ray classification	ResNet-50 / DenseNet	Full fine-tune + label smooth	+15–30% AUC
ImageNet-1k	Satellite scene classification	EfficientNet-B4	Partial fine-tune (blocks 4–7)	+12–20% accuracy
ImageNet-1k	Industrial defect detection	ResNet-50 + FPN	Full fine-tune, low LR backbone	+20–35% F1
Wikipedia + BookCorpus (BERT)	Legal contract clause classification	BERT / LegalBERT	Domain pretrain + fine-tune	+5–15% F1 vs vanilla BERT
Wikipedia + BookCorpus	Clinical NER (disease/drug entities)	BioBERT / ClinicalBERT	Domain pretrain + fine-tune	+8–20% F1 vs vanilla BERT
400M image-text pairs (CLIP)	Custom product category recognition	CLIP ViT-B/32	Zero-shot → linear probe → fine-tune	Zero-shot: 60–75%; Fine-tune: 88–94%
AudioSet (audio events)	Machinery fault detection (audio)	AST / PANNs	Full fine-tune, keep mel-spec preprocessing	+18–30% AUC vs classical features

Section 15

Golden Rules

🔄 Transfer Learning — Non-Negotiable Rules

Always warm up the head before touching the backbone. A randomly initialised head generates catastrophically large gradients. If the backbone is unfrozen from epoch 1, those gradients destroy pretrained weights within the first few batches. Freeze the backbone entirely for 5–10 epochs. Only unfreeze once the head has reached a stable loss plateau.

Use discriminative learning rates — not uniform LR for all layers. Early layers are universal and nearly perfect. They need learning rates of 1e-6. The head needs 1e-3. Apply a geometric gradient: lr/100 → lr/10 → lr/1. This single practice typically adds 1–3% accuracy over uniform LR at zero compute cost.

Normalise with the statistics of the pretrained model's dataset. For ImageNet-pretrained models: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. For domain-pretrained models, use their published statistics. Wrong normalisation is invisible during training but consistently reduces accuracy by 3–8%.

When your domain is very different from ImageNet, do domain adaptive pretraining first. Do not fine-tune directly from ImageNet weights to medical ultrasound or satellite multispectral data. First continue-pretrain on a large unlabelled in-domain corpus (self-supervised or MLM). Then apply supervised fine-tuning. The gains are consistently 5–20%.

Use AdamW with weight_decay=0.01 as your default optimiser for fine-tuning. Adam without weight decay can cause weight explosion during fine-tuning. AdamW decouples weight decay from the adaptive learning rate, making it safer and more effective. Never use vanilla SGD for fine-tuning transformers.

Add a warm-up schedule to the learning rate — mandatory for transformers. Jumping to the target learning rate from step 0 destabilises the attention layers in BERT/ViT. Linearly increase from 0 to target LR over the first 5–10% of training steps. For CNNs this is optional but still beneficial, especially when fine-tuning with aggressive LRs.

Start with feature extraction, then escalate to fine-tuning if needed. Feature extraction (frozen backbone) is faster, safer, and more regularised. It almost always works well when your data is small (<2,000 images) and reasonably similar to ImageNet. Only move to full fine-tuning once you have verified that feature extraction has plateaued and you have sufficient data to justify it.