The Story That Explains Transfer Learning
Her surgical colleagues thought she would start from scratch like everyone else. They were wrong. Within eight months she painted with a control of fine detail that students a decade younger could not match. Her hands already knew precision. Her eyes already read subtle colour gradients in tissue — learning to read them in landscapes was a smaller leap. Her surgical discipline translated directly into painting discipline.
She did not start from zero. She transferred a decade of deeply encoded skill into a related domain — and reached mastery far faster than anyone starting cold.
This is Transfer Learning. A model trained on one large task — say, recognising 1.2 million ImageNet photographs — has already learned the fundamental vocabulary of visual understanding: edges, textures, shapes, object parts. When you take that model and point it at your specific problem — plant diseases, chest X-rays, satellite images — it does not start from random noise. It starts from a deep, structured knowledge of what visual patterns mean. You are not retraining a model. You are redirecting an expert.
Transfer Learning is the practice of taking a model already trained on a large source task and reusing its learned representations — either directly or with adaptation — on a different but related target task. Instead of training from random weight initialisation (which requires millions of examples and days of GPU time), you begin from a rich learned starting point and converge quickly with far less data.
Training ResNet-50 from scratch on ImageNet requires 1.2 million labelled images, 90 epochs, and roughly 4 days on 8 V100 GPUs — costing over $2,000 in compute. Fine-tuning that same pretrained ResNet-50 on your 1,000-image medical dataset takes under 30 minutes on a single consumer GPU and routinely achieves 85–95% accuracy on tasks where training from scratch would give 40–60%. Transfer learning is not just a technique. In 2025, it is the default approach for any visual, textual, or audio deep learning task.
Why It Works — The Universal Feature Hierarchy
The result was stunning. Every large CNN trained on natural images — regardless of architecture, dataset, or training details — learned the same layered hierarchy of features. This hierarchy is not designed by engineers. It emerges from the structure of natural images themselves. And because natural images share this structure universally, features learned from photographs of cats and dogs transfer meaningfully to X-rays, satellite imagery, skin lesions, and factory defect detection.
The Four Strategies — Which One Should You Use?
Transfer learning is not one technique — it is a spectrum of four distinct strategies. The right strategy depends on two factors: how much target data you have and how similar your target domain is to the source domain (ImageNet, for vision models).
| Strategy | Data Size | Domain Similarity | Layers Trained | Training Time | Accuracy Potential |
|---|---|---|---|---|---|
| A — Feature Extraction | < 1,000 images | Similar | Head only | Minutes (CPU) | Good (82–90%) |
| B — Partial Fine-Tuning | 1k – 10k images | Moderate | Last 2–3 blocks + head | 30 min – 2 hr | Very good (88–94%) |
| C — Full Fine-Tuning | 10k+ images | Different | All layers + head | 2 – 8 hr | Excellent (91–97%) |
| D — Domain Adaptive Pretrain | Large unlabelled + small labelled | Very different | All (self-supervised first) | Days | Excellent (93–98%) |
| E — Zero-Shot (CLIP) | 0 images | Any | None (inference only) | Seconds | Moderate (60–80%) |
Strategy A — Feature Extraction (Frozen Backbone)
import torch
import torch.nn as nn
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
import numpy as np
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# ── 1. Load pretrained ResNet-50 ─────────────────────────────────
backbone = models.resnet50(weights='IMAGENET1K_V2')
# ── 2. Freeze ALL backbone parameters ────────────────────────────
for param in backbone.parameters():
param.requires_grad = False
# ── 3. Replace classification head ───────────────────────────────
# ResNet-50's fc layer: Linear(2048, 1000) → replace with our task
num_classes = 5
backbone.fc = nn.Sequential(
nn.Linear(backbone.fc.in_features, 256),
nn.ReLU(),
nn.Dropout(p=0.4),
nn.Linear(256, num_classes)
)
backbone = backbone.to(device)
# Confirm: only head parameters are trainable
trainable = sum(p.numel() for p in backbone.parameters()
if p.requires_grad)
total = sum(p.numel() for p in backbone.parameters())
print(f"Trainable : {trainable:,} ({trainable/total*100:.2f}% of total)")
print(f"Frozen : {total-trainable:,}")
print(f"Total : {total:,}")
# ── 4. Data pipeline ─────────────────────────────────────────────
tfm = transforms.Compose([
transforms.Resize(256), transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225])
])
train_ds = datasets.ImageFolder('data/train', transform=tfm)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=4)
# ── 5. Train only the head ────────────────────────────────────────
optimiser = torch.optim.Adam(backbone.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
backbone.train()
for epoch in range(10):
total_loss, correct, n = 0.0, 0, 0
for imgs, labels in train_dl:
imgs, labels = imgs.to(device), labels.to(device)
optimiser.zero_grad()
logits = backbone(imgs)
loss = criterion(logits, labels)
loss.backward()
optimiser.step()
total_loss += loss.item() * imgs.size(0)
correct += (logits.argmax(1) == labels).sum().item()
n += imgs.size(0)
print(f"Epoch {epoch+1:02d} | loss={total_loss/n:.4f} | acc={correct/n:.4f}")
Strategy B — Partial Fine-Tuning (The Most Common Real-World Approach)
First they send him to a six-week Japanese culinary course — a warm-up phase where only the new skills are trained while the foundational ones rest untouched. Once the new skills are stable, they gradually allow his French techniques to blend in at a very low influence — a full fine-tune phase at low learning rate. The result is a chef who is both deeply grounded and newly specialised.
This is exactly the two-phase transfer learning protocol: warm up the head, then carefully unfreeze deeper layers.
import torch
import torch.nn as nn
from torchvision import models
# ── Load pretrained EfficientNet-B2 ──────────────────────────────
model = models.efficientnet_b2(weights='IMAGENET1K_V1')
num_classes = 7
# Replace classifier head
in_features = model.classifier[1].in_features # 1408 for B2
model.classifier = nn.Sequential(
nn.Dropout(p=0.3),
nn.Linear(in_features, num_classes)
)
model = model.to(device)
# ════════════════════════════════════════════════════════
# PHASE 1 — Freeze backbone, train head only (5–10 epochs)
# ════════════════════════════════════════════════════════
for param in model.features.parameters():
param.requires_grad = False
model.classifier.requires_grad_(True)
opt_p1 = torch.optim.AdamW(model.classifier.parameters(),
lr=1e-3, weight_decay=1e-2)
sch_p1 = torch.optim.lr_scheduler.OneCycleLR(
opt_p1, max_lr=1e-3, steps_per_epoch=len(train_dl), epochs=5)
print("Phase 1 — Training head only...")
# [training loop here — same as Section 04]
# ════════════════════════════════════════════════════════
# PHASE 2 — Unfreeze deeper backbone blocks, low LR
# ════════════════════════════════════════════════════════
# EfficientNet-B2 has 8 feature blocks (0–7)
# Unfreeze blocks 5, 6, 7 and the classifier
for name, param in model.named_parameters():
param.requires_grad = False # start fully frozen
for block_idx in [5, 6, 7]:
for param in model.features[block_idx].parameters():
param.requires_grad = True
for param in model.classifier.parameters():
param.requires_grad = True
# Use separate LR groups: smaller for backbone, larger for head
opt_p2 = torch.optim.AdamW([
{'params': model.features[5].parameters(), 'lr': 2e-5},
{'params': model.features[6].parameters(), 'lr': 4e-5},
{'params': model.features[7].parameters(), 'lr': 8e-5},
{'params': model.classifier.parameters(), 'lr': 3e-4},
], weight_decay=1e-2)
sch_p2 = torch.optim.lr_scheduler.CosineAnnealingLR(
opt_p2, T_max=15, eta_min=1e-7)
trainable_p2 = sum(p.numel() for p in model.parameters()
if p.requires_grad)
print(f"Phase 2 trainable params: {trainable_p2:,}")
print("Phase 2 — Fine-tuning blocks 5-7 + head with discriminative LRs...")
If you skip Phase 1 and immediately fine-tune all layers together, the randomly initialised head generates enormous gradients in the first few batches. These gradients propagate into the backbone and catastrophically overwrite pretrained weights — destroying the very knowledge you are trying to leverage. Always warm up the head first. The head's random initialisation is the loaded gun; Phase 1 is the safety.
Discriminative Learning Rates — One of the Most Powerful Tricks
Not all layers need the same amount of change. Early layers learned universal, perfect features — they barely need updating. Later layers are task-specific and need more adaptation. The head is randomly initialised and needs aggressive training. Discriminative learning rates assign progressively higher learning rates to progressively deeper (or more task-specific) parts of the network.
| Layer Group | LR | Problem |
|---|---|---|
| Early conv layers | 1e-4 | Over-trains universal edge detectors — corrupts them |
| Middle layers | 1e-4 | Slightly too high — forces unnecessary change |
| Late layers | 1e-4 | Reasonable but could be higher for faster adaptation |
| Head | 1e-4 | Far too low — head needs much faster learning from random init |
| Layer Group | LR | Rationale |
|---|---|---|
| Early conv layers | 1e-6 | Near-perfect universal features — barely touch them |
| Middle layers | 1e-5 | Minor domain adaptation needed |
| Late layers | 1e-4 | Task-specific — meaningful adaptation required |
| Head | 1e-3 | Random init — train aggressively from day one |
from torchvision import models
import torch.nn as nn
import torch
# ── Full ResNet-50 discriminative LR setup ────────────────────────
model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, 10) # 10-class task
model = model.to(device)
# Unfreeze everything (after head warm-up Phase 1)
for p in model.parameters(): p.requires_grad = True
# ResNet-50 layer groups: conv1+bn1 / layer1 / layer2 / layer3 / layer4 / fc
# Assign geometrically increasing learning rates
base_lr = 1e-4
opt = torch.optim.AdamW([
{'params': [*model.conv1.parameters(),
*model.bn1.parameters()], 'lr': base_lr / 100}, # 1e-6
{'params': model.layer1.parameters(), 'lr': base_lr / 100}, # 1e-6
{'params': model.layer2.parameters(), 'lr': base_lr / 10}, # 1e-5
{'params': model.layer3.parameters(), 'lr': base_lr / 10}, # 1e-5
{'params': model.layer4.parameters(), 'lr': base_lr}, # 1e-4
{'params': model.fc.parameters(), 'lr': base_lr * 10}, # 1e-3
], weight_decay=1e-2)
# Print LR for each group
group_names = ['conv1+bn1', 'layer1', 'layer2',
'layer3', 'layer4', 'fc (head)']
for name, pg in zip(group_names, opt.param_groups):
print(f" {name:14s}: lr = {pg['lr']:.1e}")
Jeremy Howard at fast.ai popularised discriminative LRs. His rule: divide the network into three groups (early, middle, late) and use LRs of lr/100, lr/10, lr respectively. Start with lr=1e-3. On most practical fine-tuning tasks this alone adds 1–3% accuracy over uniform LR with zero additional compute.
Choosing a Pretrained Backbone — The Practical Decision
| Model | Params | ImageNet Acc | Inference Speed | Memory | Best For |
|---|---|---|---|---|---|
| MobileNetV3-S | 2.5M | 67.7% | Very fast (CPU) | Very low | Mobile, edge, IoT deployment |
| EfficientNet-B0 | 5.3M | 77.1% | Fast | Low | Best accuracy/params at small scale |
| ResNet-50 | 25M | 80.9% | Medium | Medium | Universal baseline — well-understood, predictable |
| EfficientNet-B4 | 19M | 83.4% | Medium | Medium | Competition-grade accuracy, good efficiency |
| ConvNeXt-Base | 89M | 85.8% | Medium | High | State-of-art CNN, modern training recipe |
| ViT-B/16 | 86M | 85.3% | Medium (GPU) | High | Large datasets, global context critical |
| ViT-L/14 (CLIP) | 307M | 87.2%+ | Slow | Very high | Zero-shot, multimodal, maximum accuracy |
NLP Transfer Learning — BERT and the Transformer Revolution
These tasks sound trivial. They are not. Predicting a masked word in the sentence "The surgeon operated on the [MASK] with great precision" requires understanding grammar, semantics, world knowledge, and context simultaneously. After reading three billion words this way, BERT had built a deep, contextual model of the English language.
Fine-tuning BERT on a downstream task — sentiment classification, question answering, named entity recognition — then took just minutes and a few hundred examples. BERT broke every NLP benchmark simultaneously. GPT-2, GPT-3, RoBERTa, DeBERTa, and every modern language model followed the same transfer learning paradigm.
# pip install transformers datasets
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
# ── 1. Load dataset and tokeniser ────────────────────────────────
dataset = load_dataset("imdb")
tokeniser = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenise(batch):
return tokeniser(batch["text"], truncation=True,
max_length=256, padding="max_length")
tokenised = dataset.map(tokenise, batched=True, remove_columns=["text"])
tokenised.set_format("torch")
# ── 2. Load pretrained BERT with classification head ─────────────
# AutoModelForSequenceClassification automatically adds Linear(768, 2) head
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# ── 3. Define metrics ─────────────────────────────────────────────
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=1)
return {
"accuracy": accuracy_score(labels, preds),
"f1": f1_score(labels, preds, average="macro")
}
# ── 4. TrainingArguments — key fine-tuning settings ──────────────
args = TrainingArguments(
output_dir="./bert-imdb",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5, # critical: very low for BERT fine-tuning
weight_decay=0.01,
warmup_ratio=0.1, # 10% of steps for LR warm-up
lr_scheduler_type="cosine",
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=True, # half-precision training (2× faster on GPU)
report_to="none"
)
# ── 5. Train ──────────────────────────────────────────────────────
trainer = Trainer(
model=model, args=args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["test"],
compute_metrics=compute_metrics
)
trainer.train()
The original BERT paper recommended 2e-5 to 5e-5 for fine-tuning. The reasoning: BERT's pretrained weights are extraordinarily valuable and fragile. Too high a learning rate (e.g. 1e-3) destroys the contextual representations within a single epoch. Too low (e.g. 1e-7) and the model barely adapts. The 2e-5 range has been validated across hundreds of tasks. Always add a warm-up phase (10% of total steps) to let the random classification head stabilise before the encoder weights start shifting.
Domain Adaptive Pretraining — When Your Domain Is Truly Different
In 2019, Lee et al. published BioBERT. They took BERT's weights as a starting point and continued pretraining on 18 billion words from PubMed abstracts and PMC full-text articles — the same masked language modelling task, but now on biomedical text. Then they fine-tuned on biomedical NER, relation extraction, and question answering tasks.
BioBERT outperformed standard BERT by 0.62% on NER, 2.80% on relation extraction, and 12.24% on question answering — all differences that matter enormously in medical applications where precision can affect patient outcomes.
# Domain Adaptive Pretraining — continue-pretrain BERT on your domain corpus
from transformers import (AutoTokenizer, AutoModelForMaskedLM,
DataCollatorForLanguageModeling, TrainingArguments, Trainer)
from datasets import load_dataset
# ── Load your domain corpus ───────────────────────────────────────
# e.g. a folder of clinical notes, legal contracts, financial reports
corpus = load_dataset("text", data_files={"train": "domain_corpus/*.txt"})
# ── Tokenise the corpus ───────────────────────────────────────────
tokeniser = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenise_corpus(batch):
return tokeniser(batch["text"], truncation=True, max_length=512)
tokenised = corpus.map(tokenise_corpus, batched=True,
remove_columns=["text"])
# ── Data collator handles dynamic masking (15% of tokens masked) ─
collator = DataCollatorForLanguageModeling(
tokenizer=tokeniser,
mlm=True,
mlm_probability=0.15 # mask 15% of tokens — same as original BERT
)
# ── Load BERT for Masked Language Modelling (continues pretraining)
mlm_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# ── Continue pretraining on domain corpus ─────────────────────────
dap_args = TrainingArguments(
output_dir="./domain-bert",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=5e-5,
weight_decay=0.01,
warmup_ratio=0.06,
fp16=True,
save_strategy="epoch",
report_to="none"
)
dap_trainer = Trainer(
model=mlm_model,
args=dap_args,
train_dataset=tokenised["train"],
data_collator=collator
)
dap_trainer.train()
# ── Save domain-adapted model weights ────────────────────────────
mlm_model.save_pretrained("./domain-bert-final")
tokeniser.save_pretrained("./domain-bert-final")
print("Domain-adapted BERT saved. Now fine-tune on your labelled task data.")
Zero-Shot Transfer with CLIP — No Labels Required
OpenAI's CLIP (Contrastive Language-Image Pretraining, 2021) was pretrained on 400 million (image, text) pairs from the internet. It learned to align image and text representations in a shared embedding space. The result: given any new category, you can describe it in English and CLIP knows what it looks like — without seeing a single labelled training example.
# pip install openai-clip
import clip
import torch
import numpy as np
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# ── Zero-shot classification on new categories ────────────────────
class_names = ["healthy wheat", "leaf rust disease",
"powdery mildew", "septoria blight"]
# Prompt ensemble — average 3 prompt styles per class for robustness
prompt_templates = [
"a photo of {}",
"a close-up photograph of {} on a plant",
"a field image showing {}"
]
with torch.no_grad():
# Build text embeddings (ensemble of prompts per class)
class_embeddings = []
for cls in class_names:
prompts = clip.tokenize(
[t.format(cls) for t in prompt_templates]
).to(device)
embs = model.encode_text(prompts)
embs /= embs.norm(dim=-1, keepdim=True)
class_embeddings.append(embs.mean(dim=0)) # average ensemble
class_embeddings = torch.stack(class_embeddings)
class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
# Encode test image
img = preprocess(Image.open("wheat_sample.jpg")).unsqueeze(0).to(device)
img_e = model.encode_image(img)
img_e /= img_e.norm(dim=-1, keepdim=True)
# Cosine similarities → probabilities
sims = (100.0 * img_e @ class_embeddings.T).softmax(dim=-1)[0]
print("Zero-Shot CLIP — Wheat Disease Classification")
print(f"(No training examples used)\n")
for cls, prob in zip(class_names, sims):
bar = "█" * int(prob.item() * 50)
print(f" {cls:22s}: {prob.item()*100:5.1f}% {bar}")
Critical Hyperparameters for Fine-Tuning
| Hyperparameter | Feature Extraction | Partial Fine-Tuning | Full Fine-Tuning | BERT / Transformer |
|---|---|---|---|---|
| Learning Rate (head) | 1e-3 to 3e-3 | 3e-4 to 1e-3 | 1e-4 to 3e-4 | 2e-5 to 5e-5 |
| Learning Rate (backbone) | 0 (frozen) | 1e-5 to 1e-4 | 1e-5 to 1e-4 | 2e-5 to 5e-5 |
| Optimiser | Adam / SGD | AdamW (recommended) | AdamW (recommended) | AdamW |
| Weight Decay | 1e-4 to 1e-2 | 1e-2 | 1e-2 | 0.01 |
| LR Scheduler | StepLR / Cosine | CosineAnnealingLR | OneCycleLR | Linear warmup + cosine |
| Warmup Steps | None | Optional (5–10%) | Recommended (5%) | Required (10%) |
| Epochs | 10–20 | 15–30 | 20–50 | 2–5 |
| Batch Size | 32–128 | 32–64 | 32–64 | 16–32 |
| Gradient Clipping | Not needed | Optional (max 1.0) | Recommended (1.0) | Required (1.0) |
| Label Smoothing | Optional (0.1) | 0.1 | 0.05–0.1 | Not standard |
# ── The complete recommended fine-tuning configuration ───────────
import torch
import torch.nn as nn
from torchvision import models
model = models.efficientnet_b4(weights='IMAGENET1K_V1')
model.classifier[1] = nn.Linear(model.classifier[1].in_features, 8)
model = model.to(device)
# Two-phase optimiser config
params_head = model.classifier.parameters()
params_backbone = [p for n, p in model.named_parameters()
if 'classifier' not in n]
optimiser = torch.optim.AdamW([
{'params': params_backbone, 'lr': 1e-4, 'weight_decay': 1e-2},
{'params': params_head, 'lr': 5e-4, 'weight_decay': 1e-2},
])
total_steps = len(train_dl) * 30 # 30 epochs
warmup_steps = int(0.05 * total_steps)
def lr_lambda(step):
if step < warmup_steps:
return step / max(1, warmup_steps)
progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
return max(0.0, 0.5 * (1.0 + import math; math.cos(math.pi * progress)))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimiser, lr_lambda)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
print(f"Total steps : {total_steps}")
print(f"Warmup steps : {warmup_steps}")
print(f"Backbone LR : 1e-4 | Head LR : 5e-4")
What Goes Wrong — Transfer Learning Failure Modes
End-to-End Project — Plant Disease Detection with Transfer Learning
import torch, torch.nn as nn, time
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report
# ── Config ────────────────────────────────────────────────────────
NUM_CLASSES = 38 # PlantVillage: 38 disease/healthy classes
BATCH_SIZE = 64
EPOCHS_P1 = 5 # Phase 1: head only
EPOCHS_P2 = 20 # Phase 2: partial fine-tune
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# ── Data transforms ───────────────────────────────────────────────
train_tfm = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
transforms.RandomHorizontalFlip(), transforms.RandomVerticalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1),
transforms.ToTensor(),
transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
val_tfm = transforms.Compose([
transforms.Resize(256), transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
train_ds = datasets.ImageFolder('plantvillage/train', transform=train_tfm)
val_ds = datasets.ImageFolder('plantvillage/val', transform=val_tfm)
train_dl = DataLoader(train_ds, BATCH_SIZE, shuffle=True, num_workers=4)
val_dl = DataLoader(val_ds, BATCH_SIZE, shuffle=False, num_workers=4)
# ── Model: EfficientNet-B2 backbone ───────────────────────────────
model = models.efficientnet_b2(weights='IMAGENET1K_V1')
model.classifier = nn.Sequential(
nn.Dropout(p=0.35), nn.Linear(1408, NUM_CLASSES))
model = model.to(device)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
def run_epoch(dl, train):
model.train(True) if train else model.eval()
loss_sum, correct, n = 0.0, 0, 0
ctx = torch.enable_grad() if train else torch.no_grad()
with ctx:
for x, y in dl:
x, y = x.to(device), y.to(device)
out = model(x)
loss = criterion(out, y)
if train:
opt.zero_grad(); loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step(); sch.step()
loss_sum += loss.item() * x.size(0)
correct += (out.argmax(1) == y).sum().item(); n += x.size(0)
return loss_sum/n, correct/n
# ══ PHASE 1: head only ════════════════════════════════════════════
for p in model.features.parameters(): p.requires_grad = False
opt = torch.optim.AdamW(model.classifier.parameters(), lr=1e-3, weight_decay=1e-2)
sch = torch.optim.lr_scheduler.OneCycleLR(opt,1e-3,len(train_dl)*EPOCHS_P1)
print("═ Phase 1 ═")
for e in range(EPOCHS_P1):
tl,ta = run_epoch(train_dl,True); vl,va = run_epoch(val_dl,False)
print(f"Ep{e+1:02d} tr_loss={tl:.3f} tr_acc={ta:.3f} | vl_loss={vl:.3f} vl_acc={va:.3f}")
# ══ PHASE 2: unfreeze last 3 blocks + head ════════════════════════
for p in model.parameters(): p.requires_grad = False
for b in [5,6,7]:
for p in model.features[b].parameters(): p.requires_grad=True
model.classifier.requires_grad_(True)
opt = torch.optim.AdamW([
{'params':model.features[5].parameters(),'lr':2e-5},
{'params':model.features[6].parameters(),'lr':5e-5},
{'params':model.features[7].parameters(),'lr':1e-4},
{'params':model.classifier.parameters(), 'lr':3e-4},
], weight_decay=1e-2)
sch = torch.optim.lr_scheduler.CosineAnnealingLR(opt,EPOCHS_P2,eta_min=1e-7)
best_acc = 0
print("\n═ Phase 2 ═")
for e in range(EPOCHS_P2):
tl,ta = run_epoch(train_dl,True); vl,va = run_epoch(val_dl,False)
if va > best_acc:
best_acc=va; torch.save(model.state_dict(),'best.pth')
print(f"Ep{e+1:02d} tr_acc={ta:.3f} | vl_acc={va:.3f} | best={best_acc:.3f}")
Transfer Learning Across Modalities & Domains
| Source Pretrain | Target Domain | Model | Recommended Strategy | Expected Gain vs Scratch |
|---|---|---|---|---|
| ImageNet-1k (natural photos) | Chest X-ray classification | ResNet-50 / DenseNet | Full fine-tune + label smooth | +15–30% AUC |
| ImageNet-1k | Satellite scene classification | EfficientNet-B4 | Partial fine-tune (blocks 4–7) | +12–20% accuracy |
| ImageNet-1k | Industrial defect detection | ResNet-50 + FPN | Full fine-tune, low LR backbone | +20–35% F1 |
| Wikipedia + BookCorpus (BERT) | Legal contract clause classification | BERT / LegalBERT | Domain pretrain + fine-tune | +5–15% F1 vs vanilla BERT |
| Wikipedia + BookCorpus | Clinical NER (disease/drug entities) | BioBERT / ClinicalBERT | Domain pretrain + fine-tune | +8–20% F1 vs vanilla BERT |
| 400M image-text pairs (CLIP) | Custom product category recognition | CLIP ViT-B/32 | Zero-shot → linear probe → fine-tune | Zero-shot: 60–75%; Fine-tune: 88–94% |
| AudioSet (audio events) | Machinery fault detection (audio) | AST / PANNs | Full fine-tune, keep mel-spec preprocessing | +18–30% AUC vs classical features |