The Story That Explains BERT
Now imagine a different reader who is given the entire sentence at once, but some words are hidden behind sticky notes. This reader can look left and right simultaneously to figure out what each hidden word probably says. When the hidden word is "bank", they can see "river" on the left and "muddy shore" on the right — and correctly guess "bank" in the geographical, not financial, sense.
That second reader is BERT.
BERT — Bidirectional Encoder Representations from Transformers — is a language model introduced by Google in 2018. Unlike earlier models that read text left-to-right (or right-to-left), BERT reads the entire sequence simultaneously, building a rich, context-aware representation for every token based on all surrounding tokens. It fundamentally changed the NLP landscape and forms the backbone of dozens of modern language understanding systems.
Language understanding requires bidirectional context. The word "bank" has completely different meanings depending on the words to its left and right simultaneously. GPT-style models read left-to-right and miss right-side context. BERT processes both directions at once, giving it a fundamentally richer understanding of meaning.
The Transformer Architecture — BERT's Foundation
BERT is built entirely on the Transformer Encoder — specifically a stack of encoder blocks from the original "Attention Is All You Need" paper (Vaswani et al., 2017). Understanding the encoder is essential to understanding BERT.
The Three Embedding Inputs
Every BERT input begins with [CLS] (Classification token) and sentences are separated and ended with [SEP] (Separator token). Format: [CLS] Sentence A [SEP] Sentence B [SEP]. The [CLS] token's final hidden state acts as the aggregate sequence representation for classification tasks. It sees every other token through self-attention and learns to summarise the whole input.
BERT Variants — Base vs Large
| Property | BERT-Base | BERT-Large |
|---|---|---|
| Encoder Layers (L) | 12 | 24 |
| Hidden Size (H) | 768 | 1,024 |
| Attention Heads (A) | 12 | 16 |
| Total Parameters | 110 million | 340 million |
| Pre-training Hardware | 4 TPU chips, ~4 days | 16 TPU chips, ~4 days |
| Typical Fine-tune Speed | Fast — fits on 1 GPU | Slow — needs multi-GPU or large VRAM |
| Best For | Most downstream tasks, production | State-of-the-art benchmarks |
Self-Attention — How BERT "Sees" Everything at Once
Self-attention does exactly this for every token in a sequence. For the token "it", the mechanism computes how relevant every other token is, assigns a score, and produces a weighted combination of all their representations. The result: "it" now carries information from the noun it refers to.
Self-attention computes three vectors for every token — Query (Q), Key (K), and Value (V) — using learned weight matrices. The attention score between token i and token j is the dot product of token i's Query with token j's Key, scaled and then softmax-normalised.
With 12 attention heads (BERT-Base), the model can simultaneously track 12 different types of relationships between tokens — one head might track syntactic dependencies, another co-reference, another semantic similarity. No single head needs to do everything; they specialise automatically during training.
Pre-Training Task 1 — Masked Language Model (MLM)
MLM is exactly this. BERT is given billions of such fill-in-the-blank puzzles, and in solving them, it learns a deep model of language.
In MLM, 15% of input tokens are randomly selected for masking before the text is fed into the model. The model must predict the original identity of those tokens using the full surrounding context. Crucially, the masking follows a specific strategy to reduce the mismatch between pre-training (where [MASK] appears) and fine-tuning (where it does not).
During pre-training, [MASK] tokens appear everywhere. During fine-tuning on downstream tasks, they never appear. If BERT only ever saw [MASK] during training, it would overfit to that special token's position in its vocabulary. The 10% random + 10% unchanged strategy forces the model to develop contextual representations for every token at every position, regardless of whether it was actually masked.
MLM — Worked Example
| Position | Original Token | Input to BERT | BERT Must Predict | Strategy |
|---|---|---|---|---|
| 1 | [CLS] | [CLS] | — | Not selected |
| 2 | The | The | — | Not selected |
| 3 | scientist | [MASK] | scientist | 80% replace |
| 4 | discovered | discovered | — | Not selected |
| 5 | a | a | — | Not selected |
| 6 | new | river | new | 10% random |
| 7 | species | species | species | 10% unchanged |
| 8 | [SEP] | [SEP] | — | Not selected |
Pre-Training Task 2 — Next Sentence Prediction (NSP)
BERT is trained to be this editor: given two sentences A and B, it must decide whether B is the actual next sentence that follows A in the original document, or a random sentence from somewhere else.
NSP is a binary classification task. During pre-training, 50% of the time Sentence B is the true next sentence (IsNext), and 50% of the time it is a randomly sampled sentence from the corpus (NotNext). The final hidden state of the [CLS] token is passed through a linear layer + softmax to make this binary prediction.
Later work (notably RoBERTa by Liu et al., 2019) showed that removing NSP entirely and using only MLM with longer sequences and more data improves downstream performance. The hypothesis: NSP is too easy — since the negative examples come from completely different documents, the model simply learns to detect topic similarity rather than discourse coherence. Modern BERT variants (RoBERTa, DeBERTa) drop NSP. The original BERT still uses it, and it does help on tasks like Natural Language Inference and Question Answering that explicitly involve sentence pairs.
Pre-Training Data — What BERT Learned From
BERT's Input Representation — The Full Picture
Every BERT input is constructed following a strict format. Understanding this format is essential when you implement fine-tuning, because you must replicate it exactly — the tokeniser and the model were built together.
BERT does not operate on whole words — it uses WordPiece tokenisation, which splits rare words into subword units. The word "unbelievably" might become ["un", "##believ", "##ably"] (the ## prefix means "continuation of previous token"). This allows BERT to handle any word including novel ones, because it can decompose them into known subword pieces. The vocabulary is ~30,522 tokens for English BERT.
Fine-Tuning — Adapting BERT to Your Task
Pre-training gives BERT a universal language model. Fine-tuning adapts that model to a specific downstream task by adding a small task-specific layer on top and training the whole stack end-to-end on labelled task data. Fine-tuning typically takes minutes to hours on a single GPU — orders of magnitude less compute than pre-training.
Before BERT, building a high-quality NLP model required large task-specific labelled datasets, hand-crafted features, and weeks of training. With BERT fine-tuning, a few thousand labelled examples and 30 minutes on a GPU can produce state-of-the-art results. The pre-trained weights already encode rich linguistic knowledge — fine-tuning just steers that knowledge toward your task.
BERT in Code — From Tokenisation to Fine-Tuning
Step 1 — Install and Load
# Install Hugging Face Transformers
# pip install transformers torch datasets
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
import torch
# Load pre-trained BERT-Base (uncased)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
Step 2 — Tokenise Text
text = 'The scientist discovered a new species in the rainforest.'
# Tokenise and encode to input IDs
tokens = tokenizer.tokenize(text)
print('Tokens:', tokens)
# Output: ['the', 'scientist', 'discovered', 'a', 'new', 'species',
# 'in', 'the', 'rain', '##forest', '.']
# Full encoding with attention mask and special tokens
encoding = tokenizer(
text,
return_tensors='pt', # PyTorch tensors
padding=True,
truncation=True,
max_length=128
)
print('Input IDs shape:', encoding['input_ids'].shape)
print('Token IDs:', encoding['input_ids'])
Step 3 — Extract Contextual Embeddings
model.eval()
with torch.no_grad():
outputs = model(**encoding)
# last_hidden_state: shape [batch, seq_len, 768]
last_hidden = outputs.last_hidden_state
print('All token embeddings shape:', last_hidden.shape)
# [CLS] token embedding — position 0
cls_embedding = last_hidden[:, 0, :]
print('CLS embedding shape:', cls_embedding.shape)
# Pooled output (another classification-ready representation)
pooled = outputs.pooler_output
print('Pooled output shape:', pooled.shape)
Step 4 — Fine-Tune for Sentiment Classification
from transformers import BertForSequenceClassification, AdamW, get_scheduler
from torch.utils.data import DataLoader, Dataset
import torch
# ── Toy dataset ──────────────────────────────────────────────
class SentimentDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.encodings = tokenizer(texts, truncation=True,
padding=True, max_length=max_len,
return_tensors='pt')
self.labels = torch.tensor(labels)
def __len__(self): return len(self.labels)
def __getitem__(self, idx):
return {key: val[idx] for key, val in self.encodings.items()}, self.labels[idx]
# Sample data
texts = ['I loved this film!', 'Absolutely terrible.',
'Best movie ever.', 'Waste of time.']
labels = [1, 0, 1, 0] # 1 = positive, 0 = negative
dataset = SentimentDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=2)
# ── Model — BERT-Base + classification head ───────────────────
clf_model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
)
optimizer = AdamW(clf_model.parameters(), lr=2e-5)
# ── Training loop (3 epochs) ─────────────────────────────────
clf_model.train()
for epoch in range(3):
total_loss = 0
for batch, batch_labels in dataloader:
optimizer.zero_grad()
outputs = clf_model(**batch, labels=batch_labels)
loss = outputs.loss
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1} | Loss: {total_loss/len(dataloader):.4f}")
Step 5 — Inference
clf_model.eval()
new_texts = ['This is a masterpiece!', 'Completely boring.']
new_inputs = tokenizer(new_texts, return_tensors='pt',
padding=True, truncation=True)
with torch.no_grad():
logits = clf_model(**new_inputs).logits
predictions = logits.argmax(dim=-1)
labels_map = {0: 'Negative', 1: 'Positive'}
for text, pred in zip(new_texts, predictions):
print(f"'{text}' → {labels_map[pred.item()]}")
MLM vs NSP — A Side-by-Side Comparison
| Property | Masked Language Model (MLM) | Next Sentence Prediction (NSP) |
|---|---|---|
| Task Type | Token prediction (multi-class) | Binary classification |
| What it Learns | Deep bidirectional word context | Sentence-pair relationship |
| Input | Single sequence with masked tokens | Two sentences joined with [SEP] |
| Output Head | Softmax over 30,522 vocabulary tokens | Softmax over {IsNext, NotNext} |
| Loss Computed On | Only the 15% masked tokens | [CLS] token final hidden state |
| Downstream Benefit | All NLP tasks — core representation | Sentence pairs: NLI, QA, STS |
| Kept in RoBERTa? | Yes — extended and improved | No — removed, shown to hurt |
| Relative Importance | Critical — primary learning signal | Secondary — helps specific tasks |
BERT's Key Hyperparameters
| Hyperparameter | BERT-Base Value | What It Controls |
|---|---|---|
hidden_size | 768 | Dimension of all token representations throughout the model |
num_hidden_layers | 12 | Number of Transformer encoder blocks stacked |
num_attention_heads | 12 | Parallel attention heads per layer; each has dim = 768/12 = 64 |
intermediate_size | 3072 | Hidden size in the FFN (4 × hidden_size); expansion then contraction |
max_position_embeddings | 512 | Maximum sequence length supported (tokens) |
vocab_size | 30,522 | WordPiece vocabulary size (English uncased) |
hidden_dropout_prob | 0.1 | Dropout applied to hidden states and attention weights |
Fine-tune lr | 2e-5 → 5e-5 | Learning rate — lower than typical; large LR destroys pre-trained weights |
| Fine-tune epochs | 2 – 4 | More epochs → overfitting on small datasets; 3 is usually optimal |
BERT vs Other Models — Where It Stands
| Model | Architecture | Direction | Pre-train Task | Best For |
|---|---|---|---|---|
| BERT | Transformer Encoder | Bidirectional | MLM + NSP | Understanding tasks (classification, NER, QA) |
| GPT-2/3 | Transformer Decoder | Left-to-right only | Causal LM | Text generation |
| RoBERTa | Transformer Encoder | Bidirectional | MLM only (no NSP) | Stronger than BERT on most benchmarks |
| ALBERT | Transformer Encoder | Bidirectional | MLM + SOP* | Smaller, faster BERT; mobile/edge deployment |
| DistilBERT | Transformer Encoder (6L) | Bidirectional | Distilled from BERT | 60% faster, 40% smaller, 97% of BERT's accuracy |
| T5 | Encoder-Decoder | Bidirectional encoder | Text-to-Text | All tasks framed as text generation |
* SOP = Sentence Order Prediction — a harder version of NSP used in ALBERT.
BERT's Impact — Benchmark Results
When BERT was released, it achieved state-of-the-art on 11 NLP tasks simultaneously — an unprecedented leap. Here are representative results from the original paper:
| Benchmark | Task Type | Previous SOTA | BERT-Base | BERT-Large |
|---|---|---|---|---|
| GLUE (overall) | Multi-task NLU | 72.8 | 79.6 | 80.5 |
| SQuAD 1.1 (F1) | Extractive QA | 91.7 | 93.2 | 93.2 |
| SQuAD 2.0 (F1) | QA with unanswerable | 66.3 | 76.3 | 83.1 |
| MultiNLI | Natural Language Inference | 86.7% | 84.6% | 86.7% |
| SST-2 | Sentiment (2-class) | 96.2% | 93.5% | 94.9% |
The improvements were not marginal. BERT broke records on 11 tasks using exactly the same model with a different output head for each task. Previously, each task required a custom-built architecture from scratch. BERT proved that a single pre-trained representation could transfer universally — the moment NLP became a transfer-learning field like computer vision.
Common Mistakes and How to Avoid Them
bert-base-uncased uses an uncased tokeniser that lowercases all input.
Using a cased tokeniser on an uncased model (or vice versa) silently
produces gibberish embeddings. Use AutoTokenizer.from_pretrained(model_name)
to guarantee the correct match.
2e-5 to 5e-5.
A large learning rate catastrophically destroys the pre-trained weights in the
first few steps — your model will perform worse than a random baseline.
attention_mask tensor (1 = real token, 0 = padding) must always
be passed — the tokeniser generates it automatically when you use
return_tensors='pt'.
truncation=True, max_length=512), sliding window approaches,
or switch to a long-context model like Longformer or BigBird.
Silent truncation of input beyond 512 tokens will produce wrong results.
get_linear_schedule_with_warmup handles this.
Without warmup, the model destabilises in the first steps when the task head
outputs random values and gradients are large.
roberta-base as your drop-in replacement via HuggingFace.
The BERT Pipeline — Complete Overview
BERT did not just improve NLP — it changed how NLP is done. Pre-training on raw text then fine-tuning on labelled data is now the standard paradigm for virtually every language task. Every modern language model — RoBERTa, DeBERTa, ALBERT, XLNet, GPT-4, Gemini, Claude — stands on the foundation that BERT established: language understanding requires bidirectional context, and deep pre-training on large corpora enables remarkable generalisation.