Natural Language Processing (NLP) 📂 Pre-trained Language Models (PLMs) · 2 of 4 36 min read

BERT Explained: Architecture, Masked Language Modelling

A comprehensive, story-driven tutorial on BERT — covering its Transformer encoder architecture, the three input embeddings, how Masked Language Modelling and Next Sentence Prediction work, fine-tuning strategies, and full Python code from tokenisation to sentiment classification.

Section 01

The Story That Explains BERT

The Blind Man Reading a Sentence — Left to Right, Forever
Imagine you are reading a mystery novel, but you are only allowed to read it one word at a time, strictly left to right, and you can never look back. By the time you reach the word "bank" in the sentence "He sat on the river bank", you already know the context — river — but you had to wait until you read past it.

Now imagine a different reader who is given the entire sentence at once, but some words are hidden behind sticky notes. This reader can look left and right simultaneously to figure out what each hidden word probably says. When the hidden word is "bank", they can see "river" on the left and "muddy shore" on the right — and correctly guess "bank" in the geographical, not financial, sense.

That second reader is BERT.

BERT — Bidirectional Encoder Representations from Transformers — is a language model introduced by Google in 2018. Unlike earlier models that read text left-to-right (or right-to-left), BERT reads the entire sequence simultaneously, building a rich, context-aware representation for every token based on all surrounding tokens. It fundamentally changed the NLP landscape and forms the backbone of dozens of modern language understanding systems.

💡
The Core Insight

Language understanding requires bidirectional context. The word "bank" has completely different meanings depending on the words to its left and right simultaneously. GPT-style models read left-to-right and miss right-side context. BERT processes both directions at once, giving it a fundamentally richer understanding of meaning.


Section 02

The Transformer Architecture — BERT's Foundation

BERT is built entirely on the Transformer Encoder — specifically a stack of encoder blocks from the original "Attention Is All You Need" paper (Vaswani et al., 2017). Understanding the encoder is essential to understanding BERT.

⚙️ Inside a Single Transformer Encoder Block
Input
Token Embeddings + Segment Embeddings + Positional Embeddings are summed element-wise into a single vector per token
Layer 1
Multi-Head Self-Attention — each token attends to every other token; multiple attention heads capture different relationship types simultaneously
Add & Norm
Residual connection wraps the attention output, then Layer Normalisation is applied for training stability
Layer 2
Feed-Forward Network — two linear layers with a GELU activation in between; expands and contracts the representation
Add & Norm
Second residual + layer norm. Output: same shape as input — a contextualised vector for every token
Stack
BERT-Base repeats this block 12 times. BERT-Large repeats it 24 times. Each layer refines contextual understanding further

The Three Embedding Inputs

🔐
Token Embeddings
WordPiece Vocabulary
Each token is mapped to a 768-dimensional vector (BERT-Base) using a learned vocabulary of ~30,000 subword units. The special token [CLS] is always prepended — its final representation aggregates the entire sequence and is used for classification tasks.
📌
Segment Embeddings
Sentence A / Sentence B
When BERT receives two sentences (e.g. question + answer), every token in Sentence A receives the Embedding A vector, and every token in Sentence B receives Embedding B. This lets BERT distinguish which sentence each token belongs to.
📅
Positional Embeddings
Learned, not Sinusoidal
Unlike the original Transformer, BERT uses learned positional embeddings — one vector per position, up to 512 tokens. These vectors are added to the token embedding so the model knows where each token sits in the sequence, since attention has no built-in sense of order.
📌
The [CLS] and [SEP] Tokens

Every BERT input begins with [CLS] (Classification token) and sentences are separated and ended with [SEP] (Separator token). Format: [CLS] Sentence A [SEP] Sentence B [SEP]. The [CLS] token's final hidden state acts as the aggregate sequence representation for classification tasks. It sees every other token through self-attention and learns to summarise the whole input.


Section 03

BERT Variants — Base vs Large

Property BERT-Base BERT-Large
Encoder Layers (L)1224
Hidden Size (H)7681,024
Attention Heads (A)1216
Total Parameters110 million340 million
Pre-training Hardware4 TPU chips, ~4 days16 TPU chips, ~4 days
Typical Fine-tune SpeedFast — fits on 1 GPUSlow — needs multi-GPU or large VRAM
Best ForMost downstream tasks, productionState-of-the-art benchmarks

Section 04

Self-Attention — How BERT "Sees" Everything at Once

The Dinner Party Conversation
Imagine you are at a dinner party and you need to understand what someone means by the word "it". You naturally scan the entire conversation — who said what, what objects were mentioned, what the topic was — and you assign different levels of attention to different speakers and statements based on relevance.

Self-attention does exactly this for every token in a sequence. For the token "it", the mechanism computes how relevant every other token is, assigns a score, and produces a weighted combination of all their representations. The result: "it" now carries information from the noun it refers to.

Self-attention computes three vectors for every token — Query (Q), Key (K), and Value (V) — using learned weight matrices. The attention score between token i and token j is the dot product of token i's Query with token j's Key, scaled and then softmax-normalised.

Attention Score
score(i,j) = Q_i · K_j / √d_k
Dot product of query and key, scaled by √d_k to prevent gradient vanishing in the softmax
Attention Weights
α_ij = softmax(score(i,j))
Normalised weights across all positions j; they sum to 1.0 and represent relative importance
Output Vector
out_i = Σ α_ij · V_j
Weighted sum of all Value vectors — token i's new representation blends context from every token
Multi-Head
MultiHead = Concat(head_1,...,head_h) W^O
Multiple parallel attention heads capture different types of relationships; outputs are concatenated and projected
Why Multi-Head Matters

With 12 attention heads (BERT-Base), the model can simultaneously track 12 different types of relationships between tokens — one head might track syntactic dependencies, another co-reference, another semantic similarity. No single head needs to do everything; they specialise automatically during training.


Section 05

Pre-Training Task 1 — Masked Language Model (MLM)

The Fill-in-the-Blank Test
Remember the cloze tests from English class? You were given a paragraph with certain words blanked out, and you had to fill them in using context. "The scientist discovered a new _____ in the Amazon rainforest." You'd write "species" because the surrounding context — Amazon, rainforest, discovered — makes it the most plausible word.

MLM is exactly this. BERT is given billions of such fill-in-the-blank puzzles, and in solving them, it learns a deep model of language.

In MLM, 15% of input tokens are randomly selected for masking before the text is fed into the model. The model must predict the original identity of those tokens using the full surrounding context. Crucially, the masking follows a specific strategy to reduce the mismatch between pre-training (where [MASK] appears) and fine-tuning (where it does not).

🎯 The 15% Masking Strategy — How It Works
Select
Randomly choose 15% of all tokens in the input sequence as candidates
80%
Replace the selected token with [MASK]"The cat sat on the [MASK]" — standard masking
10%
Replace with a random word"The cat sat on the television" — forces model to check the token against context every time
10%
Keep unchanged"The cat sat on the mat" — model must still predict it, preventing over-reliance on [MASK] signal
Loss
Cross-entropy loss is computed only on the 15% selected tokens, not on the whole sequence — so most tokens are not predicted, saving compute
⚠️
The Pre-train / Fine-tune Mismatch Problem

During pre-training, [MASK] tokens appear everywhere. During fine-tuning on downstream tasks, they never appear. If BERT only ever saw [MASK] during training, it would overfit to that special token's position in its vocabulary. The 10% random + 10% unchanged strategy forces the model to develop contextual representations for every token at every position, regardless of whether it was actually masked.

MLM — Worked Example

Position Original Token Input to BERT BERT Must Predict Strategy
1[CLS][CLS]Not selected
2TheTheNot selected
3scientist[MASK]scientist80% replace
4discovereddiscoveredNot selected
5aaNot selected
6newrivernew10% random
7speciesspeciesspecies10% unchanged
8[SEP][SEP]Not selected

Section 06

Pre-Training Task 2 — Next Sentence Prediction (NSP)

The Editor's Coherence Check
A good editor can instantly tell when a paragraph does not logically follow from the one before it. If a biography of Einstein reads: "Einstein published the special theory of relativity in 1905." followed by "The Amazon rainforest covers 5.5 million square kilometres," — the editor flags it immediately. The sentences are coherent in isolation but completely unrelated to each other.

BERT is trained to be this editor: given two sentences A and B, it must decide whether B is the actual next sentence that follows A in the original document, or a random sentence from somewhere else.

NSP is a binary classification task. During pre-training, 50% of the time Sentence B is the true next sentence (IsNext), and 50% of the time it is a randomly sampled sentence from the corpus (NotNext). The final hidden state of the [CLS] token is passed through a linear layer + softmax to make this binary prediction.

📋 NSP — Input Construction Examples
IsNext
[CLS] The dog chased the cat. [SEP] It ran under the sofa. [SEP] → Label: IsNext
NotNext
[CLS] The dog chased the cat. [SEP] Photosynthesis occurs in chloroplasts. [SEP] → Label: NotNext
IsNext
[CLS] She opened the letter nervously. [SEP] Inside was the exam result she had been dreading. [SEP] → Label: IsNext
NotNext
[CLS] She opened the letter nervously. [SEP] The Eiffel Tower was built in 1889. [SEP] → Label: NotNext
🚨
Controversy — Is NSP Actually Helpful?

Later work (notably RoBERTa by Liu et al., 2019) showed that removing NSP entirely and using only MLM with longer sequences and more data improves downstream performance. The hypothesis: NSP is too easy — since the negative examples come from completely different documents, the model simply learns to detect topic similarity rather than discourse coherence. Modern BERT variants (RoBERTa, DeBERTa) drop NSP. The original BERT still uses it, and it does help on tasks like Natural Language Inference and Question Answering that explicitly involve sentence pairs.


Section 07

Pre-Training Data — What BERT Learned From

📚
BooksCorpus
~800 million words
~11,000 unpublished books covering fiction and non-fiction. Provides long contiguous text — critical for the model to learn long-range dependencies and multi-sentence discourse structure. Documents are long, so NSP pairs are meaningful within-book sequences.
🌐
English Wikipedia
~2,500 million words
Text-only content (no lists, tables, or headers). Wikipedia provides factual, encyclopaedic text covering an enormous range of topics — science, history, culture, biography. This teaches BERT real-world entity knowledge alongside linguistic structure.
Combined Size
~3.3 billion words total
Pre-training ran for 1 million steps with a batch size of 256 sequences × 512 tokens = ~128,000 tokens per batch. BERT-Base required 4 Cloud TPUs for 4 days. BERT-Large required 16 TPUs for 4 days. Total compute: hundreds of petaFLOPS.

Section 08

BERT's Input Representation — The Full Picture

Every BERT input is constructed following a strict format. Understanding this format is essential when you implement fine-tuning, because you must replicate it exactly — the tokeniser and the model were built together.

📝 Complete Input Representation for a Sentence Pair Task
Tokens
[CLS] my dog is cute [SEP] he likes playing [SEP]
Segment
A   A  A  A   A   A           B   B    B       B          B
Position
0     1   2  3  4     5              6    7      8       9         10
Sum
Token Embedding + Segment Embedding + Positional Embedding → 768-dim input vector per token
🔨
WordPiece Tokenisation

BERT does not operate on whole words — it uses WordPiece tokenisation, which splits rare words into subword units. The word "unbelievably" might become ["un", "##believ", "##ably"] (the ## prefix means "continuation of previous token"). This allows BERT to handle any word including novel ones, because it can decompose them into known subword pieces. The vocabulary is ~30,522 tokens for English BERT.


Section 09

Fine-Tuning — Adapting BERT to Your Task

Pre-training gives BERT a universal language model. Fine-tuning adapts that model to a specific downstream task by adding a small task-specific layer on top and training the whole stack end-to-end on labelled task data. Fine-tuning typically takes minutes to hours on a single GPU — orders of magnitude less compute than pre-training.

🎯
Text Classification
Sentiment · Spam · Topics
Feed a single sentence. Take the [CLS] token's final hidden state (768-dim). Add a linear layer → softmax. Train on labelled examples. Examples: sentiment analysis, spam detection, news topic classification.
🔍
Question Answering
Extractive · SQuAD style
Input: [CLS] Question [SEP] Context [SEP]. Add two output vectors to predict the start and end token of the answer span within the context. The model extracts the answer directly from the passage.
🏷
Named Entity Recognition
Token-level Classification
Take the final hidden state of every token (not just [CLS]). Add a linear layer → softmax per token to classify each as B-PER, I-ORG, O, etc. BERT's contextual embeddings make this dramatically more accurate than previous methods.
🔥
Why Fine-Tuning is Powerful

Before BERT, building a high-quality NLP model required large task-specific labelled datasets, hand-crafted features, and weeks of training. With BERT fine-tuning, a few thousand labelled examples and 30 minutes on a GPU can produce state-of-the-art results. The pre-trained weights already encode rich linguistic knowledge — fine-tuning just steers that knowledge toward your task.


Section 10

BERT in Code — From Tokenisation to Fine-Tuning

Step 1 — Install and Load

# Install Hugging Face Transformers
# pip install transformers torch datasets

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
import torch

# Load pre-trained BERT-Base (uncased)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased')

Step 2 — Tokenise Text

text = 'The scientist discovered a new species in the rainforest.'

# Tokenise and encode to input IDs
tokens = tokenizer.tokenize(text)
print('Tokens:', tokens)
# Output: ['the', 'scientist', 'discovered', 'a', 'new', 'species',
#          'in', 'the', 'rain', '##forest', '.']

# Full encoding with attention mask and special tokens
encoding = tokenizer(
    text,
    return_tensors='pt',        # PyTorch tensors
    padding=True,
    truncation=True,
    max_length=128
)

print('Input IDs shape:', encoding['input_ids'].shape)
print('Token IDs:', encoding['input_ids'])
OUTPUT
Tokens: ['the', 'scientist', 'discovered', 'a', 'new', 'species', 'in', 'the', 'rain', '##forest', '.'] Input IDs shape: torch.Size([1, 13]) Token IDs: tensor([[ 101, 1996, 3820, 3603, 1037, 2047, 3605, 1999, 1996, 4482, 18421, 1012, 102]])

Step 3 — Extract Contextual Embeddings

model.eval()

with torch.no_grad():
    outputs = model(**encoding)

# last_hidden_state: shape [batch, seq_len, 768]
last_hidden = outputs.last_hidden_state
print('All token embeddings shape:', last_hidden.shape)

# [CLS] token embedding — position 0
cls_embedding = last_hidden[:, 0, :]
print('CLS embedding shape:', cls_embedding.shape)

# Pooled output (another classification-ready representation)
pooled = outputs.pooler_output
print('Pooled output shape:', pooled.shape)
OUTPUT
All token embeddings shape: torch.Size([1, 13, 768]) CLS embedding shape: torch.Size([1, 768]) Pooled output shape: torch.Size([1, 768])

Step 4 — Fine-Tune for Sentiment Classification

from transformers import BertForSequenceClassification, AdamW, get_scheduler
from torch.utils.data import DataLoader, Dataset
import torch

# ── Toy dataset ──────────────────────────────────────────────
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(texts, truncation=True,
                                   padding=True, max_length=max_len,
                                   return_tensors='pt')
        self.labels = torch.tensor(labels)

    def __len__(self): return len(self.labels)

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}, self.labels[idx]

# Sample data
texts  = ['I loved this film!', 'Absolutely terrible.',
          'Best movie ever.',   'Waste of time.']
labels = [1, 0, 1, 0]   # 1 = positive, 0 = negative

dataset    = SentimentDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=2)

# ── Model — BERT-Base + classification head ───────────────────
clf_model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

optimizer = AdamW(clf_model.parameters(), lr=2e-5)

# ── Training loop (3 epochs) ─────────────────────────────────
clf_model.train()
for epoch in range(3):
    total_loss = 0
    for batch, batch_labels in dataloader:
        optimizer.zero_grad()
        outputs = clf_model(**batch, labels=batch_labels)
        loss     = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} | Loss: {total_loss/len(dataloader):.4f}")
OUTPUT
Epoch 1 | Loss: 0.7243 Epoch 2 | Loss: 0.5812 Epoch 3 | Loss: 0.3971

Step 5 — Inference

clf_model.eval()

new_texts  = ['This is a masterpiece!', 'Completely boring.']
new_inputs = tokenizer(new_texts, return_tensors='pt',
                       padding=True, truncation=True)

with torch.no_grad():
    logits = clf_model(**new_inputs).logits

predictions = logits.argmax(dim=-1)
labels_map  = {0: 'Negative', 1: 'Positive'}

for text, pred in zip(new_texts, predictions):
    print(f"'{text}' → {labels_map[pred.item()]}")
OUTPUT
'This is a masterpiece!' → Positive 'Completely boring.' → Negative

Section 11

MLM vs NSP — A Side-by-Side Comparison

Property Masked Language Model (MLM) Next Sentence Prediction (NSP)
Task TypeToken prediction (multi-class)Binary classification
What it LearnsDeep bidirectional word contextSentence-pair relationship
InputSingle sequence with masked tokensTwo sentences joined with [SEP]
Output HeadSoftmax over 30,522 vocabulary tokensSoftmax over {IsNext, NotNext}
Loss Computed OnOnly the 15% masked tokens[CLS] token final hidden state
Downstream BenefitAll NLP tasks — core representationSentence pairs: NLI, QA, STS
Kept in RoBERTa?Yes — extended and improvedNo — removed, shown to hurt
Relative ImportanceCritical — primary learning signalSecondary — helps specific tasks

Section 12

BERT's Key Hyperparameters

Hyperparameter BERT-Base Value What It Controls
hidden_size768Dimension of all token representations throughout the model
num_hidden_layers12Number of Transformer encoder blocks stacked
num_attention_heads12Parallel attention heads per layer; each has dim = 768/12 = 64
intermediate_size3072Hidden size in the FFN (4 × hidden_size); expansion then contraction
max_position_embeddings512Maximum sequence length supported (tokens)
vocab_size30,522WordPiece vocabulary size (English uncased)
hidden_dropout_prob0.1Dropout applied to hidden states and attention weights
Fine-tune lr2e-5 → 5e-5Learning rate — lower than typical; large LR destroys pre-trained weights
Fine-tune epochs2 – 4More epochs → overfitting on small datasets; 3 is usually optimal

Section 13

BERT vs Other Models — Where It Stands

Model Architecture Direction Pre-train Task Best For
BERT Transformer Encoder Bidirectional MLM + NSP Understanding tasks (classification, NER, QA)
GPT-2/3 Transformer Decoder Left-to-right only Causal LM Text generation
RoBERTa Transformer Encoder Bidirectional MLM only (no NSP) Stronger than BERT on most benchmarks
ALBERT Transformer Encoder Bidirectional MLM + SOP* Smaller, faster BERT; mobile/edge deployment
DistilBERT Transformer Encoder (6L) Bidirectional Distilled from BERT 60% faster, 40% smaller, 97% of BERT's accuracy
T5 Encoder-Decoder Bidirectional encoder Text-to-Text All tasks framed as text generation

* SOP = Sentence Order Prediction — a harder version of NSP used in ALBERT.


Section 14

BERT's Impact — Benchmark Results

When BERT was released, it achieved state-of-the-art on 11 NLP tasks simultaneously — an unprecedented leap. Here are representative results from the original paper:

Benchmark Task Type Previous SOTA BERT-Base BERT-Large
GLUE (overall)Multi-task NLU72.879.680.5
SQuAD 1.1 (F1)Extractive QA91.793.293.2
SQuAD 2.0 (F1)QA with unanswerable66.376.383.1
MultiNLINatural Language Inference86.7%84.6%86.7%
SST-2Sentiment (2-class)96.2%93.5%94.9%
🏆
Why These Numbers Were Revolutionary

The improvements were not marginal. BERT broke records on 11 tasks using exactly the same model with a different output head for each task. Previously, each task required a custom-built architecture from scratch. BERT proved that a single pre-trained representation could transfer universally — the moment NLP became a transfer-learning field like computer vision.


Section 15

Common Mistakes and How to Avoid Them

⚡ BERT — Non-Negotiable Rules for Practitioners
1
Always use the tokeniser that matches the model. bert-base-uncased uses an uncased tokeniser that lowercases all input. Using a cased tokeniser on an uncased model (or vice versa) silently produces gibberish embeddings. Use AutoTokenizer.from_pretrained(model_name) to guarantee the correct match.
2
Do not use a large learning rate. Typical neural network training uses lr ~ 1e-3. Fine-tuning BERT requires lr in the range 2e-5 to 5e-5. A large learning rate catastrophically destroys the pre-trained weights in the first few steps — your model will perform worse than a random baseline.
3
Always add an attention mask. When you pad sequences to equal length, the padding tokens should not participate in self-attention. The attention_mask tensor (1 = real token, 0 = padding) must always be passed — the tokeniser generates it automatically when you use return_tensors='pt'.
4
Respect the 512-token limit. BERT cannot process sequences longer than 512 tokens. For long documents, use truncation (truncation=True, max_length=512), sliding window approaches, or switch to a long-context model like Longformer or BigBird. Silent truncation of input beyond 512 tokens will produce wrong results.
5
Fine-tune for 2–4 epochs only. BERT overfits rapidly on small labelled datasets. The original paper found 2–4 epochs optimal for most tasks. Use early stopping based on validation loss, not training loss.
6
Use a linear learning rate warmup. Train with a scheduler that warms up the learning rate for the first 10% of steps, then linearly decays. HuggingFace's get_linear_schedule_with_warmup handles this. Without warmup, the model destabilises in the first steps when the task head outputs random values and gradients are large.
7
For most tasks, prefer RoBERTa over BERT. RoBERTa is trained on 10× more data, uses dynamic masking in MLM, removes NSP, and uses larger batch sizes. It outperforms BERT on nearly every benchmark with identical fine-tuning code. Use roberta-base as your drop-in replacement via HuggingFace.

Section 16

The BERT Pipeline — Complete Overview

01
Raw Text Corpus
BooksCorpus + English Wikipedia — ~3.3B words of clean natural language text. Documents are split into sentence pairs for NSP construction.
02
WordPiece Tokenisation
Text is tokenised into subword units using a 30,522-token vocabulary. [CLS] and [SEP] special tokens are inserted. Sequences are padded or truncated to 512 tokens.
03
Input Embedding Construction
Token embeddings + segment embeddings + positional embeddings are summed to produce a 768-dim vector per token. [MASK] tokens are applied to 15% of positions for MLM.
04
12 Transformer Encoder Layers
Each layer applies Multi-Head Self-Attention (all tokens attend to all tokens bidirectionally) then a Feed-Forward Network, with residual connections and layer norm throughout. Output: contextualised embedding per token.
05
Dual Pre-Training Loss
MLM head predicts the original identity of masked tokens (cross-entropy over vocab). NSP head uses [CLS] hidden state to classify IsNext/NotNext. Both losses are summed and backpropagated together.
06
Fine-Tuning on Downstream Task
Pre-trained weights are loaded. A task-specific output head is added (linear layer). The full model is trained end-to-end for 2–4 epochs on labelled task data with a small learning rate (2e-5 to 5e-5). The pre-trained knowledge transfers immediately.
🌟
BERT's Lasting Legacy

BERT did not just improve NLP — it changed how NLP is done. Pre-training on raw text then fine-tuning on labelled data is now the standard paradigm for virtually every language task. Every modern language model — RoBERTa, DeBERTa, ALBERT, XLNet, GPT-4, Gemini, Claude — stands on the foundation that BERT established: language understanding requires bidirectional context, and deep pre-training on large corpora enables remarkable generalisation.