BERT Tutorial: Architecture, MLM & NSP

Section 01

The Story That Explains BERT

📖 Real World Analogy

The Blind Man Reading a Sentence — Left to Right, Forever

Imagine you are reading a mystery novel, but you are only allowed to read it one word at a time, strictly left to right, and you can never look back. By the time you reach the word "bank" in the sentence "He sat on the river bank", you already know the context — river — but you had to wait until you read past it.

Now imagine a different reader who is given the entire sentence at once, but some words are hidden behind sticky notes. This reader can look left and right simultaneously to figure out what each hidden word probably says. When the hidden word is "bank", they can see "river" on the left and "muddy shore" on the right — and correctly guess "bank" in the geographical, not financial, sense.

That second reader is BERT.

BERT — Bidirectional Encoder Representations from Transformers — is a language model introduced by Google in 2018. Unlike earlier models that read text left-to-right (or right-to-left), BERT reads the entire sequence simultaneously, building a rich, context-aware representation for every token based on all surrounding tokens. It fundamentally changed the NLP landscape and forms the backbone of dozens of modern language understanding systems.

💡

The Core Insight

Language understanding requires bidirectional context. The word "bank" has completely different meanings depending on the words to its left and right simultaneously. GPT-style models read left-to-right and miss right-side context. BERT processes both directions at once, giving it a fundamentally richer understanding of meaning.

Section 02

The Transformer Architecture — BERT's Foundation

BERT is built entirely on the Transformer Encoder — specifically a stack of encoder blocks from the original "Attention Is All You Need" paper (Vaswani et al., 2017). Understanding the encoder is essential to understanding BERT.

⚙️ Inside a Single Transformer Encoder Block

Input

Token Embeddings + Segment Embeddings + Positional Embeddings are summed element-wise into a single vector per token

Layer 1

Multi-Head Self-Attention — each token attends to every other token; multiple attention heads capture different relationship types simultaneously

Add & Norm

Residual connection wraps the attention output, then Layer Normalisation is applied for training stability

Layer 2

Feed-Forward Network — two linear layers with a GELU activation in between; expands and contracts the representation

Add & Norm

Second residual + layer norm. Output: same shape as input — a contextualised vector for every token

Stack

BERT-Base repeats this block 12 times. BERT-Large repeats it 24 times. Each layer refines contextual understanding further

The Three Embedding Inputs

🔐

Token Embeddings

WordPiece Vocabulary

Each token is mapped to a 768-dimensional vector (BERT-Base) using a learned vocabulary of ~30,000 subword units. The special token [CLS] is always prepended — its final representation aggregates the entire sequence and is used for classification tasks.

📌

Segment Embeddings

Sentence A / Sentence B

When BERT receives two sentences (e.g. question + answer), every token in Sentence A receives the Embedding A vector, and every token in Sentence B receives Embedding B. This lets BERT distinguish which sentence each token belongs to.

📅

Positional Embeddings

Learned, not Sinusoidal

Unlike the original Transformer, BERT uses learned positional embeddings — one vector per position, up to 512 tokens. These vectors are added to the token embedding so the model knows where each token sits in the sequence, since attention has no built-in sense of order.

📌

The [CLS] and [SEP] Tokens

Every BERT input begins with [CLS] (Classification token) and sentences are separated and ended with [SEP] (Separator token). Format: [CLS] Sentence A [SEP] Sentence B [SEP]. The [CLS] token's final hidden state acts as the aggregate sequence representation for classification tasks. It sees every other token through self-attention and learns to summarise the whole input.

Section 03

BERT Variants — Base vs Large

Property	BERT-Base	BERT-Large
Encoder Layers (L)	12	24
Hidden Size (H)	768	1,024
Attention Heads (A)	12	16
Total Parameters	110 million	340 million
Pre-training Hardware	4 TPU chips, ~4 days	16 TPU chips, ~4 days
Typical Fine-tune Speed	Fast — fits on 1 GPU	Slow — needs multi-GPU or large VRAM
Best For	Most downstream tasks, production	State-of-the-art benchmarks

Section 04

Self-Attention — How BERT "Sees" Everything at Once

📖 Story

The Dinner Party Conversation

Imagine you are at a dinner party and you need to understand what someone means by the word "it". You naturally scan the entire conversation — who said what, what objects were mentioned, what the topic was — and you assign different levels of attention to different speakers and statements based on relevance.

Self-attention does exactly this for every token in a sequence. For the token "it", the mechanism computes how relevant every other token is, assigns a score, and produces a weighted combination of all their representations. The result: "it" now carries information from the noun it refers to.

Self-attention computes three vectors for every token — Query (Q), Key (K), and Value (V) — using learned weight matrices. The attention score between token i and token j is the dot product of token i's Query with token j's Key, scaled and then softmax-normalised.

Attention Score

score(i,j) = Q_i · K_j / √d_k

Dot product of query and key, scaled by √d_k to prevent gradient vanishing in the softmax

Attention Weights

α_ij = softmax(score(i,j))

Normalised weights across all positions j; they sum to 1.0 and represent relative importance

Output Vector

out_i = Σ α_ij · V_j

Weighted sum of all Value vectors — token i's new representation blends context from every token

Multi-Head

MultiHead = Concat(head_1,...,head_h) W^O

Multiple parallel attention heads capture different types of relationships; outputs are concatenated and projected

⚡

Why Multi-Head Matters

With 12 attention heads (BERT-Base), the model can simultaneously track 12 different types of relationships between tokens — one head might track syntactic dependencies, another co-reference, another semantic similarity. No single head needs to do everything; they specialise automatically during training.

Section 05

Pre-Training Task 1 — Masked Language Model (MLM)

📖 Real World Analogy

The Fill-in-the-Blank Test

Remember the cloze tests from English class? You were given a paragraph with certain words blanked out, and you had to fill them in using context. "The scientist discovered a new _____ in the Amazon rainforest." You'd write "species" because the surrounding context — Amazon, rainforest, discovered — makes it the most plausible word.

MLM is exactly this. BERT is given billions of such fill-in-the-blank puzzles, and in solving them, it learns a deep model of language.

In MLM, 15% of input tokens are randomly selected for masking before the text is fed into the model. The model must predict the original identity of those tokens using the full surrounding context. Crucially, the masking follows a specific strategy to reduce the mismatch between pre-training (where [MASK] appears) and fine-tuning (where it does not).

🎯 The 15% Masking Strategy — How It Works

Select

Randomly choose 15% of all tokens in the input sequence as candidates

80%

Replace the selected token with [MASK] — "The cat sat on the [MASK]" — standard masking

10%

Replace with a random word — "The cat sat on the television" — forces model to check the token against context every time

10%

Keep unchanged — "The cat sat on the mat" — model must still predict it, preventing over-reliance on [MASK] signal

Loss

Cross-entropy loss is computed only on the 15% selected tokens, not on the whole sequence — so most tokens are not predicted, saving compute

⚠️

The Pre-train / Fine-tune Mismatch Problem

During pre-training, [MASK] tokens appear everywhere. During fine-tuning on downstream tasks, they never appear. If BERT only ever saw [MASK] during training, it would overfit to that special token's position in its vocabulary. The 10% random + 10% unchanged strategy forces the model to develop contextual representations for every token at every position, regardless of whether it was actually masked.

MLM — Worked Example

Position	Original Token	Input to BERT	BERT Must Predict	Strategy
1	[CLS]	[CLS]	—	Not selected
2	The	The	—	Not selected
3	scientist	[MASK]	scientist	80% replace
4	discovered	discovered	—	Not selected
5	a	a	—	Not selected
6	new	river	new	10% random
7	species	species	species	10% unchanged
8	[SEP]	[SEP]	—	Not selected

Section 06

Pre-Training Task 2 — Next Sentence Prediction (NSP)

📖 Real World Analogy

The Editor's Coherence Check

A good editor can instantly tell when a paragraph does not logically follow from the one before it. If a biography of Einstein reads: "Einstein published the special theory of relativity in 1905." followed by "The Amazon rainforest covers 5.5 million square kilometres," — the editor flags it immediately. The sentences are coherent in isolation but completely unrelated to each other.

BERT is trained to be this editor: given two sentences A and B, it must decide whether B is the actual next sentence that follows A in the original document, or a random sentence from somewhere else.

NSP is a binary classification task. During pre-training, 50% of the time Sentence B is the true next sentence (IsNext), and 50% of the time it is a randomly sampled sentence from the corpus (NotNext). The final hidden state of the [CLS] token is passed through a linear layer + softmax to make this binary prediction.

📋 NSP — Input Construction Examples

IsNext

[CLS] The dog chased the cat. [SEP] It ran under the sofa. [SEP] → Label: IsNext

NotNext

[CLS] The dog chased the cat. [SEP] Photosynthesis occurs in chloroplasts. [SEP] → Label: NotNext

IsNext

[CLS] She opened the letter nervously. [SEP] Inside was the exam result she had been dreading. [SEP] → Label: IsNext

NotNext

[CLS] She opened the letter nervously. [SEP] The Eiffel Tower was built in 1889. [SEP] → Label: NotNext

🚨

Controversy — Is NSP Actually Helpful?

Later work (notably RoBERTa by Liu et al., 2019) showed that removing NSP entirely and using only MLM with longer sequences and more data improves downstream performance. The hypothesis: NSP is too easy — since the negative examples come from completely different documents, the model simply learns to detect topic similarity rather than discourse coherence. Modern BERT variants (RoBERTa, DeBERTa) drop NSP. The original BERT still uses it, and it does help on tasks like Natural Language Inference and Question Answering that explicitly involve sentence pairs.

Section 07

Pre-Training Data — What BERT Learned From

📚

BooksCorpus

~800 million words

~11,000 unpublished books covering fiction and non-fiction. Provides long contiguous text — critical for the model to learn long-range dependencies and multi-sentence discourse structure. Documents are long, so NSP pairs are meaningful within-book sequences.

🌐

English Wikipedia

~2,500 million words

Text-only content (no lists, tables, or headers). Wikipedia provides factual, encyclopaedic text covering an enormous range of topics — science, history, culture, biography. This teaches BERT real-world entity knowledge alongside linguistic structure.

⚡

Combined Size

~3.3 billion words total

Pre-training ran for 1 million steps with a batch size of 256 sequences × 512 tokens = ~128,000 tokens per batch. BERT-Base required 4 Cloud TPUs for 4 days. BERT-Large required 16 TPUs for 4 days. Total compute: hundreds of petaFLOPS.

Section 08

BERT's Input Representation — The Full Picture

Every BERT input is constructed following a strict format. Understanding this format is essential when you implement fine-tuning, because you must replicate it exactly — the tokeniser and the model were built together.

📝 Complete Input Representation for a Sentence Pair Task

Tokens

[CLS] my dog is cute [SEP] he likes playing [SEP]

Segment

A A A A A A B B B B B

Position

0 1 2 3 4 5 6 7 8 9 10

Sum

Token Embedding + Segment Embedding + Positional Embedding → 768-dim input vector per token

🔨

WordPiece Tokenisation

BERT does not operate on whole words — it uses WordPiece tokenisation, which splits rare words into subword units. The word "unbelievably" might become ["un", "##believ", "##ably"] (the ## prefix means "continuation of previous token"). This allows BERT to handle any word including novel ones, because it can decompose them into known subword pieces. The vocabulary is ~30,522 tokens for English BERT.

Section 09

Fine-Tuning — Adapting BERT to Your Task

Pre-training gives BERT a universal language model. Fine-tuning adapts that model to a specific downstream task by adding a small task-specific layer on top and training the whole stack end-to-end on labelled task data. Fine-tuning typically takes minutes to hours on a single GPU — orders of magnitude less compute than pre-training.

🎯

Text Classification

Sentiment · Spam · Topics

Feed a single sentence. Take the [CLS] token's final hidden state (768-dim). Add a linear layer → softmax. Train on labelled examples. Examples: sentiment analysis, spam detection, news topic classification.

🔍

Question Answering

Extractive · SQuAD style

Input: [CLS] Question [SEP] Context [SEP]. Add two output vectors to predict the start and end token of the answer span within the context. The model extracts the answer directly from the passage.

🏷

Named Entity Recognition

Token-level Classification

Take the final hidden state of every token (not just [CLS]). Add a linear layer → softmax per token to classify each as B-PER, I-ORG, O, etc. BERT's contextual embeddings make this dramatically more accurate than previous methods.

🔥

Why Fine-Tuning is Powerful

Before BERT, building a high-quality NLP model required large task-specific labelled datasets, hand-crafted features, and weeks of training. With BERT fine-tuning, a few thousand labelled examples and 30 minutes on a GPU can produce state-of-the-art results. The pre-trained weights already encode rich linguistic knowledge — fine-tuning just steers that knowledge toward your task.

Section 10

BERT in Code — From Tokenisation to Fine-Tuning

Step 1 — Install and Load

# Install Hugging Face Transformers
# pip install transformers torch datasets

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
import torch

# Load pre-trained BERT-Base (uncased)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased')

Step 2 — Tokenise Text

text = 'The scientist discovered a new species in the rainforest.'

# Tokenise and encode to input IDs
tokens = tokenizer.tokenize(text)
print('Tokens:', tokens)
# Output: ['the', 'scientist', 'discovered', 'a', 'new', 'species',
#          'in', 'the', 'rain', '##forest', '.']

# Full encoding with attention mask and special tokens
encoding = tokenizer(
    text,
    return_tensors='pt',        # PyTorch tensors
    padding=True,
    truncation=True,
    max_length=128
)

print('Input IDs shape:', encoding['input_ids'].shape)
print('Token IDs:', encoding['input_ids'])

OUTPUT

Tokens: ['the', 'scientist', 'discovered', 'a', 'new', 'species', 'in', 'the', 'rain', '##forest', '.'] Input IDs shape: torch.Size([1, 13]) Token IDs: tensor([[ 101, 1996, 3820, 3603, 1037, 2047, 3605, 1999, 1996, 4482, 18421, 1012, 102]])

Step 3 — Extract Contextual Embeddings

model.eval()

with torch.no_grad():
    outputs = model(**encoding)

# last_hidden_state: shape [batch, seq_len, 768]
last_hidden = outputs.last_hidden_state
print('All token embeddings shape:', last_hidden.shape)

# [CLS] token embedding — position 0
cls_embedding = last_hidden[:, 0, :]
print('CLS embedding shape:', cls_embedding.shape)

# Pooled output (another classification-ready representation)
pooled = outputs.pooler_output
print('Pooled output shape:', pooled.shape)

OUTPUT

All token embeddings shape: torch.Size([1, 13, 768]) CLS embedding shape: torch.Size([1, 768]) Pooled output shape: torch.Size([1, 768])

Step 4 — Fine-Tune for Sentiment Classification

from transformers import BertForSequenceClassification, AdamW, get_scheduler
from torch.utils.data import DataLoader, Dataset
import torch

# ── Toy dataset ──────────────────────────────────────────────
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(texts, truncation=True,
                                   padding=True, max_length=max_len,
                                   return_tensors='pt')
        self.labels = torch.tensor(labels)

    def __len__(self): return len(self.labels)

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.encodings.items()}, self.labels[idx]

# Sample data
texts  = ['I loved this film!', 'Absolutely terrible.',
          'Best movie ever.',   'Waste of time.']
labels = [1, 0, 1, 0]   # 1 = positive, 0 = negative

dataset    = SentimentDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=2)

# ── Model — BERT-Base + classification head ───────────────────
clf_model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

optimizer = AdamW(clf_model.parameters(), lr=2e-5)

# ── Training loop (3 epochs) ─────────────────────────────────
clf_model.train()
for epoch in range(3):
    total_loss = 0
    for batch, batch_labels in dataloader:
        optimizer.zero_grad()
        outputs = clf_model(**batch, labels=batch_labels)
        loss     = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} | Loss: {total_loss/len(dataloader):.4f}")

OUTPUT

Epoch 1 | Loss: 0.7243 Epoch 2 | Loss: 0.5812 Epoch 3 | Loss: 0.3971

Step 5 — Inference

clf_model.eval()

new_texts  = ['This is a masterpiece!', 'Completely boring.']
new_inputs = tokenizer(new_texts, return_tensors='pt',
                       padding=True, truncation=True)

with torch.no_grad():
    logits = clf_model(**new_inputs).logits

predictions = logits.argmax(dim=-1)
labels_map  = {0: 'Negative', 1: 'Positive'}

for text, pred in zip(new_texts, predictions):
    print(f"'{text}' → {labels_map[pred.item()]}")

OUTPUT

'This is a masterpiece!' → Positive 'Completely boring.' → Negative

Section 11

MLM vs NSP — A Side-by-Side Comparison

Property	Masked Language Model (MLM)	Next Sentence Prediction (NSP)
Task Type	Token prediction (multi-class)	Binary classification
What it Learns	Deep bidirectional word context	Sentence-pair relationship
Input	Single sequence with masked tokens	Two sentences joined with [SEP]
Output Head	Softmax over 30,522 vocabulary tokens	Softmax over {IsNext, NotNext}
Loss Computed On	Only the 15% masked tokens	[CLS] token final hidden state
Downstream Benefit	All NLP tasks — core representation	Sentence pairs: NLI, QA, STS
Kept in RoBERTa?	Yes — extended and improved	No — removed, shown to hurt
Relative Importance	Critical — primary learning signal	Secondary — helps specific tasks

Section 12

BERT's Key Hyperparameters

Hyperparameter	BERT-Base Value	What It Controls
`hidden_size`	768	Dimension of all token representations throughout the model
`num_hidden_layers`	12	Number of Transformer encoder blocks stacked
`num_attention_heads`	12	Parallel attention heads per layer; each has dim = 768/12 = 64
`intermediate_size`	3072	Hidden size in the FFN (4 × hidden_size); expansion then contraction
`max_position_embeddings`	512	Maximum sequence length supported (tokens)
`vocab_size`	30,522	WordPiece vocabulary size (English uncased)
`hidden_dropout_prob`	0.1	Dropout applied to hidden states and attention weights
Fine-tune `lr`	2e-5 → 5e-5	Learning rate — lower than typical; large LR destroys pre-trained weights
Fine-tune epochs	2 – 4	More epochs → overfitting on small datasets; 3 is usually optimal

Section 13

BERT vs Other Models — Where It Stands

Model	Architecture	Direction	Pre-train Task	Best For
BERT	Transformer Encoder	Bidirectional	MLM + NSP	Understanding tasks (classification, NER, QA)
GPT-2/3	Transformer Decoder	Left-to-right only	Causal LM	Text generation
RoBERTa	Transformer Encoder	Bidirectional	MLM only (no NSP)	Stronger than BERT on most benchmarks
ALBERT	Transformer Encoder	Bidirectional	MLM + SOP*	Smaller, faster BERT; mobile/edge deployment
DistilBERT	Transformer Encoder (6L)	Bidirectional	Distilled from BERT	60% faster, 40% smaller, 97% of BERT's accuracy
T5	Encoder-Decoder	Bidirectional encoder	Text-to-Text	All tasks framed as text generation

* SOP = Sentence Order Prediction — a harder version of NSP used in ALBERT.

Section 14

BERT's Impact — Benchmark Results

When BERT was released, it achieved state-of-the-art on 11 NLP tasks simultaneously — an unprecedented leap. Here are representative results from the original paper:

Benchmark	Task Type	Previous SOTA	BERT-Base	BERT-Large
GLUE (overall)	Multi-task NLU	72.8	79.6	80.5
SQuAD 1.1 (F1)	Extractive QA	91.7	93.2	93.2
SQuAD 2.0 (F1)	QA with unanswerable	66.3	76.3	83.1
MultiNLI	Natural Language Inference	86.7%	84.6%	86.7%
SST-2	Sentiment (2-class)	96.2%	93.5%	94.9%

🏆

Why These Numbers Were Revolutionary

The improvements were not marginal. BERT broke records on 11 tasks using exactly the same model with a different output head for each task. Previously, each task required a custom-built architecture from scratch. BERT proved that a single pre-trained representation could transfer universally — the moment NLP became a transfer-learning field like computer vision.

Section 15

Common Mistakes and How to Avoid Them

⚡ BERT — Non-Negotiable Rules for Practitioners

Always use the tokeniser that matches the model. bert-base-uncased uses an uncased tokeniser that lowercases all input. Using a cased tokeniser on an uncased model (or vice versa) silently produces gibberish embeddings. Use AutoTokenizer.from_pretrained(model_name) to guarantee the correct match.

Do not use a large learning rate. Typical neural network training uses lr ~ 1e-3. Fine-tuning BERT requires lr in the range 2e-5 to 5e-5. A large learning rate catastrophically destroys the pre-trained weights in the first few steps — your model will perform worse than a random baseline.

Always add an attention mask. When you pad sequences to equal length, the padding tokens should not participate in self-attention. The attention_mask tensor (1 = real token, 0 = padding) must always be passed — the tokeniser generates it automatically when you use return_tensors='pt'.

Respect the 512-token limit. BERT cannot process sequences longer than 512 tokens. For long documents, use truncation (truncation=True, max_length=512), sliding window approaches, or switch to a long-context model like Longformer or BigBird. Silent truncation of input beyond 512 tokens will produce wrong results.

Fine-tune for 2–4 epochs only. BERT overfits rapidly on small labelled datasets. The original paper found 2–4 epochs optimal for most tasks. Use early stopping based on validation loss, not training loss.

Use a linear learning rate warmup. Train with a scheduler that warms up the learning rate for the first 10% of steps, then linearly decays. HuggingFace's get_linear_schedule_with_warmup handles this. Without warmup, the model destabilises in the first steps when the task head outputs random values and gradients are large.

For most tasks, prefer RoBERTa over BERT. RoBERTa is trained on 10× more data, uses dynamic masking in MLM, removes NSP, and uses larger batch sizes. It outperforms BERT on nearly every benchmark with identical fine-tuning code. Use roberta-base as your drop-in replacement via HuggingFace.

Section 16

The BERT Pipeline — Complete Overview

Raw Text Corpus

BooksCorpus + English Wikipedia — ~3.3B words of clean natural language text. Documents are split into sentence pairs for NSP construction.

WordPiece Tokenisation

Text is tokenised into subword units using a 30,522-token vocabulary. [CLS] and [SEP] special tokens are inserted. Sequences are padded or truncated to 512 tokens.

Input Embedding Construction

Token embeddings + segment embeddings + positional embeddings are summed to produce a 768-dim vector per token. [MASK] tokens are applied to 15% of positions for MLM.

12 Transformer Encoder Layers

Each layer applies Multi-Head Self-Attention (all tokens attend to all tokens bidirectionally) then a Feed-Forward Network, with residual connections and layer norm throughout. Output: contextualised embedding per token.

Dual Pre-Training Loss

MLM head predicts the original identity of masked tokens (cross-entropy over vocab). NSP head uses [CLS] hidden state to classify IsNext/NotNext. Both losses are summed and backpropagated together.

Fine-Tuning on Downstream Task

Pre-trained weights are loaded. A task-specific output head is added (linear layer). The full model is trained end-to-end for 2–4 epochs on labelled task data with a small learning rate (2e-5 to 5e-5). The pre-trained knowledge transfers immediately.

🌟

BERT's Lasting Legacy

BERT did not just improve NLP — it changed how NLP is done. Pre-training on raw text then fine-tuning on labelled data is now the standard paradigm for virtually every language task. Every modern language model — RoBERTa, DeBERTa, ALBERT, XLNet, GPT-4, Gemini, Claude — stands on the foundation that BERT established: language understanding requires bidirectional context, and deep pre-training on large corpora enables remarkable generalisation.