Word Embeddings Tutorial: Word2Vec, GloVe

Section 01

The Story That Explains Word Embeddings

📖 Real World Analogy

The Library Card Catalogue — Meaning Lives in Neighbours

Imagine a vast library where every book is given a numbered shelf position — not by title, but by who borrows it together with other books. Books on Python and Machine Learning end up near each other. Books on King and Queen are shelved close. Paris and France are practically neighbours.

One day a librarian notices something magical: the distance between "King" and "Queen" on the shelf is almost identical to the distance between "Man" and "Woman". The direction from "Paris" to "France" is the same direction as from "Berlin" to "Germany". Geography is encoded in shelf positions. Grammar is encoded in shelf positions. Meaning itself is encoded in shelf positions.

That is the entire secret of Word Embeddings: convert words into numbers (coordinates on a shelf), arranged so that meaning becomes geometry.

Computers cannot read text. They can only process numbers. The naive solution — assigning each word an arbitrary integer (cat = 1, dog = 2, love = 3) — tells the model nothing about the relationship between words. Word Embeddings solve this by representing every word as a dense vector of real numbers — a point in high-dimensional space — where the position encodes semantic and syntactic meaning.

🧠

The Distributional Hypothesis — The Philosophical Bedrock

All embedding techniques rest on a single 1954 idea by linguist John Rupert Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts have similar meanings. Word2Vec, GloVe, and FastText are all different engineering implementations of this one insight.

Section 02

Why Not Just One-Hot Encode Words?

Before embeddings existed, NLP used one-hot encoding: a vector of all zeros except a single 1 at the word's index. With a vocabulary of 50,000 words, every word is a 50,000-dimensional vector. This works but fails catastrophically at scale.

❌ One-Hot Encoding — The Old Way

Word	Vector (simplified, V=5)
king	[1, 0, 0, 0, 0]
queen	[0, 1, 0, 0, 0]
man	[0, 0, 1, 0, 0]
woman	[0, 0, 0, 1, 0]
apple	[0, 0, 0, 0, 1]

✅ Word Embeddings — The Modern Way (dim=4)

Word	Vector
king	[0.65, 0.12, -0.43, 0.87]
queen	[0.61, -0.21, -0.40, 0.83]
man	[0.42, 0.35, -0.10, 0.21]
woman	[0.38, -0.14, -0.07, 0.18]
apple	[-0.71, 0.02, 0.93, -0.55]

🟡

Problem 1 — Sparsity

50,000-dim vectors with one 1

With 50,000-word vocabulary, each word is a vector of 49,999 zeros and one 1. Neural networks waste enormous compute multiplying by zero. Memory cost: 50,000 × 50,000 = 2.5 billion floats.

⛔

Problem 2 — No Similarity

cosine(king, queen) = 0.0

The cosine similarity between any two one-hot vectors is exactly 0. The model learns that "dog" and "cat" are as different as "dog" and "democracy". Semantic relationships are completely invisible.

🚫

Problem 3 — No Generalisation

Unseen words = dead end

If the model has never seen "astrophysics" during training, it gets a zero vector. One-hot representations cannot transfer knowledge between related words. Every word starts from scratch.

💡

Embeddings Solve All Three

Dense embeddings use 50–300 dimensions instead of 50,000. Every dimension carries information, so there is no sparsity. Similar words have similar vectors, so cosine(king, queen) ≈ 0.85. And unseen words can be approximated from their morphology (FastText) or estimated from context.

Section 03

Word2Vec — Teaching a Neural Network to Predict Context

📖 Story

The Fill-in-the-Blank Game

Imagine a child learning language by playing a fill-in-the-blank game with billions of sentences. The rule is simple: given the words around a blank, guess the missing word — or given a word, guess the words that surround it. After millions of rounds, the child's mental model of each word becomes incredibly precise. Not from definitions, but purely from pattern of co-occurrence.

Word2Vec is that child — a shallow neural network trained on exactly this game, at internet scale. The learned weights become the word vectors. The network is then discarded; only the weights survive.

Proposed by Tomas Mikolov at Google in 2013, Word2Vec is a two-layer neural network trained on a text corpus. It comes in two architectures: CBOW (Continuous Bag of Words) and Skip-gram. Both produce the same type of output — dense word vectors — but via opposite prediction tasks.

📊 Word2Vec — Two Architectures Side by Side

Both architectures use the same corpus. The embedding matrix (hidden layer weights) is what we keep after training. The prediction task is just the training signal.

Section 04

CBOW — Continuous Bag of Words

🌎

Task: Given surrounding words, predict the centre word

Sentence: "The quick ??? fox jumps"
CBOW input: ["The", "quick", "fox", "jumps"] → CBOW output: "brown"
The context window defines how many neighbours on each side to use (typically window=5).

How CBOW Processes a Sentence — Step by Step

Tokenise & Build Vocabulary

Scan the entire corpus. Assign each unique word an index. Remove words below a minimum frequency threshold (e.g. min_count=5). A typical Wikipedia corpus yields ~500,000 unique tokens.

Slide the Context Window

For every word in the corpus, collect its neighbours within the window. With window=2 around "fox": context = [quick, brown, jumps, over]. This produces billions of (context, target) training pairs from a large corpus.

Average the Context Vectors

Each context word is looked up in the embedding matrix (random initially). The context vectors are averaged — this is what makes it a "bag": word order is ignored. The averaged vector is the hidden layer.

Predict & Backpropagate

The hidden vector is projected through the output matrix (V × d) and a softmax is applied over the vocabulary to produce a probability distribution. Cross-entropy loss against the true centre word drives backpropagation. Gradients update both embedding matrices.

Extract the Embedding Matrix

After training, the input embedding matrix W (V × d) contains the final word vectors. The output matrix is discarded. Each row of W is the embedding for one word.

🔎

Why CBOW is Faster

CBOW averages the context vectors before prediction — a single forward pass per training window. Skip-gram makes one prediction per context word, so it performs C times more predictions (C = window size). CBOW trains 3–10× faster on large corpora and is better for frequent words.

Section 05

Skip-gram — The Reverse Problem

▶️

Task: Given the centre word, predict each surrounding word

Sentence: "The quick brown fox jumps"
Skip-gram input: "brown" → outputs: ["The", "quick", "fox", "jumps"]
One input word must predict 2 × window separate output words.

Skip-gram treats each (centre, context_word) as a separate training pair. Given window=2 and the word "brown", it generates four training examples: (brown, The), (brown, quick), (brown, fox), (brown, jumps). This makes Skip-gram better at learning representations for rare words because each occurrence produces more training signal.

🔄 Negative Sampling — Solving the Softmax Bottleneck

Problem

Full softmax over 500,000 words is computed at every step — O(V) per update. With billions of training pairs, this is prohibitively slow.

Solution

Instead of updating all V output weights, pick k=5–20 random negative samples (words that do NOT appear in the context) and update only their weights alongside the one positive sample.

Sampling

Negative words are sampled according to P(w) ∝ freq(w)^0.75 — the 3/4 power smooths the distribution, giving rare words a fighting chance of being selected.

Result

Training becomes O(k) per update instead of O(V). With k=5 vs V=500,000, that is a 100,000× speedup, making billion-word corpora tractable on a single machine.

Property	CBOW	Skip-gram
Prediction task	Context → Centre word	Centre → Context words
Training speed	Fast (single pass per window)	Slower (C passes per window)
Rare word quality	Lower (rare = few training examples)	Higher (each occurrence → C pairs)
Frequent word quality	Slightly better	Similar
Best for	Large corpora, frequent vocabulary	Small corpora, rare/specialised words
Memory use	Lower	Higher

Section 06

Word Arithmetic — The Magic of Embedding Space

📖 The Famous Equation

King − Man + Woman = Queen

This is not a metaphor. This is actual vector arithmetic. The vector for "king" minus the vector for "man" gives you a direction in space that roughly means "royalty without maleness". Add the vector for "woman" and you arrive at a point whose nearest neighbour in the vocabulary is "queen".

The same trick works for: Paris − France + Italy ≈ Rome, run − running + swimming ≈ swim, good − better + worse ≈ bad. Relationships that linguists spent decades cataloguing are spontaneously encoded in the geometry.

🌍 Vector Arithmetic in Embedding Space (2D Projection)

The parallelogram shows that the gender direction (man→woman) and the royalty direction (man→king) are independent and composable. This structure emerges from co-occurrence statistics, not from any programmed rule.

Section 07

Word2Vec in Python — Full Working Example

from gensim.models import Word2Vec
import numpy as np

# ── 1. Tokenised corpus (each sentence is a list of words) ──
sentences = [
    ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
    ["she", "is", "a", "queen", "and", "he", "is", "a", "king"],
    ["paris", "is", "the", "capital", "of", "france"],
    ["berlin", "is", "the", "capital", "of", "germany"],
    ["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
]

# ── 2. Train CBOW model ──────────────────────────────────────
cbow_model = Word2Vec(
    sentences    = sentences,
    vector_size  = 100,      # embedding dimension
    window       = 5,        # context window size
    min_count    = 1,        # include words seen at least once
    sg           = 0,        # 0 = CBOW, 1 = Skip-gram
    epochs       = 100,
    seed         = 42
)

# ── 3. Train Skip-gram model ─────────────────────────────────
sg_model = Word2Vec(
    sentences    = sentences,
    vector_size  = 100,
    window       = 5,
    min_count    = 1,
    sg           = 1,        # 1 = Skip-gram
    negative     = 5,        # negative samples
    epochs       = 100,
    seed         = 42
)

# ── 4. Access word vectors ───────────────────────────────────
king_vec = cbow_model.wv["king"]        # shape: (100,)
print(f"Vector shape: {king_vec.shape}")

# ── 5. Find most similar words ───────────────────────────────
similar = cbow_model.wv.most_similar("king", topn=3)
for word, score in similar:
    print(f"  {word:12s}: {score:.4f}")

# ── 6. Vector arithmetic (king - man + woman ≈ queen) ────────
result = cbow_model.wv.most_similar(
    positive=["king", "she"],
    negative=["he"],
    topn=1
)
print(f"king - he + she ≈ {result[0][0]}")

# ── 7. Cosine similarity ─────────────────────────────────────
sim = cbow_model.wv.similarity("king", "queen")
print(f"Similarity(king, queen) = {sim:.4f}")

OUTPUT

Vector shape: (100,) queen : 0.9821 she : 0.9104 king : 1.0000 ← self-similarity king - he + she ≈ queen Similarity(king, queen) = 0.9821

⚠️

Small Corpus Warning

The example above uses a tiny corpus of 5 sentences — results will look impressive but are not meaningful. Word2Vec needs millions of sentences to learn useful representations. For production use, pre-trained models (Google News 300d, Wikipedia) are the right starting point. Load them with KeyedVectors.load_word2vec_format().

Section 08

GloVe — Global Vectors for Word Representation

📖 Story

The Statistician vs The Detective

Word2Vec is a detective: it looks at one window of text at a time and updates its guesses. It never steps back to see the full picture.

GloVe is a statistician: it first reads the entire corpus, counts every pair of words that appear near each other (building a giant co-occurrence matrix X), and then fits a model to explain that matrix all at once. By seeing global statistics before training begins, GloVe knows exactly how often "ice" and "steam" both appear near "water" — a relationship no single sentence can reveal.

Published by Pennington, Socher, and Manning at Stanford in 2014, GloVe (Global Vectors) combines the best of two worlds: the semantic richness of matrix factorisation methods and the shallow-network efficiency of Word2Vec.

The GloVe Objective Function

Co-occurrence Matrix

X_ij = count(i near j)

X_ij is how many times word j appears in the context of word i across the entire corpus. This matrix is computed first, before any training.

GloVe Loss Function

J = Σ f(X_ij)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ − log X_ij)²

Minimise the squared difference between the dot product of two word vectors (plus biases) and the log of their co-occurrence count. f(X_ij) is a weighting function that downweights very common pairs.

Weighting Function

f(x) = (x/x_max)^α if x < x_max, else 1

Typically x_max=100, α=0.75. Common word pairs (like "the" + "and") are downweighted so they don't dominate. Rare pairs are upweighted slightly.

Final Representation

embedding = (W + W̃) / 2

GloVe trains two matrices W and W̃ (word and context vectors). The final embedding averages them, which typically yields better performance than using either alone.

Property	Word2Vec	GloVe
Training signal	Local windows (one at a time)	Global co-occurrence matrix
Objective	Predict word from context	Factorise log co-occurrence
Parallelisable	Difficult (online SGD)	Yes (pre-compute matrix first)
Analogy tasks	Excellent	Excellent (often slightly better)
Similarity tasks	Good	Often better (global stats help)
Training memory	Low (streaming)	High (stores V×V matrix)
Interpretability	Harder to explain	Clear statistical foundation

Using Pre-trained GloVe in Python

import numpy as np

# ── Load GloVe vectors from text file ───────────────────────
# Download: https://nlp.stanford.edu/projects/glove/
# glove.6B.100d.txt — 6B tokens, 400K vocab, 100-dim

def load_glove(path):
    embeddings = {}
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            parts = line.split()
            word   = parts[0]
            vector = np.array(parts[1:], dtype="float32")
            embeddings[word] = vector
    return embeddings

glove = load_glove("glove.6B.100d.txt")
print(f"Vocabulary size : {len(glove)}")        # 400,000
print(f"Vector dimension: {glove['king'].shape}")  # (100,)

# ── Cosine similarity helper ─────────────────────────────────
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_sim(glove["king"], glove["queen"]))   # ~0.72
print(cosine_sim(glove["king"], glove["apple"]))   # ~0.10

# ── Analogy: king - man + woman ─────────────────────────────
target = glove["king"] - glove["man"] + glove["woman"]

best_word  = None
best_score = -1
for word, vec in glove.items():
    if word in {"king", "man", "woman"}:
        continue
    score = cosine_sim(target, vec)
    if score > best_score:
        best_score = score
        best_word  = word

print(f"king - man + woman ≈ {best_word}")   # queen

OUTPUT

Vocabulary size : 400000 Vector dimension: (100,) king ↔ queen cosine: 0.7190 king ↔ apple cosine: 0.1024 king − man + woman ≈ queen

Section 09

FastText — Embeddings for Every Word (Even Unseen Ones)

📖 Story

The Morphology Detective

Word2Vec and GloVe treat "run", "running", "runner", and "runs" as four completely unrelated words — just four different shelf positions with no connection. A linguist would find this absurd. A child who knows "run" can immediately guess what "runner" means.

Facebook AI Research (2016) asked: what if we treated each word as a bag of character n-grams? The word "running" becomes: <ru, run, unn, nni, nin, ing, ng> (trigrams). The embedding for "running" is the sum of embeddings of all its n-grams. Now "run" and "running" share n-grams and therefore share representation — by design. And a word never seen during training can still get a vector from its subwords.

Developed by Facebook AI Research and published in 2016, FastText extends the Skip-gram model by representing each word as a sum of its character n-gram embeddings. This gives FastText three critical advantages over Word2Vec and GloVe.

🔑

Advantage 1 — OOV Handling

Out-of-Vocabulary Words

FastText can produce a vector for any word, even one never seen in training, by summing its character n-gram embeddings. "unbelievability" → decompose to n-grams → sum → reasonable vector. Word2Vec returns nothing.

✓ Zero unknown-word failures

🔬

Advantage 2 — Morphology

Inflections & Derivations

Words sharing morphological roots share n-grams and therefore share representation. run ↔ running ↔ runner are automatically related. Particularly powerful for morphologically rich languages (Finnish, Turkish, Arabic).

✓ Language-agnostic morphology

🌐

Advantage 3 — Rare Words

Low-frequency vocabulary

A rare word seen only 3 times gets a poor Word2Vec vector (too few training updates). But its n-grams appear in thousands of other words, giving FastText rich information even for low-frequency tokens.

✓ Robust with small corpora

📊 FastText N-gram Decomposition of "playing"

FastText default uses n=3 to 6. The whole word <playing> is also included as a special n-gram. The final word vector is the sum of all n-gram vectors, which are themselves learned during training.

FastText in Python

from gensim.models import FastText
import numpy as np

# ── 1. Train FastText model ──────────────────────────────────
sentences = [
    ["the", "quick", "brown", "fox", "jumps"],
    ["she", "is", "playing", "tennis"],
    ["the", "player", "played", "brilliantly"],
    ["machine", "learning", "powers", "modern", "nlp"],
]

model = FastText(
    sentences   = sentences,
    vector_size = 100,
    window      = 5,
    min_count   = 1,
    sg          = 1,     # Skip-gram (recommended for FastText)
    min_n       = 3,     # minimum n-gram length
    max_n       = 6,     # maximum n-gram length
    epochs      = 100,
    seed        = 42
)

# ── 2. OOV magic — unseen word gets a vector ─────────────────
# "playfulness" was NEVER in training data
oov_vec = model.wv["playfulness"]
print(f"OOV vector shape: {oov_vec.shape}")   # (100,) — not zero!

# ── 3. Morphological similarity ──────────────────────────────
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_play_playing = cosine(model.wv["play"], model.wv["playing"])
sim_play_player  = cosine(model.wv["play"], model.wv["player"])
sim_play_fox     = cosine(model.wv["play"], model.wv["fox"])

print(f"play ↔ playing : {sim_play_playing:.4f}")   # high (shared n-grams)
print(f"play ↔ player  : {sim_play_player:.4f}")    # high
print(f"play ↔ fox     : {sim_play_fox:.4f}")       # low (no shared n-grams)

# ── 4. Use Facebook's pre-trained model ─────────────────────
# from gensim.models import FastText
# model = FastText.load_fasttext_format('cc.en.300.bin')
# Pre-trained on Common Crawl — 300 dims, 2M vocab

OUTPUT

OOV vector shape: (100,) ← FastText handles unknown words! play ↔ playing : 0.9412 ← morphologically related play ↔ player : 0.8831 ← morphologically related play ↔ fox : 0.1203 ← unrelated

Section 10

Word2Vec vs GloVe vs FastText — Complete Comparison

Property	Word2Vec (CBOW)	Word2Vec (Skip-gram)	GloVe	FastText
Training approach	Local window, predict centre	Local window, predict context	Global co-occurrence matrix	Skip-gram + character n-grams
OOV words	❌ Zero vector	❌ Zero vector	❌ Zero vector	✅ N-gram fallback
Morphology	Ignored	Ignored	Ignored	✅ Encoded in n-grams
Rare word quality	Low	Medium	Low	High
Analogy tasks	Excellent	Excellent	Excellent	Good
Training speed	Fastest	Fast	Medium (matrix first)	Slower (more params)
Model size	Small	Small	Small	Larger (stores n-gram table)
Multi-lingual	Moderate	Moderate	Moderate	✅ Excellent (agglutinative languages)
Best use case	Fast baseline, common vocab	Rare words, small corpora	Analogy/similarity, large corpora	Morphology-rich languages, NLP pipelines

⚡

When to Use CBOW

word2vec sg=0

Large corpus with a well-defined common vocabulary. Speed is a priority. Words are frequently occurring so rare-word quality is not a concern. Good baseline for sentiment analysis, topic modelling.

🎯

When to Use Skip-gram

word2vec sg=1

Medium-sized corpora, domain-specific text with rare technical terms. NER, relation extraction where specific entities matter. You need the model to capture rare patterns well.

🌐

When to Use GloVe

Stanford release

Word similarity and analogy tasks where global statistics improve performance. You have access to a large corpus and enough memory for the co-occurrence matrix. Good for research and transfer learning.

🔬

When to Use FastText

Facebook AI

Non-English or morphologically rich languages. Text with spelling variations, social media slang, medical jargon, legal terms. When OOV words are expected at inference time. Production NLP systems.

🚀

When to Use BERT/Transformers

Contextual embeddings

When you need context-dependent vectors (the same word has different vectors in different sentences). Classification, QA, named entity recognition. If compute budget allows. Word2Vec/GloVe/FastText are static — one vector per word regardless of context.

📐

Pre-trained vs From Scratch

Transfer learning

Unless you have 100M+ tokens in a specialised domain, always start from pre-trained vectors (Google News 300d, GloVe 840B, FastText CC). Fine-tune by continuing training on your domain corpus.

Section 11

How Do You Know If Your Embeddings Are Any Good?

Embedding quality is evaluated on two categories of tasks: intrinsic (does the embedding space make linguistic sense?) and extrinsic (does using this embedding improve downstream task performance?).

📈

INTRINSIC

Direct Embedding Quality

Word similarity: Correlate cosine similarity with human ratings (WordSim-353, SimLex-999).
Analogy: king−man+woman=? (Google analogy dataset, 19,544 pairs).
Clustering: Do similar words cluster together in t-SNE visualisations?

🎯

EXTRINSIC

Downstream Task Performance

Plug the embeddings into a real model (sentiment classifier, NER tagger, question answering) and compare accuracy vs baseline. The ultimate test — good intrinsic scores don't always translate to task improvements.

⚡

BIAS AUDIT

Fairness Evaluation (Critical)

Word embeddings trained on internet text encode societal biases. doctor−man+woman ≈ nurse. Use WEAT (Word Embedding Association Test) to quantify bias before deploying. Tools: responsibly, wefe libraries.

Section 12

End-to-End NLP Pipeline Using Pre-trained Embeddings

import numpy as np
from gensim.models import KeyedVectors

# ── Load Google News pre-trained Word2Vec ────────────────────
# Download: https://code.google.com/archive/p/word2vec/
wv = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin",
    binary=True
)
print(f"Vocab: {len(wv)} words | Dim: {wv.vector_size}")

# ── Sentence embedding by averaging word vectors ─────────────
def sentence_vector(text, model, dim=300):
    tokens = text.lower().split()
    vecs   = [model[w] for w in tokens if w in model]
    if not vecs:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

# ── Simple sentiment classifier using embeddings ─────────────
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

reviews = [
    ("the movie was absolutely fantastic and thrilling", 1),
    ("terrible film waste of time boring plot",          0),
    ("brilliant acting and stunning cinematography",       1),
    ("awful predictable story bad acting",                0),
    ("outstanding performances moving story",             1),
    ("dull and painfully slow nothing happens",           0),
]

X = np.array([sentence_vector(r, wv) for r, _ in reviews])
y = np.array([label for _, label in reviews])

clf = LogisticRegression()
clf.fit(X, y)

test_sentences = [
    "wonderful magical experience loved every moment",
    "completely unwatchable dreadful acting"
]

for s in test_sentences:
    vec   = sentence_vector(s, wv).reshape(1, -1)
    pred  = clf.predict(vec)[0]
    label = "POSITIVE" if pred == 1 else "NEGATIVE"
    print(f"{label}: {s}")

OUTPUT

Vocab: 3000000 words | Dim: 300 POSITIVE: wonderful magical experience loved every moment NEGATIVE: completely unwatchable dreadful acting

Section 13

Golden Rules for Word Embeddings

💡 Word Embeddings — Non-Negotiable Practitioner Rules

Start with pre-trained vectors. Unless your domain is radically specialised (chemistry, clinical notes), Google News Word2Vec, GloVe 840B, or FastText Common Crawl will outperform embeddings trained on your small in-house corpus. Pre-trained = 100B+ tokens of implicit knowledge.

Match embedding to language morphology. English → Word2Vec or GloVe is fine. Finnish, Turkish, Arabic, Hindi → use FastText. Morphological richness means thousands of word forms per root — FastText handles this natively.

Always audit for bias before deployment. Embeddings trained on the internet contain racial, gender, and occupational stereotypes. This is not a theoretical concern — Amazon, Google, and Microsoft have all deployed biased systems built on raw embeddings. Use WEAT tests and wefe to measure.

Normalise vectors before computing cosine similarity. Always L2-normalise your embedding matrix. Cosine similarity on unnormalised vectors is equivalent to dot product, which is dominated by vector magnitude, not direction. One line: wv.init_sims(replace=True) in Gensim.

Word2Vec/GloVe/FastText are static embeddings — one vector per word. The word "bank" has the same vector in "river bank" and "bank account". For tasks where word sense disambiguation matters, you need contextual embeddings (BERT, RoBERTa, XLNet). Know this limitation before choosing your architecture.

Dimension choice matters less than you think. 100d vs 300d embeddings show modest differences in most tasks. 50d is often sufficient for small downstream models. Bigger dimensions → more memory, slower inference. The sweet spot is usually 100–300 for most applications.

Fine-tune embeddings on domain data. Load pre-trained vectors and continue training on your domain corpus with a lower learning rate. Medical text after initialisation from Wikipedia will dramatically improve representations for clinical terms like "myocardial infarction" or "thrombocytopenia".

Section 14

Pre-trained Models & Resources

Model	Corpus	Vocab	Dim	Download
Word2Vec Google News	Google News (100B tokens)	3M	300	code.google.com/archive/p/word2vec
GloVe 6B	Wikipedia + Gigaword (6B tokens)	400K	50/100/200/300	nlp.stanford.edu/projects/glove
GloVe 840B	Common Crawl (840B tokens)	2.2M	300	nlp.stanford.edu/projects/glove
FastText CC (English)	Common Crawl + Wikipedia	2M	300	fasttext.cc/docs/en/english-vectors
FastText 157 Languages	Wikipedia (multilingual)	2M each	300	fasttext.cc/docs/en/crawl-vectors

🏆

The Road Ahead — From Static to Contextual

Word2Vec, GloVe, and FastText were the revolution of 2013–2016. The next revolution was ELMo (2018) — context-dependent vectors from bidirectional LSTMs. Then came BERT (2018) — transformer-based contextual embeddings that changed the entire field. Today, GPT-class models produce contextual embeddings as a byproduct of pretraining. But the underlying insight — meaning lives in context, context can be captured statistically — is still the same idea Firth had in 1954. Every transformer you use today is a descendant of the simple Word2Vec prediction game.