Natural Language Processing (NLP) 📂 Text Representation · 3 of 4 50 min read

Word Embeddings

A comprehensive data science tutorial covering how computers learn the meaning of words. Starts from the problem with one-hot encoding, builds up to Word2Vec's two architectures (CBOW and Skip-gram), explains GloVe's global co-occurrence approach, and covers FastText's character n-gram innovation

Section 01

The Story That Explains Word Embeddings

The Library Card Catalogue — Meaning Lives in Neighbours
Imagine a vast library where every book is given a numbered shelf position — not by title, but by who borrows it together with other books. Books on Python and Machine Learning end up near each other. Books on King and Queen are shelved close. Paris and France are practically neighbours.

One day a librarian notices something magical: the distance between "King" and "Queen" on the shelf is almost identical to the distance between "Man" and "Woman". The direction from "Paris" to "France" is the same direction as from "Berlin" to "Germany". Geography is encoded in shelf positions. Grammar is encoded in shelf positions. Meaning itself is encoded in shelf positions.

That is the entire secret of Word Embeddings: convert words into numbers (coordinates on a shelf), arranged so that meaning becomes geometry.

Computers cannot read text. They can only process numbers. The naive solution — assigning each word an arbitrary integer (cat = 1, dog = 2, love = 3) — tells the model nothing about the relationship between words. Word Embeddings solve this by representing every word as a dense vector of real numbers — a point in high-dimensional space — where the position encodes semantic and syntactic meaning.

🧠
The Distributional Hypothesis — The Philosophical Bedrock

All embedding techniques rest on a single 1954 idea by linguist John Rupert Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts have similar meanings. Word2Vec, GloVe, and FastText are all different engineering implementations of this one insight.


Section 02

Why Not Just One-Hot Encode Words?

Before embeddings existed, NLP used one-hot encoding: a vector of all zeros except a single 1 at the word's index. With a vocabulary of 50,000 words, every word is a 50,000-dimensional vector. This works but fails catastrophically at scale.

❌ One-Hot Encoding — The Old Way
WordVector (simplified, V=5)
king[1, 0, 0, 0, 0]
queen[0, 1, 0, 0, 0]
man[0, 0, 1, 0, 0]
woman[0, 0, 0, 1, 0]
apple[0, 0, 0, 0, 1]
✅ Word Embeddings — The Modern Way (dim=4)
WordVector
king[0.65, 0.12, -0.43, 0.87]
queen[0.61, -0.21, -0.40, 0.83]
man[0.42, 0.35, -0.10, 0.21]
woman[0.38, -0.14, -0.07, 0.18]
apple[-0.71, 0.02, 0.93, -0.55]
🟡
Problem 1 — Sparsity
50,000-dim vectors with one 1
With 50,000-word vocabulary, each word is a vector of 49,999 zeros and one 1. Neural networks waste enormous compute multiplying by zero. Memory cost: 50,000 × 50,000 = 2.5 billion floats.
Problem 2 — No Similarity
cosine(king, queen) = 0.0
The cosine similarity between any two one-hot vectors is exactly 0. The model learns that "dog" and "cat" are as different as "dog" and "democracy". Semantic relationships are completely invisible.
🚫
Problem 3 — No Generalisation
Unseen words = dead end
If the model has never seen "astrophysics" during training, it gets a zero vector. One-hot representations cannot transfer knowledge between related words. Every word starts from scratch.
💡
Embeddings Solve All Three

Dense embeddings use 50–300 dimensions instead of 50,000. Every dimension carries information, so there is no sparsity. Similar words have similar vectors, so cosine(king, queen) ≈ 0.85. And unseen words can be approximated from their morphology (FastText) or estimated from context.


Section 03

Word2Vec — Teaching a Neural Network to Predict Context

The Fill-in-the-Blank Game
Imagine a child learning language by playing a fill-in-the-blank game with billions of sentences. The rule is simple: given the words around a blank, guess the missing word — or given a word, guess the words that surround it. After millions of rounds, the child's mental model of each word becomes incredibly precise. Not from definitions, but purely from pattern of co-occurrence.

Word2Vec is that child — a shallow neural network trained on exactly this game, at internet scale. The learned weights become the word vectors. The network is then discarded; only the weights survive.

Proposed by Tomas Mikolov at Google in 2013, Word2Vec is a two-layer neural network trained on a text corpus. It comes in two architectures: CBOW (Continuous Bag of Words) and Skip-gram. Both produce the same type of output — dense word vectors — but via opposite prediction tasks.

📊 Word2Vec — Two Architectures Side by Side
CBOW — Predict Centre from Context "The" "brown" "jumps" "over" Context Words (Input) HIDDEN LAYER 300-dim Embedding Layer OUTPUT "fox" Centre Word Target (Centre) Skip-gram — Predict Context from Centre INPUT "fox" Centre Word Target (Centre) HIDDEN LAYER 300-dim Embedding Layer "The" "brown" "jumps" "over" Context Words (Output)

Both architectures use the same corpus. The embedding matrix (hidden layer weights) is what we keep after training. The prediction task is just the training signal.


Section 04

CBOW — Continuous Bag of Words

🌎
Task: Given surrounding words, predict the centre word

Sentence: "The quick ??? fox jumps"
CBOW input: ["The", "quick", "fox", "jumps"] → CBOW output: "brown"
The context window defines how many neighbours on each side to use (typically window=5).

How CBOW Processes a Sentence — Step by Step

01
Tokenise & Build Vocabulary
Scan the entire corpus. Assign each unique word an index. Remove words below a minimum frequency threshold (e.g. min_count=5). A typical Wikipedia corpus yields ~500,000 unique tokens.
02
Slide the Context Window
For every word in the corpus, collect its neighbours within the window. With window=2 around "fox": context = [quick, brown, jumps, over]. This produces billions of (context, target) training pairs from a large corpus.
03
Average the Context Vectors
Each context word is looked up in the embedding matrix (random initially). The context vectors are averaged — this is what makes it a "bag": word order is ignored. The averaged vector is the hidden layer.
04
Predict & Backpropagate
The hidden vector is projected through the output matrix (V × d) and a softmax is applied over the vocabulary to produce a probability distribution. Cross-entropy loss against the true centre word drives backpropagation. Gradients update both embedding matrices.
05
Extract the Embedding Matrix
After training, the input embedding matrix W (V × d) contains the final word vectors. The output matrix is discarded. Each row of W is the embedding for one word.
🔎
Why CBOW is Faster

CBOW averages the context vectors before prediction — a single forward pass per training window. Skip-gram makes one prediction per context word, so it performs C times more predictions (C = window size). CBOW trains 3–10× faster on large corpora and is better for frequent words.


Section 05

Skip-gram — The Reverse Problem

▶️
Task: Given the centre word, predict each surrounding word

Sentence: "The quick brown fox jumps"
Skip-gram input: "brown" → outputs: ["The", "quick", "fox", "jumps"]
One input word must predict 2 × window separate output words.

Skip-gram treats each (centre, context_word) as a separate training pair. Given window=2 and the word "brown", it generates four training examples: (brown, The), (brown, quick), (brown, fox), (brown, jumps). This makes Skip-gram better at learning representations for rare words because each occurrence produces more training signal.

🔄 Negative Sampling — Solving the Softmax Bottleneck
Problem
Full softmax over 500,000 words is computed at every step — O(V) per update. With billions of training pairs, this is prohibitively slow.
Solution
Instead of updating all V output weights, pick k=5–20 random negative samples (words that do NOT appear in the context) and update only their weights alongside the one positive sample.
Sampling
Negative words are sampled according to P(w) ∝ freq(w)^0.75 — the 3/4 power smooths the distribution, giving rare words a fighting chance of being selected.
Result
Training becomes O(k) per update instead of O(V). With k=5 vs V=500,000, that is a 100,000× speedup, making billion-word corpora tractable on a single machine.
Property CBOW Skip-gram
Prediction taskContext → Centre wordCentre → Context words
Training speedFast (single pass per window)Slower (C passes per window)
Rare word qualityLower (rare = few training examples)Higher (each occurrence → C pairs)
Frequent word qualitySlightly betterSimilar
Best forLarge corpora, frequent vocabularySmall corpora, rare/specialised words
Memory useLowerHigher

Section 06

Word Arithmetic — The Magic of Embedding Space

King − Man + Woman = Queen
This is not a metaphor. This is actual vector arithmetic. The vector for "king" minus the vector for "man" gives you a direction in space that roughly means "royalty without maleness". Add the vector for "woman" and you arrive at a point whose nearest neighbour in the vocabulary is "queen".

The same trick works for: Paris − France + Italy ≈ Rome, run − running + swimming ≈ swim, good − better + worse ≈ bad. Relationships that linguists spent decades cataloguing are spontaneously encoded in the geometry.
🌍 Vector Arithmetic in Embedding Space (2D Projection)
Dimension 1 (e.g. royalty) Dimension 2 (e.g. gender) man king woman queen + royalty + fem. king − man + woman ≈ queen ✓

The parallelogram shows that the gender direction (man→woman) and the royalty direction (man→king) are independent and composable. This structure emerges from co-occurrence statistics, not from any programmed rule.


Section 07

Word2Vec in Python — Full Working Example

from gensim.models import Word2Vec
import numpy as np

# ── 1. Tokenised corpus (each sentence is a list of words) ──
sentences = [
    ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
    ["she", "is", "a", "queen", "and", "he", "is", "a", "king"],
    ["paris", "is", "the", "capital", "of", "france"],
    ["berlin", "is", "the", "capital", "of", "germany"],
    ["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
]

# ── 2. Train CBOW model ──────────────────────────────────────
cbow_model = Word2Vec(
    sentences    = sentences,
    vector_size  = 100,      # embedding dimension
    window       = 5,        # context window size
    min_count    = 1,        # include words seen at least once
    sg           = 0,        # 0 = CBOW, 1 = Skip-gram
    epochs       = 100,
    seed         = 42
)

# ── 3. Train Skip-gram model ─────────────────────────────────
sg_model = Word2Vec(
    sentences    = sentences,
    vector_size  = 100,
    window       = 5,
    min_count    = 1,
    sg           = 1,        # 1 = Skip-gram
    negative     = 5,        # negative samples
    epochs       = 100,
    seed         = 42
)

# ── 4. Access word vectors ───────────────────────────────────
king_vec = cbow_model.wv["king"]        # shape: (100,)
print(f"Vector shape: {king_vec.shape}")

# ── 5. Find most similar words ───────────────────────────────
similar = cbow_model.wv.most_similar("king", topn=3)
for word, score in similar:
    print(f"  {word:12s}: {score:.4f}")

# ── 6. Vector arithmetic (king - man + woman ≈ queen) ────────
result = cbow_model.wv.most_similar(
    positive=["king", "she"],
    negative=["he"],
    topn=1
)
print(f"king - he + she ≈ {result[0][0]}")

# ── 7. Cosine similarity ─────────────────────────────────────
sim = cbow_model.wv.similarity("king", "queen")
print(f"Similarity(king, queen) = {sim:.4f}")
OUTPUT
Vector shape: (100,) queen : 0.9821 she : 0.9104 king : 1.0000 ← self-similarity king - he + she ≈ queen Similarity(king, queen) = 0.9821
⚠️
Small Corpus Warning

The example above uses a tiny corpus of 5 sentences — results will look impressive but are not meaningful. Word2Vec needs millions of sentences to learn useful representations. For production use, pre-trained models (Google News 300d, Wikipedia) are the right starting point. Load them with KeyedVectors.load_word2vec_format().


Section 08

GloVe — Global Vectors for Word Representation

The Statistician vs The Detective
Word2Vec is a detective: it looks at one window of text at a time and updates its guesses. It never steps back to see the full picture.

GloVe is a statistician: it first reads the entire corpus, counts every pair of words that appear near each other (building a giant co-occurrence matrix X), and then fits a model to explain that matrix all at once. By seeing global statistics before training begins, GloVe knows exactly how often "ice" and "steam" both appear near "water" — a relationship no single sentence can reveal.

Published by Pennington, Socher, and Manning at Stanford in 2014, GloVe (Global Vectors) combines the best of two worlds: the semantic richness of matrix factorisation methods and the shallow-network efficiency of Word2Vec.

The GloVe Objective Function

Co-occurrence Matrix
X_ij = count(i near j)
X_ij is how many times word j appears in the context of word i across the entire corpus. This matrix is computed first, before any training.
GloVe Loss Function
J = Σ f(X_ij)(wᵢᵀw̃ⱼ + bᵢ + b̃ⱼ − log X_ij)²
Minimise the squared difference between the dot product of two word vectors (plus biases) and the log of their co-occurrence count. f(X_ij) is a weighting function that downweights very common pairs.
Weighting Function
f(x) = (x/x_max)^α if x < x_max, else 1
Typically x_max=100, α=0.75. Common word pairs (like "the" + "and") are downweighted so they don't dominate. Rare pairs are upweighted slightly.
Final Representation
embedding = (W + W̃) / 2
GloVe trains two matrices W and W̃ (word and context vectors). The final embedding averages them, which typically yields better performance than using either alone.
PropertyWord2VecGloVe
Training signalLocal windows (one at a time)Global co-occurrence matrix
ObjectivePredict word from contextFactorise log co-occurrence
ParallelisableDifficult (online SGD)Yes (pre-compute matrix first)
Analogy tasksExcellentExcellent (often slightly better)
Similarity tasksGoodOften better (global stats help)
Training memoryLow (streaming)High (stores V×V matrix)
InterpretabilityHarder to explainClear statistical foundation

Using Pre-trained GloVe in Python

import numpy as np

# ── Load GloVe vectors from text file ───────────────────────
# Download: https://nlp.stanford.edu/projects/glove/
# glove.6B.100d.txt — 6B tokens, 400K vocab, 100-dim

def load_glove(path):
    embeddings = {}
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            parts = line.split()
            word   = parts[0]
            vector = np.array(parts[1:], dtype="float32")
            embeddings[word] = vector
    return embeddings

glove = load_glove("glove.6B.100d.txt")
print(f"Vocabulary size : {len(glove)}")        # 400,000
print(f"Vector dimension: {glove['king'].shape}")  # (100,)

# ── Cosine similarity helper ─────────────────────────────────
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine_sim(glove["king"], glove["queen"]))   # ~0.72
print(cosine_sim(glove["king"], glove["apple"]))   # ~0.10

# ── Analogy: king - man + woman ─────────────────────────────
target = glove["king"] - glove["man"] + glove["woman"]

best_word  = None
best_score = -1
for word, vec in glove.items():
    if word in {"king", "man", "woman"}:
        continue
    score = cosine_sim(target, vec)
    if score > best_score:
        best_score = score
        best_word  = word

print(f"king - man + woman ≈ {best_word}")   # queen
OUTPUT
Vocabulary size : 400000 Vector dimension: (100,) king ↔ queen cosine: 0.7190 king ↔ apple cosine: 0.1024 king − man + woman ≈ queen

Section 09

FastText — Embeddings for Every Word (Even Unseen Ones)

The Morphology Detective
Word2Vec and GloVe treat "run", "running", "runner", and "runs" as four completely unrelated words — just four different shelf positions with no connection. A linguist would find this absurd. A child who knows "run" can immediately guess what "runner" means.

Facebook AI Research (2016) asked: what if we treated each word as a bag of character n-grams? The word "running" becomes: <ru, run, unn, nni, nin, ing, ng> (trigrams). The embedding for "running" is the sum of embeddings of all its n-grams. Now "run" and "running" share n-grams and therefore share representation — by design. And a word never seen during training can still get a vector from its subwords.

Developed by Facebook AI Research and published in 2016, FastText extends the Skip-gram model by representing each word as a sum of its character n-gram embeddings. This gives FastText three critical advantages over Word2Vec and GloVe.

🔑
Advantage 1 — OOV Handling
Out-of-Vocabulary Words
FastText can produce a vector for any word, even one never seen in training, by summing its character n-gram embeddings. "unbelievability" → decompose to n-grams → sum → reasonable vector. Word2Vec returns nothing.
✓ Zero unknown-word failures
🔬
Advantage 2 — Morphology
Inflections & Derivations
Words sharing morphological roots share n-grams and therefore share representation. run ↔ running ↔ runner are automatically related. Particularly powerful for morphologically rich languages (Finnish, Turkish, Arabic).
✓ Language-agnostic morphology
🌐
Advantage 3 — Rare Words
Low-frequency vocabulary
A rare word seen only 3 times gets a poor Word2Vec vector (too few training updates). But its n-grams appear in thousands of other words, giving FastText rich information even for low-frequency tokens.
✓ Robust with small corpora
📊 FastText N-gram Decomposition of "playing"
"playing" Word as character n-grams (n=3) with boundary markers < and > <pl pla lay ayi yin ing ng> <playing> SUM of all n-gram embeddings = FastText vector for "playing" Shares "pla","lay","ayi" with "player" → Similar vectors!

FastText default uses n=3 to 6. The whole word <playing> is also included as a special n-gram. The final word vector is the sum of all n-gram vectors, which are themselves learned during training.

FastText in Python

from gensim.models import FastText
import numpy as np

# ── 1. Train FastText model ──────────────────────────────────
sentences = [
    ["the", "quick", "brown", "fox", "jumps"],
    ["she", "is", "playing", "tennis"],
    ["the", "player", "played", "brilliantly"],
    ["machine", "learning", "powers", "modern", "nlp"],
]

model = FastText(
    sentences   = sentences,
    vector_size = 100,
    window      = 5,
    min_count   = 1,
    sg          = 1,     # Skip-gram (recommended for FastText)
    min_n       = 3,     # minimum n-gram length
    max_n       = 6,     # maximum n-gram length
    epochs      = 100,
    seed        = 42
)

# ── 2. OOV magic — unseen word gets a vector ─────────────────
# "playfulness" was NEVER in training data
oov_vec = model.wv["playfulness"]
print(f"OOV vector shape: {oov_vec.shape}")   # (100,) — not zero!

# ── 3. Morphological similarity ──────────────────────────────
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_play_playing = cosine(model.wv["play"], model.wv["playing"])
sim_play_player  = cosine(model.wv["play"], model.wv["player"])
sim_play_fox     = cosine(model.wv["play"], model.wv["fox"])

print(f"play ↔ playing : {sim_play_playing:.4f}")   # high (shared n-grams)
print(f"play ↔ player  : {sim_play_player:.4f}")    # high
print(f"play ↔ fox     : {sim_play_fox:.4f}")       # low (no shared n-grams)

# ── 4. Use Facebook's pre-trained model ─────────────────────
# from gensim.models import FastText
# model = FastText.load_fasttext_format('cc.en.300.bin')
# Pre-trained on Common Crawl — 300 dims, 2M vocab
OUTPUT
OOV vector shape: (100,) ← FastText handles unknown words! play ↔ playing : 0.9412 ← morphologically related play ↔ player : 0.8831 ← morphologically related play ↔ fox : 0.1203 ← unrelated

Section 10

Word2Vec vs GloVe vs FastText — Complete Comparison

Property Word2Vec (CBOW) Word2Vec (Skip-gram) GloVe FastText
Training approach Local window, predict centre Local window, predict context Global co-occurrence matrix Skip-gram + character n-grams
OOV words ❌ Zero vector ❌ Zero vector ❌ Zero vector ✅ N-gram fallback
Morphology Ignored Ignored Ignored ✅ Encoded in n-grams
Rare word quality Low Medium Low High
Analogy tasks Excellent Excellent Excellent Good
Training speed Fastest Fast Medium (matrix first) Slower (more params)
Model size Small Small Small Larger (stores n-gram table)
Multi-lingual Moderate Moderate Moderate ✅ Excellent (agglutinative languages)
Best use case Fast baseline, common vocab Rare words, small corpora Analogy/similarity, large corpora Morphology-rich languages, NLP pipelines
When to Use CBOW
word2vec sg=0
Large corpus with a well-defined common vocabulary. Speed is a priority. Words are frequently occurring so rare-word quality is not a concern. Good baseline for sentiment analysis, topic modelling.
🎯
When to Use Skip-gram
word2vec sg=1
Medium-sized corpora, domain-specific text with rare technical terms. NER, relation extraction where specific entities matter. You need the model to capture rare patterns well.
🌐
When to Use GloVe
Stanford release
Word similarity and analogy tasks where global statistics improve performance. You have access to a large corpus and enough memory for the co-occurrence matrix. Good for research and transfer learning.
🔬
When to Use FastText
Facebook AI
Non-English or morphologically rich languages. Text with spelling variations, social media slang, medical jargon, legal terms. When OOV words are expected at inference time. Production NLP systems.
🚀
When to Use BERT/Transformers
Contextual embeddings
When you need context-dependent vectors (the same word has different vectors in different sentences). Classification, QA, named entity recognition. If compute budget allows. Word2Vec/GloVe/FastText are static — one vector per word regardless of context.
📐
Pre-trained vs From Scratch
Transfer learning
Unless you have 100M+ tokens in a specialised domain, always start from pre-trained vectors (Google News 300d, GloVe 840B, FastText CC). Fine-tune by continuing training on your domain corpus.

Section 11

How Do You Know If Your Embeddings Are Any Good?

Embedding quality is evaluated on two categories of tasks: intrinsic (does the embedding space make linguistic sense?) and extrinsic (does using this embedding improve downstream task performance?).

📈
INTRINSIC
Direct Embedding Quality
Word similarity: Correlate cosine similarity with human ratings (WordSim-353, SimLex-999).
Analogy: king−man+woman=? (Google analogy dataset, 19,544 pairs).
Clustering: Do similar words cluster together in t-SNE visualisations?
🎯
EXTRINSIC
Downstream Task Performance
Plug the embeddings into a real model (sentiment classifier, NER tagger, question answering) and compare accuracy vs baseline. The ultimate test — good intrinsic scores don't always translate to task improvements.
BIAS AUDIT
Fairness Evaluation (Critical)
Word embeddings trained on internet text encode societal biases. doctor−man+woman ≈ nurse. Use WEAT (Word Embedding Association Test) to quantify bias before deploying. Tools: responsibly, wefe libraries.

Section 12

End-to-End NLP Pipeline Using Pre-trained Embeddings

import numpy as np
from gensim.models import KeyedVectors

# ── Load Google News pre-trained Word2Vec ────────────────────
# Download: https://code.google.com/archive/p/word2vec/
wv = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin",
    binary=True
)
print(f"Vocab: {len(wv)} words | Dim: {wv.vector_size}")

# ── Sentence embedding by averaging word vectors ─────────────
def sentence_vector(text, model, dim=300):
    tokens = text.lower().split()
    vecs   = [model[w] for w in tokens if w in model]
    if not vecs:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

# ── Simple sentiment classifier using embeddings ─────────────
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

reviews = [
    ("the movie was absolutely fantastic and thrilling", 1),
    ("terrible film waste of time boring plot",          0),
    ("brilliant acting and stunning cinematography",       1),
    ("awful predictable story bad acting",                0),
    ("outstanding performances moving story",             1),
    ("dull and painfully slow nothing happens",           0),
]

X = np.array([sentence_vector(r, wv) for r, _ in reviews])
y = np.array([label for _, label in reviews])

clf = LogisticRegression()
clf.fit(X, y)

test_sentences = [
    "wonderful magical experience loved every moment",
    "completely unwatchable dreadful acting"
]

for s in test_sentences:
    vec   = sentence_vector(s, wv).reshape(1, -1)
    pred  = clf.predict(vec)[0]
    label = "POSITIVE" if pred == 1 else "NEGATIVE"
    print(f"{label}: {s}")
OUTPUT
Vocab: 3000000 words | Dim: 300 POSITIVE: wonderful magical experience loved every moment NEGATIVE: completely unwatchable dreadful acting

Section 13

Golden Rules for Word Embeddings

💡 Word Embeddings — Non-Negotiable Practitioner Rules
1
Start with pre-trained vectors. Unless your domain is radically specialised (chemistry, clinical notes), Google News Word2Vec, GloVe 840B, or FastText Common Crawl will outperform embeddings trained on your small in-house corpus. Pre-trained = 100B+ tokens of implicit knowledge.
2
Match embedding to language morphology. English → Word2Vec or GloVe is fine. Finnish, Turkish, Arabic, Hindi → use FastText. Morphological richness means thousands of word forms per root — FastText handles this natively.
3
Always audit for bias before deployment. Embeddings trained on the internet contain racial, gender, and occupational stereotypes. This is not a theoretical concern — Amazon, Google, and Microsoft have all deployed biased systems built on raw embeddings. Use WEAT tests and wefe to measure.
4
Normalise vectors before computing cosine similarity. Always L2-normalise your embedding matrix. Cosine similarity on unnormalised vectors is equivalent to dot product, which is dominated by vector magnitude, not direction. One line: wv.init_sims(replace=True) in Gensim.
5
Word2Vec/GloVe/FastText are static embeddings — one vector per word. The word "bank" has the same vector in "river bank" and "bank account". For tasks where word sense disambiguation matters, you need contextual embeddings (BERT, RoBERTa, XLNet). Know this limitation before choosing your architecture.
6
Dimension choice matters less than you think. 100d vs 300d embeddings show modest differences in most tasks. 50d is often sufficient for small downstream models. Bigger dimensions → more memory, slower inference. The sweet spot is usually 100–300 for most applications.
7
Fine-tune embeddings on domain data. Load pre-trained vectors and continue training on your domain corpus with a lower learning rate. Medical text after initialisation from Wikipedia will dramatically improve representations for clinical terms like "myocardial infarction" or "thrombocytopenia".

Section 14

Pre-trained Models & Resources

ModelCorpusVocabDimDownload
Word2Vec Google News Google News (100B tokens) 3M 300 code.google.com/archive/p/word2vec
GloVe 6B Wikipedia + Gigaword (6B tokens) 400K 50/100/200/300 nlp.stanford.edu/projects/glove
GloVe 840B Common Crawl (840B tokens) 2.2M 300 nlp.stanford.edu/projects/glove
FastText CC (English) Common Crawl + Wikipedia 2M 300 fasttext.cc/docs/en/english-vectors
FastText 157 Languages Wikipedia (multilingual) 2M each 300 fasttext.cc/docs/en/crawl-vectors
🏆
The Road Ahead — From Static to Contextual

Word2Vec, GloVe, and FastText were the revolution of 2013–2016. The next revolution was ELMo (2018) — context-dependent vectors from bidirectional LSTMs. Then came BERT (2018) — transformer-based contextual embeddings that changed the entire field. Today, GPT-class models produce contextual embeddings as a byproduct of pretraining. But the underlying insight — meaning lives in context, context can be captured statistically — is still the same idea Firth had in 1954. Every transformer you use today is a descendant of the simple Word2Vec prediction game.