The Story That Explains Word Embeddings
One day a librarian notices something magical: the distance between "King" and "Queen" on the shelf is almost identical to the distance between "Man" and "Woman". The direction from "Paris" to "France" is the same direction as from "Berlin" to "Germany". Geography is encoded in shelf positions. Grammar is encoded in shelf positions. Meaning itself is encoded in shelf positions.
That is the entire secret of Word Embeddings: convert words into numbers (coordinates on a shelf), arranged so that meaning becomes geometry.
Computers cannot read text. They can only process numbers. The naive solution — assigning each word an arbitrary integer (cat = 1, dog = 2, love = 3) — tells the model nothing about the relationship between words. Word Embeddings solve this by representing every word as a dense vector of real numbers — a point in high-dimensional space — where the position encodes semantic and syntactic meaning.
All embedding techniques rest on a single 1954 idea by linguist John Rupert Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts have similar meanings. Word2Vec, GloVe, and FastText are all different engineering implementations of this one insight.
Why Not Just One-Hot Encode Words?
Before embeddings existed, NLP used one-hot encoding: a vector of all zeros except a single 1 at the word's index. With a vocabulary of 50,000 words, every word is a 50,000-dimensional vector. This works but fails catastrophically at scale.
| Word | Vector (simplified, V=5) |
|---|---|
| king | [1, 0, 0, 0, 0] |
| queen | [0, 1, 0, 0, 0] |
| man | [0, 0, 1, 0, 0] |
| woman | [0, 0, 0, 1, 0] |
| apple | [0, 0, 0, 0, 1] |
| Word | Vector |
|---|---|
| king | [0.65, 0.12, -0.43, 0.87] |
| queen | [0.61, -0.21, -0.40, 0.83] |
| man | [0.42, 0.35, -0.10, 0.21] |
| woman | [0.38, -0.14, -0.07, 0.18] |
| apple | [-0.71, 0.02, 0.93, -0.55] |
Dense embeddings use 50–300 dimensions instead of 50,000. Every dimension carries information, so there is no sparsity. Similar words have similar vectors, so cosine(king, queen) ≈ 0.85. And unseen words can be approximated from their morphology (FastText) or estimated from context.
Word2Vec — Teaching a Neural Network to Predict Context
Word2Vec is that child — a shallow neural network trained on exactly this game, at internet scale. The learned weights become the word vectors. The network is then discarded; only the weights survive.
Proposed by Tomas Mikolov at Google in 2013, Word2Vec is a two-layer neural network trained on a text corpus. It comes in two architectures: CBOW (Continuous Bag of Words) and Skip-gram. Both produce the same type of output — dense word vectors — but via opposite prediction tasks.
Both architectures use the same corpus. The embedding matrix (hidden layer weights) is what we keep after training. The prediction task is just the training signal.
CBOW — Continuous Bag of Words
Sentence: "The quick ??? fox jumps"
CBOW input: ["The", "quick", "fox", "jumps"] → CBOW output: "brown"
The context window defines how many neighbours on each side to use (typically window=5).
How CBOW Processes a Sentence — Step by Step
min_count=5). A typical Wikipedia corpus yields ~500,000 unique tokens.CBOW averages the context vectors before prediction — a single forward pass per training window. Skip-gram makes one prediction per context word, so it performs C times more predictions (C = window size). CBOW trains 3–10× faster on large corpora and is better for frequent words.
Skip-gram — The Reverse Problem
Sentence: "The quick brown fox jumps"
Skip-gram input: "brown" → outputs: ["The", "quick", "fox", "jumps"]
One input word must predict 2 × window separate output words.
Skip-gram treats each (centre, context_word) as a separate training pair. Given window=2 and the word "brown", it generates four training examples: (brown, The), (brown, quick), (brown, fox), (brown, jumps). This makes Skip-gram better at learning representations for rare words because each occurrence produces more training signal.
| Property | CBOW | Skip-gram |
|---|---|---|
| Prediction task | Context → Centre word | Centre → Context words |
| Training speed | Fast (single pass per window) | Slower (C passes per window) |
| Rare word quality | Lower (rare = few training examples) | Higher (each occurrence → C pairs) |
| Frequent word quality | Slightly better | Similar |
| Best for | Large corpora, frequent vocabulary | Small corpora, rare/specialised words |
| Memory use | Lower | Higher |
Word Arithmetic — The Magic of Embedding Space
The same trick works for: Paris − France + Italy ≈ Rome, run − running + swimming ≈ swim, good − better + worse ≈ bad. Relationships that linguists spent decades cataloguing are spontaneously encoded in the geometry.
The parallelogram shows that the gender direction (man→woman) and the royalty direction (man→king) are independent and composable. This structure emerges from co-occurrence statistics, not from any programmed rule.
Word2Vec in Python — Full Working Example
from gensim.models import Word2Vec
import numpy as np
# ── 1. Tokenised corpus (each sentence is a list of words) ──
sentences = [
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
["she", "is", "a", "queen", "and", "he", "is", "a", "king"],
["paris", "is", "the", "capital", "of", "france"],
["berlin", "is", "the", "capital", "of", "germany"],
["machine", "learning", "is", "a", "subset", "of", "artificial", "intelligence"],
]
# ── 2. Train CBOW model ──────────────────────────────────────
cbow_model = Word2Vec(
sentences = sentences,
vector_size = 100, # embedding dimension
window = 5, # context window size
min_count = 1, # include words seen at least once
sg = 0, # 0 = CBOW, 1 = Skip-gram
epochs = 100,
seed = 42
)
# ── 3. Train Skip-gram model ─────────────────────────────────
sg_model = Word2Vec(
sentences = sentences,
vector_size = 100,
window = 5,
min_count = 1,
sg = 1, # 1 = Skip-gram
negative = 5, # negative samples
epochs = 100,
seed = 42
)
# ── 4. Access word vectors ───────────────────────────────────
king_vec = cbow_model.wv["king"] # shape: (100,)
print(f"Vector shape: {king_vec.shape}")
# ── 5. Find most similar words ───────────────────────────────
similar = cbow_model.wv.most_similar("king", topn=3)
for word, score in similar:
print(f" {word:12s}: {score:.4f}")
# ── 6. Vector arithmetic (king - man + woman ≈ queen) ────────
result = cbow_model.wv.most_similar(
positive=["king", "she"],
negative=["he"],
topn=1
)
print(f"king - he + she ≈ {result[0][0]}")
# ── 7. Cosine similarity ─────────────────────────────────────
sim = cbow_model.wv.similarity("king", "queen")
print(f"Similarity(king, queen) = {sim:.4f}")
The example above uses a tiny corpus of 5 sentences — results will look impressive but are
not meaningful. Word2Vec needs millions of sentences to learn useful
representations. For production use, pre-trained models (Google News 300d, Wikipedia) are
the right starting point. Load them with KeyedVectors.load_word2vec_format().
GloVe — Global Vectors for Word Representation
GloVe is a statistician: it first reads the entire corpus, counts every pair of words that appear near each other (building a giant co-occurrence matrix X), and then fits a model to explain that matrix all at once. By seeing global statistics before training begins, GloVe knows exactly how often "ice" and "steam" both appear near "water" — a relationship no single sentence can reveal.
Published by Pennington, Socher, and Manning at Stanford in 2014, GloVe (Global Vectors) combines the best of two worlds: the semantic richness of matrix factorisation methods and the shallow-network efficiency of Word2Vec.
The GloVe Objective Function
| Property | Word2Vec | GloVe |
|---|---|---|
| Training signal | Local windows (one at a time) | Global co-occurrence matrix |
| Objective | Predict word from context | Factorise log co-occurrence |
| Parallelisable | Difficult (online SGD) | Yes (pre-compute matrix first) |
| Analogy tasks | Excellent | Excellent (often slightly better) |
| Similarity tasks | Good | Often better (global stats help) |
| Training memory | Low (streaming) | High (stores V×V matrix) |
| Interpretability | Harder to explain | Clear statistical foundation |
Using Pre-trained GloVe in Python
import numpy as np
# ── Load GloVe vectors from text file ───────────────────────
# Download: https://nlp.stanford.edu/projects/glove/
# glove.6B.100d.txt — 6B tokens, 400K vocab, 100-dim
def load_glove(path):
embeddings = {}
with open(path, "r", encoding="utf-8") as f:
for line in f:
parts = line.split()
word = parts[0]
vector = np.array(parts[1:], dtype="float32")
embeddings[word] = vector
return embeddings
glove = load_glove("glove.6B.100d.txt")
print(f"Vocabulary size : {len(glove)}") # 400,000
print(f"Vector dimension: {glove['king'].shape}") # (100,)
# ── Cosine similarity helper ─────────────────────────────────
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine_sim(glove["king"], glove["queen"])) # ~0.72
print(cosine_sim(glove["king"], glove["apple"])) # ~0.10
# ── Analogy: king - man + woman ─────────────────────────────
target = glove["king"] - glove["man"] + glove["woman"]
best_word = None
best_score = -1
for word, vec in glove.items():
if word in {"king", "man", "woman"}:
continue
score = cosine_sim(target, vec)
if score > best_score:
best_score = score
best_word = word
print(f"king - man + woman ≈ {best_word}") # queen
FastText — Embeddings for Every Word (Even Unseen Ones)
Facebook AI Research (2016) asked: what if we treated each word as a bag of character n-grams? The word "running" becomes: <ru, run, unn, nni, nin, ing, ng> (trigrams). The embedding for "running" is the sum of embeddings of all its n-grams. Now "run" and "running" share n-grams and therefore share representation — by design. And a word never seen during training can still get a vector from its subwords.
Developed by Facebook AI Research and published in 2016, FastText extends the Skip-gram model by representing each word as a sum of its character n-gram embeddings. This gives FastText three critical advantages over Word2Vec and GloVe.
FastText default uses n=3 to 6. The whole word <playing> is also included as a special n-gram. The final word vector is the sum of all n-gram vectors, which are themselves learned during training.
FastText in Python
from gensim.models import FastText
import numpy as np
# ── 1. Train FastText model ──────────────────────────────────
sentences = [
["the", "quick", "brown", "fox", "jumps"],
["she", "is", "playing", "tennis"],
["the", "player", "played", "brilliantly"],
["machine", "learning", "powers", "modern", "nlp"],
]
model = FastText(
sentences = sentences,
vector_size = 100,
window = 5,
min_count = 1,
sg = 1, # Skip-gram (recommended for FastText)
min_n = 3, # minimum n-gram length
max_n = 6, # maximum n-gram length
epochs = 100,
seed = 42
)
# ── 2. OOV magic — unseen word gets a vector ─────────────────
# "playfulness" was NEVER in training data
oov_vec = model.wv["playfulness"]
print(f"OOV vector shape: {oov_vec.shape}") # (100,) — not zero!
# ── 3. Morphological similarity ──────────────────────────────
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim_play_playing = cosine(model.wv["play"], model.wv["playing"])
sim_play_player = cosine(model.wv["play"], model.wv["player"])
sim_play_fox = cosine(model.wv["play"], model.wv["fox"])
print(f"play ↔ playing : {sim_play_playing:.4f}") # high (shared n-grams)
print(f"play ↔ player : {sim_play_player:.4f}") # high
print(f"play ↔ fox : {sim_play_fox:.4f}") # low (no shared n-grams)
# ── 4. Use Facebook's pre-trained model ─────────────────────
# from gensim.models import FastText
# model = FastText.load_fasttext_format('cc.en.300.bin')
# Pre-trained on Common Crawl — 300 dims, 2M vocab
Word2Vec vs GloVe vs FastText — Complete Comparison
| Property | Word2Vec (CBOW) | Word2Vec (Skip-gram) | GloVe | FastText |
|---|---|---|---|---|
| Training approach | Local window, predict centre | Local window, predict context | Global co-occurrence matrix | Skip-gram + character n-grams |
| OOV words | ❌ Zero vector | ❌ Zero vector | ❌ Zero vector | ✅ N-gram fallback |
| Morphology | Ignored | Ignored | Ignored | ✅ Encoded in n-grams |
| Rare word quality | Low | Medium | Low | High |
| Analogy tasks | Excellent | Excellent | Excellent | Good |
| Training speed | Fastest | Fast | Medium (matrix first) | Slower (more params) |
| Model size | Small | Small | Small | Larger (stores n-gram table) |
| Multi-lingual | Moderate | Moderate | Moderate | ✅ Excellent (agglutinative languages) |
| Best use case | Fast baseline, common vocab | Rare words, small corpora | Analogy/similarity, large corpora | Morphology-rich languages, NLP pipelines |
How Do You Know If Your Embeddings Are Any Good?
Embedding quality is evaluated on two categories of tasks: intrinsic (does the embedding space make linguistic sense?) and extrinsic (does using this embedding improve downstream task performance?).
Analogy: king−man+woman=? (Google analogy dataset, 19,544 pairs).
Clustering: Do similar words cluster together in t-SNE visualisations?
responsibly, wefe libraries.
End-to-End NLP Pipeline Using Pre-trained Embeddings
import numpy as np
from gensim.models import KeyedVectors
# ── Load Google News pre-trained Word2Vec ────────────────────
# Download: https://code.google.com/archive/p/word2vec/
wv = KeyedVectors.load_word2vec_format(
"GoogleNews-vectors-negative300.bin",
binary=True
)
print(f"Vocab: {len(wv)} words | Dim: {wv.vector_size}")
# ── Sentence embedding by averaging word vectors ─────────────
def sentence_vector(text, model, dim=300):
tokens = text.lower().split()
vecs = [model[w] for w in tokens if w in model]
if not vecs:
return np.zeros(dim)
return np.mean(vecs, axis=0)
# ── Simple sentiment classifier using embeddings ─────────────
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
reviews = [
("the movie was absolutely fantastic and thrilling", 1),
("terrible film waste of time boring plot", 0),
("brilliant acting and stunning cinematography", 1),
("awful predictable story bad acting", 0),
("outstanding performances moving story", 1),
("dull and painfully slow nothing happens", 0),
]
X = np.array([sentence_vector(r, wv) for r, _ in reviews])
y = np.array([label for _, label in reviews])
clf = LogisticRegression()
clf.fit(X, y)
test_sentences = [
"wonderful magical experience loved every moment",
"completely unwatchable dreadful acting"
]
for s in test_sentences:
vec = sentence_vector(s, wv).reshape(1, -1)
pred = clf.predict(vec)[0]
label = "POSITIVE" if pred == 1 else "NEGATIVE"
print(f"{label}: {s}")
Golden Rules for Word Embeddings
wefe to measure.
wv.init_sims(replace=True) in Gensim.
Pre-trained Models & Resources
| Model | Corpus | Vocab | Dim | Download |
|---|---|---|---|---|
| Word2Vec Google News | Google News (100B tokens) | 3M | 300 | code.google.com/archive/p/word2vec |
| GloVe 6B | Wikipedia + Gigaword (6B tokens) | 400K | 50/100/200/300 | nlp.stanford.edu/projects/glove |
| GloVe 840B | Common Crawl (840B tokens) | 2.2M | 300 | nlp.stanford.edu/projects/glove |
| FastText CC (English) | Common Crawl + Wikipedia | 2M | 300 | fasttext.cc/docs/en/english-vectors |
| FastText 157 Languages | Wikipedia (multilingual) | 2M each | 300 | fasttext.cc/docs/en/crawl-vectors |
Word2Vec, GloVe, and FastText were the revolution of 2013–2016. The next revolution was ELMo (2018) — context-dependent vectors from bidirectional LSTMs. Then came BERT (2018) — transformer-based contextual embeddings that changed the entire field. Today, GPT-class models produce contextual embeddings as a byproduct of pretraining. But the underlying insight — meaning lives in context, context can be captured statistically — is still the same idea Firth had in 1954. Every transformer you use today is a descendant of the simple Word2Vec prediction game.