The Story Behind Embeddings
After years of watching, he can now look at you for two seconds and say: "You'll love Arrival. You'll skip Transformers."
He never reads your mind. He converts what he knows about you — your history, your patterns, your implicit preferences — into a kind of invisible fingerprint. He does the same for every film. When your fingerprint aligns closely with a film's fingerprint, he recommends it.
That fingerprint? It's an embedding.
An embedding is a dense numerical vector — a list of floating-point numbers — that encodes the identity and meaning of something (a user, a movie, a song, a product) in a compact mathematical space. Things that behave similarly in the real world end up numerically close to each other in that space.
Embeddings turn discrete, categorical identities (User #8812, Movie #441) into points in a continuous geometric space. Geometric proximity = semantic similarity. This is the mathematical foundation of every modern recommendation system: Netflix, Spotify, Amazon, YouTube, TikTok.
Why Not Just Use One-Hot Encoding?
| User / Movie | Vector | Dims |
|---|---|---|
| User #1 | [1,0,0,0,…,0] | 1,000,000 |
| User #2 | [0,1,0,0,…,0] | 1,000,000 |
| Movie A | [1,0,0,…,0] | 500,000 |
| Movie B | [0,1,0,…,0] | 500,000 |
| Distance(A,B) | Always 1.414 — useless! | |
| User / Movie | Vector | Dims |
|---|---|---|
| User #1 | [0.82, -0.31, 0.55] | 32–256 |
| User #2 | [0.80, -0.28, 0.53] | 32–256 |
| Movie A | [0.77, -0.30, 0.56] | 32–256 |
| Movie B | [-0.9, 0.71, -0.4] | 32–256 |
| Sim(User1, A) | 0.998 — very relevant! | |
| Property | One-Hot Encoding | Embeddings |
|---|---|---|
| Dimensionality | = number of items (millions) | 16–512 (fixed, tunable) |
| Captures similarity | No — all items equidistant | Yes — learned from behaviour |
| Memory footprint | Enormous sparse matrices | Tiny dense matrices |
| Cold start | Struggles — new items = new dim | Manageable with side info |
| Used in production | Rarely (old systems) | Universally (Google, Meta, Netflix) |
Dense Vector Representation — What It Really Means
The word "dense" is the key distinction. A one-hot vector for User #5 among 1 million users is 999,999 zeros and a single 1. An embedding for that same user is 128 numbers, all of them potentially non-zero, each encoding a latent dimension of taste.
▸ The Dark Knight lands at (0.7, -0.8) — blockbuster, very dark.
▸ Toy Story lands at (0.4, 0.95) — mainstream, very light.
▸ Parasite lands at (-0.6, -0.7) — arthouse, very dark.
A user who loved The Dark Knight and Parasite would sit near (-0.1, -0.75) — the space captures their taste without us ever writing a single rule. Real embeddings do this in 128+ dimensions, not 2.
User Embeddings — Who Are You to the Algorithm?
A user embedding is the system's mathematical answer to: "What does this person's entire interaction history reveal about their latent taste?" It is not a demographic profile. It is not a list of their liked items. It is a single vector that positions them in taste-space.
This gives you collaborative filtering for free. If User A and User B have nearly identical embedding vectors, items User A loved but User B hasn't seen yet are strong candidates to recommend to B — even if they share zero explicit overlap in their histories.
Item Embeddings — What Is This Product, Really?
An item embedding is the algorithm's answer to: "What latent properties does this item express — and what kind of user would it appeal to?" Items that are consumed by the same types of people end up geometrically close, even if they appear very different on the surface.
| Item Type | What Drives Embedding Position | Implicit Signal Used |
|---|---|---|
| 🎵 Songs (Spotify) | Co-listening patterns, playlist co-occurrence | Play, skip, add-to-playlist, replay |
| 🎬 Movies (Netflix) | Co-watching by same account, completion rate | Watch %, pause rate, rewatch, thumbs |
| 🛍️ Products (Amazon) | Co-purchase, co-view, session co-occurrence | Click, add-to-cart, buy, review |
| 📰 Articles (News) | Session co-reads, same-user click patterns | Time on page, scroll depth, share |
| 🎮 Games (Steam) | Same-library co-ownership, playtime patterns | Hours played, achievements, wishlisted |
The Architecture — User + Item Embeddings Together
Python Implementation — Matrix Factorisation from Scratch
Building User + Item Embeddings with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, Dataset
# ─── 1. Synthetic interaction dataset ───────────────────────
class InteractionDataset(Dataset):
def __init__(self, num_users=1000, num_items=500, num_interactions=20000):
np.random.seed(42)
self.users = torch.tensor(np.random.randint(0, num_users, num_interactions), dtype=torch.long)
self.items = torch.tensor(np.random.randint(0, num_items, num_interactions), dtype=torch.long)
self.labels = torch.ones(num_interactions) # all observed = positive
def __len__(self): return len(self.users)
def __getitem__(self, idx): return self.users[idx], self.items[idx], self.labels[idx]
# ─── 2. Matrix Factorisation model (user + item embeddings) ─
class MatrixFactorisation(nn.Module):
def __init__(self, num_users, num_items, embedding_dim=64):
super().__init__()
# USER EMBEDDINGS — shape: (num_users, embedding_dim)
self.user_emb = nn.Embedding(num_users, embedding_dim)
# ITEM EMBEDDINGS — shape: (num_items, embedding_dim)
self.item_emb = nn.Embedding(num_items, embedding_dim)
# Bias terms for each user and item
self.user_bias = nn.Embedding(num_users, 1)
self.item_bias = nn.Embedding(num_items, 1)
# Initialise with small values for stability
nn.init.normal_(self.user_emb.weight, std=0.01)
nn.init.normal_(self.item_emb.weight, std=0.01)
nn.init.zeros_(self.user_bias.weight)
nn.init.zeros_(self.item_bias.weight)
def forward(self, user_ids, item_ids):
u = self.user_emb(user_ids) # (batch, 64)
i = self.item_emb(item_ids) # (batch, 64)
u_b = self.user_bias(user_ids).squeeze() # (batch,)
i_b = self.item_bias(item_ids).squeeze() # (batch,)
# Dot product + biases → predicted score
score = (u * i).sum(dim=1) + u_b + i_b
return torch.sigmoid(score) # squeeze to [0,1]
# ─── 3. Train ────────────────────────────────────────────────
NUM_USERS, NUM_ITEMS = 1000, 500
dataset = InteractionDataset(NUM_USERS, NUM_ITEMS)
loader = DataLoader(dataset, batch_size=256, shuffle=True)
model = MatrixFactorisation(NUM_USERS, NUM_ITEMS, embedding_dim=64)
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
loss_fn = nn.BCELoss()
for epoch in range(5):
total_loss = 0
for users, items, labels in loader:
preds = model(users, items)
loss = loss_fn(preds, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1} | Loss: {total_loss/len(loader):.4f}")
print(f"\nUser embedding shape: {model.user_emb.weight.shape}")
print(f"Item embedding shape: {model.item_emb.weight.shape}")
Python Implementation — Generating Recommendations
import torch
import torch.nn.functional as F
# ─── Extract trained embeddings ─────────────────────────────
user_vectors = model.user_emb.weight.detach() # (1000, 64)
item_vectors = model.item_emb.weight.detach() # (500, 64)
# ─── Recommend top-K items for a given user ──────────────────
def recommend(user_id, top_k=10, exclude_seen=None):
"""
Returns top-K item indices for a user based on dot-product score.
exclude_seen: set of item_ids the user already interacted with.
"""
u_vec = user_vectors[user_id] # (64,)
# Dot product against ALL items at once (vectorised)
scores = (item_vectors @ u_vec) # (500,) — one score per item
if exclude_seen:
scores[list(exclude_seen)] = -float('inf')
top_items = torch.topk(scores, k=top_k).indices.tolist()
return top_items
# ─── Recommend for User #42 ──────────────────────────────────
user_42_history = {5, 17, 88, 103} # items they've already seen
recs = recommend(user_id=42, top_k=5, exclude_seen=user_42_history)
print(f"Top 5 recommendations for User 42: {recs}")
# ─── Find similar users ──────────────────────────────────────
def similar_users(user_id, top_k=5):
u_vec = F.normalize(user_vectors[user_id].unsqueeze(0), dim=1)
all_u = F.normalize(user_vectors, dim=1)
cos_sim = (all_u @ u_vec.T).squeeze() # cosine similarity
cos_sim[user_id] = -1 # exclude self
neighbours = torch.topk(cos_sim, k=top_k).indices.tolist()
return neighbours
print(f"Users most similar to User 42: {similar_users(42)}")
# ─── Find similar items ──────────────────────────────────────
def similar_items(item_id, top_k=5):
i_vec = F.normalize(item_vectors[item_id].unsqueeze(0), dim=1)
all_i = F.normalize(item_vectors, dim=1)
cos_sim = (all_i @ i_vec.T).squeeze()
cos_sim[item_id] = -1
neighbours = torch.topk(cos_sim, k=top_k).indices.tolist()
return neighbours
print(f"Items most similar to Item 88: {similar_items(88)}")
After training a single matrix factorisation model you get: (1) personalised item recommendations, (2) similar-user discovery for social features, and (3) similar-item retrieval for "if you liked this, try this." All three emerge naturally from the two embedding tables.
Advanced: Neural Collaborative Filtering (NCF)
Pure dot products assume user–item interaction is linear. In reality it isn't. Neural Collaborative Filtering replaces the dot product with a multi-layer perceptron, allowing the model to learn arbitrary interaction functions. It also adds a parallel GMF (Generalised Matrix Factorisation) path and combines both.
import torch
import torch.nn as nn
class NCF(nn.Module):
"""
Neural Collaborative Filtering (He et al. 2017)
Combines GMF (dot product path) + MLP (deep interaction path)
"""
def __init__(self, num_users, num_items,
emb_dim_gmf=32, emb_dim_mlp=32,
mlp_layers=[64, 32, 16]):
super().__init__()
# ── GMF path (element-wise product) ─────────────────
self.user_emb_gmf = nn.Embedding(num_users, emb_dim_gmf)
self.item_emb_gmf = nn.Embedding(num_items, emb_dim_gmf)
# ── MLP path (concatenation → deep network) ─────────
self.user_emb_mlp = nn.Embedding(num_users, emb_dim_mlp)
self.item_emb_mlp = nn.Embedding(num_items, emb_dim_mlp)
# Build MLP: input = 2 * emb_dim_mlp (user + item concat)
layers = []
input_dim = emb_dim_mlp * 2
for out_dim in mlp_layers:
layers += [nn.Linear(input_dim, out_dim), nn.ReLU()]
input_dim = out_dim
self.mlp = nn.Sequential(*layers)
# Final prediction: [GMF output || MLP output] → scalar
self.output = nn.Linear(emb_dim_gmf + mlp_layers[-1], 1)
def forward(self, user_ids, item_ids):
# GMF path
u_g = self.user_emb_gmf(user_ids)
i_g = self.item_emb_gmf(item_ids)
gmf_out = u_g * i_g # element-wise product (batch, 32)
# MLP path
u_m = self.user_emb_mlp(user_ids)
i_m = self.item_emb_mlp(item_ids)
mlp_in = torch.cat([u_m, i_m], dim=1) # (batch, 64)
mlp_out = self.mlp(mlp_in) # (batch, 16)
# Combine and predict
combined = torch.cat([gmf_out, mlp_out], dim=1) # (batch, 48)
score = torch.sigmoid(self.output(combined)) # (batch, 1)
return score.squeeze()
# ─── Usage ───────────────────────────────────────────────────
ncf = NCF(num_users=1000, num_items=500)
user_batch = torch.tensor([42, 7, 312])
item_batch = torch.tensor([88, 22, 441])
predictions = ncf(user_batch, item_batch)
print(f"Predicted scores: {predictions.detach().numpy().round(3)}")
print(f"NCF parameters: {sum(p.numel() for p in ncf.parameters()):,}")
Key Hyperparameters for Embedding Models
| Hyperparameter | Typical Range | What It Controls | Tuning Advice |
|---|---|---|---|
embedding_dim | 32–512 | Expressiveness of each vector | Start at 64. Larger helps with many items; tiny hurts recall. |
learning_rate | 1e-4 – 1e-2 | Gradient step size | Use Adam. Start 1e-3, halve if loss oscillates. |
weight_decay | 1e-6 – 1e-4 | L2 regularisation on embeddings | Critical to prevent embedding collapse. Tune carefully. |
batch_size | 256 – 4096 | Samples per gradient update | Larger = more stable; 1024 is a good default. |
num_negatives | 1 – 10 | Negatives sampled per positive (BPR) | 4–5 is standard. Too many slows training for marginal gain. |
epochs | 10 – 100 | Training iterations over the dataset | Early stopping on Recall@10 on validation set. |
Embedding models are helpless with new users or new items — they have no row in the embedding table. Solutions: (1) Use side features (age, genre, category) to initialise embeddings. (2) Content-based fallback for new items. (3) Session-based models that use the current session as the user signal, bypassing user ID entirely.
Evaluation — Measuring Recommendation Quality
import numpy as np
def recall_at_k(recommended, relevant, k):
"""Fraction of relevant items that appear in top-K recs."""
top_k = set(recommended[:k])
relevant_set = set(relevant)
if not relevant_set: return 0.0
return len(top_k & relevant_set) / len(relevant_set)
def ndcg_at_k(recommended, relevant, k):
"""Normalised Discounted Cumulative Gain at K."""
relevant_set = set(relevant)
dcg = 0.0
for rank, item in enumerate(recommended[:k], start=1):
if item in relevant_set:
dcg += 1.0 / np.log2(rank + 1)
# Ideal DCG: all relevant items at top positions
idcg = sum(1.0/np.log2(r+2) for r in range(min(k, len(relevant_set))))
return dcg / idcg if idcg > 0 else 0.0
# ─── Evaluate over many users ─────────────────────────────
recall_scores, ndcg_scores = [], []
for uid in range(100): # first 100 users as example
relevant_items = list(np.random.randint(0, 500, 5)) # ground truth
recommended = recommend(uid, top_k=20)
recall_scores.append(recall_at_k(recommended, relevant_items, k=10))
ndcg_scores.append(ndcg_at_k(recommended, relevant_items, k=10))
print(f"Recall@10: {np.mean(recall_scores):.4f}")
print(f"NDCG@10: {np.mean(ndcg_scores):.4f}")
🚀 Complete Project — Movie Recommendation Engine
We build a complete, production-style recommendation engine using the MovieLens 100K dataset — the standard benchmark used in industry research. The project trains user and item embeddings, generates personalised recommendations, finds similar movies, and evaluates with Recall@K and NDCG@K.
Step 1 — Load the MovieLens Dataset
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
import warnings; warnings.filterwarnings('ignore')
# ─── Load MovieLens 100K ─────────────────────────────────────
# Download from: https://grouplens.org/datasets/movielens/100k/
# Or use the mirror below
url = 'https://files.grouplens.org/datasets/movielens/ml-100k/u.data'
df = pd.read_csv(url, sep='\t',
names=['user_id', 'movie_id', 'rating', 'timestamp'])
# Re-index user_id and movie_id to be 0-based integers
df['user_idx'] = df['user_id'].astype('category').cat.codes
df['item_idx'] = df['movie_id'].astype('category').cat.codes
NUM_USERS = df['user_idx'].nunique()
NUM_ITEMS = df['item_idx'].nunique()
print(f"Users: {NUM_USERS} | Items: {NUM_ITEMS} | Interactions: {len(df):,}")
print(df['rating'].describe())
# Convert ratings to implicit feedback (liked = rating ≥ 4)
df['liked'] = (df['rating'] >= 4).astype(int)
positives = df[df['liked'] == 1][['user_idx','item_idx']].reset_index(drop=True)
print(f"\nPositive interactions (rating ≥ 4): {len(positives):,}")
Step 2 — Dataset with Negative Sampling
class BPRDataset(Dataset):
"""
Bayesian Personalised Ranking dataset.
For each positive (user, item+) pair, samples one random negative item-.
Training signal: score(user, item+) > score(user, item-)
"""
def __init__(self, interactions_df, num_items, num_negatives=4):
self.users = interactions_df['user_idx'].values
self.pos_items = interactions_df['item_idx'].values
self.num_items = num_items
self.num_neg = num_negatives
# Build user positive set for fast negative sampling
self.user_pos = interactions_df.groupby('user_idx')['item_idx'].apply(set).to_dict()
def __len__(self): return len(self.users) * self.num_neg
def __getitem__(self, idx):
i = idx % len(self.users)
u = self.users[i]
pos = self.pos_items[i]
# Sample a negative item the user has NOT seen
while True:
neg = np.random.randint(self.num_items)
if neg not in self.user_pos.get(u, set()):
break
return torch.tensor(u), torch.tensor(pos), torch.tensor(neg)
train_df, test_df = train_test_split(positives, test_size=0.2, random_state=42)
train_ds = BPRDataset(train_df, NUM_ITEMS, num_negatives=4)
train_loader = DataLoader(train_ds, batch_size=1024, shuffle=True, num_workers=0)
print(f"Training batches: {len(train_loader)} | Test positives: {len(test_df):,}")
Step 3 — Model with BPR Loss
class MovieRecommender(nn.Module):
def __init__(self, num_users, num_items, dim=64):
super().__init__()
self.user_emb = nn.Embedding(num_users, dim)
self.item_emb = nn.Embedding(num_items, dim)
nn.init.xavier_normal_(self.user_emb.weight)
nn.init.xavier_normal_(self.item_emb.weight)
def forward(self, users, pos_items, neg_items):
u = self.user_emb(users)
i_p = self.item_emb(pos_items)
i_n = self.item_emb(neg_items)
pos_score = (u * i_p).sum(dim=1)
neg_score = (u * i_n).sum(dim=1)
# BPR Loss: - log σ(pos_score - neg_score)
loss = -torch.log(torch.sigmoid(pos_score - neg_score) + 1e-8).mean()
return loss
def get_scores(self, user_id):
u = self.user_emb.weight[user_id]
return (self.item_emb.weight @ u).detach()
model = MovieRecommender(NUM_USERS, NUM_ITEMS, dim=64)
opt = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
# ─── Training loop ────────────────────────────────────────────
for epoch in range(20):
model.train()
total = 0
for users, pos, neg in train_loader:
loss = model(users, pos, neg)
opt.zero_grad(); loss.backward(); opt.step()
total += loss.item()
if (epoch+1) % 5 == 0:
print(f"Epoch {epoch+1:2d} | Loss: {total/len(train_loader):.4f}")
Step 4 — Generate Personalised Recommendations
# ─── Movie title lookup ───────────────────────────────────────
title_url = 'https://files.grouplens.org/datasets/movielens/ml-100k/u.item'
movies = pd.read_csv(title_url, sep='|', encoding='latin-1',
names=['movie_id','title'] + [f'f{i}' for i in range(22)],
usecols=['movie_id','title'])
# Build index mapping
movie_id_to_idx = dict(zip(df['movie_id'], df['item_idx']))
idx_to_title = {movie_id_to_idx[mid]: row['title']
for mid, row in movies.iterrows()
if mid in movie_id_to_idx}
def get_recommendations(user_id, top_k=10):
model.eval()
with torch.no_grad():
scores = model.get_scores(user_id)
# Exclude items the user already rated
seen = set(df[df['user_idx']==user_id]['item_idx'].values)
scores[list(seen)] = -float('inf')
top_items = torch.topk(scores, k=top_k).indices.tolist()
return [(idx_to_title.get(i, f'Item_{i}'), round(scores[i].item(),3))
for i in top_items]
print("\n🎬 Top 10 recommendations for User 42:")
for rank, (title, score) in enumerate(get_recommendations(42), 1):
print(f" {rank:2d}. {title[:45]:45s} score={score:.3f}")
Step 5 — Evaluate with Recall@10 and NDCG@10
def evaluate(model, test_df, train_df, num_items, K=10):
model.eval()
recalls, ndcgs = [], []
# Build train positive sets per user (to exclude)
train_pos = train_df.groupby('user_idx')['item_idx'].apply(set).to_dict()
test_pos = test_df.groupby('user_idx')['item_idx'].apply(list).to_dict()
with torch.no_grad():
for user_id, relevant in test_pos.items():
scores = model.get_scores(user_id)
seen = train_pos.get(user_id, set())
scores[list(seen)] = -float('inf')
top_k = torch.topk(scores, k=K).indices.tolist()
recalls.append(recall_at_k(top_k, relevant, K))
ndcgs.append(ndcg_at_k(top_k, relevant, K))
return np.mean(recalls), np.mean(ndcgs)
rec, ndcg = evaluate(model, test_df, train_df, NUM_ITEMS, K=10)
print(f"Recall@10 : {rec:.4f}")
print(f"NDCG@10 : {ndcg:.4f}")
print(f"\n✅ Model ready for production serving via FAISS ANN index.")
Production Serving with FAISS
After training, the item embedding matrix has millions of rows. Comparing a user vector against all of them at query time is O(N) — too slow. FAISS (Facebook AI Similarity Search) builds an index that allows approximate nearest-neighbour lookup in O(log N) or even O(1) for some index types.
import faiss
import numpy as np
# ─── Build FAISS index from item embeddings ──────────────────
item_np = model.item_emb.weight.detach().numpy().astype('float32')
DIM = item_np.shape[1]
# L2 index for Euclidean distance (use IndexFlatIP for dot product)
index = faiss.IndexFlatIP(DIM) # Inner Product = dot product
# Normalise for cosine similarity
faiss.normalize_L2(item_np)
index.add(item_np)
print(f"FAISS index: {index.ntotal} items indexed, dim={DIM}")
# ─── Real-time recommendation for a user ─────────────────────
def fast_recommend(user_id, top_k=10):
u_vec = model.user_emb.weight[user_id].detach().numpy().astype('float32')
u_vec = u_vec.reshape(1, -1)
faiss.normalize_L2(u_vec)
scores, item_indices = index.search(u_vec, top_k + 20) # extra buffer
return item_indices[0][:top_k]
import time
start = time.time()
recs = fast_recommend(42, top_k=10)
elapsed = (time.time() - start) * 1000
print(f"Query time: {elapsed:.2f}ms for {index.ntotal:,} items")
print(f"Top items: {recs}")
Comparison — Embedding Architectures
| Architecture | Interaction Function | Recall@10 | Complexity | Best For |
|---|---|---|---|---|
| MF (Matrix Factorisation) | Dot product | ~0.15–0.20 | Low — fast | Baseline, simple systems |
| NCF (Neural CF) | MLP + element-wise | ~0.18–0.25 | Medium | Non-linear patterns |
| LightGCN | Graph propagation | ~0.22–0.30 | Medium | Social networks, dense graphs |
| BERT4Rec | Transformer (sequential) | ~0.25–0.35 | High — slow training | Sessions, temporal patterns |
| Two-Tower (DSSM) | Separate towers, dot | ~0.20–0.28 | Medium | Large-scale retrieval (Google, Meta) |