The Story That Explains Everything
Simple, right? Except: 300 million users × 20,000 movies = 6 trillion cells. Storing that table would cost billions. Worse — most of those cells are empty. The average user has rated fewer than 100 movies. That's a 99.9995% empty table. Statisticians call this a sparse matrix.
Matrix Factorization is the algorithm that says: "We don't need to store that monster. We can compress it down to two tiny tables that, when multiplied together, reconstruct any rating — even for movies the user has never seen."
That compression — from one enormous sparse table into two small dense ones — is the entire idea. The rest of this tutorial unpacks how it works, why it's mathematically sound, and how to implement it from scratch and with libraries.
Latent factors and what they really mean · Singular Value Decomposition (SVD) and its geometric intuition · Low-rank approximation — the mathematical backbone · Hidden embeddings — why they outperform explicit features · The key formula R ≈ PQᵀ — and how to optimize it.
The Rating Matrix R — Our Starting Point
Before factorizing anything, you need to understand what R looks like. It is a matrix of shape m × n where m = number of users and n = number of items. Each observed entry R[u, i] is the rating user u gave item i. Most entries are missing — we use a ? to denote unobserved cells.
Only 14 of 30 possible cells are filled — that's a 53% sparsity rate. In real systems this is 99.99%+. Our job: predict every ? accurately.
The naïve approach — predict Alice's Titanic rating as the average of all Titanic ratings (3.5) — ignores the fact that Alice specifically dislikes Romance and loves Sci-Fi. Matrix Factorization learns Alice's personal taste profile and Titanic's genre profile simultaneously, giving a far more accurate prediction.
Latent Factors — The Hidden Language of Taste
She can also profile a customer: "This person likes bold, full-bodied reds." When she matches wine to customer, she is computing how well the wine's hidden profile aligns with the customer's hidden preference profile.
That is exactly what latent factors are. The algorithm invents its own taste dimensions — it discovers them from the data — and represents both users and items in terms of those invented dimensions.
A latent factor is a hidden dimension that explains observed behaviour without being directly measured. We choose how many latent factors to use (call it k). Typical values: 10, 20, 50, 100 for large systems.
You do not choose what the latent factors represent. You do not hand-label them as "genre" or "complexity". The algorithm discovers the most useful dimensions purely from the pattern of ratings. This is what makes it powerful — it finds structure you never knew existed. The labels above are just human interpretations added after the fact.
The Core Formula — R ≈ PQᵀ
The fundamental idea: decompose the rating matrix R (shape m×n) into two smaller matrices P (shape m×k) and Q (shape n×k), such that their product approximates R.
P (green) holds user embeddings, Qᵀ (amber) holds item embeddings (transposed). Their product fills every missing cell in R with a predicted rating.
Low-Rank Approximation — The Mathematical Backbone
The phrase "low-rank approximation" is the formal mathematical description of what we are doing. It comes from linear algebra — let's build intuition carefully.
Low-rank approximation does the same thing to a matrix. A full-rank m×n matrix might have min(m,n) independent dimensions. But rating data has far less truly independent structure — most taste differences can be explained by just k ≪ n underlying factors. We keep only those k dimensions and discard the rest.
| Concept | Full Rank Matrix | Low-Rank Approximation (k factors) |
|---|---|---|
| Storage | m × n numbers | k × (m + n) numbers |
| Example (1M users, 100k items, k=50) | 100 billion numbers | 55 million numbers — 1,800× smaller |
| Missing values | Cannot predict them | Predicted via dot product |
| Noise handling | Memorises noise | Smoothed out — only structure survives |
| Generalisation | None — can't extrapolate | Strong — learned patterns transfer |
| Rank | min(m, n) | k (chosen, small) |
The rank of a matrix is the number of linearly independent rows (or columns). A full-rank 5×6 matrix has rank 5 — all rows contain unique information. A rank-2 approximation expresses the entire matrix using only 2 independent dimensions. We call this low-rank because k ≪ min(m, n). The approximation is accurate precisely because real rating data actually is low-rank — taste is simple.
SVD — Singular Value Decomposition
The theoretically optimal low-rank approximation for a complete matrix comes from Singular Value Decomposition (SVD). SVD decomposes any matrix R into three matrices:
The Eckart–Young theorem guarantees that the rank-k SVD truncation is the best possible rank-k approximation in terms of Frobenius norm.
Classic SVD requires a complete matrix — every cell must have a value. Real rating data has 99%+ missing entries. Imputing all missing values with zeros or means before running SVD severely distorts the result. This is why practical recommender systems use matrix factorization via gradient descent instead — it optimises only on observed ratings and ignores missing ones entirely.
Hidden Embeddings — The Secret Geometry of Taste
Predicting a rating is now geometry: how close is the user's dot to the movie's dot? Close means high predicted rating. Far means low.
That map is the embedding space. Those coordinates are the embeddings. The algorithm learns the best map automatically from the data.
Users (circles) near movies (squares) get high predicted ratings. The embedding space makes similarity geometrically intuitive.
| Entity | Embedding Vector | Latent Factor 1 (Sci-Fi) | Latent Factor 2 (Family) | Predicted Alice rating |
|---|---|---|---|---|
| Alice | P[alice,:] | 0.90 | 0.05 | — |
| Bob | P[bob,:] | 0.08 | 0.88 | — |
| Inception | Q[inception,:] | 0.95 | 0.02 | 0.90×0.95 + 0.05×0.02 = 0.856 → ~5★ |
| Moana | Q[moana,:] | 0.04 | 0.92 | 0.90×0.04 + 0.05×0.92 = 0.082 → ~1★ |
| Titanic | Q[titanic,:] | 0.50 | 0.55 | 0.90×0.50 + 0.05×0.55 = 0.478 → ~3★ |
Learning P and Q — Gradient Descent on the Loss
We initialise P and Q with small random values, then optimise them to minimise prediction error on the observed ratings only. The loss function is:
Matrix Factorization from Scratch in Python
The cleanest way to build intuition is to implement the SGD update loop yourself. The entire algorithm fits in about 30 lines of NumPy.
import numpy as np
# ── Toy rating matrix (0 = not rated) ────────────────────────────────────
R = np.array([
[5, 4, 0, 0, 0, 5],
[0, 0, 4, 5, 4, 0],
[4, 5, 0, 0, 0, 0],
[0, 0, 3, 0, 5, 0],
[2, 0, 0, 3, 0, 1],
], dtype=float)
m, n = R.shape # 5 users, 6 movies
k = 2 # latent factors
eta = 0.01 # learning rate
lam = 0.01 # regularisation
# ── Initialise P and Q randomly ────────────────────────────────────────
np.random.seed(42)
P = np.random.normal(0, 0.1, (m, k))
Q = np.random.normal(0, 0.1, (n, k))
# ── Index of observed ratings ──────────────────────────────────────────
observed = [(u, i) for u in range(m) for i in range(n) if R[u, i] > 0]
# ── SGD training loop ──────────────────────────────────────────────────
for epoch in range(500):
for u, i in observed:
pred = P[u] @ Q[i] # dot product
err = R[u, i] - pred # residual error
P[u] += eta * (err * Q[i] - lam * P[u])
Q[i] += eta * (err * P[u] - lam * Q[i])
if epoch % 100 == 0:
total_loss = sum((R[u,i] - P[u] @ Q[i])**2 for u,i in observed)
print(f"Epoch {epoch:4d} | Loss: {total_loss:.4f}")
# ── Reconstruct full matrix ────────────────────────────────────────────
R_hat = P @ Q.T
print("\nPredicted Rating Matrix R̂:")
print(np.round(R_hat, 2))
Alice's row shows high predictions for Sci-Fi (columns 1, 2, 6) and low for Family. Bob's row is the mirror. The algorithm discovered the genre structure purely from 14 ratings — with no labels and no genre information provided. This is the power of latent factors.
Truncated SVD in Practice
When your matrix is reasonably complete (e.g. after mean-imputing missing values for a quick baseline), scipy.sparse.linalg.svds gives you the rank-k truncated SVD directly without computing all singular values.
import numpy as np
from scipy.sparse.linalg import svds
# ── Impute missing values with row means (simple baseline) ─────────────
R_full = R.copy()
row_means = np.true_divide(R_full.sum(1), (R_full != 0).sum(1))
for u in range(m):
R_full[u, R_full[u] == 0] = row_means[u]
# ── Truncated SVD with k=2 factors ─────────────────────────────────────
k = 2
U, sigma, Vt = svds(R_full, k=k)
# sigma is a 1D array; make it a diagonal matrix
Sigma = np.diag(sigma)
# ── Reconstruct ─────────────────────────────────────────────────────────
R_svd = U @ Sigma @ Vt
print("SVD Low-rank Reconstruction:")
print(np.round(R_svd, 2))
# ── Singular values show variance captured ──────────────────────────────
print(f"\nSingular values: {np.round(sigma, 3)}")
total_var = np.sum(sigma**2)
print(f"Variance explained by k={k}: {100*total_var/np.sum(np.linalg.svd(R_full,compute_uv=False)**2):.1f}%")
Production Implementation — Surprise Library
For real datasets, the Surprise library provides optimised, cross-validated matrix factorization via its SVD class (which is actually the SGD-optimised MF, not classical SVD — a common naming quirk).
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV
import pandas as pd
# ── Load MovieLens 100k ───────────────────────────────────────────────
data = Dataset.load_builtin('ml-100k')
# ── Baseline SVD — default hyperparams ───────────────────────────────
algo = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02, random_state=42)
results = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
print(f"\nMean RMSE: {results['test_rmse'].mean():.4f}")
print(f"Mean MAE: {results['test_mae'].mean():.4f}")
# ── Grid search over k and regularisation ────────────────────────────
param_grid = {
'n_factors': [20, 50, 100],
'n_epochs': [20, 40],
'lr_all': [0.003, 0.005],
'reg_all': [0.02, 0.1]
}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs.fit(data)
print(f"\nBest RMSE: {gs.best_score['rmse']:.4f}")
print(f"Best params: {gs.best_params['rmse']}")
# ── Predict a specific user-item rating ──────────────────────────────
trainset = data.build_full_trainset()
algo.fit(trainset)
pred = algo.predict(uid='196', iid='302', r_ui=4, verbose=True)
Hyperparameter Deep Dive
| Hyperparameter | Symbol | Typical Range | Effect of Increase | Risk |
|---|---|---|---|---|
| Latent Factors | k | 10 – 200 | Richer representation, captures subtler taste | Overfitting, slower training |
| Learning Rate | η | 0.001 – 0.02 | Faster convergence (up to a point) | Divergence if too high |
| Regularisation | λ | 0.001 – 0.1 | Smaller embeddings, less overfitting | Underfitting if too large |
| Epochs | T | 20 – 200 | Better convergence | Overfitting; use early stopping |
| Bias Terms | bᵤ, bᵢ | yes/no | Accounts for generous raters, blockbuster effect | Usually always use them |
The basic formula R̂[u,i] = P[u]·Q[i] ignores two important effects: (1) some users always rate high (optimistic bias bᵤ), and (2) some movies are universally loved (item bias bᵢ). The improved formula is R̂[u,i] = μ + bᵤ + bᵢ + P[u]·Q[i] where μ is the global mean. This consistently reduces RMSE by 5–10% in real datasets. Surprise's SVD includes this automatically — set biased=True (the default).
Matrix Factorization vs Other Approaches
| Approach | How It Works | Strengths | Weaknesses |
|---|---|---|---|
| Matrix Factorization | Learns latent factors from ratings | No domain knowledge needed; discovers hidden taste dimensions | Cold-start problem — new users/items have no ratings |
| User-Based CF | Find similar users; recommend what they liked | Intuitive; serendipitous discoveries | Scales poorly; user similarity degrades at scale |
| Item-Based CF | Find similar items; recommend similar to what you liked | More stable; pre-computeable | Over-specialises; ignores evolving taste |
| Content Filtering | Uses item features (genre, actors, director) | Solves cold start for items; explainable | Misses cross-genre surprises; requires metadata |
| Neural CF / Deep MF | Replaces dot product with a neural network | Learns non-linear taste interactions | Heavier compute; less interpretable; slower |
| Hybrid (MF + Content) | Combines latent factors with item features | Best of both; handles cold start + taste discovery | More complex; harder to tune |
The Cold Start Problem — MF's Achilles' Heel
That is the cold-start problem. Matrix Factorization cannot generate embeddings for a user or item with zero ratings. The embedding vector simply does not exist in P or Q.
Evaluation Metrics for Recommendation Quality
RMSE optimises for rating prediction accuracy. But in production, users never see the predicted number — they see a ranked list of items. A model with worse RMSE can produce a better ranked list. Most modern systems report Precision@10, Recall@10, and NDCG@10 as primary metrics. Optimise for what users actually experience.