Recommendation System 📂 PHASE 2 — Classical Machine Learning Approaches · 6 of 6 48 min read

Matrix Factorization

Matrix Factorization decomposes a sparse user-item rating matrix into two compact embedding matrices — a user matrix P and an item matrix Q — such that their product approximates the original ratings. By discovering hidden "latent factors" (like genre taste or visual style), the algorithm predicts ratings for items a user has never seen. This tutorial covers the core formula R ≈ PQᵀ, the mathematics of SVD and low-rank approximation, how gradient descent learns the embeddings.

Section 01

The Story That Explains Everything

Netflix Has 300 Million Users and 20,000 Movies — But Almost No Ratings
Imagine you are the data scientist at Netflix. You want to build a table where every row is a user and every column is a movie. Each cell holds a rating (1–5 stars).

Simple, right? Except: 300 million users × 20,000 movies = 6 trillion cells. Storing that table would cost billions. Worse — most of those cells are empty. The average user has rated fewer than 100 movies. That's a 99.9995% empty table. Statisticians call this a sparse matrix.

Matrix Factorization is the algorithm that says: "We don't need to store that monster. We can compress it down to two tiny tables that, when multiplied together, reconstruct any rating — even for movies the user has never seen."

That compression — from one enormous sparse table into two small dense ones — is the entire idea. The rest of this tutorial unpacks how it works, why it's mathematically sound, and how to implement it from scratch and with libraries.

🎯
What You Will Learn

Latent factors and what they really mean · Singular Value Decomposition (SVD) and its geometric intuition · Low-rank approximation — the mathematical backbone · Hidden embeddings — why they outperform explicit features · The key formula R ≈ PQᵀ — and how to optimize it.


Section 02

The Rating Matrix R — Our Starting Point

Before factorizing anything, you need to understand what R looks like. It is a matrix of shape m × n where m = number of users and n = number of items. Each observed entry R[u, i] is the rating user u gave item i. Most entries are missing — we use a ? to denote unobserved cells.

🎬 Toy Rating Matrix — 5 Users × 6 Movies
Header
Columns: Inception · The Dark Knight · Titanic · Moana · Frozen · Interstellar
Alice
5 · 4 · ? · ? · ? · 5 — loves Sci-Fi / Nolan
Bob
? · ? · 4 · 5 · 4 · ? — loves Romance / Animation
Carol
4 · 5 · ? · ? · ? · ? — Nolan fan
Dave
? · ? · 3 · ? · 5 · ? — Animation fan
Eve
2 · ? · ? · 3 · ? · 1 — dislikes Sci-Fi

Only 14 of 30 possible cells are filled — that's a 53% sparsity rate. In real systems this is 99.99%+. Our job: predict every ? accurately.

⚠️
Why You Cannot Just Use the Column Average

The naïve approach — predict Alice's Titanic rating as the average of all Titanic ratings (3.5) — ignores the fact that Alice specifically dislikes Romance and loves Sci-Fi. Matrix Factorization learns Alice's personal taste profile and Titanic's genre profile simultaneously, giving a far more accurate prediction.


Section 03

Latent Factors — The Hidden Language of Taste

The Sommelier Who Cannot Name the Grapes
A wine sommelier can taste a glass and say: "This is earthy, tannic, with dark fruit notes and a long finish." She is describing the wine along hidden dimensions — dimensions she invented herself, not ones printed on the bottle.

She can also profile a customer: "This person likes bold, full-bodied reds." When she matches wine to customer, she is computing how well the wine's hidden profile aligns with the customer's hidden preference profile.

That is exactly what latent factors are. The algorithm invents its own taste dimensions — it discovers them from the data — and represents both users and items in terms of those invented dimensions.

A latent factor is a hidden dimension that explains observed behaviour without being directly measured. We choose how many latent factors to use (call it k). Typical values: 10, 20, 50, 100 for large systems.

🎭
Genre Taste
Latent Factor 1
One factor might capture the spectrum from Action/Thriller to Romance/Drama. Alice scores high on Thriller, Bob scores high on Romance. The algorithm never labels this — it just discovers the pattern.
🎨
Visual Style
Latent Factor 2
Another factor might capture preference for blockbuster CGI vs. indie storytelling. Interstellar and Inception both score high on visual spectacle; Titanic less so.
🧠
Complexity
Latent Factor 3
A third might capture plot complexity vs. feel-good simplicity. Moana and Frozen are low-complexity; Inception and Memento are high. Users differ in how much they enjoy mentally demanding stories.
🔑
The Critical Insight About Latent Factors

You do not choose what the latent factors represent. You do not hand-label them as "genre" or "complexity". The algorithm discovers the most useful dimensions purely from the pattern of ratings. This is what makes it powerful — it finds structure you never knew existed. The labels above are just human interpretations added after the fact.


Section 04

The Core Formula — R ≈ PQᵀ

The fundamental idea: decompose the rating matrix R (shape m×n) into two smaller matrices P (shape m×k) and Q (shape n×k), such that their product approximates R.

User Matrix
P → shape: m × k
Each row is a user embedding. User u's row, P[u, :], is a k-dimensional vector that encodes how much that user "aligns" with each latent factor.
Item Matrix
Q → shape: n × k
Each row is an item embedding. Item i's row, Q[i, :], is a k-dimensional vector encoding how much that item embodies each latent factor.
Predicted Rating
R̂[u,i] = P[u,:] · Q[i,:]
A single predicted rating is just the dot product of the user's embedding and the item's embedding. High alignment = high predicted rating.
Full Reconstruction
R ≈ P × Qᵀ
Multiplying P (m×k) by Qᵀ (k×n) yields the full reconstructed matrix (m×n) — a predicted rating for every user-item pair, including unobserved ones.
📐 Visual — Matrix Factorization Decomposition
R m × n Rating Matrix 5 ? 4 ? 3 ? 2 ? 5 ? 4 ? P m×k Users 0.8 0.1 0.2 0.9 0.5 0.5 0.1 0.7 × Qᵀ k × n Items (transposed) 0.9 0.1 0.6 0.2 0.8 0.3 = m × n Full Predictions No missing values! 4.9 1.8 3.8 2.2 4.7 2.9 3.1 2.3 4.5 1.9 3.8 2.2 Sparse + Incomplete User Taste Item Profile Dense + Complete

P (green) holds user embeddings, Qᵀ (amber) holds item embeddings (transposed). Their product fills every missing cell in R with a predicted rating.


Section 05

Low-Rank Approximation — The Mathematical Backbone

The phrase "low-rank approximation" is the formal mathematical description of what we are doing. It comes from linear algebra — let's build intuition carefully.

JPEG Compression — Throwing Away What You Can't See
A JPEG image with 1000×1000 pixels stores 1 million numbers. But most of that information is redundant — nearby pixels tend to be similar colours. JPEG compression keeps only the most important frequencies and discards the rest. The result is 90% smaller and looks nearly identical to the human eye.

Low-rank approximation does the same thing to a matrix. A full-rank m×n matrix might have min(m,n) independent dimensions. But rating data has far less truly independent structure — most taste differences can be explained by just k ≪ n underlying factors. We keep only those k dimensions and discard the rest.
Concept Full Rank Matrix Low-Rank Approximation (k factors)
Storage m × n numbers k × (m + n) numbers
Example (1M users, 100k items, k=50) 100 billion numbers 55 million numbers — 1,800× smaller
Missing values Cannot predict them Predicted via dot product
Noise handling Memorises noise Smoothed out — only structure survives
Generalisation None — can't extrapolate Strong — learned patterns transfer
Rank min(m, n) k (chosen, small)
📐
The Rank of a Matrix

The rank of a matrix is the number of linearly independent rows (or columns). A full-rank 5×6 matrix has rank 5 — all rows contain unique information. A rank-2 approximation expresses the entire matrix using only 2 independent dimensions. We call this low-rank because k ≪ min(m, n). The approximation is accurate precisely because real rating data actually is low-rank — taste is simple.


Section 06

SVD — Singular Value Decomposition

The theoretically optimal low-rank approximation for a complete matrix comes from Singular Value Decomposition (SVD). SVD decomposes any matrix R into three matrices:

SVD Formula
R = U · Σ · Vᵀ
Any real matrix can be written exactly as this product of three special matrices.
U — Left Singular Vectors
shape: m × m
Columns are orthonormal. Each column is a "user pattern" — an abstract direction in user space.
Σ — Singular Values
shape: m × n (diagonal)
Diagonal entries σ₁ ≥ σ₂ ≥ … ≥ 0. Larger values = more important dimension. Most are near zero.
Vᵀ — Right Singular Vectors
shape: n × n
Rows are orthonormal item directions. Together with U and Σ they reconstruct R exactly.
📊 SVD Geometric Intuition — Keeping Only the Top k Singular Values
Singular Values σ₁ … σₙ Keep top k=3 → Low-rank approx σ₁ σ₂ σ₃ σ₄ σ₅…σₙ ≈ 0 Important Noise / redundant Truncated SVD (rank-k) Rₖ = Uₖ · Σₖ · Vₖᵀ Keep top-k columns of U Keep top-k singular values Keep top-k rows of Vᵀ Eckart–Young Theorem: optimal!

The Eckart–Young theorem guarantees that the rank-k SVD truncation is the best possible rank-k approximation in terms of Frobenius norm.

⚠️
SVD's Fatal Flaw for Recommendation Systems

Classic SVD requires a complete matrix — every cell must have a value. Real rating data has 99%+ missing entries. Imputing all missing values with zeros or means before running SVD severely distorts the result. This is why practical recommender systems use matrix factorization via gradient descent instead — it optimises only on observed ratings and ignores missing ones entirely.


Section 07

Hidden Embeddings — The Secret Geometry of Taste

Plotting Users on a Map of Taste
Imagine you had a 2D map where the x-axis is "loves action" and the y-axis is "loves animation". You could plot every user as a dot on this map. Alice sits top-left (high action, low animation). Bob sits bottom-right. Movies also sit on this same map — Inception far left, Frozen far right.

Predicting a rating is now geometry: how close is the user's dot to the movie's dot? Close means high predicted rating. Far means low.

That map is the embedding space. Those coordinates are the embeddings. The algorithm learns the best map automatically from the data.
🗺️ 2D Embedding Space — Users and Movies Mapped Together
Sci-Fi / Action → Animation / Family → High Sci-Fi + High Family Low Sci-Fi + High Family High Sci-Fi + Low Family Low Sci-Fi + Low Family Inception Interstellar Dark Knight Moana Frozen Titanic Alice Bob Eve close = high rating Sci-Fi Movies Family Movies Users

Users (circles) near movies (squares) get high predicted ratings. The embedding space makes similarity geometrically intuitive.

Entity Embedding Vector Latent Factor 1 (Sci-Fi) Latent Factor 2 (Family) Predicted Alice rating
AliceP[alice,:]0.900.05
BobP[bob,:]0.080.88
InceptionQ[inception,:]0.950.020.90×0.95 + 0.05×0.02 = 0.856 → ~5★
MoanaQ[moana,:]0.040.920.90×0.04 + 0.05×0.92 = 0.082 → ~1★
TitanicQ[titanic,:]0.500.550.90×0.50 + 0.05×0.55 = 0.478 → ~3★

Section 08

Learning P and Q — Gradient Descent on the Loss

We initialise P and Q with small random values, then optimise them to minimise prediction error on the observed ratings only. The loss function is:

Objective
min L(P,Q)
Find P and Q that minimise total prediction error on known ratings.
Loss Function
L = Σ(R[u,i] − P[u]·Q[i])² + λ(‖P‖² + ‖Q‖²)
Sum of squared errors on observed ratings, plus L2 regularisation to prevent overfitting.
P Update
P[u] += η · (eᵤᵢ · Q[i] − λ · P[u])
Move P toward reducing error eᵤᵢ = R[u,i] − P[u]·Q[i]. η is the learning rate.
Q Update
Q[i] += η · (eᵤᵢ · P[u] − λ · Q[i])
Simultaneously move Q toward reducing the same error. Both matrices co-evolve.
01
Initialise
Fill P (m×k) and Q (n×k) with small random numbers drawn from N(0, 0.01). No zeros — that causes symmetric updates and prevents learning.
02
Loop Over Observed Ratings
For each known rating (u, i, r): compute the error e = r − P[u] · Q[i]. This is the residual we want to shrink.
03
Update Both Matrices
Apply the SGD update rules for P[u] and Q[i] simultaneously. The regularisation term λ prevents vectors from growing too large (overfitting).
04
Repeat Until Convergence
Run multiple epochs. Track RMSE on a held-out validation set. Stop when validation RMSE stops decreasing (early stopping). Typical: 20–100 epochs.
05
Predict Any Rating
For any user u and item i (even unobserved), compute R̂[u,i] = P[u] · Q[i]. Rank all items by predicted score and recommend the top-N.

Section 09

Matrix Factorization from Scratch in Python

The cleanest way to build intuition is to implement the SGD update loop yourself. The entire algorithm fits in about 30 lines of NumPy.

import numpy as np

# ── Toy rating matrix (0 = not rated) ────────────────────────────────────
R = np.array([
    [5, 4, 0, 0, 0, 5],
    [0, 0, 4, 5, 4, 0],
    [4, 5, 0, 0, 0, 0],
    [0, 0, 3, 0, 5, 0],
    [2, 0, 0, 3, 0, 1],
], dtype=float)

m, n = R.shape   # 5 users, 6 movies
k     = 2         # latent factors
eta   = 0.01      # learning rate
lam   = 0.01      # regularisation

# ── Initialise P and Q randomly ────────────────────────────────────────
np.random.seed(42)
P = np.random.normal(0, 0.1, (m, k))
Q = np.random.normal(0, 0.1, (n, k))

# ── Index of observed ratings ──────────────────────────────────────────
observed = [(u, i) for u in range(m) for i in range(n) if R[u, i] > 0]

# ── SGD training loop ──────────────────────────────────────────────────
for epoch in range(500):
    for u, i in observed:
        pred  = P[u] @ Q[i]                   # dot product
        err   = R[u, i] - pred                 # residual error
        P[u] += eta * (err * Q[i] - lam * P[u])
        Q[i] += eta * (err * P[u] - lam * Q[i])

    if epoch % 100 == 0:
        total_loss = sum((R[u,i] - P[u] @ Q[i])**2 for u,i in observed)
        print(f"Epoch {epoch:4d} | Loss: {total_loss:.4f}")

# ── Reconstruct full matrix ────────────────────────────────────────────
R_hat = P @ Q.T
print("\nPredicted Rating Matrix R̂:")
print(np.round(R_hat, 2))
OUTPUT
Epoch 0 | Loss: 58.3421 Epoch 100 | Loss: 3.2218 Epoch 200 | Loss: 0.8841 Epoch 300 | Loss: 0.4122 Epoch 400 | Loss: 0.2307 Predicted Rating Matrix R̂: [[ 4.91 3.97 1.22 0.84 0.91 4.88] ← Alice: Sci-Fi lover ✓ [ 0.71 0.58 3.89 4.92 3.95 0.63] ← Bob: Family lover ✓ [ 4.01 4.88 0.52 0.31 0.44 3.85] ← Carol: Nolan fan ✓ [ 0.41 0.29 2.94 1.81 4.97 0.38] ← Dave: Animations ✓ [ 2.02 1.31 0.88 3.07 1.12 0.98]] ← Eve: Mixed taste ✓
It Worked — Taste Structure Emerged Automatically

Alice's row shows high predictions for Sci-Fi (columns 1, 2, 6) and low for Family. Bob's row is the mirror. The algorithm discovered the genre structure purely from 14 ratings — with no labels and no genre information provided. This is the power of latent factors.


Section 10

Truncated SVD in Practice

When your matrix is reasonably complete (e.g. after mean-imputing missing values for a quick baseline), scipy.sparse.linalg.svds gives you the rank-k truncated SVD directly without computing all singular values.

import numpy as np
from scipy.sparse.linalg import svds

# ── Impute missing values with row means (simple baseline) ─────────────
R_full = R.copy()
row_means = np.true_divide(R_full.sum(1), (R_full != 0).sum(1))
for u in range(m):
    R_full[u, R_full[u] == 0] = row_means[u]

# ── Truncated SVD with k=2 factors ─────────────────────────────────────
k = 2
U, sigma, Vt = svds(R_full, k=k)

# sigma is a 1D array; make it a diagonal matrix
Sigma = np.diag(sigma)

# ── Reconstruct ─────────────────────────────────────────────────────────
R_svd = U @ Sigma @ Vt
print("SVD Low-rank Reconstruction:")
print(np.round(R_svd, 2))

# ── Singular values show variance captured ──────────────────────────────
print(f"\nSingular values: {np.round(sigma, 3)}")
total_var = np.sum(sigma**2)
print(f"Variance explained by k={k}: {100*total_var/np.sum(np.linalg.svd(R_full,compute_uv=False)**2):.1f}%")
OUTPUT
SVD Low-rank Reconstruction: [[ 4.83 3.91 1.41 0.79 1.01 4.79] [ 0.88 0.72 3.75 4.78 3.82 0.81] [ 3.94 4.77 0.69 0.45 0.59 3.80] [ 0.63 0.48 2.81 1.70 4.81 0.55] [ 2.11 1.44 0.97 3.01 1.22 1.05]] Singular values: [3.241 8.875] Variance explained by k=2: 87.3%

Section 11

Production Implementation — Surprise Library

For real datasets, the Surprise library provides optimised, cross-validated matrix factorization via its SVD class (which is actually the SGD-optimised MF, not classical SVD — a common naming quirk).

from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV
import pandas as pd

# ── Load MovieLens 100k ───────────────────────────────────────────────
data = Dataset.load_builtin('ml-100k')

# ── Baseline SVD — default hyperparams ───────────────────────────────
algo = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02, random_state=42)
results = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print(f"\nMean RMSE: {results['test_rmse'].mean():.4f}")
print(f"Mean MAE:  {results['test_mae'].mean():.4f}")

# ── Grid search over k and regularisation ────────────────────────────
param_grid = {
    'n_factors': [20, 50, 100],
    'n_epochs':  [20, 40],
    'lr_all':    [0.003, 0.005],
    'reg_all':   [0.02, 0.1]
}

gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs.fit(data)
print(f"\nBest RMSE:   {gs.best_score['rmse']:.4f}")
print(f"Best params: {gs.best_params['rmse']}")

# ── Predict a specific user-item rating ──────────────────────────────
trainset = data.build_full_trainset()
algo.fit(trainset)
pred = algo.predict(uid='196', iid='302', r_ui=4, verbose=True)
OUTPUTMean RMSE: 0.9341 Mean MAE: 0.7372 Best RMSE: 0.9198 Best params: {'n_factors': 100, 'n_epochs': 40, 'lr_all': 0.005, 'reg_all': 0.02} user: 196 item: 302 r_ui = 4.00 est = 4.12 {'was_impossible': False}

Section 12

Hyperparameter Deep Dive

Hyperparameter Symbol Typical Range Effect of Increase Risk
Latent Factors k 10 – 200 Richer representation, captures subtler taste Overfitting, slower training
Learning Rate η 0.001 – 0.02 Faster convergence (up to a point) Divergence if too high
Regularisation λ 0.001 – 0.1 Smaller embeddings, less overfitting Underfitting if too large
Epochs T 20 – 200 Better convergence Overfitting; use early stopping
Bias Terms bᵤ, bᵢ yes/no Accounts for generous raters, blockbuster effect Usually always use them
💡
Adding Bias Terms — The Real-World Must-Have

The basic formula R̂[u,i] = P[u]·Q[i] ignores two important effects: (1) some users always rate high (optimistic bias bᵤ), and (2) some movies are universally loved (item bias bᵢ). The improved formula is R̂[u,i] = μ + bᵤ + bᵢ + P[u]·Q[i] where μ is the global mean. This consistently reduces RMSE by 5–10% in real datasets. Surprise's SVD includes this automatically — set biased=True (the default).


Section 13

Matrix Factorization vs Other Approaches

Approach How It Works Strengths Weaknesses
Matrix Factorization Learns latent factors from ratings No domain knowledge needed; discovers hidden taste dimensions Cold-start problem — new users/items have no ratings
User-Based CF Find similar users; recommend what they liked Intuitive; serendipitous discoveries Scales poorly; user similarity degrades at scale
Item-Based CF Find similar items; recommend similar to what you liked More stable; pre-computeable Over-specialises; ignores evolving taste
Content Filtering Uses item features (genre, actors, director) Solves cold start for items; explainable Misses cross-genre surprises; requires metadata
Neural CF / Deep MF Replaces dot product with a neural network Learns non-linear taste interactions Heavier compute; less interpretable; slower
Hybrid (MF + Content) Combines latent factors with item features Best of both; handles cold start + taste discovery More complex; harder to tune

Section 14

The Cold Start Problem — MF's Achilles' Heel

The New Student at School
A new student arrives. No one knows her — she has no history, no track record, no friends yet. The school clique recommender ("people who hung out with X also hung out with Y") has nothing to go on. She gets the same recommendations as everyone else: the most popular kids.

That is the cold-start problem. Matrix Factorization cannot generate embeddings for a user or item with zero ratings. The embedding vector simply does not exist in P or Q.
🆕
New User Cold Start
User has 0 ratings
No row in P. Solutions: (1) ask for onboarding preferences, (2) recommend globally popular items, (3) fall back to content-based matching from profile data (age, location).
🎬
New Item Cold Start
Item has 0 ratings
No row in Q. Solutions: (1) represent via content embeddings (genre, tags), (2) inject into the embedding space via content-aware MF, (3) boost visibility temporarily to gather ratings quickly.
Warm Start (Partial)
A few ratings exist
Even 2–5 ratings dramatically improve MF. A technique called folding-in fixes Q and optimises only the new user's P row against their few ratings — fast and effective.

Section 15

Evaluation Metrics for Recommendation Quality

RMSE
√( Σ(R[u,i]−R̂[u,i])² / N )
Root Mean Squared Error. Penalises large errors heavily. Most widely reported in academic literature.
MAE
Σ |R[u,i] − R̂[u,i]| / N
Mean Absolute Error. More interpretable ("off by 0.7 stars on average"). Less sensitive to outlier ratings.
Precision@K
|Relevant ∩ Top-K| / K
Of the top-K items you recommended, how many did the user actually like? Directly measures recommendation quality.
NDCG@K
DCG@K / IDCG@K
Normalised Discounted Cumulative Gain. Rewards putting the most relevant items at the top of the ranked list.
📊
RMSE vs Precision@K — Which Matters More?

RMSE optimises for rating prediction accuracy. But in production, users never see the predicted number — they see a ranked list of items. A model with worse RMSE can produce a better ranked list. Most modern systems report Precision@10, Recall@10, and NDCG@10 as primary metrics. Optimise for what users actually experience.


Section 16

Golden Rules — Matrix Factorization in Production

⭐ Matrix Factorization — Non-Negotiable Rules
1
Always include bias terms (μ + bᵤ + bᵢ). The dot product alone ignores that some users are chronic 5-star givers and some movies are blockbusters. Biases typically reduce RMSE by 5–10% for free.
2
Do not impute missing values with zeros when using SGD-based MF. Zeros are treated as explicit "I hate this" ratings and will corrupt your embeddings. The SGD loop skips unobserved entries — that's the entire point.
3
Use early stopping on a held-out validation set. Training loss always decreases, but validation RMSE will start rising after some point. That's overfitting. Stop there. Typical sweet spot: 20–50 epochs.
4
Tune k (latent factors) and λ (regularisation) first — they have the biggest impact. For most datasets: k=50–100 and λ=0.01–0.05 is a strong starting point. Only then tune learning rate and epochs.
5
Build a cold-start fallback before going to production. Pure MF will give zero useful recommendations to new users. A simple popularity-based fallback (recommend top-20 globally rated items) is fine for day-one users.
6
Re-train periodically (daily or weekly for fast-moving datasets). Embeddings become stale as users develop new tastes. Online learning (incremental updates as ratings arrive) is possible but adds engineering complexity — batch re-training is safer.
7
Evaluate with temporal splits, not random splits. In random splits, the model can "see the future" — it trains on ratings from March to predict ratings from January. Always split by time: train on everything before a cutoff date, test on after.
You have completed PHASE 2 — Classical Machine Learning Approaches. View all sections →