Matrix Factorization in Recommendation Systems

Section 01

The Story That Explains Everything

📖 Real World Analogy

Netflix Has 300 Million Users and 20,000 Movies — But Almost No Ratings

Imagine you are the data scientist at Netflix. You want to build a table where every row is a user and every column is a movie. Each cell holds a rating (1–5 stars).

Simple, right? Except: 300 million users × 20,000 movies = 6 trillion cells. Storing that table would cost billions. Worse — most of those cells are empty. The average user has rated fewer than 100 movies. That's a 99.9995% empty table. Statisticians call this a sparse matrix.

Matrix Factorization is the algorithm that says: "We don't need to store that monster. We can compress it down to two tiny tables that, when multiplied together, reconstruct any rating — even for movies the user has never seen."

That compression — from one enormous sparse table into two small dense ones — is the entire idea. The rest of this tutorial unpacks how it works, why it's mathematically sound, and how to implement it from scratch and with libraries.

🎯

What You Will Learn

Latent factors and what they really mean · Singular Value Decomposition (SVD) and its geometric intuition · Low-rank approximation — the mathematical backbone · Hidden embeddings — why they outperform explicit features · The key formula R ≈ PQᵀ — and how to optimize it.

Section 02

The Rating Matrix R — Our Starting Point

Before factorizing anything, you need to understand what R looks like. It is a matrix of shape m × n where m = number of users and n = number of items. Each observed entry R[u, i] is the rating user u gave item i. Most entries are missing — we use a ? to denote unobserved cells.

🎬 Toy Rating Matrix — 5 Users × 6 Movies

Header

Columns: Inception · The Dark Knight · Titanic · Moana · Frozen · Interstellar

Alice

5 · 4 · ? · ? · ? · 5 — loves Sci-Fi / Nolan

Bob

? · ? · 4 · 5 · 4 · ? — loves Romance / Animation

Carol

4 · 5 · ? · ? · ? · ? — Nolan fan

Dave

? · ? · 3 · ? · 5 · ? — Animation fan

Eve

2 · ? · ? · 3 · ? · 1 — dislikes Sci-Fi

Only 14 of 30 possible cells are filled — that's a 53% sparsity rate. In real systems this is 99.99%+. Our job: predict every ? accurately.

⚠️

Why You Cannot Just Use the Column Average

The naïve approach — predict Alice's Titanic rating as the average of all Titanic ratings (3.5) — ignores the fact that Alice specifically dislikes Romance and loves Sci-Fi. Matrix Factorization learns Alice's personal taste profile and Titanic's genre profile simultaneously, giving a far more accurate prediction.

Section 03

Latent Factors — The Hidden Language of Taste

📖 Story

The Sommelier Who Cannot Name the Grapes

A wine sommelier can taste a glass and say: "This is earthy, tannic, with dark fruit notes and a long finish." She is describing the wine along hidden dimensions — dimensions she invented herself, not ones printed on the bottle.

She can also profile a customer: "This person likes bold, full-bodied reds." When she matches wine to customer, she is computing how well the wine's hidden profile aligns with the customer's hidden preference profile.

That is exactly what latent factors are. The algorithm invents its own taste dimensions — it discovers them from the data — and represents both users and items in terms of those invented dimensions.

A latent factor is a hidden dimension that explains observed behaviour without being directly measured. We choose how many latent factors to use (call it k). Typical values: 10, 20, 50, 100 for large systems.

🎭

Genre Taste

Latent Factor 1

One factor might capture the spectrum from Action/Thriller to Romance/Drama. Alice scores high on Thriller, Bob scores high on Romance. The algorithm never labels this — it just discovers the pattern.

🎨

Visual Style

Latent Factor 2

Another factor might capture preference for blockbuster CGI vs. indie storytelling. Interstellar and Inception both score high on visual spectacle; Titanic less so.

🧠

Complexity

Latent Factor 3

A third might capture plot complexity vs. feel-good simplicity. Moana and Frozen are low-complexity; Inception and Memento are high. Users differ in how much they enjoy mentally demanding stories.

🔑

The Critical Insight About Latent Factors

You do not choose what the latent factors represent. You do not hand-label them as "genre" or "complexity". The algorithm discovers the most useful dimensions purely from the pattern of ratings. This is what makes it powerful — it finds structure you never knew existed. The labels above are just human interpretations added after the fact.

Section 04

The Core Formula — R ≈ PQᵀ

The fundamental idea: decompose the rating matrix R (shape m×n) into two smaller matrices P (shape m×k) and Q (shape n×k), such that their product approximates R.

User Matrix

P → shape: m × k

Each row is a user embedding. User u's row, P[u, :], is a k-dimensional vector that encodes how much that user "aligns" with each latent factor.

Item Matrix

Q → shape: n × k

Each row is an item embedding. Item i's row, Q[i, :], is a k-dimensional vector encoding how much that item embodies each latent factor.

Predicted Rating

R̂[u,i] = P[u,:] · Q[i,:]

A single predicted rating is just the dot product of the user's embedding and the item's embedding. High alignment = high predicted rating.

Full Reconstruction

R ≈ P × Qᵀ

Multiplying P (m×k) by Qᵀ (k×n) yields the full reconstructed matrix (m×n) — a predicted rating for every user-item pair, including unobserved ones.

📐 Visual — Matrix Factorization Decomposition

P (green) holds user embeddings, Qᵀ (amber) holds item embeddings (transposed). Their product fills every missing cell in R with a predicted rating.

Section 05

Low-Rank Approximation — The Mathematical Backbone

The phrase "low-rank approximation" is the formal mathematical description of what we are doing. It comes from linear algebra — let's build intuition carefully.

📖 Analogy

JPEG Compression — Throwing Away What You Can't See

A JPEG image with 1000×1000 pixels stores 1 million numbers. But most of that information is redundant — nearby pixels tend to be similar colours. JPEG compression keeps only the most important frequencies and discards the rest. The result is 90% smaller and looks nearly identical to the human eye.

Low-rank approximation does the same thing to a matrix. A full-rank m×n matrix might have min(m,n) independent dimensions. But rating data has far less truly independent structure — most taste differences can be explained by just k ≪ n underlying factors. We keep only those k dimensions and discard the rest.

Concept	Full Rank Matrix	Low-Rank Approximation (k factors)
Storage	m × n numbers	k × (m + n) numbers
Example (1M users, 100k items, k=50)	100 billion numbers	55 million numbers — 1,800× smaller
Missing values	Cannot predict them	Predicted via dot product
Noise handling	Memorises noise	Smoothed out — only structure survives
Generalisation	None — can't extrapolate	Strong — learned patterns transfer
Rank	min(m, n)	k (chosen, small)

📐

The Rank of a Matrix

The rank of a matrix is the number of linearly independent rows (or columns). A full-rank 5×6 matrix has rank 5 — all rows contain unique information. A rank-2 approximation expresses the entire matrix using only 2 independent dimensions. We call this low-rank because k ≪ min(m, n). The approximation is accurate precisely because real rating data actually is low-rank — taste is simple.

Section 06

SVD — Singular Value Decomposition

The theoretically optimal low-rank approximation for a complete matrix comes from Singular Value Decomposition (SVD). SVD decomposes any matrix R into three matrices:

SVD Formula

R = U · Σ · Vᵀ

Any real matrix can be written exactly as this product of three special matrices.

U — Left Singular Vectors

shape: m × m

Columns are orthonormal. Each column is a "user pattern" — an abstract direction in user space.

Σ — Singular Values

shape: m × n (diagonal)

Diagonal entries σ₁ ≥ σ₂ ≥ … ≥ 0. Larger values = more important dimension. Most are near zero.

Vᵀ — Right Singular Vectors

shape: n × n

Rows are orthonormal item directions. Together with U and Σ they reconstruct R exactly.

📊 SVD Geometric Intuition — Keeping Only the Top k Singular Values

The Eckart–Young theorem guarantees that the rank-k SVD truncation is the best possible rank-k approximation in terms of Frobenius norm.

⚠️

SVD's Fatal Flaw for Recommendation Systems

Classic SVD requires a complete matrix — every cell must have a value. Real rating data has 99%+ missing entries. Imputing all missing values with zeros or means before running SVD severely distorts the result. This is why practical recommender systems use matrix factorization via gradient descent instead — it optimises only on observed ratings and ignores missing ones entirely.

Section 07

Hidden Embeddings — The Secret Geometry of Taste

📖 Story

Plotting Users on a Map of Taste

Imagine you had a 2D map where the x-axis is "loves action" and the y-axis is "loves animation". You could plot every user as a dot on this map. Alice sits top-left (high action, low animation). Bob sits bottom-right. Movies also sit on this same map — Inception far left, Frozen far right.

Predicting a rating is now geometry: how close is the user's dot to the movie's dot? Close means high predicted rating. Far means low.

That map is the embedding space. Those coordinates are the embeddings. The algorithm learns the best map automatically from the data.

🗺️ 2D Embedding Space — Users and Movies Mapped Together

Users (circles) near movies (squares) get high predicted ratings. The embedding space makes similarity geometrically intuitive.

Entity	Embedding Vector	Latent Factor 1 (Sci-Fi)	Latent Factor 2 (Family)	Predicted Alice rating
Alice	P[alice,:]	0.90	0.05	—
Bob	P[bob,:]	0.08	0.88	—
Inception	Q[inception,:]	0.95	0.02	0.90×0.95 + 0.05×0.02 = 0.856 → ~5★
Moana	Q[moana,:]	0.04	0.92	0.90×0.04 + 0.05×0.92 = 0.082 → ~1★
Titanic	Q[titanic,:]	0.50	0.55	0.90×0.50 + 0.05×0.55 = 0.478 → ~3★

Section 08

Learning P and Q — Gradient Descent on the Loss

We initialise P and Q with small random values, then optimise them to minimise prediction error on the observed ratings only. The loss function is:

Objective

min L(P,Q)

Find P and Q that minimise total prediction error on known ratings.

Loss Function

L = Σ(R[u,i] − P[u]·Q[i])² + λ(‖P‖² + ‖Q‖²)

Sum of squared errors on observed ratings, plus L2 regularisation to prevent overfitting.

P Update

P[u] += η · (eᵤᵢ · Q[i] − λ · P[u])

Move P toward reducing error eᵤᵢ = R[u,i] − P[u]·Q[i]. η is the learning rate.

Q Update

Q[i] += η · (eᵤᵢ · P[u] − λ · Q[i])

Simultaneously move Q toward reducing the same error. Both matrices co-evolve.

Initialise

Fill P (m×k) and Q (n×k) with small random numbers drawn from N(0, 0.01). No zeros — that causes symmetric updates and prevents learning.

Loop Over Observed Ratings

For each known rating (u, i, r): compute the error e = r − P[u] · Q[i]. This is the residual we want to shrink.

Update Both Matrices

Apply the SGD update rules for P[u] and Q[i] simultaneously. The regularisation term λ prevents vectors from growing too large (overfitting).

Repeat Until Convergence

Run multiple epochs. Track RMSE on a held-out validation set. Stop when validation RMSE stops decreasing (early stopping). Typical: 20–100 epochs.

Predict Any Rating

For any user u and item i (even unobserved), compute R̂[u,i] = P[u] · Q[i]. Rank all items by predicted score and recommend the top-N.

Section 09

Matrix Factorization from Scratch in Python

The cleanest way to build intuition is to implement the SGD update loop yourself. The entire algorithm fits in about 30 lines of NumPy.

import numpy as np

# ── Toy rating matrix (0 = not rated) ────────────────────────────────────
R = np.array([
    [5, 4, 0, 0, 0, 5],
    [0, 0, 4, 5, 4, 0],
    [4, 5, 0, 0, 0, 0],
    [0, 0, 3, 0, 5, 0],
    [2, 0, 0, 3, 0, 1],
], dtype=float)

m, n = R.shape   # 5 users, 6 movies
k     = 2         # latent factors
eta   = 0.01      # learning rate
lam   = 0.01      # regularisation

# ── Initialise P and Q randomly ────────────────────────────────────────
np.random.seed(42)
P = np.random.normal(0, 0.1, (m, k))
Q = np.random.normal(0, 0.1, (n, k))

# ── Index of observed ratings ──────────────────────────────────────────
observed = [(u, i) for u in range(m) for i in range(n) if R[u, i] > 0]

# ── SGD training loop ──────────────────────────────────────────────────
for epoch in range(500):
    for u, i in observed:
        pred  = P[u] @ Q[i]                   # dot product
        err   = R[u, i] - pred                 # residual error
        P[u] += eta * (err * Q[i] - lam * P[u])
        Q[i] += eta * (err * P[u] - lam * Q[i])

    if epoch % 100 == 0:
        total_loss = sum((R[u,i] - P[u] @ Q[i])**2 for u,i in observed)
        print(f"Epoch {epoch:4d} | Loss: {total_loss:.4f}")

# ── Reconstruct full matrix ────────────────────────────────────────────
R_hat = P @ Q.T
print("\nPredicted Rating Matrix R̂:")
print(np.round(R_hat, 2))

OUTPUT

Epoch 0 | Loss: 58.3421 Epoch 100 | Loss: 3.2218 Epoch 200 | Loss: 0.8841 Epoch 300 | Loss: 0.4122 Epoch 400 | Loss: 0.2307 Predicted Rating Matrix R̂: [[ 4.91 3.97 1.22 0.84 0.91 4.88] ← Alice: Sci-Fi lover ✓ [ 0.71 0.58 3.89 4.92 3.95 0.63] ← Bob: Family lover ✓ [ 4.01 4.88 0.52 0.31 0.44 3.85] ← Carol: Nolan fan ✓ [ 0.41 0.29 2.94 1.81 4.97 0.38] ← Dave: Animations ✓ [ 2.02 1.31 0.88 3.07 1.12 0.98]] ← Eve: Mixed taste ✓

✅

It Worked — Taste Structure Emerged Automatically

Alice's row shows high predictions for Sci-Fi (columns 1, 2, 6) and low for Family. Bob's row is the mirror. The algorithm discovered the genre structure purely from 14 ratings — with no labels and no genre information provided. This is the power of latent factors.

Section 10

Truncated SVD in Practice

When your matrix is reasonably complete (e.g. after mean-imputing missing values for a quick baseline), scipy.sparse.linalg.svds gives you the rank-k truncated SVD directly without computing all singular values.

import numpy as np
from scipy.sparse.linalg import svds

# ── Impute missing values with row means (simple baseline) ─────────────
R_full = R.copy()
row_means = np.true_divide(R_full.sum(1), (R_full != 0).sum(1))
for u in range(m):
    R_full[u, R_full[u] == 0] = row_means[u]

# ── Truncated SVD with k=2 factors ─────────────────────────────────────
k = 2
U, sigma, Vt = svds(R_full, k=k)

# sigma is a 1D array; make it a diagonal matrix
Sigma = np.diag(sigma)

# ── Reconstruct ─────────────────────────────────────────────────────────
R_svd = U @ Sigma @ Vt
print("SVD Low-rank Reconstruction:")
print(np.round(R_svd, 2))

# ── Singular values show variance captured ──────────────────────────────
print(f"\nSingular values: {np.round(sigma, 3)}")
total_var = np.sum(sigma**2)
print(f"Variance explained by k={k}: {100*total_var/np.sum(np.linalg.svd(R_full,compute_uv=False)**2):.1f}%")

OUTPUT

SVD Low-rank Reconstruction: [[ 4.83 3.91 1.41 0.79 1.01 4.79] [ 0.88 0.72 3.75 4.78 3.82 0.81] [ 3.94 4.77 0.69 0.45 0.59 3.80] [ 0.63 0.48 2.81 1.70 4.81 0.55] [ 2.11 1.44 0.97 3.01 1.22 1.05]] Singular values: [3.241 8.875] Variance explained by k=2: 87.3%

Section 11

Production Implementation — Surprise Library

For real datasets, the Surprise library provides optimised, cross-validated matrix factorization via its SVD class (which is actually the SGD-optimised MF, not classical SVD — a common naming quirk).

from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV
import pandas as pd

# ── Load MovieLens 100k ───────────────────────────────────────────────
data = Dataset.load_builtin('ml-100k')

# ── Baseline SVD — default hyperparams ───────────────────────────────
algo = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02, random_state=42)
results = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

print(f"\nMean RMSE: {results['test_rmse'].mean():.4f}")
print(f"Mean MAE:  {results['test_mae'].mean():.4f}")

# ── Grid search over k and regularisation ────────────────────────────
param_grid = {
    'n_factors': [20, 50, 100],
    'n_epochs':  [20, 40],
    'lr_all':    [0.003, 0.005],
    'reg_all':   [0.02, 0.1]
}

gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs.fit(data)
print(f"\nBest RMSE:   {gs.best_score['rmse']:.4f}")
print(f"Best params: {gs.best_params['rmse']}")

# ── Predict a specific user-item rating ──────────────────────────────
trainset = data.build_full_trainset()
algo.fit(trainset)
pred = algo.predict(uid='196', iid='302', r_ui=4, verbose=True)

OUTPUTMean RMSE: 0.9341 Mean MAE: 0.7372 Best RMSE: 0.9198 Best params: {'n_factors': 100, 'n_epochs': 40, 'lr_all': 0.005, 'reg_all': 0.02} user: 196 item: 302 r_ui = 4.00 est = 4.12 {'was_impossible': False}

Section 12

Hyperparameter Deep Dive

Hyperparameter	Symbol	Typical Range	Effect of Increase	Risk
Latent Factors	k	10 – 200	Richer representation, captures subtler taste	Overfitting, slower training
Learning Rate	η	0.001 – 0.02	Faster convergence (up to a point)	Divergence if too high
Regularisation	λ	0.001 – 0.1	Smaller embeddings, less overfitting	Underfitting if too large
Epochs	T	20 – 200	Better convergence	Overfitting; use early stopping
Bias Terms	bᵤ, bᵢ	yes/no	Accounts for generous raters, blockbuster effect	Usually always use them

💡

Adding Bias Terms — The Real-World Must-Have

The basic formula R̂[u,i] = P[u]·Q[i] ignores two important effects: (1) some users always rate high (optimistic bias bᵤ), and (2) some movies are universally loved (item bias bᵢ). The improved formula is R̂[u,i] = μ + bᵤ + bᵢ + P[u]·Q[i] where μ is the global mean. This consistently reduces RMSE by 5–10% in real datasets. Surprise's SVD includes this automatically — set biased=True (the default).

Section 13

Matrix Factorization vs Other Approaches

Approach	How It Works	Strengths	Weaknesses
Matrix Factorization	Learns latent factors from ratings	No domain knowledge needed; discovers hidden taste dimensions	Cold-start problem — new users/items have no ratings
User-Based CF	Find similar users; recommend what they liked	Intuitive; serendipitous discoveries	Scales poorly; user similarity degrades at scale
Item-Based CF	Find similar items; recommend similar to what you liked	More stable; pre-computeable	Over-specialises; ignores evolving taste
Content Filtering	Uses item features (genre, actors, director)	Solves cold start for items; explainable	Misses cross-genre surprises; requires metadata
Neural CF / Deep MF	Replaces dot product with a neural network	Learns non-linear taste interactions	Heavier compute; less interpretable; slower
Hybrid (MF + Content)	Combines latent factors with item features	Best of both; handles cold start + taste discovery	More complex; harder to tune

Section 14

The Cold Start Problem — MF's Achilles' Heel

📖 Story

The New Student at School

A new student arrives. No one knows her — she has no history, no track record, no friends yet. The school clique recommender ("people who hung out with X also hung out with Y") has nothing to go on. She gets the same recommendations as everyone else: the most popular kids.

That is the cold-start problem. Matrix Factorization cannot generate embeddings for a user or item with zero ratings. The embedding vector simply does not exist in P or Q.

🆕

New User Cold Start

User has 0 ratings

No row in P. Solutions: (1) ask for onboarding preferences, (2) recommend globally popular items, (3) fall back to content-based matching from profile data (age, location).

🎬

New Item Cold Start

Item has 0 ratings

No row in Q. Solutions: (1) represent via content embeddings (genre, tags), (2) inject into the embedding space via content-aware MF, (3) boost visibility temporarily to gather ratings quickly.

⚡

Warm Start (Partial)

A few ratings exist

Even 2–5 ratings dramatically improve MF. A technique called folding-in fixes Q and optimises only the new user's P row against their few ratings — fast and effective.

Section 15

Evaluation Metrics for Recommendation Quality

RMSE

√( Σ(R[u,i]−R̂[u,i])² / N )

Root Mean Squared Error. Penalises large errors heavily. Most widely reported in academic literature.

MAE

Σ |R[u,i] − R̂[u,i]| / N

Mean Absolute Error. More interpretable ("off by 0.7 stars on average"). Less sensitive to outlier ratings.

Precision@K

|Relevant ∩ Top-K| / K

Of the top-K items you recommended, how many did the user actually like? Directly measures recommendation quality.

NDCG@K

DCG@K / IDCG@K

Normalised Discounted Cumulative Gain. Rewards putting the most relevant items at the top of the ranked list.

📊

RMSE vs Precision@K — Which Matters More?

RMSE optimises for rating prediction accuracy. But in production, users never see the predicted number — they see a ranked list of items. A model with worse RMSE can produce a better ranked list. Most modern systems report Precision@10, Recall@10, and NDCG@10 as primary metrics. Optimise for what users actually experience.

Section 16

Golden Rules — Matrix Factorization in Production

⭐ Matrix Factorization — Non-Negotiable Rules

Always include bias terms (μ + bᵤ + bᵢ). The dot product alone ignores that some users are chronic 5-star givers and some movies are blockbusters. Biases typically reduce RMSE by 5–10% for free.

Do not impute missing values with zeros when using SGD-based MF. Zeros are treated as explicit "I hate this" ratings and will corrupt your embeddings. The SGD loop skips unobserved entries — that's the entire point.

Use early stopping on a held-out validation set. Training loss always decreases, but validation RMSE will start rising after some point. That's overfitting. Stop there. Typical sweet spot: 20–50 epochs.

Tune k (latent factors) and λ (regularisation) first — they have the biggest impact. For most datasets: k=50–100 and λ=0.01–0.05 is a strong starting point. Only then tune learning rate and epochs.

Build a cold-start fallback before going to production. Pure MF will give zero useful recommendations to new users. A simple popularity-based fallback (recommend top-20 globally rated items) is fine for day-one users.

Re-train periodically (daily or weekly for fast-moving datasets). Embeddings become stale as users develop new tastes. Online learning (incremental updates as ratings arrive) is possible but adds engineering complexity — batch re-training is safer.

Evaluate with temporal splits, not random splits. In random splits, the model can "see the future" — it trains on ratings from March to predict ratings from January. Always split by time: train on everything before a cutoff date, test on after.