The Story That Explains Collaborative Filtering
She never read any of those books herself. She just noticed patterns in what people with similar taste chose. That is the entire idea behind Collaborative Filtering. No content analysis. No manual rules. Pure collective wisdom from behaviour.
Collaborative Filtering (CF) is a recommendation technique that predicts what a user will like based on the preferences of many other users. It operates on the idea that people who agreed in the past tend to agree again in the future. Netflix, Spotify, Amazon — they all use some form of CF at their core.
CF exploits the fact that user behaviour is a rich signal. Ratings, clicks, purchases, skips, and watch-time all implicitly encode taste. The algorithm finds structure in this behaviour without ever needing to know why users like something — just that they do, and who else does too.
The Two Flavours of Collaborative Filtering
Collaborative Filtering splits into two main approaches depending on where you anchor the similarity calculation — on users or on items. Both use the same raw data (a user–item rating matrix) but reason from opposite directions.
| Question Asked |
|---|
| Who among all users rates things most similarly to the target user? |
| Find the K most similar users (neighbours) |
| Aggregate their ratings on unseen items |
| Recommend items with the highest predicted score |
| Analogy: "People like you also liked…" |
| Question Asked |
|---|
| Which items are rated most similarly across all users? |
| For each item the user liked, find its K nearest item neighbours |
| Aggregate similarity-weighted ratings |
| Recommend the most similar unseen items |
| Analogy: "Because you liked X, try Y…" |
| Property | User-User CF | Item-Item CF |
|---|---|---|
| Similarity computed between | Users | Items |
| Best when | Many items, few users | Many users, stable item catalogue |
| Scalability | Slow — users grow fast | Better — item set more static |
| Cold start (new user) | Struggles immediately | Slightly more robust |
| Cold start (new item) | Needs some ratings | Struggles immediately |
| Interpretability | Moderate | High — "because you liked X" |
| Amazon, Netflix approach | Less common in production | Dominant in large catalogues |
The Rating Matrix — The Foundation of Everything
| User | Inception | Interstellar | The Martian | Dune | Arrival | Tenet |
|---|---|---|---|---|---|---|
| Alice | 5 | 4 | — | 5 | 4 | — |
| Bob | 4 | 5 | 4 | — | 5 | 3 |
| Carol | — | 4 | 5 | 4 | — | 4 |
| Dave | 5 | — | 4 | 5 | 4 | 5 |
| Eve | 3 | 2 | — | 2 | 3 | — |
Notice how the matrix is sparse — most cells are empty (—). In real systems like Netflix (200M+ users, 17,000+ titles), less than 1% of cells are filled. The challenge is using that tiny fraction of known ratings to predict everything else.
Real-world rating matrices are extremely sparse — often 99%+ empty. This creates the fundamental challenge of CF: two users may share very few rated items, making similarity calculation noisy. Techniques like matrix factorisation (see Section 10) were invented specifically to overcome this.
User-User Similarity — Deep Dive
The user-user approach asks: "Which users have the most similar taste profile to the target user?" Once we find these K nearest neighbours, we let their ratings vote on what the target user should see next.
The Prediction Formula
pred(u, i) = avg_rating(u) + Σ[ sim(u,v) × (r(v,i) − avg_rating(v)) ] / Σ|sim(u,v)|
We subtract each neighbour's average rating before weighting, then add back the target user's
average. This corrects for rating bias — some users always give 5s, others never
go above 3. Without this correction, generous raters dominate the prediction.
Pearson Correlation — Measuring Taste Similarity
Pearson Correlation captures this. It measures whether two users' ratings move together, not whether they're the same absolute number. It corrects for individual rating scales by centring each user's ratings around their own mean.
Worked Example — Alice vs Bob vs Eve
| Film | Alice's Rating | Bob's Rating | Eve's Rating | Alice − Mean(4.67) | Bob − Mean(4.2) |
|---|---|---|---|---|---|
| Inception | 5 | 4 | 3 | +0.33 | −0.20 |
| Interstellar | 4 | 5 | 2 | −0.67 | +0.80 |
| Dune | 5 | — | 2 | +0.33 | — |
| Arrival | 4 | 5 | 3 | −0.67 | +0.80 |
| Pearson vs Alice | — | ≈ −0.94 | ≈ +0.99 | — | — |
Alice and Eve have +0.99 — near-perfect correlation. When Alice rates something high, Eve rates it high too (just always lower in absolute terms). Bob has −0.94 — inverse taste. When Alice loves a film, Bob tends not to, and vice versa. For recommendations, we want neighbours with high positive Pearson scores.
Python: Pearson Correlation from Scratch
import numpy as np
# Ratings: 0 = unrated (we only use co-rated items)
ratings = {
'Alice': {'Inception': 5, 'Interstellar': 4, 'Dune': 5, 'Arrival': 4},
'Bob': {'Inception': 4, 'Interstellar': 5, 'Arrival': 5, 'Tenet': 3},
'Eve': {'Inception': 3, 'Interstellar': 2, 'Dune': 2, 'Arrival': 3},
}
def pearson_similarity(user1, user2, ratings):
"""Pearson correlation between two users on their co-rated items."""
u1_ratings = ratings[user1]
u2_ratings = ratings[user2]
# Find items rated by BOTH users
common = set(u1_ratings.keys()) & set(u2_ratings.keys())
if len(common) < 2:
return 0 # not enough co-rated items
u1 = np.array([u1_ratings[i] for i in common], dtype=float)
u2 = np.array([u2_ratings[i] for i in common], dtype=float)
# Centre ratings around each user's mean
u1 -= u1.mean()
u2 -= u2.mean()
denom = np.sqrt(np.sum(u1**2) * np.sum(u2**2))
return np.dot(u1, u2) / denom if denom > 0 else 0
# Compute similarity of Alice vs each other user
for user in ['Bob', 'Eve']:
sim = pearson_similarity('Alice', user, ratings)
print(f"Alice ↔ {user}: Pearson = {sim:.4f}")
Cosine Similarity — The Angle Between Taste Vectors
Cosine Similarity measures exactly this: the cosine of the angle between two vectors. An angle of 0° means identical direction (similarity = 1.0). An angle of 90° means orthogonal — no relationship (similarity = 0). An angle of 180° means opposite taste (similarity = −1).
Pearson vs Cosine — When to Use Which
| Property | Pearson Correlation | Cosine Similarity | Adjusted Cosine |
|---|---|---|---|
| Corrects for rating scale bias | Yes — subtracts user mean | No | Yes |
| Works on co-rated items only | Yes | Treats missing as 0 | Yes (co-rated) |
| Best for | User-User CF | Sparse implicit data | Item-Item CF |
| Sensitive to # of co-ratings | Yes — unreliable with <5 | Yes | Yes |
| Computational complexity | O(n) per pair | O(n) per pair | O(n) per pair |
Python: Cosine and Adjusted Cosine
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
# Build a dense rating matrix (NaN for missing)
data = {
'Inception': [5, 4, None, 5, 3 ],
'Interstellar': [4, 5, 4, None, 2 ],
'The Martian': [None, 4, 5, 4, None],
'Dune': [5, None, 4, 5, 2 ],
'Arrival': [4, 5, None, 4, 3 ],
}
users = ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
df = pd.DataFrame(data, index=users)
# --- Standard Cosine (fill NaN with 0) ---
df_filled = df.fillna(0)
cos_sim = cosine_similarity(df_filled)
cos_df = pd.DataFrame(cos_sim, index=users, columns=users)
print("=== Standard Cosine Similarity ===")
print(cos_df.round(3))
# --- Adjusted Cosine (subtract user means) ---
df_centred = df.sub(df.mean(axis=1), axis=0).fillna(0)
adj_cos_sim = cosine_similarity(df_centred)
adj_cos_df = pd.DataFrame(adj_cos_sim, index=users, columns=users)
print("\n=== Adjusted Cosine Similarity ===")
print(adj_cos_df.round(3))
Standard cosine shows Alice ↔ Bob at 0.825 — they look similar! But adjusted cosine reveals 0.162 — barely correlated after accounting for the fact that Bob rates everything higher than Alice. Always prefer adjusted cosine for explicit rating data. Use raw cosine only for implicit signals (clicks, plays, views) where there's no personal scale to correct for.
Item-Item Similarity — Deep Dive
Item-item CF flips the perspective: instead of finding users who think alike, we find items that attract the same crowd. If the same set of users consistently loves both Inception and Interstellar, those films are similar — regardless of what genre they belong to or what the plot is.
pred(u,j) = Σ [sim(i,j) × r(u,i)] / Σ |sim(i,j)| where the sum runs over the K most similar already-rated items.The similarity matrix between items changes slowly (item catalogues are stable). You compute it once, offline, in batch. At serving time, personalisation is just a fast lookup and weighted average — sub-millisecond. Amazon described this exact approach in their landmark 2003 paper that powered their recommendation engine for over a decade.
Python: Full User-User CF from Scratch
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
# ── 1. Rating matrix (NaN = not yet rated) ─────────────────
data = {
'Inception': [5, 4, np.nan, 5, 3 ],
'Interstellar': [4, 5, 4, np.nan, 2 ],
'The Martian': [np.nan, 4, 5, 4, np.nan],
'Dune': [5, np.nan, 4, 5, 2 ],
'Arrival': [4, 5, np.nan, 4, 3 ],
'Tenet': [np.nan, 3, 4, 5, np.nan],
}
users = ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
R = pd.DataFrame(data, index=users)
# ── 2. Pearson similarity (co-rated items only) ─────────────
def pearson_sim(u, v, R):
common = R.loc[u].notna() & R.loc[v].notna()
if common.sum() < 2:
return 0.0
r_u = R.loc[u, common]
r_v = R.loc[v, common]
corr, _ = pearsonr(r_u, r_v)
return 0.0 if np.isnan(corr) else corr
# Full similarity matrix
sim_matrix = pd.DataFrame(
[[pearson_sim(u, v, R) for v in users] for u in users],
index=users, columns=users
)
# ── 3. Predict missing ratings for Alice ────────────────────
target = 'Alice'
K = 3 # number of nearest neighbours
# Items Alice hasn't rated yet
unrated = R.loc[target][R.loc[target].isna()].index.tolist()
alice_mean = R.loc[target].mean()
predictions = {}
for item in unrated:
# Neighbours who have rated this item, sorted by sim desc
neighbours = (
sim_matrix.loc[target]
.drop(target)
.loc[R[item].notna()]
.sort_values(ascending=False)
.head(K)
)
if len(neighbours) == 0 or neighbours.abs().sum() == 0:
continue
numer, denom = 0.0, 0.0
for nb, sim in neighbours.items():
nb_mean = R.loc[nb].mean()
numer += sim * (R.loc[nb, item] - nb_mean)
denom += abs(sim)
predictions[item] = alice_mean + numer / denom
print("=== Predicted Ratings for Alice ===")
for item, pred in sorted(predictions.items(), key=lambda x: -x[1]):
print(f" {item:20s}: {pred:.2f}")
Python: Full Item-Item CF from Scratch
import numpy as np
import pandas as pd
# Reuse the rating matrix R from Section 08
# ── 1. Adjusted Cosine Similarity between items ─────────────
def adjusted_cosine(i, j, R):
"""Similarity between items i and j — subtract user means."""
# Users who rated BOTH items
common_users = R[i].notna() & R[j].notna()
if common_users.sum() < 2:
return 0.0
user_means = R.mean(axis=1) # mean rating per user
r_i = R.loc[common_users, i] - user_means[common_users]
r_j = R.loc[common_users, j] - user_means[common_users]
numer = (r_i * r_j).sum()
denom = np.sqrt((r_i**2).sum()) * np.sqrt((r_j**2).sum())
return numer / denom if denom > 0 else 0.0
items = R.columns.tolist()
item_sim = pd.DataFrame(
[[adjusted_cosine(i, j, R) for j in items] for i in items],
index=items, columns=items
)
print("=== Item Similarity Matrix (Adjusted Cosine) ===")
print(item_sim.round(3))
# ── 2. Predict missing ratings for Alice via item-item ───────
target = 'Alice'
K = 3
rated = R.loc[target].dropna()
unrated = R.loc[target][R.loc[target].isna()].index
predictions = {}
for candidate in unrated:
# K most similar items that Alice HAS rated
sims = item_sim.loc[candidate, rated.index].sort_values(ascending=False).head(K)
numer = (sims * rated[sims.index]).sum()
denom = sims.abs().sum()
if denom > 0:
predictions[candidate] = numer / denom
print("\n=== Item-Item Predictions for Alice ===")
for item, pred in sorted(predictions.items(), key=lambda x: -x[1]):
print(f" {item:20s}: {pred:.2f}")
User-user CF predicted Tenet: 4.73 via Alice's taste neighbours. Item-item CF predicted Tenet: 4.89 via the items Alice rated highly (Inception, Dune) which are both highly similar to Tenet. When both methods converge, it's a strong signal of a genuine recommendation.
A Complete Real-World Demo — Movie Recommender with Surprise
Let's build a production-grade collaborative filtering system using the Surprise library, which implements CF algorithms with proper evaluation. We'll use the classic MovieLens 100K dataset.
Setup and Baseline
# pip install scikit-surprise
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, SVD
from surprise.model_selection import cross_validate, train_test_split
from surprise import accuracy
from collections import defaultdict
import pandas as pd
import numpy as np
# ── Load MovieLens 100K (downloads automatically) ───────────
data = Dataset.load_builtin('ml-100k')
print("Dataset: MovieLens 100K")
print("100,000 ratings | 943 users | 1,682 movies | scale 1–5")
# ── Test 3 algorithms side by side ──────────────────────────
algorithms = {
'User-User CF (K=40)': KNNBasic(k=40, sim_options={
'name': 'pearson', 'user_based': True
}),
'Item-Item CF (K=40)': KNNWithMeans(k=40, sim_options={
'name': 'pearson_baseline', 'user_based': False
}),
'SVD (Matrix Factorisation)': SVD(n_factors=100, n_epochs=20),
}
results = {}
for name, algo in algorithms.items():
cv = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results[name] = {
'RMSE': cv['test_rmse'].mean(),
'MAE': cv['test_mae'].mean(),
}
print(f"{name:35s} RMSE={results[name]['RMSE']:.4f} MAE={results[name]['MAE']:.4f}")
Generate Top-N Recommendations for a Specific User
from surprise import KNNWithMeans
# ── Train on full dataset ────────────────────────────────────
trainset = data.build_full_trainset()
algo = KNNWithMeans(k=40, sim_options={'name': 'pearson_baseline', 'user_based': False})
algo.fit(trainset)
def get_top_n_recommendations(algo, data, user_id, n=10):
"""Return top N recommendations for a given user."""
trainset = algo.trainset
all_items = set(trainset.all_items())
# Items this user already rated (known)
rated_items = set(
trainset.to_raw_iid(iid)
for iid in trainset.ur[trainset.to_inner_uid(user_id)]
.keys() if hasattr(trainset.ur[trainset.to_inner_uid(user_id)], 'keys')
) if False else {
trainset.to_raw_iid(iid)
for (iid, _) in trainset.ur[trainset.to_inner_uid(user_id)]
}
# Predict rating for every unseen item
predictions = [
algo.predict(user_id, trainset.to_raw_iid(iid))
for iid in all_items
if trainset.to_raw_iid(iid) not in rated_items
]
predictions.sort(key=lambda x: x.est, reverse=True)
return predictions[:n]
top10 = get_top_n_recommendations(algo, data, user_id='196', n=10)
print("=== Top 10 Recommendations for User 196 ===")
for i, pred in enumerate(top10, 1):
print(f" {i:2d}. Movie {pred.iid:6s} — Predicted Rating: {pred.est:.2f}")
Evaluating with Precision@K and Recall@K
from surprise.model_selection import train_test_split as sp_split
trainset, testset = sp_split(data, test_size=0.25, random_state=42)
algo.fit(trainset)
predictions = algo.test(testset)
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Compute Precision@K and Recall@K for all users."""
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions, recalls = {}, {}
for uid, user_ratings in user_est_true.items():
user_ratings.sort(key=lambda x: x[0], reverse=True)
n_rel = sum(true_r >= threshold for _, true_r in user_ratings)
n_rec_k = min(k, len(user_ratings))
n_rel_and_rec_k = sum(
true_r >= threshold
for _, true_r in user_ratings[:n_rec_k]
)
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k > 0 else 0
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel > 0 else 0
return precisions, recalls
prec, rec = precision_recall_at_k(predictions, k=10, threshold=3.5)
print(f"Precision@10 = {sum(prec.values())/len(prec):.4f}")
print(f"Recall@10 = {sum(rec.values())/len(rec):.4f}")
print(f"RMSE = {accuracy.rmse(predictions, verbose=False):.4f}")