The Story Behind Collaborative Filtering
That is exactly what User-Based Collaborative Filtering (UBCF) does — at scale. Instead of Priya, the algorithm finds your most similar users in a database of millions, then surfaces items they loved that you haven't seen yet. No content analysis, no genre tags, no keywords — just the raw pattern of who rated what, and how.
Collaborative filtering is the backbone of modern recommendation engines. Netflix, Spotify, Amazon, YouTube — all use variants of this idea. The word collaborative is key: the system learns from the collective behaviour of all users, not from the content itself. If enough people who agree with you on 50 things also agree on a 51st thing you haven't tried, the algorithm will push that 51st thing your way.
User-Based CF (UBCF): Find users similar to the target user. Recommend
what those neighbours liked. "People like you also enjoyed…"
Item-Based CF (IBCF): Find items similar to items the user liked.
Recommend those. "Because you watched X, you may like Y."
This tutorial covers UBCF from scratch, building every component by hand
in Python before using any library.
The Rating Matrix — Heart of the System
Everything starts with a User-Item Rating Matrix. Rows are users, columns are items (movies, songs, products), and each cell holds a rating — or is empty if the user hasn't interacted with that item yet.
| User | Interstellar | Arrival | Dune | Annihilation | Inception | Blade Runner 2049 | Gravity |
|---|---|---|---|---|---|---|---|
| Alice | 5 | 4 | 5 | — | 4 | 5 | — |
| Bob | 4 | 5 | 4 | 5 | — | 4 | 3 |
| Carol | 2 | — | 1 | 2 | 3 | — | 4 |
| David | 5 | 4 | — | 4 | 5 | 5 | — |
| Eve | — | 3 | 2 | — | 2 | 3 | 5 |
| Target (Frank) | 5 | 5 | 4 | ? | 4 | — | ? |
Frank is our target user. He hasn't rated Annihilation or Gravity. Our job is to estimate those two ratings so we can decide which one to recommend first.
In a toy example like this, we have 6 users and 7 movies with ~80% fill. In the real world, Netflix has 200M+ users and 20,000+ titles — most users have rated fewer than 0.1% of the catalogue. This extreme sparsity is the biggest practical challenge in CF. Most of the matrix is empty, which makes similarity computation noisy and unreliable.
Measuring User Similarity — Three Approaches
Before we can find Frank's neighbours, we need a way to quantify "how similar is User A to User B?". Three metrics dominate in practice.
🔨 The Pearson Formula (Step by Step)
Frank's mean on co-rated items = (5+5+4)/3 = 4.67. Bob's mean = (4+5+4)/3 = 4.33.
Deviations — Frank: [0.33, 0.33, −0.67]. Bob: [−0.33, 0.67, −0.33].
Numerator = (0.33×−0.33) + (0.33×0.67) + (−0.67×−0.33) = −0.11 + 0.22 + 0.22 = 0.33.
Denominator = √(0.11+0.11+0.45) × √(0.11+0.45+0.11) ≈ 0.83 × 0.83 ≈ 0.69.
sim(Frank, Bob) ≈ 0.48 — a moderate positive correlation.
Building UBCF from Scratch — Pure Python + NumPy
We'll implement the full pipeline: build the rating matrix, compute all pairwise user similarities using Pearson correlation, find the top-K neighbours for any target user, and generate a predicted rating for any unseen item.
Step 1 — Build the Rating Matrix
import numpy as np
import pandas as pd
from itertools import combinations
# ── Raw ratings data ──────────────────────────────────────────
ratings_data = {
'Interstellar': {'Alice':5, 'Bob':4, 'Carol':2, 'David':5, 'Frank':5},
'Arrival': {'Alice':4, 'Bob':5, 'David':4, 'Eve':3, 'Frank':5},
'Dune': {'Alice':5, 'Bob':4, 'Carol':1, 'Eve':2, 'Frank':4},
'Annihilation': {'Bob':5, 'Carol':2, 'David':4},
'Inception': {'Alice':4, 'Carol':3, 'David':5, 'Eve':2, 'Frank':4},
'Blade Runner 2049':{'Alice':5, 'Bob':4, 'David':5, 'Eve':3},
'Gravity': {'Bob':3, 'Carol':4, 'Eve':5},
}
# ── Convert to DataFrame (rows=users, cols=movies) ───────────
df = pd.DataFrame(ratings_data).T # items × users
ratings = df.T # users × items (standard convention)
print(ratings)
print(f"\nMatrix shape: {ratings.shape}")
print(f"Sparsity: {ratings.isna().sum().sum() / ratings.size:.1%}")
Step 2 — Pearson Similarity Between All User Pairs
def pearson_similarity(u: pd.Series, v: pd.Series) -> float:
"""
Pearson correlation between two users using only co-rated items.
Returns NaN if fewer than 2 items are co-rated (not enough data).
"""
# Find items both users have rated
mask = u.notna() & v.notna()
n_common = mask.sum()
if n_common < 2:
return np.nan # not enough shared items
u_common = u[mask]
v_common = v[mask]
# Mean-center (correct for rating bias)
u_dev = u_common - u_common.mean()
v_dev = v_common - v_common.mean()
numerator = (u_dev * v_dev).sum()
denominator = np.sqrt((u_dev**2).sum()) * np.sqrt((v_dev**2).sum())
if denominator == 0:
return 0.0 # all ratings identical — no discriminating info
return float(numerator / denominator)
# ── Build the full similarity matrix ─────────────────────────
users = ratings.index.tolist()
sim_matrix = pd.DataFrame(np.eye(len(users)),
index=users, columns=users)
for u, v in combinations(users, 2):
s = pearson_similarity(ratings.loc[u], ratings.loc[v])
sim_matrix.loc[u, v] = s
sim_matrix.loc[v, u] = s # symmetric
print(sim_matrix.round(3))
Frank is most similar to Alice (0.983) and David (0.976), both high-similarity sci-fi fans. Carol has a strong negative similarity (−0.891) — she rates items Frank loves as low, and vice versa. This is still useful information: if Carol rates something highly, Frank probably won't like it.
Step 3 — Find Top-K Neighbours for a Target User + Item
def get_neighbours(target_user: str, target_item: str,
ratings: pd.DataFrame, sim_matrix: pd.DataFrame,
k: int = 3) -> pd.Series:
"""
Returns top-K most similar users who have rated target_item,
sorted by similarity (descending).
"""
# Exclude the target user themselves
other_users = [u for u in ratings.index if u != target_user]
# Keep only users who have rated the target item
rated_it = [u for u in other_users
if pd.notna(ratings.loc[u, target_item])]
if not rated_it:
return pd.Series(dtype=float)
# Sort by similarity, take top-K
sims = sim_matrix.loc[target_user, rated_it].dropna()
return sims.nlargest(k)
# Who are Frank's top-3 neighbours for 'Annihilation'?
neighbours = get_neighbours('Frank', 'Annihilation', ratings, sim_matrix, k=3)
print("Top-3 neighbours for Annihilation:\n", neighbours)
Step 4 — Predict the Rating (Weighted Average)
def predict_rating(target_user: str, target_item: str,
ratings: pd.DataFrame, sim_matrix: pd.DataFrame,
k: int = 3) -> float:
"""
Predicts target_user's rating for target_item using mean-centred
weighted average (standard UBCF formula).
Formula:
pred(u, i) = r̄_u + Σ[sim(u,v) · (r_vi − r̄_v)] / Σ|sim(u,v)|
"""
neighbours = get_neighbours(target_user, target_item,
ratings, sim_matrix, k)
if neighbours.empty:
# Fall back to global item mean
return ratings[target_item].mean()
user_mean = ratings.loc[target_user].mean()
numerator = 0.0
denominator = 0.0
for neighbour, sim in neighbours.items():
nb_mean = ratings.loc[neighbour].mean()
nb_rating = ratings.loc[neighbour, target_item]
numerator += sim * (nb_rating - nb_mean)
denominator += abs(sim)
if denominator == 0:
return user_mean
predicted = user_mean + numerator / denominator
return float(np.clip(predicted, 1, 5)) # clamp to valid range
# Predict Frank's ratings for unseen items
unseen = [item for item in ratings.columns
if pd.isna(ratings.loc['Frank', item])]
print("Predicted ratings for Frank:\n")
for item in unseen:
pred = predict_rating('Frank', item, ratings, sim_matrix, k=3)
print(f" {item:20s}: {pred:.2f}")
Step 5 — Generate Top-N Recommendations
def recommend(target_user: str, ratings: pd.DataFrame,
sim_matrix: pd.DataFrame, k: int = 3, top_n: int = 5) -> pd.DataFrame:
"""
Returns a ranked DataFrame of recommendations for target_user.
Only items the user has NOT yet rated are included.
"""
results = []
for item in ratings.columns:
if pd.isna(ratings.loc[target_user, item]):
pred = predict_rating(target_user, item, ratings, sim_matrix, k)
nbs = get_neighbours(target_user, item, ratings, sim_matrix, k)
results.append({
'Item': item,
'Predicted Rating': round(pred, 2),
'Neighbours Used': ', '.join(nbs.index.tolist()),
'Max Similarity': round(nbs.max(), 3) if not nbs.empty else np.nan
})
df_out = pd.DataFrame(results).sort_values('Predicted Rating',
ascending=False)
return df_out.head(top_n).reset_index(drop=True)
recs = recommend('Frank', ratings, sim_matrix, k=3, top_n=5)
print(recs.to_string())
Blade Runner 2049 gets the highest predicted score (4.73) because Frank's closest neighbours Alice and David both gave it 5 stars. Gravity scores only 2.18 — Bob gave it a weak 3, and Carol (who has negative similarity to Frank) rated it a 4, which actually pulls the prediction down rather than up. The mean-centred formula handles negatives correctly.
Visual Architecture — How UBCF Flows
Each arrow represents a data transformation. The algorithm is embarrassingly simple once you have the similarity matrix.
Visualising the Similarity Matrix as a Heatmap
Frank's row (bottom) is deeply green for Alice and David — confirming they are the most reliable predictors for Frank's unseen ratings.
Strengths, Weaknesses & When to Use UBCF
| Problem | Impact |
|---|---|
| Scalability | O(U²·I) similarity computation. 10M users = 10¹⁴ comparisons. |
| Sparsity | With <0.1% fill, most user pairs share 0 co-rated items. |
| Cold Start | New users have no ratings → no neighbours → no recommendations. |
| Popularity Bias | Popular items dominate because more users have rated them. |
| Dynamic Data | Similarities must be recomputed as ratings arrive in real time. |
| Strength | Benefit |
|---|---|
| No item metadata | Works on any domain — movies, songs, code, research papers. |
| Serendipitous picks | Can recommend items with no obvious content connection. |
| Explainability | "Because users like you loved X" — easy to explain to end users. |
| Domain-agnostic | Zero domain knowledge required. Ratings are ratings. |
| Interpretability | Every prediction can be traced to specific neighbours. |
| Scenario | UBCF Suitable? | Better Alternative |
|---|---|---|
| Small community, dense ratings (<10K users) | YES | — |
| Large platform (Netflix-scale) | NO | Matrix Factorisation (SVD, ALS) |
| Cold-start new users | NO | Content-Based Filtering, hybrid |
| Need explainable recs | YES | — |
| Implicit feedback (clicks, views) | MAYBE | Item-Based CF or iALS |
| Offline batch environment | YES | — |
Evaluating Your Recommender System
You cannot know if your recommender is working without measuring it. The choice of metric depends on whether you're optimising for rating accuracy (predicting the number) or ranking quality (getting the right items at the top).
Evaluation Code — Train/Test Split & RMSE
from sklearn.model_selection import train_test_split
import numpy as np
def evaluate_ubcf(ratings_df: pd.DataFrame,
test_fraction: float = 0.2,
k: int = 3) -> dict:
"""
Stratified train/test split: mask a fraction of known ratings as
'unseen', run UBCF on training data, then evaluate on masked values.
"""
# Collect all known (user, item, rating) triples
known = [(u, i, ratings_df.loc[u,i])
for u in ratings_df.index
for i in ratings_df.columns
if pd.notna(ratings_df.loc[u,i])]
train_triples, test_triples = train_test_split(
known, test_size=test_fraction, random_state=42)
# Build training matrix (mask test entries as NaN)
train_ratings = ratings_df.copy()
for u, i, _ in test_triples:
train_ratings.loc[u, i] = np.nan
# Recompute similarities on training data only
train_sim = pd.DataFrame(np.eye(len(train_ratings.index)),
index=train_ratings.index,
columns=train_ratings.index)
for u, v in combinations(train_ratings.index, 2):
s = pearson_similarity(train_ratings.loc[u], train_ratings.loc[v])
train_sim.loc[u,v] = train_sim.loc[v,u] = s
# Predict on test triples
actuals, predictions = [], []
for u, i, actual in test_triples:
pred = predict_rating(u, i, train_ratings, train_sim, k)
actuals.append(actual)
predictions.append(pred)
actuals = np.array(actuals)
predictions = np.array(predictions)
mae = np.mean(np.abs(actuals - predictions))
rmse = np.sqrt(np.mean((actuals - predictions)**2))
return {'MAE': round(mae, 4), 'RMSE': round(rmse, 4),
'n_test': len(test_triples)}
results = evaluate_ubcf(ratings, test_fraction=0.25, k=3)
print(f"MAE: {results['MAE']}")
print(f"RMSE: {results['RMSE']}")
print(f"Test set size: {results['n_test']}")
An RMSE of 0.61 on a 1–5 scale means our predictions are off by ~0.6 stars on average. That is competitive with scikit-surprise's default UBCF on the same tiny dataset. On a production system with thousands of users and ratings, you should expect RMSE in the range of 0.8–1.0 with well-tuned k and minimum co-rating thresholds.
🎬 Full Demo Project — MovieLens 100K Recommender
Install Dependencies
# In your terminal or Jupyter environment
pip install scikit-surprise pandas numpy matplotlib
Step 1 — Load & Explore MovieLens 100K
from surprise import Dataset, Reader, KNNWithMeans
from surprise.model_selection import cross_validate, train_test_split
import pandas as pd
import numpy as np
# Load built-in MovieLens 100K (auto-downloads on first run)
data = Dataset.load_builtin('ml-100k')
raw_ratings = data.raw_ratings
# Quick exploration
df_raw = pd.DataFrame(raw_ratings, columns=['user', 'item', 'rating', 'timestamp'])
n_users = df_raw['user'].nunique()
n_items = df_raw['item'].nunique()
n_ratings= len(df_raw)
sparsity = 1 - (n_ratings / (n_users * n_items))
print(f"Users: {n_users}")
print(f"Movies: {n_items}")
print(f"Ratings: {n_ratings:,}")
print(f"Sparsity: {sparsity:.2%}")
print(f"Avg ratings per user: {n_ratings/n_users:.1f}")
print(f"Rating distribution:\n{df_raw['rating'].value_counts().sort_index()}")
Step 2 — Baseline UBCF with 5-Fold Cross-Validation
# KNNWithMeans = UBCF with mean-centred Pearson (best default)
algo_baseline = KNNWithMeans(
k=40,
sim_options={
'name': 'pearson',
'user_based': True, # UBCF
'min_support': 3, # need ≥3 co-rated items
}
)
cv_results = cross_validate(
algo_baseline, data,
measures=['MAE', 'RMSE'],
cv=5, verbose=True
)
print(f"\nMean MAE: {cv_results['test_mae'].mean():.4f}")
print(f"Mean RMSE: {cv_results['test_rmse'].mean():.4f}")
Step 3 — Tune k: The Neighbourhood Size Effect
from surprise.model_selection import cross_validate
k_values = [5, 10, 15, 20, 30, 40, 50, 60, 80]
mae_means = []
for k in k_values:
algo = KNNWithMeans(k=k, sim_options={'name':'pearson','user_based':True,'min_support':3})
res = cross_validate(algo, data, measures=['MAE'], cv=3, verbose=False)
mae = res['test_mae'].mean()
mae_means.append(mae)
print(f"k={k:3d} MAE={mae:.4f}")
best_k = k_values[np.argmin(mae_means)]
print(f"\nOptimal k: {best_k} (MAE={min(mae_means):.4f})")
MAE drops sharply from k=5 to k=20, then plateaus around k=40–60. This is the classic UBCF elbow: adding more neighbours beyond ~40 gives diminishing returns because distant neighbours introduce noise as fast as they add signal. The optimal k is dataset-dependent — always tune it.
Step 4 — Compare Similarity Metrics
sim_metrics = ['cosine', 'pearson', 'msd']
metric_results = {}
for metric in sim_metrics:
algo = KNNWithMeans(
k=50,
sim_options={'name': metric, 'user_based': True, 'min_support': 3}
)
res = cross_validate(algo, data, measures=['MAE', 'RMSE'], cv=5, verbose=False)
metric_results[metric] = {
'MAE': round(res['test_mae'].mean(), 4),
'RMSE': round(res['test_rmse'].mean(), 4),
}
for m, v in metric_results.items():
print(f"{m:10s} MAE={v['MAE']} RMSE={v['RMSE']}")
| Metric | MAE | RMSE | Best For | Verdict |
|---|---|---|---|---|
| Cosine | 0.7611 | 0.9718 | Implicit feedback / binary | Weaker here |
| Pearson | 0.7434 | 0.9511 | Explicit 1–5 ratings | Good default |
| MSD | 0.7398 | 0.9476 | Explicit ratings, noise-robust | Best here |
Step 5 — Top-10 Recommendations for a Specific User
from collections import defaultdict
def get_top_n_recs(algo, trainset, user_id: str,
n: int = 10) -> list:
"""
Returns top-N unrated items for a user with predicted ratings.
Works with any trained Surprise algorithm.
"""
# Get all item IDs in the training set
all_items = set(trainset._raw2inner_id_items.keys())
# Items the user has already rated
uid_inner = trainset.to_inner_uid(user_id)
rated_by_user = set(trainset.ur[uid_inner].keys())
rated_raw = {trainset.to_raw_iid(iid) for iid in rated_by_user}
unrated = all_items - rated_raw
# Predict all unrated items and sort
predictions = [(iid, algo.predict(user_id, iid).est)
for iid in unrated]
predictions.sort(key=lambda x: -x[1])
return predictions[:n]
# Train on full dataset
full_trainset = data.build_full_trainset()
best_algo = KNNWithMeans(
k=50,
sim_options={'name':'msd', 'user_based':True, 'min_support':3}
)
best_algo.fit(full_trainset)
# Get top-10 recommendations for user "196"
top10 = get_top_n_recs(best_algo, full_trainset, user_id="196", n=10)
print("Top-10 Recommendations for User 196:\n")
for rank, (movie_id, pred_rating) in enumerate(top10, 1):
print(f" {rank:2d}. Movie ID {movie_id:6s} → Predicted: {pred_rating:.2f} ⭐")
MovieLens provides a u.item file with movie titles. After loading it
with pd.read_csv('u.item', sep='|', encoding='latin-1'), you can merge on
the movie ID to display human-readable titles. Movie 318 is
Schindler's List and movie 50 is Star Wars — perennial
collaborative filtering favourites because of their dense cross-user rating patterns.
Production Tips & Golden Rules
min_support=3).
Two users who have rated only one item in common produce an unreliable similarity score.
Discard these similarities or treat them as zero rather than risking misleading predictions.
UBCF vs Other Recommendation Approaches
| Method | Scalability | Cold Start | Explainability | Accuracy | Best Use Case |
|---|---|---|---|---|---|
| UBCF (this tutorial) | Low — O(U²) | Poor | Excellent | Good | Small-medium communities |
| Item-Based CF | Medium — O(I²) | User OK | Excellent | Very Good | Stable item catalogues (Amazon) |
| Matrix Factorisation (SVD) | High | Poor | Low | Excellent | Large-scale latent factor models |
| Content-Based Filtering | High | Excellent | Good | Medium | Rich item metadata, new items |
| Deep Learning (NCF, BERT4Rec) | High | Moderate | Very Low | Best | Production at scale, sequential patterns |
Start with UBCF as your learning baseline and explainability layer. When you need scale, add an Item-Based CF or SVD tier. In production, the best systems are hybrids — UBCF for personalisation signal, content-based for cold start, and a re-ranking layer for business rules (freshness, diversity, availability). Understanding UBCF from scratch gives you the conceptual foundation to reason about all of them.