The Five Families of Recommendation Systems
The Content-Based advisor picks up your last book, studies its genre, themes, and writing style — then finds novels with the same fingerprint. The Collaborative advisor ignores the book entirely and asks: "Customers who loved exactly what you loved — what did they read next?" The Hybrid advisor uses both approaches at once, playing them off each other. The Knowledge-Based advisor sits you down and asks structured questions: "Do you prefer fast or slow pacing? Historical or contemporary?" And the Session-Based advisor watches you browse for ten minutes and whispers recommendations based entirely on what you just picked up and put down — no history needed, just now.
These are your five families. Each one solves a different problem. Each one has a domain where it dominates. Production systems at Netflix, Amazon, and Spotify deploy all five simultaneously.
All five families feed into a unified recommendation engine. Production systems blend signals from multiple families to generate a single ranked output for each user.
| Family | Core Signal | Cold Start? | Best For | Used By |
|---|---|---|---|---|
| Content-Based | Item features (genre, tags, metadata) | Handles item cold start | News, articles, niche catalogues | Pandora, Medium |
| Collaborative | User-item interaction history | Suffers badly | Movies, music, e-commerce | Netflix, Amazon |
| Hybrid | Interactions + features + context | Mitigated | Large-scale platforms | YouTube, Spotify |
| Knowledge-Based | Explicit user constraints & requirements | No cold start at all | High-stakes, infrequent purchases | Finance, real estate, travel |
| Session-Based | Current session clicks / events only | Designed for it | E-commerce, anonymous users | Zalando, Booking.com |
Content-Based Filtering
When you typed in "Radiohead", Pandora didn't ask who else liked Radiohead. It dissected Radiohead's DNA: minor key, complex rhythmic structure, abstract lyrics, layered electric guitar, falsetto vocals. Then it found every song in the library with a similar genome — and played them for you.
That is pure content-based filtering. The system never consulted another user. It understood the item deeply and matched items to items.
Content-Based Filtering recommends items similar to those a user has interacted with positively in the past — based purely on the features of the items themselves. It builds a profile of the user's taste from item attributes, then finds items whose feature vectors are closest to that profile.
The user profile is an aggregated feature vector of liked items. Items are ranked by cosine similarity to that profile. Item C scores 0.94 — nearly identical feature fingerprint — and gets recommended.
TF-IDF for Text-Based Content Filtering
For text-heavy items (articles, job listings, product descriptions), the most common feature representation is TF-IDF (Term Frequency – Inverse Document Frequency). It weights words by how distinctive they are to a document relative to the whole corpus — suppressing common words like "the" and amplifying rare, meaningful ones.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# ── Sample movie dataset with text descriptions ──────────────
movies = pd.DataFrame({
'title': [
'Inception', 'Interstellar', 'Dune', 'The Matrix',
'Arrival', 'Parasite', '1917', 'Tenet'
],
'soup': [
'sci-fi thriller dream heist nolan subconscious layered',
'sci-fi space epic nolan wormhole relativity emotional',
'sci-fi epic desert chosen-one villeneuve political',
'sci-fi action simulation reality hacker rebellion',
'sci-fi alien language time thriller emotional villeneuve',
'thriller drama class-divide korean dark comedy',
'war drama one-shot wwii emotional sacrifice',
'sci-fi action espionage time-reversal nolan thriller',
]
})
# ── Build TF-IDF matrix ───────────────────────────────────────
tfidf = TfidfVectorizer(stop_words='english')
tfidf_m = tfidf.fit_transform(movies['soup'])
# ── Pairwise cosine similarity ────────────────────────────────
cos_sim = cosine_similarity(tfidf_m, tfidf_m)
# ── Recommend function ────────────────────────────────────────
def recommend_cb(title, top_n=3):
idx = movies[movies['title'] == title].index[0]
scores = list(enumerate(cos_sim[idx]))
scores = sorted(scores, key=lambda x: -x[1])
top = [movies['title'].iloc[i] for i, _ in scores[1:top_n+1]]
return top
print("Because you watched Inception:")
print(recommend_cb('Inception'))
print("\nBecause you watched Arrival:")
print(recommend_cb('Arrival'))
Content-based systems have a subtle failure mode: over-specialisation. If you only ever watch Nolan films, you will only ever be recommended Nolan films — you'll never discover Villeneuve or Tarkovsky. The system optimises for similarity, not surprise. This is why serendipity mechanisms (random exploration, diversity penalties) must be deliberately injected into pure content-based systems.
| Aspect | Content-Based Filtering |
|---|---|
| Strengths | No cold start for items; transparent ("because you liked X"); works for unique users; no need for other users' data |
| Weaknesses | Over-specialisation; requires rich item metadata; user cold start still applies; cannot leverage crowd wisdom |
| Best domains | News articles, music (audio features), job listings, real estate |
| Key algorithms | TF-IDF + Cosine Similarity, BM25, Word2Vec / Sentence Transformers, k-NN in feature space |
Collaborative Filtering
Collaborative filtering is that village square, scaled to 300 million people. It asks: "Of all the users in this system, who behaves most like me? What have they interacted with that I haven't?" The item doesn't need to be described. The crowd's collective wisdom does the work.
Collaborative Filtering (CF) predicts a user's preferences by collecting and analysing interaction data from many users — collaborating to filter the item space. It operates entirely on the user-item interaction matrix, requiring no knowledge of item content or user demographics.
Two Variants: Memory-Based vs Model-Based
Green highlighted cells are the predicted ratings for previously missing interactions. The top-k unseen items with highest predicted scores become recommendations.
import numpy as np
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate
import pandas as pd
# ── Build a synthetic ratings dataset ────────────────────────
ratings_dict = {
'userID': ['Ali','Ali','Ali','Ben','Ben','Ben','Cara','Cara','Dev','Dev'],
'itemID': ['A','B','C','A','B','D','B','C','A','C'],
'rating': [5,4,3,4,5,2,3,5,5,4],
}
df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)
# ── SVD Matrix Factorisation ──────────────────────────────────
svd = SVD(n_factors=20, n_epochs=30, lr_all=0.005, reg_all=0.02)
# ── 5-fold cross-validation ───────────────────────────────────
results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
print(f"SVD RMSE: {results['test_rmse'].mean():.4f}")
print(f"SVD MAE : {results['test_mae'].mean():.4f}")
# ── Predict a missing rating: Ali on item D ───────────────────
trainset = data.build_full_trainset()
svd.fit(trainset)
pred = svd.predict('Ali', 'D')
print(f"\nPredicted rating — Ali on Item D: {pred.est:.2f}")
SVD learns hidden dimensions — latent factors — that explain the patterns in ratings. These factors are never labelled, but they might correspond to concepts like "prefers cerebral sci-fi" or "likes dark psychological thrillers". A user is represented as a point in this latent space; an item is another point. The predicted rating is proportional to how close the user and item are in that space — measured by their dot product.
| Method | Complexity | Scalability | Accuracy | Cold Start |
|---|---|---|---|---|
| User-User CF | O(U² × I) | Poor — O(n²) users | Good for small sets | Severe |
| Item-Item CF | O(I² × U) | Better — item count stable | Very Good | Severe |
| SVD / ALS | O(k × U × I) | Excellent | Excellent | Severe |
| Neural CF (NCF) | Flexible | Excellent with GPU | State-of-the-art | Partial mitigation |
Hybrid Recommendation Systems
Hybrid recommender systems are conductors. Content-based filtering provides item-level understanding. Collaborative filtering provides social wisdom. Context models provide situational awareness. The hybrid engine learns exactly how much weight to give each voice — and produces recommendations none could generate independently. Netflix's system, for example, blends over twelve individual models.
Four Hybridisation Strategies
score = α × CF_score + β × CB_score.
Weights (α, β) are tuned on validation data. Easy to implement and interpret.
Used by most production systems as a baseline blending strategy.
Production systems blend three or more signals through a learned weight combiner, then apply a re-ranker to enforce diversity and business rules (e.g. don't recommend the same genre three times in a row).
import numpy as np
# ── Simulated scores from three separate models ───────────────
# Each array: scores for 8 candidate items for a specific user
cf_scores = np.array([0.91, 0.44, 0.72, 0.88, 0.31, 0.65, 0.52, 0.79])
cb_scores = np.array([0.78, 0.82, 0.61, 0.55, 0.90, 0.47, 0.69, 0.33])
ctx_scores = np.array([0.60, 0.70, 0.85, 0.45, 0.55, 0.90, 0.40, 0.75])
items = ['Dune', 'Parasite', 'Arrival', 'Inception',
'Midsommar', '1917', 'Annihilation', 'Interstellar']
# ── Weighted hybrid fusion ────────────────────────────────────
alpha, beta, gamma = 0.55, 0.30, 0.15
hybrid_scores = alpha * cf_scores + beta * cb_scores + gamma * ctx_scores
# ── Rank by hybrid score ──────────────────────────────────────
ranked = sorted(zip(items, hybrid_scores), key=lambda x: -x[1])
print("Hybrid Recommendations (Top-5):")
for rank, (item, score) in enumerate(ranked[:5], 1):
print(f" {rank}. {item:15s} score={score:.3f}")
Knowledge-Based Recommendation Systems
"What is your investment horizon? How would you feel if this dropped 30%? Do you need liquidity within 2 years? Do you have ethical restrictions on certain sectors?"
Only after building a complete picture of your constraints and requirements do they make a recommendation — one that is not "popular" or "similar to what you've bought before" but specifically correct for your situation.
This is knowledge-based recommendation. Not data-driven in the interaction sense, but constraint-driven — powered by domain knowledge and explicit user requirements. It dominates wherever the stakes are high and interaction data is sparse.
Knowledge-Based Systems recommend items using explicit knowledge about users' requirements, preferences, and item attributes — combined with domain-expert knowledge about what makes items suitable for certain needs. They do not rely on interaction history at all.
Each critique cycle narrows the candidate space toward the user's unstated ideal. Studies show users typically need 4–7 critique cycles before accepting a recommendation in high-stakes domains.
# ── Constraint-Based Knowledge Recommender — Laptop Example ──
laptops = [
{'name':'ProBook X1', 'price':1200, 'battery_h':10, 'weight_kg':1.4, 'ram_gb':16, 'gpu':True},
{'name':'SlimAir 3', 'price':950, 'battery_h':14, 'weight_kg':1.1, 'ram_gb':8, 'gpu':False},
{'name':'WorkForce 15','price':1500, 'battery_h':8, 'weight_kg':2.1, 'ram_gb':32, 'gpu':True},
{'name':'UltraBook Z', 'price':1100, 'battery_h':12, 'weight_kg':1.2, 'ram_gb':16, 'gpu':False},
{'name':'PowerEdge G', 'price':800, 'battery_h':6, 'weight_kg':2.5, 'ram_gb':16, 'gpu':True},
]
# ── User requirements (hard constraints) ─────────────────────
user_req = {
'max_price': 1300,
'min_battery_h': 10,
'max_weight_kg': 1.5,
'min_ram_gb': 16,
}
# ── Soft preference: minimise weight (lower = better) ────────
def knowledge_recommend(laptops, req):
candidates = [
l for l in laptops
if l['price'] <= req['max_price']
and l['battery_h'] >= req['min_battery_h']
and l['weight_kg'] <= req['max_weight_kg']
and l['ram_gb'] >= req['min_ram_gb']
]
# Rank surviving candidates by weight (lightest first)
return sorted(candidates, key=lambda x: x['weight_kg'])
results = knowledge_recommend(laptops, user_req)
print("Laptops matching your requirements:")
for r in results:
print(f" {r['name']:15s} £{r['price']} | {r['battery_h']}h | {r['weight_kg']}kg | {r['ram_gb']}GB")
Knowledge-based systems are the right tool when: (1) items are purchased infrequently (cars, homes, insurance), (2) preferences are complex and constraint-heavy, (3) there is no meaningful interaction history to learn from, or (4) the domain requires expert knowledge to evaluate item suitability (medical devices, financial products). No historical data is required — domain knowledge replaces it.
Session-Based Recommendation Systems
This is session-based recommendation. It operates entirely on what you're doing right now. The session is the signal. Your click sequence, your dwell time, your scroll depth in the last twenty minutes — these paint a picture of intent sharper than months of historical data, because intent shifts with mood and moment.
Session-Based Systems generate recommendations using only the sequence of user interactions within the current browsing session — no long-term user profile is needed or used. They are essential for: anonymous users, platforms with high user turnover, and any context where current intent matters more than historical preference.
Two Dominant Approaches
The transformer encoder attends to the entire click history simultaneously — unlike RNNs, which process sequentially. This allows it to capture that "running shoes + gels" strongly implies athletic intent, even across long sequences.
import numpy as np
# ── Simplified Markov Chain Session Recommender ───────────────
# Transition matrix: P(next_item | current_item)
items = ['Running Shoes', 'Running Socks', 'Water Bottle',
'Energy Gels', 'GPS Watch', 'Compression Leggings']
# Row i → probability of transitioning to item j next
transitions = np.array([
[0.00, 0.40, 0.25, 0.10, 0.15, 0.10], # Running Shoes →
[0.30, 0.00, 0.20, 0.25, 0.15, 0.10], # Running Socks →
[0.15, 0.15, 0.00, 0.40, 0.20, 0.10], # Water Bottle →
[0.10, 0.15, 0.20, 0.00, 0.30, 0.25], # Energy Gels →
[0.20, 0.10, 0.25, 0.15, 0.00, 0.30], # GPS Watch →
[0.30, 0.20, 0.10, 0.15, 0.25, 0.00], # Compression Leggings →
])
def session_recommend(session_clicks, top_n=3):
# Accumulate transition probabilities for the whole session
scores = np.zeros(len(items))
for click in session_clicks:
idx = items.index(click)
scores += transitions[idx]
# Zero out already-seen items
seen_idx = [items.index(c) for c in session_clicks]
scores[seen_idx] = 0
top_idx = np.argsort(scores)[::-1][:top_n]
return [(items[i], round(scores[i], 3)) for i in top_idx]
session = ['Running Shoes', 'Running Socks', 'Water Bottle', 'Energy Gels']
recs = session_recommend(session)
print("Session-based recommendations:")
for item, score in recs:
print(f" → {item:25s} score={score}")
Use session-based models when users are anonymous or new, when the current session intent dominates (e.g. shopping for a gift, browsing a specific category), or when user preferences change rapidly. Use long-term models (CF, Matrix Factorisation) for established users whose cumulative taste is stable. The best production systems run both in parallel and blend their outputs — using session signals to contextualise long-term preference.
Choosing the Right System — The Decision Framework
| Criteria | Content-Based | Collaborative | Hybrid | Knowledge-Based | Session-Based |
|---|---|---|---|---|---|
| Needs interaction data? | No | Yes — lots | Partial | No | No (just session) |
| Needs item metadata? | Yes — rich | No | Helpful | Yes — structured | No |
| New user (cold start)? | Partial | Fails | Mitigated | Works perfectly | Designed for this |
| New item (cold start)? | Works | Fails | Mitigated | Works | Depends on model |
| Serendipity / Discovery? | Low — echo chamber | High | Tunable | Medium | Medium |
| Interpretability? | High — "because you liked X" | Medium | Low | Very high — rule-based | Medium |
| Computation cost? | Medium | High (large matrices) | High | Low | Low to Medium |
| Best domain examples | News, music, articles | Movies, e-commerce | Netflix, YouTube | Finance, real estate | E-commerce, travel |