Content-Based Filtering — The Idea Behind "More Like This"
A great librarian doesn't ask your neighbour what they liked. She looks at the content of Interstellar itself — sci-fi, wormholes, time dilation, space exploration, emotional father-daughter story — and pulls out other films with matching themes.
That is content-based filtering in one sentence: recommend items similar to what a user already liked, based on the properties of the items themselves.
Content-based filtering is one of the two pillars of recommender systems (the other being collaborative filtering). It works entirely from item descriptions — no other users are needed. This makes it ideal when you have rich text data about your items, a cold-start problem with new users, or privacy concerns that prevent sharing user data.
Content-based filtering asks: "What is this item made of?
Find me other items made of similar stuff."
Collaborative filtering asks: "Who else liked what I liked?
What did they also like?"
Today we focus entirely on content-based filtering using movie descriptions.
| Property | Content-Based | Collaborative |
|---|---|---|
| Data needed | Item descriptions only | Requires many users + ratings |
| New user problem | Handles well | Cold-start failure |
| New item problem | Needs description | Cannot recommend |
| Niche interests | Captures them precisely | May miss rare items |
| Serendipity | Low — stays in comfort zone | Higher — finds surprises |
| Privacy | No user data shared | Relies on cross-user data |
The Complete Pipeline — Bird's Eye View
Before diving into the maths, here is the end-to-end flow of our movie recommendation system. Each step is a section in this tutorial.
Feature Engineering — Turning Metadata Into Text Soup
What Features Can We Use for Movies?
Building the Feature Soup
The standard approach is to concatenate all relevant features into a single string per movie, then apply text preprocessing. Here is what that looks like in practice:
| Movie | Genre | Director | Cast | Keywords | Final Soup |
|---|---|---|---|---|---|
| Interstellar | Sci-Fi, Drama | Nolan | McConaughey, Hathaway, Chastain | space wormhole time | scifi drama nolan mcconaughey hathaway chastain space wormhole time |
| Inception | Sci-Fi, Thriller | Nolan | DiCaprio, Gordon-Levitt, Page | dream heist mind | scifi thriller nolan dicaprio gordonlevitt page dream heist mind |
| The Martian | Sci-Fi, Drama | Scott | Damon, Chastain, Ejiofor | mars survival alone | scifi drama scott damon chastain ejiofor mars survival alone |
| Gravity | Sci-Fi, Thriller | Cuarón | Bullock, Clooney | space orbit survival | scifi thriller cuaron bullock clooney space orbit survival |
Text Preprocessing Steps
import pandas as pd
import numpy as np
from ast import literal_eval
# ── Load the TMDB dataset ──────────────────────────────────
df = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')
# Merge on movie title
df = df.merge(credits, left_on='title', right_on='title')
# ── Parse JSON columns ─────────────────────────────────────
json_cols = ['genres', 'keywords', 'cast', 'crew']
for col in json_cols:
df[col] = df[col].apply(literal_eval)
# ── Extract top-3 cast members ─────────────────────────────
def get_top_cast(cast_list, n=3):
return [member['name'].replace(' ', '').lower()
for member in cast_list[:n]]
# ── Extract director from crew ─────────────────────────────
def get_director(crew_list):
for person in crew_list:
if person['job'] == 'Director':
return [person['name'].replace(' ', '').lower()] * 2 # weight director x2
return []
# ── Generic list cleaner ───────────────────────────────────
def clean_list(lst):
return [item['name'].replace(' ', '').lower() for item in lst]
# ── Apply feature extraction ───────────────────────────────
df['cast_clean'] = df['cast'].apply(get_top_cast)
df['director_clean'] = df['crew'].apply(get_director)
df['genres_clean'] = df['genres'].apply(clean_list)
df['keywords_clean'] = df['keywords'].apply(clean_list)
# ── Combine into the feature soup ─────────────────────────
def create_soup(row):
return ' '.join(
row['keywords_clean'] +
row['genres_clean'] * 2 + # genres weighted x2
row['cast_clean'] +
row['director_clean'] # director already x2
)
df['soup'] = df.apply(create_soup, axis=1)
print(df['soup'].iloc[0])
TF-IDF — Measuring Word Importance Mathematically
TF-IDF is the same idea applied to words in documents. A word that appears a lot in one movie but rarely across all movies is a powerful signal for that movie. A word that appears in every movie is useless noise.
The Two Components
The final weight for a word is simply: TF-IDF(t, d) = TF(t, d) × IDF(t)
A word gets a high score only when it appears often in this document
AND rarely in other documents. Both conditions must hold simultaneously.
Worked Example — 5 Movies, 5 Words
| Movie | space | dream | nolan | the | survival |
|---|---|---|---|---|---|
| Interstellar | 0.42 | 0.00 | 0.28 | 0.00 | 0.21 |
| Inception | 0.00 | 0.51 | 0.28 | 0.00 | 0.00 |
| The Martian | 0.18 | 0.00 | 0.00 | 0.00 | 0.44 |
| Gravity | 0.39 | 0.00 | 0.00 | 0.00 | 0.26 |
| The Avengers | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Notice: the word the scores 0 everywhere — it appears in every movie and has IDF ≈ 0. The word dream scores high only for Inception, because it is rare across all other movies. This is TF-IDF at work.
Both movies share "nolan" equally. "space" belongs to Interstellar; "dream" to Inception. "the" scores zero for both — perfect stop-word filtering in action.
Implementing TF-IDF in Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer
# ── Fit TF-IDF on the feature soup ──────────────────────────
tfidf = TfidfVectorizer(
stop_words='english', # auto-remove common English words
min_df=2, # ignore words appearing in < 2 movies
max_features=10000, # cap vocabulary at 10k most frequent
ngram_range=(1, 2) # include bigrams like "space exploration"
)
# Fill any NaN overviews
df['soup'] = df['soup'].fillna('')
# tfidf_matrix shape: (num_movies, num_unique_tokens)
tfidf_matrix = tfidf.fit_transform(df['soup'])
print(f"Matrix shape: {tfidf_matrix.shape}")
print(f"Total unique tokens: {len(tfidf.vocabulary_)}")
# Peek at top tokens for Interstellar
movie_idx = df[df['title'] == 'Interstellar'].index[0]
feature_names = tfidf.get_feature_names_out()
scores = tfidf_matrix[movie_idx].toarray()[0]
top10 = sorted(zip(feature_names, scores),
key=lambda x: -x[1])[:10]
for token, score in top10:
print(f" {token:20s}: {score:.4f}")
Cosine Similarity — Measuring the Angle Between Movies
The question is: how similar are two arrows?
We don't care how long they are — a short description and a long description should still be comparable. We care about the angle between them. Two arrows pointing in the same direction → angle = 0° → cosine = 1 → perfectly similar. Two arrows at 90° → cosine = 0 → totally unrelated.
The Formula
‖A‖ = Euclidean length of vector A.
Result always between 0 (no similarity) and 1 (identical direction).
Visual Intuition — 2D Example
In 2D for illustration. Real TF-IDF vectors have thousands of dimensions. Interstellar and Gravity share "space" and "survival" — small angle, high cosine. Inception shares little — wider angle, lower cosine.
Computing the Full Similarity Matrix
We compute pairwise cosine similarity for all movies at once. For 5,000 movies this produces a 5,000 × 5,000 matrix — 25 million values. Each cell [i][j] holds the similarity between movie i and movie j.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# ── Compute full pairwise cosine similarity ──────────────────
# tfidf_matrix is (4803, 8742) — sparse
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f"Similarity matrix shape: {cosine_sim.shape}")
# ── Build a reverse-lookup index: title → row index ─────────
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
# ── Peek at similarity for Interstellar ─────────────────────
idx = indices['Interstellar']
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: -x[1])
print("\nTop 5 similar to Interstellar:")
for movie_idx, score in sim_scores[1:6]: # skip [0] — it's itself
title = df['title'].iloc[movie_idx]
print(f" {title:30s}: {score:.4f}")
The Martian and Gravity share "space" and "survival" with Interstellar. Inception shares "Nolan" as director. 2001: A Space Odyssey and Contact are both cerebral sci-fi space films. Our pipeline is working correctly.
Item Similarity — Understanding the Similarity Matrix
The cosine similarity matrix is the heart of our recommendation engine. Let's visualise what it looks like and how to interpret it.
| Movie ↓ / Movie → | Interstellar | Inception | The Martian | Gravity | Avengers |
|---|---|---|---|---|---|
| Interstellar | 1.000 | 0.684 | 0.782 | 0.753 | 0.121 |
| Inception | 0.684 | 1.000 | 0.312 | 0.208 | 0.098 |
| The Martian | 0.782 | 0.312 | 1.000 | 0.801 | 0.054 |
| Gravity | 0.753 | 0.208 | 0.801 | 1.000 | 0.033 |
| Avengers | 0.121 | 0.098 | 0.054 | 0.033 | 1.000 |
Diagonal is always 1.0 — every movie is identical to itself.
Symmetric — sim(A, B) = sim(B, A). The matrix mirrors itself across the diagonal.
Values in [0, 1] — because TF-IDF vectors have no negative values; cosine stays non-negative.
Sparse clusters emerge — action movies cluster together; sci-fi movies cluster together.
Green = 1.0 (diagonal, self-similarity). Deeper blue = higher similarity. Notice how Martian and Gravity share a strong cluster. Avengers is isolated from the space-film group.
The Complete Movie Recommendation System — Full Project
Let's bring all five concepts together into a single, production-ready Python script. We will use the TMDB 5000 Movies dataset — 4,803 movies with full metadata.
pip install pandas scikit-learn numpy
# ═══════════════════════════════════════════════════════════
# Movie Recommendation System — Content-Based Filtering
# Dataset: TMDB 5000 Movies
# ═══════════════════════════════════════════════════════════
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# ── 1. LOAD DATA ────────────────────────────────────────────
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')
df = movies.merge(credits, on='title')
# Keep only what we need
df = df[['title', 'overview', 'genres',
'keywords', 'cast', 'crew']].copy()
df.dropna(subset=['overview'], inplace=True)
# ── 2. PARSE JSON COLUMNS ───────────────────────────────────
for col in ['genres', 'keywords', 'cast', 'crew']:
df[col] = df[col].apply(literal_eval)
# ── 3. FEATURE EXTRACTION ───────────────────────────────────
def extract_names(lst, n=None):
names = [x['name'].replace(' ', '').lower() for x in lst]
return names[:n] if n else names
def get_director(crew):
directors = [x['name'].replace(' ', '').lower()
for x in crew if x['job'] == 'Director']
return directors * 2 # weight director twice
df['genres_f'] = df['genres'].apply(lambda x: extract_names(x) * 2)
df['keywords_f'] = df['keywords'].apply(lambda x: extract_names(x))
df['cast_f'] = df['cast'].apply(lambda x: extract_names(x, n=3))
df['crew_f'] = df['crew'].apply(get_director)
# ── 4. BUILD FEATURE SOUP ───────────────────────────────────
def build_soup(row):
overview_words = str(row['overview']).lower().split()
meta = (row['genres_f'] + row['keywords_f'] +
row['cast_f'] + row['crew_f'])
return ' '.join(overview_words + meta)
df['soup'] = df.apply(build_soup, axis=1)
# ── 5. TF-IDF VECTORISATION ─────────────────────────────────
tfidf = TfidfVectorizer(
stop_words='english',
min_df=2,
max_features=15000,
ngram_range=(1, 2),
sublinear_tf=True # dampen very high term freqs
)
tfidf_matrix = tfidf.fit_transform(df['soup'])
# ── 6. COSINE SIMILARITY MATRIX ─────────────────────────────
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Title-to-index lookup
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
# ── 7. RECOMMENDATION FUNCTION ──────────────────────────────
def recommend(title, n=10):
"""Return top-n content-based recommendations for a given movie title."""
if title not in indices:
return f"Movie '{title}' not found in dataset."
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: -x[1])
sim_scores = sim_scores[1:n+1] # skip self (index 0)
movie_indices = [i[0] for i in sim_scores]
scores = [round(i[1], 4) for i in sim_scores]
result = df['title'].iloc[movie_indices].reset_index(drop=True)
return pd.DataFrame({'Recommended Movie': result, 'Similarity Score': scores})
# ── 8. TEST IT ──────────────────────────────────────────────
print("🎬 Recommendations for: Interstellar\n")
print(recommend('Interstellar', n=8).to_string(index=False))
Evaluating & Tuning Your Recommender
Content-based recommenders are harder to evaluate than classifiers — there is no single "correct" answer. Here are the standard approaches:
| Metric | What It Measures | When to Use | Limitation |
|---|---|---|---|
| Precision@K | Fraction of top-K recommendations the user actually liked | When you have user rating data | Needs ground-truth ratings |
| Recall@K | Fraction of all liked items that appear in top-K | When recall matters (e.g. library discovery) | Ignores ranking order |
| NDCG@K | Normalised Discounted Cumulative Gain — rewards top-ranked relevant items | Best overall ranking metric | Complex to compute |
| Coverage | % of catalogue that ever gets recommended | Checks for diversity | Doesn't measure quality |
| Intra-List Diversity | Average pairwise dissimilarity within a recommendation list | Checks for filter bubble risk | Domain-specific |
| A/B Test CTR | Click-through rate on recommended items in production | Gold standard — real user behaviour | Requires live product |
# ── Quick Precision@K evaluation with mock ratings ──────────
def precision_at_k(recommended_titles, liked_titles, k):
top_k = recommended_titles[:k]
hits = len(set(top_k) & set(liked_titles))
return hits / k
# Example: user liked these space films
user_liked = ['Gravity', 'The Martian', 'Arrival', 'Contact']
recs = recommend('Interstellar', n=10)['Recommended Movie'].tolist()
p5 = precision_at_k(recs, user_liked, k=5)
p10 = precision_at_k(recs, user_liked, k=10)
print(f"Precision@5 : {p5:.2f}")
print(f"Precision@10 : {p10:.2f}")
Content-based filtering has one well-known weakness: over-specialisation. If a user watches only space films, the system only ever recommends more space films. It never discovers that the user might also love historical dramas. In production, add a small dose of serendipity: inject 10–20% diverse recommendations or blend with collaborative filtering (a hybrid system).
Golden Rules
.replace(' ', '').
np.save after computing it.
A 5,000-movie matrix takes minutes to compute — recalculating on every request kills
production performance. Serialise it once; load it on startup.
You now have a complete content-based movie recommendation engine from scratch:
Feature Engineering → TF-IDF Vectorisation →
Cosine Similarity Matrix → Item-Based Recommendations.
The same pipeline works for article recommendation, product recommendation, job matching,
and any domain where items have rich text descriptions. The maths is identical —
only the features change.