Content-Based Filtering in Recommendation Systems

Section 01

Content-Based Filtering — The Idea Behind "More Like This"

📽️ Real World Story

The Librarian Who Knows Everything About Books

Imagine you walk into a library and tell the librarian: "I just finished Interstellar and absolutely loved it. Can you recommend something similar?"

A great librarian doesn't ask your neighbour what they liked. She looks at the content of Interstellar itself — sci-fi, wormholes, time dilation, space exploration, emotional father-daughter story — and pulls out other films with matching themes.

That is content-based filtering in one sentence: recommend items similar to what a user already liked, based on the properties of the items themselves.

Content-based filtering is one of the two pillars of recommender systems (the other being collaborative filtering). It works entirely from item descriptions — no other users are needed. This makes it ideal when you have rich text data about your items, a cold-start problem with new users, or privacy concerns that prevent sharing user data.

💡

Content-Based vs Collaborative Filtering

Content-based filtering asks: "What is this item made of? Find me other items made of similar stuff."
Collaborative filtering asks: "Who else liked what I liked? What did they also like?"

Today we focus entirely on content-based filtering using movie descriptions.

Property	Content-Based	Collaborative
Data needed	Item descriptions only	Requires many users + ratings
New user problem	Handles well	Cold-start failure
New item problem	Needs description	Cannot recommend
Niche interests	Captures them precisely	May miss rare items
Serendipity	Low — stays in comfort zone	Higher — finds surprises
Privacy	No user data shared	Relies on cross-user data

Section 02

The Complete Pipeline — Bird's Eye View

Before diving into the maths, here is the end-to-end flow of our movie recommendation system. Each step is a section in this tutorial.

Raw Movie Descriptions

Each movie has a plain-text overview: plot, genre tags, cast mentions. This is our raw material.

Feature Engineering

Clean text, combine genre + cast + keywords into a single "soup" string. Remove stop words, apply stemming.

TF-IDF Vectorisation

Convert each movie's text into a numerical vector. Words that appear frequently in one movie but rarely elsewhere get high weight.

Cosine Similarity Matrix

Compute pairwise angle between all movie vectors. Two movies are "similar" if their vectors point in the same direction.

Recommendations

Given a query movie, look up its row in the similarity matrix and return the top-N highest-scoring movies.

🎬 Movie Recommendation Pipeline — Architecture Diagram

Section 03

Feature Engineering — Turning Metadata Into Text Soup

📖 Story

The Recipe Analogy

Think of a movie as a recipe. Its ingredients are: the plot description, the genre, the director, the lead actors, and keyword tags. If you want to find movies that "taste similar," you first need to write down all the ingredients in a consistent way — a feature soup. You'd combine: "sci-fi space exploration wormhole Nolan McConaughey Chastain" into a single string. That combined string is your feature. Now every movie has a comparable ingredient list.

What Features Can We Use for Movies?

📝

Plot Overview

Free Text

The synopsis or tagline. Rich natural language — the most informative source. Captures themes, settings, and narrative arcs.

🎭

Genres

Categorical List

Action, Drama, Sci-Fi, Thriller. A movie can have multiple genres. These are strong signals — a user who loved Sci-Fi probably wants more Sci-Fi.

🌟

Cast & Crew

Named Entities

Director, lead actors. Fans follow specific directors or stars. "Christopher Nolan" is a powerful feature — it clusters Interstellar with The Dark Knight and Inception.

Building the Feature Soup

The standard approach is to concatenate all relevant features into a single string per movie, then apply text preprocessing. Here is what that looks like in practice:

Movie	Genre	Director	Cast	Keywords	Final Soup
Interstellar	Sci-Fi, Drama	Nolan	McConaughey, Hathaway, Chastain	space wormhole time	scifi drama nolan mcconaughey hathaway chastain space wormhole time
Inception	Sci-Fi, Thriller	Nolan	DiCaprio, Gordon-Levitt, Page	dream heist mind	scifi thriller nolan dicaprio gordonlevitt page dream heist mind
The Martian	Sci-Fi, Drama	Scott	Damon, Chastain, Ejiofor	mars survival alone	scifi drama scott damon chastain ejiofor mars survival alone
Gravity	Sci-Fi, Thriller	Cuarón	Bullock, Clooney	space orbit survival	scifi thriller cuaron bullock clooney space orbit survival

Text Preprocessing Steps

🔧 Preprocessing Pipeline — Each Movie Description

Step 1

Lowercase everything. "Sci-Fi" and "sci-fi" must be the same token. Case normalisation prevents duplicate vocabulary entries.

Step 2

Remove stop words. Words like "the", "a", "is", "and" appear in every movie — they carry no distinguishing power. Remove them.

Step 3

Remove punctuation & special characters. Hyphens in "Sci-Fi" → "scifi". Spaces in "Gordon-Levitt" → "gordonlevitt" so it becomes one token.

Step 4

Stemming / Lemmatisation (optional). "explores", "explored", "exploration" → "explor". Reduces vocabulary size and merges related terms.

Step 5

Weight important features. Repeat tokens for high-signal features. If genre is very important, put it twice: "scifi scifi drama drama nolan mcconaughey…"

import pandas as pd
import numpy as np
from ast import literal_eval

# ── Load the TMDB dataset ──────────────────────────────────
df = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

# Merge on movie title
df = df.merge(credits, left_on='title', right_on='title')

# ── Parse JSON columns ─────────────────────────────────────
json_cols = ['genres', 'keywords', 'cast', 'crew']
for col in json_cols:
    df[col] = df[col].apply(literal_eval)

# ── Extract top-3 cast members ─────────────────────────────
def get_top_cast(cast_list, n=3):
    return [member['name'].replace(' ', '').lower()
            for member in cast_list[:n]]

# ── Extract director from crew ─────────────────────────────
def get_director(crew_list):
    for person in crew_list:
        if person['job'] == 'Director':
            return [person['name'].replace(' ', '').lower()] * 2  # weight director x2
    return []

# ── Generic list cleaner ───────────────────────────────────
def clean_list(lst):
    return [item['name'].replace(' ', '').lower() for item in lst]

# ── Apply feature extraction ───────────────────────────────
df['cast_clean']     = df['cast'].apply(get_top_cast)
df['director_clean'] = df['crew'].apply(get_director)
df['genres_clean']   = df['genres'].apply(clean_list)
df['keywords_clean'] = df['keywords'].apply(clean_list)

# ── Combine into the feature soup ─────────────────────────
def create_soup(row):
    return ' '.join(
        row['keywords_clean'] +
        row['genres_clean']   * 2 +   # genres weighted x2
        row['cast_clean']    +
        row['director_clean']        # director already x2
    )

df['soup'] = df.apply(create_soup, axis=1)

print(df['soup'].iloc[0])

OUTPUT — Interstellar feature soup

spacecraft timedilatation wormhole scifi scifi drama drama mcconaughey hathaway chastain nolan nolan

Section 04

TF-IDF — Measuring Word Importance Mathematically

📖 Story

The Detective's Clue List

A detective examines 1,000 crime scenes. At every scene she writes a report. The word "suspect" appears in all 1,000 reports — it tells her nothing special. But the word "cufflink" appears in only 3 reports. If your current scene mentions cufflinks, that is a meaningful clue.

TF-IDF is the same idea applied to words in documents. A word that appears a lot in one movie but rarely across all movies is a powerful signal for that movie. A word that appears in every movie is useless noise.

The Two Components

TF — Term Frequency

TF(t, d) = count(t in d) / total words in d

How often does the word t appear in document d? Normalised by document length so long documents aren't unfairly rewarded.

IDF — Inverse Document Frequency

IDF(t) = log( N / df(t) )

N = total documents. df(t) = documents containing word t. Words in every document → IDF ≈ 0. Rare words → IDF is large.

🔢

TF-IDF = TF × IDF

The final weight for a word is simply: TF-IDF(t, d) = TF(t, d) × IDF(t)
A word gets a high score only when it appears often in this document AND rarely in other documents. Both conditions must hold simultaneously.

Worked Example — 5 Movies, 5 Words

Movie	space	dream	nolan	survival
Interstellar	0.42	0.00	0.28	0.21
Inception	0.00	0.51	0.28	0.00
The Martian	0.18	0.00	0.00	0.44
Gravity	0.39	0.00	0.00	0.26
The Avengers	0.00	0.00	0.00	0.00

Notice: the word the scores 0 everywhere — it appears in every movie and has IDF ≈ 0. The word dream scores high only for Inception, because it is rare across all other movies. This is TF-IDF at work.

📊 TF-IDF Scores — Interstellar vs Inception (Selected Words)

Both movies share "nolan" equally. "space" belongs to Interstellar; "dream" to Inception. "the" scores zero for both — perfect stop-word filtering in action.

Implementing TF-IDF in Scikit-Learn

from sklearn.feature_extraction.text import TfidfVectorizer

# ── Fit TF-IDF on the feature soup ──────────────────────────
tfidf = TfidfVectorizer(
    stop_words='english',     # auto-remove common English words
    min_df=2,                  # ignore words appearing in < 2 movies
    max_features=10000,        # cap vocabulary at 10k most frequent
    ngram_range=(1, 2)          # include bigrams like "space exploration"
)

# Fill any NaN overviews
df['soup'] = df['soup'].fillna('')

# tfidf_matrix shape: (num_movies, num_unique_tokens)
tfidf_matrix = tfidf.fit_transform(df['soup'])

print(f"Matrix shape: {tfidf_matrix.shape}")
print(f"Total unique tokens: {len(tfidf.vocabulary_)}")

# Peek at top tokens for Interstellar
movie_idx = df[df['title'] == 'Interstellar'].index[0]
feature_names = tfidf.get_feature_names_out()
scores = tfidf_matrix[movie_idx].toarray()[0]
top10 = sorted(zip(feature_names, scores),
               key=lambda x: -x[1])[:10]

for token, score in top10:
    print(f"  {token:20s}: {score:.4f}")

OUTPUT

Matrix shape: (4803, 8742) Total unique tokens: 8742 Top TF-IDF tokens for Interstellar: nolan : 0.4821 space : 0.4102 wormhole : 0.3874 mcconaughey : 0.3566 time dilation : 0.3291 interstellar travel : 0.3017 scifi : 0.2788 survival : 0.2341 drama : 0.2105 hathaway : 0.1988

Section 05

Cosine Similarity — Measuring the Angle Between Movies

📖 Story

Two Arrows in Space

Imagine each movie is an arrow (vector) pointing in a direction in a very high-dimensional space. Interstellar points strongly toward "space" and "nolan" dimensions. Inception also points toward "nolan" but strongly toward "dream". Gravity points toward "space" and "survival".

The question is: how similar are two arrows?
We don't care how long they are — a short description and a long description should still be comparable. We care about the angle between them. Two arrows pointing in the same direction → angle = 0° → cosine = 1 → perfectly similar. Two arrows at 90° → cosine = 0 → totally unrelated.

The Formula

Cosine Similarity

cos(θ) = (A · B) / (‖A‖ × ‖B‖)

A · B = dot product (sum of element-wise products).
‖A‖ = Euclidean length of vector A.
Result always between 0 (no similarity) and 1 (identical direction).

Why Not Euclidean Distance?

d(A,B) = √Σ(Aᵢ - Bᵢ)²

Euclidean distance is affected by vector magnitude (document length). A long movie description would always appear "far" from a short one, even if they're about the same topic. Cosine avoids this by normalising length.

Visual Intuition — 2D Example

📐 Cosine Similarity — 2D Vector Space (space vs dream dimensions)

In 2D for illustration. Real TF-IDF vectors have thousands of dimensions. Interstellar and Gravity share "space" and "survival" — small angle, high cosine. Inception shares little — wider angle, lower cosine.

Computing the Full Similarity Matrix

We compute pairwise cosine similarity for all movies at once. For 5,000 movies this produces a 5,000 × 5,000 matrix — 25 million values. Each cell [i][j] holds the similarity between movie i and movie j.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# ── Compute full pairwise cosine similarity ──────────────────
# tfidf_matrix is (4803, 8742) — sparse
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f"Similarity matrix shape: {cosine_sim.shape}")

# ── Build a reverse-lookup index: title → row index ─────────
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

# ── Peek at similarity for Interstellar ─────────────────────
idx = indices['Interstellar']
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: -x[1])

print("\nTop 5 similar to Interstellar:")
for movie_idx, score in sim_scores[1:6]:   # skip [0] — it's itself
    title = df['title'].iloc[movie_idx]
    print(f"  {title:30s}: {score:.4f}")

OUTPUT

Similarity matrix shape: (4803, 4803) Top 5 similar to Interstellar: The Martian : 0.7821 Gravity : 0.7534 Inception : 0.6841 2001: A Space Odyssey : 0.6612 Contact : 0.6289

✅

These Results Make Sense!

The Martian and Gravity share "space" and "survival" with Interstellar. Inception shares "Nolan" as director. 2001: A Space Odyssey and Contact are both cerebral sci-fi space films. Our pipeline is working correctly.

Section 06

Item Similarity — Understanding the Similarity Matrix

The cosine similarity matrix is the heart of our recommendation engine. Let's visualise what it looks like and how to interpret it.

Movie ↓ / Movie →	Interstellar	Inception	The Martian	Gravity	Avengers
Interstellar	1.000	0.684	0.782	0.753	0.121
Inception	0.684	1.000	0.312	0.208	0.098
The Martian	0.782	0.312	1.000	0.801	0.054
Gravity	0.753	0.208	0.801	1.000	0.033
Avengers	0.121	0.098	0.054	0.033	1.000

ℹ️

Properties of the Similarity Matrix

Diagonal is always 1.0 — every movie is identical to itself.
Symmetric — sim(A, B) = sim(B, A). The matrix mirrors itself across the diagonal.
Values in [0, 1] — because TF-IDF vectors have no negative values; cosine stays non-negative.
Sparse clusters emerge — action movies cluster together; sci-fi movies cluster together.

🌡️ Similarity Heatmap — 5 Movies (darker = more similar)

Green = 1.0 (diagonal, self-similarity). Deeper blue = higher similarity. Notice how Martian and Gravity share a strong cluster. Avengers is isolated from the space-film group.

Section 07

The Complete Movie Recommendation System — Full Project

Let's bring all five concepts together into a single, production-ready Python script. We will use the TMDB 5000 Movies dataset — 4,803 movies with full metadata.

📦 Required Libraries & Dataset

Install

pip install pandas scikit-learn numpy

Dataset

Download tmdb_5000_movies.csv and tmdb_5000_credits.csv from Kaggle: kaggle datasets download -d tmdb-movie-metadata

Key cols

title, overview, genres, keywords, cast, crew

# ═══════════════════════════════════════════════════════════
# Movie Recommendation System — Content-Based Filtering
# Dataset: TMDB 5000 Movies
# ═══════════════════════════════════════════════════════════

import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ── 1. LOAD DATA ────────────────────────────────────────────
movies  = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')
df = movies.merge(credits, on='title')

# Keep only what we need
df = df[['title', 'overview', 'genres',
         'keywords', 'cast', 'crew']].copy()
df.dropna(subset=['overview'], inplace=True)

# ── 2. PARSE JSON COLUMNS ───────────────────────────────────
for col in ['genres', 'keywords', 'cast', 'crew']:
    df[col] = df[col].apply(literal_eval)

# ── 3. FEATURE EXTRACTION ───────────────────────────────────
def extract_names(lst, n=None):
    names = [x['name'].replace(' ', '').lower() for x in lst]
    return names[:n] if n else names

def get_director(crew):
    directors = [x['name'].replace(' ', '').lower()
                 for x in crew if x['job'] == 'Director']
    return directors * 2   # weight director twice

df['genres_f']   = df['genres'].apply(lambda x: extract_names(x) * 2)
df['keywords_f'] = df['keywords'].apply(lambda x: extract_names(x))
df['cast_f']     = df['cast'].apply(lambda x: extract_names(x, n=3))
df['crew_f']     = df['crew'].apply(get_director)

# ── 4. BUILD FEATURE SOUP ───────────────────────────────────
def build_soup(row):
    overview_words = str(row['overview']).lower().split()
    meta = (row['genres_f'] + row['keywords_f'] +
            row['cast_f']   + row['crew_f'])
    return ' '.join(overview_words + meta)

df['soup'] = df.apply(build_soup, axis=1)

# ── 5. TF-IDF VECTORISATION ─────────────────────────────────
tfidf = TfidfVectorizer(
    stop_words='english',
    min_df=2,
    max_features=15000,
    ngram_range=(1, 2),
    sublinear_tf=True    # dampen very high term freqs
)
tfidf_matrix = tfidf.fit_transform(df['soup'])

# ── 6. COSINE SIMILARITY MATRIX ─────────────────────────────
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Title-to-index lookup
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

# ── 7. RECOMMENDATION FUNCTION ──────────────────────────────
def recommend(title, n=10):
    """Return top-n content-based recommendations for a given movie title."""
    if title not in indices:
        return f"Movie '{title}' not found in dataset."

    idx        = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: -x[1])
    sim_scores = sim_scores[1:n+1]  # skip self (index 0)

    movie_indices = [i[0] for i in sim_scores]
    scores        = [round(i[1], 4) for i in sim_scores]

    result = df['title'].iloc[movie_indices].reset_index(drop=True)
    return pd.DataFrame({'Recommended Movie': result, 'Similarity Score': scores})

# ── 8. TEST IT ──────────────────────────────────────────────
print("🎬 Recommendations for: Interstellar\n")
print(recommend('Interstellar', n=8).to_string(index=False))

OUTPUT

🎬 Recommendations for: Interstellar Recommended Movie Similarity Score The Martian 0.7821 Gravity 0.7534 2001: A Space Odyssey 0.7209 Contact 0.6998 Inception 0.6841 Europa Report 0.6612 The Dark Knight Rises 0.6405 Arrival 0.6288

Section 08

Evaluating & Tuning Your Recommender

Content-based recommenders are harder to evaluate than classifiers — there is no single "correct" answer. Here are the standard approaches:

Metric	What It Measures	When to Use	Limitation
Precision@K	Fraction of top-K recommendations the user actually liked	When you have user rating data	Needs ground-truth ratings
Recall@K	Fraction of all liked items that appear in top-K	When recall matters (e.g. library discovery)	Ignores ranking order
NDCG@K	Normalised Discounted Cumulative Gain — rewards top-ranked relevant items	Best overall ranking metric	Complex to compute
Coverage	% of catalogue that ever gets recommended	Checks for diversity	Doesn't measure quality
Intra-List Diversity	Average pairwise dissimilarity within a recommendation list	Checks for filter bubble risk	Domain-specific
A/B Test CTR	Click-through rate on recommended items in production	Gold standard — real user behaviour	Requires live product

# ── Quick Precision@K evaluation with mock ratings ──────────
def precision_at_k(recommended_titles, liked_titles, k):
    top_k = recommended_titles[:k]
    hits  = len(set(top_k) & set(liked_titles))
    return hits / k

# Example: user liked these space films
user_liked = ['Gravity', 'The Martian', 'Arrival', 'Contact']

recs = recommend('Interstellar', n=10)['Recommended Movie'].tolist()
p5  = precision_at_k(recs, user_liked, k=5)
p10 = precision_at_k(recs, user_liked, k=10)
print(f"Precision@5  : {p5:.2f}")
print(f"Precision@10 : {p10:.2f}")

OUTPUT

Precision@5 : 0.60 Precision@10 : 0.40

⚠️

The Filter Bubble Problem

Content-based filtering has one well-known weakness: over-specialisation. If a user watches only space films, the system only ever recommends more space films. It never discovers that the user might also love historical dramas. In production, add a small dose of serendipity: inject 10–20% diverse recommendations or blend with collaborative filtering (a hybrid system).

Section 09

Golden Rules

🎬 Content-Based Filtering — Non-Negotiable Rules

Always use TF-IDF over raw word counts. Raw counts reward long descriptions unfairly. TF-IDF normalises and down-weights common words that carry no discriminating power.

Weight your features intentionally. Repeating genre tokens twice and director tokens twice is a deliberate design decision. Test different weights — sometimes the overview alone beats the metadata soup.

Always use cosine similarity, not Euclidean distance, for text vectors. TF-IDF vectors live in high-dimensional sparse space — cosine handles variable-length documents correctly. Euclidean distance does not.

Collapse multi-word names to single tokens. "Gordon Levitt" becomes gordonlevitt — otherwise it splits into common words "gordon" and "levitt" that may match unrelated things. Always use .replace(' ', '').

Store the similarity matrix with np.save after computing it. A 5,000-movie matrix takes minutes to compute — recalculating on every request kills production performance. Serialise it once; load it on startup.

Handle cold-start users with popularity-based fallback. New users have no history. Until they rate at least 3–5 items, show them the most popular movies in their stated preferred genre rather than leaving them with no recommendations.

Monitor and refresh your TF-IDF model regularly. As new movies are added, the vocabulary and IDF values shift. Re-train periodically (weekly or monthly) to keep recommendations fresh and accurate.

🏆

What You Built

You now have a complete content-based movie recommendation engine from scratch: Feature Engineering → TF-IDF Vectorisation → Cosine Similarity Matrix → Item-Based Recommendations.

The same pipeline works for article recommendation, product recommendation, job matching, and any domain where items have rich text descriptions. The maths is identical — only the features change.