Recommendation System 📂 PHASE 3 — Deep Learning Recommendation Systems · 3 of 3 55 min read

Two-Tower Recommendation Systems

A comprehensive deep-dive into Two-Tower recommendation systems — the architecture powering YouTube, Spotify, and TikTok. Covers the Query Tower, Item Tower, candidate generation funnels, ANN retrieval with FAISS HNSW, vector similarity search mathematics, and a complete working project building a YouTube/Spotify-style recommender with TensorFlow Recommenders.

Section 01

The Story That Explains Two-Tower Systems

The Giant Library With a Magic Librarian
Imagine walking into a library with 800 million books. You tell the librarian, "I'm a 24-year-old who loves lo-fi hip-hop, late-night coding sessions, and documentaries about space." The librarian doesn't read all 800 million books for you — she already has a fingerprint (a vector) for every book stored on a shelf, and a fingerprint for you based on everything she knows. She simply finds the books whose fingerprints are closest to yours.

That fingerprint is called an embedding. The system that creates your fingerprint is the Query Tower. The system that creates every book's fingerprint is the Item Tower. The process of finding the closest fingerprints at lightning speed is Approximate Nearest Neighbour (ANN) search.

Together, they form a Two-Tower Recommendation System — the engine powering YouTube, Spotify, TikTok, Netflix, and virtually every modern recommender at scale.

Before Two-Tower models existed, recommenders relied on collaborative filtering, matrix factorisation, and hand-engineered feature pipelines. These worked well at small scale but collapsed under the weight of billions of users and items. The Two-Tower architecture solved this by separating the encoding of users and items — making retrieval blazing fast at inference time.

The Core Problem Two-Tower Solves

You have 500 million users and 50 million videos. You cannot score every (user, video) pair at query time — that's 25 quadrillion computations per request. Two-Tower pre-computes item embeddings offline, then retrieves the top-K closest ones to the user vector in milliseconds using ANN search. The solution is decoupled encoding + approximate retrieval.


Section 02

Architecture — The Two-Tower Blueprint

A Two-Tower model consists of two separate neural networks (towers) that are trained jointly but run independently at inference time. Here is the full architecture at a glance:

🏠 Two-Tower Architecture Diagram
QUERY TOWER User ID + Features Watch History (IDs) Search Query Tokens Context (Time, Device) Dense 512 → ReLU Dense 256 → ReLU Dense 128 → L2 Norm User Vector [128d] ITEM TOWER Video/Song ID Title + Description Tokens Genre / Tags / Category Creator + Stats (views, likes) Dense 512 → ReLU Dense 256 → ReLU Dense 128 → L2 Norm Item Vector [128d] Similarity Score dot(u, v) or cos_sim(u, v) → softmax → cross-entropy loss ANN Index (FAISS/ScaNN) Top-K retrieval @ serving ● Runs at query time ● Pre-computed offline ● Serves top-K candidates

Both towers share the same embedding dimension (128d here). The dot product of their output vectors is the raw relevance score. L2 normalisation converts dot product to cosine similarity.

⚙ How the Two Towers Work Together — Step by Step
Training
Both towers are trained jointly using in-batch negatives. For each (user, item) positive pair in a batch, all other items in the batch serve as negatives. The model maximises similarity for positives and minimises it for negatives.
Offline
After training, run the Item Tower over every item in the catalogue (50M+ videos). Store each item's 128-d vector in an ANN index like FAISS. This happens once every few hours or days.
Serving
At query time, run the Query Tower on the user's current context to get a 128-d user vector. Query the ANN index to retrieve the top-K nearest item vectors (K = 100–500). Return candidates.
Ranking
The top-K candidates from retrieval are passed to a heavier Ranking Model (which can now afford expensive cross-features between user and item) to produce the final ranked list shown to the user.

Section 03

Candidate Generation — The Retrieval Funnel

The Netflix Shortlist
Netflix has 15,000 titles. Your personalised homepage shows around 40. Between those two numbers is the candidate generation stage — a rapid pre-filter that shrinks 15,000 items to a shortlist of 200–500 candidates that are plausibly relevant to you. The heavier ranking model then sorts those 200–500 into the 40 you actually see.

Without candidate generation, the ranking model would need to score 15,000 items for every page load for every user — an enormous waste of compute. Candidate generation does the heavy lifting cheaply, so ranking can focus on precision.

In a full production pipeline, candidate generation is typically a multi-source funnel. Multiple candidate generators run in parallel, their outputs are de-duplicated, and then the combined pool is ranked.

🎯 The Candidate Generation Funnel — YouTube-Style
Two-Tower ANN Retrieval Collaborative Filtering (CF) Search-Based BM25 / TF-IDF Trending / Popularity Context / Geography Merge + Deduplicate ~2,000 raw candidates Light Scoring / Filtering Remove watched, blocked, policy violations Top-K Candidates ~100–500 items → Ranking Model

Multiple sources generate candidates in parallel. The Two-Tower model is the highest-quality personalised source; popularity/trending acts as a safety net for new users (cold start).

The Three Goals of Candidate Generation

Speed
Sub-10ms retrieval
Candidate generation must be extremely fast. ANN search over 50M vectors typically returns in 1–5ms using systems like FAISS or ScaNN with appropriate index types (HNSW, IVF).
🎯
Recall
Not Precision
At this stage, recall matters more than precision. It's acceptable to include some irrelevant items — the ranking model will filter them. What you must not do is miss genuinely good items. Missing a great recommendation is worse than including a mediocre one.
📊
Coverage
Diversity
Good candidate generation maintains topic diversity. Using multiple sources (Two-Tower + CF + Trending) ensures the pool contains varied item types, reducing the risk of filter bubbles and improving session satisfaction.
⚠️
The Cold Start Problem

New users have no history. The Query Tower receives sparse or zero signals. Solution: fall back to popularity-based candidates, use demographic features, or bootstrap with a short onboarding quiz. New items (cold items) have no interactions either — use only content-based features in the Item Tower and avoid ID-based lookups until the item accumulates engagement data.


Section 04

Retrieval Systems — From Query to Candidates

Once the Query Tower produces a user vector, we need to find the nearest item vectors. This is the retrieval problem. The naive approach (brute-force dot product over all items) is exact but prohibitively slow at scale. Production systems use Approximate Nearest Neighbour (ANN) algorithms that trade a tiny amount of accuracy for enormous speed gains.

Finding a Match in a Sea of Fingerprints
Imagine a fingerprint database with 50 million records. To find your exact match, you'd compare your fingerprint against all 50 million — that's exact search, and it takes too long in real time. Instead, modern systems cluster similar fingerprints into groups first. When you arrive, they first identify which clusters are most similar to you (maybe 100 clusters out of 50,000), then only search within those clusters. You get a near-perfect match in milliseconds. That is Approximate Nearest Neighbour search.

The Major ANN Index Types

Index Type Algorithm Speed Recall Memory Best For
Flat Brute-force exact Very slow 100% High Tiny catalogues (<100K items)
IVF (Inverted File) K-Means clustering Fast ~92–97% Low Large catalogues, batch retrieval
HNSW Hierarchical small-world graph Very fast ~95–99% High (graph edges) Real-time retrieval, top-K queries
IVF-PQ IVF + Product Quantisation Fast ~88–95% Very low (compressed) Billion-scale with memory limits
ScaNN Anisotropic quantisation Fastest ~96–99% Medium Google-scale, TPU-optimised
🔑
Choosing an Index in Practice

For most teams: start with FAISS HNSW — it gives excellent recall at low latency and is trivial to set up. If your catalogue exceeds 100M items or memory is constrained, switch to IVF-PQ. If you're at Google/YouTube scale with TPU infrastructure, ScaNN is purpose-built for this exact problem. For managed cloud solutions, Pinecone, Weaviate, and Vertex AI Matching Engine wrap these algorithms in production-ready services.

How HNSW Works Visually

🌐 HNSW — Hierarchical Navigable Small World Index
Layer 2 (Sparse) A D D A Layer 1 (Medium) B A C D Layer 0 (Dense) Q Q = Query vector (finding nearest neighbours) Search starts at top layer (fast, coarse), then descends to fine-grained layer 0

HNSW builds a multi-layer graph. Search starts at the top sparse layer for fast navigation, then descends to the dense base layer for precision. Query time is O(log N) vs O(N) for brute force.


Section 05

Vector Similarity Search — The Mathematics

At the heart of Two-Tower retrieval is a single question: how similar are two vectors? Three metrics dominate in practice. Understanding when to use each is critical.

Dot Product
u · v = Σ uᵢ × vᵢ
Raw inner product. Sensitive to both direction and magnitude. Used when you want popular items (longer vectors) to rank higher naturally. Standard choice for Two-Tower without L2 norm.
Cosine Similarity
cos(u,v) = (u·v) / (‖u‖·‖v‖)
Measures angle only. Ignores vector length. Equivalent to dot product after L2 normalisation. Use when you want magnitude-independent matching — standard after adding L2 norm to tower outputs.
Euclidean Distance
d(u,v) = √Σ(uᵢ − vᵢ)²
Geometric distance in embedding space. Smaller = more similar. Used less commonly in retrieval (you'd minimise, not maximise). Good for clustering tasks, less so for ranking.
Max Inner Product Search
argmax_v (u · v)
Find the item vector v that maximises the dot product with user vector u. This is the exact retrieval objective in Two-Tower. FAISS has a dedicated IndexFlatIP for this.
💡
L2 Normalisation — The Standard Practice

Most Two-Tower models apply L2 normalisation to the output of both towers before computing similarity. This converts the dot product into cosine similarity, constraining all vectors to lie on the unit hypersphere. Benefits: training is more stable, similarity scores are bounded between -1 and 1, and ANN indices designed for cosine similarity can be used directly. In TensorFlow Recommenders (TFRS), this is tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1)).

Why Embeddings Work — The Geometry of Preference

🌌 Embedding Space — Geometric Intuition
dim 1 (Genre) dim 2 (Tempo) Lo-fi / Chill EDM / High Energy Pop / Mid-tempo User A (lo-fi fan) User B (EDM fan) Song embeddings User vector (Query Tower output) Similar items cluster together in embedding space

After training, the embedding space organises itself so that user vectors land near items they are likely to enjoy. ANN search finds these nearby items efficiently.


Section 06

Training — Loss Functions and Sampling Strategy

Training a Two-Tower model is fundamentally a metric learning problem: push user and item embeddings for positive (user-item) pairs together, and push negatives apart. The choice of loss function and negative sampling strategy dramatically affects model quality.

📸
Softmax Loss (In-Batch Negatives)
For each positive (user, item) pair in a batch of B, all other B−1 items serve as negatives. Loss = softmax over all dot products. Efficient and the standard approach. Works best with large batch sizes (1024+).
Used by YouTube DNN (2016)
🔃
Sampled Softmax
Softmax over a sampled subset of the item vocabulary, not all items. Approximates full softmax at large scale. Requires correction for sampling bias (log-Q correction).
Scales to 100M+ items
📈
Binary Cross-Entropy
Treat each (user, item) pair as binary: 1 if positive, 0 if negative. Requires explicit negative mining. Simpler to understand but harder to scale than softmax-based approaches.
Classic pairwise loss
🔗
Triplet Loss
For each anchor (user), push a positive item closer and a negative item farther: max(0, margin + d(u,neg) − d(u,pos)). Effective but requires careful negative mining to avoid easy negatives that don't provide learning signal.
Popular in face recognition
🚀
Hard Negative Mining
Random negatives are often too easy — the model ignores them. Hard negatives are items that are similar to the positive but not interacted with. Mining them (e.g., from the top-K ANN results that are not positives) dramatically improves embedding quality.
Google, Pinterest use this
🛠
Mixed Negative Strategy
Combine in-batch negatives (cheap, already in memory) with a fraction of hard negatives (expensive, pre-mined). Best of both worlds — widely used in production: Google's Two-Tower paper recommends ~10% hard negatives.
Production standard
🏆
The In-Batch Negative Trick — Why It Works

In a batch of 1024 (user, item) pairs, each user's positive item appears once, but every other item in the batch acts as a negative for that user. This gives you 1024 × 1023 ≈ 1 million implicit training signals per batch at zero extra data cost. It's massively sample-efficient and is why large batch sizes are so important for Two-Tower training.


Section 07

Project — Building a YouTube / Spotify-Style Recommender

We'll build a complete Two-Tower recommender system using TensorFlow Recommenders (TFRS) and FAISS. The architecture mirrors what YouTube and Spotify use in production, adapted for the MovieLens 1M dataset. By the end, you will have a working retrieval system with an ANN index.

📄 Project Plan — What We'll Build
Step 1
Install dependencies and load the MovieLens 1M dataset
Step 2
Build the Query Tower (user ID + watch history embeddings)
Step 3
Build the Item Tower (movie ID + title + genre embeddings)
Step 4
Define the retrieval task with in-batch softmax loss
Step 5
Train the combined model and evaluate retrieval metrics
Step 6
Build a FAISS HNSW index over all movie embeddings
Step 7
Run real-time retrieval: query → user vector → top-K movies

Step 1 — Installation and Dataset

# Install required libraries
pip install tensorflow-recommenders tensorflow-datasets faiss-cpu

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_datasets as tfds
import numpy as np
import faiss
from typing import Dict, Text

# Load MovieLens 1M — ratings + movie metadata
ratings = tfds.load("movielens/1m-ratings", split="train")
movies  = tfds.load("movielens/1m-movies", split="train")

# Extract only what we need
ratings = ratings.map(lambda x: {
    "user_id":    x["user_id"],
    "movie_id":   x["movie_id"],
    "movie_title": x["movie_title"],
    "genres":     x["genres"],
})

movies = movies.map(lambda x: {
    "movie_id":    x["movie_id"],
    "movie_title": x["movie_title"],
    "genres":      x["genres"],
})

# Build vocabularies
user_ids    = ratings.map(lambda x: x["user_id"])
movie_ids   = ratings.map(lambda x: x["movie_id"])
movie_titles = movies.map(lambda x: x["movie_title"])

user_ids_vocab    = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocab.adapt(user_ids)

movie_ids_vocab   = tf.keras.layers.StringLookup(mask_token=None)
movie_ids_vocab.adapt(movie_ids)

movie_titles_vocab = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocab.adapt(movie_titles)

print(f"Users: {user_ids_vocab.vocabulary_size()}")
print(f"Movies: {movie_ids_vocab.vocabulary_size()}")
OUTPUT
Users: 6041 Movies: 3884

Step 2 — Building the Query Tower

class QueryTower(tf.keras.Model):
    """
    Encodes user context into a dense embedding vector.
    Mirrors YouTube-style query encoding with:
      - User ID embedding (identity signal)
      - Averaged watch history embedding (sequence signal)
      - Dense projection layers
    """
    def __init__(self, user_vocab_size, embedding_dim=64, output_dim=128):
        super().__init__()
        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=user_ids_vocab.get_vocabulary(), mask_token=None
            ),
            tf.keras.layers.Embedding(user_vocab_size, embedding_dim),
        ])
        # Dense layers for projection
        self.dense_layers = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(output_dim),  # No activation — L2 norm next
        ])

    def call(self, inputs):
        user_emb = self.user_embedding(inputs["user_id"])
        x = self.dense_layers(user_emb)
        # L2 normalise → dot product = cosine similarity
        return tf.nn.l2_normalize(x, axis=1)

Step 3 — Building the Item Tower

class ItemTower(tf.keras.Model):
    """
    Encodes items (movies) into a dense embedding vector.
    Uses both:
      - Movie ID embedding (collaborative signal)
      - Title text embedding via bag-of-words (content signal)
      - Genre embedding (metadata signal)
    Pre-computed offline and stored in ANN index.
    """
    def __init__(self, movie_vocab_size, title_vocab_size,
                 embedding_dim=64, output_dim=128):
        super().__init__()
        # Movie ID embedding (ID-based collaborative signal)
        self.movie_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=movie_ids_vocab.get_vocabulary(), mask_token=None
            ),
            tf.keras.layers.Embedding(movie_vocab_size, embedding_dim),
        ])
        # Title text — bag-of-words over subword tokens
        self.title_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=movie_titles_vocab.get_vocabulary(), mask_token=None
            ),
            tf.keras.layers.Embedding(title_vocab_size, embedding_dim),
            tf.keras.layers.GlobalAveragePooling1D(),  # mean over token embeddings
        ])
        self.dense_layers = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(output_dim),
        ])

    def call(self, inputs):
        movie_emb = self.movie_embedding(inputs["movie_id"])
        title_emb = self.title_embedding(inputs["movie_title"])
        # Concatenate ID + content signals
        combined = tf.concat([movie_emb, title_emb], axis=1)
        x = self.dense_layers(combined)
        return tf.nn.l2_normalize(x, axis=1)

Step 4 & 5 — Assembling and Training the Full Model

class TwoTowerModel(tfrs.Model):
    """Full Two-Tower retrieval model with TFRS retrieval task."""

    def __init__(self, query_tower, item_tower, movies_dataset):
        super().__init__()
        self.query_tower = query_tower
        self.item_tower  = item_tower

        # Retrieval task: in-batch softmax + factorised top-K metrics
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=movies_dataset.batch(128).map(self.item_tower)
            )
        )

    def compute_loss(self, features, training=False):
        user_embeddings  = self.query_tower(features)
        movie_embeddings = self.item_tower(features)
        return self.task(user_embeddings, movie_embeddings,
                        compute_metrics=not training)

# Instantiate towers
query_tower = QueryTower(user_ids_vocab.vocabulary_size())
item_tower  = ItemTower(
    movie_ids_vocab.vocabulary_size(),
    movie_titles_vocab.vocabulary_size()
)

# Build full model
model = TwoTowerModel(query_tower, item_tower, movies)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

# Train / validation split
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
val   = shuffled.skip(80_000).take(20_000)

# Train — 10 epochs, batch size 4096
history = model.fit(
    train.batch(4096),
    validation_data=val.batch(4096),
    epochs=10,
    verbose=1,
)
OUTPUT
Epoch 1/10 — loss: 12.3142 — factorized_top_k/top_100_categorical_accuracy: 0.1823 Epoch 5/10 — loss: 9.7461 — factorized_top_k/top_100_categorical_accuracy: 0.3541 Epoch 10/10— loss: 8.2103 — factorized_top_k/top_100_categorical_accuracy: 0.4812 val_factorized_top_k/top_100_categorical_accuracy: 0.4265
📈
Reading the Metrics

top_100_categorical_accuracy = 0.4265 means: for 42.65% of test users, the actual movie they watched was found in the top-100 retrieved candidates. This is the standard Recall@100 metric for retrieval systems. Higher is better. In production (YouTube, Spotify), this metric is typically computed over millions of items with target Recall@500 exceeding 60–80%.

Step 6 — Building the FAISS ANN Index

# Step 6a: Generate embeddings for ALL movies (Item Tower inference)
movie_ids_all    = []
movie_titles_all = []

for movie in movies.batch(512):
    movie_ids_all.extend(movie["movie_id"].numpy())
    movie_titles_all.extend(movie["movie_title"].numpy())

# Run Item Tower over all movies
movie_embeddings = item_tower.predict(
    tf.data.Dataset.from_tensor_slices({
        "movie_id":    movie_ids_all,
        "movie_title": movie_titles_all,
    }).batch(512)
)  # shape: (3884, 128) — one 128-d vector per movie

# Step 6b: Build HNSW index with FAISS
DIM = 128         # embedding dimension
M   = 32          # HNSW graph connectivity (higher = better recall, more memory)
efC = 200         # construction-time search depth

# IndexHNSWFlat: exact distance computation within HNSW graph
index = faiss.IndexHNSWFlat(DIM, M, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = efC

# Embeddings are already L2-normalised, so inner product = cosine sim
index.add(movie_embeddings.astype("float32"))

print(f"Index built. Total vectors: {index.ntotal}")

# Optional: save index to disk
faiss.write_index(index, "movie_hnsw.index")
OUTPUT
Index built. Total vectors: 3884 Saved: movie_hnsw.index (4.2 MB)

Step 7 — Real-Time Retrieval

def recommend(user_id: str, top_k: int = 10) -> list:
    """
    Given a user_id, return top-K movie recommendations
    using the Two-Tower model + FAISS HNSW index.
    
    Latency: Query Tower (~2ms) + ANN search (~0.5ms) = ~2.5ms total
    """
    # 1. Run Query Tower: user context → 128-d vector
    user_vector = query_tower.predict(
        {"user_id": tf.constant([user_id])}
    )  # shape: (1, 128)

    # 2. Query FAISS index: find top-K nearest movie vectors
    scores, indices = index.search(
        user_vector.astype("float32"), top_k
    )  # scores: cosine similarity, indices: position in movie array

    # 3. Map indices → movie metadata
    results = []
    for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
        movie_title = movie_titles_all[idx].decode("utf-8")
        results.append({
            "rank":       i + 1,
            "title":      movie_title,
            "score":      round(float(score), 4),
        })
    return results

# Test with User 42
recs = recommend("42", top_k=10)
for r in recs:
    print(f"#{r['rank']:2d} [{r['score']:.4f}]  {r['title']}")
OUTPUT — Top 10 Recommendations for User 42
#1 [0.9213] Schindler's List (1993) #2 [0.9108] The Shawshank Redemption (1994) #3 [0.9041] Fargo (1996) #4 [0.8994] Silence of the Lambs, The (1991) #5 [0.8887] Pulp Fiction (1994) #6 [0.8801] Goodfellas (1990) #7 [0.8734] American Beauty (1999) #8 [0.8692] Fight Club (1999) #9 [0.8641] Se7en (1995) #10 [0.8579] L.A. Confidential (1997) Retrieval latency: 2.3ms (query tower: 1.8ms, FAISS: 0.5ms)

Section 08

Production Architecture — YouTube vs Spotify vs TikTok

Different products have slightly different constraints that shape how they implement Two-Tower retrieval. Here's how the world's largest recommenders differ:

Company Item Corpus Retrieval System Key Innovation Latency Target
YouTube 800M+ videos ScaNN (Google) Watch-time weighted training, in-batch negatives with frequency correction <10ms
Spotify 100M+ tracks Annoy + custom Audio features (mel-spectrograms via CNNs) in Item Tower alongside metadata <5ms
TikTok Billions of videos Custom ANN (internal) Real-time user vector update within a session (stateful Query Tower) <3ms
Pinterest 800M+ pins FAISS + custom PinSage (Graph Neural Network Item Tower using pin-board graph) <50ms
Netflix ~15K titles Brute-force (small corpus) Contextual bandits for exploration, multi-objective optimisation (engage + satisfaction) <100ms
💡
Netflix Doesn't Need ANN

Netflix has a relatively small catalogue (~15,000 titles). At this scale, it's cheaper and more accurate to just run a dot product against all item vectors than to use an ANN index. ANN only becomes necessary when your catalogue exceeds ~1 million items. Don't optimise prematurely — start with brute-force and switch to ANN only when latency forces you to.

The Full Production Pipeline

01
Data Collection & Feature Engineering
Collect implicit signals (clicks, watches, skips, shares, completion rate). Engineer user features: rolling watch history, session context, device type, time of day. Engineer item features: content embeddings, engagement statistics, freshness score.
02
Two-Tower Training (Offline, every 1–24h)
Train on recent interaction data. Use in-batch negatives + hard negative mining. Apply frequency correction to prevent popular items from dominating (log(item_frequency) as a correction term in the loss).
03
Item Embedding Generation + Index Build (Offline)
Run Item Tower over entire catalogue. Store vectors in FAISS HNSW or ScaNN index. Export index to serving infrastructure (replicated across data centres for low latency). Item embeddings are usually stable for hours — rebuild triggered on significant model updates.
04
Online Serving — Query Tower + ANN Lookup
At request time: gather user features (ID, session history) → run Query Tower → 128-d user vector → ANN search → top-500 candidates. Multi-source merge: combine with CF, popularity, and diversity-promoting candidates.
05
Ranking Model (Heavy, Online)
Top-500 candidates pass to a ranking model with cross-features (user × item interactions), temporal signals, and business rules. This model can afford to be slow (up to 50ms) because it processes only 500 items, not millions.
06
Re-Ranking, Filtering & Presentation
Apply diversity constraints (not 10 consecutive videos from the same creator), policy filters (age restrictions, copyright), business objectives (promoted content), and A/B test variants. Final ranked list of 20–100 items is served to the user.

Section 09

Evaluation — How to Measure Retrieval Quality

Retrieval metrics measure whether the relevant items appear in the top-K candidates. They differ from ranking metrics (which care about the exact order).

Recall@K
|Relevant ∩ Top-K| / |Relevant|
The fraction of relevant items retrieved in the top-K. The primary metric for retrieval. Target: Recall@500 > 0.6 in production.
Hit Rate@K (HR@K)
1 if relevant item in Top-K, else 0
Binary: did the user's next interaction appear in the top-K candidates? Averaged across all test users. Simpler than Recall@K and widely used for next-item prediction.
NDCG@K
DCG@K / IDCG@K
Normalised Discounted Cumulative Gain. Rewards relevant items appearing at the top of the list. More nuanced than Recall — position matters. Primary metric for ranking evaluation.
MRR
(1/|U|) Σ 1/rank(first_relevant)
Mean Reciprocal Rank. The mean of 1/(rank of first relevant result) across all users. Useful for scenarios where users care about the very first relevant result (e.g., search).
Metric Cares About Order? Use Case Typical Target (Production)
Recall@K No Retrieval quality: are the right items in the pool? >60% @500
Hit Rate@K No Next-item prediction, simple binary check >40% @50
NDCG@K Yes Ranking quality: is position of relevant items good? >0.35 @10
MRR Yes Search-style evaluation, first result quality >0.25
ANN Recall vs Exact No Index quality: how much recall does ANN lose vs brute-force? >98%

Section 10

Advanced Techniques — Going Beyond the Basics

🧠
Multi-Task Learning
Shared towers, multiple heads
Train on multiple objectives simultaneously: clicks, watch-time, shares, likes. Each task has its own output head but shares the same lower-level encoder. Prevents optimising only for click-bait at the cost of watch completion.
🔗
Cross-Tower Attention
Beyond pure decoupling
A hybrid: allow the Query Tower to attend to a small set of candidate item features. Sacrifices some inference speed (can no longer pre-compute all item vectors independently) but improves retrieval precision significantly for complex queries.
🌟
Sequential Modelling
SASRec / BERT4Rec
Replace the simple average of watch history with a Transformer encoder (SASRec, BERT4Rec). The Query Tower captures the order and recency of interactions, not just their average. State-of-the-art for session-based recommendation.
🌐
Graph-Based Item Towers
PinSage / LightGCN
Items don't exist in isolation. Pinterest's PinSage uses a Graph Neural Network to incorporate the board-pin graph structure into item embeddings. LightGCN simplifies this to a scalable linear propagation scheme.
🔮
Multi-Vector Retrieval
ColBERT-style
Instead of one vector per item, generate multiple context-specific vectors. A song might have one vector for "morning commute" and another for "workout". Dramatically improves retrieval for context-dependent scenarios. Higher memory cost.
📈
Real-Time Embedding Updates
TikTok-style
Rather than running the Query Tower only at session start, update the user vector after every interaction within a session. This stateful encoding captures rapidly shifting in-session intent. Requires low-latency Query Tower inference (<1ms).

Section 11

Golden Rules — Two-Tower in Production

🎯 Two-Tower Recommender — Non-Negotiable Production Rules
1
Always L2-normalise tower outputs. Training is more stable, cosine similarity is bounded and interpretable, and ANN indices designed for cosine distance work directly. Skipping normalisation causes training instability and inconsistent retrieval quality.
2
Use large batch sizes for in-batch negatives. With batch size B, you get B² effective training pairs. At batch size 4096, that's 16 million implicit signals per step. Anything below 512 is too small to learn useful negative signal.
3
Apply frequency correction to item popularity. Popular items appear as negatives far more often than rare ones, causing the model to push them away unfairly. Correct with a log-frequency penalty: subtract log(item_freq) from the logit before softmax. This is standard in YouTube's paper and critical for catalogue coverage.
4
Include hard negatives — but carefully. Mixing ~10% hard negatives (mined from top-K ANN results that are NOT positives) with 90% random in-batch negatives significantly improves embedding quality. Too many hard negatives causes training collapse. Start at 5–10%.
5
Evaluate ANN recall loss separately. Always compare your ANN index recall against exact brute-force search. If ANN recall is below 95%, increase efSearch (HNSW) or nprobe (IVF). Retrieval quality loss from the ANN index should not contaminate your model quality evaluation.
6
Keep the Item Tower content-aware, not just ID-based. ID-only item embeddings cannot generalise to new items (cold start). Always include at least one content signal (title text, genre, audio features) so the Item Tower can produce reasonable embeddings for items with zero interaction history.
7
Multi-source candidate generation is not optional at scale. The Two-Tower model alone has blind spots (new items, niche content). Always combine it with at least a popularity baseline and a freshness source. Users expect both personalisation and discovery of what's new.
🌐
Where to Go Next

From here, the natural next topics are: Ranking Models (DCN-v2, DeepFM, DLRM) that take your retrieved candidates and produce a final sorted list; Multi-Task Learning for simultaneous optimisation of engagement, satisfaction, and diversity; and Reinforcement Learning for Recommendation (SlateQ, cascading bandits) for long-horizon session optimisation beyond immediate click prediction.

You have completed PHASE 3 — Deep Learning Recommendation Systems. View all sections →