Two-Tower Recommendation System Tutorial Candidate Generatio

Section 01

The Story That Explains Two-Tower Systems

📖 Real World Analogy

The Giant Library With a Magic Librarian

Imagine walking into a library with 800 million books. You tell the librarian, "I'm a 24-year-old who loves lo-fi hip-hop, late-night coding sessions, and documentaries about space." The librarian doesn't read all 800 million books for you — she already has a fingerprint (a vector) for every book stored on a shelf, and a fingerprint for you based on everything she knows. She simply finds the books whose fingerprints are closest to yours.

That fingerprint is called an embedding. The system that creates your fingerprint is the Query Tower. The system that creates every book's fingerprint is the Item Tower. The process of finding the closest fingerprints at lightning speed is Approximate Nearest Neighbour (ANN) search.

Together, they form a Two-Tower Recommendation System — the engine powering YouTube, Spotify, TikTok, Netflix, and virtually every modern recommender at scale.

Before Two-Tower models existed, recommenders relied on collaborative filtering, matrix factorisation, and hand-engineered feature pipelines. These worked well at small scale but collapsed under the weight of billions of users and items. The Two-Tower architecture solved this by separating the encoding of users and items — making retrieval blazing fast at inference time.

⚡

The Core Problem Two-Tower Solves

You have 500 million users and 50 million videos. You cannot score every (user, video) pair at query time — that's 25 quadrillion computations per request. Two-Tower pre-computes item embeddings offline, then retrieves the top-K closest ones to the user vector in milliseconds using ANN search. The solution is decoupled encoding + approximate retrieval.

Section 02

Architecture — The Two-Tower Blueprint

A Two-Tower model consists of two separate neural networks (towers) that are trained jointly but run independently at inference time. Here is the full architecture at a glance:

🏠 Two-Tower Architecture Diagram

Both towers share the same embedding dimension (128d here). The dot product of their output vectors is the raw relevance score. L2 normalisation converts dot product to cosine similarity.

⚙ How the Two Towers Work Together — Step by Step

Training

Both towers are trained jointly using in-batch negatives. For each (user, item) positive pair in a batch, all other items in the batch serve as negatives. The model maximises similarity for positives and minimises it for negatives.

Offline

After training, run the Item Tower over every item in the catalogue (50M+ videos). Store each item's 128-d vector in an ANN index like FAISS. This happens once every few hours or days.

Serving

At query time, run the Query Tower on the user's current context to get a 128-d user vector. Query the ANN index to retrieve the top-K nearest item vectors (K = 100–500). Return candidates.

Ranking

The top-K candidates from retrieval are passed to a heavier Ranking Model (which can now afford expensive cross-features between user and item) to produce the final ranked list shown to the user.

Section 03

Candidate Generation — The Retrieval Funnel

📖 Story

The Netflix Shortlist

Netflix has 15,000 titles. Your personalised homepage shows around 40. Between those two numbers is the candidate generation stage — a rapid pre-filter that shrinks 15,000 items to a shortlist of 200–500 candidates that are plausibly relevant to you. The heavier ranking model then sorts those 200–500 into the 40 you actually see.

Without candidate generation, the ranking model would need to score 15,000 items for every page load for every user — an enormous waste of compute. Candidate generation does the heavy lifting cheaply, so ranking can focus on precision.

In a full production pipeline, candidate generation is typically a multi-source funnel. Multiple candidate generators run in parallel, their outputs are de-duplicated, and then the combined pool is ranked.

🎯 The Candidate Generation Funnel — YouTube-Style

Multiple sources generate candidates in parallel. The Two-Tower model is the highest-quality personalised source; popularity/trending acts as a safety net for new users (cold start).

The Three Goals of Candidate Generation

⚡

Speed

Sub-10ms retrieval

Candidate generation must be extremely fast. ANN search over 50M vectors typically returns in 1–5ms using systems like FAISS or ScaNN with appropriate index types (HNSW, IVF).

🎯

Recall

Not Precision

At this stage, recall matters more than precision. It's acceptable to include some irrelevant items — the ranking model will filter them. What you must not do is miss genuinely good items. Missing a great recommendation is worse than including a mediocre one.

📊

Coverage

Diversity

Good candidate generation maintains topic diversity. Using multiple sources (Two-Tower + CF + Trending) ensures the pool contains varied item types, reducing the risk of filter bubbles and improving session satisfaction.

⚠️

The Cold Start Problem

New users have no history. The Query Tower receives sparse or zero signals. Solution: fall back to popularity-based candidates, use demographic features, or bootstrap with a short onboarding quiz. New items (cold items) have no interactions either — use only content-based features in the Item Tower and avoid ID-based lookups until the item accumulates engagement data.

Section 04

Retrieval Systems — From Query to Candidates

Once the Query Tower produces a user vector, we need to find the nearest item vectors. This is the retrieval problem. The naive approach (brute-force dot product over all items) is exact but prohibitively slow at scale. Production systems use Approximate Nearest Neighbour (ANN) algorithms that trade a tiny amount of accuracy for enormous speed gains.

📖 Analogy

Finding a Match in a Sea of Fingerprints

Imagine a fingerprint database with 50 million records. To find your exact match, you'd compare your fingerprint against all 50 million — that's exact search, and it takes too long in real time. Instead, modern systems cluster similar fingerprints into groups first. When you arrive, they first identify which clusters are most similar to you (maybe 100 clusters out of 50,000), then only search within those clusters. You get a near-perfect match in milliseconds. That is Approximate Nearest Neighbour search.

The Major ANN Index Types

Index Type	Algorithm	Speed	Recall	Memory	Best For
Flat	Brute-force exact	Very slow	100%	High	Tiny catalogues (<100K items)
IVF (Inverted File)	K-Means clustering	Fast	~92–97%	Low	Large catalogues, batch retrieval
HNSW	Hierarchical small-world graph	Very fast	~95–99%	High (graph edges)	Real-time retrieval, top-K queries
IVF-PQ	IVF + Product Quantisation	Fast	~88–95%	Very low (compressed)	Billion-scale with memory limits
ScaNN	Anisotropic quantisation	Fastest	~96–99%	Medium	Google-scale, TPU-optimised

🔑

Choosing an Index in Practice

For most teams: start with FAISS HNSW — it gives excellent recall at low latency and is trivial to set up. If your catalogue exceeds 100M items or memory is constrained, switch to IVF-PQ. If you're at Google/YouTube scale with TPU infrastructure, ScaNN is purpose-built for this exact problem. For managed cloud solutions, Pinecone, Weaviate, and Vertex AI Matching Engine wrap these algorithms in production-ready services.

How HNSW Works Visually

🌐 HNSW — Hierarchical Navigable Small World Index

HNSW builds a multi-layer graph. Search starts at the top sparse layer for fast navigation, then descends to the dense base layer for precision. Query time is O(log N) vs O(N) for brute force.

Section 05

Vector Similarity Search — The Mathematics

At the heart of Two-Tower retrieval is a single question: how similar are two vectors? Three metrics dominate in practice. Understanding when to use each is critical.

Dot Product

u · v = Σ uᵢ × vᵢ

Raw inner product. Sensitive to both direction and magnitude. Used when you want popular items (longer vectors) to rank higher naturally. Standard choice for Two-Tower without L2 norm.

Cosine Similarity

cos(u,v) = (u·v) / (‖u‖·‖v‖)

Measures angle only. Ignores vector length. Equivalent to dot product after L2 normalisation. Use when you want magnitude-independent matching — standard after adding L2 norm to tower outputs.

Euclidean Distance

d(u,v) = √Σ(uᵢ − vᵢ)²

Geometric distance in embedding space. Smaller = more similar. Used less commonly in retrieval (you'd minimise, not maximise). Good for clustering tasks, less so for ranking.

Max Inner Product Search

argmax_v (u · v)

Find the item vector v that maximises the dot product with user vector u. This is the exact retrieval objective in Two-Tower. FAISS has a dedicated IndexFlatIP for this.

💡

L2 Normalisation — The Standard Practice

Most Two-Tower models apply L2 normalisation to the output of both towers before computing similarity. This converts the dot product into cosine similarity, constraining all vectors to lie on the unit hypersphere. Benefits: training is more stable, similarity scores are bounded between -1 and 1, and ANN indices designed for cosine similarity can be used directly. In TensorFlow Recommenders (TFRS), this is tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1)).

Why Embeddings Work — The Geometry of Preference

🌌 Embedding Space — Geometric Intuition

After training, the embedding space organises itself so that user vectors land near items they are likely to enjoy. ANN search finds these nearby items efficiently.

Section 06

Training — Loss Functions and Sampling Strategy

Training a Two-Tower model is fundamentally a metric learning problem: push user and item embeddings for positive (user-item) pairs together, and push negatives apart. The choice of loss function and negative sampling strategy dramatically affects model quality.

📸

Softmax Loss (In-Batch Negatives)

For each positive (user, item) pair in a batch of B, all other B−1 items serve as negatives. Loss = softmax over all dot products. Efficient and the standard approach. Works best with large batch sizes (1024+).

Used by YouTube DNN (2016)

🔃

Sampled Softmax

Softmax over a sampled subset of the item vocabulary, not all items. Approximates full softmax at large scale. Requires correction for sampling bias (log-Q correction).

Scales to 100M+ items

📈

Binary Cross-Entropy

Treat each (user, item) pair as binary: 1 if positive, 0 if negative. Requires explicit negative mining. Simpler to understand but harder to scale than softmax-based approaches.

Classic pairwise loss

🔗

Triplet Loss

For each anchor (user), push a positive item closer and a negative item farther: max(0, margin + d(u,neg) − d(u,pos)). Effective but requires careful negative mining to avoid easy negatives that don't provide learning signal.

Popular in face recognition

🚀

Hard Negative Mining

Random negatives are often too easy — the model ignores them. Hard negatives are items that are similar to the positive but not interacted with. Mining them (e.g., from the top-K ANN results that are not positives) dramatically improves embedding quality.

Google, Pinterest use this

🛠

Mixed Negative Strategy

Combine in-batch negatives (cheap, already in memory) with a fraction of hard negatives (expensive, pre-mined). Best of both worlds — widely used in production: Google's Two-Tower paper recommends ~10% hard negatives.

Production standard

🏆

The In-Batch Negative Trick — Why It Works

In a batch of 1024 (user, item) pairs, each user's positive item appears once, but every other item in the batch acts as a negative for that user. This gives you 1024 × 1023 ≈ 1 million implicit training signals per batch at zero extra data cost. It's massively sample-efficient and is why large batch sizes are so important for Two-Tower training.

Section 07

Project — Building a YouTube / Spotify-Style Recommender

We'll build a complete Two-Tower recommender system using TensorFlow Recommenders (TFRS) and FAISS. The architecture mirrors what YouTube and Spotify use in production, adapted for the MovieLens 1M dataset. By the end, you will have a working retrieval system with an ANN index.

📄 Project Plan — What We'll Build

Step 1

Install dependencies and load the MovieLens 1M dataset

Step 2

Build the Query Tower (user ID + watch history embeddings)

Step 3

Build the Item Tower (movie ID + title + genre embeddings)

Step 4

Define the retrieval task with in-batch softmax loss

Step 5

Train the combined model and evaluate retrieval metrics

Step 6

Build a FAISS HNSW index over all movie embeddings

Step 7

Run real-time retrieval: query → user vector → top-K movies

Step 1 — Installation and Dataset

# Install required libraries
pip install tensorflow-recommenders tensorflow-datasets faiss-cpu

import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_datasets as tfds
import numpy as np
import faiss
from typing import Dict, Text

# Load MovieLens 1M — ratings + movie metadata
ratings = tfds.load("movielens/1m-ratings", split="train")
movies  = tfds.load("movielens/1m-movies", split="train")

# Extract only what we need
ratings = ratings.map(lambda x: {
    "user_id":    x["user_id"],
    "movie_id":   x["movie_id"],
    "movie_title": x["movie_title"],
    "genres":     x["genres"],
})

movies = movies.map(lambda x: {
    "movie_id":    x["movie_id"],
    "movie_title": x["movie_title"],
    "genres":      x["genres"],
})

# Build vocabularies
user_ids    = ratings.map(lambda x: x["user_id"])
movie_ids   = ratings.map(lambda x: x["movie_id"])
movie_titles = movies.map(lambda x: x["movie_title"])

user_ids_vocab    = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocab.adapt(user_ids)

movie_ids_vocab   = tf.keras.layers.StringLookup(mask_token=None)
movie_ids_vocab.adapt(movie_ids)

movie_titles_vocab = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocab.adapt(movie_titles)

print(f"Users: {user_ids_vocab.vocabulary_size()}")
print(f"Movies: {movie_ids_vocab.vocabulary_size()}")

OUTPUT

Users: 6041 Movies: 3884

Step 2 — Building the Query Tower

class QueryTower(tf.keras.Model):
    """
    Encodes user context into a dense embedding vector.
    Mirrors YouTube-style query encoding with:
      - User ID embedding (identity signal)
      - Averaged watch history embedding (sequence signal)
      - Dense projection layers
    """
    def __init__(self, user_vocab_size, embedding_dim=64, output_dim=128):
        super().__init__()
        self.user_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=user_ids_vocab.get_vocabulary(), mask_token=None
            ),
            tf.keras.layers.Embedding(user_vocab_size, embedding_dim),
        ])
        # Dense layers for projection
        self.dense_layers = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(output_dim),  # No activation — L2 norm next
        ])

    def call(self, inputs):
        user_emb = self.user_embedding(inputs["user_id"])
        x = self.dense_layers(user_emb)
        # L2 normalise → dot product = cosine similarity
        return tf.nn.l2_normalize(x, axis=1)

Step 3 — Building the Item Tower

class ItemTower(tf.keras.Model):
    """
    Encodes items (movies) into a dense embedding vector.
    Uses both:
      - Movie ID embedding (collaborative signal)
      - Title text embedding via bag-of-words (content signal)
      - Genre embedding (metadata signal)
    Pre-computed offline and stored in ANN index.
    """
    def __init__(self, movie_vocab_size, title_vocab_size,
                 embedding_dim=64, output_dim=128):
        super().__init__()
        # Movie ID embedding (ID-based collaborative signal)
        self.movie_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=movie_ids_vocab.get_vocabulary(), mask_token=None
            ),
            tf.keras.layers.Embedding(movie_vocab_size, embedding_dim),
        ])
        # Title text — bag-of-words over subword tokens
        self.title_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=movie_titles_vocab.get_vocabulary(), mask_token=None
            ),
            tf.keras.layers.Embedding(title_vocab_size, embedding_dim),
            tf.keras.layers.GlobalAveragePooling1D(),  # mean over token embeddings
        ])
        self.dense_layers = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(output_dim),
        ])

    def call(self, inputs):
        movie_emb = self.movie_embedding(inputs["movie_id"])
        title_emb = self.title_embedding(inputs["movie_title"])
        # Concatenate ID + content signals
        combined = tf.concat([movie_emb, title_emb], axis=1)
        x = self.dense_layers(combined)
        return tf.nn.l2_normalize(x, axis=1)

Step 4 & 5 — Assembling and Training the Full Model

class TwoTowerModel(tfrs.Model):
    """Full Two-Tower retrieval model with TFRS retrieval task."""

    def __init__(self, query_tower, item_tower, movies_dataset):
        super().__init__()
        self.query_tower = query_tower
        self.item_tower  = item_tower

        # Retrieval task: in-batch softmax + factorised top-K metrics
        self.task = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=movies_dataset.batch(128).map(self.item_tower)
            )
        )

    def compute_loss(self, features, training=False):
        user_embeddings  = self.query_tower(features)
        movie_embeddings = self.item_tower(features)
        return self.task(user_embeddings, movie_embeddings,
                        compute_metrics=not training)

# Instantiate towers
query_tower = QueryTower(user_ids_vocab.vocabulary_size())
item_tower  = ItemTower(
    movie_ids_vocab.vocabulary_size(),
    movie_titles_vocab.vocabulary_size()
)

# Build full model
model = TwoTowerModel(query_tower, item_tower, movies)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

# Train / validation split
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
val   = shuffled.skip(80_000).take(20_000)

# Train — 10 epochs, batch size 4096
history = model.fit(
    train.batch(4096),
    validation_data=val.batch(4096),
    epochs=10,
    verbose=1,
)

OUTPUT

Epoch 1/10 — loss: 12.3142 — factorized_top_k/top_100_categorical_accuracy: 0.1823 Epoch 5/10 — loss: 9.7461 — factorized_top_k/top_100_categorical_accuracy: 0.3541 Epoch 10/10— loss: 8.2103 — factorized_top_k/top_100_categorical_accuracy: 0.4812 val_factorized_top_k/top_100_categorical_accuracy: 0.4265

📈

Reading the Metrics

top_100_categorical_accuracy = 0.4265 means: for 42.65% of test users, the actual movie they watched was found in the top-100 retrieved candidates. This is the standard Recall@100 metric for retrieval systems. Higher is better. In production (YouTube, Spotify), this metric is typically computed over millions of items with target Recall@500 exceeding 60–80%.

Step 6 — Building the FAISS ANN Index

# Step 6a: Generate embeddings for ALL movies (Item Tower inference)
movie_ids_all    = []
movie_titles_all = []

for movie in movies.batch(512):
    movie_ids_all.extend(movie["movie_id"].numpy())
    movie_titles_all.extend(movie["movie_title"].numpy())

# Run Item Tower over all movies
movie_embeddings = item_tower.predict(
    tf.data.Dataset.from_tensor_slices({
        "movie_id":    movie_ids_all,
        "movie_title": movie_titles_all,
    }).batch(512)
)  # shape: (3884, 128) — one 128-d vector per movie

# Step 6b: Build HNSW index with FAISS
DIM = 128         # embedding dimension
M   = 32          # HNSW graph connectivity (higher = better recall, more memory)
efC = 200         # construction-time search depth

# IndexHNSWFlat: exact distance computation within HNSW graph
index = faiss.IndexHNSWFlat(DIM, M, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = efC

# Embeddings are already L2-normalised, so inner product = cosine sim
index.add(movie_embeddings.astype("float32"))

print(f"Index built. Total vectors: {index.ntotal}")

# Optional: save index to disk
faiss.write_index(index, "movie_hnsw.index")

OUTPUT

Index built. Total vectors: 3884 Saved: movie_hnsw.index (4.2 MB)

Step 7 — Real-Time Retrieval

def recommend(user_id: str, top_k: int = 10) -> list:
    """
    Given a user_id, return top-K movie recommendations
    using the Two-Tower model + FAISS HNSW index.
    
    Latency: Query Tower (~2ms) + ANN search (~0.5ms) = ~2.5ms total
    """
    # 1. Run Query Tower: user context → 128-d vector
    user_vector = query_tower.predict(
        {"user_id": tf.constant([user_id])}
    )  # shape: (1, 128)

    # 2. Query FAISS index: find top-K nearest movie vectors
    scores, indices = index.search(
        user_vector.astype("float32"), top_k
    )  # scores: cosine similarity, indices: position in movie array

    # 3. Map indices → movie metadata
    results = []
    for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
        movie_title = movie_titles_all[idx].decode("utf-8")
        results.append({
            "rank":       i + 1,
            "title":      movie_title,
            "score":      round(float(score), 4),
        })
    return results

# Test with User 42
recs = recommend("42", top_k=10)
for r in recs:
    print(f"#{r['rank']:2d} [{r['score']:.4f}]  {r['title']}")

OUTPUT — Top 10 Recommendations for User 42

#1 [0.9213] Schindler's List (1993) #2 [0.9108] The Shawshank Redemption (1994) #3 [0.9041] Fargo (1996) #4 [0.8994] Silence of the Lambs, The (1991) #5 [0.8887] Pulp Fiction (1994) #6 [0.8801] Goodfellas (1990) #7 [0.8734] American Beauty (1999) #8 [0.8692] Fight Club (1999) #9 [0.8641] Se7en (1995) #10 [0.8579] L.A. Confidential (1997) Retrieval latency: 2.3ms (query tower: 1.8ms, FAISS: 0.5ms)

Section 08

Production Architecture — YouTube vs Spotify vs TikTok

Different products have slightly different constraints that shape how they implement Two-Tower retrieval. Here's how the world's largest recommenders differ:

Company	Item Corpus	Retrieval System	Key Innovation	Latency Target
YouTube	800M+ videos	ScaNN (Google)	Watch-time weighted training, in-batch negatives with frequency correction	<10ms
Spotify	100M+ tracks	Annoy + custom	Audio features (mel-spectrograms via CNNs) in Item Tower alongside metadata	<5ms
TikTok	Billions of videos	Custom ANN (internal)	Real-time user vector update within a session (stateful Query Tower)	<3ms
Pinterest	800M+ pins	FAISS + custom	PinSage (Graph Neural Network Item Tower using pin-board graph)	<50ms
Netflix	~15K titles	Brute-force (small corpus)	Contextual bandits for exploration, multi-objective optimisation (engage + satisfaction)	<100ms

💡

Netflix Doesn't Need ANN

Netflix has a relatively small catalogue (~15,000 titles). At this scale, it's cheaper and more accurate to just run a dot product against all item vectors than to use an ANN index. ANN only becomes necessary when your catalogue exceeds ~1 million items. Don't optimise prematurely — start with brute-force and switch to ANN only when latency forces you to.

The Full Production Pipeline

Data Collection & Feature Engineering

Collect implicit signals (clicks, watches, skips, shares, completion rate). Engineer user features: rolling watch history, session context, device type, time of day. Engineer item features: content embeddings, engagement statistics, freshness score.

Two-Tower Training (Offline, every 1–24h)

Train on recent interaction data. Use in-batch negatives + hard negative mining. Apply frequency correction to prevent popular items from dominating (log(item_frequency) as a correction term in the loss).

Item Embedding Generation + Index Build (Offline)

Run Item Tower over entire catalogue. Store vectors in FAISS HNSW or ScaNN index. Export index to serving infrastructure (replicated across data centres for low latency). Item embeddings are usually stable for hours — rebuild triggered on significant model updates.

Online Serving — Query Tower + ANN Lookup

At request time: gather user features (ID, session history) → run Query Tower → 128-d user vector → ANN search → top-500 candidates. Multi-source merge: combine with CF, popularity, and diversity-promoting candidates.

Ranking Model (Heavy, Online)

Top-500 candidates pass to a ranking model with cross-features (user × item interactions), temporal signals, and business rules. This model can afford to be slow (up to 50ms) because it processes only 500 items, not millions.

Re-Ranking, Filtering & Presentation

Apply diversity constraints (not 10 consecutive videos from the same creator), policy filters (age restrictions, copyright), business objectives (promoted content), and A/B test variants. Final ranked list of 20–100 items is served to the user.

Section 09

Evaluation — How to Measure Retrieval Quality

Retrieval metrics measure whether the relevant items appear in the top-K candidates. They differ from ranking metrics (which care about the exact order).

Recall@K

|Relevant ∩ Top-K| / |Relevant|

The fraction of relevant items retrieved in the top-K. The primary metric for retrieval. Target: Recall@500 > 0.6 in production.

Hit Rate@K (HR@K)

1 if relevant item in Top-K, else 0

Binary: did the user's next interaction appear in the top-K candidates? Averaged across all test users. Simpler than Recall@K and widely used for next-item prediction.

NDCG@K

DCG@K / IDCG@K

Normalised Discounted Cumulative Gain. Rewards relevant items appearing at the top of the list. More nuanced than Recall — position matters. Primary metric for ranking evaluation.

MRR

(1/|U|) Σ 1/rank(first_relevant)

Mean Reciprocal Rank. The mean of 1/(rank of first relevant result) across all users. Useful for scenarios where users care about the very first relevant result (e.g., search).

Metric	Cares About Order?	Use Case	Typical Target (Production)
Recall@K	No	Retrieval quality: are the right items in the pool?	>60% @500
Hit Rate@K	No	Next-item prediction, simple binary check	>40% @50
NDCG@K	Yes	Ranking quality: is position of relevant items good?	>0.35 @10
MRR	Yes	Search-style evaluation, first result quality	>0.25
ANN Recall vs Exact	No	Index quality: how much recall does ANN lose vs brute-force?	>98%

Section 10

Advanced Techniques — Going Beyond the Basics

🧠

Multi-Task Learning

Shared towers, multiple heads

Train on multiple objectives simultaneously: clicks, watch-time, shares, likes. Each task has its own output head but shares the same lower-level encoder. Prevents optimising only for click-bait at the cost of watch completion.

🔗

Cross-Tower Attention

Beyond pure decoupling

A hybrid: allow the Query Tower to attend to a small set of candidate item features. Sacrifices some inference speed (can no longer pre-compute all item vectors independently) but improves retrieval precision significantly for complex queries.

🌟

Sequential Modelling

SASRec / BERT4Rec

Replace the simple average of watch history with a Transformer encoder (SASRec, BERT4Rec). The Query Tower captures the order and recency of interactions, not just their average. State-of-the-art for session-based recommendation.

🌐

Graph-Based Item Towers

PinSage / LightGCN

Items don't exist in isolation. Pinterest's PinSage uses a Graph Neural Network to incorporate the board-pin graph structure into item embeddings. LightGCN simplifies this to a scalable linear propagation scheme.

🔮

Multi-Vector Retrieval

ColBERT-style

Instead of one vector per item, generate multiple context-specific vectors. A song might have one vector for "morning commute" and another for "workout". Dramatically improves retrieval for context-dependent scenarios. Higher memory cost.

📈

Real-Time Embedding Updates

TikTok-style

Rather than running the Query Tower only at session start, update the user vector after every interaction within a session. This stateful encoding captures rapidly shifting in-session intent. Requires low-latency Query Tower inference (<1ms).

Section 11

Golden Rules — Two-Tower in Production

🎯 Two-Tower Recommender — Non-Negotiable Production Rules

Always L2-normalise tower outputs. Training is more stable, cosine similarity is bounded and interpretable, and ANN indices designed for cosine distance work directly. Skipping normalisation causes training instability and inconsistent retrieval quality.

Use large batch sizes for in-batch negatives. With batch size B, you get B² effective training pairs. At batch size 4096, that's 16 million implicit signals per step. Anything below 512 is too small to learn useful negative signal.

Apply frequency correction to item popularity. Popular items appear as negatives far more often than rare ones, causing the model to push them away unfairly. Correct with a log-frequency penalty: subtract log(item_freq) from the logit before softmax. This is standard in YouTube's paper and critical for catalogue coverage.

Include hard negatives — but carefully. Mixing ~10% hard negatives (mined from top-K ANN results that are NOT positives) with 90% random in-batch negatives significantly improves embedding quality. Too many hard negatives causes training collapse. Start at 5–10%.

Evaluate ANN recall loss separately. Always compare your ANN index recall against exact brute-force search. If ANN recall is below 95%, increase efSearch (HNSW) or nprobe (IVF). Retrieval quality loss from the ANN index should not contaminate your model quality evaluation.

Keep the Item Tower content-aware, not just ID-based. ID-only item embeddings cannot generalise to new items (cold start). Always include at least one content signal (title text, genre, audio features) so the Item Tower can produce reasonable embeddings for items with zero interaction history.

Multi-source candidate generation is not optional at scale. The Two-Tower model alone has blind spots (new items, niche content). Always combine it with at least a popularity baseline and a freshness source. Users expect both personalisation and discovery of what's new.

🌐

Where to Go Next

From here, the natural next topics are: Ranking Models (DCN-v2, DeepFM, DLRM) that take your retrieved candidates and produce a final sorted list; Multi-Task Learning for simultaneous optimisation of engagement, satisfaction, and diversity; and Reinforcement Learning for Recommendation (SlateQ, cascading bandits) for long-horizon session optimisation beyond immediate click prediction.