The Story That Explains Two-Tower Systems
That fingerprint is called an embedding. The system that creates your fingerprint is the Query Tower. The system that creates every book's fingerprint is the Item Tower. The process of finding the closest fingerprints at lightning speed is Approximate Nearest Neighbour (ANN) search.
Together, they form a Two-Tower Recommendation System — the engine powering YouTube, Spotify, TikTok, Netflix, and virtually every modern recommender at scale.
Before Two-Tower models existed, recommenders relied on collaborative filtering, matrix factorisation, and hand-engineered feature pipelines. These worked well at small scale but collapsed under the weight of billions of users and items. The Two-Tower architecture solved this by separating the encoding of users and items — making retrieval blazing fast at inference time.
You have 500 million users and 50 million videos. You cannot score every (user, video) pair at query time — that's 25 quadrillion computations per request. Two-Tower pre-computes item embeddings offline, then retrieves the top-K closest ones to the user vector in milliseconds using ANN search. The solution is decoupled encoding + approximate retrieval.
Architecture — The Two-Tower Blueprint
A Two-Tower model consists of two separate neural networks (towers) that are trained jointly but run independently at inference time. Here is the full architecture at a glance:
Both towers share the same embedding dimension (128d here). The dot product of their output vectors is the raw relevance score. L2 normalisation converts dot product to cosine similarity.
Candidate Generation — The Retrieval Funnel
Without candidate generation, the ranking model would need to score 15,000 items for every page load for every user — an enormous waste of compute. Candidate generation does the heavy lifting cheaply, so ranking can focus on precision.
In a full production pipeline, candidate generation is typically a multi-source funnel. Multiple candidate generators run in parallel, their outputs are de-duplicated, and then the combined pool is ranked.
Multiple sources generate candidates in parallel. The Two-Tower model is the highest-quality personalised source; popularity/trending acts as a safety net for new users (cold start).
The Three Goals of Candidate Generation
New users have no history. The Query Tower receives sparse or zero signals. Solution: fall back to popularity-based candidates, use demographic features, or bootstrap with a short onboarding quiz. New items (cold items) have no interactions either — use only content-based features in the Item Tower and avoid ID-based lookups until the item accumulates engagement data.
Retrieval Systems — From Query to Candidates
Once the Query Tower produces a user vector, we need to find the nearest item vectors. This is the retrieval problem. The naive approach (brute-force dot product over all items) is exact but prohibitively slow at scale. Production systems use Approximate Nearest Neighbour (ANN) algorithms that trade a tiny amount of accuracy for enormous speed gains.
The Major ANN Index Types
| Index Type | Algorithm | Speed | Recall | Memory | Best For |
|---|---|---|---|---|---|
| Flat | Brute-force exact | Very slow | 100% | High | Tiny catalogues (<100K items) |
| IVF (Inverted File) | K-Means clustering | Fast | ~92–97% | Low | Large catalogues, batch retrieval |
| HNSW | Hierarchical small-world graph | Very fast | ~95–99% | High (graph edges) | Real-time retrieval, top-K queries |
| IVF-PQ | IVF + Product Quantisation | Fast | ~88–95% | Very low (compressed) | Billion-scale with memory limits |
| ScaNN | Anisotropic quantisation | Fastest | ~96–99% | Medium | Google-scale, TPU-optimised |
For most teams: start with FAISS HNSW — it gives excellent recall at low latency and is trivial to set up. If your catalogue exceeds 100M items or memory is constrained, switch to IVF-PQ. If you're at Google/YouTube scale with TPU infrastructure, ScaNN is purpose-built for this exact problem. For managed cloud solutions, Pinecone, Weaviate, and Vertex AI Matching Engine wrap these algorithms in production-ready services.
How HNSW Works Visually
HNSW builds a multi-layer graph. Search starts at the top sparse layer for fast navigation, then descends to the dense base layer for precision. Query time is O(log N) vs O(N) for brute force.
Vector Similarity Search — The Mathematics
At the heart of Two-Tower retrieval is a single question: how similar are two vectors? Three metrics dominate in practice. Understanding when to use each is critical.
Most Two-Tower models apply L2 normalisation to the output of both towers before computing similarity. This converts the dot product into cosine similarity, constraining all vectors to lie on the unit hypersphere. Benefits: training is more stable, similarity scores are bounded between -1 and 1, and ANN indices designed for cosine similarity can be used directly. In TensorFlow Recommenders (TFRS), this is tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1)).
Why Embeddings Work — The Geometry of Preference
After training, the embedding space organises itself so that user vectors land near items they are likely to enjoy. ANN search finds these nearby items efficiently.
Training — Loss Functions and Sampling Strategy
Training a Two-Tower model is fundamentally a metric learning problem: push user and item embeddings for positive (user-item) pairs together, and push negatives apart. The choice of loss function and negative sampling strategy dramatically affects model quality.
In a batch of 1024 (user, item) pairs, each user's positive item appears once, but every other item in the batch acts as a negative for that user. This gives you 1024 × 1023 ≈ 1 million implicit training signals per batch at zero extra data cost. It's massively sample-efficient and is why large batch sizes are so important for Two-Tower training.
Project — Building a YouTube / Spotify-Style Recommender
We'll build a complete Two-Tower recommender system using TensorFlow Recommenders (TFRS) and FAISS. The architecture mirrors what YouTube and Spotify use in production, adapted for the MovieLens 1M dataset. By the end, you will have a working retrieval system with an ANN index.
Step 1 — Installation and Dataset
# Install required libraries
pip install tensorflow-recommenders tensorflow-datasets faiss-cpu
import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_datasets as tfds
import numpy as np
import faiss
from typing import Dict, Text
# Load MovieLens 1M — ratings + movie metadata
ratings = tfds.load("movielens/1m-ratings", split="train")
movies = tfds.load("movielens/1m-movies", split="train")
# Extract only what we need
ratings = ratings.map(lambda x: {
"user_id": x["user_id"],
"movie_id": x["movie_id"],
"movie_title": x["movie_title"],
"genres": x["genres"],
})
movies = movies.map(lambda x: {
"movie_id": x["movie_id"],
"movie_title": x["movie_title"],
"genres": x["genres"],
})
# Build vocabularies
user_ids = ratings.map(lambda x: x["user_id"])
movie_ids = ratings.map(lambda x: x["movie_id"])
movie_titles = movies.map(lambda x: x["movie_title"])
user_ids_vocab = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocab.adapt(user_ids)
movie_ids_vocab = tf.keras.layers.StringLookup(mask_token=None)
movie_ids_vocab.adapt(movie_ids)
movie_titles_vocab = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocab.adapt(movie_titles)
print(f"Users: {user_ids_vocab.vocabulary_size()}")
print(f"Movies: {movie_ids_vocab.vocabulary_size()}")
Step 2 — Building the Query Tower
class QueryTower(tf.keras.Model):
"""
Encodes user context into a dense embedding vector.
Mirrors YouTube-style query encoding with:
- User ID embedding (identity signal)
- Averaged watch history embedding (sequence signal)
- Dense projection layers
"""
def __init__(self, user_vocab_size, embedding_dim=64, output_dim=128):
super().__init__()
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=user_ids_vocab.get_vocabulary(), mask_token=None
),
tf.keras.layers.Embedding(user_vocab_size, embedding_dim),
])
# Dense layers for projection
self.dense_layers = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation="relu"),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dense(output_dim), # No activation — L2 norm next
])
def call(self, inputs):
user_emb = self.user_embedding(inputs["user_id"])
x = self.dense_layers(user_emb)
# L2 normalise → dot product = cosine similarity
return tf.nn.l2_normalize(x, axis=1)
Step 3 — Building the Item Tower
class ItemTower(tf.keras.Model):
"""
Encodes items (movies) into a dense embedding vector.
Uses both:
- Movie ID embedding (collaborative signal)
- Title text embedding via bag-of-words (content signal)
- Genre embedding (metadata signal)
Pre-computed offline and stored in ANN index.
"""
def __init__(self, movie_vocab_size, title_vocab_size,
embedding_dim=64, output_dim=128):
super().__init__()
# Movie ID embedding (ID-based collaborative signal)
self.movie_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=movie_ids_vocab.get_vocabulary(), mask_token=None
),
tf.keras.layers.Embedding(movie_vocab_size, embedding_dim),
])
# Title text — bag-of-words over subword tokens
self.title_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=movie_titles_vocab.get_vocabulary(), mask_token=None
),
tf.keras.layers.Embedding(title_vocab_size, embedding_dim),
tf.keras.layers.GlobalAveragePooling1D(), # mean over token embeddings
])
self.dense_layers = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation="relu"),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dense(output_dim),
])
def call(self, inputs):
movie_emb = self.movie_embedding(inputs["movie_id"])
title_emb = self.title_embedding(inputs["movie_title"])
# Concatenate ID + content signals
combined = tf.concat([movie_emb, title_emb], axis=1)
x = self.dense_layers(combined)
return tf.nn.l2_normalize(x, axis=1)
Step 4 & 5 — Assembling and Training the Full Model
class TwoTowerModel(tfrs.Model):
"""Full Two-Tower retrieval model with TFRS retrieval task."""
def __init__(self, query_tower, item_tower, movies_dataset):
super().__init__()
self.query_tower = query_tower
self.item_tower = item_tower
# Retrieval task: in-batch softmax + factorised top-K metrics
self.task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidates=movies_dataset.batch(128).map(self.item_tower)
)
)
def compute_loss(self, features, training=False):
user_embeddings = self.query_tower(features)
movie_embeddings = self.item_tower(features)
return self.task(user_embeddings, movie_embeddings,
compute_metrics=not training)
# Instantiate towers
query_tower = QueryTower(user_ids_vocab.vocabulary_size())
item_tower = ItemTower(
movie_ids_vocab.vocabulary_size(),
movie_titles_vocab.vocabulary_size()
)
# Build full model
model = TwoTowerModel(query_tower, item_tower, movies)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
# Train / validation split
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
val = shuffled.skip(80_000).take(20_000)
# Train — 10 epochs, batch size 4096
history = model.fit(
train.batch(4096),
validation_data=val.batch(4096),
epochs=10,
verbose=1,
)
top_100_categorical_accuracy = 0.4265 means: for 42.65% of test users, the actual movie they watched was found in the top-100 retrieved candidates. This is the standard Recall@100 metric for retrieval systems. Higher is better. In production (YouTube, Spotify), this metric is typically computed over millions of items with target Recall@500 exceeding 60–80%.
Step 6 — Building the FAISS ANN Index
# Step 6a: Generate embeddings for ALL movies (Item Tower inference)
movie_ids_all = []
movie_titles_all = []
for movie in movies.batch(512):
movie_ids_all.extend(movie["movie_id"].numpy())
movie_titles_all.extend(movie["movie_title"].numpy())
# Run Item Tower over all movies
movie_embeddings = item_tower.predict(
tf.data.Dataset.from_tensor_slices({
"movie_id": movie_ids_all,
"movie_title": movie_titles_all,
}).batch(512)
) # shape: (3884, 128) — one 128-d vector per movie
# Step 6b: Build HNSW index with FAISS
DIM = 128 # embedding dimension
M = 32 # HNSW graph connectivity (higher = better recall, more memory)
efC = 200 # construction-time search depth
# IndexHNSWFlat: exact distance computation within HNSW graph
index = faiss.IndexHNSWFlat(DIM, M, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = efC
# Embeddings are already L2-normalised, so inner product = cosine sim
index.add(movie_embeddings.astype("float32"))
print(f"Index built. Total vectors: {index.ntotal}")
# Optional: save index to disk
faiss.write_index(index, "movie_hnsw.index")
Step 7 — Real-Time Retrieval
def recommend(user_id: str, top_k: int = 10) -> list:
"""
Given a user_id, return top-K movie recommendations
using the Two-Tower model + FAISS HNSW index.
Latency: Query Tower (~2ms) + ANN search (~0.5ms) = ~2.5ms total
"""
# 1. Run Query Tower: user context → 128-d vector
user_vector = query_tower.predict(
{"user_id": tf.constant([user_id])}
) # shape: (1, 128)
# 2. Query FAISS index: find top-K nearest movie vectors
scores, indices = index.search(
user_vector.astype("float32"), top_k
) # scores: cosine similarity, indices: position in movie array
# 3. Map indices → movie metadata
results = []
for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
movie_title = movie_titles_all[idx].decode("utf-8")
results.append({
"rank": i + 1,
"title": movie_title,
"score": round(float(score), 4),
})
return results
# Test with User 42
recs = recommend("42", top_k=10)
for r in recs:
print(f"#{r['rank']:2d} [{r['score']:.4f}] {r['title']}")
Production Architecture — YouTube vs Spotify vs TikTok
Different products have slightly different constraints that shape how they implement Two-Tower retrieval. Here's how the world's largest recommenders differ:
| Company | Item Corpus | Retrieval System | Key Innovation | Latency Target |
|---|---|---|---|---|
| YouTube | 800M+ videos | ScaNN (Google) | Watch-time weighted training, in-batch negatives with frequency correction | <10ms |
| Spotify | 100M+ tracks | Annoy + custom | Audio features (mel-spectrograms via CNNs) in Item Tower alongside metadata | <5ms |
| TikTok | Billions of videos | Custom ANN (internal) | Real-time user vector update within a session (stateful Query Tower) | <3ms |
| 800M+ pins | FAISS + custom | PinSage (Graph Neural Network Item Tower using pin-board graph) | <50ms | |
| Netflix | ~15K titles | Brute-force (small corpus) | Contextual bandits for exploration, multi-objective optimisation (engage + satisfaction) | <100ms |
Netflix has a relatively small catalogue (~15,000 titles). At this scale, it's cheaper and more accurate to just run a dot product against all item vectors than to use an ANN index. ANN only becomes necessary when your catalogue exceeds ~1 million items. Don't optimise prematurely — start with brute-force and switch to ANN only when latency forces you to.
The Full Production Pipeline
Evaluation — How to Measure Retrieval Quality
Retrieval metrics measure whether the relevant items appear in the top-K candidates. They differ from ranking metrics (which care about the exact order).
| Metric | Cares About Order? | Use Case | Typical Target (Production) |
|---|---|---|---|
| Recall@K | No | Retrieval quality: are the right items in the pool? | >60% @500 |
| Hit Rate@K | No | Next-item prediction, simple binary check | >40% @50 |
| NDCG@K | Yes | Ranking quality: is position of relevant items good? | >0.35 @10 |
| MRR | Yes | Search-style evaluation, first result quality | >0.25 |
| ANN Recall vs Exact | No | Index quality: how much recall does ANN lose vs brute-force? | >98% |
Advanced Techniques — Going Beyond the Basics
Golden Rules — Two-Tower in Production
From here, the natural next topics are: Ranking Models (DCN-v2, DeepFM, DLRM) that take your retrieved candidates and produce a final sorted list; Multi-Task Learning for simultaneous optimisation of engagement, satisfaction, and diversity; and Reinforcement Learning for Recommendation (SlateQ, cascading bandits) for long-horizon session optimisation beyond immediate click prediction.