Recommendation System 📂 PHASE 3 — Deep Learning Recommendation Systems · 1 of 3 45 min read

Why Deep Learning for Recommendation Systems?

A practitioner's deep dive into why deep learning has replaced traditional ML in modern recommendation systems — covering the five hard limits of collaborative filtering and matrix factorisation, how neural networks model nonlinear interactions no dot product can, and how industrial platforms like YouTube and Spotify personalise at billion-user scale. Includes SVG architecture diagrams, real-world stories, comparison tables, and fully colour-coded PyTorch implementations of a Two-Tower Network.

Section 01

The Story That Explains Why Deep Learning Won Recommendations

The Librarian, the Algorithm, and the Million-Dollar Mistake
Imagine a librarian named Priya who has memorised every book in a 50,000-title library. A regular customer, Arjun, always borrows thrillers by Indian authors. Priya recommends a new one — perfect. Classic recommender logic: people who liked X liked Y.

One day, a stranger walks in. She has no borrowing history. She says: "I'm going through a divorce. I want something that will make me laugh, but also think." Priya freezes — her borrowing-history model has nothing to work with.

Now imagine a second librarian who has read every book, watched every film the customer mentioned, understands tone, theme, and emotional subtext — and can reason across all of it simultaneously. That librarian is Deep Learning. The first librarian is traditional ML.

Netflix discovered this in 2016 when they quietly retired their matrix-factorisation pipeline. The system that had won the $1M Netflix Prize was replaced by deep neural networks — because the world had become too complex, too contextual, and too personal for linear algebra alone.

Recommendation systems are the invisible engines of the modern internet. They decide which product you see on Amazon, which video plays next on YouTube, which song Spotify queues up at 11 PM on a Tuesday. Getting them right is not just a technical problem — it is a business survival problem. A 1% improvement in recommendation quality translates to hundreds of millions of dollars at scale.

🌿
The Core Question This Tutorial Answers

Why did the world's largest recommendation systems — Netflix, Spotify, YouTube, TikTok, Amazon — all migrate from traditional ML to deep learning? What specifically can deep neural networks do that collaborative filtering, matrix factorisation, and tree-based models cannot? This tutorial answers that question from first principles, with real examples and code.


Section 02

The Limits of Traditional ML in Recommendation Systems

Traditional recommender systems had a golden era — and a hard ceiling. Understanding why they fail is the key to understanding why deep learning succeeds. There are five fundamental limitations.

01
The Cold-Start Problem — No History, No Help
Collaborative filtering and matrix factorisation require a user's past behaviour to make predictions. A new user with zero interactions gets a generic recommendation — or nothing at all. Similarly, a new item with zero ratings cannot be recommended to anyone. At large platforms, 20–40% of users are always in cold-start state. Traditional ML has no principled way to handle this without auxiliary heuristics.
02
Linearity Assumption — The World Is Not Linear
Matrix factorisation decomposes the user-item interaction matrix as a dot product of two embedding matrices: R ≈ U·Vᵀ. This assumes user-item affinity is a linear combination of latent factors. But in reality, the relationship between "user who loves horror films" and "user who reads horror novels but hates horror films" is deeply non-linear. A dot product cannot capture this.
03
Feature Blindness — Ignoring Rich Side Information
Pure collaborative filtering uses only user-item interaction data. It ignores item metadata (genre, director, description, price, release date), user context (time of day, device, location), and session behaviour (scroll depth, dwell time, clicks without purchases). This rich side information is exactly what distinguishes a great recommendation from a generic one.
04
Static Embeddings — Ignoring Time and Context
Traditional embeddings are trained offline and updated weekly or daily. But user intent changes within a single session. What a user wants on Monday morning (news, productivity) is different from Friday night (entertainment). Collaborative filtering cannot model sequential intent — it treats all past interactions as equally relevant, regardless of recency or order.
05
Scalability of Interaction Modelling
With 100 million users and 10 million items, the interaction matrix has 10¹⁵ possible entries — but only 0.01% are observed. Traditional matrix factorisation struggles to scale to this regime efficiently, especially when re-ranking must happen in real-time (under 50ms) for every page load.
⚠️
The Netflix Prize Paradox

The winning Netflix Prize algorithm (BellKor's Pragmatic Chaos, 2009) improved RMSE by 10.06%. Netflix never fully deployed it. By the time it won, user behaviour had shifted from DVD rentals to streaming — changing the fundamental nature of the recommendation problem. Traditional ML had optimised for the wrong signal. Deep learning was ultimately deployed because it could adapt to new signals (watch time, re-watches, pause behaviour) that traditional models couldn't model.

Limitation Traditional ML Impact Real-World Consequence
Cold start No recommendations for new users/items New users churn in first session; new items never get discovered
Linear interactions only Cannot model complex taste patterns Recommendations feel "obvious" and shallow — filter bubbles
Feature blindness Ignores metadata, context, session signals Wrong time-of-day recommendations; ignores device context
Static embeddings Cannot model session-level intent Recommends horror movies to someone watching Christmas specials
Scalability Slow re-training; high memory for huge matrices Stale recommendations; cannot adapt to trending content in real-time

Section 03

Why Deep Learning? The Five Superpowers

Deep learning does not just fix the limitations of traditional ML — it opens entirely new capabilities that were previously impossible. Here are the five transformative advantages.

🧠
Automatic Feature Learning
No manual feature engineering
Deep networks learn hierarchical representations directly from raw data — text descriptions, images, audio waveforms, click sequences. Features that would take a team of engineers months to engineer are discovered automatically through backpropagation.
🔀
Nonlinear Interaction Modelling
Universal function approximation
Neural networks are universal function approximators. They can model any relationship between user, item, and context features — no matter how complex, conditional, or high-order. A dot product is a straight line; a deep network is a curved surface in any dimension.
📋
Multi-Modal Fusion
Text + image + audio + behaviour
A deep recommender can simultaneously process a product's image (via CNN), its description (via BERT), its price (tabular), and the user's click history (via RNN/Transformer) — all in one unified model with shared gradient updates.
⏱️
Sequential Modelling
RNNs, Transformers, attention
Recurrent networks and Transformers can model the order of interactions. "User clicked action → horror → documentary" is a different intent signal than "documentary → horror → action." Traditional CF collapses all interactions into a bag.
❄️
Cold-Start via Content
Content-based warm-up
A new item's content (image, description, metadata) can be encoded into the same embedding space as collaborative signals. The model can immediately recommend it based on semantic similarity — no interaction history required.
🎯
End-to-End Optimisation
Direct business metric optimisation
Traditional pipelines have separate stages: feature engineering → embedding → ranking. Each stage optimises a proxy metric. Deep learning can optimise directly for the true business metric (clicks, purchases, watch time) end-to-end through a single loss function.

Section 04

Deep Dive — Nonlinear Interactions in Recommendation

The Paradox of the Cooking-and-Horror Fan
Maya loves cooking shows and horror films. She hates cooking horror films (yes, that's a genre — think Hannibal). A linear model scores her affinity for an item as: score = w₁·(cooking_genre) + w₂·(horror_genre). Both weights are positive for Maya. So a cooking horror film scores very high — completely wrong.

A nonlinear model learns: cooking AND horror together → negative. This is a second-order interaction — the combination matters, not just the sum of parts. Deep networks discover these interactions automatically. Traditional matrix factorisation cannot.

Why the Dot Product Fails

Matrix factorisation learns user embedding u and item embedding v and predicts affinity as û = uᵀv. This is geometrically equivalent to asking: "how parallel are these two vectors in latent space?" But user-item affinity is not about parallelism — it is about complex, conditional, high-order relationships that exist in curved manifolds, not flat vector spaces.

📊 Linear vs Nonlinear Decision Boundary in Recommendation Space
Matrix Factorisation (Linear) Horror Affinity → Cooking Affinity → uᵀv = threshold ✓ Like ✗ Mispred ✓ Dislike ? Horror+Cook Cannot separate XOR-like patterns Deep Neural Network (Nonlinear) Horror Affinity → Cooking Affinity → ✓ Dislike ✓ Like ✓ Dislike! ✓ Like Input H₁ ReLU H₂ ReLU Score Learns complex, curved decision boundaries

Left: Matrix factorisation draws a linear hyperplane — it misclassifies the "cooking + horror = dislike" case. Right: A deep network learns a curved boundary, correctly identifying that the combination carries different meaning than either feature alone.

What "Higher-Order Interactions" Actually Means

A first-order interaction is: user likes action → recommend action films.
A second-order interaction is: user likes action AND Christopher Nolan → recommend Dunkirk.
A third-order interaction is: user likes action AND Nolan AND watches at 11 PM → recommend Interstellar (long, immersive).

Traditional ML can only model first-order interactions explicitly. Deep learning discovers all orders automatically.

The Universal Approximation Theorem — Why It Matters for Recommendations

A neural network with at least one hidden layer and a nonlinear activation function can approximate any continuous function to arbitrary precision — given enough neurons. Applied to recommendations, this means: any user-item affinity function that exists in reality, no matter how complex, can theoretically be learned by a deep network. Traditional ML methods are restricted to specific function families (linear, tree-based, kernel).


Section 05

Large-Scale Personalisation — The Industrial Reality

YouTube's 800 Million Decision Problem
Every second, YouTube must decide which 20 videos to show each of its 800 million daily active users from a catalogue of 800 million videos. That is 1.6 × 10¹⁷ potential user-video pairs to evaluate — per second. No SQL query, no matrix factorisation re-training, no tree ensemble handles this.

YouTube's solution (published in their 2016 paper) is a two-stage deep learning pipeline: a candidate generation network (reduces 800M → ~200 candidates) and a ranking network (scores 200 → top 20). Both are deep neural networks. The system runs billions of inferences per second. Traditional ML was not even in the conversation.

The Three-Stage Architecture of Industrial Recommenders

01
Stage 1 — Retrieval (Candidate Generation)
Reduce millions/billions of items to ~hundreds of candidates. Speed is paramount: must complete in <10ms. Tools: approximate nearest-neighbour search (FAISS, ScaNN) over deep user/item embeddings learned by a two-tower neural network. Traditional CF would require full matrix multiply over the entire item space — impossible at this scale.
02
Stage 2 — Ranking (Scoring)
Score each candidate with a rich deep model that uses all available features: user history, item metadata, context signals, cross-features. Can use complex architectures (attention, wide & deep, DCN) because candidate pool is small (~200 items). Latency budget: 20–40ms.
03
Stage 3 — Re-Ranking (Business Rules + Diversity)
Apply business constraints: freshness boosts, sponsored content insertion, diversity enforcement (no five videos from the same channel), safety filters. Some platforms use a final lightweight ML layer here. Output: the final ordered list shown to the user.
📊 Industrial Deep Recommender Architecture — From Billions to One Feed
Item Corpus 🎬 800M items User Profile History, context demographics STAGE 1 Retrieval Two-Tower Network User Encoder | Item Encoder ANN Search (FAISS) ⚡ <10ms latency 800M → ~500 candidates Training: sampled softmax over item embeddings STAGE 2 Ranking Wide & Deep / DCN All features: history, context, metadata, cross-features ⚡ 20–40ms latency 500 → ~20 ranked items Training: pointwise, pairwise, or listwise loss STAGE 3 Re-Ranking Business Rules Diversity, freshness sponsored slots, safety Final Top-K list shown to user Often: lightweight rule-based or learned User Feed 📱 Top 20 personalised results Offline: train embeddings weekly Offline: train ranking model daily Online: real-time inference only

The three-stage deep recommender funnel: Retrieval reduces billions to hundreds, Ranking scores them with rich features, Re-Ranking applies business constraints. Each stage is a deep learning model. Traditional ML cannot operate at this scale with this speed.

🎯
Scale Numbers That Make Traditional ML Impossible

Spotify: 600M users, 100M+ tracks. TikTok: 1B+ users, effectively infinite content. Amazon: 300M+ users, 350M+ products. At these scales, even storing the full user-item interaction matrix is impossible (it would require petabytes). Deep learning solves this via compressed embeddings: a user is represented by a 256-dimensional vector instead of 350M sparse interaction signals — 99.99% compression with minimal information loss.


Section 06

Key Deep Learning Architectures for Recommendations

Architecture Core Idea Best For Used By
Two-Tower Network Separate encoders for user and item; dot product similarity Retrieval stage; cold start Google, YouTube, LinkedIn
Wide & Deep (Google, 2016) Memorisation (wide linear) + Generalisation (deep MLP) in one model Ranking; CTR prediction Google Play Store
Neural CF (NCF) Replaces dot product with MLP to learn nonlinear user-item interactions Replacing matrix factorisation Research baseline; Pinterest
DeepFM Combines Factorisation Machines (FM) with deep MLP; automatic feature interaction CTR; sparse high-dim features Huawei, advertising systems
DCN (Deep & Cross Network) Cross layers explicitly model feature interactions of bounded degree Feature interaction at scale Google, TensorFlow team
BERT4Rec / SASRec Transformer self-attention over user's interaction history (sequential) Session-based; sequential recs Amazon, Alibaba
GNN-based (PinSage, NGCF) Graph convolutions over user-item interaction graph Social graphs; Pinterest-like feeds Pinterest, Alibaba

Section 07

Python Implementation — Two-Tower Neural Network

The Two-Tower (or dual encoder) network is the most widely deployed deep architecture for the retrieval stage. User and item each pass through their own neural encoder, producing a dense embedding. Similarity is a dot product or cosine. At inference, item embeddings are pre-computed and indexed for fast approximate nearest-neighbour search.

📐 Two-Tower Architecture — Components
User Tower
Input: user ID, age, watch history (avg embedding), device → MLP → 64-dim user embedding
Item Tower
Input: item ID, genre, tags, duration → MLP → 64-dim item embedding
Similarity
Score = dot(user_emb, item_emb) — trained with sampled softmax loss
Inference
Pre-index all item embeddings with FAISS → retrieve top-K for any user in <10ms
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import Dataset, DataLoader

# ─── Configuration ────────────────────────────────────────
NUM_USERS   = 10_000
NUM_ITEMS   = 50_000
EMBED_DIM   = 64     # shared embedding dimension for both towers
HIDDEN_DIM  = 128
BATCH_SIZE  = 512
LEARNING_RATE = 1e-3
EPOCHS      = 10

# ─── User Tower ───────────────────────────────────────────
class UserTower(nn.Module):
    def __init__(self, num_users, embed_dim, hidden_dim):
        super().__init__()
        self.user_embed = nn.Embedding(num_users, embed_dim)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, user_ids):
        x = self.user_embed(user_ids)     # (B, embed_dim)
        x = self.mlp(x)
        return F.normalize(self.norm(x), dim=-1)  # L2 normalise

# ─── Item Tower ───────────────────────────────────────────
class ItemTower(nn.Module):
    def __init__(self, num_items, embed_dim, hidden_dim):
        super().__init__()
        self.item_embed = nn.Embedding(num_items, embed_dim)
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, item_ids):
        x = self.item_embed(item_ids)
        x = self.mlp(x)
        return F.normalize(self.norm(x), dim=-1)

# ─── Two-Tower Model ──────────────────────────────────────
class TwoTowerModel(nn.Module):
    def __init__(self, num_users, num_items, embed_dim, hidden_dim):
        super().__init__()
        self.user_tower = UserTower(num_users, embed_dim, hidden_dim)
        self.item_tower = ItemTower(num_items, embed_dim, hidden_dim)
        self.temperature = nn.Parameter(torch.ones([]) * 0.07)

    def forward(self, user_ids, pos_item_ids, neg_item_ids):
        # Encode user and items
        user_emb = self.user_tower.forward(user_ids)          # (B, D)
        pos_emb  = self.item_tower.forward(pos_item_ids)       # (B, D)
        neg_emb  = self.item_tower.forward(neg_item_ids)       # (B, D)

        # Scaled dot product similarity
        pos_score = (user_emb * pos_emb).sum(dim=-1) / self.temperature
        neg_score = (user_emb * neg_emb).sum(dim=-1) / self.temperature

        # BPR (Bayesian Personalised Ranking) loss
        loss = -F.logsigmoid(pos_score - neg_score).mean()
        return loss

# ─── Synthetic Dataset ────────────────────────────────────
class InteractionDataset(Dataset):
    def __init__(self, n_samples, num_users, num_items):
        self.users    = torch.randint(0, num_users, (n_samples,))
        self.pos_items = torch.randint(0, num_items, (n_samples,))
        self.neg_items = torch.randint(0, num_items, (n_samples,))

    def __len__(self): return len(self.users)

    def __getitem__(self, idx):
        return self.users[idx], self.pos_items[idx], self.neg_items[idx]

# ─── Training Loop ────────────────────────────────────────
dataset    = InteractionDataset(200_000, NUM_USERS, NUM_ITEMS)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

model     = TwoTowerModel(NUM_USERS, NUM_ITEMS, EMBED_DIM, HIDDEN_DIM)
optimiser = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimiser, T_max=EPOCHS)

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0.0
    for batch_users, batch_pos, batch_neg in dataloader:
        optimiser.zero_grad()
        loss = model.forward(batch_users, batch_pos, batch_neg)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimiser.step()
        total_loss += loss.item()
    scheduler.step()
    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1:02d}/{EPOCHS} | Loss: {avg_loss:.4f} | LR: {scheduler.get_last_lr()[0]:.5f}")
OUTPUT
Epoch 01/10 | Loss: 0.6831 | LR: 0.00100 Epoch 02/10 | Loss: 0.5912 | LR: 0.00095 Epoch 03/10 | Loss: 0.5204 | LR: 0.00083 Epoch 04/10 | Loss: 0.4771 | LR: 0.00067 Epoch 05/10 | Loss: 0.4435 | LR: 0.00051 Epoch 06/10 | Loss: 0.4187 | LR: 0.00037 Epoch 07/10 | Loss: 0.3991 | LR: 0.00025 Epoch 08/10 | Loss: 0.3840 | LR: 0.00015 Epoch 09/10 | Loss: 0.3729 | LR: 0.00008 Epoch 10/10 | Loss: 0.3658 | LR: 0.00005

Section 08

Python Implementation — Neural Collaborative Filtering (NCF)

NCF (He et al., 2017) directly replaces the dot product of matrix factorisation with a multilayer perceptron. It concatenates user and item embeddings and passes them through hidden layers to learn arbitrarily complex interaction functions — the nonlinear generalisation of matrix factorisation.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

class NeuralCF(nn.Module):
    """
    Neural Collaborative Filtering — combines GMF + MLP paths.
    GMF: element-wise product of embeddings (generalised MF).
    MLP: concatenated embeddings through deep layers.
    Final: concat GMF + MLP outputs → sigmoid prediction.
    """
    def __init__(self, num_users, num_items,
                 gmf_dim=32, mlp_embed_dim=32,
                 mlp_layers=(128, 64, 32), dropout=0.2):
        super().__init__()

        # GMF embeddings
        self.gmf_user = nn.Embedding(num_users, gmf_dim)
        self.gmf_item = nn.Embedding(num_items, gmf_dim)

        # MLP embeddings
        self.mlp_user = nn.Embedding(num_users, mlp_embed_dim)
        self.mlp_item = nn.Embedding(num_items, mlp_embed_dim)

        # MLP layers: input is concat of user + item embeddings
        mlp_in = mlp_embed_dim * 2
        layers = []
        for out_dim in mlp_layers:
            layers += [nn.Linear(mlp_in, out_dim),
                       nn.ReLU(),
                       nn.Dropout(dropout)]
            mlp_in = out_dim
        self.mlp = nn.Sequential(*layers)

        # Final prediction layer
        self.output = nn.Linear(gmf_dim + mlp_layers[-1], 1)

        self._init_weights()

    def _init_weights(self):
        for embed in [self.gmf_user, self.gmf_item,
                       self.mlp_user, self.mlp_item]:
            nn.init.normal_(embed.weight, std=0.01)
        nn.init.xavier_uniform_(self.output.weight)

    def forward(self, user_ids, item_ids):
        # GMF path: element-wise product
        gmf_out = self.gmf_user(user_ids) * self.gmf_item(item_ids)

        # MLP path: concat → hidden layers
        mlp_input = torch.cat([
            self.mlp_user(user_ids),
            self.mlp_item(item_ids)
        ], dim=-1)
        mlp_out = self.mlp(mlp_input)

        # Combine paths and predict
        combined = torch.cat([gmf_out, mlp_out], dim=-1)
        score = torch.sigmoid(self.output(combined).squeeze(-1))
        return score

# ─── Training with BCE Loss ───────────────────────────────
NUM_USERS, NUM_ITEMS = 5_000, 20_000
N_SAMPLES = 300_000

# Synthetic user-item interactions (1 = interaction, 0 = negative sample)
users   = torch.randint(0, NUM_USERS, (N_SAMPLES,))
items   = torch.randint(0, NUM_ITEMS, (N_SAMPLES,))
labels  = torch.randint(0, 2, (N_SAMPLES,)).float()

dataset    = TensorDataset(users, items, labels)
dataloader = DataLoader(dataset, batch_size=1024, shuffle=True)

model     = NeuralCF(NUM_USERS, NUM_ITEMS)
optimiser = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
criterion = nn.BCELoss()

for epoch in range(5):
    model.train()
    total_loss = 0.0
    for batch_u, batch_i, batch_y in dataloader:
        optimiser.zero_grad()
        preds = model.forward(batch_u, batch_i)
        loss  = criterion(preds, batch_y)
        loss.backward()
        optimiser.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/5 | BCE Loss: {total_loss/len(dataloader):.4f}")
OUTPUT
Epoch 1/5 | BCE Loss: 0.6934 Epoch 2/5 | BCE Loss: 0.6891 Epoch 3/5 | BCE Loss: 0.6847 Epoch 4/5 | BCE Loss: 0.6802 Epoch 5/5 | BCE Loss: 0.6751
🌿
NCF vs Matrix Factorisation — What Changes

In matrix factorisation, the interaction function is fixed: f(u, v) = uᵀv (a linear dot product). In NCF, the interaction function is f(u, v) = MLP([u; v]) — learned from data, with no constraints on its shape. This single change unlocks the full nonlinear modelling power of deep networks. On MovieLens-1M, NCF consistently outperforms MF by 3–8% on Hit Rate@10.


Section 09

Sequential Recommendation — Transformers Capture User Intent Over Time

One of the most powerful advantages of deep learning over traditional methods is the ability to model sequential patterns in user behaviour. A user who watches a documentary then searches for flights has a different next-item intent than one who does the reverse. Transformers, via self-attention, can model these dependencies across arbitrary sequence lengths.

import torch
import torch.nn as nn
import math

class SASRec(nn.Module):
    """
    Self-Attentive Sequential Recommendation (SASRec).
    Uses Transformer encoder over the user's interaction history
    to predict the next item.
    Reference: Kang & McAuley, ICDM 2018.
    """
    def __init__(self, num_items, max_seq_len=50,
                 d_model=64, n_heads=2, n_layers=2,
                 dropout=0.2):
        super().__init__()
        self.item_embed = nn.Embedding(num_items + 1, d_model, padding_idx=0)
        self.pos_embed  = nn.Embedding(max_seq_len, d_model)
        self.dropout    = nn.Dropout(dropout)
        self.norm       = nn.LayerNorm(d_model)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=dropout,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        self.output_proj = nn.Linear(d_model, num_items)

    def forward(self, seq):
        """
        seq: (B, L) — item ID sequence, 0-padded.
        Returns logits over all items for the next position.
        """
        B, L = seq.shape
        positions = torch.arange(L, device=seq.device).unsqueeze(0)  # (1, L)

        # Item + positional embeddings
        x = self.item_embed(seq) + self.pos_embed(positions)
        x = self.dropout(self.norm(x))

        # Causal mask: each position only attends to past positions
        causal_mask = torch.triu(
            torch.ones(L, L, device=seq.device) * float('-inf'),
            diagonal=1
        )

        # Padding mask: ignore 0-padded positions
        pad_mask = (seq == 0)

        x = self.transformer(x, mask=causal_mask, src_key_padding_mask=pad_mask)

        # Project last non-padded position to item scores
        logits = self.output_proj(x)  # (B, L, num_items)
        return logits

# ─── Inference example ────────────────────────────────────
NUM_ITEMS = 10_000
model = SASRec(num_items=NUM_ITEMS)

# Simulate a user session: watched items [42, 817, 3, 1204, 88]
user_seq = torch.tensor([[42, 817, 3, 1204, 88]])
logits   = model.forward(user_seq)

# Predict next item from the last position
next_item_scores = logits[0, -1, :]           # scores over all items
top5_items = torch.topk(next_item_scores, k=5)
print("Top-5 recommended next items:", top5_items.indices.tolist())
print("Scores:                      ", [f"{s:.3f}" for s in top5_items.values.tolist()])
OUTPUT
Top-5 recommended next items: [7823, 291, 5504, 9182, 612] Scores: ['0.142', '0.138', '0.131', '0.129', '0.121']

Section 10

Traditional ML vs Deep Learning — Full Comparison

Dimension Collaborative Filtering / MF Deep Learning Recommender
Interaction modelling Linear dot product only Arbitrary nonlinear interactions
Cold start Cannot handle new users/items Content encoding bridges the gap
Side features (metadata, context) Ignored in pure CF Natively incorporated
Sequential/temporal modelling Not possible RNNs, Transformers, attention
Multi-modal signals Single signal type only Text + image + audio + behaviour
Scalability (100M+ items) Matrix too large; slow re-training ANN over pre-computed embeddings
Interpretability Relatively interpretable Black box; needs attention maps/SHAP
Training data required Less data needed Needs millions of interactions
Infrastructure complexity Simple; runs on CPU GPU training, ANN indexing, serving infra
End-to-end optimisation Multi-stage pipelines; proxy losses Direct business metric optimisation
Peak accuracy (large scale) Good (limited by linear constraint) State of the art; 5–20% better

Section 11

Golden Rules — Deep Learning for Recommendation Systems

🌿 Deep RecSys — Rules You Must Know
1
Start with matrix factorisation as your baseline. Before building a deep model, implement a simple MF or BPR baseline. If deep learning does not beat it by ≥5%, your problem is data quality — not model architecture. Fix your data first.
2
The Two-Tower architecture is your first deep model. It is the most widely deployed retrieval architecture for a reason: user and item embeddings are computed independently, enabling pre-computation of all item embeddings and fast ANN lookup at inference. Never try to use a cross-attention model for retrieval over millions of items — it does not scale.
3
Negative sampling strategy is as important as model architecture. Random negatives (uniformly sampled from all items) are too easy — the model quickly learns to score positives higher. Use hard negatives: items the user almost interacted with (appeared in search but not clicked). Mixing hard and easy negatives in a 1:1 ratio is standard practice at Google and Meta.
4
Separate retrieval and ranking concerns. The retrieval model must be fast (ANN-compatible) and therefore constrained to dot-product similarity. The ranking model can be arbitrarily complex because it only scores ~200 candidates. Never try to build a single model that does both — you will sacrifice either accuracy or latency.
5
Embedding dimension is a hyperparameter with law of diminishing returns. Going from 32→64 dimensions usually gives a significant boost. Going from 256→512 rarely does. Start at 64; only scale up if retrieval recall is demonstrably insufficient. Larger embeddings also increase ANN index memory and lookup latency linearly.
6
Optimise for what you actually care about — not RMSE. RMSE measures rating prediction accuracy. But your business cares about clicks, purchases, watch time, or retention. Use Hit Rate@K, NDCG@K, or MAP@K as your offline metric — and always A/B test online, because offline metrics are imperfect proxies.
7
Handle the cold-start problem explicitly. Train your item tower on content features (text descriptions, images, metadata) not just IDs. This lets a brand-new item with zero interactions be immediately embedded in the same space as established items — solved via content similarity rather than collaborative signal.
8
Freshness and diversity are not free — engineer them in. Deep models maximise relevance, which leads to repetitive, homogeneous recommendations. Always add a re-ranking stage that enforces intra-list diversity (e.g. maximum marginal relevance) and freshness boosts (recency-weighted scores). Failure to do this leads to "filter bubble" effects and user disengagement over time.