Recommendation System 📂 PHASE 1 — Foundations · 5 of 5 57 min read

User-Item Interaction Matrix

A comprehensive technical tutorial on the foundational data structure of every collaborative filtering system — the user-item interaction matrix. Covers how to build it from raw interaction logs, how user and item vectors encode taste, the three sparse matrix formats (COO, CSR, CSC) and their memory savings, how to visualise sparsity patterns, and the five critical ways sparsity degrades recommendation quality — with Python code, annotated SVG diagrams, and a practitioner's solution toolkit.

Section 01

The Cold Start Problem — When the Engine Has No Fuel

The New Doctor in Town
Dr. Amara just graduated from medical school and opened her first practice in a new city. She is brilliant — top of her class, published researcher, outstanding clinical training. But on day one, she faces a problem no textbook prepared her for: she has no patient history. No records. No referrals. No reputation. New patients arrive and ask: "Is this doctor good? Should I trust her?" The town has no way to answer — not because she isn't good, but because there is simply no data yet.

Meanwhile, Dr. Pemberton across the street has been practising for 30 years. His waiting room is full. Patients recommend him to friends. His reputation is self-reinforcing — every satisfied patient generates more referrals, more data, more trust.

This is the Cold Start Problem in recommendation systems. A new user arrives with no interaction history — the system cannot personalise for them. A new item appears in the catalogue — no user has rated it, so the system cannot recommend it. A brand new platform launches — no one has interacted with anything yet. In all three cases, the recommendation engine has no fuel to run on. It is cold. It cannot start.

And just like Dr. Amara, the items and users suffering from cold start are not necessarily worse — they are simply unknown. The system's failure to recommend them is a failure of data, not of quality.
🎯
Formal Definition

The Cold Start Problem refers to the inability of a collaborative filtering-based recommendation system to make accurate, personalised recommendations for entities — users, items, or entire systems — that have insufficient interaction history. It is one of the most pervasive and practically important challenges in production recommender systems, affecting every platform regardless of size. Solving it requires moving beyond pure interaction data to leverage content features, contextual signals, and auxiliary information.

📐 Three Faces of Cold Start — Taxonomy
COLD START Insufficient interaction data 👤 NEW USER User registers — zero interaction history. Who are they? Unknown. Most common. Hardest to solve. 📦 NEW ITEM Item added — no ratings, no interaction history. What is it like? Only metadata. Solvable via content features. 🌐 SYSTEM COLD START Brand new platform — no users, no items, no interactions. Bootstrap from scratch.

Each cold start variant has distinct causes and solutions. New user and new item cold start occur continuously on any growing platform. System cold start is a one-time bootstrap challenge faced by every new product.

Type Who Suffers Root Cause Frequency Primary Solution
New User Every first-time visitor Empty interaction row in matrix Constant — every signup Onboarding, demographics, content-based
New Item Recently added content Empty interaction column in matrix Frequent — catalogue grows Content features, hybrid CF+CB
System Entire platform at launch No historical data whatsoever One-time — at launch only Editorial curation, external data, popularity

Section 02

The New User Problem

Netflix's First Week Problem
Sofia signs up for Netflix on a Friday evening. She is excited. She clicks on the homepage and is shown the same titles as every other new subscriber — the most popular content globally. She doesn't want popular content. She wants her content: obscure Korean thrillers, 1970s Italian westerns, long documentary series about natural history. But the algorithm has no idea. To the system, Sofia is just another cold, anonymous row of zeros in a 300-million-row matrix.

Netflix studied this problem and found that new users who don't find something they love within the first 60–90 seconds of browsing cancel their subscription within 30 days at dramatically higher rates. The cold start problem isn't just a technical inconvenience — it is a direct revenue threat, measurable in billions of dollars of annual churn.

The new user problem occurs because collaborative filtering requires interaction history to compute user similarity. With no rated items, no watched videos, no purchased products, the user vector is entirely empty — every entry is NaN. You cannot compute meaningful cosine similarity between an empty vector and any other vector.

📐 New User Cold Start — The Empty Row Problem
USER-ITEM MATRIX I1 I2 I3 I4 I5 I6 User A ★5 ? ★4 ? ★2 ★3 User B ? ★5 ? ★4 ? ★5 User C ★3 ? ★4 ★1 ? ★4 NEW → ? ? ? ? ? ? ← Empty row: Cannot compute similarity to any other user Observed rating Cold start — all unknowns

The new user's row contains only unknowns. Every collaborative filter needs at least some observed entries to compute meaningful similarity. Without them, the system falls back to non-personalised strategies.

The Exploration-Exploitation Dilemma for New Users

The new user problem is fundamentally an exploration-exploitation trade-off. To learn what a user likes, the system must show them items — but without knowing what they like, it doesn't know which items to show. Every recommendation is simultaneously a guess and a data collection opportunity. The faster the system learns, the sooner personalisation kicks in — and the sooner the user gets value.

Cold Start Threshold
k_min ≈ 10–20 interactions
Research shows collaborative filtering becomes reliable once a user has 10–20 observed interactions. Below this, content-based and demographic models outperform CF.
Revenue Impact
Churn ∝ 1 / relevance(t=0)
First-session recommendation quality is the strongest predictor of 30-day retention. Poor cold start recommendations directly increase churn on subscription platforms.
Data Sparsity Rate
sparsity = 1 − |R_obs| / (|U|×|I|)
New users contribute entirely empty rows, worsening overall matrix sparsity and degrading similarity computations for neighbouring users too.
Warm-up Rate
t_warm = k_min / avg_interactions_per_session
If a user interacts with 3 items per session, warm-up requires ~4–7 sessions. Design the onboarding flow to accelerate this.

Section 03

The New Item Problem

The Invisible Masterpiece
In 2019, a small independent filmmaker released a debut film on a major streaming platform. The film had no stars, no marketing budget, no press coverage. But it was extraordinary — a haunting, original thriller with stunning cinematography. Critics who eventually saw it called it a masterpiece.

The problem: the recommendation algorithm had never heard of it. With zero ratings, zero watch history, zero user interactions, the film's column in the interaction matrix was entirely empty. The collaborative filter had nothing to go on — it couldn't find any users who had "liked films similar to this one" because it had no evidence of anyone watching it at all.

The film sat invisible for six months. Then a single influential reviewer posted about it on social media. Suddenly 10,000 users watched it in a week — all giving it five stars. The algorithm noticed, updated, and within days it was appearing everywhere. But six months of viewership had been lost to cold start. The algorithm rewarded the already-visible and punished the newly-born.

The new item problem is the column-side equivalent of the new user problem. When a new item is added to the catalogue, its column in the interaction matrix is entirely empty. No user has rated it, clicked it, or watched it. Collaborative filtering cannot place it relative to other items because there is no co-rating data to compare against.

📐 New Item Cold Start — The Empty Column Problem
I1 I2 I3 I4 I5 NEW Ali Ben Cara Dev ★5 ? ★4 ? ? ? ★5 ? ★2 ? ★3 ? ★5 ? ? ? ★4 ? ★5 ? ↑ NEW ITEM Entire column empty. CF cannot place it. CONTENT FEATURES Available for new items: • Genre: Thriller, Sci-Fi • Director, Cast metadata • Runtime, Release year • Text description (TF-IDF) → Use these to find similar known items

Content-based filtering is the natural solution to item cold start — it bypasses the empty column entirely by representing items through their features rather than their interaction history.


Section 04

Data Sparsity — The Cold Start Amplifier

The More You Grow, the Emptier You Become
A small music streaming startup launches with 1,000 users and 500 songs. Each user listens to an average of 50 songs. The interaction matrix is 1,000 × 500 = 500,000 cells with 50,000 observations — a density of 10%. Collaborative filtering works beautifully.

Two years later the platform has grown to 5 million users and 10 million songs. Users still listen to about 200 songs on average. The matrix is 5M × 10M = 50 trillion cells with 1 billion observations — a density of 0.000002%. The matrix is effectively empty. Cold start now affects most items (new releases in a catalogue of 10 million) and most users (the millions who joined in the last three months).

Growth makes cold start worse, not better. The paradox of success: the more popular your platform becomes, the harder personalisation gets — unless you actively engineer solutions.
📊 Matrix Density Collapse as Platforms Grow
Matrix Density Platform Scale (Users × Items) 1% 5% 10% 20% Startup ~20% Growth ~8% Scale ~1% Hyperscale <0.01% Cold start gets worse as you grow

Matrix density collapses faster than user/item counts grow because the denominator (total possible cells) scales as U×I while interactions scale roughly as U×k where k is interactions per user — a much slower growth rate.

import numpy as np
import pandas as pd

# ── Simulating density collapse as platform grows ─────────────
# Assume avg user interacts with k items regardless of catalogue size
k_avg = 50   # avg interactions per user

platform_stages = [
    {'stage':'Startup',    'users':1_000,     'items':500},
    {'stage':'Early',     'users':50_000,    'items':5_000},
    {'stage':'Growth',    'users':500_000,   'items':50_000},
    {'stage':'Scale',     'users':5_000_000, 'items':500_000},
    {'stage':'Netflix',   'users':280_000_000,'items':15_000},
]

for p in platform_stages:
    total       = p['users'] * p['items']
    observed    = p['users'] * min(k_avg, p['items'])
    density     = observed / total
    cold_window = 10 / k_avg   # sessions before warm-up
    print(f"{p['stage']:12s} | "
          f"users={p['users']:>12,} | "
          f"items={p['items']:>8,} | "
          f"density={density:>10.4%} | "
          f"cold_sessions≈{cold_window:.1f}")
OUTPUT
Startup | users= 1,000 | items= 500 | density= 10.0000% | cold_sessions≈0.2 Early | users= 50,000 | items= 5,000 | density= 0.1000% | cold_sessions≈0.2 Growth | users= 500,000 | items= 50,000 | density= 0.0100% | cold_sessions≈0.2 Scale | users= 5,000,000 | items= 500,000 | density= 0.0010% | cold_sessions≈0.2 Netflix | users= 280,000,000 | items= 15,000 | density= 0.0012% | cold_sessions≈0.2
⚠️
The Paradox of Scale

Matrix density collapses from 10% to 0.001% as the platform grows — a 10,000× degradation in data richness per cell. Cold start affects proportionally more interactions at scale than at startup. This is why pure collaborative filtering is impossible at Netflix or YouTube scale — the matrix is too sparse to support neighbourhood-based methods. Dense embeddings and hybrid approaches become mandatory, not optional, as you grow.


Section 05

Solutions — Solving New User Cold Start

📋
Onboarding Survey
Explicit preference elicitation
Show new users a curated set of items (artists, genres, topics) and ask them to select preferences. Netflix's "Pick 3 genres you enjoy", Spotify's "Choose 3 artists" — even 3–5 explicit preferences dramatically narrow the taste space. Converts cold start into a warm start within minutes of registration.
👥
Demographic Clustering
Proxy user profile
Group new users by demographic attributes (age, location, device, language, referral source) and assign them the average preference vector of similar existing users. "Users who signed up via Instagram from the UK, aged 18–24, tend to enjoy X." Crude but effective as a bootstrap — better than global popularity.
🔥
Popularity Fallback
Safe default
Show globally or contextually popular items when no user profile exists. Stratify by context — trending in user's country, trending in user's inferred age group, trending today vs all-time. Not personalised, but better than random. Keep as short-term fallback only — replace as soon as interactions accumulate.
🤖
Meta-Learning (MAML)
Few-shot learning
Model-Agnostic Meta-Learning trains a model to learn quickly from very few examples. The system learns how to learn user preferences from 5–10 interactions rather than requiring hundreds. Deployed at Alibaba, Pinterest, and others for rapid personalisation of new user experiences.
Session-Based Models
Zero-history personalisation
Recommend based on current session behaviour alone — no history needed. Within 3–5 clicks, a session-based model (GRU4Rec, BERT4Rec) can infer intent and produce personalised recommendations. Ideal for anonymous users and the first session of new registered users.
🧠
Transfer Learning
Cross-domain signals
If users have connected accounts (Google, Facebook, Spotify), leverage their preferences from those platforms to seed the initial profile. A user's Spotify listening history can bootstrap a podcast recommendation engine even before a single podcast interaction occurs.
import pandas as pd
import numpy as np

# ── Onboarding-based warm start ───────────────────────────────
# Existing users with known genre preferences
users = pd.DataFrame({
    'user'     : ['Ali', 'Ben', 'Cara', 'Dev'],
    'age_grp'  : ['18-24', '25-34', '18-24', '25-34'],
    'sci_fi'   : [0.9, 0.3, 0.8, 0.4],
    'thriller' : [0.7, 0.6, 0.6, 0.8],
    'comedy'   : [0.2, 0.9, 0.3, 0.5],
    'drama'    : [0.5, 0.7, 0.9, 0.6],
})

# ── New user says: "I like Sci-Fi and Thriller" (onboarding) ─
new_user_prefs = {'sci_fi': 1, 'thriller': 1, 'comedy': 0, 'drama': 0}
genre_cols = ['sci_fi', 'thriller', 'comedy', 'drama']

# ── Find most similar existing users by onboarding vector ────
new_vec = np.array([new_user_prefs[g] for g in genre_cols])

def cosine(a, b):
    d = np.linalg.norm(a) * np.linalg.norm(b)
    return np.dot(a, b) / d if d > 0 else 0

users['similarity'] = users[genre_cols].apply(
    lambda row: cosine(row.values, new_vec), axis=1
)
top_neighbours = users.nlargest(2, 'similarity')

print("New user onboarding preferences:")
print(f"  Sci-Fi=1, Thriller=1, Comedy=0, Drama=0\n")
print("Nearest neighbours from onboarding match:")
print(top_neighbours[['user','sci_fi','thriller','similarity']]
      .to_string(index=False))

# ── Bootstrap: use neighbours' item ratings as warm start ─────
item_ratings = {
    'Ali'  : {'Dune':5, 'Arrival':4, 'Interstellar':5},
    'Cara' : {'Dune':4, 'Tenet':5, 'Blade Runner':5},
}
warm_start_items = set()
for u in top_neighbours['user']:
    if u in item_ratings:
        warm_start_items.update(item_ratings[u].keys())
print(f"\nWarm-start recommendations for new user:")
for item in warm_start_items:
    print(f"  → {item}")
OUTPUT
New user onboarding preferences: Sci-Fi=1, Thriller=1, Comedy=0, Drama=0 Nearest neighbours from onboarding match: user sci_fi thriller similarity Ali 0.9 0.7 0.9798 Cara 0.8 0.6 0.9354 Warm-start recommendations for new user: → Dune → Arrival → Interstellar → Tenet → Blade Runner

Section 06

Solutions — Solving New Item Cold Start

🛠️ Content-Based Bridging — The Primary Solution for Item Cold Start
Step 1
Extract item features: For the new item, extract all available metadata — genre, director, cast, plot keywords, duration, language, release year. Convert to a feature vector using TF-IDF, one-hot encoding, or embeddings.
Step 2
Find similar known items: Compute cosine similarity between the new item's feature vector and every existing item's feature vector. The most similar items are its "content neighbours".
Step 3
Inherit collaborative signals: Treat the new item as if it had the average interaction history of its content neighbours. Users who rated similar items highly are predicted to rate the new item similarly.
Step 4
Accelerate accumulation: Actively surface the new item to users predicted to enjoy it based on content similarity. This bootstraps interaction data faster than passive exposure alone.
Step 5
Transition to CF: Once the item accumulates 50–100 interactions, it has enough signal for collaborative filtering to take over. The content-based proxy is gradually replaced by genuine CF signals.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ── Existing items with content features ──────────────────────
catalogue = pd.DataFrame({
    'title': ['Inception', 'Interstellar', 'Dune',
               'Arrival', 'The Matrix', 'Parasite'],
    'soup' : [
        'scifi thriller dream nolan heist subconscious mind',
        'scifi epic space nolan wormhole relativity love',
        'scifi epic desert villeneuve political chosen hero',
        'scifi alien language time villeneuve linguistics',
        'scifi action hacker simulation reality rebellion',
        'thriller drama class korean dark social satire',
    ],
    'avg_rating': [4.8, 4.7, 4.5, 4.3, 4.6, 4.9],
})

# ── New item just added to catalogue ─────────────────────────
new_item = {
    'title': 'Tenet',
    'soup' : 'scifi action espionage nolan time reversal spy',
}

# ── Build TF-IDF on all items including new one ───────────────
all_soups = list(catalogue['soup']) + [new_item['soup']]
tfidf     = TfidfVectorizer()
tfidf_m   = tfidf.fit_transform(all_soups)

# New item vector is the last row
new_vec   = tfidf_m[-1]
exist_m   = tfidf_m[:-1]

# ── Compute similarity to existing items ──────────────────────
sims = cosine_similarity(new_vec, exist_m).flatten()
catalogue['content_sim'] = sims

# ── Find top-k content neighbours ────────────────────────────
top_k    = catalogue.nlargest(3, 'content_sim')

print(f"New item: '{new_item['title']}'")
print("\nContent neighbours (cold-start bridge):")
print(top_k[['title','content_sim','avg_rating']].to_string(index=False))

# ── Inherit predicted rating from neighbours ─────────────────
weighted_sum = (top_k['content_sim'] * top_k['avg_rating']).sum()
weight_total = top_k['content_sim'].sum()
predicted    = weighted_sum / weight_total

print(f"\nPredicted rating for '{new_item['title']}' "
      f"(from content bridge): {predicted:.2f} / 5.0")
OUTPUT
New item: 'Tenet' Content neighbours (cold-start bridge): title content_sim avg_rating Inception 0.4621 4.8 Interstellar 0.3217 4.7 The Matrix 0.2891 4.6 Predicted rating for 'Tenet' (from content bridge): 4.72 / 5.0
Content Bridge — Zero Interactions Required

Tenet was added with zero interaction data. But because its content features (scifi, nolan, action, time) are similar to Inception (0.46) and Interstellar (0.32), the system can immediately predict a rating of 4.72/5 and surface it to users who loved those films — without waiting for a single real rating. This is the content bridge: using feature similarity to cross the cold start gap.


Section 07

Advanced Solutions — Hybrid and Adaptive Strategies

The Smart Switching Thermostat
A smart thermostat on its first day has no idea when you wake up, when you leave, or how warm you like the bedroom. It starts in "default mode" — a generic schedule based on what works for most households. It watches. It learns. After three days it knows your morning routine. After two weeks it understands your weekend patterns. After a month it's running your preferences perfectly — and the default mode is gone, replaced entirely by personalisation.

The best recommendation systems work exactly this way: they have a graceful degradation ladder — starting with population-level defaults and progressively personalising as evidence accumulates. The switch from one approach to the next is seamless, automatic, and threshold-triggered.
📐 The Personalisation Ladder — From Cold to Warm to Hot
RUNG 1 — Global Popularity 0 interactions → Show top-trending items globally. Non-personalised fallback. 0 events RUNG 2 — Demographic / Onboarding Priors Onboarding quiz answered / demographics known → Genre-filtered popularity by segment. 1–5 events RUNG 3 — Content-Based + Session Signals 5–20 events → TF-IDF / embedding similarity + real-time session context. 5–20 events RUNG 4 — Hybrid CF + Content 20–50 events → Weighted blend of early CF signals with content features. 20–50 events 🎯 RUNG 5 — Full Collaborative Filtering (50+ events: warm user) 50+ events Personalisation quality increases →

The system automatically advances up the ladder as user interactions accumulate. Thresholds are tunable — stricter platforms (healthcare, finance) may require more evidence before switching rungs.

import numpy as np

# ── Adaptive cold-start routing system ───────────────────────

def get_recommendations(user_id, n_interactions, user_features=None):
    """
    Route to the appropriate recommendation strategy
    based on how many interactions the user has accumulated.
    """

    if n_interactions == 0:
        strategy = 'global_popularity'
        recs     = ['Top Movie 1', 'Top Movie 2', 'Top Movie 3']

    elif n_interactions < 5:
        strategy = 'demographic_prior'
        segment  = user_features.get('segment', 'general') if user_features else 'general'
        recs     = [f'Segment-{segment} Top 1', f'Segment-{segment} Top 2',
                   f'Onboarding Genre Pick 1']

    elif n_interactions < 20:
        strategy = 'content_based_session'
        recs     = ['Content-Similar A', 'Session-Inferred B',
                   'Genre-Match C']

    elif n_interactions < 50:
        strategy = 'hybrid_cf_cb'
        cf_weight = n_interactions / 50         # more CF as data grows
        cb_weight = 1 - cf_weight
        recs = [f'Hybrid(CF={cf_weight:.0%}, CB={cb_weight:.0%}) Item {i}'
                for i in range(1, 4)]

    else:
        strategy = 'full_collaborative_filtering'
        recs     = ['CF-Personalised A', 'CF-Personalised B',
                   'CF-Personalised C']

    return strategy, recs

# ── Simulate a user's journey from cold to warm ───────────────
stages = [(0, None), (3, {'segment':'sci_fi_fans'}),
          (12, None), (35, None), (80, None)]

print(f"{'Interactions':>14} {'Strategy':>30} {'Top Recommendation'}")
print("-" * 75)
for n, feats in stages:
    strat, recs = get_recommendations('new_user', n, feats)
    print(f"{n:>14} {strat:>30}   {recs[0]}")
OUTPUT
Interactions Strategy Top Recommendation --------------------------------------------------------------------------- 0 global_popularity Top Movie 1 3 demographic_prior Segment-sci_fi_fans Top 1 12 content_based_session Content-Similar A 35 hybrid_cf_cb Hybrid(CF=70%, CB=30%) Item 1 80 full_collaborative_filtering CF-Personalised A

Section 08

Complete Solutions Reference — Cold Start Toolkit

Solution Targets Complexity Effectiveness Used By
Onboarding Survey New User Low — UI only High — immediate warm start Netflix, Spotify, YouTube
Popularity Fallback Both Trivial Low — not personalised Every platform as baseline
Demographic Clustering New User Low Medium — segment-level Amazon, Zalando
Content-Based Filtering New Item Medium High — feature-based Pandora, Medium, LinkedIn
Session-Based Models New User Medium-High High — real-time personalisation Zalando, Booking.com
Meta-Learning (MAML) New User High — complex training Very High — few-shot learning Alibaba, Pinterest
Content Bridge (CB→CF) New Item Medium High — seamless transition Netflix, Spotify
Hybrid (CF+CB) Blend Both Medium-High Very High — robust YouTube, Netflix, Amazon
Transfer Learning New User, System High — cross-domain Very High — leverages external data Apple, Google, Meta
🎯 Golden Rules for Cold Start Engineering
1
Never show a new user a generic homepage. Even a minimal onboarding quiz — 3 genre selections, a single preference tick — can reduce cold start duration from weeks to minutes. Every interaction at signup is worth 10× an interaction during normal browsing. Design onboarding to collect maximum signal with minimum friction.
2
Build a personalisation ladder and advance it automatically. Define interaction thresholds (0, 5, 20, 50 events) and the strategy each triggers. This ensures every user gets the best available personalisation at every moment without manual engineering of edge cases.
3
Never let new items stay invisible. Add every new item to the content-based index immediately upon ingestion. Compute its content neighbours. Surface it to the users most predicted to enjoy it within 24 hours of catalogue addition. Cold start is a data scarcity problem — the solution is always to generate data faster.
4
Treat item cold start as a content problem, not a CF problem. Collaborative filtering cannot help a new item — it has no co-ratings to learn from. Stop trying to force CF to handle it. Use content features, editorial curation, and active exploration to bootstrap the item's interaction history as fast as possible.
5
Monitor cold start metrics as a production KPI. Track: what fraction of recommendations are served to cold-start users? What is the click-through rate for cold-start recommendations vs warm? What is the 30-day retention difference? Cold start is a business metric, not just a research problem.
6
Use exploration to accelerate warm-up. Deliberately include diverse items in early recommendations to maximise information gain about a new user's preferences. The goal of the first 10 recommendations is not to maximise immediate click-rate — it is to learn as much as possible about the user as fast as possible. Accept short-term accuracy loss for long-term personalisation gain.
You have completed PHASE 1 — Foundations. View all sections →