Explicit vs Implicit Feedback Ratings, Clicks, Watch Time

Section 01

The Signal Behind Every Recommendation

📖 Real-World Story

Two Kinds of Opinion

Imagine two friends who both loved the same film. The first friend walks out of the cinema and immediately writes a 5-star review with a 400-word essay about the cinematography. The second friend says nothing — but she watches it again the following Saturday, then recommends it to three colleagues on Monday, and pauses every time the trailer plays on television.

Which friend gave you more information about how much she loved the film?

The first friend gave you explicit feedback — a direct, intentional, verbal declaration of preference. The second friend gave you implicit feedback — a rich stream of behavioural signals, none of them labelled, all of them honest. The second friend's signals are harder to read, but they are far more plentiful, and in many ways more truthful — because behaviour is difficult to fake and requires no effort to produce.

This is the central tension of modern recommender systems. Explicit feedback is clean but rare. Implicit feedback is noisy but abundant. Mastering both — knowing when to trust each, how to process each, and how to combine them — is the craft at the heart of recommendation engineering.

🎯

Why This Distinction Matters Enormously

Netflix reports that fewer than 1% of users rate content explicitly. Amazon receives explicit reviews on fewer than 2% of purchases. Spotify users almost never thumb-rate individual songs. If recommendation systems relied solely on explicit feedback, they would have data on almost no one. The industry pivoted to implicit feedback around 2008 — and accuracy improved dramatically. Understanding the difference between these two signal types is not academic; it determines the entire architecture of a modern recommender system.

📐 Explicit vs Implicit Feedback — The Signal Landscape

The choice between explicit and implicit feedback shapes every downstream design decision — from data collection infrastructure to model architecture and evaluation metrics.

Section 02

Explicit Feedback — Ratings

📖 Story

The Netflix Ratings Revolution — and Then Retreat

When Netflix launched in 1997, it built its entire recommendation engine around 5-star ratings. Users were actively encouraged to rate every film they watched. The more you rated, the better your recommendations became. The system was elegant and transparent. Netflix even offered a $1 million prize to anyone who could improve their rating prediction accuracy by 10%.

Then in 2017, Netflix quietly removed the 5-star rating system and replaced it with a simple thumbs up / thumbs down.

Why? Because data showed that most users never gave 5-star ratings — they were too demanding of effort. The thumbs system generated 200% more ratings per user almost immediately. But even that wasn't the real revolution: Netflix had realised that what you watch tells them far more than what you rate. By 2017, ratings had become a secondary signal. Behaviour had become primary.

Ratings are the canonical form of explicit feedback. Users assign a numerical or categorical score to an item — expressing how much they liked or disliked it. Ratings are clean, unambiguous, and directly interpretable. They are also frustratingly rare.

⭐ Ratings Scales in the Wild — Design Decisions That Matter

1–5 Stars

Amazon, Yelp, Google Maps. Most granular scale. Susceptible to J-curve bias — users tend to rate only extreme experiences (1 or 5), skipping the middle. Mean-centring per user is essential before use.

Thumbs

Netflix, YouTube. Binary positive/negative. Far higher participation rate — lower cognitive load. Loses nuance (a 2-star and a 4-star both become "not liked") but gains volume dramatically.

1–10

IMDb, Goodreads. Finer-grained expression. Research shows humans struggle to consistently distinguish 7 from 8 — inter-rater reliability decreases above 5 points. More cognitive load per rating.

None / Skip

Implicit signal. A user who watches a film but chooses not to rate it has still told you something. Absence of rating combined with presence of watch-through is itself interpretable.

📊 The J-Curve Bias — Why Raw Ratings Mislead

On most platforms, ★1 and ★5 ratings are massively over-represented because users only bother to rate when they feel strongly. This bias must be corrected before training — typically by mean-centring each user's ratings.

import numpy as np
import pandas as pd

# ── Raw ratings dataset ───────────────────────────────────────
ratings = pd.DataFrame({
    'user'  : ['Ali','Ali','Ali','Ben','Ben','Ben','Cara','Cara'],
    'item'  : ['A','B','C','A','B','D','B','C'],
    'rating': [5, 4, 1, 3, 4, 5, 2, 5],
})

# ── Problem: raw ratings confound user generosity with preference
# Ali gives 5,4,1 — tough rater.  Cara gives 2,5 — wide spread.
# A raw "4" from Ali ≠ a "4" from a generous rater.

# ── Solution: mean-centre per user ───────────────────────────
user_means = ratings.groupby('user')['rating'].transform('mean')
ratings['rating_norm'] = ratings['rating'] - user_means

print(ratings.to_string(index=False))
print("\nUser mean ratings:")
print(ratings.groupby('user')['rating'].mean().round(2))

OUTPUT

user item rating rating_norm Ali A 5 1.67 Ali B 4 0.67 Ali C 1 -2.33 Ben A 3 -1.00 Ben B 4 0.00 Ben D 5 1.00 Cara B 2 -1.50 Cara C 5 1.50 User mean ratings: user Ali 3.33 Ben 4.00 Cara 3.50

🔑

Mean-Centring — The Most Important Preprocessing Step for Ratings

Ali's raw "4" and Ben's raw "4" look identical — but Ali averages 3.33 (tough rater) while Ben averages 4.00 (generous rater). Ali's "4" is actually above her norm; Ben's "4" is exactly average. After mean-centring, Ali's 4 becomes +0.67 and Ben's 4 becomes 0.00 — now they correctly signal different sentiment. Always mean-centre explicit ratings before training any collaborative filtering model.

Section 03

Explicit Feedback — Likes & Dislikes

Binary feedback — thumbs up / thumbs down, heart / no heart, upvote / downvote — is the simplest form of explicit preference signal. What it loses in granularity, it more than compensates for in volume and participation rate.

Platform	Signal Type	Positive	Negative	How It's Used
Netflix	Thumbs binary	👍 Thumbs Up	👎 Thumbs Down	Directly shifts genre weights in taste profile; 👎 triggers suppression of similar content for 6+ months
YouTube	Like / Dislike	👍 Like	👎 Dislike	Dislike removed from public count in 2021 but remains a private signal in ranking algorithm
Spotify	Heart + hide	❤️ Like / Save	🚫 Hide song	Heart adds to Liked Songs; Hide immediately removes from radio/playlist and suppresses artist for 30 days
Reddit	Up/Down vote	⬆️ Upvote	⬇️ Downvote	Net score determines feed ranking; personalised feed weights subreddit affinity from upvote history
TikTok	Heart	❤️ Heart	— (no explicit negative)	Heart is explicit signal; TikTok infers negative preference from rapid scroll, "Not interested" press

⚠️

The Asymmetry of Likes and Dislikes

Positive and negative signals are not mirror images. A like is a weak positive signal — it costs nothing, users click it casually. A dislike is a strong negative signal — users only bother when they feel strongly enough to complain. This asymmetry means negative signals should carry greater weight per instance in your model. Spotify's "Hide" suppresses an artist for 30 days because they treat it as a high-confidence negative — far more informative than the absence of a like.

import pandas as pd
import numpy as np

# ── Binary feedback: asymmetric weighting ────────────────────
interactions = pd.DataFrame({
    'user'   : ['Ali','Ali','Ali','Ben','Ben','Cara','Cara'],
    'item'   : ['A','B','C','A','C','B','D'],
    'signal' : ['like','dislike','like','like','dislike','like','like'],
})

# Asymmetric weights: dislike carries 3× the signal strength of a like
weight_map = {'like': 1.0, 'dislike': -3.0}
interactions['weight'] = interactions['signal'].map(weight_map)

# ── Build weighted user preference score per item ─────────────
user_item_score = (
    interactions
    .groupby(['user', 'item'])['weight']
    .sum()
    .reset_index()
    .rename(columns={'weight': 'pref_score'})
)
print(user_item_score.to_string(index=False))

# ── Suppress disliked items from candidate pool ───────────────
suppress = interactions[interactions['signal'] == 'dislike'][['user','item']]
print(f"\nSuppressed (user, item) pairs:")
print(suppress.to_string(index=False))

OUTPUT

user item pref_score Ali A 1.0 Ali B -3.0 Ali C 1.0 Ben A 1.0 Ben C -3.0 Cara B 1.0 Cara D 1.0 Suppressed (user, item) pairs: user item Ali B Ben C

Section 04

Implicit Feedback — Clicks

📖 Story

The Click That Lied

A user browsing Amazon clicks on a product: a bright red kitchen blender with an eye-catching thumbnail. Did they want a blender? Not necessarily. Maybe the thumbnail caught their eye. Maybe they clicked to read the reviews out of curiosity. Maybe they were comparison shopping for a gift they would never buy for themselves. Maybe they clicked by accident on a touchscreen.

The click happened. The signal was recorded. But the interpretation of that click requires context — position on page, how long they stayed, whether they scrolled, whether they added to cart, whether they bounced immediately back.

Clicks are the most abundant implicit signal in the digital world, and the most treacherous. They are simultaneously the most valuable data you have and the easiest to misread. Learning to weight, de-noise, and contextualise click data is one of the most important skills in industrial recommendation engineering.

Position Bias — The Elephant in the Room

Users are far more likely to click items that appear higher on a list — not because those items are better, but because they are seen first. This is called position bias, and it is the most dangerous confound in click data. If you train a model on raw click data, you will systematically reinforce items that were already ranked highly — regardless of their true quality.

📊 Position Bias — Click-Through Rate vs Ranking Position

This decay curve means training on raw clicks teaches the model "position 1 is good" rather than "this specific item is relevant." Inverse Propensity Scoring (IPS) corrects for this by down-weighting clicks from high positions.

import numpy as np
import pandas as pd

# ── Click data with position information ──────────────────────
clicks = pd.DataFrame({
    'user'    : ['Ali','Ali','Ben','Ben','Cara','Cara'],
    'item'    : ['A','B','A','C','B','D'],
    'position': [1, 3, 1, 5, 2, 4],
    'clicked' : [1, 1, 1, 1, 1, 1],
})

# ── Inverse Propensity Scoring (IPS) ──────────────────────────
# P(click | position) estimated from empirical CTR curve
propensity = {1: 0.20, 2: 0.13, 3: 0.09, 4: 0.07, 5: 0.05}

clicks['propensity'] = clicks['position'].map(propensity)

# IPS weight: down-weight easy (high-position) clicks
clicks['ips_weight'] = 1.0 / clicks['propensity']

print(clicks[['user','item','position','propensity','ips_weight']]
      .to_string(index=False))

print("\nIPS-weighted click value by item:")
print(clicks.groupby('item')['ips_weight']
      .sum().sort_values(ascending=False).round(2))

OUTPUT

user item position propensity ips_weight Ali A 1 0.20 5.00 Ali B 3 0.09 11.11 Ben A 1 0.20 5.00 Ben C 5 0.05 20.00 Cara B 2 0.13 7.69 Cara D 4 0.07 14.29 IPS-weighted click value by item: item C 20.00 ← Position-5 click = highest true signal D 14.29 B 18.80 A 10.00

💡

The IPS Insight — Position 5 Beats Position 1

Item C was clicked at position 5 — with only a 5% chance of being seen. The fact that a user scrolled past four items to click it is a much stronger signal of genuine interest than clicking item A at position 1, which nearly everyone sees. IPS re-weights accordingly: item C's click is worth 20× a "fair" click; item A's click is worth only 5×. Raw click counts completely obscure this difference.

Section 05

Implicit Feedback — Watch Time & Dwell Time

📖 Story

YouTube's 2012 Revolution

In 2012, YouTube made a decision that changed the internet. Until then, their recommendation algorithm optimised for click-through rate — the number of clicks a video received. The unintended consequence was catastrophic: thumbnails became outrageously misleading, titles became "clickbait", and users felt deceived constantly.

YouTube's engineering team ran the numbers and discovered something stunning: videos optimised for clicks had terrible completion rates. Users clicked, felt cheated, and left after 10 seconds. Meanwhile, genuinely excellent long-form content had lower click rates — but users who clicked watched it to the end.

YouTube switched their recommendation objective from "maximise clicks" to "maximise watch time". The clickbait era collapsed overnight. Watch time became — and remains — the single most important signal in the YouTube recommendation algorithm. It is the feedback signal that cannot be easily gamed, because you cannot fake spending 45 minutes on a video you didn't enjoy.

Watch time (for video), dwell time (for articles), listen duration (for podcasts/music), and reading time are time-based implicit signals that are considerably more informative than binary clicks. Time is a scarce resource — users who give it are signalling genuine engagement.

📐 Interpreting Watch Time — From Signal to Sentiment

Completion rate is far more nuanced than binary click. A 10-second bounce after a click is a negative signal. A full watch plus replay is one of the strongest positive signals a recommender can receive.

import pandas as pd
import numpy as np

# ── Video watch events ────────────────────────────────────────
watches = pd.DataFrame({
    'user'       : ['Ali','Ali','Ben','Ben','Cara','Cara'],
    'video'      : ['V1','V2','V1','V3','V2','V3'],
    'duration_s' : [1800, 1800, 1800, 2400, 1800, 2400],  # video length
    'watched_s'  : [1750, 120,  900,  2380, 1800, 85],   # seconds watched
    'replayed'   : [True, False, False, True, False, False],
})

# ── Compute completion rate ───────────────────────────────────
watches['completion'] = (watches['watched_s'] / watches['duration_s']).clip(0, 1)

# ── Sentiment label from completion rate ──────────────────────
def watch_sentiment(row):
    c = row['completion']
    if   c < 0.10: return 'bounce'
    elif c < 0.35: return 'browse'
    elif c < 0.80: return 'engaged'
    else:           return 'loved'

watches['sentiment'] = watches.apply(watch_sentiment, axis=1)

# ── Implicit score: completion + replay bonus ─────────────────
watches['impl_score'] = watches['completion'] + watches['replayed'].astype('float') * 0.5

cols = ['user','video','completion','sentiment','impl_score']
print(watches[cols].to_string(index=False))

OUTPUT

user video completion sentiment impl_score Ali V1 0.97 loved 1.47 Ali V2 0.07 bounce 0.07 Ben V1 0.50 engaged 0.50 Ben V3 0.99 loved 1.49 Cara V2 1.00 loved 1.00 Cara V3 0.04 bounce 0.04

⚡

Three Watch-Time Features That Always Matter

(1) Completion rate — a 97% completion is far stronger than a 50% completion, both qualitatively and quantitatively. (2) Replay / re-watch — watching something twice or returning to it signals the highest tier of positive sentiment. (3) Drop-off point — where in the video a user stopped tells you whether they abandoned due to a technical issue (very early) or after getting the value they wanted (near end). A drop at 95% is not a negative signal.

Section 06

Implicit Feedback — Purchase History

📖 Story

The Purchase That Wasn't What It Seemed

Amazon knows you bought a set of pink fairy lights, a princess tiara, and a personalised children's birthday banner. Their algorithm lights up — you must love princess aesthetics, children's party supplies, pink decor. It starts recommending tutus and glitter craft kits.

You were buying for your niece's fifth birthday. You have no niece anymore — she's twenty-three. You will never buy princess fairy lights again.

Purchase history is the strongest implicit signal in e-commerce — someone actually spent money, which is far more committed than a click or even a long scroll. But it is also the most context-dependent. Every purchase exists in a moment: was it a gift? A one-time need? An impulse? A recurring necessity? Without modelling purchase intent, even the strongest signal can mislead catastrophically.

Purchase history is the gold-standard implicit signal for e-commerce recommendation. A transaction represents the maximum commitment a user can make — they gave money. It is a strong positive signal about intent and satisfaction. But purchases require careful contextualisation: gift-buying, seasonal purchases, one-time needs, and category exploration all create false taste signals if naively fed into a model.

Purchase Pattern	Naive Interpretation	Correct Interpretation	Mitigation Strategy
One-time purchase of unusual category	Strong interest in category	Likely a gift or one-off need	Decay signal over time; require repeat purchase before high-weighting
Repeat purchase of same item	Strong preference — replenishment	Correct — high-confidence signal	Upweight heavily; trigger subscription or bulk-buy recommendation
Purchase immediately after browsing	Deliberate choice — strong signal	Correct — high intent	Full positive weight; expand into adjacent categories
Purchase + immediate return	Ambiguous	Negative — item did not meet expectations	Return event should flip purchase weight negative; suppress similar items
Seasonal spike purchase	Ongoing interest in category	Time-bound event (Christmas, birthday)	Timestamp features in model; only recommend seasonal content in window

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# ── Purchase history with context signals ─────────────────────
now = datetime(2024, 6, 15)
purchases = pd.DataFrame({
    'user'      : ['Ali','Ali','Ali','Ben','Ben'],
    'item'      : ['Blender','Coffee','Toy','Coffee','Headphones'],
    'category'  : ['Kitchen','Grocery','Toys','Grocery','Electronics'],
    'date'      : [now - timedelta(days=d)
                   for d in [400, 30, 360, 20, 5]],
    'returned'  : [False, False, True, False, False],
    'repeat_buy': [False, True, False, True, False],
})

# ── Time-decay: recent purchases are more relevant ────────────
purchases['days_ago']     = (now - purchases['date']).dt.days
purchases['time_decay']   = np.exp(-purchases['days_ago'] / 90)  # 90-day half-life

# ── Base signal: return = -1, normal purchase = +1 ───────────
purchases['base_signal']  = np.where(purchases['returned'], -1.0, 1.0)

# ── Repeat bonus multiplier ───────────────────────────────────
purchases['repeat_mult']  = np.where(purchases['repeat_buy'], 2.0, 1.0)

# ── Final implicit score ──────────────────────────────────────
purchases['impl_score'] = (purchases['base_signal']
                             * purchases['time_decay']
                             * purchases['repeat_mult'])

cols = ['user','item','days_ago','returned','repeat_buy','impl_score']
print(purchases[cols].round(3).to_string(index=False))

OUTPUT

user item days_ago returned repeat_buy impl_score Ali Blender 400 False False 0.012 Ali Coffee 30 False True 1.513 Ali Toy 360 True False -0.018 Ben Coffee 20 False True 1.641 Ben Headphones 5 False False 0.946

📈

Reading the Output

Ali's Blender purchase (400 days ago) is nearly irrelevant after time decay. Her Coffee — recent, repeated — scores 1.51, the strongest positive signal. The returned Toy correctly becomes a negative signal (−0.018), suppressing toy recommendations. Ben's repeated Coffee and recent Headphones purchase are both strong signals with different recency weights. This single feature engineering step transforms raw transaction logs into a nuanced taste profile.

Section 07

Combining All Signals — The Unified Feedback Matrix

In production, no single feedback signal is trusted alone. Every signal has biases, gaps, and misinterpretation risks. The art of feedback engineering is building a unified confidence-weighted interaction score that aggregates all available signals into a single interpretable value.

📐 Signal Confidence Hierarchy — From Weakest to Strongest

The confidence hierarchy guides how much weight to assign each signal type. A single purchase or explicit dislike outweighs dozens of passive impressions. Production systems accumulate evidence across all tiers before drawing strong conclusions.

import pandas as pd
import numpy as np

# ── Unified implicit score from multiple signal types ─────────
# Signal weights: higher = more trustworthy
SIGNAL_WEIGHTS = {
    'impression' : 0.01,
    'click'      : 0.10,
    'watch_50pct': 0.30,
    'watch_full' : 0.60,
    'like'       : 0.70,
    'dislike'    : -1.50,   # negative and high-magnitude
    'rating_4'   : 0.65,
    'rating_5'   : 0.90,
    'rating_1'   : -1.20,
    'purchase'   : 1.00,
    'rewatch'    : 1.20,
}

# ── Example: Ali's interactions with Item V1 ─────────────────
ali_v1_signals = ['impression', 'click', 'watch_full', 'rewatch', 'like']
ali_v1_score   = sum(SIGNAL_WEIGHTS[s] for s in ali_v1_signals)

# ── Example: Ben's interactions with Item V2 ─────────────────
ben_v2_signals = ['impression', 'click', 'watch_50pct', 'dislike']
ben_v2_score   = sum(SIGNAL_WEIGHTS[s] for s in ben_v2_signals)

# ── Build interaction table ───────────────────────────────────
data = []
for sig in ali_v1_signals:
    data.append({'user':'Ali','item':'V1','signal':sig,'weight':SIGNAL_WEIGHTS[sig]})
for sig in ben_v2_signals:
    data.append({'user':'Ben','item':'V2','signal':sig,'weight':SIGNAL_WEIGHTS[sig]})

df = pd.DataFrame(data)
summary = df.groupby(['user','item'])['weight'].sum().reset_index()
summary.columns = ['user','item','unified_score']

print(df.to_string(index=False))
print("\nUnified Interaction Scores:")
print(summary.to_string(index=False))

OUTPUT

user item signal weight Ali V1 impression 0.01 Ali V1 click 0.10 Ali V1 watch_full 0.60 Ali V1 rewatch 1.20 Ali V1 like 0.70 Ben V2 impression 0.01 Ben V2 click 0.10 Ben V2 watch_50pct 0.30 Ben V2 dislike -1.50 Unified Interaction Scores: user item unified_score Ali V1 2.61 Ben V2 -1.09

✅

Ali Loved V1. Ben Cannot Stand V2.

Ali's unified score of +2.61 makes V1 a top recommendation candidate for Ali — confirmed by every signal tier. Ben's score of −1.09 marks V2 as content to be suppressed in Ben's feed, because the dislike signal (+click+watch) overwhelms the positive impressions. This is the power of unified signal weighting: a nuanced, multi-signal portrait of preference invisible to any single signal alone.

Section 08

Explicit vs Implicit — The Complete Comparison

Property	Explicit Feedback	Implicit Feedback
Volume / Density	Sparse — <2% of users provide it	Dense — 100% of users generate it continuously
Signal Noise	Low — intentional declaration	High — requires careful interpretation
Negative Signal	Clear — 1-star or thumbs-down	Ambiguous — not clicking ≠ disliking
User Effort	Requires deliberate action	Zero effort — automatic behavioural trace
Truthfulness	Susceptible to social desirability bias	Hard to fake — behaviour is honest
Best Use Case	Cold start seeding, preference calibration	Core training signal for large-scale CF
Key Preprocessing	Mean-centring per user, J-curve correction	IPS weighting, time decay, deduplication
Platforms That Rely On It	Yelp, IMDb, Goodreads, older Netflix	TikTok, YouTube, Amazon, Spotify, modern Netflix

🎯 Golden Rules for Feedback Signal Engineering

Always mean-centre explicit ratings. A "4" from a tough rater and a "4" from a generous rater mean completely different things. Normalise per user before any model training, every time, without exception.

Never treat a click as a positive label. A click is evidence of curiosity, not preference. Apply Inverse Propensity Scoring to correct for position bias, and combine with dwell time before assigning any positive signal weight.

Watch time is your most honest implicit signal. Completion rate is hard to fake. A user who watches 95% of a video and replays it has given you the clearest possible signal of genuine enjoyment. Weight it accordingly.

Dislikes and negative signals punch above their weight. A single dislike, return, or hide should carry 2–3× the magnitude of a single like. Users only bother to express strong negative feedback — it is high-confidence information.

Apply time decay to all signals. A purchase 400 days ago is nearly irrelevant compared to a purchase last week. Use an exponential decay function with a domain-appropriate half-life: 30–90 days for e-commerce; 7–30 days for news; 180+ days for movies.

Context-tag your purchase signals. Gift purchases, seasonal spikes, and one-time needs will corrupt your user taste model if treated as genuine preference. Use session context (gift-wrap selected? seasonal keyword in search?) to flag and down-weight context-specific purchases.

Combine, don't choose. The best production systems fuse explicit and implicit signals into a single unified score, with confidence weights reflecting each signal's reliability. Explicit signals calibrate the model's direction; implicit signals provide the volume it needs to generalise.