Master random variables — discrete vs continuous

Section 01

Numbers That Carry Uncertainty 🎲

Imagine you are about to roll a die. Before it lands, you cannot say with certainty what the outcome will be — but you know it will be some number. Now imagine you record that number, call it X. This X is not a fixed, known quantity. It is a number that takes different values depending on the outcome of a random experiment. That is a Random Variable.

Random variables are the bridge between the abstract language of probability (sample spaces, events) and the practical world of data, measurement, and machine learning. Every dataset column you have ever analysed — customer age, transaction amount, exam score, daily rainfall — is a realisation of a random variable. The entire machinery of statistics is built around describing, comparing, and modelling them.

This tutorial covers the four pillars: what a random variable is, the crucial distinction between discrete and continuous types, and the three functions — PMF, PDF, and CDF — that completely characterise their probabilistic behaviour.

💡

Notation Convention

Random variables are written as capital letters (X, Y, Z). Their specific observed values are written as lowercase letters (x, y, z). So "X = 3" means the random variable X took the specific value 3 in one trial. P(X = 3) is the probability that this happens. This distinction matters throughout statistics and machine learning literature.

Section 02

What Is a Random Variable? 🔢

The Story: The Call Centre Manager

Priya manages a call centre. Each hour, she observes how many customer complaints arrive. Some hours it's 0, some hours 12, occasionally 30. She can't predict the exact number in advance — but she can describe the pattern. She defines X = "number of complaints per hour." X is her random variable.

Meanwhile, her colleague Arjun monitors how long each call lasts. A call might last 2.3 minutes, 7.81 minutes, or 14.002 minutes. Any positive real number is possible — there is no list of discrete options. He defines Y = "call duration in minutes." Y is also a random variable, but of a fundamentally different type.

Random Variable (Formal)

X : Ω → ℝ

A function that maps each outcome in the sample space Ω to a real number. It assigns a numerical value to every possible outcome of an experiment.

Realisation / Observed Value

x = X(ω)

When the experiment is run and outcome ω occurs, the random variable takes the specific value x. This is called a realisation or observed value.

Distribution

P(X ∈ A) for all A ⊆ ℝ

The distribution of X describes the probability that X falls in any set A. It completely characterises the random variable's behaviour — the full picture of all possible values and their likelihoods.

Two Fundamental Types

Discrete or Continuous

Discrete: countable values (0, 1, 2, …). Continuous: uncountable values on an interval. The type determines which mathematical tools are used to describe the distribution.

Section 03

Discrete vs Continuous — The Core Split ⚖️

The most fundamental distinction in the theory of random variables is whether the variable takes values from a countable set or an uncountable continuum. This determines everything: which formula you use, which distribution family applies, which plots make sense, and which statistical methods are valid.

Discrete Random Variable

X ∈ {x₁, x₂, ...}

Takes countable values (finite or infinite)
Values can be listed: 0, 1, 2, 3…
Gaps exist between possible values
Described by PMF: P(X = x)
Sum of all PMF values = 1
CDF is a staircase function
Examples: die rolls, coin flips, counts

Continuous Random Variable

X ∈ [a, b] or ℝ

Takes any value in an interval
Values cannot be listed — uncountable
No gaps — infinitely dense
Described by PDF: f(x)
Area under PDF curve = 1
CDF is a smooth, continuous curve
Examples: height, time, temperature

The Critical Difference

P(X=x)

Discrete: P(X=x) can be > 0
Continuous: P(X=x) = 0 always
For continuous: ask P(a ≤ X ≤ b)
Probabilities need intervals, not points
This is why PDFs are densities, not probs
Areas = probabilities (for continuous)
Heights = probabilities (for discrete)

Feature	Discrete	Continuous
Values	Countable set {0, 1, 2, 3, …}	Uncountable interval [a, b] or ℝ
Probability at a point	P(X = x) ≥ 0 possible	P(X = x) = 0 always
Probability function	PMF — p(x) = P(X = x)	PDF — f(x), area gives probability
Sums to	Σ p(x) = 1	∫ f(x)dx = 1
CDF shape	Staircase (step function)	Smooth S-curve
Common distributions	Binomial, Poisson, Geometric	Normal, Exponential, Uniform, Beta
Real examples	Number of defects, goals scored, clicks	Height, weight, time, temperature

Section 04

Probability Mass Function (PMF) 📊

The Story: The Quality Control Inspector

Ravi inspects batches of 3 smartphones. Each phone independently has a 20% chance of having a defect. He defines X = "number of defective phones in a batch of 3." X can be 0, 1, 2, or 3. Before shipping any batch, Ravi wants to know the probability of each possible count. That probability assignment — one value for each possible outcome — is the Probability Mass Function.

PMF Definition

p(x) = P(X = x)

Maps every possible value x to its probability. p(x) tells you exactly how much probability mass sits at the point x.

Two PMF Requirements

p(x) ≥ 0 and Σ p(x) = 1

Every probability must be non-negative, and all probabilities must sum to exactly 1. These two conditions must always be satisfied.

Expected Value (Mean)

E[X] = Σ x · p(x)

The probability-weighted average of all possible values. The "centre of mass" of the distribution. Sum of (value × its probability) over all x.

Variance

Var(X) = Σ (x−μ)² · p(x)

Probability-weighted average squared deviation from the mean μ = E[X]. Measures how spread out the distribution is around its centre.

🧮 PMF — Defective Phones (Binomial, n=3, p=0.2)

Formula

Binomial PMF: P(X=k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ | n=3, p=0.20

P(X=0)

C(3,0) × 0.2⁰ × 0.8³ = 1 × 1 × 0.512 = 0.512 (51.2% — no defects)

P(X=1)

C(3,1) × 0.2¹ × 0.8² = 3 × 0.2 × 0.64 = 0.384 (38.4% — one defect)

P(X=2)

C(3,2) × 0.2² × 0.8¹ = 3 × 0.04 × 0.8 = 0.096 (9.6% — two defects)

P(X=3)

C(3,3) × 0.2³ × 0.8⁰ = 1 × 0.008 × 1 = 0.008 (0.8% — all defective)

Verify

0.512 + 0.384 + 0.096 + 0.008 = 1.000 ✓

E[X]

μ = 0×0.512 + 1×0.384 + 2×0.096 + 3×0.008 = 0 + 0.384 + 0.192 + 0.024 = 0.60
On average, 0.6 phones per batch are defective. (Also = n×p = 3×0.2 = 0.6 ✓)

Var(X)

Var(X) = n×p×(1−p) = 3×0.2×0.8 = 0.48 | SD = √0.48 ≈ 0.693

More PMF Examples

Experiment	X (Random Variable)	Possible Values	PMF Example
Coin flip (fair)	X = 1 if Heads, 0 if Tails	{0, 1}	P(X=0) = P(X=1) = 0.5
Die roll (fair)	X = face value shown	{1, 2, 3, 4, 5, 6}	P(X=k) = 1/6 for each k
Goals in a football match	X = total goals scored	{0, 1, 2, 3, …}	Poisson: P(X=k) = e⁻λ λᵏ/k!
Emails per hour	X = number of emails received	{0, 1, 2, 3, …}	Poisson with λ = average rate
Raffle draw	X = 1 if win, 0 if lose	{0, 1}	P(X=1)=1/1000, P(X=0)=999/1000

Section 05

Probability Density Function (PDF) 🌊

The Story: The Hospital Wait Time

The emergency room of a hospital records wait times. A patient might wait 8.7 minutes, 23.14 minutes, or 5.003 minutes. The wait time Y is a continuous random variable — it can take any non-negative real value. The probability that Y equals exactly 8.7 minutes is zero (there are infinitely many possible times, each with infinitely small probability).

But the probability that Y falls between 10 and 20 minutes is a perfectly meaningful number — it is the area under the probability density curve between those two points. The function that defines this curve is the Probability Density Function.

⚠️

The Single Most Common Misconception About PDFs

f(x) is NOT a probability. It is a density — like population density (people per km², not just people). f(x) can be greater than 1. What gives probability is the area under the curve over an interval: P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx. The total area under the entire curve must equal 1, but the height f(x) at any point can exceed 1.

PDF Definition

P(a≤X≤b) = ∫ₐᵇ f(x)dx

The probability that X falls in [a,b] is the integral (area) of f(x) over that interval. f(x) itself is the density, not the probability.

Two PDF Requirements

f(x) ≥ 0 and ∫₋∞^∞ f(x)dx = 1

Density must be non-negative everywhere, and the total area under the entire curve must equal exactly 1 — all probability accounted for.

Expected Value

E[X] = ∫₋∞^∞ x·f(x)dx

The continuous analogue of the PMF mean formula. Replace the sum with an integral, multiply each point x by its density f(x).

Variance

Var(X) = ∫(x−μ)²f(x)dx

Continuous version of variance. The integral of squared deviations from the mean, weighted by the density function.

🧮 PDF — Uniform Distribution: Bus Arrival

Story

A bus arrives at a random time uniformly between 0 and 10 minutes from now. Y ~ Uniform(0, 10). Every instant is equally likely — the density is flat.

PDF

f(y) = 1/(b−a) = 1/(10−0) = 0.10 for 0 ≤ y ≤ 10; f(y) = 0 otherwise.

Verify

∫₀¹⁰ 0.10 dy = 0.10 × 10 = 1.0 ✓

P(2≤Y≤5)

P(bus arrives in 2–5 min) = ∫₂⁵ 0.10 dy = 0.10 × (5−2) = 0.30 (30%)

P(Y=3)

P(bus arrives at exactly 3 min) = ∫₃³ 0.10 dy = 0.10 × 0 = 0.00 — zero probability at a point.

E[Y]

E[Y] = (a+b)/2 = (0+10)/2 = 5 minutes average wait time.

🧮 PDF — Normal Distribution: Heights of Adults

Story

Adult heights in India approximately follow a normal distribution with μ = 165 cm and σ = 8 cm. Define H = height of a randomly selected adult.

PDF Formula

f(h) = (1/σ√2π) × exp[−(h−μ)²/(2σ²)]
= (1/8√2π) × exp[−(h−165)²/128]

P(H > 170)

Standardise: Z = (170−165)/8 = 0.625. P(Z > 0.625) ≈ 0.266 (26.6%)
About 26.6% of adults are taller than 170 cm.

P(157≤H≤173)

Z₁ = (157−165)/8 = −1.0, Z₂ = (173−165)/8 = +1.0
P(−1 ≤ Z ≤ +1) ≈ 0.683 (68.3%) — the famous 68% rule.

Empirical Rule

μ ± 1σ → 68.3% of heights fall between 157 and 173 cm
μ ± 2σ → 95.4% fall between 149 and 181 cm
μ ± 3σ → 99.7% fall between 141 and 189 cm

Section 06

Cumulative Distribution Function (CDF) 📈

The Story: The Weather Forecaster

A meteorologist studies daily rainfall. Instead of asking "What's the probability of exactly 15mm of rain?", she asks "What is the probability of getting at most 15mm of rain?" This cumulative question — phrased with "at most" or "less than or equal to" — is precisely what the Cumulative Distribution Function answers for any value.

The CDF is defined for every type of random variable — discrete and continuous. It always starts at 0, always ends at 1, and is always non-decreasing. For discrete variables it looks like stairs. For continuous variables it forms a smooth S-curve. It is arguably the most fundamental function in all of probability theory.

CDF Definition (Universal)

F(x) = P(X ≤ x)

The probability that the random variable X takes a value less than or equal to x. Defined for all real x, for both discrete and continuous variables.

CDF from PMF (Discrete)

F(x) = Σ p(k) for all k ≤ x

Cumulative sum of all PMF values up to and including x. Produces a staircase that jumps up by p(x) at each possible value x.

CDF from PDF (Continuous)

F(x) = ∫₋∞ˣ f(t)dt

Integral of the PDF from −∞ to x. Produces a smooth S-shaped curve from 0 to 1. The derivative of F(x) is f(x): F'(x) = f(x).

Interval Probability from CDF

P(a < X ≤ b) = F(b) − F(a)

The probability X falls in any interval is the difference in CDF values at the endpoints. Works for both discrete and continuous variables.

Three Properties of Every CDF

📐 Universal CDF Properties (Always True)

Bounded between 0 and 1: 0 ≤ F(x) ≤ 1 for all x. As x → −∞, F(x) → 0 (nothing has accumulated yet). As x → +∞, F(x) → 1 (all probability accumulated).

Non-decreasing: If a ≤ b, then F(a) ≤ F(b). More probability accumulates as x grows — the CDF can never decrease. Adding more possible outcomes can only increase the cumulative total.

Right-continuous: At any jump point x₀ (for discrete variables), F(x₀) includes the probability at x₀. Mathematically: F(x₀) = lim_{x→x₀⁺} F(x). This is a technical convention with important practical consequences.

🧮 CDF — Defective Phones (Building on Our PMF)

Recall PMF

P(X=0)=0.512, P(X=1)=0.384, P(X=2)=0.096, P(X=3)=0.008

F(0)

P(X ≤ 0) = P(X=0) = 0.512

F(1)

P(X ≤ 1) = P(X=0) + P(X=1) = 0.512 + 0.384 = 0.896

F(2)

P(X ≤ 2) = 0.896 + 0.096 = 0.992

F(3)

P(X ≤ 3) = 0.992 + 0.008 = 1.000 ✓

Applications

P(X > 1) = 1 − F(1) = 1 − 0.896 = 0.104 (10.4% chance of 2+ defects)
P(1 < X ≤ 3) = F(3) − F(1) = 1.000 − 0.896 = 0.104

Section 07

PMF vs PDF vs CDF — The Full Comparison

Feature	PMF p(x)	PDF f(x)	CDF F(x)
Variable type	Discrete only	Continuous only	Both discrete & continuous
Value at a point	P(X=x) — a probability	A density (NOT a probability)	P(X≤x) — always a probability
Range of values	0 ≤ p(x) ≤ 1	f(x) ≥ 0 (can exceed 1!)	0 ≤ F(x) ≤ 1 always
Sums/integrates to	Σ p(x) = 1	∫ f(x)dx = 1	F(+∞) = 1
Shape	Bar chart (spikes at values)	Smooth curve or flat line	Staircase (discrete) or S-curve
Probability of interval	Σ p(x) for x in [a,b]	∫ₐᵇ f(x)dx (area under curve)	F(b) − F(a) always
Relationship	CDF = cumulative sum of PMF \| CDF = integral of PDF \| PDF = derivative of CDF

Section 08

Common Discrete Distributions & Their PMFs 🎰

Distribution	PMF p(x)	Parameters	Mean	Use Case
Bernoulli	p if x=1; (1−p) if x=0	p ∈ [0,1]	p	Single trial: click/no click
Binomial B(n,p)	C(n,k)pᵏ(1−p)ⁿ⁻ᵏ	n trials, p success prob	np	n trials, count successes
Poisson(λ)	e⁻λ λᵏ / k!	λ = average rate	λ	Count events per unit time/area
Geometric(p)	(1−p)ᵏ⁻¹ p	p = success probability	1/p	Trials until first success
Uniform(a,b) discrete	1/(b−a+1)	a = min, b = max	(a+b)/2	Fair die, lottery

Section 09

Common Continuous Distributions & Their PDFs 🌊

Distribution	PDF f(x)	Range	Mean	Use Case
Uniform(a,b)	1/(b−a)	[a, b]	(a+b)/2	Equal likelihood over interval
Normal N(μ,σ²)	(1/σ√2π)e^(−(x−μ)²/2σ²)	(−∞, +∞)	μ	Heights, errors, test scores
Exponential(λ)	λe^(−λx)	[0, +∞)	1/λ	Time between events, survival
Beta(α,β)	x^(α−1)(1−x)^(β−1)/B(α,β)	[0, 1]	α/(α+β)	Probabilities, proportions
Gamma(α,β)	x^(α−1)e^(−x/β)/(βᵅΓ(α))	[0, +∞)	αβ	Waiting times, insurance claims
Log-Normal	(1/xσ√2π)e^(−(ln x−μ)²/2σ²)	(0, +∞)	e^(μ+σ²/2)	Stock prices, income, file sizes

Section 10

Using the CDF — Practical Calculations 🔧

🧮 CDF in Action — Four Question Types from One Distribution

Setup

Test scores follow a Normal distribution: μ = 70, σ = 10. F(x) is the standard normal CDF evaluated at z = (x−70)/10.

P(X ≤ 80)

z = (80−70)/10 = 1.0. F(80) = P(Z≤1.0) = 0.841 (84.1%)
84.1% of students score 80 or below.

P(X > 80)

P(X > 80) = 1 − F(80) = 1 − 0.841 = 0.159 (15.9%)
15.9% of students score above 80.

P(60 ≤ X ≤ 80)

z₁ = (60−70)/10 = −1.0, z₂ = (80−70)/10 = 1.0
P = F(80) − F(60) = 0.841 − 0.159 = 0.682 (68.2%)

Percentile

What score is the 90th percentile? Find x such that F(x) = 0.90.
z = 1.282 → x = μ + z×σ = 70 + 1.282×10 = 82.82
90% of students score below 82.82.

Section 11

Random Variables in Data Science & ML 🤖

Application	Random Variable	Type	Distribution Used
🖱️ Ad click prediction	X = 1 (click), 0 (no click)	Discrete	Bernoulli / Binomial
📦 Inventory management	X = daily demand (units)	Discrete	Poisson
📉 Stock price modelling	X = daily log-return	Continuous	Normal / Log-Normal
⏱️ Server response time	X = time to respond (ms)	Continuous	Exponential / Gamma
🎯 Bayesian A/B testing	X = true conversion rate	Continuous	Beta
🔤 NLP: word frequency	X = count of word w in doc	Discrete	Multinomial / Poisson
🧬 Genomics: mutations	X = mutations per genome	Discrete	Poisson
🏠 House price prediction	X = log(house price)	Continuous	Normal (log-transformed)

🧮

The Connection to Machine Learning Loss Functions

Every ML loss function is secretly a statement about random variables and their distributions. Mean Squared Error assumes the target variable is normally distributed around predictions. Cross-entropy loss (for classification) assumes a Bernoulli or categorical distribution. Poisson loss is used for count data. Understanding PMFs and PDFs is the key to understanding why certain loss functions are appropriate for certain problems.

Section 12

The Golden Rules

🎯 12 Rules Every Data Scientist Must Master

Identify discrete vs continuous before anything else. This single decision determines whether you use a PMF or PDF, sums or integrals, bar charts or density curves. Getting this wrong invalidates all subsequent analysis.

P(X = x) = 0 for any continuous random variable. This is not an approximation — it is mathematically exact. For continuous variables, always compute probabilities over intervals, never at points.

PDF values are densities, not probabilities. f(x) can exceed 1. Only areas under the PDF curve represent probabilities. Never interpret a PDF height as "the probability of x."

The CDF is universal — use it to compute all interval probabilities. P(a < X ≤ b) = F(b) − F(a) works for every distribution, discrete or continuous. This is often the fastest computation path.

Every PMF and PDF must satisfy two conditions. Non-negativity (p(x) ≥ 0 or f(x) ≥ 0) AND normalisation (sums or integrates to 1). If either fails, it is not a valid probability function.

E[X] is the centre of mass of the distribution. For symmetric distributions, E[X] equals the median and mode. For skewed distributions, these three measures of centre diverge — and E[X] is pulled toward the long tail.

F'(x) = f(x) — the PDF is the derivative of the CDF. Conversely, the CDF is the integral of the PDF. These two functions carry identical information — just in different forms. Statistical software switches between them constantly.

The CDF is always non-decreasing from 0 to 1. It can never decrease. It starts at 0 (left tail) and ends at 1 (right tail). Any function that violates these properties is not a valid CDF.

Choose distributions based on data-generating process, not visual fit alone. Count data that can only be 0, 1, 2, … → Poisson or Binomial. Proportions strictly between 0 and 1 → Beta. Time until an event → Exponential. The mechanism matters more than the histogram shape.

Standardise to use the normal CDF table. For X ~ N(μ, σ²), convert to Z = (X − μ)/σ ~ N(0,1). All normal probability calculations reduce to looking up values in the standard normal table or using the Φ(z) function.

The Poisson is the limiting case of the Binomial. When n is very large and p is very small (rare events), Binomial(n, p) ≈ Poisson(λ = np). This is why Poisson models rare events: insurance claims, network packets, radioactive decay.

Every ML model has an assumed distribution behind its loss function. Understanding PMFs and PDFs lets you read ML models as probabilistic models — and choose or design the right loss for your data type and distribution assumptions.

🧮

You Now Have the Probabilistic Toolkit

Random variables give numbers to uncertainty. The discrete/continuous split determines your mathematical language. The PMF precisely assigns probability to each discrete outcome. The PDF describes the density of probability across a continuous range — with areas as probabilities. The CDF accumulates probability from left to right and answers every "at most" question instantly. Together, these four tools form the complete probabilistic description of any real-world quantity — the foundation of every statistical model, every machine learning algorithm, and every data-driven decision.