Foundations of Data Science 📂 Probability · 4 of 5 43 min read

Joint & Marginal Probability

A richly illustrated, story-driven tutorial covering joint probability P(A,B), marginal probability through summation/integration, and the precise mathematical triangle connecting joint, marginal, and conditional probability — with three complete real-world stories (Meera's commute, e-commerce email campaigns, medical smoking study), six custom inline SVGs (Venn, joint-to-marginal collapse, Golden Triangle, bivariate normal contours, probability tree, full tables), independence testing, continuo

Section 01

When Two Things Happen Together 🔗

Most real-world events don't happen in isolation. A customer buys a laptop and a mouse. A patient has high blood pressure and elevated cholesterol. A student studies data science and passes their exam. The world is full of pairs — and pairs of events demand a richer probability vocabulary than single events alone.

Until now, we have asked questions like "What is the probability of event A?" — one variable, one question. But data science, machine learning, and statistics constantly ask about relationships between variables: How likely are A and B together? Does knowing one tell us about the other? Are they related or independent? These questions require three interconnected concepts: joint probability, marginal probability, and their precise relationship to conditional probability.

Master these three, and you will understand the mathematical foundation of everything from the Naive Bayes classifier to the full joint distribution of a Bayesian network — the backbone of probabilistic AI.

💡
The Three Probabilities and Their Relationship

Joint: P(A, B) — probability that BOTH A and B occur simultaneously.  |  Marginal: P(A) — probability of A alone, ignoring B entirely (sum/integrate out B).  |  Conditional: P(A|B) — probability of A given that B is known to have occurred.  |  They are connected by: P(A,B) = P(A|B) × P(B) = P(B|A) × P(A)


Section 02

Joint Probability — P(A, B) 🎯

The Story: The Morning Commuter

Meera commutes to work every morning. On any given day, two uncertain things happen: it might rain (R), and her bus might be late (L). Neither is certain, and they might be related — rain could cause traffic that delays the bus. Meera wants to understand not just "Will it rain?" or "Will the bus be late?" in isolation, but all the combinations together. This is joint probability.

Joint Probability (Discrete)
P(A, B) = P(A ∩ B)
Probability that both event A and event B occur simultaneously in the same trial. Also written P(A ∩ B). The comma and intersection symbol are interchangeable.
Joint Probability (Continuous)
f(x, y) — Joint PDF
For continuous variables X and Y, the joint probability density function f(x,y) describes the density of probability across all combinations of values. P(a≤X≤b, c≤Y≤d) = ∬f(x,y)dxdy
Joint PMF (Discrete)
p(x, y) = P(X=x, Y=y)
For discrete random variables X and Y, the joint PMF assigns a probability to every combination (x, y). All values must sum to 1: Σₓ Σᵧ p(x,y) = 1
Independence Test
P(A,B) = P(A)·P(B)
A and B are independent if and only if their joint probability equals the product of their individual probabilities. Any deviation signals dependence — they influence each other.
Joint Probability — Two Ways to See It Ω A B A∩B P(A,B) Venn: Overlap = Joint Joint Probability Table B occurs B doesn't A occurs A doesn't Marginal B P(A,B) P(A,Bᶜ) P(Aᶜ,B) P(Aᶜ,Bᶜ) P(B) P(Bᶜ) Marginal A

Section 03

The Joint Distribution Table 📋

The Story: Meera's Commute — Building the Full Picture

Meera tracked her commute for 200 days. She recorded whether it rained (R) and whether her bus was late (L). Here are the raw counts — and from these, we can build the complete joint probability distribution, read off all marginal probabilities, and derive every conditional probability.

Count (200 days) Bus Late (L) Bus On Time (Lᶜ) Row Total
Rained (R) 45 15 60
No Rain (Rᶜ) 25 115 140
Column Total 70 130 200

Now divide every cell by the grand total (200) to convert counts into probabilities:

Joint Probability P(·,·) Bus Late P(L) On Time P(Lᶜ) Marginal P(R·)
Rained P(R,·) P(R,L) = 45/200 = 0.225 P(R,Lᶜ) = 15/200 = 0.075 P(R) = 60/200 = 0.300
No Rain P(Rᶜ,·) P(Rᶜ,L) = 25/200 = 0.125 P(Rᶜ,Lᶜ) = 115/200 = 0.575 P(Rᶜ) = 140/200 = 0.700
Marginal P(·,L) P(L) = 70/200 = 0.350 P(Lᶜ) = 130/200 = 0.650 1.000 ✓
📐
How to Read the Joint Table

Every interior cell is a joint probability — the probability of one specific combination of both variables. Every row total is a marginal probability for the row variable. Every column total is a marginal probability for the column variable. And all four interior cells sum to 1.00 — because they cover every possible combination of the two events.


Section 04

Marginal Probability — Summing Out a Variable 📊

The Story: Zooming Out to the Big Picture

Meera's boss asks: "Overall, how often does it rain?" He doesn't care about the bus. He wants to ignore the bus variable entirely and just look at rain alone. To answer this from the joint table, Meera adds up all the joint probabilities across every value of the bus variable. She is marginalising out the bus — summing over all its possible outcomes.

The word "marginal" comes from the physical location of these probabilities in the table — they appear in the margins (the row and column totals at the edges). But the concept is far more important than the name: marginalisation is the operation that lets you extract information about one variable while ignoring all others.

Marginal — Discrete (Sum)
P(A) = Σ_B P(A, B)
Sum the joint probability P(A,B) over all possible values of B. "Summing out" B collapses the joint distribution to reveal A's distribution alone.
Marginal — Continuous (Integrate)
f_X(x) = ∫ f(x,y) dy
Integrate the joint density over all values of Y. The result is the marginal density of X, describing X's distribution regardless of Y's value.
Marginal P(B) from Joint
P(B) = Σ_A P(A, B)
Sum the joint probability P(A,B) over all possible values of A. This gives B's distribution, ignoring A completely.
Normalisation Check
Σ_A P(A) = 1 and Σ_B P(B) = 1
Both marginal distributions must sum to 1 separately. And all joint probabilities must also sum to 1 together: Σ_A Σ_B P(A,B) = 1.
🧮 Computing Marginals — Meera's Commute
P(R) — Marginal
Sum across all bus outcomes (L and Lᶜ):
P(R) = P(R,L) + P(R,Lᶜ) = 0.225 + 0.075 = 0.300 (30%)
It rains on 30% of Meera's commute days — regardless of bus timing.
P(Rᶜ) — Marginal
P(Rᶜ) = P(Rᶜ,L) + P(Rᶜ,Lᶜ) = 0.125 + 0.575 = 0.700 (70%)
Check: P(R) + P(Rᶜ) = 0.30 + 0.70 = 1.00 ✓
P(L) — Marginal
Sum across all rain outcomes (R and Rᶜ):
P(L) = P(R,L) + P(Rᶜ,L) = 0.225 + 0.125 = 0.350 (35%)
The bus is late on 35% of days — ignoring whether it rained.
P(Lᶜ) — Marginal
P(Lᶜ) = P(R,Lᶜ) + P(Rᶜ,Lᶜ) = 0.075 + 0.575 = 0.650 (65%)
Check: P(L) + P(Lᶜ) = 0.35 + 0.65 = 1.00 ✓
All Cells Sum
0.225 + 0.075 + 0.125 + 0.575 = 1.000 ✓
All joint probabilities cover all possible outcomes and sum to 1.
Marginalisation — Summing Out a Variable Joint Table P(R,L) Late On time Rain No Rain Col Total 0.225 0.075 0.125 0.575 0.350 0.650 Sum over L Marginal P(R) Outcome P(R·) Rain 0.300 No Rain 0.700 Total 1.000 ✓

Section 05

The Golden Triangle — Joint, Marginal & Conditional 🔺

Joint, marginal, and conditional probability are not three separate topics — they are three views of the same underlying information, connected by exact mathematical relationships. Every one of the three can be derived from any of the others. This triangle is the algebraic heart of probabilistic reasoning.

Joint from Conditional × Marginal
P(A,B) = P(A|B) · P(B)
The multiplication rule. Joint probability = conditional × the marginal we're conditioning on. This is how Bayesian networks factorise complex joint distributions into simpler conditionals.
Conditional from Joint ÷ Marginal
P(A|B) = P(A,B) / P(B)
The definition of conditional probability. Divide the joint probability by the marginal of the conditioning variable. This restricts the sample space to B's universe.
Marginal from Joint (Sum Out)
P(A) = Σ_B P(A,B)
Marginalise by summing (or integrating) the joint distribution over all values of the unwanted variable B. Eliminates B entirely from the picture.
Bayes' Theorem (Derived from Triangle)
P(A|B) = P(B|A)·P(A) / P(B)
Directly derived by applying P(A,B) = P(A|B)·P(B) = P(B|A)·P(A) simultaneously. Bayes' theorem IS the Golden Triangle applied twice to the same joint probability.
The Golden Triangle — Three Probability Types P(A, B) Joint P(A), P(B) Marginal P(A|B), P(B|A) Conditional Σ_B P(A,B) × P(A|B) ÷ P(B) × P(B) P(A|B) = P(A,B)/P(B) Bayes: P(A|B) = P(B|A)P(A)/P(B)

Section 06

Deriving Everything from the Joint Table 🔄

The joint probability table is a complete specification of the relationship between two variables. From it alone, you can recover every marginal probability, every conditional probability, test for independence, and verify Bayes' theorem. Let's do all of this with Meera's commute data.

🧮 Complete Analysis — All Probabilities from One Joint Table
Joint Values
P(R,L)=0.225  |  P(R,Lᶜ)=0.075  |  P(Rᶜ,L)=0.125  |  P(Rᶜ,Lᶜ)=0.575
Marginals
P(R)=0.300, P(Rᶜ)=0.700  |  P(L)=0.350, P(Lᶜ)=0.650
P(L|R)
Given it rained, what's the probability the bus is late?
P(L|R) = P(R,L) / P(R) = 0.225 / 0.300 = 0.750 (75%)
When it rains, the bus is late 75% of the time.
P(L|Rᶜ)
Given no rain, what's the probability the bus is late?
P(L|Rᶜ) = P(Rᶜ,L) / P(Rᶜ) = 0.125 / 0.700 = 0.179 (17.9%)
Without rain, only 17.9% late — a massive difference!
P(R|L)
Given the bus is late, what's the probability it rained?
P(R|L) = P(R,L) / P(L) = 0.225 / 0.350 = 0.643 (64.3%)
If the bus is late, there's a 64.3% chance it was raining.
Independence?
Test: P(R,L) = P(R) × P(L)?
P(R) × P(L) = 0.300 × 0.350 = 0.105   vs   P(R,L) = 0.225
0.105 ≠ 0.225 → NOT independent. Rain strongly increases bus delay.
Verify Bayes
P(R|L) = P(L|R) × P(R) / P(L)
= 0.750 × 0.300 / 0.350 = 0.225 / 0.350 = 0.643 ✓
Bayes' theorem and the direct calculation agree perfectly.

Section 07

Story 2 — The E-Commerce Platform 🛒

Customer Purchase Analysis

An online retailer tracks two binary behaviours across 1,000 customers in one month: whether they opened a promotional email (E: yes/no) and whether they made a purchase (P: yes/no). The marketing team wants to understand the joint, marginal, and conditional probabilities to design better campaigns.

Joint Table (n=1000) Purchased (P) Did Not Purchase (Pᶜ) Marginal P(E·)
Opened Email (E) 180 → 0.180 220 → 0.220 400 → P(E) = 0.400
Didn't Open (Eᶜ) 70 → 0.070 530 → 0.530 600 → P(Eᶜ) = 0.600
Marginal P(·,P) 250 → P(P) = 0.250 750 → P(Pᶜ) = 0.750 1.000 ✓
🧮 E-Commerce — Business Insights from Probability Analysis
Baseline
Overall purchase rate (marginal): P(P) = 0.250 (25%)
Email open rate (marginal): P(E) = 0.400 (40%)
P(P|E)
Purchase rate among email openers:
P(P|E) = P(E,P) / P(E) = 0.180 / 0.400 = 0.450 (45%)
Email openers buy at nearly twice the baseline rate!
P(P|Eᶜ)
Purchase rate among non-openers:
P(P|Eᶜ) = P(Eᶜ,P) / P(Eᶜ) = 0.070 / 0.600 = 0.117 (11.7%)
Non-openers buy at less than half the baseline rate.
P(E|P)
Among those who bought, what fraction opened the email?
P(E|P) = P(E,P) / P(P) = 0.180 / 0.250 = 0.720 (72%)
72% of buyers opened the email — strong signal for attribution.
Independence?
P(E) × P(P) = 0.400 × 0.250 = 0.100   vs   P(E,P) = 0.180
0.100 ≠ 0.180 → Highly dependent. Email opening and purchasing are strongly linked.
Lift
Lift = P(P|E) / P(P) = 0.450 / 0.250 = 1.80×
Email openers are 1.8× more likely to purchase than average. The campaign is working.

Section 08

Joint Probability for Continuous Variables 🌊

The Story: Height and Weight

A researcher studies the joint distribution of height (H) and weight (W) in adults. Both are continuous variables. The joint PDF f(h, w) describes the density across all (height, weight) pairs — like a contour map of a mountain, where higher contours represent more probable combinations.

Joint PDF (Continuous)
f(x,y) ≥ 0 everywhere
The joint density function must be non-negative for all (x, y). Higher density = more probable combinations in that region of the plane.
Normalisation
∫∫ f(x,y)dxdy = 1
The double integral of the joint PDF over all possible (x, y) must equal 1. All probability across the entire plane sums to 100%.
Marginal PDF of X
f_X(x) = ∫ f(x,y)dy
Integrate out Y (over all its possible values) to get the marginal density of X alone. This is the continuous version of "summing out" a variable.
Conditional PDF
f(x|y) = f(x,y) / f_Y(y)
The joint PDF divided by the marginal density of Y gives the conditional density of X given Y = y. Exact continuous analogue of the discrete formula.

Visualising the Bivariate Normal Distribution

The bivariate normal distribution is the most important continuous joint distribution. Its contour lines are ellipses, centred on the mean (μₓ, μᵧ). Projecting these ellipses onto the X or Y axis gives the marginal distributions — both normal.

Bivariate Normal — Joint & Marginals Height (X) Weight (Y) (μₓ,μᵧ) Marginal f_X(x) Marginal f_Y(y) Key Relationships ● Inner ellipse = high joint density ● Project onto X-axis = marginal of X ● Project onto Y-axis = marginal of Y ● Slice at fixed Y=y = conditional f(x|y) Ellipse tilt = correlation

Section 09

Story 3 — Medical Study: Disease & Risk Factor 🏥

An epidemiologist studies whether smoking (S) is associated with lung disease (D) in a population of 5,000 patients. She builds the joint distribution from patient records. This example shows how joint probability analysis drives medical insights.

Joint Table (n=5000) Lung Disease (D) No Disease (Dᶜ) Marginal P(S·)
Smoker (S) 350 → 0.070 650 → 0.130 1000 → P(S) = 0.200
Non-Smoker (Sᶜ) 150 → 0.030 3850 → 0.770 4000 → P(Sᶜ) = 0.800
Marginal P(·,D) 500 → P(D) = 0.100 4500 → P(Dᶜ) = 0.900 1.000 ✓
🧮 Medical Epidemiology — Full Probability Analysis
Baseline
Disease prevalence: P(D) = 0.100 (10%) in this population.
Smoking rate: P(S) = 0.200 (20%) of the population.
P(D|S)
Disease risk among smokers:
P(D|S) = P(S,D) / P(S) = 0.070 / 0.200 = 0.350 (35%)
Smokers have a 35% chance of lung disease!
P(D|Sᶜ)
Disease risk among non-smokers:
P(D|Sᶜ) = P(Sᶜ,D) / P(Sᶜ) = 0.030 / 0.800 = 0.0375 (3.75%)
Non-smokers have only a 3.75% risk.
Relative Risk
Relative Risk (RR) = P(D|S) / P(D|Sᶜ) = 0.350 / 0.0375 = 9.33×
Smokers are over 9 times more likely to develop lung disease. A devastating finding.
P(S|D)
Among diseased patients, what fraction are smokers?
P(S|D) = P(S,D) / P(D) = 0.070 / 0.100 = 0.700 (70%)
70% of lung disease patients are smokers — despite smokers being only 20% of the population.
Verify with Total
P(D) = P(D|S)×P(S) + P(D|Sᶜ)×P(Sᶜ)
= 0.350×0.200 + 0.0375×0.800 = 0.070 + 0.030 = 0.100 ✓

Section 10

Independence in Joint Distributions 🔍

Two random variables X and Y are independent if and only if their joint distribution factors perfectly into the product of their marginals. No amount of information about X tells you anything about Y — they are probabilistically unrelated.

Independence Condition
P(X=x, Y=y) = P(X=x)·P(Y=y)
For ALL pairs (x, y). The joint probability equals the product of marginals for every single cell in the table. A single violation disproves independence.
Continuous Independence
f(x,y) = f_X(x) · f_Y(y)
For continuous variables, independence means the joint PDF factors into the product of marginal PDFs at every point. The bivariate normal with ρ=0 (zero correlation) is independent.
🧮 Independence Illustration — Die Roll and Coin Flip
Setup
Roll a fair die (X = 1..6) and flip a fair coin (Y = H or T). These are independent — the die result has no effect on the coin.
Marginals
P(X=k) = 1/6 for k∈{1,2,3,4,5,6}  |  P(Y=H) = P(Y=T) = 1/2
Joint (one cell)
P(X=3, Y=H) = P(X=3) × P(Y=H) = 1/6 × 1/2 = 1/12 ≈ 0.0833
All 12 cells
Each of the 12 combinations (1H, 1T, 2H, 2T, ..., 6H, 6T) has probability 1/12.
12 × (1/12) = 1.00 ✓  |  Every cell = marginal × marginal — perfect independence.
Contrast
In Meera's commute: P(R,L) = 0.225 but P(R)×P(L) = 0.300×0.350 = 0.105.
0.225 ≠ 0.105 → Not independent. Rain and bus delay are correlated.

Section 11

Applications in Data Science & Machine Learning 🤖

Application Joint Probability Used Marginal Used Conditional Derived
🧠 Naive Bayes Classifier P(features, class) factorisations P(class) — prior P(class | features) — prediction
🎬 Recommendation System P(user u, item i) — co-occurrence P(item i) — item popularity P(item i | user u) — personalised rec
🔗 Bayesian Networks Full joint = product of conditionals Marginalise to query any variable Conditional queries on all nodes
📊 Exploratory Data Analysis Joint distributions in heatmaps Histograms of individual features Conditional distributions by group
🔍 Anomaly Detection P(X, Y) — expected combinations P(X) — expected individual values P(Y | X) — conditional outlier check
🧬 Genomics (GWAS) P(SNP₁, SNP₂) — linkage disequil. P(SNP₁) — allele frequency P(disease | SNP profile)
📉 Finance: Risk Modelling P(Asset A, Asset B) — joint returns P(Asset A) — individual return dist. P(A | B drops) — contagion risk
🤖 Generative AI P(all tokens jointly) in sequence P(token) — unigram distribution P(token_n | token_1...n-1) — LLM
🧮
Bayesian Networks — The Ultimate Expression of Joint Probability

A Bayesian network is a directed graph where each node is a random variable and each edge represents a conditional dependency. The full joint distribution of all variables is expressed as a product of conditional probabilities: P(X₁, X₂, ..., Xₙ) = Π P(Xᵢ | Parents(Xᵢ)). This factorisation makes it computationally feasible to reason about dozens of variables simultaneously — which would be completely intractable with a single giant joint table. Every conditional and marginal probability can then be computed through marginalisation.


Section 12

Complete Formula Reference

Formula Expression What It Computes Direction
Joint (Discrete) P(A, B) = P(A ∩ B) Probability both A and B occur Both Together
Marginal from Joint P(A) = Σ_B P(A, B) A's probability ignoring B Sum Out B
Joint from Conditional P(A,B) = P(A|B) × P(B) Joint via multiplication rule Build Up
Conditional from Joint P(A|B) = P(A,B) / P(B) A's probability restricted to B's world Restrict Down
Independence Test P(A,B) = P(A) × P(B) Check if variables are unrelated Factor Check
Total Probability P(A) = Σ P(A|Bᵢ) × P(Bᵢ) Marginal via weighted conditionals Sum Over Parts
Bayes' Theorem P(A|B) = P(B|A)·P(A)/P(B) Flip conditional direction Reverse Condition
Marginal PDF (Continuous) f_X(x) = ∫ f(x,y)dy X's density integrating out Y Integrate Out Y

Section 13

The Golden Rules of Joint & Marginal Probability

🎯 12 Rules Every Data Scientist Must Master
1
The joint table contains everything. From a complete joint probability table, you can derive every marginal probability, every conditional probability, test for independence, verify Bayes' theorem, and compute the Law of Total Probability. It is the complete specification of the relationship between two variables.
2
Marginalisation = summing out unwanted variables. To find P(A), sum (or integrate) the joint distribution over every possible value of every other variable. This is how complex joint distributions are simplified into the quantity you actually care about.
3
Row totals = marginal of row variable; column totals = marginal of column variable. The name "marginal" literally comes from the margins of the joint table. Always build the full joint table first — the marginals fall out naturally from the row and column sums.
4
The Golden Triangle: Joint = Conditional × Marginal. P(A,B) = P(A|B)·P(B) = P(B|A)·P(A). This symmetry IS Bayes' theorem. Knowing any two of the three lets you calculate the third. Never memorise Bayes separately — it falls directly from the multiplication rule.
5
Independence means joint = product of marginals — for every single cell. P(A,B) = P(A)·P(B) must hold for every combination, not just on average. A single cell violation disproves independence. In the joint table, every cell would equal (row marginal) × (column marginal).
6
Conditional probability divides; joint probability multiplies. To go from joint → conditional: divide by the marginal. To go from conditional → joint: multiply by the marginal. This up-down relationship is the algebra of the Golden Triangle.
7
All joint probabilities must sum to 1; all marginals must also sum to 1. These are normalisation checks. If they fail, you have made an arithmetic error. Always verify both after constructing a joint table.
8
For continuous variables, replace sums with integrals throughout. Every discrete formula has a continuous analogue: Σ becomes ∫, PMF becomes PDF, and "probability at a point" becomes "density at a point." The conceptual structure is identical.
9
Marginalisation is how Bayesian inference works in practice. Computing a posterior P(θ|data) often requires marginalising out nuisance parameters. MCMC, variational inference, and numerical integration are all techniques for performing this marginalisation when it has no closed form.
10
Large differences between joint and product-of-marginals reveal strong associations. If P(A,B) is much larger than P(A)·P(B), A and B co-occur far more than chance — a strong positive association. If much smaller, they tend to avoid each other — a negative association. This is the basis of chi-square tests and mutual information.
11
The full joint distribution of n variables is the complete probabilistic model. All statistical and ML models are ultimately trying to estimate or approximate this joint distribution — either directly (generative models) or through its conditional decomposition (discriminative models). Understanding joint probability is understanding all of probabilistic ML.
12
Always build the joint table when analysing two categorical variables. Before computing any conditional probabilities, first fill in the full joint table with both raw counts and probabilities. This prevents arithmetic errors, makes all relationships visible simultaneously, and reveals patterns that selective calculations would miss.
🧮
The Unified Picture

Joint probability is the atomic unit of probabilistic reasoning about multiple variables. From it, marginals reveal each variable in isolation. Conditionals reveal how knowledge of one variable reshapes our beliefs about another. The Golden Triangle connects all three through multiplication and division. Every tool in the data scientist's arsenal — Bayesian classifiers, recommendation engines, causal inference, generative models, Bayesian networks — is built on this foundation. Build the joint table. Sum to get marginals. Divide to get conditionals. Verify with Bayes. That workflow is the core of probabilistic thinking.