Foundations of Data Science 📂 Probability · 1 of 5 41 min read

Basic Probability Foundations

A comprehensive, story-driven tutorial covering the complete foundations of probability — from sample spaces and event types to all three probability interpretations (Classical, Frequentist, Bayesian), the Addition Rule (with Venn diagrams), the Multiplication Rule (with tree diagrams), conditional probability, Kolmogorov's axioms, a Bayesian medical diagnosis worked example, the Law of Large Numbers diagram, independence testing, and 10 golden rules every data scientist must master.

Section 01

The Language of Uncertainty 🎲

Every single day, you make decisions under uncertainty. Will it rain? Will the bus arrive on time? Is this email a scam? Your brain quietly estimates likelihoods and makes choices — but it does so imprecisely, shaped by emotion, memory, and bias.

Probability is humanity's attempt to make this process exact. It is the mathematical language for quantifying uncertainty — assigning precise numbers between 0 and 1 to the likelihood of outcomes. A probability of 0 means "impossible." A probability of 1 means "certain." Everything real and interesting lives between these two extremes.

This tutorial covers the absolute foundations — the vocabulary and rules that every concept in statistics, machine learning, and data science is built upon. Master these, and everything from Bayesian classifiers to hypothesis tests to neural network loss functions will make intuitive sense.

💡
Why These Foundations Matter

Conditional probability underlies Naive Bayes classifiers. The multiplication rule powers Bayesian networks. The addition rule is the heart of the total probability theorem. The sample space concept is at the core of Monte Carlo simulations. None of these can be understood without first mastering what follows in this tutorial.


Section 02

Sample Space & Events 🎯

The Story: Rolling a Die at a Board Game Night

It's game night. You roll a standard six-sided die. Before the die even lands, you can list every possible result: 1, 2, 3, 4, 5, or 6. Nothing else can happen — the die can't land on 7, it can't vanish mid-air. This complete list of all possible outcomes is called the Sample Space.

Now your friend says: "I win if I roll an even number." The outcomes that satisfy this condition — {2, 4, 6} — form an Event. An event is simply a subset of the sample space that you care about. Events can be as simple as a single outcome (rolling a 6) or as complex as "rolling a prime number greater than 3" (just {5}).

Sample Space (Ω or S)
Ω = {all possible outcomes}
The complete set of every outcome that could occur in one trial of an experiment. Nothing outside Ω can happen.
Event (E)
E ⊆ Ω
Any subset of the sample space. E occurs if the actual outcome is one of the elements in E.
Probability of an Event
0 ≤ P(E) ≤ 1
P(Ω) = 1 (something must happen). P(∅) = 0 (the impossible event never occurs). All probabilities sum to 1.
Complement Event (Eᶜ)
P(Eᶜ) = 1 − P(E)
The complement of E is everything in Ω that is NOT in E. Together, E and Eᶜ cover the entire sample space.

Common Sample Spaces

Experiment Sample Space (Ω) Example Event P(Event)
Flip one coin {H, T} Getting Heads: {H} 1/2 = 0.50
Roll one die {1, 2, 3, 4, 5, 6} Even number: {2, 4, 6} 3/6 = 0.50
Flip two coins {HH, HT, TH, TT} At least one Head: {HH, HT, TH} 3/4 = 0.75
Draw 1 card (52-card deck) {A♠, 2♠, … K♥} Drawing a King: {K♠, K♣, K♦, K♥} 4/52 = 1/13
Baby's gender (simplified) {Boy, Girl} Having a girl: {Girl} ≈ 0.49
Daily stock movement {Up, Down, Unchanged} Stock rises: {Up} Historical frequency

Special Types of Events

Event Type Definition Example (Die Roll)
Simple event Contains exactly one outcome {3} — rolling exactly 3
Compound event Contains two or more outcomes {2, 4, 6} — rolling even
Impossible event (∅) Contains no outcomes; P = 0 {7} — rolling a 7 on a d6
Certain event (Ω) Contains all outcomes; P = 1 {1,2,3,4,5,6} — rolling something
Mutually exclusive Cannot both occur simultaneously {1,3,5} and {2,4,6} — odd & even
Exhaustive events Together cover the entire sample space {≤3} and {>3} together = {1,2,3,4,5,6}
Independent events Occurrence of one doesn't affect the other Roll 1 and Roll 2 of separate dice

Visualising Events — Venn Diagrams

The sample space Ω is the entire rectangle. Events A and B are circles within it. Their overlap (A ∩ B) is where both occur simultaneously. The union (A ∪ B) is the total area covered by either.

Ω A only A B only B A∩B both Aᶜ∩Bᶜ neither A ∩ B (intersection) A ∪ B (union = shaded)
📐
Set Notation Cheat Sheet

A ∪ B — Union: A OR B (or both) occurs.  |  A ∩ B — Intersection: A AND B both occur.  |  Aᶜ — Complement: A does NOT occur.  |  A ⊆ B — A is a subset: every outcome in A is also in B.


Section 03

Three Types of Probability 🔭

Probability didn't spring fully-formed from one mind. Over three centuries, three distinct interpretations of "what probability means" were developed, each suited to different situations. A professional data scientist needs to understand all three — because they appear in different parts of the toolbox.

Classical
P = f/n
  • Also: A priori / Theoretical
  • Assumes equally likely outcomes
  • Calculated before any experiment
  • P(E) = favourable / total outcomes
  • Works for: dice, cards, coins
  • Fails when outcomes aren't equal
Frequentist
P = lim(f/n)
  • Also: Empirical / Objective
  • Based on observed long-run frequency
  • Calculated from actual data
  • Needs many repeated experiments
  • Works for: insurance, medicine, QC
  • Fails for unique/one-time events
Bayesian
P(H|E)
  • Also: Subjective / Posterior
  • Probability = degree of belief
  • Updated when new evidence arrives
  • Uses prior + likelihood → posterior
  • Works for: unique events, ML, AI
  • Requires a prior (can be subjective)

Section 04

Type 1 — Classical Probability 🎴

The Story: Pierre-Simon Laplace and the Equally Likely World

In 1814, French mathematician Pierre-Simon Laplace published his Philosophical Essay on Probabilities. He proposed that when all outcomes of an experiment are equally likely — a perfectly balanced coin, a perfectly symmetric die, a well-shuffled deck — probability can be calculated purely by counting, before any data is collected.

Classical Probability Formula
P(E) = |E| / |Ω|
|E| = number of favourable outcomes in event E. |Ω| = total number of equally likely outcomes in sample space.
Key Assumption
All outcomes equally likely
If a die is loaded, or a coin is bent, classical probability gives the wrong answer. Equally likely is the critical constraint.
🧮 Classical Probability — Five Worked Examples
Example 1
Rolling an odd number on a fair die.
Ω = {1,2,3,4,5,6}, |Ω| = 6.   E = {1,3,5}, |E| = 3.
P(odd) = 3/6 = 0.50 (50%)
Example 2
Drawing a red card from a standard deck.
|Ω| = 52. Red cards = 26 (13♥ + 13♦).
P(red) = 26/52 = 0.50 (50%)
Example 3
Drawing a face card (Jack, Queen, King).
Face cards = 12 (4 suits × 3 face cards). |Ω| = 52.
P(face card) = 12/52 = 3/13 ≈ 0.231 (23.1%)
Example 4
Getting at least one head in two coin flips.
Ω = {HH, HT, TH, TT}, |Ω| = 4.   E = {HH, HT, TH}, |E| = 3.
P(at least one H) = 3/4 = 0.75 (75%)
Example 5
Picking a vowel from the alphabet {A–Z}.
|Ω| = 26. Vowels = {A, E, I, O, U}, |E| = 5.
P(vowel) = 5/26 ≈ 0.192 (19.2%)
⚠️
When Classical Probability Fails

Classical probability breaks down whenever outcomes are NOT equally likely. A thumbtack dropped on a floor can land point-up or point-down — but these are not equally likely. A biased coin, a weighted die, a real stock market — none of these satisfy Laplace's assumption. For these situations, you need Frequentist or Bayesian probability.


Section 05

Type 2 — Frequentist Probability 📊

The Story: The Life Insurance Actuary

In 1693, Edmund Halley (of Halley's Comet fame) compiled the first mortality table — a systematic record of how many people in Breslau, Poland died at each age. Halley didn't know when any individual would die, but he could calculate — from thousands of observations — that approximately 1 in 100 forty-year-old men would die before turning 41. Insurance companies were born from this insight.

This is Frequentist probability: probability as the long-run relative frequency of an event across many repeated identical trials. It requires no assumptions about equal likelihood — only a lot of data.

Frequentist Formula
P(E) = lim(n→∞) f/n
f = number of times E occurred. n = total number of trials. As n → ∞, f/n converges to the true probability.
Relative Frequency (Observed)
P̂(E) = f / n
With finite n, this is an estimate of P(E). The larger n is, the better the estimate — guaranteed by the Law of Large Numbers.
🧮 Frequentist Probability — Worked Examples
Example 1
Biased coin experiment. You flip a thumbtack 1,000 times. It lands point-up 382 times.
P̂(point-up) = 382/1000 = 0.382 (38.2%)
You could never calculate this classically — only observation reveals it.
Example 2
Website click-through rate. An ad is shown to 50,000 users. 1,750 click it.
P̂(click) = 1750/50000 = 0.035 (3.5%)
This is now the estimated probability a random user clicks this ad.
Example 3
Manufacturing defect rate. A factory inspects 10,000 chips. 47 are defective.
P̂(defect) = 47/10000 = 0.0047 (0.47%)
This drives quality control decisions — no theoretical model needed.
Example 4
Hospital readmission rate. 800 patients discharged; 112 readmitted within 30 days.
P̂(readmission) = 112/800 = 0.14 (14%)
Used to benchmark hospital quality and allocate follow-up care resources.

The Law of Large Numbers in Action

As the number of coin flips increases, the observed relative frequency of heads converges toward the true probability of 0.5. This convergence is guaranteed by the Law of Large Numbers — the mathematical foundation of Frequentist probability.

P = 0.5 Number of Trials (n) Rel. Freq. Volatile early on Converges to truth 10 50 200 1000 0 0.5 1.0
💡
When Frequentist Probability Fails

Frequentist probability requires repeatable experiments. What is the "frequentist probability" that it will rain in Delhi on 15 August 2026? There is only one 15 August 2026 — you cannot repeat it. Similarly: "What is the probability that this specific defendant committed this specific crime?" Unique, non-repeatable events demand a different interpretation of probability — which is where Bayesian thinking shines.


Section 06

Type 3 — Bayesian Probability 🔄

The Story: The Lost Submarine and Bayesian Search

In 1968, the USS Scorpion — a nuclear submarine — disappeared in the Atlantic Ocean. The Navy needed to find it. Traditional search methods would have been hopeless across millions of square kilometres of ocean. Instead, analyst John Craven used Bayesian reasoning.

He assembled a team of experts — submariners, oceanographers, salvage specialists — and asked each: "Given everything you know, where do you think it sank?" Each expert gave a probability distribution across ocean zones. These were the priors. Then Craven combined them using Bayes' theorem, weighting each zone by cumulative evidence. The search team found the Scorpion within 220 metres of the predicted location.

Bayes' Theorem
P(H|E) = P(E|H) × P(H) / P(E)
H = hypothesis. E = evidence (data). P(H) = prior belief before evidence. P(H|E) = posterior belief after evidence.
The Three Components
Prior → Likelihood → Posterior
Prior: what you believed before. Likelihood: how probable is this evidence given H? Posterior: updated belief after seeing evidence E.
🧮 Bayesian Probability — Medical Diagnosis Story
Scenario
A rare disease affects 1% of the population. A test for it is 95% accurate (95% sensitivity, 95% specificity). You test positive. What is the actual probability you have the disease?
Prior
P(Disease) = 0.01  (1% of population has it)
P(No Disease) = 0.99
Likelihoods
P(Positive | Disease) = 0.95  (true positive rate — sensitivity)
P(Positive | No Disease) = 0.05  (false positive rate = 1 − specificity)
Step 1
P(Positive) = P(Pos|Disease)×P(Disease) + P(Pos|No Disease)×P(No Disease)
= 0.95×0.01 + 0.05×0.99 = 0.0095 + 0.0495 = 0.059
Step 2
P(Disease | Positive) = (0.95 × 0.01) / 0.059 = 0.0095 / 0.059 = 0.161 (16.1%)
Insight
Despite a 95% accurate test, a positive result means only a 16% chance of actually having the disease! The rare prior (1%) swamps the test result. This is why mass screening for rare diseases generates many false alarms — and why Bayesian thinking is essential in medicine.
🧮
Bayesian Updating — The Power of Iteration

Bayesian probability is dynamic. After the first test, your posterior (16.1%) becomes the new prior. If you test positive a second time, you update again — now starting from 16.1% instead of 1%. After two positive tests, the probability exceeds 80%. Bayesian reasoning accumulates evidence naturally — each observation sharpens your belief. This is the engine behind spam filters, self-driving cars, and recommendation systems.

Feature Classical Frequentist Bayesian
Based on Counting equally likely outcomes Long-run observed frequencies Degree of belief + evidence
Requires data? No — theoretical Yes — many trials Prior + evidence
Handles unique events? No No Yes
Assumption needed Equal likelihood Repeatable experiments Valid prior distribution
Best for Games, theory, combinatorics Insurance, QC, epidemiology ML, AI, medicine, forecasting
Example use Card game odds Defect rates, mortality tables Spam filters, medical diagnosis

Section 07

The Probability Rules 📏

With sample spaces and event types defined, and the three interpretations of probability in hand, we can now establish the fundamental mathematical rules that govern how probabilities combine. These are not approximations or conventions — they are logical necessities, proven from first principles by Andrei Kolmogorov in 1933.

💡
Kolmogorov's Three Axioms (The Foundation of All Probability)

Axiom 1: P(E) ≥ 0 for any event E. (Probabilities are non-negative.)  |  Axiom 2: P(Ω) = 1. (Something in the sample space always occurs.)  |  Axiom 3: For mutually exclusive events, P(A ∪ B) = P(A) + P(B). (Probabilities of non-overlapping events add up.) Every probability rule that follows is a logical consequence of these three axioms alone.


Section 08

Rule 1 — The Addition Rule ➕

The Story: The Raffle Ticket Problem

You buy tickets at a school raffle. There are 100 tickets total. You hold ticket #7 and ticket #23. What is the probability you win? Since you hold two tickets, and only one can be drawn, your chances are simply added together: 1/100 + 1/100 = 2/100 = 2%. This works because the two events (ticket #7 wins, ticket #23 wins) cannot both happen simultaneously — they are mutually exclusive.

But what if you ask: "What is the probability I draw a King OR a Heart?" Now a card can be both a King AND a Heart (the King of Hearts). Simply adding P(King) + P(Heart) would double-count the King of Hearts. The Addition Rule has two forms to handle this.

General Addition Rule
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Works for ALL pairs of events. Subtract the intersection to avoid double-counting outcomes that belong to both A and B.
Mutually Exclusive Events
P(A ∪ B) = P(A) + P(B)
Special case when A ∩ B = ∅ (impossible to belong to both). P(A ∩ B) = 0, so the subtraction term vanishes.
Complement Rule
P(Aᶜ) = 1 − P(A)
A direct consequence of the addition rule applied to A and Aᶜ. Since A ∪ Aᶜ = Ω and P(Ω) = 1, P(A) + P(Aᶜ) = 1.
Three Events (Extension)
P(A∪B∪C) = P(A)+P(B)+P(C) −P(A∩B)−P(A∩C)−P(B∩C)+P(A∩B∩C)
Inclusion-Exclusion Principle: subtract all pairwise intersections, then add back the triple intersection.
🧮 Addition Rule — Five Worked Examples
Ex 1 — ME
Rolling a 2 or a 5 on a fair die. These are mutually exclusive (a die shows only one number).
P(2 or 5) = P(2) + P(5) = 1/6 + 1/6 = 2/6 = 0.333
Ex 2 — ME
Drawing a Club or a Diamond from a deck. A card cannot be both simultaneously.
P(♣ or ♦) = 13/52 + 13/52 = 26/52 = 0.50
Ex 3 — General
Drawing a King OR a Heart. King of Hearts belongs to both — must subtract it.
P(K) = 4/52, P(♥) = 13/52, P(K and ♥) = 1/52
P(K or ♥) = 4/52 + 13/52 − 1/52 = 16/52 ≈ 0.308
Ex 4 — General
Rolling a number greater than 3 OR an even number on a die.
A = {4,5,6} → P(A) = 3/6.  B = {2,4,6} → P(B) = 3/6.  A∩B = {4,6} → P(A∩B) = 2/6
P(A or B) = 3/6 + 3/6 − 2/6 = 4/6 ≈ 0.667
Ex 5 — Complement
"At least one head" in 3 coin flips. Use the complement: P(at least 1H) = 1 − P(no heads).
P(no heads) = P(TTT) = (1/2)³ = 1/8
P(at least 1H) = 1 − 1/8 = 7/8 = 0.875

Visualising the Addition Rule

Mutually Exclusive A B P(A∪B) = P(A) + P(B) Non-Exclusive (General) A B A∩B P(A∪B) = P(A)+P(B)−P(A∩B)

Section 09

Rule 2 — The Multiplication Rule ✖️

The Story: The Password Hacker

A cybersecurity analyst is testing how quickly a brute-force attack can crack a 4-digit PIN where each digit is chosen from 0–9. What is the probability of guessing correctly on the first attempt? Each digit has 10 possibilities, and the choices are independent. The total number of combinations is 10 × 10 × 10 × 10 = 10,000. P(correct guess) = 1/10,000 = 0.01%.

This is the multiplication rule for independent events. But what if events aren't independent? What if drawing one card changes the probabilities for the next draw? Then we need the conditional form of the multiplication rule.

Independent Events
P(A ∩ B) = P(A) × P(B)
When knowing A occurred gives NO information about B. The two events don't influence each other at all.
Dependent Events (General)
P(A ∩ B) = P(A) × P(B|A)
P(B|A) = conditional probability of B given A has occurred. When A affects the probability of B, use this form.
Conditional Probability
P(B|A) = P(A ∩ B) / P(A)
The probability of B given that A is known to have occurred. Restricts the sample space to only outcomes in A.
Three or More Events
P(A∩B∩C) = P(A) × P(B|A) × P(C|A∩B)
Chain rule of probability. Each successive event is conditioned on all previous ones having occurred.
🧮 Multiplication Rule — Six Worked Examples
Ex 1 — Indep.
Flipping heads twice in a row. Coin flips are independent.
P(H and H) = P(H) × P(H) = 1/2 × 1/2 = 1/4 = 0.25
Ex 2 — Indep.
Rolling a 6 on a die AND flipping heads. Independent events.
P(6 and H) = 1/6 × 1/2 = 1/12 ≈ 0.083
Ex 3 — Indep.
Three consecutive product defects at 2% defect rate.
P(3 defects) = 0.02 × 0.02 × 0.02 = 0.000008 (0.0008%)
Ex 4 — Dep.
Drawing 2 Aces from a deck WITHOUT replacement.
P(Ace 1st) = 4/52.  After drawing an Ace: P(Ace 2nd | Ace 1st) = 3/51.
P(both Aces) = 4/52 × 3/51 = 12/2652 = 1/221 ≈ 0.0045
Ex 5 — Dep.
Drawing a King then a Queen (no replacement).
P(King 1st) = 4/52.  P(Queen 2nd | King 1st) = 4/51.
P(King then Queen) = 4/52 × 4/51 = 16/2652 ≈ 0.006
Ex 6 — Chain
Drawing 3 red cards in a row (no replacement).
P(R₁) = 26/52.  P(R₂|R₁) = 25/51.  P(R₃|R₁∩R₂) = 24/50.
P(all 3 red) = 26/52 × 25/51 × 24/50 = 15600/132600 ≈ 0.1176

The Probability Tree — Independent vs Dependent

Probability Tree — Two Coin Flips (Independent) Start H T ½ ½ H T H T ½ ½ ½ ½ HH = ¼ HT = ¼ TH = ¼ TT = ¼ Sum = 1 ✓

Section 10

Testing Independence 🔍

How do you know whether two events are independent? There is one definitive test: calculate P(A), P(B), and P(A ∩ B) separately from your data — and check whether the multiplication rule holds.

Independence Test
P(A ∩ B) = P(A) × P(B)
If this equality holds, A and B are independent. If not, they are dependent. Also check: P(B|A) = P(B) — knowing A occurred doesn't change B's probability.
Practical Check
P(B|A) = P(B)
Equivalent condition. If the conditional probability of B given A equals the unconditional probability of B, the events are independent.
🧮 Independence Check — Study Hours vs Exam Grade
Data
Survey of 200 students. A = "studied more than 5 hours" (100 students). B = "passed exam" (140 students). Both A and B = "studied >5 hrs AND passed" (90 students).
Calculate
P(A) = 100/200 = 0.50  |  P(B) = 140/200 = 0.70  |  P(A∩B) = 90/200 = 0.45
Test
P(A) × P(B) = 0.50 × 0.70 = 0.35
P(A∩B) = 0.45
0.35 ≠ 0.45 → Events are NOT independent
Interpretation
P(B|A) = 90/100 = 0.90  vs  P(B) = 0.70.
Students who studied more have a 90% pass rate vs 70% overall. Studying affects exam outcomes — the events are dependent. (No surprise!)

Section 11

All the Rules Together — Quick Reference

Rule Formula When to Use Key Word
Complement P(Aᶜ) = 1 − P(A) "At least one," "not," "none" NOT
Addition (ME) P(A∪B) = P(A) + P(B) Mutually exclusive events OR (no overlap)
Addition (General) P(A∪B) = P(A)+P(B)−P(A∩B) Any two events (safe to always use) OR (with overlap)
Multiplication (Indep.) P(A∩B) = P(A) × P(B) Independent events only AND (independent)
Multiplication (Dep.) P(A∩B) = P(A) × P(B|A) Dependent events (always safe) AND (dependent)
Conditional Prob. P(B|A) = P(A∩B) / P(A) Given that A has occurred GIVEN
Bayes' Theorem P(H|E) = P(E|H)×P(H)/P(E) Update beliefs with new evidence UPDATE

Section 12

The Golden Rules of Probability

🎯 10 Foundations Every Data Scientist Must Know
1
Always start with the sample space. Before calculating any probability, explicitly list or describe Ω. Many errors come from having an incomplete or incorrect sample space — and if Ω is wrong, every probability derived from it is wrong.
2
Probabilities always sum to 1. For any partition of the sample space, all probabilities must add to exactly 1. Use this as a sanity check — if your probabilities don't sum to 1, something is wrong.
3
"OR" triggers the addition rule. "AND" triggers the multiplication rule. The keyword in the problem tells you which formula to reach for. "At least one" almost always means: use the complement rule, then subtract from 1.
4
Always check whether events are mutually exclusive before adding. Using P(A) + P(B) without checking for overlap is the most common probability mistake. Always ask: can A and B happen simultaneously?
5
Always check independence before multiplying. P(A) × P(B) only equals P(A ∩ B) when events are truly independent. Assuming independence when events are correlated can produce catastrophically wrong risk assessments.
6
Conditional probability restricts the sample space. P(B|A) is calculated entirely within the world where A has happened. The denominator becomes P(A), not 1. Drawing Venn diagrams helps visualise this restriction clearly.
7
The complement rule is your most powerful shortcut. "At least one," "one or more," and "not all" problems are almost always easier solved as 1 − P(none). Save yourself long addition chains by flipping to the complement.
8
Choose the right probability interpretation for your problem. Symmetric games → Classical. Real data with many trials → Frequentist. Unique events or belief updating → Bayesian. Using the wrong interpretation doesn't just give wrong numbers — it gives meaningless ones.
9
Draw a tree diagram or Venn diagram for multi-step problems. Visual representations make sample spaces, intersections, and conditional probabilities concrete. They prevent errors more effectively than algebra alone — always sketch before calculating.
10
Rare priors dominate test results — always apply Bayes. A 99% accurate test on a disease affecting 0.1% of the population still yields mostly false positives. Ignoring prior probabilities is one of the most dangerous mistakes in medicine, law, and machine learning.
🧮
You've Built the Foundation

Sample spaces define the universe of possibility. Events carve out the outcomes that matter. Classical, Frequentist, and Bayesian interpretations give probability its meaning. The Addition and Multiplication rules govern how probabilities combine. Master these eight concepts and you have everything needed to tackle conditional probability, distributions, Bayesian inference, and the entire edifice of modern statistical machine learning that rests upon them.