Foundations of Data Science 📂 Inferential Statistics · 4 of 8 39 min read

The p-value Explained

A deep-dive into the p-value — from the 1919 Lady Tasting Tea experiment that invented it, to a visual binomial diagram, the three-panel normal curve illustration, a complete myth-busting section covering six dangerous misconceptions, the evidence strength scale, a worked hospital wait-time example, and Python code that shows how sample size, effect size, and variability all drive the p-value. Includes the p-value vs confidence interval duality.

Section 01

The Coin, The Courtroom & The Coffee Taster

Bristol, England. 1919. A group of scientists are having afternoon tea. A colleague — Dr Muriel Bristol — claims she can tell, just by tasting, whether milk was poured before or after the tea in her cup. The statistician Ronald Fisher thinks this is nonsense. So he designs a test.

He prepares eight cups — four milk-first, four tea-first — shuffled randomly. If Dr Bristol is just guessing, probability theory says she should get about 4 right by chance. She gets all 8 correct. Fisher calculates the probability of guessing all 8 right by pure chance: 1 in 70 — about 1.4%.

💡
That 1.4% is a p-value

Fisher reasoned: "If she had no ability whatsoever, the probability of getting all 8 cups correct by luck alone is just 1.4%." That tiny probability was the evidence. It told him that the observed result would be extremely unlikely under the assumption of "no ability." This experiment — the Lady Tasting Tea — was the birth of the modern p-value, and it is still the best intuitive explanation for what a p-value actually measures.

The p-value is arguably the most widely used — and most widely misunderstood — number in all of science. It appears in every medical journal, every psychology study, every A/B test report, every quality control dashboard. Understanding what it truly means, and what it does not mean, is one of the most valuable skills a data scientist can have.


Section 02

What Exactly Is a p-value?

Before the formal definition, try this thought experiment. You suspect a coin is biased toward heads. You flip it 10 times and get 9 heads. You ask yourself: "If this coin were perfectly fair, how often would I see 9 or more heads in 10 flips just by chance?"

The answer to that question is the p-value.

💡
The Formal Definition

The p-value is the probability of obtaining a test statistic at least as extreme as the one actually observed, assuming the null hypothesis is true.

p = P( observing data this extreme or more extreme | H₀ is true )

It answers one question and one question only: "How surprising is this data, if we assume there is nothing going on?"

Small p-value
p < 0.05
  • Data is surprising under H₀
  • Evidence against H₀
  • Reject H₀ if p < α
  • Result is "significant"
Large p-value
p ≥ 0.05
  • Data is unsurprising under H₀
  • Weak evidence against H₀
  • Fail to reject H₀
  • Result is "non-significant"
The Threshold α
usually 0.05
  • Set before the experiment
  • Not magic — just convention
  • 0.01 for medicine/finance
  • p < α → reject H₀

Section 03

The Coin Flip — A Visual Walkthrough

Let us build the p-value from scratch, visually, using the coin example. We flip a coin 10 times and observe 9 heads. The null hypothesis is H₀: the coin is fair (P(heads) = 0.5). We use a two-tailed test because we want to detect bias in either direction.

Binomial Distribution Under H₀ (n=10, p=0.5) Probability of each outcome if coin is perfectly fair 0 0.05 0.10 0.15 0.20 Probability 0 1 2 3 4 5 24.6% 6 7 8 9 10 0 1 Observed 9 heads Number of Heads in 10 Flips p-value (two-tailed) ≈ 0.021 P(≤1 or ≥9 heads | fair coin) = 2.1%
Reading the Diagram

The blue bars show every possible outcome when flipping a fair coin 10 times. The red bars are the outcomes as extreme as — or more extreme than — what we observed (9 heads). Their combined probability is the p-value: ≈ 2.1%.

Since p = 0.021 < α = 0.05, we reject H₀. The coin is unlikely to be fair. But notice: 2.1% is not zero. There is still a 1-in-47 chance a fair coin produced this result — we could be making a Type I error.


Section 04

The p-value on a Normal Distribution

For continuous data (means, differences), the p-value comes from the area under the normal (or t) distribution curve. The test statistic — a z-score or t-value — tells you how many standard errors your sample result sits from the null hypothesis value. The p-value is the shaded area in the tail(s) beyond that point.

p-value as Area Under the Curve — Three Scenarios Large p-value (p = 0.42) z = 0.80 — weak evidence z = 0.80 21% p ≈ 0.42 (two-tail) Fail to reject H₀ Borderline p-value (p = 0.05) z = 1.96 — just significant +1.96 −1.96 2.5% 2.5% p = 0.05 (two-tail) Right on the boundary α = 0.05 Very small p-value (p = 0.003) z = 3.0 — strong evidence +3.0 −3.0 0.13% p = 0.003 (two-tail) Strong evidence — Reject H₀ Smaller shaded area = smaller p-value = stronger evidence against H₀ = further from the null
🎯
The Intuition in One Sentence

The p-value is the area of the red shaded tail(s) beyond your observed test statistic, under the curve that represents what would happen if H₀ were true. The further your result sits from the centre of the null distribution, the smaller the tail area, the smaller the p-value, and the stronger the case against H₀.


Section 05

What the p-value Is NOT — The Six Myths

The American Statistical Association issued an official statement in 2016 — something it had never done in its 177-year history — specifically to address the widespread misuse of p-values. Here are the six most dangerous myths, each corrected with the accurate interpretation.

Six p-value Myths — Debunked Myth 1 — "p = 0.03 means there's a 3% chance H₀ is true" ✅ Truth: p assumes H₀ is already true. It cannot tell you the probability that H₀ is true or false — that requires Bayesian methods, not a p-value. Myth 2 — "p < 0.05 means the result is practically important" ✅ Truth: With a large enough sample, even a 0.001 mmHg blood pressure difference gives p < 0.0001. Significance ≠ importance. Always report effect size (Cohen's d, etc.). Myth 3 — "p ≥ 0.05 means there is no effect" ✅ Truth: Failing to reject H₀ does not prove H₀ is true. The study may simply have been underpowered (too small n) to detect a real but small effect. Absence of evidence ≠ evidence of absence. Myth 4 — "p = 0.04 is stronger evidence than p = 0.049" ✅ Truth: The 0.05 threshold is an arbitrary convention. p = 0.049 and p = 0.051 represent virtually identical evidence. Do not treat the threshold as a bright-line truth detector. Myth 5 — "If I repeat the study, I'll get p < 0.05 again" ✅ Truth: p-values vary enormously across repeated samples even when H₁ is true. A single p < 0.05 is not reproducibility — it is a signal worth investigating further. Myth 6 — "A smaller p-value means a bigger effect" ✅ Truth: p depends on both effect size AND sample size. Larger n can produce p = 0.0001
⚠️
Myth 6 Continued — Sample Size Inflates Significance

A tiny p-value can come from a massive sample size detecting a trivially small effect. A large p-value can come from a small study that was underpowered to detect a very real, large effect. p-value = f(effect size × sample size). The p-value alone cannot tell you which of these is happening. This is why effect size reporting is not optional — it is mandatory.


Section 06

p-value Scale — How to Read the Evidence

Not all significant results are equally convincing, and not all non-significant results are equally weak. Here is a practical interpretation scale used by statisticians to communicate the strength of evidence conveyed by different p-value ranges — originally proposed by Ronald Fisher himself.

The p-value Evidence Scale p < 0.001 Very Strong 0.001 – 0.01 Strong 0.01 – 0.05 Moderate 0.05 – 0.10 Weak / Marginal p ≥ 0.10 Little / No Evidence α = 0.05 threshold (conventional dividing line) REJECT H₀ FAIL TO REJECT H₀ ◄ Stronger evidence against H₀ Weaker evidence ► Note: These ranges are guidelines for communication — not rigid rules of nature.
p-value Range Strength of Evidence Decision (α=0.05) Example Context
p < 0.001 Very strong against H₀ Reject H₀ Vaccine efficacy in phase III trial
0.001 ≤ p < 0.01 Strong against H₀ Reject H₀ Drug lowers cholesterol
0.01 ≤ p < 0.05 Moderate against H₀ Reject H₀ A/B test — Button B wins
0.05 ≤ p < 0.10 Weak / marginal Fail to Reject H₀ Trend worth investigating further
p ≥ 0.10 Little or no evidence Fail to Reject H₀ No detected difference between groups

Section 07

Step-by-Step: Computing a p-value from Scratch

A hospital claims its average patient wait time is 30 minutes. An auditor suspects it is longer. She randomly samples 40 patients and records their wait times. She finds: x̄ = 34.2 minutes, s = 10.5 minutes. Is this enough evidence to reject the hospital's claim?

🧮 Computing the p-value — One-sample t-test (Right-tailed)
Step 1
State hypotheses.
H₀: μ ≤ 30 minutes  |  H₁: μ > 30 minutes (right-tailed)
Step 2
Set significance level.
α = 0.05. σ unknown → t-test. df = n − 1 = 39.
Step 3
Compute Standard Error.
SE = s / √n = 10.5 / √40 = 10.5 / 6.325 = 1.66 minutes
Step 4
Compute the t-statistic.
t = (x̄ − μ₀) / SE = (34.2 − 30) / 1.66 = 4.2 / 1.66 = t = 2.530
Step 5
Find the p-value.
Right-tailed: p = P(T > 2.530 | df=39)
From t-table or Python: p = 0.0078
Decision
p = 0.0078 < α = 0.05 → Reject H₀
At the 5% significance level, there is strong evidence (p = 0.008) that the true mean wait time exceeds 30 minutes. The hospital's claim is not supported by this sample.
What Changed the p-value Here?

The p-value of 0.0078 was driven by three factors simultaneously: the size of the difference (x̄ − μ₀ = 4.2 minutes), the variability in the data (s = 10.5), and the sample size (n = 40). Had we only sampled 10 patients, SE would have been 3.32 and t = 1.27 — giving p ≈ 0.11, which would not be significant. Same true effect, different conclusion — because of n.


Section 08

Python Implementation

Computing p-values from First Principles

import numpy as np
from scipy import stats

# Hospital wait time example
# H0: mu <= 30  |  H1: mu > 30  (right-tailed)
np.random.seed(42)
sample = np.random.normal(loc=34.2, scale=10.5, size=40)

mu_0  = 30
x_bar = np.mean(sample)
s     = np.std(sample, ddof=1)
n     = len(sample)
se    = s / np.sqrt(n)
df    = n - 1

# t-statistic
t_stat = (x_bar - mu_0) / se

# p-value: right-tailed = area to the RIGHT of t_stat
p_right = stats.t.sf(t_stat, df=df)          # sf = 1 - cdf

# For two-tailed: multiply by 2
p_two   = 2 * stats.t.sf(abs(t_stat), df=df)

# Critical t-value at alpha=0.05, right-tailed
t_crit = stats.t.ppf(0.95, df=df)

print(f"Sample mean x̄:       {x_bar:.4f} min")
print(f"Standard Error SE:    {se:.4f} min")
print(f"t-statistic:          {t_stat:.4f}")
print(f"Degrees of freedom:   {df}")
print(f"t-critical (right):   {t_crit:.4f}")
print(f"p-value (right-tail): {p_right:.4f}")
print(f"p-value (two-tail):   {p_two:.4f}")
print()
print(f"Conclusion: {'REJECT H0' if p_right < 0.05 else 'FAIL TO REJECT H0'}")

Visualising What Changes the p-value

import numpy as np
from scipy import stats

# How sample size, effect size, and variability each affect p-value
mu_0    = 30     # null hypothesis value
x_bar   = 34.2   # sample mean (fixed)
s       = 10.5   # std dev (fixed)

print("Effect of Sample Size on p-value (same effect, same sd):")
for n in [10, 20, 40, 100, 500]:
    se     = s / np.sqrt(n)
    t_stat = (x_bar - mu_0) / se
    p      = stats.t.sf(t_stat, df=n-1)
    sig    = "*** REJECT H0" if p < 0.05 else "  fail to reject"
    print(f"  n={n:4d}  SE={se:.3f}  t={t_stat:.3f}  p={p:.4f}  {sig}")

print()
print("Effect of Effect Size on p-value (n=40 fixed, sd=10.5 fixed):")
for mean in [30.5, 31.0, 32.0, 34.2, 38.0]:
    se     = s / np.sqrt(40)
    t_stat = (mean - mu_0) / se
    p      = stats.t.sf(t_stat, df=39)
    sig    = "*** REJECT H0" if p < 0.05 else "  fail to reject"
    print(f"  x̄={mean:.1f}  t={t_stat:.3f}  p={p:.4f}  {sig}")

# Key insight output:
# With n=10: p=0.11  (not significant despite real effect!)
# With n=500: p=0.0000 (trivially tiny effect is "significant")

The Complete p-value Interpretation Workflow

import numpy as np
from scipy import stats

def full_test_report(sample, mu_0, alpha=0.05, tail="two"):
    """
    Runs a one-sample t-test and prints a complete interpretation report.
    tail: 'two', 'right', or 'left'
    """
    n       = len(sample)
    x_bar   = np.mean(sample)
    s       = np.std(sample, ddof=1)
    se      = s / np.sqrt(n)
    df      = n - 1
    t_stat  = (x_bar - mu_0) / se

    # p-value by tail direction
    if tail == "two":
        p = 2 * stats.t.sf(abs(t_stat), df=df)
    elif tail == "right":
        p = stats.t.sf(t_stat, df=df)
    else:
        p = stats.t.cdf(t_stat, df=df)

    # Effect size: Cohen's d
    d = (x_bar - mu_0) / s
    d_label = ("negligible" if abs(d) < 0.2 else
               "small"      if abs(d) < 0.5 else
               "medium"     if abs(d) < 0.8 else "large")

    # Evidence strength
    if   p < 0.001: evidence = "Very Strong"
    elif p < 0.010: evidence = "Strong"
    elif p < 0.050: evidence = "Moderate"
    elif p < 0.100: evidence = "Weak"
    else:           evidence = "Little/None"

    print("=" * 55)
    print(f"  One-sample t-test  ({tail}-tailed)")
    print("=" * 55)
    print(f"  H₀: μ = {mu_0}   |   α = {alpha}")
    print(f"  n={n}  x̄={x_bar:.3f}  s={s:.3f}  SE={se:.3f}")
    print(f"  t({df}) = {t_stat:.4f}")
    print(f"  p-value  = {p:.4f}")
    print(f"  Cohen's d = {d:.3f} ({d_label} effect)")
    print(f"  Evidence strength: {evidence}")
    print()
    if p < alpha:
        print(f"  DECISION: REJECT H₀ (p={p:.4f} < α={alpha})")
    else:
        print(f"  DECISION: FAIL TO REJECT H₀ (p={p:.4f} ≥ α={alpha})")
    print("=" * 55)

# Run on hospital wait time data
np.random.seed(42)
wait_times = np.random.normal(34.2, 10.5, 40)
full_test_report(wait_times, mu_0=30, alpha=0.05, tail="right")

Simulating p-value Variability — Why One Study Is Never Enough

import numpy as np
from scipy import stats

# Simulate 1000 studies: true effect exists (mu_true = 32, mu_0 = 30)
# Show how much p-values vary even when H1 is actually true
np.random.seed(0)

n_studies    = 1000
n_per_study  = 40
mu_true, mu_0, sigma = 32, 30, 10.5

p_values = []
for _ in range(n_studies):
    sample = np.random.normal(mu_true, sigma, n_per_study)
    t, p   = stats.ttest_1samp(sample, popmean=mu_0)
    p_values.append(p)

p_values = np.array(p_values)

print("True effect exists (μ_true=32 ≠ μ_0=30)")
print(f"Studies run:              {n_studies}")
print(f"Studies with p < 0.05:   {(p_values < 0.05).sum()} ({(p_values<0.05).mean()*100:.1f}%)")
print(f"Studies with p >= 0.05:  {(p_values >= 0.05).sum()} ({(p_values>=0.05).mean()*100:.1f}%)")
print(f"Median p-value:          {np.median(p_values):.4f}")
print(f"Min p-value:             {p_values.min():.6f}")
print(f"Max p-value:             {p_values.max():.4f}")
print()
print("Lesson: Even when H1 is TRUE, ~30-40% of studies may fail to")
print("find significance with n=40. This is the power problem.")
⚠️
p-hacking — The Crisis of Reproducibility

If you run 20 independent tests on random noise and use α = 0.05, you expect roughly 1 false positive by chance alone. Researchers who try multiple analyses, subgroups, or outcome measures and report only the significant one are inflating the Type I error rate far beyond 5%. This is called p-hacking or data dredging, and it is behind the replication crisis in psychology, medicine, and nutrition science. Always correct for multiple comparisons (Bonferroni, FDR/BH) when running more than one test.


Section 09

p-value vs Confidence Interval — Two Sides of the Same Coin

Every p-value has a directly equivalent confidence interval, and every confidence interval implicitly tells you about significance. They are two different ways of communicating the same underlying information, but confidence intervals are often more informative because they also convey effect size and precision.

Property p-value Confidence Interval
What it gives you Binary signal (significant or not) Range of plausible parameter values
Conveys effect size? No Yes — via the interval width
Conveys precision? No Yes — narrower = more precise
Significance link p < 0.05 95% CI does not contain μ₀
More informative? Less More — report both together
Affected by sample size? Yes — n inflates significance Yes — n narrows the interval
📐
The Duality Rule

For a two-tailed test at α = 0.05:
p < 0.05 if and only if the 95% CI does not contain μ₀.
p ≥ 0.05 if and only if the 95% CI contains μ₀.

The best practice in reporting: always give both the p-value and the confidence interval. The CI tells your reader how large the effect is and how precisely you estimated it. The p-value alone cannot do that.


Section 10

Golden Rules

🎯 p-value — Key Rules
1
p-value = P(data this extreme or more | H₀ is true). Nothing else. It is the probability of the observed data under the assumption H₀ is true. It is not the probability H₀ is true, not the probability of the result being due to chance, and not the probability of replicating the finding.
2
Always set α before collecting data — never after seeing p. Setting α = 0.05 because your p-value came out at 0.04 is p-hacking. Pre-specify your significance level, hypotheses, and test type. In clinical trials this is legally required before patient enrolment.
3
Statistical significance ≠ practical significance. Always report effect size. A p-value of 0.0001 with n = 1,000,000 may mean a 0.001 unit difference — real but useless. Report Cohen's d, odds ratio, or relative risk alongside every p-value. Effect size tells you whether the result matters.
4
p ≥ 0.05 is not evidence of no effect — it is absence of sufficient evidence. Failing to reject H₀ could mean H₀ is true, or that your study was underpowered. Before concluding "no effect," calculate the study's power. If power was low, your non-result is uninformative — a larger study is needed.
5
Correct for multiple comparisons when running more than one test. Running 20 tests at α = 0.05 on pure noise expects 1 false positive by chance. Use Bonferroni correction (α/k), Holm's method, or the Benjamini-Hochberg procedure (FDR control) whenever you test multiple hypotheses simultaneously.
6
Always report the 95% CI alongside the p-value. The confidence interval gives effect size, direction, and precision in one number. It communicates far more than p alone. The two together — p-value for the decision, CI for the magnitude — form a complete statistical report.