The Coin, The Courtroom & The Coffee Taster
Bristol, England. 1919. A group of scientists are having afternoon tea. A colleague — Dr Muriel Bristol — claims she can tell, just by tasting, whether milk was poured before or after the tea in her cup. The statistician Ronald Fisher thinks this is nonsense. So he designs a test.
He prepares eight cups — four milk-first, four tea-first — shuffled randomly. If Dr Bristol is just guessing, probability theory says she should get about 4 right by chance. She gets all 8 correct. Fisher calculates the probability of guessing all 8 right by pure chance: 1 in 70 — about 1.4%.
Fisher reasoned: "If she had no ability whatsoever, the probability of getting all 8 cups correct by luck alone is just 1.4%." That tiny probability was the evidence. It told him that the observed result would be extremely unlikely under the assumption of "no ability." This experiment — the Lady Tasting Tea — was the birth of the modern p-value, and it is still the best intuitive explanation for what a p-value actually measures.
The p-value is arguably the most widely used — and most widely misunderstood — number in all of science. It appears in every medical journal, every psychology study, every A/B test report, every quality control dashboard. Understanding what it truly means, and what it does not mean, is one of the most valuable skills a data scientist can have.
What Exactly Is a p-value?
Before the formal definition, try this thought experiment. You suspect a coin is biased toward heads. You flip it 10 times and get 9 heads. You ask yourself: "If this coin were perfectly fair, how often would I see 9 or more heads in 10 flips just by chance?"
The answer to that question is the p-value.
The p-value is the probability of obtaining a test statistic
at least as extreme as the one actually observed, assuming the null
hypothesis is true.
p = P( observing data this extreme or more extreme | H₀ is true )
It answers one question and one question only: "How surprising is this
data, if we assume there is nothing going on?"
- Data is surprising under H₀
- Evidence against H₀
- Reject H₀ if p < α
- Result is "significant"
- Data is unsurprising under H₀
- Weak evidence against H₀
- Fail to reject H₀
- Result is "non-significant"
- Set before the experiment
- Not magic — just convention
- 0.01 for medicine/finance
- p < α → reject H₀
The Coin Flip — A Visual Walkthrough
Let us build the p-value from scratch, visually, using the coin example. We flip a coin 10 times and observe 9 heads. The null hypothesis is H₀: the coin is fair (P(heads) = 0.5). We use a two-tailed test because we want to detect bias in either direction.
The blue bars show every possible outcome when flipping a fair coin 10 times.
The red bars are the outcomes as extreme as — or more extreme than —
what we observed (9 heads). Their combined probability is the p-value: ≈ 2.1%.
Since p = 0.021 < α = 0.05, we reject H₀. The coin is unlikely
to be fair. But notice: 2.1% is not zero. There is still a 1-in-47 chance
a fair coin produced this result — we could be making a Type I error.
The p-value on a Normal Distribution
For continuous data (means, differences), the p-value comes from the area under the normal (or t) distribution curve. The test statistic — a z-score or t-value — tells you how many standard errors your sample result sits from the null hypothesis value. The p-value is the shaded area in the tail(s) beyond that point.
The p-value is the area of the red shaded tail(s) beyond your observed test statistic, under the curve that represents what would happen if H₀ were true. The further your result sits from the centre of the null distribution, the smaller the tail area, the smaller the p-value, and the stronger the case against H₀.
What the p-value Is NOT — The Six Myths
The American Statistical Association issued an official statement in 2016 — something it had never done in its 177-year history — specifically to address the widespread misuse of p-values. Here are the six most dangerous myths, each corrected with the accurate interpretation.
A tiny p-value can come from a massive sample size detecting a trivially small effect. A large p-value can come from a small study that was underpowered to detect a very real, large effect. p-value = f(effect size × sample size). The p-value alone cannot tell you which of these is happening. This is why effect size reporting is not optional — it is mandatory.
p-value Scale — How to Read the Evidence
Not all significant results are equally convincing, and not all non-significant results are equally weak. Here is a practical interpretation scale used by statisticians to communicate the strength of evidence conveyed by different p-value ranges — originally proposed by Ronald Fisher himself.
| p-value Range | Strength of Evidence | Decision (α=0.05) | Example Context |
|---|---|---|---|
| p < 0.001 | Very strong against H₀ | Reject H₀ | Vaccine efficacy in phase III trial |
| 0.001 ≤ p < 0.01 | Strong against H₀ | Reject H₀ | Drug lowers cholesterol |
| 0.01 ≤ p < 0.05 | Moderate against H₀ | Reject H₀ | A/B test — Button B wins |
| 0.05 ≤ p < 0.10 | Weak / marginal | Fail to Reject H₀ | Trend worth investigating further |
| p ≥ 0.10 | Little or no evidence | Fail to Reject H₀ | No detected difference between groups |
Step-by-Step: Computing a p-value from Scratch
A hospital claims its average patient wait time is 30 minutes. An auditor suspects it is longer. She randomly samples 40 patients and records their wait times. She finds: x̄ = 34.2 minutes, s = 10.5 minutes. Is this enough evidence to reject the hospital's claim?
H₀: μ ≤ 30 minutes | H₁: μ > 30 minutes (right-tailed)
α = 0.05. σ unknown → t-test. df = n − 1 = 39.
SE = s / √n = 10.5 / √40 = 10.5 / 6.325 = 1.66 minutes
t = (x̄ − μ₀) / SE = (34.2 − 30) / 1.66 = 4.2 / 1.66 = t = 2.530
Right-tailed: p = P(T > 2.530 | df=39)
From t-table or Python: p = 0.0078
At the 5% significance level, there is strong evidence (p = 0.008) that the true mean wait time exceeds 30 minutes. The hospital's claim is not supported by this sample.
The p-value of 0.0078 was driven by three factors simultaneously: the size of the difference (x̄ − μ₀ = 4.2 minutes), the variability in the data (s = 10.5), and the sample size (n = 40). Had we only sampled 10 patients, SE would have been 3.32 and t = 1.27 — giving p ≈ 0.11, which would not be significant. Same true effect, different conclusion — because of n.
Python Implementation
Computing p-values from First Principles
import numpy as np
from scipy import stats
# Hospital wait time example
# H0: mu <= 30 | H1: mu > 30 (right-tailed)
np.random.seed(42)
sample = np.random.normal(loc=34.2, scale=10.5, size=40)
mu_0 = 30
x_bar = np.mean(sample)
s = np.std(sample, ddof=1)
n = len(sample)
se = s / np.sqrt(n)
df = n - 1
# t-statistic
t_stat = (x_bar - mu_0) / se
# p-value: right-tailed = area to the RIGHT of t_stat
p_right = stats.t.sf(t_stat, df=df) # sf = 1 - cdf
# For two-tailed: multiply by 2
p_two = 2 * stats.t.sf(abs(t_stat), df=df)
# Critical t-value at alpha=0.05, right-tailed
t_crit = stats.t.ppf(0.95, df=df)
print(f"Sample mean x̄: {x_bar:.4f} min")
print(f"Standard Error SE: {se:.4f} min")
print(f"t-statistic: {t_stat:.4f}")
print(f"Degrees of freedom: {df}")
print(f"t-critical (right): {t_crit:.4f}")
print(f"p-value (right-tail): {p_right:.4f}")
print(f"p-value (two-tail): {p_two:.4f}")
print()
print(f"Conclusion: {'REJECT H0' if p_right < 0.05 else 'FAIL TO REJECT H0'}")
Visualising What Changes the p-value
import numpy as np
from scipy import stats
# How sample size, effect size, and variability each affect p-value
mu_0 = 30 # null hypothesis value
x_bar = 34.2 # sample mean (fixed)
s = 10.5 # std dev (fixed)
print("Effect of Sample Size on p-value (same effect, same sd):")
for n in [10, 20, 40, 100, 500]:
se = s / np.sqrt(n)
t_stat = (x_bar - mu_0) / se
p = stats.t.sf(t_stat, df=n-1)
sig = "*** REJECT H0" if p < 0.05 else " fail to reject"
print(f" n={n:4d} SE={se:.3f} t={t_stat:.3f} p={p:.4f} {sig}")
print()
print("Effect of Effect Size on p-value (n=40 fixed, sd=10.5 fixed):")
for mean in [30.5, 31.0, 32.0, 34.2, 38.0]:
se = s / np.sqrt(40)
t_stat = (mean - mu_0) / se
p = stats.t.sf(t_stat, df=39)
sig = "*** REJECT H0" if p < 0.05 else " fail to reject"
print(f" x̄={mean:.1f} t={t_stat:.3f} p={p:.4f} {sig}")
# Key insight output:
# With n=10: p=0.11 (not significant despite real effect!)
# With n=500: p=0.0000 (trivially tiny effect is "significant")
The Complete p-value Interpretation Workflow
import numpy as np
from scipy import stats
def full_test_report(sample, mu_0, alpha=0.05, tail="two"):
"""
Runs a one-sample t-test and prints a complete interpretation report.
tail: 'two', 'right', or 'left'
"""
n = len(sample)
x_bar = np.mean(sample)
s = np.std(sample, ddof=1)
se = s / np.sqrt(n)
df = n - 1
t_stat = (x_bar - mu_0) / se
# p-value by tail direction
if tail == "two":
p = 2 * stats.t.sf(abs(t_stat), df=df)
elif tail == "right":
p = stats.t.sf(t_stat, df=df)
else:
p = stats.t.cdf(t_stat, df=df)
# Effect size: Cohen's d
d = (x_bar - mu_0) / s
d_label = ("negligible" if abs(d) < 0.2 else
"small" if abs(d) < 0.5 else
"medium" if abs(d) < 0.8 else "large")
# Evidence strength
if p < 0.001: evidence = "Very Strong"
elif p < 0.010: evidence = "Strong"
elif p < 0.050: evidence = "Moderate"
elif p < 0.100: evidence = "Weak"
else: evidence = "Little/None"
print("=" * 55)
print(f" One-sample t-test ({tail}-tailed)")
print("=" * 55)
print(f" H₀: μ = {mu_0} | α = {alpha}")
print(f" n={n} x̄={x_bar:.3f} s={s:.3f} SE={se:.3f}")
print(f" t({df}) = {t_stat:.4f}")
print(f" p-value = {p:.4f}")
print(f" Cohen's d = {d:.3f} ({d_label} effect)")
print(f" Evidence strength: {evidence}")
print()
if p < alpha:
print(f" DECISION: REJECT H₀ (p={p:.4f} < α={alpha})")
else:
print(f" DECISION: FAIL TO REJECT H₀ (p={p:.4f} ≥ α={alpha})")
print("=" * 55)
# Run on hospital wait time data
np.random.seed(42)
wait_times = np.random.normal(34.2, 10.5, 40)
full_test_report(wait_times, mu_0=30, alpha=0.05, tail="right")
Simulating p-value Variability — Why One Study Is Never Enough
import numpy as np
from scipy import stats
# Simulate 1000 studies: true effect exists (mu_true = 32, mu_0 = 30)
# Show how much p-values vary even when H1 is actually true
np.random.seed(0)
n_studies = 1000
n_per_study = 40
mu_true, mu_0, sigma = 32, 30, 10.5
p_values = []
for _ in range(n_studies):
sample = np.random.normal(mu_true, sigma, n_per_study)
t, p = stats.ttest_1samp(sample, popmean=mu_0)
p_values.append(p)
p_values = np.array(p_values)
print("True effect exists (μ_true=32 ≠ μ_0=30)")
print(f"Studies run: {n_studies}")
print(f"Studies with p < 0.05: {(p_values < 0.05).sum()} ({(p_values<0.05).mean()*100:.1f}%)")
print(f"Studies with p >= 0.05: {(p_values >= 0.05).sum()} ({(p_values>=0.05).mean()*100:.1f}%)")
print(f"Median p-value: {np.median(p_values):.4f}")
print(f"Min p-value: {p_values.min():.6f}")
print(f"Max p-value: {p_values.max():.4f}")
print()
print("Lesson: Even when H1 is TRUE, ~30-40% of studies may fail to")
print("find significance with n=40. This is the power problem.")
If you run 20 independent tests on random noise and use α = 0.05, you expect roughly 1 false positive by chance alone. Researchers who try multiple analyses, subgroups, or outcome measures and report only the significant one are inflating the Type I error rate far beyond 5%. This is called p-hacking or data dredging, and it is behind the replication crisis in psychology, medicine, and nutrition science. Always correct for multiple comparisons (Bonferroni, FDR/BH) when running more than one test.
p-value vs Confidence Interval — Two Sides of the Same Coin
Every p-value has a directly equivalent confidence interval, and every confidence interval implicitly tells you about significance. They are two different ways of communicating the same underlying information, but confidence intervals are often more informative because they also convey effect size and precision.
| Property | p-value | Confidence Interval |
|---|---|---|
| What it gives you | Binary signal (significant or not) | Range of plausible parameter values |
| Conveys effect size? | No | Yes — via the interval width |
| Conveys precision? | No | Yes — narrower = more precise |
| Significance link | p < 0.05 | 95% CI does not contain μ₀ |
| More informative? | Less | More — report both together |
| Affected by sample size? | Yes — n inflates significance | Yes — n narrows the interval |
For a two-tailed test at α = 0.05:
p < 0.05 if and only if the 95% CI does not contain μ₀.
p ≥ 0.05 if and only if the 95% CI contains μ₀.
The best practice in reporting: always give both the p-value and the
confidence interval. The CI tells your reader how large the effect is
and how precisely you estimated it. The p-value alone cannot do that.