p-value Explained in Depth

Section 01

The Coin, The Courtroom & The Coffee Taster

Bristol, England. 1919. A group of scientists are having afternoon tea. A colleague — Dr Muriel Bristol — claims she can tell, just by tasting, whether milk was poured before or after the tea in her cup. The statistician Ronald Fisher thinks this is nonsense. So he designs a test.

He prepares eight cups — four milk-first, four tea-first — shuffled randomly. If Dr Bristol is just guessing, probability theory says she should get about 4 right by chance. She gets all 8 correct. Fisher calculates the probability of guessing all 8 right by pure chance: 1 in 70 — about 1.4%.

💡

That 1.4% is a p-value

Fisher reasoned: "If she had no ability whatsoever, the probability of getting all 8 cups correct by luck alone is just 1.4%." That tiny probability was the evidence. It told him that the observed result would be extremely unlikely under the assumption of "no ability." This experiment — the Lady Tasting Tea — was the birth of the modern p-value, and it is still the best intuitive explanation for what a p-value actually measures.

The p-value is arguably the most widely used — and most widely misunderstood — number in all of science. It appears in every medical journal, every psychology study, every A/B test report, every quality control dashboard. Understanding what it truly means, and what it does not mean, is one of the most valuable skills a data scientist can have.

Section 02

What Exactly Is a p-value?

Before the formal definition, try this thought experiment. You suspect a coin is biased toward heads. You flip it 10 times and get 9 heads. You ask yourself: "If this coin were perfectly fair, how often would I see 9 or more heads in 10 flips just by chance?"

The answer to that question is the p-value.

💡

The Formal Definition

The p-value is the probability of obtaining a test statistic at least as extreme as the one actually observed, assuming the null hypothesis is true.

p = P( observing data this extreme or more extreme | H₀ is true )

It answers one question and one question only: "How surprising is this data, if we assume there is nothing going on?"

Small p-value

p < 0.05

Data is surprising under H₀
Evidence against H₀
Reject H₀ if p < α
Result is "significant"

Large p-value

p ≥ 0.05

Data is unsurprising under H₀
Weak evidence against H₀
Fail to reject H₀
Result is "non-significant"

The Threshold α

usually 0.05

Set before the experiment
Not magic — just convention
0.01 for medicine/finance
p < α → reject H₀

Section 03

The Coin Flip — A Visual Walkthrough

Let us build the p-value from scratch, visually, using the coin example. We flip a coin 10 times and observe 9 heads. The null hypothesis is H₀: the coin is fair (P(heads) = 0.5). We use a two-tailed test because we want to detect bias in either direction.

✅

Reading the Diagram

The blue bars show every possible outcome when flipping a fair coin 10 times. The red bars are the outcomes as extreme as — or more extreme than — what we observed (9 heads). Their combined probability is the p-value: ≈ 2.1%.

Since p = 0.021 < α = 0.05, we reject H₀. The coin is unlikely to be fair. But notice: 2.1% is not zero. There is still a 1-in-47 chance a fair coin produced this result — we could be making a Type I error.

Section 04

The p-value on a Normal Distribution

For continuous data (means, differences), the p-value comes from the area under the normal (or t) distribution curve. The test statistic — a z-score or t-value — tells you how many standard errors your sample result sits from the null hypothesis value. The p-value is the shaded area in the tail(s) beyond that point.

🎯

The Intuition in One Sentence

The p-value is the area of the red shaded tail(s) beyond your observed test statistic, under the curve that represents what would happen if H₀ were true. The further your result sits from the centre of the null distribution, the smaller the tail area, the smaller the p-value, and the stronger the case against H₀.

Section 05

What the p-value Is NOT — The Six Myths

The American Statistical Association issued an official statement in 2016 — something it had never done in its 177-year history — specifically to address the widespread misuse of p-values. Here are the six most dangerous myths, each corrected with the accurate interpretation.

⚠️

Myth 6 Continued — Sample Size Inflates Significance

A tiny p-value can come from a massive sample size detecting a trivially small effect. A large p-value can come from a small study that was underpowered to detect a very real, large effect. p-value = f(effect size × sample size). The p-value alone cannot tell you which of these is happening. This is why effect size reporting is not optional — it is mandatory.

Section 06

p-value Scale — How to Read the Evidence

Not all significant results are equally convincing, and not all non-significant results are equally weak. Here is a practical interpretation scale used by statisticians to communicate the strength of evidence conveyed by different p-value ranges — originally proposed by Ronald Fisher himself.

p-value Range	Strength of Evidence	Decision (α=0.05)	Example Context
p < 0.001	Very strong against H₀	Reject H₀	Vaccine efficacy in phase III trial
0.001 ≤ p < 0.01	Strong against H₀	Reject H₀	Drug lowers cholesterol
0.01 ≤ p < 0.05	Moderate against H₀	Reject H₀	A/B test — Button B wins
0.05 ≤ p < 0.10	Weak / marginal	Fail to Reject H₀	Trend worth investigating further
p ≥ 0.10	Little or no evidence	Fail to Reject H₀	No detected difference between groups

Section 07

Step-by-Step: Computing a p-value from Scratch

A hospital claims its average patient wait time is 30 minutes. An auditor suspects it is longer. She randomly samples 40 patients and records their wait times. She finds: x̄ = 34.2 minutes, s = 10.5 minutes. Is this enough evidence to reject the hospital's claim?

🧮 Computing the p-value — One-sample t-test (Right-tailed)

Step 1

State hypotheses.
H₀: μ ≤ 30 minutes | H₁: μ > 30 minutes (right-tailed)

Step 2

Set significance level.
α = 0.05. σ unknown → t-test. df = n − 1 = 39.

Step 3

Compute Standard Error.
SE = s / √n = 10.5 / √40 = 10.5 / 6.325 = 1.66 minutes

Step 4

Compute the t-statistic.
t = (x̄ − μ₀) / SE = (34.2 − 30) / 1.66 = 4.2 / 1.66 = t = 2.530

Step 5

Find the p-value.
Right-tailed: p = P(T > 2.530 | df=39)
From t-table or Python: p = 0.0078

Decision

p = 0.0078 < α = 0.05 → Reject H₀
At the 5% significance level, there is strong evidence (p = 0.008) that the true mean wait time exceeds 30 minutes. The hospital's claim is not supported by this sample.

✅

What Changed the p-value Here?

The p-value of 0.0078 was driven by three factors simultaneously: the size of the difference (x̄ − μ₀ = 4.2 minutes), the variability in the data (s = 10.5), and the sample size (n = 40). Had we only sampled 10 patients, SE would have been 3.32 and t = 1.27 — giving p ≈ 0.11, which would not be significant. Same true effect, different conclusion — because of n.

Section 08

Python Implementation

Computing p-values from First Principles

import numpy as np
from scipy import stats

# Hospital wait time example
# H0: mu <= 30  |  H1: mu > 30  (right-tailed)
np.random.seed(42)
sample = np.random.normal(loc=34.2, scale=10.5, size=40)

mu_0  = 30
x_bar = np.mean(sample)
s     = np.std(sample, ddof=1)
n     = len(sample)
se    = s / np.sqrt(n)
df    = n - 1

# t-statistic
t_stat = (x_bar - mu_0) / se

# p-value: right-tailed = area to the RIGHT of t_stat
p_right = stats.t.sf(t_stat, df=df)          # sf = 1 - cdf

# For two-tailed: multiply by 2
p_two   = 2 * stats.t.sf(abs(t_stat), df=df)

# Critical t-value at alpha=0.05, right-tailed
t_crit = stats.t.ppf(0.95, df=df)

print(f"Sample mean x̄:       {x_bar:.4f} min")
print(f"Standard Error SE:    {se:.4f} min")
print(f"t-statistic:          {t_stat:.4f}")
print(f"Degrees of freedom:   {df}")
print(f"t-critical (right):   {t_crit:.4f}")
print(f"p-value (right-tail): {p_right:.4f}")
print(f"p-value (two-tail):   {p_two:.4f}")
print()
print(f"Conclusion: {'REJECT H0' if p_right < 0.05 else 'FAIL TO REJECT H0'}")

Visualising What Changes the p-value

import numpy as np
from scipy import stats

# How sample size, effect size, and variability each affect p-value
mu_0    = 30     # null hypothesis value
x_bar   = 34.2   # sample mean (fixed)
s       = 10.5   # std dev (fixed)

print("Effect of Sample Size on p-value (same effect, same sd):")
for n in [10, 20, 40, 100, 500]:
    se     = s / np.sqrt(n)
    t_stat = (x_bar - mu_0) / se
    p      = stats.t.sf(t_stat, df=n-1)
    sig    = "*** REJECT H0" if p < 0.05 else "  fail to reject"
    print(f"  n={n:4d}  SE={se:.3f}  t={t_stat:.3f}  p={p:.4f}  {sig}")

print()
print("Effect of Effect Size on p-value (n=40 fixed, sd=10.5 fixed):")
for mean in [30.5, 31.0, 32.0, 34.2, 38.0]:
    se     = s / np.sqrt(40)
    t_stat = (mean - mu_0) / se
    p      = stats.t.sf(t_stat, df=39)
    sig    = "*** REJECT H0" if p < 0.05 else "  fail to reject"
    print(f"  x̄={mean:.1f}  t={t_stat:.3f}  p={p:.4f}  {sig}")

# Key insight output:
# With n=10: p=0.11  (not significant despite real effect!)
# With n=500: p=0.0000 (trivially tiny effect is "significant")

The Complete p-value Interpretation Workflow

import numpy as np
from scipy import stats

def full_test_report(sample, mu_0, alpha=0.05, tail="two"):
    """
    Runs a one-sample t-test and prints a complete interpretation report.
    tail: 'two', 'right', or 'left'
    """
    n       = len(sample)
    x_bar   = np.mean(sample)
    s       = np.std(sample, ddof=1)
    se      = s / np.sqrt(n)
    df      = n - 1
    t_stat  = (x_bar - mu_0) / se

    # p-value by tail direction
    if tail == "two":
        p = 2 * stats.t.sf(abs(t_stat), df=df)
    elif tail == "right":
        p = stats.t.sf(t_stat, df=df)
    else:
        p = stats.t.cdf(t_stat, df=df)

    # Effect size: Cohen's d
    d = (x_bar - mu_0) / s
    d_label = ("negligible" if abs(d) < 0.2 else
               "small"      if abs(d) < 0.5 else
               "medium"     if abs(d) < 0.8 else "large")

    # Evidence strength
    if   p < 0.001: evidence = "Very Strong"
    elif p < 0.010: evidence = "Strong"
    elif p < 0.050: evidence = "Moderate"
    elif p < 0.100: evidence = "Weak"
    else:           evidence = "Little/None"

    print("=" * 55)
    print(f"  One-sample t-test  ({tail}-tailed)")
    print("=" * 55)
    print(f"  H₀: μ = {mu_0}   |   α = {alpha}")
    print(f"  n={n}  x̄={x_bar:.3f}  s={s:.3f}  SE={se:.3f}")
    print(f"  t({df}) = {t_stat:.4f}")
    print(f"  p-value  = {p:.4f}")
    print(f"  Cohen's d = {d:.3f} ({d_label} effect)")
    print(f"  Evidence strength: {evidence}")
    print()
    if p < alpha:
        print(f"  DECISION: REJECT H₀ (p={p:.4f} < α={alpha})")
    else:
        print(f"  DECISION: FAIL TO REJECT H₀ (p={p:.4f} ≥ α={alpha})")
    print("=" * 55)

# Run on hospital wait time data
np.random.seed(42)
wait_times = np.random.normal(34.2, 10.5, 40)
full_test_report(wait_times, mu_0=30, alpha=0.05, tail="right")

Simulating p-value Variability — Why One Study Is Never Enough

import numpy as np
from scipy import stats

# Simulate 1000 studies: true effect exists (mu_true = 32, mu_0 = 30)
# Show how much p-values vary even when H1 is actually true
np.random.seed(0)

n_studies    = 1000
n_per_study  = 40
mu_true, mu_0, sigma = 32, 30, 10.5

p_values = []
for _ in range(n_studies):
    sample = np.random.normal(mu_true, sigma, n_per_study)
    t, p   = stats.ttest_1samp(sample, popmean=mu_0)
    p_values.append(p)

p_values = np.array(p_values)

print("True effect exists (μ_true=32 ≠ μ_0=30)")
print(f"Studies run:              {n_studies}")
print(f"Studies with p < 0.05:   {(p_values < 0.05).sum()} ({(p_values<0.05).mean()*100:.1f}%)")
print(f"Studies with p >= 0.05:  {(p_values >= 0.05).sum()} ({(p_values>=0.05).mean()*100:.1f}%)")
print(f"Median p-value:          {np.median(p_values):.4f}")
print(f"Min p-value:             {p_values.min():.6f}")
print(f"Max p-value:             {p_values.max():.4f}")
print()
print("Lesson: Even when H1 is TRUE, ~30-40% of studies may fail to")
print("find significance with n=40. This is the power problem.")

⚠️

p-hacking — The Crisis of Reproducibility

If you run 20 independent tests on random noise and use α = 0.05, you expect roughly 1 false positive by chance alone. Researchers who try multiple analyses, subgroups, or outcome measures and report only the significant one are inflating the Type I error rate far beyond 5%. This is called p-hacking or data dredging, and it is behind the replication crisis in psychology, medicine, and nutrition science. Always correct for multiple comparisons (Bonferroni, FDR/BH) when running more than one test.

Section 09

p-value vs Confidence Interval — Two Sides of the Same Coin

Every p-value has a directly equivalent confidence interval, and every confidence interval implicitly tells you about significance. They are two different ways of communicating the same underlying information, but confidence intervals are often more informative because they also convey effect size and precision.

Property	p-value	Confidence Interval
What it gives you	Binary signal (significant or not)	Range of plausible parameter values
Conveys effect size?	No	Yes — via the interval width
Conveys precision?	No	Yes — narrower = more precise
Significance link	p < 0.05	95% CI does not contain μ₀
More informative?	Less	More — report both together
Affected by sample size?	Yes — n inflates significance	Yes — n narrows the interval

📐

The Duality Rule

For a two-tailed test at α = 0.05:
p < 0.05 if and only if the 95% CI does not contain μ₀.
p ≥ 0.05 if and only if the 95% CI contains μ₀.

The best practice in reporting: always give both the p-value and the confidence interval. The CI tells your reader how large the effect is and how precisely you estimated it. The p-value alone cannot do that.

Section 10

Golden Rules

🎯 p-value — Key Rules

p-value = P(data this extreme or more | H₀ is true). Nothing else. It is the probability of the observed data under the assumption H₀ is true. It is not the probability H₀ is true, not the probability of the result being due to chance, and not the probability of replicating the finding.

Always set α before collecting data — never after seeing p. Setting α = 0.05 because your p-value came out at 0.04 is p-hacking. Pre-specify your significance level, hypotheses, and test type. In clinical trials this is legally required before patient enrolment.

Statistical significance ≠ practical significance. Always report effect size. A p-value of 0.0001 with n = 1,000,000 may mean a 0.001 unit difference — real but useless. Report Cohen's d, odds ratio, or relative risk alongside every p-value. Effect size tells you whether the result matters.

p ≥ 0.05 is not evidence of no effect — it is absence of sufficient evidence. Failing to reject H₀ could mean H₀ is true, or that your study was underpowered. Before concluding "no effect," calculate the study's power. If power was low, your non-result is uninformative — a larger study is needed.

Correct for multiple comparisons when running more than one test. Running 20 tests at α = 0.05 on pure noise expects 1 false positive by chance. Use Bonferroni correction (α/k), Holm's method, or the Benjamini-Hochberg procedure (FDR control) whenever you test multiple hypotheses simultaneously.

Always report the 95% CI alongside the p-value. The confidence interval gives effect size, direction, and precision in one number. It communicates far more than p alone. The two together — p-value for the decision, CI for the magnitude — form a complete statistical report.