The Courtroom, The Medicine & The Factory
Imagine you are a judge. A person stands before you accused of a crime. In every democratic legal system, the law starts from one powerful assumption: the accused is innocent until proven guilty. The burden of proof lies entirely with the prosecution. Unless the evidence is so overwhelming that it cannot plausibly be explained by chance or coincidence, the verdict must be not guilty.
Notice the judge does not prove innocence. The judge simply asks: "Is the evidence against this person strong enough to reject the assumption of innocence?" This is exactly how hypothesis testing works in statistics — and it is not a coincidence that statisticians borrowed the language of law to build it.
A pharmaceutical company claims a new drug lowers blood pressure better than the current standard. How do we know the improvement is real and not random luck in the trial? A factory manager suspects her machine produces bolts heavier than the 50 g specification. How many bolts must she weigh to be sure? An e-commerce team runs an A/B test — Button A converts at 4.2%, Button B at 4.7%. Is that difference real or just noise? In all three cases the answer uses one tool: hypothesis testing.
Hypothesis testing is the formal statistical procedure for deciding whether observed data provides sufficient evidence to reject a specific claim about a population. It is the engine behind every clinical trial, quality control system, A/B test, and scientific publication on the planet.
What Is a Hypothesis Test?
A statistical hypothesis is a testable claim about a population parameter — a mean, a proportion, a variance, a difference. The test uses sample data to decide how plausible that claim is. The entire procedure rests on one central question:
"If the null hypothesis were true, how likely is it that we would
observe data at least this extreme purely by chance?"
If that probability — the p-value — is smaller than a
pre-set threshold called the significance level α, we say
the result is statistically significant and reject the null hypothesis.
If not, we fail to reject it.
- The default assumption
- "Nothing is happening"
- Always contains = sign
- We try to disprove it
- What we want to establish
- "Something is happening"
- Contains ≠, <, or >
- Accepted only if H₀ rejected
- p < α → Reject H₀
- p ≥ α → Fail to reject H₀
- α = 0.05 is standard
- Never say "accept H₀"
Null vs Alternative Hypothesis — In Depth
The null and alternative hypotheses are two competing statements about the world. Together they must be mutually exclusive and exhaustive. Understanding how to formulate them correctly is the most critical skill in hypothesis testing.
The Three Test Types
| Test Type | H₀ | H₁ | Rejection Region | Real-world Question |
|---|---|---|---|---|
| Two-tailed | μ = μ₀ | μ ≠ μ₀ | Both tails (α/2 each) | "Is the mean different from 50 g?" |
| Right-tailed | μ ≤ μ₀ | μ > μ₀ | Right tail only (α) | "Is the mean greater than 50 g?" |
| Left-tailed | μ ≥ μ₀ | μ < μ₀ | Left tail only (α) | "Is the mean less than 50 g?" |
Researchers instinctively want to put their research claim in H₀. Always do the opposite. H₀ is the sceptic's position — "nothing has changed, there is no effect." Your claim — the thing you are trying to establish — always goes in H₁. You build evidence against H₀, just as a prosecutor builds a case against the presumption of innocence.
Significance Level α and the p-value
Before collecting any data you must set the significance level α — the maximum probability of wrongly rejecting H₀ you are willing to tolerate. After collecting data you compute the p-value, then compare the two to make your decision.
❌ "The probability that H₀ is true." — False.
❌ "The probability the result is due to chance." — False.
❌ "The effect is large or important." — False.
✅ p-value means: assuming H₀ is true, the probability of seeing data
at least this extreme. A p = 0.001 with n = 1,000,000 may represent
a trivially tiny real-world effect. Always pair p-values with effect sizes.
Type I & Type II Errors — The Cost of Being Wrong
No matter how carefully you design a test, two types of error are always possible. Think of a COVID test analogy: a Type I error is a false positive (testing positive when you are healthy); a Type II error is a false negative (testing negative when you are actually infected).
| Error Type | Also Called | Probability | What Happened | Real-world Cost |
|---|---|---|---|---|
| Type I Error | False Positive | α (you set this) | Rejected H₀ but it was true | Approving a drug that doesn't work |
| Type II Error | False Negative | β (depends on n) | Failed to reject H₀ but it was false | Rejecting a drug that genuinely works |
| Power | Sensitivity | 1 − β | Correctly rejected a false H₀ | Detecting a real drug effect — target ≥ 80% |
Reducing α (from 0.05 to 0.01) makes Type I errors less likely — but automatically increases β, because you demand stronger evidence before rejecting H₀. The only way to reduce both simultaneously is to increase sample size n. In clinical trials, regulators require α ≤ 0.05 and power ≥ 80% (β ≤ 0.20), then use these to derive the minimum trial size before a single patient is enrolled.
The Universal 5-Step Framework
Every hypothesis test — regardless of which specific test you use — follows the same five-step procedure. Memorising this structure means you will never lose track of where you are or what a result actually means.
Specification: bolt weight μ₀ = 50 g. Engineer suspects machine is off. She collects n = 25 bolts.
H₀: μ = 50 g (on spec) | H₁: μ ≠ 50 g (off spec — two-tailed)
α = 0.05. σ unknown, n = 25 (small sample) → one-sample t-test.
Sample: x̄ = 51.2 g, s = 2.8 g, n = 25, df = 24.
t = (x̄ − μ₀) / (s / √n) = (51.2 − 50) / (2.8 / 5) = 1.2 / 0.56 = t = 2.143
Two-tailed p-value (df=24, t=2.143) = 0.0426
Critical value t* = ±2.064 (df=24, α=0.05)
p = 0.0426 < α = 0.05 → Reject H₀ and |2.143| > 2.064 ✓
At the 5% significance level there is sufficient statistical evidence that the mean bolt weight is significantly different from 50 g (t(24) = 2.14, p = 0.043). Production should be halted and the machine recalibrated.
Always state: (1) the significance level, (2) direction of finding,
(3) test statistic and p-value in brackets, (4) the real-world implication.
Never write: "We prove H₁ is true" or "H₀ is false."
Always write: "There is sufficient evidence at the α=0.05 level
to reject H₀ in favour of H₁" — or "There is insufficient evidence to reject H₀."
One-tailed vs Two-tailed Tests
The choice between one-tailed and two-tailed must be made before looking at the data. Switching after seeing results to get a smaller p-value is a form of data manipulation called p-hacking.
| Property | Two-tailed Test | One-tailed Test |
|---|---|---|
| H₁ contains | ≠ (not equal) | > or < (directional) |
| Rejection region | Both tails (α/2 each) | One tail only (α) |
| Critical z at α=0.05 | ±1.960 | +1.645 or −1.645 |
| More conservative? | Yes — harder to reject | No — easier to reject |
| Recommended default? | Yes — safer choice | Only if theory clearly demands a direction |
Python Implementation
One-sample t-test — Factory Bolt
import numpy as np
from scipy import stats
np.random.seed(7)
# 25 bolt weight measurements (grams)
bolt_weights = np.random.normal(loc=51.2, scale=2.8, size=25)
mu_0 = 50 # H₀: μ = 50 g
alpha = 0.05
# Two-tailed one-sample t-test
t_stat, p_value = stats.ttest_1samp(bolt_weights, popmean=mu_0)
x_bar = np.mean(bolt_weights)
s = np.std(bolt_weights, ddof=1)
se = s / np.sqrt(len(bolt_weights))
df = len(bolt_weights) - 1
t_crit = stats.t.ppf(1 - alpha/2, df=df)
print(f"x̄ = {x_bar:.4f} g | s = {s:.4f} g | SE = {se:.4f}")
print(f"t_stat = {t_stat:.4f}")
print(f"t_crit = ±{t_crit:.4f} (df={df}, α={alpha})")
print(f"p_value = {p_value:.4f}")
if p_value < alpha:
print(f"\np={p_value:.4f} < α={alpha} → REJECT H₀")
print("Mean bolt weight differs significantly from 50 g")
else:
print(f"\np={p_value:.4f} ≥ α={alpha} → FAIL TO REJECT H₀")
One-tailed Test — Drug Trial (Right-tailed)
import numpy as np
from scipy import stats
np.random.seed(42)
# Blood pressure reduction (mmHg) for 30 trial patients
# H₀: μ ≤ 10 | H₁: μ > 10 (right-tailed)
bp_reduction = np.random.normal(loc=12.5, scale=4.0, size=30)
mu_0, alpha = 10, 0.05
t_stat, p_two = stats.ttest_1samp(bp_reduction, popmean=mu_0)
# For right-tailed: halve p-value only when t_stat > 0
p_one_right = p_two / 2 if t_stat > 0 else 1 - (p_two / 2)
print(f"Mean reduction: {np.mean(bp_reduction):.2f} mmHg")
print(f"t-statistic: {t_stat:.4f}")
print(f"p (one-tailed): {p_one_right:.4f}")
if p_one_right < alpha:
print(f"\nREJECT H₀ — new drug produces significantly greater reduction")
else:
print(f"\nFAIL TO REJECT H₀")
Two-sample Welch's t-test — A/B Test
import numpy as np
from scipy import stats
np.random.seed(0)
# Conversion data for two button designs
# H₀: μ_A = μ_B | H₁: μ_A ≠ μ_B (two-tailed)
conversions_A = np.random.binomial(1, 0.042, 500)
conversions_B = np.random.binomial(1, 0.047, 500)
alpha = 0.05
# Welch's t-test — always use equal_var=False in practice
t_stat, p_value = stats.ttest_ind(conversions_A, conversions_B,
equal_var=False)
print(f"Button A rate: {conversions_A.mean()*100:.2f}%")
print(f"Button B rate: {conversions_B.mean()*100:.2f}%")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < alpha:
print(f"\nREJECT H₀ — significant difference between buttons")
else:
print(f"\nFAIL TO REJECT H₀ — no significant difference detected")
Computing Cohen's d — Effect Size
import numpy as np
from scipy import stats
def cohen_d(group1, group2):
"""Pooled Cohen's d for two independent groups."""
n1, n2 = len(group1), len(group2)
s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
pooled_std = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
return (np.mean(group1) - np.mean(group2)) / pooled_std
np.random.seed(42)
control = np.random.normal(10, 4, 30)
treated = np.random.normal(12.5, 4, 30)
t_stat, p_val = stats.ttest_ind(treated, control, equal_var=False)
d = cohen_d(treated, control)
print(f"t = {t_stat:.4f}, p = {p_val:.4f}")
print(f"Cohen's d = {d:.4f}")
# Interpret effect size
if abs(d) < 0.2: label = "negligible"
elif abs(d) < 0.5: label = "small"
elif abs(d) < 0.8: label = "medium"
else: label = "large"
print(f"Effect size: {label}")
# Crucial lesson: statistically significant doesn't mean practically important.
# Always report d alongside p-value.
stats.ttest_ind() defaults to equal_var=True
(Student's t-test), which assumes both groups have equal population variances —
something you almost never verify. Always pass equal_var=False
to use Welch's t-test. It is equally powerful when variances are equal,
and far more robust when they are not. There is no downside to always using it.
Choosing the Right Test — Quick Reference
| Scenario | Test | H₀ Example | SciPy Function |
|---|---|---|---|
| 1 group vs known value | One-sample t-test | μ = 50 | ttest_1samp(data, 50) |
| 2 independent groups | Welch's t-test | μ₁ = μ₂ | ttest_ind(a, b, equal_var=False) |
| Same group before & after | Paired t-test | μ_diff = 0 | ttest_rel(before, after) |
| 3+ independent groups | One-way ANOVA | μ₁ = μ₂ = μ₃ | f_oneway(g1, g2, g3) |
| Proportion vs benchmark | z-test proportion | P = 0.50 | proportions_ztest(count, nobs) |
| Categorical association | Chi-square test | Variables independent | chi2_contingency(table) |
| Non-normal, 2 groups | Mann-Whitney U | Same distribution | mannwhitneyu(a, b) |