Hypothesis Testing in Statistics

Section 01

The Courtroom, The Medicine & The Factory

Imagine you are a judge. A person stands before you accused of a crime. In every democratic legal system, the law starts from one powerful assumption: the accused is innocent until proven guilty. The burden of proof lies entirely with the prosecution. Unless the evidence is so overwhelming that it cannot plausibly be explained by chance or coincidence, the verdict must be not guilty.

Notice the judge does not prove innocence. The judge simply asks: "Is the evidence against this person strong enough to reject the assumption of innocence?" This is exactly how hypothesis testing works in statistics — and it is not a coincidence that statisticians borrowed the language of law to build it.

💡

Three Stories, One Framework

A pharmaceutical company claims a new drug lowers blood pressure better than the current standard. How do we know the improvement is real and not random luck in the trial? A factory manager suspects her machine produces bolts heavier than the 50 g specification. How many bolts must she weigh to be sure? An e-commerce team runs an A/B test — Button A converts at 4.2%, Button B at 4.7%. Is that difference real or just noise? In all three cases the answer uses one tool: hypothesis testing.

Hypothesis testing is the formal statistical procedure for deciding whether observed data provides sufficient evidence to reject a specific claim about a population. It is the engine behind every clinical trial, quality control system, A/B test, and scientific publication on the planet.

Section 02

What Is a Hypothesis Test?

A statistical hypothesis is a testable claim about a population parameter — a mean, a proportion, a variance, a difference. The test uses sample data to decide how plausible that claim is. The entire procedure rests on one central question:

💡

The Central Question of Every Hypothesis Test

"If the null hypothesis were true, how likely is it that we would observe data at least this extreme purely by chance?"

If that probability — the p-value — is smaller than a pre-set threshold called the significance level α, we say the result is statistically significant and reject the null hypothesis. If not, we fail to reject it.

Null Hypothesis

H₀

The default assumption
"Nothing is happening"
Always contains = sign
We try to disprove it

Alternative Hypothesis

H₁ or Hₐ

What we want to establish
"Something is happening"
Contains ≠, <, or >
Accepted only if H₀ rejected

Decision Rule

p vs α

p < α → Reject H₀
p ≥ α → Fail to reject H₀
α = 0.05 is standard
Never say "accept H₀"

Section 03

Null vs Alternative Hypothesis — In Depth

The null and alternative hypotheses are two competing statements about the world. Together they must be mutually exclusive and exhaustive. Understanding how to formulate them correctly is the most critical skill in hypothesis testing.

The Three Test Types

Test Type	H₀	H₁	Rejection Region	Real-world Question
Two-tailed	μ = μ₀	μ ≠ μ₀	Both tails (α/2 each)	"Is the mean different from 50 g?"
Right-tailed	μ ≤ μ₀	μ > μ₀	Right tail only (α)	"Is the mean greater than 50 g?"
Left-tailed	μ ≥ μ₀	μ < μ₀	Left tail only (α)	"Is the mean less than 50 g?"

⚠️

The Most Common Formulation Mistake

Researchers instinctively want to put their research claim in H₀. Always do the opposite. H₀ is the sceptic's position — "nothing has changed, there is no effect." Your claim — the thing you are trying to establish — always goes in H₁. You build evidence against H₀, just as a prosecutor builds a case against the presumption of innocence.

Section 04

Significance Level α and the p-value

Before collecting any data you must set the significance level α — the maximum probability of wrongly rejecting H₀ you are willing to tolerate. After collecting data you compute the p-value, then compare the two to make your decision.

p-value — Precise Definition

p = P(observing data this extreme | H₀ is true)

The probability of obtaining a test statistic at least as extreme as the one observed, assuming H₀ is true. Small p-value = data would be very unlikely if H₀ were true = evidence against H₀. It is not the probability that H₀ is true.

Significance Level α — The Pre-set Gate

α = P(Type I Error) = P(Reject H₀ | H₀ is true)

Set before data collection. Standard choices: α = 0.05 (science), 0.01 (medicine/finance), 0.10 (exploratory). Result is "statistically significant at the α level" when p < α.

⚠️

What the p-value Does NOT Mean

❌ "The probability that H₀ is true." — False.
❌ "The probability the result is due to chance." — False.
❌ "The effect is large or important." — False.
✅ p-value means: assuming H₀ is true, the probability of seeing data at least this extreme. A p = 0.001 with n = 1,000,000 may represent a trivially tiny real-world effect. Always pair p-values with effect sizes.

Section 05

Type I & Type II Errors — The Cost of Being Wrong

No matter how carefully you design a test, two types of error are always possible. Think of a COVID test analogy: a Type I error is a false positive (testing positive when you are healthy); a Type II error is a false negative (testing negative when you are actually infected).

Error Type	Also Called	Probability	What Happened	Real-world Cost
Type I Error	False Positive	α (you set this)	Rejected H₀ but it was true	Approving a drug that doesn't work
Type II Error	False Negative	β (depends on n)	Failed to reject H₀ but it was false	Rejecting a drug that genuinely works
Power	Sensitivity	1 − β	Correctly rejected a false H₀	Detecting a real drug effect — target ≥ 80%

🎯

The Error Trade-off

Reducing α (from 0.05 to 0.01) makes Type I errors less likely — but automatically increases β, because you demand stronger evidence before rejecting H₀. The only way to reduce both simultaneously is to increase sample size n. In clinical trials, regulators require α ≤ 0.05 and power ≥ 80% (β ≤ 0.20), then use these to derive the minimum trial size before a single patient is enrolled.

Section 06

The Universal 5-Step Framework

Every hypothesis test — regardless of which specific test you use — follows the same five-step procedure. Memorising this structure means you will never lose track of where you are or what a result actually means.

🧮 Worked Example — Factory Bolt Weight (One-sample t-test)

Step 1

State the hypotheses.
Specification: bolt weight μ₀ = 50 g. Engineer suspects machine is off. She collects n = 25 bolts.
H₀: μ = 50 g (on spec) | H₁: μ ≠ 50 g (off spec — two-tailed)

Step 2

Set α and choose test.
α = 0.05. σ unknown, n = 25 (small sample) → one-sample t-test.

Step 3

Compute the test statistic.
Sample: x̄ = 51.2 g, s = 2.8 g, n = 25, df = 24.
t = (x̄ − μ₀) / (s / √n) = (51.2 − 50) / (2.8 / 5) = 1.2 / 0.56 = t = 2.143

Step 4

Find p-value and compare.
Two-tailed p-value (df=24, t=2.143) = 0.0426
Critical value t* = ±2.064 (df=24, α=0.05)
p = 0.0426 < α = 0.05 → Reject H₀ and |2.143| > 2.064 ✓

Step 5

Conclusion in plain language.
At the 5% significance level there is sufficient statistical evidence that the mean bolt weight is significantly different from 50 g (t(24) = 2.14, p = 0.043). Production should be halted and the machine recalibrated.

✅

How to Write a Correct Conclusion

Always state: (1) the significance level, (2) direction of finding, (3) test statistic and p-value in brackets, (4) the real-world implication.
Never write: "We prove H₁ is true" or "H₀ is false."
Always write: "There is sufficient evidence at the α=0.05 level to reject H₀ in favour of H₁" — or "There is insufficient evidence to reject H₀."

Section 07

One-tailed vs Two-tailed Tests

The choice between one-tailed and two-tailed must be made before looking at the data. Switching after seeing results to get a smaller p-value is a form of data manipulation called p-hacking.

Property	Two-tailed Test	One-tailed Test
H₁ contains	≠ (not equal)	> or < (directional)
Rejection region	Both tails (α/2 each)	One tail only (α)
Critical z at α=0.05	±1.960	+1.645 or −1.645
More conservative?	Yes — harder to reject	No — easier to reject
Recommended default?	Yes — safer choice	Only if theory clearly demands a direction

Section 08

Python Implementation

One-sample t-test — Factory Bolt

import numpy as np
from scipy import stats

np.random.seed(7)

# 25 bolt weight measurements (grams)
bolt_weights = np.random.normal(loc=51.2, scale=2.8, size=25)

mu_0  = 50       # H₀: μ = 50 g
alpha = 0.05

# Two-tailed one-sample t-test
t_stat, p_value = stats.ttest_1samp(bolt_weights, popmean=mu_0)

x_bar = np.mean(bolt_weights)
s     = np.std(bolt_weights, ddof=1)
se    = s / np.sqrt(len(bolt_weights))
df    = len(bolt_weights) - 1
t_crit = stats.t.ppf(1 - alpha/2, df=df)

print(f"x̄ = {x_bar:.4f} g  |  s = {s:.4f} g  |  SE = {se:.4f}")
print(f"t_stat   = {t_stat:.4f}")
print(f"t_crit   = ±{t_crit:.4f}  (df={df}, α={alpha})")
print(f"p_value  = {p_value:.4f}")

if p_value < alpha:
    print(f"\np={p_value:.4f} < α={alpha} → REJECT H₀")
    print("Mean bolt weight differs significantly from 50 g")
else:
    print(f"\np={p_value:.4f} ≥ α={alpha} → FAIL TO REJECT H₀")

One-tailed Test — Drug Trial (Right-tailed)

import numpy as np
from scipy import stats

np.random.seed(42)

# Blood pressure reduction (mmHg) for 30 trial patients
# H₀: μ ≤ 10  |  H₁: μ > 10  (right-tailed)
bp_reduction = np.random.normal(loc=12.5, scale=4.0, size=30)
mu_0, alpha  = 10, 0.05

t_stat, p_two = stats.ttest_1samp(bp_reduction, popmean=mu_0)

# For right-tailed: halve p-value only when t_stat > 0
p_one_right = p_two / 2 if t_stat > 0 else 1 - (p_two / 2)

print(f"Mean reduction: {np.mean(bp_reduction):.2f} mmHg")
print(f"t-statistic:    {t_stat:.4f}")
print(f"p (one-tailed): {p_one_right:.4f}")

if p_one_right < alpha:
    print(f"\nREJECT H₀ — new drug produces significantly greater reduction")
else:
    print(f"\nFAIL TO REJECT H₀")

Two-sample Welch's t-test — A/B Test

import numpy as np
from scipy import stats

np.random.seed(0)

# Conversion data for two button designs
# H₀: μ_A = μ_B  |  H₁: μ_A ≠ μ_B  (two-tailed)
conversions_A = np.random.binomial(1, 0.042, 500)
conversions_B = np.random.binomial(1, 0.047, 500)
alpha = 0.05

# Welch's t-test — always use equal_var=False in practice
t_stat, p_value = stats.ttest_ind(conversions_A, conversions_B,
                                   equal_var=False)

print(f"Button A rate: {conversions_A.mean()*100:.2f}%")
print(f"Button B rate: {conversions_B.mean()*100:.2f}%")
print(f"t-statistic:   {t_stat:.4f}")
print(f"p-value:       {p_value:.4f}")

if p_value < alpha:
    print(f"\nREJECT H₀ — significant difference between buttons")
else:
    print(f"\nFAIL TO REJECT H₀ — no significant difference detected")

Computing Cohen's d — Effect Size

import numpy as np
from scipy import stats

def cohen_d(group1, group2):
    """Pooled Cohen's d for two independent groups."""
    n1, n2 = len(group1), len(group2)
    s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

np.random.seed(42)
control = np.random.normal(10, 4, 30)
treated = np.random.normal(12.5, 4, 30)

t_stat, p_val = stats.ttest_ind(treated, control, equal_var=False)
d = cohen_d(treated, control)

print(f"t = {t_stat:.4f},  p = {p_val:.4f}")
print(f"Cohen's d = {d:.4f}")

# Interpret effect size
if   abs(d) < 0.2: label = "negligible"
elif abs(d) < 0.5: label = "small"
elif abs(d) < 0.8: label = "medium"
else:               label = "large"
print(f"Effect size: {label}")

# Crucial lesson: statistically significant doesn't mean practically important.
# Always report d alongside p-value.

⚠️

Always Use equal_var=False in ttest_ind()

stats.ttest_ind() defaults to equal_var=True (Student's t-test), which assumes both groups have equal population variances — something you almost never verify. Always pass equal_var=False to use Welch's t-test. It is equally powerful when variances are equal, and far more robust when they are not. There is no downside to always using it.

Section 09

Choosing the Right Test — Quick Reference

Scenario	Test	H₀ Example	SciPy Function
1 group vs known value	One-sample t-test	μ = 50	`ttest_1samp(data, 50)`
2 independent groups	Welch's t-test	μ₁ = μ₂	`ttest_ind(a, b, equal_var=False)`
Same group before & after	Paired t-test	μ_diff = 0	`ttest_rel(before, after)`
3+ independent groups	One-way ANOVA	μ₁ = μ₂ = μ₃	`f_oneway(g1, g2, g3)`
Proportion vs benchmark	z-test proportion	P = 0.50	`proportions_ztest(count, nobs)`
Categorical association	Chi-square test	Variables independent	`chi2_contingency(table)`
Non-normal, 2 groups	Mann-Whitney U	Same distribution	`mannwhitneyu(a, b)`

Section 10

Golden Rules

🎯 Hypothesis Testing — Key Rules

Your research claim always goes in H₁ — never in H₀. H₀ is the sceptic's default. You build evidence against it with data, just as a prosecutor builds a case against presumed innocence. If a statement contains an equals sign, it belongs in H₀.

Set α before collecting data — never after. Choosing α = 0.05 because your p-value happened to be 0.04 is p-hacking — one of the most common forms of statistical misconduct. Pre-register your significance level, test type, and hypotheses before touching any data.

"Fail to reject H₀" is NOT the same as "Accept H₀." You never prove the null hypothesis. A non-significant result means you did not find sufficient evidence against H₀. It could mean H₀ is true — or simply that your sample was too small to detect a real effect. Always calculate statistical power.

Statistical significance ≠ practical significance. With n = 1,000,000, even a 0.001 mmHg blood pressure difference will be statistically significant. Always report effect size (Cohen's d, odds ratio, relative risk) alongside every p-value. A small p-value with a tiny effect size means nothing to patients.

Always use Welch's t-test (equal_var=False) for two independent groups. Student's t-test assumes equal variances — an assumption rarely verified. Welch's test is equally powerful when variances are equal and far more robust when they are not. It is the safer default with no downside.

Check your assumptions before running any test. Parametric tests (t-tests, ANOVA) require approximate normality or large n. For heavily skewed data with small samples, use non-parametric alternatives: Mann-Whitney U instead of t-test, Kruskal-Wallis instead of ANOVA, Wilcoxon signed-rank instead of paired t-test.