Type I & Type II Errors: False Positives vs Negatives

Section 01

Two Ways to Be Wrong ⚖️

Every decision under uncertainty carries the risk of being wrong. A doctor diagnosing a patient, a judge evaluating evidence, an engineer approving a bridge — each must act on incomplete information, accepting that mistakes are possible. In statistics, there are exactly two distinct ways a hypothesis test can fail you, and understanding both is the difference between good science and dangerous science.

Think of it like a smoke detector. It has two possible errors: it beeps when there is no fire (a false alarm waking you at 3 AM for burnt toast), or it stays silent when there is a real fire (a catastrophic miss). Both errors exist. You cannot eliminate one without worsening the other — unless you buy a better detector (collect more data). Statistics calls these the Type I Error and the Type II Error.

💡

The Core Tension

Type I and Type II errors are locked in a permanent trade-off. Reducing one always risks increasing the other — for a fixed sample size. The only escape is to collect more data, which shrinks both simultaneously. Every statistical design is, at its heart, a negotiation between these two risks.

Section 02

Definitions — Crystal Clear

Type I Error

Also called: False Positive
Rejecting H₀ when H₀ is actually TRUE
Seeing an effect that does not exist
Probability = α (significance level)
You control this directly
Lower α → fewer Type I errors
Analogy: convicting the innocent

Type II Error

Also called: False Negative
Failing to reject H₀ when H₀ is FALSE
Missing an effect that truly exists
Probability = β
Controlled indirectly via n and α
Lower β → fewer Type II errors
Analogy: acquitting the guilty

Statistical Power

1 − β

The probability of detecting a real effect
Power = 1 − β
Target: 80% power (β = 0.20)
Increases with larger sample size
Increases with larger effect size
Increases with higher α
The flip-side of Type II error

Section 03

The Four-Outcome Matrix

When you run a hypothesis test, reality is either one way or the other — and your decision is either correct or one of two errors. This two-by-two table is the most important diagram in all of inferential statistics. Memorise it.

Your Decision	H₀ is Actually TRUE (No real effect exists)	H₀ is Actually FALSE (Real effect exists)
Reject H₀ "Effect found"	❌ Type I Error False Positive Probability = α	✅ Correct! True Positive Probability = 1 − β (Power)
Fail to Reject H₀ "No effect found"	✅ Correct! True Negative Probability = 1 − α	❌ Type II Error False Negative Probability = β

📐

Memory Trick — The "Cry Wolf" Framework

Type I: The boy who cried wolf when there was no wolf — a false alarm. The villagers rejected "no wolf" (H₀) when it was true. | Type II: The villagers who ignored the boy when a wolf finally came — a missed detection. They failed to reject "no wolf" when there really was one.

Section 04

Visualising Both Errors on the Bell Curve 📊

Both errors arise from the overlap between two distributions: the null distribution (what data looks like if H₀ is true) and the alternative distribution (what data looks like if H₁ is true). The critical value — set by α — draws a vertical line. Everything to its right gets labelled "significant." But because the two distributions overlap, both types of errors are inevitable.

Type I Error (α)

Shaded tail of H₀ past critical value

False Positive

Type II Error (β)

H₁ tail left of critical value

False Negative

Power (1 − β)

H₁ area past critical value

Correct Detection

💡

Why the Overlap Creates Both Errors

The further apart H₀ and H₁ are (larger effect size), the less the distributions overlap, and the easier it is to separate them. Moving the critical value left reduces β (more power) but increases α. Moving it right reduces α but increases β. The only way to reduce both is to reduce the spread of both distributions — which means increasing sample size n.

Section 05

Story 1 — The Courtroom 🏛️

The criminal justice system is built around the principle of "innocent until proven guilty." The null hypothesis is always innocence: H₀: defendant is innocent. The prosecution must provide overwhelming evidence to overcome this presumption.

⚖️ Justice System — Both Errors Mapped

H₀

The defendant is innocent — no crime committed

H₁

The defendant is guilty — crime was committed

Type I Error

Convicting an innocent person. The jury rejects "innocence" (H₀) when the defendant truly is innocent. The system produces a false positive — a wrongful conviction. DNA exoneration cases are real-world Type I errors in justice.

Type II Error

Acquitting a guilty person. The jury fails to reject "innocence" (H₀) when the defendant truly is guilty. The system produces a false negative — a guilty party walks free.

Society's α

Society chooses a very low α — "beyond reasonable doubt" — because Type I errors (convicting the innocent) are considered more morally catastrophic than Type II errors. The legal system explicitly accepts more guilty acquittals to prevent innocent convictions.

The Trade-off

Lowering the evidence standard (raising α) convicts more guilty people — but also more innocent ones. Raising the standard (lowering α) protects the innocent — but lets more guilty people escape. There is no perfect threshold.

Section 06

Story 2 — The Cancer Screening Test 🏥

A hospital introduces a new blood test to screen for early-stage pancreatic cancer in patients over 50. The null hypothesis is always the medical default: H₀: patient does not have cancer. The test looks for a biomarker above a certain threshold. If found, the patient is sent for invasive biopsy and further treatment.

🩺 Cancer Screening — Real-World Error Consequences

H₀

Patient does not have cancer

H₁

Patient has cancer

Type I Error

False Positive — healthy patient told they may have cancer. Consequences: severe psychological distress, unnecessary invasive biopsy, potential complications from procedures, financial cost. Rate controlled by the test's specificity.

Type II Error

False Negative — cancer patient told they are healthy. Consequences: cancer goes undetected at an early, treatable stage. Patient returns home without treatment. Cancer progresses. Potentially fatal. Rate controlled by the test's sensitivity.

Which Is Worse?

In cancer screening, Type II errors are catastrophic. Missing a real cancer is far worse than a false alarm requiring follow-up. So screening tests are deliberately designed with high sensitivity (low β) — accepting more false positives to ensure no real cancers are missed.

Design Choice

The hospital sets a lower biomarker threshold (higher α). This catches more true cancers (high power, low β) at the cost of more false alarms requiring biopsy follow-up. The clinical decision balances patient safety against healthcare system capacity.

⚠️

Sensitivity vs Specificity — Clinical Names for the Same Trade-off

Sensitivity = Power = 1 − β = P(positive test | disease present). How well the test catches real disease. | Specificity = 1 − α = P(negative test | no disease). How well the test avoids false alarms. A perfect test has both at 100% — in practice, improving one always degrades the other.

Section 07

Story 3 — The Email Spam Filter 📧

Every email inbox has a spam filter making thousands of hypothesis tests per day — one for each incoming message. H₀: this email is legitimate. H₁: this email is spam. The filter must decide: quarantine or deliver?

📬 Spam Filter — A/B Error Analysis

H₀

Email is legitimate (not spam)

H₁

Email is spam

Type I Error

Legitimate email sent to spam folder. Your boss's important message disappears. A job offer is missed. A client's invoice bounces. Type I errors in spam filters destroy real communication and can have serious professional consequences.

Type II Error

Spam email delivered to inbox. Annoying, wastes time, possibly carries phishing or malware links. But the user can delete it. Less damaging than losing a legitimate email — usually.

Design Priority

Most email providers prioritise low Type I error rate — they'd rather let some spam through than delete real emails. Enterprise filters (especially financial or legal firms) may invert this for security: higher α to block everything suspicious.

User Control

Moving the spam threshold (your "mark as spam" / "not spam" feedback) directly adjusts the filter's α and β in real time. When you rescue a legitimate email from spam, you're reducing the filter's Type I error rate for similar future messages.

Section 08

Story 4 — The Factory Quality Inspector 🏭

A pharmaceutical factory produces insulin vials. Each batch is tested to see if the concentration is within specification. If a batch fails, it is destroyed — a costly decision. H₀: batch concentration is within specification. A Type I error means destroying a good batch. A Type II error means shipping dangerous insulin to diabetic patients.

🧮 Insulin Batch QC — Worked Hypothesis Test

Setup

Specification: μ = 100 IU/mL ± tolerance. H₀: batch is in spec (μ = 100). H₁: batch is out of spec (μ ≠ 100). Two-tailed t-test, α = 0.01 (strict — patient safety critical), n = 20 vials.

Data

Sample: x̄ = 97.4 IU/mL, s = 4.2 IU/mL, n = 20

Step 1

SE = s/√n = 4.2/√20 = 4.2/4.472 = 0.939

Step 2

t = (97.4 − 100)/0.939 = −2.6/0.939 = −2.77

Step 3

df = 19. Critical t (two-tailed, α = 0.01, df = 19) = ±2.861

Decision

|−2.77| = 2.77 < 2.861 → Fail to Reject H₀. Batch passes at α = 0.01.
At α = 0.05: critical t = ±2.093. |−2.77| > 2.093 → Reject H₀. Batch would be destroyed. Different α, different fate for the batch.

Error Analysis

If the batch truly IS out of spec but passes (at α = 0.01) → Type II Error. Dangerous insulin ships. If the batch truly is fine but fails (at α = 0.05) → Type I Error. Good product destroyed. The factory carefully weighs which risk is more acceptable.

Section 09

Factors That Control Each Error

Factor	Effect on Type I Error (α)	Effect on Type II Error (β)	Effect on Power (1−β)
Increase α (e.g. 0.01 → 0.05)	Increases ↑	Decreases ↓	Increases ↑
Decrease α (e.g. 0.05 → 0.01)	Decreases ↓	Increases ↑	Decreases ↓
Increase sample size n	No change	Decreases ↓	Increases ↑
Larger effect size (δ)	No change	Decreases ↓	Increases ↑
Reduce population variance (σ²)	No change	Decreases ↓	Increases ↑
One-tailed vs two-tailed	Same total α	Decreases in predicted direction ↓	Increases in predicted direction ↑

✅

The Only Free Lunch: Increase n

Every other knob you turn to reduce one error worsens the other. Increasing sample size is the only strategy that reduces both Type I and Type II error rates simultaneously (for a fixed α). This is why power analysis — calculating the n needed to achieve 80% power at your chosen α — is the first step of any well-designed experiment.

Section 10

Real-World Error Priorities by Domain

Domain	Scenario	Worse Error	Design Priority	Typical α
⚖️ Criminal law	Convict / acquit defendant	Type I (convict innocent)	Low α — high evidence bar	Very low (beyond doubt)
🏥 Cancer screening	Detect / miss disease	Type II (miss cancer)	High sensitivity, low β	0.05 – 0.10
💊 Drug approval	Approve / reject drug	Type I (approve useless drug)	Low α — strict evidence	0.01 – 0.05
☢️ Safety monitoring	Sound alarm / stay silent	Type II (miss hazard)	High sensitivity, high α	0.05 – 0.10
📧 Spam filtering	Block / allow email	Type I (block real email)	Protect legitimate mail	Low (conservative)
📈 A/B testing	Launch / reject new feature	Balanced — both matter	80% power at α = 0.05	0.05
🔬 Particle physics	Claim new particle	Type I (false discovery)	Extremely low α (5σ)	≈ 0.0000003
🧬 Genomics (GWAS)	Identify disease gene	Type I (false association)	Bonferroni correction	5 × 10⁻⁸

Section 11

Worked Example — Power Calculation 🔢

A researcher wants to detect whether a new teaching method increases exam scores by at least 5 points (the minimum meaningful effect). Current scores have μ = 70 and σ = 15. How many students are needed to achieve 80% power at α = 0.05 (two-tailed)?

Sample Size Formula

n = (Z_α/2 + Z_β)² × σ² / δ²

δ = minimum detectable effect, Z_α/2 = critical Z for α, Z_β = Z for desired power (1−β)

Z Values Needed

Z_α/2 = 1.96 Z_β = 0.842

At α = 0.05 (two-tailed): Z_α/2 = 1.96. For 80% power (β = 0.20): Z_β = 0.842 from Z-table.

Calculation

n = (1.96 + 0.842)² × 225 / 25

(1.96 + 0.842)² = (2.802)² = 7.85. × 225 / 25 = 7.85 × 9 ≈ 71 students per group.

Interpretation

n ≈ 71 per group

With 71 students per group, the study has 80% power to detect a 5-point improvement at α = 0.05. Fewer students risks a Type II error — missing a real effect.

🧮 What Happens at Different Sample Sizes?

n = 20

Power ≈ 30%. β ≈ 0.70. A 70% chance of missing the 5-point improvement even if it's real. The study is severely underpowered — a waste of resources.

n = 45

Power ≈ 60%. β ≈ 0.40. Still likely to miss the effect — only a 60% detection rate. Below the accepted 80% threshold for publication.

n = 71

Power ≈ 80%. β ≈ 0.20. The standard target. One in five real effects will still be missed, but this is the accepted convention in most research fields.

n = 130

Power ≈ 95%. β ≈ 0.05. Only a 5% chance of missing the effect. Used in high-stakes research where a Type II error would be very costly. Requires twice the resources of the 80% design.

Section 12

The Replication Crisis — Type I Errors at Scale 📉

Between 2011 and 2015, a team of psychologists attempted to replicate 100 published psychology studies — all of which had been reported as statistically significant at α = 0.05. The result was shocking: only 36% successfully replicated. The original studies were full of Type I errors that had been published as real discoveries.

How did this happen? Several mechanisms inflated the false positive rate far above 5%:

🚨 Five Practices That Inflate the Type I Error Rate

HARKing — Hypothesising After Results are Known. Researchers run an experiment with no firm hypothesis, find any significant result, then write the paper as if that result was predicted. The test statistic is no longer drawn from the distribution assumed under H₀.

Multiple comparisons without correction. Running 20 tests at α = 0.05 gives a ~64% chance of at least one false positive. Reporting only the "significant" result as if it were the only test run makes the true Type I error rate enormous.

Optional stopping. Checking data after every 10 participants and stopping when p < 0.05 dramatically inflates the Type I error rate. The p-value is only valid when n is fixed in advance.

Publication bias. Journals prefer to publish significant findings. Null results (which might represent correct non-rejections of H₀) sit in file drawers. The published literature becomes a biased sample of Type I errors.

Underpowered studies. Small samples make Type II errors common — real effects are missed. But paradoxically, published significant results from small studies are more likely to be Type I errors, because the study's effect size estimate is inflated by sampling noise.

⚠️

The Solution: Pre-Registration

Scientists now combat the replication crisis by pre-registering their hypotheses, sample sizes, α levels, and analysis plans in public repositories (OSF, ClinicalTrials.gov) before collecting a single data point. This locks in the test design, making it impossible to manipulate decisions after seeing data. Pre-registered studies replicate at roughly twice the rate of non-pre-registered ones.

Section 13

Quick Reference Summary

Feature	Type I Error	Type II Error
Also called	False Positive, α error	False Negative, β error
Definition	Reject H₀ when H₀ is TRUE	Fail to reject H₀ when H₀ is FALSE
Probability	α (set by researcher)	β (depends on n, δ, σ, α)
You control it via	Setting significance level α	Sample size, effect size, α
Medical term	Low specificity (false alarm)	Low sensitivity (missed detection)
Legal analogy	Convicting the innocent	Acquitting the guilty
Detector analogy	Smoke alarm — no fire	Smoke alarm — silent during fire
Reduced by	Lowering α	Increasing n or raising α
Trade-off	Reducing one increases the other — only larger n reduces both

🧮

The Analyst's Creed

Great statistical thinking is not about getting p < 0.05 — it's about honestly quantifying and minimising both kinds of error for the decision at hand. Always ask: "In my context, which error is more costly?" Design your study to control that error first, power it properly to control the other, and report both transparently. That is the foundation of trustworthy data science.