Two Ways to Be Wrong ⚖️
Every decision under uncertainty carries the risk of being wrong. A doctor diagnosing a patient, a judge evaluating evidence, an engineer approving a bridge — each must act on incomplete information, accepting that mistakes are possible. In statistics, there are exactly two distinct ways a hypothesis test can fail you, and understanding both is the difference between good science and dangerous science.
Think of it like a smoke detector. It has two possible errors: it beeps when there is no fire (a false alarm waking you at 3 AM for burnt toast), or it stays silent when there is a real fire (a catastrophic miss). Both errors exist. You cannot eliminate one without worsening the other — unless you buy a better detector (collect more data). Statistics calls these the Type I Error and the Type II Error.
Type I and Type II errors are locked in a permanent trade-off. Reducing one always risks increasing the other — for a fixed sample size. The only escape is to collect more data, which shrinks both simultaneously. Every statistical design is, at its heart, a negotiation between these two risks.
Definitions — Crystal Clear
- Also called: False Positive
- Rejecting H₀ when H₀ is actually TRUE
- Seeing an effect that does not exist
- Probability = α (significance level)
- You control this directly
- Lower α → fewer Type I errors
- Analogy: convicting the innocent
- Also called: False Negative
- Failing to reject H₀ when H₀ is FALSE
- Missing an effect that truly exists
- Probability = β
- Controlled indirectly via n and α
- Lower β → fewer Type II errors
- Analogy: acquitting the guilty
- The probability of detecting a real effect
- Power = 1 − β
- Target: 80% power (β = 0.20)
- Increases with larger sample size
- Increases with larger effect size
- Increases with higher α
- The flip-side of Type II error
The Four-Outcome Matrix
When you run a hypothesis test, reality is either one way or the other — and your decision is either correct or one of two errors. This two-by-two table is the most important diagram in all of inferential statistics. Memorise it.
| Your Decision | H₀ is Actually TRUE (No real effect exists) |
H₀ is Actually FALSE (Real effect exists) |
|---|---|---|
| Reject H₀ "Effect found" |
❌ Type I Error
False Positive
Probability = α |
✅ Correct!
True Positive
Probability = 1 − β (Power) |
| Fail to Reject H₀ "No effect found" |
✅ Correct!
True Negative
Probability = 1 − α |
❌ Type II Error
False Negative
Probability = β |
Type I: The boy who cried wolf when there was no wolf — a false alarm. The villagers rejected "no wolf" (H₀) when it was true. | Type II: The villagers who ignored the boy when a wolf finally came — a missed detection. They failed to reject "no wolf" when there really was one.
Visualising Both Errors on the Bell Curve 📊
Both errors arise from the overlap between two distributions: the null distribution (what data looks like if H₀ is true) and the alternative distribution (what data looks like if H₁ is true). The critical value — set by α — draws a vertical line. Everything to its right gets labelled "significant." But because the two distributions overlap, both types of errors are inevitable.
The further apart H₀ and H₁ are (larger effect size), the less the distributions overlap, and the easier it is to separate them. Moving the critical value left reduces β (more power) but increases α. Moving it right reduces α but increases β. The only way to reduce both is to reduce the spread of both distributions — which means increasing sample size n.
Story 1 — The Courtroom 🏛️
The criminal justice system is built around the principle of "innocent until proven guilty." The null hypothesis is always innocence: H₀: defendant is innocent. The prosecution must provide overwhelming evidence to overcome this presumption.
Story 2 — The Cancer Screening Test 🏥
A hospital introduces a new blood test to screen for early-stage pancreatic cancer in patients over 50. The null hypothesis is always the medical default: H₀: patient does not have cancer. The test looks for a biomarker above a certain threshold. If found, the patient is sent for invasive biopsy and further treatment.
Sensitivity = Power = 1 − β = P(positive test | disease present). How well the test catches real disease. | Specificity = 1 − α = P(negative test | no disease). How well the test avoids false alarms. A perfect test has both at 100% — in practice, improving one always degrades the other.
Story 3 — The Email Spam Filter 📧
Every email inbox has a spam filter making thousands of hypothesis tests per day — one for each incoming message. H₀: this email is legitimate. H₁: this email is spam. The filter must decide: quarantine or deliver?
Story 4 — The Factory Quality Inspector 🏭
A pharmaceutical factory produces insulin vials. Each batch is tested to see if the concentration is within specification. If a batch fails, it is destroyed — a costly decision. H₀: batch concentration is within specification. A Type I error means destroying a good batch. A Type II error means shipping dangerous insulin to diabetic patients.
At α = 0.05: critical t = ±2.093. |−2.77| > 2.093 → Reject H₀. Batch would be destroyed. Different α, different fate for the batch.
Factors That Control Each Error
| Factor | Effect on Type I Error (α) | Effect on Type II Error (β) | Effect on Power (1−β) |
|---|---|---|---|
| Increase α (e.g. 0.01 → 0.05) |
Increases ↑ | Decreases ↓ | Increases ↑ |
| Decrease α (e.g. 0.05 → 0.01) |
Decreases ↓ | Increases ↑ | Decreases ↓ |
| Increase sample size n | No change | Decreases ↓ | Increases ↑ |
| Larger effect size (δ) | No change | Decreases ↓ | Increases ↑ |
| Reduce population variance (σ²) | No change | Decreases ↓ | Increases ↑ |
| One-tailed vs two-tailed | Same total α | Decreases in predicted direction ↓ | Increases in predicted direction ↑ |
Every other knob you turn to reduce one error worsens the other. Increasing sample size is the only strategy that reduces both Type I and Type II error rates simultaneously (for a fixed α). This is why power analysis — calculating the n needed to achieve 80% power at your chosen α — is the first step of any well-designed experiment.
Real-World Error Priorities by Domain
| Domain | Scenario | Worse Error | Design Priority | Typical α |
|---|---|---|---|---|
| ⚖️ Criminal law | Convict / acquit defendant | Type I (convict innocent) | Low α — high evidence bar | Very low (beyond doubt) |
| 🏥 Cancer screening | Detect / miss disease | Type II (miss cancer) | High sensitivity, low β | 0.05 – 0.10 |
| 💊 Drug approval | Approve / reject drug | Type I (approve useless drug) | Low α — strict evidence | 0.01 – 0.05 |
| ☢️ Safety monitoring | Sound alarm / stay silent | Type II (miss hazard) | High sensitivity, high α | 0.05 – 0.10 |
| 📧 Spam filtering | Block / allow email | Type I (block real email) | Protect legitimate mail | Low (conservative) |
| 📈 A/B testing | Launch / reject new feature | Balanced — both matter | 80% power at α = 0.05 | 0.05 |
| 🔬 Particle physics | Claim new particle | Type I (false discovery) | Extremely low α (5σ) | ≈ 0.0000003 |
| 🧬 Genomics (GWAS) | Identify disease gene | Type I (false association) | Bonferroni correction | 5 × 10⁻⁸ |
Worked Example — Power Calculation 🔢
A researcher wants to detect whether a new teaching method increases exam scores by at least 5 points (the minimum meaningful effect). Current scores have μ = 70 and σ = 15. How many students are needed to achieve 80% power at α = 0.05 (two-tailed)?
The Replication Crisis — Type I Errors at Scale 📉
Between 2011 and 2015, a team of psychologists attempted to replicate 100 published psychology studies — all of which had been reported as statistically significant at α = 0.05. The result was shocking: only 36% successfully replicated. The original studies were full of Type I errors that had been published as real discoveries.
How did this happen? Several mechanisms inflated the false positive rate far above 5%:
Scientists now combat the replication crisis by pre-registering their hypotheses, sample sizes, α levels, and analysis plans in public repositories (OSF, ClinicalTrials.gov) before collecting a single data point. This locks in the test design, making it impossible to manipulate decisions after seeing data. Pre-registered studies replicate at roughly twice the rate of non-pre-registered ones.
Quick Reference Summary
| Feature | Type I Error | Type II Error |
|---|---|---|
| Also called | False Positive, α error | False Negative, β error |
| Definition | Reject H₀ when H₀ is TRUE | Fail to reject H₀ when H₀ is FALSE |
| Probability | α (set by researcher) | β (depends on n, δ, σ, α) |
| You control it via | Setting significance level α | Sample size, effect size, α |
| Medical term | Low specificity (false alarm) | Low sensitivity (missed detection) |
| Legal analogy | Convicting the innocent | Acquitting the guilty |
| Detector analogy | Smoke alarm — no fire | Smoke alarm — silent during fire |
| Reduced by | Lowering α | Increasing n or raising α |
| Trade-off | Reducing one increases the other — only larger n reduces both | |
Great statistical thinking is not about getting p < 0.05 — it's about honestly quantifying and minimising both kinds of error for the decision at hand. Always ask: "In my context, which error is more costly?" Design your study to control that error first, power it properly to control the other, and report both transparently. That is the foundation of trustworthy data science.