Foundations of Data Science 📂 Inferential Statistics · 7 of 8 26 min read

Significance Level (α)

A deep, story-driven tutorial on the significance level α — covering its origin with Ronald Fisher, what it truly means (precisely), its relationship to Type I and II errors, three real-world stories (nuclear safety, FDA drug trials, particle physics), the critical α vs p-value distinction, worked examples comparing three α values on the same data, a field guide for choosing α, its connection to power and sample size, and 8 golden rules every analyst must follow.

Section 01

The Judge's Threshold ⚖️

Picture a courtroom. A defendant stands accused, and the judge must decide: how much evidence is "enough" to convict? Set the bar too low, and innocent people go to prison. Set it too high, and guilty people walk free. Before the trial even begins, society agrees on a standard — a threshold of proof called "beyond reasonable doubt."

In statistics, that threshold has a name: the Significance Level, written as the Greek letter α (alpha). It is the single number you choose — always before collecting data — that defines how much evidence you need before you will reject the null hypothesis.

Most people encounter α = 0.05 and never question it. But where did 0.05 come from? What does it really mean? What happens when you change it? And how do you choose the right α for your own study? This tutorial answers all of it — with stories, diagrams, and worked examples.

💡
The One-Sentence Definition

The significance level α is the maximum probability of making a Type I error (rejecting a true H₀) that you are willing to tolerate. It is your "false alarm budget" — set before the experiment, never adjusted after seeing the data.


Section 02

Where Did α = 0.05 Come From? 📜

The Story of Ronald Fisher

The year is 1925. Ronald Aylmer Fisher, a British statistician working at an agricultural research station, publishes a landmark book: Statistical Methods for Research Workers. In it, he writes almost casually that a result is "significant" if it would occur by chance fewer than 1 in 20 times — that is, with probability less than 0.05.

Fisher never intended this to be a universal law. He chose 1-in-20 as a convenient, round threshold for agricultural experiments — whether a new fertiliser genuinely improved crop yields versus random soil variation. Over the following century, the 0.05 threshold spread from agronomy into medicine, psychology, economics, and engineering, eventually becoming the global default.

⚠️
Fisher's Own Warning — Ignored for 100 Years

Fisher himself warned that 0.05 was not meant to be a rigid rule for all situations. He wrote: a scientific fact should be confirmed by independent repetition, not judged by a single p-value crossing an arbitrary line. The American Statistical Association issued a formal statement in 2016 echoing this — yet α = 0.05 remains the universal default in most fields.


Section 03

What α Actually Means — Precisely

This is where most textbooks are vague. Let's be exact. Setting α = 0.05 means:

📐 The Four Precise Meanings of α = 0.05
1
It is a probability about your decision procedure, not about any specific result. If H₀ is truly true and you ran this same experiment 100 times, you would incorrectly reject H₀ in approximately 5 of those 100 experiments — purely by random chance.
2
It defines the rejection region. α is the area under the null distribution that counts as "extreme enough to reject H₀." For a two-tailed Z-test, this is the outer 2.5% of each tail (together = 5%).
3
It is NOT the probability that H₀ is true or false. α says nothing about whether your particular H₀ is correct. It only governs the long-run error rate of your decision rule across many repeated experiments.
4
It is chosen by you, not calculated from data. α is a policy decision made before the study begins. It reflects how serious a Type I error would be in your specific context — not a property of the data itself.

Section 04

Visualising α on the Bell Curve 📊

When you set α, you are literally drawing a line on the null distribution — everything beyond that line is the "rejection region." Here is how the same test looks with three different α values. Notice how the rejection region grows as α increases, making it easier to reject H₀ — but also increasing the risk of a false alarm.

α = 0.01 Critical Z = ±2.576 0.5% 0.5%
Very Strict
Tiny rejection region
α = 0.01 (1%)
α = 0.05 ← Standard Critical Z = ±1.96 2.5% 2.5%
Standard
Default in most fields
α = 0.05 (5%)
α = 0.10 Critical Z = ±1.645 5% 5%
Lenient
Easier to reject H₀
α = 0.10 (10%)

Section 05

α and the Two Types of Error 🎯

Choosing α is fundamentally a trade-off between two kinds of mistakes. Understanding this trade-off is one of the most important skills in applied statistics — because the "right" α depends entirely on which mistake is more costly in your specific situation.

Type I Error
α
  • False Positive — rejecting a TRUE H₀
  • Probability = α (you control this)
  • Lower α → fewer false alarms
  • But: harder to detect real effects
  • Example: convicting an innocent person
  • Example: approving a useless drug
Type II Error
β
  • False Negative — failing to reject a FALSE H₀
  • Probability = β (related to Power)
  • Lower α → higher β (worse)
  • Power = 1 − β (probability of detecting truth)
  • Example: acquitting a guilty person
  • Example: rejecting a life-saving drug
The Trade-Off
α ↔ β
  • Lowering α increases β
  • Raising α decreases β
  • Only bigger sample size reduces both
  • Choose α based on which error costs more
  • Medical: prefer low α (safety first)
  • Exploration: α = 0.10 may be fine
H₀ is Actually TRUE H₀ is Actually FALSE
You Reject H₀ ❌ Type I Error
Probability = α (false alarm)
✅ Correct Decision
Probability = Power = 1 − β
You Fail to Reject H₀ ✅ Correct Decision
Probability = 1 − α
❌ Type II Error
Probability = β (missed effect)

Section 06

Story 1 — The Nuclear Safety Inspector ☢️

An engineer monitors radiation levels at a nuclear plant. The null hypothesis is: radiation is within safe limits (H₀: μ ≤ 50 mSv/year). She is testing whether levels have risen dangerously.

What should α be? Think carefully about the consequences:

  • Type I Error (α): She raises a false alarm — the plant shuts down unnecessarily, costing millions. Costly, but reversible.
  • Type II Error (β): She misses a real radiation spike — workers are exposed to dangerous levels. Catastrophic, irreversible.

The Type II error is far more dangerous. To minimise β, she must raise α. In safety-critical monitoring like this, engineers often use α = 0.10 — they'd rather trigger ten false alarms than miss one genuine hazard.

🧮 Nuclear Safety — Z-Test with α = 0.10
H₀ / H₁
H₀: μ ≤ 50 mSv/year  |  H₁: μ > 50 (Right-tailed)  |  α = 0.10
Data
n = 36 readings, x̄ = 53.2 mSv, σ = 12 mSv (known from sensors)
Step 1
SE = 12 / √36 = 12 / 6 = 2.0
Step 2
Z = (53.2 − 50) / 2.0 = 3.2 / 2.0 = +1.60
Step 3
Critical Z (right-tailed, α = 0.10) = +1.282
Decision
Z = 1.60 > 1.282 → Reject H₀. Plant goes on alert. ⚠️
Note: At α = 0.05, critical value = 1.645. Z = 1.60 < 1.645 → would FAIL to reject. The choice of α changed the outcome.
⚠️
The α Choice Changed the Decision

With Z = 1.60: at α = 0.10, the plant alerts. At α = 0.05, no action is taken. At α = 0.01, definitely no action. Same data, same test — three different decisions based solely on the threshold chosen beforehand. This is why α must be set before analysis, not chosen to make results "significant."


Section 07

Story 2 — The Clinical Drug Trial 💊

A pharmaceutical company develops a new cancer drug. The FDA requires clinical trials before approval. The null hypothesis: the drug has no effect (or is harmful). The alternative: the drug improves survival rates.

Here the consequences are reversed from the nuclear case:

  • Type I Error (α): Approving a drug that doesn't work — patients take an ineffective treatment, may forgo better alternatives, company profits falsely. Serious harm.
  • Type II Error (β): Rejecting a drug that genuinely saves lives — delayed access. Terrible, but the drug can be retested.

The FDA uses α = 0.05 as the standard for Phase III trials, and α = 0.01 for breakthrough therapies where data quality must be rock-solid. The priority is preventing false approvals.

🧮 Drug Trial — T-Test with α = 0.01
Setup
Trial of new cancer drug vs placebo. Survival improvement (months) measured. n = 25 patients, x̄ = 4.2 months, s = 6.5 months. α = 0.01. Right-tailed t-test.
H₀ / H₁
H₀: μ ≤ 0 months improvement  |  H₁: μ > 0 (drug extends survival)
Step 1
SE = s / √n = 6.5 / √25 = 6.5 / 5 = 1.30
Step 2
t = (4.2 − 0) / 1.30 = +3.23
Step 3
df = 24. Critical t (right-tailed, df=24, α=0.01) = +2.492
Decision
t = 3.23 > 2.492 → Reject H₀ even at α = 0.01. Strong evidence. FDA proceeds to review. ✅
At α = 0.05
Critical t (df=24, α=0.05) = 1.711. Also rejected. At α = 0.001: critical t ≈ 3.467. Fail to reject. The strictest threshold would have blocked this drug.

Section 08

Story 3 — The Particle Physics Standard 🔬

In 2012, CERN announced the discovery of the Higgs boson — the so-called "God particle." Did they use α = 0.05? Not even close.

Particle physics uses the "5-sigma" standard, which corresponds to α ≈ 0.0000003 (3 in 10 million). Why? Because physicists run billions of collision experiments. At α = 0.05, you'd expect 5% of all tests to produce false discoveries purely by chance — with billions of experiments, that's hundreds of millions of false positives. The 5-sigma rule keeps the false discovery rate manageable even at astronomical scales.

Standard Research
α = 0.05
1 in 20 chance of false positive. Used in most social science, business, and medical research.
Clinical / Medical
α = 0.01
1 in 100. Used for drug approvals, medical device safety, where false positives have serious consequences.
Highly Conservative
α = 0.001
1 in 1,000. Genomics (testing thousands of genes), quality control in high-stakes manufacturing.
Particle Physics
α ≈ 0.0000003
"5-sigma" standard. 1 in 3.5 million. Used when billions of tests are run and any false discovery would be catastrophic.

Section 09

α vs p-value — The Most Common Confusion

Students frequently confuse α and the p-value. They are fundamentally different things, and mixing them up leads to incorrect conclusions.

Feature Significance Level (α) p-value
What it is A threshold you choose in advance A probability calculated from your data
When it's set Before data collection After data collection and analysis
Who determines it You (the researcher), based on context Calculated from the test statistic
What it means Your tolerance for a false positive Probability of data this extreme if H₀ were true
How it's used Sets the decision boundary Compared against α to make the decision
Fixed or variable? Fixed — you do NOT change it after seeing results Variable — depends entirely on your sample
Decision rule If p < α → Reject H₀ If p < α → Reject H₀
📐
The Perfect Analogy

α is the speed limit sign posted before you drive. The p-value is your speedometer reading during the drive. If your speedometer (p-value) shows a number below the speed limit (α), you're fine — fail to reject H₀. If it shows above the limit, you get a ticket — reject H₀. You don't change the speed limit after seeing how fast you're going.


Section 10

Worked Comparison — Same Data, Three Values of α

A marketing team runs an A/B test on a new website design. They measure conversion rates. The Z-test gives a test statistic of Z = +1.75 (right-tailed). Watch how the conclusion changes with α.

🧮 A/B Test — Z = +1.75, Right-Tailed
Context
New website design tested on 500 users. H₁: new design increases conversions. Z = +1.75. p-value (right-tailed) = 0.040
At α = 0.10
Critical Z = +1.282.  1.75 > 1.282 → Reject H₀ ✅  Launch new design. (Lenient — exploratory marketing test)
At α = 0.05
Critical Z = +1.645.  1.75 > 1.645 → Reject H₀ ✅  Launch new design. p = 0.040 < 0.05. (Standard research threshold)
At α = 0.01
Critical Z = +2.326.  1.75 < 2.326 → Fail to Reject H₀ ❌  Keep old design. (Strict threshold — not enough evidence)
Lesson
The data is identical in all three cases. Only the pre-set threshold differs. This is why α must be declared in advance — and why changing it after seeing results is scientific misconduct.

Section 11

How to Choose Your α — A Field Guide

Field / Situation Typical α Reasoning Priority
🔬 Particle physics, astronomy 0.0000003 (5σ) Billions of tests; any false discovery is catastrophic Minimise α
💊 FDA drug approval 0.01 – 0.05 False approvals harm patients; very high evidence bar Minimise α
🧬 Genomics (GWAS) 5 × 10⁻⁸ Millions of SNPs tested simultaneously; Bonferroni correction Minimise α
🏭 Industrial quality control 0.01 – 0.05 Stopping production is expensive; balance false alarms Balance
📊 Social science research 0.05 Convention; sample sizes moderate; effects complex Balance
☢️ Safety / environmental monitoring 0.05 – 0.10 Missing a real hazard (Type II) is worse than false alarm Raise α
📈 A/B testing (marketing) 0.05 – 0.10 Low cost of Type I; quick iteration; business decisions Raise α
🔍 Exploratory / pilot studies 0.10 – 0.20 Generating hypotheses, not confirming them; high β risk Raise α
💡
The Bonferroni Correction — When You Run Many Tests

If you run 20 hypothesis tests each at α = 0.05, you expect 1 false positive just by chance. The Bonferroni correction solves this by dividing α by the number of tests: α_adjusted = α / m. For 20 tests at α = 0.05, each individual test uses α_adjusted = 0.0025. This keeps the overall family-wise error rate at 5%.


Section 12

α, Sample Size, and Statistical Power

There is a three-way relationship connecting α, sample size (n), and statistical power (1 − β). Understanding this triangle is essential for designing any experiment properly.

Increase α (e.g. 0.05 → 0.10)
Power ↑   β ↓
Easier to reject H₀. Better at detecting real effects. But more false alarms. Same sample size.
Decrease α (e.g. 0.05 → 0.01)
Power ↓   β ↑
Harder to reject H₀. Fewer false alarms. But more likely to miss real effects. Need larger n to compensate.
Increase Sample Size (n)
Power ↑   SE ↓
The only way to reduce BOTH α and β simultaneously. More data = narrower sampling distribution = easier to detect true effects precisely.
Effect Size (δ)
Larger δ → easier
A very large real effect is easy to detect even at strict α with small n. A tiny effect requires massive n or relaxed α.
The Power Equation

Most research aims for 80% power (β = 0.20) at α = 0.05. This means: a 20% chance of missing a real effect, and a 5% chance of a false positive. This 80/20 balance was suggested by statistician Jacob Cohen and remains the standard for sample size calculations in clinical and social science research.


Section 13

The Golden Rules of α

🎯 8 Rules Every Analyst Must Follow
1
Set α before collecting data. Choosing α after seeing your p-value is equivalent to moving the goalposts — it destroys the validity of the entire test and is a form of p-hacking.
2
α is not a universal constant. α = 0.05 is a convention, not a law of nature. Choose α based on the relative costs of Type I versus Type II errors in your specific domain and decision context.
3
Report the actual p-value, not just "significant" or "not significant." A p-value of 0.048 and a p-value of 0.002 both "reject H₀ at α = 0.05" — but they convey very different strengths of evidence.
4
Statistical significance ≠ practical significance. With n = 100,000, a difference of 0.001% can be statistically significant at α = 0.05. Always pair p-values with effect sizes (Cohen's d, R², odds ratios).
5
Apply the Bonferroni correction (or FDR control) for multiple tests. Running 20 tests at α = 0.05 gives an expected 1 false positive. Correct α accordingly, or use False Discovery Rate methods in genomics and large-scale studies.
6
Consider the cost asymmetry. Ask: "If I make a Type I error, what happens? If I make a Type II error, what happens?" The more dangerous error should drive your α choice — not journal convention.
7
Pre-register your hypotheses and α. Upload your H₀, H₁, α, and analysis plan to a public registry (OSF, ClinicalTrials.gov) before data collection. This makes your choice of α immune to accusations of manipulation.
8
α is not a substitute for replication. A single study rejecting H₀ at α = 0.05 is weak evidence. Science advances through independent replication — ideally with pre-registration — not through single studies with convenient p-values.
🧮
The Big Picture

α is the most consequential decision a researcher makes — yet it's often set without thought. It determines the sensitivity of your test, the balance between false alarms and missed discoveries, and the credibility of your conclusions. Use it deliberately. Set it with intention. And never, ever change it after seeing your data.