Z-Test, T-Test & Chi-Square: Test Statistics Explained

Section 01

The Universal Translator 🌐

Imagine you and three friends each measure the temperature of a swimming pool using different thermometers — one in Celsius, one in Fahrenheit, one in Kelvin. The readings look wildly different, yet they all describe the same water. To compare them, you need a common language.

Test statistics do exactly that for hypothesis testing. No matter what you measured — average delivery time, defect rates, or voting preferences — a test statistic converts your raw sample result into a single standardised number that sits on a known distribution. That number tells you: "How surprising is my result, if H₀ were true?"

There are three test statistics you will encounter in almost every data science and statistics course: the Z-statistic, the T-statistic, and the Chi-square statistic (χ²). Each is designed for a specific type of data and situation. This tutorial tells the story of all three.

💡

What All Test Statistics Have in Common

Every test statistic follows the same logic: (Observed − Expected) / Spread. You measure how far your sample result is from what H₀ predicts, then scale it by the variability in your data. A large result means the data is surprising under H₀. A small result means the data is consistent with H₀.

Section 02

The Three Champions — At a Glance

Z-Statistic

Tests means or proportions
Requires known population σ
OR large sample (n ≥ 30)
Follows standard normal curve
Critical value: ±1.96 at α=0.05
Simplest and most classic test

T-Statistic

Tests means (σ unknown)
Works with small samples (n < 30)
Follows t-distribution with df
Heavier tails = more conservative
Critical value depends on df
Most common in real research

Chi-Square (χ²)

χ²

Tests categorical / count data
No means involved — uses frequencies
Always positive (squared values)
Skewed right distribution
Tests goodness-of-fit & independence
Depends on degrees of freedom

Section 03

Part I — The Z-Statistic 🏙️

The Story: The City Water Authority

The Delhi Water Authority claims that the average daily water supply per household is 200 litres. A consumer rights group suspects this is overstated. They survey 64 households across the city. The population standard deviation is known from historical records to be 40 litres.

Because the population standard deviation is known and the sample is large (n = 64), this is a perfect case for the Z-test. Think of it as the gold standard — when you have complete information about population variability, Z is the most precise tool available.

Z-Statistic Formula (Mean)

Z = (x̄ − μ₀) / (σ / √n)

x̄ = sample mean, μ₀ = hypothesised mean, σ = known population std dev, n = sample size

Z-Statistic Formula (Proportion)

Z = (p̂ − p₀) / √(p₀q₀/n)

p̂ = sample proportion, p₀ = hypothesised proportion, q₀ = 1 − p₀, n = sample size

🧮 Z-Test Worked Example — Delhi Water Supply

Given

μ₀ = 200 L, σ = 40 L, n = 64 households, x̄ = 192 L, α = 0.05, Left-tailed test

H₀ / H₁

H₀: μ = 200 | H₁: μ < 200 (Authority is overstating supply)

Step 1

Standard Error = σ / √n = 40 / √64 = 40 / 8 = 5

Step 2

Z = (192 − 200) / 5 = −8 / 5 = −1.60

Step 3

Critical value (left-tailed, α = 0.05) = −1.645

Decision

Z = −1.60 is NOT less than −1.645. Fail to Reject H₀. At 5% significance, insufficient evidence to conclude the authority is overstating supply. The consumer group needs more data.

📌

When to Use Z (the Checklist)

Use Z when: (1) Population standard deviation σ is known, OR (2) Sample size n ≥ 30 (Central Limit Theorem kicks in). For proportions, use Z when np ≥ 5 and n(1−p) ≥ 5. If these conditions fail, switch to the T-test.

Z — Standard Normal

Symmetric, mean = 0

σ known, n ≥ 30

t — Student's t

Wider tails, df-dependent

σ unknown, n < 30

χ² — Chi-Square

Right-skewed, always ≥ 0

Categorical data

Section 04

Part II — The T-Statistic 🧪

The Story: William Gosset and the Guinness Brewery

The year is 1908. William Sealy Gosset is a chemist working at the Guinness Brewery in Dublin. His job: test whether different batches of barley produce consistent quality beer. The problem? He can only afford to test small samples of barley — and the population standard deviation is completely unknown.

The Z-test fails here. With small samples, estimating σ from the sample data introduces extra uncertainty — the resulting distribution is wider and heavier-tailed than the standard normal. Gosset derived the t-distribution to account for this, publishing it under the pseudonym "Student" (Guinness didn't allow employees to publish). That's why it's called Student's t-test to this day.

💡

What Are Degrees of Freedom?

Degrees of freedom (df) = n − 1. Think of it as the number of values "free to vary" once you've fixed the sample mean. With only 5 data points and a known mean, only 4 of those points are truly independent — the last one is determined. As df increases, the t-distribution approaches the Z (normal) distribution — which is why large samples can use Z.

One-Sample T-Statistic

t = (x̄ − μ₀) / (s / √n)

s = sample standard deviation (estimated, not known). df = n − 1

Two-Sample T-Statistic

t = (x̄₁ − x̄₂) / √(s²/n₁ + s²/n₂)

Compares means of two independent groups. df ≈ n₁ + n₂ − 2

Paired T-Statistic

t = d̄ / (sᴅ / √n)

d̄ = mean of differences, sᴅ = std dev of differences. df = n − 1

Key Difference from Z

s replaces σ

Sample std dev (s) is an estimate — this uncertainty widens the distribution, especially at low df.

🧮 One-Sample T-Test — Sleep Drug Trial

Story

A researcher tests a new sleep drug on 10 patients. Before the drug, patients averaged 6.0 hours of sleep. After the drug, the sample mean is 6.8 hours with a sample standard deviation of 1.0 hour. Did the drug significantly increase sleep?

Given

μ₀ = 6.0, x̄ = 6.8, s = 1.0, n = 10, α = 0.05. Right-tailed test.

H₀ / H₁

H₀: μ ≤ 6.0 | H₁: μ > 6.0 (drug increases sleep hours)

Step 1

SE = s / √n = 1.0 / √10 = 1.0 / 3.162 = 0.316

Step 2

t = (6.8 − 6.0) / 0.316 = 0.8 / 0.316 = +2.53

Step 3

df = n − 1 = 10 − 1 = 9. Critical value (right-tailed, df=9, α=0.05) = +1.833

Decision

t = 2.53 > 1.833 → Reject H₀. The drug significantly increased sleep hours. ✅

🧮 Paired T-Test — Blood Pressure Before vs After

Story

A doctor measures blood pressure (mmHg) for 6 patients before and after a 4-week diet programme. The same patients are measured twice — making this a paired design. We analyse the differences, not the raw values.

Data

Differences (Before − After): 8, 5, 12, 3, 9, 7
d̄ = (8+5+12+3+9+7)/6 = 44/6 = 7.33
sᴅ = 3.08 (computed from the 6 differences), n = 6

H₀ / H₁

H₀: μᴅ = 0 | H₁: μᴅ > 0 (diet reduces BP — positive difference means decrease)

Step 1

SE = sᴅ / √n = 3.08 / √6 = 3.08 / 2.449 = 1.257

Step 2

t = d̄ / SE = 7.33 / 1.257 = +5.83

Step 3

df = 6 − 1 = 5. Critical value (right-tailed, df=5, α=0.05) = +2.015

Decision

t = 5.83 ≫ 2.015 → Reject H₀. The diet programme significantly reduced blood pressure. 🎉

Section 05

Part III — The Chi-Square Statistic (χ²) 🎲

The Story: The Dice Factory Inspector

A toy company manufactures six-sided dice. A quality inspector suspects one batch is biased — some faces appear more often than they should. She rolls a die 120 times and records how often each face appears. If the die is fair, each face should appear exactly 20 times. But the observed counts are different. Is this due to random chance, or is the die actually unfair?

There are no means here. No standard deviation of heights or weights. Just counts in categories. This is exactly what the Chi-square test was built for.

Chi-Square Formula

χ² = Σ (O − E)² / E

O = Observed frequency, E = Expected frequency. Sum over all categories. df = k − 1 for goodness-of-fit.

Expected Frequency

E = n × p

n = total count, p = hypothesised proportion for that category. For a fair die: E = 120 × (1/6) = 20

Degrees of Freedom

df = k − 1

k = number of categories. For independence test in a table: df = (rows − 1)(cols − 1)

Key Insight

(O − E)² / E

Squaring removes negatives. Dividing by E weights deviations by expected size. Large χ² = big surprise.

🧮 Chi-Square Goodness-of-Fit — The Biased Die

Setup

H₀: Die is fair (each face equally likely, p = 1/6). H₁: Die is biased. n = 120 rolls, k = 6 faces, α = 0.05

Data

Face 1: O=25, E=20 | Face 2: O=17, E=20 | Face 3: O=22, E=20
Face 4: O=30, E=20 | Face 5: O=14, E=20 | Face 6: O=12, E=20

Step 1

χ² = (25−20)²/20 + (17−20)²/20 + (22−20)²/20 + (30−20)²/20 + (14−20)²/20 + (12−20)²/20

Step 2

= 25/20 + 9/20 + 4/20 + 100/20 + 36/20 + 64/20
= 1.25 + 0.45 + 0.20 + 5.00 + 1.80 + 3.20 = 11.90

Step 3

df = k − 1 = 6 − 1 = 5. Critical value (χ², df=5, α=0.05) = 11.07

Decision

χ² = 11.90 > 11.07 → Reject H₀. The die is significantly biased! The factory recalls the batch. 🎲

💡

Chi-Square for Independence (Two Variables)

Chi-square is also used to test whether two categorical variables are related. For example: "Is gender independent of political party preference?" You build a contingency table (rows = gender, columns = party), compute expected frequencies for each cell using E = (row total × col total) / grand total, then apply the same χ² formula. df = (rows − 1)(cols − 1).

🧮 Chi-Square Independence Test — Tea Preference vs City

Story

A survey of 200 people in Delhi and Mumbai asks whether they prefer Chai or Coffee. Is tea preference independent of city? The contingency table is shown below.

Table

Chai Coffee Total
Delhi: 70 30 100
Mumbai: 50 50 100
Total: 120 80 200

Expected

E(Delhi, Chai) = 100×120/200 = 60 | E(Delhi, Coffee) = 100×80/200 = 40
E(Mumbai, Chai) = 100×120/200 = 60 | E(Mumbai, Coffee) = 100×80/200 = 40

χ² Calc

= (70−60)²/60 + (30−40)²/40 + (50−60)²/60 + (50−40)²/40
= 100/60 + 100/40 + 100/60 + 100/40
= 1.67 + 2.50 + 1.67 + 2.50 = 8.33

df = (2−1)(2−1) = 1. Critical value (χ², df=1, α=0.05) = 3.84

Decision

8.33 > 3.84 → Reject H₀. Tea preference is NOT independent of city. Delhi leans strongly Chai; Mumbai is evenly split. ☕

Section 06

Choosing the Right Test — The Decision Map

This is the most practical skill in inferential statistics. Picking the wrong test doesn't just produce the wrong answer — it undermines the entire analysis. Use this framework every time.

Data Type	What You're Testing	σ Known?	Sample Size	Use This Test
Continuous (numeric)	One sample mean vs known value	Yes	Any	Z-test
Continuous (numeric)	One sample mean vs known value	No	Any	One-sample t-test
Continuous (numeric)	Two independent group means	No	Any	Two-sample t-test
Continuous (numeric)	Same subjects measured twice	No	Any	Paired t-test
Binary (yes/no)	One sample proportion vs value	N/A	n ≥ 30	Z-test (proportion)
Categorical (counts)	Observed vs expected frequencies	N/A	E ≥ 5 per cell	χ² goodness-of-fit
Categorical (counts)	Two categorical variables related?	N/A	E ≥ 5 per cell	χ² independence test
Continuous (numeric)	Three or more group means	No	Any	ANOVA (F-test)

Section 07

Critical Value Reference Tables

Z Critical Values

α	Two-Tailed (±Z)	Right-Tailed (+Z)	Left-Tailed (−Z)
0.10	±1.645	+1.282	−1.282
0.05	±1.960	+1.645	−1.645
0.01	±2.576	+2.326	−2.326
0.001	±3.291	+3.090	−3.090

T Critical Values (Right-Tailed, α = 0.05)

df	Right-Tailed (α=0.05)	Two-Tailed (α=0.05)	Two-Tailed (α=0.01)
1	6.314	12.706	63.657
5	2.015	2.571	4.032
9	1.833	2.262	3.250
15	1.753	2.131	2.947
20	1.725	2.086	2.845
30	1.697	2.042	2.750
∞ (→Z)	1.645	1.960	2.576

Chi-Square Critical Values (Right-Tailed)

df	α = 0.10	α = 0.05	α = 0.01
1	2.706	3.841	6.635
2	4.605	5.991	9.210
3	6.251	7.815	11.345
4	7.779	9.488	13.277
5	9.236	11.070	15.086
10	15.987	18.307	23.209

Section 08

Key Assumptions to Check

Every test statistic is only valid when its underlying assumptions hold. Violating these assumptions can make your p-values meaningless — even if the maths looks correct.

Test	Assumption 1	Assumption 2	Assumption 3
Z-test	σ known OR n ≥ 30	Random sampling	Independent observations
One-sample t	Population approximately normal (or n ≥ 30)	σ unknown (use s)	Independent observations
Two-sample t	Both populations approximately normal	Independent groups	Equal or unequal variances (check with F-test)
Paired t	Differences approximately normal	Same subjects measured twice	Random sampling
χ² test	Categorical data (counts, not means)	Expected frequency E ≥ 5 in each cell	Independent observations

⚠️

What If Assumptions Are Violated?

If your data is heavily skewed and n is small, consider non-parametric alternatives: the Mann-Whitney U test instead of a two-sample t, the Wilcoxon signed-rank test instead of a paired t, or Fisher's exact test instead of χ² when expected cell counts are below 5.

Section 09

The Golden Rules of Test Statistics

🎯 The 7 Rules Every Data Scientist Must Know

Check your data type first. Categorical data (counts, frequencies) → Chi-square. Numerical data (heights, scores, times) → Z or T. This one question eliminates half the decision tree immediately.

Do you know σ? If the population standard deviation is given or known from historical data, use Z. If you're estimating it from your sample (the usual case in practice), use T.

Large samples forgive a lot. When n ≥ 30, the Central Limit Theorem ensures the sampling distribution is approximately normal — so Z works even if the underlying population isn't perfectly normal.

T approaches Z as df increases. At df = ∞, t = Z. At df = 30, they're already very close. This is why many textbooks use Z for large samples even when σ is unknown — it's a practical approximation.

Chi-square is always right-tailed. Because values are squared, χ² is always ≥ 0 and its distribution is right-skewed. A large χ² always means "surprising" — there's no "negative" direction for categorical deviations.

Statistical significance ≠ practical significance. With a very large sample, even a tiny, meaningless difference can produce a significant p-value. Always report effect sizes (Cohen's d, Cramér's V) alongside your test result.

Always state your conclusion in plain language. "Reject H₀" is not a conclusion — it's a formula. Say: "At the 5% significance level, there is sufficient evidence to conclude that the drug significantly increased average sleep duration."

🧮

You Now Speak the Language of Evidence

Z, t, and χ² are not just formulas — they are the three fundamental ways data science converts raw observations into decisions. Master when to use each, check your assumptions, and always interpret your result in the real-world context of the problem. That's what separates a statistician from a calculator.