Foundations of Data Science 📂 Inferential Statistics · 7 of 8 26 min read

Test Statistics Explained: Z, T, and Chi-Square

A comprehensive, story-driven tutorial covering the three essential test statistics — Z, T, and Chi-Square. Includes the Guinness brewery origin of the t-test, four fully worked examples (water supply Z-test, sleep drug t-test, blood pressure paired t, biased dice χ², and city tea preference χ²), distribution diagrams, critical value tables, assumption checklists, and a decision map for choosing the right test.

Section 01

The Universal Translator 🌐

Imagine you and three friends each measure the temperature of a swimming pool using different thermometers — one in Celsius, one in Fahrenheit, one in Kelvin. The readings look wildly different, yet they all describe the same water. To compare them, you need a common language.

Test statistics do exactly that for hypothesis testing. No matter what you measured — average delivery time, defect rates, or voting preferences — a test statistic converts your raw sample result into a single standardised number that sits on a known distribution. That number tells you: "How surprising is my result, if H₀ were true?"

There are three test statistics you will encounter in almost every data science and statistics course: the Z-statistic, the T-statistic, and the Chi-square statistic (χ²). Each is designed for a specific type of data and situation. This tutorial tells the story of all three.

💡
What All Test Statistics Have in Common

Every test statistic follows the same logic: (Observed − Expected) / Spread. You measure how far your sample result is from what H₀ predicts, then scale it by the variability in your data. A large result means the data is surprising under H₀. A small result means the data is consistent with H₀.


Section 02

The Three Champions — At a Glance

Z-Statistic
Z
  • Tests means or proportions
  • Requires known population σ
  • OR large sample (n ≥ 30)
  • Follows standard normal curve
  • Critical value: ±1.96 at α=0.05
  • Simplest and most classic test
T-Statistic
t
  • Tests means (σ unknown)
  • Works with small samples (n < 30)
  • Follows t-distribution with df
  • Heavier tails = more conservative
  • Critical value depends on df
  • Most common in real research
Chi-Square (χ²)
χ²
  • Tests categorical / count data
  • No means involved — uses frequencies
  • Always positive (squared values)
  • Skewed right distribution
  • Tests goodness-of-fit & independence
  • Depends on degrees of freedom

Section 03

Part I — The Z-Statistic 🏙️

The Story: The City Water Authority

The Delhi Water Authority claims that the average daily water supply per household is 200 litres. A consumer rights group suspects this is overstated. They survey 64 households across the city. The population standard deviation is known from historical records to be 40 litres.

Because the population standard deviation is known and the sample is large (n = 64), this is a perfect case for the Z-test. Think of it as the gold standard — when you have complete information about population variability, Z is the most precise tool available.

Z-Statistic Formula (Mean)
Z = (x̄ − μ₀) / (σ / √n)
x̄ = sample mean, μ₀ = hypothesised mean, σ = known population std dev, n = sample size
Z-Statistic Formula (Proportion)
Z = (p̂ − p₀) / √(p₀q₀/n)
p̂ = sample proportion, p₀ = hypothesised proportion, q₀ = 1 − p₀, n = sample size
🧮 Z-Test Worked Example — Delhi Water Supply
Given
μ₀ = 200 L, σ = 40 L, n = 64 households, x̄ = 192 L, α = 0.05, Left-tailed test
H₀ / H₁
H₀: μ = 200  |  H₁: μ < 200  (Authority is overstating supply)
Step 1
Standard Error = σ / √n = 40 / √64 = 40 / 8 = 5
Step 2
Z = (192 − 200) / 5 = −8 / 5 = −1.60
Step 3
Critical value (left-tailed, α = 0.05) = −1.645
Decision
Z = −1.60 is NOT less than −1.645.  Fail to Reject H₀. At 5% significance, insufficient evidence to conclude the authority is overstating supply. The consumer group needs more data.
📌
When to Use Z (the Checklist)

Use Z when: (1) Population standard deviation σ is known, OR (2) Sample size n ≥ 30 (Central Limit Theorem kicks in). For proportions, use Z when np ≥ 5 and n(1−p) ≥ 5. If these conditions fail, switch to the T-test.

Mean = 0, SD = 1 Z Distribution
Z — Standard Normal
Symmetric, mean = 0
σ known, n ≥ 30
Heavier tails than Z t Distribution
t — Student's t
Wider tails, df-dependent
σ unknown, n < 30
Always ≥ 0, right-skewed χ² Distribution
χ² — Chi-Square
Right-skewed, always ≥ 0
Categorical data

Section 04

Part II — The T-Statistic 🧪

The Story: William Gosset and the Guinness Brewery

The year is 1908. William Sealy Gosset is a chemist working at the Guinness Brewery in Dublin. His job: test whether different batches of barley produce consistent quality beer. The problem? He can only afford to test small samples of barley — and the population standard deviation is completely unknown.

The Z-test fails here. With small samples, estimating σ from the sample data introduces extra uncertainty — the resulting distribution is wider and heavier-tailed than the standard normal. Gosset derived the t-distribution to account for this, publishing it under the pseudonym "Student" (Guinness didn't allow employees to publish). That's why it's called Student's t-test to this day.

💡
What Are Degrees of Freedom?

Degrees of freedom (df) = n − 1. Think of it as the number of values "free to vary" once you've fixed the sample mean. With only 5 data points and a known mean, only 4 of those points are truly independent — the last one is determined. As df increases, the t-distribution approaches the Z (normal) distribution — which is why large samples can use Z.

One-Sample T-Statistic
t = (x̄ − μ₀) / (s / √n)
s = sample standard deviation (estimated, not known). df = n − 1
Two-Sample T-Statistic
t = (x̄₁ − x̄₂) / √(s²/n₁ + s²/n₂)
Compares means of two independent groups. df ≈ n₁ + n₂ − 2
Paired T-Statistic
t = d̄ / (sᴅ / √n)
d̄ = mean of differences, sᴅ = std dev of differences. df = n − 1
Key Difference from Z
s replaces σ
Sample std dev (s) is an estimate — this uncertainty widens the distribution, especially at low df.
🧮 One-Sample T-Test — Sleep Drug Trial
Story
A researcher tests a new sleep drug on 10 patients. Before the drug, patients averaged 6.0 hours of sleep. After the drug, the sample mean is 6.8 hours with a sample standard deviation of 1.0 hour. Did the drug significantly increase sleep?
Given
μ₀ = 6.0, x̄ = 6.8, s = 1.0, n = 10, α = 0.05. Right-tailed test.
H₀ / H₁
H₀: μ ≤ 6.0  |  H₁: μ > 6.0  (drug increases sleep hours)
Step 1
SE = s / √n = 1.0 / √10 = 1.0 / 3.162 = 0.316
Step 2
t = (6.8 − 6.0) / 0.316 = 0.8 / 0.316 = +2.53
Step 3
df = n − 1 = 10 − 1 = 9.  Critical value (right-tailed, df=9, α=0.05) = +1.833
Decision
t = 2.53 > 1.833 → Reject H₀. The drug significantly increased sleep hours. ✅
🧮 Paired T-Test — Blood Pressure Before vs After
Story
A doctor measures blood pressure (mmHg) for 6 patients before and after a 4-week diet programme. The same patients are measured twice — making this a paired design. We analyse the differences, not the raw values.
Data
Differences (Before − After): 8, 5, 12, 3, 9, 7
d̄ = (8+5+12+3+9+7)/6 = 44/6 = 7.33
sᴅ = 3.08 (computed from the 6 differences), n = 6
H₀ / H₁
H₀: μᴅ = 0  |  H₁: μᴅ > 0  (diet reduces BP — positive difference means decrease)
Step 1
SE = sᴅ / √n = 3.08 / √6 = 3.08 / 2.449 = 1.257
Step 2
t = d̄ / SE = 7.33 / 1.257 = +5.83
Step 3
df = 6 − 1 = 5. Critical value (right-tailed, df=5, α=0.05) = +2.015
Decision
t = 5.83 ≫ 2.015 → Reject H₀. The diet programme significantly reduced blood pressure. 🎉

Section 05

Part III — The Chi-Square Statistic (χ²) 🎲

The Story: The Dice Factory Inspector

A toy company manufactures six-sided dice. A quality inspector suspects one batch is biased — some faces appear more often than they should. She rolls a die 120 times and records how often each face appears. If the die is fair, each face should appear exactly 20 times. But the observed counts are different. Is this due to random chance, or is the die actually unfair?

There are no means here. No standard deviation of heights or weights. Just counts in categories. This is exactly what the Chi-square test was built for.

Chi-Square Formula
χ² = Σ (O − E)² / E
O = Observed frequency, E = Expected frequency. Sum over all categories. df = k − 1 for goodness-of-fit.
Expected Frequency
E = n × p
n = total count, p = hypothesised proportion for that category. For a fair die: E = 120 × (1/6) = 20
Degrees of Freedom
df = k − 1
k = number of categories. For independence test in a table: df = (rows − 1)(cols − 1)
Key Insight
(O − E)² / E
Squaring removes negatives. Dividing by E weights deviations by expected size. Large χ² = big surprise.
🧮 Chi-Square Goodness-of-Fit — The Biased Die
Setup
H₀: Die is fair (each face equally likely, p = 1/6).  H₁: Die is biased.  n = 120 rolls, k = 6 faces, α = 0.05
Data
Face 1: O=25, E=20  |  Face 2: O=17, E=20  |  Face 3: O=22, E=20
Face 4: O=30, E=20  |  Face 5: O=14, E=20  |  Face 6: O=12, E=20
Step 1
χ² = (25−20)²/20 + (17−20)²/20 + (22−20)²/20 + (30−20)²/20 + (14−20)²/20 + (12−20)²/20
Step 2
= 25/20 + 9/20 + 4/20 + 100/20 + 36/20 + 64/20
= 1.25 + 0.45 + 0.20 + 5.00 + 1.80 + 3.20 = 11.90
Step 3
df = k − 1 = 6 − 1 = 5.  Critical value (χ², df=5, α=0.05) = 11.07
Decision
χ² = 11.90 > 11.07 → Reject H₀. The die is significantly biased! The factory recalls the batch. 🎲
💡
Chi-Square for Independence (Two Variables)

Chi-square is also used to test whether two categorical variables are related. For example: "Is gender independent of political party preference?" You build a contingency table (rows = gender, columns = party), compute expected frequencies for each cell using E = (row total × col total) / grand total, then apply the same χ² formula. df = (rows − 1)(cols − 1).

🧮 Chi-Square Independence Test — Tea Preference vs City
Story
A survey of 200 people in Delhi and Mumbai asks whether they prefer Chai or Coffee. Is tea preference independent of city? The contingency table is shown below.
Table
          Chai   Coffee   Total
Delhi:    70      30       100
Mumbai:  50      50       100
Total:   120      80       200
Expected
E(Delhi, Chai) = 100×120/200 = 60  |  E(Delhi, Coffee) = 100×80/200 = 40
E(Mumbai, Chai) = 100×120/200 = 60  |  E(Mumbai, Coffee) = 100×80/200 = 40
χ² Calc
= (70−60)²/60 + (30−40)²/40 + (50−60)²/60 + (50−40)²/40
= 100/60 + 100/40 + 100/60 + 100/40
= 1.67 + 2.50 + 1.67 + 2.50 = 8.33
df
df = (2−1)(2−1) = 1. Critical value (χ², df=1, α=0.05) = 3.84
Decision
8.33 > 3.84 → Reject H₀. Tea preference is NOT independent of city. Delhi leans strongly Chai; Mumbai is evenly split. ☕

Section 06

Choosing the Right Test — The Decision Map

This is the most practical skill in inferential statistics. Picking the wrong test doesn't just produce the wrong answer — it undermines the entire analysis. Use this framework every time.

Data Type What You're Testing σ Known? Sample Size Use This Test
Continuous (numeric) One sample mean vs known value Yes Any Z-test
Continuous (numeric) One sample mean vs known value No Any One-sample t-test
Continuous (numeric) Two independent group means No Any Two-sample t-test
Continuous (numeric) Same subjects measured twice No Any Paired t-test
Binary (yes/no) One sample proportion vs value N/A n ≥ 30 Z-test (proportion)
Categorical (counts) Observed vs expected frequencies N/A E ≥ 5 per cell χ² goodness-of-fit
Categorical (counts) Two categorical variables related? N/A E ≥ 5 per cell χ² independence test
Continuous (numeric) Three or more group means No Any ANOVA (F-test)

Section 07

Critical Value Reference Tables

Z Critical Values

α Two-Tailed (±Z) Right-Tailed (+Z) Left-Tailed (−Z)
0.10±1.645+1.282−1.282
0.05±1.960+1.645−1.645
0.01±2.576+2.326−2.326
0.001±3.291+3.090−3.090

T Critical Values (Right-Tailed, α = 0.05)

df Right-Tailed (α=0.05) Two-Tailed (α=0.05) Two-Tailed (α=0.01)
16.31412.70663.657
52.0152.5714.032
91.8332.2623.250
151.7532.1312.947
201.7252.0862.845
301.6972.0422.750
∞ (→Z)1.6451.9602.576

Chi-Square Critical Values (Right-Tailed)

df α = 0.10 α = 0.05 α = 0.01
12.7063.8416.635
24.6055.9919.210
36.2517.81511.345
47.7799.48813.277
59.23611.07015.086
1015.98718.30723.209

Section 08

Key Assumptions to Check

Every test statistic is only valid when its underlying assumptions hold. Violating these assumptions can make your p-values meaningless — even if the maths looks correct.

Test Assumption 1 Assumption 2 Assumption 3
Z-test σ known OR n ≥ 30 Random sampling Independent observations
One-sample t Population approximately normal (or n ≥ 30) σ unknown (use s) Independent observations
Two-sample t Both populations approximately normal Independent groups Equal or unequal variances (check with F-test)
Paired t Differences approximately normal Same subjects measured twice Random sampling
χ² test Categorical data (counts, not means) Expected frequency E ≥ 5 in each cell Independent observations
⚠️
What If Assumptions Are Violated?

If your data is heavily skewed and n is small, consider non-parametric alternatives: the Mann-Whitney U test instead of a two-sample t, the Wilcoxon signed-rank test instead of a paired t, or Fisher's exact test instead of χ² when expected cell counts are below 5.


Section 09

The Golden Rules of Test Statistics

🎯 The 7 Rules Every Data Scientist Must Know
1
Check your data type first. Categorical data (counts, frequencies) → Chi-square. Numerical data (heights, scores, times) → Z or T. This one question eliminates half the decision tree immediately.
2
Do you know σ? If the population standard deviation is given or known from historical data, use Z. If you're estimating it from your sample (the usual case in practice), use T.
3
Large samples forgive a lot. When n ≥ 30, the Central Limit Theorem ensures the sampling distribution is approximately normal — so Z works even if the underlying population isn't perfectly normal.
4
T approaches Z as df increases. At df = ∞, t = Z. At df = 30, they're already very close. This is why many textbooks use Z for large samples even when σ is unknown — it's a practical approximation.
5
Chi-square is always right-tailed. Because values are squared, χ² is always ≥ 0 and its distribution is right-skewed. A large χ² always means "surprising" — there's no "negative" direction for categorical deviations.
6
Statistical significance ≠ practical significance. With a very large sample, even a tiny, meaningless difference can produce a significant p-value. Always report effect sizes (Cohen's d, Cramér's V) alongside your test result.
7
Always state your conclusion in plain language. "Reject H₀" is not a conclusion — it's a formula. Say: "At the 5% significance level, there is sufficient evidence to conclude that the drug significantly increased average sleep duration."
🧮
You Now Speak the Language of Evidence

Z, t, and χ² are not just formulas — they are the three fundamental ways data science converts raw observations into decisions. Master when to use each, check your assumptions, and always interpret your result in the real-world context of the problem. That's what separates a statistician from a calculator.