The Story That Explains Variance
Imagine two pizza delivery services in your city — QuickSlice and PizzaPal. Both claim an average delivery time of 30 minutes. You order from both over ten days and record the times.
| Day | QuickSlice (mins) | PizzaPal (mins) |
|---|---|---|
| 1 | 29 | 10 |
| 2 | 31 | 55 |
| 3 | 30 | 8 |
| 4 | 28 | 62 |
| 5 | 32 | 15 |
| 6 | 30 | 48 |
| 7 | 29 | 20 |
| 8 | 31 | 70 |
| 9 | 30 | 12 |
| 10 | 30 | 0 |
Both have a mean of exactly 30 minutes. But QuickSlice always
arrives between 28–32 minutes. PizzaPal arrives anywhere from 0 to 70 minutes.
The mean tells you nothing about this difference.
Variance does.
Variance measures how much each value in a dataset deviates from the mean. A low variance means values stay close to the mean (like QuickSlice). A high variance means values are all over the place (like PizzaPal). Same mean — completely different story.
The Formula
This is called Bessel's correction. When you work with a sample, your data tends to underestimate how spread out the full population is. Dividing by (n − 1) instead of n corrects for this bias and gives a more accurate estimate of the true population variance. This is why it is called an "unbiased estimator."
Step-by-Step Calculation
Let us calculate the variance for QuickSlice:
[29, 31, 30, 28, 32, 30, 29, 31, 30, 30] — Mean = 30
(29−30)=−1 (31−30)=1 (30−30)=0 (28−30)=−2 (32−30)=2 (30−30)=0 (29−30)=−1 (31−30)=1 (30−30)=0 (30−30)=0
1 + 1 + 0 + 4 + 4 + 0 + 1 + 1 + 0 + 0 = 12
12 / (10 − 1) = 12 / 9 = 1.33 min²
Now let us calculate PizzaPal:
[10, 55, 8, 62, 15, 48, 20, 70, 12, 0] — Mean = 30
(10−30)=−20 (55−30)=25 (8−30)=−22 (62−30)=32 (15−30)=−15 (48−30)=18 (20−30)=−10 (70−30)=40 (12−30)=−18 (0−30)=−30
400 + 625 + 484 + 1024 + 225 + 324 + 100 + 1600 + 324 + 900 = 6006
6006 / 9 = 667.33 min²
QuickSlice variance = 1.33 |
PizzaPal variance = 667.33
Same mean. Variance of 501× higher for PizzaPal. Now you have proof —
not just a feeling — that PizzaPal is wildly unpredictable.
Python Implementation
Manual calculation
quickslice = [29, 31, 30, 28, 32, 30, 29, 31, 30, 30]
pizzapal = [10, 55, 8, 62, 15, 48, 20, 70, 12, 0]
def sample_variance(data):
n = len(data)
mean = sum(data) / n
return sum((x - mean) ** 2 for x in data) / (n - 1)
print(f"QuickSlice variance: {sample_variance(quickslice):.2f}")
# QuickSlice variance: 1.33
print(f"PizzaPal variance: {sample_variance(pizzapal):.2f}")
# PizzaPal variance: 667.33
Using the statistics module
import statistics
quickslice = [29, 31, 30, 28, 32, 30, 29, 31, 30, 30]
pizzapal = [10, 55, 8, 62, 15, 48, 20, 70, 12, 0]
# Sample variance (divides by n-1)
print(statistics.variance(quickslice)) # 1.3333
print(statistics.variance(pizzapal)) # 667.3333
# Population variance (divides by n)
print(statistics.pvariance(quickslice)) # 1.2
print(statistics.pvariance(pizzapal)) # 600.6
Using NumPy
import numpy as np
quickslice = [29, 31, 30, 28, 32, 30, 29, 31, 30, 30]
pizzapal = [10, 55, 8, 62, 15, 48, 20, 70, 12, 0]
# ddof=1 → sample variance
# ddof=0 → population variance (NumPy default)
print(np.var(quickslice, ddof=1)) # 1.3333
print(np.var(pizzapal, ddof=1)) # 667.3333
np.var() uses ddof=0 by default — population
variance. In almost all data science work you want ddof=1.
Always specify it explicitly to avoid a subtle but impactful mistake.
The One Problem with Variance
Variance is powerful but has one frustrating issue — its unit is squared. If your data is in minutes, variance is in minutes squared. That is hard to interpret in the real world.
| Data | Unit | Variance Unit | Makes sense? |
|---|---|---|---|
| Delivery times | minutes | minutes² | Hard to interpret |
| Heights | cm | cm² | Hard to interpret |
| Prices | £ | £² | Hard to interpret |
| Temperatures | °C | °C² | Hard to interpret |
Taking the square root of variance gives you the Standard Deviation — which is in the same unit as your original data. This is why standard deviation is more commonly used for interpretation, while variance is used in mathematical calculations and models. The next tutorial covers standard deviation in full.
Where Variance is Used in Data Science
- Finds directions of max variance
- Reduces dimensions
- Removes redundant features
- Compares group variances
- Tests if groups differ
- Used in A/B testing
- Measures investment risk
- Portfolio optimisation
- Volatility analysis