Foundations of Data Science 📂 Descriptive Statistics · 4 of 11 9 min read

What is Variance?

Variance measures how far each value in a dataset is spread from the mean. Learn what it is, how to calculate it, and why it matters in data science through real-world stories.

Section 01

The Story That Explains Variance

Imagine two pizza delivery services in your city — QuickSlice and PizzaPal. Both claim an average delivery time of 30 minutes. You order from both over ten days and record the times.

Day QuickSlice (mins) PizzaPal (mins)
12910
23155
3308
42862
53215
63048
72920
83170
93012
10300

Both have a mean of exactly 30 minutes. But QuickSlice always arrives between 28–32 minutes. PizzaPal arrives anywhere from 0 to 70 minutes. The mean tells you nothing about this difference. Variance does.

💡
What Variance Tells You

Variance measures how much each value in a dataset deviates from the mean. A low variance means values stay close to the mean (like QuickSlice). A high variance means values are all over the place (like PizzaPal). Same mean — completely different story.


Section 02

The Formula

Population Variance
σ² = Σ(x − μ)² / N
Use when you have every single data point in the population
Sample Variance
s² = Σ(x − x̄)² / (n − 1)
Use when your data is a sample from a larger population
📐
Why (n − 1) and not n?

This is called Bessel's correction. When you work with a sample, your data tends to underestimate how spread out the full population is. Dividing by (n − 1) instead of n corrects for this bias and gives a more accurate estimate of the true population variance. This is why it is called an "unbiased estimator."


Section 03

Step-by-Step Calculation

Let us calculate the variance for QuickSlice: [29, 31, 30, 28, 32, 30, 29, 31, 30, 30] — Mean = 30

🧮 QuickSlice Variance
Step 1
Subtract the mean from each value.
(29−30)=−1   (31−30)=1   (30−30)=0   (28−30)=−2   (32−30)=2   (30−30)=0   (29−30)=−1   (31−30)=1   (30−30)=0   (30−30)=0
Step 2
Square each difference (removes negatives).
1 + 1 + 0 + 4 + 4 + 0 + 1 + 1 + 0 + 0 = 12
Step 3
Divide by (n − 1) — sample variance.
12 / (10 − 1) = 12 / 9 = 1.33 min²

Now let us calculate PizzaPal: [10, 55, 8, 62, 15, 48, 20, 70, 12, 0] — Mean = 30

🧮 PizzaPal Variance
Step 1
Subtract the mean from each value.
(10−30)=−20   (55−30)=25   (8−30)=−22   (62−30)=32   (15−30)=−15   (48−30)=18   (20−30)=−10   (70−30)=40   (12−30)=−18   (0−30)=−30
Step 2
Square each difference.
400 + 625 + 484 + 1024 + 225 + 324 + 100 + 1600 + 324 + 900 = 6006
Step 3
Divide by (n − 1).
6006 / 9 = 667.33 min²
🧮
Result

QuickSlice variance = 1.33  |  PizzaPal variance = 667.33
Same mean. Variance of 501× higher for PizzaPal. Now you have proof — not just a feeling — that PizzaPal is wildly unpredictable.


Section 04

Python Implementation

Manual calculation

quickslice = [29, 31, 30, 28, 32, 30, 29, 31, 30, 30]
pizzapal   = [10, 55,  8, 62, 15, 48, 20, 70, 12,  0]

def sample_variance(data):
    n    = len(data)
    mean = sum(data) / n
    return sum((x - mean) ** 2 for x in data) / (n - 1)

print(f"QuickSlice variance: {sample_variance(quickslice):.2f}")
# QuickSlice variance: 1.33

print(f"PizzaPal variance:   {sample_variance(pizzapal):.2f}")
# PizzaPal variance:   667.33

Using the statistics module

import statistics

quickslice = [29, 31, 30, 28, 32, 30, 29, 31, 30, 30]
pizzapal   = [10, 55,  8, 62, 15, 48, 20, 70, 12,  0]

# Sample variance (divides by n-1)
print(statistics.variance(quickslice))   # 1.3333
print(statistics.variance(pizzapal))     # 667.3333

# Population variance (divides by n)
print(statistics.pvariance(quickslice))  # 1.2
print(statistics.pvariance(pizzapal))    # 600.6

Using NumPy

import numpy as np

quickslice = [29, 31, 30, 28, 32, 30, 29, 31, 30, 30]
pizzapal   = [10, 55,  8, 62, 15, 48, 20, 70, 12,  0]

# ddof=1 → sample variance
# ddof=0 → population variance (NumPy default)
print(np.var(quickslice, ddof=1))   # 1.3333
print(np.var(pizzapal,   ddof=1))   # 667.3333
⚠️
NumPy Default Warning

np.var() uses ddof=0 by default — population variance. In almost all data science work you want ddof=1. Always specify it explicitly to avoid a subtle but impactful mistake.


Section 05

The One Problem with Variance

Variance is powerful but has one frustrating issue — its unit is squared. If your data is in minutes, variance is in minutes squared. That is hard to interpret in the real world.

Data Unit Variance Unit Makes sense?
Delivery times minutes minutes² Hard to interpret
Heights cm cm² Hard to interpret
Prices £ £² Hard to interpret
Temperatures °C °C² Hard to interpret
🎯
Solution: Standard Deviation

Taking the square root of variance gives you the Standard Deviation — which is in the same unit as your original data. This is why standard deviation is more commonly used for interpretation, while variance is used in mathematical calculations and models. The next tutorial covers standard deviation in full.


Section 06

Where Variance is Used in Data Science

PCA
📊
  • Finds directions of max variance
  • Reduces dimensions
  • Removes redundant features
ANOVA
📐
  • Compares group variances
  • Tests if groups differ
  • Used in A/B testing
Finance
📈
  • Measures investment risk
  • Portfolio optimisation
  • Volatility analysis

Section 07

Golden Rules

🎯 Variance — Key Rules
1
Variance is always zero or positive. It can never be negative because you square the differences before summing them. A variance of zero means every value is identical.
2
Use sample variance (n−1) when your data is a sample. Use population variance (n) only when you have the complete dataset for the entire population — which is rare in practice.
3
Variance is highly sensitive to outliers. Because differences are squared, a single extreme value inflates variance dramatically. Always check for outliers before interpreting variance.
4
For human interpretation use Standard Deviation (square root of variance). For mathematical operations in ML models use variance — it has better algebraic properties.