Foundations of Data Science 📂 Descriptive Statistics · 5 of 11 11 min read

What is Standard Deviation?

Standard deviation is the square root of variance and measures spread in the same unit as your data. Learn how to calculate it, what the 68-95-99.7 rule means, and how it powers machine learning through real stories.

Section 01

The Story That Explains Standard Deviation

You are a quality control manager at a biscuit factory. Every biscuit should weigh exactly 20 grams. At the end of the day two machines both produced biscuits with a mean weight of 20g. Your boss is happy. But you check the individual weights.

Biscuit Machine A (g) Machine B (g)
119.814.0
220.126.0
320.018.0
419.922.0
520.220.0

Machine A: mean = 20g, std dev = 0.15g — nearly perfect every time.
Machine B: mean = 20g, std dev = 4.36g — all over the place.

⚠️
Machine B Needs Fixing

A biscuit weighing 14g is underweight — customers complain. A biscuit weighing 26g breaks the packaging and wastes ingredients. The mean looks fine but standard deviation reveals Machine B is broken. This is why standard deviation is used in Six Sigma quality control across every major manufacturing company in the world.


Section 02

What Standard Deviation Actually Means

Standard deviation answers one simple question: "On average, how far is each value from the mean?"

Population Std Dev
σ = √( Σ(x − μ)² / N )
Square root of population variance. Use for complete datasets.
Sample Std Dev
s = √( Σ(x − x̄)² / (n−1) )
Square root of sample variance. Use for sampled data.
💡
Std Dev vs Variance — The Key Difference

Standard deviation is simply the square root of variance. That one operation makes all the difference — it brings the unit back to the same scale as your original data. Variance of Machine A is 0.023 g² which is meaningless. Standard deviation is 0.15g — now you can say "biscuits vary by about 0.15 grams on average."


Section 03

Step-by-Step Calculation

Dataset — exam scores of 8 students: [52, 74, 68, 90, 61, 85, 72, 78]

🧮 Calculating Standard Deviation
Step 1
Calculate the mean.
(52 + 74 + 68 + 90 + 61 + 85 + 72 + 78) / 8 = 580 / 8 = 72.5
Step 2
Subtract mean and square each difference.
(52−72.5)² = 420.25
(74−72.5)² = 2.25
(68−72.5)² = 20.25
(90−72.5)² = 306.25
(61−72.5)² = 132.25
(85−72.5)² = 156.25
(72−72.5)² = 0.25
(78−72.5)² = 30.25
Step 3
Sum the squared differences.
420.25 + 2.25 + 20.25 + 306.25 + 132.25 + 156.25 + 0.25 + 30.25 = 1068
Step 4
Divide by (n − 1) to get variance.
1068 / (8 − 1) = 1068 / 7 = 152.57
Step 5
Take the square root to get std dev.
√152.57 = 12.35
Interpretation

The mean exam score is 72.5 and the standard deviation is 12.35. This tells you that on average, students scored within about 12 marks of the mean — either above or below. Most students scored between 60 and 85.


Section 04

Python Implementation

Manual calculation

scores = [52, 74, 68, 90, 61, 85, 72, 78]

n    = len(scores)
mean = sum(scores) / n

squared_diffs = [(x - mean) ** 2 for x in scores]
variance      = sum(squared_diffs) / (n - 1)
std_dev       = variance ** 0.5

print(f"Mean:     {mean}")           # 72.5
print(f"Variance: {variance:.2f}")   # 152.57
print(f"Std Dev:  {std_dev:.2f}")    # 12.35

Using the statistics module

import statistics

scores = [52, 74, 68, 90, 61, 85, 72, 78]

print(statistics.mean(scores))       # 72.5
print(statistics.variance(scores))   # 152.57   (sample)
print(statistics.stdev(scores))      # 12.35    (sample std dev)
print(statistics.pstdev(scores))     # 11.57    (population std dev)

Using NumPy

import numpy as np

scores = [52, 74, 68, 90, 61, 85, 72, 78]

print(np.mean(scores))               # 72.5
print(np.std(scores, ddof=1))        # 12.35   (sample)
print(np.std(scores, ddof=0))        # 11.57   (population)
print(np.var(scores, ddof=1))        # 152.57  (sample variance)

Complete comparison

import numpy as np

machine_a = [19.8, 20.1, 20.0, 19.9, 20.2]
machine_b = [14.0, 26.0, 18.0, 22.0, 20.0]

for name, data in [("Machine A", machine_a), ("Machine B", machine_b)]:
    mean    = np.mean(data)
    std_dev = np.std(data, ddof=1)
    print(f"{name}: mean={mean:.1f}g  std_dev={std_dev:.2f}g")

# Machine A: mean=20.0g  std_dev=0.15g
# Machine B: mean=20.0g  std_dev=4.36g

Section 05

The 68 – 95 – 99.7 Rule

When data follows a normal distribution (bell curve), standard deviation lets you predict exactly what percentage of values fall within any range. This is called the Empirical Rule.

68% Rule
±1σ
  • 68% of all values
  • The "normal" range
  • Mean ± 1 std dev
95% Rule
±2σ
  • 95% of all values
  • Covers almost everyone
  • Mean ± 2 std devs
99.7% Rule
±3σ
  • 99.7% of all values
  • Beyond = extreme outlier
  • Mean ± 3 std devs

Real Example — Adult Male Heights

Adult male heights follow a normal distribution with mean = 175 cm and std dev = 7 cm.

Range Calculation Heights % of men
±1σ 175 ± 7 168 cm – 182 cm 68%
±2σ 175 ± 14 161 cm – 189 cm 95%
±3σ 175 ± 21 154 cm – 196 cm 99.7%
Beyond ±3σ <154 cm or >196 cm 0.3%
import numpy as np

mean    = 175   # cm
std_dev = 7     # cm

print(f"68% range: {mean - std_dev} – {mean + std_dev} cm")
# 68% range: 168 – 182 cm

print(f"95% range: {mean - 2*std_dev} – {mean + 2*std_dev} cm")
# 95% range: 161 – 189 cm

print(f"99.7% range: {mean - 3*std_dev} – {mean + 3*std_dev} cm")
# 99.7% range: 154 – 196 cm

Section 06

Standard Deviation in Machine Learning

Standard deviation is not just a statistics concept — it powers several core machine learning techniques you will use every day.

Feature Scaling — StandardScaler

Many ML algorithms (SVM, KNN, Neural Networks) are sensitive to the scale of features. StandardScaler transforms each feature so that it has mean = 0 and std dev = 1. This is called Z-score normalisation.

Z-score Formula
z = (x − μ) / σ
Transforms any value to standard deviation units from the mean
from sklearn.preprocessing import StandardScaler
import numpy as np

# Raw features — very different scales
heights = [[160], [175], [180], [155], [190]]   # cm
weights = [[55],  [70],  [80],  [50],  [95]]    # kg

scaler = StandardScaler()
scaled = scaler.fit_transform(heights)

print("Original heights:", [h[0] for h in heights])
print("Scaled heights:  ", [round(s[0], 2) for s in scaled])
# Original heights: [160, 175, 180, 155, 190]
# Scaled heights:   [-0.91, 0.15, 0.54, -1.3, 1.52]

Anomaly Detection

import numpy as np

# Server response times (ms)
response_times = [120, 118, 122, 119, 121, 120, 118, 500, 123, 119]

mean    = np.mean(response_times)
std_dev = np.std(response_times, ddof=1)

print(f"Mean:    {mean:.1f} ms")
print(f"Std Dev: {std_dev:.1f} ms")

# Any value more than 3 std devs from mean is an anomaly
threshold = mean + 3 * std_dev

anomalies = [t for t in response_times if t > threshold]
print(f"Anomalies detected: {anomalies}")
# Anomalies detected: [500]
💡
Anomaly Detection in the Real World

This exact technique — flagging values beyond 3 standard deviations — is used by banks to detect fraudulent transactions, by hospitals to flag abnormal lab results, and by DevOps teams to detect server incidents. A response time of 500ms when the mean is 120ms and std dev is 11ms is 34 standard deviations away. That is not normal latency — that is an outage.


Section 07

Variance vs Standard Deviation — Summary

Property Variance Standard Deviation
Symbol σ² or σ or s
Formula Mean of squared differences Square root of variance
Unit Squared (e.g. g², mins²) Same as data (e.g. g, mins)
Interpretable? Harder Easy
Used in PCA, ANOVA, math proofs Reporting, scaling, outlier detection
Sensitive to outliers? Yes Yes

Section 08

Golden Rules

🎯 Standard Deviation — Key Rules
1
Always use ddof=1 in NumPy when working with sample data. The default ddof=0 gives population std dev which underestimates spread when you have a sample.
2
Std dev is in the same unit as your data. This makes it directly meaningful — a std dev of 12 marks means students typically score within 12 marks of the mean.
3
The 68-95-99.7 rule only applies to normally distributed data. Always visualise your data first — skewed or multimodal distributions behave very differently.
4
In ML preprocessing, StandardScaler uses std dev to normalise features. Understanding this helps you debug models that perform poorly due to unscaled input features.
5
A std dev of zero means every single value is identical. This is a red flag in ML — a feature with zero variance carries no information and should be dropped.