The Story That Starts Everything
It is 7 am on a Monday. A hospital manager walks into her office and finds a spreadsheet with 50,000 rows — every patient visit from the past year. Name, age, diagnosis, treatment, cost, duration, outcome. Her board meeting starts in one hour and they want to know: how is the hospital performing?
She cannot read 50,000 rows in an hour. She cannot hand the spreadsheet to the board. What she needs is a way to compress 50,000 data points into a handful of meaningful numbers — numbers that tell the story of the data without showing every detail.
That is exactly what Descriptive Statistics does.
Average patient age: 47. Most common diagnosis: respiratory (28%). Average hospital stay: 3.2 days. Average treatment cost: £1,840. Patient outcome success rate: 94.3%. Longest wait time recorded: 18 hours. These six numbers summarise 50,000 rows. That is descriptive statistics.
What is Descriptive Statistics?
Descriptive statistics is the branch of statistics that organises, summarises, and describes a dataset using numbers and charts. It does not make predictions. It does not test hypotheses. It simply answers one question: What does this data look like?
- Mean, Median, Mode
- Where is the centre?
- What is typical?
- Range, Variance, Std Dev
- How spread out is it?
- How consistent is it?
- Skewness, Kurtosis
- What shape is it?
- Are there outliers?
Descriptive statistics describes the data you have. Inferential statistics uses that data to make predictions or draw conclusions about data you have not seen. Descriptive always comes first — you cannot make good inferences from data you have not properly described.
The Four Types of Data
Before you can choose which descriptive statistic to calculate, you must know what type of data you are working with. The wrong statistic on the wrong data type produces meaningless results.
All data falls into one of four types, arranged in a hierarchy from least to most informative:
Examples: blood type (A, B, AB, O), eye colour, country of birth, programming language, gender, car brand.
Valid stats: Mode, frequency, percentage. You cannot calculate a mean blood type.
Examples: customer satisfaction (1–5 stars), pain scale (1–10), education level (GCSE, A-Level, Degree, Masters), film ratings (Poor, Average, Good, Excellent).
Valid stats: Mode, median, percentiles. The gap between 1 star and 2 stars is not necessarily the same as the gap between 4 stars and 5 stars — so mean is unreliable.
Examples: temperature in Celsius or Fahrenheit, IQ scores, calendar year (year 0 does not mean "no time"), SAT scores.
Valid stats: Mean, median, mode, std dev, correlation. You can say 30°C is 10° hotter than 20°C, but you cannot say 30°C is "twice as hot" as 15°C. Only ratios of differences are meaningful, not ratios of values.
Examples: height, weight, age, salary, distance, time elapsed, number of purchases, temperature in Kelvin.
Valid stats: All statistics. You can say someone who earns £60,000 earns twice as much as someone earning £30,000. That ratio is meaningful because zero salary means no salary at all.
Treating ordinal data as if it were ratio. A customer satisfaction survey numbered 1 to 10 looks like numbers, but the difference between a 3 and a 4 is not the same emotional distance as between a 8 and a 9. Calculating a mean satisfaction score of 7.3 and treating it like a precise measurement is a common and misleading error in business reporting.
Data Types — Real Examples Side by Side
| Type | Real example | Can rank? | Equal gaps? | True zero? | Best stat |
|---|---|---|---|---|---|
| Nominal | Favourite colour: Red, Blue, Green | No | No | No | Mode |
| Ordinal | Movie rating: ★ ★★ ★★★ ★★★★ ★★★★★ | Yes | No | No | Median |
| Interval | Temperature: 0°C, 10°C, 20°C, 30°C | Yes | Yes | No | Mean |
| Ratio | Salary: £0, £25k, £50k, £100k | Yes | Yes | Yes | Mean + all |
Another way to remember it — the thermometer test
Ask: does zero mean "none of this thing exists"?
0°C — does it mean "no temperature"? No — it still
has temperature, just at the freezing point of water.
So temperature in Celsius is Interval.
0 kg — does it mean "no weight"? Yes — zero weight
means nothing is there. So weight is Ratio.
0 goals scored — does it mean "no goals"? Yes.
So goals scored is Ratio.
Population vs Sample — The Most Important Distinction
Every dataset you will ever work with is either a population or a sample. Getting this wrong leads to the wrong formulas, the wrong conclusions, and sometimes catastrophically bad decisions.
Story — the hospital again
The hospital manager has data on every patient visit last year — all 50,000. That is the population for the question "how did we perform last year?" She has every data point. She uses population formulas (divide by N).
But now she wants to know: do patients in general — across the entire NHS — prefer morning or evening appointments? She cannot collect data from every NHS patient. So she surveys 800 patients from her hospital. That 800 is a sample. She uses sample formulas (divide by n−1) and her conclusions come with uncertainty — they might not perfectly reflect all NHS patients.
[12, 18, 25, 8, 34, 21, 15, 19]
(12+18+25+8+34+21+15+19) / 8 = 152 / 8 = 19 minutes
Sample variance: Σ(x−x̄)² / (n−1) = 444 / 7 = 63.4
If they are a sample from a large hospital — use sample formulas (÷ n−1 = 7). Using the wrong one gives a different variance, different standard deviation, and different conclusions.
Population vs Sample — More Real-World Examples
| Scenario | Population | Sample | Type used |
|---|---|---|---|
| UK general election poll | All 46 million eligible voters | 1,000 people surveyed | Sample |
| Company employee satisfaction | All 200 employees | All 200 surveyed | Population |
| Average height of adult men in UK | ~25 million adult men | 500 men measured | Sample |
| Netflix monthly views for one film | Every view logged in their system | Every view logged | Population |
| Quality check — chocolate bars | All bars produced (millions) | 200 bars tested per shift | Sample |
| Exam scores in a class of 30 | All 30 students | All 30 scored | Population |
Even if your database has 10 million rows, it is rarely every person who has ever bought your product, used your app, or been affected by your model's decisions. Unless you have genuinely captured every possible case, treat your data as a sample and use sample formulas — divide variance and standard deviation by (n − 1), not n.
Python — Identifying Data Types and Basics
Loading and inspecting data types
import pandas as pd
import numpy as np
# Create a realistic hospital dataset
data = {
'patient_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
'blood_type': ['A', 'O', 'B', 'AB', 'O', 'A', 'O', 'B'], # Nominal
'pain_level': [3, 7, 5, 2, 8, 4, 6, 1], # Ordinal (1–10 scale)
'temperature_c': [37.1, 38.4, 36.9, 39.2, 37.8, 36.5, 38.1, 37.3], # Interval
'weight_kg': [72, 85, 61, 93, 78, 55, 88, 70], # Ratio
'stay_days': [2, 5, 1, 8, 3, 1, 4, 2], # Ratio
}
df = pd.DataFrame(data)
print(df.dtypes)
print("\n", df.head())
Descriptive stats for each data type
import pandas as pd
from scipy import stats
data = {
'blood_type': ['A','O','B','AB','O','A','O','B','A','O'],
'pain_level': [3, 7, 5, 2, 8, 4, 6, 1, 5, 7],
'temperature_c': [37.1, 38.4, 36.9, 39.2, 37.8, 36.5, 38.1, 37.3, 37.0, 38.9],
'weight_kg': [72, 85, 61, 93, 78, 55, 88, 70, 66, 82],
}
df = pd.DataFrame(data)
# ── Nominal: blood_type — only mode and frequency make sense ──
print("=== Nominal: Blood Type ===")
print(df['blood_type'].value_counts())
print(f"Mode: {df['blood_type'].mode()[0]}")
# ── Ordinal: pain_level — mode and median are appropriate ──
print("\n=== Ordinal: Pain Level ===")
print(f"Mode: {df['pain_level'].mode()[0]}")
print(f"Median: {df['pain_level'].median()}")
# ── Interval/Ratio: temperature and weight — full stats ──
print("\n=== Ratio: Weight (kg) ===")
print(f"Mean: {df['weight_kg'].mean():.1f} kg")
print(f"Median: {df['weight_kg'].median():.1f} kg")
print(f"Std Dev: {df['weight_kg'].std():.1f} kg") # sample std (ddof=1)
print(f"Min: {df['weight_kg'].min()} kg