Foundations of Data Science 📂 Descriptive Statistics · 2 of 11 12 min read

What is Descriptive Statistics?

Descriptive statistics summarises and describes data so you can understand it at a glance. Learn what it is, the types of data, the difference between population and sample, and why it is the essential first step in every data science project.

Section 01

The Story That Starts Everything

It is 7 am on a Monday. A hospital manager walks into her office and finds a spreadsheet with 50,000 rows — every patient visit from the past year. Name, age, diagnosis, treatment, cost, duration, outcome. Her board meeting starts in one hour and they want to know: how is the hospital performing?

She cannot read 50,000 rows in an hour. She cannot hand the spreadsheet to the board. What she needs is a way to compress 50,000 data points into a handful of meaningful numbers — numbers that tell the story of the data without showing every detail.

That is exactly what Descriptive Statistics does.

💡
What the manager reports in one hour

Average patient age: 47. Most common diagnosis: respiratory (28%). Average hospital stay: 3.2 days. Average treatment cost: £1,840. Patient outcome success rate: 94.3%. Longest wait time recorded: 18 hours. These six numbers summarise 50,000 rows. That is descriptive statistics.


Section 02

What is Descriptive Statistics?

Descriptive statistics is the branch of statistics that organises, summarises, and describes a dataset using numbers and charts. It does not make predictions. It does not test hypotheses. It simply answers one question: What does this data look like?

Central tendency
📍
  • Mean, Median, Mode
  • Where is the centre?
  • What is typical?
Spread
↔️
  • Range, Variance, Std Dev
  • How spread out is it?
  • How consistent is it?
Shape
📈
  • Skewness, Kurtosis
  • What shape is it?
  • Are there outliers?
📐
Descriptive vs Inferential Statistics

Descriptive statistics describes the data you have. Inferential statistics uses that data to make predictions or draw conclusions about data you have not seen. Descriptive always comes first — you cannot make good inferences from data you have not properly described.


Section 03

The Four Types of Data

Before you can choose which descriptive statistic to calculate, you must know what type of data you are working with. The wrong statistic on the wrong data type produces meaningless results.

All data falls into one of four types, arranged in a hierarchy from least to most informative:

📊 The Four Data Types — Nominal → Ordinal → Interval → Ratio
Nominal
Categories with no order. Labels only — you can count them but you cannot rank them or do arithmetic on them.

Examples: blood type (A, B, AB, O), eye colour, country of birth, programming language, gender, car brand.

Valid stats: Mode, frequency, percentage. You cannot calculate a mean blood type.
Ordinal
Categories with a meaningful order, but the gaps between them are not equal or known.

Examples: customer satisfaction (1–5 stars), pain scale (1–10), education level (GCSE, A-Level, Degree, Masters), film ratings (Poor, Average, Good, Excellent).

Valid stats: Mode, median, percentiles. The gap between 1 star and 2 stars is not necessarily the same as the gap between 4 stars and 5 stars — so mean is unreliable.
Interval
Numeric with equal gaps between values, but no true zero — zero does not mean "none of it."

Examples: temperature in Celsius or Fahrenheit, IQ scores, calendar year (year 0 does not mean "no time"), SAT scores.

Valid stats: Mean, median, mode, std dev, correlation. You can say 30°C is 10° hotter than 20°C, but you cannot say 30°C is "twice as hot" as 15°C. Only ratios of differences are meaningful, not ratios of values.
Ratio
Numeric with equal gaps AND a true zero — zero genuinely means "none of it." The most informative type.

Examples: height, weight, age, salary, distance, time elapsed, number of purchases, temperature in Kelvin.

Valid stats: All statistics. You can say someone who earns £60,000 earns twice as much as someone earning £30,000. That ratio is meaningful because zero salary means no salary at all.
⚠️
The most common data type mistake

Treating ordinal data as if it were ratio. A customer satisfaction survey numbered 1 to 10 looks like numbers, but the difference between a 3 and a 4 is not the same emotional distance as between a 8 and a 9. Calculating a mean satisfaction score of 7.3 and treating it like a precise measurement is a common and misleading error in business reporting.


Section 04

Data Types — Real Examples Side by Side

Type Real example Can rank? Equal gaps? True zero? Best stat
Nominal Favourite colour: Red, Blue, Green No No No Mode
Ordinal Movie rating: ★ ★★ ★★★ ★★★★ ★★★★★ Yes No No Median
Interval Temperature: 0°C, 10°C, 20°C, 30°C Yes Yes No Mean
Ratio Salary: £0, £25k, £50k, £100k Yes Yes Yes Mean + all

Another way to remember it — the thermometer test

💡
The zero test — interval vs ratio

Ask: does zero mean "none of this thing exists"?

0°C — does it mean "no temperature"? No — it still has temperature, just at the freezing point of water. So temperature in Celsius is Interval.

0 kg — does it mean "no weight"? Yes — zero weight means nothing is there. So weight is Ratio.

0 goals scored — does it mean "no goals"? Yes. So goals scored is Ratio.


Section 05

Population vs Sample — The Most Important Distinction

Every dataset you will ever work with is either a population or a sample. Getting this wrong leads to the wrong formulas, the wrong conclusions, and sometimes catastrophically bad decisions.

Population
Every member of the group you care about
Symbol: N (size) and μ (mean) and σ (std dev). Use when you have data for every single subject.
Sample
A subset selected from the population
Symbol: n (size) and x̄ (mean) and s (std dev). Use when you have data for only some subjects.

Story — the hospital again

The hospital manager has data on every patient visit last year — all 50,000. That is the population for the question "how did we perform last year?" She has every data point. She uses population formulas (divide by N).

But now she wants to know: do patients in general — across the entire NHS — prefer morning or evening appointments? She cannot collect data from every NHS patient. So she surveys 800 patients from her hospital. That 800 is a sample. She uses sample formulas (divide by n−1) and her conclusions come with uncertainty — they might not perfectly reflect all NHS patients.

🧮 Population vs Sample — How the Maths Changes
Dataset
Waiting times (minutes) for 8 patients: [12, 18, 25, 8, 34, 21, 15, 19]
Mean
Same formula for both:
(12+18+25+8+34+21+15+19) / 8 = 152 / 8 = 19 minutes
Variance
Population variance: Σ(x−μ)² / N = 444 / 8 = 55.5
Sample variance: Σ(x−x̄)² / (n−1) = 444 / 7 = 63.4
Why it matters
If these 8 patients are all the patients in a small clinic — use population formulas (÷ N = 8).
If they are a sample from a large hospital — use sample formulas (÷ n−1 = 7). Using the wrong one gives a different variance, different standard deviation, and different conclusions.

Section 06

Population vs Sample — More Real-World Examples

Scenario Population Sample Type used
UK general election poll All 46 million eligible voters 1,000 people surveyed Sample
Company employee satisfaction All 200 employees All 200 surveyed Population
Average height of adult men in UK ~25 million adult men 500 men measured Sample
Netflix monthly views for one film Every view logged in their system Every view logged Population
Quality check — chocolate bars All bars produced (millions) 200 bars tested per shift Sample
Exam scores in a class of 30 All 30 students All 30 scored Population
💡
In data science you almost always work with samples

Even if your database has 10 million rows, it is rarely every person who has ever bought your product, used your app, or been affected by your model's decisions. Unless you have genuinely captured every possible case, treat your data as a sample and use sample formulas — divide variance and standard deviation by (n − 1), not n.


Section 07

Python — Identifying Data Types and Basics

Loading and inspecting data types

import pandas as pd
import numpy as np

# Create a realistic hospital dataset
data = {
    'patient_id':    [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    'blood_type':    ['A', 'O', 'B', 'AB', 'O', 'A', 'O', 'B'],          # Nominal
    'pain_level':    [3, 7, 5, 2, 8, 4, 6, 1],                           # Ordinal (1–10 scale)
    'temperature_c': [37.1, 38.4, 36.9, 39.2, 37.8, 36.5, 38.1, 37.3],  # Interval
    'weight_kg':     [72, 85, 61, 93, 78, 55, 88, 70],                    # Ratio
    'stay_days':     [2, 5, 1, 8, 3, 1, 4, 2],                           # Ratio
}

df = pd.DataFrame(data)
print(df.dtypes)
print("\n", df.head())

Descriptive stats for each data type

import pandas as pd
from scipy import stats

data = {
    'blood_type':    ['A','O','B','AB','O','A','O','B','A','O'],
    'pain_level':    [3, 7, 5, 2, 8, 4, 6, 1, 5, 7],
    'temperature_c': [37.1, 38.4, 36.9, 39.2, 37.8, 36.5, 38.1, 37.3, 37.0, 38.9],
    'weight_kg':     [72, 85, 61, 93, 78, 55, 88, 70, 66, 82],
}
df = pd.DataFrame(data)

# ── Nominal: blood_type — only mode and frequency make sense ──
print("=== Nominal: Blood Type ===")
print(df['blood_type'].value_counts())
print(f"Mode: {df['blood_type'].mode()[0]}")

# ── Ordinal: pain_level — mode and median are appropriate ──
print("\n=== Ordinal: Pain Level ===")
print(f"Mode:   {df['pain_level'].mode()[0]}")
print(f"Median: {df['pain_level'].median()}")

# ── Interval/Ratio: temperature and weight — full stats ──
print("\n=== Ratio: Weight (kg) ===")
print(f"Mean:    {df['weight_kg'].mean():.1f} kg")
print(f"Median:  {df['weight_kg'].median():.1f} kg")
print(f"Std Dev: {df['weight_kg'].std():.1f} kg")  # sample std (ddof=1)
print(f"Min:     {df['weight_kg'].min()} kg