Descriptive Statistics, Data Types

Section 01

The Story That Starts Everything

It is 7 am on a Monday. A hospital manager walks into her office and finds a spreadsheet with 50,000 rows — every patient visit from the past year. Name, age, diagnosis, treatment, cost, duration, outcome. Her board meeting starts in one hour and they want to know: how is the hospital performing?

She cannot read 50,000 rows in an hour. She cannot hand the spreadsheet to the board. What she needs is a way to compress 50,000 data points into a handful of meaningful numbers — numbers that tell the story of the data without showing every detail.

That is exactly what Descriptive Statistics does.

💡

What the manager reports in one hour

Average patient age: 47. Most common diagnosis: respiratory (28%). Average hospital stay: 3.2 days. Average treatment cost: £1,840. Patient outcome success rate: 94.3%. Longest wait time recorded: 18 hours. These six numbers summarise 50,000 rows. That is descriptive statistics.

Section 02

What is Descriptive Statistics?

Descriptive statistics is the branch of statistics that organises, summarises, and describes a dataset using numbers and charts. It does not make predictions. It does not test hypotheses. It simply answers one question: What does this data look like?

Central tendency

📍

Mean, Median, Mode
Where is the centre?
What is typical?

Spread

↔️

Range, Variance, Std Dev
How spread out is it?
How consistent is it?

Shape

📈

Skewness, Kurtosis
What shape is it?
Are there outliers?

📐

Descriptive vs Inferential Statistics

Descriptive statistics describes the data you have. Inferential statistics uses that data to make predictions or draw conclusions about data you have not seen. Descriptive always comes first — you cannot make good inferences from data you have not properly described.

Section 03

The Four Types of Data

Before you can choose which descriptive statistic to calculate, you must know what type of data you are working with. The wrong statistic on the wrong data type produces meaningless results.

All data falls into one of four types, arranged in a hierarchy from least to most informative:

📊 The Four Data Types — Nominal → Ordinal → Interval → Ratio

Nominal

Categories with no order. Labels only — you can count them but you cannot rank them or do arithmetic on them.

Examples: blood type (A, B, AB, O), eye colour, country of birth, programming language, gender, car brand.

Valid stats: Mode, frequency, percentage. You cannot calculate a mean blood type.

Ordinal

Categories with a meaningful order, but the gaps between them are not equal or known.

Examples: customer satisfaction (1–5 stars), pain scale (1–10), education level (GCSE, A-Level, Degree, Masters), film ratings (Poor, Average, Good, Excellent).

Valid stats: Mode, median, percentiles. The gap between 1 star and 2 stars is not necessarily the same as the gap between 4 stars and 5 stars — so mean is unreliable.

Interval

Numeric with equal gaps between values, but no true zero — zero does not mean "none of it."

Examples: temperature in Celsius or Fahrenheit, IQ scores, calendar year (year 0 does not mean "no time"), SAT scores.

Valid stats: Mean, median, mode, std dev, correlation. You can say 30°C is 10° hotter than 20°C, but you cannot say 30°C is "twice as hot" as 15°C. Only ratios of differences are meaningful, not ratios of values.

Ratio

Numeric with equal gaps AND a true zero — zero genuinely means "none of it." The most informative type.

Examples: height, weight, age, salary, distance, time elapsed, number of purchases, temperature in Kelvin.

Valid stats: All statistics. You can say someone who earns £60,000 earns twice as much as someone earning £30,000. That ratio is meaningful because zero salary means no salary at all.

⚠️

The most common data type mistake

Treating ordinal data as if it were ratio. A customer satisfaction survey numbered 1 to 10 looks like numbers, but the difference between a 3 and a 4 is not the same emotional distance as between a 8 and a 9. Calculating a mean satisfaction score of 7.3 and treating it like a precise measurement is a common and misleading error in business reporting.

Section 04

Data Types — Real Examples Side by Side

Type	Real example	Can rank?	Equal gaps?	True zero?	Best stat
Nominal	Favourite colour: Red, Blue, Green	No	No	No	Mode
Ordinal	Movie rating: ★ ★★ ★★★ ★★★★ ★★★★★	Yes	No	No	Median
Interval	Temperature: 0°C, 10°C, 20°C, 30°C	Yes	Yes	No	Mean
Ratio	Salary: £0, £25k, £50k, £100k	Yes	Yes	Yes	Mean + all

Another way to remember it — the thermometer test

💡

The zero test — interval vs ratio

Ask: does zero mean "none of this thing exists"?

0°C — does it mean "no temperature"? No — it still has temperature, just at the freezing point of water. So temperature in Celsius is Interval.

0 kg — does it mean "no weight"? Yes — zero weight means nothing is there. So weight is Ratio.

0 goals scored — does it mean "no goals"? Yes. So goals scored is Ratio.

Section 05

Population vs Sample — The Most Important Distinction

Every dataset you will ever work with is either a population or a sample. Getting this wrong leads to the wrong formulas, the wrong conclusions, and sometimes catastrophically bad decisions.

Population

Every member of the group you care about

Symbol: N (size) and μ (mean) and σ (std dev). Use when you have data for every single subject.

Sample

A subset selected from the population

Symbol: n (size) and x̄ (mean) and s (std dev). Use when you have data for only some subjects.

Story — the hospital again

The hospital manager has data on every patient visit last year — all 50,000. That is the population for the question "how did we perform last year?" She has every data point. She uses population formulas (divide by N).

But now she wants to know: do patients in general — across the entire NHS — prefer morning or evening appointments? She cannot collect data from every NHS patient. So she surveys 800 patients from her hospital. That 800 is a sample. She uses sample formulas (divide by n−1) and her conclusions come with uncertainty — they might not perfectly reflect all NHS patients.

🧮 Population vs Sample — How the Maths Changes

Dataset

Waiting times (minutes) for 8 patients: [12, 18, 25, 8, 34, 21, 15, 19]

Mean

Same formula for both:
(12+18+25+8+34+21+15+19) / 8 = 152 / 8 = 19 minutes

Variance

Population variance: Σ(x−μ)² / N = 444 / 8 = 55.5
Sample variance: Σ(x−x̄)² / (n−1) = 444 / 7 = 63.4

Why it matters

If these 8 patients are all the patients in a small clinic — use population formulas (÷ N = 8).
If they are a sample from a large hospital — use sample formulas (÷ n−1 = 7). Using the wrong one gives a different variance, different standard deviation, and different conclusions.

Section 06

Population vs Sample — More Real-World Examples

Scenario	Population	Sample	Type used
UK general election poll	All 46 million eligible voters	1,000 people surveyed	Sample
Company employee satisfaction	All 200 employees	All 200 surveyed	Population
Average height of adult men in UK	~25 million adult men	500 men measured	Sample
Netflix monthly views for one film	Every view logged in their system	Every view logged	Population
Quality check — chocolate bars	All bars produced (millions)	200 bars tested per shift	Sample
Exam scores in a class of 30	All 30 students	All 30 scored	Population

💡

In data science you almost always work with samples

Even if your database has 10 million rows, it is rarely every person who has ever bought your product, used your app, or been affected by your model's decisions. Unless you have genuinely captured every possible case, treat your data as a sample and use sample formulas — divide variance and standard deviation by (n − 1), not n.

Section 07

Python — Identifying Data Types and Basics

Loading and inspecting data types

import pandas as pd
import numpy as np

# Create a realistic hospital dataset
data = {
    'patient_id':    [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    'blood_type':    ['A', 'O', 'B', 'AB', 'O', 'A', 'O', 'B'],          # Nominal
    'pain_level':    [3, 7, 5, 2, 8, 4, 6, 1],                           # Ordinal (1–10 scale)
    'temperature_c': [37.1, 38.4, 36.9, 39.2, 37.8, 36.5, 38.1, 37.3],  # Interval
    'weight_kg':     [72, 85, 61, 93, 78, 55, 88, 70],                    # Ratio
    'stay_days':     [2, 5, 1, 8, 3, 1, 4, 2],                           # Ratio
}

df = pd.DataFrame(data)
print(df.dtypes)
print("\n", df.head())

Descriptive stats for each data type

import pandas as pd
from scipy import stats

data = {
    'blood_type':    ['A','O','B','AB','O','A','O','B','A','O'],
    'pain_level':    [3, 7, 5, 2, 8, 4, 6, 1, 5, 7],
    'temperature_c': [37.1, 38.4, 36.9, 39.2, 37.8, 36.5, 38.1, 37.3, 37.0, 38.9],
    'weight_kg':     [72, 85, 61, 93, 78, 55, 88, 70, 66, 82],
}
df = pd.DataFrame(data)

# ── Nominal: blood_type — only mode and frequency make sense ──
print("=== Nominal: Blood Type ===")
print(df['blood_type'].value_counts())
print(f"Mode: {df['blood_type'].mode()[0]}")

# ── Ordinal: pain_level — mode and median are appropriate ──
print("\n=== Ordinal: Pain Level ===")
print(f"Mode:   {df['pain_level'].mode()[0]}")
print(f"Median: {df['pain_level'].median()}")

# ── Interval/Ratio: temperature and weight — full stats ──
print("\n=== Ratio: Weight (kg) ===")
print(f"Mean:    {df['weight_kg'].mean():.1f} kg")
print(f"Median:  {df['weight_kg'].median():.1f} kg")
print(f"Std Dev: {df['weight_kg'].std():.1f} kg")  # sample std (ddof=1)
print(f"Min:     {df['weight_kg'].min()} kg