Quartiles, IQR and Outlier Detection

Section 01

The Story That Explains Everything

You are an HR analyst at a tech company. The CEO asks you to report the "typical" salary. You pull the data for 12 employees:

Employee	Role	Salary (£)
1	Junior Developer	32,000
2	Junior Developer	34,000
3	Junior Developer	33,000
4	Mid Developer	48,000
5	Mid Developer	51,000
6	Mid Developer	49,000
7	Senior Developer	72,000
8	Senior Developer	75,000
9	Senior Developer	78,000
10	Lead Developer	95,000
11	Engineering Manager	120,000
12	CTO	380,000

The mean salary is £88,917. You report this to the CEO. Every junior developer in the room is confused — they earn £32,000 and the "average" is £88,917? The CTO's £380,000 salary has dragged the mean far above what anyone except two people actually earns.

⚠️

The Problem with Mean and Outliers

The CTO's salary is an outlier — a value so extreme it distorts the mean. This is why data scientists never rely on the mean alone. Quartiles and IQR give you a far more honest picture of where the middle of your data actually sits — and they tell you exactly which values are outliers.

Section 02

What are Quartiles?

Quartiles split a sorted dataset into four equal parts — each containing exactly 25% of the data. There are three quartile values:

25th %ile

Lower quartile
25% of data below
75% of data above

50th %ile

The median
50% of data below
50% of data above

75th %ile

Upper quartile
75% of data below
25% of data above

💡

Think of it like a race

Imagine 100 runners finishing a race in order. Q1 is the finishing time of runner 25 — the fastest quarter just finished. Q2 is runner 50 — exactly half done. Q3 is runner 75 — three quarters done. The middle 50% of runners (between Q1 and Q3) form the core of the pack. Stragglers far behind Q3 or sprinters far ahead of Q1 are the outliers.

Section 03

What is the IQR?

The Interquartile Range (IQR) is the distance between Q1 and Q3. It covers the middle 50% of your data — the core, typical range — completely ignoring the extreme low and high values.

Interquartile Range

IQR = Q3 − Q1

The spread of the middle 50% of the data. Resistant to outliers — extreme values do not affect it.

📐

Why IQR is Better than Range

The full range (max − min) is destroyed by a single outlier. In our salary example, range = £380,000 − £32,000 = £348,000 — completely dominated by the CTO. The IQR ignores both extremes and tells you what the typical spread actually looks like for the middle majority.

Section 04

Step-by-Step Calculation

Let us use a simpler dataset first: [4, 7, 9, 12, 15, 18, 21, 24, 28, 35, 38, 45]

🧮 Finding Q1, Q2, Q3 and IQR

Step 1

Sort the data (already sorted).
[4, 7, 9, 12, 15, 18, 21, 24, 28, 35, 38, 45]
n = 12 values

Step 2

Find Q2 (Median) — average of 6th and 7th values.
6th value = 18 7th value = 21
Q2 = (18 + 21) / 2 = 19.5

Step 3

Find Q1 — median of the lower half [4, 7, 9, 12, 15, 18]
Average of 3rd and 4th values: (9 + 12) / 2 = 10.5

Step 4

Find Q3 — median of the upper half [21, 24, 28, 35, 38, 45]
Average of 3rd and 4th values: (28 + 35) / 2 = 31.5

Step 5

Calculate IQR.
IQR = Q3 − Q1 = 31.5 − 10.5 = 21.0

Section 05

How to Detect Outliers Using IQR

The IQR method uses fences — boundaries beyond which any value is considered an outlier. This is the same method used by box plots.

Lower Fence

Q1 − 1.5 × IQR

Any value below this is a lower outlier

Upper Fence

Q3 + 1.5 × IQR

Any value above this is an upper outlier

💡

Why 1.5?

The 1.5 multiplier was proposed by statistician John Tukey in 1977. He showed mathematically that for normally distributed data, values beyond 1.5 × IQR from the quartiles occur less than 0.7% of the time by chance — making them statistically unusual enough to flag. Some analyses use 3 × IQR for "extreme outliers" only.

Applying the fences to our dataset

🧮 Outlier Detection — Salary Data

Data

Sorted salaries (£000s):
[32, 33, 34, 48, 49, 51, 72, 75, 78, 95, 120, 380]

Lower half: [32, 33, 34, 48, 49, 51]
Q1 = (34 + 48) / 2 = 41

Upper half: [72, 75, 78, 95, 120, 380]
Q3 = (78 + 95) / 2 = 86.5

IQR

IQR = 86.5 − 41 = 45.5

Fences

Lower fence = 41 − (1.5 × 45.5) = 41 − 68.25 = −27.25 (no lower outliers)
Upper fence = 86.5 + (1.5 × 45.5) = 86.5 + 68.25 = 154.75

Result

Any salary above £154,750 is an outlier.
£380,000 (CTO) > £154,750 → OUTLIER ✓
£120,000 (Engineering Manager) < £154,750 → Normal ✓

✅

Result

The IQR method mathematically confirms what we suspected — the CTO's £380,000 salary is a statistical outlier. The median salary is £61,500. That is the honest "typical" salary you should report — not the mean of £88,917 which the outlier inflated by nearly £27,000.

Section 06

Python Implementation

Manual calculation

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

salaries_sorted = sorted(salaries)
n = len(salaries_sorted)

# Q2 — median
mid = n // 2
q2 = (salaries_sorted[mid - 1] + salaries_sorted[mid]) / 2

# Q1 — median of lower half
lower_half = salaries_sorted[:mid]
q1 = (lower_half[len(lower_half)//2 - 1] + lower_half[len(lower_half)//2]) / 2

# Q3 — median of upper half
upper_half = salaries_sorted[mid:]
q3 = (upper_half[len(upper_half)//2 - 1] + upper_half[len(upper_half)//2]) / 2

iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

print(f"Q1: £{q1:,.0f}")
print(f"Q2: £{q2:,.0f}")
print(f"Q3: £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Lower fence: £{lower_fence:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")

outliers = [s for s in salaries if s < lower_fence or s > upper_fence]
print(f"Outliers: {outliers}")

# Q1: £41,000
# Q2: £61,500
# Q3: £86,500
# IQR: £45,500
# Lower fence: £-27,250
# Upper fence: £154,750
# Outliers: [380000]

Using NumPy

import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

q1  = np.percentile(salaries, 25)
q2  = np.percentile(salaries, 50)
q3  = np.percentile(salaries, 75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

salaries_arr = np.array(salaries)
outliers = salaries_arr[(salaries_arr < lower_fence) | (salaries_arr > upper_fence)]

print(f"Q1:  £{q1:,.0f}")
print(f"Q2:  £{q2:,.0f}")
print(f"Q3:  £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")
print(f"Outliers: {outliers}")

# Q1:  £41,750
# Q2:  £61,500
# Q3:  £87,750
# IQR: £46,000
# Upper fence: £156,750
# Outliers: [380000]

Using Pandas

import pandas as pd

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

df = pd.DataFrame({"salary": salaries})

q1  = df["salary"].quantile(0.25)
q3  = df["salary"].quantile(0.75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

# Flag outliers as a new column
df["is_outlier"] = (df["salary"] < lower_fence) | (df["salary"] > upper_fence)

print(df)
print(f"\nOutlier rows:")
print(df[df["is_outlier"]])

Visualising with a Box Plot

import matplotlib.pyplot as plt
import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Box plot — outliers shown as circles beyond the whiskers
axes[0].boxplot(salaries, vert=True, patch_artist=True,
                boxprops=dict(facecolor="#3776AB", alpha=0.6))
axes[0].set_title("Salary Box Plot")
axes[0].set_ylabel("Salary (£)")
axes[0].yaxis.set_major_formatter(
    plt.FuncFormatter(lambda x, _: f"£{x:,.0f}")
)

# Same data without outlier for comparison
salaries_clean = [s for s in salaries if s <= 154750]
axes[1].boxplot(salaries_clean, vert=True, patch_artist=True,
                boxprops=dict(facecolor="#34d399", alpha=0.6))
axes[1].set_title("Salaries Without Outlier")
axes[1].set_ylabel("Salary (£)")
axes[1].yaxis.set_major_formatter(
    plt.FuncFormatter(lambda x, _: f"£{x:,.0f}")
)

plt.suptitle("IQR Outlier Detection — Box Plots", fontsize=13)
plt.tight_layout()
plt.show()

Section 07

What To Do When You Find Outliers

Finding an outlier is only the first step. What you do next depends on why the outlier exists. There are four possible explanations:

Reason	Example	Action
Data entry error	Height recorded as 1700cm instead of 170cm	Fix or remove
Measurement error	Faulty sensor recorded 0°C on a summer day	Remove
Legitimate extreme value	CTO genuinely earns £380,000	Keep — report separately
Interesting anomaly	Fraudulent transaction flagged by model	Keep — it is the signal

⚠️

Never Blindly Remove Outliers

Removing outliers without understanding them is one of the most dangerous mistakes in data science. In fraud detection, the outliers are the thing you are trying to find. In medical research, removing extreme patient readings could hide a life-saving discovery. Always investigate first — then decide.

Capping outliers instead of removing them

import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

q1  = np.percentile(salaries, 25)
q3  = np.percentile(salaries, 75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

# Option 1 — Remove outliers
clean = [s for s in salaries if lower_fence <= s <= upper_fence]
print(f"After removal: {len(clean)} values, mean = £{sum(clean)/len(clean):,.0f}")

# Option 2 — Cap outliers (Winsorisation)
# Replace outlier with the fence value instead of removing
capped = [max(lower_fence, min(upper_fence, s)) for s in salaries]
print(f"After capping: mean = £{sum(capped)/len(capped):,.0f}")

# Option 3 — Use median (unaffected by outliers)
median = np.median(salaries)
print(f"Median (no action needed): £{median:,.0f}")

# After removal: 11 values, mean = £62,455
# After capping: mean = £70,396
# Median (no action needed): £61,500

Section 08

IQR vs Z-Score — Two Methods Compared

There are two main methods for detecting outliers. Knowing when to use each one is an important data science skill.

Property	IQR Method	Z-Score Method
Formula	Q1 − 1.5×IQR and Q3 + 1.5×IQR	z = (x − mean) / std dev
Threshold	Beyond the fences	\|z\| > 2 or \|z\| > 3
Affected by outliers?	No	Yes
Best for	Skewed data, unknown distribution	Normally distributed data
Use when	You are not sure how data is distributed	Data is confirmed to be normally distributed

import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

# ── IQR Method ──────────────────────────────────────────────
q1, q3  = np.percentile(salaries, [25, 75])
iqr     = q3 - q1
iqr_outliers = [s for s in salaries
                if s < q1 - 1.5*iqr or s > q3 + 1.5*iqr]
print(f"IQR outliers:     {iqr_outliers}")

# ── Z-Score Method ───────────────────────────────────────────
mean = np.mean(salaries)
std  = np.std(salaries, ddof=1)
z_scores = [(s - mean) / std for s in salaries]
z_outliers = [s for s, z in zip(salaries, z_scores) if abs(z) > 2]
print(f"Z-score outliers: {z_outliers}")

# IQR outliers:     [380000]
# Z-score outliers: [380000]

🎯

Which Should You Use?

Default to IQR when you are unsure — it is robust and makes no assumptions about the shape of your data. Use Z-score only when you have verified that your data follows a normal distribution. For salary, income, house price, or any right-skewed data, IQR is almost always the better choice.

Section 09

Complete Pipeline — Detect, Investigate, Handle

import numpy as np
import pandas as pd

# Full dataset — house prices in a neighbourhood (£)
prices = [185000, 192000, 178000, 205000, 195000, 210000,
          188000, 201000, 198000, 185000, 2100000, 175000]

df = pd.DataFrame({"price": prices})

# Step 1 — Calculate quartiles and IQR
q1  = df["price"].quantile(0.25)
q2  = df["price"].quantile(0.50)
q3  = df["price"].quantile(0.75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

print("── Summary Statistics ──────────────────")
print(f"Q1 (25th percentile): £{q1:,.0f}")
print(f"Q2 (Median):          £{q2:,.0f}")
print(f"Q3 (75th percentile): £{q3:,.0f}")
print(f"IQR:                  £{iqr:,.0f}")
print(f"Lower fence:          £{lower_fence:,.0f}")
print(f"Upper fence:          £{upper_fence:,.0f}")

# Step 2 — Flag outliers
df["is_outlier"] = (
    (df["price"] < lower_fence) |
    (df["price"] > upper_fence)
)

print("\n── Outliers Detected ───────────────────")
print(df[df["is_outlier"]])

# Step 3 — Compare mean before and after
mean_with    = df["price"].mean()
mean_without = df[~df["is_outlier"]]["price"].mean()
median       = df["price"].median()

print("\n── Impact of Outlier ───────────────────")
print(f"Mean with outlier:    £{mean_with:,.0f}")
print(f"Mean without outlier: £{mean_without:,.0f}")
print(f"Median (robust):      £{median:,.0f}")

# ── Summary Statistics ──────────────────────────
# Q1 (25th percentile): £185,500
# Q2 (Median):          £193,500
# Q3 (75th percentile): £201,750
# IQR:                  £16,250
# Lower fence:          £161,125
# Upper fence:          £226,125
# ── Outliers Detected ───────────────────────────
#     price  is_outlier
# 10  2100000  True
# ── Impact of Outlier ───────────────────────────
# Mean with outlier:    £342,667
# Mean without outlier: £192,909
# Median (robust):      £193,500

Section 10

Golden Rules

🎯 Quartiles, IQR and Outliers — Key Rules

Always sort your data before calculating quartiles. The entire method depends on position — unsorted data gives completely wrong quartile values.

The IQR is resistant to outliers — extreme values do not affect Q1 or Q3 because they sit outside the middle 50%. This makes IQR far more reliable than range or standard deviation for describing spread in skewed data.

The 1.5 × IQR fence is a guideline, not a law. Some domains use 2× or 3× IQR for stricter definitions. Always apply domain knowledge when deciding the right threshold.

Investigate before removing. An outlier is a question, not a mistake. Ask why it exists before deciding whether to remove, cap, transform, or keep it.

Use the median and IQR together as your default summary for skewed data — they are both robust to outliers. Reserve mean and standard deviation for data that is approximately normally distributed.