Foundations of Data Science 📂 Descriptive Statistics · 6 of 11 17 min read

Quartiles, IQR and Outlier Detection

Learn what quartiles and the Interquartile Range (IQR) are, how to calculate them step by step, and how to use the IQR method to detect and handle outliers in your data — with real stories and Python code.

Section 01

The Story That Explains Everything

You are an HR analyst at a tech company. The CEO asks you to report the "typical" salary. You pull the data for 12 employees:

Employee Role Salary (£)
1Junior Developer32,000
2Junior Developer34,000
3Junior Developer33,000
4Mid Developer48,000
5Mid Developer51,000
6Mid Developer49,000
7Senior Developer72,000
8Senior Developer75,000
9Senior Developer78,000
10Lead Developer95,000
11Engineering Manager120,000
12CTO380,000

The mean salary is £88,917. You report this to the CEO. Every junior developer in the room is confused — they earn £32,000 and the "average" is £88,917? The CTO's £380,000 salary has dragged the mean far above what anyone except two people actually earns.

⚠️
The Problem with Mean and Outliers

The CTO's salary is an outlier — a value so extreme it distorts the mean. This is why data scientists never rely on the mean alone. Quartiles and IQR give you a far more honest picture of where the middle of your data actually sits — and they tell you exactly which values are outliers.


Section 02

What are Quartiles?

Quartiles split a sorted dataset into four equal parts — each containing exactly 25% of the data. There are three quartile values:

Q1
25th %ile
  • Lower quartile
  • 25% of data below
  • 75% of data above
Q2
50th %ile
  • The median
  • 50% of data below
  • 50% of data above
Q3
75th %ile
  • Upper quartile
  • 75% of data below
  • 25% of data above
💡
Think of it like a race

Imagine 100 runners finishing a race in order. Q1 is the finishing time of runner 25 — the fastest quarter just finished. Q2 is runner 50 — exactly half done. Q3 is runner 75 — three quarters done. The middle 50% of runners (between Q1 and Q3) form the core of the pack. Stragglers far behind Q3 or sprinters far ahead of Q1 are the outliers.


Section 03

What is the IQR?

The Interquartile Range (IQR) is the distance between Q1 and Q3. It covers the middle 50% of your data — the core, typical range — completely ignoring the extreme low and high values.

Interquartile Range
IQR = Q3 − Q1
The spread of the middle 50% of the data. Resistant to outliers — extreme values do not affect it.
📐
Why IQR is Better than Range

The full range (max − min) is destroyed by a single outlier. In our salary example, range = £380,000 − £32,000 = £348,000 — completely dominated by the CTO. The IQR ignores both extremes and tells you what the typical spread actually looks like for the middle majority.


Section 04

Step-by-Step Calculation

Let us use a simpler dataset first: [4, 7, 9, 12, 15, 18, 21, 24, 28, 35, 38, 45]

🧮 Finding Q1, Q2, Q3 and IQR
Step 1
Sort the data (already sorted).
[4, 7, 9, 12, 15, 18, 21, 24, 28, 35, 38, 45]
n = 12 values
Step 2
Find Q2 (Median) — average of 6th and 7th values.
6th value = 18   7th value = 21
Q2 = (18 + 21) / 2 = 19.5
Step 3
Find Q1 — median of the lower half [4, 7, 9, 12, 15, 18]
Average of 3rd and 4th values: (9 + 12) / 2 = 10.5
Step 4
Find Q3 — median of the upper half [21, 24, 28, 35, 38, 45]
Average of 3rd and 4th values: (28 + 35) / 2 = 31.5
Step 5
Calculate IQR.
IQR = Q3 − Q1 = 31.5 − 10.5 = 21.0

Section 05

How to Detect Outliers Using IQR

The IQR method uses fences — boundaries beyond which any value is considered an outlier. This is the same method used by box plots.

Lower Fence
Q1 − 1.5 × IQR
Any value below this is a lower outlier
Upper Fence
Q3 + 1.5 × IQR
Any value above this is an upper outlier
💡
Why 1.5?

The 1.5 multiplier was proposed by statistician John Tukey in 1977. He showed mathematically that for normally distributed data, values beyond 1.5 × IQR from the quartiles occur less than 0.7% of the time by chance — making them statistically unusual enough to flag. Some analyses use 3 × IQR for "extreme outliers" only.

Applying the fences to our dataset

🧮 Outlier Detection — Salary Data
Data
Sorted salaries (£000s):
[32, 33, 34, 48, 49, 51, 72, 75, 78, 95, 120, 380]
Q1
Lower half: [32, 33, 34, 48, 49, 51]
Q1 = (34 + 48) / 2 = 41
Q3
Upper half: [72, 75, 78, 95, 120, 380]
Q3 = (78 + 95) / 2 = 86.5
IQR
IQR = 86.5 − 41 = 45.5
Fences
Lower fence = 41 − (1.5 × 45.5) = 41 − 68.25 = −27.25 (no lower outliers)
Upper fence = 86.5 + (1.5 × 45.5) = 86.5 + 68.25 = 154.75
Result
Any salary above £154,750 is an outlier.
£380,000 (CTO) > £154,750 → OUTLIER ✓
£120,000 (Engineering Manager) < £154,750 → Normal ✓
Result

The IQR method mathematically confirms what we suspected — the CTO's £380,000 salary is a statistical outlier. The median salary is £61,500. That is the honest "typical" salary you should report — not the mean of £88,917 which the outlier inflated by nearly £27,000.


Section 06

Python Implementation

Manual calculation

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

salaries_sorted = sorted(salaries)
n = len(salaries_sorted)

# Q2 — median
mid = n // 2
q2 = (salaries_sorted[mid - 1] + salaries_sorted[mid]) / 2

# Q1 — median of lower half
lower_half = salaries_sorted[:mid]
q1 = (lower_half[len(lower_half)//2 - 1] + lower_half[len(lower_half)//2]) / 2

# Q3 — median of upper half
upper_half = salaries_sorted[mid:]
q3 = (upper_half[len(upper_half)//2 - 1] + upper_half[len(upper_half)//2]) / 2

iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

print(f"Q1: £{q1:,.0f}")
print(f"Q2: £{q2:,.0f}")
print(f"Q3: £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Lower fence: £{lower_fence:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")

outliers = [s for s in salaries if s < lower_fence or s > upper_fence]
print(f"Outliers: {outliers}")

# Q1: £41,000
# Q2: £61,500
# Q3: £86,500
# IQR: £45,500
# Lower fence: £-27,250
# Upper fence: £154,750
# Outliers: [380000]

Using NumPy

import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

q1  = np.percentile(salaries, 25)
q2  = np.percentile(salaries, 50)
q3  = np.percentile(salaries, 75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

salaries_arr = np.array(salaries)
outliers = salaries_arr[(salaries_arr < lower_fence) | (salaries_arr > upper_fence)]

print(f"Q1:  £{q1:,.0f}")
print(f"Q2:  £{q2:,.0f}")
print(f"Q3:  £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")
print(f"Outliers: {outliers}")

# Q1:  £41,750
# Q2:  £61,500
# Q3:  £87,750
# IQR: £46,000
# Upper fence: £156,750
# Outliers: [380000]

Using Pandas

import pandas as pd

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

df = pd.DataFrame({"salary": salaries})

q1  = df["salary"].quantile(0.25)
q3  = df["salary"].quantile(0.75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

# Flag outliers as a new column
df["is_outlier"] = (df["salary"] < lower_fence) | (df["salary"] > upper_fence)

print(df)
print(f"\nOutlier rows:")
print(df[df["is_outlier"]])

Visualising with a Box Plot

import matplotlib.pyplot as plt
import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Box plot — outliers shown as circles beyond the whiskers
axes[0].boxplot(salaries, vert=True, patch_artist=True,
                boxprops=dict(facecolor="#3776AB", alpha=0.6))
axes[0].set_title("Salary Box Plot")
axes[0].set_ylabel("Salary (£)")
axes[0].yaxis.set_major_formatter(
    plt.FuncFormatter(lambda x, _: f"£{x:,.0f}")
)

# Same data without outlier for comparison
salaries_clean = [s for s in salaries if s <= 154750]
axes[1].boxplot(salaries_clean, vert=True, patch_artist=True,
                boxprops=dict(facecolor="#34d399", alpha=0.6))
axes[1].set_title("Salaries Without Outlier")
axes[1].set_ylabel("Salary (£)")
axes[1].yaxis.set_major_formatter(
    plt.FuncFormatter(lambda x, _: f"£{x:,.0f}")
)

plt.suptitle("IQR Outlier Detection — Box Plots", fontsize=13)
plt.tight_layout()
plt.show()

Section 07

What To Do When You Find Outliers

Finding an outlier is only the first step. What you do next depends on why the outlier exists. There are four possible explanations:

Reason Example Action
Data entry error Height recorded as 1700cm instead of 170cm Fix or remove
Measurement error Faulty sensor recorded 0°C on a summer day Remove
Legitimate extreme value CTO genuinely earns £380,000 Keep — report separately
Interesting anomaly Fraudulent transaction flagged by model Keep — it is the signal
⚠️
Never Blindly Remove Outliers

Removing outliers without understanding them is one of the most dangerous mistakes in data science. In fraud detection, the outliers are the thing you are trying to find. In medical research, removing extreme patient readings could hide a life-saving discovery. Always investigate first — then decide.

Capping outliers instead of removing them

import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

q1  = np.percentile(salaries, 25)
q3  = np.percentile(salaries, 75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

# Option 1 — Remove outliers
clean = [s for s in salaries if lower_fence <= s <= upper_fence]
print(f"After removal: {len(clean)} values, mean = £{sum(clean)/len(clean):,.0f}")

# Option 2 — Cap outliers (Winsorisation)
# Replace outlier with the fence value instead of removing
capped = [max(lower_fence, min(upper_fence, s)) for s in salaries]
print(f"After capping: mean = £{sum(capped)/len(capped):,.0f}")

# Option 3 — Use median (unaffected by outliers)
median = np.median(salaries)
print(f"Median (no action needed): £{median:,.0f}")

# After removal: 11 values, mean = £62,455
# After capping: mean = £70,396
# Median (no action needed): £61,500

Section 08

IQR vs Z-Score — Two Methods Compared

There are two main methods for detecting outliers. Knowing when to use each one is an important data science skill.

Property IQR Method Z-Score Method
Formula Q1 − 1.5×IQR and Q3 + 1.5×IQR z = (x − mean) / std dev
Threshold Beyond the fences |z| > 2 or |z| > 3
Affected by outliers? No Yes
Best for Skewed data, unknown distribution Normally distributed data
Use when You are not sure how data is distributed Data is confirmed to be normally distributed
import numpy as np

salaries = [32000, 33000, 34000, 48000, 49000, 51000,
            72000, 75000, 78000, 95000, 120000, 380000]

# ── IQR Method ──────────────────────────────────────────────
q1, q3  = np.percentile(salaries, [25, 75])
iqr     = q3 - q1
iqr_outliers = [s for s in salaries
                if s < q1 - 1.5*iqr or s > q3 + 1.5*iqr]
print(f"IQR outliers:     {iqr_outliers}")

# ── Z-Score Method ───────────────────────────────────────────
mean = np.mean(salaries)
std  = np.std(salaries, ddof=1)
z_scores = [(s - mean) / std for s in salaries]
z_outliers = [s for s, z in zip(salaries, z_scores) if abs(z) > 2]
print(f"Z-score outliers: {z_outliers}")

# IQR outliers:     [380000]
# Z-score outliers: [380000]
🎯
Which Should You Use?

Default to IQR when you are unsure — it is robust and makes no assumptions about the shape of your data. Use Z-score only when you have verified that your data follows a normal distribution. For salary, income, house price, or any right-skewed data, IQR is almost always the better choice.


Section 09

Complete Pipeline — Detect, Investigate, Handle

import numpy as np
import pandas as pd

# Full dataset — house prices in a neighbourhood (£)
prices = [185000, 192000, 178000, 205000, 195000, 210000,
          188000, 201000, 198000, 185000, 2100000, 175000]

df = pd.DataFrame({"price": prices})

# Step 1 — Calculate quartiles and IQR
q1  = df["price"].quantile(0.25)
q2  = df["price"].quantile(0.50)
q3  = df["price"].quantile(0.75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

print("── Summary Statistics ──────────────────")
print(f"Q1 (25th percentile): £{q1:,.0f}")
print(f"Q2 (Median):          £{q2:,.0f}")
print(f"Q3 (75th percentile): £{q3:,.0f}")
print(f"IQR:                  £{iqr:,.0f}")
print(f"Lower fence:          £{lower_fence:,.0f}")
print(f"Upper fence:          £{upper_fence:,.0f}")

# Step 2 — Flag outliers
df["is_outlier"] = (
    (df["price"] < lower_fence) |
    (df["price"] > upper_fence)
)

print("\n── Outliers Detected ───────────────────")
print(df[df["is_outlier"]])

# Step 3 — Compare mean before and after
mean_with    = df["price"].mean()
mean_without = df[~df["is_outlier"]]["price"].mean()
median       = df["price"].median()

print("\n── Impact of Outlier ───────────────────")
print(f"Mean with outlier:    £{mean_with:,.0f}")
print(f"Mean without outlier: £{mean_without:,.0f}")
print(f"Median (robust):      £{median:,.0f}")

# ── Summary Statistics ──────────────────────────
# Q1 (25th percentile): £185,500
# Q2 (Median):          £193,500
# Q3 (75th percentile): £201,750
# IQR:                  £16,250
# Lower fence:          £161,125
# Upper fence:          £226,125
# ── Outliers Detected ───────────────────────────
#     price  is_outlier
# 10  2100000  True
# ── Impact of Outlier ───────────────────────────
# Mean with outlier:    £342,667
# Mean without outlier: £192,909
# Median (robust):      £193,500

Section 10

Golden Rules

🎯 Quartiles, IQR and Outliers — Key Rules
1
Always sort your data before calculating quartiles. The entire method depends on position — unsorted data gives completely wrong quartile values.
2
The IQR is resistant to outliers — extreme values do not affect Q1 or Q3 because they sit outside the middle 50%. This makes IQR far more reliable than range or standard deviation for describing spread in skewed data.
3
The 1.5 × IQR fence is a guideline, not a law. Some domains use 2× or 3× IQR for stricter definitions. Always apply domain knowledge when deciding the right threshold.
4
Investigate before removing. An outlier is a question, not a mistake. Ask why it exists before deciding whether to remove, cap, transform, or keep it.
5
Use the median and IQR together as your default summary for skewed data — they are both robust to outliers. Reserve mean and standard deviation for data that is approximately normally distributed.