The Story That Explains Everything
You are an HR analyst at a tech company. The CEO asks you to report the "typical" salary. You pull the data for 12 employees:
| Employee | Role | Salary (£) |
|---|---|---|
| 1 | Junior Developer | 32,000 |
| 2 | Junior Developer | 34,000 |
| 3 | Junior Developer | 33,000 |
| 4 | Mid Developer | 48,000 |
| 5 | Mid Developer | 51,000 |
| 6 | Mid Developer | 49,000 |
| 7 | Senior Developer | 72,000 |
| 8 | Senior Developer | 75,000 |
| 9 | Senior Developer | 78,000 |
| 10 | Lead Developer | 95,000 |
| 11 | Engineering Manager | 120,000 |
| 12 | CTO | 380,000 |
The mean salary is £88,917. You report this to the CEO.
Every junior developer in the room is confused — they earn £32,000
and the "average" is £88,917? The CTO's £380,000 salary has dragged the
mean far above what anyone except two people actually earns.
The CTO's salary is an outlier — a value so extreme it distorts the mean. This is why data scientists never rely on the mean alone. Quartiles and IQR give you a far more honest picture of where the middle of your data actually sits — and they tell you exactly which values are outliers.
What are Quartiles?
Quartiles split a sorted dataset into four equal parts — each containing exactly 25% of the data. There are three quartile values:
- Lower quartile
- 25% of data below
- 75% of data above
- The median
- 50% of data below
- 50% of data above
- Upper quartile
- 75% of data below
- 25% of data above
Imagine 100 runners finishing a race in order. Q1 is the finishing time of runner 25 — the fastest quarter just finished. Q2 is runner 50 — exactly half done. Q3 is runner 75 — three quarters done. The middle 50% of runners (between Q1 and Q3) form the core of the pack. Stragglers far behind Q3 or sprinters far ahead of Q1 are the outliers.
What is the IQR?
The Interquartile Range (IQR) is the distance between Q1 and Q3. It covers the middle 50% of your data — the core, typical range — completely ignoring the extreme low and high values.
The full range (max − min) is destroyed by a single outlier. In our salary example, range = £380,000 − £32,000 = £348,000 — completely dominated by the CTO. The IQR ignores both extremes and tells you what the typical spread actually looks like for the middle majority.
Step-by-Step Calculation
Let us use a simpler dataset first:
[4, 7, 9, 12, 15, 18, 21, 24, 28, 35, 38, 45]
[4, 7, 9, 12, 15, 18, 21, 24, 28, 35, 38, 45]n = 12 values
6th value = 18 7th value = 21
Q2 = (18 + 21) / 2 = 19.5
[4, 7, 9, 12, 15, 18]Average of 3rd and 4th values: (9 + 12) / 2 = 10.5
[21, 24, 28, 35, 38, 45]Average of 3rd and 4th values: (28 + 35) / 2 = 31.5
IQR = Q3 − Q1 = 31.5 − 10.5 = 21.0
How to Detect Outliers Using IQR
The IQR method uses fences — boundaries beyond which any value is considered an outlier. This is the same method used by box plots.
The 1.5 multiplier was proposed by statistician John Tukey in 1977. He showed mathematically that for normally distributed data, values beyond 1.5 × IQR from the quartiles occur less than 0.7% of the time by chance — making them statistically unusual enough to flag. Some analyses use 3 × IQR for "extreme outliers" only.
Applying the fences to our dataset
[32, 33, 34, 48, 49, 51, 72, 75, 78, 95, 120, 380]
[32, 33, 34, 48, 49, 51]Q1 = (34 + 48) / 2 = 41
[72, 75, 78, 95, 120, 380]Q3 = (78 + 95) / 2 = 86.5
Upper fence = 86.5 + (1.5 × 45.5) = 86.5 + 68.25 = 154.75
£380,000 (CTO) > £154,750 → OUTLIER ✓
£120,000 (Engineering Manager) < £154,750 → Normal ✓
The IQR method mathematically confirms what we suspected — the CTO's £380,000 salary is a statistical outlier. The median salary is £61,500. That is the honest "typical" salary you should report — not the mean of £88,917 which the outlier inflated by nearly £27,000.
Python Implementation
Manual calculation
salaries = [32000, 33000, 34000, 48000, 49000, 51000,
72000, 75000, 78000, 95000, 120000, 380000]
salaries_sorted = sorted(salaries)
n = len(salaries_sorted)
# Q2 — median
mid = n // 2
q2 = (salaries_sorted[mid - 1] + salaries_sorted[mid]) / 2
# Q1 — median of lower half
lower_half = salaries_sorted[:mid]
q1 = (lower_half[len(lower_half)//2 - 1] + lower_half[len(lower_half)//2]) / 2
# Q3 — median of upper half
upper_half = salaries_sorted[mid:]
q3 = (upper_half[len(upper_half)//2 - 1] + upper_half[len(upper_half)//2]) / 2
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
print(f"Q1: £{q1:,.0f}")
print(f"Q2: £{q2:,.0f}")
print(f"Q3: £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Lower fence: £{lower_fence:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")
outliers = [s for s in salaries if s < lower_fence or s > upper_fence]
print(f"Outliers: {outliers}")
# Q1: £41,000
# Q2: £61,500
# Q3: £86,500
# IQR: £45,500
# Lower fence: £-27,250
# Upper fence: £154,750
# Outliers: [380000]
Using NumPy
import numpy as np
salaries = [32000, 33000, 34000, 48000, 49000, 51000,
72000, 75000, 78000, 95000, 120000, 380000]
q1 = np.percentile(salaries, 25)
q2 = np.percentile(salaries, 50)
q3 = np.percentile(salaries, 75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
salaries_arr = np.array(salaries)
outliers = salaries_arr[(salaries_arr < lower_fence) | (salaries_arr > upper_fence)]
print(f"Q1: £{q1:,.0f}")
print(f"Q2: £{q2:,.0f}")
print(f"Q3: £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")
print(f"Outliers: {outliers}")
# Q1: £41,750
# Q2: £61,500
# Q3: £87,750
# IQR: £46,000
# Upper fence: £156,750
# Outliers: [380000]
Using Pandas
import pandas as pd
salaries = [32000, 33000, 34000, 48000, 49000, 51000,
72000, 75000, 78000, 95000, 120000, 380000]
df = pd.DataFrame({"salary": salaries})
q1 = df["salary"].quantile(0.25)
q3 = df["salary"].quantile(0.75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
# Flag outliers as a new column
df["is_outlier"] = (df["salary"] < lower_fence) | (df["salary"] > upper_fence)
print(df)
print(f"\nOutlier rows:")
print(df[df["is_outlier"]])
Visualising with a Box Plot
import matplotlib.pyplot as plt
import numpy as np
salaries = [32000, 33000, 34000, 48000, 49000, 51000,
72000, 75000, 78000, 95000, 120000, 380000]
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Box plot — outliers shown as circles beyond the whiskers
axes[0].boxplot(salaries, vert=True, patch_artist=True,
boxprops=dict(facecolor="#3776AB", alpha=0.6))
axes[0].set_title("Salary Box Plot")
axes[0].set_ylabel("Salary (£)")
axes[0].yaxis.set_major_formatter(
plt.FuncFormatter(lambda x, _: f"£{x:,.0f}")
)
# Same data without outlier for comparison
salaries_clean = [s for s in salaries if s <= 154750]
axes[1].boxplot(salaries_clean, vert=True, patch_artist=True,
boxprops=dict(facecolor="#34d399", alpha=0.6))
axes[1].set_title("Salaries Without Outlier")
axes[1].set_ylabel("Salary (£)")
axes[1].yaxis.set_major_formatter(
plt.FuncFormatter(lambda x, _: f"£{x:,.0f}")
)
plt.suptitle("IQR Outlier Detection — Box Plots", fontsize=13)
plt.tight_layout()
plt.show()
What To Do When You Find Outliers
Finding an outlier is only the first step. What you do next depends on why the outlier exists. There are four possible explanations:
| Reason | Example | Action |
|---|---|---|
| Data entry error | Height recorded as 1700cm instead of 170cm | Fix or remove |
| Measurement error | Faulty sensor recorded 0°C on a summer day | Remove |
| Legitimate extreme value | CTO genuinely earns £380,000 | Keep — report separately |
| Interesting anomaly | Fraudulent transaction flagged by model | Keep — it is the signal |
Removing outliers without understanding them is one of the most dangerous mistakes in data science. In fraud detection, the outliers are the thing you are trying to find. In medical research, removing extreme patient readings could hide a life-saving discovery. Always investigate first — then decide.
Capping outliers instead of removing them
import numpy as np
salaries = [32000, 33000, 34000, 48000, 49000, 51000,
72000, 75000, 78000, 95000, 120000, 380000]
q1 = np.percentile(salaries, 25)
q3 = np.percentile(salaries, 75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
# Option 1 — Remove outliers
clean = [s for s in salaries if lower_fence <= s <= upper_fence]
print(f"After removal: {len(clean)} values, mean = £{sum(clean)/len(clean):,.0f}")
# Option 2 — Cap outliers (Winsorisation)
# Replace outlier with the fence value instead of removing
capped = [max(lower_fence, min(upper_fence, s)) for s in salaries]
print(f"After capping: mean = £{sum(capped)/len(capped):,.0f}")
# Option 3 — Use median (unaffected by outliers)
median = np.median(salaries)
print(f"Median (no action needed): £{median:,.0f}")
# After removal: 11 values, mean = £62,455
# After capping: mean = £70,396
# Median (no action needed): £61,500
IQR vs Z-Score — Two Methods Compared
There are two main methods for detecting outliers. Knowing when to use each one is an important data science skill.
| Property | IQR Method | Z-Score Method |
|---|---|---|
| Formula | Q1 − 1.5×IQR and Q3 + 1.5×IQR | z = (x − mean) / std dev |
| Threshold | Beyond the fences | |z| > 2 or |z| > 3 |
| Affected by outliers? | No | Yes |
| Best for | Skewed data, unknown distribution | Normally distributed data |
| Use when | You are not sure how data is distributed | Data is confirmed to be normally distributed |
import numpy as np
salaries = [32000, 33000, 34000, 48000, 49000, 51000,
72000, 75000, 78000, 95000, 120000, 380000]
# ── IQR Method ──────────────────────────────────────────────
q1, q3 = np.percentile(salaries, [25, 75])
iqr = q3 - q1
iqr_outliers = [s for s in salaries
if s < q1 - 1.5*iqr or s > q3 + 1.5*iqr]
print(f"IQR outliers: {iqr_outliers}")
# ── Z-Score Method ───────────────────────────────────────────
mean = np.mean(salaries)
std = np.std(salaries, ddof=1)
z_scores = [(s - mean) / std for s in salaries]
z_outliers = [s for s, z in zip(salaries, z_scores) if abs(z) > 2]
print(f"Z-score outliers: {z_outliers}")
# IQR outliers: [380000]
# Z-score outliers: [380000]
Default to IQR when you are unsure — it is robust and makes no assumptions about the shape of your data. Use Z-score only when you have verified that your data follows a normal distribution. For salary, income, house price, or any right-skewed data, IQR is almost always the better choice.
Complete Pipeline — Detect, Investigate, Handle
import numpy as np
import pandas as pd
# Full dataset — house prices in a neighbourhood (£)
prices = [185000, 192000, 178000, 205000, 195000, 210000,
188000, 201000, 198000, 185000, 2100000, 175000]
df = pd.DataFrame({"price": prices})
# Step 1 — Calculate quartiles and IQR
q1 = df["price"].quantile(0.25)
q2 = df["price"].quantile(0.50)
q3 = df["price"].quantile(0.75)
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
print("── Summary Statistics ──────────────────")
print(f"Q1 (25th percentile): £{q1:,.0f}")
print(f"Q2 (Median): £{q2:,.0f}")
print(f"Q3 (75th percentile): £{q3:,.0f}")
print(f"IQR: £{iqr:,.0f}")
print(f"Lower fence: £{lower_fence:,.0f}")
print(f"Upper fence: £{upper_fence:,.0f}")
# Step 2 — Flag outliers
df["is_outlier"] = (
(df["price"] < lower_fence) |
(df["price"] > upper_fence)
)
print("\n── Outliers Detected ───────────────────")
print(df[df["is_outlier"]])
# Step 3 — Compare mean before and after
mean_with = df["price"].mean()
mean_without = df[~df["is_outlier"]]["price"].mean()
median = df["price"].median()
print("\n── Impact of Outlier ───────────────────")
print(f"Mean with outlier: £{mean_with:,.0f}")
print(f"Mean without outlier: £{mean_without:,.0f}")
print(f"Median (robust): £{median:,.0f}")
# ── Summary Statistics ──────────────────────────
# Q1 (25th percentile): £185,500
# Q2 (Median): £193,500
# Q3 (75th percentile): £201,750
# IQR: £16,250
# Lower fence: £161,125
# Upper fence: £226,125
# ── Outliers Detected ───────────────────────────
# price is_outlier
# 10 2100000 True
# ── Impact of Outlier ───────────────────────────
# Mean with outlier: £342,667
# Mean without outlier: £192,909
# Median (robust): £193,500