Foundations of Data Science 📂 Descriptive Statistics · 2 of 11 29 min read

Mean-Median-Mode

Mean, Median, and Mode are the three core measures of central tendency that every Data Scientist must master before diving into advanced analytics. The Mean gives the mathematical average but breaks down in the presence of outliers, while the Median offers a robust alternative for skewed data like salaries and house prices. The Mode identifies the most frequent value and stands as the only valid measure for categorical data such as colours, brands, or survey responses. Understanding how

Section 01

Overview — What Are Measures of Central Tendency?

In statistics, a measure of central tendency is a single value that summarises a dataset by identifying where the centre of the data lies. The three most common measures are:

Mean
x̄ / μ
  • Sum ÷ Count
  • Sensitive to outliers
  • Best for symmetric data
Median
M
  • Middle value (sorted)
  • Outlier-resistant
  • Best for skewed data
Mode
Mo
  • Most frequent value
  • Works on categorical data
  • Can be multiple values
📌
Why This Matters

Choosing the wrong measure can completely mislead your analysis. Reporting average income in a country with extreme inequality gives a false picture — the median is far more honest.


Section 02

The Mean (Arithmetic Average)

The mean is calculated by adding all values and dividing by the number of values. It takes into account every single value in the dataset — which makes it powerful but also vulnerable to extreme values (outliers).

Formula

Population Mean
μ = (Σ xᵢ) / N
Σ xᵢ = sum of all population values
N = total population size
Sample Mean
x̄ = (Σ xᵢ) / n
Σ xᵢ = sum of sample values
n = sample size (subset of data)

Real Example — Employee Salaries

A small company has 10 employees with the following monthly salaries (₹):

Emp #12345678910
Salary (₹) 32,00035,00037,00038,00040,000 42,00045,00047,00050,00054,000
🧮 Step-by-Step Calculation
Step 1
Sum all salaries:
32,000 + 35,000 + 37,000 + 38,000 + 40,000 + 42,000 + 45,000 + 47,000 + 50,000 + 54,000 = ₹4,20,000
Step 2
Divide by count (n = 10):
x̄ = 4,20,000 / 10 = ₹42,000
Result
The average salary is ₹42,000/month — a fair representation since all salaries are close to each other.

The Outlier Problem — When Mean Fails

Now suppose the CEO joins with a salary of ₹5,00,000/month:

⚠️
Outlier Effect on Mean

New Sum = 4,20,000 + 5,00,000 = ₹9,20,000
New Mean = 9,20,000 / 11 = ₹83,636/month

But 10 out of 11 employees earn between ₹32,000–₹54,000! The mean of ₹83,636 does NOT represent a typical salary. This is exactly why salary surveys always report the median.

Python Code

mean_example.py
import numpy as np

salaries = [32000, 35000, 37000, 38000, 40000,
            42000, 45000, 47000, 50000, 54000]

print("Mean (no CEO)  :", np.mean(salaries))        # 42000.0

salaries_with_ceo = salaries + [500000]
print("Mean (with CEO):", np.mean(salaries_with_ceo))  # 83636.36
print("Median (robust):", np.median(salaries_with_ceo)) # 42000.0

Industry Applications of Mean

IndustryApplication
💰 FinanceAverage daily stock return, average monthly expenses
🏭 ManufacturingAverage product weight, average defect rate per batch
🏥 HealthcareAverage patient recovery time, average dosage
🎓 EducationAverage test score across a class
🛒 E-commerceAverage order value (AOV), average delivery time

When to Use the Mean

  • Data is normally distributed (symmetric, bell-shaped)
  • No significant outliers present
  • Variables are continuous (height, weight, temperature, exam scores)
  • You need the value in further calculations (standard deviation, regression)

Section 03

The Median (Middle Value)

The median is the middle value in a sorted dataset. Exactly half the values fall above it and half fall below. Because the median only depends on position — not magnitude — it is completely unaffected by extreme outliers.

Formula

Odd Number of Values
M = value at (n+1)/2
Sort the data. The median is the single middle value.
e.g. n = 9 → position 5
Even Number of Values
M = [x(n/2) + x(n/2+1)] / 2
Sort the data. Average the two middle values.
e.g. n = 10 → average positions 5 & 6

Real Example — House Prices in a City

Seven properties recently sold in a neighbourhood (₹ Lakhs):

PropertyABCDEFLuxury Villa
Price (₹L) 455247555048 420
🧮 Step-by-Step Median Calculation
Step 1
Sort the values: 45, 47, 48, 50, 52, 55, 420
Step 2
n = 7 (odd) → Middle position = (7+1)/2 = 4th value
Median
M = ₹50 Lakhs ✓
Compare
Mean = (45+47+48+50+52+55+420)/7 = ₹109.6 Lakhs — completely misleading! No typical buyer pays ₹109.6L when 6 out of 7 homes cost under ₹56L.
💡
Real-World Insight — Income Inequality

India's GDP per capita (mean income per person) ≈ ₹2.1 Lakhs/year. But the median Indian income is closer to ₹50,000–70,000/year. The massive gap exists because a small number of ultra-wealthy individuals pull the mean far above what most people actually earn.

Python Code

median_example.py
import numpy as np

house_prices = [45, 52, 47, 55, 50, 48, 420]

print("Mean  :", np.mean(house_prices))    # 109.57 — misleading!
print("Median:", np.median(house_prices))  # 50.0  — accurate ✓

Industry Applications of Median

IndustryApplication
🏠 Real EstateMedian home price in a city (standard reporting measure)
📊 EconomicsMedian household income, median net worth
🏥 HealthcareMedian survival time in clinical trials (not average)
💼 HR / RecruitmentMedian salary for a job role — fairer than mean for job ads
🚚 LogisticsMedian delivery time — unaffected by rare extreme delays

When to Use the Median

  • Data is skewed (not symmetric — income, property prices, wealth)
  • Outliers are present and significant
  • Ordinal data — values have order but distances are not equal (e.g., survey ratings 1–5)
  • Reporting typical experience rather than a mathematical average

Section 04

The Mode (Most Frequent Value)

The mode is the value that appears most frequently in a dataset. Unlike the mean and median, the mode can be used with categorical (non-numeric) data. A dataset may have one mode, two modes (bimodal), multiple modes, or no mode at all.

Mode Definition
Mo = argmax f(x)
The value x that maximises frequency f(x).
Simply: the most occurring value.

Real Example — Retail Shoe Store

A shoe store records sizes purchased by 238 customers to decide which size to order the most:

Size66.577.588.5 ★99.51010.511
Customers 58152238 45 403020105
🧮
Finding the Mode

Mode = Size 8.5 — purchased by 45 customers, the highest frequency.

The mode is the ONLY measure that directly tells the store: order the most 8.5s! No formula for mean or median could give this actionable insight as directly.

Real Example — Favourite Colour Survey (Categorical Data)

A school surveys 200 students: "What is your favourite colour?"

Blue ★RedGreenYellowPurpleOrange
72 583518125
⚠️
Why Only Mode Works Here

You cannot calculate a mean or median for colours — they are categorical, not numeric. Mode = Blue is the only valid measure. This is a common mistake in data science — always check your data type before choosing a measure.

Unimodal, Bimodal & Multimodal Distributions

TypeDefinitionExample
UnimodalOne peak / one modeExam scores clustered around 70
BimodalTwo peaks / two modesCustomer ages: peaks at 25 and 55
MultimodalThree+ peaksReviews split between very low and very high ratings
No ModeAll equally frequentRolling a fair die

Python Code

mode_example.py
from scipy import stats
import numpy as np

shoe_sizes = [6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11]
counts      = [5, 8,  15, 22, 38, 45, 40, 30,  10,  10,  5]

data = np.repeat(shoe_sizes, counts)
print("Mode:", stats.mode(data).mode[0])   # 8.5

# Categorical example
colours = ["Blue"]*72 + ["Red"]*58 + ["Green"]*35
print("Favourite:", stats.mode(colours).mode[0])  # Blue

Industry Applications of Mode

IndustryApplication
🛍️ RetailMost popular product size, colour, or category to stock
📣 MarketingMost common customer age group, most popular channel
🏥 HealthcareMost common diagnosis, most frequently prescribed drug
🚗 TransportationMost common trip duration, most popular route
📱 Social MediaMost liked post type, most used hashtag

When to Use the Mode

  • Data is categorical (colours, brands, yes/no, city names)
  • You need to identify the most popular item or preference
  • Detecting clusters or groups in data (bimodal = two distinct groups)
  • Quality control — finding the most common defect type
  • Imputing missing values in categorical ML columns

Section 05

Skewness — How Distribution Shape Changes Everything

The relationship between mean, median, and mode reveals the shape of a distribution. Understanding skewness is essential for choosing the right measure and avoiding misleading analysis.

Symmetric
Bell-shaped
Mean = Median = Mode
Right Skewed (+)
Long tail on right
Mode < Median < Mean
Left Skewed (−)
Long tail on left
Mean < Median < Mode
DistributionShapeRelationshipBest Measure
SymmetricBell-shapedMean = Median = ModeAny — all are equal
Right-Skewed (+)Long tail on rightMode < Median < MeanMedian
Left-Skewed (−)Long tail on leftMean < Median < ModeMedian
📐
Pearson's Approximation

For moderately skewed distributions:
Mean − Mode ≈ 3 × (Mean − Median)

This formula lets you estimate the mode from mean and median, or verify consistency in your data.


Section 06

Complete Real Example — Student Exam Scores

A class of 20 students takes a maths exam (out of 100). Here are their scores:

#Score#Score#Score#Score#Score
14556096513721785
252660106514751880
355762116515781988
458862126816702092
🧮 Calculating All Three Measures
Mean
Sum = 1,369  →  x̄ = 1,369 / 20 = 68.45
Median
n = 20 (even) → Average 10th and 11th values in sorted list
..., 62, 65, 65, 68, ...  →  M = (65+65)/2 = 65.0
Mode
65 appears 3 times — more than any other value  →  Mo = 65

Interpretation for the Teacher

MeasureValueWhat It Tells the Teacher
Mean68.45 Mathematical average — useful for comparing this class against other classes or past years.
Median65.0 Typical student score — half the class scored above 65, half below. Better for reporting typical performance.
Mode65 Most common score — many students clustered here; useful for identifying where teaching focus paid off.

Section 07

Decision Guide — Which Measure to Choose?

SituationUse MeanUse MedianUse Mode
Data typeContinuous numericContinuous numericCategorical or discrete
DistributionSymmetricSkewedAny
Outliers present?No outliersOutliers presentDoesn't matter
GoalFurther calculationsDescribe typical valueFind most popular value
Real exampleAverage temperatureHousehold incomeMost popular product
ML imputationNormal featuresSkewed featuresCategorical columns
🎯 Golden Rules for Data Scientists
1
Always visualize your data before choosing a measure — check for skewness and outliers using histograms and box plots.
2
Report both mean and median when in doubt — the gap between them immediately reveals skewness.
3
Never use mean for categorical data — it is mathematically meaningless.
4
For ML missing value imputation: use mean for normal features, median for skewed features, mode for categorical columns.
5
A large gap between mean and median is always a red flag — investigate for outliers or skewness before proceeding.

Section 08

Quick Reference Card

Property Mean Median Mode
Symbolx̄ or μMMo
FormulaΣxᵢ / nMiddle value (sorted)Most frequent value
Data TypeNumeric onlyNumeric onlyAny (incl. categorical)
Outlier effect? Yes — strongly No — robust No — unaffected
Always unique? Always 1 value Always 1 value 0, 1, or many
Skewed data Misleading Reliable Depends
Used in Std Dev?YesNoNo
Real EstateAvoidStandard useRare
Income reportingAvoidAlways preferredSector breakdown
ML imputationNormal featuresSkewed featuresCategorical features