The Story That Explains Correlation
A city council notices that on days when more ice cream is sold, the number of drowning incidents also rises. A worried councillor proposes banning ice cream to save lives.
Ice cream does not cause drowning. Both rise together because of a third hidden variable — hot weather. On hot days, people buy more ice cream and also swim more, which increases the chance of drowning. Ice cream and drowning are correlated but one does not cause the other. This is the most important lesson in all of statistics: correlation does not imply causation.
Correlation is a number that measures how strongly two variables move together. Understanding it — and its limits — is one of the most critical skills in data science.
What is Correlation?
Correlation measures the strength and direction of the linear relationship between two numerical variables. It answers the question: when one variable changes, does the other tend to change too — and in which direction?
- Both rise together
- Both fall together
- e.g. height and weight
- No consistent pattern
- Variables are independent
- e.g. shoe size and salary
- One rises, other falls
- Inverse relationship
- e.g. speed and travel time
The Correlation Coefficient (r)
The Pearson correlation coefficient (written as
r) is the most common measure of correlation. It always
falls between −1 and +1.
| r value | Strength | Direction | Real example |
|---|---|---|---|
+1.0 | Perfect | Positive | Celsius and Fahrenheit |
+0.7 to +0.9 | Strong | Positive | Height and weight |
+0.4 to +0.6 | Moderate | Positive | Study hours and exam score |
+0.1 to +0.3 | Weak | Positive | Sleep and productivity |
0.0 | None | — | Shoe size and IQ |
−0.1 to −0.3 | Weak | Negative | Stress and sleep quality |
−0.4 to −0.6 | Moderate | Negative | Price and demand |
−0.7 to −0.9 | Strong | Negative | Speed and travel time |
−1.0 | Perfect | Negative | Altitude and air pressure |
r = −0.85 is a stronger relationship than
r = +0.40. The negative sign just means they move in
opposite directions. Always look at the absolute value to judge
strength, and the sign to judge direction.
Step-by-Step Calculation
A teacher records study hours and exam scores for 5 students:
| Student | Study hours (x) | Exam score (y) |
|---|---|---|
| Alice | 2 | 50 |
| Bob | 4 | 60 |
| Carol | 6 | 72 |
| Dan | 8 | 80 |
| Eve | 10 | 95 |
x̄ = (2+4+6+8+10)/5 = 6
ȳ = (50+60+72+80+95)/5 = 71.4
Alice: (2−6)=−4, (50−71.4)=−21.4 product = 85.6
Bob: (4−6)=−2, (60−71.4)=−11.4 product = 22.8
Carol: (6−6)=0, (72−71.4)=0.6 product = 0
Dan: (8−6)=2, (80−71.4)=8.6 product = 17.2
Eve: (10−6)=4, (95−71.4)=23.6 product = 94.4
85.6 + 22.8 + 0 + 17.2 + 94.4 = 220
Calculate Σ(y−ȳ)² = 457.96+129.96+0.36+73.96+556.96 = 1219.2
r = 0.996 — an almost perfect positive correlation. Study hours and exam scores move together extremely closely. Every extra 2 hours of study is associated with roughly 11 more marks. The relationship is strong enough that study hours alone can predict exam scores with high accuracy.
Python Implementation
Manual calculation
import math
x = [2, 4, 6, 8, 10] # study hours
y = [50, 60, 72, 80, 95] # exam scores
n = len(x)
x_mean = sum(x) / n
y_mean = sum(y) / n
numerator = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y))
denom_x = sum((xi - x_mean) ** 2 for xi in x)
denom_y = sum((yi - y_mean) ** 2 for yi in y)
denominator = math.sqrt(denom_x * denom_y)
r = numerator / denominator
print(f"Pearson r: {r:.4f}") # 0.9964
Using NumPy
import numpy as np
x = [2, 4, 6, 8, 10]
y = [50, 60, 72, 80, 95]
# np.corrcoef returns a correlation matrix
corr_matrix = np.corrcoef(x, y)
r = corr_matrix[0, 1] # r[0,0] and r[1,1] are always 1.0 (self-correlation)
print(f"Pearson r: {r:.4f}") # 0.9964
print("\nFull correlation matrix:")
print(np.round(corr_matrix, 4))
# [[1. 0.9964]
# [0.9964 1. ]]
Using Pandas on a DataFrame
import pandas as pd
data = {
'study_hours': [2, 4, 6, 8, 10, 3, 7, 5, 9, 1],
'exam_score': [50, 60, 72, 80, 95, 55, 78, 65, 88, 40],
'sleep_hours': [8, 7, 7, 6, 5, 8, 6, 7, 5, 9],
'anxiety_level':[3, 4, 4, 6, 7, 3, 5, 4, 8, 2],
}
df = pd.DataFrame(data)
# Single pair
r = df['study_hours'].corr(df['exam_score'])
print(f"Study hours vs exam score: r = {r:.4f}")
# Full correlation matrix for all columns
print("\nFull correlation matrix:")
print(df.corr().round(3))
Visualising with a scatter plot
import matplotlib.pyplot as plt
import numpy as np
x = [2, 4, 6, 8, 10]
y = [50, 60, 72, 80, 95]
r = np.corrcoef(x, y)[0, 1]
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(x, y, color='#378add', s=100, zorder=3,
edgecolors='white', linewidths=1)
# Trend line
m, b = np.polyfit(x, y, 1)
x_line = np.linspace(min(x)-0.5, max(x)+0.5, 100)
ax.plot(x_line, m*x_line+b, color='#f59e0b',
linewidth=2, linestyle='--', label=f'Trend line')
# Label each point
labels = ['Alice', 'Bob', 'Carol', 'Dan', 'Eve']
for xi, yi, lb in zip(x, y, labels):
ax.annotate(lb, (xi, yi), textcoords='offset points',
xytext=(8, 4), fontsize=9, color='#888780')
ax.set_title(f'Study Hours vs Exam Score (r = {r:.3f})', fontsize=13)
ax.set_xlabel('Study hours')
ax.set_ylabel('Exam score')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Real-World Stories
Story 1 — The Sunscreen Paradox
A study finds a strong positive correlation between sunscreen sales and skin cancer rates. A journalist writes: "Sunscreen causes cancer." What is actually happening?
In regions with high UV levels, people buy more sunscreen AND get more skin cancer (because sunscreen reduces but does not eliminate risk). UV exposure is the confounding variable driving both. Sunscreen actually reduces cancer risk — the correlation is misleading without controlling for the true cause.
Story 2 — Netflix and Shoe Size
A data scientist finds that in their dataset, people with larger shoe sizes
watch more Netflix. r = +0.71. Should Netflix target large-footed customers?
Adults have larger shoe sizes and also watch more Netflix than children. Age is the hidden variable. When you control for age the correlation between shoe size and Netflix watching drops to near zero. This is a spurious correlation — real in the numbers, meaningless in reality. Always ask: is there a third variable explaining both?
Story 3 — The £1 Billion Decision
An e-commerce company finds a strong positive correlation
(r = +0.82) between the number of customer support tickets
raised and total revenue. A manager suggests cutting support to reduce
tickets and save money.
More revenue means more customers — more customers means more support tickets. Revenue causes tickets, not the other way around. Cutting support would harm customers and reduce future revenue. The correlation is real but the assumed direction of causation was backwards. Always use domain knowledge alongside the numbers.
Pearson vs Spearman — Which to Use?
Pearson measures linear correlation. Spearman measures rank-based (monotonic) correlation — it works even when the relationship is not a straight line.
| Property | Pearson r | Spearman ρ |
|---|---|---|
| Measures | Linear relationship | Monotonic relationship |
| Requires normal distribution? | Yes, ideally | No |
| Sensitive to outliers? | Yes | No |
| Works on ordinal data? | No | Yes |
| Use when | Data is continuous and roughly normal | Data is skewed, ordinal, or has outliers |
| Range | −1 to +1 | −1 to +1 |
from scipy import stats
import numpy as np
# Study hours and exam scores
x = [2, 4, 6, 8, 10, 3, 7, 5, 9, 1]
y = [50, 60, 72, 80, 95, 55, 78, 65, 88, 40]
# Pearson — measures linear relationship
pearson_r, pearson_p = stats.pearsonr(x, y)
print(f"Pearson r = {pearson_r:.4f} (p-value: {pearson_p:.4f})")
# Spearman — measures rank-based relationship
spearman_r, spearman_p = stats.spearmanr(x, y)
print(f"Spearman ρ = {spearman_r:.4f} (p-value: {spearman_p:.4f})")
# Pearson r = 0.9972 (p-value: 0.0000)
# Spearman ρ = 0.9939 (p-value: 0.0000)
# p-value < 0.05 means the correlation is statistically significant
# i.e. unlikely to have occurred by random chance alone
The p-value tests whether the correlation could be zero by chance. A p-value below 0.05 means the correlation is statistically significant — unlikely to be a fluke of random sampling. A small dataset with r = 0.7 might have p = 0.12 meaning the correlation is not reliable. Always report both r and p.
Correlation Matrix — Many Variables at Once
In real data science you rarely look at just two variables. A correlation matrix shows the pairwise correlation between every variable in your dataset simultaneously.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
n = 100
df = pd.DataFrame({
'study_hours': np.random.uniform(1, 12, n),
'sleep_hours': np.random.uniform(4, 9, n),
'anxiety': np.random.uniform(1, 10, n),
'exam_score': None,
'salary_k': None,
})
# Construct realistic relationships
df['exam_score'] = (
df['study_hours'] * 7 +
df['sleep_hours'] * 2 -
df['anxiety'] * 1.5 +
np.random.normal(0, 5, n)
).clip(0, 100)
df['salary_k'] = (
df['exam_score'] * 0.6 +
df['study_hours'] * 1.2 +
np.random.normal(30, 5, n)
).clip(20, 120)
corr = df.corr().round(2)
print(corr)
# Heatmap
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
vmin=-1, vmax=1, center=0,
square=True, linewidths=0.5, ax=ax)
ax.set_title('Correlation Matrix — Student Data', fontsize=13)
plt.tight_layout()
plt.show()
Dark red = strong positive correlation. Dark blue = strong negative. White = near zero. The diagonal is always 1.0 (a variable is perfectly correlated with itself). Look for dark cells off the diagonal — those are the interesting relationships in your data.
Four Warnings About Correlation
Warning 1 — Anscombe's Quartet
In 1973 statistician Francis Anscombe created four datasets that all have identical correlation coefficients (r = 0.816) but look completely different when plotted.
import matplotlib.pyplot as plt
import numpy as np
# Anscombe's Quartet — four datasets, same r = 0.816
datasets = [
([10,8,13,9,11,14,6,4,12,7,5], [8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]),
([10,8,13,9,11,14,6,4,12,7,5], [9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]),
([10,8,13,9,11,14,6,4,12,7,5], [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]),
([8,8,8,8,8,8,8,8,8,8,19], [6.58,5.76,7.71,8.84,8.47,7.04,5.25,5.56,7.91,6.89,12.50]),
]
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes = axes.flatten()
for i, (x, y) in enumerate(datasets):
r = np.corrcoef(x, y)[0,1]
m, b = np.polyfit(x, y, 1)
x_arr = np.array(x)
axes[i].scatter(x, y, color='#378add', s=60, zorder=3)
axes[i].plot(sorted(x), [m*xi+b for xi in sorted(x)],
color='#f59e0b', linewidth=1.5, linestyle='--')
axes[i].set_title(f'Dataset {i+1} (r = {r:.3f})', fontsize=11)
axes[i].set_xlim(0, 22); axes[i].set_ylim(0, 14)
axes[i].grid(alpha=0.3)
plt.suptitle("Anscombe's Quartet — same r, completely different data",
fontsize=13)
plt.tight_layout()
plt.show()
Anscombe's Quartet proves that the correlation coefficient alone is not enough. Dataset 3 has an outlier ruining an otherwise perfect relationship. Dataset 4 is a vertical line with one extreme point. Always visualise the scatter plot before trusting r.
Warning 2 — Non-linear relationships
import numpy as np
import matplotlib.pyplot as plt
# A perfect U-shape — zero linear correlation but obvious relationship
x = np.linspace(-3, 3, 100)
y = x ** 2 + np.random.normal(0, 0.2, 100)
r = np.corrcoef(x, y)[0, 1]
print(f"r = {r:.4f}") # r ≈ 0.0 — but clearly y depends on x!
fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(x, y, color='#378add', s=30, alpha=0.7)
ax.set_title(f'y = x² — perfect quadratic relationship but r = {r:.3f}')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
A U-shaped relationship gives r ≈ 0 even though x perfectly predicts y. If you only computed r you would conclude there is no relationship — completely wrong. Use Spearman for monotonic non-linear relationships, and always check the scatter plot for non-linear patterns.
Warning 3 — Sample size matters
from scipy import stats
import numpy as np
# Two datasets — same r, very different reliability
np.random.seed(42)
# Small sample — r = 0.6 in just 6 data points
x_small = [1, 2, 3, 4, 5, 6]
y_small = [2.1, 3.8, 5.2, 4.9, 7.1, 8.3]
r_small, p_small = stats.pearsonr(x_small, y_small)
# Large sample — r = 0.6 in 100 data points
x_large = np.random.uniform(1, 10, 100)
y_large = 0.6*x_large + np.random.normal(0, 2, 100)
r_large, p_large = stats.pearsonr(x_large, y_large)
print(f"Small sample (n=6): r = {r_small:.3f} p = {p_small:.3f}")
print(f"Large sample (n=100): r = {r_large:.3f} p = {p_large:.3f}")
# Small sample (n=6): r = 0.971 p = 0.001
# Large sample (n=100): r = 0.576 p = 0.000
Warning 4 — Correlation does not tell you the slope
Two datasets can both have r = 0.9. In one, every extra hour of study adds 2 marks. In another, it adds 20 marks. Correlation tells you how consistently they vary together — not how much y changes per unit of x. For that you need linear regression.
Correlation in Machine Learning
Feature selection — drop highly correlated features
import pandas as pd
import numpy as np
np.random.seed(0)
n = 200
df = pd.DataFrame({
'age': np.random.randint(22, 65, n),
'years_exp': np.random.randint(0, 40, n),
'salary': np.random.normal(60, 15, n),
'monthly_pay': None, # will be perfectly correlated with salary
'performance': np.random.uniform(1, 10, n),
'team_size': np.random.randint(2, 20, n),
})
# monthly_pay is just salary/12 — perfectly correlated, adds no information
df['monthly_pay'] = df['salary'] / 12
corr = df.corr().abs()
print("Correlations with 'salary':")
print(corr['salary'].sort_values(ascending=False).round(3))
# Find pairs with correlation > 0.85 — candidates to drop
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]
print(f"\nFeatures to consider dropping: {to_drop}")
# Features to consider dropping: ['monthly_pay']
When two features are highly correlated (|r| > 0.85) they carry almost identical information. Keeping both in a model causes multicollinearity — it makes coefficients unstable and hard to interpret. Drop the less interpretable one, or use PCA to combine them into a single feature.