correlation statistics, correlation coefficient

Section 01

The Story That Explains Correlation

A city council notices that on days when more ice cream is sold, the number of drowning incidents also rises. A worried councillor proposes banning ice cream to save lives.

⚠️

Wait — that cannot be right

Ice cream does not cause drowning. Both rise together because of a third hidden variable — hot weather. On hot days, people buy more ice cream and also swim more, which increases the chance of drowning. Ice cream and drowning are correlated but one does not cause the other. This is the most important lesson in all of statistics: correlation does not imply causation.

Correlation is a number that measures how strongly two variables move together. Understanding it — and its limits — is one of the most critical skills in data science.

Section 02

What is Correlation?

Correlation measures the strength and direction of the linear relationship between two numerical variables. It answers the question: when one variable changes, does the other tend to change too — and in which direction?

Positive

↑↑

Both rise together
Both fall together
e.g. height and weight

No correlation

→ ?

No consistent pattern
Variables are independent
e.g. shoe size and salary

Negative

↑↓

One rises, other falls
Inverse relationship
e.g. speed and travel time

Section 03

The Correlation Coefficient (r)

The Pearson correlation coefficient (written as r) is the most common measure of correlation. It always falls between −1 and +1.

Pearson r

r = Σ[(x−x̄)(y−ȳ)] / √[Σ(x−x̄)² × Σ(y−ȳ)²]

Measures linear relationship strength between −1 and +1

r value	Strength	Direction	Real example
`+1.0`	Perfect	Positive	Celsius and Fahrenheit
`+0.7 to +0.9`	Strong	Positive	Height and weight
`+0.4 to +0.6`	Moderate	Positive	Study hours and exam score
`+0.1 to +0.3`	Weak	Positive	Sleep and productivity
`0.0`	None	—	Shoe size and IQ
`−0.1 to −0.3`	Weak	Negative	Stress and sleep quality
`−0.4 to −0.6`	Moderate	Negative	Price and demand
`−0.7 to −0.9`	Strong	Negative	Speed and travel time
`−1.0`	Perfect	Negative	Altitude and air pressure

📐

The sign tells you direction, the size tells you strength

r = −0.85 is a stronger relationship than r = +0.40. The negative sign just means they move in opposite directions. Always look at the absolute value to judge strength, and the sign to judge direction.

Section 04

Step-by-Step Calculation

A teacher records study hours and exam scores for 5 students:

Student	Study hours (x)	Exam score (y)
Alice	2	50
Bob	4	60
Carol	6	72
Dan	8	80
Eve	10	95

🧮 Calculating Pearson r

Step 1

Calculate the means.
x̄ = (2+4+6+8+10)/5 = 6
ȳ = (50+60+72+80+95)/5 = 71.4

Step 2

Calculate deviations (x−x̄) and (y−ȳ) for each student.
Alice: (2−6)=−4, (50−71.4)=−21.4 product = 85.6
Bob: (4−6)=−2, (60−71.4)=−11.4 product = 22.8
Carol: (6−6)=0, (72−71.4)=0.6 product = 0
Dan: (8−6)=2, (80−71.4)=8.6 product = 17.2
Eve: (10−6)=4, (95−71.4)=23.6 product = 94.4

Step 3

Sum the products: Σ[(x−x̄)(y−ȳ)]
85.6 + 22.8 + 0 + 17.2 + 94.4 = 220

Step 4

Calculate Σ(x−x̄)² = 16+4+0+4+16 = 40
Calculate Σ(y−ȳ)² = 457.96+129.96+0.36+73.96+556.96 = 1219.2

Step 5

r = 220 / √(40 × 1219.2) = 220 / √48768 = 220 / 220.8 = r ≈ 0.996

✅

Interpretation

r = 0.996 — an almost perfect positive correlation. Study hours and exam scores move together extremely closely. Every extra 2 hours of study is associated with roughly 11 more marks. The relationship is strong enough that study hours alone can predict exam scores with high accuracy.

Section 05

Python Implementation

Manual calculation

import math

x = [2, 4, 6, 8, 10]   # study hours
y = [50, 60, 72, 80, 95]   # exam scores

n    = len(x)
x_mean = sum(x) / n
y_mean = sum(y) / n

numerator   = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y))
denom_x     = sum((xi - x_mean) ** 2 for xi in x)
denom_y     = sum((yi - y_mean) ** 2 for yi in y)
denominator = math.sqrt(denom_x * denom_y)

r = numerator / denominator
print(f"Pearson r: {r:.4f}")   # 0.9964

Using NumPy

import numpy as np

x = [2, 4, 6, 8, 10]
y = [50, 60, 72, 80, 95]

# np.corrcoef returns a correlation matrix
corr_matrix = np.corrcoef(x, y)
r = corr_matrix[0, 1]   # r[0,0] and r[1,1] are always 1.0 (self-correlation)

print(f"Pearson r: {r:.4f}")   # 0.9964
print("\nFull correlation matrix:")
print(np.round(corr_matrix, 4))
# [[1.     0.9964]
#  [0.9964 1.    ]]

Using Pandas on a DataFrame

import pandas as pd

data = {
    'study_hours':  [2, 4, 6, 8, 10, 3, 7, 5, 9, 1],
    'exam_score':   [50, 60, 72, 80, 95, 55, 78, 65, 88, 40],
    'sleep_hours':  [8, 7, 7, 6, 5, 8, 6, 7, 5, 9],
    'anxiety_level':[3, 4, 4, 6, 7, 3, 5, 4, 8, 2],
}

df = pd.DataFrame(data)

# Single pair
r = df['study_hours'].corr(df['exam_score'])
print(f"Study hours vs exam score: r = {r:.4f}")

# Full correlation matrix for all columns
print("\nFull correlation matrix:")
print(df.corr().round(3))

Visualising with a scatter plot

import matplotlib.pyplot as plt
import numpy as np

x = [2, 4, 6, 8, 10]
y = [50, 60, 72, 80, 95]

r = np.corrcoef(x, y)[0, 1]

fig, ax = plt.subplots(figsize=(7, 5))

ax.scatter(x, y, color='#378add', s=100, zorder=3,
           edgecolors='white', linewidths=1)

# Trend line
m, b = np.polyfit(x, y, 1)
x_line = np.linspace(min(x)-0.5, max(x)+0.5, 100)
ax.plot(x_line, m*x_line+b, color='#f59e0b',
        linewidth=2, linestyle='--', label=f'Trend line')

# Label each point
labels = ['Alice', 'Bob', 'Carol', 'Dan', 'Eve']
for xi, yi, lb in zip(x, y, labels):
    ax.annotate(lb, (xi, yi), textcoords='offset points',
                xytext=(8, 4), fontsize=9, color='#888780')

ax.set_title(f'Study Hours vs Exam Score  (r = {r:.3f})', fontsize=13)
ax.set_xlabel('Study hours')
ax.set_ylabel('Exam score')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Section 06

Real-World Stories

Story 1 — The Sunscreen Paradox

A study finds a strong positive correlation between sunscreen sales and skin cancer rates. A journalist writes: "Sunscreen causes cancer." What is actually happening?

⚠️

Hidden variable: UV exposure

In regions with high UV levels, people buy more sunscreen AND get more skin cancer (because sunscreen reduces but does not eliminate risk). UV exposure is the confounding variable driving both. Sunscreen actually reduces cancer risk — the correlation is misleading without controlling for the true cause.

Story 2 — Netflix and Shoe Size

A data scientist finds that in their dataset, people with larger shoe sizes watch more Netflix. r = +0.71. Should Netflix target large-footed customers?

💡

Spurious correlation from age

Adults have larger shoe sizes and also watch more Netflix than children. Age is the hidden variable. When you control for age the correlation between shoe size and Netflix watching drops to near zero. This is a spurious correlation — real in the numbers, meaningless in reality. Always ask: is there a third variable explaining both?

Story 3 — The £1 Billion Decision

An e-commerce company finds a strong positive correlation (r = +0.82) between the number of customer support tickets raised and total revenue. A manager suggests cutting support to reduce tickets and save money.

⚠️

Direction of causation matters

More revenue means more customers — more customers means more support tickets. Revenue causes tickets, not the other way around. Cutting support would harm customers and reduce future revenue. The correlation is real but the assumed direction of causation was backwards. Always use domain knowledge alongside the numbers.

Section 07

Pearson vs Spearman — Which to Use?

Pearson measures linear correlation. Spearman measures rank-based (monotonic) correlation — it works even when the relationship is not a straight line.

Property	Pearson r	Spearman ρ
Measures	Linear relationship	Monotonic relationship
Requires normal distribution?	Yes, ideally	No
Sensitive to outliers?	Yes	No
Works on ordinal data?	No	Yes
Use when	Data is continuous and roughly normal	Data is skewed, ordinal, or has outliers
Range	−1 to +1	−1 to +1

from scipy import stats
import numpy as np

# Study hours and exam scores
x = [2, 4, 6, 8, 10, 3, 7, 5, 9, 1]
y = [50, 60, 72, 80, 95, 55, 78, 65, 88, 40]

# Pearson — measures linear relationship
pearson_r, pearson_p = stats.pearsonr(x, y)
print(f"Pearson r  = {pearson_r:.4f}  (p-value: {pearson_p:.4f})")

# Spearman — measures rank-based relationship
spearman_r, spearman_p = stats.spearmanr(x, y)
print(f"Spearman ρ = {spearman_r:.4f}  (p-value: {spearman_p:.4f})")

# Pearson r  = 0.9972  (p-value: 0.0000)
# Spearman ρ = 0.9939  (p-value: 0.0000)

# p-value < 0.05 means the correlation is statistically significant
# i.e. unlikely to have occurred by random chance alone

💡

What is the p-value here?

The p-value tests whether the correlation could be zero by chance. A p-value below 0.05 means the correlation is statistically significant — unlikely to be a fluke of random sampling. A small dataset with r = 0.7 might have p = 0.12 meaning the correlation is not reliable. Always report both r and p.

Section 08

Correlation Matrix — Many Variables at Once

In real data science you rarely look at just two variables. A correlation matrix shows the pairwise correlation between every variable in your dataset simultaneously.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
n = 100

df = pd.DataFrame({
    'study_hours':   np.random.uniform(1, 12, n),
    'sleep_hours':   np.random.uniform(4, 9, n),
    'anxiety':       np.random.uniform(1, 10, n),
    'exam_score':    None,
    'salary_k':      None,
})

# Construct realistic relationships
df['exam_score'] = (
    df['study_hours'] * 7 +
    df['sleep_hours'] * 2 -
    df['anxiety'] * 1.5 +
    np.random.normal(0, 5, n)
).clip(0, 100)

df['salary_k'] = (
    df['exam_score'] * 0.6 +
    df['study_hours'] * 1.2 +
    np.random.normal(30, 5, n)
).clip(20, 120)

corr = df.corr().round(2)
print(corr)

# Heatmap
fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
            vmin=-1, vmax=1, center=0,
            square=True, linewidths=0.5, ax=ax)

ax.set_title('Correlation Matrix — Student Data', fontsize=13)
plt.tight_layout()
plt.show()

🎯

Reading a correlation heatmap

Dark red = strong positive correlation. Dark blue = strong negative. White = near zero. The diagonal is always 1.0 (a variable is perfectly correlated with itself). Look for dark cells off the diagonal — those are the interesting relationships in your data.

Section 09

Four Warnings About Correlation

Warning 1 — Anscombe's Quartet

In 1973 statistician Francis Anscombe created four datasets that all have identical correlation coefficients (r = 0.816) but look completely different when plotted.

import matplotlib.pyplot as plt
import numpy as np

# Anscombe's Quartet — four datasets, same r = 0.816
datasets = [
    ([10,8,13,9,11,14,6,4,12,7,5], [8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]),
    ([10,8,13,9,11,14,6,4,12,7,5], [9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]),
    ([10,8,13,9,11,14,6,4,12,7,5], [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]),
    ([8,8,8,8,8,8,8,8,8,8,19],    [6.58,5.76,7.71,8.84,8.47,7.04,5.25,5.56,7.91,6.89,12.50]),
]

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes = axes.flatten()

for i, (x, y) in enumerate(datasets):
    r = np.corrcoef(x, y)[0,1]
    m, b = np.polyfit(x, y, 1)
    x_arr = np.array(x)
    axes[i].scatter(x, y, color='#378add', s=60, zorder=3)
    axes[i].plot(sorted(x), [m*xi+b for xi in sorted(x)],
                 color='#f59e0b', linewidth=1.5, linestyle='--')
    axes[i].set_title(f'Dataset {i+1}  (r = {r:.3f})', fontsize=11)
    axes[i].set_xlim(0, 22); axes[i].set_ylim(0, 14)
    axes[i].grid(alpha=0.3)

plt.suptitle("Anscombe's Quartet — same r, completely different data",
             fontsize=13)
plt.tight_layout()
plt.show()

⚠️

Always plot your data

Anscombe's Quartet proves that the correlation coefficient alone is not enough. Dataset 3 has an outlier ruining an otherwise perfect relationship. Dataset 4 is a vertical line with one extreme point. Always visualise the scatter plot before trusting r.

Warning 2 — Non-linear relationships

import numpy as np
import matplotlib.pyplot as plt

# A perfect U-shape — zero linear correlation but obvious relationship
x = np.linspace(-3, 3, 100)
y = x ** 2 + np.random.normal(0, 0.2, 100)

r = np.corrcoef(x, y)[0, 1]
print(f"r = {r:.4f}")   # r ≈ 0.0 — but clearly y depends on x!

fig, ax = plt.subplots(figsize=(7, 4))
ax.scatter(x, y, color='#378add', s=30, alpha=0.7)
ax.set_title(f'y = x² — perfect quadratic relationship but r = {r:.3f}')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

💡

Pearson only measures linear relationships

A U-shaped relationship gives r ≈ 0 even though x perfectly predicts y. If you only computed r you would conclude there is no relationship — completely wrong. Use Spearman for monotonic non-linear relationships, and always check the scatter plot for non-linear patterns.

Warning 3 — Sample size matters

from scipy import stats
import numpy as np

# Two datasets — same r, very different reliability
np.random.seed(42)

# Small sample — r = 0.6 in just 6 data points
x_small = [1, 2, 3, 4, 5, 6]
y_small = [2.1, 3.8, 5.2, 4.9, 7.1, 8.3]
r_small, p_small = stats.pearsonr(x_small, y_small)

# Large sample — r = 0.6 in 100 data points
x_large = np.random.uniform(1, 10, 100)
y_large = 0.6*x_large + np.random.normal(0, 2, 100)
r_large, p_large = stats.pearsonr(x_large, y_large)

print(f"Small sample (n=6):   r = {r_small:.3f}  p = {p_small:.3f}")
print(f"Large sample (n=100): r = {r_large:.3f}  p = {p_large:.3f}")

# Small sample (n=6):   r = 0.971  p = 0.001
# Large sample (n=100): r = 0.576  p = 0.000

Warning 4 — Correlation does not tell you the slope

⚠️

r says nothing about the magnitude of change

Two datasets can both have r = 0.9. In one, every extra hour of study adds 2 marks. In another, it adds 20 marks. Correlation tells you how consistently they vary together — not how much y changes per unit of x. For that you need linear regression.

Section 10

Correlation in Machine Learning

Feature selection — drop highly correlated features

import pandas as pd
import numpy as np

np.random.seed(0)
n = 200

df = pd.DataFrame({
    'age':           np.random.randint(22, 65, n),
    'years_exp':     np.random.randint(0, 40, n),
    'salary':        np.random.normal(60, 15, n),
    'monthly_pay':   None,   # will be perfectly correlated with salary
    'performance':   np.random.uniform(1, 10, n),
    'team_size':     np.random.randint(2, 20, n),
})

# monthly_pay is just salary/12 — perfectly correlated, adds no information
df['monthly_pay'] = df['salary'] / 12

corr = df.corr().abs()
print("Correlations with 'salary':")
print(corr['salary'].sort_values(ascending=False).round(3))

# Find pairs with correlation > 0.85 — candidates to drop
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]
print(f"\nFeatures to consider dropping: {to_drop}")

# Features to consider dropping: ['monthly_pay']

🎯

Multicollinearity in ML

When two features are highly correlated (|r| > 0.85) they carry almost identical information. Keeping both in a model causes multicollinearity — it makes coefficients unstable and hard to interpret. Drop the less interpretable one, or use PCA to combine them into a single feature.

Section 11

Golden Rules

🎯 Correlation — Key Rules

Correlation does not imply causation. This is the most important rule in statistics. Ice cream and drowning are correlated. Shoe size and Netflix watching are correlated. Always ask: is there a hidden third variable driving both?

Always plot your data first. Anscombe's Quartet proves that four completely different datasets can share the same r value. The scatter plot reveals non-linearity, outliers, and clusters that r alone cannot.

Use Pearson for linear relationships on normal continuous data. Use Spearman for non-linear, ordinal, or skewed data — it is more robust and makes fewer assumptions.

Report r and the p-value together. A correlation of r = 0.9 in a sample of 5 is unreliable (p might be 0.08). The same r = 0.9 in a sample of 100 is highly significant. Sample size determines whether correlation is trustworthy.

In ML preprocessing, drop features with |r| > 0.85 against each other (multicollinearity). Keep features with high correlation to the target — they are the most predictive inputs for your model.