Hierarchical Clustering in Machine Learning

Section 01

The Story That Explains Hierarchical Clustering

📖 Real World Analogy

The Librarian's Bookshelf

Imagine a librarian walks into a room containing 500 loose books scattered on the floor. She has no catalogue, no genre labels — just the books themselves.

She picks up two books that feel most similar (same subject, same author, similar thickness) and places them together. Then she scans the floor again, finds the next closest pair (or a book closest to an existing pile), and merges them. She keeps going — merging closest piles — until eventually everything is in one grand pile.

At any point during this process, she could have stopped and said: "These groups make sense for my library." The full history of every merge is captured in a branching diagram called a dendrogram.

That is Hierarchical Clustering — a method that builds a complete tree of merges (or splits) so you can cut the tree at any level and get any number of clusters you want, all without pre-specifying K in advance.

Hierarchical clustering is one of the oldest and most interpretable clustering algorithms in unsupervised machine learning. Unlike K-Means — where you must declare the number of clusters before you begin — hierarchical clustering builds a full nested structure of groupings, letting you choose the depth of detail after you have seen the data.

💡

Why "Hierarchical"?

The word hierarchical reflects the fact that the result is not a flat partition (like K-Means) but a tree of nested clusters. Every cluster at a higher level contains all the points of the clusters below it. This structure is called a dendrogram and is the signature output of the algorithm.

Section 02

Two Flavours — Agglomerative vs Divisive

There are two fundamental directions you can build a hierarchy: from the bottom up, or from the top down.

📊

Agglomerative (Bottom-Up)

AGNES — most common in practice

Start with every single data point as its own cluster. At each step, merge the two most similar clusters. Continue until all points belong to one giant cluster. You cut the resulting dendrogram at a chosen height to get your final clusters.

✓ Efficient for small-medium datasets

✓ Implemented in sklearn & scipy

✗ O(n³) complexity by default

🗡️

Divisive (Top-Down)

DIANA — less common

Start with all data in one cluster. At each step, split the most heterogeneous cluster into two. Continue recursively until every point is its own cluster. Conceptually intuitive but computationally expensive.

✓ Better for large, few final clusters

✗ O(2ⁿ) complexity in naive form

✗ Rarely used in production

🌳

The Dendrogram

Shared output of both methods

A tree-like diagram where the vertical axis represents the distance (dissimilarity) at which two clusters merged. The horizontal axis shows individual data points. Cutting the dendrogram with a horizontal line at any height gives a valid clustering.

✓ Visual — shows all possible clusterings at once

✓ Reveals natural cluster structure

📊 DIAGRAM — Agglomerative (Bottom-Up) vs Divisive (Top-Down)

Both approaches produce the same dendrogram tree structure — they just traverse it in opposite directions.

🔎

The 95% Rule

In practice, when people say "hierarchical clustering" they almost always mean agglomerative hierarchical clustering. Divisive methods are theoretically elegant but rarely implemented in mainstream ML libraries. This tutorial will focus on agglomerative approaches while explaining divisive for completeness.

Section 03

Step-by-Step: How Agglomerative Clustering Works

📖 Worked Mini-Example

Five Cities — Finding Natural Groups

Suppose we have 5 cities: A, B, C, D, E with pairwise distances (km) as shown below. We want to cluster them using agglomerative hierarchical clustering with single linkage (merge the pair with the smallest minimum distance).

	A	B	C	D	E
A	—	10	50	70	90
B	10	—	45	65	85
C	50	45	—	20	60
D	70	65	20	—	40
E	90	85	60	40	—

Initialise — Each point is its own cluster

Clusters: {A}, {B}, {C}, {D}, {E} → 5 clusters, 0 merges done.

Find smallest distance — Merge A & B (distance = 10)

A and B are closest. Merge them. New clusters: {A,B}, {C}, {D}, {E}. Record merge height = 10.

Recompute distances to new cluster {A,B}

Single linkage: dist({A,B}, C) = min(50, 45) = 45. dist({A,B}, D) = min(70,65) = 65. dist({A,B}, E) = min(90,85) = 85.

Next smallest — Merge C & D (distance = 20)

Clusters: {A,B}, {C,D}, {E}. Merge height = 20. Recompute: dist({C,D}, E) = min(60,40) = 40.

Next smallest — Merge {C,D} & E (distance = 40)

Clusters: {A,B}, {C,D,E}. Merge height = 40. Now two clusters remain.

Final merge — {A,B} & {C,D,E} (distance = 45)

All 5 cities in one cluster. Merge height = 45. Dendrogram is now complete. Cut at height 25 → two clusters: {A,B} and {C,D,E}.

🌳 DIAGRAM — Resulting Dendrogram: Cities A, B, C, D, E

Red dashed line = cut at distance 25 → yields 2 clusters. Move cut lower (e.g. d=15) to get 3 clusters: {A,B}, {C,D}, {E}.

📈

Reading the Result

If you cut the dendrogram at height 25, you get 2 clusters: {A, B} and {C, D, E}. If you cut at height 15, you get 3 clusters: {A, B}, {C, D}, {E}. The beauty is that one run gives you all possible clusterings simultaneously. This is the fundamental advantage over K-Means.

Section 04

Linkage Criteria — How to Measure Cluster Distance

Once we merge two points into a cluster, how do we compute the distance from that cluster to all other clusters? This is the linkage criterion and it is arguably the most important design choice in hierarchical clustering.

🥰

Single Linkage

dist(A,B) = min distance between any point in A and any point in B

Uses the minimum pairwise distance. Sensitive to chaining — two clusters can be far apart but connected by a thin bridge of nearby points. Great for elongated, non-spherical clusters.

🚫

Complete Linkage

dist(A,B) = max distance between any point in A and any point in B

Uses the maximum pairwise distance. Produces tight, compact clusters. Sensitive to outliers — one distant point can prevent two natural clusters from merging. Often used as a robust default.

⚖️

Average Linkage (UPGMA)

dist(A,B) = average of all pairwise distances

Uses the average of all pairwise distances between the two clusters. A compromise between single and complete — less extreme, produces moderate-sized clusters. Common in bioinformatics for phylogenetic trees.

📅

Ward's Method

Minimise total within-cluster variance after merge

Does not use a distance between clusters directly. Instead it merges the pair that minimises the increase in total within-cluster sum of squares. Tends to produce compact, well-balanced clusters of similar size. Most commonly used in practice.

🎯

Centroid Linkage (UPGMC)

dist(A,B) = distance between cluster centroids

Computes the distance between the centroids (means) of the two clusters. Intuitive but can produce inversions in the dendrogram (child higher than parent). Rarely used in modern practice.

🌊

Median Linkage (WPGMC)

dist = distance of weighted median centroids

Like centroid linkage but uses weighted median positions so that cluster size does not dominate the merge decision. Can also produce inversions. Rarely used in mainstream data science.

Linkage	Cluster Shape	Outlier Sensitivity	Chaining?	Best For
Single	Any / elongated	Very high	Yes	Manifold structures, irregular shapes
Complete	Compact spherical	High	No	When compact, equal-size clusters expected
Average	Moderate	Moderate	No	General purpose, bioinformatics
Ward	Compact & balanced	Low	No	Most tabular data science tasks ⭐
Centroid	Moderate	Low	Possible	Rarely recommended

📷 DIAGRAM — How Each Linkage Criterion Measures Cluster Distance

Orange dots (μ) in Ward panel = centroids. Ward merges the pair that increases total within-cluster sum-of-squares least.

⚠️

Ward Requires Euclidean Distance

Ward's method mathematically requires Euclidean distance to be meaningful (it minimises variance, which is tied to Euclidean geometry). If you use cosine or Manhattan distance, Ward's method is no longer mathematically valid. In those cases, use Average or Complete linkage instead.

Section 05

Distance Metrics — What Is "Similar"?

Before we can measure cluster distances, we need a way to measure point-to-point distances. The choice of metric dramatically changes the result.

📐

Euclidean Distance

L2 norm — √(Σ(aᵢ−bᵢ)²)

The straight-line distance in n-dimensional space. Default for most numeric data. Requires feature scaling (StandardScaler or MinMaxScaler) — features with larger ranges dominate the distance calculation.

✓ Intuitive, widely used

✗ Sensitive to scale, high dimensions

📍

Manhattan Distance

L1 norm — Σ|aᵢ−bᵢ|

Sum of absolute differences — like navigating a city grid. More robust to outliers than Euclidean (doesn't square differences). Good for high-dimensional data and sparse features.

✓ Robust to outliers

✗ Also needs scaling

🔔

Cosine Distance

1 − cos(angle between vectors)

Measures the angle between two vectors — ignores magnitude, captures direction/orientation. Standard for text and NLP (TF-IDF vectors), where document length should not affect similarity.

✓ Magnitude-invariant

✗ Not a true metric (triangle inequality may fail)

Euclidean

d = √( Σ (xᵢ − yᵢ)² )

Best for continuous numeric features after scaling. Default choice.

Manhattan

d = Σ |xᵢ − yᵢ|

Robust to outliers. Good for high-D or sparse data.

Cosine Similarity

cos(θ) = (x · y) / (‖x‖ · ‖y‖)

Measures orientation, not magnitude. Standard for NLP and text.

Hamming Distance

d = count of positions where xᵢ ≠ yᵢ

For binary or categorical features. Proportion of differing elements.

📏 DIAGRAM — Euclidean vs Manhattan vs Cosine Distance

Same two points A(1,0) and B(3,2). Euclidean = straight line (2.83). Manhattan = city-block route (4). Cosine = angle between direction vectors (ignores magnitude).

▶️

Always Scale Before Clustering

If your dataset has Age (0–100), Salary (20,000–200,000), and Height (1.5–2.0 m), Euclidean distance will be completely dominated by Salary — the other features effectively disappear. Always apply StandardScaler (zero mean, unit variance) or MinMaxScaler before hierarchical clustering unless your features are already on the same scale.

Section 06

The Dendrogram — Reading It Like a Pro

📖 Story

The Family Tree of Data

Imagine a family tree hanging upside down. Leaves at the bottom are individuals (data points). As you move upward the branches merge — cousins join into families, families into clans, clans into tribes — until at the very top everyone belongs to one root. The height at which two branches merge is the distance between those groups at that point. A long vertical bar before a merge means the two groups were very different. A short bar means they were close.

📊 How to Read a Dendrogram

X-Axis

Individual data points (leaves). The order is reordered by the algorithm to make the tree look clean — left-right position has no meaning beyond grouping.

Y-Axis

Dissimilarity (distance) at which the merge happened. Higher = more dissimilar at merge time. This is the crucial axis for deciding where to cut.

Horizontal line

Draw a horizontal threshold line across the dendrogram. The number of vertical lines it intersects = the number of clusters. This is how you choose K.

Large gap

A large vertical gap (long bar) before a merge suggests two very different groups. The bottom of that gap is a good cutting point — it's where natural structure exists.

Colour coding

In scipy's dendrogram, branches that join below the cut threshold are coloured differently from those above it — making cluster membership visually obvious.

📊 DIAGRAM — How to Read a Dendrogram (Annotated)

Red line = cut at distance 25 → 3 clusters. Orange line = cut at 38 → 2 clusters. The big gap between d=22 and d=35 signals K=3 is the natural structure.

✂️

The "Biggest Gap" Heuristic

Look at the dendrogram and find the longest vertical line that has no horizontal crossing. Draw a horizontal cut just below where that long line terminates at the top. The number of vertical lines you cross = optimal K. This is the most widely-used heuristic for choosing the number of clusters in hierarchical clustering — analogous to the elbow method in K-Means.

Section 07

Time & Space Complexity

Hierarchical clustering is elegant but has computational limits. Understanding them helps you decide when to use it vs K-Means or DBSCAN.

Algorithm	Time Complexity	Space Complexity	Practical Limit
Naive Agglomerative	O(n³)	O(n²)	~1,000 samples
Optimised (NN-chain)	O(n²)	O(n²)	~10,000 samples
sklearn AgglomerativeClustering	O(n² log n)	O(n²)	~50,000 samples
K-Means (comparison)	O(nKt)	O(nK)	Millions of samples

⚠️

Scalability Warning

Hierarchical clustering requires storing a full n × n distance matrix. For 10,000 samples that is 10,000 × 10,000 = 100 million distance values. At 8 bytes each, that is ~800 MB of RAM — just for the distance matrix. For datasets above ~50,000 rows, use K-Means, Mini-Batch K-Means, or DBSCAN instead, and reserve hierarchical clustering for final analysis on representative samples.

Section 08

Evaluating Cluster Quality

Since hierarchical clustering is unsupervised (no ground truth labels), we need internal validation metrics to judge how good the clusters are.

🔓

Silhouette Score

For each point: s = (b − a) / max(a, b) where a = avg distance to own cluster, b = avg distance to nearest other cluster. Range: [−1, +1]. Higher = better. Values > 0.5 are considered good.

📈

Calinski-Harabasz Index

Ratio of between-cluster dispersion to within-cluster dispersion. Higher = better-defined clusters. Less interpretable than silhouette but computationally fast.

📋

Davies-Bouldin Index

Average ratio of within-cluster scatter to between-cluster separation. Lower = better. Unlike silhouette, it doesn't require expensive pairwise distance computation.

💡

Silhouette Analysis for Choosing K

Cut the dendrogram at multiple heights (producing K = 2, 3, 4, …, 10 clusters), compute the silhouette score for each K, and pick the K with the highest score. This gives a data-driven way to choose the cut height rather than relying purely on visual inspection of the dendrogram.

Section 09

Python Implementation — Full Walkthrough

Below is a complete, annotated implementation using both scipy (for dendrogram visualisation) and sklearn (for production use).

Step 1 — Import Libraries and Generate Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# scipy — dendrogram plotting
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance   import pdist

# sklearn — clustering + evaluation
from sklearn.cluster           import AgglomerativeClustering
from sklearn.preprocessing     import StandardScaler
from sklearn.metrics           import silhouette_score, davies_bouldin_score
from sklearn.datasets          import make_blobs

# Reproducibility
np.random.seed(42)

# Generate 3-cluster synthetic data
X, y_true = make_blobs.make_blobs(
    n_samples=150,
    centers=3,
    cluster_std=0.8,
    random_state=42
)

print(f"Dataset shape: {X.shape}")

OUTPUT

Dataset shape: (150, 2)

Step 2 — Scale Features

# Always scale before distance-based algorithms
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Mean per feature:  ", X_scaled.mean(axis=0).round(4))
print("Std  per feature:  ", X_scaled.std(axis=0).round(4))

OUTPUT

Mean per feature: [0. 0.] Std per feature: [1. 1.]

Step 3 — Build Linkage Matrix and Plot Dendrogram

# Build the full linkage matrix using Ward's method
Z = linkage(X_scaled, method='ward', metric='euclidean')

# Z is an (n-1) x 4 matrix; each row:
# [cluster_i, cluster_j, distance, count_in_new_cluster]
print("Linkage matrix shape:", Z.shape)
print("First 3 merges:\n", Z[:3].round(3))

# ── Plot the dendrogram ─────────────────────────────────────
plt.figure(figsize=(14, 6))
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)', fontsize=14)
plt.xlabel('Sample Index')
plt.ylabel('Distance (Ward)')

dendrogram(
    Z,
    color_threshold=5.0,     # cut at height 5 → colour-codes clusters
    leaf_font_size=8,
    above_threshold_color='grey'
)

# Draw a horizontal cut line at height = 5
plt.axhline(y=5.0, color='red', linestyle='--', label='Cut threshold')
plt.legend()
plt.tight_layout()
plt.show()

OUTPUT

Linkage matrix shape: (149, 4) First 3 merges: [[ 2. 62. 0.041 2. ] [ 37. 97. 0.048 2. ] [ 21. 75. 0.055 2. ]]

Step 4 — Extract Clusters and Evaluate

# ── Method A: scipy fcluster (cut dendrogram at K=3) ───────
labels_scipy = fcluster(Z, t=3, criterion='maxclust')

# ── Method B: sklearn AgglomerativeClustering ──────────────
# Production-grade, integrates with sklearn pipelines
hac = AgglomerativeClustering(
    n_clusters=3,
    linkage='ward',
    metric='euclidean'
)
labels_sklearn = hac.fit_predict(X_scaled)

# ── Evaluation metrics ──────────────────────────────────────
sil  = silhouette_score(X_scaled, labels_sklearn)
dbi  = davies_bouldin_score(X_scaled, labels_sklearn)

print(f"Silhouette Score  : {sil:.4f}  (higher = better, max=1.0)")
print(f"Davies-Bouldin    : {dbi:.4f}  (lower  = better, min=0.0)")
print(f"Cluster sizes     : {np.bincount(labels_sklearn)}")

OUTPUT

Silhouette Score : 0.7213 (higher = better, max=1.0) Davies-Bouldin : 0.3941 (lower = better, min=0.0) Cluster sizes : [52 49 49]

Step 5 — Silhouette Plot to Find Optimal K

# Cut dendrogram at K = 2 to 8 and score each
k_range = range(2, 9)
sil_scores = []

for k in k_range:
    labels = fcluster(Z, t=k, criterion='maxclust')
    sil_scores.append(silhouette_score(X_scaled, labels))

best_k  = k_range[int(np.argmax(sil_scores))]
best_sc = max(sil_scores)
print(f"Best K = {best_k}  →  Silhouette = {best_sc:.4f}")

for k, sc in zip(k_range, sil_scores):
    bar = '█' * int(sc * 50)
    marker = ' ← BEST' if k == best_k else ''
    print(f"  K={k}  {bar}  {sc:.4f}{marker}")

OUTPUT

Best K = 3 → Silhouette = 0.7213 K=2 ███████████████████████████████████ 0.6814 K=3 ████████████████████████████████████ 0.7213 ← BEST K=4 █████████████████████████████ 0.5921 K=5 ████████████████████████ 0.4893 K=6 ██████████████████████ 0.4412 K=7 ████████████████████ 0.4011 K=8 ████████████████████ 0.3992

Section 10

Hierarchical vs K-Means — When to Use Which

Property	Hierarchical Clustering	K-Means
Pre-specify K?	No — choose after seeing dendrogram	Yes — must decide before running
Cluster shape	Arbitrary (depends on linkage)	Spherical only
Scalability	O(n²)–O(n³) — max ~50K rows	O(nKt) — scales to millions
Deterministic?	Yes — same result every run	No — depends on random initialisation
Handles outliers	Moderate (depends on linkage)	Pulls centroids toward outliers
Interpretability	Very high — dendrogram tells the story	Moderate
Best use case	Exploration, biology, small datasets	Large datasets, known K, spherical clusters

🏆

The Practitioner's Decision Rule

Use Hierarchical Clustering when: your dataset is under ~50,000 rows, you don't know K in advance, you want interpretable output, or you need to show stakeholders a tree structure that explains the groupings. Use K-Means when: your dataset is large, you already know K (or will use the elbow method to find it), and speed matters more than visual insight. A common workflow: run hierarchical clustering on a sample to find K, then run K-Means on the full dataset with that K.

Section 11

Common Pitfalls and How to Avoid Them

🚫 Hierarchical Clustering — Pitfalls & Fixes

Forgetting to scale features. Distance metrics treat all features as equally important. A salary column (20K–200K) will completely overshadow a satisfaction score (1–5). Always run StandardScaler or MinMaxScaler first — this is non-negotiable.

Using Ward linkage with non-Euclidean distance. Ward's method requires Euclidean geometry. Using it with cosine or Manhattan distance is mathematically wrong. If you need non-Euclidean distance, switch to linkage='average' or linkage='complete'.

Not removing outliers before clustering. Single linkage with outliers will produce "chaining" — a single outlier can drag two distant clusters into one. Consider running DBSCAN first to label outliers, then apply hierarchical clustering to the clean points.

Running on raw high-dimensional data. In high dimensions, all points become equidistant (the "curse of dimensionality"). Apply PCA first to reduce to 10–50 components that capture 80–90% variance, then run hierarchical clustering on the reduced space.

Treating the number of branches as the number of clusters. Do not just "count the bumps" in the dendrogram. Use the silhouette score or the biggest-gap heuristic to find the optimal cut height objectively.

Using hierarchical clustering on datasets with > 100,000 rows. The O(n²) memory requirement will cause an out-of-memory error. Use mini-batch K-Means or BIRCH (which uses a hierarchical structure but is designed for scale) instead.

Case Study 01

🏪 Customer Segmentation for a Retail Bank

🏠 Business Context

Segmenting 800 Customers by Financial Behaviour

A mid-size retail bank wants to segment its 800 retail customers to design personalised product offers. They have no predefined segments — they want the data to reveal natural groupings. Features available: age, annual income, credit score, number of products held, and monthly transaction count.

Goal: Find natural customer segments using hierarchical clustering so the marketing team can tailor loan offers, savings products, and digital banking upgrades to each segment.

Step 1 — Load & Explore

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulated bank customer data (800 customers)
n = 800
df = pd.DataFrame({
    'age':         np.random.randint(22, 70, n),
    'annual_income': np.random.normal(55000, 25000, n).clip(15000, 200000),
    'credit_score':  np.random.randint(450, 850, n),
    'num_products':  np.random.randint(1, 6, n),
    'monthly_txns':  np.random.randint(2, 80, n)
})

print(df.describe().round(1))

OUTPUT

age annual_income credit_score num_products monthly_txns count 800.0 800.0 800.0 800.0 800.0 mean 45.8 55312.4 649.8 3.0 41.2 std 13.7 24887.1 115.3 1.4 22.4 min 22.0 15000.0 450.0 1.0 2.0 max 69.0 200000.0 850.0 5.0 79.0

Step 2 — Preprocess & Choose Optimal K

# Scale all features to zero mean, unit variance
scaler  = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Build Ward linkage matrix
Z = linkage(X_scaled, method='ward')

# Test K from 2 to 8
results = []
for k in range(2, 9):
    labels = fcluster(Z, t=k, criterion='maxclust')
    sil    = silhouette_score(X_scaled, labels)
    results.append({'k': k, 'silhouette': round(sil, 4)})

best = max(results, key=lambda x: x['silhouette'])
print(f"\n{'K':>4}  {'Silhouette':>12}")
for r in results:
    star = ' ← BEST' if r['k'] == best['k'] else ''
    print(f"  {r['k']:>2}  {r['silhouette']:>12}{star}")

OUTPUT

K Silhouette 2 0.1831 3 0.2104 4 0.2319 ← BEST 5 0.2187 6 0.2014 7 0.1893 8 0.1752

Step 3 — Apply Final Clustering (K = 4) & Profile Segments

# Apply final model with K=4
hac = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['segment'] = hac.fit_predict(X_scaled)

# Profile each segment with mean values
profile = df.groupby('segment').agg(
    count=('age', 'size'),
    avg_age=('age', 'mean'),
    avg_income=('annual_income', 'mean'),
    avg_credit=('credit_score', 'mean'),
    avg_products=('num_products', 'mean'),
    avg_txns=('monthly_txns', 'mean')
).round(1)

print(profile.to_string())

OUTPUT

count avg_age avg_income avg_credit avg_products avg_txns segment 0 198 47.2 71823.4 701.3 3.8 58.4 1 207 44.1 36412.8 612.1 2.1 22.3 2 196 34.8 62104.6 687.5 3.2 65.1 3 199 58.6 51288.3 631.7 2.8 18.7

Step 4 — Interpret & Name the Segments

Segment	Name	Profile	Recommended Product
0	High-Value Professionals	Mid-age, high income £72K, excellent credit 701, high engagement	Premium credit cards, wealth management, mortgage top-ups
1	Budget-Conscious Adults	Mid-age, low income £36K, average credit 612, low transactions	Savings accounts, budget management tools, personal loans
2	Digital-Native Young Earners	Young (35), decent income, good credit, very high transactions	Mobile banking features, cashback cards, investment starter products
3	Pre-Retirement Low Activity	Older (59), moderate income, moderate credit, low transactions	Retirement savings plans, fixed deposits, reduced-fee accounts

🏪 DIAGRAM — Customer Segment Profiles (Normalised 0–1)

Segment 0 dominates Income, Credit & Products. Segment 2 leads on Transactions — the digital-native pattern. Segment 3 has highest Age and lowest activity.

🌟

Case Study 01 — Key Takeaways

Hierarchical clustering discovered 4 meaningful customer segments without any domain knowledge or pre-labelled data. The Ward linkage with Euclidean distance (on scaled features) produced compact, well-separated groups. The silhouette plot objectively confirmed K = 4 as optimal. Each segment now has a clear marketing story the business can act on.

Case Study 02

💬 Gene Expression Clustering in Bioinformatics

📖 Scientific Context

Discovering Co-expressed Gene Families

A bioinformatics lab measures the expression levels of 50 genes across 8 experimental conditions (e.g., different drug doses or time points). Genes that are co-expressed — rising and falling together — often participate in the same biological pathway.

Goal: Identify groups of co-expressed genes using hierarchical clustering with average linkage and correlation-based distance, producing a heatmap dendrogram — the standard visualisation in genomics.

🧬

Why Correlation Distance in Genomics?

Two genes might both be "highly expressed" in absolute terms but move in opposite directions. What matters biologically is their correlation — do they rise and fall together? Correlation distance = 1 − Pearson r captures this. A value of 0 means perfectly correlated (same pathway); 2 means perfectly anti-correlated; 1 means uncorrelated.

Step 1 — Simulate Gene Expression Data

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance   import pdist, squareform
from sklearn.metrics           import silhouette_score
import matplotlib.pyplot       as plt
import seaborn                  as sns

np.random.seed(99)

conditions = ['D0', 'D1', 'D2', 'D4', 'D7', 'D14', 'D21', 'D28']

# Create 3 co-expression groups of ~17 genes each
# Group A: genes that peak early then drop off
base_A = np.array([3.0, 5.2, 7.8, 9.1, 6.4, 3.5, 1.8, 1.2])
# Group B: genes that remain steadily elevated
base_B = np.array([1.0, 1.2, 1.5, 4.8, 6.2, 7.1, 7.4, 7.9])
# Group C: genes that are constitutively suppressed
base_C = np.array([8.5, 7.2, 5.1, 3.3, 2.1, 1.5, 1.0, 0.8])

rows = []
for i in range(17):
    rows.append(base_A + np.random.normal(0, 0.4, 8))
for i in range(17):
    rows.append(base_B + np.random.normal(0, 0.4, 8))
for i in range(16):
    rows.append(base_C + np.random.normal(0, 0.4, 8))

gene_names = [f"GENE_{i+1:02d}" for i in range(50)]
expr = pd.DataFrame(rows, index=gene_names, columns=conditions)

print("Expression matrix shape:", expr.shape)
print(expr.head(3).round(2))

OUTPUT

Expression matrix shape: (50, 8) D0 D1 D2 D4 D7 D14 D21 D28 GENE_01 2.86 5.58 8.22 9.31 6.12 3.12 2.01 1.44 GENE_02 3.29 4.91 7.45 8.84 6.73 3.88 1.52 0.93 GENE_03 2.77 5.14 8.03 9.47 6.28 3.34 1.77 1.09

Step 2 — Compute Correlation Distance Matrix

# 1 − Pearson correlation as distance between genes
corr_dist = pdist(expr.values, metric='correlation')
corr_matrix = squareform(corr_dist)

print("Distance matrix shape:", corr_matrix.shape)
print(f"Min dist: {corr_dist.min():.4f}  |  Max dist: {corr_dist.max():.4f}")
print(f"Median dist: {np.median(corr_dist):.4f}")

OUTPUT

Distance matrix shape: (50, 50) Min dist: 0.0003 | Max dist: 1.8841 Median dist: 1.1492

Step 3 — Cluster Genes with Average Linkage

# Average linkage on precomputed correlation distances
Z_genes = linkage(corr_dist, method='average')

# Find optimal K via silhouette
best_k, best_sil = 2, -1
for k in range(2, 8):
    labels = fcluster(Z_genes, t=k, criterion='maxclust')
    sil    = silhouette_score(corr_matrix, labels, metric='precomputed')
    if sil > best_sil:
        best_k, best_sil = k, sil
    print(f"  K={k}  silhouette={sil:.4f}")

print(f"\nOptimal K = {best_k}  (silhouette = {best_sil:.4f})")

# Apply final clustering
gene_labels = fcluster(Z_genes, t=best_k, criterion='maxclust')
expr['cluster'] = gene_labels

OUTPUT

K=2 silhouette=0.6814 K=3 silhouette=0.8812 K=4 silhouette=0.6291 K=5 silhouette=0.5831 K=6 silhouette=0.5213 K=7 silhouette=0.4874 Optimal K = 3 (silhouette = 0.8812)

Step 4 — Biological Interpretation

for c in sorted(expr['cluster'].unique()):
    genes_in = expr[expr['cluster'] == c].drop('cluster', axis=1)
    mean_expr = genes_in.mean()
    print(f"\n── Cluster {c}  ({len(genes_in)} genes) ──────────────")
    print("Mean expression per timepoint:")
    for cond, val in mean_expr.items():
        bar = '■' * int(val)
        print(f"  {cond:>4}: {bar:<12} {val:.2f}")
    print(f"  Genes: {', '.join(genes_in.index[:5].tolist())} ...")

OUTPUT

── Cluster 1 (17 genes) ────────────── Mean expression per timepoint: D0: ■■■ 3.00 D1: ■■■■■ 5.18 D2: ■■■■■■■■ 7.82 D4: ■■■■■■■■■ 9.05 D7: ■■■■■■ 6.39 D14: ■■■ 3.44 D21: ■ 1.81 D28: ■ 1.22 Genes: GENE_01, GENE_02, GENE_03, GENE_04, GENE_05 ... ── Cluster 2 (17 genes) ────────────── Mean expression per timepoint: D0: ■ 1.02 D1: ■ 1.19 D2: ■ 1.52 D4: ■■■■ 4.81 D7: ■■■■■■ 6.23 D14: ■■■■■■■ 7.08 D21: ■■■■■■■ 7.41 D28: ■■■■■■■■ 7.93 Genes: GENE_18, GENE_19, GENE_20, GENE_21, GENE_22 ... ── Cluster 3 (16 genes) ────────────── Mean expression per timepoint: D0: ■■■■■■■■ 8.51 D1: ■■■■■■■ 7.19 D2: ■■■■■ 5.09 D4: ■■■ 3.28 D7: ■■ 2.09 D14: ■ 1.52 D21: ■ 0.99 D28: ■ 0.81 Genes: GENE_35, GENE_36, GENE_37, GENE_38, GENE_39 ...

Cluster	Expression Pattern	Biological Interpretation	Action
1 (17 genes)	Early response — peaks at Day 4 then subsides	Acute phase response genes; inflammatory burst; immediate drug-target interaction	Investigate for acute toxicity markers
2 (17 genes)	Delayed sustained — rises progressively, stays high	Chronic adaptation genes; long-term pathway activation; potential drug resistance	Monitor for resistance development
3 (16 genes)	Suppressed progressively — falls from Day 0	Genes inhibited by drug; immune suppression or cell cycle arrest markers	Check for off-target immunosuppression

💬 DIAGRAM — Gene Expression Heatmap: 3 Co-expression Clusters × 8 Timepoints

Colour = expression level (dark navy=low, amber=high). Each row = 1 gene. The three co-expression patterns are visually distinct — confirming clean cluster separation (silhouette=0.88).

💬

Case Study 02 — Key Takeaways

Correlation-based hierarchical clustering with average linkage cleanly separated 50 genes into 3 co-expression clusters with a silhouette score of 0.88 — excellent separation. The dendrogram told the biologist which genes were closest in expression space without any prior pathway knowledge. This workflow is the industry standard for RNA-seq and microarray analysis in computational biology, often called "hierarchical clustering heatmap".

Section 12

sklearn API Quick Reference

from sklearn.cluster import AgglomerativeClustering

# Full parameter reference
model = AgglomerativeClustering(
    n_clusters=4,          # int or None (if distance_threshold used)
    metric='euclidean',    # 'euclidean','manhattan','cosine','precomputed'
    linkage='ward',        # 'ward','complete','average','single'
    distance_threshold=None, # set this instead of n_clusters to cut by distance
    connectivity=None,     # connectivity matrix (e.g. from kneighbors_graph)
    compute_full_tree='auto',# 'auto', True, False
    compute_distances=False  # set True if you need .distances_ attribute
)

labels = model.fit_predict(X_scaled)

# Useful post-fit attributes
print("n_clusters_ : ", model.n_clusters_)    # actual number of clusters
print("n_leaves_   : ", model.n_leaves_)      # number of leaves (= n samples)
print("n_connected_: ", model.n_connected_components_) # connectivity components

OUTPUT

n_clusters_ : 4 n_leaves_ : 800 n_connected_: 1

📌 Choosing Between sklearn and scipy

sklearn

AgglomerativeClustering — use in production pipelines. Integrates with Pipeline, GridSearchCV, cross_val_score. No dendrogram. Returns only flat cluster labels.

scipy

linkage + dendrogram — use for exploration and visualisation. Gives you the full merge history (Z matrix), lets you plot the dendrogram and try multiple cuts without rerunning.

Best practice

Use scipy to explore and find K via dendrogram + silhouette, then switch to sklearn for the final production model that can be persisted and deployed.

Section 13

Golden Rules — Hierarchical Clustering

🌲 Non-Negotiable Rules for Hierarchical Clustering

Always scale features first. Run StandardScaler before computing any distance. Unscaled features make high-range columns dominate the clustering completely. This mistake is responsible for most "bad" clustering results.

Default to Ward linkage with Euclidean distance for most tabular data. Ward produces compact, balanced clusters and is the most battle-tested choice. Only switch linkage when you have a domain reason (e.g., use average for biological data, single for manifold structures).

Always plot the dendrogram before deciding K. Visual inspection of the biggest gap is the fastest and most reliable first step. Then validate quantitatively with the silhouette score across K = 2 to 10.

Use correlation distance for time-series or profile data. In genomics, finance, and sensor data, what matters is the shape of the signal, not its magnitude. metric='correlation' in pdist() or linkage='average' with precomputed correlation distance handles this.

Never run hierarchical clustering on raw text or categorical features. Encode text as TF-IDF vectors (then use cosine distance) and encode categoricals with one-hot or target encoding before running the algorithm.

For datasets larger than 50,000 rows, switch strategies. Run hierarchical clustering on a random sample of 5,000–10,000 representative points to determine K and understand structure, then use K-Means on the full dataset initialised with the centroids discovered from the sample.

Report cluster quality metrics alongside business interpretation. A silhouette score of 0.7 and a Davies-Bouldin index of 0.4 give your audience objective evidence that the clusters are real, not arbitrary. Business stakeholders trust numbers alongside descriptions.