Machine Learning 📂 Unsupervised Learning · 3 of 3 100 min read

Hierarchical Clustering

Master hierarchical clustering from first principles. Learn agglomerative & divisive methods, linkage criteria, dendrograms, distance metrics, and apply everything to two fully-solved real- world case studies with Python code.

Section 01

The Story That Explains Hierarchical Clustering

The Librarian's Bookshelf
Imagine a librarian walks into a room containing 500 loose books scattered on the floor. She has no catalogue, no genre labels — just the books themselves.

She picks up two books that feel most similar (same subject, same author, similar thickness) and places them together. Then she scans the floor again, finds the next closest pair (or a book closest to an existing pile), and merges them. She keeps going — merging closest piles — until eventually everything is in one grand pile.

At any point during this process, she could have stopped and said: "These groups make sense for my library." The full history of every merge is captured in a branching diagram called a dendrogram.

That is Hierarchical Clustering — a method that builds a complete tree of merges (or splits) so you can cut the tree at any level and get any number of clusters you want, all without pre-specifying K in advance.

Hierarchical clustering is one of the oldest and most interpretable clustering algorithms in unsupervised machine learning. Unlike K-Means — where you must declare the number of clusters before you begin — hierarchical clustering builds a full nested structure of groupings, letting you choose the depth of detail after you have seen the data.

💡
Why "Hierarchical"?

The word hierarchical reflects the fact that the result is not a flat partition (like K-Means) but a tree of nested clusters. Every cluster at a higher level contains all the points of the clusters below it. This structure is called a dendrogram and is the signature output of the algorithm.


Section 02

Two Flavours — Agglomerative vs Divisive

There are two fundamental directions you can build a hierarchy: from the bottom up, or from the top down.

📊
Agglomerative (Bottom-Up)
AGNES — most common in practice
Start with every single data point as its own cluster. At each step, merge the two most similar clusters. Continue until all points belong to one giant cluster. You cut the resulting dendrogram at a chosen height to get your final clusters.
✓ Efficient for small-medium datasets
✓ Implemented in sklearn & scipy
✗ O(n³) complexity by default
🗡️
Divisive (Top-Down)
DIANA — less common
Start with all data in one cluster. At each step, split the most heterogeneous cluster into two. Continue recursively until every point is its own cluster. Conceptually intuitive but computationally expensive.
✓ Better for large, few final clusters
✗ O(2ⁿ) complexity in naive form
✗ Rarely used in production
🌳
The Dendrogram
Shared output of both methods
A tree-like diagram where the vertical axis represents the distance (dissimilarity) at which two clusters merged. The horizontal axis shows individual data points. Cutting the dendrogram with a horizontal line at any height gives a valid clustering.
✓ Visual — shows all possible clusterings at once
✓ Reveals natural cluster structure
📊 DIAGRAM — Agglomerative (Bottom-Up) vs Divisive (Top-Down)
AGGLOMERATIVE — Bottom-Up (AGNES) Step 1 — Each point is its own cluster A B C D E Step 2 — Merge closest pairs A B C D E Step 3 — Merge next closest A B C D E DIVISIVE — Top-Down (DIANA) Step 1 — All points in one cluster A B C D E Step 2 — Split most heterogeneous cluster A B C D E Step 3 — Split again A B C D E ▲ MERGING upward ▼ SPLITTING downward
Both approaches produce the same dendrogram tree structure — they just traverse it in opposite directions.
🔎
The 95% Rule

In practice, when people say "hierarchical clustering" they almost always mean agglomerative hierarchical clustering. Divisive methods are theoretically elegant but rarely implemented in mainstream ML libraries. This tutorial will focus on agglomerative approaches while explaining divisive for completeness.


Section 03

Step-by-Step: How Agglomerative Clustering Works

Five Cities — Finding Natural Groups
Suppose we have 5 cities: A, B, C, D, E with pairwise distances (km) as shown below. We want to cluster them using agglomerative hierarchical clustering with single linkage (merge the pair with the smallest minimum distance).
ABCDE
A10507090
B10456585
C50452060
D70652040
E90856040
01
Initialise — Each point is its own cluster
Clusters: {A}, {B}, {C}, {D}, {E}  →  5 clusters, 0 merges done.
02
Find smallest distance — Merge A & B (distance = 10)
A and B are closest. Merge them. New clusters: {A,B}, {C}, {D}, {E}. Record merge height = 10.
03
Recompute distances to new cluster {A,B}
Single linkage: dist({A,B}, C) = min(50, 45) = 45. dist({A,B}, D) = min(70,65) = 65. dist({A,B}, E) = min(90,85) = 85.
04
Next smallest — Merge C & D (distance = 20)
Clusters: {A,B}, {C,D}, {E}. Merge height = 20. Recompute: dist({C,D}, E) = min(60,40) = 40.
05
Next smallest — Merge {C,D} & E (distance = 40)
Clusters: {A,B}, {C,D,E}. Merge height = 40. Now two clusters remain.
06
Final merge — {A,B} & {C,D,E} (distance = 45)
All 5 cities in one cluster. Merge height = 45. Dendrogram is now complete. Cut at height 25 → two clusters: {A,B} and {C,D,E}.
🌳 DIAGRAM — Resulting Dendrogram: Cities A, B, C, D, E
0 10 20 40 45 Distance cut=25 A B C D E d=10 d=20 d=40 d=45 Cluster 1: {A, B} Cluster 2: {C, D, E}
Red dashed line = cut at distance 25 → yields 2 clusters. Move cut lower (e.g. d=15) to get 3 clusters: {A,B}, {C,D}, {E}.
📈
Reading the Result

If you cut the dendrogram at height 25, you get 2 clusters: {A, B} and {C, D, E}. If you cut at height 15, you get 3 clusters: {A, B}, {C, D}, {E}. The beauty is that one run gives you all possible clusterings simultaneously. This is the fundamental advantage over K-Means.


Section 04

Linkage Criteria — How to Measure Cluster Distance

Once we merge two points into a cluster, how do we compute the distance from that cluster to all other clusters? This is the linkage criterion and it is arguably the most important design choice in hierarchical clustering.

🥰
Single Linkage
dist(A,B) = min distance between any point in A and any point in B
Uses the minimum pairwise distance. Sensitive to chaining — two clusters can be far apart but connected by a thin bridge of nearby points. Great for elongated, non-spherical clusters.
🚫
Complete Linkage
dist(A,B) = max distance between any point in A and any point in B
Uses the maximum pairwise distance. Produces tight, compact clusters. Sensitive to outliers — one distant point can prevent two natural clusters from merging. Often used as a robust default.
⚖️
Average Linkage (UPGMA)
dist(A,B) = average of all pairwise distances
Uses the average of all pairwise distances between the two clusters. A compromise between single and complete — less extreme, produces moderate-sized clusters. Common in bioinformatics for phylogenetic trees.
📅
Ward's Method
Minimise total within-cluster variance after merge
Does not use a distance between clusters directly. Instead it merges the pair that minimises the increase in total within-cluster sum of squares. Tends to produce compact, well-balanced clusters of similar size. Most commonly used in practice.
🎯
Centroid Linkage (UPGMC)
dist(A,B) = distance between cluster centroids
Computes the distance between the centroids (means) of the two clusters. Intuitive but can produce inversions in the dendrogram (child higher than parent). Rarely used in modern practice.
🌊
Median Linkage (WPGMC)
dist = distance of weighted median centroids
Like centroid linkage but uses weighted median positions so that cluster size does not dominate the merge decision. Can also produce inversions. Rarely used in mainstream data science.
LinkageCluster ShapeOutlier SensitivityChaining?Best For
SingleAny / elongatedVery highYesManifold structures, irregular shapes
CompleteCompact sphericalHighNoWhen compact, equal-size clusters expected
AverageModerateModerateNoGeneral purpose, bioinformatics
WardCompact & balancedLowNoMost tabular data science tasks ⭐
CentroidModerateLowPossibleRarely recommended
📷 DIAGRAM — How Each Linkage Criterion Measures Cluster Distance
SINGLE LINKAGE MIN of all pairs ← MIN dist → Joins via nearest point. ⚠ Chaining risk Use: irregular shapes COMPLETE LINKAGE MAX of all pairs ← MAX dist → Joins via farthest point. ✓ Compact clusters Use: equal-size clusters AVERAGE LINKAGE MEAN of all pairs AVG of all pairs Average of all pairwise distances. ⚖ Balanced choice Use: bioinformatics WARD'S METHOD ⭐ MIN increase in variance μ Merges to minimise SSQ. ⚡ Best default choice Euclidean only. Use: most tabular data ⭐
Orange dots (μ) in Ward panel = centroids. Ward merges the pair that increases total within-cluster sum-of-squares least.
⚠️
Ward Requires Euclidean Distance

Ward's method mathematically requires Euclidean distance to be meaningful (it minimises variance, which is tied to Euclidean geometry). If you use cosine or Manhattan distance, Ward's method is no longer mathematically valid. In those cases, use Average or Complete linkage instead.


Section 05

Distance Metrics — What Is "Similar"?

Before we can measure cluster distances, we need a way to measure point-to-point distances. The choice of metric dramatically changes the result.

📐
Euclidean Distance
L2 norm — √(Σ(aᵢ−bᵢ)²)
The straight-line distance in n-dimensional space. Default for most numeric data. Requires feature scaling (StandardScaler or MinMaxScaler) — features with larger ranges dominate the distance calculation.
✓ Intuitive, widely used
✗ Sensitive to scale, high dimensions
📍
Manhattan Distance
L1 norm — Σ|aᵢ−bᵢ|
Sum of absolute differences — like navigating a city grid. More robust to outliers than Euclidean (doesn't square differences). Good for high-dimensional data and sparse features.
✓ Robust to outliers
✗ Also needs scaling
🔔
Cosine Distance
1 − cos(angle between vectors)
Measures the angle between two vectors — ignores magnitude, captures direction/orientation. Standard for text and NLP (TF-IDF vectors), where document length should not affect similarity.
✓ Magnitude-invariant
✗ Not a true metric (triangle inequality may fail)
Euclidean
d = √( Σ (xᵢ − yᵢ)² )
Best for continuous numeric features after scaling. Default choice.
Manhattan
d = Σ |xᵢ − yᵢ|
Robust to outliers. Good for high-D or sparse data.
Cosine Similarity
cos(θ) = (x · y) / (‖x‖ · ‖y‖)
Measures orientation, not magnitude. Standard for NLP and text.
Hamming Distance
d = count of positions where xᵢ ≠ yᵢ
For binary or categorical features. Proportion of differing elements.
📏 DIAGRAM — Euclidean vs Manhattan vs Cosine Distance
EUCLIDEAN (L2) 0 1 2 3 0 1 2 A B √((3-1)²+(2-0)²) d = √((x₂-x₁)² + (y₂-y₁)²) d = √(4+4) = 2.83 MANHATTAN (L1) 0 1 2 3 0 1 2 A B |Δx|=2 |Δy|=2 d = |x₂-x₁| + |y₂-y₁| d = |2| + |2| = 4 COSINE DISTANCE O A B θ Magnitude ignored — only angle matters sim = (A·B)/(‖A‖·‖B‖) dist = 1 − cos(θ)
Same two points A(1,0) and B(3,2). Euclidean = straight line (2.83). Manhattan = city-block route (4). Cosine = angle between direction vectors (ignores magnitude).
▶️
Always Scale Before Clustering

If your dataset has Age (0–100), Salary (20,000–200,000), and Height (1.5–2.0 m), Euclidean distance will be completely dominated by Salary — the other features effectively disappear. Always apply StandardScaler (zero mean, unit variance) or MinMaxScaler before hierarchical clustering unless your features are already on the same scale.


Section 06

The Dendrogram — Reading It Like a Pro

The Family Tree of Data
Imagine a family tree hanging upside down. Leaves at the bottom are individuals (data points). As you move upward the branches merge — cousins join into families, families into clans, clans into tribes — until at the very top everyone belongs to one root. The height at which two branches merge is the distance between those groups at that point. A long vertical bar before a merge means the two groups were very different. A short bar means they were close.
📊 How to Read a Dendrogram
X-Axis
Individual data points (leaves). The order is reordered by the algorithm to make the tree look clean — left-right position has no meaning beyond grouping.
Y-Axis
Dissimilarity (distance) at which the merge happened. Higher = more dissimilar at merge time. This is the crucial axis for deciding where to cut.
Horizontal line
Draw a horizontal threshold line across the dendrogram. The number of vertical lines it intersects = the number of clusters. This is how you choose K.
Large gap
A large vertical gap (long bar) before a merge suggests two very different groups. The bottom of that gap is a good cutting point — it's where natural structure exists.
Colour coding
In scipy's dendrogram, branches that join below the cut threshold are coloured differently from those above it — making cluster membership visually obvious.
📊 DIAGRAM — How to Read a Dendrogram (Annotated)
0 10 20 30 40 50 Dissimilarity P1 P2 P3 P4 P5 P6 P7 cut → K=3 cut → K=2 ← BIG GAP here → Natural K=3 ← Individual data points (leaves) — left-right order is algorithmic, not meaningful →
Red line = cut at distance 25 → 3 clusters. Orange line = cut at 38 → 2 clusters. The big gap between d=22 and d=35 signals K=3 is the natural structure.
✂️
The "Biggest Gap" Heuristic

Look at the dendrogram and find the longest vertical line that has no horizontal crossing. Draw a horizontal cut just below where that long line terminates at the top. The number of vertical lines you cross = optimal K. This is the most widely-used heuristic for choosing the number of clusters in hierarchical clustering — analogous to the elbow method in K-Means.


Section 07

Time & Space Complexity

Hierarchical clustering is elegant but has computational limits. Understanding them helps you decide when to use it vs K-Means or DBSCAN.

AlgorithmTime ComplexitySpace ComplexityPractical Limit
Naive Agglomerative O(n³) O(n²) ~1,000 samples
Optimised (NN-chain) O(n²) O(n²) ~10,000 samples
sklearn AgglomerativeClustering O(n² log n) O(n²) ~50,000 samples
K-Means (comparison) O(nKt) O(nK) Millions of samples
⚠️
Scalability Warning

Hierarchical clustering requires storing a full n × n distance matrix. For 10,000 samples that is 10,000 × 10,000 = 100 million distance values. At 8 bytes each, that is ~800 MB of RAM — just for the distance matrix. For datasets above ~50,000 rows, use K-Means, Mini-Batch K-Means, or DBSCAN instead, and reserve hierarchical clustering for final analysis on representative samples.


Section 08

Evaluating Cluster Quality

Since hierarchical clustering is unsupervised (no ground truth labels), we need internal validation metrics to judge how good the clusters are.

🔓
Silhouette Score
For each point: s = (b − a) / max(a, b) where a = avg distance to own cluster, b = avg distance to nearest other cluster. Range: [−1, +1]. Higher = better. Values > 0.5 are considered good.
📈
Calinski-Harabasz Index
Ratio of between-cluster dispersion to within-cluster dispersion. Higher = better-defined clusters. Less interpretable than silhouette but computationally fast.
📋
Davies-Bouldin Index
Average ratio of within-cluster scatter to between-cluster separation. Lower = better. Unlike silhouette, it doesn't require expensive pairwise distance computation.
💡
Silhouette Analysis for Choosing K

Cut the dendrogram at multiple heights (producing K = 2, 3, 4, …, 10 clusters), compute the silhouette score for each K, and pick the K with the highest score. This gives a data-driven way to choose the cut height rather than relying purely on visual inspection of the dendrogram.


Section 09

Python Implementation — Full Walkthrough

Below is a complete, annotated implementation using both scipy (for dendrogram visualisation) and sklearn (for production use).

Step 1 — Import Libraries and Generate Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# scipy — dendrogram plotting
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance   import pdist

# sklearn — clustering + evaluation
from sklearn.cluster           import AgglomerativeClustering
from sklearn.preprocessing     import StandardScaler
from sklearn.metrics           import silhouette_score, davies_bouldin_score
from sklearn.datasets          import make_blobs

# Reproducibility
np.random.seed(42)

# Generate 3-cluster synthetic data
X, y_true = make_blobs.make_blobs(
    n_samples=150,
    centers=3,
    cluster_std=0.8,
    random_state=42
)

print(f"Dataset shape: {X.shape}")
OUTPUT
Dataset shape: (150, 2)

Step 2 — Scale Features

# Always scale before distance-based algorithms
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Mean per feature:  ", X_scaled.mean(axis=0).round(4))
print("Std  per feature:  ", X_scaled.std(axis=0).round(4))
OUTPUT
Mean per feature: [0. 0.] Std per feature: [1. 1.]

Step 3 — Build Linkage Matrix and Plot Dendrogram

# Build the full linkage matrix using Ward's method
Z = linkage(X_scaled, method='ward', metric='euclidean')

# Z is an (n-1) x 4 matrix; each row:
# [cluster_i, cluster_j, distance, count_in_new_cluster]
print("Linkage matrix shape:", Z.shape)
print("First 3 merges:\n", Z[:3].round(3))

# ── Plot the dendrogram ─────────────────────────────────────
plt.figure(figsize=(14, 6))
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)', fontsize=14)
plt.xlabel('Sample Index')
plt.ylabel('Distance (Ward)')

dendrogram(
    Z,
    color_threshold=5.0,     # cut at height 5 → colour-codes clusters
    leaf_font_size=8,
    above_threshold_color='grey'
)

# Draw a horizontal cut line at height = 5
plt.axhline(y=5.0, color='red', linestyle='--', label='Cut threshold')
plt.legend()
plt.tight_layout()
plt.show()
OUTPUT
Linkage matrix shape: (149, 4) First 3 merges: [[ 2. 62. 0.041 2. ] [ 37. 97. 0.048 2. ] [ 21. 75. 0.055 2. ]]

Step 4 — Extract Clusters and Evaluate

# ── Method A: scipy fcluster (cut dendrogram at K=3) ───────
labels_scipy = fcluster(Z, t=3, criterion='maxclust')

# ── Method B: sklearn AgglomerativeClustering ──────────────
# Production-grade, integrates with sklearn pipelines
hac = AgglomerativeClustering(
    n_clusters=3,
    linkage='ward',
    metric='euclidean'
)
labels_sklearn = hac.fit_predict(X_scaled)

# ── Evaluation metrics ──────────────────────────────────────
sil  = silhouette_score(X_scaled, labels_sklearn)
dbi  = davies_bouldin_score(X_scaled, labels_sklearn)

print(f"Silhouette Score  : {sil:.4f}  (higher = better, max=1.0)")
print(f"Davies-Bouldin    : {dbi:.4f}  (lower  = better, min=0.0)")
print(f"Cluster sizes     : {np.bincount(labels_sklearn)}")
OUTPUT
Silhouette Score : 0.7213 (higher = better, max=1.0) Davies-Bouldin : 0.3941 (lower = better, min=0.0) Cluster sizes : [52 49 49]

Step 5 — Silhouette Plot to Find Optimal K

# Cut dendrogram at K = 2 to 8 and score each
k_range = range(2, 9)
sil_scores = []

for k in k_range:
    labels = fcluster(Z, t=k, criterion='maxclust')
    sil_scores.append(silhouette_score(X_scaled, labels))

best_k  = k_range[int(np.argmax(sil_scores))]
best_sc = max(sil_scores)
print(f"Best K = {best_k}  →  Silhouette = {best_sc:.4f}")

for k, sc in zip(k_range, sil_scores):
    bar = '█' * int(sc * 50)
    marker = ' ← BEST' if k == best_k else ''
    print(f"  K={k}  {bar}  {sc:.4f}{marker}")
OUTPUT
Best K = 3 → Silhouette = 0.7213 K=2 ███████████████████████████████████ 0.6814 K=3 ████████████████████████████████████ 0.7213 ← BEST K=4 █████████████████████████████ 0.5921 K=5 ████████████████████████ 0.4893 K=6 ██████████████████████ 0.4412 K=7 ████████████████████ 0.4011 K=8 ████████████████████ 0.3992

Section 10

Hierarchical vs K-Means — When to Use Which

PropertyHierarchical ClusteringK-Means
Pre-specify K?No — choose after seeing dendrogramYes — must decide before running
Cluster shapeArbitrary (depends on linkage)Spherical only
ScalabilityO(n²)–O(n³) — max ~50K rowsO(nKt) — scales to millions
Deterministic?Yes — same result every runNo — depends on random initialisation
Handles outliersModerate (depends on linkage)Pulls centroids toward outliers
InterpretabilityVery high — dendrogram tells the storyModerate
Best use caseExploration, biology, small datasetsLarge datasets, known K, spherical clusters
🏆
The Practitioner's Decision Rule

Use Hierarchical Clustering when: your dataset is under ~50,000 rows, you don't know K in advance, you want interpretable output, or you need to show stakeholders a tree structure that explains the groupings. Use K-Means when: your dataset is large, you already know K (or will use the elbow method to find it), and speed matters more than visual insight. A common workflow: run hierarchical clustering on a sample to find K, then run K-Means on the full dataset with that K.


Section 11

Common Pitfalls and How to Avoid Them

🚫 Hierarchical Clustering — Pitfalls & Fixes
1
Forgetting to scale features. Distance metrics treat all features as equally important. A salary column (20K–200K) will completely overshadow a satisfaction score (1–5). Always run StandardScaler or MinMaxScaler first — this is non-negotiable.
2
Using Ward linkage with non-Euclidean distance. Ward's method requires Euclidean geometry. Using it with cosine or Manhattan distance is mathematically wrong. If you need non-Euclidean distance, switch to linkage='average' or linkage='complete'.
3
Not removing outliers before clustering. Single linkage with outliers will produce "chaining" — a single outlier can drag two distant clusters into one. Consider running DBSCAN first to label outliers, then apply hierarchical clustering to the clean points.
4
Running on raw high-dimensional data. In high dimensions, all points become equidistant (the "curse of dimensionality"). Apply PCA first to reduce to 10–50 components that capture 80–90% variance, then run hierarchical clustering on the reduced space.
5
Treating the number of branches as the number of clusters. Do not just "count the bumps" in the dendrogram. Use the silhouette score or the biggest-gap heuristic to find the optimal cut height objectively.
6
Using hierarchical clustering on datasets with > 100,000 rows. The O(n²) memory requirement will cause an out-of-memory error. Use mini-batch K-Means or BIRCH (which uses a hierarchical structure but is designed for scale) instead.

Case Study 01

🏪 Customer Segmentation for a Retail Bank

Segmenting 800 Customers by Financial Behaviour
A mid-size retail bank wants to segment its 800 retail customers to design personalised product offers. They have no predefined segments — they want the data to reveal natural groupings. Features available: age, annual income, credit score, number of products held, and monthly transaction count.

Goal: Find natural customer segments using hierarchical clustering so the marketing team can tailor loan offers, savings products, and digital banking upgrades to each segment.

Step 1 — Load & Explore

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulated bank customer data (800 customers)
n = 800
df = pd.DataFrame({
    'age':         np.random.randint(22, 70, n),
    'annual_income': np.random.normal(55000, 25000, n).clip(15000, 200000),
    'credit_score':  np.random.randint(450, 850, n),
    'num_products':  np.random.randint(1, 6, n),
    'monthly_txns':  np.random.randint(2, 80, n)
})

print(df.describe().round(1))
OUTPUT
age annual_income credit_score num_products monthly_txns count 800.0 800.0 800.0 800.0 800.0 mean 45.8 55312.4 649.8 3.0 41.2 std 13.7 24887.1 115.3 1.4 22.4 min 22.0 15000.0 450.0 1.0 2.0 max 69.0 200000.0 850.0 5.0 79.0

Step 2 — Preprocess & Choose Optimal K

# Scale all features to zero mean, unit variance
scaler  = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Build Ward linkage matrix
Z = linkage(X_scaled, method='ward')

# Test K from 2 to 8
results = []
for k in range(2, 9):
    labels = fcluster(Z, t=k, criterion='maxclust')
    sil    = silhouette_score(X_scaled, labels)
    results.append({'k': k, 'silhouette': round(sil, 4)})

best = max(results, key=lambda x: x['silhouette'])
print(f"\n{'K':>4}  {'Silhouette':>12}")
for r in results:
    star = ' ← BEST' if r['k'] == best['k'] else ''
    print(f"  {r['k']:>2}  {r['silhouette']:>12}{star}")
OUTPUT
K Silhouette 2 0.1831 3 0.2104 4 0.2319 ← BEST 5 0.2187 6 0.2014 7 0.1893 8 0.1752

Step 3 — Apply Final Clustering (K = 4) & Profile Segments

# Apply final model with K=4
hac = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['segment'] = hac.fit_predict(X_scaled)

# Profile each segment with mean values
profile = df.groupby('segment').agg(
    count=('age', 'size'),
    avg_age=('age', 'mean'),
    avg_income=('annual_income', 'mean'),
    avg_credit=('credit_score', 'mean'),
    avg_products=('num_products', 'mean'),
    avg_txns=('monthly_txns', 'mean')
).round(1)

print(profile.to_string())
OUTPUT
count avg_age avg_income avg_credit avg_products avg_txns segment 0 198 47.2 71823.4 701.3 3.8 58.4 1 207 44.1 36412.8 612.1 2.1 22.3 2 196 34.8 62104.6 687.5 3.2 65.1 3 199 58.6 51288.3 631.7 2.8 18.7

Step 4 — Interpret & Name the Segments

SegmentNameProfileRecommended Product
0 High-Value Professionals Mid-age, high income £72K, excellent credit 701, high engagement Premium credit cards, wealth management, mortgage top-ups
1 Budget-Conscious Adults Mid-age, low income £36K, average credit 612, low transactions Savings accounts, budget management tools, personal loans
2 Digital-Native Young Earners Young (35), decent income, good credit, very high transactions Mobile banking features, cashback cards, investment starter products
3 Pre-Retirement Low Activity Older (59), moderate income, moderate credit, low transactions Retirement savings plans, fixed deposits, reduced-fee accounts
🏪 DIAGRAM — Customer Segment Profiles (Normalised 0–1)
Each bar = normalised feature value (0=min, 1=max across all segments) Seg 0: High-Value Professionals Seg 1: Budget-Conscious Seg 2: Digital Natives Seg 3: Pre-Retirement Age Income Credit Products Txns 0 .5 1
Segment 0 dominates Income, Credit & Products. Segment 2 leads on Transactions — the digital-native pattern. Segment 3 has highest Age and lowest activity.
🌟
Case Study 01 — Key Takeaways

Hierarchical clustering discovered 4 meaningful customer segments without any domain knowledge or pre-labelled data. The Ward linkage with Euclidean distance (on scaled features) produced compact, well-separated groups. The silhouette plot objectively confirmed K = 4 as optimal. Each segment now has a clear marketing story the business can act on.


Case Study 02

💬 Gene Expression Clustering in Bioinformatics

Discovering Co-expressed Gene Families
A bioinformatics lab measures the expression levels of 50 genes across 8 experimental conditions (e.g., different drug doses or time points). Genes that are co-expressed — rising and falling together — often participate in the same biological pathway.

Goal: Identify groups of co-expressed genes using hierarchical clustering with average linkage and correlation-based distance, producing a heatmap dendrogram — the standard visualisation in genomics.
🧬
Why Correlation Distance in Genomics?

Two genes might both be "highly expressed" in absolute terms but move in opposite directions. What matters biologically is their correlation — do they rise and fall together? Correlation distance = 1 − Pearson r captures this. A value of 0 means perfectly correlated (same pathway); 2 means perfectly anti-correlated; 1 means uncorrelated.

Step 1 — Simulate Gene Expression Data

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance   import pdist, squareform
from sklearn.metrics           import silhouette_score
import matplotlib.pyplot       as plt
import seaborn                  as sns

np.random.seed(99)

conditions = ['D0', 'D1', 'D2', 'D4', 'D7', 'D14', 'D21', 'D28']

# Create 3 co-expression groups of ~17 genes each
# Group A: genes that peak early then drop off
base_A = np.array([3.0, 5.2, 7.8, 9.1, 6.4, 3.5, 1.8, 1.2])
# Group B: genes that remain steadily elevated
base_B = np.array([1.0, 1.2, 1.5, 4.8, 6.2, 7.1, 7.4, 7.9])
# Group C: genes that are constitutively suppressed
base_C = np.array([8.5, 7.2, 5.1, 3.3, 2.1, 1.5, 1.0, 0.8])

rows = []
for i in range(17):
    rows.append(base_A + np.random.normal(0, 0.4, 8))
for i in range(17):
    rows.append(base_B + np.random.normal(0, 0.4, 8))
for i in range(16):
    rows.append(base_C + np.random.normal(0, 0.4, 8))

gene_names = [f"GENE_{i+1:02d}" for i in range(50)]
expr = pd.DataFrame(rows, index=gene_names, columns=conditions)

print("Expression matrix shape:", expr.shape)
print(expr.head(3).round(2))
OUTPUT
Expression matrix shape: (50, 8) D0 D1 D2 D4 D7 D14 D21 D28 GENE_01 2.86 5.58 8.22 9.31 6.12 3.12 2.01 1.44 GENE_02 3.29 4.91 7.45 8.84 6.73 3.88 1.52 0.93 GENE_03 2.77 5.14 8.03 9.47 6.28 3.34 1.77 1.09

Step 2 — Compute Correlation Distance Matrix

# 1 − Pearson correlation as distance between genes
corr_dist = pdist(expr.values, metric='correlation')
corr_matrix = squareform(corr_dist)

print("Distance matrix shape:", corr_matrix.shape)
print(f"Min dist: {corr_dist.min():.4f}  |  Max dist: {corr_dist.max():.4f}")
print(f"Median dist: {np.median(corr_dist):.4f}")
OUTPUT
Distance matrix shape: (50, 50) Min dist: 0.0003 | Max dist: 1.8841 Median dist: 1.1492

Step 3 — Cluster Genes with Average Linkage

# Average linkage on precomputed correlation distances
Z_genes = linkage(corr_dist, method='average')

# Find optimal K via silhouette
best_k, best_sil = 2, -1
for k in range(2, 8):
    labels = fcluster(Z_genes, t=k, criterion='maxclust')
    sil    = silhouette_score(corr_matrix, labels, metric='precomputed')
    if sil > best_sil:
        best_k, best_sil = k, sil
    print(f"  K={k}  silhouette={sil:.4f}")

print(f"\nOptimal K = {best_k}  (silhouette = {best_sil:.4f})")

# Apply final clustering
gene_labels = fcluster(Z_genes, t=best_k, criterion='maxclust')
expr['cluster'] = gene_labels
OUTPUT
K=2 silhouette=0.6814 K=3 silhouette=0.8812 K=4 silhouette=0.6291 K=5 silhouette=0.5831 K=6 silhouette=0.5213 K=7 silhouette=0.4874 Optimal K = 3 (silhouette = 0.8812)

Step 4 — Biological Interpretation

for c in sorted(expr['cluster'].unique()):
    genes_in = expr[expr['cluster'] == c].drop('cluster', axis=1)
    mean_expr = genes_in.mean()
    print(f"\n── Cluster {c}  ({len(genes_in)} genes) ──────────────")
    print("Mean expression per timepoint:")
    for cond, val in mean_expr.items():
        bar = '■' * int(val)
        print(f"  {cond:>4}: {bar:<12} {val:.2f}")
    print(f"  Genes: {', '.join(genes_in.index[:5].tolist())} ...")
OUTPUT
── Cluster 1 (17 genes) ────────────── Mean expression per timepoint: D0: ■■■ 3.00 D1: ■■■■■ 5.18 D2: ■■■■■■■■ 7.82 D4: ■■■■■■■■■ 9.05 D7: ■■■■■■ 6.39 D14: ■■■ 3.44 D21: ■ 1.81 D28: ■ 1.22 Genes: GENE_01, GENE_02, GENE_03, GENE_04, GENE_05 ... ── Cluster 2 (17 genes) ────────────── Mean expression per timepoint: D0: ■ 1.02 D1: ■ 1.19 D2: ■ 1.52 D4: ■■■■ 4.81 D7: ■■■■■■ 6.23 D14: ■■■■■■■ 7.08 D21: ■■■■■■■ 7.41 D28: ■■■■■■■■ 7.93 Genes: GENE_18, GENE_19, GENE_20, GENE_21, GENE_22 ... ── Cluster 3 (16 genes) ────────────── Mean expression per timepoint: D0: ■■■■■■■■ 8.51 D1: ■■■■■■■ 7.19 D2: ■■■■■ 5.09 D4: ■■■ 3.28 D7: ■■ 2.09 D14: ■ 1.52 D21: ■ 0.99 D28: ■ 0.81 Genes: GENE_35, GENE_36, GENE_37, GENE_38, GENE_39 ...
ClusterExpression PatternBiological InterpretationAction
1 (17 genes) Early response — peaks at Day 4 then subsides Acute phase response genes; inflammatory burst; immediate drug-target interaction Investigate for acute toxicity markers
2 (17 genes) Delayed sustained — rises progressively, stays high Chronic adaptation genes; long-term pathway activation; potential drug resistance Monitor for resistance development
3 (16 genes) Suppressed progressively — falls from Day 0 Genes inhibited by drug; immune suppression or cell cycle arrest markers Check for off-target immunosuppression
💬 DIAGRAM — Gene Expression Heatmap: 3 Co-expression Clusters × 8 Timepoints
D0 D1 D2 D4 D7 D14 D21 D28 CLUSTER 1 Cluster 1 Early Response Peaks at Day 4 then subsides 17 genes CLUSTER 2 Cluster 2 Late Sustained Low → High stays elevated 17 genes CLUSTER 3 Cluster 3 Progressive Suppression High → Low 16 genes Low expr High expr
Colour = expression level (dark navy=low, amber=high). Each row = 1 gene. The three co-expression patterns are visually distinct — confirming clean cluster separation (silhouette=0.88).
💬
Case Study 02 — Key Takeaways

Correlation-based hierarchical clustering with average linkage cleanly separated 50 genes into 3 co-expression clusters with a silhouette score of 0.88 — excellent separation. The dendrogram told the biologist which genes were closest in expression space without any prior pathway knowledge. This workflow is the industry standard for RNA-seq and microarray analysis in computational biology, often called "hierarchical clustering heatmap".


Section 12

sklearn API Quick Reference

from sklearn.cluster import AgglomerativeClustering

# Full parameter reference
model = AgglomerativeClustering(
    n_clusters=4,          # int or None (if distance_threshold used)
    metric='euclidean',    # 'euclidean','manhattan','cosine','precomputed'
    linkage='ward',        # 'ward','complete','average','single'
    distance_threshold=None, # set this instead of n_clusters to cut by distance
    connectivity=None,     # connectivity matrix (e.g. from kneighbors_graph)
    compute_full_tree='auto',# 'auto', True, False
    compute_distances=False  # set True if you need .distances_ attribute
)

labels = model.fit_predict(X_scaled)

# Useful post-fit attributes
print("n_clusters_ : ", model.n_clusters_)    # actual number of clusters
print("n_leaves_   : ", model.n_leaves_)      # number of leaves (= n samples)
print("n_connected_: ", model.n_connected_components_) # connectivity components
OUTPUT
n_clusters_ : 4 n_leaves_ : 800 n_connected_: 1
📌 Choosing Between sklearn and scipy
sklearn
AgglomerativeClustering — use in production pipelines. Integrates with Pipeline, GridSearchCV, cross_val_score. No dendrogram. Returns only flat cluster labels.
scipy
linkage + dendrogram — use for exploration and visualisation. Gives you the full merge history (Z matrix), lets you plot the dendrogram and try multiple cuts without rerunning.
Best practice
Use scipy to explore and find K via dendrogram + silhouette, then switch to sklearn for the final production model that can be persisted and deployed.

Section 13

Golden Rules — Hierarchical Clustering

🌲 Non-Negotiable Rules for Hierarchical Clustering
1
Always scale features first. Run StandardScaler before computing any distance. Unscaled features make high-range columns dominate the clustering completely. This mistake is responsible for most "bad" clustering results.
2
Default to Ward linkage with Euclidean distance for most tabular data. Ward produces compact, balanced clusters and is the most battle-tested choice. Only switch linkage when you have a domain reason (e.g., use average for biological data, single for manifold structures).
3
Always plot the dendrogram before deciding K. Visual inspection of the biggest gap is the fastest and most reliable first step. Then validate quantitatively with the silhouette score across K = 2 to 10.
4
Use correlation distance for time-series or profile data. In genomics, finance, and sensor data, what matters is the shape of the signal, not its magnitude. metric='correlation' in pdist() or linkage='average' with precomputed correlation distance handles this.
5
Never run hierarchical clustering on raw text or categorical features. Encode text as TF-IDF vectors (then use cosine distance) and encode categoricals with one-hot or target encoding before running the algorithm.
6
For datasets larger than 50,000 rows, switch strategies. Run hierarchical clustering on a random sample of 5,000–10,000 representative points to determine K and understand structure, then use K-Means on the full dataset initialised with the centroids discovered from the sample.
7
Report cluster quality metrics alongside business interpretation. A silhouette score of 0.7 and a Davies-Bouldin index of 0.4 give your audience objective evidence that the clusters are real, not arbitrary. Business stakeholders trust numbers alongside descriptions.
You have completed Unsupervised Learning. View all sections →