The Story That Explains Hierarchical Clustering
She picks up two books that feel most similar (same subject, same author, similar thickness) and places them together. Then she scans the floor again, finds the next closest pair (or a book closest to an existing pile), and merges them. She keeps going — merging closest piles — until eventually everything is in one grand pile.
At any point during this process, she could have stopped and said: "These groups make sense for my library." The full history of every merge is captured in a branching diagram called a dendrogram.
That is Hierarchical Clustering — a method that builds a complete tree of merges (or splits) so you can cut the tree at any level and get any number of clusters you want, all without pre-specifying K in advance.
Hierarchical clustering is one of the oldest and most interpretable clustering algorithms in unsupervised machine learning. Unlike K-Means — where you must declare the number of clusters before you begin — hierarchical clustering builds a full nested structure of groupings, letting you choose the depth of detail after you have seen the data.
The word hierarchical reflects the fact that the result is not a flat partition (like K-Means) but a tree of nested clusters. Every cluster at a higher level contains all the points of the clusters below it. This structure is called a dendrogram and is the signature output of the algorithm.
Two Flavours — Agglomerative vs Divisive
There are two fundamental directions you can build a hierarchy: from the bottom up, or from the top down.
In practice, when people say "hierarchical clustering" they almost always mean agglomerative hierarchical clustering. Divisive methods are theoretically elegant but rarely implemented in mainstream ML libraries. This tutorial will focus on agglomerative approaches while explaining divisive for completeness.
Step-by-Step: How Agglomerative Clustering Works
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | — | 10 | 50 | 70 | 90 |
| B | 10 | — | 45 | 65 | 85 |
| C | 50 | 45 | — | 20 | 60 |
| D | 70 | 65 | 20 | — | 40 |
| E | 90 | 85 | 60 | 40 | — |
If you cut the dendrogram at height 25, you get 2 clusters: {A, B} and {C, D, E}. If you cut at height 15, you get 3 clusters: {A, B}, {C, D}, {E}. The beauty is that one run gives you all possible clusterings simultaneously. This is the fundamental advantage over K-Means.
Linkage Criteria — How to Measure Cluster Distance
Once we merge two points into a cluster, how do we compute the distance from that cluster to all other clusters? This is the linkage criterion and it is arguably the most important design choice in hierarchical clustering.
| Linkage | Cluster Shape | Outlier Sensitivity | Chaining? | Best For |
|---|---|---|---|---|
| Single | Any / elongated | Very high | Yes | Manifold structures, irregular shapes |
| Complete | Compact spherical | High | No | When compact, equal-size clusters expected |
| Average | Moderate | Moderate | No | General purpose, bioinformatics |
| Ward | Compact & balanced | Low | No | Most tabular data science tasks ⭐ |
| Centroid | Moderate | Low | Possible | Rarely recommended |
Ward's method mathematically requires Euclidean distance to be meaningful (it minimises variance, which is tied to Euclidean geometry). If you use cosine or Manhattan distance, Ward's method is no longer mathematically valid. In those cases, use Average or Complete linkage instead.
Distance Metrics — What Is "Similar"?
Before we can measure cluster distances, we need a way to measure point-to-point distances. The choice of metric dramatically changes the result.
If your dataset has Age (0–100), Salary (20,000–200,000), and Height (1.5–2.0 m),
Euclidean distance will be completely dominated by Salary — the other features
effectively disappear. Always apply StandardScaler (zero mean, unit variance)
or MinMaxScaler before hierarchical clustering unless your features are
already on the same scale.
The Dendrogram — Reading It Like a Pro
Look at the dendrogram and find the longest vertical line that has no horizontal crossing. Draw a horizontal cut just below where that long line terminates at the top. The number of vertical lines you cross = optimal K. This is the most widely-used heuristic for choosing the number of clusters in hierarchical clustering — analogous to the elbow method in K-Means.
Time & Space Complexity
Hierarchical clustering is elegant but has computational limits. Understanding them helps you decide when to use it vs K-Means or DBSCAN.
| Algorithm | Time Complexity | Space Complexity | Practical Limit |
|---|---|---|---|
| Naive Agglomerative | O(n³) | O(n²) | ~1,000 samples |
| Optimised (NN-chain) | O(n²) | O(n²) | ~10,000 samples |
| sklearn AgglomerativeClustering | O(n² log n) | O(n²) | ~50,000 samples |
| K-Means (comparison) | O(nKt) | O(nK) | Millions of samples |
Hierarchical clustering requires storing a full n × n distance matrix. For 10,000 samples that is 10,000 × 10,000 = 100 million distance values. At 8 bytes each, that is ~800 MB of RAM — just for the distance matrix. For datasets above ~50,000 rows, use K-Means, Mini-Batch K-Means, or DBSCAN instead, and reserve hierarchical clustering for final analysis on representative samples.
Evaluating Cluster Quality
Since hierarchical clustering is unsupervised (no ground truth labels), we need internal validation metrics to judge how good the clusters are.
Cut the dendrogram at multiple heights (producing K = 2, 3, 4, …, 10 clusters), compute the silhouette score for each K, and pick the K with the highest score. This gives a data-driven way to choose the cut height rather than relying purely on visual inspection of the dendrogram.
Python Implementation — Full Walkthrough
Below is a complete, annotated implementation using both scipy
(for dendrogram visualisation) and sklearn (for production use).
Step 1 — Import Libraries and Generate Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# scipy — dendrogram plotting
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
# sklearn — clustering + evaluation
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.datasets import make_blobs
# Reproducibility
np.random.seed(42)
# Generate 3-cluster synthetic data
X, y_true = make_blobs.make_blobs(
n_samples=150,
centers=3,
cluster_std=0.8,
random_state=42
)
print(f"Dataset shape: {X.shape}")
Step 2 — Scale Features
# Always scale before distance-based algorithms
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Mean per feature: ", X_scaled.mean(axis=0).round(4))
print("Std per feature: ", X_scaled.std(axis=0).round(4))
Step 3 — Build Linkage Matrix and Plot Dendrogram
# Build the full linkage matrix using Ward's method
Z = linkage(X_scaled, method='ward', metric='euclidean')
# Z is an (n-1) x 4 matrix; each row:
# [cluster_i, cluster_j, distance, count_in_new_cluster]
print("Linkage matrix shape:", Z.shape)
print("First 3 merges:\n", Z[:3].round(3))
# ── Plot the dendrogram ─────────────────────────────────────
plt.figure(figsize=(14, 6))
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)', fontsize=14)
plt.xlabel('Sample Index')
plt.ylabel('Distance (Ward)')
dendrogram(
Z,
color_threshold=5.0, # cut at height 5 → colour-codes clusters
leaf_font_size=8,
above_threshold_color='grey'
)
# Draw a horizontal cut line at height = 5
plt.axhline(y=5.0, color='red', linestyle='--', label='Cut threshold')
plt.legend()
plt.tight_layout()
plt.show()
Step 4 — Extract Clusters and Evaluate
# ── Method A: scipy fcluster (cut dendrogram at K=3) ───────
labels_scipy = fcluster(Z, t=3, criterion='maxclust')
# ── Method B: sklearn AgglomerativeClustering ──────────────
# Production-grade, integrates with sklearn pipelines
hac = AgglomerativeClustering(
n_clusters=3,
linkage='ward',
metric='euclidean'
)
labels_sklearn = hac.fit_predict(X_scaled)
# ── Evaluation metrics ──────────────────────────────────────
sil = silhouette_score(X_scaled, labels_sklearn)
dbi = davies_bouldin_score(X_scaled, labels_sklearn)
print(f"Silhouette Score : {sil:.4f} (higher = better, max=1.0)")
print(f"Davies-Bouldin : {dbi:.4f} (lower = better, min=0.0)")
print(f"Cluster sizes : {np.bincount(labels_sklearn)}")
Step 5 — Silhouette Plot to Find Optimal K
# Cut dendrogram at K = 2 to 8 and score each
k_range = range(2, 9)
sil_scores = []
for k in k_range:
labels = fcluster(Z, t=k, criterion='maxclust')
sil_scores.append(silhouette_score(X_scaled, labels))
best_k = k_range[int(np.argmax(sil_scores))]
best_sc = max(sil_scores)
print(f"Best K = {best_k} → Silhouette = {best_sc:.4f}")
for k, sc in zip(k_range, sil_scores):
bar = '█' * int(sc * 50)
marker = ' ← BEST' if k == best_k else ''
print(f" K={k} {bar} {sc:.4f}{marker}")
Hierarchical vs K-Means — When to Use Which
| Property | Hierarchical Clustering | K-Means |
|---|---|---|
| Pre-specify K? | No — choose after seeing dendrogram | Yes — must decide before running |
| Cluster shape | Arbitrary (depends on linkage) | Spherical only |
| Scalability | O(n²)–O(n³) — max ~50K rows | O(nKt) — scales to millions |
| Deterministic? | Yes — same result every run | No — depends on random initialisation |
| Handles outliers | Moderate (depends on linkage) | Pulls centroids toward outliers |
| Interpretability | Very high — dendrogram tells the story | Moderate |
| Best use case | Exploration, biology, small datasets | Large datasets, known K, spherical clusters |
Use Hierarchical Clustering when: your dataset is under ~50,000 rows, you don't know K in advance, you want interpretable output, or you need to show stakeholders a tree structure that explains the groupings. Use K-Means when: your dataset is large, you already know K (or will use the elbow method to find it), and speed matters more than visual insight. A common workflow: run hierarchical clustering on a sample to find K, then run K-Means on the full dataset with that K.
Common Pitfalls and How to Avoid Them
StandardScaler or
MinMaxScaler first — this is non-negotiable.
linkage='average' or linkage='complete'.
🏪 Customer Segmentation for a Retail Bank
Goal: Find natural customer segments using hierarchical clustering so the marketing team can tailor loan offers, savings products, and digital banking upgrades to each segment.
Step 1 — Load & Explore
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
np.random.seed(42)
# Simulated bank customer data (800 customers)
n = 800
df = pd.DataFrame({
'age': np.random.randint(22, 70, n),
'annual_income': np.random.normal(55000, 25000, n).clip(15000, 200000),
'credit_score': np.random.randint(450, 850, n),
'num_products': np.random.randint(1, 6, n),
'monthly_txns': np.random.randint(2, 80, n)
})
print(df.describe().round(1))
Step 2 — Preprocess & Choose Optimal K
# Scale all features to zero mean, unit variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Build Ward linkage matrix
Z = linkage(X_scaled, method='ward')
# Test K from 2 to 8
results = []
for k in range(2, 9):
labels = fcluster(Z, t=k, criterion='maxclust')
sil = silhouette_score(X_scaled, labels)
results.append({'k': k, 'silhouette': round(sil, 4)})
best = max(results, key=lambda x: x['silhouette'])
print(f"\n{'K':>4} {'Silhouette':>12}")
for r in results:
star = ' ← BEST' if r['k'] == best['k'] else ''
print(f" {r['k']:>2} {r['silhouette']:>12}{star}")
Step 3 — Apply Final Clustering (K = 4) & Profile Segments
# Apply final model with K=4
hac = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['segment'] = hac.fit_predict(X_scaled)
# Profile each segment with mean values
profile = df.groupby('segment').agg(
count=('age', 'size'),
avg_age=('age', 'mean'),
avg_income=('annual_income', 'mean'),
avg_credit=('credit_score', 'mean'),
avg_products=('num_products', 'mean'),
avg_txns=('monthly_txns', 'mean')
).round(1)
print(profile.to_string())
Step 4 — Interpret & Name the Segments
| Segment | Name | Profile | Recommended Product |
|---|---|---|---|
| 0 | High-Value Professionals | Mid-age, high income £72K, excellent credit 701, high engagement | Premium credit cards, wealth management, mortgage top-ups |
| 1 | Budget-Conscious Adults | Mid-age, low income £36K, average credit 612, low transactions | Savings accounts, budget management tools, personal loans |
| 2 | Digital-Native Young Earners | Young (35), decent income, good credit, very high transactions | Mobile banking features, cashback cards, investment starter products |
| 3 | Pre-Retirement Low Activity | Older (59), moderate income, moderate credit, low transactions | Retirement savings plans, fixed deposits, reduced-fee accounts |
Hierarchical clustering discovered 4 meaningful customer segments without any domain knowledge or pre-labelled data. The Ward linkage with Euclidean distance (on scaled features) produced compact, well-separated groups. The silhouette plot objectively confirmed K = 4 as optimal. Each segment now has a clear marketing story the business can act on.
💬 Gene Expression Clustering in Bioinformatics
Goal: Identify groups of co-expressed genes using hierarchical clustering with average linkage and correlation-based distance, producing a heatmap dendrogram — the standard visualisation in genomics.
Two genes might both be "highly expressed" in absolute terms but move in opposite directions. What matters biologically is their correlation — do they rise and fall together? Correlation distance = 1 − Pearson r captures this. A value of 0 means perfectly correlated (same pathway); 2 means perfectly anti-correlated; 1 means uncorrelated.
Step 1 — Simulate Gene Expression Data
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(99)
conditions = ['D0', 'D1', 'D2', 'D4', 'D7', 'D14', 'D21', 'D28']
# Create 3 co-expression groups of ~17 genes each
# Group A: genes that peak early then drop off
base_A = np.array([3.0, 5.2, 7.8, 9.1, 6.4, 3.5, 1.8, 1.2])
# Group B: genes that remain steadily elevated
base_B = np.array([1.0, 1.2, 1.5, 4.8, 6.2, 7.1, 7.4, 7.9])
# Group C: genes that are constitutively suppressed
base_C = np.array([8.5, 7.2, 5.1, 3.3, 2.1, 1.5, 1.0, 0.8])
rows = []
for i in range(17):
rows.append(base_A + np.random.normal(0, 0.4, 8))
for i in range(17):
rows.append(base_B + np.random.normal(0, 0.4, 8))
for i in range(16):
rows.append(base_C + np.random.normal(0, 0.4, 8))
gene_names = [f"GENE_{i+1:02d}" for i in range(50)]
expr = pd.DataFrame(rows, index=gene_names, columns=conditions)
print("Expression matrix shape:", expr.shape)
print(expr.head(3).round(2))
Step 2 — Compute Correlation Distance Matrix
# 1 − Pearson correlation as distance between genes
corr_dist = pdist(expr.values, metric='correlation')
corr_matrix = squareform(corr_dist)
print("Distance matrix shape:", corr_matrix.shape)
print(f"Min dist: {corr_dist.min():.4f} | Max dist: {corr_dist.max():.4f}")
print(f"Median dist: {np.median(corr_dist):.4f}")
Step 3 — Cluster Genes with Average Linkage
# Average linkage on precomputed correlation distances
Z_genes = linkage(corr_dist, method='average')
# Find optimal K via silhouette
best_k, best_sil = 2, -1
for k in range(2, 8):
labels = fcluster(Z_genes, t=k, criterion='maxclust')
sil = silhouette_score(corr_matrix, labels, metric='precomputed')
if sil > best_sil:
best_k, best_sil = k, sil
print(f" K={k} silhouette={sil:.4f}")
print(f"\nOptimal K = {best_k} (silhouette = {best_sil:.4f})")
# Apply final clustering
gene_labels = fcluster(Z_genes, t=best_k, criterion='maxclust')
expr['cluster'] = gene_labels
Step 4 — Biological Interpretation
for c in sorted(expr['cluster'].unique()):
genes_in = expr[expr['cluster'] == c].drop('cluster', axis=1)
mean_expr = genes_in.mean()
print(f"\n── Cluster {c} ({len(genes_in)} genes) ──────────────")
print("Mean expression per timepoint:")
for cond, val in mean_expr.items():
bar = '■' * int(val)
print(f" {cond:>4}: {bar:<12} {val:.2f}")
print(f" Genes: {', '.join(genes_in.index[:5].tolist())} ...")
| Cluster | Expression Pattern | Biological Interpretation | Action |
|---|---|---|---|
| 1 (17 genes) | Early response — peaks at Day 4 then subsides | Acute phase response genes; inflammatory burst; immediate drug-target interaction | Investigate for acute toxicity markers |
| 2 (17 genes) | Delayed sustained — rises progressively, stays high | Chronic adaptation genes; long-term pathway activation; potential drug resistance | Monitor for resistance development |
| 3 (16 genes) | Suppressed progressively — falls from Day 0 | Genes inhibited by drug; immune suppression or cell cycle arrest markers | Check for off-target immunosuppression |
Correlation-based hierarchical clustering with average linkage cleanly separated 50 genes into 3 co-expression clusters with a silhouette score of 0.88 — excellent separation. The dendrogram told the biologist which genes were closest in expression space without any prior pathway knowledge. This workflow is the industry standard for RNA-seq and microarray analysis in computational biology, often called "hierarchical clustering heatmap".
sklearn API Quick Reference
from sklearn.cluster import AgglomerativeClustering
# Full parameter reference
model = AgglomerativeClustering(
n_clusters=4, # int or None (if distance_threshold used)
metric='euclidean', # 'euclidean','manhattan','cosine','precomputed'
linkage='ward', # 'ward','complete','average','single'
distance_threshold=None, # set this instead of n_clusters to cut by distance
connectivity=None, # connectivity matrix (e.g. from kneighbors_graph)
compute_full_tree='auto',# 'auto', True, False
compute_distances=False # set True if you need .distances_ attribute
)
labels = model.fit_predict(X_scaled)
# Useful post-fit attributes
print("n_clusters_ : ", model.n_clusters_) # actual number of clusters
print("n_leaves_ : ", model.n_leaves_) # number of leaves (= n samples)
print("n_connected_: ", model.n_connected_components_) # connectivity components
Golden Rules — Hierarchical Clustering
StandardScaler before
computing any distance. Unscaled features make high-range columns dominate
the clustering completely. This mistake is responsible for most "bad" clustering results.
metric='correlation' in pdist() or
linkage='average' with precomputed correlation distance handles this.