The Story That Explains Unsupervised Learning
Nobody tells you how many categories to create. Nobody gives you a labelling guide. You just start reading spines, noticing patterns. Books about war and politics drift together. Books about recipes and cooking cluster near the kitchen section. Science and maths titles migrate toward the back wall. You invent the organisation system purely from the content itself.
That is exactly what unsupervised learning does. The algorithm gets raw data with no labels and must discover the hidden structure on its own — grouping, compressing, or flagging what doesn't belong.
In supervised learning, you hand the model thousands of emails pre-labelled Spam / Not Spam. In unsupervised learning, you hand it the same emails with no labels at all and ask: "What natural groups exist here?"
Unsupervised learning finds hidden structure in unlabelled data. No teacher, no correct answer — just patterns, groups, compressions, and anomalies discovered purely from the data itself. It is the closest machine learning gets to true exploration.
The Three Families of Unsupervised Learning
Unsupervised methods fall into three broad families. Each answers a different question about your data.
K-Means, DBSCAN, and Hierarchical Clustering carve your data into natural groups — customer segments, topic clusters, gene expression profiles.
PCA, t-SNE, and Autoencoders collapse hundreds of features into 2–3 meaningful dimensions — perfect for visualisation, denoising, or feature engineering.
Isolation Forest and One-Class SVM learn the shape of "normal" data and flag anything that falls outside — fraud, faults, network intrusions.
Clustering — K-Means: The City Planner
The planner starts by throwing 5 pins randomly on the map. Every house gets assigned to its nearest pin. Then each pin moves to the average location of its assigned houses. Houses re-assign. Pins move again. After a dozen rounds, the pins stop moving — they've found the natural centres. That iterative "assign → move → reassign" process is K-Means.
Left: random initialisation. Middle: first reassignment. Right: converged clusters with stable centroids.
The objective K-Means minimises is called inertia — the sum of squared distances from each point to its centroid. Lower inertia = tighter, more compact clusters.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
# ── Generate synthetic customer data ────────────────────
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)
# ── Scale (K-Means IS distance-sensitive) ───────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ── Elbow method — find optimal K ───────────────────────
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
# ── Fit final model ──────────────────────────────────────
kmeans = KMeans(n_clusters=4, n_init=20, random_state=42)
labels = kmeans.fit_predict(X_scaled)
print(f"Inertia: {kmeans.inertia_:.2f}")
print(f"Cluster centres: {kmeans.cluster_centers_.shape}")
print(f"Cluster sizes: {np.bincount(labels)}")
K-Means requires you to pre-specify K, assumes clusters are roughly spherical and equal-sized, is sensitive to outliers (one extreme point can pull a centroid far off course), and always finds K clusters even if the natural number is different. Always scale your features first — unlike Random Forest, K-Means is distance-based.
Choosing K — The Elbow Method and Silhouette Score
One of the trickiest decisions in K-Means is selecting the right value of K. Two diagnostic tools exist for this.
The "elbow" is where adding more clusters stops giving meaningful inertia reduction. Here, K=4 is optimal.
from sklearn.metrics import silhouette_score
# ── Test K = 2 to 10 using silhouette score ─────────────
sil_scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
lbl = km.fit_predict(X_scaled)
score = silhouette_score(X_scaled, lbl)
sil_scores.append((k, score))
print(f"K={k} Silhouette={score:.4f}")
best_k = max(sil_scores, key=lambda x: x[1])
print(f"\nBest K: {best_k[0]} Score: {best_k[1]:.4f}")
Clustering — DBSCAN: The Astronomer's Lens
Her method: pick any star. Count how many other stars are within a fixed radius ε. If fewer than minPts neighbours are within that radius, the star is "noise" — isolated background. If there are enough neighbours, it's a core star that anchors a cluster. She expands outward from every core star, absorbing all reachable neighbours into the same cluster.
This is DBSCAN — Density-Based Spatial Clustering of Applications with Noise. It finds arbitrarily-shaped clusters and explicitly labels noise. You never specify K.
K-Means splits rings by a vertical boundary (wrong). DBSCAN follows density contours and correctly separates both rings.
| Property | K-Means | DBSCAN |
|---|---|---|
| Number of clusters | Must pre-specify K | Discovered automatically |
| Cluster shapes | Spherical only | Any shape |
| Handles noise/outliers | No — assigns all points | Yes — explicit noise label (−1) |
| Scales to large data | Very fast | Slower with high dimensions |
| Parameters | n_clusters | eps, min_samples |
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_circles
# ── Concentric rings — K-Means can't handle this ────────
X_rings, _ = make_circles(n_samples=400, noise=0.05, factor=0.5)
X_rings = StandardScaler().fit_transform(X_rings)
# ── DBSCAN ───────────────────────────────────────────────
db = DBSCAN(eps=0.3, min_samples=8)
db_labels = db.fit_predict(X_rings)
n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = list(db_labels).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Unique labels: {set(db_labels)}")
Clustering — Hierarchical: The Family Tree
She starts with each species as its own group (200 clusters). Then merges the two most similar species together. Then the next two most similar groups. Over and over until everything is one large group. At any point she can slice the tree horizontally to get however many clusters she wants — without re-running the algorithm. The result is a dendrogram — a branching diagram of the entire merger history.
A single dendrogram encodes every possible K. Cutting higher gives fewer, broader clusters. Cutting lower gives more, finer clusters.
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# ── Fit hierarchical clustering ──────────────────────────
hier = AgglomerativeClustering(
n_clusters=4,
linkage='ward' # minimises intra-cluster variance at each merge
)
hier_labels = hier.fit_predict(X_scaled)
print(f"Cluster sizes: {np.bincount(hier_labels)}")
# ── Full dendrogram (use linkage on a small sample) ──────
Z = linkage(X_scaled[:50], method='ward')
# Z is a (49 x 4) merge matrix — plot with scipy dendrogram()
print(f"Merge matrix shape: {Z.shape}")
print(f"Last merge distance: {Z[-1, 2]:.3f}")
Dimensionality Reduction — PCA: The Shadow on the Wall
Now imagine rotating the spotlight until the shadow is as large and spread-out as possible — the angle where the most variance is visible. That is Principal Component Analysis (PCA). It finds the axis along which your data varies most, projects everything onto that axis, then finds the second most-varying perpendicular axis, and so on — compressing hundreds of features into a handful of principal components that capture the most variance.
PC1 runs along the axis of greatest spread. PC2 is perpendicular and captures the next most variance. Dashed lines show orthogonal projection.
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
# ── Digits dataset: 1797 samples × 64 pixel features ────
digits = load_digits()
X_dig = digits.data # shape (1797, 64)
X_dig = StandardScaler().fit_transform(X_dig)
# ── How many components explain 95% variance? ────────────
pca_full = PCA(n_components=0.95)
X_pca = pca_full.fit_transform(X_dig)
print(f"Original dims: {X_dig.shape[1]}")
print(f"Reduced dims: {X_pca.shape[1]}")
print(f"Variance retained: {sum(pca_full.explained_variance_ratio_):.4f}")
# ── 2D projection for visualisation ─────────────────────
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_dig)
print(f"2D variance: {sum(pca_2d.explained_variance_ratio_):.4f}")
print(f"PC1 explains: {pca_2d.explained_variance_ratio_[0]:.4f}")
print(f"PC2 explains: {pca_2d.explained_variance_ratio_[1]:.4f}")
PCA is variance-sensitive. A feature measured in millions (annual salary)
will dominate one measured in units (age) and hijack all principal components.
Always apply StandardScaler before PCA. This is the opposite of
Random Forest — which never needs scaling.
Dimensionality Reduction — t-SNE: The Neighbourhood Mapper
t-SNE does exactly this. It converts high-dimensional distances into probabilities — "what is the chance that point A would pick point B as a neighbour?" Then it tries to replicate those same probabilities in 2D. Clusters that are genuinely distinct in high dimensions appear as clearly separated islands on your t-SNE map. It is the gold standard for visualising complex datasets, but not for general compression or preprocessing.
| ✅ Visualising high-dim data in 2D/3D |
| ✅ Confirming cluster structure visually |
| ✅ Exploring new datasets |
| ✅ Revealing sub-clusters within groups |
| ❌ Feature engineering for ML pipelines |
| ❌ Interpreting cluster distances (not preserved) |
| ❌ Large datasets (>50k rows — too slow) |
| ❌ Reproducible pipelines (stochastic) |
from sklearn.manifold import TSNE
# ── t-SNE: always reduce with PCA first for speed ────────
pca_50 = PCA(n_components=50, random_state=42)
X_50 = pca_50.fit_transform(X_dig) # PCA → 50 dims first
tsne = TSNE(
n_components=2,
perplexity=40, # typical range: 5–50
n_iter=1000,
random_state=42,
learning_rate='auto'
)
X_tsne = tsne.fit_transform(X_50)
print(f"t-SNE output shape: {X_tsne.shape}")
print(f"KL divergence: {tsne.kl_divergence_:.4f}")
# Lower KL = better embedding (closer to original distances)
Autoencoders — The Art of Compressing and Rebuilding
An autoencoder is a neural network that does exactly this. The encoder compresses input data into a smaller latent space. The decoder reconstructs the original data from that compressed representation. The network trains by minimising reconstruction error — no labels needed.
The bottleneck layer forces the network to learn a compact representation. Everything left of centre = encoder. Right = decoder.
import tensorflow as tf
from tensorflow import keras
# ── Simple autoencoder for MNIST-like digits ─────────────
input_dim = 64 # sklearn digits: 8x8 = 64 pixels
latent_dim = 8 # compress to 8 dimensions
# ── Encoder ───────────────────────────────────────────────
inputs = keras.Input(shape=(input_dim,))
encoded = keras.layers.Dense(32, activation='relu')(inputs)
latent = keras.layers.Dense(latent_dim, activation='relu')(encoded)
# ── Decoder ───────────────────────────────────────────────
decoded = keras.layers.Dense(32, activation='relu')(latent)
outputs = keras.layers.Dense(input_dim, activation='sigmoid')(decoded)
# ── Full autoencoder ──────────────────────────────────────
autoencoder = keras.Model(inputs, outputs)
autoencoder.compile(optimizer='adam', loss='mse')
# ── Normalise to [0,1] ────────────────────────────────────
X_norm = digits.data / 16.0
autoencoder.fit(X_norm, X_norm, # target = input (reconstruction)
epochs=50, batch_size=32, validation_split=0.1, verbose=0)
# ── Extract compressed representations ───────────────────
encoder_model = keras.Model(inputs, latent)
X_compressed = encoder_model.predict(X_norm)
print(f"Compressed shape: {X_compressed.shape}") # (1797, 8)
Anomaly Detection — Isolation Forest: The Odd One Out
A genuine coin looks like many others and takes many splits to be alone. The counterfeit is so different it ends up alone after just a few splits.
Isolation Forest exploits this: anomalies are isolated by short paths in random decision trees. Points with very short average path lengths across hundreds of trees are flagged as anomalies. No need to define what "normal" looks like — the algorithm finds what's hard to isolate and calls it normal.
Points in dense clusters require many splits to isolate (long paths = normal). Anomalies are isolated almost immediately (short paths = anomaly).
from sklearn.ensemble import IsolationForest
import numpy as np
# ── Credit card fraud detection simulation ───────────────
np.random.seed(42)
X_normal = np.random.randn(500, 5) # 500 normal transactions
X_fraud = np.random.randn(20, 5) * 4 + 6 # 20 anomalous transactions
X_all = np.vstack([X_normal, X_fraud])
# ── Isolation Forest ─────────────────────────────────────
iso = IsolationForest(
n_estimators=200,
contamination=0.04, # expected anomaly fraction (~4%)
random_state=42
)
preds = iso.fit_predict(X_all) # +1 = normal, -1 = anomaly
scores = iso.score_samples(X_all) # lower score = more anomalous
n_detected = (preds == -1).sum()
actual_frauds_found = sum((preds[500:]) == -1)
print(f"Total anomalies flagged: {n_detected}")
print(f"Real frauds detected: {actual_frauds_found} / 20")
print(f"Lowest anomaly score: {scores.min():.4f}")
Isolation Forest is one of the most practical anomaly detectors available.
It works on tabular data with no label requirements, scales well, handles
high-dimensional data, and produces anomaly scores that can be thresholded.
The contamination parameter is your main lever — set it to your
expected anomaly rate, or use 'auto' and threshold the scores manually.
End-to-End Pipeline — Customer Segmentation
The pipeline: Scale → PCA (reduce noise, speed up clustering) → K-Means → Profile each cluster → Name and target each segment.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# ── Simulate customer data ────────────────────────────────
np.random.seed(42)
n = 2000
df = pd.DataFrame({
'recency_days': np.random.exponential(60, n),
'frequency': np.random.poisson(8, n),
'total_spend': np.random.lognormal(5, 1.2, n),
'avg_order_value': np.random.lognormal(3, 0.8, n),
'categories_browsed': np.random.randint(1, 15, n),
'returns_count': np.random.poisson(1.5, n),
})
# ── Step 1: Scale ─────────────────────────────────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# ── Step 2: PCA — 90% variance ───────────────────────────
pca = PCA(n_components=0.90, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print(f"Features after PCA: {X_pca.shape[1]}")
# ── Step 3: Choose K ──────────────────────────────────────
best_k, best_sil = 2, 0
for k in range(2, 9):
km = KMeans(n_clusters=k, n_init=10, random_state=42)
lbl = km.fit_predict(X_pca)
sil = silhouette_score(X_pca, lbl)
if sil > best_sil: best_sil, best_k = sil, k
print(f"Optimal K: {best_k} Silhouette: {best_sil:.4f}")
# ── Step 4: Fit and profile ───────────────────────────────
kmeans = KMeans(n_clusters=best_k, n_init=20, random_state=42)
df['segment'] = kmeans.fit_predict(X_pca)
profile = df.groupby('segment').mean().round(1)
print("\nSegment profiles (means):")
print(profile.to_string())
Algorithm Comparison — When to Use What
| Algorithm | Best For | Requires Scaling | Specify K? | Handles Noise |
|---|---|---|---|---|
| K-Means | Large datasets, spherical clusters, fast baseline | Yes — mandatory | Yes | No |
| DBSCAN | Irregular shapes, unknown K, with noise points | Yes | No — automatic | Yes — explicit |
| Hierarchical | Small-medium data, needing full merge history | Recommended | No — cut after | Partial |
| PCA | Linear reduction, preprocessing, noise removal | Yes — mandatory | n_components | Partial |
| t-SNE | 2D/3D visualisation only | Yes | 2 or 3 | Partial |
| Autoencoder | Non-linear reduction, denoising, anomaly detection | Yes | Latent dim | Yes (VAE) |
| Isolation Forest | Anomaly / fraud detection on tabular data | No | contamination | Yes — core purpose |
Golden Rules for Unsupervised Learning
StandardScaler as your default. Only Isolation Forest and
tree-based methods are immune.
random_state on everything stochastic.
K-Means, t-SNE, and Isolation Forest all use random initialisation.
Without a fixed seed, results change every run — making reproducibility
and debugging impossible.
contamination deliberately.
Don't use 'auto' in production. Estimate your expected anomaly rate
from domain knowledge (fraud rates, fault rates) and set it explicitly.
Treat the output as a ranking of anomaly scores and threshold it yourself.