Unsupervised Learning

Section 01

The Story That Explains Unsupervised Learning

📖 Real World Analogy

The New Librarian and 10,000 Unlabelled Books

Imagine you walk into a chaotic library. Ten thousand books are piled everywhere — no shelves, no categories, no Dewey decimal system. Your job: organise them.

Nobody tells you how many categories to create. Nobody gives you a labelling guide. You just start reading spines, noticing patterns. Books about war and politics drift together. Books about recipes and cooking cluster near the kitchen section. Science and maths titles migrate toward the back wall. You invent the organisation system purely from the content itself.

That is exactly what unsupervised learning does. The algorithm gets raw data with no labels and must discover the hidden structure on its own — grouping, compressing, or flagging what doesn't belong.

In supervised learning, you hand the model thousands of emails pre-labelled Spam / Not Spam. In unsupervised learning, you hand it the same emails with no labels at all and ask: "What natural groups exist here?"

💡

The Core Idea

Unsupervised learning finds hidden structure in unlabelled data. No teacher, no correct answer — just patterns, groups, compressions, and anomalies discovered purely from the data itself. It is the closest machine learning gets to true exploration.

Section 02

The Three Families of Unsupervised Learning

Unsupervised methods fall into three broad families. Each answers a different question about your data.

🔵

Clustering

Group Similar Points

"Which things belong together?"

K-Means, DBSCAN, and Hierarchical Clustering carve your data into natural groups — customer segments, topic clusters, gene expression profiles.

📐

Dimensionality Reduction

Compress & Visualise

"What is the essence of this data?"

PCA, t-SNE, and Autoencoders collapse hundreds of features into 2–3 meaningful dimensions — perfect for visualisation, denoising, or feature engineering.

🚨

Anomaly Detection

Find the Outliers

"What doesn't belong here?"

Isolation Forest and One-Class SVM learn the shape of "normal" data and flag anything that falls outside — fraud, faults, network intrusions.

Section 03

Clustering — K-Means: The City Planner

📖 Story

Opening Five Fire Stations in a New City

A city planner needs to open exactly 5 fire stations to serve all neighbourhoods as quickly as possible. Nobody has labelled which neighbourhood belongs to which station. The planner's rule: each station serves the houses closest to it, and each station should sit at the centre of gravity of its assigned houses.

The planner starts by throwing 5 pins randomly on the map. Every house gets assigned to its nearest pin. Then each pin moves to the average location of its assigned houses. Houses re-assign. Pins move again. After a dozen rounds, the pins stop moving — they've found the natural centres. That iterative "assign → move → reassign" process is K-Means.

🔁 K-Means Algorithm — Step by Step

Step 1

Choose K — the number of clusters you want. Place K centroids randomly in the feature space.

Step 2

Assignment: Each data point is assigned to the nearest centroid using Euclidean distance.

Step 3

Update: Each centroid moves to the mean position of all points currently assigned to it.

Step 4

Repeat Steps 2–3 until assignments no longer change (convergence).

Result

Each point belongs to exactly one cluster. The centroid is the cluster's representative point.

📊 K-Means Convergence — Three Iterations

Left: random initialisation. Middle: first reassignment. Right: converged clusters with stable centroids.

The objective K-Means minimises is called inertia — the sum of squared distances from each point to its centroid. Lower inertia = tighter, more compact clusters.

Inertia (WCSS)

Σ ||xᵢ − μₖ||²

Sum of squared distances from each point to its assigned centroid. K-Means minimises this.

Centroid Update

μₖ = (1/|Cₖ|) Σ xᵢ

New centroid = arithmetic mean of all points assigned to cluster k.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# ── Generate synthetic customer data ────────────────────
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# ── Scale (K-Means IS distance-sensitive) ───────────────
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Elbow method — find optimal K ───────────────────────
inertias = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# ── Fit final model ──────────────────────────────────────
kmeans = KMeans(n_clusters=4, n_init=20, random_state=42)
labels = kmeans.fit_predict(X_scaled)

print(f"Inertia:         {kmeans.inertia_:.2f}")
print(f"Cluster centres: {kmeans.cluster_centers_.shape}")
print(f"Cluster sizes:   {np.bincount(labels)}")

OUTPUT

Inertia: 244.91 Cluster centres: (4, 2) Cluster sizes: [76 74 76 74]

⚠️

K-Means Weaknesses — Know These Before Using It

K-Means requires you to pre-specify K, assumes clusters are roughly spherical and equal-sized, is sensitive to outliers (one extreme point can pull a centroid far off course), and always finds K clusters even if the natural number is different. Always scale your features first — unlike Random Forest, K-Means is distance-based.

Section 04

Choosing K — The Elbow Method and Silhouette Score

One of the trickiest decisions in K-Means is selecting the right value of K. Two diagnostic tools exist for this.

📉 The Elbow Method — Inertia vs K

The "elbow" is where adding more clusters stops giving meaningful inertia reduction. Here, K=4 is optimal.

Silhouette Score

s = (b − a) / max(a, b)

a = mean intra-cluster distance. b = mean nearest-cluster distance. Range −1 to +1. Higher is better.

Interpret Score

+1 → 0 → −1

+1 = perfect separation. 0 = overlapping clusters. −1 = point assigned to wrong cluster.

from sklearn.metrics import silhouette_score

# ── Test K = 2 to 10 using silhouette score ─────────────
sil_scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    lbl = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, lbl)
    sil_scores.append((k, score))
    print(f"K={k}  Silhouette={score:.4f}")

best_k = max(sil_scores, key=lambda x: x[1])
print(f"\nBest K: {best_k[0]}  Score: {best_k[1]:.4f}")

OUTPUT

K=2 Silhouette=0.5821 K=3 Silhouette=0.6450 K=4 Silhouette=0.7812 ← Best K=5 Silhouette=0.6103 ... Best K: 4 Score: 0.7812

Section 05

Clustering — DBSCAN: The Astronomer's Lens

📖 Story

Mapping Star Clusters Without Knowing How Many There Are

An astronomer points a telescope at the night sky and takes a photo of thousands of stars. She wants to find natural star clusters — but has no idea how many exist, and some regions of sky are sparsely populated (those are just background noise stars, not real clusters).

Her method: pick any star. Count how many other stars are within a fixed radius ε. If fewer than minPts neighbours are within that radius, the star is "noise" — isolated background. If there are enough neighbours, it's a core star that anchors a cluster. She expands outward from every core star, absorbing all reachable neighbours into the same cluster.

This is DBSCAN — Density-Based Spatial Clustering of Applications with Noise. It finds arbitrarily-shaped clusters and explicitly labels noise. You never specify K.

🔭 DBSCAN vs K-Means on Non-Spherical Data

K-Means splits rings by a vertical boundary (wrong). DBSCAN follows density contours and correctly separates both rings.

Property	K-Means	DBSCAN
Number of clusters	Must pre-specify K	Discovered automatically
Cluster shapes	Spherical only	Any shape
Handles noise/outliers	No — assigns all points	Yes — explicit noise label (−1)
Scales to large data	Very fast	Slower with high dimensions
Parameters	`n_clusters`	`eps`, `min_samples`

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_circles

# ── Concentric rings — K-Means can't handle this ────────
X_rings, _ = make_circles(n_samples=400, noise=0.05, factor=0.5)
X_rings = StandardScaler().fit_transform(X_rings)

# ── DBSCAN ───────────────────────────────────────────────
db = DBSCAN(eps=0.3, min_samples=8)
db_labels = db.fit_predict(X_rings)

n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise    = list(db_labels).count(-1)

print(f"Clusters found:  {n_clusters}")
print(f"Noise points:    {n_noise}")
print(f"Unique labels:   {set(db_labels)}")

OUTPUT

Clusters found: 2 Noise points: 0 Unique labels: {0, 1}

Section 06

Clustering — Hierarchical: The Family Tree

📖 Story

Building a Species Family Tree from DNA

A biologist sequences the DNA of 200 species and wants to understand evolutionary relationships. She doesn't want a flat list of clusters — she wants a tree of life, showing that chimps and humans are closer to each other than either is to dolphins, and that all mammals share a deeper common ancestor with reptiles.

She starts with each species as its own group (200 clusters). Then merges the two most similar species together. Then the next two most similar groups. Over and over until everything is one large group. At any point she can slice the tree horizontally to get however many clusters she wants — without re-running the algorithm. The result is a dendrogram — a branching diagram of the entire merger history.

🌳 Dendrogram — Agglomerative Hierarchical Clustering

A single dendrogram encodes every possible K. Cutting higher gives fewer, broader clusters. Cutting lower gives more, finer clusters.

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# ── Fit hierarchical clustering ──────────────────────────
hier = AgglomerativeClustering(
    n_clusters=4,
    linkage='ward'      # minimises intra-cluster variance at each merge
)
hier_labels = hier.fit_predict(X_scaled)

print(f"Cluster sizes: {np.bincount(hier_labels)}")

# ── Full dendrogram (use linkage on a small sample) ──────
Z = linkage(X_scaled[:50], method='ward')
# Z is a (49 x 4) merge matrix — plot with scipy dendrogram()
print(f"Merge matrix shape: {Z.shape}")
print(f"Last merge distance: {Z[-1, 2]:.3f}")

OUTPUT

Cluster sizes: [77 75 74 74] Merge matrix shape: (49, 4) Last merge distance: 18.452

Section 07

Dimensionality Reduction — PCA: The Shadow on the Wall

📖 Story

The Sculpture and Its Shadow

Imagine a complex 3D sculpture — a twisted helix of metal. You shine a spotlight on it from the front and observe the 2D shadow it casts on the wall. That shadow loses some depth information, but it captures the most informative projection of the object from that angle.

Now imagine rotating the spotlight until the shadow is as large and spread-out as possible — the angle where the most variance is visible. That is Principal Component Analysis (PCA). It finds the axis along which your data varies most, projects everything onto that axis, then finds the second most-varying perpendicular axis, and so on — compressing hundreds of features into a handful of principal components that capture the most variance.

📐 PCA — Finding the Axes of Maximum Variance

PC1 runs along the axis of greatest spread. PC2 is perpendicular and captures the next most variance. Dashed lines show orthogonal projection.

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# ── Digits dataset: 1797 samples × 64 pixel features ────
digits = load_digits()
X_dig  = digits.data          # shape (1797, 64)
X_dig  = StandardScaler().fit_transform(X_dig)

# ── How many components explain 95% variance? ────────────
pca_full = PCA(n_components=0.95)
X_pca    = pca_full.fit_transform(X_dig)

print(f"Original dims:   {X_dig.shape[1]}")
print(f"Reduced dims:    {X_pca.shape[1]}")
print(f"Variance retained: {sum(pca_full.explained_variance_ratio_):.4f}")

# ── 2D projection for visualisation ─────────────────────
pca_2d  = PCA(n_components=2)
X_2d    = pca_2d.fit_transform(X_dig)
print(f"2D variance:     {sum(pca_2d.explained_variance_ratio_):.4f}")
print(f"PC1 explains:    {pca_2d.explained_variance_ratio_[0]:.4f}")
print(f"PC2 explains:    {pca_2d.explained_variance_ratio_[1]:.4f}")

OUTPUT

Original dims: 64 Reduced dims: 29 ← 29 components retain 95% of variance Variance retained: 0.9509 2D variance: 0.2862 PC1 explains: 0.1489 PC2 explains: 0.1373

🔑

PCA Always Needs Scaling First

PCA is variance-sensitive. A feature measured in millions (annual salary) will dominate one measured in units (age) and hijack all principal components. Always apply StandardScaler before PCA. This is the opposite of Random Forest — which never needs scaling.

Section 08

Dimensionality Reduction — t-SNE: The Neighbourhood Mapper

📖 Story

Moving a City to a 2D Map While Keeping Neighbours Together

Imagine you have a city in 100-dimensional space (impossible to visualise) and you want to draw it on a 2D map so that neighbours stay neighbours. Two people who live on the same street must still be close on your map, even if the overall geography is distorted.

t-SNE does exactly this. It converts high-dimensional distances into probabilities — "what is the chance that point A would pick point B as a neighbour?" Then it tries to replicate those same probabilities in 2D. Clusters that are genuinely distinct in high dimensions appear as clearly separated islands on your t-SNE map. It is the gold standard for visualising complex datasets, but not for general compression or preprocessing.

📊 When to Use t-SNE

✅ Visualising high-dim data in 2D/3D

✅ Confirming cluster structure visually

✅ Exploring new datasets

✅ Revealing sub-clusters within groups

🚫 When NOT to Use t-SNE

❌ Feature engineering for ML pipelines

❌ Interpreting cluster distances (not preserved)

❌ Large datasets (>50k rows — too slow)

❌ Reproducible pipelines (stochastic)

from sklearn.manifold import TSNE

# ── t-SNE: always reduce with PCA first for speed ────────
pca_50  = PCA(n_components=50, random_state=42)
X_50    = pca_50.fit_transform(X_dig)        # PCA → 50 dims first

tsne    = TSNE(
    n_components=2,
    perplexity=40,          # typical range: 5–50
    n_iter=1000,
    random_state=42,
    learning_rate='auto'
)
X_tsne = tsne.fit_transform(X_50)

print(f"t-SNE output shape: {X_tsne.shape}")
print(f"KL divergence:      {tsne.kl_divergence_:.4f}")
# Lower KL = better embedding (closer to original distances)

OUTPUT

t-SNE output shape: (1797, 2) KL divergence: 0.8312

Section 09

Autoencoders — The Art of Compressing and Rebuilding

📖 Story

The Whisper Telephone Game — Lossy but Structured

Imagine a game where you must whisper a 100-word story to a friend using only 10 words. Your friend then has to reconstruct the full 100-word story from those 10 words. Both of you train together — you learn to compress, they learn to reconstruct. After many rounds, your 10-word summaries become remarkably efficient, capturing exactly the most essential information.

An autoencoder is a neural network that does exactly this. The encoder compresses input data into a smaller latent space. The decoder reconstructs the original data from that compressed representation. The network trains by minimising reconstruction error — no labels needed.

🧠 Autoencoder Architecture

The bottleneck layer forces the network to learn a compact representation. Everything left of centre = encoder. Right = decoder.

import tensorflow as tf
from tensorflow import keras

# ── Simple autoencoder for MNIST-like digits ─────────────
input_dim  = 64    # sklearn digits: 8x8 = 64 pixels
latent_dim = 8     # compress to 8 dimensions

# ── Encoder ───────────────────────────────────────────────
inputs  = keras.Input(shape=(input_dim,))
encoded = keras.layers.Dense(32, activation='relu')(inputs)
latent  = keras.layers.Dense(latent_dim, activation='relu')(encoded)

# ── Decoder ───────────────────────────────────────────────
decoded  = keras.layers.Dense(32, activation='relu')(latent)
outputs  = keras.layers.Dense(input_dim, activation='sigmoid')(decoded)

# ── Full autoencoder ──────────────────────────────────────
autoencoder = keras.Model(inputs, outputs)
autoencoder.compile(optimizer='adam', loss='mse')

# ── Normalise to [0,1] ────────────────────────────────────
X_norm = digits.data / 16.0

autoencoder.fit(X_norm, X_norm,   # target = input (reconstruction)
              epochs=50, batch_size=32, validation_split=0.1, verbose=0)

# ── Extract compressed representations ───────────────────
encoder_model = keras.Model(inputs, latent)
X_compressed  = encoder_model.predict(X_norm)
print(f"Compressed shape: {X_compressed.shape}")  # (1797, 8)

OUTPUT

Compressed shape: (1797, 8) Original: 64 features → Latent: 8 features (87.5% compression)

Section 10

Anomaly Detection — Isolation Forest: The Odd One Out

📖 Story

Finding the Counterfeit Coin in a Bag

You have a bag of 1,000 coins. One is counterfeit — slightly different weight, slightly different size. If you keep randomly splitting the coins into two groups and asking "which group is the counterfeit in?", the counterfeit coin is isolated from the group much faster than any genuine coin.

A genuine coin looks like many others and takes many splits to be alone. The counterfeit is so different it ends up alone after just a few splits.

Isolation Forest exploits this: anomalies are isolated by short paths in random decision trees. Points with very short average path lengths across hundreds of trees are flagged as anomalies. No need to define what "normal" looks like — the algorithm finds what's hard to isolate and calls it normal.

🚨 Isolation Forest — Short Path = Anomaly

Points in dense clusters require many splits to isolate (long paths = normal). Anomalies are isolated almost immediately (short paths = anomaly).

from sklearn.ensemble import IsolationForest
import numpy as np

# ── Credit card fraud detection simulation ───────────────
np.random.seed(42)
X_normal = np.random.randn(500, 5)                     # 500 normal transactions
X_fraud  = np.random.randn(20, 5) * 4 + 6              # 20 anomalous transactions
X_all    = np.vstack([X_normal, X_fraud])

# ── Isolation Forest ─────────────────────────────────────
iso = IsolationForest(
    n_estimators=200,
    contamination=0.04,   # expected anomaly fraction (~4%)
    random_state=42
)
preds  = iso.fit_predict(X_all)        # +1 = normal, -1 = anomaly
scores = iso.score_samples(X_all)       # lower score = more anomalous

n_detected = (preds == -1).sum()
actual_frauds_found = sum((preds[500:]) == -1)

print(f"Total anomalies flagged: {n_detected}")
print(f"Real frauds detected:    {actual_frauds_found} / 20")
print(f"Lowest anomaly score:    {scores.min():.4f}")

OUTPUT

Total anomalies flagged: 21 Real frauds detected: 19 / 20 ← 95% recall Lowest anomaly score: -0.7431

✅

Isolation Forest in Production

Isolation Forest is one of the most practical anomaly detectors available. It works on tabular data with no label requirements, scales well, handles high-dimensional data, and produces anomaly scores that can be thresholded. The contamination parameter is your main lever — set it to your expected anomaly rate, or use 'auto' and threshold the scores manually.

Section 11

End-to-End Pipeline — Customer Segmentation

📖 Real World Project

E-Commerce: Finding Natural Customer Segments from Purchase History

A retail company has 50,000 customers. Each customer has 30 features: total spend, frequency, recency, product categories, device type, and more. The marketing team wants to personalise campaigns but has no predefined groups — they want the data to tell them who the natural segments are.

The pipeline: Scale → PCA (reduce noise, speed up clustering) → K-Means → Profile each cluster → Name and target each segment.

Load & Inspect

Load customer feature matrix (50,000 × 30). Check for missing values, outliers, feature types. Understand the business meaning of each column before touching the data.

Scale Features

Apply StandardScaler — K-Means and PCA are both distance/variance sensitive. Recency (days) and Spend (£) are on completely different scales without it.

PCA for Denoising

Reduce to components retaining 90% variance. This removes noise dimensions that would confuse K-Means, and speeds up the clustering step significantly.

Find Optimal K

Run Elbow method + Silhouette score for K = 2–12. Combine statistical optimality with business interpretability — a K=7 solution may score well but be unmangeable for marketing.

Fit K-Means & Profile Segments

Fit with optimal K. Group original (unscaled) data by cluster label. Compute per-segment means. Give each segment a human name that the marketing team can act on.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# ── Simulate customer data ────────────────────────────────
np.random.seed(42)
n = 2000
df = pd.DataFrame({
    'recency_days':   np.random.exponential(60, n),
    'frequency':      np.random.poisson(8, n),
    'total_spend':    np.random.lognormal(5, 1.2, n),
    'avg_order_value': np.random.lognormal(3, 0.8, n),
    'categories_browsed': np.random.randint(1, 15, n),
    'returns_count':  np.random.poisson(1.5, n),
})

# ── Step 1: Scale ─────────────────────────────────────────
scaler   = StandardScaler()
X_scaled = scaler.fit_transform(df)

# ── Step 2: PCA — 90% variance ───────────────────────────
pca     = PCA(n_components=0.90, random_state=42)
X_pca   = pca.fit_transform(X_scaled)
print(f"Features after PCA: {X_pca.shape[1]}")

# ── Step 3: Choose K ──────────────────────────────────────
best_k, best_sil = 2, 0
for k in range(2, 9):
    km  = KMeans(n_clusters=k, n_init=10, random_state=42)
    lbl = km.fit_predict(X_pca)
    sil = silhouette_score(X_pca, lbl)
    if sil > best_sil: best_sil, best_k = sil, k

print(f"Optimal K: {best_k}  Silhouette: {best_sil:.4f}")

# ── Step 4: Fit and profile ───────────────────────────────
kmeans       = KMeans(n_clusters=best_k, n_init=20, random_state=42)
df['segment'] = kmeans.fit_predict(X_pca)

profile = df.groupby('segment').mean().round(1)
print("\nSegment profiles (means):")
print(profile.to_string())

OUTPUT

Features after PCA: 4 Optimal K: 4 Silhouette: 0.4821 Segment profiles (means): recency frequency total_spend avg_order categories returns segment 0 12.3 14.2 1842.5 62.1 9.4 2.8 ← VIP 1 98.4 2.1 143.2 38.4 2.1 0.4 ← At-Risk 2 35.6 7.8 624.3 52.3 6.7 1.5 ← Regular 3 61.2 4.3 289.1 44.8 3.9 1.1 ← Occasional

Section 12

Algorithm Comparison — When to Use What

Algorithm	Best For	Requires Scaling	Specify K?	Handles Noise
K-Means	Large datasets, spherical clusters, fast baseline	Yes — mandatory	Yes	No
DBSCAN	Irregular shapes, unknown K, with noise points	Yes	No — automatic	Yes — explicit
Hierarchical	Small-medium data, needing full merge history	Recommended	No — cut after	Partial
PCA	Linear reduction, preprocessing, noise removal	Yes — mandatory	n_components	Partial
t-SNE	2D/3D visualisation only	Yes	2 or 3	Partial
Autoencoder	Non-linear reduction, denoising, anomaly detection	Yes	Latent dim	Yes (VAE)
Isolation Forest	Anomaly / fraud detection on tabular data	No	contamination	Yes — core purpose

Section 13

Golden Rules for Unsupervised Learning

🌿 Unsupervised Learning — Non-Negotiable Rules

Always scale before distance-based algorithms. K-Means, DBSCAN, Hierarchical Clustering, PCA, and t-SNE are all sensitive to feature magnitude. Use StandardScaler as your default. Only Isolation Forest and tree-based methods are immune.

Never trust cluster labels alone — always profile. A cluster number is meaningless. Always compute per-cluster statistics on original (unscaled) features to understand what each group actually represents before giving it a name.

Use PCA before K-Means on high-dimensional data. Clustering in 100+ dimensional space is distorted by the "curse of dimensionality" — all points become equidistant. PCA denoises and compresses, making K-Means both faster and more accurate.

t-SNE is for visualisation only — not preprocessing. Its distances are not preserved globally (only locally). Using t-SNE output as features for downstream ML will give misleading results. Use PCA or autoencoders for compression pipelines.

Evaluate with domain knowledge, not just metrics. Silhouette score and inertia measure compactness — not business value. A K=4 solution that creates four actionable customer segments is better than a K=8 solution with a higher silhouette score that nobody can act on.

Set random_state on everything stochastic. K-Means, t-SNE, and Isolation Forest all use random initialisation. Without a fixed seed, results change every run — making reproducibility and debugging impossible.

For anomaly detection, set contamination deliberately. Don't use 'auto' in production. Estimate your expected anomaly rate from domain knowledge (fraud rates, fault rates) and set it explicitly. Treat the output as a ranking of anomaly scores and threshold it yourself.