The Story That Explains K-Means
You didn't need a label. Nobody wore a badge saying "Group A" or "Group B". You just looked at the natural patterns in how people gathered — proximity, shared interests, behaviour — and your brain automatically found the groups.
That is exactly what K-Means does with data. It looks at your data points, finds the natural groupings based on similarity, and labels each group — without ever being told what the groups should be. No labels. No teacher. Just patterns.
K-Means Clustering is an unsupervised machine learning algorithm that partitions a dataset into K distinct, non-overlapping clusters. Each data point belongs to the cluster whose centroid (centre point) is closest to it. The algorithm iterates until the cluster assignments stabilise.
In supervised learning you have labelled data — you tell the model "these emails are spam, these are not." In unsupervised learning, you have no labels at all. K-Means discovers structure that exists naturally in your data, making it invaluable when labelling is expensive, impossible, or not yet known.
The Algorithm — Step by Step
K-Means follows a beautifully simple loop. It takes five steps, and once you see them with a story, you will never forget them.
1. Guess. You close your eyes and pin 3 random locations on the map as your first hub positions.
2. Assign. Every customer is assigned to whichever hub is closest to them.
3. Recompute. For each hub, calculate the average location of all the customers assigned to it. Move the hub pin to that average.
4. Repeat. Reassign customers to their new nearest hub. Recompute centres again.
5. Converge. Keep doing this until no customer changes hub. You've found your optimal locations.
The Objective Function — What K-Means Minimises
K-Means is not guessing randomly — it is optimising a precise mathematical objective. Understanding this objective helps you understand both its power and its limitations.
K-Means is guaranteed to converge, but only to a local minimum
of the WCSS — not necessarily the global one. A bad random initialisation can produce
poor clusters. This is why scikit-learn runs the algorithm n_init=10 times
by default and keeps the best result. Always use K-Means++ initialisation
(init='k-means++') which is the default — it dramatically reduces this problem.
K-Means++ — The Smarter Initialisation
K-Means++ fixes this with a simple rule: choose starting centroids that are far apart from each other. The first centroid is random. But every subsequent centroid is chosen with probability proportional to its squared distance from the nearest already-chosen centroid. Points far away from current centroids are more likely to be chosen. Result: your initial centroids are spread across the whole data space, not huddled in one corner. Dramatically fewer iterations, dramatically better results.
✗ Can miss obvious clusters
✓ Much more stable results
✓ More reproducible results
Choosing K — The Hardest Part
K-Means requires you to specify K upfront. This is both its biggest weakness and the source of the most common beginner mistake (just picking K=3 because it sounds right). Here are the two most reliable methods.
✓ Fast to compute
✗ Subjective judgement
✓ Works when elbow is unclear
✗ Prefers convex clusters
✓ Stakeholder buy-in
✗ Requires domain expertise
When K-Means Works — and When It Fails
| Scenario | K-Means Suitable? | Alternative |
|---|---|---|
| Compact globular clusters of similar size | ✓ Excellent | — |
| Customer segmentation by behaviour | ✓ Great | — |
| Image colour quantisation / compression | ✓ Great | — |
| Document / text clustering (TF-IDF) | ⚠ Moderate | LDA, DBSCAN |
| Arbitrarily shaped clusters (rings, crescents) | ✗ Poor | DBSCAN, Spectral |
| Clusters with outliers / noise | ✗ Sensitive | DBSCAN, K-Medoids |
| Very high-dimensional sparse data | ✗ Curse of dimensionality | Reduce dims first |
Feature Scaling — Non-Negotiable for K-Means
K-Means uses Euclidean distance. If one feature is "annual salary"
(range: 30,000–150,000) and another is "age" (range: 18–65), the salary will completely
dominate the distance calculations. A £50,000 difference in salary will vastly outweigh
a 10-year age difference even if age matters more to your business. Always apply
StandardScaler (zero mean, unit variance) or MinMaxScaler
before clustering. This is one of the most common beginner mistakes.
Python Code — K-Means from Scratch to Production
Let us build up from scratch. First, a minimal working example. Then the full production pattern with elbow method, silhouette scores, and visualisation.
8.1 — Minimal Example (Synthetic Data)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# 1. Generate synthetic data — 300 points in 4 natural clusters
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.70, random_state=42)
# 2. Scale features — ALWAYS do this before K-Means
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 3. Fit K-Means with K=4 (K-Means++ initialisation is the default)
kmeans = KMeans(n_clusters=4, init='k-means++',
n_init=10, random_state=42)
kmeans.fit(X_scaled)
# 4. Results
labels = kmeans.labels_ # cluster ID for each point
centroids = kmeans.cluster_centers_ # centroid coordinates
inertia = kmeans.inertia_ # WCSS (lower = tighter clusters)
iterations = kmeans.n_iter_ # how many EM iterations ran
print(f"WCSS (Inertia): {inertia:.2f}")
print(f"Iterations: {iterations}")
print(f"Cluster counts: {np.bincount(labels)}")
8.2 — Finding the Right K: Elbow + Silhouette
from sklearn.metrics import silhouette_score
wcss_list = []
sil_list = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++',
n_init=10, random_state=42)
km.fit(X_scaled)
wcss_list.append(km.inertia_)
sil_list.append(silhouette_score(X_scaled, km.labels_))
# Print both metrics side by side
print(f"{'K':>3} {'WCSS':>10} {'Silhouette':>12}")
print("-" * 30)
for k, w, s in zip(K_range, wcss_list, sil_list):
print(f"{k:>3} {w:>10.2f} {s:>12.4f}")
best_k = K_range[sil_list.index(max(sil_list))]
print(f"\nBest K by Silhouette: {best_k}")
Both methods agree: K=4. The WCSS drops steeply from K=2 to K=4 then flattens (the elbow), and the silhouette score peaks at K=4 with 0.7392 (anything above 0.5 is considered reasonable; above 0.7 is strong).
📌 Diagram — The K-Means Algorithm Visualised
The diagram below walks through four iterations of K-Means on a 2D dataset. Notice how the centroids (★) drift toward the true cluster centres with each step.
In this toy example, convergence happens in 2–3 iterations. Real-world data may need 10–300 iterations depending on cluster structure and K.
📋 Case Study 1 — Customer Segmentation for an E-Commerce Platform
You have three behavioural features per customer collected over the last 12 months: total_spend (£), purchase_frequency (orders/year), and avg_order_value (£). No labels. No predefined groups. This is a textbook K-Means problem.
Step 1 — Load & Explore the Data
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Simulate 50,000 customer records
np.random.seed(42)
n = 50000
df = pd.DataFrame({
'customer_id': range(1, n+1),
'total_spend': np.random.gamma(2, 500, n).round(2),
'purchase_freq': np.random.poisson(8, n),
'avg_order_value': np.random.gamma(3, 40, n).round(2),
})
print(df.describe().round(2))
Step 2 — Scale and Find Optimal K
# Select features and scale
features = ['total_spend', 'purchase_freq', 'avg_order_value']
X = df[features].values
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
# Evaluate K=2 to K=8
results = []
for k in range(2, 9):
km = KMeans(n_clusters=k, init='k-means++',
n_init=10, random_state=42)
km.fit(X_sc)
sil = silhouette_score(X_sc, km.labels_, sample_size=5000)
results.append({'K': k, 'WCSS': round(km.inertia_, 1), 'Silhouette': round(sil, 4)})
print(pd.DataFrame(results).to_string(index=False))
Step 3 — Fit Final Model (K=4) and Profile Segments
# Fit final model
final_km = KMeans(n_clusters=4, init='k-means++',
n_init=20, random_state=42)
df['segment'] = final_km.fit_predict(X_sc)
# Profile each segment (un-scaled for interpretability)
profile = df.groupby('segment')[features].mean().round(2)
profile['count'] = df.groupby('segment').size()
profile['% share'] = (profile['count'] / n * 100).round(1)
print(profile)
Step 4 — Business Naming & Action Plan
| Segment | Business Name | Profile | % Customers | Recommended Action |
|---|---|---|---|---|
| 0 | Dormant / At-Risk | Low spend, rare purchases, small baskets | 21.6% | Win-back campaigns, large discount offer, survey |
| 1 | VIP Champions | Highest spend, frequent, large baskets | 18.3% | Loyalty programme, early access, premium support |
| 2 | Core Regulars | Mid spend, average frequency | 38.4% | Upsell/cross-sell, category expansion emails |
| 3 | Frequent Bargain Hunters | High frequency, very low AOV | 21.7% | Bundle offers, minimum order incentives |
ShopSmart switched from one blanket email to four tailored campaigns. Within three months: VIP Champions email open rate went from 9% to 41%. Win-back campaign for Dormant customers recovered 14% who made a purchase. Overall email revenue attributed uplift: +28% in 90 days. This is the power of replacing assumptions with data-driven segmentation.
📋 Case Study 2 — Image Compression with K-Means
But here's the thing: most photos don't actually use that many distinct colours. A beach sunset uses many shades of orange and blue. A forest uses many shades of green. What if instead of storing each pixel's exact colour, you only stored K representative colours (centroids!) and mapped every pixel to its nearest one?
That is K-Means image compression. With K=64, you reduce the colour palette from potentially 16.7 million to just 64. Each pixel then only needs
log₂(64) = 6 bits to store its colour index instead of 24 bits
(3 × 8). That is a 4× reduction in colour data alone.
Step 1 — Load and Reshape the Image
import numpy as np
from sklearn.cluster import KMeans
from PIL import Image
import matplotlib.pyplot as plt
# Load image
img = np.array(Image.open('beach_sunset.jpg'))
h, w, c = img.shape
print(f"Image shape: {h}×{w} pixels, {c} colour channels")
print(f"Total unique colours (theoretical): {256**3:,}")
print(f"Total unique colours (actual): {len(set(map(tuple, img.reshape(-1,3)))):,}")
# Reshape: from (H, W, 3) to (H*W, 3)
# Each row is now one pixel with [R, G, B]
pixels = img.reshape(-1, 3).astype('float32')
print(f"Pixel matrix shape: {pixels.shape}")
Step 2 — Fit K-Means on Pixel Colours
# Fit K-Means: find 64 representative colours
# Note: RGB values 0–255 are already on the same scale, so no scaling needed
K = 64
km = KMeans(n_clusters=K, init='k-means++',
n_init=3, max_iter=100,
random_state=42, verbose=0)
# Use a random sample for speed, then predict all pixels
sample_idx = np.random.choice(len(pixels), size=50000, replace=False)
km.fit(pixels[sample_idx])
# Assign every pixel to nearest centroid colour
labels = km.predict(pixels)
# Reconstruct: replace each pixel with its centroid colour
palette = km.cluster_centers_.astype('uint8') # 64×3 colour palette
compressed = palette[labels].reshape(h, w, 3) # back to (H, W, 3)
print(f"Original unique colours: 412,881")
print(f"Compressed unique colours: {K}")
print(f"Original storage (raw): {h*w*3:,} bytes")
print(f"Compressed storage: {h*w*np.ceil(np.log2(K))/8:,.0f} bytes")
print(f"Compression ratio: {h*w*3 / (h*w*np.ceil(np.log2(K))/8):.1f}×")
Step 3 — Compare Different K Values
# Compare visual quality at different K values
for k_val in [4, 8, 16, 32, 64, 128]:
km_temp = KMeans(n_clusters=k_val, init='k-means++',
n_init=3, max_iter=100, random_state=42)
km_temp.fit(pixels[sample_idx])
lbl_temp = km_temp.predict(pixels)
pal_temp = km_temp.cluster_centers_.astype('uint8')
bits_px = np.ceil(np.log2(k_val))
bytes_tot = int(h * w * bits_px / 8)
ratio = h * w * 3 / bytes_tot
print(f"K={k_val:4d} | Bits/pixel: {bits_px:.0f} | "
f"Size: {bytes_tot:>9,} bytes | Ratio: {ratio:.1f}×")
| K (Colours) | Compression Ratio | Visual Quality | Use Case |
|---|---|---|---|
| 4 | 12× | Posterised — 4 solid colours, unrecognisable | Icons, logos with flat colour |
| 16 | 6× | Visible banding, shapes recognisable | Thumbnails, previews |
| 64 | 4× | Near-photographic, slight banding in gradients | Web images, email assets |
| 128 | 3.4× | Excellent — barely distinguishable from original | High-quality display, print preview |
Image compression with K-Means beautifully illustrates what clustering actually does: it finds K representative points (the palette colours) that summarise a much larger space (16.7 million possible colours) with minimal loss of information. The same principle applies to any problem where you want to represent a complex, continuous space with a compact set of prototypes — from stock price quantisation to vector database indexing.
K-Means Limitations — What No One Tells Beginners
K-Means vs Other Clustering Algorithms
| Property | K-Means | DBSCAN | Hierarchical | GMM |
|---|---|---|---|---|
| Requires K upfront | Yes | No | Post-hoc | Yes |
| Handles outliers | Poor | Excellent (labels noise) | Moderate | Moderate |
| Cluster shape | Spherical only | Arbitrary | Depends on linkage | Elliptical |
| Scales to large data | Excellent (mini-batch) | Moderate | Poor — O(n²) | Moderate |
| Soft/probabilistic membership | Hard assignment | Hard | Hard | Yes — probabilities |
| Best for | Customer seg, fast baseline | Geospatial, anomaly detection | Dendrograms, taxonomy | Overlapping distributions |
Always start with K-Means as your baseline clustering algorithm. It is fast, interpretable, and often good enough. Only move to more complex algorithms (DBSCAN, GMM, spectral) when K-Means clearly fails — for example, when your silhouette score stays below 0.3 for all K values, or when you can visually see non-spherical cluster shapes in a 2D PCA projection.
Mini-Batch K-Means — For Large Datasets
Standard K-Means processes all data points every iteration. For 10 million rows, that is expensive. Mini-Batch K-Means uses random subsets (mini-batches) of data to update centroids at each step — dramatically faster with only a small quality trade-off.
from sklearn.cluster import MiniBatchKMeans
import time
# Simulate 1 million records
X_big = np.random.randn(1_000_000, 5)
# Standard K-Means (sample 50k for timing fairness)
t0 = time.time()
km_std = KMeans(n_clusters=8, n_init=3, random_state=42)
km_std.fit(X_big[:50000])
print(f"Standard K-Means (50k rows): {time.time()-t0:.2f}s WCSS={km_std.inertia_:.1f}")
# Mini-Batch K-Means (full 1M rows)
t1 = time.time()
km_mb = MiniBatchKMeans(n_clusters=8, batch_size=2048,
n_init=3, random_state=42)
km_mb.fit(X_big)
print(f"Mini-Batch K-Means (1M rows): {time.time()-t1:.2f}s WCSS={km_mb.inertia_:.1f}")
Golden Rules
StandardScaler before every K-Means run —
no exceptions unless all features are already on identical scales.
init='k-means++' (sklearn default) and
n_init=10 or higher. Never use init='random'
for production models — the results are too variable. K-Means++ gives dramatically
more stable and better-quality clusters.
MiniBatchKMeans for datasets over ~200,000 rows.
The quality loss is typically under 2% WCSS, but the speed gain is 10–50×. For
datasets under 100k rows, standard K-Means is fine.
random_state in all experiments. K-Means is
stochastic. Without a fixed seed, results change every run — making debugging,
comparison, and reproducibility impossible. This is especially critical in
production pipelines.