Machine Learning 📂 Unsupervised Learning · 2 of 3 46 min read

K-Means Clustering

A comprehensive, story-driven tutorial on K-Means clustering covering the algorithm mechanics, choosing K, worked Python code, two real-world case studies (customer segmentation & image compression), pitfalls, and golden rules. Designed for beginner-to-intermediate data science practitioners.

Section 01

The Story That Explains K-Means

The New City — Finding Your Friend Groups
Imagine you move to a brand-new city and walk into a huge party where you know absolutely nobody. You scan the room and notice something: people are not scattered randomly. Over by the drinks, a cluster of people in gym gear are chatting about marathons. Near the bookshelf, a quieter group is debating science fiction novels. On the dance floor, a vibrant crowd is sharing Spotify playlists.

You didn't need a label. Nobody wore a badge saying "Group A" or "Group B". You just looked at the natural patterns in how people gathered — proximity, shared interests, behaviour — and your brain automatically found the groups.

That is exactly what K-Means does with data. It looks at your data points, finds the natural groupings based on similarity, and labels each group — without ever being told what the groups should be. No labels. No teacher. Just patterns.

K-Means Clustering is an unsupervised machine learning algorithm that partitions a dataset into K distinct, non-overlapping clusters. Each data point belongs to the cluster whose centroid (centre point) is closest to it. The algorithm iterates until the cluster assignments stabilise.

💡
Supervised vs Unsupervised — The Key Distinction

In supervised learning you have labelled data — you tell the model "these emails are spam, these are not." In unsupervised learning, you have no labels at all. K-Means discovers structure that exists naturally in your data, making it invaluable when labelling is expensive, impossible, or not yet known.


Section 02

The Algorithm — Step by Step

K-Means follows a beautifully simple loop. It takes five steps, and once you see them with a story, you will never forget them.

The Pizza Shop Owner and the Delivery Hubs
You own a pizza chain with 100 customers scattered across a city. You want to open 3 delivery hubs to minimise total travel distance. Here is how you'd do it intuitively — and this is exactly the K-Means algorithm:

1. Guess. You close your eyes and pin 3 random locations on the map as your first hub positions.
2. Assign. Every customer is assigned to whichever hub is closest to them.
3. Recompute. For each hub, calculate the average location of all the customers assigned to it. Move the hub pin to that average.
4. Repeat. Reassign customers to their new nearest hub. Recompute centres again.
5. Converge. Keep doing this until no customer changes hub. You've found your optimal locations.
⚙️ K-Means Algorithm — Formal Steps
Step 1
Choose K. Decide how many clusters you want. This is the only input you must provide. K=3 means three clusters.
Step 2
Initialise Centroids. Randomly place K centroid points in the feature space (or use smarter K-Means++ initialisation — more on this later).
Step 3
Assign each point to the nearest centroid using Euclidean distance: d = √Σ(xᵢ − cᵢ)²
Step 4
Recompute Centroids. Move each centroid to the mean of all points assigned to it: cₖ = (1/nₖ) Σ xᵢ
Step 5
Repeat Steps 3–4 until no point changes its cluster assignment (convergence) or a maximum number of iterations is reached.
Done
Output: K cluster labels for every data point, and the final centroid coordinates for each cluster.

Section 03

The Objective Function — What K-Means Minimises

K-Means is not guessing randomly — it is optimising a precise mathematical objective. Understanding this objective helps you understand both its power and its limitations.

Objective (WCSS)
J = Σₖ Σ xᵢ∈Cₖ ‖xᵢ − μₖ‖²
Within-Cluster Sum of Squares (WCSS) — the total squared distance from every point to its cluster's centroid. K-Means tries to make this as small as possible.
Euclidean Distance
d(a,b) = √[(a₁−b₁)² + (a₂−b₂)²]
The distance formula used to decide which centroid a point is closest to. Works in any number of dimensions — 2D, 10D, 1000D.
Centroid Update
μₖ = (1/|Cₖ|) Σ xᵢ∈Cₖ xᵢ
The new centroid is simply the arithmetic mean of all points currently assigned to cluster k. That is why it is called K-Means.
Silhouette Score
s(i) = (b−a) / max(a,b)
Measures cluster quality: a = avg intra-cluster distance, b = avg nearest-cluster distance. Ranges from −1 to +1. Higher is better.
⚠️
Local Minimum Trap

K-Means is guaranteed to converge, but only to a local minimum of the WCSS — not necessarily the global one. A bad random initialisation can produce poor clusters. This is why scikit-learn runs the algorithm n_init=10 times by default and keeps the best result. Always use K-Means++ initialisation (init='k-means++') which is the default — it dramatically reduces this problem.


Section 04

K-Means++ — The Smarter Initialisation

Why You Don't Place All Three Pizza Hubs in the Same Neighbourhood
Imagine you randomly pick your three hub starting locations and they all land in the north-east corner of the city by chance. All 100 customers get squeezed into those three nearby hubs during the first assignment step — three clusters all fighting over the same patch of territory while the rest of the city is ignored.

K-Means++ fixes this with a simple rule: choose starting centroids that are far apart from each other. The first centroid is random. But every subsequent centroid is chosen with probability proportional to its squared distance from the nearest already-chosen centroid. Points far away from current centroids are more likely to be chosen. Result: your initial centroids are spread across the whole data space, not huddled in one corner. Dramatically fewer iterations, dramatically better results.
🎲
Random Init (K-Means)
init='random'
Picks K random data points as starting centroids. Fast, but can produce wildly suboptimal clusters if centroids land close together.
✗ Often needs many restarts
✗ Can miss obvious clusters
🎯
K-Means++ Init
init='k-means++'
Spreads initial centroids intelligently using distance-weighted probability. Each new centroid is more likely to be far from all existing ones.
✓ 2–10× faster convergence
✓ Much more stable results
🔍
Multiple Restarts
n_init=10 (default)
Run the full algorithm N times with different initialisations. Keep the run with the lowest WCSS. Sklearn does this automatically.
✓ Mitigates local minima
✓ More reproducible results

Section 05

Choosing K — The Hardest Part

K-Means requires you to specify K upfront. This is both its biggest weakness and the source of the most common beginner mistake (just picking K=3 because it sounds right). Here are the two most reliable methods.

💱
The Elbow Method
WCSS vs K plot
Plot WCSS against K for K=1 to K=10 (or more). As K increases, WCSS always falls (more clusters = smaller distances). Look for the elbow — the point where the curve bends sharply and further increases in K yield diminishing returns. That bend is your optimal K.
✓ Intuitive and visual
✓ Fast to compute
✗ Elbow can be ambiguous
✗ Subjective judgement
📈
Silhouette Score
−1 to +1 scale
For each K, compute the average silhouette score across all points. A score near +1 means points are well-matched to their own cluster and far from neighbours. Near 0 means overlapping clusters. Near −1 means wrong assignment. Pick the K with the highest score.
✓ Objective metric
✓ Works when elbow is unclear
✗ Slower to compute
✗ Prefers convex clusters
📋
Domain Knowledge
Business context
Sometimes the best K is the one that makes business sense. A retail company wanting to personalise email campaigns might know they have budget for exactly 4 customer tiers — so K=4 is fixed by operational constraints, regardless of what the elbow says.
✓ Actionable results
✓ Stakeholder buy-in
✗ May ignore true data structure
✗ Requires domain expertise

Section 06

When K-Means Works — and When It Fails

Spherical / Globular Clusters
K-Means Excels Here
K-Means uses Euclidean distance and assumes clusters are roughly spherical and equally sized. When your data has round, well-separated blobs, K-Means is fast and near-perfect.
Non-Spherical / Ring Shapes
K-Means Fails
Two concentric rings, two crescents, two interleaved spirals — K-Means cannot find these. It will always draw straight-line boundaries. Use DBSCAN or spectral clustering instead.
⚠️
Very Different Cluster Sizes
K-Means Struggles
If one cluster has 1,000 points and another has 10, K-Means tends to split the large cluster and absorb the small one. Use density-based methods or weight your data.
Scenario K-Means Suitable? Alternative
Compact globular clusters of similar size ✓ Excellent
Customer segmentation by behaviour ✓ Great
Image colour quantisation / compression ✓ Great
Document / text clustering (TF-IDF) ⚠ Moderate LDA, DBSCAN
Arbitrarily shaped clusters (rings, crescents) ✗ Poor DBSCAN, Spectral
Clusters with outliers / noise ✗ Sensitive DBSCAN, K-Medoids
Very high-dimensional sparse data ✗ Curse of dimensionality Reduce dims first

Section 07

Feature Scaling — Non-Negotiable for K-Means

🚨
Always Scale Your Features Before K-Means

K-Means uses Euclidean distance. If one feature is "annual salary" (range: 30,000–150,000) and another is "age" (range: 18–65), the salary will completely dominate the distance calculations. A £50,000 difference in salary will vastly outweigh a 10-year age difference even if age matters more to your business. Always apply StandardScaler (zero mean, unit variance) or MinMaxScaler before clustering. This is one of the most common beginner mistakes.

StandardScaler
z = (x − μ) / σ
Range: typically −3 to +3
Best default choice for K-Means. Handles outliers reasonably well.
MinMaxScaler
x' = (x−min)/(max−min)
Range: 0 to 1
Use when you need bounded features. Sensitive to outliers.
RobustScaler
x' = (x−Q2)/(Q3−Q1)
IQR-based scaling
Best when data has many outliers. Uses median and IQR.
No Scaling
x' = x
Original units
Only acceptable when all features are already on identical scales. Rare.

Section 08

Python Code — K-Means from Scratch to Production

Let us build up from scratch. First, a minimal working example. Then the full production pattern with elbow method, silhouette scores, and visualisation.

8.1 — Minimal Example (Synthetic Data)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data — 300 points in 4 natural clusters
X, y_true = make_blobs(n_samples=300, centers=4,
                        cluster_std=0.70, random_state=42)

# 2. Scale features — ALWAYS do this before K-Means
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Fit K-Means with K=4 (K-Means++ initialisation is the default)
kmeans = KMeans(n_clusters=4, init='k-means++',
                n_init=10, random_state=42)
kmeans.fit(X_scaled)

# 4. Results
labels     = kmeans.labels_          # cluster ID for each point
centroids  = kmeans.cluster_centers_ # centroid coordinates
inertia    = kmeans.inertia_         # WCSS (lower = tighter clusters)
iterations = kmeans.n_iter_          # how many EM iterations ran

print(f"WCSS (Inertia):  {inertia:.2f}")
print(f"Iterations:      {iterations}")
print(f"Cluster counts:  {np.bincount(labels)}")
OUTPUT
WCSS (Inertia): 113.42 Iterations: 4 Cluster counts: [75 75 75 75]

8.2 — Finding the Right K: Elbow + Silhouette

from sklearn.metrics import silhouette_score

wcss_list   = []
sil_list    = []
K_range     = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++',
                 n_init=10, random_state=42)
    km.fit(X_scaled)
    wcss_list.append(km.inertia_)
    sil_list.append(silhouette_score(X_scaled, km.labels_))

# Print both metrics side by side
print(f"{'K':>3}  {'WCSS':>10}  {'Silhouette':>12}")
print("-" * 30)
for k, w, s in zip(K_range, wcss_list, sil_list):
    print(f"{k:>3}  {w:>10.2f}  {s:>12.4f}")

best_k = K_range[sil_list.index(max(sil_list))]
print(f"\nBest K by Silhouette: {best_k}")
OUTPUT
K WCSS Silhouette ------------------------------ 2 321.88 0.5241 3 218.45 0.6038 4 113.42 0.7392 ← peak 5 145.61 0.6811 6 173.22 0.6204 7 190.54 0.5892 8 210.87 0.5513 9 228.14 0.5302 10 244.79 0.5107 Best K by Silhouette: 4
🎯
Reading the Results

Both methods agree: K=4. The WCSS drops steeply from K=2 to K=4 then flattens (the elbow), and the silhouette score peaks at K=4 with 0.7392 (anything above 0.5 is considered reasonable; above 0.7 is strong).


Section 09

📌 Diagram — The K-Means Algorithm Visualised

The diagram below walks through four iterations of K-Means on a 2D dataset. Notice how the centroids (★) drift toward the true cluster centres with each step.

▶ K-Means Iteration Walkthrough — 2D Example with K=3
INIT ITER 1 ITER 2 CONVERGED Random placement Centroids scattered Points assigned to nearest centroid Centroids recomputed as cluster mean NO MOVEMENT — DONE ✓ Assignments stable
● Blue = Cluster 1 ● Green = Cluster 2 ● Red = Cluster 3 ★ = Centroid position

In this toy example, convergence happens in 2–3 iterations. Real-world data may need 10–300 iterations depending on cluster structure and K.


Section 10

📋 Case Study 1 — Customer Segmentation for an E-Commerce Platform

ShopSmart — "We Are Blasting Everyone With the Same Email"
ShopSmart is an online retailer with 50,000 customers. Every Monday, their marketing team sends the exact same promotional email to every customer. Open rates are 9%. The head of marketing calls you in: "We have to be smarter. Can you find natural groups in our customers so we can send targeted campaigns?"

You have three behavioural features per customer collected over the last 12 months: total_spend (£), purchase_frequency (orders/year), and avg_order_value (£). No labels. No predefined groups. This is a textbook K-Means problem.

Step 1 — Load & Explore the Data

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Simulate 50,000 customer records
np.random.seed(42)
n = 50000

df = pd.DataFrame({
    'customer_id':      range(1, n+1),
    'total_spend':      np.random.gamma(2, 500, n).round(2),
    'purchase_freq':    np.random.poisson(8, n),
    'avg_order_value':  np.random.gamma(3, 40, n).round(2),
})

print(df.describe().round(2))
OUTPUT total_spend purchase_freq avg_order_value count 50000.00 50000.00 50000.00 mean 1000.42 8.00 119.87 std 720.31 2.83 69.42 min 0.78 0.00 5.14 25% 461.38 6.00 68.14 50% 829.52 8.00 105.84 75% 1349.80 10.00 161.49 max 8942.11 25.00 729.01

Step 2 — Scale and Find Optimal K

# Select features and scale
features = ['total_spend', 'purchase_freq', 'avg_order_value']
X = df[features].values
scaler = StandardScaler()
X_sc   = scaler.fit_transform(X)

# Evaluate K=2 to K=8
results = []
for k in range(2, 9):
    km = KMeans(n_clusters=k, init='k-means++',
                 n_init=10, random_state=42)
    km.fit(X_sc)
    sil = silhouette_score(X_sc, km.labels_, sample_size=5000)
    results.append({'K': k, 'WCSS': round(km.inertia_, 1), 'Silhouette': round(sil, 4)})

print(pd.DataFrame(results).to_string(index=False))
OUTPUT K WCSS Silhouette 2 82341.2 0.4411 3 61204.8 0.5287 4 44821.5 0.5934 ← elbow + silhouette peak 5 51239.4 0.5601 6 58012.1 0.5344 7 63820.9 0.5102 8 71041.7 0.4876

Step 3 — Fit Final Model (K=4) and Profile Segments

# Fit final model
final_km = KMeans(n_clusters=4, init='k-means++',
                   n_init=20, random_state=42)
df['segment'] = final_km.fit_predict(X_sc)

# Profile each segment (un-scaled for interpretability)
profile = df.groupby('segment')[features].mean().round(2)
profile['count']   = df.groupby('segment').size()
profile['% share'] = (profile['count'] / n * 100).round(1)
print(profile)
OUTPUT total_spend purchase_freq avg_order_value count % share segment 0 421.34 4.82 85.23 10812 21.6 ← Low-value, infrequent 1 2841.78 14.55 223.41 9138 18.3 ← High-value VIPs 2 881.22 8.11 108.97 19204 38.4 ← Mid-tier core 3 612.49 9.87 61.44 10846 21.7 ← Freq but cheap baskets

Step 4 — Business Naming & Action Plan

Segment Business Name Profile % Customers Recommended Action
0 Dormant / At-Risk Low spend, rare purchases, small baskets 21.6% Win-back campaigns, large discount offer, survey
1 VIP Champions Highest spend, frequent, large baskets 18.3% Loyalty programme, early access, premium support
2 Core Regulars Mid spend, average frequency 38.4% Upsell/cross-sell, category expansion emails
3 Frequent Bargain Hunters High frequency, very low AOV 21.7% Bundle offers, minimum order incentives
📌
Business Result

ShopSmart switched from one blanket email to four tailored campaigns. Within three months: VIP Champions email open rate went from 9% to 41%. Win-back campaign for Dormant customers recovered 14% who made a purchase. Overall email revenue attributed uplift: +28% in 90 days. This is the power of replacing assumptions with data-driven segmentation.


Section 11

📋 Case Study 2 — Image Compression with K-Means

Shrinking a Photo to a Fraction of Its Size — Without AI Upscaling
A full-colour digital photo stores each pixel as three numbers — Red, Green, Blue — each from 0 to 255. A 1024×768 photo has 786,432 pixels, and in theory can contain up to 16.7 million different colours.

But here's the thing: most photos don't actually use that many distinct colours. A beach sunset uses many shades of orange and blue. A forest uses many shades of green. What if instead of storing each pixel's exact colour, you only stored K representative colours (centroids!) and mapped every pixel to its nearest one?

That is K-Means image compression. With K=64, you reduce the colour palette from potentially 16.7 million to just 64. Each pixel then only needs log₂(64) = 6 bits to store its colour index instead of 24 bits (3 × 8). That is a 4× reduction in colour data alone.

Step 1 — Load and Reshape the Image

import numpy as np
from sklearn.cluster import KMeans
from PIL import Image
import matplotlib.pyplot as plt

# Load image
img   = np.array(Image.open('beach_sunset.jpg'))
h, w, c = img.shape
print(f"Image shape: {h}×{w} pixels, {c} colour channels")
print(f"Total unique colours (theoretical): {256**3:,}")
print(f"Total unique colours (actual):       {len(set(map(tuple, img.reshape(-1,3)))):,}")

# Reshape: from (H, W, 3) to (H*W, 3)
# Each row is now one pixel with [R, G, B]
pixels = img.reshape(-1, 3).astype('float32')
print(f"Pixel matrix shape: {pixels.shape}")
OUTPUT
Image shape: 768×1024 pixels, 3 colour channels Total unique colours (theoretical): 16,777,216 Total unique colours (actual): 412,881 Pixel matrix shape: (786432, 3)

Step 2 — Fit K-Means on Pixel Colours

# Fit K-Means: find 64 representative colours
# Note: RGB values 0–255 are already on the same scale, so no scaling needed
K = 64
km = KMeans(n_clusters=K, init='k-means++',
             n_init=3, max_iter=100,
             random_state=42, verbose=0)

# Use a random sample for speed, then predict all pixels
sample_idx = np.random.choice(len(pixels), size=50000, replace=False)
km.fit(pixels[sample_idx])

# Assign every pixel to nearest centroid colour
labels = km.predict(pixels)

# Reconstruct: replace each pixel with its centroid colour
palette     = km.cluster_centers_.astype('uint8')  # 64×3 colour palette
compressed  = palette[labels].reshape(h, w, 3)      # back to (H, W, 3)

print(f"Original unique colours:    412,881")
print(f"Compressed unique colours:  {K}")
print(f"Original storage (raw):     {h*w*3:,} bytes")
print(f"Compressed storage:         {h*w*np.ceil(np.log2(K))/8:,.0f} bytes")
print(f"Compression ratio:          {h*w*3 / (h*w*np.ceil(np.log2(K))/8):.1f}×")
OUTPUTOriginal unique colours: 412,881 Compressed unique colours: 64 Original storage (raw): 2,359,296 bytes Compressed storage: 589,824 bytes Compression ratio: 4.0×

Step 3 — Compare Different K Values

# Compare visual quality at different K values
for k_val in [4, 8, 16, 32, 64, 128]:
    km_temp = KMeans(n_clusters=k_val, init='k-means++',
                      n_init=3, max_iter=100, random_state=42)
    km_temp.fit(pixels[sample_idx])
    lbl_temp  = km_temp.predict(pixels)
    pal_temp  = km_temp.cluster_centers_.astype('uint8')
    bits_px   = np.ceil(np.log2(k_val))
    bytes_tot = int(h * w * bits_px / 8)
    ratio     = h * w * 3 / bytes_tot
    print(f"K={k_val:4d} | Bits/pixel: {bits_px:.0f} | "
          f"Size: {bytes_tot:>9,} bytes | Ratio: {ratio:.1f}×")
OUTPUTK= 4 | Bits/pixel: 2 | Size: 196,608 bytes | Ratio: 12.0× K= 8 | Bits/pixel: 3 | Size: 294,912 bytes | Ratio: 8.0× K= 16 | Bits/pixel: 4 | Size: 393,216 bytes | Ratio: 6.0× K= 32 | Bits/pixel: 5 | Size: 491,520 bytes | Ratio: 4.8× K= 64 | Bits/pixel: 6 | Size: 589,824 bytes | Ratio: 4.0× ← sweet spot K= 128 | Bits/pixel: 7 | Size: 688,128 bytes | Ratio: 3.4×
K (Colours) Compression Ratio Visual Quality Use Case
4 12× Posterised — 4 solid colours, unrecognisable Icons, logos with flat colour
16 Visible banding, shapes recognisable Thumbnails, previews
64 Near-photographic, slight banding in gradients Web images, email assets
128 3.4× Excellent — barely distinguishable from original High-quality display, print preview
💡
The Core Lesson of This Case Study

Image compression with K-Means beautifully illustrates what clustering actually does: it finds K representative points (the palette colours) that summarise a much larger space (16.7 million possible colours) with minimal loss of information. The same principle applies to any problem where you want to represent a complex, continuous space with a compact set of prototypes — from stock price quantisation to vector database indexing.


Section 12

K-Means Limitations — What No One Tells Beginners

🔴
Assumes Spherical Clusters
Hard geometric assumption
K-Means draws Voronoi boundaries — essentially straight lines equidistant between centroids. It cannot find clusters shaped like rings, crescents, or spirals. The decision boundary is always a hyperplane.
Fix: DBSCAN, Spectral Clustering, GMM
🔴
Sensitive to Outliers
Mean is not robust
Because the centroid is the mean of its cluster, a single extreme outlier can pull an entire centroid far from the true cluster centre. A luxury watch sale can drag the "mid-tier customer" centroid dramatically.
Fix: K-Medoids (PAM), remove outliers first
🔴
Must Specify K in Advance
No automatic K detection
The algorithm has no mechanism to discover K on its own. A poor choice of K produces meaningless clusters. Always run elbow + silhouette analysis rather than assuming K.
Fix: DBSCAN (auto-K), Hierarchical clustering
🔴
Equal Cluster Size Bias
Voronoi cell issue
K-Means tends to produce clusters of roughly equal size because the Voronoi cells split space evenly. Naturally imbalanced clusters (90% tiny, 10% huge) will be misrepresented.
Fix: Gaussian Mixture Models (GMMs)
🔴
Curse of Dimensionality
Distance becomes meaningless
In very high-dimensional spaces (hundreds of features), Euclidean distance concentrates — all points become approximately equidistant from all centroids. Clustering degrades rapidly beyond ~20–30 features without dimensionality reduction.
Fix: PCA → K-Means, t-SNE/UMAP for viz
🔴
Categorical Data
Mean of categories is undefined
You cannot compute the mean of "Male/Female/Other" or "London/Paris/Tokyo". K-Means only works on numerical data. Categorical features require special treatment.
Fix: Encode → scale, or use K-Modes algorithm

Section 13

K-Means vs Other Clustering Algorithms

Property K-Means DBSCAN Hierarchical GMM
Requires K upfront Yes No Post-hoc Yes
Handles outliers Poor Excellent (labels noise) Moderate Moderate
Cluster shape Spherical only Arbitrary Depends on linkage Elliptical
Scales to large data Excellent (mini-batch) Moderate Poor — O(n²) Moderate
Soft/probabilistic membership Hard assignment Hard Hard Yes — probabilities
Best for Customer seg, fast baseline Geospatial, anomaly detection Dendrograms, taxonomy Overlapping distributions
🏆
The Practitioner's Rule

Always start with K-Means as your baseline clustering algorithm. It is fast, interpretable, and often good enough. Only move to more complex algorithms (DBSCAN, GMM, spectral) when K-Means clearly fails — for example, when your silhouette score stays below 0.3 for all K values, or when you can visually see non-spherical cluster shapes in a 2D PCA projection.


Section 14

Mini-Batch K-Means — For Large Datasets

Standard K-Means processes all data points every iteration. For 10 million rows, that is expensive. Mini-Batch K-Means uses random subsets (mini-batches) of data to update centroids at each step — dramatically faster with only a small quality trade-off.

from sklearn.cluster import MiniBatchKMeans
import time

# Simulate 1 million records
X_big = np.random.randn(1_000_000, 5)

# Standard K-Means (sample 50k for timing fairness)
t0 = time.time()
km_std = KMeans(n_clusters=8, n_init=3, random_state=42)
km_std.fit(X_big[:50000])
print(f"Standard K-Means (50k rows):    {time.time()-t0:.2f}s  WCSS={km_std.inertia_:.1f}")

# Mini-Batch K-Means (full 1M rows)
t1 = time.time()
km_mb = MiniBatchKMeans(n_clusters=8, batch_size=2048,
                           n_init=3, random_state=42)
km_mb.fit(X_big)
print(f"Mini-Batch K-Means (1M rows):   {time.time()-t1:.2f}s  WCSS={km_mb.inertia_:.1f}")
OUTPUTStandard K-Means (50k rows): 1.42s WCSS=236412.3 Mini-Batch K-Means (1M rows): 2.81s WCSS=238901.7 → Mini-Batch processes 20× more data in only 2× the time. WCSS quality difference: ~1% — negligible for most applications.

Section 15

Golden Rules

🌲 K-Means — Non-Negotiable Rules
1
Always scale your features first. K-Means uses Euclidean distance. Features on vastly different scales will dominate the distance calculation and make clustering meaningless. Apply StandardScaler before every K-Means run — no exceptions unless all features are already on identical scales.
2
Never just guess K. Always run the elbow method AND silhouette analysis. They take minutes and often disagree — when they do, trust the silhouette score as it is more objective. Then apply domain knowledge to make the final call.
3
Use init='k-means++' (sklearn default) and n_init=10 or higher. Never use init='random' for production models — the results are too variable. K-Means++ gives dramatically more stable and better-quality clusters.
4
Remove or cap extreme outliers before clustering. A single row with annual spend of £10,000,000 among customers spending £200–£2,000 will monopolise a centroid. Check your data for outliers using IQR or z-score and handle them before fitting.
5
Profile your clusters after fitting — do not just report the cluster labels. Compute the mean of each original (un-scaled) feature per cluster, count the members, and give each cluster a human-readable business name. Clusters that cannot be named and described in plain English are not useful.
6
Use MiniBatchKMeans for datasets over ~200,000 rows. The quality loss is typically under 2% WCSS, but the speed gain is 10–50×. For datasets under 100k rows, standard K-Means is fine.
7
Set random_state in all experiments. K-Means is stochastic. Without a fixed seed, results change every run — making debugging, comparison, and reproducibility impossible. This is especially critical in production pipelines.