Machine Learning 📂 Dimensionality reduction · 1 of 3 51 min read

Principal Component Analysis (PCA)

PCA is a linear technique that rotates your data into a new coordinate system where the first axis captures the most variance, the second axis captures the next most, and so on — letting you keep 95 % of the information with a fraction of the original features.

Section 01

The Story That Explains PCA

The Shadow on the Wall
Imagine you are holding a crumpled piece of wire in 3-D space and shining a torch at it. The shadow it casts on the wall is a 2-D projection of a 3-D object. Some angles produce a rich, informative shadow — you can almost reconstruct the wire. Other angles collapse everything into a confusing blob.

PCA is the algorithm that automatically finds the best torch angle — the projection that preserves the most information (variance) from your original high-dimensional data in fewer dimensions.

In practice: you have a dataset with 50 columns. PCA finds the 2 or 3 "best angles" (principal components) so you can plot, visualise, or model it without losing much detail.

Principal Component Analysis (PCA) is an unsupervised linear transformation technique. It re-expresses your data in a new coordinate system where the axes (principal components) are ordered by how much variance they explain. The first principal component (PC1) is the direction of maximum spread in your data; PC2 is the direction of maximum spread perpendicular to PC1, and so on.

💡
One-Line Definition

PCA rotates your data into a new frame where axes point in the directions of maximum variance — ordered so you can drop the last few axes and lose almost nothing.


Section 02

Why Dimensionality Reduction Matters

Modern datasets routinely have hundreds or thousands of features. More features sounds better — but past a point it backfires in three serious ways:

📊
The Curse of Dimensionality
Volume explodes exponentially
In high dimensions, data points become sparse. A k-NN model that needs 10 neighbours in 2-D needs thousands in 100-D to achieve the same density. Models overfit, distances become meaningless.
Computational Cost
Training time scales badly
Many algorithms scale as O(d²) or O(d³) where d is the feature count. Reducing 1000 features to 50 can make training 20–400× faster with negligible accuracy loss.
👁
Visualisation
Humans live in 2-D and 3-D
You cannot plot a 784-pixel image dataset directly. Projecting to 2 PCA components lets you scatter-plot thousands of images and see cluster structure instantly.
⚠️
PCA Is Not Feature Selection

Feature selection keeps a subset of original columns. PCA creates brand-new columns that are linear combinations of all original features. The new features have no direct real-world meaning, but they are mathematically optimal for preserving variance.


Section 03

The Four-Step PCA Algorithm

01
Standardise the Data
Subtract the mean and divide by the standard deviation for each feature. This gives every feature zero mean and unit variance. Without this step, features on large scales (e.g. salary in £) dominate those on small scales (e.g. age). Result: a matrix X where every column has mean = 0 and std = 1.
02
Compute the Covariance Matrix
Calculate C = (XᵀX) / (n−1) — a d×d symmetric matrix where each cell C[i,j] captures how much feature i and feature j vary together. The diagonal holds variances; off-diagonal holds covariances.
03
Eigen-Decomposition of C
Solve Cv = λv for all eigenvalue/eigenvector pairs. Each eigenvector is a direction (principal component) in the original feature space. The corresponding eigenvalue tells you how much variance that direction captures. Sort pairs by eigenvalue — largest first.
04
Project the Data
Choose the top k eigenvectors (where k ≪ d). Stack them into a matrix W of shape d×k. Project: Z = X·W. The result Z has shape n×k — your data in the new reduced space.
05
Decide How Many Components to Keep
Use the explained variance ratio to decide k. Plot a scree plot and look for the "elbow". A common rule: keep enough PCs to explain ≥ 95% of total variance.

Section 04

The Maths — Made Concrete with Numbers

Numerical Example: 2 Features → 1 Principal Component

Let's walk through PCA on a tiny dataset — 5 students with two exam scores. We want to find the single direction (PC1) that best summarises both scores.

Student Maths Score (x₁) Physics Score (x₂)
Alice8588
Bob7072
Carol9295
Dan6058
Eve7880
📝 Step-by-Step Calculation
Step 1
Compute means: μ₁ = (85+70+92+60+78)/5 = 77.0, μ₂ = (88+72+95+58+80)/5 = 78.6
Step 2
Centre the data (subtract means): Alice becomes (8, 9.4), Bob (−7, −6.6), Carol (15, 16.4), Dan (−17, −20.6), Eve (1, 1.4)
Step 3
Covariance matrix: Var(x₁) = 148.5  |  Var(x₂) = 186.3  |  Cov(x₁,x₂) = 165.5
C = [[148.5, 165.5], [165.5, 186.3]]
Step 4
Eigenvalues: Solving det(C − λI) = 0 gives λ₁ ≈ 332.8 and λ₂ ≈ 1.95
Step 5
Eigenvectors: v₁ ≈ [0.664, 0.747] (PC1 direction)  |  v₂ ≈ [−0.747, 0.664] (PC2 direction)
Result
Explained variance: PC1 explains 332.8 / (332.8+1.95) = 99.4% of variance!
One component is almost perfect — the two exam scores are nearly identical in direction.
🌟
Intuition Check

The eigenvector [0.664, 0.747] points in the direction students vary most — roughly at 45° between the two axes. This makes sense: students who score high in Maths also score high in Physics. The information is almost entirely captured by a single "academic ability" axis.

Key Formulas

Covariance Matrix
C = XᵀX / (n−1)
Symmetric d×d matrix. Each entry Cᵢⱼ measures co-variation of features i and j.
Eigen Equation
C · v = λ · v
v = eigenvector (direction), λ = eigenvalue (amount of variance in that direction).
Explained Variance Ratio
EVR_k = λ_k / Σλ_i
Fraction of total variance explained by component k. All ratios sum to 1.
Projection
Z = X · W_k
W_k is the d×k matrix of top-k eigenvectors. Z is the reduced n×k dataset.

Section 05

Visual Diagram — What PCA Actually Does

📊 PCA Geometric Visualisation — Interactive 3D
Data points PC1 — max variance PC2 PC3 Drag canvas to rotate • Sliders for precise control
PC1 variance
PC2 variance
PC3 variance
Retained (k=1)

↑ Try PC1 axis view — looking straight down PC1, the data collapses into a tight disc, showing how little variance lives in PC2 and PC3. The amber arrow is the direction of maximum spread.


Section 06

Scree Plot — Choosing How Many Components to Keep

The Mountain Ridge
Imagine ranking hills by height. The first few hills are huge — they dominate the landscape. Then, after a sharp drop, the rest are gentle undulations almost indistinguishable from noise. The point where the big drop ends is called the "elbow". Keep the hills above the elbow — they're telling you something real. Ignore the flat part — it's mostly noise.

That is a scree plot. The y-axis is eigenvalue (= variance explained); the x-axis is component number. The elbow tells you where to cut.
📈 Scree Plot — Explained Variance per Component
0% 20% 40% 60% 80% PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 55% 22% 10% 6% 3% 2% Elbow → keep PC1+PC2 (77% total variance) cumulative

Amber bars = components worth keeping (above the elbow). Blue bars = mostly noise. The green dashed line tracks cumulative explained variance.


Section 07

Python Implementation — Step by Step

Step 1: Standardisation and Raw PCA from Scratch

import numpy as np

# ── Toy dataset: 6 students × 4 exam scores ──────────────────
X = np.array([
    [85, 88, 82, 90],
    [70, 72, 68, 74],
    [92, 95, 91, 96],
    [60, 58, 63, 55],
    [78, 80, 76, 82],
    [55, 52, 57, 50],
], dtype=float)

# Step 1 — Standardise (zero mean, unit variance)
mean = X.mean(axis=0)
std  = X.std(axis=0, ddof=1)
X_std = (X - mean) / std

# Step 2 — Covariance matrix
C = np.cov(X_std.T)
print("Covariance matrix shape:", C.shape)
print(np.round(C, 3))

# Step 3 — Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(C)
idx          = np.argsort(eigenvalues)[::-1]
eigenvalues  = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Step 4 — Explained variance ratio
evr = eigenvalues / eigenvalues.sum()
print("Eigenvalues:         ", np.round(eigenvalues, 4))
print("Explained var ratio: ", np.round(evr, 4))
print("Cumulative EVR:      ", np.round(np.cumsum(evr), 4))

# Step 5 — Project onto top-2 PCs
W  = eigenvectors[:, :2]
Z  = X_std @ W
print("Projected data (2 PCs):
", np.round(Z, 3))
OUTPUT
Covariance matrix shape: (4, 4) [[ 1. 0.998 0.999 0.997] [ 0.998 1. 0.997 0.999] [ 0.999 0.997 1. 0.995] [ 0.997 0.999 0.995 1. ]] Eigenvalues: [3.9883 0.0086 0.0024 0.0006] Explained var ratio: [0.9971 0.0021 0.0006 0.0002] Cumulative EVR: [0.9971 0.9993 0.9998 1. ] Projected data (2 PCs): [[ 1.032 0.094] [-0.519 -0.031] [ 1.548 0.127] [-1.897 -0.113] [ 0.587 0.058] [-0.752 -0.135]]
🌟
What the output tells us

The first eigenvalue (3.99) is enormous compared to the rest (0.009, 0.002, 0.001). PC1 alone explains 99.71% of all variance — four exam scores essentially measure one thing: "academic ability". We compressed 4 dimensions into 1 and lost only 0.29% of information.

Step 2: sklearn PCA — The Production Way

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca_full = PCA(n_components=None)
pca_full.fit(X_scaled)

print("Explained variance ratio per component:")
for i, ev in enumerate(pca_full.explained_variance_ratio_):
    print(f"  PC{i+1}: {ev:.4f}  ({ev*100:.2f}%)")
print(f"
95% variance threshold: {np.argmax(np.cumsum(pca_full.explained_variance_ratio_) >= 0.95) + 1} components")

pca_2d = PCA(n_components=2)
X_2d   = pca_2d.fit_transform(X_scaled)

print(f"
Original shape:  {X.shape}")
print(f"Reduced shape:   {X_2d.shape}")
print(f"Total variance retained: {pca_2d.explained_variance_ratio_.sum():.4f}")

print("
Component loadings (how much each feature contributes):")
for i, comp in enumerate(pca_2d.components_):
    print(f"  PC{i+1}:")
    for fname, loading in zip(feature_names, comp):
        print(f"    {fname:28s}: {loading:+.4f}")
OUTPUT
Explained variance ratio per component: PC1: 0.7296 (72.96%) PC2: 0.2285 (22.85%) PC3: 0.0367 (3.67%) PC4: 0.0052 (0.52%) 95% variance threshold: 2 components Original shape: (150, 4) Reduced shape: (150, 2) Total variance retained: 0.9581 Component loadings (how much each feature contributes): PC1: sepal length (cm) : +0.5224 sepal width (cm) : -0.2634 petal length (cm) : +0.5813 petal width (cm) : +0.5656 PC2: sepal length (cm) : +0.3724 sepal width (cm) : +0.9256 petal length (cm) : -0.0211 petal width (cm) : -0.0657
📖
Reading the Loadings

PC1 has high positive loadings for petal length (+0.58), petal width (+0.57), and sepal length (+0.52). It is essentially an "overall flower size" axis — bigger flowers score higher. PC2 is dominated by sepal width (+0.93) — it captures a completely different aspect of the flower's shape.

Step 3: Choosing n_components with Variance Threshold

from sklearn.decomposition import PCA

pca_95 = PCA(n_components=0.95)
X_95   = pca_95.fit_transform(X_scaled)
print(f"Components needed for 95% variance: {pca_95.n_components_}")

pca_mle = PCA(n_components='mle')
X_mle   = pca_mle.fit_transform(X_scaled)
print(f"MLE auto-selected components:       {pca_mle.n_components_}")

X_reconstructed = pca_2d.inverse_transform(X_2d)
mse = ((X_scaled - X_reconstructed) ** 2).mean()
print(f"Reconstruction MSE (2 PCs):         {mse:.6f}")
OUTPUT
Components needed for 95% variance: 2 MLE auto-selected components: 1 Reconstruction MSE (2 PCs): 0.041853

Section 08

Complete Real-World Example — Digits Dataset

The digits dataset contains 1,797 handwritten digit images (8×8 pixels = 64 features). We will use PCA to compress 64 dimensions to 2, visualise the digit clusters, and then compress to 30 dimensions and train a classifier.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

digits = load_digits()
X, y = digits.data, digits.target
print(f"Dataset: {X.shape[0]} images × {X.shape[1]} pixel features")

scaler   = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca_30  = PCA(n_components=30, random_state=42)
X_30    = pca_30.fit_transform(X_scaled)
print(f"Variance retained (30 PCs): {pca_30.explained_variance_ratio_.sum():.4f}")
print(f"Features: 64 → 30 ({30/64*100:.0f}% of original)")

X_train, X_test, y_train, y_test = train_test_split(
    X_30, y, test_size=0.2, random_state=42)

for name, Xtr, Xte in [
    ("No PCA (64 features)",  *train_test_split(X_scaled, test_size=0.2, random_state=42)),
    ("PCA 30 components",      X_train, X_test)
]:
    lr = LogisticRegression(max_iter=1000, random_state=42)
    lr.fit(Xtr, y_train)
    acc = accuracy_score(y_test, lr.predict(Xte))
    print(f"  {name:25s}: {acc:.4f}")
OUTPUT
Dataset: 1797 images × 64 pixel features Variance retained (2 PCs): 0.286 Variance retained (30 PCs): 0.9335 Features: 64 → 30 (47% of original) No PCA (64 features) : 0.9611 PCA 30 components : 0.9528
👉
Key Takeaway

By discarding the noisiest 34 dimensions and keeping only 30 PCs (47% of the original features), we retained 93.4% of all variance and lost only 0.8% accuracy. In exchange, training is faster, memory usage is lower, and noisy pixel dimensions that could confuse the model are removed.


Section 09

PCA in a Full Machine-Learning Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.datasets import load_wine

wine = load_wine()
X, y = wine.data, wine.target

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=0.95)),
    ('svm',    SVC(kernel='rbf', random_state=42))
])

cv_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

param_grid = {
    'pca__n_components': [0.90, 0.95, 0.99, 5, 8],
    'svm__C':            [0.1, 1.0, 10.0],
    'svm__gamma':        ['scale', 'auto'],
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)

print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Best params:      {grid.best_params_}")
print(f"PCs selected:     {grid.best_estimator_.named_steps['pca'].n_components_}")
OUTPUT
CV Accuracy: 0.9833 ± 0.0183 Best CV accuracy: 0.9944 Best params: {'pca__n_components': 8, 'svm__C': 10.0, 'svm__gamma': 'scale'} PCs selected: 8
⚠️
Data Leakage Warning

Always put PCA inside a Pipeline when doing cross-validation. Fitting PCA on the full dataset before splitting leaks information from the test fold into training — PCA has "seen" the test data's variance structure. The Pipeline guarantees PCA is only fit on the training fold each time.


Section 10

Incremental PCA — When Data Doesn't Fit in RAM

from sklearn.decomposition import IncrementalPCA
import numpy as np

np.random.seed(42)
X_large = np.random.randn(100_000, 50)

ipca = IncrementalPCA(n_components=10, batch_size=500)
ipca.fit(X_large)

print("Explained variance ratio (top 10):")
print(np.round(ipca.explained_variance_ratio_, 4))
print(f"Total: {ipca.explained_variance_ratio_.sum():.4f}")

X_transformed = ipca.transform(X_large[:1000])
print(f"Transformed shape: {X_transformed.shape}")
OUTPUT
Explained variance ratio (top 10): [0.0229 0.0222 0.0219 0.0217 0.0215 0.0214 0.0213 0.0212 0.0211 0.0211] Total: 0.2163 Transformed shape: (1000, 10)
📊
When to use IncrementalPCA

Standard PCA loads the entire dataset into RAM to compute the covariance matrix. IncrementalPCA processes mini-batches and updates its internal state incrementally — memory usage stays constant regardless of dataset size. Use it whenever your data is larger than available RAM, or when data arrives as a stream. Results are numerically very close to standard PCA.


Section 11

PCA Diagram — The Full Data Flow

⚙️ PCA End-to-End Data Flow
Raw Data n × d matrix e.g. 150 × 64 Scale Standardised mean=0, std=1 per feature Cov(X) Cov Matrix C = XᵀX/(n-1) d × d symmetric eig(C) Eigenvectors sorted by λ (desc) directions of variance choose k W matrix top-k eigenvectors shape: d × k project Z = X · W Reduced data shape: n × k ✓ Lower dimension Visualise scatter 2-D/3-D Train Model SVM, LR, RF … Reconstruct Z · Wᵀ ≈ X LEGEND Data matrix Output Transform Data flow

Section 12

Key Hyperparameters

Parameter Default What It Controls Tuning Advice
n_components None Number of PCs to keep. Also accepts float (variance %) or 'mle' Most important param. Use 0.95 for 95% variance threshold.
svd_solver 'auto' SVD algorithm: 'full', 'arpack', 'randomized', 'auto' Use 'randomized' for large d and small k — much faster.
whiten False Scale components to unit variance after projection Set True before SVM or neural nets for better conditioning.
random_state None Seed for randomized SVD (affects reproducibility) Always set for reproducibility in production.
copy True Whether to copy input X before centering Set False to save memory if you don't need X after transform.

Section 13

When to Use PCA — and When Not To

Visualisation (2D/3D)
Project any dataset to 2 or 3 PCs and scatter-plot it. Cluster structure, outliers, and class separation become immediately visible — impossible otherwise in 50+ dimensions.
exploratory data analysis
Noise Reduction
The last few principal components often represent measurement noise. Projecting and back-projecting (Z·Wᵀ) gives a denoised version of your data. Used in image compression, audio denoising, and genomics.
denoising, image compression
Multicollinearity Removal
PCA components are by construction orthogonal — zero correlation. If your model suffers from highly correlated features (e.g. financial ratios), PCA preprocessing removes the multicollinearity completely.
regression, logistic regression
Interpretability Required
PCA components are linear combinations of all features. "PC1 = 0.52·feature_A + 0.58·feature_B …" has no direct business meaning. If you need to explain "why did the model predict X", use original features.
regulated industries, clinical AI
Non-Linear Structure
PCA is purely linear. If your data lies on a curved manifold (e.g. the Swiss roll), PCA fails badly. Use t-SNE, UMAP, or Kernel PCA instead for non-linear dimensionality reduction.
use t-SNE, UMAP, KernelPCA
Small Dataset, High Variance Budget
On very small datasets (n ≪ d), PCA can amplify noise into the top components. The estimated covariance matrix becomes unreliable when d > n. Consider regularised PCA or Ridge regression instead.
n < d regime, genomics caution

Section 14

PCA vs Other Dimensionality Reduction Methods

Method Type Preserves Speed Best For
PCA Linear Global variance structure Fast Preprocessing, noise removal, visualisation
Kernel PCA Non-linear Non-linear variance Slow (O(n²)) Non-linear manifolds in moderate datasets
t-SNE Non-linear Local neighbourhood structure Very slow Cluster visualisation only (2D/3D)
UMAP Non-linear Local + some global Medium Large datasets, preserves topology
Autoencoder Non-linear Learned latent structure Slow (training) Images, audio, highly complex data
LDA Linear Class separation (supervised) Fast Classification preprocessing (supervised)
🏆
The Practitioner's Rule

Start with PCA — it is fast, interpretable (loadings), and works as a preprocessing step inside pipelines with no data leakage risk. If cluster structure remains invisible after PCA, switch to UMAP for visualisation. If your problem is fundamentally non-linear and performance matters more than speed, try Kernel PCA or a deep autoencoder.


Section 15

Golden Rules

🌟 PCA — Non-Negotiable Rules
1
Always standardise before PCA. PCA maximises variance. Features on large scales dominate those on small scales. Use StandardScaler — zero mean, unit variance — every single time unless you have a specific reason not to (e.g. all features are already on the same scale by design).
2
Always put PCA inside a Pipeline. Fitting PCA on the full dataset before train/test splitting leaks test-set variance information into training. Wrap it in Pipeline([...]) so it is only fit on training folds.
3
Inspect explained variance ratios before choosing k. Never blindly set n_components=2. Plot the scree plot, check cumulative EVR. If 2 components explain only 30% of variance, your 2-D plot is misleading. Report the retained variance alongside every PCA visualisation.
4
Use PCA for preprocessing, not as an end in itself. PCA is a means to an end — faster training, better visualisation, removed multicollinearity. Always evaluate whether PCA actually improves your downstream task metric. Sometimes it helps; sometimes full features are better.
5
Do not use PCA to fix class imbalance or missing values. Impute missing values before PCA. PCA is undefined for NaN inputs. Use SimpleImputer or KNNImputer first, then scale, then PCA.
6
For datasets too large to fit in RAM, use IncrementalPCA with a sensible batch_size (typically 500–5000 rows). Results are numerically equivalent to standard PCA but constant memory cost regardless of total dataset size.
7
Components are not features — do not name them carelessly. A loading plot (feature vs. component weight) can help you interpret what a component represents, but treat interpretations as heuristics, not facts. PC1 of an Iris dataset may resemble "petal size", but it is still a mathematical construction, not a measured property.