The Story That Explains PCA
PCA finds that special angle. It rotates your coordinate system so the first new axis — called the first principal component — points in the direction of maximum variance in your data. The shadow on that axis contains as much information as possible. Each subsequent component captures the next highest variance, perpendicular to all previous ones.
The result: you can describe your data in fewer dimensions while losing as little information as possible.
Principal Component Analysis (PCA) is an unsupervised linear transformation that projects data onto a new set of orthogonal axes (principal components), ordered by the amount of variance they explain. It is the workhorse of dimensionality reduction — used in data visualisation, noise filtering, feature extraction, and preprocessing before clustering or classification.
Given a dataset with p features, find a new set of k < p axes (principal components) such that the projected data retains maximum variance. The original p-dimensional data is represented in k dimensions with minimal information loss.
The Problem — Dataset & Goal
We have a small dataset with 4 samples and 2 features (x₁, x₂). The goal is to reduce it from 2D to 1D using PCA, retaining the maximum variance possible. We will solve every step by hand — exactly as you would in a Data Science exam notebook.
| Sample | Feature x₁ | Feature x₂ |
|---|---|---|
| S1 | 1 | 2 |
| S2 | 3 | 4 |
| S3 | 5 | 4 |
| S4 | 3 | 6 |
① Compute the mean of each feature → ② Mean-centre the data → ③ Compute the covariance matrix → ④ Solve for eigenvalues → ⑤ Compute eigenvectors → ⑥ Select principal components (explained variance) → ⑦ Project data onto chosen components
Step 1 — Compute the Mean of Each Feature
PCA is sensitive to the spread of data, not its absolute position. We must first shift the data so it is centred at the origin. Before that, we need the mean of each feature.
x̄₁ = (x₁₁ + x₂₁ + x₃₁ + x₄₁) / 4
x̄₁ = (1 + 3 + 5 + 3) / 4
x̄₁ = 12 / 4
x̄₁ = 3
Mean of feature x₂:
x̄₂ = (x₁₂ + x₂₂ + x₃₂ + x₄₂) / 4
x̄₂ = (2 + 4 + 4 + 6) / 4
x̄₂ = 16 / 4
x̄₂ = 4
Mean vector: μ = [3, 4]
Step 2 — Mean-Centre the Data (Subtract the Mean)
We subtract the mean vector μ = [3, 4] from every data point to produce the centred matrix B. This shifts the data cloud so its centroid sits exactly at the origin — a prerequisite for covariance analysis.
B₂ = (3 − 3, 4 − 4) = ( 0, 0)
B₃ = (5 − 3, 4 − 4) = ( 2, 0)
B₄ = (3 − 3, 6 − 4) = ( 0, 2)
| Sample | x₁ | x₂ |
|---|---|---|
| S1 | 1 | 2 |
| S2 | 3 | 4 |
| S3 | 5 | 4 |
| S4 | 3 | 6 |
| Sample | b₁ | b₂ |
|---|---|---|
| S1 | −2 | −2 |
| S2 | 0 | 0 |
| S3 | 2 | 0 |
| S4 | 0 | 2 |
Sum of b₁: −2 + 0 + 2 + 0 = 0 ✓
Sum of b₂: −2 + 0 + 0 + 2 = 0 ✓
If either sum ≠ 0, you've made an arithmetic error in the mean or subtraction.
Step 3 — Compute the Covariance Matrix
The covariance matrix C captures how much each pair of features varies together. It is a symmetric p × p matrix. For our 2-feature dataset it is 2 × 2.
C = (1 / (n−1)) × BᵀB
where B is the mean-centred matrix and n is the number of samples.
We divide by (n−1) for the unbiased sample covariance estimate.
With n = 4, the divisor is 3.
Row 1 (b₁ column): [−2, 0, 2, 0]
Row 2 (b₂ column): [−2, 0, 0, 2]
BᵀB[1,1] = (−2)(−2) + (0)(0) + (2)(2) + (0)(0) = 4 + 0 + 4 + 0 = 8
BᵀB[1,2] = (−2)(−2) + (0)(0) + (2)(0) + (0)(2) = 4 + 0 + 0 + 0 = 4
BᵀB[2,1] = (−2)(−2) + (0)(0) + (0)(2) + (2)(0) = 4 + 0 + 0 + 0 = 4
BᵀB[2,2] = (−2)(−2) + (0)(0) + (0)(0) + (2)(2) = 4 + 0 + 0 + 4 = 8
BᵀB = [[8, 4], [4, 8]]
C[1,1] = 8/3 ≈ 2.667 ← Variance of x₁
C[1,2] = 4/3 ≈ 1.333 ← Covariance of x₁ and x₂
C[2,1] = 4/3 ≈ 1.333 ← Covariance of x₂ and x₁ (symmetric)
C[2,2] = 8/3 ≈ 2.667 ← Variance of x₂
C = [[8/3, 4/3], [4/3, 8/3]]
| Covariance Matrix C | x₁ | x₂ |
|---|---|---|
| x₁ | 8/3 ≈ 2.667 | 4/3 ≈ 1.333 |
| x₂ | 4/3 ≈ 1.333 | 8/3 ≈ 2.667 |
Diagonal entries (C[1,1] and C[2,2]) = variance of each feature individually. Off-diagonal entries (C[1,2] = C[2,1]) = how much x₁ and x₂ change together. A positive off-diagonal (1.333) means as x₁ increases, x₂ tends to increase too — they are positively correlated. If the matrix were diagonal, features would be uncorrelated and PCA would produce the original axes unchanged.
Step 4 — Solve for Eigenvalues
The eigenvalues of the covariance matrix tell us how much variance
each principal component explains. We find them by solving the
characteristic equation: det(C − λI) = 0
det(C − λI) = (8/3 − λ)(8/3 − λ) − (4/3)(4/3)
= (8/3 − λ)² − 16/9
Set equal to zero:
(8/3 − λ)² − 16/9 = 0
(8/3 − λ)² = 16/9
8/3 − λ = ± 4/3
8/3 − λ = +4/3
λ = 8/3 − 4/3 = 4/3
Case 2 (negative sign):
8/3 − λ = −4/3
λ = 8/3 + 4/3 = 12/3 = 4
We sort largest first:
λ₁ = 4 (larger → PC1)
λ₂ = 4/3 ≈ 1.333 (smaller → PC2)
PCA requires eigenvalues to be sorted from largest to smallest. PC1 always corresponds to the largest eigenvalue (most variance), PC2 to the second largest, and so on. Never assign PCs in the order the solver returns them without sorting first.
Step 5 — Compute Eigenvectors (Principal Directions)
Eigenvectors give us the directions of the new axes.
For each eigenvalue λ, we solve (C − λI)v = 0 and then
normalise the resulting vector to unit length.
= [[8/3 − 12/3, 4/3], [4/3, 8/3 − 12/3]]
= [[−4/3, 4/3], [4/3, −4/3]]
Take row 1: −4/3 · v₁ + 4/3 · v₂ = 0
→ −v₁ + v₂ = 0
→ v₁ = v₂
Choose v = [1, 1], then normalise:
‖v‖ = √(1² + 1²) = √2
e₁ = [1/√2, 1/√2] ≈ [0.707, 0.707]
= [[4/3, 4/3], [4/3, 4/3]]
Take row 1: 4/3 · v₁ + 4/3 · v₂ = 0
→ v₁ + v₂ = 0
→ v₁ = −v₂
Choose v = [1, −1], then normalise:
‖v‖ = √(1² + (−1)²) = √2
e₂ = [1/√2, −1/√2] ≈ [0.707, −0.707]
Principal components are always perpendicular to each other. Non-zero dot product means calculation error.
Step 6 — Explained Variance & Choosing Components
The eigenvalue tells us how much variance each principal component captures. We express this as a percentage of total variance to decide how many components to keep. This is the heart of the dimensionality reduction decision.
Σλ = λ₁ + λ₂ = 4 + 4/3 = 12/3 + 4/3 = 16/3 ≈ 5.333
PC1 explained variance:
EV₁ = λ₁ / Σλ = 4 / (16/3) = 4 × (3/16) = 12/16 = 3/4 = 75.0 %
PC2 explained variance:
EV₂ = λ₂ / Σλ = (4/3) / (16/3) = 4/16 = 1/4 = 25.0 %
Cumulative (PC1 + PC2) = 75% + 25% = 100% ✓
| Component | Eigenvalue | Explained Variance | Cumulative Variance | Keep? |
|---|---|---|---|---|
| PC1 | 4.000 | 75.0 % | 75.0 % | YES |
| PC2 | 1.333 | 25.0 % | 100.0 % | OPTIONAL |
PC1 alone captures 75% of the total variance. By keeping only 1 component we reduce the data from 2D to 1D, losing just 25% of the information. In practice, a threshold of ≥ 95% cumulative variance is common for larger datasets. For this demo, we proceed with PC1 only.
Step 7 — Project Data onto PC1 (Reduce to 1D)
We now project the mean-centred data matrix B onto the first principal component direction e₁. The projection is simply a dot product: multiply each centred data row by the eigenvector.
Z₁ = (−2)(1/√2) + (−2)(1/√2) = −2/√2 − 2/√2 = −4/√2 = −4 × (√2/2) = −2√2 ≈ −2.828
Z₂ = (0)(1/√2) + (0)(1/√2) = 0 + 0 = 0.000
Z₃ = (2)(1/√2) + (0)(1/√2) = 2/√2 + 0 = 2/√2 = √2 ≈ 1.414
Z₄ = (0)(1/√2) + (2)(1/√2) = 0 + 2/√2 = 2/√2 = √2 ≈ 1.414
Projected Data Z = [−2.828, 0, 1.414, 1.414]
| Sample | x₁ | x₂ |
|---|---|---|
| S1 | 1 | 2 |
| S2 | 3 | 4 |
| S3 | 5 | 4 |
| S4 | 3 | 6 |
| Sample | PC1 Score |
|---|---|
| S1 | −2.828 |
| S2 | 0.000 |
| S3 | 1.414 |
| S4 | 1.414 |
We successfully transformed 4 data points from 2 dimensions down to 1 dimension, retaining 75% of the total variance. S3 and S4 map to the same PC1 score (1.414) — they overlap in the principal direction, but differ in PC2 (the discarded component). This is the inherent trade-off of dimensionality reduction.
Full Solution Summary — One View
Here is the complete PCA solution in condensed form, exactly as you would write it in a final exam answer:
Key Formulae at a Glance
Three Concepts Every PCA Student Must Understand
When to Use PCA — and When Not To
Python Implementation — Verify Our Hand Calculation
Let us reproduce the exact numerical example above in Python — both manually and using scikit-learn. The outputs should match our hand-solved values to within floating-point precision.
import numpy as np
from sklearn.decomposition import PCA
# ── Original dataset ──────────────────────────────────────────
X = np.array([
[1, 2], # S1
[3, 4], # S2
[5, 4], # S3
[3, 6], # S4
], dtype=float)
# ── MANUAL PCA ────────────────────────────────────────────────
# Step 1: Mean vector
mu = X.mean(axis=0)
print(f"Mean vector μ: {mu}") # [3. 4.]
# Step 2: Mean-centre
B = X - mu
print(f"\nCentred matrix B:\n{B}")
# Step 3: Covariance matrix (ddof=1 → divide by n-1)
C = np.cov(B, rowvar=False) # equivalent to BᵀB / (n-1)
print(f"\nCovariance matrix C:\n{C}")
# Step 4 & 5: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(C)
# eigh returns ascending order — reverse to descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"\nEigenvalues (sorted): {eigenvalues}") # [4. 1.333]
print(f"Eigenvectors (columns):\n{eigenvectors}")
# Step 6: Explained variance
ev_ratio = eigenvalues / eigenvalues.sum()
print(f"\nExplained variance ratio: {ev_ratio}") # [0.75 0.25]
print(f"Cumulative: {ev_ratio.cumsum()}")
# Step 7: Project onto PC1 only
W = eigenvectors[:, :1] # take first column only
Z = B @ W # matrix multiply
print(f"\nProjected data Z (1D):\n{Z.flatten()}")
λ₁ = 4.0 ✓ λ₂ = 1.333 ✓ EV = 75%, 25% ✓
Z = [−2.828, 0, 1.414, 1.414] ✓
All values match our step-by-step notebook solution. NumPy uses the same
mathematical definitions — this confirms we solved the problem correctly by hand.
# ── SCIKIT-LEARN PCA (one-liner verification) ─────────────────
pca = PCA(n_components=1)
Z_sk = pca.fit_transform(X) # sklearn centres internally
print(f"sklearn Z: {Z_sk.flatten()}")
print(f"sklearn EV ratio: {pca.explained_variance_ratio_}")
print(f"sklearn eigenvalue: {pca.explained_variance_}")
print(f"sklearn component: {pca.components_}")
The sign of eigenvectors is arbitrary (both [0.707, 0.707] and [−0.707, −0.707] are valid).
Some solvers may flip the sign of the PC scores. This does not affect variance explained,
relative distances, or model accuracy — only the direction of the axis label.
sklearn's PCA uses a deterministic sign convention; always check consistency
when comparing implementations.
PCA vs Other Dimensionality Reduction Methods
| Property | PCA | t-SNE | UMAP | LDA |
|---|---|---|---|---|
| Linearity | Linear | Non-linear | Non-linear | Linear |
| Supervised? | No (unsupervised) | No | No | Yes (needs labels) |
| Preserves | Global variance | Local clusters | Local & global | Class separability |
| Output usable for ML? | Yes | No (stochastic) | Sometimes | Yes |
| Scales to high-d? | Yes (TruncatedSVD) | Slow (>50k samples) | Good | Yes |
| Best for | Preprocessing, noise removal | 2D/3D visualisation | Exploration, clustering | Classification preprocessing |
Golden Rules — PCA Non-Negotiables
StandardScaler before PCA in every sklearn pipeline.
sklearn.decomposition.TruncatedSVD or
PCA(svd_solver='randomized'). The full exact SVD
on a 1M × 1000 matrix is prohibitively expensive. Randomised PCA gives nearly
identical results at a fraction of the compute cost.
The Real Significance of Eigenvectors
Most textbooks define eigenvectors as "directions of maximum variance" and move on. That is true but dangerously shallow. Here is what an eigenvector actually means — geometrically, algebraically, and in the context of your data.
An eigenvector is a special vector that the matrix does not rotate. It only gets stretched or compressed. The matrix multiplies it by a scalar — the eigenvalue — and that is it.
Cv = λv
Read this as: "When the covariance matrix C acts on eigenvector v, the result is just a scaled version of v itself — same direction, different magnitude."
In PCA: the covariance matrix encodes how your data varies. Its eigenvectors are the directions in feature space where this variation is pure — no mixing, no rotation. They are the natural axes of the data's shape.
Imagine your data points as a cloud in 2D space. This cloud is probably not circular — it is likely an ellipse stretched in some direction. The eigenvectors of the covariance matrix point along the axes of that ellipse. The longest axis of the ellipse = first eigenvector (PC1). The shortest axis = second eigenvector (PC2). They are always perpendicular to each other. Always.
① Real eigenvalues (no complex numbers)
② Eigenvectors that are mutually orthogonal (perpendicular to each other)
③ A complete set of eigenvectors that span the entire feature space
This is why PCA works. If the covariance matrix were not symmetric, eigenvectors could point in complex directions or fail to be orthogonal — and the whole method would break down. Symmetry of C is guaranteed because covariance is symmetric: Cov(x₁,x₂) always equals Cov(x₂,x₁).
Once you rotate into PC space, there is no "x₁" or "x₂" anymore. PC1 is a blend of all original features. This is why PCA projects are uninterpretable — you cannot say "a patient scores high on PC1 because their blood pressure is high." If interpretability matters for your task, this is a showstopper. Use feature selection or SHAP-based methods instead.
The Real Significance of Eigenvalues
The eigenvalue is often summarised as "the variance explained." That is correct. But why is the eigenvalue the variance? Where does that come from? And what exactly does the number 4 mean in our example? Let us build the answer from scratch.
Variance of the projection:
Var(z) = (1/(n−1)) × zᵀz
= (1/(n−1)) × (Be)ᵀ(Be)
= (1/(n−1)) × eᵀBᵀBe
= eᵀ [(1/(n−1)) BᵀB] e
= eᵀ C e ← definition of covariance matrix
Since e is an eigenvector of C: Ce = λe
= eᵀ(λe) = λ(eᵀe) = λ × 1 = λ
Therefore: Var(z) = λ
The variance of the data projected onto eigenvector e is exactly equal to the eigenvalue λ. This is not a definition — it is a proven consequence of the eigenvector equation.
When we project our 4 data points onto PC1 (e₁), the resulting 1D scores are [−2.828, 0, 1.414, 1.414].
The variance of these scores = exactly 4.0.
Verify: mean of scores = (−2.828 + 0 + 1.414 + 1.414) / 4 = 0 ✓ (centred data has zero mean)
Var = [(−2.828)² + 0² + 1.414² + 1.414²] / (4−1) = [8 + 0 + 2 + 2] / 3 = 12/3 = 4.0 ✓
Eigenvalue and Eigenvector Together — The Complete Picture
They are inseparable. An eigenvalue without its eigenvector is just a number. An eigenvector without its eigenvalue is just a direction. Together they answer two questions that completely describe a principal component:
| Question | Answer from Eigenvector |
|---|---|
| Which direction in feature space? | e = [0.707, 0.707] → 45° diagonal |
| How do features combine? | PC1 = 0.707·x₁ + 0.707·x₂ |
| Which features dominate this axis? | Both equally (same loading) |
| Are features increasing or opposing? | Both positive → move together |
| Are PCs independent? | Yes — e₁ · e₂ = 0 (orthogonal) |
| Question | Answer from Eigenvalue |
|---|---|
| How much variance along this axis? | λ₁ = 4.0 units of variance |
| How important is this PC? | 75% of total variance |
| How many PCs to keep? | Until Σλₖ/Σλ ≥ 95% |
| Is this axis informative? | λ ≫ 0 → yes; λ ≈ 0 → discard |
| What is the total information? | Σλ = Trace(C) = 16/3 |
The eigenvector tells you the road's direction — which angle to build the highway. It could run north-south, east-west, or diagonally. The eigenvector picks the exact angle that maximises the spread of your data along that road.
The eigenvalue tells you how much traffic the road carries — the total variance of data spread along that highway. A large eigenvalue means data points are spread far apart on this road — it is a useful axis. A small eigenvalue means data points are all clustered close together on this road — the highway goes nowhere interesting.
PCA finds the best roads (eigenvectors) ranked by traffic (eigenvalues) and keeps only the busiest ones.
| Scenario | Eigenvalue | Eigenvector | What It Means |
|---|---|---|---|
| Two features perfectly correlated | λ₁ = total variance, λ₂ = 0 | e₁ = [0.707, 0.707] | Data lies on a 1D line. One PC captures everything. Second PC is pure noise. |
| Two features completely uncorrelated | λ₁ = Var(x₁), λ₂ = Var(x₂) | e₁ = [1, 0], e₂ = [0, 1] | PCs = original axes. PCA adds nothing — features are already orthogonal. |
| Two features equal variance, correlated | λ₁ > λ₂, both non-zero | e₁ = [0.707, 0.707] | Our exact example. Data is an ellipse tilted at 45°. PCA rotates it upright. |
| Feature with zero variance | λ = 0 for that direction | Points along that feature | All samples have the same value for this feature. It is a constant — zero information. |
| All eigenvalues equal | λ₁ = λ₂ = ... = λₚ | Any orthogonal set works | Data cloud is a perfect hypersphere. PCA is undefined — no preferred direction. |
Trace(C) = Σ eigenvalues = Total variance of all features
In our example: Trace(C) = C[1,1] + C[2,2] = 8/3 + 8/3 = 16/3
Sum of eigenvalues: 4 + 4/3 = 12/3 + 4/3 = 16/3 ✓
This identity means PCA is a conservation law: it never creates or destroys
variance. It only redistributes it — concentrating as much as possible
into the first few axes. The total information in your data is exactly preserved.
What you discard by keeping only k PCs is precisely the variance of the dropped components.
Practical Significance — What These Numbers Mean for Your Data
Your data has one overwhelming dominant pattern. Almost all variation in your dataset can be described by a single axis. The data is effectively 1-dimensional. Check if this dominant PC correlates with a confounding variable (e.g. sample size, batch effect).
When eigenvalues are nearly equal, the corresponding eigenvectors are numerically unstable — small changes in data can swap them or rotate them into each other. They span the same 2D subspace, but their individual directions are unreliable. Do not over-interpret individual loadings from nearly degenerate eigenvalues.
Features 3 and 7 together explain most of the variance in your data. They are the most informative features. They likely share an underlying cause — look for a domain explanation. This is also a signal that a simpler 2-feature model using only features 3 and 7 might perform nearly as well as the full model.
Yes in almost every case. Near-zero eigenvalues mean near-zero variance — those PCs are noise directions. However, if you are doing anomaly detection, outliers sometimes appear in these low-variance directions (they differ from the main data cloud in unusual ways), so check first before discarding.