Machine Learning 📂 Dimensionality reduction · 2 of 3 50 min read

PCA Dimensional Reduction Numerical

Section 01

The Story That Explains PCA

Projecting a Shadow on the Wall
Imagine you are holding a 3D sculpture and shining a torch at it from different angles. Depending on the angle, the shadow on the wall captures more or less of the sculpture's interesting shape. From some angles, the shadow is just a thin line — almost no information. From one special angle, the shadow captures the widest, most descriptive outline of the object.

PCA finds that special angle. It rotates your coordinate system so the first new axis — called the first principal component — points in the direction of maximum variance in your data. The shadow on that axis contains as much information as possible. Each subsequent component captures the next highest variance, perpendicular to all previous ones.

The result: you can describe your data in fewer dimensions while losing as little information as possible.

Principal Component Analysis (PCA) is an unsupervised linear transformation that projects data onto a new set of orthogonal axes (principal components), ordered by the amount of variance they explain. It is the workhorse of dimensionality reduction — used in data visualisation, noise filtering, feature extraction, and preprocessing before clustering or classification.

💡
The Core Goal

Given a dataset with p features, find a new set of k < p axes (principal components) such that the projected data retains maximum variance. The original p-dimensional data is represented in k dimensions with minimal information loss.


Section 02

The Problem — Dataset & Goal

We have a small dataset with 4 samples and 2 features (x₁, x₂). The goal is to reduce it from 2D to 1D using PCA, retaining the maximum variance possible. We will solve every step by hand — exactly as you would in a Data Science exam notebook.

📋 Given Dataset — X (4 samples × 2 features)
S1
x₁ = 1, x₂ = 2
S2
x₁ = 3, x₂ = 4
S3
x₁ = 5, x₂ = 4
S4
x₁ = 3, x₂ = 6
Sample Feature x₁ Feature x₂
S112
S234
S354
S436
✍️
PCA Roadmap — 7 Steps

① Compute the mean of each feature  →  ② Mean-centre the data  →  ③ Compute the covariance matrix  →  ④ Solve for eigenvalues  →  ⑤ Compute eigenvectors  →  ⑥ Select principal components (explained variance)  →  ⑦ Project data onto chosen components


Section 03

Step 1 — Compute the Mean of Each Feature

PCA is sensitive to the spread of data, not its absolute position. We must first shift the data so it is centred at the origin. Before that, we need the mean of each feature.

Formula: x̄ⱼ = (1/n) × Σᵢ xᵢⱼ
Mean of feature x₁:
x̄₁ = (x₁₁ + x₂₁ + x₃₁ + x₄₁) / 4
x̄₁ = (1 + 3 + 5 + 3) / 4
x̄₁ = 12 / 4
x̄₁ = 3

Mean of feature x₂:
x̄₂ = (x₁₂ + x₂₂ + x₃₂ + x₄₂) / 4
x̄₂ = (2 + 4 + 4 + 6) / 4
x̄₂ = 16 / 4
x̄₂ = 4

Mean vector: μ = [3, 4]
Mean of x₁
x̄₁ = 12 / 4 = 3
Sum of all x₁ values divided by number of samples (n = 4)
Mean of x₂
x̄₂ = 16 / 4 = 4
Sum of all x₂ values divided by number of samples (n = 4)

Section 04

Step 2 — Mean-Centre the Data (Subtract the Mean)

We subtract the mean vector μ = [3, 4] from every data point to produce the centred matrix B. This shifts the data cloud so its centroid sits exactly at the origin — a prerequisite for covariance analysis.

Formula: B = X − μ (subtract mean row-by-row)
B₁ = (1 − 3, 2 − 4) = (−2, −2)
B₂ = (3 − 3, 4 − 4) = ( 0, 0)
B₃ = (5 − 3, 4 − 4) = ( 2, 0)
B₄ = (3 − 3, 6 − 4) = ( 0, 2)
Original Matrix X
Samplex₁x₂
S112
S234
S354
S436
Centred Matrix B = X − μ
Sampleb₁b₂
S1−2−2
S2 0 0
S3 2 0
S4 0 2
Sanity Check — Column Sums Must Equal Zero

Sum of b₁: −2 + 0 + 2 + 0 = 0 ✓
Sum of b₂: −2 + 0 + 0 + 2 = 0 ✓
If either sum ≠ 0, you've made an arithmetic error in the mean or subtraction.


Section 05

Step 3 — Compute the Covariance Matrix

The covariance matrix C captures how much each pair of features varies together. It is a symmetric p × p matrix. For our 2-feature dataset it is 2 × 2.

📚
Formula

C = (1 / (n−1)) × BᵀB
where B is the mean-centred matrix and n is the number of samples. We divide by (n−1) for the unbiased sample covariance estimate. With n = 4, the divisor is 3.

Matrix Multiplication: Bᵀ (2×4) × B (4×2) → BᵀB (2×2)
Bᵀ =
Row 1 (b₁ column): [−2, 0, 2, 0]
Row 2 (b₂ column): [−2, 0, 0, 2]

BᵀB[1,1] = (−2)(−2) + (0)(0) + (2)(2) + (0)(0) = 4 + 0 + 4 + 0 = 8
BᵀB[1,2] = (−2)(−2) + (0)(0) + (2)(0) + (0)(2) = 4 + 0 + 0 + 0 = 4
BᵀB[2,1] = (−2)(−2) + (0)(0) + (0)(2) + (2)(0) = 4 + 0 + 0 + 0 = 4
BᵀB[2,2] = (−2)(−2) + (0)(0) + (0)(0) + (2)(2) = 4 + 0 + 0 + 4 = 8

BᵀB = [[8, 4], [4, 8]]
C = (1/3) × BᵀB
C = (1/3) × [[8, 4], [4, 8]]

C[1,1] = 8/3 ≈ 2.667    ← Variance of x₁
C[1,2] = 4/3 ≈ 1.333    ← Covariance of x₁ and x₂
C[2,1] = 4/3 ≈ 1.333    ← Covariance of x₂ and x₁ (symmetric)
C[2,2] = 8/3 ≈ 2.667    ← Variance of x₂

C = [[8/3, 4/3], [4/3, 8/3]]
Covariance Matrix C x₁ x₂
x₁8/3 ≈ 2.6674/3 ≈ 1.333
x₂4/3 ≈ 1.3338/3 ≈ 2.667
🔑
Reading the Covariance Matrix

Diagonal entries (C[1,1] and C[2,2]) = variance of each feature individually. Off-diagonal entries (C[1,2] = C[2,1]) = how much x₁ and x₂ change together. A positive off-diagonal (1.333) means as x₁ increases, x₂ tends to increase too — they are positively correlated. If the matrix were diagonal, features would be uncorrelated and PCA would produce the original axes unchanged.


Section 06

Step 4 — Solve for Eigenvalues

The eigenvalues of the covariance matrix tell us how much variance each principal component explains. We find them by solving the characteristic equation: det(C − λI) = 0

Subtract λ from the diagonal, take the determinant, set to zero
C − λI = [[8/3 − λ, 4/3], [4/3, 8/3 − λ]]

det(C − λI) = (8/3 − λ)(8/3 − λ) − (4/3)(4/3)
              = (8/3 − λ)² − 16/9

Set equal to zero:
(8/3 − λ)² − 16/9 = 0
(8/3 − λ)² = 16/9
8/3 − λ = ± 4/3
Two solutions from the ± sign
Case 1 (positive sign):
8/3 − λ = +4/3
λ = 8/3 − 4/3 = 4/3

Case 2 (negative sign):
8/3 − λ = −4/3
λ = 8/3 + 4/3 = 12/3 = 4

We sort largest first:
λ₁ = 4    (larger → PC1)
λ₂ = 4/3 ≈ 1.333    (smaller → PC2)
λ₁
Eigenvalue 1
λ₁ = 4.000
Corresponds to the first principal component (PC1). Larger value = more variance captured.
λ₂
Eigenvalue 2
λ₂ = 4/3 ≈ 1.333
Corresponds to the second principal component (PC2). Smaller value = less variance.
Σλ
Total Variance
4 + 4/3 = 16/3 ≈ 5.333
Sum of all eigenvalues equals total variance in the dataset. Used to compute explained variance %.
⚠️
Common Mistake — Always Sort Eigenvalues Descending

PCA requires eigenvalues to be sorted from largest to smallest. PC1 always corresponds to the largest eigenvalue (most variance), PC2 to the second largest, and so on. Never assign PCs in the order the solver returns them without sorting first.


Section 07

Step 5 — Compute Eigenvectors (Principal Directions)

Eigenvectors give us the directions of the new axes. For each eigenvalue λ, we solve (C − λI)v = 0 and then normalise the resulting vector to unit length.

(C − 4I)v = 0 → Find the null space
C − 4I = [[8/3 − 4, 4/3], [4/3, 8/3 − 4]]
         = [[8/3 − 12/3, 4/3], [4/3, 8/3 − 12/3]]
         = [[−4/3, 4/3], [4/3, −4/3]]

Take row 1:   −4/3 · v₁ + 4/3 · v₂ = 0
             → −v₁ + v₂ = 0
             → v₁ = v₂

Choose v = [1, 1], then normalise:
‖v‖ = √(1² + 1²) = √2
e₁ = [1/√2, 1/√2] ≈ [0.707, 0.707]
(C − 4/3 · I)v = 0 → Find the null space
C − (4/3)I = [[8/3 − 4/3, 4/3], [4/3, 8/3 − 4/3]]
            = [[4/3, 4/3], [4/3, 4/3]]

Take row 1:   4/3 · v₁ + 4/3 · v₂ = 0
             → v₁ + v₂ = 0
             → v₁ = −v₂

Choose v = [1, −1], then normalise:
‖v‖ = √(1² + (−1)²) = √2
e₂ = [1/√2, −1/√2] ≈ [0.707, −0.707]
PC1 Direction
e₁ = [0.707, 0.707]
Points diagonally up-right at 45°. This is the direction in feature space along which the data spreads the most. Both features contribute equally and positively to this component.
PC2 Direction
e₂ = [0.707, −0.707]
Orthogonal to PC1, pointing diagonally down-right at −45°. Captures the remaining variance. One feature is positive, the other negative — it captures the contrast between them.
Orthogonality Check
e₁ · e₂ = 0
e₁ · e₂ = (0.707)(0.707) + (0.707)(−0.707) = 0.5 − 0.5 = 0 ✓
Principal components are always perpendicular to each other. Non-zero dot product means calculation error.

Section 08

Step 6 — Explained Variance & Choosing Components

The eigenvalue tells us how much variance each principal component captures. We express this as a percentage of total variance to decide how many components to keep. This is the heart of the dimensionality reduction decision.

Formula: Explained Variance (PCₖ) = λₖ / Σλᵢ × 100%
Total variance:
Σλ = λ₁ + λ₂ = 4 + 4/3 = 12/3 + 4/3 = 16/3 ≈ 5.333

PC1 explained variance:
EV₁ = λ₁ / Σλ = 4 / (16/3) = 4 × (3/16) = 12/16 = 3/4 = 75.0 %

PC2 explained variance:
EV₂ = λ₂ / Σλ = (4/3) / (16/3) = 4/16 = 1/4 = 25.0 %

Cumulative (PC1 + PC2) = 75% + 25% = 100% ✓
Component Eigenvalue Explained Variance Cumulative Variance Keep?
PC1 4.000 75.0 % 75.0 % YES
PC2 1.333 25.0 % 100.0 % OPTIONAL
🎯
Decision — Keep 1 Component

PC1 alone captures 75% of the total variance. By keeping only 1 component we reduce the data from 2D to 1D, losing just 25% of the information. In practice, a threshold of ≥ 95% cumulative variance is common for larger datasets. For this demo, we proceed with PC1 only.


Section 09

Step 7 — Project Data onto PC1 (Reduce to 1D)

We now project the mean-centred data matrix B onto the first principal component direction e₁. The projection is simply a dot product: multiply each centred data row by the eigenvector.

Formula: Z = B · e₁ where e₁ = [1/√2, 1/√2]
Remember B = [[−2,−2], [0,0], [2,0], [0,2]] and e₁ = [1/√2, 1/√2] ≈ [0.707, 0.707]

Z₁ = (−2)(1/√2) + (−2)(1/√2) = −2/√2 − 2/√2 = −4/√2 = −4 × (√2/2) = −2√2 ≈ −2.828

Z₂ = (0)(1/√2) + (0)(1/√2) = 0 + 0 = 0.000

Z₃ = (2)(1/√2) + (0)(1/√2) = 2/√2 + 0 = 2/√2 = √2 ≈ 1.414

Z₄ = (0)(1/√2) + (2)(1/√2) = 0 + 2/√2 = 2/√2 = √2 ≈ 1.414

Projected Data Z = [−2.828, 0, 1.414, 1.414]
Original 2D Data (X)
Samplex₁x₂
S112
S234
S354
S436
Reduced 1D Data (Z on PC1)
SamplePC1 Score
S1−2.828
S2 0.000
S3 1.414
S4 1.414
🎉
Result — Dimensionality Reduced!

We successfully transformed 4 data points from 2 dimensions down to 1 dimension, retaining 75% of the total variance. S3 and S4 map to the same PC1 score (1.414) — they overlap in the principal direction, but differ in PC2 (the discarded component). This is the inherent trade-off of dimensionality reduction.


Section 10

Full Solution Summary — One View

Here is the complete PCA solution in condensed form, exactly as you would write it in a final exam answer:

📄 Complete Worked Solution at a Glance
Step 1
Means: x̄₁ = 3, x̄₂ = 4  →  μ = [3, 4]
Step 2
Centred B: [−2,−2], [0,0], [2,0], [0,2]
Step 3
Covariance: C = [[8/3, 4/3], [4/3, 8/3]]
Step 4
Eigenvalues: λ₁ = 4, λ₂ = 4/3   (from det(C−λI)=0)
Step 5
Eigenvectors: e₁ = [0.707, 0.707], e₂ = [0.707, −0.707]
Step 6
Variance: PC1 = 75%, PC2 = 25%  →  Keep PC1 only
Step 7
Projection Z: [−2.828, 0.000, 1.414, 1.414]

Section 11

Key Formulae at a Glance

Mean Centering
B = X − μ
Subtract the feature-wise mean from every row. Shifts data centroid to origin.
Covariance Matrix
C = BᵀB / (n−1)
Symmetric p×p matrix. Diagonal = variances. Off-diagonal = covariances.
Characteristic Equation
det(C − λI) = 0
Solve for eigenvalues λ. Each λ is the variance explained by its corresponding PC.
Eigenvector Equation
(C − λI)v = 0
Solve for each eigenvector v. Normalise to unit length: e = v / ‖v‖
Explained Variance
EVₖ = λₖ / Σλᵢ
Fraction of total variance captured by principal component k. Sum to 1 (100%).
Projection
Z = B · W
W = matrix of selected eigenvectors (columns). Z is the reduced dataset.

Section 12

Three Concepts Every PCA Student Must Understand

📈
Variance = Information
Why we maximise it
PCA defines "information" as variance. A feature with zero variance tells you nothing — every sample is identical on that axis. The principal component with the most variance is the single axis that spreads your data points the furthest apart — it is the most discriminating direction in the data.
Orthogonality
Why PCs are uncorrelated
Each principal component is perpendicular (orthogonal) to all others. This means PCs are completely uncorrelated — zero covariance between them. PCA eliminates redundancy. The original features may be correlated; the PCs never are.
🕑
Scree Plot
How to pick k
Plot eigenvalues in descending order. Look for the elbow — the point where the drop-off flattens. Components before the elbow explain most variance; components after are mostly noise. Alternatively, keep components until cumulative variance ≥ 95%.

Section 13

When to Use PCA — and When Not To

📊
Visualisation
2D/3D plotting
Reduce 100-feature data to 2 or 3 PCs for scatter plots. Makes cluster structures visible that are invisible in high dimensions.
🔈
Noise Filtering
Drop low-variance PCs
Small eigenvalues often correspond to noise. Discarding the last few PCs removes noise while preserving signal — used in image compression and genetics.
🔗
Multicollinearity Fix
Before regression
Highly correlated features break OLS regression. PCA creates uncorrelated PCs, eliminating multicollinearity before fitting a linear model (PCR).
Speed Up Training
Preprocessing
Reduce 500 features to 50 PCs before training SVM or kNN. Runtime drops dramatically with minimal accuracy loss on well-structured data.
Avoid: Tree Models
Not helpful
Random Forest and XGBoost handle high dimensions and correlated features natively. PCA removes feature interpretability with no performance benefit for trees.
Avoid: Non-linear Data
Use kernel PCA instead
PCA only captures linear relationships. Curved or spiral data structures require kernel PCA, t-SNE, or UMAP to reveal the true low-dimensional manifold.

Section 14

Python Implementation — Verify Our Hand Calculation

Let us reproduce the exact numerical example above in Python — both manually and using scikit-learn. The outputs should match our hand-solved values to within floating-point precision.

import numpy as np
from sklearn.decomposition import PCA

# ── Original dataset ──────────────────────────────────────────
X = np.array([
    [1, 2],   # S1
    [3, 4],   # S2
    [5, 4],   # S3
    [3, 6],   # S4
], dtype=float)

# ── MANUAL PCA ────────────────────────────────────────────────
# Step 1: Mean vector
mu = X.mean(axis=0)
print(f"Mean vector μ: {mu}")   # [3. 4.]

# Step 2: Mean-centre
B = X - mu
print(f"\nCentred matrix B:\n{B}")

# Step 3: Covariance matrix (ddof=1 → divide by n-1)
C = np.cov(B, rowvar=False)     # equivalent to BᵀB / (n-1)
print(f"\nCovariance matrix C:\n{C}")

# Step 4 & 5: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(C)

# eigh returns ascending order — reverse to descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues  = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"\nEigenvalues (sorted): {eigenvalues}")   # [4.  1.333]
print(f"Eigenvectors (columns):\n{eigenvectors}")

# Step 6: Explained variance
ev_ratio = eigenvalues / eigenvalues.sum()
print(f"\nExplained variance ratio: {ev_ratio}")  # [0.75  0.25]
print(f"Cumulative:              {ev_ratio.cumsum()}")

# Step 7: Project onto PC1 only
W  = eigenvectors[:, :1]      # take first column only
Z  = B @ W                      # matrix multiply
print(f"\nProjected data Z (1D):\n{Z.flatten()}")
OUTPUT
Mean vector μ: [3. 4.] Centred matrix B: [[-2. -2.] [ 0. 0.] [ 2. 0.] [ 0. 2.]] Covariance matrix C: [[2.66666667 1.33333333] [1.33333333 2.66666667]] Eigenvalues (sorted): [4. 1.33333333] Eigenvectors (columns): [[ 0.70710678 0.70710678] [ 0.70710678 -0.70710678]] Explained variance ratio: [0.75 0.25] Cumulative: [0.75 1. ] Projected data Z (1D): [-2.82842712 0. 1.41421356 1.41421356]
Output Matches Hand Calculation Exactly

λ₁ = 4.0 ✓   λ₂ = 1.333 ✓   EV = 75%, 25% ✓   Z = [−2.828, 0, 1.414, 1.414] ✓
All values match our step-by-step notebook solution. NumPy uses the same mathematical definitions — this confirms we solved the problem correctly by hand.

# ── SCIKIT-LEARN PCA (one-liner verification) ─────────────────
pca = PCA(n_components=1)
Z_sk = pca.fit_transform(X)    # sklearn centres internally

print(f"sklearn Z:          {Z_sk.flatten()}")
print(f"sklearn EV ratio:   {pca.explained_variance_ratio_}")
print(f"sklearn eigenvalue: {pca.explained_variance_}")
print(f"sklearn component:  {pca.components_}")
OUTPUT
sklearn Z: [-2.82842712 0. 1.41421356 1.41421356] sklearn EV ratio: [0.75] sklearn eigenvalue: [4.] sklearn component: [[0.70710678 0.70710678]]
📚
Note on Sign Conventions

The sign of eigenvectors is arbitrary (both [0.707, 0.707] and [−0.707, −0.707] are valid). Some solvers may flip the sign of the PC scores. This does not affect variance explained, relative distances, or model accuracy — only the direction of the axis label. sklearn's PCA uses a deterministic sign convention; always check consistency when comparing implementations.


Section 15

PCA vs Other Dimensionality Reduction Methods

Property PCA t-SNE UMAP LDA
Linearity Linear Non-linear Non-linear Linear
Supervised? No (unsupervised) No No Yes (needs labels)
Preserves Global variance Local clusters Local & global Class separability
Output usable for ML? Yes No (stochastic) Sometimes Yes
Scales to high-d? Yes (TruncatedSVD) Slow (>50k samples) Good Yes
Best for Preprocessing, noise removal 2D/3D visualisation Exploration, clustering Classification preprocessing

Section 16

Golden Rules — PCA Non-Negotiables

🌲 PCA — Rules You Must Never Break
1
Always standardise features before PCA when features are on different scales (e.g. age in years vs. salary in lakhs). Without standardisation, the feature with the largest numeric range will dominate all principal components — even if it is not the most informative. Use StandardScaler before PCA in every sklearn pipeline.
2
Fit the scaler and PCA on training data only. Apply the learned transform to test data. Fitting on the full dataset leaks future information into the PCA axes — this is data leakage and inflates performance metrics in evaluation.
3
Sum of all eigenvalues = Trace of covariance matrix = Total variance. Always verify this: if eigenvalues don't sum to the trace, the eigendecomposition went wrong. In our example: λ₁ + λ₂ = 4 + 4/3 = 16/3 = Trace(C) = 8/3 + 8/3 = 16/3 ✓
4
Use a scree plot or cumulative explained variance curve to choose k, not an arbitrary number. The standard threshold for most applications is keeping enough components to explain ≥ 95% cumulative variance. For visualisation, always use exactly 2 or 3 components.
5
PCA destroys feature interpretability. A PC is a linear combination of all original features — you cannot say "PC1 is height" or "PC2 is weight". If your downstream task requires human-interpretable features (e.g. clinical reporting, regulatory compliance), do NOT use PCA. Use feature selection or manual engineering instead.
6
PCA is linear-only. It cannot capture non-linear structures in data. If your data lies on a curved manifold (e.g. Swiss roll, concentric rings), PCA will produce poor projections. Use kernel PCA (with RBF kernel) or UMAP / t-SNE for non-linear dimensionality reduction.
7
For very large datasets (millions of rows, thousands of features), use sklearn.decomposition.TruncatedSVD or PCA(svd_solver='randomized'). The full exact SVD on a 1M × 1000 matrix is prohibitively expensive. Randomised PCA gives nearly identical results at a fraction of the compute cost.

Section 17

The Real Significance of Eigenvectors

Most textbooks define eigenvectors as "directions of maximum variance" and move on. That is true but dangerously shallow. Here is what an eigenvector actually means — geometrically, algebraically, and in the context of your data.

What direction does the covariance matrix NOT rotate?
A matrix is a transformation. When you multiply any vector v by a matrix C, the vector generally changes direction — it gets rotated and stretched.

An eigenvector is a special vector that the matrix does not rotate. It only gets stretched or compressed. The matrix multiplies it by a scalar — the eigenvalue — and that is it.

Cv = λv

Read this as: "When the covariance matrix C acts on eigenvector v, the result is just a scaled version of v itself — same direction, different magnitude."

In PCA: the covariance matrix encodes how your data varies. Its eigenvectors are the directions in feature space where this variation is pure — no mixing, no rotation. They are the natural axes of the data's shape.
🔭
Geometric Meaning — The Data Cloud Has a Shape

Imagine your data points as a cloud in 2D space. This cloud is probably not circular — it is likely an ellipse stretched in some direction. The eigenvectors of the covariance matrix point along the axes of that ellipse. The longest axis of the ellipse = first eigenvector (PC1). The shortest axis = second eigenvector (PC2). They are always perpendicular to each other. Always.

📏
Eigenvector = Direction
Not a point — a compass bearing
An eigenvector tells you which way to look in feature space to see the most spread. In our example, e₁ = [0.707, 0.707] means "look diagonally — equally in the x₁ and x₂ directions simultaneously." It is a new axis drawn through your data at 45°.
Eigenvector = Rotation
It rotates your coordinate system
The matrix of eigenvectors W is an orthogonal rotation matrix. When you compute Z = B·W, you are rotating the entire dataset so that the axis of maximum spread becomes the new x-axis (PC1), and the next axis of spread becomes the new y-axis (PC2). No stretching — pure rotation.
🔨
Eigenvector = Recipe
A linear combination of features
Each eigenvector is a recipe that tells you how to mix your original features to create a new, more informative axis. In our example: PC1 = 0.707·x₁ + 0.707·x₂ — add both features equally. PC2 = 0.707·x₁ − 0.707·x₂ — subtract them. These mixtures are uncorrelated by construction.
Because uncorrelated axes must be orthogonal in Euclidean space
The covariance matrix C is symmetric (C = Cᵀ). A fundamental theorem of linear algebra — the Spectral Theorem — guarantees that any real symmetric matrix has:

① Real eigenvalues (no complex numbers)
② Eigenvectors that are mutually orthogonal (perpendicular to each other)
③ A complete set of eigenvectors that span the entire feature space

This is why PCA works. If the covariance matrix were not symmetric, eigenvectors could point in complex directions or fail to be orthogonal — and the whole method would break down. Symmetry of C is guaranteed because covariance is symmetric: Cov(x₁,x₂) always equals Cov(x₂,x₁).
💡 What Each Eigenvector Component Value Means — Interpreting e₁ = [0.707, 0.707]
0.707
The loading of feature x₁ on PC1. How much x₁ contributes to this principal component. Large positive value = strong positive contribution.
0.707
The loading of feature x₂ on PC1. Both features contribute equally (0.707 = 0.707), so PC1 is purely the sum of both features — it captures their shared movement.
Sign
In e₂ = [0.707, −0.707]: x₁ loads positively, x₂ loads negatively. PC2 captures how the features diverge — when one goes up and the other goes down.
Magnitude
Larger absolute loading = that feature dominates this PC. A loading near 0 means the feature barely influences this axis. Loadings always satisfy ‖e‖ = 1.
⚠️
The Critical Implication — Eigenvectors Destroy Feature Names

Once you rotate into PC space, there is no "x₁" or "x₂" anymore. PC1 is a blend of all original features. This is why PCA projects are uninterpretable — you cannot say "a patient scores high on PC1 because their blood pressure is high." If interpretability matters for your task, this is a showstopper. Use feature selection or SHAP-based methods instead.


Section 18

The Real Significance of Eigenvalues

The eigenvalue is often summarised as "the variance explained." That is correct. But why is the eigenvalue the variance? Where does that come from? And what exactly does the number 4 mean in our example? Let us build the answer from scratch.

The variance of projected data equals the eigenvalue. Here is the proof.
Let e be a unit eigenvector. The projected data along this direction is z = Be (a column vector).

Variance of the projection:
Var(z) = (1/(n−1)) × zᵀz
        = (1/(n−1)) × (Be)ᵀ(Be)
        = (1/(n−1)) × eᵀBᵀBe
        = eᵀ [(1/(n−1)) BᵀB] e
        = eᵀ C e    ← definition of covariance matrix

Since e is an eigenvector of C:   Ce = λe

        = eᵀ(λe) = λ(eᵀe) = λ × 1 = λ

Therefore: Var(z) = λ

The variance of the data projected onto eigenvector e is exactly equal to the eigenvalue λ. This is not a definition — it is a proven consequence of the eigenvector equation.
💡
The Meaning of λ₁ = 4 in Our Example

When we project our 4 data points onto PC1 (e₁), the resulting 1D scores are [−2.828, 0, 1.414, 1.414]. The variance of these scores = exactly 4.0.

Verify: mean of scores = (−2.828 + 0 + 1.414 + 1.414) / 4 = 0 ✓ (centred data has zero mean)
Var = [(−2.828)² + 0² + 1.414² + 1.414²] / (4−1) = [8 + 0 + 2 + 2] / 3 = 12/3 = 4.0 ✓

📊 What the Eigenvalue Number Physically Represents
λ = 4
The data points, when projected onto PC1, have a spread (variance) of 4. This is the widest the data can be spread by any linear projection. No other direction gives you more than 4.
λ = 4/3
Projected onto PC2, the data has a spread (variance) of only 1.333. This axis captures less information. The data is less spread out in this direction.
λ → 0
An eigenvalue near zero means almost zero variance in that direction. The data is nearly flat along that axis — no information there. These are the PCs you discard.
λ = 0
An exactly zero eigenvalue means the features are perfectly linearly dependent (multicollinear). The matrix is singular and that direction is pure redundancy — discarding it loses nothing.
📈
Eigenvalue = Spread
It is the variance of the projection
The bigger the eigenvalue, the more spread out the data is along that eigenvector's direction. Large spread = the PC axis discriminates well between data points. Small spread = data points are nearly identical along that axis — uninformative.
🎯
Eigenvalue = Importance
Ranking tells you what to keep
Sorting eigenvalues descending gives you a strict importance ranking of axes. The first PC is the single best summary of your data. The second is the next best orthogonal summary. You can objectively decide where to cut using the explained variance ratio.
📵
Eigenvalue = Stretching Factor
How the covariance matrix deforms space
The covariance matrix is a linear transformation. It stretches space in the direction of each eigenvector by a factor of λ. Large λ = strong stretch = the data has a lot of structure in that direction. The shape of the data ellipse is drawn by the eigenvalues.

Section 19

Eigenvalue and Eigenvector Together — The Complete Picture

They are inseparable. An eigenvalue without its eigenvector is just a number. An eigenvector without its eigenvalue is just a direction. Together they answer two questions that completely describe a principal component:

Eigenvector Answers — WHERE?
QuestionAnswer from Eigenvector
Which direction in feature space?e = [0.707, 0.707] → 45° diagonal
How do features combine?PC1 = 0.707·x₁ + 0.707·x₂
Which features dominate this axis?Both equally (same loading)
Are features increasing or opposing?Both positive → move together
Are PCs independent?Yes — e₁ · e₂ = 0 (orthogonal)
Eigenvalue Answers — HOW MUCH?
QuestionAnswer from Eigenvalue
How much variance along this axis?λ₁ = 4.0 units of variance
How important is this PC?75% of total variance
How many PCs to keep?Until Σλₖ/Σλ ≥ 95%
Is this axis informative?λ ≫ 0 → yes; λ ≈ 0 → discard
What is the total information?Σλ = Trace(C) = 16/3
Eigenvector is the road. Eigenvalue is how far the road takes you.
Imagine you are a traveller in feature space. Your data has spread across this space and you want to build a single highway that carries the most traffic (information).

The eigenvector tells you the road's direction — which angle to build the highway. It could run north-south, east-west, or diagonally. The eigenvector picks the exact angle that maximises the spread of your data along that road.

The eigenvalue tells you how much traffic the road carries — the total variance of data spread along that highway. A large eigenvalue means data points are spread far apart on this road — it is a useful axis. A small eigenvalue means data points are all clustered close together on this road — the highway goes nowhere interesting.

PCA finds the best roads (eigenvectors) ranked by traffic (eigenvalues) and keeps only the busiest ones.
Scenario Eigenvalue Eigenvector What It Means
Two features perfectly correlated λ₁ = total variance, λ₂ = 0 e₁ = [0.707, 0.707] Data lies on a 1D line. One PC captures everything. Second PC is pure noise.
Two features completely uncorrelated λ₁ = Var(x₁), λ₂ = Var(x₂) e₁ = [1, 0], e₂ = [0, 1] PCs = original axes. PCA adds nothing — features are already orthogonal.
Two features equal variance, correlated λ₁ > λ₂, both non-zero e₁ = [0.707, 0.707] Our exact example. Data is an ellipse tilted at 45°. PCA rotates it upright.
Feature with zero variance λ = 0 for that direction Points along that feature All samples have the same value for this feature. It is a constant — zero information.
All eigenvalues equal λ₁ = λ₂ = ... = λₚ Any orthogonal set works Data cloud is a perfect hypersphere. PCA is undefined — no preferred direction.
🔑
The Most Important Identity in PCA

Trace(C) = Σ eigenvalues = Total variance of all features

In our example: Trace(C) = C[1,1] + C[2,2] = 8/3 + 8/3 = 16/3
Sum of eigenvalues: 4 + 4/3 = 12/3 + 4/3 = 16/3 ✓

This identity means PCA is a conservation law: it never creates or destroys variance. It only redistributes it — concentrating as much as possible into the first few axes. The total information in your data is exactly preserved. What you discard by keeping only k PCs is precisely the variance of the dropped components.


Section 20

Practical Significance — What These Numbers Mean for Your Data

Eigenvalue Tells You the Signal-to-Noise Ratio of a Component
In real datasets, the first few eigenvalues are large (signal — genuine patterns in the data) and the rest are small (noise — random measurement error, irrelevant fluctuations). The point where the eigenvalue curve bends sharply downward — the "elbow" in a scree plot — is the boundary between signal PCs and noise PCs. Keep everything before the elbow. Discard everything after. The eigenvectors of large eigenvalues encode real structure; eigenvectors of near-zero eigenvalues encode noise directions.
Eigenvector Loadings Tell You Which Original Features Drive Each PC
In a real dataset with 10 features, the first eigenvector might look like: [0.52, 0.48, 0.01, 0.02, −0.01, 0.49, 0.51, 0.00, 0.01, 0.02]. Features 1, 2, 6, 7 have large loadings. This PC represents "the combined level of features 1, 2, 6, 7" — perhaps they all relate to one underlying latent factor (e.g. socioeconomic status in survey data). The eigenvector has given you a data-driven grouping of related features — something domain analysis alone might miss.
Eigenvalue Ratio Tells You How Redundant Your Features Are
If the first eigenvalue captures 99% of the variance in a 50-feature dataset, it means 49 of your features are essentially redundant — they are all measuring the same underlying thing. The dataset's effective dimensionality is close to 1, not 50. Conversely, if eigenvalues are roughly equal, all features contribute unique information and PCA provides little compression benefit.
Eigenvectors of the Largest Eigenvalues Are the Most Stable
When you add more data, the eigenvectors corresponding to large eigenvalues stay nearly constant — they represent robust, real structure. Eigenvectors corresponding to small eigenvalues are unstable — small perturbations in data can rotate them significantly. This is why discarding small-eigenvalue PCs not only saves dimensions but also improves robustness: you are keeping only the stable, reproducible structure.
📈 How to Read Eigenvalues and Eigenvectors in a Real Project
Q1
"My first eigenvalue is 10× larger than the rest. What does that mean?"
Your data has one overwhelming dominant pattern. Almost all variation in your dataset can be described by a single axis. The data is effectively 1-dimensional. Check if this dominant PC correlates with a confounding variable (e.g. sample size, batch effect).
Q2
"Two of my eigenvectors have very similar eigenvalues. Which is PC1 and which is PC2?"
When eigenvalues are nearly equal, the corresponding eigenvectors are numerically unstable — small changes in data can swap them or rotate them into each other. They span the same 2D subspace, but their individual directions are unreliable. Do not over-interpret individual loadings from nearly degenerate eigenvalues.
Q3
"My eigenvector for PC1 has high loadings only on features 3 and 7. What does that mean?"
Features 3 and 7 together explain most of the variance in your data. They are the most informative features. They likely share an underlying cause — look for a domain explanation. This is also a signal that a simpler 2-feature model using only features 3 and 7 might perform nearly as well as the full model.
Q4
"My last 5 eigenvalues are all near zero. Should I always discard them?"
Yes in almost every case. Near-zero eigenvalues mean near-zero variance — those PCs are noise directions. However, if you are doing anomaly detection, outliers sometimes appear in these low-variance directions (they differ from the main data cloud in unusual ways), so check first before discarding.