The Story That Explains PCA
PCA is the algorithm that automatically finds the best torch angle — the projection that preserves the most information (variance) from your original high-dimensional data in fewer dimensions.
In practice: you have a dataset with 50 columns. PCA finds the 2 or 3 "best angles" (principal components) so you can plot, visualise, or model it without losing much detail.
Principal Component Analysis (PCA) is an unsupervised linear transformation technique. It re-expresses your data in a new coordinate system where the axes (principal components) are ordered by how much variance they explain. The first principal component (PC1) is the direction of maximum spread in your data; PC2 is the direction of maximum spread perpendicular to PC1, and so on.
PCA rotates your data into a new frame where axes point in the directions of maximum variance — ordered so you can drop the last few axes and lose almost nothing.
Why Dimensionality Reduction Matters
Modern datasets routinely have hundreds or thousands of features. More features sounds better — but past a point it backfires in three serious ways:
Feature selection keeps a subset of original columns. PCA creates brand-new columns that are linear combinations of all original features. The new features have no direct real-world meaning, but they are mathematically optimal for preserving variance.
The Four-Step PCA Algorithm
The Maths — Made Concrete with Numbers
Numerical Example: 2 Features → 1 Principal Component
Let's walk through PCA on a tiny dataset — 5 students with two exam scores. We want to find the single direction (PC1) that best summarises both scores.
| Student | Maths Score (x₁) | Physics Score (x₂) |
|---|---|---|
| Alice | 85 | 88 |
| Bob | 70 | 72 |
| Carol | 92 | 95 |
| Dan | 60 | 58 |
| Eve | 78 | 80 |
C = [[148.5, 165.5], [165.5, 186.3]]
One component is almost perfect — the two exam scores are nearly identical in direction.
The eigenvector [0.664, 0.747] points in the direction students vary most — roughly at 45° between the two axes. This makes sense: students who score high in Maths also score high in Physics. The information is almost entirely captured by a single "academic ability" axis.
Key Formulas
Visual Diagram — What PCA Actually Does
—
—
—
—
↑ Try PC1 axis view — looking straight down PC1, the data collapses into a tight disc, showing how little variance lives in PC2 and PC3. The amber arrow is the direction of maximum spread.
Scree Plot — Choosing How Many Components to Keep
That is a scree plot. The y-axis is eigenvalue (= variance explained); the x-axis is component number. The elbow tells you where to cut.
Amber bars = components worth keeping (above the elbow). Blue bars = mostly noise. The green dashed line tracks cumulative explained variance.
Python Implementation — Step by Step
Step 1: Standardisation and Raw PCA from Scratch
import numpy as np
# ── Toy dataset: 6 students × 4 exam scores ──────────────────
X = np.array([
[85, 88, 82, 90],
[70, 72, 68, 74],
[92, 95, 91, 96],
[60, 58, 63, 55],
[78, 80, 76, 82],
[55, 52, 57, 50],
], dtype=float)
# Step 1 — Standardise (zero mean, unit variance)
mean = X.mean(axis=0)
std = X.std(axis=0, ddof=1)
X_std = (X - mean) / std
# Step 2 — Covariance matrix
C = np.cov(X_std.T)
print("Covariance matrix shape:", C.shape)
print(np.round(C, 3))
# Step 3 — Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(C)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Step 4 — Explained variance ratio
evr = eigenvalues / eigenvalues.sum()
print("Eigenvalues: ", np.round(eigenvalues, 4))
print("Explained var ratio: ", np.round(evr, 4))
print("Cumulative EVR: ", np.round(np.cumsum(evr), 4))
# Step 5 — Project onto top-2 PCs
W = eigenvectors[:, :2]
Z = X_std @ W
print("Projected data (2 PCs):
", np.round(Z, 3))
The first eigenvalue (3.99) is enormous compared to the rest (0.009, 0.002, 0.001). PC1 alone explains 99.71% of all variance — four exam scores essentially measure one thing: "academic ability". We compressed 4 dimensions into 1 and lost only 0.29% of information.
Step 2: sklearn PCA — The Production Way
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_full = PCA(n_components=None)
pca_full.fit(X_scaled)
print("Explained variance ratio per component:")
for i, ev in enumerate(pca_full.explained_variance_ratio_):
print(f" PC{i+1}: {ev:.4f} ({ev*100:.2f}%)")
print(f"
95% variance threshold: {np.argmax(np.cumsum(pca_full.explained_variance_ratio_) >= 0.95) + 1} components")
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
print(f"
Original shape: {X.shape}")
print(f"Reduced shape: {X_2d.shape}")
print(f"Total variance retained: {pca_2d.explained_variance_ratio_.sum():.4f}")
print("
Component loadings (how much each feature contributes):")
for i, comp in enumerate(pca_2d.components_):
print(f" PC{i+1}:")
for fname, loading in zip(feature_names, comp):
print(f" {fname:28s}: {loading:+.4f}")
PC1 has high positive loadings for petal length (+0.58), petal width (+0.57), and sepal length (+0.52). It is essentially an "overall flower size" axis — bigger flowers score higher. PC2 is dominated by sepal width (+0.93) — it captures a completely different aspect of the flower's shape.
Step 3: Choosing n_components with Variance Threshold
from sklearn.decomposition import PCA
pca_95 = PCA(n_components=0.95)
X_95 = pca_95.fit_transform(X_scaled)
print(f"Components needed for 95% variance: {pca_95.n_components_}")
pca_mle = PCA(n_components='mle')
X_mle = pca_mle.fit_transform(X_scaled)
print(f"MLE auto-selected components: {pca_mle.n_components_}")
X_reconstructed = pca_2d.inverse_transform(X_2d)
mse = ((X_scaled - X_reconstructed) ** 2).mean()
print(f"Reconstruction MSE (2 PCs): {mse:.6f}")
Complete Real-World Example — Digits Dataset
The digits dataset contains 1,797 handwritten digit images (8×8 pixels = 64 features). We will use PCA to compress 64 dimensions to 2, visualise the digit clusters, and then compress to 30 dimensions and train a classifier.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
digits = load_digits()
X, y = digits.data, digits.target
print(f"Dataset: {X.shape[0]} images × {X.shape[1]} pixel features")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca_30 = PCA(n_components=30, random_state=42)
X_30 = pca_30.fit_transform(X_scaled)
print(f"Variance retained (30 PCs): {pca_30.explained_variance_ratio_.sum():.4f}")
print(f"Features: 64 → 30 ({30/64*100:.0f}% of original)")
X_train, X_test, y_train, y_test = train_test_split(
X_30, y, test_size=0.2, random_state=42)
for name, Xtr, Xte in [
("No PCA (64 features)", *train_test_split(X_scaled, test_size=0.2, random_state=42)),
("PCA 30 components", X_train, X_test)
]:
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(Xtr, y_train)
acc = accuracy_score(y_test, lr.predict(Xte))
print(f" {name:25s}: {acc:.4f}")
By discarding the noisiest 34 dimensions and keeping only 30 PCs (47% of the original features), we retained 93.4% of all variance and lost only 0.8% accuracy. In exchange, training is faster, memory usage is lower, and noisy pixel dimensions that could confuse the model are removed.
PCA in a Full Machine-Learning Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)),
('svm', SVC(kernel='rbf', random_state=42))
])
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
param_grid = {
'pca__n_components': [0.90, 0.95, 0.99, 5, 8],
'svm__C': [0.1, 1.0, 10.0],
'svm__gamma': ['scale', 'auto'],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Best params: {grid.best_params_}")
print(f"PCs selected: {grid.best_estimator_.named_steps['pca'].n_components_}")
Always put PCA inside a Pipeline when doing cross-validation. Fitting PCA on the full dataset before splitting leaks information from the test fold into training — PCA has "seen" the test data's variance structure. The Pipeline guarantees PCA is only fit on the training fold each time.
Incremental PCA — When Data Doesn't Fit in RAM
from sklearn.decomposition import IncrementalPCA
import numpy as np
np.random.seed(42)
X_large = np.random.randn(100_000, 50)
ipca = IncrementalPCA(n_components=10, batch_size=500)
ipca.fit(X_large)
print("Explained variance ratio (top 10):")
print(np.round(ipca.explained_variance_ratio_, 4))
print(f"Total: {ipca.explained_variance_ratio_.sum():.4f}")
X_transformed = ipca.transform(X_large[:1000])
print(f"Transformed shape: {X_transformed.shape}")
Standard PCA loads the entire dataset into RAM to compute the covariance matrix. IncrementalPCA processes mini-batches and updates its internal state incrementally — memory usage stays constant regardless of dataset size. Use it whenever your data is larger than available RAM, or when data arrives as a stream. Results are numerically very close to standard PCA.
PCA Diagram — The Full Data Flow
Key Hyperparameters
| Parameter | Default | What It Controls | Tuning Advice |
|---|---|---|---|
n_components |
None | Number of PCs to keep. Also accepts float (variance %) or 'mle' | Most important param. Use 0.95 for 95% variance threshold. |
svd_solver |
'auto' | SVD algorithm: 'full', 'arpack', 'randomized', 'auto' | Use 'randomized' for large d and small k — much faster. |
whiten |
False | Scale components to unit variance after projection | Set True before SVM or neural nets for better conditioning. |
random_state |
None | Seed for randomized SVD (affects reproducibility) | Always set for reproducibility in production. |
copy |
True | Whether to copy input X before centering | Set False to save memory if you don't need X after transform. |
When to Use PCA — and When Not To
PCA vs Other Dimensionality Reduction Methods
| Method | Type | Preserves | Speed | Best For |
|---|---|---|---|---|
| PCA | Linear | Global variance structure | Fast | Preprocessing, noise removal, visualisation |
| Kernel PCA | Non-linear | Non-linear variance | Slow (O(n²)) | Non-linear manifolds in moderate datasets |
| t-SNE | Non-linear | Local neighbourhood structure | Very slow | Cluster visualisation only (2D/3D) |
| UMAP | Non-linear | Local + some global | Medium | Large datasets, preserves topology |
| Autoencoder | Non-linear | Learned latent structure | Slow (training) | Images, audio, highly complex data |
| LDA | Linear | Class separation (supervised) | Fast | Classification preprocessing (supervised) |
Start with PCA — it is fast, interpretable (loadings), and works as a preprocessing step inside pipelines with no data leakage risk. If cluster structure remains invisible after PCA, switch to UMAP for visualisation. If your problem is fundamentally non-linear and performance matters more than speed, try Kernel PCA or a deep autoencoder.
Golden Rules
StandardScaler — zero mean, unit variance — every single time
unless you have a specific reason not to (e.g. all features are already on the
same scale by design).
Pipeline([...])
so it is only fit on training folds.
n_components=2. Plot the scree plot, check cumulative EVR.
If 2 components explain only 30% of variance, your 2-D plot is misleading.
Report the retained variance alongside every PCA visualisation.
SimpleImputer or KNNImputer first, then scale, then PCA.
IncrementalPCA with a sensible batch_size
(typically 500–5000 rows). Results are numerically equivalent to standard PCA
but constant memory cost regardless of total dataset size.