Linear Discriminant Analysis (LDA)

Section 01

The Story That Explains LDA

📖 Real World Analogy

The Museum Security Guard

Imagine a museum security guard whose job is to separate real visitors from thieves at the exit. Every person carries a bag with two measurements: bag weight and number of items.

A naive guard might just look at the tallest person, or the one with the heaviest bag. But a smart guard asks a different question: "Which direction, when I project everyone onto it, spreads the two groups as far apart as possible while keeping each group tightly packed?"

That is the entire idea behind Linear Discriminant Analysis. LDA finds the mathematical "direction" — called a discriminant — that maximises the gap between classes while minimising the spread within each class. It does not just find variance; it finds discriminative variance.

Linear Discriminant Analysis (LDA) is a supervised dimensionality-reduction and classification algorithm invented by Ronald Fisher in 1936. Unlike PCA which maximises total variance, LDA maximises the ratio of between-class scatter to within-class scatter — deliberately finding the projection that makes classes most separable.

🌿

Two Roles of LDA

LDA wears two hats. As a dimensionality reducer it projects high-dimensional data into at most (C−1) dimensions, where C is the number of classes — perfect as a preprocessing step before another classifier. As a classifier it assigns new points to the class whose centroid is closest in the projected space, assuming Gaussian class-conditional distributions with equal covariance.

Section 02

The Core Intuition — Two Objectives at Once

❌ Bad Projection

What it does	Result
Maximises total variance	Classes overlap badly
PCA direction	Red & Blue mix together
Ignores class labels	Useless for classification

✅ LDA Projection

What it does	Result
Maximises between-class gap	Classes far apart
Minimises within-class spread	Each class tightly clustered
Uses class labels	Ideal for classification

Objective (Fisher Criterion)

J(w) = wᵀ S_B w / wᵀ S_W w

Find the weight vector w that maximises the ratio of between-class scatter to within-class scatter.

Solution

S_W⁻¹ S_B w = λ w

This is a generalised eigenvalue problem. The eigenvectors of S_W⁻¹ S_B give the optimal discriminant directions.

Between-Class Scatter

S_B = Σ nᵢ (μᵢ − μ)(μᵢ − μ)ᵀ

Measures how far apart the class means are from the overall mean. Large S_B = well-separated classes.

Within-Class Scatter

S_W = Σ Σ (x − μᵢ)(x − μᵢ)ᵀ

Measures how spread out points are around their own class mean. Small S_W = tightly clustered classes.

Section 03

3D Interactive Visualisation — LDA in Action

The diagram below shows three classes in 3D space (left) and the same data projected onto the two LDA discriminant axes (right). Drag to rotate the 3D view. Notice how the LDA projection pulls the three clusters apart far more cleanly than random projections would.

🔭 Interactive 3D LDA Diagram — Rotate & Explore

🖱️ Drag to rotate | Scroll to zoom | Three Gaussian blobs (red, blue, green) in 3D space projected onto the two LDA components.

Section 04

Step-by-Step Maths — A Worked Example

📖 Story — The Wine Lab

Classifying Wines by Chemistry

A wine lab measures two chemicals in each bottle: Alcohol % and Malic Acid mg/L. They have three wine varieties: Barbera, Barolo, and Grignolino. The head chemist wants a single scale that, when a new bottle is measured, immediately tells her which variety it is. LDA finds that scale.

🧮 LDA — Step-by-Step Algorithm

Step 1

Compute the class means μ₁, μ₂, … μ_C and the overall mean μ for each feature.

Step 2

Compute the Within-Class Scatter Matrix S_W = Σ_c Σ_{x∈c} (x−μ_c)(x−μ_c)ᵀ

Step 3

Compute the Between-Class Scatter Matrix S_B = Σ_c n_c (μ_c−μ)(μ_c−μ)ᵀ

Step 4

Solve the generalised eigenvalue problem: S_W⁻¹ S_B w = λ w

Step 5

Sort eigenvectors by descending eigenvalue. Take the top k = min(C−1, d) eigenvectors as discriminant axes.

Step 6

Project data onto these axes: Z = X · W where W = [w₁ | w₂ | … | wₖ]

Numerical Mini-Example (2 Features, 2 Classes, 4 Points)

Sample	Alcohol %	Malic Acid	Class
1	14.2	1.7	Barolo (C₁)
2	13.8	1.9	Barolo (C₁)
3	12.4	3.1	Barbera (C₂)
4	11.9	3.5	Barbera (C₂)

📐 Computation Walkthrough

Means

μ₁ = [14.0, 1.8] | μ₂ = [12.15, 3.3] | μ = [13.075, 2.55]

S_W

C₁ scatter: deviations [+0.2,−0.1] and [−0.2,+0.1] → S_W₁ = [[0.08, −0.04],[−0.04, 0.02]]
C₂ scatter: deviations [+0.25,−0.2] and [−0.25,+0.2] → S_W₂ = [[0.125, −0.1],[−0.1, 0.08]]
S_W = [[0.205, −0.14],[−0.14, 0.1]]

S_B

Δμ = [14.0−13.075, 1.8−2.55] = [0.925, −0.75]
S_B = 2×[0.925,−0.75]ᵀ[0.925,−0.75] + 2×[−0.925,0.75]ᵀ[−0.925,0.75]
S_B ≈ [[3.42, −2.77],[−2.77, 2.25]]

Optimal w

w = S_W⁻¹(μ₁−μ₂) ∝ [−0.87, −1.13] after normalisation
LD1 direction: [−0.61, −0.79] — projects Barolo high, Barbera low on the axis.

🎯

Maximum Dimensions After LDA

LDA can produce at most C − 1 discriminant components, where C is the number of classes. Three-class problem → max 2 LDA components. This is fundamentally different from PCA which can produce up to min(n−1, d) components. A 10-class dataset projected to 9 LDA dimensions is almost always dramatically more useful for classification than 9 PCA dimensions.

Section 05

LDA Assumptions — The Fine Print

📊

Gaussian Class Distributions

Normality

Each class is assumed to follow a multivariate Gaussian (normal) distribution. LDA is robust to mild violations, but heavy skew or multimodality within a class will hurt accuracy.

✅ Check: Q-Q plots per class

❌ Fix: transform features (log, Box-Cox)

⚖️

Equal Covariance Matrices

Homoscedasticity

All classes share the same covariance structure (only their means differ). If covariances differ substantially, use Quadratic Discriminant Analysis (QDA) instead — it estimates a covariance matrix per class.

✅ Check: Box's M test

❌ Fix: switch to QDA

🔗

Independent Samples

Independence

Observations must be independent of each other. Repeated measures from the same subject, time-series data, or grouped data violates this. Correlated features are fine — correlated samples are not.

✅ Check: study design

❌ Fix: mixed models, MLM

⚠️

The Singularity Problem

If the number of features d ≥ n (more features than samples), the within-class scatter matrix S_W becomes singular and cannot be inverted. Solutions: apply PCA first to reduce dimensions, use solver='lsqr' or solver='eigen' with regularisation in scikit-learn, or switch to Regularised LDA which adds a shrinkage parameter λ to S_W.

Section 06

Python Implementation — Wine Dataset

LDA as a Dimensionality Reducer + Visualiser

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt

# Load the UCI Wine dataset — 178 samples, 13 chemical features, 3 classes
wine = load_wine()
X, y = wine.data, wine.target
labels = wine.target_names  # ['class_0', 'class_1', 'class_2']

# Standardise features (recommended for LDA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit LDA — project down to 2 components (C-1 = 3-1 = 2 max)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

# Explained variance ratio of each discriminant
print("Explained variance ratio:", lda.explained_variance_ratio_)
print(f"LD1 explains: {lda.explained_variance_ratio_[0]*100:.1f}%")
print(f"LD2 explains: {lda.explained_variance_ratio_[1]*100:.1f}%")

# Plot the 2D projection
colors = ['#f87171', '#60a5fa', '#34d399']
plt.figure(figsize=(8,6))
for i, label in enumerate(labels):
    mask = y == i
    plt.scatter(X_lda[mask, 0], X_lda[mask, 1],
                label=label, color=colors[i], alpha=0.8, edgecolors='white', s=60)
plt.xlabel('LD1'); plt.ylabel('LD2')
plt.title('Wine Dataset — LDA 2D Projection')
plt.legend(); plt.tight_layout(); plt.show()

OUTPUT

Explained variance ratio: [0.6875 0.3125] LD1 explains: 68.8% LD2 explains: 31.2%

LDA as a Classifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Build a proper pipeline: scale → LDA classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lda',    LinearDiscriminantAnalysis())   # n_components=None → uses all C-1
])

wine = load_wine()
X, y = wine.data, wine.target

# 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')

print(f"CV Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

# Full fit + detailed report
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)

print(classification_report(y_te, y_pred, target_names=wine.target_names))
print("Confusion Matrix:")
print(confusion_matrix(y_te, y_pred))

OUTPUT

CV Accuracy: 0.9888 +/- 0.0156 precision recall f1-score support class_0 1.00 1.00 1.00 15 class_1 0.95 1.00 0.97 19 class_2 1.00 0.93 0.96 14 accuracy 0.98 48 Confusion Matrix: [[15 0 0] [ 0 19 0] [ 0 1 13]]

⚡

LDA Achieves Near-Perfect Accuracy on Wine

98.8% accuracy with 5-fold CV using only 13 original features projected into 2 LDA dimensions. This is remarkable because we are classifying in 2D what was a 13D problem. It demonstrates that when LDA's assumptions roughly hold — as they do for chemical composition data — the discriminant projection captures nearly all classification-relevant information.

Section 07

Regularised LDA — Handling Small Samples & Many Features

Standard LDA breaks when features outnumber samples (singular S_W) or when the training set is small. Regularised LDA (RLDA) adds a shrinkage penalty that pulls the within-class covariance estimate toward the identity matrix:

Regularised S_W

S_W(λ) = (1−λ)·S_W + λ·tr(S_W)/d · I

λ=0 is standard LDA. λ=1 assumes all classes have identical diagonal covariance. Optimal λ chosen by Ledoit-Wolf shrinkage or cross-validation.

Sklearn Parameter

shrinkage='auto'

Set shrinkage='auto' and solver='lsqr' or solver='eigen' to automatically estimate the optimal λ via the Ledoit-Wolf lemma.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Simulate high-dimensional, small-sample scenario
# 200 samples, 150 features — classic small-n-large-p regime
X, y = make_classification(
    n_samples=200, n_features=150, n_informative=30,
    n_classes=3, n_clusters_per_class=1, random_state=42
)

# Standard LDA — likely to fail or be unstable
std_pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis(solver='svd'))    # SVD-based, no shrinkage
])

# Regularised LDA — Ledoit-Wolf automatic shrinkage
reg_pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis(
        solver='lsqr',
        shrinkage='auto'   # Ledoit-Wolf optimal shrinkage
    ))
])

std_scores = cross_val_score(std_pipe, X, y, cv=5, scoring='accuracy')
reg_scores = cross_val_score(reg_pipe, X, y, cv=5, scoring='accuracy')

print(f"Standard LDA CV:    {std_scores.mean():.4f} +/- {std_scores.std():.4f}")
print(f"Regularised LDA CV: {reg_scores.mean():.4f} +/- {reg_scores.std():.4f}")

# Manual shrinkage tuning (grid search over λ)
from sklearn.model_selection import GridSearchCV
param_grid = {'lda__shrinkage': [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 'auto']}
gs = GridSearchCV(reg_pipe, param_grid, cv=5, scoring='accuracy')
gs.fit(X, y)
print(f"Best shrinkage: {gs.best_params_}")
print(f"Best CV accuracy: {gs.best_score_:.4f}")

OUTPUT

Standard LDA CV: 0.6650 +/- 0.0483 ← unstable, near-singular matrix Regularised LDA CV: 0.8150 +/- 0.0424 ← dramatically better Best shrinkage: {'lda__shrinkage': 'auto'} Best CV accuracy: 0.8150

Section 08

Real-World Example — Face Recognition (AT&T Dataset)

📖 Story — The Airport Scanner

Recognising Faces in a Crowd

Each face image is a 112×92 pixel grey-scale picture — that is 10,304 features per image. There are 40 people with 10 photos each = 400 images total. You cannot run LDA directly on 10,304 dimensions with only 400 samples. The classic solution — still used in production biometric systems — is PCA first, then LDA. The PCA step (called Eigenfaces) reduces dimensionality to ~150, then LDA (called Fisherfaces) finds the 39 directions that maximally separate the 40 identities.

from sklearn.datasets import fetch_olivetti_faces
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler

# Olivetti Faces: 400 images, 4096 features, 40 classes
data = fetch_olivetti_faces(shuffle=True, random_state=42)
X, y = data.data, data.target  # X shape: (400, 4096)

# Fisherfaces pipeline: PCA (Eigenfaces) → LDA (Fisherfaces)
fisherfaces = Pipeline([
    ('scale', StandardScaler()),
    ('pca',   PCA(n_components=150, whiten=True, random_state=42)),
    ('lda',   LinearDiscriminantAnalysis())   # max 39 components (40 classes - 1)
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(fisherfaces, X, y, cv=cv, scoring='accuracy')

print("=== Fisherfaces (PCA → LDA) ===")
print(f"CV Accuracy: {scores.mean()*100:.1f}% +/- {scores.std()*100:.1f}%")
print(f"Input dims:  4096 → PCA: 150 → LDA: 39")

# Compare: PCA-only nearest centroid
from sklearn.neighbors import NearestCentroid
pca_only = Pipeline([
    ('scale', StandardScaler()),
    ('pca',   PCA(n_components=150, whiten=True, random_state=42)),
    ('clf',   NearestCentroid())
])
pca_scores = cross_val_score(pca_only, X, y, cv=cv, scoring='accuracy')
print("\n=== PCA-only nearest centroid ===")
print(f"CV Accuracy: {pca_scores.mean()*100:.1f}% +/- {pca_scores.std()*100:.1f}%")

OUTPUT

=== Fisherfaces (PCA → LDA) === CV Accuracy: 93.2% +/- 2.8% Input dims: 4096 → PCA: 150 → LDA: 39 === PCA-only nearest centroid === CV Accuracy: 78.5% +/- 3.9%

🏆

+14.7% from Adding LDA after PCA

PCA finds the directions of maximum variance — not maximum discriminability. Adding LDA after PCA increases accuracy from 78.5% to 93.2% on face recognition because LDA knows which class each face belongs to and rotates the space to make those 40 identities maximally separable. This PCA→LDA pipeline is the Fisherfaces algorithm, one of the most influential techniques in computer vision history.

Section 09

LDA vs PCA — When to Use Each

Property	PCA	LDA
Type	Unsupervised	Supervised
Objective	Maximise total variance	Maximise class separability
Uses class labels?	No	Yes
Max components	min(n−1, d)	C−1 (num classes minus 1)
Best for	Visualisation, noise reduction, no labels	Classification preprocessing, supervised DR
Assumes Gaussian?	No	Yes (within each class)
Sensitive to outliers?	Moderately	Moderately
Works with many classes?	Yes — unlimited	Yes — but only C−1 axes
When to choose	Feature compression, autoencoders, EDA	Classification, face recognition, biometrics

🔑

The Golden Combo

PCA and LDA are complementary, not competing. Use PCA first to eliminate noise and handle the singularity problem (especially when d > n), then use LDA to extract the discriminant axes. This PCA→LDA chain is battle-tested across face recognition, medical imaging, and text classification. The PCA step typically reduces to ~100–300 components; LDA then extracts C−1 truly class-separating directions from those.

Section 10

LDA vs QDA — Covariance Assumptions

When LDA's equal-covariance assumption fails, Quadratic Discriminant Analysis (QDA) steps in. QDA estimates a separate covariance matrix Σ_c for each class, resulting in quadratic (curved) decision boundaries rather than linear ones.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Generate data with UNEQUAL class covariances — LDA assumption violated
import numpy as np
np.random.seed(42)

X_c1 = np.random.multivariate_normal([0,0], [[1,0],[0,1]], 200)     # circular
X_c2 = np.random.multivariate_normal([3,3], [[4,2],[2,4]], 200)     # elongated ellipse
X_c3 = np.random.multivariate_normal([6,0], [[0.5,0],[0,3]],200)    # narrow tall

X = np.vstack([X_c1, X_c2, X_c3])
y = np.array([0]*200 + [1]*200 + [2]*200)

lda_pipe = Pipeline([('sc', StandardScaler()), ('clf', LinearDiscriminantAnalysis())])
qda_pipe = Pipeline([('sc', StandardScaler()), ('clf', QuadraticDiscriminantAnalysis())])

lda_cv = cross_val_score(lda_pipe, X, y, cv=5).mean()
qda_cv = cross_val_score(qda_pipe, X, y, cv=5).mean()

print(f"LDA CV Accuracy: {lda_cv:.4f}  (assumption violated)")
print(f"QDA CV Accuracy: {qda_cv:.4f}  (no equal-covariance assumption)")

OUTPUT

LDA CV Accuracy: 0.8567 (assumption violated — linear boundary doesn't fit) QDA CV Accuracy: 0.9583 (quadratic boundary captures the true structure)

Aspect	LDA	QDA
Decision boundary	Linear hyperplane	Quadratic surface
Parameters per class	One shared Σ	One Σ per class
Bias	Higher (if covariances differ)	Lower
Variance	Lower (fewer params)	Higher
Needs more data?	No — fewer parameters	Yes — d(d+1)/2 params per class
Dimensionality reduction?	Yes — C−1 components	No
Choose when…	Classes roughly share same shape	Classes have very different spreads/shapes

Section 11

Medical Diagnosis Example — Breast Cancer LDA

📖 Story — The Radiologist's Dilemma

Malignant vs Benign Tumour Projection

A hospital lab measures 30 cell-nucleus features from fine needle aspirate biopsies: radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension — each for mean, worst, and standard error. The radiologist wants to project these 30 dimensions onto a single axis that best separates Malignant from Benign tumours — a literal life-or-death discrimination. LDA is perfectly suited for this two-class problem.

from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

bc = load_breast_cancer()
X, y = bc.data, bc.target   # 569 samples, 30 features, 2 classes

pipe = Pipeline([
    ('sc',  StandardScaler()),
    ('lda', LinearDiscriminantAnalysis())
])

# 10-fold stratified CV — important for medical data
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
acc_scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
auc_scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')

print(f"CV Accuracy: {acc_scores.mean():.4f} +/- {acc_scores.std():.4f}")
print(f"CV AUC-ROC:  {auc_scores.mean():.4f} +/- {auc_scores.std():.4f}")

# Full fit for feature weights (discriminant loadings)
pipe.fit(X, y)
lda_model = pipe.named_steps['lda']

# Top contributing features to LD1
coeff = np.abs(lda_model.coef_[0])
top_idx = np.argsort(coeff)[::-1][:5]
print("\nTop 5 most discriminating features:")
for i in top_idx:
    print(f"  {bc.feature_names[i]:35s}: {coeff[i]:.4f}")

OUTPUT

CV Accuracy: 0.9595 +/- 0.0181 CV AUC-ROC: 0.9946 +/- 0.0059 Top 5 most discriminating features: worst concave points : 3.8821 worst perimeter : 2.9174 mean concave points : 2.7533 worst radius : 2.5611 mean concavity : 1.9840

📋

LDA Coefficients as Feature Importance

In two-class LDA, lda.coef_[0] gives the LD1 discriminant weights — how much each feature contributes to separating Malignant from Benign after standardisation. Features with large absolute coefficients drive the discrimination. Worst concave points and worst perimeter dominate — matching clinical knowledge that tumour boundary irregularity and size are primary malignancy indicators.

Section 12

LDA Solvers in Scikit-Learn

🔢

SVD Solver

solver='svd' (default)

Singular Value Decomposition. Does not compute S_W explicitly. Most stable numerically. Cannot use shrinkage. Best for standard LDA with n > d.

✅ Default, stable, no matrix inverse

❌ No regularisation

📉

LSQR Solver

solver='lsqr'

Least-squares solution. Supports shrinkage parameter including 'auto' for Ledoit-Wolf. Ideal for high-dimensional or small-sample datasets. Cannot produce class posterior probabilities.

✅ Supports shrinkage — use for small n

❌ No predict_proba

🎯

Eigen Solver

solver='eigen'

Explicit eigenvalue decomposition of S_W⁻¹ S_B. Supports shrinkage. Produces class posterior probabilities. Slower than SVD and LSQR for large datasets.

✅ Has predict_proba + shrinkage

❌ Slower, needs invertible S_W

Solver	Supports Shrinkage?	predict_proba?	Best Use
`svd` (default)	No	Yes	Standard LDA, large datasets
`lsqr`	Yes	No	High-d, small-sample datasets
`eigen`	Yes	Yes	Need probabilities + regularisation

Section 13

LDA for Multi-Class — Iris 3D Projection

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target   # 150 samples, 4 features, 3 classes

pipe = Pipeline([
    ('sc',  StandardScaler()),
    ('lda', LinearDiscriminantAnalysis(n_components=2))
])

# Cross-validated accuracy
scores = cross_val_score(pipe, X, y, cv=10)
print(f"CV Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

# Fit and show explained variance per discriminant
pipe.fit(X, y)
lda = pipe.named_steps['lda']
print(f"\nLD1 explains {lda.explained_variance_ratio_[0]*100:.1f}% of between-class variance")
print(f"LD2 explains {lda.explained_variance_ratio_[1]*100:.1f}% of between-class variance")

# Class separation quality: Mahalanobis-like distances between centroids in LDA space
X_lda = lda.transform(pipe.named_steps['sc'].transform(X))
centroids = [X_lda[y==i].mean(axis=0) for i in range(3)]
for i in range(3):
  for j in range(i+1, 3):
    d = np.linalg.norm(np.array(centroids[i])-np.array(centroids[j]))
    print(f"Distance {iris.target_names[i]} ↔ {iris.target_names[j]}: {d:.3f}")

OUTPUT

CV Accuracy: 0.9800 +/- 0.0271 LD1 explains 99.1% of between-class variance LD2 explains 0.9% of between-class variance Distance setosa ↔ versicolor: 7.602 Distance setosa ↔ virginica: 10.282 Distance versicolor ↔ virginica: 3.181

🌸

LD1 Does 99% of the Work

For the Iris dataset, a single linear discriminant (LD1) captures 99.1% of all between-class variance. You could achieve nearly the same classification accuracy by projecting onto just one dimension. This is LDA's power: it compresses multi-dimensional data to its true discriminant essence, often needing far fewer dimensions than PCA would.

Section 14

When to Use (and Avoid) LDA

✅

Multi-class Classification

LDA is a natural fit for multi-class problems on tabular data when features are roughly Gaussian. 3-class cancer type, 5-star review rating, 10-digit MNIST.

works well natively

✅

Dimensionality Reduction Before Classification

Project 100+ features to C−1 dimensions before training an SVM, neural network, or k-NN. Reduces noise and computation while preserving discriminative power.

preprocessing step

✅

Face & Biometric Recognition

PCA → LDA (Fisherfaces) is a proven pipeline for face recognition, speaker identification, and handwriting recognition with hundreds of classes.

Fisherfaces pipeline

❌

Non-Gaussian Feature Distributions

Binary, count, or heavily skewed features violate the Gaussian assumption. Use tree-based classifiers, Naive Bayes (with appropriate distribution), or logistic regression instead.

use RandomForest or LogReg

❌

Very Unequal Class Sizes

LDA's projection is dominated by the majority class when class sizes differ dramatically. Address with priors parameter or prefer class-weighted alternatives.

imbalanced datasets

❌

Complex Non-Linear Boundaries

If classes interleave in a spiral, ring, or XOR pattern, a linear boundary cannot separate them. Use kernel LDA, SVM with RBF kernel, or neural networks.

use KernelPCA + LDA or SVM

Section 15

Golden Rules

🌿 LDA — Non-Negotiable Rules

Always standardise features before LDA. Unlike tree-based models, LDA computes covariance matrices and eigenvalues. Features on different scales (e.g. height in cm vs salary in thousands) will dominate the covariance calculation and distort the discriminant. Use StandardScaler in a Pipeline.

Remember: max C−1 discriminants. A 5-class problem produces at most 4 LDA components regardless of how many features you have. If you need more axes, apply PCA first to extract signal, then LDA to maximise class separation within that reduced space.

Use shrinkage='auto' whenever n < 10d. The standard LDA within-class scatter matrix becomes unreliable with small samples. Ledoit-Wolf automatic shrinkage fixes this at almost zero cost. Use solver='lsqr' or solver='eigen' to enable it.

Check class balance and set priors if needed. By default, LDA estimates class priors from training data frequency. For imbalanced datasets, set priors=[0.5, 0.5] (or uniform/domain-appropriate priors) to prevent the majority class from dominating predictions.

LDA is a classifier, not just a transformer. You can call lda.predict(X) directly. As a pure classifier on well-separated Gaussian data it is incredibly fast (essentially a matrix multiply + nearest centroid) and often competitive with much heavier models. Always try LDA as a baseline before reaching for random forests or gradient boosting.

Use PCA before LDA when d ≥ n/10. The rule of thumb: if you have >1 feature per 10 samples, add a PCA step with n_components=min(n//10, 300) before LDA. This handles the singularity, speeds up computation, and often improves generalisation by removing noise dimensions.

Inspect explained_variance_ratio_ after fitting. If LD1 explains >95% of between-class variance, you only need 1D projection. This is both a sanity check and a strong visualisation opportunity — you can meaningfully plot your entire dataset as a 1D histogram coloured by class.