The Story That Explains LDA
A naive guard might just look at the tallest person, or the one with the heaviest bag. But a smart guard asks a different question: "Which direction, when I project everyone onto it, spreads the two groups as far apart as possible while keeping each group tightly packed?"
That is the entire idea behind Linear Discriminant Analysis. LDA finds the mathematical "direction" — called a discriminant — that maximises the gap between classes while minimising the spread within each class. It does not just find variance; it finds discriminative variance.
Linear Discriminant Analysis (LDA) is a supervised dimensionality-reduction and classification algorithm invented by Ronald Fisher in 1936. Unlike PCA which maximises total variance, LDA maximises the ratio of between-class scatter to within-class scatter — deliberately finding the projection that makes classes most separable.
LDA wears two hats. As a dimensionality reducer it projects high-dimensional data into at most (C−1) dimensions, where C is the number of classes — perfect as a preprocessing step before another classifier. As a classifier it assigns new points to the class whose centroid is closest in the projected space, assuming Gaussian class-conditional distributions with equal covariance.
The Core Intuition — Two Objectives at Once
| What it does | Result |
|---|---|
| Maximises total variance | Classes overlap badly |
| PCA direction | Red & Blue mix together |
| Ignores class labels | Useless for classification |
| What it does | Result |
|---|---|
| Maximises between-class gap | Classes far apart |
| Minimises within-class spread | Each class tightly clustered |
| Uses class labels | Ideal for classification |
3D Interactive Visualisation — LDA in Action
The diagram below shows three classes in 3D space (left) and the same data projected onto the two LDA discriminant axes (right). Drag to rotate the 3D view. Notice how the LDA projection pulls the three clusters apart far more cleanly than random projections would.
🖱️ Drag to rotate | Scroll to zoom | Three Gaussian blobs (red, blue, green) in 3D space projected onto the two LDA components.
Step-by-Step Maths — A Worked Example
Numerical Mini-Example (2 Features, 2 Classes, 4 Points)
| Sample | Alcohol % | Malic Acid | Class |
|---|---|---|---|
| 1 | 14.2 | 1.7 | Barolo (C₁) |
| 2 | 13.8 | 1.9 | Barolo (C₁) |
| 3 | 12.4 | 3.1 | Barbera (C₂) |
| 4 | 11.9 | 3.5 | Barbera (C₂) |
C₂ scatter: deviations [+0.25,−0.2] and [−0.25,+0.2] → S_W₂ = [[0.125, −0.1],[−0.1, 0.08]]
S_W = [[0.205, −0.14],[−0.14, 0.1]]
S_B = 2×[0.925,−0.75]ᵀ[0.925,−0.75] + 2×[−0.925,0.75]ᵀ[−0.925,0.75]
S_B ≈ [[3.42, −2.77],[−2.77, 2.25]]
LD1 direction: [−0.61, −0.79] — projects Barolo high, Barbera low on the axis.
LDA can produce at most C − 1 discriminant components, where C is the number of classes. Three-class problem → max 2 LDA components. This is fundamentally different from PCA which can produce up to min(n−1, d) components. A 10-class dataset projected to 9 LDA dimensions is almost always dramatically more useful for classification than 9 PCA dimensions.
LDA Assumptions — The Fine Print
If the number of features d ≥ n (more features than samples), the
within-class scatter matrix S_W becomes singular and cannot be inverted. Solutions:
apply PCA first to reduce dimensions, use solver='lsqr' or
solver='eigen' with regularisation in scikit-learn, or switch to
Regularised LDA which adds a shrinkage parameter λ to S_W.
Python Implementation — Wine Dataset
LDA as a Dimensionality Reducer + Visualiser
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
# Load the UCI Wine dataset — 178 samples, 13 chemical features, 3 classes
wine = load_wine()
X, y = wine.data, wine.target
labels = wine.target_names # ['class_0', 'class_1', 'class_2']
# Standardise features (recommended for LDA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit LDA — project down to 2 components (C-1 = 3-1 = 2 max)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
# Explained variance ratio of each discriminant
print("Explained variance ratio:", lda.explained_variance_ratio_)
print(f"LD1 explains: {lda.explained_variance_ratio_[0]*100:.1f}%")
print(f"LD2 explains: {lda.explained_variance_ratio_[1]*100:.1f}%")
# Plot the 2D projection
colors = ['#f87171', '#60a5fa', '#34d399']
plt.figure(figsize=(8,6))
for i, label in enumerate(labels):
mask = y == i
plt.scatter(X_lda[mask, 0], X_lda[mask, 1],
label=label, color=colors[i], alpha=0.8, edgecolors='white', s=60)
plt.xlabel('LD1'); plt.ylabel('LD2')
plt.title('Wine Dataset — LDA 2D Projection')
plt.legend(); plt.tight_layout(); plt.show()
LDA as a Classifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Build a proper pipeline: scale → LDA classifier
pipe = Pipeline([
('scaler', StandardScaler()),
('lda', LinearDiscriminantAnalysis()) # n_components=None → uses all C-1
])
wine = load_wine()
X, y = wine.data, wine.target
# 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")
# Full fit + detailed report
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
print(classification_report(y_te, y_pred, target_names=wine.target_names))
print("Confusion Matrix:")
print(confusion_matrix(y_te, y_pred))
98.8% accuracy with 5-fold CV using only 13 original features projected into 2 LDA dimensions. This is remarkable because we are classifying in 2D what was a 13D problem. It demonstrates that when LDA's assumptions roughly hold — as they do for chemical composition data — the discriminant projection captures nearly all classification-relevant information.
Regularised LDA — Handling Small Samples & Many Features
Standard LDA breaks when features outnumber samples (singular S_W) or when the training set is small. Regularised LDA (RLDA) adds a shrinkage penalty that pulls the within-class covariance estimate toward the identity matrix:
shrinkage='auto' and solver='lsqr' or solver='eigen' to automatically estimate the optimal λ via the Ledoit-Wolf lemma.from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Simulate high-dimensional, small-sample scenario
# 200 samples, 150 features — classic small-n-large-p regime
X, y = make_classification(
n_samples=200, n_features=150, n_informative=30,
n_classes=3, n_clusters_per_class=1, random_state=42
)
# Standard LDA — likely to fail or be unstable
std_pipe = Pipeline([
('sc', StandardScaler()),
('lda', LinearDiscriminantAnalysis(solver='svd')) # SVD-based, no shrinkage
])
# Regularised LDA — Ledoit-Wolf automatic shrinkage
reg_pipe = Pipeline([
('sc', StandardScaler()),
('lda', LinearDiscriminantAnalysis(
solver='lsqr',
shrinkage='auto' # Ledoit-Wolf optimal shrinkage
))
])
std_scores = cross_val_score(std_pipe, X, y, cv=5, scoring='accuracy')
reg_scores = cross_val_score(reg_pipe, X, y, cv=5, scoring='accuracy')
print(f"Standard LDA CV: {std_scores.mean():.4f} +/- {std_scores.std():.4f}")
print(f"Regularised LDA CV: {reg_scores.mean():.4f} +/- {reg_scores.std():.4f}")
# Manual shrinkage tuning (grid search over λ)
from sklearn.model_selection import GridSearchCV
param_grid = {'lda__shrinkage': [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 'auto']}
gs = GridSearchCV(reg_pipe, param_grid, cv=5, scoring='accuracy')
gs.fit(X, y)
print(f"Best shrinkage: {gs.best_params_}")
print(f"Best CV accuracy: {gs.best_score_:.4f}")
Real-World Example — Face Recognition (AT&T Dataset)
from sklearn.datasets import fetch_olivetti_faces
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
# Olivetti Faces: 400 images, 4096 features, 40 classes
data = fetch_olivetti_faces(shuffle=True, random_state=42)
X, y = data.data, data.target # X shape: (400, 4096)
# Fisherfaces pipeline: PCA (Eigenfaces) → LDA (Fisherfaces)
fisherfaces = Pipeline([
('scale', StandardScaler()),
('pca', PCA(n_components=150, whiten=True, random_state=42)),
('lda', LinearDiscriminantAnalysis()) # max 39 components (40 classes - 1)
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(fisherfaces, X, y, cv=cv, scoring='accuracy')
print("=== Fisherfaces (PCA → LDA) ===")
print(f"CV Accuracy: {scores.mean()*100:.1f}% +/- {scores.std()*100:.1f}%")
print(f"Input dims: 4096 → PCA: 150 → LDA: 39")
# Compare: PCA-only nearest centroid
from sklearn.neighbors import NearestCentroid
pca_only = Pipeline([
('scale', StandardScaler()),
('pca', PCA(n_components=150, whiten=True, random_state=42)),
('clf', NearestCentroid())
])
pca_scores = cross_val_score(pca_only, X, y, cv=cv, scoring='accuracy')
print("\n=== PCA-only nearest centroid ===")
print(f"CV Accuracy: {pca_scores.mean()*100:.1f}% +/- {pca_scores.std()*100:.1f}%")
PCA finds the directions of maximum variance — not maximum discriminability. Adding LDA after PCA increases accuracy from 78.5% to 93.2% on face recognition because LDA knows which class each face belongs to and rotates the space to make those 40 identities maximally separable. This PCA→LDA pipeline is the Fisherfaces algorithm, one of the most influential techniques in computer vision history.
LDA vs PCA — When to Use Each
| Property | PCA | LDA |
|---|---|---|
| Type | Unsupervised | Supervised |
| Objective | Maximise total variance | Maximise class separability |
| Uses class labels? | No | Yes |
| Max components | min(n−1, d) | C−1 (num classes minus 1) |
| Best for | Visualisation, noise reduction, no labels | Classification preprocessing, supervised DR |
| Assumes Gaussian? | No | Yes (within each class) |
| Sensitive to outliers? | Moderately | Moderately |
| Works with many classes? | Yes — unlimited | Yes — but only C−1 axes |
| When to choose | Feature compression, autoencoders, EDA | Classification, face recognition, biometrics |
PCA and LDA are complementary, not competing. Use PCA first to eliminate noise and handle the singularity problem (especially when d > n), then use LDA to extract the discriminant axes. This PCA→LDA chain is battle-tested across face recognition, medical imaging, and text classification. The PCA step typically reduces to ~100–300 components; LDA then extracts C−1 truly class-separating directions from those.
LDA vs QDA — Covariance Assumptions
When LDA's equal-covariance assumption fails, Quadratic Discriminant Analysis (QDA) steps in. QDA estimates a separate covariance matrix Σ_c for each class, resulting in quadratic (curved) decision boundaries rather than linear ones.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Generate data with UNEQUAL class covariances — LDA assumption violated
import numpy as np
np.random.seed(42)
X_c1 = np.random.multivariate_normal([0,0], [[1,0],[0,1]], 200) # circular
X_c2 = np.random.multivariate_normal([3,3], [[4,2],[2,4]], 200) # elongated ellipse
X_c3 = np.random.multivariate_normal([6,0], [[0.5,0],[0,3]],200) # narrow tall
X = np.vstack([X_c1, X_c2, X_c3])
y = np.array([0]*200 + [1]*200 + [2]*200)
lda_pipe = Pipeline([('sc', StandardScaler()), ('clf', LinearDiscriminantAnalysis())])
qda_pipe = Pipeline([('sc', StandardScaler()), ('clf', QuadraticDiscriminantAnalysis())])
lda_cv = cross_val_score(lda_pipe, X, y, cv=5).mean()
qda_cv = cross_val_score(qda_pipe, X, y, cv=5).mean()
print(f"LDA CV Accuracy: {lda_cv:.4f} (assumption violated)")
print(f"QDA CV Accuracy: {qda_cv:.4f} (no equal-covariance assumption)")
| Aspect | LDA | QDA |
|---|---|---|
| Decision boundary | Linear hyperplane | Quadratic surface |
| Parameters per class | One shared Σ | One Σ per class |
| Bias | Higher (if covariances differ) | Lower |
| Variance | Lower (fewer params) | Higher |
| Needs more data? | No — fewer parameters | Yes — d(d+1)/2 params per class |
| Dimensionality reduction? | Yes — C−1 components | No |
| Choose when… | Classes roughly share same shape | Classes have very different spreads/shapes |
Medical Diagnosis Example — Breast Cancer LDA
from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
bc = load_breast_cancer()
X, y = bc.data, bc.target # 569 samples, 30 features, 2 classes
pipe = Pipeline([
('sc', StandardScaler()),
('lda', LinearDiscriminantAnalysis())
])
# 10-fold stratified CV — important for medical data
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
acc_scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
auc_scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f"CV Accuracy: {acc_scores.mean():.4f} +/- {acc_scores.std():.4f}")
print(f"CV AUC-ROC: {auc_scores.mean():.4f} +/- {auc_scores.std():.4f}")
# Full fit for feature weights (discriminant loadings)
pipe.fit(X, y)
lda_model = pipe.named_steps['lda']
# Top contributing features to LD1
coeff = np.abs(lda_model.coef_[0])
top_idx = np.argsort(coeff)[::-1][:5]
print("\nTop 5 most discriminating features:")
for i in top_idx:
print(f" {bc.feature_names[i]:35s}: {coeff[i]:.4f}")
In two-class LDA, lda.coef_[0] gives the LD1 discriminant weights — how much
each feature contributes to separating Malignant from Benign after standardisation.
Features with large absolute coefficients drive the discrimination.
Worst concave points and worst perimeter dominate — matching clinical
knowledge that tumour boundary irregularity and size are primary malignancy indicators.
LDA Solvers in Scikit-Learn
shrinkage parameter including
'auto' for Ledoit-Wolf. Ideal for high-dimensional or small-sample
datasets. Cannot produce class posterior probabilities.
| Solver | Supports Shrinkage? | predict_proba? | Best Use |
|---|---|---|---|
svd (default) | No | Yes | Standard LDA, large datasets |
lsqr | Yes | No | High-d, small-sample datasets |
eigen | Yes | Yes | Need probabilities + regularisation |
LDA for Multi-Class — Iris 3D Projection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target # 150 samples, 4 features, 3 classes
pipe = Pipeline([
('sc', StandardScaler()),
('lda', LinearDiscriminantAnalysis(n_components=2))
])
# Cross-validated accuracy
scores = cross_val_score(pipe, X, y, cv=10)
print(f"CV Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")
# Fit and show explained variance per discriminant
pipe.fit(X, y)
lda = pipe.named_steps['lda']
print(f"\nLD1 explains {lda.explained_variance_ratio_[0]*100:.1f}% of between-class variance")
print(f"LD2 explains {lda.explained_variance_ratio_[1]*100:.1f}% of between-class variance")
# Class separation quality: Mahalanobis-like distances between centroids in LDA space
X_lda = lda.transform(pipe.named_steps['sc'].transform(X))
centroids = [X_lda[y==i].mean(axis=0) for i in range(3)]
for i in range(3):
for j in range(i+1, 3):
d = np.linalg.norm(np.array(centroids[i])-np.array(centroids[j]))
print(f"Distance {iris.target_names[i]} ↔ {iris.target_names[j]}: {d:.3f}")
For the Iris dataset, a single linear discriminant (LD1) captures 99.1% of all between-class variance. You could achieve nearly the same classification accuracy by projecting onto just one dimension. This is LDA's power: it compresses multi-dimensional data to its true discriminant essence, often needing far fewer dimensions than PCA would.
When to Use (and Avoid) LDA
priors parameter or prefer class-weighted alternatives.Golden Rules
StandardScaler in a Pipeline.
shrinkage='auto' whenever n < 10d. The standard LDA
within-class scatter matrix becomes unreliable with small samples. Ledoit-Wolf
automatic shrinkage fixes this at almost zero cost. Use solver='lsqr'
or solver='eigen' to enable it.
priors if needed. By default,
LDA estimates class priors from training data frequency. For imbalanced datasets, set
priors=[0.5, 0.5] (or uniform/domain-appropriate priors) to prevent the
majority class from dominating predictions.
lda.predict(X) directly. As a pure classifier on well-separated Gaussian
data it is incredibly fast (essentially a matrix multiply + nearest centroid) and often
competitive with much heavier models. Always try LDA as a baseline before reaching for
random forests or gradient boosting.
n_components=min(n//10, 300)
before LDA. This handles the singularity, speeds up computation, and often improves
generalisation by removing noise dimensions.
explained_variance_ratio_ after fitting. If LD1
explains >95% of between-class variance, you only need 1D projection. This is both
a sanity check and a strong visualisation opportunity — you can meaningfully plot your
entire dataset as a 1D histogram coloured by class.