The Story That Explains Cross-Validation
So the board devises a smarter system: they rotate. In Round 1, students 1–100 are the "test group" and everyone else studies. In Round 2, students 101–200 are tested. By Round 10, every student has been examined on material they didn't study for, and the board has a reliable, fair picture of every student's true ability.
This rotating-exam system is Cross-Validation — the gold standard for measuring how well a machine learning model will perform on data it has never seen.
In machine learning, cross-validation (CV) is a resampling technique used to evaluate models on limited data. Instead of using a single fixed train/test split — which can be lucky or unlucky — CV rotates the held-out portion multiple times, giving a statistically robust estimate of generalisation performance.
With a single 80/20 split, your reported accuracy depends heavily on which 20% ended up in your test set. If the test set happens to contain easy samples, you overestimate performance. If it contains hard outliers, you underestimate. Cross-validation averages over many splits, eliminating this luck factor and giving you a reliable estimate of how your model will behave on truly new data.
The Two Deadly Sins CV Protects You From
Standard K-Fold Cross-Validation — The Foundation
K-Fold CV is the workhorse of model evaluation. The dataset is split into k equal-sized "folds." The model trains on k−1 folds and is tested on the remaining fold. This repeats k times, with each fold serving as the test set exactly once. The final score is the mean (and standard deviation) across all k test scores.
Click "Next Iteration" to watch how the test fold rotates through all 5 positions. Each fold takes a turn as the unseen test set exactly once.
Python — Standard K-Fold (k=5)
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score
# ── Dataset: 1000 samples, 20 features ──────────────
X, y = make_classification(
n_samples=1000, n_features=20,
n_informative=12, random_state=42
)
# ── Model ────────────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)
# ── K-Fold: 5 folds, shuffled ────────────────────────
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# ── cross_val_score runs all folds automatically ─────
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print("Per-Fold Scores:")
for i, s in enumerate(scores, 1):
print(f" Fold {i}: {s:.4f}")
print(f"\nMean Accuracy : {scores.mean():.4f}")
print(f"Std Deviation : {scores.std():.4f}")
print(f"95% CI : [{scores.mean()-2*scores.std():.4f}, {scores.mean()+2*scores.std():.4f}]")
k=5 is the industry default. k=10 gives a slightly better bias estimate but costs 2× compute. k=3 is fine for very large datasets where training is expensive. k=N (Leave-One-Out) is theoretically ideal but computationally prohibitive for N>500. For most problems: start with k=5.
Stratified K-Fold — The Fix for Class Imbalance
Stratified K-Fold solves this by ensuring each fold has the same class proportion as the full dataset. Every fold gets exactly 10% disease cases (100 × 10% = 10 per fold). Fair evaluation, guaranteed.
StratifiedKFold preserves the class distribution (percentage of each class) in every fold. This is critical for imbalanced classification, where standard K-Fold can produce folds with zero or near-zero minority class examples.
With a 90/10 class split, standard K-Fold can produce folds ranging from 6% to 14% minority class. Stratified K-Fold locks every fold at exactly 10%. This eliminates unstable F1/AUC scores caused by imbalanced test folds.
Python — Stratified K-Fold
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
# ── Imbalanced dataset: 90% class 0, 10% class 1 ────
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=10,
weights=[0.9, 0.1], # imbalanced!
random_state=42
)
print(f"Class distribution: {np.bincount(y)}")
# → [1800, 200] ← heavily imbalanced
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
# ── StratifiedKFold preserves 90/10 in every fold ────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ── Evaluate with F1 (better than accuracy for imbalance) ──
f1_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
auc_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print("\nStratified K-Fold (k=5) Results:")
for i in range(5):
print(f" Fold {i+1}: F1={f1_scores[i]:.4f} AUC={auc_scores[i]:.4f}")
print(f"\nMean F1 : {f1_scores.mean():.4f} ± {f1_scores.std():.4f}")
print(f"Mean AUC : {auc_scores.mean():.4f} ± {auc_scores.std():.4f}")
# ── Verify class proportions in each fold ────────────
print("\nClass 1 proportion per fold:")
for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
prop = y[test_idx].mean()
print(f" Fold {fold_idx}: {prop:.3f} ({int(prop*len(test_idx))}/{len(test_idx)} samples)")
For any classification task — balanced or not — StratifiedKFold is strictly better than KFold.
It has no downside and protects against random bad splits.
cross_val_score automatically uses StratifiedKFold when a classifier is passed,
but always specify it explicitly when using cross_validate or custom loops.
Leave-One-Out CV (LOOCV) — The Maximalist Approach
Leave-One-Out CV (LOOCV) is K-Fold where k = N (number of samples). In each iteration, a single sample is the test set and all other N−1 samples form the training set. The process repeats N times, giving N test scores — one per sample.
Python — Leave-One-Out CV
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_iris
import numpy as np
# ── Small dataset: LOOCV is ideal ─────────────────────
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset size: {X.shape}")
# → 150 samples → LOOCV runs 150 train/test iterations
model = SVC(kernel='rbf', C=1.0, random_state=42)
loo = LeaveOneOut()
# ── cross_val_score handles all N iterations ───────────
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print(f"Total iterations : {len(scores)}")
print(f"Scores (first 10) : {scores[:10]}")
print(f"Mean Accuracy : {scores.mean():.4f}")
print(f"Std Deviation : {scores.std():.4f}")
print(f"Misclassified : {(scores == 0).sum()} / {len(scores)}")
| Property | K-Fold (k=5) | K-Fold (k=10) | LOOCV (k=N) |
|---|---|---|---|
| Training set size | 80% of data | 90% of data | 99.9% of data |
| Test set size | 20% of data | 10% of data | 1 sample |
| Number of models trained | 5 | 10 | N |
| Bias (underestimation) | Moderate | Low | Minimal |
| Variance of estimate | Low | Medium | High |
| Compute cost | Low | Medium | Very High |
| Best for | General use | Medium datasets | Tiny datasets (<100) |
Stratified Shuffle Split — Large-Scale Evaluation
StratifiedShuffleSplit performs a user-defined number of random stratified train/test splits (with replacement). Unlike K-Fold, the same sample can appear in multiple test sets. It's ideal when your dataset is large enough that you don't need every sample in training, but you want many evaluation runs for statistical reliability.
K-Fold guarantees each sample is tested exactly once. StratifiedShuffleSplit makes no such guarantee — some samples may never be in the test set, others may appear in multiple. This makes it less statistically pure than K-Fold but much faster for large datasets where you want, say, 10 random 10% test splits rather than a full 10-fold rotation.
Python — StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = make_classification(n_samples=5000, n_features=15,
weights=[0.8, 0.2], random_state=42)
model = LogisticRegression(max_iter=1000)
# ── 10 random stratified 80/20 splits ─────────────────
sss = StratifiedShuffleSplit(
n_splits=10, # 10 random evaluations
test_size=0.2, # 20% test each time
random_state=42
)
auc_scores = cross_val_score(model, X, y, cv=sss, scoring='roc_auc')
print("StratifiedShuffleSplit (10 runs × 20% test):")
for i, s in enumerate(auc_scores, 1):
print(f" Run {i:2d}: AUC = {s:.4f}")
print(f"\nMean AUC : {auc_scores.mean():.4f}")
print(f"Std : {auc_scores.std():.4f}")
Group K-Fold — When Samples Are Not Independent
But wait. In fold 1, Patient #47's scan from Monday is in training, and their scan from Friday is in the test set. The model has already seen that patient. It learned their unique anatomy, their specific radiological signature. Of course it scores high — it's essentially recognising the same patient.
Your reported 94% AUC is a lie. On a genuinely new patient, you get 71%.
Group K-Fold ensures all scans from the same patient are always in the same fold. If Patient #47 is in the test fold, all five of their scans are there — none of them in training.
Python — Group K-Fold
from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# ── Simulate: 400 patients, 5 scans each ─────────────
np.seed(42)
n_patients = 400
scans_per_patient = 5
n_samples = n_patients * scans_per_patient # 2000
X = np.random.randn(n_samples, 20)
y = np.random.randint(0, 2, n_samples)
groups = np.repeat(range(n_patients), scans_per_patient)
# groups = [0,0,0,0,0, 1,1,1,1,1, ..., 399,399,399,399,399]
model = RandomForestClassifier(n_estimators=100, random_state=42)
gkf = GroupKFold(n_splits=5)
# ── Pass groups= to cross_val_score ───────────────────
scores = cross_val_score(
model, X, y,
cv=gkf,
groups=groups,
scoring='accuracy'
)
print("Group K-Fold (patient-level splits):")
for i, s in enumerate(scores, 1):
print(f" Fold {i}: {s:.4f}")
print(f"\nMean: {scores.mean():.4f} ± {scores.std():.4f}")
# ── Verify: no patient appears in both train and test ─
print("\nVerification — any patient leakage?")
for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups), 1):
train_groups = set(groups[train_idx])
test_groups = set(groups[test_idx])
leak = train_groups & test_groups
print(f" Fold {fold}: {len(test_groups)} test patients, leakage = {len(leak)}")
Use Group K-Fold whenever your samples are not i.i.d. (independently and identically distributed): multiple measurements per patient, multiple transactions per customer, multiple posts per user, multiple frames per video, multiple time windows per sensor. Standard K-Fold on grouped data gives catastrophically optimistic performance estimates due to data leakage between folds.
Time Series Cross-Validation — Respecting Temporal Order
In reality, you will always train on historical data and predict future prices. A model that can peek at future data in training will look brilliant in CV but fail spectacularly in production — because time only moves forward.
Time Series CV (TimeSeriesSplit) enforces this one-way street: training data is always older than test data. No future leakage. Ever.
TimeSeriesSplit uses an expanding window: each split adds more historical data to training while the test set always contains future samples. The future never leaks into the past.
Python — TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np
# ── Simulate monthly sales data: 3 years, 12 features ─
np.random.seed(42)
n = 360 # 360 daily observations
t = np.arange(n)
X = np.column_stack([
np.sin(2*np.pi*t/30), # monthly seasonality
np.sin(2*np.pi*t/7), # weekly pattern
t/360, # trend
np.random.randn(n, 9) # noise features
])
y = (10 + 0.02*t
+ 3*np.sin(2*np.pi*t/30)
+ np.random.randn(n)*0.5) # sales with trend + seasonality
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
# ── 5-split time series CV ─────────────────────────────
tscv = TimeSeriesSplit(n_splits=5)
maes = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
X_tr, X_te = X[train_idx], X[test_idx]
y_tr, y_te = y[train_idx], y[test_idx]
model.fit(X_tr, y_tr)
preds = model.predict(X_te)
mae = mean_absolute_error(y_te, preds)
maes.append(mae)
print(f"Fold {fold}: train[0:{len(train_idx)}] → test[{len(train_idx)}:{len(train_idx)+len(test_idx)}] | MAE={mae:.4f}")
print(f"\nMean MAE : {np.mean(maes):.4f}")
print(f"Std MAE : {np.std(maes):.4f}")
Notice MAE decreases from 0.42 to 0.35 as the training window grows. This is expected — more historical data makes the model better. If MAE increased in later folds, it might signal distribution shift (the data's statistical properties changing over time) — a critical warning for production models.
Repeated K-Fold — For Maximum Statistical Reliability
RepeatedKFold and RepeatedStratifiedKFold run K-Fold multiple times with different random splits each time, then aggregate all scores. With 5-fold repeated 10 times, you get 50 test scores — giving a very tight confidence interval on your estimate.
Use it when statistical rigour matters — comparing two models where the difference in performance is small (e.g., 84.2% vs 84.7%), or when writing a research paper where you need tight confidence intervals. It's also excellent for small datasets where a single K-Fold run could be overly optimistic or pessimistic by chance.
Python — Repeated Stratified K-Fold
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from scipy import stats
import numpy as np
X, y = make_classification(n_samples=500, n_features=20,
n_informative=10, random_state=42)
# ── 5-fold × 10 repeats = 50 scores ───────────────────
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
rf_scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42),
X, y, cv=rskf, scoring='accuracy')
svc_scores = cross_val_score(SVC(kernel='rbf'),
X, y, cv=rskf, scoring='accuracy')
print("Repeated Stratified K-Fold (5×10 = 50 scores):")
print(f" Random Forest : {rf_scores.mean():.4f} ± {rf_scores.std():.4f}")
print(f" SVC (RBF) : {svc_scores.mean():.4f} ± {svc_scores.std():.4f}")
# ── Paired t-test: is the difference statistically significant? ──
t_stat, p_val = stats.ttest_rel(rf_scores, svc_scores)
print(f"\nPaired t-test:")
print(f" t-statistic : {t_stat:.4f}")
print(f" p-value : {p_val:.4f}")
print(f" Significant : {'YES' if p_val < 0.05 else 'NO'} (α=0.05)")
Nested Cross-Validation — Unbiased Model Selection
But there's a problem: you selected those hyperparameters because they happened to work well on those specific folds. You've effectively overfit to the CV folds themselves. The honest CV score for the tuned model is always higher than what you'd get on truly new data. This is called optimistic bias in model selection.
Nested CV uses two loops: an inner loop for hyperparameter tuning and an outer loop for honest performance estimation. The outer folds are never touched during tuning. It's the gold standard for unbiased model evaluation.
5 outer folds × 3 inner folds = 15 total model fits per hyperparameter combination. If you try 50 param combos: 50 × 15 = 750 model fits. Nested CV is expensive but gives the only truly unbiased performance estimate for tuned models.
Python — Nested CV with GridSearchCV
from sklearn.model_selection import (
StratifiedKFold, GridSearchCV, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=600, n_features=20,
n_informative=10, random_state=42)
# ── Outer CV: honest evaluation (5 folds) ─────────────
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ── Inner CV: hyperparameter search (3 folds) ─────────
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
# ── Parameter grid to search ──────────────────────────
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, None],
'max_features': ['sqrt', 'log2']
} # 3×3×2 = 18 combinations × 3 inner folds = 54 fits per outer fold
# ── GridSearchCV wraps the inner loop ─────────────────
clf = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=inner_cv,
scoring='accuracy',
n_jobs=-1
)
# ── cross_val_score wraps the outer loop ──────────────
nested_scores = cross_val_score(
clf, X, y,
cv=outer_cv,
scoring='accuracy',
n_jobs=-1
)
print("Nested CV Outer Scores:")
for i, s in enumerate(nested_scores, 1):
print(f" Outer Fold {i}: {s:.4f}")
print(f"\nNested CV Mean : {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
# ── Compare with naive (non-nested) CV ────────────────
# Fit on all data, report best inner CV score — OPTIMISTIC!
clf.fit(X, y)
print(f"\nNaive (non-nested) CV best score : {clf.best_score_:.4f} ← overly optimistic!")
print(f"Nested CV honest estimate : {nested_scores.mean():.4f} ← real performance")
print(f"Optimism gap : {clf.best_score_ - nested_scores.mean():.4f}")
The non-nested CV score is always higher than the nested CV score. The gap grows with the number of hyperparameter combinations you try — the more you search, the more you overfit to the CV folds. For papers and production deployments where honesty matters, nested CV is non-negotiable.
cross_validate — Multiple Metrics in One Call
cross_val_score returns a single metric. cross_validate is its more powerful sibling —
it returns multiple metrics simultaneously, plus training scores and fit times.
Use it whenever you need more than one evaluation metric per fold.
Python — cross_validate with Multiple Metrics
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
X, y = make_classification(
n_samples=2000, n_features=20,
weights=[0.8, 0.2], random_state=42
)
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ── Evaluate 5 metrics at once ─────────────────────────
results = cross_validate(
model, X, y,
cv=skf,
scoring=['accuracy', 'f1', 'roc_auc',
'precision', 'recall'],
return_train_score=True # also capture train scores to detect overfitting
)
# ── Print as a tidy table ─────────────────────────────
metrics = ['accuracy', 'f1', 'roc_auc', 'precision', 'recall']
print(f"{'Metric':12s} {'Train':8s} {'Test':8s} {'Gap':8s}")
print("-"*42)
for m in metrics:
train = results[f'train_{m}'].mean()
test = results[f'test_{m}'].mean()
gap = train - test
print(f"{m:12s} {train:.4f} {test:.4f} {gap:+.4f}")
The gap between train and test score is your overfitting indicator.
A gap of 0.05 accuracy is normal. A gap of 0.25+ is a red flag — your model is memorising training data.
The train score being far above the test score means you need more regularisation,
simpler model architecture, or more training data. cross_validate with
return_train_score=True makes this diagnostic trivial.
CV with Pipelines — The Only Safe Way
If you scale your features before CV, the scaler sees the test fold's data during
scaler.fit(X). The test fold statistics leak into training — your model has technically
seen the test data before being evaluated on it. This artificially inflates scores,
sometimes by 2–5% depending on the dataset. Always put preprocessing inside a Pipeline
so it refits only on each training fold and transforms each test fold blindly.
Python — Wrong vs Right CV with Preprocessing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=500, n_features=20,
n_informative=10, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# ─── ❌ WRONG: Scale ALL data before CV ───────────────
scaler = StandardScaler()
X_scaled_wrong = scaler.fit_transform(X) # test data leaks here!
wrong_scores = cross_val_score(
SVC(kernel='rbf'), X_scaled_wrong, y,
cv=skf, scoring='accuracy'
)
# ─── ✅ CORRECT: Pipeline scales inside each CV fold ──
pipeline = Pipeline([
('scaler', StandardScaler()), # fits on train fold, transforms test fold
('svc', SVC(kernel='rbf'))
])
correct_scores = cross_val_score(
pipeline, X, y, # raw X — pipeline handles scaling internally
cv=skf, scoring='accuracy'
)
print("❌ WRONG (pre-scaled, data leakage):")
print(f" {wrong_scores} → mean: {wrong_scores.mean():.4f}")
print("\n✅ CORRECT (Pipeline, no leakage):")
print(f" {correct_scores} → mean: {correct_scores.mean():.4f}")
print(f"\nInflation from leakage: +{(wrong_scores.mean()-correct_scores.mean())*100:.2f}%")
All CV Variants at a Glance — Decision Guide
Work through these five questions top to bottom. They cover 95% of real-world CV selection scenarios.
| CV Method | Class Stratified | Group-Safe | Temporal-Safe | Compute Cost | Best For |
|---|---|---|---|---|---|
| KFold | No | No | No | Low | Regression, balanced classes |
| StratifiedKFold | Yes | No | No | Low | Classification (default choice) |
| LOOCV | No | No | No | Very High | Tiny datasets (<100 samples) |
| StratifiedShuffleSplit | Yes | No | No | Medium | Large datasets, quick evaluation |
| GroupKFold | Optional | Yes | No | Low | Patient/user/session data |
| TimeSeriesSplit | No | No | Yes | Low | Sequential/temporal data |
| RepeatedStratifiedKFold | Yes | No | No | High | Research, tight CI, model comparison |
| Nested CV | Yes | Possible | Possible | Very High | Hyperparameter tuning + honest evaluation |
Golden Rules — Cross-Validation
cross_val_score defaults to
StratifiedKFold for classifiers, but always make it explicit when using
cross_validate or manual CV loops.