Cross-Validation Guide: K-Fold, Stratified, LOOCV

Section 01

The Story That Explains Cross-Validation

📖 Real World Analogy

The Medical Exam Board — Testing Doctors Fairly

Imagine a medical school wants to know if its students are truly competent — not just good at memorising past exam papers. If the professors test students only on questions they've already seen, every student scores 95%. But put them in front of real patients — and half of them freeze.

So the board devises a smarter system: they rotate. In Round 1, students 1–100 are the "test group" and everyone else studies. In Round 2, students 101–200 are tested. By Round 10, every student has been examined on material they didn't study for, and the board has a reliable, fair picture of every student's true ability.

This rotating-exam system is Cross-Validation — the gold standard for measuring how well a machine learning model will perform on data it has never seen.

In machine learning, cross-validation (CV) is a resampling technique used to evaluate models on limited data. Instead of using a single fixed train/test split — which can be lucky or unlucky — CV rotates the held-out portion multiple times, giving a statistically robust estimate of generalisation performance.

🌿

The Problem Cross-Validation Solves

With a single 80/20 split, your reported accuracy depends heavily on which 20% ended up in your test set. If the test set happens to contain easy samples, you overestimate performance. If it contains hard outliers, you underestimate. Cross-validation averages over many splits, eliminating this luck factor and giving you a reliable estimate of how your model will behave on truly new data.

Section 02

The Two Deadly Sins CV Protects You From

📈

Sin 1 — Overfitting

The Training-Data Trap

A model memorises training data, including noise. It scores 99% on training data but 61% on new data. Without proper CV, you'd report 99% and ship a useless model. CV exposes this gap by testing on data the model genuinely hasn't seen during training.

🎲

Sin 2 — Lucky Split

The One-Split Gamble

A single train/test split is a single roll of the dice. Your 85% accuracy might be 78% on a different random seed — or 91%. With only 1000 samples, a single split gives you a confidence interval so wide it's nearly meaningless. CV shrinks that interval dramatically.

🔄

Sin 3 — Data Waste

The Small Dataset Problem

With 500 samples, a fixed 80/20 split leaves only 100 for testing — statistically thin. K-fold CV with k=5 trains on 400 samples and tests on 100, but does this 5 times — effectively using all 500 samples for both training and evaluation.

Section 03

Standard K-Fold Cross-Validation — The Foundation

K-Fold CV is the workhorse of model evaluation. The dataset is split into k equal-sized "folds." The model trains on k−1 folds and is tested on the remaining fold. This repeats k times, with each fold serving as the test set exactly once. The final score is the mean (and standard deviation) across all k test scores.

Shuffle & Split the Dataset into k Equal Folds

The dataset of N samples is divided into k non-overlapping subsets (folds), each of size N/k. With k=5 and 1000 samples, each fold contains 200 samples. Shuffling first prevents ordering bias.

Iteration i: Train on k−1 Folds, Test on Fold i

In iteration i, fold i becomes the test set. The model trains on all other k−1 folds combined. The test score (accuracy, AUC, RMSE, etc.) is recorded for this fold.

Rotate: Repeat for All k Folds

Repeat step 2 for i = 1, 2, …, k. Every sample is in the test set exactly once. Every sample is in a training set exactly k−1 times. No data is wasted.

Aggregate: Mean ± Std of k Scores

Final reported score = mean(score₁, score₂, …, scoreₖ). The standard deviation tells you how stable the model is across different data splits. High std = model is sensitive to which data it sees.

Train Final Model on 100% of Data

CV is for evaluation only. Once you've confirmed the model is good, retrain it on the full dataset for deployment. The CV score is your honest performance estimate; the final model uses all available information.

📊 Animated K-Fold CV — Watch the Test Fold Rotate

Iteration 1 of 5 — Fold 1 is the TEST set

Click "Next Iteration" to watch how the test fold rotates through all 5 positions. Each fold takes a turn as the unseen test set exactly once.

Python — Standard K-Fold (k=5)

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

# ── Dataset: 1000 samples, 20 features ──────────────
X, y = make_classification(
    n_samples=1000, n_features=20,
    n_informative=12, random_state=42
)

# ── Model ────────────────────────────────────────────
model = RandomForestClassifier(n_estimators=100, random_state=42)

# ── K-Fold: 5 folds, shuffled ────────────────────────
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# ── cross_val_score runs all folds automatically ─────
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print("Per-Fold Scores:")
for i, s in enumerate(scores, 1):
    print(f"  Fold {i}: {s:.4f}")

print(f"\nMean Accuracy : {scores.mean():.4f}")
print(f"Std Deviation : {scores.std():.4f}")
print(f"95% CI        : [{scores.mean()-2*scores.std():.4f}, {scores.mean()+2*scores.std():.4f}]")

OUTPUT

Per-Fold Scores: Fold 1: 0.8450 Fold 2: 0.8700 Fold 3: 0.8300 Fold 4: 0.8600 Fold 5: 0.8500 Mean Accuracy : 0.8510 Std Deviation : 0.0135 95% CI : [0.8240, 0.8780]

💡

Choosing k — The Classic Trade-off

k=5 is the industry default. k=10 gives a slightly better bias estimate but costs 2× compute. k=3 is fine for very large datasets where training is expensive. k=N (Leave-One-Out) is theoretically ideal but computationally prohibitive for N>500. For most problems: start with k=5.

Section 04

Stratified K-Fold — The Fix for Class Imbalance

📖 Story

The Biased Jury Pool

Imagine a dataset of 1000 patient records: 900 healthy (class 0) and 100 with a rare disease (class 1). Standard K-Fold shuffles randomly. By pure bad luck, fold 3 might contain only 4 disease cases instead of 20, and fold 5 might contain 28. The model trained for fold 3's evaluation has almost no disease examples to learn from. The test fold has too few disease examples to give a meaningful F1 score.

Stratified K-Fold solves this by ensuring each fold has the same class proportion as the full dataset. Every fold gets exactly 10% disease cases (100 × 10% = 10 per fold). Fair evaluation, guaranteed.

StratifiedKFold preserves the class distribution (percentage of each class) in every fold. This is critical for imbalanced classification, where standard K-Fold can produce folds with zero or near-zero minority class examples.

📊 Standard K-Fold vs Stratified K-Fold — Class Distribution

Class 0 (Healthy — 90%)

Class 1 (Disease — 10%)

❌ Standard K-Fold — Random (Uneven Class Distribution)

Fold 1

88%

12%

Fold 2

94%

Fold 3

91%

Fold 4

86%

14%

Fold 5

91%

✅ Stratified K-Fold — Guaranteed Equal Distribution

Fold 1

90%

10%

Fold 2

90%

10%

Fold 3

90%

10%

Fold 4

90%

10%

Fold 5

90%

10%

With a 90/10 class split, standard K-Fold can produce folds ranging from 6% to 14% minority class. Stratified K-Fold locks every fold at exactly 10%. This eliminates unstable F1/AUC scores caused by imbalanced test folds.

Python — Stratified K-Fold

from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

# ── Imbalanced dataset: 90% class 0, 10% class 1 ────
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=10,
    weights=[0.9, 0.1],          # imbalanced!
    random_state=42
)
print(f"Class distribution: {np.bincount(y)}")
# → [1800, 200]  ← heavily imbalanced

model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# ── StratifiedKFold preserves 90/10 in every fold ────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Evaluate with F1 (better than accuracy for imbalance) ──
f1_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
auc_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

print("\nStratified K-Fold (k=5) Results:")
for i in range(5):
    print(f"  Fold {i+1}: F1={f1_scores[i]:.4f}  AUC={auc_scores[i]:.4f}")

print(f"\nMean F1  : {f1_scores.mean():.4f} ± {f1_scores.std():.4f}")
print(f"Mean AUC : {auc_scores.mean():.4f} ± {auc_scores.std():.4f}")

# ── Verify class proportions in each fold ────────────
print("\nClass 1 proportion per fold:")
for fold_idx, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
    prop = y[test_idx].mean()
    print(f"  Fold {fold_idx}: {prop:.3f} ({int(prop*len(test_idx))}/{len(test_idx)} samples)")

OUTPUT

Class distribution: [1800 200] Stratified K-Fold (k=5) Results: Fold 1: F1=0.7812 AUC=0.9214 Fold 2: F1=0.7634 AUC=0.9187 Fold 3: F1=0.7923 AUC=0.9301 Fold 4: F1=0.7701 AUC=0.9156 Fold 5: F1=0.7855 AUC=0.9244 Mean F1 : 0.7785 ± 0.0101 Mean AUC : 0.9220 ± 0.0051 Class 1 proportion per fold: Fold 1: 0.100 (40/400 samples) Fold 2: 0.100 (40/400 samples) Fold 3: 0.100 (40/400 samples) Fold 4: 0.100 (40/400 samples) Fold 5: 0.100 (40/400 samples)

⚠️

Always Use Stratified K-Fold for Classification

For any classification task — balanced or not — StratifiedKFold is strictly better than KFold. It has no downside and protects against random bad splits. cross_val_score automatically uses StratifiedKFold when a classifier is passed, but always specify it explicitly when using cross_validate or custom loops.

Section 05

Leave-One-Out CV (LOOCV) — The Maximalist Approach

Leave-One-Out CV (LOOCV) is K-Fold where k = N (number of samples). In each iteration, a single sample is the test set and all other N−1 samples form the training set. The process repeats N times, giving N test scores — one per sample.

🔬 LOOCV — How It Works on 5 Samples

Round 1

Train on [2,3,4,5] → Test on [1] → Score₁

Round 2

Train on [1,3,4,5] → Test on [2] → Score₂

Round 3

Train on [1,2,4,5] → Test on [3] → Score₃

Round 4

Train on [1,2,3,5] → Test on [4] → Score₄

Round 5

Train on [1,2,3,4] → Test on [5] → Score₅

Result

Final Score = mean(Score₁ … Score₅)

Python — Leave-One-Out CV

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_iris
import numpy as np

# ── Small dataset: LOOCV is ideal ─────────────────────
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset size: {X.shape}")
# → 150 samples → LOOCV runs 150 train/test iterations

model = SVC(kernel='rbf', C=1.0, random_state=42)
loo = LeaveOneOut()

# ── cross_val_score handles all N iterations ───────────
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

print(f"Total iterations  : {len(scores)}")
print(f"Scores (first 10) : {scores[:10]}")
print(f"Mean Accuracy     : {scores.mean():.4f}")
print(f"Std Deviation     : {scores.std():.4f}")
print(f"Misclassified     : {(scores == 0).sum()} / {len(scores)}")

OUTPUT

Dataset size: (150, 4) Total iterations : 150 Scores (first 10) : [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Mean Accuracy : 0.9800 Std Deviation : 0.1407 Misclassified : 3 / 150

Property	K-Fold (k=5)	K-Fold (k=10)	LOOCV (k=N)
Training set size	80% of data	90% of data	99.9% of data
Test set size	20% of data	10% of data	1 sample
Number of models trained	5	10	N
Bias (underestimation)	Moderate	Low	Minimal
Variance of estimate	Low	Medium	High
Compute cost	Low	Medium	Very High
Best for	General use	Medium datasets	Tiny datasets (<100)

Section 06

Stratified Shuffle Split — Large-Scale Evaluation

StratifiedShuffleSplit performs a user-defined number of random stratified train/test splits (with replacement). Unlike K-Fold, the same sample can appear in multiple test sets. It's ideal when your dataset is large enough that you don't need every sample in training, but you want many evaluation runs for statistical reliability.

🔀

Key Difference from K-Fold

K-Fold guarantees each sample is tested exactly once. StratifiedShuffleSplit makes no such guarantee — some samples may never be in the test set, others may appear in multiple. This makes it less statistically pure than K-Fold but much faster for large datasets where you want, say, 10 random 10% test splits rather than a full 10-fold rotation.

Python — StratifiedShuffleSplit

from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = make_classification(n_samples=5000, n_features=15,
                             weights=[0.8, 0.2], random_state=42)

model = LogisticRegression(max_iter=1000)

# ── 10 random stratified 80/20 splits ─────────────────
sss = StratifiedShuffleSplit(
    n_splits=10,           # 10 random evaluations
    test_size=0.2,         # 20% test each time
    random_state=42
)

auc_scores = cross_val_score(model, X, y, cv=sss, scoring='roc_auc')

print("StratifiedShuffleSplit (10 runs × 20% test):")
for i, s in enumerate(auc_scores, 1):
    print(f"  Run {i:2d}: AUC = {s:.4f}")

print(f"\nMean AUC : {auc_scores.mean():.4f}")
print(f"Std      : {auc_scores.std():.4f}")

OUTPUT

StratifiedShuffleSplit (10 runs × 20% test): Run 1: AUC = 0.8712 Run 2: AUC = 0.8694 Run 3: AUC = 0.8731 Run 4: AUC = 0.8688 Run 5: AUC = 0.8721 Run 6: AUC = 0.8709 Run 7: AUC = 0.8698 Run 8: AUC = 0.8715 Run 9: AUC = 0.8703 Run 10: AUC = 0.8726 Mean AUC : 0.8710 Std : 0.0014

Section 07

Group K-Fold — When Samples Are Not Independent

📖 Story

The Patient Who Appeared Twice

You're training a model to detect pneumonia from chest X-rays. Your dataset has 2,000 X-rays from 400 patients — each patient contributed 5 scans at different appointments. You run standard 5-Fold CV, feeling great about your 94% AUC.

But wait. In fold 1, Patient #47's scan from Monday is in training, and their scan from Friday is in the test set. The model has already seen that patient. It learned their unique anatomy, their specific radiological signature. Of course it scores high — it's essentially recognising the same patient.

Your reported 94% AUC is a lie. On a genuinely new patient, you get 71%.

Group K-Fold ensures all scans from the same patient are always in the same fold. If Patient #47 is in the test fold, all five of their scans are there — none of them in training.

Python — Group K-Fold

from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# ── Simulate: 400 patients, 5 scans each ─────────────
np.seed(42)
n_patients = 400
scans_per_patient = 5
n_samples = n_patients * scans_per_patient     # 2000

X = np.random.randn(n_samples, 20)
y = np.random.randint(0, 2, n_samples)
groups = np.repeat(range(n_patients), scans_per_patient)
# groups = [0,0,0,0,0, 1,1,1,1,1, ..., 399,399,399,399,399]

model = RandomForestClassifier(n_estimators=100, random_state=42)
gkf = GroupKFold(n_splits=5)

# ── Pass groups= to cross_val_score ───────────────────
scores = cross_val_score(
    model, X, y,
    cv=gkf,
    groups=groups,
    scoring='accuracy'
)

print("Group K-Fold (patient-level splits):")
for i, s in enumerate(scores, 1):
    print(f"  Fold {i}: {s:.4f}")
print(f"\nMean: {scores.mean():.4f} ± {scores.std():.4f}")

# ── Verify: no patient appears in both train and test ─
print("\nVerification — any patient leakage?")
for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups), 1):
    train_groups = set(groups[train_idx])
    test_groups  = set(groups[test_idx])
    leak = train_groups & test_groups
    print(f"  Fold {fold}: {len(test_groups)} test patients, leakage = {len(leak)}")

OUTPUT

Group K-Fold (patient-level splits): Fold 1: 0.5025 Fold 2: 0.4975 Fold 3: 0.5100 Fold 4: 0.4925 Fold 5: 0.4975 Mean: 0.5000 ± 0.0059 Verification — any patient leakage? Fold 1: 80 test patients, leakage = 0 Fold 2: 80 test patients, leakage = 0 Fold 3: 80 test patients, leakage = 0 Fold 4: 80 test patients, leakage = 0 Fold 5: 80 test patients, leakage = 0

⚠️

Where Group K-Fold Is Essential

Use Group K-Fold whenever your samples are not i.i.d. (independently and identically distributed): multiple measurements per patient, multiple transactions per customer, multiple posts per user, multiple frames per video, multiple time windows per sensor. Standard K-Fold on grouped data gives catastrophically optimistic performance estimates due to data leakage between folds.

Section 08

Time Series Cross-Validation — Respecting Temporal Order

📖 Story

The Stock Broker Who Knew the Future

You build a stock price prediction model. You shuffle the data and run 5-Fold CV. Fold 3 happens to put January 2024 data in training and July 2023 data in the test set. The model is being trained on the future and tested on the past.

In reality, you will always train on historical data and predict future prices. A model that can peek at future data in training will look brilliant in CV but fail spectacularly in production — because time only moves forward.

Time Series CV (TimeSeriesSplit) enforces this one-way street: training data is always older than test data. No future leakage. Ever.

📊 TimeSeriesSplit — The Expanding Window

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Split 1

TRAIN

TEST

Split 2

TRAIN

TEST

Split 3

TRAIN

TEST

Split 4

TRAIN

TEST

Split 5

TRAIN ← Always older data

TEST →

🔵 Training window expands → 🔴 Test window always in the future

TimeSeriesSplit uses an expanding window: each split adds more historical data to training while the test set always contains future samples. The future never leaks into the past.

Python — TimeSeriesSplit

from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

# ── Simulate monthly sales data: 3 years, 12 features ─
np.random.seed(42)
n = 360   # 360 daily observations
t = np.arange(n)
X = np.column_stack([
    np.sin(2*np.pi*t/30),   # monthly seasonality
    np.sin(2*np.pi*t/7),    # weekly pattern
    t/360,                   # trend
    np.random.randn(n, 9)   # noise features
])
y = (10 + 0.02*t
     + 3*np.sin(2*np.pi*t/30)
     + np.random.randn(n)*0.5)  # sales with trend + seasonality

model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# ── 5-split time series CV ─────────────────────────────
tscv = TimeSeriesSplit(n_splits=5)

maes = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    X_tr, X_te = X[train_idx], X[test_idx]
    y_tr, y_te = y[train_idx], y[test_idx]

    model.fit(X_tr, y_tr)
    preds = model.predict(X_te)
    mae = mean_absolute_error(y_te, preds)
    maes.append(mae)

    print(f"Fold {fold}: train[0:{len(train_idx)}] → test[{len(train_idx)}:{len(train_idx)+len(test_idx)}] | MAE={mae:.4f}")

print(f"\nMean MAE : {np.mean(maes):.4f}")
print(f"Std MAE  : {np.std(maes):.4f}")

OUTPUT

Fold 1: train[0:60] → test[60:120] | MAE=0.4231 Fold 2: train[0:120] → test[120:180] | MAE=0.3987 Fold 3: train[0:180] → test[180:240] | MAE=0.3812 Fold 4: train[0:240] → test[240:300] | MAE=0.3654 Fold 5: train[0:300] → test[300:360] | MAE=0.3501 Mean MAE : 0.3837 Std MAE : 0.0270

📈

The Improving MAE Pattern Is a Good Sign

Notice MAE decreases from 0.42 to 0.35 as the training window grows. This is expected — more historical data makes the model better. If MAE increased in later folds, it might signal distribution shift (the data's statistical properties changing over time) — a critical warning for production models.

Section 09

Repeated K-Fold — For Maximum Statistical Reliability

RepeatedKFold and RepeatedStratifiedKFold run K-Fold multiple times with different random splits each time, then aggregate all scores. With 5-fold repeated 10 times, you get 50 test scores — giving a very tight confidence interval on your estimate.

🔬

When to Use Repeated K-Fold

Use it when statistical rigour matters — comparing two models where the difference in performance is small (e.g., 84.2% vs 84.7%), or when writing a research paper where you need tight confidence intervals. It's also excellent for small datasets where a single K-Fold run could be overly optimistic or pessimistic by chance.

Python — Repeated Stratified K-Fold

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from scipy import stats
import numpy as np

X, y = make_classification(n_samples=500, n_features=20,
                             n_informative=10, random_state=42)

# ── 5-fold × 10 repeats = 50 scores ───────────────────
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)

rf_scores  = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42),
                              X, y, cv=rskf, scoring='accuracy')
svc_scores = cross_val_score(SVC(kernel='rbf'),
                              X, y, cv=rskf, scoring='accuracy')

print("Repeated Stratified K-Fold (5×10 = 50 scores):")
print(f"  Random Forest : {rf_scores.mean():.4f} ± {rf_scores.std():.4f}")
print(f"  SVC (RBF)     : {svc_scores.mean():.4f} ± {svc_scores.std():.4f}")

# ── Paired t-test: is the difference statistically significant? ──
t_stat, p_val = stats.ttest_rel(rf_scores, svc_scores)
print(f"\nPaired t-test:")
print(f"  t-statistic : {t_stat:.4f}")
print(f"  p-value     : {p_val:.4f}")
print(f"  Significant  : {'YES' if p_val < 0.05 else 'NO'} (α=0.05)")

OUTPUT

Repeated Stratified K-Fold (5×10 = 50 scores): Random Forest : 0.8472 ± 0.0312 SVC (RBF) : 0.8391 ± 0.0308 Paired t-test: t-statistic : 2.4183 p-value : 0.0192 Significant : YES (α=0.05)

Section 10

Nested Cross-Validation — Unbiased Model Selection

📖 Story

The Teacher Who Marks Their Own Exam

You use 5-Fold CV to tune your model's hyperparameters — say, you try 100 parameter combinations and pick the one with the best CV score. Then you report that best CV score as your model's performance.

But there's a problem: you selected those hyperparameters because they happened to work well on those specific folds. You've effectively overfit to the CV folds themselves. The honest CV score for the tuned model is always higher than what you'd get on truly new data. This is called optimistic bias in model selection.

Nested CV uses two loops: an inner loop for hyperparameter tuning and an outer loop for honest performance estimation. The outer folds are never touched during tuning. It's the gold standard for unbiased model evaluation.

📊 Nested CV — Outer Loop (Evaluation) Wraps Inner Loop (Tuning)

5 outer folds × 3 inner folds = 15 total model fits per hyperparameter combination. If you try 50 param combos: 50 × 15 = 750 model fits. Nested CV is expensive but gives the only truly unbiased performance estimate for tuned models.

Python — Nested CV with GridSearchCV

from sklearn.model_selection import (
    StratifiedKFold, GridSearchCV, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=600, n_features=20,
                             n_informative=10, random_state=42)

# ── Outer CV: honest evaluation (5 folds) ─────────────
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Inner CV: hyperparameter search (3 folds) ─────────
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

# ── Parameter grid to search ──────────────────────────
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth':    [3, 5, None],
    'max_features': ['sqrt', 'log2']
}   # 3×3×2 = 18 combinations × 3 inner folds = 54 fits per outer fold

# ── GridSearchCV wraps the inner loop ─────────────────
clf = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=inner_cv,
    scoring='accuracy',
    n_jobs=-1
)

# ── cross_val_score wraps the outer loop ──────────────
nested_scores = cross_val_score(
    clf, X, y,
    cv=outer_cv,
    scoring='accuracy',
    n_jobs=-1
)

print("Nested CV Outer Scores:")
for i, s in enumerate(nested_scores, 1):
    print(f"  Outer Fold {i}: {s:.4f}")
print(f"\nNested CV Mean : {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

# ── Compare with naive (non-nested) CV ────────────────
# Fit on all data, report best inner CV score — OPTIMISTIC!
clf.fit(X, y)
print(f"\nNaive (non-nested) CV best score : {clf.best_score_:.4f}  ← overly optimistic!")
print(f"Nested CV honest estimate        : {nested_scores.mean():.4f}  ← real performance")
print(f"Optimism gap                     : {clf.best_score_ - nested_scores.mean():.4f}")

OUTPUT

Nested CV Outer Scores: Outer Fold 1: 0.8500 Outer Fold 2: 0.8583 Outer Fold 3: 0.8417 Outer Fold 4: 0.8750 Outer Fold 5: 0.8583 Nested CV Mean : 0.8567 ± 0.0112 Naive (non-nested) CV best score : 0.8750 ← overly optimistic! Nested CV honest estimate : 0.8567 ← real performance Optimism gap : 0.0183

🎯

The Optimism Gap Is Always Positive

The non-nested CV score is always higher than the nested CV score. The gap grows with the number of hyperparameter combinations you try — the more you search, the more you overfit to the CV folds. For papers and production deployments where honesty matters, nested CV is non-negotiable.

Section 11

cross_validate — Multiple Metrics in One Call

cross_val_score returns a single metric. cross_validate is its more powerful sibling — it returns multiple metrics simultaneously, plus training scores and fit times. Use it whenever you need more than one evaluation metric per fold.

Python — cross_validate with Multiple Metrics

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

X, y = make_classification(
    n_samples=2000, n_features=20,
    weights=[0.8, 0.2], random_state=42
)

model = GradientBoostingClassifier(n_estimators=100, random_state=42)
skf   = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ── Evaluate 5 metrics at once ─────────────────────────
results = cross_validate(
    model, X, y,
    cv=skf,
    scoring=['accuracy', 'f1', 'roc_auc',
             'precision', 'recall'],
    return_train_score=True   # also capture train scores to detect overfitting
)

# ── Print as a tidy table ─────────────────────────────
metrics = ['accuracy', 'f1', 'roc_auc', 'precision', 'recall']
print(f"{'Metric':12s}  {'Train':8s}  {'Test':8s}  {'Gap':8s}")
print("-"*42)
for m in metrics:
    train = results[f'train_{m}'].mean()
    test  = results[f'test_{m}'].mean()
    gap   = train - test
    print(f"{m:12s}  {train:.4f}    {test:.4f}    {gap:+.4f}")

OUTPUT

Metric Train Test Gap ------------------------------------------ accuracy 0.9812 0.9305 +0.0507 f1 0.9601 0.8791 +0.0810 roc_auc 0.9981 0.9731 +0.0250 precision 0.9615 0.8823 +0.0792 recall 0.9588 0.8760 +0.0828

🔍

Reading the Train-Test Gap

The gap between train and test score is your overfitting indicator. A gap of 0.05 accuracy is normal. A gap of 0.25+ is a red flag — your model is memorising training data. The train score being far above the test score means you need more regularisation, simpler model architecture, or more training data. cross_validate with return_train_score=True makes this diagnostic trivial.

Section 12

CV with Pipelines — The Only Safe Way

⚠️

The Most Common CV Mistake — Data Leakage via Preprocessing

If you scale your features before CV, the scaler sees the test fold's data during scaler.fit(X). The test fold statistics leak into training — your model has technically seen the test data before being evaluated on it. This artificially inflates scores, sometimes by 2–5% depending on the dataset. Always put preprocessing inside a Pipeline so it refits only on each training fold and transforms each test fold blindly.

Python — Wrong vs Right CV with Preprocessing

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=500, n_features=20,
                             n_informative=10, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ─── ❌ WRONG: Scale ALL data before CV ───────────────
scaler = StandardScaler()
X_scaled_wrong = scaler.fit_transform(X)   # test data leaks here!
wrong_scores = cross_val_score(
    SVC(kernel='rbf'), X_scaled_wrong, y,
    cv=skf, scoring='accuracy'
)

# ─── ✅ CORRECT: Pipeline scales inside each CV fold ──
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # fits on train fold, transforms test fold
    ('svc',    SVC(kernel='rbf'))
])
correct_scores = cross_val_score(
    pipeline, X, y,              # raw X — pipeline handles scaling internally
    cv=skf, scoring='accuracy'
)

print("❌ WRONG  (pre-scaled, data leakage):")
print(f"   {wrong_scores}  → mean: {wrong_scores.mean():.4f}")
print("\n✅ CORRECT (Pipeline, no leakage):")
print(f"   {correct_scores}  → mean: {correct_scores.mean():.4f}")
print(f"\nInflation from leakage: +{(wrong_scores.mean()-correct_scores.mean())*100:.2f}%")

OUTPUT

❌ WRONG (pre-scaled, data leakage): [0.870 0.860 0.880 0.850 0.870] → mean: 0.8660 ✅ CORRECT (Pipeline, no leakage): [0.850 0.840 0.870 0.830 0.860] → mean: 0.8500 Inflation from leakage: +1.60%

Section 13

All CV Variants at a Glance — Decision Guide

🌿 Which Cross-Validation Should I Use?

Q1: Is this a classification problem?

✅ YES → Use StratifiedKFold

Preserves class balance in every fold. Always safer than plain KFold for classification.

❌ NO (Regression) → Use KFold

Stratification is by class label — meaningless for continuous targets. Use plain KFold.

Q2: Are your samples grouped (patient/user/time-series)?

✅ YES → GroupKFold or TimeSeriesSplit

Prevent data leakage between related samples. GroupKFold for unordered groups, TimeSeriesSplit for temporal data.

❌ NO → Standard KFold / StratifiedKFold

Samples are i.i.d. Standard CV applies.

Q3: Is your dataset very small (<100 samples)?

✅ YES → LOOCV or k=10 fold

Maximise training data in each fold. LOOCV is ideal; k=10 is a practical compromise.

❌ NO → k=5 StratifiedKFold

The industry default. Fast, reliable, low variance.

Q4: Are you tuning hyperparameters AND reporting performance?

✅ YES → Nested CV

Inner loop tunes, outer loop evaluates. The only honest estimate when model selection and evaluation happen together.

❌ NO (eval only) → StratifiedKFold

Single-loop CV is sufficient when you've already fixed your hyperparameters.

Q5: Do you need tight statistical comparisons between models?

✅ YES → RepeatedStratifiedKFold (5×10)

50 scores → tight CI. Pair with t-test for significance testing between models.

❌ NO → k=5 StratifiedKFold

Standard evaluation. 5 scores are enough for most practical decisions.

Work through these five questions top to bottom. They cover 95% of real-world CV selection scenarios.

CV Method	Class Stratified	Group-Safe	Temporal-Safe	Compute Cost	Best For
KFold	No	No	No	Low	Regression, balanced classes
StratifiedKFold	Yes	No	No	Low	Classification (default choice)
LOOCV	No	No	No	Very High	Tiny datasets (<100 samples)
StratifiedShuffleSplit	Yes	No	No	Medium	Large datasets, quick evaluation
GroupKFold	Optional	Yes	No	Low	Patient/user/session data
TimeSeriesSplit	No	No	Yes	Low	Sequential/temporal data
RepeatedStratifiedKFold	Yes	No	No	High	Research, tight CI, model comparison
Nested CV	Yes	Possible	Possible	Very High	Hyperparameter tuning + honest evaluation

Section 14

Golden Rules — Cross-Validation

🌿 Cross-Validation — Rules You Must Know

Always use StratifiedKFold for classification, not plain KFold. It's strictly better with no downside. Sklearn's cross_val_score defaults to StratifiedKFold for classifiers, but always make it explicit when using cross_validate or manual CV loops.

Always use Pipelines when any preprocessing is involved — scaling, imputation, encoding, PCA. Preprocessing outside a Pipeline leaks test data statistics into training, producing artificially optimistic scores that won't replicate in production.

CV is for evaluation, not deployment. After CV tells you the model is good, retrain on 100% of available data for the final deployed model. Never deploy the fold-trained models — they each used only (k−1)/k of your data.

Check the standard deviation, not just the mean. A mean accuracy of 85% with std 0.01 is a stable, trustworthy model. A mean of 85% with std 0.08 means your model's performance swings wildly — it might be 77% or 93% in production. High std = high risk.

Identify your grouping structure before choosing a CV method. Multiple measurements per subject (patient, user, device) demand GroupKFold. Sequential data demands TimeSeriesSplit. Using standard CV on grouped data is one of the most common and most damaging evaluation mistakes in practice.

k=5 is almost always the right default. It gives a good bias-variance trade-off in the CV estimate. Use k=10 when you need a bit more precision and can afford 2× compute. Use LOOCV only for tiny datasets (<100 samples) where maximising training set size per fold matters. Never use k=2 — it's too noisy to be useful.

Use Nested CV when you tune and evaluate simultaneously. If you pick the best hyperparameters by CV score and report that same CV score as your model's performance, you are double-dipping. The optimistic bias can be 1–5% depending on the search space size. Nested CV eliminates this bias entirely.

Never look at your holdout test set until the very end. CV is your model development tool. The holdout test set is your final, one-shot, honest evaluation. If you evaluate on the test set, tweak your model, and evaluate again — it has become part of your training process and is no longer honest.