Data Preparation / Data Preprocessing 📂 Data Collection · 13 of 13 66 min read

Data Splitting Mastery

A comprehensive, story-driven guide to the art and science of splitting datasets for machine learning. Covers train-test split, validation sets, and all major cross-validation techniques with real-world analogies, visual diagrams, and Python code examples.

Section 01

Why Data Splitting Matters

Every machine learning model has one job: generalise well to data it has never seen before. If you train and evaluate on the same data, your accuracy number is a lie — the model has simply memorised the answers. Data splitting is the practice of partitioning your dataset into separate subsets so that training, tuning, and final evaluation each happen on independent data.

The Exam Study Problem
Imagine a student who studies by reading past exam papers — then is tested on those exact same papers. They score 100%. Does that mean they've mastered the subject? Of course not. A fair exam uses new questions. In machine learning, your test set is that new exam — kept completely hidden until you need a final, honest performance number.
The Doctor Who Tested on His Own Patients
Dr. Aryan built a cancer-detection AI using records from 1,000 patients and tested it on those same 1,000 records — achieving 98% accuracy. He deployed it. Accuracy dropped to 61%. The model memorised patient IDs and quirks of his hospital's data-entry habits, not the actual cancer signals. Proper data splitting would have caught this before anyone was harmed.
⚠️
The Golden Rule

Never use the test set for any decision during model development. Not for hyperparameter tuning, not for feature selection, not for architecture choice. The moment you peek at test performance to make a decision, it becomes a second validation set — and your reported score becomes fiction.

The Three-Set Framework

📊 The Three-Way Data Split — 60% / 20% / 20%
Three-way data split showing training set 60%, validation set 20%, test set 20% TRAINING SET 60% — Model learns here VALIDATION 20% — Tune here TEST SET 20% — Final eval ① Fit model weights — repeat ↺ ② Compare models here ③ Touch once only!

Proportions are guidelines. With 1M+ rows a 98/1/1 split is fine. With fewer than 1,000 rows, prefer cross-validation over any static holdout.

The Workflow Order

⚙️ THREE-SET PIPELINE
Step 01
Training Set (~60–70%)
The data your model learns from. Weights, coefficients, and decision boundaries are fitted here. More data here → better learning, but less left for evaluation.
Step 02
Validation Set (~15–20%)
Used during development to tune hyperparameters and compare model architectures. You see this data repeatedly — it is your feedback loop, not your final judge.
Step 03
Test Set (~15–20%)
Locked away until the very end. Touched exactly once to report final generalisation performance. This is the number you report to the world.

Section 02

Train-Test Split (The Holdout Method)

The simplest form of data splitting is the holdout method: randomly partition the dataset once into a training set and a test set. It is fast, easy to understand, and the right default when you have a large dataset (typically >10,000 rows).

The Spam Filter That Was Too Optimistic
A team built a spam classifier and reported 99.2% accuracy — training and evaluating on the same 5,000 emails. When deployed, it caught only 60% of spam. The model had memorised sender addresses from training that never appeared in production. A proper holdout split would have caught this instantly. Their "99.2%" became "60%" overnight.

How It Works — Shuffle Then Split

🔀 Train-Test Split — Step by Step
Step-by-step train test split showing shuffle then split into 80% train and 20% test Step 1 — Raw Dataset (ordered / unshuffled) S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 🔀 Shuffle Randomly (random_state=42) Step 2 — Shuffled, then Split S7 S3 S9 S1 S5 S10 S2 S6 S4 S8 ◀──── Train (80%) ────▶ ◀ Test (20%) ▶ → Model trains here (many epochs) → Final evaluation

Always shuffle before splitting — unless dealing with time-series data where order must be preserved.

Python Implementation

from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
df = pd.read_csv('house_prices.csv')
X  = df.drop('price', axis=1)
y  = df['price']

# Simple 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,
    shuffle=True
)
print(f"Training rows : {len(X_train)}")
print(f"Test rows     : {len(X_test)}")
▶ Output
Training rows : 800 Test rows : 200

Key Parameters

⚙️ train_test_split — Parameter Guide
test_size
0.1 – 0.3 or int
Fraction or count of samples in test set. Use 0.2 as default. Smaller for large datasets.
random_state
Any integer (e.g. 42)
Seeds the RNG — ensures identical splits every run. Always set this in production.
shuffle
True / False
Set False only for time-series data where order must be preserved.
stratify
y (for classification)
Preserves class proportions in both splits. Critical for imbalanced datasets.

Stratified Split — Preserving Class Balance

When classes are imbalanced, a random split may put all minority cases in training and none in the test set. stratify=y guarantees every split mirrors the original class ratio.

❌ Without stratify — Risky
SetClass 0Class 1 (Fraud)
Train76040 (5%)
Test19010 (5%) by chance
Worst case8000 (0%) — disaster
✅ With stratify=y — Safe
SetClass 0Class 1 (Fraud)
Train76040 (5%) guaranteed
Test19010 (5%) guaranteed
Both setsExact same ratio ✓
💡
Rule of Thumb for Split Ratios

<1K rows → Use cross-validation  |  1K–10K → 70/30 split  |  10K–100K → 80/20 split  |  >100K → 90/10 or 99/1 — you have enough test samples anyway.


Section 03

The Validation Set

Once you add hyperparameter tuning — choosing learning rate, tree depth, regularisation strength — you need a third partition: the validation set. This is the data you use to compare models and tune parameters without contaminating the test set.

The Overfitted Sales Predictor
A data scientist tried 50 hyperparameter combinations, each time measuring accuracy on the test set and keeping the best. The final test accuracy looked excellent. But in production, the model performed terribly. By evaluating 50 times on the test set, she effectively trained on the test set. She needed a separate validation set for tuning — the test set should have been touched only once.
📊 Three-Way Split Workflow — Model Comparison Flow
Three-way split workflow showing multiple models evaluated on validation set with winner going to test set Train 60% Val 20% Test 20% Model A lr=0.01, d=3 Model B lr=0.1, d=5 Model C lr=0.001, d=8 Validation Set Compare val scores → pick best Best Model ✓ Final Test Evaluation

Each model is evaluated on validation only. The winner is tested exactly once on the test set to report final performance. Never use the test set to pick between models.

Creating a Three-Way Split in Python

from sklearn.model_selection import train_test_split

# Step 1: split off the test set first (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Step 2: split remaining 80% → train (75%) + val (25%)
# 0.25 × 0.80 = 0.20 of total → gives 60/20/20 overall
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Train : {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Val   : {len(X_val)}   ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test  : {len(X_test)}  ({len(X_test)/len(X)*100:.0f}%)")
▶ Output
Train : 600 (60%) Val : 200 (20%) Test : 200 (20%)
ℹ️
Why Test Accuracy Is Lower Than Validation Accuracy

Final test accuracy is typically slightly below the best validation accuracy — and that is normal and healthy. The validation set was used to select the best model, so it is slightly optimistic. The test set gives the honest, real-world estimate. If test accuracy were higher than validation accuracy, something is likely wrong.


Section 04

K-Fold Cross-Validation

A single validation split is sensitive to which rows happen to land where. Cross-validation solves this by repeating the train/evaluate cycle multiple times on different partitions and averaging the results. Every data point gets to be in the validation set exactly once.

The Fair Judge Panel
A singing competition with one judge → biased. Ten judges → much fairer. Cross-validation is like rotating ten different "judges" (validation folds) so no single random draw unfairly decides your model's fate. The final score is the average of all judges' verdicts.
📊 5-Fold Cross-Validation — All 5 Iterations
5-fold cross-validation showing validation fold rotating across all 5 iterations Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Iter 1 VAL TRAIN Iter 2 VAL Iter 3 VAL Iter 4 VAL Iter 5 VAL Validation fold (held out) Training folds

Every data point gets to be in the validation set exactly once. Final score = mean ± std across all k runs.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble       import RandomForestClassifier
import numpy as np

model = RandomForestClassifier(n_estimators=100, random_state=42)
kf    = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
print("Fold scores :", np.round(scores, 4))
print(f"Mean        : {scores.mean():.4f}")
print(f"Std Dev     : {scores.std():.4f}")
print(f"95% CI      : {scores.mean():.4f} ± {2*scores.std():.4f}")
▶ Output
Fold scores : [0.8833 0.9083 0.8917 0.9000 0.8750] Mean : 0.8917 Std Dev : 0.0118 95% CI : 0.8917 ± 0.0236

The Math Behind K-Fold Scoring

CV Score (Mean)
CV = (1/k) × Σ score_i
Average metric across all k folds. More robust than a single split estimate.
CV Std Deviation
σ = √[ (1/k) × Σ (score_i − CV)² ]
High σ means your model's performance is unstable across different data slices.
Training Size per Fold
n_train = N × (k−1) / k
For k=5 and N=1000: each fold trains on 800 samples, validates on 200.
Choosing k
k = 5 or k = 10
k=5 is standard. k=10 gives lower bias but costs more compute. Avoid k=2.

Section 05

Stratified K-Fold — Preserving Class Balance

Regular K-Fold shuffles randomly. With imbalanced classes, a random fold could end up with zero minority cases in the validation set — making your accuracy score completely unreliable. StratifiedKFold ensures each fold maintains the same class proportions as the full dataset.

📊 Regular K-Fold vs Stratified K-Fold — Imbalanced Data
Comparison of regular K-Fold and Stratified K-Fold showing class distribution in folds ❌ Regular K-Fold (Risky) Fold 1 Val — 0 minority cases! Fold 2 Val — minority bunched in one spot ✅ Stratified K-Fold (Safe) Fold 1 Val: ~80% majority, 20% minority Fold 2 Val: consistent class ratio maintained Majority class (e.g. Not Fraud) Minority class (e.g. Fraud)

Stratified K-Fold is almost always preferred over plain K-Fold for classification tasks — especially when classes are imbalanced.

(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc') print("Stratified ROC-AUC scores:", np.round(scores, 4)) print(f"Mean AUC: {scores.mean():.4f} ± {scores.std():.4f}")
▶ Output
Stratified ROC-AUC scores: [0.9412 0.9487 0.9356 0.9523 0.9401] Mean AUC: 0.9436 ± 0.0060
ℹ️
Regular KFold vs StratifiedKFold

For imbalanced classification tasks, always use StratifiedKFold instead of plain KFold. Regular K-Fold can place all minority class samples in a single fold, making some iterations train with zero positive examples. Stratified ensures every fold sees the correct class proportions.


Section 06

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is the extreme case of k-fold where k = N. Each iteration trains on N−1 samples and validates on a single sample. Every data point gets one turn as the test case. It gives the least biased estimate but is computationally expensive for large datasets.

The Clinical Trial with 10 Patients
Dr. Singh has only 10 patient records. Splitting 80/20 would give just 2 patients for validation — completely unreliable. LOOCV solves this: train on 9 patients, test on the 10th. Repeat 10 times. Every patient gets a turn being the "test case", and the final accuracy is averaged across all 10 predictions.
📊 Leave-One-Out CV — n=6 Mini Example
Leave-one-out cross-validation with 6 samples showing test sample rotating each iteration S1 S2 S3 S4 S5 S6 i=1 TEST TRAIN i=2 TEST i=3 TEST · · · (repeats N times total) · · · i=N TEST Final Score = mean of N individual predictions

LOOCV uses maximum training data per iteration. Ideal for tiny datasets (<100 rows), computationally prohibitive for large ones.

Training size per foldn−1 (maximum)~90% of data Number of iterationsn (e.g. 10,000)10 Bias of estimateVery lowLow Variance of estimateHighLow–Medium Computational costVery highModerate Best forTiny datasets (<100)Most practical cases

Section 07

Time Series Cross-Validation

Standard k-fold must not be used for time-series data — using future data to predict the past is data leakage. TimeSeriesSplit enforces temporal order: training always uses past data, validation always uses future data.

The Stock Market Time Traveller
Vivek built a stock price prediction model with 94% accuracy using random K-Fold. He deployed it — and lost money. The model had been trained using future price data to predict the past. In production, the future is unknown. TimeSeriesSplit would have revealed his real accuracy of 52% — barely better than chance.
📊 TimeSeriesSplit — Expanding Window (4 Splits)
TimeSeriesSplit showing expanding training window with validation always after training ← Past (2020) Future (2024) → Split 1 Train (2020–2021) Val (2022) Split 2 Train (2020–2022) Val (2023 H1) Split 3 Train (2020–2023 H1) Val (2023 H2) Split 4 Train (2020–2023) Val (2024) ✅ Training window always uses PAST data only Validation always lies AFTER training — temporal order never violated Training (past) Validation (future)

Never shuffle time-series data. The training window expands with each split — future data never leaks into training.

from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model   import Ridge

tscv     = TimeSeriesSplit(n_splits=5)
model_ts = Ridge(alpha=1.0)

scores = cross_val_score(
    model_ts, X_timeseries, y_timeseries,
    cv=tscv,
    scoring='neg_root_mean_squared_error'
)
rmse_scores = -scores
print(f"RMSE per fold : {np.round(rmse_scores, 2)}")
print(f"Mean RMSE     : {rmse_scores.mean():.3f}")
▶ Output
RMSE per fold : [12.4 11.8 13.2 12.1 11.5] Mean RMSE : 12.200

Section 08

Which Technique Should You Use?

📊 Data Splitting Decision Flowchart
Decision flowchart for choosing the right data splitting technique Your Dataset Is it Time Series data? Yes TimeSeriesSplit No Dataset size < 5,000 rows? Yes No Classification task? (esp. imbalanced) Stratified K-Fold Very tiny LOOCV (k = n) Deep Learning? Or huge dataset (>100K) Simple Train / Val / Test Default K-Fold CV k=5 or k=10 — the practical default

When in doubt, 5-Fold or 10-Fold CV is robust, widely understood, and works for 90% of use cases.

✂️ Simple Train-Test
train_test_split()
Best for large datasets (>10K rows) where speed matters and you have enough data for a reliable single holdout.
✓ Fast   ✓ Simple   ✓ Scales to big data
✗ High variance on small datasets
🔁 K-Fold CV
KFold / StratifiedKFold
Gold standard for model evaluation. Use StratifiedKFold for classification. k=5 or k=10.
✓ Low variance   ✓ Efficient use of data
✗ k× slower than a single split
📅 Time Series Split
TimeSeriesSplit
The only valid CV for temporal data. Always trains on past, validates on future. Never shuffle.
✓ No leakage   ✓ Realistic evaluation
✗ Training set shrinks in early folds
SituationRecommended MethodWhy
Dataset >100K rowsTrain-Test SplitEnough data for reliable single holdout
Dataset 1K–10K rows5-Fold Stratified CVMaximises data usage, reduces estimate variance
Dataset <1K rows10-Fold or LOOCVCan't afford to waste data on a single split
Imbalanced classesStratifiedKFoldPreserves class ratios in every fold
Time-series / sequentialTimeSeriesSplitAvoids look-ahead leakage
Hyperparameter tuningNested CVOuter loop evaluates, inner loop tunes
Regression tasksKFold (no stratify)Stratification on continuous targets isn't meaningful

Section 09

Nested Cross-Validation & Pipelines

When you use the same CV loop for both model selection and performance estimation, the estimate is optimistically biased. The solution is nested cross-validation — an outer loop for unbiased performance estimation, and an inner loop for hyperparameter search.

📊 Nested Cross-Validation Architecture
Nested cross-validation showing outer loop for performance estimation and inner loop for hyperparameter search OUTER LOOP — Unbiased Performance Estimate Test Fold 1 Inner Training Pool (Folds 2–5) ↓ Within training pool, run inner CV to find best hyperparameters INNER LOOP — Hyperparameter Search (GridSearchCV) Val Train (inner folds) Best inner model → predict on outer Test Fold 1 → record score → repeat for all outer folds

Nested CV completely separates model selection (inner loop) from performance estimation (outer loop), giving fully unbiased results.

from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.svm            import SVC

param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}
inner_cv   = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv   = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

clf = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')

nested_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy')
print(f"Nested CV Accuracy: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
▶ Output
Nested CV Accuracy: 0.8912 ± 0.0143

Section 10

Data Leakage — The #1 Mistake in ML

Data leakage happens when information from outside the training set leaks into the model during training, making it look better than it really is. It is the single most common reason ML models fail in production despite appearing excellent during development.

The Leaky Exam Cheat
A student accidentally received next week's exam paper while studying. He aced the test with 100%. The teacher put him in the advanced class — but the next exam (without the leak) he failed completely. Data leakage works exactly the same way: your model "cheats" by seeing information it should never access during training, then fails in the real world.

Three Types of Leakage

🎯
Target Leakage
A feature is derived from or correlated with the target after the event. Example: using "insurance_claim_amount" to predict "will_claim".
Most Dangerous
🔀
Train-Test Contamination
Fitting preprocessors (scaler, encoder, imputer) on the full dataset before splitting. Test data statistics corrupt training preprocessing.
Very Common
📅
Temporal Leakage
Using future information to predict past events. Especially common when random shuffling ignores time order in sequential data.
Sneaky

Leaky Pipeline vs Clean Pipeline

❌ Leaky Pipeline — Wrong
scaler.fit(X_all) ← entire dataset!
X_scaled = scaler.transform(X_all)
X_train, X_test = split(X_scaled)
model.fit(X_train)
model.predict(X_test)
✅ Clean Pipeline — Correct
X_train, X_test = split(X_raw)
scaler.fit(X_train) ← train only
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model.fit(X_train) → predict(X_test)

Using sklearn Pipeline to Prevent Leakage Automatically

from sklearn.pipeline     import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble      import GradientBoostingClassifier

# Pipeline ensures scaler is fit ONLY on training data in each fold
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  GradientBoostingClassifier(random_state=42))
])

# cross_val_score + Pipeline = leakage-free evaluation
scores = cross_val_score(
    pipe, X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy'
)
print(f"Pipeline CV: {scores.mean():.4f} ± {scores.std():.4f}")
▶ Output
Pipeline CV: 0.9023 ± 0.0087

Section 11

Summary & Golden Rules

✂️
Train-Test Split
Holdout Method
Fast, simple, reliable for large data. Set random_state and stratify for best results.
🔁
Validation Set
3-Way Split
Essential when tuning hyperparameters. Keeps test set pristine for final honest evaluation.
📊
Cross-Validation
k-Fold / CV
Best practice for small–medium datasets. Every row acts as validation exactly once.
MethodBest Forsklearn ClassCompute Cost
Train-Test SplitLarge datasets, quick baselinetrain_test_split()Low
K-Fold CVGeneral purpose evaluationKFoldMedium
Stratified K-FoldClassification, imbalanced dataStratifiedKFoldMedium
LOOCVVery small datasets (<100)LeaveOneOutVery High
Time Series SplitSequential / temporal dataTimeSeriesSplitMedium
Nested CVHyperparameter tuning + evalGridSearchCV + outer CVVery High
🎯 6 Golden Rules of Data Splitting
1
Split before any preprocessing. Fit all transformers (scalers, imputers, encoders) only on training data. Apply them to validation and test without refitting.
2
Set random_state everywhere. Reproducibility is not optional in science or production. Any split without a seed is a split you cannot reproduce.
3
Use stratify for classification. Any time your target is categorical, use stratify=y in train_test_split and StratifiedKFold in CV loops.
4
Never tune on the test set. Touching the test set more than once turns it into a second validation set and makes your reported performance optimistic.
5
Respect temporal ordering in time series. Past data trains the model. Future data validates it. Never shuffle time-series data before splitting.
6
Use sklearn Pipelines. They guarantee every preprocessing step is applied correctly within each fold, eliminating the most common source of leakage.
Key Takeaway

Data splitting is not just a technical step — it is a philosophy of honesty in ML. The gold-standard workflow: Split off test set first → Use Stratified K-Fold + Pipeline on training data → Pick best model → Evaluate once on test set → Report results. A model evaluated properly is a model you can trust.

You have completed Data Collection. View all sections →