Data Splitting in Machine Learning

Section 01

Why Data Splitting Matters

Every machine learning model has one job: generalise well to data it has never seen before. If you train and evaluate on the same data, your accuracy number is a lie — the model has simply memorised the answers. Data splitting is the practice of partitioning your dataset into separate subsets so that training, tuning, and final evaluation each happen on independent data.

📖 Analogy

The Exam Study Problem

Imagine a student who studies by reading past exam papers — then is tested on those exact same papers. They score 100%. Does that mean they've mastered the subject? Of course not. A fair exam uses new questions. In machine learning, your test set is that new exam — kept completely hidden until you need a final, honest performance number.

📖 Real-World Story

The Doctor Who Tested on His Own Patients

Dr. Aryan built a cancer-detection AI using records from 1,000 patients and tested it on those same 1,000 records — achieving 98% accuracy. He deployed it. Accuracy dropped to 61%. The model memorised patient IDs and quirks of his hospital's data-entry habits, not the actual cancer signals. Proper data splitting would have caught this before anyone was harmed.

⚠️

The Golden Rule

Never use the test set for any decision during model development. Not for hyperparameter tuning, not for feature selection, not for architecture choice. The moment you peek at test performance to make a decision, it becomes a second validation set — and your reported score becomes fiction.

The Three-Set Framework

📊 The Three-Way Data Split — 60% / 20% / 20%

Proportions are guidelines. With 1M+ rows a 98/1/1 split is fine. With fewer than 1,000 rows, prefer cross-validation over any static holdout.

The Workflow Order

⚙️ THREE-SET PIPELINE

Step 01

Training Set (~60–70%)

The data your model learns from. Weights, coefficients, and decision boundaries are fitted here. More data here → better learning, but less left for evaluation.

Step 02

Validation Set (~15–20%)

Used during development to tune hyperparameters and compare model architectures. You see this data repeatedly — it is your feedback loop, not your final judge.

Step 03

Test Set (~15–20%)

Locked away until the very end. Touched exactly once to report final generalisation performance. This is the number you report to the world.

Section 02

Train-Test Split (The Holdout Method)

The simplest form of data splitting is the holdout method: randomly partition the dataset once into a training set and a test set. It is fast, easy to understand, and the right default when you have a large dataset (typically >10,000 rows).

📖 Real-World Story

The Spam Filter That Was Too Optimistic

A team built a spam classifier and reported 99.2% accuracy — training and evaluating on the same 5,000 emails. When deployed, it caught only 60% of spam. The model had memorised sender addresses from training that never appeared in production. A proper holdout split would have caught this instantly. Their "99.2%" became "60%" overnight.

How It Works — Shuffle Then Split

🔀 Train-Test Split — Step by Step

Always shuffle before splitting — unless dealing with time-series data where order must be preserved.

Python Implementation

from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
df = pd.read_csv('house_prices.csv')
X  = df.drop('price', axis=1)
y  = df['price']

# Simple 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,
    shuffle=True
)
print(f"Training rows : {len(X_train)}")
print(f"Test rows     : {len(X_test)}")

▶ Output

Training rows : 800 Test rows : 200

Key Parameters

⚙️ train_test_split — Parameter Guide

test_size

0.1 – 0.3 or int

Fraction or count of samples in test set. Use 0.2 as default. Smaller for large datasets.

random_state

Any integer (e.g. 42)

Seeds the RNG — ensures identical splits every run. Always set this in production.

shuffle

True / False

Set False only for time-series data where order must be preserved.

stratify

y (for classification)

Preserves class proportions in both splits. Critical for imbalanced datasets.

Stratified Split — Preserving Class Balance

When classes are imbalanced, a random split may put all minority cases in training and none in the test set. stratify=y guarantees every split mirrors the original class ratio.

❌ Without stratify — Risky

Set	Class 0	Class 1 (Fraud)
Train	760	40 (5%)
Test	190	10 (5%) by chance
Worst case	800	0 (0%) — disaster

✅ With stratify=y — Safe

Set	Class 0	Class 1 (Fraud)
Train	760	40 (5%) guaranteed
Test	190	10 (5%) guaranteed
Both sets	Exact same ratio ✓

💡

Rule of Thumb for Split Ratios

<1K rows → Use cross-validation | 1K–10K → 70/30 split | 10K–100K → 80/20 split | >100K → 90/10 or 99/1 — you have enough test samples anyway.

Section 03

The Validation Set

Once you add hyperparameter tuning — choosing learning rate, tree depth, regularisation strength — you need a third partition: the validation set. This is the data you use to compare models and tune parameters without contaminating the test set.

📖 Real-World Story

The Overfitted Sales Predictor

A data scientist tried 50 hyperparameter combinations, each time measuring accuracy on the test set and keeping the best. The final test accuracy looked excellent. But in production, the model performed terribly. By evaluating 50 times on the test set, she effectively trained on the test set. She needed a separate validation set for tuning — the test set should have been touched only once.

📊 Three-Way Split Workflow — Model Comparison Flow

Each model is evaluated on validation only. The winner is tested exactly once on the test set to report final performance. Never use the test set to pick between models.

Creating a Three-Way Split in Python

from sklearn.model_selection import train_test_split

# Step 1: split off the test set first (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Step 2: split remaining 80% → train (75%) + val (25%)
# 0.25 × 0.80 = 0.20 of total → gives 60/20/20 overall
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Train : {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Val   : {len(X_val)}   ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test  : {len(X_test)}  ({len(X_test)/len(X)*100:.0f}%)")

▶ Output

Train : 600 (60%) Val : 200 (20%) Test : 200 (20%)

ℹ️

Why Test Accuracy Is Lower Than Validation Accuracy

Final test accuracy is typically slightly below the best validation accuracy — and that is normal and healthy. The validation set was used to select the best model, so it is slightly optimistic. The test set gives the honest, real-world estimate. If test accuracy were higher than validation accuracy, something is likely wrong.

Section 04

K-Fold Cross-Validation

A single validation split is sensitive to which rows happen to land where. Cross-validation solves this by repeating the train/evaluate cycle multiple times on different partitions and averaging the results. Every data point gets to be in the validation set exactly once.

📖 Analogy

The Fair Judge Panel

A singing competition with one judge → biased. Ten judges → much fairer. Cross-validation is like rotating ten different "judges" (validation folds) so no single random draw unfairly decides your model's fate. The final score is the average of all judges' verdicts.

📊 5-Fold Cross-Validation — All 5 Iterations

Every data point gets to be in the validation set exactly once. Final score = mean ± std across all k runs.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble       import RandomForestClassifier
import numpy as np

model = RandomForestClassifier(n_estimators=100, random_state=42)
kf    = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
print("Fold scores :", np.round(scores, 4))
print(f"Mean        : {scores.mean():.4f}")
print(f"Std Dev     : {scores.std():.4f}")
print(f"95% CI      : {scores.mean():.4f} ± {2*scores.std():.4f}")

▶ Output

Fold scores : [0.8833 0.9083 0.8917 0.9000 0.8750] Mean : 0.8917 Std Dev : 0.0118 95% CI : 0.8917 ± 0.0236

The Math Behind K-Fold Scoring

CV Score (Mean)

CV = (1/k) × Σ score_i

Average metric across all k folds. More robust than a single split estimate.

CV Std Deviation

σ = √[ (1/k) × Σ (score_i − CV)² ]

High σ means your model's performance is unstable across different data slices.

Training Size per Fold

n_train = N × (k−1) / k

For k=5 and N=1000: each fold trains on 800 samples, validates on 200.

Choosing k

k = 5 or k = 10

k=5 is standard. k=10 gives lower bias but costs more compute. Avoid k=2.

Section 05

Stratified K-Fold — Preserving Class Balance

Regular K-Fold shuffles randomly. With imbalanced classes, a random fold could end up with zero minority cases in the validation set — making your accuracy score completely unreliable. StratifiedKFold ensures each fold maintains the same class proportions as the full dataset.

📊 Regular K-Fold vs Stratified K-Fold — Imbalanced Data

Stratified K-Fold is almost always preferred over plain K-Fold for classification tasks — especially when classes are imbalanced.

(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc') print("Stratified ROC-AUC scores:", np.round(scores, 4)) print(f"Mean AUC: {scores.mean():.4f} ± {scores.std():.4f}")

▶ Output

Stratified ROC-AUC scores: [0.9412 0.9487 0.9356 0.9523 0.9401] Mean AUC: 0.9436 ± 0.0060

ℹ️

Regular KFold vs StratifiedKFold

For imbalanced classification tasks, always use StratifiedKFold instead of plain KFold. Regular K-Fold can place all minority class samples in a single fold, making some iterations train with zero positive examples. Stratified ensures every fold sees the correct class proportions.

Section 06

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is the extreme case of k-fold where k = N. Each iteration trains on N−1 samples and validates on a single sample. Every data point gets one turn as the test case. It gives the least biased estimate but is computationally expensive for large datasets.

📖 Real-World Story

The Clinical Trial with 10 Patients

Dr. Singh has only 10 patient records. Splitting 80/20 would give just 2 patients for validation — completely unreliable. LOOCV solves this: train on 9 patients, test on the 10th. Repeat 10 times. Every patient gets a turn being the "test case", and the final accuracy is averaged across all 10 predictions.

📊 Leave-One-Out CV — n=6 Mini Example

LOOCV uses maximum training data per iteration. Ideal for tiny datasets (<100 rows), computationally prohibitive for large ones.

Training size per foldn−1 (maximum)~90% of data Number of iterationsn (e.g. 10,000)10 Bias of estimateVery lowLow Variance of estimateHighLow–Medium Computational costVery highModerate Best forTiny datasets (<100)Most practical cases

Section 07

Time Series Cross-Validation

Standard k-fold must not be used for time-series data — using future data to predict the past is data leakage. TimeSeriesSplit enforces temporal order: training always uses past data, validation always uses future data.

📖 Real-World Story

The Stock Market Time Traveller

Vivek built a stock price prediction model with 94% accuracy using random K-Fold. He deployed it — and lost money. The model had been trained using future price data to predict the past. In production, the future is unknown. TimeSeriesSplit would have revealed his real accuracy of 52% — barely better than chance.

📊 TimeSeriesSplit — Expanding Window (4 Splits)

Never shuffle time-series data. The training window expands with each split — future data never leaks into training.

from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model   import Ridge

tscv     = TimeSeriesSplit(n_splits=5)
model_ts = Ridge(alpha=1.0)

scores = cross_val_score(
    model_ts, X_timeseries, y_timeseries,
    cv=tscv,
    scoring='neg_root_mean_squared_error'
)
rmse_scores = -scores
print(f"RMSE per fold : {np.round(rmse_scores, 2)}")
print(f"Mean RMSE     : {rmse_scores.mean():.3f}")

▶ Output

RMSE per fold : [12.4 11.8 13.2 12.1 11.5] Mean RMSE : 12.200

Section 08

Which Technique Should You Use?

📊 Data Splitting Decision Flowchart

When in doubt, 5-Fold or 10-Fold CV is robust, widely understood, and works for 90% of use cases.

✂️ Simple Train-Test

train_test_split()

Best for large datasets (>10K rows) where speed matters and you have enough data for a reliable single holdout.

✓ Fast ✓ Simple ✓ Scales to big data

✗ High variance on small datasets

🔁 K-Fold CV

KFold / StratifiedKFold

Gold standard for model evaluation. Use StratifiedKFold for classification. k=5 or k=10.

✓ Low variance ✓ Efficient use of data

✗ k× slower than a single split

📅 Time Series Split

TimeSeriesSplit

The only valid CV for temporal data. Always trains on past, validates on future. Never shuffle.

✓ No leakage ✓ Realistic evaluation

✗ Training set shrinks in early folds

Situation	Recommended Method	Why
Dataset >100K rows	Train-Test Split	Enough data for reliable single holdout
Dataset 1K–10K rows	5-Fold Stratified CV	Maximises data usage, reduces estimate variance
Dataset <1K rows	10-Fold or LOOCV	Can't afford to waste data on a single split
Imbalanced classes	StratifiedKFold	Preserves class ratios in every fold
Time-series / sequential	TimeSeriesSplit	Avoids look-ahead leakage
Hyperparameter tuning	Nested CV	Outer loop evaluates, inner loop tunes
Regression tasks	KFold (no stratify)	Stratification on continuous targets isn't meaningful

Section 09

Nested Cross-Validation & Pipelines

When you use the same CV loop for both model selection and performance estimation, the estimate is optimistically biased. The solution is nested cross-validation — an outer loop for unbiased performance estimation, and an inner loop for hyperparameter search.

📊 Nested Cross-Validation Architecture

Nested CV completely separates model selection (inner loop) from performance estimation (outer loop), giving fully unbiased results.

from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.svm            import SVC

param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}
inner_cv   = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv   = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

clf = GridSearchCV(SVC(), param_grid, cv=inner_cv, scoring='accuracy')

nested_scores = cross_val_score(clf, X, y, cv=outer_cv, scoring='accuracy')
print(f"Nested CV Accuracy: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

▶ Output

Nested CV Accuracy: 0.8912 ± 0.0143

Section 10

Data Leakage — The #1 Mistake in ML

Data leakage happens when information from outside the training set leaks into the model during training, making it look better than it really is. It is the single most common reason ML models fail in production despite appearing excellent during development.

📖 Analogy

The Leaky Exam Cheat

A student accidentally received next week's exam paper while studying. He aced the test with 100%. The teacher put him in the advanced class — but the next exam (without the leak) he failed completely. Data leakage works exactly the same way: your model "cheats" by seeing information it should never access during training, then fails in the real world.

Three Types of Leakage

🎯

Target Leakage

A feature is derived from or correlated with the target after the event. Example: using "insurance_claim_amount" to predict "will_claim".

Most Dangerous

🔀

Train-Test Contamination

Fitting preprocessors (scaler, encoder, imputer) on the full dataset before splitting. Test data statistics corrupt training preprocessing.

Very Common

📅

Temporal Leakage

Using future information to predict past events. Especially common when random shuffling ignores time order in sequential data.

Sneaky

Leaky Pipeline vs Clean Pipeline

❌ Leaky Pipeline — Wrong

scaler.fit(X_all) ← entire dataset!

X_scaled = scaler.transform(X_all)

X_train, X_test = split(X_scaled)

model.fit(X_train)

model.predict(X_test)

✅ Clean Pipeline — Correct

X_train, X_test = split(X_raw)

scaler.fit(X_train) ← train only

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

model.fit(X_train) → predict(X_test)

Using sklearn Pipeline to Prevent Leakage Automatically

from sklearn.pipeline     import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble      import GradientBoostingClassifier

# Pipeline ensures scaler is fit ONLY on training data in each fold
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  GradientBoostingClassifier(random_state=42))
])

# cross_val_score + Pipeline = leakage-free evaluation
scores = cross_val_score(
    pipe, X, y,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='accuracy'
)
print(f"Pipeline CV: {scores.mean():.4f} ± {scores.std():.4f}")

▶ Output

Pipeline CV: 0.9023 ± 0.0087

Section 11

Summary & Golden Rules

✂️

Train-Test Split

Holdout Method

Fast, simple, reliable for large data. Set random_state and stratify for best results.

🔁

Validation Set

3-Way Split

Essential when tuning hyperparameters. Keeps test set pristine for final honest evaluation.

📊

Cross-Validation

k-Fold / CV

Best practice for small–medium datasets. Every row acts as validation exactly once.

Method	Best For	sklearn Class	Compute Cost
Train-Test Split	Large datasets, quick baseline	train_test_split()	Low
K-Fold CV	General purpose evaluation	KFold	Medium
Stratified K-Fold	Classification, imbalanced data	StratifiedKFold	Medium
LOOCV	Very small datasets (<100)	LeaveOneOut	Very High
Time Series Split	Sequential / temporal data	TimeSeriesSplit	Medium
Nested CV	Hyperparameter tuning + eval	GridSearchCV + outer CV	Very High

🎯 6 Golden Rules of Data Splitting

Split before any preprocessing. Fit all transformers (scalers, imputers, encoders) only on training data. Apply them to validation and test without refitting.

Set random_state everywhere. Reproducibility is not optional in science or production. Any split without a seed is a split you cannot reproduce.

Use stratify for classification. Any time your target is categorical, use stratify=y in train_test_split and StratifiedKFold in CV loops.

Never tune on the test set. Touching the test set more than once turns it into a second validation set and makes your reported performance optimistic.

Respect temporal ordering in time series. Past data trains the model. Future data validates it. Never shuffle time-series data before splitting.

Use sklearn Pipelines. They guarantee every preprocessing step is applied correctly within each fold, eliminating the most common source of leakage.

✅

Key Takeaway

Data splitting is not just a technical step — it is a philosophy of honesty in ML. The gold-standard workflow: Split off test set first → Use Stratified K-Fold + Pipeline on training data → Pick best model → Evaluate once on test set → Report results. A model evaluated properly is a model you can trust.