Data Preparation / Data Preprocessing
📂 Data Collection
· 13 of 13
66 min read
Data Splitting Mastery
A comprehensive, story-driven guide to the art and science of
splitting datasets for machine learning. Covers train-test split,
validation sets, and all major cross-validation techniques with
real-world analogies, visual diagrams, and Python code examples.
Section 01
Why Data Splitting Matters
Every machine learning model has one job: generalise well to data it has never seen before. If you train and evaluate on the same data, your accuracy number is a lie — the model has simply memorised the answers. Data splitting is the practice of partitioning your dataset into separate subsets so that training, tuning, and final evaluation each happen on independent data.
📖 Analogy
The Exam Study Problem
Imagine a student who studies by reading past exam papers — then is tested on those exact same papers. They score 100%. Does that mean they've mastered the subject? Of course not. A fair exam uses new questions. In machine learning, your test set is that new exam — kept completely hidden until you need a final, honest performance number.
📖 Real-World Story
The Doctor Who Tested on His Own Patients
Dr. Aryan built a cancer-detection AI using records from 1,000 patients and tested it on those same 1,000 records — achieving 98% accuracy. He deployed it. Accuracy dropped to 61%. The model memorised patient IDs and quirks of his hospital's data-entry habits, not the actual cancer signals. Proper data splitting would have caught this before anyone was harmed.
⚠️
The Golden Rule
Never use the test set for any decision during model development. Not for hyperparameter tuning, not for feature selection, not for architecture choice. The moment you peek at test performance to make a decision, it becomes a second validation set — and your reported score becomes fiction.
The Three-Set Framework
📊 The Three-Way Data Split — 60% / 20% / 20%
Proportions are guidelines. With 1M+ rows a 98/1/1 split is fine. With fewer than 1,000 rows, prefer cross-validation over any static holdout.
The Workflow Order
⚙️ THREE-SET PIPELINE
Step 01
Training Set (~60–70%)
The data your model learns from. Weights, coefficients, and decision boundaries are fitted here. More data here → better learning, but less left for evaluation.
Step 02
Validation Set (~15–20%)
Used during development to tune hyperparameters and compare model architectures. You see this data repeatedly — it is your feedback loop, not your final judge.
Step 03
Test Set (~15–20%)
Locked away until the very end. Touched exactly once to report final generalisation performance. This is the number you report to the world.
Section 02
Train-Test Split (The Holdout Method)
The simplest form of data splitting is the holdout method: randomly partition the dataset once into a training set and a test set. It is fast, easy to understand, and the right default when you have a large dataset (typically >10,000 rows).
📖 Real-World Story
The Spam Filter That Was Too Optimistic
A team built a spam classifier and reported 99.2% accuracy — training and evaluating on the same 5,000 emails. When deployed, it caught only 60% of spam. The model had memorised sender addresses from training that never appeared in production. A proper holdout split would have caught this instantly. Their "99.2%" became "60%" overnight.
How It Works — Shuffle Then Split
🔀 Train-Test Split — Step by Step
Always shuffle before splitting — unless dealing with time-series data where order must be preserved.
Python Implementation
from sklearn.model_selection import train_test_split
import pandas as pd
# Load dataset
df = pd.read_csv('house_prices.csv')
X = df.drop('price', axis=1)
y = df['price']
# Simple 80/20 split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.20,
random_state=42,
shuffle=True
)
print(f"Training rows : {len(X_train)}")
print(f"Test rows : {len(X_test)}")
▶ Output
Training rows : 800
Test rows : 200
Key Parameters
⚙️ train_test_split — Parameter Guide
test_size
0.1 – 0.3 or int
Fraction or count of samples in test set. Use 0.2 as default. Smaller for large datasets.
random_state
Any integer (e.g. 42)
Seeds the RNG — ensures identical splits every run. Always set this in production.
shuffle
True / False
Set False only for time-series data where order must be preserved.
stratify
y (for classification)
Preserves class proportions in both splits. Critical for imbalanced datasets.
Stratified Split — Preserving Class Balance
When classes are imbalanced, a random split may put all minority cases in training and none in the test set. stratify=y guarantees every split mirrors the original class ratio.
❌ Without stratify — Risky
Set
Class 0
Class 1 (Fraud)
Train
760
40 (5%)
Test
190
10 (5%) by chance
Worst case
800
0 (0%) — disaster
✅ With stratify=y — Safe
Set
Class 0
Class 1 (Fraud)
Train
760
40 (5%) guaranteed
Test
190
10 (5%) guaranteed
Both sets
Exact same ratio ✓
💡
Rule of Thumb for Split Ratios
<1K rows → Use cross-validation | 1K–10K → 70/30 split | 10K–100K → 80/20 split | >100K → 90/10 or 99/1 — you have enough test samples anyway.
Section 03
The Validation Set
Once you add hyperparameter tuning — choosing learning rate, tree depth, regularisation strength — you need a third partition: the validation set. This is the data you use to compare models and tune parameters without contaminating the test set.
📖 Real-World Story
The Overfitted Sales Predictor
A data scientist tried 50 hyperparameter combinations, each time measuring accuracy on the test set and keeping the best. The final test accuracy looked excellent. But in production, the model performed terribly. By evaluating 50 times on the test set, she effectively trained on the test set. She needed a separate validation set for tuning — the test set should have been touched only once.
📊 Three-Way Split Workflow — Model Comparison Flow
Each model is evaluated on validation only. The winner is tested exactly once on the test set to report final performance. Never use the test set to pick between models.
Creating a Three-Way Split in Python
from sklearn.model_selection import train_test_split
# Step 1: split off the test set first (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
# Step 2: split remaining 80% → train (75%) + val (25%)# 0.25 × 0.80 = 0.20 of total → gives 60/20/20 overall
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
print(f"Train : {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Val : {len(X_val)} ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test : {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")
▶ Output
Train : 600 (60%)
Val : 200 (20%)
Test : 200 (20%)
ℹ️
Why Test Accuracy Is Lower Than Validation Accuracy
Final test accuracy is typically slightly below the best validation accuracy — and that is normal and healthy. The validation set was used to select the best model, so it is slightly optimistic. The test set gives the honest, real-world estimate. If test accuracy were higher than validation accuracy, something is likely wrong.
Section 04
K-Fold Cross-Validation
A single validation split is sensitive to which rows happen to land where. Cross-validation solves this by repeating the train/evaluate cycle multiple times on different partitions and averaging the results. Every data point gets to be in the validation set exactly once.
📖 Analogy
The Fair Judge Panel
A singing competition with one judge → biased. Ten judges → much fairer. Cross-validation is like rotating ten different "judges" (validation folds) so no single random draw unfairly decides your model's fate. The final score is the average of all judges' verdicts.
📊 5-Fold Cross-Validation — All 5 Iterations
Every data point gets to be in the validation set exactly once. Final score = mean ± std across all k runs.
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np
model = RandomForestClassifier(n_estimators=100, random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')
print("Fold scores :", np.round(scores, 4))
print(f"Mean : {scores.mean():.4f}")
print(f"Std Dev : {scores.std():.4f}")
print(f"95% CI : {scores.mean():.4f} ± {2*scores.std():.4f}")
▶ Output
Fold scores : [0.8833 0.9083 0.8917 0.9000 0.8750]
Mean : 0.8917
Std Dev : 0.0118
95% CI : 0.8917 ± 0.0236
The Math Behind K-Fold Scoring
CV Score (Mean)
CV = (1/k) × Σ score_i
Average metric across all k folds. More robust than a single split estimate.
CV Std Deviation
σ = √[ (1/k) × Σ (score_i − CV)² ]
High σ means your model's performance is unstable across different data slices.
Training Size per Fold
n_train = N × (k−1) / k
For k=5 and N=1000: each fold trains on 800 samples, validates on 200.
Choosing k
k = 5 or k = 10
k=5 is standard. k=10 gives lower bias but costs more compute. Avoid k=2.
Section 05
Stratified K-Fold — Preserving Class Balance
Regular K-Fold shuffles randomly. With imbalanced classes, a random fold could end up with zero minority cases in the validation set — making your accuracy score completely unreliable. StratifiedKFold ensures each fold maintains the same class proportions as the full dataset.
📊 Regular K-Fold vs Stratified K-Fold — Imbalanced Data
Stratified K-Fold is almost always preferred over plain K-Fold for classification tasks — especially when classes are imbalanced.
For imbalanced classification tasks, always use StratifiedKFold instead of plain KFold. Regular K-Fold can place all minority class samples in a single fold, making some iterations train with zero positive examples. Stratified ensures every fold sees the correct class proportions.
Section 06
Leave-One-Out Cross-Validation (LOOCV)
LOOCV is the extreme case of k-fold where k = N. Each iteration trains on N−1 samples and validates on a single sample. Every data point gets one turn as the test case. It gives the least biased estimate but is computationally expensive for large datasets.
📖 Real-World Story
The Clinical Trial with 10 Patients
Dr. Singh has only 10 patient records. Splitting 80/20 would give just 2 patients for validation — completely unreliable. LOOCV solves this: train on 9 patients, test on the 10th. Repeat 10 times. Every patient gets a turn being the "test case", and the final accuracy is averaged across all 10 predictions.
📊 Leave-One-Out CV — n=6 Mini Example
LOOCV uses maximum training data per iteration. Ideal for tiny datasets (<100 rows), computationally prohibitive for large ones.
Training size per fold
n−1 (maximum)
~90% of data
Number of iterations
n (e.g. 10,000)
10
Bias of estimate
Very low
Low
Variance of estimate
High
Low–Medium
Computational cost
Very high
Moderate
Best for
Tiny datasets (<100)
Most practical cases
Section 07
Time Series Cross-Validation
Standard k-fold must not be used for time-series data — using future data to predict the past is data leakage. TimeSeriesSplit enforces temporal order: training always uses past data, validation always uses future data.
📖 Real-World Story
The Stock Market Time Traveller
Vivek built a stock price prediction model with 94% accuracy using random K-Fold. He deployed it — and lost money. The model had been trained using future price data to predict the past. In production, the future is unknown. TimeSeriesSplit would have revealed his real accuracy of 52% — barely better than chance.
📊 TimeSeriesSplit — Expanding Window (4 Splits)
Never shuffle time-series data. The training window expands with each split — future data never leaks into training.
RMSE per fold : [12.4 11.8 13.2 12.1 11.5]
Mean RMSE : 12.200
Section 08
Which Technique Should You Use?
📊 Data Splitting Decision Flowchart
When in doubt, 5-Fold or 10-Fold CV is robust, widely understood, and works for 90% of use cases.
✂️ Simple Train-Test
train_test_split()
Best for large datasets (>10K rows) where speed matters and you have enough data for a reliable single holdout.
✓ Fast ✓ Simple ✓ Scales to big data
✗ High variance on small datasets
🔁 K-Fold CV
KFold / StratifiedKFold
Gold standard for model evaluation. Use StratifiedKFold for classification. k=5 or k=10.
✓ Low variance ✓ Efficient use of data
✗ k× slower than a single split
📅 Time Series Split
TimeSeriesSplit
The only valid CV for temporal data. Always trains on past, validates on future. Never shuffle.
✓ No leakage ✓ Realistic evaluation
✗ Training set shrinks in early folds
Situation
Recommended Method
Why
Dataset >100K rows
Train-Test Split
Enough data for reliable single holdout
Dataset 1K–10K rows
5-Fold Stratified CV
Maximises data usage, reduces estimate variance
Dataset <1K rows
10-Fold or LOOCV
Can't afford to waste data on a single split
Imbalanced classes
StratifiedKFold
Preserves class ratios in every fold
Time-series / sequential
TimeSeriesSplit
Avoids look-ahead leakage
Hyperparameter tuning
Nested CV
Outer loop evaluates, inner loop tunes
Regression tasks
KFold (no stratify)
Stratification on continuous targets isn't meaningful
Section 09
Nested Cross-Validation & Pipelines
When you use the same CV loop for both model selection and performance estimation, the estimate is optimistically biased. The solution is nested cross-validation — an outer loop for unbiased performance estimation, and an inner loop for hyperparameter search.
📊 Nested Cross-Validation Architecture
Nested CV completely separates model selection (inner loop) from performance estimation (outer loop), giving fully unbiased results.
Data leakage happens when information from outside the training set leaks into the model during training, making it look better than it really is. It is the single most common reason ML models fail in production despite appearing excellent during development.
📖 Analogy
The Leaky Exam Cheat
A student accidentally received next week's exam paper while studying. He aced the test with 100%. The teacher put him in the advanced class — but the next exam (without the leak) he failed completely. Data leakage works exactly the same way: your model "cheats" by seeing information it should never access during training, then fails in the real world.
Three Types of Leakage
🎯
Target Leakage
A feature is derived from or correlated with the target after the event. Example: using "insurance_claim_amount" to predict "will_claim".
Most Dangerous
🔀
Train-Test Contamination
Fitting preprocessors (scaler, encoder, imputer) on the full dataset before splitting. Test data statistics corrupt training preprocessing.
Very Common
📅
Temporal Leakage
Using future information to predict past events. Especially common when random shuffling ignores time order in sequential data.
Sneaky
Leaky Pipeline vs Clean Pipeline
❌ Leaky Pipeline — Wrong
scaler.fit(X_all) ← entire dataset!
X_scaled = scaler.transform(X_all)
X_train, X_test = split(X_scaled)
model.fit(X_train)
model.predict(X_test)
✅ Clean Pipeline — Correct
X_train, X_test = split(X_raw)
scaler.fit(X_train) ← train only
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model.fit(X_train) → predict(X_test)
Using sklearn Pipeline to Prevent Leakage Automatically
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
# Pipeline ensures scaler is fit ONLY on training data in each fold
pipe = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(random_state=42))
])
# cross_val_score + Pipeline = leakage-free evaluation
scores = cross_val_score(
pipe, X, y,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='accuracy'
)
print(f"Pipeline CV: {scores.mean():.4f} ± {scores.std():.4f}")
▶ Output
Pipeline CV: 0.9023 ± 0.0087
Section 11
Summary & Golden Rules
✂️
Train-Test Split
Holdout Method
Fast, simple, reliable for large data. Set random_state and stratify for best results.
🔁
Validation Set
3-Way Split
Essential when tuning hyperparameters. Keeps test set pristine for final honest evaluation.
📊
Cross-Validation
k-Fold / CV
Best practice for small–medium datasets. Every row acts as validation exactly once.
Method
Best For
sklearn Class
Compute Cost
Train-Test Split
Large datasets, quick baseline
train_test_split()
Low
K-Fold CV
General purpose evaluation
KFold
Medium
Stratified K-Fold
Classification, imbalanced data
StratifiedKFold
Medium
LOOCV
Very small datasets (<100)
LeaveOneOut
Very High
Time Series Split
Sequential / temporal data
TimeSeriesSplit
Medium
Nested CV
Hyperparameter tuning + eval
GridSearchCV + outer CV
Very High
🎯 6 Golden Rules of Data Splitting
1
Split before any preprocessing. Fit all transformers (scalers, imputers, encoders) only on training data. Apply them to validation and test without refitting.
2
Set random_state everywhere. Reproducibility is not optional in science or production. Any split without a seed is a split you cannot reproduce.
3
Use stratify for classification. Any time your target is categorical, use stratify=y in train_test_split and StratifiedKFold in CV loops.
4
Never tune on the test set. Touching the test set more than once turns it into a second validation set and makes your reported performance optimistic.
5
Respect temporal ordering in time series. Past data trains the model. Future data validates it. Never shuffle time-series data before splitting.
6
Use sklearn Pipelines. They guarantee every preprocessing step is applied correctly within each fold, eliminating the most common source of leakage.
✅
Key Takeaway
Data splitting is not just a technical step — it is a philosophy of honesty in ML. The gold-standard workflow: Split off test set first → Use Stratified K-Fold + Pipeline on training data → Pick best model → Evaluate once on test set → Report results. A model evaluated properly is a model you can trust.