Data Preparation / Data Preprocessing 📂 Data Collection · 12 of 13 45 min read

Handling Imbalanced Data in Machine Learning

A story-driven, comprehensive guide to detecting and treating class imbalance — covering oversampling (SMOTE), undersampling, class weights, threshold tuning, and evaluation metrics that actually work on imbalanced datasets — with live diagrams, before/after comparisons, and complete reusable sklearn pipeline code.

Section 01

The Class Imbalance Problem

In an ideal world, machine learning datasets have equal numbers of examples for each class. In the real world, the most important problems are almost always severely imbalanced. Fraud is rare. Cancer is rare. Equipment failure is rare. Network intrusions are rare. This rarity is precisely what makes these problems critical — and it is exactly what makes standard machine learning approaches fail silently and dangerously.

The Fraud Model That Caught Zero Frauds
A payment processing company built a fraud detection model on a dataset with 99.1% legitimate transactions and 0.9% fraudulent ones. The model achieved 99.1% accuracy — and the team celebrated. Then someone asked: how many frauds did it actually catch? The answer was zero. The model had learned the perfect lazy strategy — predict "legitimate" for every single transaction. It was statistically correct 99.1% of the time. It was also completely useless. In the first month of deployment, ₹4.2 crore in fraudulent transactions passed through undetected. Accuracy was not the right metric. The model had been judged as excellent using a metric that was blind to the only thing that mattered: catching fraud. The lesson: on imbalanced data, accuracy is a lie. Precision, recall, F1, and AUC-PR are the metrics that tell the truth.
⚠️
Accuracy Is a Lie on Imbalanced Data

On a dataset that is 99% class 0 and 1% class 1, a model that predicts class 0 for everything achieves 99% accuracy — the highest possible accuracy score — while being completely useless. Never use accuracy as your primary metric on imbalanced data. Use precision, recall, F1-score, and especially AUC-PR (Area Under the Precision-Recall Curve).

1:4
Mild Imbalance
20% minority
Review ratings
Churn prediction
1:10
Moderate
9% minority
Disease screening
Loan default
1:100
Severe
~1% minority
Fraud detection
Manufacturing defect
1:1000
Extreme
~0.1% minority
Rare disease
Cyber intrusion
📊 Class Distribution — Imbalanced Dataset vs Balanced

A 99:1 class ratio means the model sees 99 negative examples for every 1 positive — it is vastly easier for the model to simply predict "negative" for everything and be right 99% of the time. The rare class is the one that matters most — and the one the model learns least about.


Section 02

Step 1 — Detecting Class Imbalance

Before applying any fix, you must first measure and visualise the imbalance. Understanding the ratio, the absolute counts, and the distribution pattern determines which treatment strategy is appropriate.

import pandas as pd
import numpy  as np
from collections import Counter

# ── Basic class count ─────────────────────────────────
print(df['target'].value_counts())
print(df['target'].value_counts(normalize=True).round(4))

# ── Imbalance ratio ───────────────────────────────────
counts = df['target'].value_counts()
majority = counts.max()
minority = counts.min()
ratio    = majority / minority
print(f"Majority class : {majority:,} ({majority/len(df)*100:.1f}%)")
print(f"Minority class : {minority:,} ({minority/len(df)*100:.1f}%)")
print(f"Imbalance ratio: {ratio:.0f}:1")

# ── Severity classification ───────────────────────────
def imbalance_severity(ratio):
    if   ratio < 4:    return "Mild — may not need treatment"
    elif ratio < 10:   return "Moderate — class weights recommended"
    elif ratio < 100:  return "Severe — SMOTE or resampling needed"
    else:              return "Extreme — combine multiple strategies"
print(imbalance_severity(ratio))

# ── Counter on numpy array ────────────────────────────
print(Counter(y_train))    # Counter({0: 9910, 1: 90})

Section 03

Step 2 — Use the Right Metrics

Before fixing the data, fix the evaluation. If you optimise for the wrong metric, every technique you apply will improve that wrong metric while potentially making the real problem worse. On imbalanced data, precision, recall, F1, and AUC-PR are the only metrics that tell you whether your model is actually learning to detect the minority class.

❌ Naïve Model — Predicts Everything as "Not Fraud"
9910True Negative
(correct — legit)
0False Positive
(wrong alarm)
90False Negative
⚠ FRAUD MISSED
0True Positive
(caught fraud)

Accuracy: 99.1%  |  Recall: 0%  |  Frauds caught: 0/90

✅ Balanced Model — After SMOTE + Class Weights
9640True Negative
(correct — legit)
270False Positive
(wrong alarm)
18False Negative
missed fraud
72True Positive
⭐ FRAUD CAUGHT

Accuracy: 97.1%  |  Recall: 80%  |  Frauds caught: 72/90

from sklearn.metrics import (
    classification_report, confusion_matrix,
    precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    precision_recall_curve
)

# ── Full classification report ────────────────────────
print(classification_report(y_test, y_pred,
      target_names=['Legit', 'Fraud'],
      digits=4))

# ── Key minority-class metrics ────────────────────────
print(f"Precision (fraud) : {precision_score(y_test, y_pred):.3f}")
# Of all predicted frauds, what % were actually fraud?

print(f"Recall (fraud)    : {recall_score(y_test, y_pred):.3f}")
# Of all actual frauds, what % did the model catch?

print(f"F1 (fraud)        : {f1_score(y_test, y_pred):.3f}")
# Harmonic mean of precision and recall

print(f"AUC-ROC           : {roc_auc_score(y_test, y_prob):.3f}")
print(f"AUC-PR            : {average_precision_score(y_test, y_prob):.3f}")
# AUC-PR is more informative than AUC-ROC for imbalanced data

# ── Balanced accuracy (accounts for imbalance) ────────
from sklearn.metrics import balanced_accuracy_score
print(f"Balanced Accuracy  : {balanced_accuracy_score(y_test, y_pred):.3f}")
📊 Precision-Recall Curve — The Correct Metric for Imbalanced Data

Precision-Recall Curve (AUC-PR)

ROC Curve (AUC-ROC)

The ROC curve (right) looks impressive even for the naïve model — because AUC-ROC is inflated by the large number of true negatives. The PR curve (left) is brutally honest — it collapses to the bottom for the naïve model. Always use AUC-PR as the primary metric on severely imbalanced data.


Section 04

Six Strategies to Handle Imbalanced Data

⚖️
Class Weights
Tell the algorithm to penalise mistakes on the minority class more heavily during training. Zero data change — just reweighted loss function.
class_weight='balanced'
🧬
SMOTE Oversampling
Synthetic Minority Oversampling — generates new synthetic minority examples by interpolating between existing ones. Most widely used technique.
imblearn.SMOTE
✂️
Undersampling
Remove majority class examples to balance the dataset. Fast and reduces training time, but loses potentially useful data.
RandomUnderSampler
🎯
Threshold Tuning
Lower the decision threshold below 0.5 to classify more examples as positive. Directly controls the precision-recall trade-off.
predict_proba + threshold
🔀
Combined Sampling
Use SMOTE to oversample minority then Tomek Links or ENN to clean noisy majority examples. Best of both worlds.
SMOTETomek / SMOTEENN
🌲
Algorithm Choice
Some algorithms handle imbalance better natively — Balanced Random Forest, EasyEnsemble, and XGBoost with scale_pos_weight.
BalancedRandomForest

Section 05

Strategy 1 — Class Weights (Fastest Fix)

Class weighting is the simplest and most computationally efficient treatment. It modifies the loss function to penalise errors on the minority class more heavily — without changing a single row of data. A model with class_weight='balanced' treats each minority class error as if it were equivalent to 100 majority class errors (in a 100:1 imbalance scenario).

The Insurance Fraud Team's Two-Line Fix
An insurance company's fraud team had been tuning XGBoost hyperparameters for three weeks trying to improve their recall score on fraudulent claims. Their dataset was 200,000 legitimate claims and 2,000 fraudulent ones — a 100:1 ratio. Their best model had recall of 31% (catching only 31% of actual fraud). A senior data scientist asked whether they had set class_weight='balanced'. They had not. Two lines of code change later, recall jumped from 31% to 64% — catching twice as many fraudulent claims — with no change to the model architecture, no additional data collection, and no SMOTE. The class weight adjustment alone doubled the fraud detection rate.
from sklearn.linear_model  import LogisticRegression
from sklearn.ensemble      import RandomForestClassifier
from sklearn.svm           import SVC
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# ── Option A: 'balanced' — auto-computes from data ────
lr  = LogisticRegression(class_weight='balanced', max_iter=1000)
rf  = RandomForestClassifier(class_weight='balanced', n_estimators=200)
svc = SVC(class_weight='balanced', probability=True)

# ── Option B: Manual weights ──────────────────────────
weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
weight_dict = dict(enumerate(weights))
print(weight_dict)  # {0: 0.505, 1: 50.5}  ← minority gets 100× weight

rf_manual = RandomForestClassifier(class_weight=weight_dict)

# ── For XGBoost: scale_pos_weight ────────────────────
import xgboost as xgb
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()
xgb_model = xgb.XGBClassifier(
    scale_pos_weight = neg / pos,   # e.g. 100 for 100:1 ratio
    eval_metric='aucpr'             # optimise for PR-AUC
)

# ── Fit and evaluate on minority class ────────────────
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

Section 06

Strategy 2 — SMOTE Oversampling

SMOTE (Synthetic Minority Oversampling TEchnique) generates new synthetic minority class examples by interpolating between existing minority examples. For each minority sample, SMOTE finds its k nearest minority neighbours and creates new points along the line connecting them. This is fundamentally different from simple duplication — it adds new information rather than repeating existing rows.

The Medical Diagnosis Model That Finally Found Rare Cases
A hospital's AI team built a model to detect a rare autoimmune condition from blood panel results. Their dataset had 48,000 negative cases and only 480 positive ones — a 100:1 ratio. Even with class weights, the model's recall on positive cases was only 42% — missing 58% of actual patients. The problem: with only 480 positive examples, the model had too little data to learn the decision boundary accurately. After applying SMOTE to generate 9,520 synthetic positive examples (bringing the ratio to 5:1), recall improved to 87%. The synthetic examples allowed the model to map the boundary between positive and negative much more accurately. The hospital estimated the improved model would catch an additional 180 patients per year who would otherwise have been misdiagnosed.
📊 How SMOTE Works — Synthetic Interpolation Between Minority Samples
Diagram showing how SMOTE generates synthetic minority samples by interpolating between existing minority points Before SMOTE — 480 minority vs 48,000 majority Minority class severely underrepresented After SMOTE — Synthetic samples added Original minority Synthetic (SMOTE)

SMOTE creates new minority examples (green) by randomly selecting a point on the line between two existing minority examples (red). It does not duplicate — it interpolates. This forces the model to learn a denser, more accurate boundary around the minority class.

from imblearn.over_sampling  import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.pipeline       import Pipeline  # imblearn Pipeline, NOT sklearn!
from sklearn.ensemble        import RandomForestClassifier
from sklearn.preprocessing   import StandardScaler

# ── Basic SMOTE ───────────────────────────────────────
smote = SMOTE(
    sampling_strategy=0.5,  # minority becomes 50% of majority
    k_neighbors=5,           # use 5 neighbours for interpolation
    random_state=42
)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(Counter(y_resampled))  # {0: 9910, 1: 4955}

# ── Borderline-SMOTE (focuses on boundary samples) ───
bl_smote = BorderlineSMOTE(random_state=42)

# ── ADASYN (adaptive — more synthesis where harder) ──
adasyn = ADASYN(random_state=42)

# ── CRITICAL: Use imblearn Pipeline, not sklearn! ─────
# sklearn Pipeline applies SMOTE to test data too — wrong!
# imblearn Pipeline applies SMOTE only during fit() — correct!
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('smote',  SMOTE(random_state=42)),
    ('model',  RandomForestClassifier(n_estimators=200, random_state=42))
])

pipe.fit(X_train, y_train)   # SMOTE applied only to X_train
y_pred = pipe.predict(X_test)  # X_test untouched — correct!
⚠️
The Most Common SMOTE Mistake — Applying Before Train-Test Split

Never apply SMOTE to the full dataset before splitting. If you do, synthetic examples generated from training rows will appear in the test set — causing catastrophically optimistic evaluation metrics that do not reflect real-world performance. Always split first, then apply SMOTE only to the training set. The easiest way to guarantee this: use imblearn's Pipeline, which applies SMOTE only during fit().

📊 Class Distribution — Before vs After SMOTE

Before SMOTE (100:1 ratio)

After SMOTE (2:1 ratio)

SMOTE with sampling_strategy=0.5 generates enough minority samples to make it 50% of the majority (a 2:1 ratio). Going all the way to 1:1 often over-corrects — a 2:1 or 3:1 ratio is usually sufficient for most algorithms.


Section 07

Strategy 3 — Undersampling & Combined Methods

from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
from imblearn.combine        import SMOTETomek, SMOTEENN
from imblearn.pipeline       import Pipeline

# ── Random Undersampling ─────────────────────────────
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_under, y_under = rus.fit_resample(X_train, y_train)
print(Counter(y_under))  # {0: 200, 1: 100} — majority reduced

# ── Tomek Links — remove borderline majority samples ─
tl = TomekLinks()         # Removes majority samples closest to minority
X_tl, y_tl = tl.fit_resample(X_train, y_train)

# ── SMOTETomek — SMOTE + Tomek cleanup ───────────────
# Oversample minority with SMOTE, then clean noisy pairs
smt = SMOTETomek(random_state=42)
X_comb, y_comb = smt.fit_resample(X_train, y_train)

# ── SMOTEENN — SMOTE + Edited Nearest Neighbours ─────
smenn = SMOTEENN(random_state=42)
X_enn, y_enn = smenn.fit_resample(X_train, y_train)

# ── Pipeline with combined resampling ─────────────────
pipe_comb = Pipeline([
    ('resample', SMOTETomek(random_state=42)),
    ('model',    RandomForestClassifier(n_estimators=200, random_state=42))
])
pipe_comb.fit(X_train, y_train)

Section 08

Strategy 4 — Threshold Tuning

Every classifier outputs a probability. By default, class 1 is predicted when probability > 0.5. On imbalanced data, this default is almost always wrong — the model is biased toward predicting the majority class, so the optimal threshold is usually much lower than 0.5. Tuning the threshold lets you directly control the precision-recall trade-off without changing any model parameters.

import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score

# ── Get predicted probabilities ───────────────────────
y_prob = model.predict_proba(X_test)[:, 1]  # P(class=1)

# ── Find optimal threshold by maximising F1 ──────────
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
best_idx   = np.argmax(f1_scores)
best_thresh = thresholds[best_idx]
print(f"Optimal threshold: {best_thresh:.3f}")
print(f"At this threshold — Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}")

# ── Apply custom threshold ────────────────────────────
y_pred_custom = (y_prob >= best_thresh).astype(int)
print(classification_report(y_test, y_pred_custom))

# ── Business-driven threshold ─────────────────────────
# In fraud detection: catching fraud (recall) > false alarms (precision)
# Lower threshold → higher recall, more false alarms
for thresh in [0.3, 0.4, 0.5, 0.6]:
    y_pred_t = (y_prob >= thresh).astype(int)
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t)
    print(f"Thresh={thresh:.1f} | Precision={p:.3f} | Recall={r:.3f} | F1={f1_score(y_test,y_pred_t):.3f}")

# ── Save threshold alongside model ────────────────────
import joblib
joblib.dump({'model': model, 'threshold': best_thresh}, 'fraud_model.pkl')

Section 09

Strategy Comparison — Which Method Wins?

📊 Output: All Strategies Compared — Recall, Precision & F1 on Fraud Detection
Recall (catching fraud) Precision (false alarm rate) F1 Score

No strategy is universally best — the choice depends on whether you prioritise recall (catching more fraud) or precision (reducing false alarms). Combined SMOTE + class weights achieves the best F1. For maximum recall at the cost of precision, threshold tuning to 0.2 is most aggressive.

Strategy Recall Precision F1 Training Speed Best For
No treatment (baseline) 0.00 0.00 Fast Never use on imbalanced data
Class weights 0.64 0.71 0.67 Fast First line of defence — always try first
Random Undersampling 0.68 0.58 0.63 Fast Very large datasets where speed matters
SMOTE Oversampling 0.76 0.68 0.72 Medium Moderate imbalance with enough minority samples
SMOTE + Class Weights 0.81 0.74 0.77 Medium Best general-purpose combination
SMOTETomek Combined 0.79 0.76 0.77 Slow When boundary clarity matters
Threshold Tuning (0.2) 0.89 0.48 0.62 Fast Maximise recall; tolerate false alarms

Section 10

Complete Imbalanced Data Pipeline

from imblearn.pipeline       import Pipeline   # NOT sklearn.pipeline!
from imblearn.over_sampling  import SMOTE
from imblearn.combine        import SMOTETomek
from sklearn.compose         import ColumnTransformer
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute           import SimpleImputer
from sklearn.ensemble         import RandomForestClassifier
from sklearn.model_selection  import StratifiedKFold, cross_val_score
from sklearn.metrics          import average_precision_score, classification_report
import joblib

# ── 1. Preprocessing ──────────────────────────────────
num_cols = ['amount', 'age', 'credit_score', 'balance']
cat_cols = ['city', 'gender', 'product_type']

num_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='median')),
    ('sc',  StandardScaler())
])
cat_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

# ── 2. Full imblearn pipeline ─────────────────────────
full_pipe = Pipeline([
    ('prep',    preprocessor),
    ('smote',   SMOTE(sampling_strategy=0.5, random_state=42)),
    ('model',   RandomForestClassifier(
                    n_estimators=300,
                    class_weight='balanced',  # combine with SMOTE
                    random_state=42, n_jobs=-1
                ))
])

# ── 3. Cross-validate with StratifiedKFold ────────────
# StratifiedKFold preserves the class ratio in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(full_pipe, X_train, y_train,
                              cv=cv, scoring='average_precision')
print(f"CV AUC-PR: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# ── 4. Fit on full training set ───────────────────────
full_pipe.fit(X_train, y_train)

# ── 5. Find optimal threshold on validation set ───────
y_prob = full_pipe.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, y_prob)
f1s = 2*(precisions*recalls)/(precisions+recalls+1e-10)
best_threshold = thresholds[np.argmax(f1s)]
print(f"Optimal threshold: {best_threshold:.3f}")

# ── 6. Final evaluation ───────────────────────────────
y_pred_final = (full_pipe.predict_proba(X_test)[:, 1] >= best_threshold).astype(int)
print(classification_report(y_test, y_pred_final, target_names=['Legit', 'Fraud']))
print(f"AUC-PR: {average_precision_score(y_test, full_pipe.predict_proba(X_test)[:,1]):.3f}")

# ── 7. Save pipeline + threshold together ─────────────
joblib.dump({'pipeline': full_pipe, 'threshold': best_threshold}, 'imbalanced_model.pkl')

# ── Inference on new data ─────────────────────────────
artifacts = joblib.load('imbalanced_model.pkl')
pipe_loaded  = artifacts['pipeline']
thresh_loaded = artifacts['threshold']
new_probs    = pipe_loaded.predict_proba(new_raw_data)[:, 1]
new_preds    = (new_probs >= thresh_loaded).astype(int)
Critical Rules for Imbalanced Pipelines

Four non-negotiable rules: (1) Use imblearn Pipeline, not sklearn Pipeline — imblearn correctly applies SMOTE only during fit() and skips it during transform(); (2) Use StratifiedKFold for cross-validation — it preserves the class ratio in every fold; (3) Tune and save the threshold alongside the pipeline — a model without its threshold will use the default 0.5 which is wrong for imbalanced data; (4) Evaluate with AUC-PR, not accuracy and not AUC-ROC — it is the only metric that fully captures minority class performance.


Section 11

Golden Rules of Handling Imbalanced Data

🎯 8 Rules Every Data Scientist Must Follow
1
Never use accuracy as your primary metric on imbalanced data. A model that predicts the majority class for every single example will score 99% accuracy on a 99:1 dataset while catching zero minority examples. Accuracy is a lie. Use precision, recall, F1, and AUC-PR.
2
Always try class weights first — before SMOTE, before undersampling, before anything else. It is two characters (class_weight='balanced'), it costs nothing, and it often delivers 80% of the benefit you would get from more complex resampling techniques.
3
Never apply SMOTE before the train-test split. Synthetic examples created from training rows will bleed into the test set, producing catastrophically optimistic metrics that collapse in production. Always split first. Use imblearn Pipeline to enforce this automatically.
4
Use imblearn's Pipeline, not sklearn's. sklearn Pipeline applies every step including SMOTE during both fit() and transform(). imblearn Pipeline applies resampling only during fit() and skips it during prediction — the only correct behaviour.
5
Always use StratifiedKFold for cross-validation on imbalanced data. Regular KFold randomly assigns rows to folds — some folds may end up with zero minority examples, making CV scores meaningless. StratifiedKFold preserves the class ratio in every fold.
6
Tune the decision threshold using a held-out validation set — not the test set. Find the threshold that maximises F1 (or recall, depending on business requirements) and save it alongside the model. The default 0.5 threshold is almost always suboptimal for imbalanced problems.
7
SMOTE requires enough minority samples to interpolate meaningfully. With fewer than 5 minority samples, SMOTE cannot find distinct nearest neighbours and will create very similar synthetic examples. The minimum recommended minority class size before applying SMOTE is 10 samples, with 50+ preferred.
8
Understand the business cost of each type of error before choosing a strategy. In fraud detection, a missed fraud (false negative) costs far more than a false alarm (false positive). In a screening test, missing a case may be unacceptable. The precision-recall trade-off is a business decision, not just a statistical one.
🧮
Key Takeaway

The payment company's fraud model that caught zero frauds was not a model failure — it was a problem formulation failure. It was optimised for the wrong objective on the wrong metric with the wrong data distribution. Handling imbalanced data is not a preprocessing trick — it is a fundamental re-orientation of the entire modelling workflow: choose the right metrics first, then choose the right treatment, then build the pipeline in the right order. The model is the last thing you change. Everything before it determines whether the model even has a chance to learn what matters.