The Class Imbalance Problem
In an ideal world, machine learning datasets have equal numbers of examples for each class. In the real world, the most important problems are almost always severely imbalanced. Fraud is rare. Cancer is rare. Equipment failure is rare. Network intrusions are rare. This rarity is precisely what makes these problems critical — and it is exactly what makes standard machine learning approaches fail silently and dangerously.
On a dataset that is 99% class 0 and 1% class 1, a model that predicts class 0 for everything achieves 99% accuracy — the highest possible accuracy score — while being completely useless. Never use accuracy as your primary metric on imbalanced data. Use precision, recall, F1-score, and especially AUC-PR (Area Under the Precision-Recall Curve).
Review ratings
Churn prediction
Disease screening
Loan default
Fraud detection
Manufacturing defect
Rare disease
Cyber intrusion
A 99:1 class ratio means the model sees 99 negative examples for every 1 positive — it is vastly easier for the model to simply predict "negative" for everything and be right 99% of the time. The rare class is the one that matters most — and the one the model learns least about.
Step 1 — Detecting Class Imbalance
Before applying any fix, you must first measure and visualise the imbalance. Understanding the ratio, the absolute counts, and the distribution pattern determines which treatment strategy is appropriate.
import pandas as pd
import numpy as np
from collections import Counter
# ── Basic class count ─────────────────────────────────
print(df['target'].value_counts())
print(df['target'].value_counts(normalize=True).round(4))
# ── Imbalance ratio ───────────────────────────────────
counts = df['target'].value_counts()
majority = counts.max()
minority = counts.min()
ratio = majority / minority
print(f"Majority class : {majority:,} ({majority/len(df)*100:.1f}%)")
print(f"Minority class : {minority:,} ({minority/len(df)*100:.1f}%)")
print(f"Imbalance ratio: {ratio:.0f}:1")
# ── Severity classification ───────────────────────────
def imbalance_severity(ratio):
if ratio < 4: return "Mild — may not need treatment"
elif ratio < 10: return "Moderate — class weights recommended"
elif ratio < 100: return "Severe — SMOTE or resampling needed"
else: return "Extreme — combine multiple strategies"
print(imbalance_severity(ratio))
# ── Counter on numpy array ────────────────────────────
print(Counter(y_train)) # Counter({0: 9910, 1: 90})
Step 2 — Use the Right Metrics
Before fixing the data, fix the evaluation. If you optimise for the wrong metric, every technique you apply will improve that wrong metric while potentially making the real problem worse. On imbalanced data, precision, recall, F1, and AUC-PR are the only metrics that tell you whether your model is actually learning to detect the minority class.
(correct — legit)
(wrong alarm)
⚠ FRAUD MISSED
(caught fraud)
Accuracy: 99.1% | Recall: 0% | Frauds caught: 0/90
(correct — legit)
(wrong alarm)
missed fraud
⭐ FRAUD CAUGHT
Accuracy: 97.1% | Recall: 80% | Frauds caught: 72/90
from sklearn.metrics import (
classification_report, confusion_matrix,
precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score,
precision_recall_curve
)
# ── Full classification report ────────────────────────
print(classification_report(y_test, y_pred,
target_names=['Legit', 'Fraud'],
digits=4))
# ── Key minority-class metrics ────────────────────────
print(f"Precision (fraud) : {precision_score(y_test, y_pred):.3f}")
# Of all predicted frauds, what % were actually fraud?
print(f"Recall (fraud) : {recall_score(y_test, y_pred):.3f}")
# Of all actual frauds, what % did the model catch?
print(f"F1 (fraud) : {f1_score(y_test, y_pred):.3f}")
# Harmonic mean of precision and recall
print(f"AUC-ROC : {roc_auc_score(y_test, y_prob):.3f}")
print(f"AUC-PR : {average_precision_score(y_test, y_prob):.3f}")
# AUC-PR is more informative than AUC-ROC for imbalanced data
# ── Balanced accuracy (accounts for imbalance) ────────
from sklearn.metrics import balanced_accuracy_score
print(f"Balanced Accuracy : {balanced_accuracy_score(y_test, y_pred):.3f}")
Precision-Recall Curve (AUC-PR)
ROC Curve (AUC-ROC)
The ROC curve (right) looks impressive even for the naïve model — because AUC-ROC is inflated by the large number of true negatives. The PR curve (left) is brutally honest — it collapses to the bottom for the naïve model. Always use AUC-PR as the primary metric on severely imbalanced data.
Six Strategies to Handle Imbalanced Data
Strategy 1 — Class Weights (Fastest Fix)
Class weighting is the simplest and most computationally efficient treatment. It modifies the loss function to penalise errors on the minority class more heavily — without changing a single row of data. A model with class_weight='balanced' treats each minority class error as if it were equivalent to 100 majority class errors (in a 100:1 imbalance scenario).
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
# ── Option A: 'balanced' — auto-computes from data ────
lr = LogisticRegression(class_weight='balanced', max_iter=1000)
rf = RandomForestClassifier(class_weight='balanced', n_estimators=200)
svc = SVC(class_weight='balanced', probability=True)
# ── Option B: Manual weights ──────────────────────────
weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(y_train),
y=y_train
)
weight_dict = dict(enumerate(weights))
print(weight_dict) # {0: 0.505, 1: 50.5} ← minority gets 100× weight
rf_manual = RandomForestClassifier(class_weight=weight_dict)
# ── For XGBoost: scale_pos_weight ────────────────────
import xgboost as xgb
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()
xgb_model = xgb.XGBClassifier(
scale_pos_weight = neg / pos, # e.g. 100 for 100:1 ratio
eval_metric='aucpr' # optimise for PR-AUC
)
# ── Fit and evaluate on minority class ────────────────
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
Strategy 2 — SMOTE Oversampling
SMOTE (Synthetic Minority Oversampling TEchnique) generates new synthetic minority class examples by interpolating between existing minority examples. For each minority sample, SMOTE finds its k nearest minority neighbours and creates new points along the line connecting them. This is fundamentally different from simple duplication — it adds new information rather than repeating existing rows.
SMOTE creates new minority examples (green) by randomly selecting a point on the line between two existing minority examples (red). It does not duplicate — it interpolates. This forces the model to learn a denser, more accurate boundary around the minority class.
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.pipeline import Pipeline # imblearn Pipeline, NOT sklearn!
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# ── Basic SMOTE ───────────────────────────────────────
smote = SMOTE(
sampling_strategy=0.5, # minority becomes 50% of majority
k_neighbors=5, # use 5 neighbours for interpolation
random_state=42
)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(Counter(y_resampled)) # {0: 9910, 1: 4955}
# ── Borderline-SMOTE (focuses on boundary samples) ───
bl_smote = BorderlineSMOTE(random_state=42)
# ── ADASYN (adaptive — more synthesis where harder) ──
adasyn = ADASYN(random_state=42)
# ── CRITICAL: Use imblearn Pipeline, not sklearn! ─────
# sklearn Pipeline applies SMOTE to test data too — wrong!
# imblearn Pipeline applies SMOTE only during fit() — correct!
pipe = Pipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('model', RandomForestClassifier(n_estimators=200, random_state=42))
])
pipe.fit(X_train, y_train) # SMOTE applied only to X_train
y_pred = pipe.predict(X_test) # X_test untouched — correct!
Never apply SMOTE to the full dataset before splitting. If you do, synthetic examples generated from training rows will appear in the test set — causing catastrophically optimistic evaluation metrics that do not reflect real-world performance. Always split first, then apply SMOTE only to the training set. The easiest way to guarantee this: use imblearn's Pipeline, which applies SMOTE only during fit().
Before SMOTE (100:1 ratio)
After SMOTE (2:1 ratio)
SMOTE with sampling_strategy=0.5 generates enough minority samples to make it 50% of the majority (a 2:1 ratio). Going all the way to 1:1 often over-corrects — a 2:1 or 3:1 ratio is usually sufficient for most algorithms.
Strategy 3 — Undersampling & Combined Methods
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, EditedNearestNeighbours
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline
# ── Random Undersampling ─────────────────────────────
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_under, y_under = rus.fit_resample(X_train, y_train)
print(Counter(y_under)) # {0: 200, 1: 100} — majority reduced
# ── Tomek Links — remove borderline majority samples ─
tl = TomekLinks() # Removes majority samples closest to minority
X_tl, y_tl = tl.fit_resample(X_train, y_train)
# ── SMOTETomek — SMOTE + Tomek cleanup ───────────────
# Oversample minority with SMOTE, then clean noisy pairs
smt = SMOTETomek(random_state=42)
X_comb, y_comb = smt.fit_resample(X_train, y_train)
# ── SMOTEENN — SMOTE + Edited Nearest Neighbours ─────
smenn = SMOTEENN(random_state=42)
X_enn, y_enn = smenn.fit_resample(X_train, y_train)
# ── Pipeline with combined resampling ─────────────────
pipe_comb = Pipeline([
('resample', SMOTETomek(random_state=42)),
('model', RandomForestClassifier(n_estimators=200, random_state=42))
])
pipe_comb.fit(X_train, y_train)
Strategy 4 — Threshold Tuning
Every classifier outputs a probability. By default, class 1 is predicted when probability > 0.5. On imbalanced data, this default is almost always wrong — the model is biased toward predicting the majority class, so the optimal threshold is usually much lower than 0.5. Tuning the threshold lets you directly control the precision-recall trade-off without changing any model parameters.
import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score
# ── Get predicted probabilities ───────────────────────
y_prob = model.predict_proba(X_test)[:, 1] # P(class=1)
# ── Find optimal threshold by maximising F1 ──────────
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
best_idx = np.argmax(f1_scores)
best_thresh = thresholds[best_idx]
print(f"Optimal threshold: {best_thresh:.3f}")
print(f"At this threshold — Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}")
# ── Apply custom threshold ────────────────────────────
y_pred_custom = (y_prob >= best_thresh).astype(int)
print(classification_report(y_test, y_pred_custom))
# ── Business-driven threshold ─────────────────────────
# In fraud detection: catching fraud (recall) > false alarms (precision)
# Lower threshold → higher recall, more false alarms
for thresh in [0.3, 0.4, 0.5, 0.6]:
y_pred_t = (y_prob >= thresh).astype(int)
p = precision_score(y_test, y_pred_t, zero_division=0)
r = recall_score(y_test, y_pred_t)
print(f"Thresh={thresh:.1f} | Precision={p:.3f} | Recall={r:.3f} | F1={f1_score(y_test,y_pred_t):.3f}")
# ── Save threshold alongside model ────────────────────
import joblib
joblib.dump({'model': model, 'threshold': best_thresh}, 'fraud_model.pkl')
Strategy Comparison — Which Method Wins?
No strategy is universally best — the choice depends on whether you prioritise recall (catching more fraud) or precision (reducing false alarms). Combined SMOTE + class weights achieves the best F1. For maximum recall at the cost of precision, threshold tuning to 0.2 is most aggressive.
| Strategy | Recall | Precision | F1 | Training Speed | Best For |
|---|---|---|---|---|---|
| No treatment (baseline) | 0.00 | — | 0.00 | Fast | Never use on imbalanced data |
| Class weights | 0.64 | 0.71 | 0.67 | Fast | First line of defence — always try first |
| Random Undersampling | 0.68 | 0.58 | 0.63 | Fast | Very large datasets where speed matters |
| SMOTE Oversampling | 0.76 | 0.68 | 0.72 | Medium | Moderate imbalance with enough minority samples |
| SMOTE + Class Weights | 0.81 | 0.74 | 0.77 | Medium | Best general-purpose combination |
| SMOTETomek Combined | 0.79 | 0.76 | 0.77 | Slow | When boundary clarity matters |
| Threshold Tuning (0.2) | 0.89 | 0.48 | 0.62 | Fast | Maximise recall; tolerate false alarms |
Complete Imbalanced Data Pipeline
from imblearn.pipeline import Pipeline # NOT sklearn.pipeline!
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import average_precision_score, classification_report
import joblib
# ── 1. Preprocessing ──────────────────────────────────
num_cols = ['amount', 'age', 'credit_score', 'balance']
cat_cols = ['city', 'gender', 'product_type']
num_pipe = Pipeline([
('imp', SimpleImputer(strategy='median')),
('sc', StandardScaler())
])
cat_pipe = Pipeline([
('imp', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols)
])
# ── 2. Full imblearn pipeline ─────────────────────────
full_pipe = Pipeline([
('prep', preprocessor),
('smote', SMOTE(sampling_strategy=0.5, random_state=42)),
('model', RandomForestClassifier(
n_estimators=300,
class_weight='balanced', # combine with SMOTE
random_state=42, n_jobs=-1
))
])
# ── 3. Cross-validate with StratifiedKFold ────────────
# StratifiedKFold preserves the class ratio in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(full_pipe, X_train, y_train,
cv=cv, scoring='average_precision')
print(f"CV AUC-PR: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# ── 4. Fit on full training set ───────────────────────
full_pipe.fit(X_train, y_train)
# ── 5. Find optimal threshold on validation set ───────
y_prob = full_pipe.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, y_prob)
f1s = 2*(precisions*recalls)/(precisions+recalls+1e-10)
best_threshold = thresholds[np.argmax(f1s)]
print(f"Optimal threshold: {best_threshold:.3f}")
# ── 6. Final evaluation ───────────────────────────────
y_pred_final = (full_pipe.predict_proba(X_test)[:, 1] >= best_threshold).astype(int)
print(classification_report(y_test, y_pred_final, target_names=['Legit', 'Fraud']))
print(f"AUC-PR: {average_precision_score(y_test, full_pipe.predict_proba(X_test)[:,1]):.3f}")
# ── 7. Save pipeline + threshold together ─────────────
joblib.dump({'pipeline': full_pipe, 'threshold': best_threshold}, 'imbalanced_model.pkl')
# ── Inference on new data ─────────────────────────────
artifacts = joblib.load('imbalanced_model.pkl')
pipe_loaded = artifacts['pipeline']
thresh_loaded = artifacts['threshold']
new_probs = pipe_loaded.predict_proba(new_raw_data)[:, 1]
new_preds = (new_probs >= thresh_loaded).astype(int)
Four non-negotiable rules: (1) Use imblearn Pipeline, not sklearn Pipeline — imblearn correctly applies SMOTE only during fit() and skips it during transform(); (2) Use StratifiedKFold for cross-validation — it preserves the class ratio in every fold; (3) Tune and save the threshold alongside the pipeline — a model without its threshold will use the default 0.5 which is wrong for imbalanced data; (4) Evaluate with AUC-PR, not accuracy and not AUC-ROC — it is the only metric that fully captures minority class performance.
Golden Rules of Handling Imbalanced Data
The payment company's fraud model that caught zero frauds was not a model failure — it was a problem formulation failure. It was optimised for the wrong objective on the wrong metric with the wrong data distribution. Handling imbalanced data is not a preprocessing trick — it is a fundamental re-orientation of the entire modelling workflow: choose the right metrics first, then choose the right treatment, then build the pipeline in the right order. The model is the last thing you change. Everything before it determines whether the model even has a chance to learn what matters.