Machine Learning 📂 Supervised Learning · 19 of 20 54 min read

Gradient Boosting

A deep-dive into Gradient Boosting from first principles, covering how trees are built sequentially, loss functions, regularisation, and every major hyperparameter tuning strategy: Grid Search, Random Search, Optuna, and Bayesian Optimisation — with working Python code, stories, and visual diagrams.

Section 01

The Story That Explains Gradient Boosting

The Student Who Learns Only From Mistakes
Imagine a student sitting an exam in three rounds. In Round 1, she scores 60/100. Her teacher doesn't throw away that answer sheet — instead, the teacher circles only the questions she got wrong and hands them back. In Round 2, the student studies only those wrong answers. She improves. In Round 3, she studies the remaining errors.

After ten rounds — each focused exclusively on the previous round's mistakes — she scores 98/100. The teacher never re-taught everything from scratch. She only ever corrected what was wrong.

That is the entire logic of Gradient Boosting. Each new tree doesn't learn the original target — it learns the residual errors (mistakes) left behind by all the previous trees combined. Add them all up, and you get a remarkably accurate model.

Gradient Boosting is a sequential ensemble method that builds a series of weak decision trees — typically shallow — where each tree is trained to correct the cumulative prediction error of all previous trees. Unlike Random Forest (which builds trees independently and in parallel), Gradient Boosting is inherently sequential — each tree depends on what came before it.

🌲
Why "Gradient"?

The word gradient refers to the mathematical gradient of a loss function — the direction in which errors are largest. Each new tree is fit to the negative gradient of the loss, which for squared-error regression is simply the residuals. The method generalises to any differentiable loss function, which gives it tremendous flexibility.


Section 02

Boosting vs Bagging — Two Different Philosophies

Both boosting and bagging are ensemble methods — they combine many weak models. But how they combine them is fundamentally different.

🎲 Bagging (e.g. Random Forest)
PropertyDetail
Tree orderParallel — all trees built independently
Each tree trains onRandom bootstrap sample of original data
Error focusNo — each tree sees the full target
DepthDeep trees (high variance, low bias)
ReducesVariance only
Parallelisable✓ Yes — trivially
🎯 Boosting (Gradient Boosting)
PropertyDetail
Tree orderSequential — each tree follows the last
Each tree trains onResiduals / pseudo-residuals of previous trees
Error focusYes — each tree corrects prior mistakes
DepthShallow trees (depth 3–5)
ReducesBoth bias AND variance
Parallelisable✗ No — sequential by design
Diagram Bias vs variance — what each algorithm reduces
Bias and variance bullseye diagram Four bullseye targets showing how high/low bias and variance affect predictions, and where each algorithm sits on the spectrum. Low variance High variance Low bias High bias Accurate & consistent — ideal Gradient Boosting Scattered but centred — overfitting Random Forest (raw) averaging reduces variance Consistent but wrong — underfitting Wrong & inconsistent — worst case

Random Forest starts top-right (scattered but centred) and averaging pushes it left. Gradient Boosting targets the top-left directly by iteratively correcting bias through sequential tree fitting.

🔐
The Key Insight

Random Forest reduces variance by averaging diverse trees. Gradient Boosting reduces bias by iteratively correcting errors. This is why GB can fit complex patterns that Random Forest misses — and also why it is more prone to overfitting if not carefully tuned.


Section 03

The Algorithm — Step by Step

Let's walk through the full algorithm using a regression problem — predicting house prices. Each step is precise but explained in plain English.

01
Initialise with a Constant Prediction
The model starts with the simplest possible prediction: the mean of the target for regression, or log-odds for classification. For house prices with a mean of £280,000, every house is initially predicted at £280,000. This is F₀(x) = ȳ.
02
Compute Residuals (Pseudo-Residuals)
For each training row, compute the residual error: rᵢ = yᵢ − F(xᵢ). A house worth £400,000 has residual +£120,000. One worth £180,000 has residual −£100,000. These are the "mistakes" the current model is making.
03
Fit a Shallow Tree to the Residuals
Train a new decision tree — typically depth 3 to 5 — with the residuals as the new target, not the original house prices. This tree learns "how wrong was the current model for this house?" — not the price itself.
04
Add the Tree to the Ensemble (with Shrinkage)
The new tree's predictions are multiplied by the learning rate η (e.g. 0.1) before being added: F₁(x) = F₀(x) + η · h₁(x). The learning rate prevents any single tree from dominating.
05
Recompute Residuals on the Updated Model
With updated ensemble F₁(x), compute new residuals. The errors are now smaller. The next tree trains on these new, smaller residuals.
06
Repeat for M Trees (the Full Ensemble)
Repeat steps 2–5 for M iterations. The final model is: F(x) = F₀ + η·h₁ + η·h₂ + … + η·hₘ. Residuals shrink with every round until they converge or the stop criterion is reached.
Diagram Residual error shrinks with every boosting round
Residual shrinkage across six boosting rounds Bar chart showing the residual error for a single house prediction collapsing from £120k to £18k over six rounds as each tree corrects the previous ensemble's mistakes. Residual error (£k) 0 30 60 90 £120k £99k £79k £58k £36k £18k Round 0 (F₀) Round 1 (+h₁) Round 2 (+h₂) Round 3 (+h₃) Round 4 (+h₄) Round 5 (+h₅) Boosting rounds — one shallow tree added per round

ⓘ For a house worth £400k, starting at the £280k mean prediction, the residual error is progressively chopped down each round until the ensemble converges on the true value.

🏅
Concrete Numeric Example

Suppose one house is worth £400,000.
Round 0 — constant prediction: £280,000 → residual: +£120,000
Round 1 — tree predicts residual £110,000 → new prediction: 280k + 0.1×110k = £291,000
Round 2 — tree predicts residual £100,000 → new prediction: 291k + 0.1×100k = £301,000
… After 100 rounds the prediction approaches £400,000. Each round takes a small, precise step toward the truth.


Section 04

The Mathematics — What "Gradient" Actually Means

Gradient Boosting is rooted in functional gradient descent. Instead of updating model parameters with gradients (like in neural networks), it updates the prediction function itself in the direction that reduces loss the most.

Loss Function (Regression)
L(y, F) = ½ (y − F)²
Squared Error. Residuals are rᵢ = yᵢ − F(xᵢ). Each tree directly fits these residuals.
Negative Gradient (General)
rᵢ = − [∂L / ∂F(xᵢ)]
For any loss L, the pseudo-residual is the negative gradient of L w.r.t. the current prediction. This is what the next tree learns.
Model Update Rule
Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)
η = learning rate (shrinkage). hₘ = m-th tree trained on pseudo-residuals. F grows with each step.
Final Ensemble
F(x) = F₀ + Σ η · hₘ(x)
The final model is the sum of the initial constant plus all M scaled tree predictions. No weighting or voting — just summation.
🔬
Why "Pseudo-Residuals" and Not Just "Residuals"?

For squared-error loss, the negative gradient is exactly the raw residual: rᵢ = yᵢ − F(xᵢ). Simple. But for other loss functions (log-loss for classification, absolute error for regression), the negative gradient is not the raw residual — it is something related but mathematically derived. The name pseudo-residuals covers all cases generically.


Section 05

Visual Diagram — Sequential Boosting Process

📊 How Gradient Boosting Builds Sequentially
TRAINING DATASET y = target F₀(x) Mean prediction e.g. £280,000 residuals r₁ r = y − F₀ e.g. +£120,000 Tree h₁ fit to r₁ depth 3–5 F₁(x) F₀ + η·h₁ £291,000 residuals r₂ F₂(x) F₁ + η·h₂ £301,000 r = y − F₁ smaller errors Tree h₂ fit to r₂ depth 3–5 ··· F(x) = F₀ + Σ η · hₘ FINAL MODEL ≈ £400,000 ✓ η = LEARNING RATE (e.g. 0.1) Each tree's contribution is shrunk by η before adding

ⓘ Each box in the top row is a cumulative ensemble model. Each bottom box is a shallow tree trained on current residuals. Arrows show the flow of errors and corrections.


Section 06

Loss Functions — The Heart of Flexibility

One of Gradient Boosting's greatest strengths is that it works with any differentiable loss function. The loss function determines what kind of "mistake" the model measures and minimises.

📈
Squared Error (MSE)
loss='squared_error'
Default for regression. Residuals are exactly yᵢ − Fᵢ. Sensitive to outliers because large errors are squared and penalised more heavily.
💷
Absolute Error (MAE)
loss='absolute_error'
Pseudo-residuals are the sign of the error: ±1. Robust to outliers — extreme values are not amplified. Predictions converge to the conditional median.
💡
Huber Loss
loss='huber'
Best of both worlds. Squared for small errors (like MSE), linear for large errors (like MAE). Controlled by δ. Robust without losing efficiency on clean data.
📋
Log-Loss (Deviance)
loss='log_loss'
Used for binary and multiclass classification. Pseudo-residuals are yᵢ − p̂ᵢ where p̂ is the current predicted probability.
📊
Exponential Loss
loss='exponential'
Equivalent to AdaBoost — the original boosting algorithm. Very sensitive to outliers; log-loss is almost always preferred in practice.
🔄
Quantile Loss
loss='quantile'
Enables prediction intervals — train two models (e.g. 10th and 90th percentile) to get a confidence range. Essential for uncertainty-aware forecasting.
Interactive Loss function shapes — click to compare

Squares the error — large mistakes are penalised heavily. A 2× bigger error causes 4× bigger loss. Sensitive to outliers because the curve steepens rapidly.

🎯
Which Loss Should You Use?

Regression with clean data: squared_error. Regression with outliers: huber or absolute_error. Binary classification: log_loss (default). Prediction intervals: quantile with two separate models.


Section 07

Key Hyperparameters — The Control Panel

🍎 Family 1 — Ensemble Structure

ParameterDefaultWhat It ControlsPractical Advice
n_estimators100Number of trees (boosting rounds)Set high (300–3000); use early stopping to find the sweet spot.
learning_rate (η)0.1Shrinkage applied to each treeLower = better generalisation, but needs more trees. Best range: 0.01–0.1.
lossvariesObjective function being minimisedMatch to your task: squared_error, log_loss, huber, quantile.

🌵 Family 2 — Tree Structure (Complexity)

ParameterDefaultWhat It ControlsPractical Advice
max_depth3Depth of each individual treeKeep shallow! 3–5 is ideal. Depth > 6 rarely helps and often overfits.
min_samples_leaf1Min samples at a leaf nodeIncrease to 5–20 on noisy or small datasets to regularise.
max_featuresNoneFeatures considered per splitSetting to 'sqrt' or 0.3–0.7 adds randomness and reduces overfitting.
max_leaf_nodesNoneMaximum number of leavesAlternative to max_depth. Try 8–64.

🎗 Family 3 — Stochasticity (Randomness)

ParameterDefaultWhat It ControlsPractical Advice
subsample1.0Fraction of rows sampled per treeSet 0.6–0.8 to enable Stochastic GB — reduces overfitting and variance.
validation_fraction0.1Fraction of train set used for early stoppingUse with n_iter_no_change to auto-stop before overfitting.
n_iter_no_changeNoneEarly stopping patienceSet to 10–20. Training halts if validation score doesn't improve.

Section 08

Shrinkage — The Single Most Important Tuning Decision

The Surgeon's Steady Hand
Two surgeons are correcting a 5 mm spinal misalignment. Surgeon A makes one bold 5 mm correction. Surgeon B makes fifty 0.1 mm corrections, checking after each one.

Surgeon A is faster — but if she overshoots by even 0.5 mm, she causes new damage. Surgeon B takes longer but always knows where she is and never overshoots. In complex systems, precision beats speed.

The learning rate is Surgeon B's calibration: small steps, constant correction, fewer catastrophic errors.
More Trees Needed
n_estimators ↑
With η = 0.01, you need 10× as many trees as with η = 0.1 to achieve the same total correction. Training takes longer.
⚠ Higher compute cost
🌍
Better Generalisation
test accuracy ↑
Lower η forces the model to learn more gradually, preventing it from overfitting to the noise of any single tree. Almost always improves held-out performance.
✓ Lower generalisation error
⚔️
The Gold Rule
η × n_estimators ≈ constant
Halving the learning rate and doubling n_estimators produces roughly the same accuracy — but with better smoothing. η = 0.05 with 1000 trees beats η = 0.1 with 500 trees.
✓ Competition-proven strategy
Interactive Learning rate vs convergence — drag the sliders
Highlight η 0.10
Trees shown 20

Low η (green) takes more trees to reach the true value but overshoots less and generalises better. Very high η (red) converges fast but oscillates. All five rates are shown simultaneously — drag to explore the trade-off.

💡
The Friedman (2002) Finding

Jerome Friedman showed empirically that η < 0.1 combined with large n_estimators consistently outperforms higher learning rates. Set η = 0.05 or lower, then use early stopping to automatically find the right n_estimators. Never manually search n_estimators — let the data tell you when to stop.


Section 09

Stochastic Gradient Boosting — Adding Randomness

Instead of using the full training set for each tree, use a random subsample of rows (without replacement). This is controlled by the subsample parameter. Setting it below 1.0 creates Stochastic Gradient Boosting.

🔴 Standard GB (subsample=1.0)
PropertyEffect
Data used per tree100% of training set
DeterministicYes (with fixed seed)
Overfitting riskHigher on noisy data
Training speedSlower (more data per tree)
BiasLower
VarianceHigher
🟢 Stochastic GB (subsample=0.7)
PropertyEffect
Data used per tree70% of training set (random)
DeterministicNo — introduces randomness
Overfitting riskLower — diversity from sampling
Training speedFaster per tree (30% less data)
BiasSlightly higher
VarianceLower

Section 10

Overfitting and Regularisation in Gradient Boosting

🎗 Regularisation Levers — Ranked by Impact
1st
Learning Rate (η) — Reduce η (try 0.01–0.05). Single biggest regulariser. Always pair with early stopping.
2nd
max_depth — Keep trees shallow (3–5). Deep trees memorise noise. Depth 3 is a very common sweet spot.
3rd
subsample — Set 0.6–0.8 for Stochastic GB. Row subsampling injects beneficial randomness and reduces variance.
4th
max_features — Subsample features per split (try 'sqrt' or 0.5). Column subsampling mirrors what Random Forest does.
5th
min_samples_leaf — Increase to 5–20 on noisy data. Forces each leaf to represent a meaningful data pattern.
6th
Early Stopping — Set n_iter_no_change=20 and validation_fraction=0.1. Automatically halts before overfitting. Free regularisation.
⚠️
The Overfitting Trap

Unlike Random Forest (where adding more trees never hurts), adding more trees to Gradient Boosting will eventually overfit if the learning rate is not low enough. Always track both training loss and validation loss. If training loss keeps falling but validation loss plateaus or rises — you are overfitting. Enable early stopping immediately.


Section 11

Python — sklearn GradientBoostingClassifier

🛠️ Full Classification Pipeline: Credit Card Fraud Detection

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=8000, n_features=20,
    n_informative=12, n_redundant=4,
    weights=[0.95, 0.05],   # 95% legit, 5% fraud
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

gb = GradientBoostingClassifier(
    n_estimators=500,          # many trees — early stopping will prune
    learning_rate=0.05,        # small steps, better generalisation
    max_depth=4,               # shallow trees prevent overfitting
    min_samples_leaf=10,       # regularise leaf size
    subsample=0.75,            # stochastic GB: 75% row sample per tree
    max_features=0.5,          # 50% of features per split
    loss='log_loss',           # binary classification
    validation_fraction=0.1,   # 10% for early stopping
    n_iter_no_change=20,       # stop if no improvement for 20 rounds
    tol=1e-4,
    random_state=42
)
gb.fit(X_train, y_train)
print(f"Trees used (early stopping): {gb.n_estimators_}")
print(f"Test AUC-ROC: {roc_auc_score(y_test, gb.predict_proba(X_test)[:,1]):.4f}")
print(classification_report(y_test, gb.predict(X_test), digits=4))
OUTPUT
Trees used (early stopping): 312 Test AUC-ROC: 0.9741 precision recall f1-score support 0 0.9883 0.9921 0.9902 1523 1 0.8421 0.7812 0.8105 77 accuracy 0.9817 1600

📈 Regression Example: Predicting Energy Consumption

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_absolute_error, r2_score

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
gb_reg = GradientBoostingRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    max_features='sqrt',
    loss='huber',              # robust to outlier prices
    alpha=0.9,
    validation_fraction=0.1,
    n_iter_no_change=25,
    random_state=42
)
gb_reg.fit(X_train, y_train)
y_pred = gb_reg.predict(X_test)
print(f"R² Score : {r2_score(y_test, y_pred):.4f}")
print(f"MAE      : ${mean_absolute_error(y_test, y_pred) * 100_000:.0f}")
print(f"Trees    : {gb_reg.n_estimators_} (early stopped)")
OUTPUT
R² Score : 0.8641 MAE : $26,134 Trees : 487 (early stopped)

Section 12

LightGBM — Faster, Scalable Gradient Boosting

LightGBM is Microsoft's production-grade implementation. It is the same sequential residual-fitting logic, but with two critical engineering innovations that make it dramatically faster on large datasets.

🔴 Standard GB Tree Growth
PropertyDetail
Tree growth strategyLevel-wise (all nodes at depth d before d+1)
Split findingExact: evaluates every value of every feature
Speed on 1M rowsSlow — O(n × features) per split
MemoryHigh
Categorical featuresRequires manual encoding
🟢 LightGBM Tree Growth
PropertyDetail
Tree growth strategyLeaf-wise: always splits the leaf with max gain
Split findingHistogram-based: bins features into 256 buckets
Speed on 1M rows10–100× faster than sklearn GB
MemoryLow (histogram compression)
Categorical featuresNative support — no encoding needed
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, classification_report

train_data = lgb.Dataset(X_train, label=y_train)
val_data   = lgb.Dataset(X_test,  label=y_test, reference=train_data)

params = {
    'objective':         'binary',
    'metric':            'auc',
    'learning_rate':     0.05,
    'num_leaves':        31,         # primary complexity control for LGB
    'max_depth':         -1,         # -1 = no limit; num_leaves controls it
    'min_child_samples': 20,
    'feature_fraction':  0.7,
    'bagging_fraction':  0.8,
    'bagging_freq':      5,
    'lambda_l1':         0.1,
    'lambda_l2':         1.0,
    'verbosity':         -1,
    'random_state':      42
}
callbacks = [lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(period=100)]
model = lgb.train(params, train_data, num_boost_round=2000,
                   valid_sets=[val_data], callbacks=callbacks)

y_prob = model.predict(X_test)
print(f"Best iteration : {model.best_iteration}")
print(f"AUC-ROC        : {roc_auc_score(y_test, y_prob):.4f}")
OUTPUT
Best iteration : 347 AUC-ROC : 0.9741 precision recall f1-score support 0 0.9401 0.9387 0.9394 5002 1 0.9387 0.9401 0.9394 4998 accuracy 0.9394 10000

Section 13

Feature Importance in Gradient Boosting

📈 Gain (Impurity Reduction)
importance_type='gain'
Measures the total improvement in loss each feature provides across all splits. The most informative measure — it tells you how much a feature actually contributes to good predictions.
✓ Reflects predictive value directly
✗ Can favour high-cardinality features
🔄 Split Count (Frequency)
importance_type='split'
Counts how many times each feature is used as a split across all trees. Easy to compute but biased toward continuous features — they can be split many times with little predictive value.
✓ Fast and easy to compute
✗ Misleading for high-cardinality features
🎓 SHAP Values
shap.TreeExplainer
The gold standard for interpretability. Shows exactly how much each feature pushes the prediction up or down for a specific row. Mathematically grounded in game theory (Shapley values).
✓ Local + global, theoretically sound
✗ Slower to compute

Section 14

Gradient Boosting vs Random Forest — Full Comparison

PropertyGradient BoostingRandom Forest
Core ideaSequential error correctionParallel variance averaging
What it reducesBias + VarianceVariance only
Tree depthShallow (3–5) — intentionally weakDeep — intentionally complex
ParallelisableNo — sequential by definitionYes — trivially parallel
Overfitting riskHigh — must tune carefullyLow — bagging is self-regularising
Hyperparameter sensitivityHigh — many knobs matterLow — good defaults out of the box
Peak accuracy (tabular)Highest in practiceVery good but rarely highest
Feature scaling neededNoNo
Best use caseMaximising accuracy, competitions, productionFast baseline, robust default, interpretability
💬
The Practitioner Decision Tree

Start with Random Forest. If it doesn't meet your accuracy target after basic tuning, move to Gradient Boosting. If your dataset has >100,000 rows, go directly to LightGBM — sklearn's GB is too slow at scale.


Section 15

Golden Rules

🌳 Gradient Boosting — Non-Negotiable Rules
1
Always use early stopping. Set n_iter_no_change=20 and validation_fraction=0.1 in sklearn, or early_stopping_rounds=50 in LightGBM. Never pick n_estimators by hand — it is the job of early stopping.
2
Start with a low learning rate (0.05 or lower). The single most reliable way to improve a GB model. η = 0.01 with 3,000 trees beats η = 0.1 with 300 trees on almost every real-world dataset.
3
Keep trees shallow. max_depth=3 is a robust default. Rarely go deeper than 6. Deep trees in a boosting context memorise noise — the very opposite of what you want.
4
Enable Stochastic Gradient Boosting. Set subsample=0.7–0.8 as your default. Row subsampling adds healthy randomness, reduces variance, speeds up training, and almost always improves validation accuracy on real-world noisy data.
5
Use LightGBM for datasets over 50,000 rows. Sklearn's GradientBoostingClassifier is excellent for learning and small datasets but too slow for production-scale data. LightGBM delivers the same or better accuracy at 10–100× the speed.
6
Tune in the right order. Fix η → tune max_depth and min_samples_leaf → tune subsample and max_features → then lower η further. Searching all parameters simultaneously wastes compute and introduces unnecessary interaction noise.
7
Match your loss function to your metric. If you optimise squared error but evaluate on MAE, you are optimising the wrong objective. Ensure loss is always consistent with your evaluation metric.
8
Feature scaling is never required. Tree-based splits are scale-invariant — the magnitude of a feature value is irrelevant. Do not apply StandardScaler or MinMaxScaler. It changes nothing and wastes preprocessing time.