Gradient Boosting

Section 01

The Story That Explains Gradient Boosting

📖 Real World Analogy

The Student Who Learns Only From Mistakes

Imagine a student sitting an exam in three rounds. In Round 1, she scores 60/100. Her teacher doesn't throw away that answer sheet — instead, the teacher circles only the questions she got wrong and hands them back. In Round 2, the student studies only those wrong answers. She improves. In Round 3, she studies the remaining errors.

After ten rounds — each focused exclusively on the previous round's mistakes — she scores 98/100. The teacher never re-taught everything from scratch. She only ever corrected what was wrong.

That is the entire logic of Gradient Boosting. Each new tree doesn't learn the original target — it learns the residual errors (mistakes) left behind by all the previous trees combined. Add them all up, and you get a remarkably accurate model.

Gradient Boosting is a sequential ensemble method that builds a series of weak decision trees — typically shallow — where each tree is trained to correct the cumulative prediction error of all previous trees. Unlike Random Forest (which builds trees independently and in parallel), Gradient Boosting is inherently sequential — each tree depends on what came before it.

🌲

Why "Gradient"?

The word gradient refers to the mathematical gradient of a loss function — the direction in which errors are largest. Each new tree is fit to the negative gradient of the loss, which for squared-error regression is simply the residuals. The method generalises to any differentiable loss function, which gives it tremendous flexibility.

Section 02

Boosting vs Bagging — Two Different Philosophies

Both boosting and bagging are ensemble methods — they combine many weak models. But how they combine them is fundamentally different.

🎲 Bagging (e.g. Random Forest)

Property	Detail
Tree order	Parallel — all trees built independently
Each tree trains on	Random bootstrap sample of original data
Error focus	No — each tree sees the full target
Depth	Deep trees (high variance, low bias)
Reduces	Variance only
Parallelisable	✓ Yes — trivially

🎯 Boosting (Gradient Boosting)

Property	Detail
Tree order	Sequential — each tree follows the last
Each tree trains on	Residuals / pseudo-residuals of previous trees
Error focus	Yes — each tree corrects prior mistakes
Depth	Shallow trees (depth 3–5)
Reduces	Both bias AND variance
Parallelisable	✗ No — sequential by design

Diagram Bias vs variance — what each algorithm reduces

Random Forest starts top-right (scattered but centred) and averaging pushes it left. Gradient Boosting targets the top-left directly by iteratively correcting bias through sequential tree fitting.

🔐

The Key Insight

Random Forest reduces variance by averaging diverse trees. Gradient Boosting reduces bias by iteratively correcting errors. This is why GB can fit complex patterns that Random Forest misses — and also why it is more prone to overfitting if not carefully tuned.

Section 03

The Algorithm — Step by Step

Let's walk through the full algorithm using a regression problem — predicting house prices. Each step is precise but explained in plain English.

Initialise with a Constant Prediction

The model starts with the simplest possible prediction: the mean of the target for regression, or log-odds for classification. For house prices with a mean of £280,000, every house is initially predicted at £280,000. This is F₀(x) = ȳ.

Compute Residuals (Pseudo-Residuals)

For each training row, compute the residual error: rᵢ = yᵢ − F(xᵢ). A house worth £400,000 has residual +£120,000. One worth £180,000 has residual −£100,000. These are the "mistakes" the current model is making.

Fit a Shallow Tree to the Residuals

Train a new decision tree — typically depth 3 to 5 — with the residuals as the new target, not the original house prices. This tree learns "how wrong was the current model for this house?" — not the price itself.

Add the Tree to the Ensemble (with Shrinkage)

The new tree's predictions are multiplied by the learning rate η (e.g. 0.1) before being added: F₁(x) = F₀(x) + η · h₁(x). The learning rate prevents any single tree from dominating.

Recompute Residuals on the Updated Model

With updated ensemble F₁(x), compute new residuals. The errors are now smaller. The next tree trains on these new, smaller residuals.

Repeat for M Trees (the Full Ensemble)

Repeat steps 2–5 for M iterations. The final model is: F(x) = F₀ + η·h₁ + η·h₂ + … + η·hₘ. Residuals shrink with every round until they converge or the stop criterion is reached.

Diagram Residual error shrinks with every boosting round

ⓘ For a house worth £400k, starting at the £280k mean prediction, the residual error is progressively chopped down each round until the ensemble converges on the true value.

🏅

Concrete Numeric Example

Suppose one house is worth £400,000.
Round 0 — constant prediction: £280,000 → residual: +£120,000
Round 1 — tree predicts residual £110,000 → new prediction: 280k + 0.1×110k = £291,000
Round 2 — tree predicts residual £100,000 → new prediction: 291k + 0.1×100k = £301,000
… After 100 rounds the prediction approaches £400,000. Each round takes a small, precise step toward the truth.

Section 04

The Mathematics — What "Gradient" Actually Means

Gradient Boosting is rooted in functional gradient descent. Instead of updating model parameters with gradients (like in neural networks), it updates the prediction function itself in the direction that reduces loss the most.

Loss Function (Regression)

L(y, F) = ½ (y − F)²

Squared Error. Residuals are rᵢ = yᵢ − F(xᵢ). Each tree directly fits these residuals.

Negative Gradient (General)

rᵢ = − [∂L / ∂F(xᵢ)]

For any loss L, the pseudo-residual is the negative gradient of L w.r.t. the current prediction. This is what the next tree learns.

Model Update Rule

Fₘ(x) = Fₘ₋₁(x) + η · hₘ(x)

η = learning rate (shrinkage). hₘ = m-th tree trained on pseudo-residuals. F grows with each step.

Final Ensemble

F(x) = F₀ + Σ η · hₘ(x)

The final model is the sum of the initial constant plus all M scaled tree predictions. No weighting or voting — just summation.

🔬

Why "Pseudo-Residuals" and Not Just "Residuals"?

For squared-error loss, the negative gradient is exactly the raw residual: rᵢ = yᵢ − F(xᵢ). Simple. But for other loss functions (log-loss for classification, absolute error for regression), the negative gradient is not the raw residual — it is something related but mathematically derived. The name pseudo-residuals covers all cases generically.

Section 05

Visual Diagram — Sequential Boosting Process

📊 How Gradient Boosting Builds Sequentially

ⓘ Each box in the top row is a cumulative ensemble model. Each bottom box is a shallow tree trained on current residuals. Arrows show the flow of errors and corrections.

Section 06

Loss Functions — The Heart of Flexibility

One of Gradient Boosting's greatest strengths is that it works with any differentiable loss function. The loss function determines what kind of "mistake" the model measures and minimises.

📈

Squared Error (MSE)

loss='squared_error'

Default for regression. Residuals are exactly yᵢ − Fᵢ. Sensitive to outliers because large errors are squared and penalised more heavily.

💷

Absolute Error (MAE)

loss='absolute_error'

Pseudo-residuals are the sign of the error: ±1. Robust to outliers — extreme values are not amplified. Predictions converge to the conditional median.

💡

Huber Loss

loss='huber'

Best of both worlds. Squared for small errors (like MSE), linear for large errors (like MAE). Controlled by δ. Robust without losing efficiency on clean data.

📋

Log-Loss (Deviance)

loss='log_loss'

Used for binary and multiclass classification. Pseudo-residuals are yᵢ − p̂ᵢ where p̂ is the current predicted probability.

📊

Exponential Loss

loss='exponential'

Equivalent to AdaBoost — the original boosting algorithm. Very sensitive to outliers; log-loss is almost always preferred in practice.

🔄

Quantile Loss

loss='quantile'

Enables prediction intervals — train two models (e.g. 10th and 90th percentile) to get a confidence range. Essential for uncertainty-aware forecasting.

Interactive Loss function shapes — click to compare

Squares the error — large mistakes are penalised heavily. A 2× bigger error causes 4× bigger loss. Sensitive to outliers because the curve steepens rapidly.

🎯

Which Loss Should You Use?

Regression with clean data: squared_error. Regression with outliers: huber or absolute_error. Binary classification: log_loss (default). Prediction intervals: quantile with two separate models.

Section 07

Key Hyperparameters — The Control Panel

🍎 Family 1 — Ensemble Structure

Parameter	Default	What It Controls	Practical Advice
`n_estimators`	100	Number of trees (boosting rounds)	Set high (300–3000); use early stopping to find the sweet spot.
`learning_rate` (η)	0.1	Shrinkage applied to each tree	Lower = better generalisation, but needs more trees. Best range: 0.01–0.1.
`loss`	varies	Objective function being minimised	Match to your task: squared_error, log_loss, huber, quantile.

🌵 Family 2 — Tree Structure (Complexity)

Parameter	Default	What It Controls	Practical Advice
`max_depth`	3	Depth of each individual tree	Keep shallow! 3–5 is ideal. Depth > 6 rarely helps and often overfits.
`min_samples_leaf`	1	Min samples at a leaf node	Increase to 5–20 on noisy or small datasets to regularise.
`max_features`	None	Features considered per split	Setting to 'sqrt' or 0.3–0.7 adds randomness and reduces overfitting.
`max_leaf_nodes`	None	Maximum number of leaves	Alternative to max_depth. Try 8–64.

🎗 Family 3 — Stochasticity (Randomness)

Parameter	Default	What It Controls	Practical Advice
`subsample`	1.0	Fraction of rows sampled per tree	Set 0.6–0.8 to enable Stochastic GB — reduces overfitting and variance.
`validation_fraction`	0.1	Fraction of train set used for early stopping	Use with n_iter_no_change to auto-stop before overfitting.
`n_iter_no_change`	None	Early stopping patience	Set to 10–20. Training halts if validation score doesn't improve.

Section 08

Shrinkage — The Single Most Important Tuning Decision

📖 Story

The Surgeon's Steady Hand

Two surgeons are correcting a 5 mm spinal misalignment. Surgeon A makes one bold 5 mm correction. Surgeon B makes fifty 0.1 mm corrections, checking after each one.

Surgeon A is faster — but if she overshoots by even 0.5 mm, she causes new damage. Surgeon B takes longer but always knows where she is and never overshoots. In complex systems, precision beats speed.

The learning rate is Surgeon B's calibration: small steps, constant correction, fewer catastrophic errors.

➕

More Trees Needed

n_estimators ↑

With η = 0.01, you need 10× as many trees as with η = 0.1 to achieve the same total correction. Training takes longer.

⚠ Higher compute cost

🌍

Better Generalisation

test accuracy ↑

Lower η forces the model to learn more gradually, preventing it from overfitting to the noise of any single tree. Almost always improves held-out performance.

✓ Lower generalisation error

⚔️

The Gold Rule

η × n_estimators ≈ constant

Halving the learning rate and doubling n_estimators produces roughly the same accuracy — but with better smoothing. η = 0.05 with 1000 trees beats η = 0.1 with 500 trees.

✓ Competition-proven strategy

Interactive Learning rate vs convergence — drag the sliders

Highlight η 0.10

Trees shown 20

Low η (green) takes more trees to reach the true value but overshoots less and generalises better. Very high η (red) converges fast but oscillates. All five rates are shown simultaneously — drag to explore the trade-off.

💡

The Friedman (2002) Finding

Jerome Friedman showed empirically that η < 0.1 combined with large n_estimators consistently outperforms higher learning rates. Set η = 0.05 or lower, then use early stopping to automatically find the right n_estimators. Never manually search n_estimators — let the data tell you when to stop.

Section 09

Stochastic Gradient Boosting — Adding Randomness

Instead of using the full training set for each tree, use a random subsample of rows (without replacement). This is controlled by the subsample parameter. Setting it below 1.0 creates Stochastic Gradient Boosting.

🔴 Standard GB (subsample=1.0)

Property	Effect
Data used per tree	100% of training set
Deterministic	Yes (with fixed seed)
Overfitting risk	Higher on noisy data
Training speed	Slower (more data per tree)
Bias	Lower
Variance	Higher

🟢 Stochastic GB (subsample=0.7)

Property	Effect
Data used per tree	70% of training set (random)
Deterministic	No — introduces randomness
Overfitting risk	Lower — diversity from sampling
Training speed	Faster per tree (30% less data)
Bias	Slightly higher
Variance	Lower

Section 10

Overfitting and Regularisation in Gradient Boosting

🎗 Regularisation Levers — Ranked by Impact

1st

Learning Rate (η) — Reduce η (try 0.01–0.05). Single biggest regulariser. Always pair with early stopping.

2nd

max_depth — Keep trees shallow (3–5). Deep trees memorise noise. Depth 3 is a very common sweet spot.

3rd

subsample — Set 0.6–0.8 for Stochastic GB. Row subsampling injects beneficial randomness and reduces variance.

4th

max_features — Subsample features per split (try 'sqrt' or 0.5). Column subsampling mirrors what Random Forest does.

5th

min_samples_leaf — Increase to 5–20 on noisy data. Forces each leaf to represent a meaningful data pattern.

6th

Early Stopping — Set n_iter_no_change=20 and validation_fraction=0.1. Automatically halts before overfitting. Free regularisation.

⚠️

The Overfitting Trap

Unlike Random Forest (where adding more trees never hurts), adding more trees to Gradient Boosting will eventually overfit if the learning rate is not low enough. Always track both training loss and validation loss. If training loss keeps falling but validation loss plateaus or rises — you are overfitting. Enable early stopping immediately.

Section 11

Python — sklearn GradientBoostingClassifier

🛠️ Full Classification Pipeline: Credit Card Fraud Detection

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=8000, n_features=20,
    n_informative=12, n_redundant=4,
    weights=[0.95, 0.05],   # 95% legit, 5% fraud
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

gb = GradientBoostingClassifier(
    n_estimators=500,          # many trees — early stopping will prune
    learning_rate=0.05,        # small steps, better generalisation
    max_depth=4,               # shallow trees prevent overfitting
    min_samples_leaf=10,       # regularise leaf size
    subsample=0.75,            # stochastic GB: 75% row sample per tree
    max_features=0.5,          # 50% of features per split
    loss='log_loss',           # binary classification
    validation_fraction=0.1,   # 10% for early stopping
    n_iter_no_change=20,       # stop if no improvement for 20 rounds
    tol=1e-4,
    random_state=42
)
gb.fit(X_train, y_train)
print(f"Trees used (early stopping): {gb.n_estimators_}")
print(f"Test AUC-ROC: {roc_auc_score(y_test, gb.predict_proba(X_test)[:,1]):.4f}")
print(classification_report(y_test, gb.predict(X_test), digits=4))

OUTPUT

Trees used (early stopping): 312 Test AUC-ROC: 0.9741 precision recall f1-score support 0 0.9883 0.9921 0.9902 1523 1 0.8421 0.7812 0.8105 77 accuracy 0.9817 1600

📈 Regression Example: Predicting Energy Consumption

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_absolute_error, r2_score

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
gb_reg = GradientBoostingRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    max_features='sqrt',
    loss='huber',              # robust to outlier prices
    alpha=0.9,
    validation_fraction=0.1,
    n_iter_no_change=25,
    random_state=42
)
gb_reg.fit(X_train, y_train)
y_pred = gb_reg.predict(X_test)
print(f"R² Score : {r2_score(y_test, y_pred):.4f}")
print(f"MAE      : ${mean_absolute_error(y_test, y_pred) * 100_000:.0f}")
print(f"Trees    : {gb_reg.n_estimators_} (early stopped)")

OUTPUT

R² Score : 0.8641 MAE : $26,134 Trees : 487 (early stopped)

Section 12

LightGBM — Faster, Scalable Gradient Boosting

LightGBM is Microsoft's production-grade implementation. It is the same sequential residual-fitting logic, but with two critical engineering innovations that make it dramatically faster on large datasets.

🔴 Standard GB Tree Growth

Property	Detail
Tree growth strategy	Level-wise (all nodes at depth d before d+1)
Split finding	Exact: evaluates every value of every feature
Speed on 1M rows	Slow — O(n × features) per split
Memory	High
Categorical features	Requires manual encoding

🟢 LightGBM Tree Growth

Property	Detail
Tree growth strategy	Leaf-wise: always splits the leaf with max gain
Split finding	Histogram-based: bins features into 256 buckets
Speed on 1M rows	10–100× faster than sklearn GB
Memory	Low (histogram compression)
Categorical features	Native support — no encoding needed

import lightgbm as lgb
from sklearn.metrics import roc_auc_score, classification_report

train_data = lgb.Dataset(X_train, label=y_train)
val_data   = lgb.Dataset(X_test,  label=y_test, reference=train_data)

params = {
    'objective':         'binary',
    'metric':            'auc',
    'learning_rate':     0.05,
    'num_leaves':        31,         # primary complexity control for LGB
    'max_depth':         -1,         # -1 = no limit; num_leaves controls it
    'min_child_samples': 20,
    'feature_fraction':  0.7,
    'bagging_fraction':  0.8,
    'bagging_freq':      5,
    'lambda_l1':         0.1,
    'lambda_l2':         1.0,
    'verbosity':         -1,
    'random_state':      42
}
callbacks = [lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(period=100)]
model = lgb.train(params, train_data, num_boost_round=2000,
                   valid_sets=[val_data], callbacks=callbacks)

y_prob = model.predict(X_test)
print(f"Best iteration : {model.best_iteration}")
print(f"AUC-ROC        : {roc_auc_score(y_test, y_prob):.4f}")

OUTPUT

Best iteration : 347 AUC-ROC : 0.9741 precision recall f1-score support 0 0.9401 0.9387 0.9394 5002 1 0.9387 0.9401 0.9394 4998 accuracy 0.9394 10000

Section 13

Feature Importance in Gradient Boosting

📈 Gain (Impurity Reduction)

importance_type='gain'

Measures the total improvement in loss each feature provides across all splits. The most informative measure — it tells you how much a feature actually contributes to good predictions.

✓ Reflects predictive value directly

✗ Can favour high-cardinality features

🔄 Split Count (Frequency)

importance_type='split'

Counts how many times each feature is used as a split across all trees. Easy to compute but biased toward continuous features — they can be split many times with little predictive value.

✓ Fast and easy to compute

✗ Misleading for high-cardinality features

🎓 SHAP Values

shap.TreeExplainer

The gold standard for interpretability. Shows exactly how much each feature pushes the prediction up or down for a specific row. Mathematically grounded in game theory (Shapley values).

✓ Local + global, theoretically sound

✗ Slower to compute

Section 14

Gradient Boosting vs Random Forest — Full Comparison

Property	Gradient Boosting	Random Forest
Core idea	Sequential error correction	Parallel variance averaging
What it reduces	Bias + Variance	Variance only
Tree depth	Shallow (3–5) — intentionally weak	Deep — intentionally complex
Parallelisable	No — sequential by definition	Yes — trivially parallel
Overfitting risk	High — must tune carefully	Low — bagging is self-regularising
Hyperparameter sensitivity	High — many knobs matter	Low — good defaults out of the box
Peak accuracy (tabular)	Highest in practice	Very good but rarely highest
Feature scaling needed	No	No
Best use case	Maximising accuracy, competitions, production	Fast baseline, robust default, interpretability

💬

The Practitioner Decision Tree

Start with Random Forest. If it doesn't meet your accuracy target after basic tuning, move to Gradient Boosting. If your dataset has >100,000 rows, go directly to LightGBM — sklearn's GB is too slow at scale.

Section 15

Golden Rules

🌳 Gradient Boosting — Non-Negotiable Rules

Always use early stopping. Set n_iter_no_change=20 and validation_fraction=0.1 in sklearn, or early_stopping_rounds=50 in LightGBM. Never pick n_estimators by hand — it is the job of early stopping.

Start with a low learning rate (0.05 or lower). The single most reliable way to improve a GB model. η = 0.01 with 3,000 trees beats η = 0.1 with 300 trees on almost every real-world dataset.

Keep trees shallow. max_depth=3 is a robust default. Rarely go deeper than 6. Deep trees in a boosting context memorise noise — the very opposite of what you want.

Enable Stochastic Gradient Boosting. Set subsample=0.7–0.8 as your default. Row subsampling adds healthy randomness, reduces variance, speeds up training, and almost always improves validation accuracy on real-world noisy data.

Use LightGBM for datasets over 50,000 rows. Sklearn's GradientBoostingClassifier is excellent for learning and small datasets but too slow for production-scale data. LightGBM delivers the same or better accuracy at 10–100× the speed.

Tune in the right order. Fix η → tune max_depth and min_samples_leaf → tune subsample and max_features → then lower η further. Searching all parameters simultaneously wastes compute and introduces unnecessary interaction noise.

Match your loss function to your metric. If you optimise squared error but evaluate on MAE, you are optimising the wrong objective. Ensure loss is always consistent with your evaluation metric.

Feature scaling is never required. Tree-based splits are scale-invariant — the magnitude of a feature value is irrelevant. Do not apply StandardScaler or MinMaxScaler. It changes nothing and wastes preprocessing time.