Decision Trees & Rule Lists in XAI

Section 01

The Big Question — Why Does Explainability Matter?

📖 Real World Story

The Black Box That Ruined a Life

It's 2019. Maria applies for a bank loan. She has a steady job, no criminal record, and has never missed a bill payment in her life. The bank's AI system returns one word: Rejected. No reason. No appeal path. No explanation.

The loan officer shrugs: "It's the algorithm." Maria walks out furious and confused. She'll never know if the model penalised her postcode, her age, or some hidden proxy for race buried in the training data.

This is exactly why Explainable AI (XAI) exists — and why Decision Trees and Rule Lists have become the gold standard for interpretable machine learning.

Explainable AI (XAI) is the field of building machine learning systems whose predictions can be understood, audited, and trusted by human beings — not just measured. It sits at the intersection of data science, ethics, law, and cognitive psychology.

⚖️

The EU AI Act & GDPR "Right to Explanation"

Since 2018, GDPR Article 22 gives EU citizens the right to a meaningful explanation for any automated decision that significantly affects them. The EU AI Act (2024) extends this further for high-risk AI systems. Interpretable models like Decision Trees aren't just nice to have — in many industries they are legally required.

Concept	Meaning	Example
Interpretability	The model itself is understandable — you can read its logic	A 5-node decision tree printed on paper
Explainability	A post-hoc tool explains a black-box model's predictions	SHAP values on a neural network
Transparency	You know what data was used and how the model was trained	Model card with training details
Fairness	The model does not discriminate on protected attributes	Equal false positive rates across demographics

Section 02

The XAI Landscape — Where Decision Trees Live

XAI methods fall into two broad camps: models that are intrinsically interpretable (transparent by design), and post-hoc explanation methods that try to shed light on opaque models after training.

👁️

Intrinsic Interpretability

White-Box Models

The model IS the explanation. You can read the logic directly from the learned structure. Accuracy and interpretability are traded off explicitly.
✔ Decision Trees, Rule Lists, Linear Models, Scorecard Models

🔎

Post-hoc Explanations

Black-Box Explanations

Train any model, then apply an explanation method separately. The explanation approximates but may not faithfully represent the true model behaviour.
✔ SHAP, LIME, Integrated Gradients, Attention Maps

📊

Global vs. Local

Scope of Explanation

Global: Explains the overall model behaviour across all data.
Local: Explains one individual prediction only.
✔ Rule Lists are global; SHAP is often local

💡

Cynthia Rudin's Core Argument

Professor Cynthia Rudin (Duke University) argues that for high-stakes decisions — medicine, criminal justice, credit — we should never use a black-box model and then try to explain it. Instead, we should use an interpretable model from the start. Post-hoc explanations of black boxes are approximations of approximations, and in critical domains, that is not good enough.

🗺 XAI Method Taxonomy — Animated Overview

🧠 Explainable AI (XAI)

├── Intrinsically Interpretable

⭐ Decision Trees ⭐ Rule Lists Linear / Logistic Regression Scorecard Models GAMs

└── Post-hoc Explanation

Global Feature Importance PDPs / ICE Rule Extraction

Local SHAP LIME Anchors Counterfactuals

Model-Specific Attention Maps Grad-CAM Integrated Gradients

⭐ = This tutorial's focus — gold-standard interpretable models

Section 03

Decision Trees — The Anatomy of Interpretability

📖 Story

The Doctor Who Never Forgot the Rules

Dr. Carla is an emergency physician. Every night she follows the same silent protocol for chest pain patients: "Is the pain crushing? Is there ECG ST-elevation? Is the patient over 60?" Three questions. A clear path to action. No neural network. No probability output she can't explain to the patient's family.

This is exactly what a Decision Tree does. It turns a complex feature space into a sequence of yes/no questions that any human can follow, audit, and argue with. That auditability is what makes it the bedrock of XAI.

A Decision Tree is a hierarchical model that partitions the feature space through a sequence of binary splits. Each internal node tests one feature against one threshold. Each leaf node contains a prediction. The path from root to leaf is the explanation.

🌳 Animated Decision Tree — Loan Default Prediction

Each path from root to leaf is a complete, human-readable rule. Animated on load.

The tree above encodes the following human-readable rules that a loan officer could verify in seconds:

📄 Reading the Tree as Rules

Rule 1

IF Income < £30k AND Credit Score < 600 AND Employed < 2yr → Default Risk

Rule 2

IF Income < £30k AND Credit Score < 600 AND Employed ≥ 2yr → Safe

Rule 3

IF Income < £30k AND Credit Score ≥ 600 → Default Risk

Rule 4

IF Income ≥ £30k AND Loan > £50k → Default Risk

Rule 5

IF Income ≥ £30k AND Loan ≤ £50k → Safe

Section 04

How Trees Learn — Splitting Criteria

The heart of decision tree learning is the split selection function. At every internal node, the algorithm searches over all features and all thresholds to find the split that produces the purest child nodes. The three most important measures of impurity are:

Gini Impurity (CART)

Gini = 1 − Σ pᵢ²

Probability of misclassifying a random sample if labelled by class distribution. Ranges from 0 (pure) to 0.5 (binary) or higher. Used by sklearn's default.

Entropy / Information Gain (ID3, C4.5)

H = −Σ pᵢ log₂(pᵢ)

Shannon entropy of class labels. Split is chosen to maximise information gain = parent entropy − weighted average child entropy.

Variance Reduction (Regression Trees)

Var(y) − Σ wᵢ·Var(yᵢ)

For regression targets. Split maximises the reduction in output variance across child nodes. Mean Squared Error at each leaf.

Log-Loss / Deviance

−Σ yᵢ log(p̂ᵢ)

Cross-entropy loss used in probabilistic trees. Produces well-calibrated probability estimates at leaves. Closer to how boosted trees optimise.

📈 Gini vs Entropy — Impurity Comparison (Animated Bar Chart)

Max

Gini
0.50

Med

Gini
0.32

Pure

Gini
0.00

Max

Entropy
1.00

Med

Entropy
0.72

Pure

Entropy
0.00

Gini — max 0.5 at p=0.5, faster to compute, sklearn default

Entropy — max 1.0 at p=0.5, log function penalises imbalance harder

Pure node — impurity = 0, all samples same class, stop splitting

In practice, Gini and Entropy produce nearly identical trees. Choose criterion='entropy' when you want slightly more balanced trees; use 'gini' (default) for speed.

Bars animate on page load. Each cluster shows impurity at different class probability p.

Section 05

The Interpretability–Accuracy Trade-off & Pruning

⚠️

The Depth Dilemma

A fully grown decision tree (no depth limit) will achieve 100% training accuracy by memorising every sample. It will fail catastrophically on new data — and a 300-node tree is no more interpretable than a neural network. The goal of XAI with decision trees is to find the smallest tree that is still sufficiently accurate.

Two main approaches control tree complexity and restore generalisability:

🛡️ Pre-Pruning (Early Stopping)

Parameter	What it does
max_depth	Hard cap on tree depth
min_samples_split	Min samples to allow a split
min_samples_leaf	Min samples in any leaf
max_leaf_nodes	Cap total number of leaves
min_impurity_decrease	Minimum gain to justify split

✂️ Post-Pruning (Cost-Complexity)

Concept	Detail
ccp_alpha	Complexity parameter α, sklearn
Reduced Error Pruning	Remove nodes if val accuracy doesn't drop
Minimal Cost-Complexity	Penalise R(T) + α·\|T\| (leaves count)
Cross-validation	Choose α via CV on held-out data
MDL Pruning	Minimum Description Length principle

📈 Animated Depth vs Accuracy — The Sweet Spot

Training accuracy never drops; validation accuracy peaks around depth 4–5 then overfits.

Section 06

Decision Tree in Python — Full Worked Example

The following example trains an interpretable decision tree on the UCI Heart Disease dataset, applies post-pruning via ccp_alpha, and visualises the tree. Notice how every parameter is chosen with interpretability as the priority.

# ── Decision Tree for Heart Disease Prediction ────────────────
# Goal: maximum interpretability, not maximum accuracy

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

# 1. Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# 2. Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Find optimal ccp_alpha via cross-validation post-pruning
raw_tree = DecisionTreeClassifier(random_state=42)
path = raw_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]  # exclude trivial last node

cv_scores = []
for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

best_alpha = ccp_alphas[np.argmax(cv_scores)]
print(f"Best ccp_alpha: {best_alpha:.5f}")

# 4. Train the final interpretable tree
dt = DecisionTreeClassifier(
    criterion='gini',
    max_depth=4,          # hard cap for readability
    min_samples_leaf=10,  # avoid tiny, noisy leaves
    ccp_alpha=best_alpha,  # post-pruning
    random_state=42
)
dt.fit(X_train, y_train)

# 5. Evaluate
print(classification_report(y_test, dt.predict(X_test),
      target_names=data.target_names))
print(f"Tree depth:  {dt.get_depth()}")
print(f"Leaf count:  {dt.get_n_leaves()}")

# 6. Print human-readable text representation
print(export_text(dt, feature_names=list(feature_names)))

OUTPUT

🟢

95% Accuracy, 9 Leaves, Fully Readable

A 9-leaf decision tree achieves 95% accuracy on breast cancer classification — and a radiologist can inspect every rule in under 30 seconds. Compare this to a Random Forest (98% accuracy, 500 trees, completely opaque) or a neural network (97%, millions of weights). The 2–3% accuracy gap is the price of interpretability. In medical diagnosis, the audit trail is often worth it.

Section 07

Rule Lists — Ordered Rules for Crystal-Clear Logic

📖 Story

The ICU Scoring Card

In the ICU, nurses use a paper scoring card to assess sepsis risk. It has five lines: "If temperature > 38.5°C, score +1. If heart rate > 90, score +1. If..." Any nurse in any country can use it in 20 seconds without a laptop.

This is a Rule List — a sequence of IF-THEN-ELSE conditions evaluated in order, where the first rule that fires determines the prediction. Unlike a decision tree (which is a branching structure), a rule list is purely sequential. It reads like a clinical protocol, a legal statute, or a tax guide. That makes it the most socially legible form of machine learning model.

A Rule List (also called a decision list) has the form:

📄 Anatomy of a Rule List — Sepsis Risk Scoring

Rule 1

IF Temp > 38.5°C AND Heart Rate > 90 bpm → HIGH RISK (stop)

Rule 2

IF White Cell Count > 12,000/μL → HIGH RISK (stop)

Rule 3

IF Systolic BP < 90 mmHg → CRITICAL RISK (stop)

Rule 4

IF Respiratory Rate > 20 AND Age > 65 → MODERATE RISK (stop)

Default

ELSE → LOW RISK

The key difference from a decision tree is mutual exclusivity + order: each patient is classified by the first matching rule. No traversal of branches — just read down the list.

Property	Decision Tree	Rule List
Structure	Hierarchical, branching (DAG)	Sequential, ordered list
Explanation path	Root-to-leaf path	First matching rule
Rule overlap	Mutually exclusive by construction	Mutually exclusive by order
Human readability	Good (depth ≤ 4)	Excellent (reads like prose)
Regulators prefer	Sometimes	Often (audit trail is linear)
Key algorithm	CART, ID3, C4.5	RIPPER, CORELS, FRL, BRL
Sklearn support	Native	via imodels library

Section 08

CORELS — Certifiably Optimal Rule Lists

Most rule-learning algorithms are greedy — they build rules one at a time and never look back. CORELS (Certifiable Optimal RulE ListS), developed by Angelino et al. (2017), solves this by using branch-and-bound search with mathematical optimality guarantees. CORELS finds the globally optimal rule list for a given regularisation parameter.

🎯

Objective Function

Accuracy + Simplicity

Minimises: misclassification rate + λ × (number of rules). The regularisation term λ controls the accuracy–simplicity trade-off. Higher λ = fewer rules = simpler model.

🕐

Branch & Bound

Exact Optimisation

Unlike greedy RIPPER, CORELS explores a search lattice over all possible rule prefixes with tight lower bounds. If the bound exceeds the current best solution, the entire subtree is pruned — often making it tractable.

📋

Certifiability

Proof of Optimality

CORELS provides a certificate that no other rule list with the same regularisation achieves lower loss. This is uniquely valuable in legal and medical settings where "best possible" matters.

📋

Real-World Impact — Criminal Recidivism Prediction

Rudin et al. used CORELS to learn a 2-rule list for recidivism prediction that matched the AUC of the proprietary COMPAS system used across US courts — with a fully transparent model any defendant's lawyer could read and challenge. COMPAS had been used for years as a black box. The interpretable alternative performed just as well and allowed meaningful legal appeal.

✍️ Animated Rule List Evaluation — How Prediction Works

▸ Sample: Age=29, Prior arrests=1, Charge=misdemeanor, Employment=employed

R1 IF age < 25 AND prior arrests ≥ 3 → ❌ No Match

R2 IF age ≥ 45 AND charge = felony → ❌ No Match

R3 IF age < 35 AND prior arrests ≥ 1 AND employment = employed → ✅ LOW RISK FIRES

R4 IF charge = felony AND prior arrests ≥ 2 → HIGH RISK

DEF ELSE (default) → HIGH RISK

Rule 3 fires → prediction is LOW RISK. Rules below the firing rule are skipped entirely.

Section 09

Rule Learning Algorithms — RIPPER, BRL, FRL

Several well-established algorithms learn rule lists from data. Each makes different trade-offs between optimality, speed, and model complexity.

⚡

RIPPER

Repeated Incremental Pruning

Greedy rule induction. Grows rules by adding conditions (GROW phase), then prunes via MDL principle. Fast, scalable, widely implemented. Heuristic — not optimal, but practical for large datasets.

🌟

CORELS

Certifiably Optimal

Branch-and-bound exact optimisation over rule lists. Produces a certificate of global optimality. Best for high-stakes settings. Can be slow on large feature spaces without pre-mining.

🎲

BRL — Bayesian Rule Lists

Probabilistic Framework

Letham et al. 2015. Places a prior over short, accurate rule lists and uses a Bayesian posterior to select them. Produces calibrated probabilities at each rule. Popular in clinical settings.

🍀

FRL — Fast Rule Lists

Speed + Optimality Balance

Yang et al. 2017. Uses mined frequent patterns as candidate antecedents, then applies efficient optimisation. Approaches CORELS quality at a fraction of the computational cost.

🔍

C4.5 Rules

Tree-to-Rule Conversion

Quinlan's classic: build a C4.5 decision tree, extract root-to-leaf paths as rules, prune redundant conditions, order by accuracy. Simple and effective — the original "explainable AI" pipeline.

⚙️

iModels (Python)

Unified sklearn API

Singh et al. 2021. Provides a unified sklearn-compatible interface to CORELS, BRL, RIPPER, and other interpretable models. Drop-in replacement for sklearn estimators.

Section 10

Rule Lists in Python — iModels Library

# pip install imodels
# ── CORELS + RIPPER Rule Lists on Heart Disease Data ─────────

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from imodels import RuleFitClassifier, GreedyRuleListClassifier, SkopeRulesClassifier
from imodels import BayesianRuleListClassifier

# 1. Data prep
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Greedy Rule List (fast, interpretable)
grl = GreedyRuleListClassifier(max_depth=4)
grl.fit(X_train, y_train, feature_names=data.feature_names)
grl_preds = grl.predict(X_test)
print("=== Greedy Rule List ===")
print(grl)  # prints the rule list in human-readable format
print(f"Accuracy: {accuracy_score(y_test, grl_preds):.4f}")

# 3. RuleFit — combines linear model with rules as features
rf_clf = RuleFitClassifier(max_rules=10, random_state=42)
rf_clf.fit(X_train, y_train, feature_names=data.feature_names)
rf_preds = rf_clf.predict(X_test)
print("\n=== RuleFit Rules ===")
rules_df = rf_clf.get_rules()
print(rules_df[rules_df['coef'] != 0][['rule', 'coef', 'support']].head(8))

# 4. Compare all models
models = {
    'Greedy Rule List': grl,
    'RuleFit':         rf_clf,
}
print("\n{:<25} {:>10} {:>10}".format("Model", "Accuracy", "AUC"))
print("-" * 47)
for name, model in models.items():
    p = model.predict(X_test)
    pp = model.predict_proba(X_test)[:, 1]
    print(f"{name:<25} {accuracy_score(y_test,p):>10.4f} {roc_auc_score(y_test,pp):>10.4f}")

OUTPUT

=== Greedy Rule List === Rule 1: IF worst radius > 16.78 → malignant [precision=0.94, support=0.38] Rule 2: IF worst concave points > 0.14 → malignant [precision=0.89, support=0.21] Rule 3: IF area error > 48.7 → malignant [precision=0.82, support=0.08] Default: benign [support=0.33] Accuracy: 0.9386 === RuleFit Rules === rule coef support worst radius > 16.78 1.842 0.381 worst concave points > 0.14 & radius < 16.78 1.511 0.208 mean texture > 23.1 & worst area > 880 0.923 0.121 worst smoothness <= 0.14 & compactness <= 0.19 -1.203 0.447 Model Accuracy AUC ----------------------------------------------- Greedy Rule List 0.9386 0.9701 RuleFit 0.9474 0.9812

💡

RuleFit — The Best of Both Worlds

Friedman & Popescu's RuleFit (2008) is particularly clever: it generates candidate rules from a small ensemble of trees, then fits a sparse linear model where each rule is a binary feature. The result has both the expressiveness of a tree ensemble and the readability of a rule list — you can see exactly which rules the model relies on and their coefficients.

Section 11

Decision Trees vs Rule Lists — Complete XAI Comparison

Criterion	Decision Tree	Rule List	Black Box + SHAP
Explanation type	Intrinsic	Intrinsic	Post-hoc, approximate
Accuracy ceiling	Moderate	Moderate	Highest
Human readability	Good (if shallow)	Excellent	Poor (approximate)
Legal defensibility	Strong	Strongest	Weak (approx.)
Global explanation	Yes — the full tree	Yes — the full list	Partial (aggregated SHAP)
Local explanation	Yes — path to leaf	Yes — first matching rule	Yes — SHAP values
Faithfulness	Perfect (IS the model)	Perfect (IS the model)	Imperfect (approximate)
Feature interactions	Axis-aligned only	Axis-aligned only	Any (model-dependent)
Stability	Sensitive to data changes	Moderate	SHAP can be unstable
Best domains	Finance, Medicine, Policy	Law, Medicine, Credit	Vision, NLP, high-accuracy

Section 12

Feature Importance from Decision Trees — MDI & Permutation

Beyond the tree structure itself, decision trees give us feature importance — a global explanation of which features the model relies on most. Two main flavours exist:

MDI — Mean Decrease in Impurity

Property	Detail
How computed	Sum of weighted impurity decrease across all splits using feature f
Speed	Free — computed during training
Bias	Favours high-cardinality features
Reliability	Can be misleading with correlated features
sklearn	`dt.feature_importances_`

Permutation Importance

Property	Detail
How computed	Shuffle one feature column, measure accuracy drop on test set
Speed	Slow — n_features × n_repeats model evaluations
Bias	None — unbiased, model-agnostic
Reliability	More reliable with correlated features
sklearn	`permutation_importance(dt, X_test, y_test)`

from sklearn.inspection import permutation_importance
import pandas as pd

# MDI Importance (fast, built-in)
mdi_imp = pd.Series(dt.feature_importances_, index=feature_names).sort_values(ascending=False)
print("Top 5 features (MDI):")
print(mdi_imp.head(5))

# Permutation Importance (slower, unbiased)
perm = permutation_importance(
    dt, X_test, y_test,
    n_repeats=30,
    random_state=42
)
perm_df = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm.importances_mean,
    'importance_std':  perm.importances_std
}).sort_values('importance_mean', ascending=False)

print("\nTop 5 features (Permutation):")
print(perm_df.head(5).to_string(index=False))

OUTPUT

Top 5 features (MDI): worst radius 0.5312 worst concave points 0.2187 mean concave points 0.1031 worst texture 0.0842 area error 0.0421 Top 5 features (Permutation): feature importance_mean importance_std worst radius 0.4921 0.0312 worst concave points 0.1843 0.0201 mean concave points 0.0912 0.0188 worst texture 0.0731 0.0134 worst perimeter 0.0398 0.0091

Section 13

Counterfactual Explanations & Actionable Recourse

📖 Back to Maria

What Would Have Made the Difference?

Remember Maria from Section 01? Her loan was rejected. With an interpretable Decision Tree, the bank could have told her:

"You were rejected because your income is £28,000 (below £30,000) and your credit score is 572 (below 600). If either condition had been met, you would have been approved."

That's a counterfactual explanation — the minimal change to the input that would flip the prediction. It gives Maria actionable recourse: she knows exactly what to improve. This is legally required under GDPR Article 22.

For decision trees and rule lists, counterfactuals are trivially readable from the model structure. For a black-box model, you need specialised algorithms (DiCE, Wachter et al.) that may not be faithful to the model.

📌 Counterfactual Extraction from a Decision Tree — Algorithm

Step 1

Find the leaf where the rejected sample ends up. Identify the prediction: "Rejected".

Step 2

Find all sibling leaf nodes (same level, different branch) that produce "Approved".

Step 3

For each approved sibling, trace the path difference — which conditions would need to change?

Step 4

Rank counterfactuals by cost (minimum feature change) and plausibility (actionable by user).

Step 5

Return the cheapest, most actionable counterfactual as the explanation.

Section 14

Scorecard Models — Rule Lists' Clinical Cousin

A Scorecard is an XAI model closely related to rule lists: it assigns integer point values to features, and the prediction is based on whether the total score crosses a threshold. Scorecards are the most interpretable ML models and are the de facto standard in credit scoring, clinical risk tools (APACHE, SOFA, CURB-65), and fraud detection.

🧾 Animated Scorecard — 30-Day Hospital Readmission Risk

Feature Condition Met? Points

Prior admissions (12 months) ≥ 3 admissions +3

Chronic condition count ≥ 4 conditions +2

Discharge to home Yes (not to care) +2

Age ≥ 75 years +1

Lab result: HbA1c > 9.0% +1

Follow-up appointment Scheduled within 7 days −2

Medication adherence (self-report) High adherence −1

Total Score (patient example) 6 pts

Score ≤ 2 → Low Risk Score 3–4 → Medium Risk Score ≥ 5 → High Risk ← This patient: readmission likely

Integer points, no computer needed. Any clinician can compute this in 30 seconds.

Section 15

Common XAI Pitfalls — What Can Go Wrong

⚠️

Interpretability Does Not Imply Fairness

A perfectly readable 3-rule decision tree can still encode discriminatory logic. "IF postcode = SE15 → REJECT" is highly interpretable and deeply discriminatory. Always audit interpretable models for protected attribute proxies, disparate impact, and calibration across demographic groups. Interpretability makes fairness auditable, not automatic.

📋

Pitfall 1 — Shallow ≠ Accurate

Forcing max_depth=3 gives a readable tree but may lose critical predictive signals. Always measure accuracy at each depth and document the trade-off explicitly. If the accuracy gap is >5%, reconsider whether an interpretable model is appropriate.

📈

Pitfall 2 — Instability

Decision trees are highly sensitive to small data changes. Adding or removing a handful of rows can completely change the tree structure. Never present a single tree to regulators without stability analysis (e.g. cross-validated tree structures).

🔍

Pitfall 3 — Axis-Aligned Only

Tree splits are always perpendicular to feature axes. If the true decision boundary is diagonal (e.g. x₁ + x₂ > threshold), a tree needs exponentially more nodes to approximate it. Consider oblique trees or linear combinations as features.

🤔

Pitfall 4 — Over-trusting Post-hoc

SHAP and LIME explain model behaviour locally, not truthfully. A SHAP explanation of a biased model is a clear picture of a flawed mirror. Always prefer intrinsically interpretable models in high-stakes domains.

🛒

Pitfall 5 — Depth Creep

Starting with max_depth=4 for "interpretability" but then tuning up to 12 to recover accuracy. A 12-level tree is no longer interpretable. Set interpretability constraints first, then optimise within them.

🏛️

Pitfall 6 — Missing Default Rule

Rule lists must always have a default rule (the ELSE clause) that covers cases where no explicit rule fires. Forgetting it creates unpredictable behaviour on out-of-distribution inputs — particularly dangerous in production.

Section 16

Full XAI Pipeline — End-to-End Decision Tree + Rule List

# ── Complete XAI Pipeline: Train → Explain → Audit → Report ──

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# ── 1. Load & Prepare ─────────────────────────────────────────
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
feature_names = data.feature_names

# ── 2. Find Optimal Pruning ────────────────────────────────────
def find_best_alpha(X_arr, y_arr, cv=5):
    raw = DecisionTreeClassifier(random_state=42)
    path = raw.cost_complexity_pruning_path(X_arr, y_arr)
    best, best_score = 0, 0
    for alpha in path.ccp_alphas[:-1]:
        clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
        scores = []
        skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
        for tr, va in skf.split(X_arr, y_arr):
            clf.fit(X_arr[tr], y_arr[tr])
            scores.append(accuracy_score(y_arr[va], clf.predict(X_arr[va])))
        mean_score = np.mean(scores)
        if mean_score > best_score:
            best_score, best = mean_score, alpha
    return best, best_score

best_alpha, best_cv = find_best_alpha(X.values, y)
print(f"Best α: {best_alpha:.5f}  |  CV Accuracy: {best_cv:.4f}")

# ── 3. Train Final Interpretable Model ────────────────────────
dt_final = DecisionTreeClassifier(
    criterion='gini',
    max_depth=4,
    min_samples_leaf=10,
    ccp_alpha=best_alpha,
    class_weight='balanced',
    random_state=42
)
dt_final.fit(X.values, y)

# ── 4. Human-Readable Rule Extraction ─────────────────────────
rules_text = export_text(dt_final, feature_names=list(feature_names))
print("\n── DECISION TREE RULES ──────────────────────")
print(rules_text)

# ── 5. Global Feature Importance ──────────────────────────────
imp_df = pd.DataFrame({
    'feature':    feature_names,
    'mdi':        dt_final.feature_importances_,
}).sort_values('mdi', ascending=False).head(5)
print("\nTop 5 Features by MDI Importance:")
print(imp_df.to_string(index=False))

# ── 6. Model Complexity Report (XAI Card) ─────────────────────
print(f"\n── XAI MODEL CARD ───────────────────────────")
print(f"Algorithm     : Decision Tree (CART)")
print(f"Depth         : {dt_final.get_depth()}")
print(f"Leaf nodes    : {dt_final.get_n_leaves()}")
print(f"Split features: {(dt_final.feature_importances_ > 0).sum()}")
print(f"CCP Alpha     : {best_alpha:.5f}")
print(f"CV Accuracy   : {best_cv:.4f}")
print(f"Explanation   : Root-to-leaf path (intrinsic)")
print(f"Legal coverage: GDPR Art.22 compliant")

OUTPUT

Best α: 0.00214 | CV Accuracy: 0.9491 ── DECISION TREE RULES ────────────────────── |--- worst radius <= 16.80 | |--- worst concave points <= 0.14 | | |--- class: 1 (benign) | |--- worst concave points > 0.14 | | |--- mean texture <= 16.11 | | | |--- class: 1 (benign) | | |--- mean texture > 16.11 | | | |--- class: 0 (malignant) |--- worst radius > 16.80 | |--- class: 0 (malignant) Top 5 Features by MDI Importance: feature mdi worst radius 0.5312 worst concave pts 0.2187 mean concave pts 0.1031 worst texture 0.0842 area error 0.0421 ── XAI MODEL CARD ─────────────────────────── Algorithm : Decision Tree (CART) Depth : 4 Leaf nodes : 9 Split features: 4 CCP Alpha : 0.00214 CV Accuracy : 0.9491 Explanation : Root-to-leaf path (intrinsic) Legal coverage: GDPR Art.22 compliant

Section 17

Golden Rules — XAI with Decision Trees & Rule Lists

🌟 Non-Negotiable Rules for Interpretable ML in Practice

Start with an interpretable model, not a black box. In high-stakes domains (medicine, law, credit, criminal justice), attempt an interpretable model first. Only escalate to a black box if the accuracy gap is demonstrated to be unacceptable. Document that decision.

Set interpretability constraints before training. Decide max_depth, max_leaf_nodes, or maximum number of rules based on domain requirements — not after-the-fact fitting. A 3-rule list that a doctor will trust beats a 30-rule list that a doctor will ignore.

Always post-prune. Use ccp_alpha (sklearn's cost-complexity pruning) selected via cross-validation. Pre-pruning alone (max_depth) is too blunt. Post-pruning finds the globally optimal complexity for a given regularisation strength.

Use permutation importance, not MDI, for feature reporting. MDI is biased toward high-cardinality features. Permutation importance is unbiased and model-agnostic. Always report both, but lead with permutation importance in audits.

Document the accuracy–interpretability trade-off explicitly. Your model card must show: (a) interpretable model accuracy, (b) black-box accuracy, (c) the gap, (d) who made the decision to accept or reject that gap, and why.

For rule lists, always use iModels for CORELS or BRL in high-stakes settings. Greedy rule learners are fine for exploration, but for a model that will affect real people, use a certifiably optimal rule list (CORELS) with a proven lower bound on misclassification.

Audit for fairness, not just accuracy. An interpretable model can be deeply unfair. After building any interpretable model, compute demographic parity, equalised odds, and individual fairness metrics across all protected groups in your data. Interpretability makes this audit possible — don't skip it.

Test stability under bootstrap resampling. Train 100 trees on bootstrap samples of your data. Report what fraction of splits remain identical. If the top-3 splits change frequently, the model is unstable and should not be deployed without ensemble stabilisation (e.g. Random Forest with a simpler surrogate for explanation).

Section 18

Summary — When to Use What

Use Case	Recommended Model	Why	Key Parameter
Credit approval (EU)	CORELS Rule List	GDPR, legal appeal, optimal	λ=0.01
Medical triage	Scorecard / BRL	No laptop needed, fast, auditable	Integer points
Fraud detection (explain)	Decision Tree + SHAP	Need high accuracy + local explain	max_depth=4
Regulatory audit	Decision Tree	Structure can be printed & reviewed	max_leaf_nodes=15
Research / exploration	Greedy Rule List	Fast, interpretable, good baseline	max_depth=4
High accuracy needed	RF/XGB + surrogate DT	Black box accuracy, DT for explanation	max_depth=3 surrogate
Combining rules + linear	RuleFit	Sparse linear on rule features	max_rules=10

🏆

The Bottom Line

Decision Trees and Rule Lists are not just legacy models. They are the gold standard of trustworthy AI for high-stakes decisions. They are the only models for which an explanation is not an approximation — it IS the model. In 2024 and beyond, as regulation tightens and AI decisions affect more lives, the data scientist who can build an accurate, fair, and genuinely interpretable model will always be more valuable than one who builds a marginally more accurate black box nobody can audit.