The Big Question — Why Does Explainability Matter?
The loan officer shrugs: "It's the algorithm." Maria walks out furious and confused. She'll never know if the model penalised her postcode, her age, or some hidden proxy for race buried in the training data.
This is exactly why Explainable AI (XAI) exists — and why Decision Trees and Rule Lists have become the gold standard for interpretable machine learning.
Explainable AI (XAI) is the field of building machine learning systems whose predictions can be understood, audited, and trusted by human beings — not just measured. It sits at the intersection of data science, ethics, law, and cognitive psychology.
Since 2018, GDPR Article 22 gives EU citizens the right to a meaningful explanation for any automated decision that significantly affects them. The EU AI Act (2024) extends this further for high-risk AI systems. Interpretable models like Decision Trees aren't just nice to have — in many industries they are legally required.
| Concept | Meaning | Example |
|---|---|---|
| Interpretability | The model itself is understandable — you can read its logic | A 5-node decision tree printed on paper |
| Explainability | A post-hoc tool explains a black-box model's predictions | SHAP values on a neural network |
| Transparency | You know what data was used and how the model was trained | Model card with training details |
| Fairness | The model does not discriminate on protected attributes | Equal false positive rates across demographics |
The XAI Landscape — Where Decision Trees Live
XAI methods fall into two broad camps: models that are intrinsically interpretable (transparent by design), and post-hoc explanation methods that try to shed light on opaque models after training.
✔ Decision Trees, Rule Lists, Linear Models, Scorecard Models
✔ SHAP, LIME, Integrated Gradients, Attention Maps
Local: Explains one individual prediction only.
✔ Rule Lists are global; SHAP is often local
Professor Cynthia Rudin (Duke University) argues that for high-stakes decisions — medicine, criminal justice, credit — we should never use a black-box model and then try to explain it. Instead, we should use an interpretable model from the start. Post-hoc explanations of black boxes are approximations of approximations, and in critical domains, that is not good enough.
⭐ = This tutorial's focus — gold-standard interpretable models
Decision Trees — The Anatomy of Interpretability
This is exactly what a Decision Tree does. It turns a complex feature space into a sequence of yes/no questions that any human can follow, audit, and argue with. That auditability is what makes it the bedrock of XAI.
A Decision Tree is a hierarchical model that partitions the feature space through a sequence of binary splits. Each internal node tests one feature against one threshold. Each leaf node contains a prediction. The path from root to leaf is the explanation.
Each path from root to leaf is a complete, human-readable rule. Animated on load.
The tree above encodes the following human-readable rules that a loan officer could verify in seconds:
How Trees Learn — Splitting Criteria
The heart of decision tree learning is the split selection function. At every internal node, the algorithm searches over all features and all thresholds to find the split that produces the purest child nodes. The three most important measures of impurity are:
0.50
0.32
0.00
1.00
0.72
0.00
In practice, Gini and Entropy produce nearly identical trees.
Choose criterion='entropy' when you want slightly more balanced trees;
use 'gini' (default) for speed.
Bars animate on page load. Each cluster shows impurity at different class probability p.
The Interpretability–Accuracy Trade-off & Pruning
A fully grown decision tree (no depth limit) will achieve 100% training accuracy by memorising every sample. It will fail catastrophically on new data — and a 300-node tree is no more interpretable than a neural network. The goal of XAI with decision trees is to find the smallest tree that is still sufficiently accurate.
Two main approaches control tree complexity and restore generalisability:
| Parameter | What it does |
|---|---|
| max_depth | Hard cap on tree depth |
| min_samples_split | Min samples to allow a split |
| min_samples_leaf | Min samples in any leaf |
| max_leaf_nodes | Cap total number of leaves |
| min_impurity_decrease | Minimum gain to justify split |
| Concept | Detail |
|---|---|
| ccp_alpha | Complexity parameter α, sklearn |
| Reduced Error Pruning | Remove nodes if val accuracy doesn't drop |
| Minimal Cost-Complexity | Penalise R(T) + α·|T| (leaves count) |
| Cross-validation | Choose α via CV on held-out data |
| MDL Pruning | Minimum Description Length principle |
Training accuracy never drops; validation accuracy peaks around depth 4–5 then overfits.
Decision Tree in Python — Full Worked Example
The following example trains an interpretable decision tree on the UCI Heart Disease dataset, applies post-pruning via ccp_alpha, and visualises the tree. Notice how every parameter is chosen with interpretability as the priority.
# ── Decision Tree for Heart Disease Prediction ────────────────
# Goal: maximum interpretability, not maximum accuracy
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
# 1. Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# 2. Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Find optimal ccp_alpha via cross-validation post-pruning
raw_tree = DecisionTreeClassifier(random_state=42)
path = raw_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1] # exclude trivial last node
cv_scores = []
for alpha in ccp_alphas:
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=5)
cv_scores.append(scores.mean())
best_alpha = ccp_alphas[np.argmax(cv_scores)]
print(f"Best ccp_alpha: {best_alpha:.5f}")
# 4. Train the final interpretable tree
dt = DecisionTreeClassifier(
criterion='gini',
max_depth=4, # hard cap for readability
min_samples_leaf=10, # avoid tiny, noisy leaves
ccp_alpha=best_alpha, # post-pruning
random_state=42
)
dt.fit(X_train, y_train)
# 5. Evaluate
print(classification_report(y_test, dt.predict(X_test),
target_names=data.target_names))
print(f"Tree depth: {dt.get_depth()}")
print(f"Leaf count: {dt.get_n_leaves()}")
# 6. Print human-readable text representation
print(export_text(dt, feature_names=list(feature_names)))
A 9-leaf decision tree achieves 95% accuracy on breast cancer classification — and a radiologist can inspect every rule in under 30 seconds. Compare this to a Random Forest (98% accuracy, 500 trees, completely opaque) or a neural network (97%, millions of weights). The 2–3% accuracy gap is the price of interpretability. In medical diagnosis, the audit trail is often worth it.
Rule Lists — Ordered Rules for Crystal-Clear Logic
This is a Rule List — a sequence of IF-THEN-ELSE conditions evaluated in order, where the first rule that fires determines the prediction. Unlike a decision tree (which is a branching structure), a rule list is purely sequential. It reads like a clinical protocol, a legal statute, or a tax guide. That makes it the most socially legible form of machine learning model.
A Rule List (also called a decision list) has the form:
The key difference from a decision tree is mutual exclusivity + order: each patient is classified by the first matching rule. No traversal of branches — just read down the list.
| Property | Decision Tree | Rule List |
|---|---|---|
| Structure | Hierarchical, branching (DAG) | Sequential, ordered list |
| Explanation path | Root-to-leaf path | First matching rule |
| Rule overlap | Mutually exclusive by construction | Mutually exclusive by order |
| Human readability | Good (depth ≤ 4) | Excellent (reads like prose) |
| Regulators prefer | Sometimes | Often (audit trail is linear) |
| Key algorithm | CART, ID3, C4.5 | RIPPER, CORELS, FRL, BRL |
| Sklearn support | Native | via imodels library |
CORELS — Certifiably Optimal Rule Lists
Most rule-learning algorithms are greedy — they build rules one at a time and never look back. CORELS (Certifiable Optimal RulE ListS), developed by Angelino et al. (2017), solves this by using branch-and-bound search with mathematical optimality guarantees. CORELS finds the globally optimal rule list for a given regularisation parameter.
Rudin et al. used CORELS to learn a 2-rule list for recidivism prediction that matched the AUC of the proprietary COMPAS system used across US courts — with a fully transparent model any defendant's lawyer could read and challenge. COMPAS had been used for years as a black box. The interpretable alternative performed just as well and allowed meaningful legal appeal.
Rule 3 fires → prediction is LOW RISK. Rules below the firing rule are skipped entirely.
Rule Learning Algorithms — RIPPER, BRL, FRL
Several well-established algorithms learn rule lists from data. Each makes different trade-offs between optimality, speed, and model complexity.
Rule Lists in Python — iModels Library
# pip install imodels
# ── CORELS + RIPPER Rule Lists on Heart Disease Data ─────────
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from imodels import RuleFitClassifier, GreedyRuleListClassifier, SkopeRulesClassifier
from imodels import BayesianRuleListClassifier
# 1. Data prep
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Greedy Rule List (fast, interpretable)
grl = GreedyRuleListClassifier(max_depth=4)
grl.fit(X_train, y_train, feature_names=data.feature_names)
grl_preds = grl.predict(X_test)
print("=== Greedy Rule List ===")
print(grl) # prints the rule list in human-readable format
print(f"Accuracy: {accuracy_score(y_test, grl_preds):.4f}")
# 3. RuleFit — combines linear model with rules as features
rf_clf = RuleFitClassifier(max_rules=10, random_state=42)
rf_clf.fit(X_train, y_train, feature_names=data.feature_names)
rf_preds = rf_clf.predict(X_test)
print("\n=== RuleFit Rules ===")
rules_df = rf_clf.get_rules()
print(rules_df[rules_df['coef'] != 0][['rule', 'coef', 'support']].head(8))
# 4. Compare all models
models = {
'Greedy Rule List': grl,
'RuleFit': rf_clf,
}
print("\n{:<25} {:>10} {:>10}".format("Model", "Accuracy", "AUC"))
print("-" * 47)
for name, model in models.items():
p = model.predict(X_test)
pp = model.predict_proba(X_test)[:, 1]
print(f"{name:<25} {accuracy_score(y_test,p):>10.4f} {roc_auc_score(y_test,pp):>10.4f}")
Friedman & Popescu's RuleFit (2008) is particularly clever: it generates candidate rules from a small ensemble of trees, then fits a sparse linear model where each rule is a binary feature. The result has both the expressiveness of a tree ensemble and the readability of a rule list — you can see exactly which rules the model relies on and their coefficients.
Decision Trees vs Rule Lists — Complete XAI Comparison
| Criterion | Decision Tree | Rule List | Black Box + SHAP |
|---|---|---|---|
| Explanation type | Intrinsic | Intrinsic | Post-hoc, approximate |
| Accuracy ceiling | Moderate | Moderate | Highest |
| Human readability | Good (if shallow) | Excellent | Poor (approximate) |
| Legal defensibility | Strong | Strongest | Weak (approx.) |
| Global explanation | Yes — the full tree | Yes — the full list | Partial (aggregated SHAP) |
| Local explanation | Yes — path to leaf | Yes — first matching rule | Yes — SHAP values |
| Faithfulness | Perfect (IS the model) | Perfect (IS the model) | Imperfect (approximate) |
| Feature interactions | Axis-aligned only | Axis-aligned only | Any (model-dependent) |
| Stability | Sensitive to data changes | Moderate | SHAP can be unstable |
| Best domains | Finance, Medicine, Policy | Law, Medicine, Credit | Vision, NLP, high-accuracy |
Feature Importance from Decision Trees — MDI & Permutation
Beyond the tree structure itself, decision trees give us feature importance — a global explanation of which features the model relies on most. Two main flavours exist:
| Property | Detail |
|---|---|
| How computed | Sum of weighted impurity decrease across all splits using feature f |
| Speed | Free — computed during training |
| Bias | Favours high-cardinality features |
| Reliability | Can be misleading with correlated features |
| sklearn | dt.feature_importances_ |
| Property | Detail |
|---|---|
| How computed | Shuffle one feature column, measure accuracy drop on test set |
| Speed | Slow — n_features × n_repeats model evaluations |
| Bias | None — unbiased, model-agnostic |
| Reliability | More reliable with correlated features |
| sklearn | permutation_importance(dt, X_test, y_test) |
from sklearn.inspection import permutation_importance
import pandas as pd
# MDI Importance (fast, built-in)
mdi_imp = pd.Series(dt.feature_importances_, index=feature_names).sort_values(ascending=False)
print("Top 5 features (MDI):")
print(mdi_imp.head(5))
# Permutation Importance (slower, unbiased)
perm = permutation_importance(
dt, X_test, y_test,
n_repeats=30,
random_state=42
)
perm_df = pd.DataFrame({
'feature': feature_names,
'importance_mean': perm.importances_mean,
'importance_std': perm.importances_std
}).sort_values('importance_mean', ascending=False)
print("\nTop 5 features (Permutation):")
print(perm_df.head(5).to_string(index=False))
Counterfactual Explanations & Actionable Recourse
"You were rejected because your income is £28,000 (below £30,000) and your credit score is 572 (below 600). If either condition had been met, you would have been approved."
That's a counterfactual explanation — the minimal change to the input that would flip the prediction. It gives Maria actionable recourse: she knows exactly what to improve. This is legally required under GDPR Article 22.
For decision trees and rule lists, counterfactuals are trivially readable from the model structure. For a black-box model, you need specialised algorithms (DiCE, Wachter et al.) that may not be faithful to the model.
Scorecard Models — Rule Lists' Clinical Cousin
A Scorecard is an XAI model closely related to rule lists: it assigns integer point values to features, and the prediction is based on whether the total score crosses a threshold. Scorecards are the most interpretable ML models and are the de facto standard in credit scoring, clinical risk tools (APACHE, SOFA, CURB-65), and fraud detection.
Integer points, no computer needed. Any clinician can compute this in 30 seconds.
Common XAI Pitfalls — What Can Go Wrong
A perfectly readable 3-rule decision tree can still encode discriminatory logic. "IF postcode = SE15 → REJECT" is highly interpretable and deeply discriminatory. Always audit interpretable models for protected attribute proxies, disparate impact, and calibration across demographic groups. Interpretability makes fairness auditable, not automatic.
max_depth=3 gives a readable tree but may lose critical predictive signals.
Always measure accuracy at each depth and document the trade-off explicitly.
If the accuracy gap is >5%, reconsider whether an interpretable model is appropriate.
max_depth=4 for "interpretability" but then tuning up to 12
to recover accuracy. A 12-level tree is no longer interpretable. Set interpretability
constraints first, then optimise within them.
Full XAI Pipeline — End-to-End Decision Tree + Rule List
# ── Complete XAI Pipeline: Train → Explain → Audit → Report ──
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# ── 1. Load & Prepare ─────────────────────────────────────────
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
feature_names = data.feature_names
# ── 2. Find Optimal Pruning ────────────────────────────────────
def find_best_alpha(X_arr, y_arr, cv=5):
raw = DecisionTreeClassifier(random_state=42)
path = raw.cost_complexity_pruning_path(X_arr, y_arr)
best, best_score = 0, 0
for alpha in path.ccp_alphas[:-1]:
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
scores = []
skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
for tr, va in skf.split(X_arr, y_arr):
clf.fit(X_arr[tr], y_arr[tr])
scores.append(accuracy_score(y_arr[va], clf.predict(X_arr[va])))
mean_score = np.mean(scores)
if mean_score > best_score:
best_score, best = mean_score, alpha
return best, best_score
best_alpha, best_cv = find_best_alpha(X.values, y)
print(f"Best α: {best_alpha:.5f} | CV Accuracy: {best_cv:.4f}")
# ── 3. Train Final Interpretable Model ────────────────────────
dt_final = DecisionTreeClassifier(
criterion='gini',
max_depth=4,
min_samples_leaf=10,
ccp_alpha=best_alpha,
class_weight='balanced',
random_state=42
)
dt_final.fit(X.values, y)
# ── 4. Human-Readable Rule Extraction ─────────────────────────
rules_text = export_text(dt_final, feature_names=list(feature_names))
print("\n── DECISION TREE RULES ──────────────────────")
print(rules_text)
# ── 5. Global Feature Importance ──────────────────────────────
imp_df = pd.DataFrame({
'feature': feature_names,
'mdi': dt_final.feature_importances_,
}).sort_values('mdi', ascending=False).head(5)
print("\nTop 5 Features by MDI Importance:")
print(imp_df.to_string(index=False))
# ── 6. Model Complexity Report (XAI Card) ─────────────────────
print(f"\n── XAI MODEL CARD ───────────────────────────")
print(f"Algorithm : Decision Tree (CART)")
print(f"Depth : {dt_final.get_depth()}")
print(f"Leaf nodes : {dt_final.get_n_leaves()}")
print(f"Split features: {(dt_final.feature_importances_ > 0).sum()}")
print(f"CCP Alpha : {best_alpha:.5f}")
print(f"CV Accuracy : {best_cv:.4f}")
print(f"Explanation : Root-to-leaf path (intrinsic)")
print(f"Legal coverage: GDPR Art.22 compliant")
Golden Rules — XAI with Decision Trees & Rule Lists
max_depth, max_leaf_nodes, or maximum number of rules
based on domain requirements — not after-the-fact fitting.
A 3-rule list that a doctor will trust beats a 30-rule list that a doctor will ignore.
ccp_alpha (sklearn's cost-complexity
pruning) selected via cross-validation. Pre-pruning alone (max_depth) is too blunt.
Post-pruning finds the globally optimal complexity for a given regularisation strength.
Summary — When to Use What
| Use Case | Recommended Model | Why | Key Parameter |
|---|---|---|---|
| Credit approval (EU) | CORELS Rule List | GDPR, legal appeal, optimal | λ=0.01 |
| Medical triage | Scorecard / BRL | No laptop needed, fast, auditable | Integer points |
| Fraud detection (explain) | Decision Tree + SHAP | Need high accuracy + local explain | max_depth=4 |
| Regulatory audit | Decision Tree | Structure can be printed & reviewed | max_leaf_nodes=15 |
| Research / exploration | Greedy Rule List | Fast, interpretable, good baseline | max_depth=4 |
| High accuracy needed | RF/XGB + surrogate DT | Black box accuracy, DT for explanation | max_depth=3 surrogate |
| Combining rules + linear | RuleFit | Sparse linear on rule features | max_rules=10 |
Decision Trees and Rule Lists are not just legacy models. They are the gold standard of trustworthy AI for high-stakes decisions. They are the only models for which an explanation is not an approximation — it IS the model. In 2024 and beyond, as regulation tightens and AI decisions affect more lives, the data scientist who can build an accurate, fair, and genuinely interpretable model will always be more valuable than one who builds a marginally more accurate black box nobody can audit.