What Is XAI — and Why Did It Become Urgent?
The problem? Nobody could explain why COMPAS gave a specific score to a specific person. A Black defendant who had never committed a violent crime was rated "high risk." A white defendant with a violent history was rated "low risk." When lawyers demanded an explanation, the company said the algorithm was proprietary.
This case, alongside the EU's GDPR Article 22 (the "right to explanation") enacted in 2018, ignited the field of Explainable Artificial Intelligence — XAI. The question was no longer "is the model accurate?" but "can the model explain itself?"
Explainable AI (XAI) is the collection of processes, methods, and tools that help humans understand, interpret, and trust the outputs of machine learning models — especially deep neural networks. It sits at the intersection of machine learning, cognitive science, and human–computer interaction. XAI does not require a model to be simple; it requires a model (or a companion system) to produce human-understandable reasons for its outputs.
GDPR Article 22 (2018): individuals have the right not to be subject to automated decisions without a meaningful explanation. EU AI Act (2024): "high-risk" AI systems (medical, legal, credit) must provide documentation of how decisions are made. FDA guidance for AI/ML-based software as medical devices requires transparency in model behaviour. XAI is no longer academic — it is a legal requirement.
Global vs Local
Ante-hoc vs Post-hoc
White-box vs Black-box
Feature / Example / Rule
Every XAI method can be classified along these four axes. Understanding the axes helps you choose the right tool for the right job.
The XAI Taxonomy — Four Axes Every Practitioner Must Know
| XAI Method | Scope | Timing | Access | Output | Faithful? |
|---|---|---|---|---|---|
| Integrated Gradients | Local | Post-hoc | White-box | Attribution scores | High (axiomatic) |
| Grad-CAM | Local | Post-hoc | White-box | Spatial heatmap | High |
| LIME | Local | Post-hoc | Black-box | Linear surrogate | Medium (local approx) |
| SHAP | Local + Global | Post-hoc | Both | Shapley values | High (game-theoretic) |
| Probing Classifiers | Global (per layer) | Post-hoc | White-box | Concept accuracy | Medium (linear only) |
| Counterfactual | Local | Post-hoc | Both | "What-if" examples | Medium |
| Attention Weights | Local | Post-hoc | White-box | Routing weights | Low (not causal) |
| Decision Tree (ante-hoc) | Global | Ante-hoc | White-box | If-then rules | Perfect (model is the explanation) |
Faithfulness vs Plausibility — The Central Tension of XAI
Faithfulness is the honest map: the explanation accurately reflects what the model actually computed. Plausibility is the pretty map: the explanation aligns with human intuitions about what should matter, even if the model used something different.
This is the central tension of XAI. A plausible-but-unfaithful explanation gives humans false confidence. A faithful-but-implausible explanation may be ignored. The best methods — Integrated Gradients, SHAP — try to maximise both.
| Token | Attribution | Human Intuition |
|---|---|---|
| "excellent" | +0.82 | ✅ Makes sense |
| "but" | +0.03 | ✅ Makes sense |
| "boring" | −0.71 | ✅ Makes sense |
| "the" | +0.44 | ❌ Suspicious |
| Token | IG Attribution | Completeness Check |
|---|---|---|
| "excellent" | +0.79 | ✅ Sums to F(x)−F(x') |
| "but" | −0.12 | ✅ |
| "boring" | −0.68 | ✅ |
| "the" | +0.01 | ✅ Negligible |
XAI Method 1 — Gradient-Based Saliency (Local, Post-hoc, White-box)
In the XAI taxonomy, gradient-based saliency maps are local, post-hoc, white-box attribution methods. They answer the question: which input features contributed most to this specific prediction? The "attribution" is a score assigned to each input dimension (pixel, token, tabular feature) reflecting its contribution to the output.
Vanilla Gradient Saliency
import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
from torchvision.models import resnet50, ResNet50_Weights
from PIL import Image
# ── XAI Step 1: Load model + image ───────────────────────────
model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
img = Image.open("chest_xray.jpg")
x = transform(img).unsqueeze(0)
x.requires_grad_(True)
# ── XAI Step 2: Forward pass — get prediction ─────────────────
logits = model(x)
target_class = logits.argmax().item()
confidence = torch.softmax(logits, dim=1)[0, target_class].item()
# ── XAI Step 3: Backward — compute attribution map ───────────
model.zero_grad()
logits[0, target_class].backward()
saliency = x.grad.data.abs()
saliency, _ = torch.max(saliency, dim=1)
saliency = saliency.squeeze().numpy()
# ── XAI Step 4: Generate explanation object ───────────────────
xai_explanation = {
"method": "vanilla_gradient",
"scope": "local",
"predicted_class": target_class,
"confidence": confidence,
"attribution": saliency,
"xai_type": "post-hoc / white-box / feature attribution"
}
print(f"Predicted class : {xai_explanation['predicted_class']}")
print(f"Confidence : {xai_explanation['confidence']:.4f}")
print(f"Explanation type: {xai_explanation['xai_type']}")
print(f"Attribution shape: {saliency.shape}")
Integrated Gradients — the XAI Gold Standard
In XAI, Integrated Gradients (IG) is considered the most principled gradient method because it satisfies two formal XAI desiderata: Completeness and Sensitivity. These are not just nice-to-have properties — they are the formal definition of a "good" explanation in the Sundararajan et al. (2017) axiomatic framework.
import torch
import numpy as np
def integrated_gradients_xai(model, x, target_class,
baseline=None, n_steps=50):
"""
XAI-complete Integrated Gradients.
Returns attribution map AND verifies completeness axiom.
"""
if baseline is None:
baseline = torch.zeros_like(x) # black image = "no information"
delta = x - baseline
alphas = torch.linspace(0.0, 1.0, n_steps)
grads = []
for a in alphas:
x_a = (baseline + a * delta).detach().requires_grad_(True)
out = model(x_a)
model.zero_grad()
out[0, target_class].backward()
grads.append(x_a.grad.detach().clone())
# Trapezoidal approximation of the integral
avg_grads = torch.stack(grads).mean(dim=0)
ig_attrs = delta * avg_grads # shape: [1, 3, H, W]
# ── XAI Verification: Completeness Axiom ─────────────────────
with torch.no_grad():
f_x = model(x)[0, target_class].item()
f_baseline= model(baseline)[0, target_class].item()
expected_sum = f_x - f_baseline
actual_sum = ig_attrs.sum().item()
completeness_error = abs(expected_sum - actual_sum)
print(f"XAI Completeness Audit:")
print(f" F(x) = {f_x:.4f}")
print(f" F(x') = {f_baseline:.4f}")
print(f" Expected = {expected_sum:.4f}")
print(f" IG sum = {actual_sum:.4f}")
print(f" Error = {completeness_error:.4f} (should be < 0.01)")
print(f" Axiom OK? = {'✓' if completeness_error < 0.05 else '✗'}")
ig_map, _ = torch.max(ig_attrs.abs(), dim=1)
return ig_map.squeeze().numpy(), ig_attrs
ig_vis, ig_full = integrated_gradients_xai(model, x, target_class)
Grad-CAM — Spatial XAI for CNNs
Grad-CAM's XAI output directly led to dataset cleaning and retraining. Without the spatial explanation, this Clever Hans effect would have survived deployment.
import torch.nn as nn
import cv2
import numpy as np
class GradCAM_XAI:
"""
Grad-CAM wrapped as an XAI explanation object.
Produces: heatmap, overlay, and XAI metadata dict.
"""
def __init__(self, model, target_layer):
self.model = model
self.activations = None
self.gradients = None
target_layer.register_forward_hook(
lambda m, i, o: setattr(self, 'activations', o.detach()))
target_layer.register_backward_hook(
lambda m, gi, go: setattr(self, 'gradients', go[0].detach()))
def explain(self, x, class_idx=None):
# Forward
logits = self.model(x)
if class_idx is None:
class_idx = logits.argmax().item()
# Backward
self.model.zero_grad()
logits[0, class_idx].backward()
# Grad-CAM computation
alpha = self.gradients.mean(dim=[2,3], keepdim=True)
cam = nn.functional.relu((alpha * self.activations).sum(dim=1))
cam = (cam - cam.min()) / (cam.max() + 1e-8)
cam_np = cam.squeeze().numpy()
# Upsample to input resolution
cam_up = cv2.resize(cam_np, (224, 224))
# ── XAI metadata ─────────────────────────────────────────────
xai_meta = {
"method": "Grad-CAM",
"scope": "local",
"timing": "post-hoc",
"access": "white-box",
"predicted_cls": class_idx,
"confidence": torch.softmax(logits, dim=1)[0, class_idx].item(),
"heatmap_res": cam_up.shape,
"max_activation": float(cam_up.max()),
}
return cam_up, xai_meta
gcam = GradCAM_XAI(model, model.layer4[2].conv3)
heatmap, meta = gcam.explain(x)
for k, v in meta.items():
print(f" {k:20s}: {v}")
Each stage transforms the raw signal into a progressively more human-readable XAI explanation.
XAI Method 2 — LIME: Black-box Local Explanations
This is LIME (Local Interpretable Model-Agnostic Explanations, Ribeiro et al. 2016). It is the definitive black-box XAI method: it treats the model as a sealed oracle, generates perturbed inputs around the instance of interest, and fits a simple interpretable model (linear regression, decision tree) on the oracle's responses. That simple model is the explanation.
import lime
import lime.lime_tabular
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# ── Black-box model (LIME treats this as opaque) ──────────────
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# ── XAI: Set up LIME explainer ────────────────────────────────
explainer = lime.lime_tabular.LimeTabularExplainer(
X_train,
feature_names = data.feature_names,
class_names = data.target_names,
discretize_continuous = True,
mode = 'classification'
)
# ── XAI: Explain a single prediction (local explanation) ──────
instance_idx = 0
instance = X_test[instance_idx]
true_label = data.target_names[y_test[instance_idx]]
pred_proba = rf.predict_proba([instance])[0]
pred_label = data.target_names[rf.predict([instance])[0]]
exp = explainer.explain_instance(
instance,
rf.predict_proba,
num_features = 8, # top 8 features in explanation
num_samples = 3000 # perturbed samples for surrogate
)
print(f"True label : {true_label}")
print(f"Predicted : {pred_label} (p={pred_proba.max():.3f})")
print(f"\nXAI Local Explanation (LIME) — top features:")
print(f"{'Feature':35s} {'Weight':>8s} {'Direction'}")
print("─"*60)
for feat, weight in exp.as_list():
direction = "→ MALIGNANT" if weight < 0 else "→ BENIGN"
print(f"{feat:35s} {weight:+8.4f} {direction}")
Strength: LIME is truly model-agnostic — it works on any black-box (scikit-learn, TensorFlow, XGBoost, even a remote API).
It produces human-readable if/then feature descriptions.
Limitation: LIME is a local approximation. The surrogate model is only faithful near x.
Two runs with different random seeds can produce different top features.
For high-stakes XAI (medical, legal), always cross-validate LIME explanations with SHAP or gradient methods.
XAI Method 3 — SHAP: Game-Theoretic Explanations
SHAP (SHapley Additive exPlanations) unifies several XAI methods under a single framework grounded in cooperative game theory. Each feature is treated as a "player" in a game; its Shapley value is the fair contribution — the average marginal contribution across all possible feature coalitions.
Lloyd Shapley won the 2012 Nobel Prize in Economics partly for this concept. If three workers (A, B, C) together earn £100, how do you fairly split it? The Shapley value gives each worker their average marginal contribution across every possible ordering of team formation. SHAP applies this to features: the "game" is the prediction, and each "player" is a feature. The result satisfies Efficiency (attributions sum to prediction), Symmetry, Dummy (non-contributing features get zero), and Linearity. These four axioms make SHAP the most rigorous attribution method in XAI.
import shap
import numpy as np
import matplotlib.pyplot as plt
# ── XAI: SHAP TreeExplainer (exact, fast for tree models) ─────
explainer_shap = shap.TreeExplainer(rf)
shap_values = explainer_shap.shap_values(X_test)
# shap_values: list[2] of arrays [n_samples, n_features]
# shap_values[1] = Shapley values for class 1 (malignant)
# ── XAI Report: Single instance explanation ───────────────────
instance_shap = shap_values[1][instance_idx] # Shapley values for class 1
feature_names = data.feature_names
print(f"SHAP Efficiency check:")
print(f" Σ SHAP values = {instance_shap.sum():.4f}")
print(f" f(x) − E[f(x)] = {pred_proba[1] - rf.predict_proba(X_train)[:,1].mean():.4f}")
print()
print("SHAP Local XAI Explanation:")
print(f"{'Feature':35s} {'SHAP Value':>10s}")
print("─"*50)
sorted_idx = np.argsort(np.abs(instance_shap))[:-9:-1]
for i in sorted_idx:
sign = "▲" if instance_shap[i] > 0 else "▼"
print(f"{feature_names[i]:35s} {sign} {instance_shap[i]:+.4f}")
# ── XAI: Global summary — beeswarm plot ───────────────────────
shap.summary_plot(shap_values[1], X_test,
feature_names=data.feature_names,
plot_type="beeswarm", show=False)
plt.savefig("shap_global_xai.png", bbox_inches="tight", dpi=150)
XAI Method 4 — Probing Classifiers: What Does the Network Know?
Think of a neural network's layers as geological strata. An archaeologist doesn't just look at the surface (the output); they dig at specific depths and ask what artefacts are buried here? Probing is the XAI equivalent: you freeze the model, extract hidden states at layer k, and train a tiny linear classifier to answer a specific question about those representations.
If the probe achieves high accuracy, the property is encoded. This is representation-level XAI — understanding not just what the model decided, but what the model knows.
surface
syntax
semantics
task
W·h + b → label
Layer k encodes P?
XAI probing answer: if probe accuracy ≫ majority-class baseline → the property P is encoded in layer k's representation.
Probing BERT — Layer-by-Layer XAI Analysis
import torch
import numpy as np
from transformers import BertModel, BertTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# ── XAI Setup: load frozen BERT ───────────────────────────────
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states=True)
bert.eval()
# Freeze all parameters — XAI probing never modifies the model
for p in bert.parameters():
p.requires_grad = False
sentences = [
("The cat sat on the mat", ['DET','NOUN','VERB','ADP','DET','NOUN']),
("Dogs bark loudly", ['NOUN','VERB','ADV']),
("She quickly ran away", ['PRON','ADV','VERB','ADV']),
("Beautiful flowers bloom", ['ADJ','NOUN','VERB']),
("He rarely speaks clearly", ['PRON','ADV','VERB','ADV']),
]
pos_to_id = {'DET':0,'NOUN':1,'VERB':2,'ADP':3,'ADV':4,'PRON':5,'ADJ':6}
def extract_reps(layer_idx):
all_h, all_y = [], []
with torch.no_grad():
for sent, labels in sentences:
toks = tokenizer(sent, return_tensors='pt')
out = bert(**toks)
hidden = out.hidden_states[layer_idx][0][1:1+len(labels)]
all_h.extend(hidden.numpy())
all_y.extend([pos_to_id[l] for l in labels])
return np.array(all_h), np.array(all_y)
# ── XAI: Probe each BERT layer, report selectivity score ──────
probe = Pipeline([('sc',StandardScaler()),('lr',LogisticRegression(max_iter=800))])
print("Layer │ Real Acc │ Random Acc │ Selectivity │ XAI Verdict")
print("──────┼──────────┼────────────┼─────────────┼─────────────────")
for layer in [0,1,3,6,9,12]:
X, y = extract_reps(layer)
real_acc = cross_val_score(probe, X, y, cv=3).mean()
rand_acc = cross_val_score(probe, X, np.random.permutation(y), cv=3).mean()
selectivity = real_acc - rand_acc
verdict = "✓ ENCODED" if selectivity > 0.35 else "~ WEAK"
print(f" {layer:2d} │ {real_acc:.3f} │ {rand_acc:.3f} │ {selectivity:.3f} │ {verdict}")
XAI finding: POS peaks mid-network; semantic role labelling peaks at final layers — BERT spontaneously encodes a linguistic hierarchy.
XAI Method 5 — Counterfactual Explanations
A counterfactual explanation answers the question: "What is the smallest change to this input that would flip the prediction?" This is arguably the most human-intuitive XAI output — it is how people naturally reason about causality.
This is the actionable, human-centred explanation that GDPR's "right to explanation" envisions. Saliency tells you why now. Counterfactuals tell you how to change the outcome.
import numpy as np
from scipy.optimize import minimize
def generate_counterfactual(model_fn, x, target_class,
lam=0.5, max_iter=500):
"""
XAI Counterfactual generator using L-BFGS-B optimisation.
Finds minimal change to x that produces target_class.
"""
x0 = x.copy()
def objective(x_prime):
x_prime_t = torch.tensor(x_prime, dtype=torch.float32).unsqueeze(0)
proba = torch.softmax(model_fn(x_prime_t), dim=1)
pred_loss = (1 - proba[0, target_class]).item() # push towards target
prox_loss = np.sum((x_prime - x) ** 2) # stay close to x
return lam * pred_loss + (1 - lam) * prox_loss
result = minimize(objective, x0, method='L-BFGS-B',
options={'maxiter': max_iter})
x_cf = result.x
cf_changes = np.where(np.abs(x_cf - x) > 0.01)[0]
return x_cf, cf_changes
# ── Example: loan application XAI counterfactual ─────────────
feature_names = ['income_k', 'dti_ratio', 'emp_months', 'credit_score']
x_denied = np.array([38.0, 0.61, 8.0, 680.0]) # denied applicant
x_cf, changed_features = generate_counterfactual(
model, x_denied, target_class=1) # class 1 = approved
print("XAI Counterfactual Explanation:")
print(f"{'Feature':15s} {'Original':>10s} {'Counterfactual':>14s} {'Change'}")
print("─"*58)
for i, name in enumerate(feature_names):
change = ""
if i in changed_features:
diff = x_cf[i] - x_denied[i]
change = f"{'▲' if diff > 0 else '▼'} {abs(diff):.2f}"
print(f"{name:15s} {x_denied[i]:>10.2f} {x_cf[i]:>14.2f} {change}")
print(f"\n→ Minimal change: increase income + reduce DTI for approval")
Evaluating XAI: How Do You Know an Explanation Is Good?
Producing an explanation is easy. Producing a good explanation is hard. The XAI community has converged on a multi-dimensional evaluation framework. No single metric is sufficient.
Necessity test: keep only top-k features → does accuracy hold?
Measured by: AUC of perturbation curve (ROAR/AOPC).
Distinguishes genuine representation encoding from probe over-fitting. Selectivity > 0.4 is typically considered meaningful.
import numpy as np
def faithfulness_aopc(model_fn, x, attribution_map, n_steps=10):
"""
XAI Faithfulness metric: AOPC (Area Over the Perturbation Curve).
Progressively mask top-k features; measure average prediction drop.
High AOPC → explanation is faithful to the model's actual reasoning.
"""
flat_attrs = attribution_map.flatten()
sorted_idx = np.argsort(flat_attrs)[::-1] # descending importance
n_features = len(flat_attrs)
step_size = n_features // n_steps
baseline_pred = model_fn(x).item()
drops = []
for k in range(1, n_steps + 1):
x_masked = x.copy()
mask_idx = sorted_idx[:k * step_size]
x_masked.flat[mask_idx] = 0 # zero out top-k features
masked_pred = model_fn(x_masked).item()
drops.append(baseline_pred - masked_pred)
aopc = np.mean(drops)
return aopc, drops
def selectivity_score(reps, labels, n_random=5):
"""XAI probing quality metric (Hewitt & Liang 2019)."""
probe = Pipeline([('sc',StandardScaler()),('lr',LogisticRegression(max_iter=500))])
real_acc = cross_val_score(probe, reps, labels, cv=5).mean()
rand_acc = np.mean([
cross_val_score(probe, reps, np.random.permutation(labels), cv=5).mean()
for _ in range(n_random)
])
return real_acc, rand_acc, real_acc - rand_acc
# ── Run XAI evaluation suite ──────────────────────────────────
aopc, drops = faithfulness_aopc(predict_fn, x_flat, ig_attribution)
print(f"AOPC Faithfulness score : {aopc:.4f}")
print(f"Interpretation : {'FAITHFUL ✓' if aopc > 0.1 else 'WEAK'}")
real, rand, sel = selectivity_score(X_layer9, y_pos)
print(f"\nProbing Selectivity : {sel:.3f}")
print(f"Interpretation : {'PROPERTY ENCODED ✓' if sel > 0.35 else 'INCONCLUSIVE'}")
Full XAI Pipeline: Diagnosing a Biased Medical Model
The full XAI pipeline revealed: Grad-CAM heatmaps consistently highlighted surgical ruler markings next to moles, rather than the lesion itself. SHAP global explanations flagged "image metadata" as a top contributor. Probing showed that layer 4 had learned to encode "presence of ruler" as a feature correlated with malignancy.
The explanation: clinicians photographed suspicious lesions with rulers for scale. The model learned ruler → suspicious → malignant. This Clever Hans effect was invisible in accuracy metrics. The XAI pipeline caught it before patients were harmed.
The XAI Toolkit — Libraries and When to Use Each
| Library | XAI Methods | Framework | Best For | Install |
|---|---|---|---|---|
| Captum | IG, DeepLIFT, Grad-CAM, SHAP, Occlusion | PyTorch native | White-box XAI for any PyTorch model | pip install captum |
| SHAP | Shapley values, SHAP, DeepSHAP, TreeSHAP | Framework-agnostic | Both local and global XAI; rigorous attribution | pip install shap |
| LIME | Local linear surrogate | Black-box / any | Explaining proprietary or remote APIs | pip install lime |
| Alibi Explain | Counterfactuals, Anchors, IG, SHAP | Production-grade | Regulated XAI pipelines (finance, medical) | pip install alibi |
| BertViz | Attention visualisation | Transformers | Interactive XAI for BERT/GPT attention | pip install bertviz |
| TransformerLens | Activation patching, circuit analysis | GPT-family | Mechanistic XAI for LLMs | pip install transformer-lens |
| tf-explain | Grad-CAM, SmoothGrad, Occlusion | TensorFlow / Keras | XAI callbacks during training | pip install tf-explain |
| DiCE | Diverse counterfactuals | Framework-agnostic | Multiple actionable XAI options per decision | pip install dice-ml |