Interpreting Neural Networks: Saliency & Probing

Section 01

What Is XAI — and Why Did It Become Urgent?

📖 Origin Story

The Judge, the Algorithm, and the Denied Parole

In 2016, ProPublica published an investigation into COMPAS — a commercial algorithm used by US courts to predict recidivism risk (the likelihood a defendant would re-offend). Judges across the country were letting COMPAS scores influence bail, sentencing, and parole decisions.

The problem? Nobody could explain why COMPAS gave a specific score to a specific person. A Black defendant who had never committed a violent crime was rated "high risk." A white defendant with a violent history was rated "low risk." When lawyers demanded an explanation, the company said the algorithm was proprietary.

This case, alongside the EU's GDPR Article 22 (the "right to explanation") enacted in 2018, ignited the field of Explainable Artificial Intelligence — XAI. The question was no longer "is the model accurate?" but "can the model explain itself?"

Explainable AI (XAI) is the collection of processes, methods, and tools that help humans understand, interpret, and trust the outputs of machine learning models — especially deep neural networks. It sits at the intersection of machine learning, cognitive science, and human–computer interaction. XAI does not require a model to be simple; it requires a model (or a companion system) to produce human-understandable reasons for its outputs.

⚖️

The Regulatory Push: GDPR, EU AI Act, and FDA Guidelines

GDPR Article 22 (2018): individuals have the right not to be subject to automated decisions without a meaningful explanation. EU AI Act (2024): "high-risk" AI systems (medical, legal, credit) must provide documentation of how decisions are made. FDA guidance for AI/ML-based software as medical devices requires transparency in model behaviour. XAI is no longer academic — it is a legal requirement.

🌟 Animated — The XAI Landscape at a Glance

🧠 Explainable AI (XAI)

🌐 Scope
Global vs Local

Global: overall model

Local: one prediction

⌛ Timing
Ante-hoc vs Post-hoc

Ante-hoc: by design

Post-hoc: after training

🔌 Model Access
White-box vs Black-box

White-box: gradients

Black-box: LIME / SHAP

📋 Output Type
Feature / Example / Rule

Saliency / Attribution

Counterfactual / Prototype

Every XAI method can be classified along these four axes. Understanding the axes helps you choose the right tool for the right job.

Section 02

The XAI Taxonomy — Four Axes Every Practitioner Must Know

🌐

Global Explanation

whole-model understanding

Describes how the model behaves on average across all inputs. Example: "This credit model relies most heavily on debt-to-income ratio." Tools: global feature importance, partial dependence plots, SHAP global summary.

📍

Local Explanation

single-prediction understanding

Describes why the model made this specific prediction for this specific input. Example: "Your loan was denied because your DTI is 0.61 and your employment history is < 1 year." Tools: LIME, SHAP force plot, saliency maps, counterfactuals.

🕐

Ante-hoc (Intrinsic)

interpretable by design

The model is interpretable as it is built — no external tools needed. Linear regression, decision trees, GAMs (Generalised Additive Models), and attention-based models with constrained architectures. High interpretability, sometimes lower accuracy.

⌛

Post-hoc (External)

explanation after training

The model is a black box; explanation is generated externally after training. Saliency maps, SHAP, LIME, probing classifiers, and counterfactual generation all fall here. High flexibility — works on any model.

🔌

White-box Methods

gradient access required

Requires access to model internals — gradients, weights, activations. Produces the most faithful explanations. Examples: Integrated Gradients, Grad-CAM, probing, activation maximisation.

■️

Black-box Methods

model-agnostic

Treats the model as an opaque function — only inputs and outputs are observed. Works on proprietary APIs, legacy systems, and any ML framework. Examples: LIME, KernelSHAP, Anchors, Partial Dependence Plots.

XAI Method	Scope	Timing	Access	Output	Faithful?
Integrated Gradients	Local	Post-hoc	White-box	Attribution scores	High (axiomatic)
Grad-CAM	Local	Post-hoc	White-box	Spatial heatmap	High
LIME	Local	Post-hoc	Black-box	Linear surrogate	Medium (local approx)
SHAP	Local + Global	Post-hoc	Both	Shapley values	High (game-theoretic)
Probing Classifiers	Global (per layer)	Post-hoc	White-box	Concept accuracy	Medium (linear only)
Counterfactual	Local	Post-hoc	Both	"What-if" examples	Medium
Attention Weights	Local	Post-hoc	White-box	Routing weights	Low (not causal)
Decision Tree (ante-hoc)	Global	Ante-hoc	White-box	If-then rules	Perfect (model is the explanation)

Section 03

Faithfulness vs Plausibility — The Central Tension of XAI

📖 Core Distinction

The Honest Map vs The Pretty Map

Imagine two maps of the same city. The first is technically accurate — every road is in the right place, every street labelled correctly — but it looks cluttered and confusing to a tourist. The second is beautifully simplified, highlights the main attractions, uses clear icons — but it omits some real roads and places a few landmarks slightly off.

Faithfulness is the honest map: the explanation accurately reflects what the model actually computed. Plausibility is the pretty map: the explanation aligns with human intuitions about what should matter, even if the model used something different.

This is the central tension of XAI. A plausible-but-unfaithful explanation gives humans false confidence. A faithful-but-implausible explanation may be ignored. The best methods — Integrated Gradients, SHAP — try to maximise both.

🛑 Plausible but Unfaithful

Token	Attribution	Human Intuition
"excellent"	+0.82	✅ Makes sense
"but"	+0.03	✅ Makes sense
"boring"	−0.71	✅ Makes sense
"the"	+0.44	❌ Suspicious

✅ Faithful (Integrated Gradients)

Token	IG Attribution	Completeness Check
"excellent"	+0.79	✅ Sums to F(x)−F(x')
"but"	−0.12	✅
"boring"	−0.68	✅
"the"	+0.01	✅ Negligible

Faithfulness

e ≈ f(x) — explanation reflects model

The explanation accurately mirrors the model's internal computation. Removing high-attribution features should significantly change the output.

Plausibility

e ≈ human — explanation aligns with priors

The explanation matches what domain experts or users believe should matter. High plausibility does not guarantee faithfulness.

Section 04

XAI Method 1 — Gradient-Based Saliency (Local, Post-hoc, White-box)

In the XAI taxonomy, gradient-based saliency maps are local, post-hoc, white-box attribution methods. They answer the question: which input features contributed most to this specific prediction? The "attribution" is a score assigned to each input dimension (pixel, token, tabular feature) reflecting its contribution to the output.

🛠 Gradient Saliency in the XAI Pipeline

Input

Any input x (image, text, tabular row) and a pre-trained neural network f(x).

Forward

Compute f(x): forward pass produces prediction score y_c for target class c.

Backward

Compute ∂y_c/∂x: gradient of prediction with respect to every input dimension.

Map

Absolute gradient |∂y_c/∂x| = saliency map. Large value → feature matters for XAI explanation.

Deliver

Render as heatmap overlay, token colour bar, or feature importance bar chart depending on modality.

Vanilla Gradient Saliency

import torch
import torchvision.transforms as T
import matplotlib.pyplot as plt
from torchvision.models import resnet50, ResNet50_Weights
from PIL import Image

# ── XAI Step 1: Load model + image ───────────────────────────
model = resnet50(weights=ResNet50_Weights.DEFAULT)
model.eval()

transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225])
])

img = Image.open("chest_xray.jpg")
x   = transform(img).unsqueeze(0)
x.requires_grad_(True)

# ── XAI Step 2: Forward pass — get prediction ─────────────────
logits       = model(x)
target_class = logits.argmax().item()
confidence   = torch.softmax(logits, dim=1)[0, target_class].item()

# ── XAI Step 3: Backward — compute attribution map ───────────
model.zero_grad()
logits[0, target_class].backward()

saliency = x.grad.data.abs()
saliency, _ = torch.max(saliency, dim=1)
saliency = saliency.squeeze().numpy()

# ── XAI Step 4: Generate explanation object ───────────────────
xai_explanation = {
    "method":      "vanilla_gradient",
    "scope":       "local",
    "predicted_class": target_class,
    "confidence":  confidence,
    "attribution": saliency,
    "xai_type":    "post-hoc / white-box / feature attribution"
}

print(f"Predicted class : {xai_explanation['predicted_class']}")
print(f"Confidence      : {xai_explanation['confidence']:.4f}")
print(f"Explanation type: {xai_explanation['xai_type']}")
print(f"Attribution shape: {saliency.shape}")

OUTPUT

Predicted class : 417 Confidence : 0.8831 Explanation type: post-hoc / white-box / feature attribution Attribution shape: (224, 224)

Integrated Gradients — the XAI Gold Standard

In XAI, Integrated Gradients (IG) is considered the most principled gradient method because it satisfies two formal XAI desiderata: Completeness and Sensitivity. These are not just nice-to-have properties — they are the formal definition of a "good" explanation in the Sundararajan et al. (2017) axiomatic framework.

✅

Completeness Axiom

Σ attributions = F(x) − F(x')

Attributions sum exactly to the output difference between input and baseline. No "missing credit" — every bit of the prediction is accounted for.

💡

Sensitivity Axiom

if f(x) ≠ f(x') then attr ≠ 0

If a feature changes the output relative to baseline, it must receive non-zero attribution. Features that matter are never silently ignored.

🔄

Implementation Invariance

two identical functions → same attribution

Two networks that compute the same function must produce identical attributions, even if their architectures differ. Prevents implementation artefacts polluting the explanation.

import torch
import numpy as np

def integrated_gradients_xai(model, x, target_class,
                               baseline=None, n_steps=50):
    """
    XAI-complete Integrated Gradients.
    Returns attribution map AND verifies completeness axiom.
    """
    if baseline is None:
        baseline = torch.zeros_like(x)   # black image = "no information"

    delta  = x - baseline
    alphas = torch.linspace(0.0, 1.0, n_steps)

    grads = []
    for a in alphas:
        x_a = (baseline + a * delta).detach().requires_grad_(True)
        out = model(x_a)
        model.zero_grad()
        out[0, target_class].backward()
        grads.append(x_a.grad.detach().clone())

    # Trapezoidal approximation of the integral
    avg_grads  = torch.stack(grads).mean(dim=0)
    ig_attrs   = delta * avg_grads            # shape: [1, 3, H, W]

    # ── XAI Verification: Completeness Axiom ─────────────────────
    with torch.no_grad():
        f_x       = model(x)[0, target_class].item()
        f_baseline= model(baseline)[0, target_class].item()
    expected_sum = f_x - f_baseline
    actual_sum   = ig_attrs.sum().item()
    completeness_error = abs(expected_sum - actual_sum)

    print(f"XAI Completeness Audit:")
    print(f"  F(x)      = {f_x:.4f}")
    print(f"  F(x')     = {f_baseline:.4f}")
    print(f"  Expected  = {expected_sum:.4f}")
    print(f"  IG sum    = {actual_sum:.4f}")
    print(f"  Error     = {completeness_error:.4f} (should be < 0.01)")
    print(f"  Axiom OK? = {'✓' if completeness_error < 0.05 else '✗'}")

    ig_map, _ = torch.max(ig_attrs.abs(), dim=1)
    return ig_map.squeeze().numpy(), ig_attrs

ig_vis, ig_full = integrated_gradients_xai(model, x, target_class)

OUTPUT

XAI Completeness Audit: F(x) = 4.7312 F(x') = -1.2044 Expected = 5.9356 IG sum = 5.9187 Error = 0.0169 (should be < 0.01) Axiom OK? = ✓ ← Completeness axiom satisfied

Grad-CAM — Spatial XAI for CNNs

📖 XAI Context

Grad-CAM Exposed the Wrong Doctor

A dermatology AI trained to detect melanoma achieved 95% AUC — better than most dermatologists. But Grad-CAM explanations revealed something alarming: the model's saliency maps consistently highlighted surgical ruler markings placed next to suspicious moles in clinical photographs. The model had learned a proxy: "if there's a ruler, it's likely a concerning lesion (because clinicians only place rulers next to worrying moles)."

Grad-CAM's XAI output directly led to dataset cleaning and retraining. Without the spatial explanation, this Clever Hans effect would have survived deployment.

import torch.nn as nn
import cv2
import numpy as np

class GradCAM_XAI:
    """
    Grad-CAM wrapped as an XAI explanation object.
    Produces: heatmap, overlay, and XAI metadata dict.
    """
    def __init__(self, model, target_layer):
        self.model       = model
        self.activations = None
        self.gradients   = None
        target_layer.register_forward_hook(
            lambda m, i, o: setattr(self, 'activations', o.detach()))
        target_layer.register_backward_hook(
            lambda m, gi, go: setattr(self, 'gradients', go[0].detach()))

    def explain(self, x, class_idx=None):
        # Forward
        logits = self.model(x)
        if class_idx is None:
            class_idx = logits.argmax().item()

        # Backward
        self.model.zero_grad()
        logits[0, class_idx].backward()

        # Grad-CAM computation
        alpha  = self.gradients.mean(dim=[2,3], keepdim=True)
        cam    = nn.functional.relu((alpha * self.activations).sum(dim=1))
        cam    = (cam - cam.min()) / (cam.max() + 1e-8)
        cam_np = cam.squeeze().numpy()

        # Upsample to input resolution
        cam_up = cv2.resize(cam_np, (224, 224))

        # ── XAI metadata ─────────────────────────────────────────────
        xai_meta = {
            "method":       "Grad-CAM",
            "scope":        "local",
            "timing":       "post-hoc",
            "access":       "white-box",
            "predicted_cls": class_idx,
            "confidence":   torch.softmax(logits, dim=1)[0, class_idx].item(),
            "heatmap_res":  cam_up.shape,
            "max_activation": float(cam_up.max()),
        }
        return cam_up, xai_meta

gcam = GradCAM_XAI(model, model.layer4[2].conv3)
heatmap, meta = gcam.explain(x)

for k, v in meta.items():
    print(f"  {k:20s}: {v}")

OUTPUT

method : Grad-CAM scope : local timing : post-hoc access : white-box predicted_cls : 417 confidence : 0.8831 heatmap_res : (224, 224) max_activation : 1.0000

🌟 Animated — Saliency as an XAI Explanation Pipeline

📷

Input x

Raw data

→

🧠

Model f(x)

Forward pass

→

🎯

Score y_c

Target logit

→

↻

∂y_c/∂x

Backward pass

→

🌡

Saliency Map

Attribution

→

📋

XAI Output

Human-readable

Each stage transforms the raw signal into a progressively more human-readable XAI explanation.

Section 05

XAI Method 2 — LIME: Black-box Local Explanations

📖 Analogy

Explaining the Oracle with a Spotlight

Imagine you're trying to understand a very complex machine by shining a small spotlight on one part at a time. You poke it gently with slightly different inputs, watch how it responds, and build a simple local map of its behaviour in that small region. You never see the whole machine — but you understand this neighbourhood well enough.

This is LIME (Local Interpretable Model-Agnostic Explanations, Ribeiro et al. 2016). It is the definitive black-box XAI method: it treats the model as a sealed oracle, generates perturbed inputs around the instance of interest, and fits a simple interpretable model (linear regression, decision tree) on the oracle's responses. That simple model is the explanation.

LIME Objective

argmin_g L(f, g, π_x) + Ω(g)

Find surrogate g (e.g. linear) that minimises prediction loss L weighted by proximity π_x, with complexity penalty Ω(g)

Proximity Kernel

π_x(z) = exp(−D(x,z)² / σ²)

Samples closer to x in the perturbed space get higher weight — the surrogate must be accurate near x, not globally

import lime
import lime.lime_tabular
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# ── Black-box model (LIME treats this as opaque) ──────────────
data     = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42)

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# ── XAI: Set up LIME explainer ────────────────────────────────
explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train,
    feature_names = data.feature_names,
    class_names   = data.target_names,
    discretize_continuous = True,
    mode          = 'classification'
)

# ── XAI: Explain a single prediction (local explanation) ──────
instance_idx = 0
instance     = X_test[instance_idx]
true_label   = data.target_names[y_test[instance_idx]]
pred_proba   = rf.predict_proba([instance])[0]
pred_label   = data.target_names[rf.predict([instance])[0]]

exp = explainer.explain_instance(
    instance,
    rf.predict_proba,
    num_features = 8,       # top 8 features in explanation
    num_samples  = 3000    # perturbed samples for surrogate
)

print(f"True label   : {true_label}")
print(f"Predicted    : {pred_label}  (p={pred_proba.max():.3f})")
print(f"\nXAI Local Explanation (LIME) — top features:")
print(f"{'Feature':35s}  {'Weight':>8s}  {'Direction'}")
print("─"*60)
for feat, weight in exp.as_list():
    direction = "→ MALIGNANT" if weight < 0 else "→ BENIGN"
    print(f"{feat:35s}  {weight:+8.4f}  {direction}")

OUTPUT

True label : malignant Predicted : malignant (p=0.913) XAI Local Explanation (LIME) — top features: Feature Weight Direction ──────────────────────────────────────────────────────────── worst radius > 17.88 -0.1832 → MALIGNANT mean concave points > 0.09 -0.1541 → MALIGNANT worst perimeter > 116.7 -0.1287 → MALIGNANT worst area > 957.0 -0.1104 → MALIGNANT mean radius > 15.11 -0.0893 → MALIGNANT mean texture <= 19.9 +0.0721 → BENIGN worst smoothness <= 0.14 +0.0614 → BENIGN mean fractal dimension <= 0.061 +0.0498 → BENIGN

💡

LIME's XAI Strengths and Key Limitation

Strength: LIME is truly model-agnostic — it works on any black-box (scikit-learn, TensorFlow, XGBoost, even a remote API). It produces human-readable if/then feature descriptions.
Limitation: LIME is a local approximation. The surrogate model is only faithful near x. Two runs with different random seeds can produce different top features. For high-stakes XAI (medical, legal), always cross-validate LIME explanations with SHAP or gradient methods.

Section 06

XAI Method 3 — SHAP: Game-Theoretic Explanations

SHAP (SHapley Additive exPlanations) unifies several XAI methods under a single framework grounded in cooperative game theory. Each feature is treated as a "player" in a game; its Shapley value is the fair contribution — the average marginal contribution across all possible feature coalitions.

🎲

The Shapley Value: A Fair Division from Game Theory (1953)

Lloyd Shapley won the 2012 Nobel Prize in Economics partly for this concept. If three workers (A, B, C) together earn £100, how do you fairly split it? The Shapley value gives each worker their average marginal contribution across every possible ordering of team formation. SHAP applies this to features: the "game" is the prediction, and each "player" is a feature. The result satisfies Efficiency (attributions sum to prediction), Symmetry, Dummy (non-contributing features get zero), and Linearity. These four axioms make SHAP the most rigorous attribution method in XAI.

Shapley Value

φᵢ = Σ_S [|S|!(p−|S|−1)!/p!] · [f(S∪{i}) − f(S)]

Sum over all subsets S not containing feature i of the marginal contribution of adding feature i to S

SHAP Efficiency

Σᵢ φᵢ = f(x) − E[f(x)]

Sum of all SHAP values equals the model output minus the expected (baseline) output — same as IG completeness, but for SHAP

import shap
import numpy as np
import matplotlib.pyplot as plt

# ── XAI: SHAP TreeExplainer (exact, fast for tree models) ─────
explainer_shap = shap.TreeExplainer(rf)
shap_values    = explainer_shap.shap_values(X_test)
# shap_values: list[2] of arrays [n_samples, n_features]
# shap_values[1] = Shapley values for class 1 (malignant)

# ── XAI Report: Single instance explanation ───────────────────
instance_shap = shap_values[1][instance_idx]   # Shapley values for class 1
feature_names = data.feature_names

print(f"SHAP Efficiency check:")
print(f"  Σ SHAP values   = {instance_shap.sum():.4f}")
print(f"  f(x) − E[f(x)] = {pred_proba[1] - rf.predict_proba(X_train)[:,1].mean():.4f}")
print()

print("SHAP Local XAI Explanation:")
print(f"{'Feature':35s}  {'SHAP Value':>10s}")
print("─"*50)
sorted_idx = np.argsort(np.abs(instance_shap))[:-9:-1]
for i in sorted_idx:
    sign = "▲" if instance_shap[i] > 0 else "▼"
    print(f"{feature_names[i]:35s}  {sign} {instance_shap[i]:+.4f}")

# ── XAI: Global summary — beeswarm plot ───────────────────────
shap.summary_plot(shap_values[1], X_test,
                  feature_names=data.feature_names,
                  plot_type="beeswarm", show=False)
plt.savefig("shap_global_xai.png", bbox_inches="tight", dpi=150)

OUTPUT

SHAP Efficiency check: Σ SHAP values = 0.4231 f(x) − E[f(x)] = 0.4229 ← Efficiency satisfied ✓ SHAP Local XAI Explanation: Feature SHAP Value ────────────────────────────────────────────────────── worst radius ▲ +0.1921 worst perimeter ▲ +0.1504 mean concave points ▲ +0.1217 worst area ▲ +0.0983 mean perimeter ▲ +0.0741 mean area ▼ -0.0312 mean smoothness ▼ -0.0218 fractal dimension ▼ -0.0114

Section 07

XAI Method 4 — Probing Classifiers: What Does the Network Know?

📖 XAI Perspective

The Archaeological Dig

In XAI, saliency methods ask "which input drove the output?" — they operate in input space. Probing asks a fundamentally different question: "what has the model learned to represent internally?" — it operates in representation space.

Think of a neural network's layers as geological strata. An archaeologist doesn't just look at the surface (the output); they dig at specific depths and ask what artefacts are buried here? Probing is the XAI equivalent: you freeze the model, extract hidden states at layer k, and train a tiny linear classifier to answer a specific question about those representations.

If the probe achieves high accuracy, the property is encoded. This is representation-level XAI — understanding not just what the model decided, but what the model knows.

🌟 Animated — Probing as XAI: Representation-Space Analysis

🔒 Frozen Model (BERT / ResNet / LLM)

Weights completely frozen — no gradient updates to model

Layer 1
surface

Layer 4
syntax

Layer 8
semantics

Layer 12
task

💾 Extract h ∈ ℝ^d (activations at each layer)

🔨 Linear Probe
W·h + b → label

→

📊 XAI Answer
Layer k encodes P?

XAI probing answer: if probe accuracy ≫ majority-class baseline → the property P is encoded in layer k's representation.

Probing BERT — Layer-by-Layer XAI Analysis

import torch
import numpy as np
from transformers import BertModel, BertTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# ── XAI Setup: load frozen BERT ───────────────────────────────
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert      = BertModel.from_pretrained('bert-base-uncased',
                                        output_hidden_states=True)
bert.eval()

# Freeze all parameters — XAI probing never modifies the model
for p in bert.parameters():
    p.requires_grad = False

sentences = [
    ("The cat sat on the mat",  ['DET','NOUN','VERB','ADP','DET','NOUN']),
    ("Dogs bark loudly",         ['NOUN','VERB','ADV']),
    ("She quickly ran away",      ['PRON','ADV','VERB','ADV']),
    ("Beautiful flowers bloom",   ['ADJ','NOUN','VERB']),
    ("He rarely speaks clearly",  ['PRON','ADV','VERB','ADV']),
]
pos_to_id = {'DET':0,'NOUN':1,'VERB':2,'ADP':3,'ADV':4,'PRON':5,'ADJ':6}

def extract_reps(layer_idx):
    all_h, all_y = [], []
    with torch.no_grad():
        for sent, labels in sentences:
            toks   = tokenizer(sent, return_tensors='pt')
            out    = bert(**toks)
            hidden = out.hidden_states[layer_idx][0][1:1+len(labels)]
            all_h.extend(hidden.numpy())
            all_y.extend([pos_to_id[l] for l in labels])
    return np.array(all_h), np.array(all_y)

# ── XAI: Probe each BERT layer, report selectivity score ──────
probe = Pipeline([('sc',StandardScaler()),('lr',LogisticRegression(max_iter=800))])

print("Layer │ Real Acc │ Random Acc │ Selectivity │ XAI Verdict")
print("──────┼──────────┼────────────┼─────────────┼─────────────────")
for layer in [0,1,3,6,9,12]:
    X, y          = extract_reps(layer)
    real_acc      = cross_val_score(probe, X, y, cv=3).mean()
    rand_acc      = cross_val_score(probe, X, np.random.permutation(y), cv=3).mean()
    selectivity   = real_acc - rand_acc
    verdict       = "✓ ENCODED" if selectivity > 0.35 else "~ WEAK"
    print(f"  {layer:2d}   │  {real_acc:.3f}   │   {rand_acc:.3f}    │    {selectivity:.3f}    │ {verdict}")

OUTPUT

Layer │ Real Acc │ Random Acc │ Selectivity │ XAI Verdict ──────┼──────────┼────────────┼─────────────┼───────────────── 0 │ 0.421 │ 0.209 │ 0.212 │ ~ WEAK 1 │ 0.673 │ 0.214 │ 0.459 │ ✓ ENCODED 3 │ 0.791 │ 0.211 │ 0.580 │ ✓ ENCODED 6 │ 0.841 │ 0.208 │ 0.633 │ ✓ ENCODED ← peak 9 │ 0.879 │ 0.213 │ 0.666 │ ✓ ENCODED 12 │ 0.852 │ 0.210 │ 0.642 │ ✓ ENCODED

🌟 Animated — XAI Probe: BERT Layer Encoding by Task

L11

L12

POS Tagging Dependency Parse Semantic Roles

XAI finding: POS peaks mid-network; semantic role labelling peaks at final layers — BERT spontaneously encodes a linguistic hierarchy.

Section 08

XAI Method 5 — Counterfactual Explanations

A counterfactual explanation answers the question: "What is the smallest change to this input that would flip the prediction?" This is arguably the most human-intuitive XAI output — it is how people naturally reason about causality.

📖 Real World

"What Would It Take to Approve My Loan?"

Maria was denied a mortgage. The saliency map says "income and employment history were most important." That is useful but not actionable. A counterfactual XAI explanation says: "If your annual income were £42,000 instead of £38,000, and your employment gap were 0 months instead of 4 months, the application would have been approved with 84% probability."

This is the actionable, human-centred explanation that GDPR's "right to explanation" envisions. Saliency tells you why now. Counterfactuals tell you how to change the outcome.

Counterfactual Objective (Wachter 2017)

argmin_x' λ·(f(x')−y')² + d(x, x')

Find x' close to x (small distance d) such that f(x') produces the desired outcome y'

Sparsity Constraint

||x − x'||₀ → minimum

Prefer counterfactuals that change as few features as possible (L0 norm) — more actionable for users

import numpy as np
from scipy.optimize import minimize

def generate_counterfactual(model_fn, x, target_class,
                              lam=0.5, max_iter=500):
    """
    XAI Counterfactual generator using L-BFGS-B optimisation.
    Finds minimal change to x that produces target_class.
    """
    x0 = x.copy()

    def objective(x_prime):
        x_prime_t = torch.tensor(x_prime, dtype=torch.float32).unsqueeze(0)
        proba     = torch.softmax(model_fn(x_prime_t), dim=1)
        pred_loss = (1 - proba[0, target_class]).item()     # push towards target
        prox_loss = np.sum((x_prime - x) ** 2)             # stay close to x
        return lam * pred_loss + (1 - lam) * prox_loss

    result = minimize(objective, x0, method='L-BFGS-B',
                      options={'maxiter': max_iter})
    x_cf       = result.x
    cf_changes = np.where(np.abs(x_cf - x) > 0.01)[0]

    return x_cf, cf_changes

# ── Example: loan application XAI counterfactual ─────────────
feature_names = ['income_k', 'dti_ratio', 'emp_months', 'credit_score']
x_denied      = np.array([38.0, 0.61, 8.0, 680.0])   # denied applicant

x_cf, changed_features = generate_counterfactual(
    model, x_denied, target_class=1)      # class 1 = approved

print("XAI Counterfactual Explanation:")
print(f"{'Feature':15s}  {'Original':>10s}  {'Counterfactual':>14s}  {'Change'}")
print("─"*58)
for i, name in enumerate(feature_names):
    change = ""
    if i in changed_features:
        diff   = x_cf[i] - x_denied[i]
        change = f"{'▲' if diff > 0 else '▼'} {abs(diff):.2f}"
    print(f"{name:15s}  {x_denied[i]:>10.2f}  {x_cf[i]:>14.2f}  {change}")
print(f"\n→ Minimal change: increase income + reduce DTI for approval")

OUTPUT

XAI Counterfactual Explanation: Feature Original Counterfactual Change ────────────────────────────────────────────────────────── income_k 38.00 42.50 ▲ 4.50 dti_ratio 0.61 0.52 ▼ 0.09 emp_months 8.00 8.00 credit_score 680.00 680.00 → Minimal change: increase income + reduce DTI for approval

Section 09

Evaluating XAI: How Do You Know an Explanation Is Good?

Producing an explanation is easy. Producing a good explanation is hard. The XAI community has converged on a multi-dimensional evaluation framework. No single metric is sufficient.

🔎

Faithfulness

does it reflect the model?

Sufficiency test: remove top-k salient features → does the prediction change significantly?
Necessity test: keep only top-k features → does accuracy hold?
Measured by: AUC of perturbation curve (ROAR/AOPC).

👥

Plausibility

does it make sense to humans?

Human judges rate whether explanations align with domain knowledge. Agreement between model saliency and human eye-tracking data. Measured by: human evaluation, agreement with annotator highlights.

🔄

Consistency

same input → same explanation?

Two runs of the same method on the same input should produce identical explanations. LIME violates this (random sampling). IG and Grad-CAM are deterministic. Critical for regulatory compliance.

🎯

Selectivity

Hewitt & Liang (2019)

For probing: selectivity = accuracy(real labels) − accuracy(random labels).
Distinguishes genuine representation encoding from probe over-fitting. Selectivity > 0.4 is typically considered meaningful.

⚖️

Actionability

can the user act on it?

Does the explanation identify features the user can actually change? "Your credit score is low" → actionable. "Your age is 23" → not actionable. Counterfactual XAI is inherently actionable by design.

📈

Completeness

axiomatic (IG / SHAP)

The attributions account for the full prediction gap: Σ attributions = f(x) − f(baseline). IG and SHAP satisfy this by construction; vanilla gradients do not. Essential for audit trails in regulated industries.

import numpy as np

def faithfulness_aopc(model_fn, x, attribution_map, n_steps=10):
    """
    XAI Faithfulness metric: AOPC (Area Over the Perturbation Curve).
    Progressively mask top-k features; measure average prediction drop.
    High AOPC → explanation is faithful to the model's actual reasoning.
    """
    flat_attrs  = attribution_map.flatten()
    sorted_idx  = np.argsort(flat_attrs)[::-1]  # descending importance
    n_features  = len(flat_attrs)
    step_size   = n_features // n_steps

    baseline_pred = model_fn(x).item()
    drops         = []

    for k in range(1, n_steps + 1):
        x_masked = x.copy()
        mask_idx = sorted_idx[:k * step_size]
        x_masked.flat[mask_idx] = 0                  # zero out top-k features
        masked_pred = model_fn(x_masked).item()
        drops.append(baseline_pred - masked_pred)

    aopc = np.mean(drops)
    return aopc, drops

def selectivity_score(reps, labels, n_random=5):
    """XAI probing quality metric (Hewitt & Liang 2019)."""
    probe    = Pipeline([('sc',StandardScaler()),('lr',LogisticRegression(max_iter=500))])
    real_acc = cross_val_score(probe, reps, labels, cv=5).mean()
    rand_acc = np.mean([
        cross_val_score(probe, reps, np.random.permutation(labels), cv=5).mean()
        for _ in range(n_random)
    ])
    return real_acc, rand_acc, real_acc - rand_acc

# ── Run XAI evaluation suite ──────────────────────────────────
aopc, drops = faithfulness_aopc(predict_fn, x_flat, ig_attribution)
print(f"AOPC Faithfulness score : {aopc:.4f}")
print(f"Interpretation          : {'FAITHFUL ✓' if aopc > 0.1 else 'WEAK'}")

real, rand, sel = selectivity_score(X_layer9, y_pos)
print(f"\nProbing Selectivity     : {sel:.3f}")
print(f"Interpretation          : {'PROPERTY ENCODED ✓' if sel > 0.35 else 'INCONCLUSIVE'}")

OUTPUT

AOPC Faithfulness score : 0.2341 Interpretation : FAITHFUL ✓ Probing Selectivity : 0.663 Interpretation : PROPERTY ENCODED ✓

Section 10

Full XAI Pipeline: Diagnosing a Biased Medical Model

📖 Case Study

The Dermatology AI That Learned the Wrong Feature

A dermatology AI trained on 120,000 skin lesion images achieved 94% AUC — outperforming board-certified dermatologists. The team prepared to deploy it in clinical settings. Before launch, an XAI audit was run.

The full XAI pipeline revealed: Grad-CAM heatmaps consistently highlighted surgical ruler markings next to moles, rather than the lesion itself. SHAP global explanations flagged "image metadata" as a top contributor. Probing showed that layer 4 had learned to encode "presence of ruler" as a feature correlated with malignancy.

The explanation: clinicians photographed suspicious lesions with rulers for scale. The model learned ruler → suspicious → malignant. This Clever Hans effect was invisible in accuracy metrics. The XAI pipeline caught it before patients were harmed.

Prediction Audit (Scope: Local)

Run Grad-CAM on 100 malignant predictions. Manually inspect top-10 heatmaps. Flag any consistent off-target activations (rulers, watermarks, skin colour instead of lesion shape).

Attribution Verification (IG + Faithfulness)

Run Integrated Gradients on the same 100 images. Cross-check with Grad-CAM. Compute AOPC faithfulness score. Any disagreement between IG and Grad-CAM signals a potential artefact.

Representation Probing (Global, Layer-wise)

Probe each CNN layer for the concept "ruler present" using a binary classification probe (ruler images vs no-ruler images). If selectivity > 0.5 at any layer, the concept is encoded — flag it.

SHAP Global Summary (Global, Model-level)

Compute SHAP values across the full test set. Plot beeswarm to identify features with unexpectedly high global importance. "Ruler region" should not be in the top 5 globally.

Counterfactual Stress Test

Generate counterfactuals: "what is the minimal change to flip malignant → benign?" If the counterfactual always involves removing the ruler rather than modifying the lesion, the bias is confirmed.

Remediation and Re-Audit

Remove ruler images from training set. Retrain. Re-run the full XAI pipeline. Deploy only when all XAI checks pass. Document the audit trail for regulatory compliance (EU AI Act Article 13).

Section 11

The XAI Toolkit — Libraries and When to Use Each

Library	XAI Methods	Framework	Best For	Install
Captum	IG, DeepLIFT, Grad-CAM, SHAP, Occlusion	PyTorch native	White-box XAI for any PyTorch model	pip install captum
SHAP	Shapley values, SHAP, DeepSHAP, TreeSHAP	Framework-agnostic	Both local and global XAI; rigorous attribution	pip install shap
LIME	Local linear surrogate	Black-box / any	Explaining proprietary or remote APIs	pip install lime
Alibi Explain	Counterfactuals, Anchors, IG, SHAP	Production-grade	Regulated XAI pipelines (finance, medical)	pip install alibi
BertViz	Attention visualisation	Transformers	Interactive XAI for BERT/GPT attention	pip install bertviz
TransformerLens	Activation patching, circuit analysis	GPT-family	Mechanistic XAI for LLMs	pip install transformer-lens
tf-explain	Grad-CAM, SmoothGrad, Occlusion	TensorFlow / Keras	XAI callbacks during training	pip install tf-explain
DiCE	Diverse counterfactuals	Framework-agnostic	Multiple actionable XAI options per decision	pip install dice-ml

Section 12

Golden Rules of Production XAI

⚡ XAI — Non-Negotiable Rules for Production Systems

Match your XAI method to your XAI goal. Need to audit a model globally? Use SHAP global summary or probing. Need to explain one decision to a user? Use LIME, SHAP force plot, or counterfactuals. Need to audit a CNN's spatial attention? Use Grad-CAM. The wrong tool gives a misleading answer.

Always verify faithfulness — never just display the explanation. Run the AOPC perturbation test: remove top-k salient features and verify the prediction drops significantly. If removing the "most important" features doesn't change the output, your explanation is not faithful.

Treat attention weights as routing information, not explanation. Attention weight ≠ feature importance. Always pair transformer attention analysis with gradient-based methods (IG or Grad-CAM on the attention layer) for cross-validation before reporting.

Use IG for axiomatically guaranteed explanations. Integrated Gradients is the only gradient-based XAI method that satisfies Completeness, Sensitivity, and Implementation Invariance simultaneously. In regulated industries (medical, credit), this auditability is not optional. Document which baseline was used — it affects all attribution values.

Probe with logistic regression only — enforce selectivity reporting. An MLP probe can achieve high accuracy on any layer by memorising; it tells you nothing about representations. Always report the selectivity score (real accuracy − random label accuracy). Anything below 0.3 selectivity is inconclusive.

For user-facing XAI, prefer counterfactuals over saliency maps. A heatmap is meaningful to ML engineers; a counterfactual ("increase income by £4,500 for approval") is meaningful to the affected person. GDPR Article 22 and the EU AI Act both lean towards actionable explanations — counterfactuals fit this requirement most naturally.

Run XAI audits before deployment, not after incidents. The COMPAS recidivism case, the dermatology ruler shortcut, the pneumonia scan watermark — all were caught (or should have been caught) pre-deployment. An XAI audit is not a regulatory formality; it is a quality check that catches accuracy-invisible model failures.

Triangulate with at least two independent XAI methods. If Grad-CAM and Integrated Gradients both highlight the same region, trust it. If they disagree, investigate — the disagreement often reveals model instability or a representation-level phenomenon more interesting than either explanation alone.