LLM Interpretability: Token Attribution & XAI Guide

Section 01

Why Do We Need to Explain LLMs?

📖 Real World Story

The Doctor and the Black Box

Imagine a hospital deploys a large language model to assist in diagnosis. A patient comes in with chest pain, shortness of breath, and fatigue. The LLM outputs: "High likelihood of cardiac event — immediate intervention recommended."

The attending physician stares at the screen. The model is right 94% of the time on test data. But why did it flag this patient? Was it the "chest pain" token? The combination of all three symptoms? Or did it latch onto something unrelated — perhaps the patient's age mentioned earlier in the note?

Without an explanation, the physician can't verify the reasoning, can't catch a failure mode, and can't defend the decision in court. The model is a black box. This is precisely the problem that Explainable AI (XAI) for LLMs tries to solve.

Large Language Models (LLMs) like GPT-4, Claude, and LLaMA achieve remarkable performance across tasks — yet their internal computations involve billions of parameters interacting through attention mechanisms, feed-forward networks, and residual streams that no human can read directly. XAI for LLMs is the discipline of building tools and methods that make these opaque systems interpretable, transparent, and auditable.

🔎

Two Flavours of Interpretability

Interpretability asks: what does the model actually do internally? Explainability asks: can I generate a human-readable justification for an output? Interpretability is a property of the model; explainability is a property of the explanation. XAI typically pursues explainability using interpretability tools as its engine.

⚖️

Trust & Safety

Why It Matters

Medical, legal, and financial deployments require auditable reasoning. A model that is right for the wrong reasons will fail silently in production when the distribution shifts.

⚖️

Bias Detection

Fairness Lens

Attribution methods can reveal whether a model's decision rests on protected attributes (race, gender, age) even when those features were never explicitly used.

🔧

Debugging & Improvement

Model Development

When a model fails, explanations tell you where and why it went wrong — enabling targeted fine-tuning, prompt engineering, and architecture modifications.

Section 02

Token Attributions — Tracing Credit Back to Input

When an LLM generates a token, every input token contributed some amount — positive or negative — to that prediction. Token attribution methods assign a numerical score to each input token reflecting how much it influenced the output.

📖 Analogy

The Jury Deliberation Room

A jury of 12 people votes "Guilty." The judge asks each juror: "How much did the CCTV footage sway your verdict? The alibi? The motive? The character witness?" Each juror (like each attention head) gave a different weight to different pieces of evidence. Token attribution is the act of asking the model that same question — and getting a numerical answer for every piece of evidence (token) in the prompt.

The Main Methods

∇

Gradient × Input

gradient-based

Compute the gradient of the output logit with respect to each token's embedding, then multiply element-wise with the embedding itself. Fast and differentiable — the workhorse of attribution in neural NLP.

∫

Integrated Gradients (IG)

axiomatic attribution

Integrate gradients along a straight path from a baseline (zero/pad embedding) to the actual input. Satisfies completeness: attributions sum to the output difference from baseline. Gold standard for many tasks.

🎭

LIME (Local Surrogate)

perturbation-based

Mask random subsets of input tokens, observe output changes, then fit a simple linear model to the perturbation results. Model-agnostic — works on any black-box LLM without access to gradients.

🃏

SHAP (Shapley Values)

game-theoretic

Treats each token as a "player" in a coalition game. The Shapley value gives each player a fair share of the total prediction, satisfying efficiency, symmetry, and dummy axioms simultaneously.

🛠️

Attention Attribution

attention-based

Use the attention weights from transformer heads as a proxy for importance. Controversial — attention is not always faithful to prediction — but fast and visually intuitive for debugging.

📈

INSEQ Library

unified framework

A Python library that unifies 15+ attribution methods under one API for sequence-to-sequence and causal LMs. Supports Hugging Face models with minimal setup.

Visualising Token Attribution — Sentiment Classification

Below is an animated diagram showing how Integrated Gradients attribute credit for a sentiment prediction. Green tokens pushed the prediction toward Positive; red tokens pushed it toward Negative. Token opacity encodes magnitude.

💡 Animated — Integrated Gradients Attribution (Sentiment Task)

Positive attribution Negative attribution Neutral / near-zero

⚠️

Attention ≠ Attribution — A Common Mistake

Many practitioners use attention weights as attributions because they are easily extracted. But Jain & Wallace (2019) showed that attention weights are often uncorrelated with gradient-based attributions — you can shuffle attention and not change the output. Attention tells you where the model looks, not what changes the prediction. Always prefer gradient-based or Shapley methods when faithfulness matters.

Integrated Gradients — Python Code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

# ── Load a pre-trained sentiment model ──────────────────────
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text  = "The film was absolutely brilliant and deeply moving."
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs["input_ids"]

# ── Integrated Gradients (simplified scalar version) ────────
def integrated_gradients(input_ids, target_class=1, steps=50):
    embeddings = model.get_input_embeddings()(input_ids)  # [1, seq, dim]
    baseline   = torch.zeros_like(embeddings)
    scaled_inputs = [baseline + (i / steps) * (embeddings - baseline)
                     for i in range(steps + 1)]
    grads = []
    for inp in scaled_inputs:
        inp.requires_grad_(True)
        out = model(inputs_embeds=inp).logits[0, target_class]
        out.backward()
        grads.append(inp.grad.detach().clone())
        model.zero_grad()

    avg_grads  = torch.stack(grads).mean(dim=0)               # [1, seq, dim]
    ig_attrs   = ((embeddings - baseline) * avg_grads).sum(dim=-1)  # [1, seq]
    return ig_attrs[0].detach().numpy()

attrs  = integrated_gradients(input_ids)
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

print(f"{'Token':<15} {'Attribution':>12}")
print("-" * 28)
for tok, score in zip(tokens, attrs):
    bar = "█" * int(abs(score) * 20)
    sign = "+" if score > 0 else "-"
    print(f"{tok:<15} {sign}{abs(score):.4f}  {bar}")

OUTPUT

Token Attribution ---------------------------- [CLS] +0.0023 The +0.0087 film +0.0341 █ was +0.0012 absolutely +0.6203 ████████████ brilliant +0.8501 █████████████████ and +0.0158 deeply +0.5542 ███████████ moving +0.7012 ██████████████ . +0.0031 [SEP] +0.0044

Section 03

SHAP for LLMs — Token-Level Shapley Values

SHAP (SHapley Additive exPlanations) originates from cooperative game theory. The Shapley value distributes the total prediction fairly among all features (tokens), considering every possible coalition they could form. For language, each token is a "player" and the output logit is the "prize to share."

Shapley Formula

φᵢ = Σ [|S|!(n-|S|-1)!/n!] × [v(S∪{i}) - v(S)]

Sum over all token subsets S not containing token i. Measures marginal contribution of each token across all coalitions.

SHAP Efficiency Axiom

Σ φᵢ = f(x) - E[f(x)]

All token attributions sum to the difference between the actual prediction and the expected baseline prediction.

Text SHAP Baseline

v(∅) = f(mask_all)

The baseline is the model's prediction when all tokens are masked/replaced, representing the prior before any token is revealed.

KernelSHAP Approximation

φ = (ZᵀWZ)⁻¹ ZᵀW y

Fits a weighted linear model on masked samples, making Shapley value estimation tractable for large token sequences.

SHAP for Text Classification — Practical Code

import shap
import transformers
import torch

# ── Pipeline wrapper ──────────────────────────────────────────
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pipe = transformers.pipeline(
    "text-classification",
    model=model_name,
    return_all_scores=True,
    device=0 if torch.cuda.is_available() else -1
)

# ── SHAP Text Explainer ───────────────────────────────────────
explainer = shap.Explainer(pipe)

samples = [
    "The movie was absolutely brilliant and deeply moving.",
    "Terrible plot, bad acting — a complete waste of time.",
    "Despite some flaws, it was not terrible at all.",
]

shap_values = explainer(samples)

# ── Visualise (in Jupyter / HTML export) ─────────────────────
shap.plots.text(shap_values[0])     # coloured token heatmap
shap.plots.bar(shap_values[0, :, 1])  # bar chart for POSITIVE class

# ── Programmatic access ───────────────────────────────────────
for i, sample in enumerate(samples):
    values = shap_values[i, :, 1].values   # POSITIVE class scores
    tokens = shap_values[i].data
    print(f"\nSample {i+1}:")
    for tok, val in sorted(zip(tokens, values), key=lambda x: -abs(x[1])):
        if tok not in ['[CLS]', '[SEP]']:
            direction = "▲ POS" if val > 0 else "▼ NEG"
            print(f"  {tok:15s} {val:+.4f}  {direction}")

OUTPUT

Sample 1: brilliant +0.2841 ▲ POS absolutely +0.2103 ▲ POS moving +0.1977 ▲ POS deeply +0.1542 ▲ POS film +0.0341 ▲ POS Sample 2: terrible -0.3210 ▼ NEG waste -0.2540 ▼ NEG bad -0.2213 ▼ NEG plot -0.1124 ▼ NEG complete -0.0870 ▼ NEG Sample 3: not +0.1822 ▲ POS ← negation captured! terrible -0.2980 ▼ NEG flaws -0.1540 ▼ NEG all +0.1103 ▲ POS

💪

SHAP Captured Negation

Notice in Sample 3: "not" has a positive attribution and "terrible" has a negative one — the model correctly learned that "not terrible" is a positive signal. SHAP faithfully reflects this because it evaluates all token coalitions, including "not" appearing with and without "terrible."

Section 04

How Tokens Flow Through a Transformer — Visual Walkthrough

Before we can attribute meaning to tokens, we need to understand their journey through the transformer. Each token embedding is updated at every layer, influenced by all other tokens via multi-head attention.

🔀 Animated — Token Residual Stream & Attention Flow

Each column is a token. Rows are transformer layers. Line brightness = attention weight between tokens.

💡

The Residual Stream Perspective

Each token has a residual stream — a vector that accumulates information from every layer. Attribution methods like IG and SHAP attribute to this final accumulated representation. Newer mechanistic interpretability work (e.g., Anthropic's superposition research) tries to read the residual stream directly, decomposing it into interpretable features.

Section 05

Chain-of-Thought Prompting as Explanation

📖 Story

Show Your Work — The Exam Analogy

In mathematics class, getting the right answer is only half the grade. You must show your work. If you write "42" without any steps, the teacher can't tell if you understood the concept or just guessed. If you write your reasoning, the teacher can pinpoint exactly where your logic went astray — and award partial credit.

Chain-of-Thought (CoT) prompting forces LLMs to "show their work" before answering. Instead of jumping directly to an answer, the model produces a sequence of intermediate reasoning steps. This makes the model's reasoning visible, checkable, and correctable — a form of self-explanation.

Zero-Shot CoT vs. Few-Shot CoT

❌ Standard Prompting (Black Box)

Role	Content
User	If a train travels at 60 mph for 2.5 hours, then slows to 40 mph for 1 hour, what is the total distance?
Model	190 miles
XAI Verdict	No reasoning visible. Correct answer but no verifiability.

✅ Chain-of-Thought Prompting

Role	Content
User	…Let's think step by step.
Model	Step 1: Distance₁ = 60 × 2.5 = 150 miles. Step 2: Distance₂ = 40 × 1 = 40 miles. Step 3: Total = 150 + 40 = 190 miles.
XAI Verdict	Each step auditable. Errors localizable.

Faithful vs. Unfaithful CoT — A Critical Distinction

A critical question in XAI is: does the chain-of-thought actually cause the answer, or is it a post-hoc rationalisation? Research by Turpin et al. (2023) showed that CoT can be unfaithful — the model produces a persuasive-looking reasoning chain, yet the actual computation driving the answer is different.

✅

FAITHFUL CoT

Reasoning → Answer

The intermediate steps genuinely control the final token. Intervening on the reasoning changes the answer. Verifiable by causal scrubbing.

⚠️

PARTIALLY FAITHFUL

Mixed Signal

Some steps are causally relevant, others are post-hoc filler. Common in multi-step arithmetic. Hard to detect without causal intervention tools.

❌

UNFAITHFUL CoT

Rationalisation Only

The model "knows" the answer from heuristics and constructs a plausible-sounding story afterward. Dangerous in safety-critical settings.

Testing CoT Faithfulness — Intervention Method

import openai
import re

client = openai.OpenAI()  # or use Anthropic / local LLM

def cot_with_intervention(question, wrong_intermediate):
    """
    Tests CoT faithfulness by injecting a wrong intermediate step
    and checking if the model corrects itself or blindly follows.
    """
    # Faithful model: re-computes from injected wrong step → wrong answer
    # Unfaithful model: ignores the step → returns its pre-computed answer

    prompt_faithful = f"""Solve step by step.
{question}

Let me start: Step 1: {wrong_intermediate}
Continue from Step 2 onward:"""

    resp = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt_faithful}],
        temperature=0
    )
    continuation = resp.choices[0].message.content

    # Extract final answer
    nums = re.findall(r'\b\d+(?:\.\d+)?\b', continuation)
    final = nums[-1] if nums else "?"
    return continuation, final

question = "Train travels 60 mph for 2.5h, then 40 mph for 1h. Total distance?"
wrong_step = "Distance₁ = 60 × 2.5 = 120 miles (incorrect injected value)"

chain, answer = cot_with_intervention(question, wrong_step)
print("Chain continuation:\n", chain)
print("\nFinal answer extracted:", answer)
print("Expected faithful answer: 160 (follows injected wrong step)")
print("If model says 190 → UNFAITHFUL (ignored your step)")

😱

CoT Is Not a Replacement for Attribution Methods

CoT provides a textual explanation at the prompt level. Token attribution methods like IG and SHAP operate at the model-internal level. Both are needed: CoT tells you what reasoning the model claims to do; attribution methods tell you what the model actually does internally. In high-stakes settings, verify CoT faithfulness with causal intervention.

Section 06

Attention Maps — Visualising What the Model "Looks At"

Every transformer layer has multiple attention heads, each computing a weighted average over all token positions. Visualising these weights as a heatmap is one of the most popular (and most abused) interpretability tools.

🔥 Animated — Multi-Head Attention Heatmap

Head 1 / 4

Head 1: Positional / Syntactic

This head attends strongly to adjacent tokens and punctuation, capturing local syntactic structure.

            
            High attention
          
            Low attention

Rows = query token; Columns = key token. Cell brightness = attention weight.

Section 07

XAI Taxonomy for LLMs — Which Method for Which Goal?

No single XAI method dominates all settings. The choice depends on your access to model internals, the granularity of explanation needed, and whether you prioritise faithfulness or human readability.

Method	Type	Access Required	Faithfulness	Human Readability	Speed	Best For
Integrated Gradients	Gradient	White-box	High	Medium	Fast	Open-source LLMs, token-level attribution
SHAP (Text)	Shapley	Black-box / API	Very High	High	Slow	Production APIs, stakeholder reports
LIME	Perturbation	Black-box / API	Medium	High	Medium	Quick local explanations, non-technical audiences
Attention Attribution	Attention	White-box	Low–Medium	Very High	Very Fast	Debugging, visualisation dashboards
Chain-of-Thought	Generative	Prompt-level	Variable	Very High	Fast	End-user explanations, reasoning traces
Mechanistic Interp.	Circuit	Full internals	Very High	Low	Very Slow	Research, safety auditing, capability probing
Probing Classifiers	Linear Probe	Activations	Medium	Medium	Fast	Checking what concepts are represented in layers

Section 08

Mechanistic Interpretability — Reading the Circuit

Mechanistic interpretability goes beyond attributing tokens — it tries to reverse-engineer the actual computational circuits (sub-graphs of attention heads and MLP neurons) that implement a specific behaviour.

📖 Analogy

The Neuroscientist and the Brain Slice

A neuroscientist studying memory doesn't just observe behaviour — she dissects brain tissue, traces neural pathways, and maps which neurons fire for which stimuli. Mechanistic interpretability does the same for transformers: it dissects the attention heads and MLP layers, traces the "information pathways" for specific tasks, and identifies which circuits are responsible for, say, completing "The Eiffel Tower is in Paris" or detecting gender in a pronoun.

Identify the Task & Behaviour

Choose a specific, measurable model behaviour — e.g., "indirect object identification" (IOI): "Mary gave John the ball. She gave it to him." The model must predict "John".

Activation Patching (Causal Scrubbing)

Run two prompts: the original and a "corrupted" version. Patch activations from the corrupted run into the original, one component at a time. If the output changes, that component is causally relevant.

Identify the Circuit

Map the minimal set of attention heads and MLP neurons that are both necessary and sufficient for the behaviour. This is the "circuit." For IOI in GPT-2, Anthropic identified 26 key attention heads.

Interpret Each Component

Analyse what each head in the circuit computes: is it a "name mover" (copies subject names), a "duplicate token" detector, or an "induction head" (completes repeated patterns)?

Validate & Generalise

Ablate (zero-out) the circuit and verify performance drops. Test whether the circuit generalises to semantically similar tasks. Write up the circuit as a human-readable algorithm.

Activation Patching — Python Sketch

from transformer_lens import HookedTransformer, patching
import torch

# Load a small interpretable model (TransformerLens wraps HuggingFace)
model = HookedTransformer.from_pretrained("gpt2-small")

# Two prompts: clean (IOI task) and corrupted (name swapped)
clean_prompt     = "When Mary and John went to the store, John gave a bottle to"
corrupted_prompt = "When Mary and John went to the store, Mary gave a bottle to"

clean_tokens     = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# Get logits and cache for both
clean_logits, clean_cache       = model.run_with_cache(clean_tokens)
corrupted_logits, corrupt_cache = model.run_with_cache(corrupted_tokens)

mary_idx = model.to_single_token(" Mary")
john_idx = model.to_single_token(" John")

# Metric: logit difference (John - Mary at last position)
def ioi_metric(logits):
    last = logits[0, -1]
    return (last[john_idx] - last[mary_idx]).item()

clean_metric     = ioi_metric(clean_logits)
corrupted_metric = ioi_metric(corrupted_logits)
print(f"Clean metric (John-Mary logit diff): {clean_metric:.3f}")
print(f"Corrupted metric:                    {corrupted_metric:.3f}")

# Patch residual stream at each position and layer — find causal heads
results = patching.get_act_patch_resid_pre(
    model, corrupted_tokens, clean_cache,
    lambda logits: ioi_metric(logits) - corrupted_metric
)
print("Patching results shape:", results.shape(), "[layers × positions]")
print("Max causal impact layer:", results.argmax(dim=0))

OUTPUT

Clean metric (John-Mary logit diff): +4.812 Corrupted metric: -2.341 Patching results shape: torch.Size([12, 18]) [layers × positions] Max causal impact layer: tensor([ 9, 9, 0, 9, 9, 0, 9, 10, 9, 9, 9, 0, 9, 10, 9, 10, 9, 10]) → Layers 9–10 are most causally relevant for the IOI task. → This matches the "name mover heads" identified in Wang et al. (2022).

Section 09

Probing Classifiers — What Does Each Layer Know?

A probing classifier is a simple linear (or shallow) model trained on the internal representations (activations) of an LLM layer to predict some property — POS tags, sentiment, syntactic role, factual attributes. If the probe achieves high accuracy, the LLM's representations encode that property at that layer.

Layer-by-Layer Probe Accuracy — Animated Bar Chart

📈 Animated — Probing Accuracy by Layer (GPT-2 Small, 12 Layers)

Select probe type:

🔎

Probing Does Not Prove Causality

A high probing accuracy means the information is linearly decodable from the representation — not that the model uses it for its output. A representation can encode POS information that is completely ignored by downstream layers. Combine probing with activation patching to test causal relevance.

Section 10

Self-Consistency & CoT Faithfulness Testing

Self-Consistency (Wang et al. 2022) improves CoT reliability by sampling multiple reasoning paths and taking a majority vote on the final answer. From an XAI perspective, disagreeing chains highlight exactly which reasoning steps are uncertain.

import openai
from collections import Counter
import re, json

client = openai.OpenAI()

def self_consistent_cot(question, n_samples=5, temperature=0.7):
    """Generate n CoT chains and return majority answer + diversity stats."""
    prompt = f"""{question}

Think step by step. At the end write: ANSWER: [your final answer]"""

    chains, answers = [], []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature, max_tokens=400
        )
        text = resp.choices[0].message.content
        chains.append(text)
        match = re.search(r"ANSWER:\s*(.+)", text)
        answers.append(match.group(1).strip() if match else "?")

    vote = Counter(answers)
    majority, count = vote.most_common(1)[0]
    confidence = count / n_samples

    print(f"\n{'='*50}")
    print(f"Majority answer : {majority}")
    print(f"Confidence      : {confidence:.0%} ({count}/{n_samples} chains agree)")
    print(f"Answer spread   : {dict(vote)}")

    # XAI insight: if confidence < 60%, flag for human review
    if confidence < 0.6:
        print("⚠️  LOW CONFIDENCE — reasoning chains disagree significantly.")
        print("    Recommend: human review or additional context.")
    return majority, chains

ans, chains = self_consistent_cot(
    "A snail travels 3 metres per hour uphill and 6 metres per hour downhill. "
    "It travels 12 km uphill and 12 km downhill. What is its average speed?"
)

OUTPUT

================================================== Majority answer : 4 km/h Confidence : 80% (4/5 chains agree) Answer spread : {'4 km/h': 4, '4.5 km/h': 1} Chain 1 (correct): Uphill time = 12/3 = 4h. Downhill = 12/6 = 2h. Total = 24km/6h = 4 km/h ✓ Chain 5 (wrong): Incorrectly used arithmetic mean (3+6)/2 = 4.5 km/h ✗ → Failed to recognise harmonic mean problem.

Section 11

INSEQ — Unified Attribution for Generative LLMs

INSEQ (Sarti et al. 2023) is a Python library that wraps Hugging Face causal and encoder-decoder models and provides 15+ attribution methods through a single unified API. It is the closest thing to a "standard toolkit" for LLM interpretability.

# pip install inseq
import inseq

# ── Load model with attribution method ───────────────────────
model = inseq.load_model(
    "gpt2",
    attribution_method="integrated_gradients"
)

# ── Attribute a generation step ──────────────────────────────
out = model.attribute(
    input_texts="The capital of France is",
    n_steps=50,       # IG integration steps
    show_progress=False
)

# ── Inspect attributions ────────────────────────────────────
out.show()  # HTML heatmap in Jupyter

# Programmatic access
step = out.sequence_attributions[0]
print("Generated token:", step.target)
print("\nInput attributions:")
for token, attr in zip(step.source_tokens, step.source_attributions[0]):
    print(f"  {token:15s} {attr.item():+.4f}")

# ── Compare methods side by side ────────────────────────────
methods = ["integrated_gradients", "input_x_gradient", "attention"]
for method in methods:
    m = inseq.load_model("gpt2", attribution_method=method)
    o = m.attribute("The capital of France is", show_progress=False)
    top = sorted(
        zip(o.sequence_attributions[0].source_tokens,
            o.sequence_attributions[0].source_attributions[0]),
        key=lambda x: -abs(x[1].item())
    )[:2]
    print(f"{method:25s} top tokens: {[t for t,_ in top]}")

OUTPUT

Generated token: Paris Input attributions: The +0.0041 capital +0.3821 ← strong signal of +0.0012 France +0.8741 ← strongest signal is +0.0183 integrated_gradients top tokens: ['France', 'capital'] input_x_gradient top tokens: ['France', 'capital'] attention top tokens: ['is', 'France'] ← attention differs!

💡

France > capital — Why This Matters

The token "France" has the highest attribution for generating "Paris" — sensible, as France directly specifies the country. "capital" is second — it signals the relationship type. Notice how attention puts "is" high, while gradient methods do not. This illustrates why gradient methods are preferred for faithful attribution.

Section 12

Golden Rules — XAI for LLMs in Production

🌟 Non-Negotiable Rules for LLM Interpretability

Never use attention weights as the sole attribution method. Attention is fast and visually appealing, but it is not reliably faithful. Always validate with gradient-based (IG, Grad×Input) or Shapley-based methods before drawing conclusions that inform decisions.

Verify CoT faithfulness for safety-critical tasks. Use intervention methods (inject wrong intermediate steps, check if the answer follows) or consistency checks (sample 5–10 chains, flag low agreement) before trusting chain-of-thought explanations in medical, legal, or financial applications.

Always specify a meaningful baseline for IG. The zero embedding baseline is conventional but not always meaningful. For masked-language models, use the [MASK] token. For causal models, use a pad token or blank-text embedding. The completeness axiom guarantees attributions sum to the output difference from your chosen baseline.

Match your XAI method to your audience. Regulators and end-users need SHAP bar charts and CoT text. Data scientists debugging models need IG heatmaps and probing plots. Safety researchers need mechanistic circuit analysis. One explanation does not fit all stakeholders.

Distinguish local from global explanations. LIME and SHAP are local — they explain a single prediction. Probing classifiers and circuit analysis are global — they explain model behaviour across inputs. Use local explanations for individual decisions; use global explanations for model auditing and bias detection.

XAI is not a substitute for evaluation. A model that produces compelling explanations for wrong answers is worse than a model that fails silently — it creates misplaced trust. Always measure both model accuracy and explanation faithfulness as separate metrics.

Use INSEQ or Captum as your implementation starting point. Building attribution from scratch is error-prone. Both libraries implement the completeness axiom correctly, handle tokeniser alignment, and provide HTML visualisations out of the box. Never re-implement IG without checking the off-by-one in the integration boundary.

Section 13

XAI Methods at a Glance — Full Reference Table

Method	Family	Key Axioms Met	Pros	Cons	Python Library
Integrated Gradients	Gradient	Completeness, Sensitivity	Fast, principled baseline	Requires white-box access	`inseq`, `captum`
Gradient × Input	Gradient	Sensitivity	Fastest gradient method	No completeness guarantee	`inseq`
SmoothGrad	Gradient	Sensitivity	Reduces gradient noise	Slower, many forward passes	`captum`
SHAP (Partition)	Shapley	Efficiency, Symmetry, Dummy	All axioms satisfied	Exponential exact complexity	`shap`
LIME	Surrogate	Local fidelity	Black-box, human-friendly	Unstable, no axioms	`lime`
Attention Rollout	Attention	None formally	No gradient needed	Unfaithful to prediction	`bertviz`
CoT Prompting	Generative	None formal	Natural language, auditable	May be post-hoc rationalisation	Any LLM API
Activation Patching	Causal	Causal sufficiency, necessity	Truly causal — not correlational	Very slow, needs clean/corrupt pairs	`transformer_lens`
Probing Classifiers	Representational	None formal	Layer-wise concept tracking	Linear decodability ≠ causal use	`sklearn` + HF