Explainable AI (XAI) 📂 Model-Specific Interpretability · 5 of 5 51 min read

Interpreting LLMs with XAI: Token Attributions, Chain-of-Thought & Mechanistic Analysis

A comprehensive, hands-on tutorial on Explainable AI for Large Language Models. Covers the six major attribution methods (Integrated Gradients, SHAP, LIME, Attention Rollout, Activation Patching, Probing Classifiers), chain-of-thought faithfulness testing, mechanistic interpretability circuits, and the INSEQ unified library — with animated diagrams, real Python code, and decision tables to help practitioners choose the right method for every setting.

Section 01

Why Do We Need to Explain LLMs?

The Doctor and the Black Box
Imagine a hospital deploys a large language model to assist in diagnosis. A patient comes in with chest pain, shortness of breath, and fatigue. The LLM outputs: "High likelihood of cardiac event — immediate intervention recommended."

The attending physician stares at the screen. The model is right 94% of the time on test data. But why did it flag this patient? Was it the "chest pain" token? The combination of all three symptoms? Or did it latch onto something unrelated — perhaps the patient's age mentioned earlier in the note?

Without an explanation, the physician can't verify the reasoning, can't catch a failure mode, and can't defend the decision in court. The model is a black box. This is precisely the problem that Explainable AI (XAI) for LLMs tries to solve.

Large Language Models (LLMs) like GPT-4, Claude, and LLaMA achieve remarkable performance across tasks — yet their internal computations involve billions of parameters interacting through attention mechanisms, feed-forward networks, and residual streams that no human can read directly. XAI for LLMs is the discipline of building tools and methods that make these opaque systems interpretable, transparent, and auditable.

🔎
Two Flavours of Interpretability

Interpretability asks: what does the model actually do internally? Explainability asks: can I generate a human-readable justification for an output? Interpretability is a property of the model; explainability is a property of the explanation. XAI typically pursues explainability using interpretability tools as its engine.

⚖️
Trust & Safety
Why It Matters
Medical, legal, and financial deployments require auditable reasoning. A model that is right for the wrong reasons will fail silently in production when the distribution shifts.
⚖️
Bias Detection
Fairness Lens
Attribution methods can reveal whether a model's decision rests on protected attributes (race, gender, age) even when those features were never explicitly used.
🔧
Debugging & Improvement
Model Development
When a model fails, explanations tell you where and why it went wrong — enabling targeted fine-tuning, prompt engineering, and architecture modifications.

Section 02

Token Attributions — Tracing Credit Back to Input

When an LLM generates a token, every input token contributed some amount — positive or negative — to that prediction. Token attribution methods assign a numerical score to each input token reflecting how much it influenced the output.

The Jury Deliberation Room
A jury of 12 people votes "Guilty." The judge asks each juror: "How much did the CCTV footage sway your verdict? The alibi? The motive? The character witness?" Each juror (like each attention head) gave a different weight to different pieces of evidence. Token attribution is the act of asking the model that same question — and getting a numerical answer for every piece of evidence (token) in the prompt.

The Main Methods

Gradient × Input
gradient-based
Compute the gradient of the output logit with respect to each token's embedding, then multiply element-wise with the embedding itself. Fast and differentiable — the workhorse of attribution in neural NLP.
Integrated Gradients (IG)
axiomatic attribution
Integrate gradients along a straight path from a baseline (zero/pad embedding) to the actual input. Satisfies completeness: attributions sum to the output difference from baseline. Gold standard for many tasks.
🎭
LIME (Local Surrogate)
perturbation-based
Mask random subsets of input tokens, observe output changes, then fit a simple linear model to the perturbation results. Model-agnostic — works on any black-box LLM without access to gradients.
🃏
SHAP (Shapley Values)
game-theoretic
Treats each token as a "player" in a coalition game. The Shapley value gives each player a fair share of the total prediction, satisfying efficiency, symmetry, and dummy axioms simultaneously.
🛠️
Attention Attribution
attention-based
Use the attention weights from transformer heads as a proxy for importance. Controversial — attention is not always faithful to prediction — but fast and visually intuitive for debugging.
📈
INSEQ Library
unified framework
A Python library that unifies 15+ attribution methods under one API for sequence-to-sequence and causal LMs. Supports Hugging Face models with minimal setup.

Visualising Token Attribution — Sentiment Classification

Below is an animated diagram showing how Integrated Gradients attribute credit for a sentiment prediction. Green tokens pushed the prediction toward Positive; red tokens pushed it toward Negative. Token opacity encodes magnitude.

💡 Animated — Integrated Gradients Attribution (Sentiment Task)
Positive attribution Negative attribution Neutral / near-zero
⚠️
Attention ≠ Attribution — A Common Mistake

Many practitioners use attention weights as attributions because they are easily extracted. But Jain & Wallace (2019) showed that attention weights are often uncorrelated with gradient-based attributions — you can shuffle attention and not change the output. Attention tells you where the model looks, not what changes the prediction. Always prefer gradient-based or Shapley methods when faithfulness matters.

Integrated Gradients — Python Code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

# ── Load a pre-trained sentiment model ──────────────────────
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text  = "The film was absolutely brilliant and deeply moving."
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs["input_ids"]

# ── Integrated Gradients (simplified scalar version) ────────
def integrated_gradients(input_ids, target_class=1, steps=50):
    embeddings = model.get_input_embeddings()(input_ids)  # [1, seq, dim]
    baseline   = torch.zeros_like(embeddings)
    scaled_inputs = [baseline + (i / steps) * (embeddings - baseline)
                     for i in range(steps + 1)]
    grads = []
    for inp in scaled_inputs:
        inp.requires_grad_(True)
        out = model(inputs_embeds=inp).logits[0, target_class]
        out.backward()
        grads.append(inp.grad.detach().clone())
        model.zero_grad()

    avg_grads  = torch.stack(grads).mean(dim=0)               # [1, seq, dim]
    ig_attrs   = ((embeddings - baseline) * avg_grads).sum(dim=-1)  # [1, seq]
    return ig_attrs[0].detach().numpy()

attrs  = integrated_gradients(input_ids)
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

print(f"{'Token':<15} {'Attribution':>12}")
print("-" * 28)
for tok, score in zip(tokens, attrs):
    bar = "█" * int(abs(score) * 20)
    sign = "+" if score > 0 else "-"
    print(f"{tok:<15} {sign}{abs(score):.4f}  {bar}")
OUTPUT
Token Attribution ---------------------------- [CLS] +0.0023 The +0.0087 film +0.0341 █ was +0.0012 absolutely +0.6203 ████████████ brilliant +0.8501 █████████████████ and +0.0158 deeply +0.5542 ███████████ moving +0.7012 ██████████████ . +0.0031 [SEP] +0.0044

Section 03

SHAP for LLMs — Token-Level Shapley Values

SHAP (SHapley Additive exPlanations) originates from cooperative game theory. The Shapley value distributes the total prediction fairly among all features (tokens), considering every possible coalition they could form. For language, each token is a "player" and the output logit is the "prize to share."

Shapley Formula
φᵢ = Σ [|S|!(n-|S|-1)!/n!] × [v(S∪{i}) - v(S)]
Sum over all token subsets S not containing token i. Measures marginal contribution of each token across all coalitions.
SHAP Efficiency Axiom
Σ φᵢ = f(x) - E[f(x)]
All token attributions sum to the difference between the actual prediction and the expected baseline prediction.
Text SHAP Baseline
v(∅) = f(mask_all)
The baseline is the model's prediction when all tokens are masked/replaced, representing the prior before any token is revealed.
KernelSHAP Approximation
φ = (ZᵀWZ)⁻¹ ZᵀW y
Fits a weighted linear model on masked samples, making Shapley value estimation tractable for large token sequences.

SHAP for Text Classification — Practical Code

import shap
import transformers
import torch

# ── Pipeline wrapper ──────────────────────────────────────────
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pipe = transformers.pipeline(
    "text-classification",
    model=model_name,
    return_all_scores=True,
    device=0 if torch.cuda.is_available() else -1
)

# ── SHAP Text Explainer ───────────────────────────────────────
explainer = shap.Explainer(pipe)

samples = [
    "The movie was absolutely brilliant and deeply moving.",
    "Terrible plot, bad acting — a complete waste of time.",
    "Despite some flaws, it was not terrible at all.",
]

shap_values = explainer(samples)

# ── Visualise (in Jupyter / HTML export) ─────────────────────
shap.plots.text(shap_values[0])     # coloured token heatmap
shap.plots.bar(shap_values[0, :, 1])  # bar chart for POSITIVE class

# ── Programmatic access ───────────────────────────────────────
for i, sample in enumerate(samples):
    values = shap_values[i, :, 1].values   # POSITIVE class scores
    tokens = shap_values[i].data
    print(f"\nSample {i+1}:")
    for tok, val in sorted(zip(tokens, values), key=lambda x: -abs(x[1])):
        if tok not in ['[CLS]', '[SEP]']:
            direction = "▲ POS" if val > 0 else "▼ NEG"
            print(f"  {tok:15s} {val:+.4f}  {direction}")
OUTPUT
Sample 1: brilliant +0.2841 ▲ POS absolutely +0.2103 ▲ POS moving +0.1977 ▲ POS deeply +0.1542 ▲ POS film +0.0341 ▲ POS Sample 2: terrible -0.3210 ▼ NEG waste -0.2540 ▼ NEG bad -0.2213 ▼ NEG plot -0.1124 ▼ NEG complete -0.0870 ▼ NEG Sample 3: not +0.1822 ▲ POS ← negation captured! terrible -0.2980 ▼ NEG flaws -0.1540 ▼ NEG all +0.1103 ▲ POS
💪
SHAP Captured Negation

Notice in Sample 3: "not" has a positive attribution and "terrible" has a negative one — the model correctly learned that "not terrible" is a positive signal. SHAP faithfully reflects this because it evaluates all token coalitions, including "not" appearing with and without "terrible."


Section 04

How Tokens Flow Through a Transformer — Visual Walkthrough

Before we can attribute meaning to tokens, we need to understand their journey through the transformer. Each token embedding is updated at every layer, influenced by all other tokens via multi-head attention.

🔀 Animated — Token Residual Stream & Attention Flow
Each column is a token. Rows are transformer layers. Line brightness = attention weight between tokens.
💡
The Residual Stream Perspective

Each token has a residual stream — a vector that accumulates information from every layer. Attribution methods like IG and SHAP attribute to this final accumulated representation. Newer mechanistic interpretability work (e.g., Anthropic's superposition research) tries to read the residual stream directly, decomposing it into interpretable features.


Section 05

Chain-of-Thought Prompting as Explanation

Show Your Work — The Exam Analogy
In mathematics class, getting the right answer is only half the grade. You must show your work. If you write "42" without any steps, the teacher can't tell if you understood the concept or just guessed. If you write your reasoning, the teacher can pinpoint exactly where your logic went astray — and award partial credit.

Chain-of-Thought (CoT) prompting forces LLMs to "show their work" before answering. Instead of jumping directly to an answer, the model produces a sequence of intermediate reasoning steps. This makes the model's reasoning visible, checkable, and correctable — a form of self-explanation.

Zero-Shot CoT vs. Few-Shot CoT

❌ Standard Prompting (Black Box)
RoleContent
UserIf a train travels at 60 mph for 2.5 hours, then slows to 40 mph for 1 hour, what is the total distance?
Model190 miles
XAI VerdictNo reasoning visible. Correct answer but no verifiability.
✅ Chain-of-Thought Prompting
RoleContent
UserLet's think step by step.
ModelStep 1: Distance₁ = 60 × 2.5 = 150 miles.
Step 2: Distance₂ = 40 × 1 = 40 miles.
Step 3: Total = 150 + 40 = 190 miles.
XAI VerdictEach step auditable. Errors localizable.

Faithful vs. Unfaithful CoT — A Critical Distinction

A critical question in XAI is: does the chain-of-thought actually cause the answer, or is it a post-hoc rationalisation? Research by Turpin et al. (2023) showed that CoT can be unfaithful — the model produces a persuasive-looking reasoning chain, yet the actual computation driving the answer is different.

FAITHFUL CoT
Reasoning → Answer
The intermediate steps genuinely control the final token. Intervening on the reasoning changes the answer. Verifiable by causal scrubbing.
⚠️
PARTIALLY FAITHFUL
Mixed Signal
Some steps are causally relevant, others are post-hoc filler. Common in multi-step arithmetic. Hard to detect without causal intervention tools.
UNFAITHFUL CoT
Rationalisation Only
The model "knows" the answer from heuristics and constructs a plausible-sounding story afterward. Dangerous in safety-critical settings.

Testing CoT Faithfulness — Intervention Method

import openai
import re

client = openai.OpenAI()  # or use Anthropic / local LLM

def cot_with_intervention(question, wrong_intermediate):
    """
    Tests CoT faithfulness by injecting a wrong intermediate step
    and checking if the model corrects itself or blindly follows.
    """
    # Faithful model: re-computes from injected wrong step → wrong answer
    # Unfaithful model: ignores the step → returns its pre-computed answer

    prompt_faithful = f"""Solve step by step.
{question}

Let me start: Step 1: {wrong_intermediate}
Continue from Step 2 onward:"""

    resp = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt_faithful}],
        temperature=0
    )
    continuation = resp.choices[0].message.content

    # Extract final answer
    nums = re.findall(r'\b\d+(?:\.\d+)?\b', continuation)
    final = nums[-1] if nums else "?"
    return continuation, final

question = "Train travels 60 mph for 2.5h, then 40 mph for 1h. Total distance?"
wrong_step = "Distance₁ = 60 × 2.5 = 120 miles (incorrect injected value)"

chain, answer = cot_with_intervention(question, wrong_step)
print("Chain continuation:\n", chain)
print("\nFinal answer extracted:", answer)
print("Expected faithful answer: 160 (follows injected wrong step)")
print("If model says 190 → UNFAITHFUL (ignored your step)")
😱
CoT Is Not a Replacement for Attribution Methods

CoT provides a textual explanation at the prompt level. Token attribution methods like IG and SHAP operate at the model-internal level. Both are needed: CoT tells you what reasoning the model claims to do; attribution methods tell you what the model actually does internally. In high-stakes settings, verify CoT faithfulness with causal intervention.


Section 06

Attention Maps — Visualising What the Model "Looks At"

Every transformer layer has multiple attention heads, each computing a weighted average over all token positions. Visualising these weights as a heatmap is one of the most popular (and most abused) interpretability tools.

🔥 Animated — Multi-Head Attention Heatmap
Head 1 / 4
Head 1: Positional / Syntactic
This head attends strongly to adjacent tokens and punctuation, capturing local syntactic structure.
High attention
Low attention
Rows = query token; Columns = key token. Cell brightness = attention weight.

Section 07

XAI Taxonomy for LLMs — Which Method for Which Goal?

No single XAI method dominates all settings. The choice depends on your access to model internals, the granularity of explanation needed, and whether you prioritise faithfulness or human readability.

Method Type Access Required Faithfulness Human Readability Speed Best For
Integrated Gradients Gradient White-box High Medium Fast Open-source LLMs, token-level attribution
SHAP (Text) Shapley Black-box / API Very High High Slow Production APIs, stakeholder reports
LIME Perturbation Black-box / API Medium High Medium Quick local explanations, non-technical audiences
Attention Attribution Attention White-box Low–Medium Very High Very Fast Debugging, visualisation dashboards
Chain-of-Thought Generative Prompt-level Variable Very High Fast End-user explanations, reasoning traces
Mechanistic Interp. Circuit Full internals Very High Low Very Slow Research, safety auditing, capability probing
Probing Classifiers Linear Probe Activations Medium Medium Fast Checking what concepts are represented in layers

Section 08

Mechanistic Interpretability — Reading the Circuit

Mechanistic interpretability goes beyond attributing tokens — it tries to reverse-engineer the actual computational circuits (sub-graphs of attention heads and MLP neurons) that implement a specific behaviour.

The Neuroscientist and the Brain Slice
A neuroscientist studying memory doesn't just observe behaviour — she dissects brain tissue, traces neural pathways, and maps which neurons fire for which stimuli. Mechanistic interpretability does the same for transformers: it dissects the attention heads and MLP layers, traces the "information pathways" for specific tasks, and identifies which circuits are responsible for, say, completing "The Eiffel Tower is in Paris" or detecting gender in a pronoun.
01
Identify the Task & Behaviour
Choose a specific, measurable model behaviour — e.g., "indirect object identification" (IOI): "Mary gave John the ball. She gave it to him." The model must predict "John".
02
Activation Patching (Causal Scrubbing)
Run two prompts: the original and a "corrupted" version. Patch activations from the corrupted run into the original, one component at a time. If the output changes, that component is causally relevant.
03
Identify the Circuit
Map the minimal set of attention heads and MLP neurons that are both necessary and sufficient for the behaviour. This is the "circuit." For IOI in GPT-2, Anthropic identified 26 key attention heads.
04
Interpret Each Component
Analyse what each head in the circuit computes: is it a "name mover" (copies subject names), a "duplicate token" detector, or an "induction head" (completes repeated patterns)?
05
Validate & Generalise
Ablate (zero-out) the circuit and verify performance drops. Test whether the circuit generalises to semantically similar tasks. Write up the circuit as a human-readable algorithm.

Activation Patching — Python Sketch

from transformer_lens import HookedTransformer, patching
import torch

# Load a small interpretable model (TransformerLens wraps HuggingFace)
model = HookedTransformer.from_pretrained("gpt2-small")

# Two prompts: clean (IOI task) and corrupted (name swapped)
clean_prompt     = "When Mary and John went to the store, John gave a bottle to"
corrupted_prompt = "When Mary and John went to the store, Mary gave a bottle to"

clean_tokens     = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# Get logits and cache for both
clean_logits, clean_cache       = model.run_with_cache(clean_tokens)
corrupted_logits, corrupt_cache = model.run_with_cache(corrupted_tokens)

mary_idx = model.to_single_token(" Mary")
john_idx = model.to_single_token(" John")

# Metric: logit difference (John - Mary at last position)
def ioi_metric(logits):
    last = logits[0, -1]
    return (last[john_idx] - last[mary_idx]).item()

clean_metric     = ioi_metric(clean_logits)
corrupted_metric = ioi_metric(corrupted_logits)
print(f"Clean metric (John-Mary logit diff): {clean_metric:.3f}")
print(f"Corrupted metric:                    {corrupted_metric:.3f}")

# Patch residual stream at each position and layer — find causal heads
results = patching.get_act_patch_resid_pre(
    model, corrupted_tokens, clean_cache,
    lambda logits: ioi_metric(logits) - corrupted_metric
)
print("Patching results shape:", results.shape(), "[layers × positions]")
print("Max causal impact layer:", results.argmax(dim=0))
OUTPUT
Clean metric (John-Mary logit diff): +4.812 Corrupted metric: -2.341 Patching results shape: torch.Size([12, 18]) [layers × positions] Max causal impact layer: tensor([ 9, 9, 0, 9, 9, 0, 9, 10, 9, 9, 9, 0, 9, 10, 9, 10, 9, 10]) → Layers 9–10 are most causally relevant for the IOI task. → This matches the "name mover heads" identified in Wang et al. (2022).

Section 09

Probing Classifiers — What Does Each Layer Know?

A probing classifier is a simple linear (or shallow) model trained on the internal representations (activations) of an LLM layer to predict some property — POS tags, sentiment, syntactic role, factual attributes. If the probe achieves high accuracy, the LLM's representations encode that property at that layer.

Layer-by-Layer Probe Accuracy — Animated Bar Chart

📈 Animated — Probing Accuracy by Layer (GPT-2 Small, 12 Layers)
Select probe type:
🔎
Probing Does Not Prove Causality

A high probing accuracy means the information is linearly decodable from the representation — not that the model uses it for its output. A representation can encode POS information that is completely ignored by downstream layers. Combine probing with activation patching to test causal relevance.


Section 10

Self-Consistency & CoT Faithfulness Testing

Self-Consistency (Wang et al. 2022) improves CoT reliability by sampling multiple reasoning paths and taking a majority vote on the final answer. From an XAI perspective, disagreeing chains highlight exactly which reasoning steps are uncertain.

import openai
from collections import Counter
import re, json

client = openai.OpenAI()

def self_consistent_cot(question, n_samples=5, temperature=0.7):
    """Generate n CoT chains and return majority answer + diversity stats."""
    prompt = f"""{question}

Think step by step. At the end write: ANSWER: [your final answer]"""

    chains, answers = [], []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature, max_tokens=400
        )
        text = resp.choices[0].message.content
        chains.append(text)
        match = re.search(r"ANSWER:\s*(.+)", text)
        answers.append(match.group(1).strip() if match else "?")

    vote = Counter(answers)
    majority, count = vote.most_common(1)[0]
    confidence = count / n_samples

    print(f"\n{'='*50}")
    print(f"Majority answer : {majority}")
    print(f"Confidence      : {confidence:.0%} ({count}/{n_samples} chains agree)")
    print(f"Answer spread   : {dict(vote)}")

    # XAI insight: if confidence < 60%, flag for human review
    if confidence < 0.6:
        print("⚠️  LOW CONFIDENCE — reasoning chains disagree significantly.")
        print("    Recommend: human review or additional context.")
    return majority, chains

ans, chains = self_consistent_cot(
    "A snail travels 3 metres per hour uphill and 6 metres per hour downhill. "
    "It travels 12 km uphill and 12 km downhill. What is its average speed?"
)
OUTPUT
================================================== Majority answer : 4 km/h Confidence : 80% (4/5 chains agree) Answer spread : {'4 km/h': 4, '4.5 km/h': 1} Chain 1 (correct): Uphill time = 12/3 = 4h. Downhill = 12/6 = 2h. Total = 24km/6h = 4 km/h ✓ Chain 5 (wrong): Incorrectly used arithmetic mean (3+6)/2 = 4.5 km/h ✗ → Failed to recognise harmonic mean problem.

Section 11

INSEQ — Unified Attribution for Generative LLMs

INSEQ (Sarti et al. 2023) is a Python library that wraps Hugging Face causal and encoder-decoder models and provides 15+ attribution methods through a single unified API. It is the closest thing to a "standard toolkit" for LLM interpretability.

# pip install inseq
import inseq

# ── Load model with attribution method ───────────────────────
model = inseq.load_model(
    "gpt2",
    attribution_method="integrated_gradients"
)

# ── Attribute a generation step ──────────────────────────────
out = model.attribute(
    input_texts="The capital of France is",
    n_steps=50,       # IG integration steps
    show_progress=False
)

# ── Inspect attributions ────────────────────────────────────
out.show()  # HTML heatmap in Jupyter

# Programmatic access
step = out.sequence_attributions[0]
print("Generated token:", step.target)
print("\nInput attributions:")
for token, attr in zip(step.source_tokens, step.source_attributions[0]):
    print(f"  {token:15s} {attr.item():+.4f}")

# ── Compare methods side by side ────────────────────────────
methods = ["integrated_gradients", "input_x_gradient", "attention"]
for method in methods:
    m = inseq.load_model("gpt2", attribution_method=method)
    o = m.attribute("The capital of France is", show_progress=False)
    top = sorted(
        zip(o.sequence_attributions[0].source_tokens,
            o.sequence_attributions[0].source_attributions[0]),
        key=lambda x: -abs(x[1].item())
    )[:2]
    print(f"{method:25s} top tokens: {[t for t,_ in top]}")
OUTPUT
Generated token: Paris Input attributions: The +0.0041 capital +0.3821 ← strong signal of +0.0012 France +0.8741 ← strongest signal is +0.0183 integrated_gradients top tokens: ['France', 'capital'] input_x_gradient top tokens: ['France', 'capital'] attention top tokens: ['is', 'France'] ← attention differs!
💡
France > capital — Why This Matters

The token "France" has the highest attribution for generating "Paris" — sensible, as France directly specifies the country. "capital" is second — it signals the relationship type. Notice how attention puts "is" high, while gradient methods do not. This illustrates why gradient methods are preferred for faithful attribution.


Section 12

Golden Rules — XAI for LLMs in Production

🌟 Non-Negotiable Rules for LLM Interpretability
1
Never use attention weights as the sole attribution method. Attention is fast and visually appealing, but it is not reliably faithful. Always validate with gradient-based (IG, Grad×Input) or Shapley-based methods before drawing conclusions that inform decisions.
2
Verify CoT faithfulness for safety-critical tasks. Use intervention methods (inject wrong intermediate steps, check if the answer follows) or consistency checks (sample 5–10 chains, flag low agreement) before trusting chain-of-thought explanations in medical, legal, or financial applications.
3
Always specify a meaningful baseline for IG. The zero embedding baseline is conventional but not always meaningful. For masked-language models, use the [MASK] token. For causal models, use a pad token or blank-text embedding. The completeness axiom guarantees attributions sum to the output difference from your chosen baseline.
4
Match your XAI method to your audience. Regulators and end-users need SHAP bar charts and CoT text. Data scientists debugging models need IG heatmaps and probing plots. Safety researchers need mechanistic circuit analysis. One explanation does not fit all stakeholders.
5
Distinguish local from global explanations. LIME and SHAP are local — they explain a single prediction. Probing classifiers and circuit analysis are global — they explain model behaviour across inputs. Use local explanations for individual decisions; use global explanations for model auditing and bias detection.
6
XAI is not a substitute for evaluation. A model that produces compelling explanations for wrong answers is worse than a model that fails silently — it creates misplaced trust. Always measure both model accuracy and explanation faithfulness as separate metrics.
7
Use INSEQ or Captum as your implementation starting point. Building attribution from scratch is error-prone. Both libraries implement the completeness axiom correctly, handle tokeniser alignment, and provide HTML visualisations out of the box. Never re-implement IG without checking the off-by-one in the integration boundary.

Section 13

XAI Methods at a Glance — Full Reference Table

Method Family Key Axioms Met Pros Cons Python Library
Integrated Gradients Gradient Completeness, Sensitivity Fast, principled baseline Requires white-box access inseq, captum
Gradient × Input Gradient Sensitivity Fastest gradient method No completeness guarantee inseq
SmoothGrad Gradient Sensitivity Reduces gradient noise Slower, many forward passes captum
SHAP (Partition) Shapley Efficiency, Symmetry, Dummy All axioms satisfied Exponential exact complexity shap
LIME Surrogate Local fidelity Black-box, human-friendly Unstable, no axioms lime
Attention Rollout Attention None formally No gradient needed Unfaithful to prediction bertviz
CoT Prompting Generative None formal Natural language, auditable May be post-hoc rationalisation Any LLM API
Activation Patching Causal Causal sufficiency, necessity Truly causal — not correlational Very slow, needs clean/corrupt pairs transformer_lens
Probing Classifiers Representational None formal Layer-wise concept tracking Linear decodability ≠ causal use sklearn + HF
You have completed Model-Specific Interpretability. View all sections →