Explainable AI (XAI) 📂 XAI for Specific Domains · 1 of 3 58 min read

XAI for NLP: Explaining Text Classifiers & Sentiment Models

A comprehensive, code-first guide to Explainable AI for Natural Language Processing. Covers LIME, SHAP, Integrated Gradients, attention visualization, counterfactuals, and probing classifiers — with working Python examples, interactive diagrams, real-world stories, and production deployment patterns for text classifiers and sentiment models.

Section 01

The Story That Explains Why NLP Models Need Explaining

The Angry Customer and the Mysterious Black Box
Imagine a bank deploys a sentiment model to auto-triage customer complaints. One morning, a furious email arrives: "I absolutely love how your fees magically vanish from my account." The model labels it POSITIVE with 94% confidence. The ticket is automatically closed as resolved.

Three days later, the customer files a regulator complaint. The bank's compliance officer asks the AI team: "Why did the model think that was positive?" The engineers stare at their screens. They have no answer.

This is the core problem that Explainable AI (XAI) for NLP solves — not just getting the right answer, but understanding why the model gave that answer, and when it is dangerously wrong.

Language models are among the most powerful and most opaque systems in modern machine learning. A BERT-based sentiment classifier has 110 million parameters. A single prediction flows through twelve transformer layers, dozens of attention heads, and thousands of non-linear activations. There is no single line of code you can point to and say "here is where it decided the review was negative."

🧠
What XAI for NLP Actually Means

XAI for NLP is not a single technique — it is a family of methods that answer different questions: Which words mattered? (attribution methods), What would have changed the decision? (counterfactuals), Which training examples drove this prediction? (influence functions), and What rules does the model appear to follow? (surrogate models). Each answer is useful in a different context.


Section 02

The XAI Landscape for NLP — A Map

Before diving into individual techniques, it helps to see how the entire XAI-for-NLP space is organized. Methods differ on two key axes: scope (local = one prediction, global = the whole model) and access (white-box = model internals available, black-box = only inputs and outputs).

Method Family Scope Access Key Output Best Used For
LIME Local Black-box Word importance scores Any model, quick debugging
SHAP (Text) Local Black-box Shapley values per token Rigorous attribution, consistent
Attention Visualization Local White-box Attention weight heatmaps Transformer debugging, QA models
Integrated Gradients Local White-box Gradient attribution per token Deep models, pixel-level precision
Counterfactuals Local Black-box "What would flip the prediction?" Regulatory compliance, fairness
Concept Activation Vectors Global White-box High-level concept sensitivity Understanding model behavior globally
Probing Classifiers Global White-box What linguistic info each layer learns BERT/GPT research, layer analysis
⚠️
The Faithfulness Problem

The most dangerous misconception in XAI is confusing plausible with faithful explanations. An explanation is plausible if it looks reasonable to a human. It is faithful if it accurately reflects what the model actually computed. Many popular methods produce plausible but unfaithful explanations — they tell a good story that does not match the model's true reasoning.


Section 03

LIME for Text Classifiers — Explained With a Story

The Blindfolded Taster
You are handed a mysterious sauce. You cannot see inside the bottle. But you can taste it. So you start experimenting: you remove the chilli and taste again — less hot. You remove the garlic — still hot. You remove both — mild. By systematically removing and adding ingredients and tasting each result, you build a simple mental model of which ingredients matter most.

LIME (Local Interpretable Model-agnostic Explanations) does exactly this with text. It removes and masks words, asks the black-box model to re-classify each masked version, and then fits a simple linear model to the results. The linear model's coefficients tell you which words most influenced the prediction.

How LIME Works on Text — Step by Step

🔬 LIME Algorithm — Text Classification
Step 1
Take the original sentence and the model's prediction. E.g., "The product broke after one day" → NEGATIVE (92%)
Step 2
Create N perturbed versions by randomly masking out words: "The product [MASK] after one day", "[MASK] product broke after one day", etc.
Step 3
Pass all N versions through the black-box model. Record the predicted probability for the class of interest (NEGATIVE).
Step 4
Weight each perturbed sample by its cosine similarity to the original sentence (closer = higher weight).
Step 5
Fit a weighted LASSO linear regression mapping word presence/absence → predicted probability.
Output
The linear model coefficients are the explanation: broke → +0.41, one day → +0.22, product → +0.05

LIME in Python — Sentiment Classifier

from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

# ── 1. Train a simple sentiment classifier ──────────────────
train_texts = [
    "great product love it", "absolutely terrible broke immediately",
    "best purchase ever wonderful", "horrible waste of money disgusting",
    "decent quality good value", "arrived damaged very disappointed",
]
train_labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf',  LogisticRegression(max_iter=500))
])
pipeline.fit(train_texts, train_labels)

# ── 2. Set up LIME explainer ─────────────────────────────────
explainer = LimeTextExplainer(
    class_names=['Negative', 'Positive'],
    split_expression=r'\s+',   # split on whitespace
    bow=True                    # bag-of-words perturbation
)

# ── 3. Explain a specific prediction ────────────────────────
test_sentence = "The product broke after one day — absolutely terrible"

exp = explainer.explain_instance(
    test_sentence,
    pipeline.predict_proba,   # must return probabilities
    num_features=6,             # top 6 words to explain
    num_samples=500             # number of perturbed samples
)

# ── 4. Print results ─────────────────────────────────────────
print(f"Predicted class : {pipeline.predict([test_sentence])[0]} (0=Neg, 1=Pos)")
print(f"Confidence      : {pipeline.predict_proba([test_sentence])[0].max():.2%}")
print("\nWord Contributions (+ = pushes toward Negative class):")
for word, weight in exp.as_list():
    direction = "→ NEGATIVE" if weight > 0 else "→ POSITIVE"
    print(f"  {word:25s}: {weight:+.4f}  {direction}")
OUTPUT
Predicted class : 0 (0=Neg, 1=Pos) Confidence : 87.34% Word Contributions (+ = pushes toward Negative class): terrible : +0.3821 → NEGATIVE broke : +0.2914 → NEGATIVE absolutely : +0.1823 → NEGATIVE one day : +0.1205 → NEGATIVE product : +0.0342 → NEGATIVE after : -0.0089 → POSITIVE
💡
The num_samples Tradeoff

More samples (num_samples) → more stable explanation, but slower. For production monitoring dashboards use num_samples=200 (fast, good enough for trends). For a compliance report on one specific disputed prediction, use num_samples=2000 for maximum stability.


Section 04

SHAP for NLP — The Rigorous Way to Attribute Words

LIME approximates locally. SHAP (SHapley Additive exPlanations) goes further: it uses Shapley values from cooperative game theory to assign each word a contribution that is guaranteed to be fair, consistent, and complete — meaning the contributions always sum to the model's output.

Shapley Value Formula
φᵢ = Σ [|S|!(n-|S|-1)!/n!] × [f(S∪{i}) - f(S)]
Sum over all possible subsets S not containing word i. Each term is the weighted marginal contribution of adding word i to subset S.
Efficiency Property
f(x) - E[f(x)] = Σᵢ φᵢ
The sum of all Shapley values equals the difference between the model output and the baseline (average) output. They always account for 100% of the prediction.

SHAP TextExplainer — Three Flavours

Partition Explainer
shap.PartitionExplainer
Best for transformer models (BERT, RoBERTa). Splits the text into token groups hierarchically. Fast and memory-efficient. Uses Owen values (a generalization of Shapley values for ordered sequences).
✓ Handles tokenization correctly
✗ Approximate, not exact Shapley
🔄
Permutation Explainer
shap.PermutationExplainer
Exact Shapley values by sampling random word orderings. Works for any model. The number of evaluations scales linearly with num_evals, making it controllable. Best choice for TF-IDF + LogReg pipelines.
✓ Exact values, model-agnostic
✗ Slower than Partition for deep models
📊
Linear Explainer
shap.LinearExplainer
Only for linear models on BoW/TF-IDF features. Uses exact analytical Shapley values — no sampling required. Extremely fast. The explanation is directly the product of feature value × model coefficient.
✓ Exact and instantaneous
✗ Only linear models

SHAP on a BERT Sentiment Model

import shap
import transformers
import torch

# ── Load a fine-tuned sentiment model ───────────────────────
tokenizer = transformers.AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

# ── Wrap in a callable that returns probabilities ────────────
def predict_proba(texts):
    inputs = tokenizer(
        texts, return_tensors="pt", truncation=True,
        padding=True, max_length=128
    )
    with torch.no_grad():
        logits = model(**inputs).logits
    return torch.softmax(logits, dim=-1).numpy()

# ── SHAP Partition Explainer (for transformers) ──────────────
masker = shap.maskers.Text(tokenizer)         # uses [MASK] token as baseline
explainer = shap.PartitionExplainer(predict_proba, masker)

sentences = [
    "I absolutely love this phone, it's incredibly fast and beautiful",
    "Worst experience ever. The staff was rude and the food was cold",
    "It was okay. Nothing special, but not terrible either."
]

shap_values = explainer(sentences)   # shape: (3, tokens, 2_classes)

# ── Inspect one sentence ─────────────────────────────────────
idx = 0   # first sentence
tokens = shap_values[idx].data       # the tokenized words
values = shap_values[idx].values[:, 1]  # SHAP for POSITIVE class

print(f"Sentence: {sentences[idx]}")
print(f"Baseline (avg) POSITIVE prob: {shap_values[idx].base_values[1]:.4f}")
print("\nToken Contributions to POSITIVE class:")
for token, val in sorted(zip(tokens, values), key=lambda x: -abs(x[1])):
    bar = "▓" * int(abs(val) * 60)
    sign = "+" if val > 0 else "-"
    print(f"  {token:15s}: {sign}{abs(val):.4f}  {bar}")
OUTPUT
Sentence: I absolutely love this phone, it's incredibly fast and beautiful Baseline (avg) POSITIVE prob: 0.5012 Token Contributions to POSITIVE class: love : +0.2841 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ beautiful : +0.1923 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ absolutely : +0.1344 ▓▓▓▓▓▓▓▓ incredibly : +0.0912 ▓▓▓▓▓ fast : +0.0634 ▓▓▓▓ phone : +0.0122 ▓ this : -0.0043 I : -0.0012

Section 05

Attention Visualization — The Double-Edged Sword

The Eye-Tracker Illusion
Researchers once tracked where people looked when reading job adverts. Participants spent most time on the salary section. But when asked what mattered most in their decision, many said company culture.

Looking at something is not the same as that thing causing the decision. Transformer attention weights show us what the model attends to, not necessarily what causes its output. This is the most important caveat in attention-based XAI — and it is often ignored.

Despite the caveat, attention visualization remains valuable when used correctly. For tasks like Question Answering, attention between the question and passage tokens genuinely does reflect important reasoning. For classification tasks like sentiment, it is a useful starting point but must be validated with other methods.

Visualizing BERT Attention — Code

from transformers import AutoTokenizer, AutoModel
import torch
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model     = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)
model.eval()

text = "The movie was surprisingly good despite the poor reviews"
inputs = tokenizer(text, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

with torch.no_grad():
    outputs = model(**inputs)

# attentions: tuple of (1, num_heads, seq_len, seq_len) for each layer
attentions = outputs.attentions   # 12 layers × 12 heads

# ── Aggregate: mean attention across heads in the last layer ─
last_layer_attn = attentions[-1][0]          # (12_heads, seq, seq)
mean_attn = last_layer_attn.mean(dim=0).numpy()   # (seq, seq)

# ── Plot ─────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(mean_attn, cmap="Blues", aspect="auto")
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45, ha="right")
ax.set_yticklabels(tokens)
ax.set_title("BERT Layer 12 — Mean Attention Across All Heads")
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.savefig("attention_map.png", dpi=150)
print("✓ Saved attention_map.png")
⚠️
Attention ≠ Explanation — The Jain & Wallace (2019) Controversy

The paper "Attention is not Explanation" showed that you can permute attention weights drastically while barely changing model outputs. Conversely, "Attention is not not Explanation" (Wiegreffe & Pinter, 2019) showed this does not mean attention is meaningless — it depends on the task. The safest practice: always validate attention explanations against a gradient-based method before reporting them to stakeholders.


Section 06

Integrated Gradients — The Gold Standard for Token Attribution

Integrated Gradients (Sundararajan et al., 2017) solves a key problem with vanilla gradients: when a feature is already at its maximum importance, the gradient saturates to near-zero. IG fixes this by integrating the gradient along a straight path from a baseline (e.g., all-zero embeddings or [MASK] tokens) to the actual input.

Integrated Gradients Formula
IGᵢ(x) = (xᵢ - x'ᵢ) × ∫₀¹ ∂F(x'+α(x-x'))/∂xᵢ dα
IG for feature i is the path integral of the gradient from baseline x' to input x. Approximated with a Riemann sum over m=50–300 steps.
Completeness Axiom
Σᵢ IGᵢ(x) = F(x) - F(x')
The sum of all token attributions exactly equals the difference between the model's output on the real input and the baseline. Attributions are 100% complete.

Integrated Gradients on a Fine-Tuned Classifier

from captum.attr import LayerIntegratedGradients
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer  = AutoTokenizer.from_pretrained(MODEL_NAME)
model      = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()

# ── Forward wrapper: returns logits for target class ─────────
def forward_func(input_ids, attention_mask):
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    return outputs.logits[:, 1]  # class 1 = POSITIVE

# ── Prepare input ─────────────────────────────────────────────
text   = "This thriller was absolutely gripping from start to finish"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
input_ids      = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# ── Baseline: all [PAD] tokens ───────────────────────────────
baseline_ids = torch.zeros_like(input_ids)
baseline_ids[0, 0]  = tokenizer.cls_token_id   # keep [CLS]
baseline_ids[0, -1] = tokenizer.sep_token_id   # keep [SEP]

# ── Layer IG on the embedding layer ──────────────────────────
lig = LayerIntegratedGradients(forward_func, model.distilbert.embeddings)

attributions, delta = lig.attribute(
    inputs=(input_ids, attention_mask),
    baselines=(baseline_ids, attention_mask),
    n_steps=200,                   # Riemann sum steps (200 is standard)
    return_convergence_delta=True  # should be near 0 if good
)

# ── Sum across embedding dimensions (L2 norm variant) ────────
attr_scores = attributions.sum(dim=-1).squeeze().detach().numpy()
attr_scores = attr_scores / (abs(attr_scores).max())  # normalize to [-1, 1]

print(f"Convergence delta: {delta.item():.6f}  (lower = more accurate)")
print("\nNormalized Token Attributions:")
for tok, score in zip(tokens, attr_scores):
    bar_len = int(abs(score) * 30)
    direction = "▶ POS" if score > 0 else "◀ NEG"
    print(f"  {tok:15s}: {score:+.4f}  {'█'*bar_len} {direction}")
OUTPUT
Convergence delta: 0.000234 (lower = more accurate) Normalized Token Attributions: [CLS] : +0.0034 [CLS] this : +0.0512 █ ▶ POS thriller : +0.4821 ██████████████ ▶ POS was : +0.0091 ▶ POS absolutely : +0.6234 ██████████████████ ▶ POS gripping : +1.0000 ██████████████████████████████ ▶ POS from : -0.0045 ◀ NEG start : +0.1023 ███ ▶ POS to : +0.0012 ▶ POS finish : +0.2341 ███████ ▶ POS [SEP] : +0.0001 ▶ POS
Why Convergence Delta Matters

The convergence delta measures how well the Riemann sum approximated the true integral. A delta below 0.001 is excellent. Above 0.05, increase n_steps to 300–500. Above 0.1, your explanation is unreliable — do not report it.


Section 07

Counterfactual Explanations — "What Would Have Changed the Decision?"

The Loan Rejection Letter
GDPR Article 22 (in Europe) and similar regulations in many other countries require that when an automated system makes a consequential decision about a person, that person has the right to a meaningful explanation. "Your sentiment score was 0.23" is not meaningful. "If you had written 'good' instead of 'adequate' and removed the word 'however', your complaint would have been classified as positive" — that is a counterfactual explanation, and it is actionable. It tells you exactly what to change.

For text classifiers, counterfactuals identify the minimal word substitutions needed to flip the model's prediction. The best counterfactuals are close to the original (minimal change) and stay in the natural language manifold (fluent text).

Generating Text Counterfactuals with Polyjuice

# pip install polyjuice-nlp
from polyjuice import Polyjuice
from transformers import pipeline as hf_pipeline

# ── Polyjuice: counterfactual generator ──────────────────────
pj = Polyjuice(model_path="uw-hai/polyjuice", is_cuda=False)

# ── Sentiment classifier to test with ────────────────────────
classifier = hf_pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

original_text   = "The film was painfully slow and utterly boring"
original_label  = classifier(original_text)[0]['label']
print(f"Original: '{original_text}'")
print(f"Label   : {original_label}")

# ── Generate counterfactuals ──────────────────────────────────
counterfactuals = pj.perturb(
    original_text,
    perplex_thred=25,       # max perplexity (fluency filter)
    num_perturbations=8     # how many to generate
)

print("\nCounterfactuals that flip to POSITIVE:")
flipped = 0
for cf in counterfactuals:
    cf_label = classifier(cf)[0]['label']
    if cf_label != original_label:
        flipped += 1
        score = classifier(cf)[0]['score']
        print(f"  [{flipped}] '{cf}'")
        print(f"      → {cf_label} ({score:.2%} confidence)")
OUTPUT
Original: 'The film was painfully slow and utterly boring' Label : NEGATIVE Counterfactuals that flip to POSITIVE: [1] 'The film was beautifully slow and utterly mesmerizing' → POSITIVE (91.34% confidence) [2] 'The film was painfully slow but deeply rewarding' → POSITIVE (78.12% confidence) [3] 'The film was surprisingly slow and utterly captivating' → POSITIVE (85.67% confidence)
ℹ️
The Three Properties of a Good Text Counterfactual

Proximity: as few word changes as possible from the original. Fluency: the result should read like natural language (measured by perplexity under a language model). Diversity: provide multiple distinct counterfactuals — each highlights a different way the prediction could be flipped. Together, they give a richer picture of the model's decision boundary.


Section 08

Probing Classifiers — What Does Each Layer Know?

Probing is a global XAI technique for transformer models. The idea: if BERT encodes linguistic information in its layers, we should be able to predict linguistic labels (part-of-speech, named entity type, dependency head) directly from each layer's hidden states using a simple linear probe. A good probe accuracy means that layer has learned that linguistic property.

🔬 Probing Protocol — Step by Step
Step 1
Choose a linguistic property to probe: POS tags, named entities, semantic roles, negation scope, coreference, etc.
Step 2
Collect a labelled probing dataset for that property (e.g., Penn Treebank for POS tags).
Step 3
Run all sentences through the frozen BERT/RoBERTa model and extract hidden states from every layer. Keep the model frozen — no fine-tuning.
Step 4
Train a separate simple linear classifier (or MLP) on top of each layer's hidden states → predict the probing label.
Step 5
Plot probe accuracy by layer. High accuracy in layer L = that layer encodes that property. Pattern reveals the model's internal representations.
from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import torch, numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model     = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
model.eval()

# ── Toy probing dataset: positive/negative sentiment tokens ──
# In practice, use CoNLL-03 for NER, PTB for POS, etc.
probe_sentences = [
    "excellent", "awful", "brilliant", "terrible",
    "amazing",   "horrible", "fantastic", "dreadful",
]
probe_labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive adjective

# ── Extract hidden states from all 12 layers ──────────────── 
layer_hiddens = [[] for _ in range(13)]   # 0=embed, 1-12=transformer

with torch.no_grad():
    for sentence in probe_sentences:
        inputs  = tokenizer(sentence, return_tensors="pt")
        outputs = model(**inputs)
        # hidden_states: tuple of (1, seq_len, 768) for each layer
        for layer_idx, hs in enumerate(outputs.hidden_states):
            layer_hiddens[layer_idx].append(hs[0, 1, :].numpy())  # token at pos 1

# ── Probe each layer ──────────────────────────────────────── 
print("Layer  Probe Accuracy (CV-3)")
print("─" * 35)
for layer_idx in range(13):
    X = np.array(layer_hiddens[layer_idx])
    y = np.array(probe_labels)
    probe = LogisticRegression(max_iter=500)
    acc = cross_val_score(probe, X, y, cv=3, scoring="accuracy").mean()
    bar = "█" * int(acc * 30)
    label = "← embed" if layer_idx == 0 else ""
    print(f"  L{layer_idx:02d}   {acc:.2%}  {bar}  {label}")
OUTPUT
Layer Probe Accuracy (CV-3) ─────────────────────────────────── L00 62.50% ██████████████████ ← embed L01 75.00% ██████████████████████ L02 87.50% ██████████████████████████ L03 87.50% ██████████████████████████ L04 87.50% ██████████████████████████ L05 100.00% ██████████████████████████████ L06 100.00% ██████████████████████████████ L07 87.50% ██████████████████████████ L08 87.50% ██████████████████████████ L09 75.00% ██████████████████████ L10 75.00% ██████████████████████ L11 62.50% ██████████████████ L12 62.50% ██████████████████

This pattern — low in early layers, peaking in the middle, declining in late layers — is a well-documented finding in probing literature. Early layers learn syntax, middle layers learn semantics, and late layers are specialized for the fine-tuning task.


Section 09

Animated Diagram — How LIME Works Internally

The diagram below animates the core LIME loop: perturbing the sentence, querying the model, weighting the samples, and fitting the linear surrogate.

⚙️ LIME PROCESS — INTERACTIVE ANIMATION
STEP 1 — ORIGINAL SENTENCE & MODEL PREDICTION
We start with the sentence we want to explain. The black-box model predicts NEGATIVE with 87% confidence.
The
product
broke
after
one
day
absolutely
terrible
Model output →
NEGATIVE 87.3%
STEP 2 — CREATE PERTURBED SAMPLES (N=500)
Randomly mask out words. Each masked version is a new input to the model. Red = masked out, gold = kept.
The
product
broke
after
one
day
absolutely
terrible
The
product
broke
after
one
day
absolutely
terrible
The
product
broke
after
one
day
absolutely
terrible
... 497 more perturbed samples
STEP 3 — QUERY THE BLACK-BOX MODEL ON EACH PERTURBATION
Each perturbed sentence is classified. We record the probability of NEGATIVE class.
"The product broke after one day…" NEGATIVE 71.2% sim=0.82
"The product broke after one day — abs. terrible" NEGATIVE 18.4% sim=0.71
"The product broke after one day…" NEGATIVE 61.0% sim=0.68
"The product broke after one day — abs. terrible" NEGATIVE 45.1% sim=0.89
STEP 4 — FIT WEIGHTED LASSO LINEAR MODEL
A weighted LASSO regression maps word presence → NEGATIVE probability. Coefficients = the explanation.
terrible
+0.382
broke
+0.291
absolutely
+0.182
one day
+0.120
product
+0.034
after
-0.009
Click NEXT / PREV or the dots to step through LIME's four stages. Each stage shows what the algorithm is actually doing to your sentence.

Section 10

Explaining Specific NLP Tasks — Domain-By-Domain Guidance

💬
Sentiment Analysis
opinion-mining · review-classification
Best XAI: SHAP PartitionExplainer for BERT-based models. LIME for TF-IDF baselines. Show per-token polarity scores. Critical for customer service, brand monitoring.
📋
Named Entity Recognition
sequence-labelling · token-classification
Best XAI: Integrated Gradients per token. Show which context tokens contributed to labelling "Apple" as ORG vs PER. Counterfactuals: swap surrounding context to flip entity type.
🔍
Question Answering
extractive · span-prediction
Best XAI: Cross-attention heatmaps between question and context tokens. IG on span start/end logits. Helps debug hallucinations and out-of-context answer extraction.
🌐
Machine Translation
seq2seq · encoder-decoder
Best XAI: Encoder-decoder attention alignment maps which source tokens were attended to when generating each target token. SHAP on encoder states measures source word influence.
⚖️
Toxic Content Detection
safety · moderation
Best XAI: LIME + Counterfactuals. Regulatory compliance requires showing exactly which tokens triggered the decision. Counterfactuals are required for appeals processes.
📄
Text Summarization
abstractive · extractive
Best XAI: Saliency maps on encoder show which source sentences contributed most to the summary. BERTScore with attribution reveals faithfulness issues (hallucinated content).

Section 11

Comparing All XAI Methods — Side-by-Side

Property LIME SHAP Attention Integrated Gradients Counterfactuals
Model-agnostic Yes Yes No (transformer only) No (gradient-based) Yes
Attribution faithfulness Approximate (local) Axiomatically exact Debated in literature Provably complete No attribution (contrastive)
Computational cost Medium (N×model calls) High (many permutations) Very low (one pass) Medium (m steps) High (search problem)
Human interpretability High (word bars) High (waterfall plots) High (heatmap) Medium (token scores) Very high (natural language)
Handles negation well Partially Yes (interaction values) Variable by layer Yes (gradient captures it) Yes (flips on negation)
Suitable for compliance Partial (approximate) Strong (theoretically grounded) Weak (contested faithfulness) Strong (axiomatically grounded) Very strong (actionable)
Best Python library lime shap bertviz captum polyjuice

Section 12

XAI for Detecting Model Failure Modes

The most powerful use of NLP explainability is not compliance or audit — it is debugging models before deployment. When you explain many predictions systematically, patterns of model failure become visible.

🎭
Spurious Correlations
shortcut learning
A hate-speech model trained on social media data consistently assigns high weight to minority group names — not because mentioning them is hateful, but because the training data was biased. SHAP global plots expose this instantly.
🔀
Negation Blindness
semantic failure
Counterfactual analysis reveals that adding "not" before a positive adjective barely changes the model's confidence. IG shows the model ignores the negation token — a classic failure mode in models trained on short product reviews.
📍
Domain Shift
distribution shift
A model trained on movie reviews applied to hotel reviews. LIME explanations show it is assigning high weight to words like "plot" and "acting" on hotel texts — concepts that never appear. The distribution shift is visible in the explanations.

Systematic Explanation Audit — Finding Shortcuts

import shap, numpy as np
from collections import defaultdict
from transformers import pipeline as hf_pipeline

classifier = hf_pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    return_all_scores=True
)

test_corpus = [
    "The customer service was excellent",
    "Terrible customer service from start to finish",
    "The female agent was excellent, the male manager was terrible",
    "Not bad at all — actually quite good",
    "I cannot recommend this product enough",   # double negative → POSITIVE
    "I could not have asked for a worse experience"
]

def predict_proba(texts):
    results = classifier(texts)
    return np.array([[r['score'] for r in res] for res in results])

masker   = shap.maskers.Text(r"\W+")
explainer = shap.PartitionExplainer(predict_proba, masker,
                                     output_names=["NEGATIVE", "POSITIVE"])
sv = explainer(test_corpus)

# ── Aggregate: most influential words across the corpus ──────
word_total_impact = defaultdict(float)

for i in range(len(test_corpus)):
    tokens = sv[i].data
    values = sv[i].values[:, 1]   # POSITIVE class
    for tok, val in zip(tokens, values):
        word_total_impact[tok.lower()] += abs(val)

print("Top globally influential words across all predictions:")
for word, impact in sorted(word_total_impact.items(), key=lambda x: -x[1])[:12]:
    bar = "█" * int(impact * 20)
    print(f"  {word:20s}: {impact:.3f}  {bar}")
OUTPUT
Top globally influential words across all predictions: terrible : 2.341 ████████████████████████████████████████████████ excellent : 2.198 ████████████████████████████████████████████ cannot : 1.623 ████████████████████████████████ not : 1.441 ████████████████████████████ good : 0.981 ████████████████████ service : 0.423 ████████ worse : 0.398 ████████ female : 0.287 █████ ← ⚠️ gender word has unexpected influence male : 0.264 █████ ← ⚠️ should be near zero customer : 0.189 ████ manager : 0.144 ███ agent : 0.112 ██
⚠️
The Audit Found a Bias Signal

Notice "female" and "male" appearing with non-trivial SHAP values. In a well-behaved model, gender words should have near-zero impact on sentiment predictions. Their presence signals potential gender bias learned from training data. This is exactly the kind of finding that an XAI audit is designed to surface — invisible to accuracy metrics alone.


Section 13

Animated Token Heatmap — Live SHAP Visualization

The interactive widget below simulates a SHAP token heatmap for a sentiment classifier. Select a sentence to see how individual tokens push the prediction toward positive or negative.

🎨 INTERACTIVE SHAP TOKEN HEATMAP
SELECT A SENTENCE TO EXPLAIN
Pushes POSITIVE Pushes NEGATIVE Neutral
Token color intensity reflects SHAP magnitude. Green = pushes positive, Red = pushes negative. The bar chart shows the top 7 most influential tokens.

Section 14

Full End-to-End XAI Pipeline — Production Pattern

In production, XAI for NLP is not a one-off analysis — it is a continuous monitoring pipeline. The pattern below is battle-tested across financial services, healthcare NLP, and content moderation systems.

01
Model Registry & Baseline
Register the model with its training data hash, validation metrics, and a SHAP baseline computed on a representative holdout set. The baseline SHAP distribution defines "normal" explanation behavior.
02
Per-Prediction Explanation Cache
For every high-stakes prediction (confidence < 0.95 or decision value > threshold), compute and store SHAP values. Cache to Redis/DynamoDB with the prediction ID, input text hash, and top-5 influential tokens.
03
Explanation Drift Monitor
Weekly batch job computes global SHAP importance for the past week's predictions. Compare against the baseline distribution using PSI (Population Stability Index) or Jensen-Shannon divergence. Alert if any token's importance drifts significantly — this signals data shift before accuracy degrades.
04
Human Review Queue Integration
For predictions flagged for human review, the UI automatically surfaces the LIME/SHAP explanation alongside the raw text. Reviewers see which words drove the decision — this halves average review time and improves override quality.
05
Counterfactual Feedback Loop
When a human reviewer overrides a prediction, automatically generate counterfactuals showing the minimal change that would have produced the correct prediction. Feed these into the retraining dataset as hard negatives — targeted training data that directly fixes model weaknesses.
# ── Production XAI wrapper — minimalist pattern ─────────────
import shap, json, hashlib
from datetime import datetime

class ExplainablePredictor:
    def __init__(self, model_fn, tokenizer=None, threshold=0.90):
        self.model_fn  = model_fn
        self.threshold = threshold
        masker = shap.maskers.Text(r"\W+")
        self.explainer = shap.PartitionExplainer(model_fn, masker,
                                                  output_names=["NEG","POS"])

    def predict(self, text: str) -> dict:
        result   = self.model_fn([text])[0]
        label    = "POS" if result[1] > 0.5 else "NEG"
        conf     = result.max()
        needs_xai = conf < self.threshold  # explain uncertain predictions

        payload = {
            "text_hash"  : hashlib.md5(text.encode()).hexdigest(),
            "label"      : label,
            "confidence" : round(float(conf), 4),
            "timestamp"  : datetime.utcnow().isoformat(),
            "explanation": None
        }

        if needs_xai:
            sv = self.explainer([text])
            tokens = sv[0].data
            vals   = sv[0].values[:, 1]   # POS class
            top5   = sorted(zip(tokens, vals),
                            key=lambda x: -abs(x[1]))[:5]
            payload["explanation"] = [
                {"token": t, "shap": round(float(v), 4)} for t, v in top5
            ]

        return payload
EXAMPLE PAYLOAD (low-confidence prediction → XAI triggered){ "text_hash" : "a3f9c2d1...", "label" : "POS", "confidence" : 0.6834, "timestamp" : "2025-04-12T09:23:11", "explanation": [ {"token": "not", "shap": -0.2341}, {"token": "bad", "shap": 0.3102}, {"token": "great", "shap": 0.2890}, {"token": "either", "shap": -0.0912}, {"token": "quite", "shap": 0.1023} ] }

Section 15

Metrics for Evaluating XAI Quality

Metric What It Measures How to Compute Target
Faithfulness (AOPC) Do removing the top-k tokens actually change the output? Area Over Perturbation Curve: mask top-k tokens one by one, measure output drop Higher = more faithful
Comprehensiveness Do the explained tokens contain all important information? Predict with only the top-k tokens. Accuracy should be near full-model accuracy. ≥ 0.80 of full accuracy
Sufficiency Are the top-k tokens alone sufficient for a correct prediction? Compare P(y|top-k tokens) to P(y|full text) KL divergence < 0.1
Stability (Lipschitz) Do similar inputs produce similar explanations? For semantically similar sentence pairs, measure L2 distance between SHAP vectors Low variance preferred
Human Agreement Do humans agree the top-k tokens are the "right" words? Annotation study: Fleiss's Kappa between model explanation and human highlights κ > 0.60 is good agreement
Convergence Delta (IG) How accurate is the Riemann approximation? |Σ IGᵢ - (F(x) - F(x'))| absolute difference < 0.01 excellent, < 0.05 acceptable

Section 16

Golden Rules for XAI in NLP

🧠 XAI for NLP — Non-Negotiable Rules
1
Always use at least two explanation methods. No single method is universally faithful. If LIME and Integrated Gradients agree on the top tokens, you can be confident. If they disagree, investigate — the disagreement itself is informative.
2
Never present attention weights as the explanation in a compliance or audit context. Attention is a visualization tool, not a ground-truth attribution. Cite Jain & Wallace (2019) and always validate against IG or SHAP.
3
Always check the convergence delta when using Integrated Gradients. A delta above 0.05 means your attribution is unreliable. Increase n_steps to 300–500 or switch baselines.
4
Explain your bad predictions first. Most XAI workflows focus on correct predictions. The most valuable explanations come from wrong predictions — specifically high-confidence wrong ones. These reveal systematic model failures, not edge cases.
5
Run a global XAI audit before every model deployment. Compute SHAP global importance on a held-out test set and manually inspect the top 20 most influential tokens. If any are demographic attributes, punctuation, or formatting artifacts, the model has learned spurious correlations — retrain before deploying.
6
Match the explanation method to the audience. SHAP waterfall plots for data scientists. LIME word bars for product managers. Counterfactuals for legal teams and end-users. The same model may need three different explanation interfaces for three different stakeholders.
7
Monitor explanation drift in production. If the top influential tokens change significantly week-over-week (measured by PSI or JS divergence), your input distribution has shifted — even if accuracy has not yet dropped. XAI-based drift detection is typically faster than accuracy-based drift detection by 1–2 weeks.