XAI for NLP: Explain Text Classifiers with SHAP & LIME

Section 01

The Story That Explains Why NLP Models Need Explaining

📖 Real World Story

The Angry Customer and the Mysterious Black Box

Imagine a bank deploys a sentiment model to auto-triage customer complaints. One morning, a furious email arrives: "I absolutely love how your fees magically vanish from my account." The model labels it POSITIVE with 94% confidence. The ticket is automatically closed as resolved.

Three days later, the customer files a regulator complaint. The bank's compliance officer asks the AI team: "Why did the model think that was positive?" The engineers stare at their screens. They have no answer.

This is the core problem that Explainable AI (XAI) for NLP solves — not just getting the right answer, but understanding why the model gave that answer, and when it is dangerously wrong.

Language models are among the most powerful and most opaque systems in modern machine learning. A BERT-based sentiment classifier has 110 million parameters. A single prediction flows through twelve transformer layers, dozens of attention heads, and thousands of non-linear activations. There is no single line of code you can point to and say "here is where it decided the review was negative."

🧠

What XAI for NLP Actually Means

XAI for NLP is not a single technique — it is a family of methods that answer different questions: Which words mattered? (attribution methods), What would have changed the decision? (counterfactuals), Which training examples drove this prediction? (influence functions), and What rules does the model appear to follow? (surrogate models). Each answer is useful in a different context.

Section 02

The XAI Landscape for NLP — A Map

Before diving into individual techniques, it helps to see how the entire XAI-for-NLP space is organized. Methods differ on two key axes: scope (local = one prediction, global = the whole model) and access (white-box = model internals available, black-box = only inputs and outputs).

Method Family	Scope	Access	Key Output	Best Used For
LIME	Local	Black-box	Word importance scores	Any model, quick debugging
SHAP (Text)	Local	Black-box	Shapley values per token	Rigorous attribution, consistent
Attention Visualization	Local	White-box	Attention weight heatmaps	Transformer debugging, QA models
Integrated Gradients	Local	White-box	Gradient attribution per token	Deep models, pixel-level precision
Counterfactuals	Local	Black-box	"What would flip the prediction?"	Regulatory compliance, fairness
Concept Activation Vectors	Global	White-box	High-level concept sensitivity	Understanding model behavior globally
Probing Classifiers	Global	White-box	What linguistic info each layer learns	BERT/GPT research, layer analysis

⚠️

The Faithfulness Problem

The most dangerous misconception in XAI is confusing plausible with faithful explanations. An explanation is plausible if it looks reasonable to a human. It is faithful if it accurately reflects what the model actually computed. Many popular methods produce plausible but unfaithful explanations — they tell a good story that does not match the model's true reasoning.

Section 03

LIME for Text Classifiers — Explained With a Story

📖 Analogy

The Blindfolded Taster

You are handed a mysterious sauce. You cannot see inside the bottle. But you can taste it. So you start experimenting: you remove the chilli and taste again — less hot. You remove the garlic — still hot. You remove both — mild. By systematically removing and adding ingredients and tasting each result, you build a simple mental model of which ingredients matter most.

LIME (Local Interpretable Model-agnostic Explanations) does exactly this with text. It removes and masks words, asks the black-box model to re-classify each masked version, and then fits a simple linear model to the results. The linear model's coefficients tell you which words most influenced the prediction.

How LIME Works on Text — Step by Step

🔬 LIME Algorithm — Text Classification

Step 1

Take the original sentence and the model's prediction. E.g., "The product broke after one day" → NEGATIVE (92%)

Step 2

Create N perturbed versions by randomly masking out words: "The product [MASK] after one day", "[MASK] product broke after one day", etc.

Step 3

Pass all N versions through the black-box model. Record the predicted probability for the class of interest (NEGATIVE).

Step 4

Weight each perturbed sample by its cosine similarity to the original sentence (closer = higher weight).

Step 5

Fit a weighted LASSO linear regression mapping word presence/absence → predicted probability.

Output

The linear model coefficients are the explanation: broke → +0.41, one day → +0.22, product → +0.05

LIME in Python — Sentiment Classifier

from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

# ── 1. Train a simple sentiment classifier ──────────────────
train_texts = [
    "great product love it", "absolutely terrible broke immediately",
    "best purchase ever wonderful", "horrible waste of money disgusting",
    "decent quality good value", "arrived damaged very disappointed",
]
train_labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf',  LogisticRegression(max_iter=500))
])
pipeline.fit(train_texts, train_labels)

# ── 2. Set up LIME explainer ─────────────────────────────────
explainer = LimeTextExplainer(
    class_names=['Negative', 'Positive'],
    split_expression=r'\s+',   # split on whitespace
    bow=True                    # bag-of-words perturbation
)

# ── 3. Explain a specific prediction ────────────────────────
test_sentence = "The product broke after one day — absolutely terrible"

exp = explainer.explain_instance(
    test_sentence,
    pipeline.predict_proba,   # must return probabilities
    num_features=6,             # top 6 words to explain
    num_samples=500             # number of perturbed samples
)

# ── 4. Print results ─────────────────────────────────────────
print(f"Predicted class : {pipeline.predict([test_sentence])[0]} (0=Neg, 1=Pos)")
print(f"Confidence      : {pipeline.predict_proba([test_sentence])[0].max():.2%}")
print("\nWord Contributions (+ = pushes toward Negative class):")
for word, weight in exp.as_list():
    direction = "→ NEGATIVE" if weight > 0 else "→ POSITIVE"
    print(f"  {word:25s}: {weight:+.4f}  {direction}")

OUTPUT

Predicted class : 0 (0=Neg, 1=Pos) Confidence : 87.34% Word Contributions (+ = pushes toward Negative class): terrible : +0.3821 → NEGATIVE broke : +0.2914 → NEGATIVE absolutely : +0.1823 → NEGATIVE one day : +0.1205 → NEGATIVE product : +0.0342 → NEGATIVE after : -0.0089 → POSITIVE

💡

The num_samples Tradeoff

More samples (num_samples) → more stable explanation, but slower. For production monitoring dashboards use num_samples=200 (fast, good enough for trends). For a compliance report on one specific disputed prediction, use num_samples=2000 for maximum stability.

Section 04

SHAP for NLP — The Rigorous Way to Attribute Words

LIME approximates locally. SHAP (SHapley Additive exPlanations) goes further: it uses Shapley values from cooperative game theory to assign each word a contribution that is guaranteed to be fair, consistent, and complete — meaning the contributions always sum to the model's output.

Shapley Value Formula

φᵢ = Σ [|S|!(n-|S|-1)!/n!] × [f(S∪{i}) - f(S)]

Sum over all possible subsets S not containing word i. Each term is the weighted marginal contribution of adding word i to subset S.

Efficiency Property

f(x) - E[f(x)] = Σᵢ φᵢ

The sum of all Shapley values equals the difference between the model output and the baseline (average) output. They always account for 100% of the prediction.

SHAP TextExplainer — Three Flavours

⚡

Partition Explainer

shap.PartitionExplainer

Best for transformer models (BERT, RoBERTa). Splits the text into token groups hierarchically. Fast and memory-efficient. Uses Owen values (a generalization of Shapley values for ordered sequences).

✓ Handles tokenization correctly

✗ Approximate, not exact Shapley

🔄

Permutation Explainer

shap.PermutationExplainer

Exact Shapley values by sampling random word orderings. Works for any model. The number of evaluations scales linearly with num_evals, making it controllable. Best choice for TF-IDF + LogReg pipelines.

✓ Exact values, model-agnostic

✗ Slower than Partition for deep models

📊

Linear Explainer

shap.LinearExplainer

Only for linear models on BoW/TF-IDF features. Uses exact analytical Shapley values — no sampling required. Extremely fast. The explanation is directly the product of feature value × model coefficient.

✓ Exact and instantaneous

✗ Only linear models

SHAP on a BERT Sentiment Model

import shap
import transformers
import torch

# ── Load a fine-tuned sentiment model ───────────────────────
tokenizer = transformers.AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

# ── Wrap in a callable that returns probabilities ────────────
def predict_proba(texts):
    inputs = tokenizer(
        texts, return_tensors="pt", truncation=True,
        padding=True, max_length=128
    )
    with torch.no_grad():
        logits = model(**inputs).logits
    return torch.softmax(logits, dim=-1).numpy()

# ── SHAP Partition Explainer (for transformers) ──────────────
masker = shap.maskers.Text(tokenizer)         # uses [MASK] token as baseline
explainer = shap.PartitionExplainer(predict_proba, masker)

sentences = [
    "I absolutely love this phone, it's incredibly fast and beautiful",
    "Worst experience ever. The staff was rude and the food was cold",
    "It was okay. Nothing special, but not terrible either."
]

shap_values = explainer(sentences)   # shape: (3, tokens, 2_classes)

# ── Inspect one sentence ─────────────────────────────────────
idx = 0   # first sentence
tokens = shap_values[idx].data       # the tokenized words
values = shap_values[idx].values[:, 1]  # SHAP for POSITIVE class

print(f"Sentence: {sentences[idx]}")
print(f"Baseline (avg) POSITIVE prob: {shap_values[idx].base_values[1]:.4f}")
print("\nToken Contributions to POSITIVE class:")
for token, val in sorted(zip(tokens, values), key=lambda x: -abs(x[1])):
    bar = "▓" * int(abs(val) * 60)
    sign = "+" if val > 0 else "-"
    print(f"  {token:15s}: {sign}{abs(val):.4f}  {bar}")

OUTPUT

Sentence: I absolutely love this phone, it's incredibly fast and beautiful Baseline (avg) POSITIVE prob: 0.5012 Token Contributions to POSITIVE class: love : +0.2841 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ beautiful : +0.1923 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ absolutely : +0.1344 ▓▓▓▓▓▓▓▓ incredibly : +0.0912 ▓▓▓▓▓ fast : +0.0634 ▓▓▓▓ phone : +0.0122 ▓ this : -0.0043 I : -0.0012

Section 05

Attention Visualization — The Double-Edged Sword

📖 Analogy

The Eye-Tracker Illusion

Researchers once tracked where people looked when reading job adverts. Participants spent most time on the salary section. But when asked what mattered most in their decision, many said company culture.

Looking at something is not the same as that thing causing the decision. Transformer attention weights show us what the model attends to, not necessarily what causes its output. This is the most important caveat in attention-based XAI — and it is often ignored.

Despite the caveat, attention visualization remains valuable when used correctly. For tasks like Question Answering, attention between the question and passage tokens genuinely does reflect important reasoning. For classification tasks like sentiment, it is a useful starting point but must be validated with other methods.

Visualizing BERT Attention — Code

from transformers import AutoTokenizer, AutoModel
import torch
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model     = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)
model.eval()

text = "The movie was surprisingly good despite the poor reviews"
inputs = tokenizer(text, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

with torch.no_grad():
    outputs = model(**inputs)

# attentions: tuple of (1, num_heads, seq_len, seq_len) for each layer
attentions = outputs.attentions   # 12 layers × 12 heads

# ── Aggregate: mean attention across heads in the last layer ─
last_layer_attn = attentions[-1][0]          # (12_heads, seq, seq)
mean_attn = last_layer_attn.mean(dim=0).numpy()   # (seq, seq)

# ── Plot ─────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(mean_attn, cmap="Blues", aspect="auto")
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45, ha="right")
ax.set_yticklabels(tokens)
ax.set_title("BERT Layer 12 — Mean Attention Across All Heads")
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.savefig("attention_map.png", dpi=150)
print("✓ Saved attention_map.png")

⚠️

Attention ≠ Explanation — The Jain & Wallace (2019) Controversy

The paper "Attention is not Explanation" showed that you can permute attention weights drastically while barely changing model outputs. Conversely, "Attention is not not Explanation" (Wiegreffe & Pinter, 2019) showed this does not mean attention is meaningless — it depends on the task. The safest practice: always validate attention explanations against a gradient-based method before reporting them to stakeholders.

Section 06

Integrated Gradients — The Gold Standard for Token Attribution

Integrated Gradients (Sundararajan et al., 2017) solves a key problem with vanilla gradients: when a feature is already at its maximum importance, the gradient saturates to near-zero. IG fixes this by integrating the gradient along a straight path from a baseline (e.g., all-zero embeddings or [MASK] tokens) to the actual input.

Integrated Gradients Formula

IGᵢ(x) = (xᵢ - x'ᵢ) × ∫₀¹ ∂F(x'+α(x-x'))/∂xᵢ dα

IG for feature i is the path integral of the gradient from baseline x' to input x. Approximated with a Riemann sum over m=50–300 steps.

Completeness Axiom

Σᵢ IGᵢ(x) = F(x) - F(x')

The sum of all token attributions exactly equals the difference between the model's output on the real input and the baseline. Attributions are 100% complete.

Integrated Gradients on a Fine-Tuned Classifier

from captum.attr import LayerIntegratedGradients
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer  = AutoTokenizer.from_pretrained(MODEL_NAME)
model      = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()

# ── Forward wrapper: returns logits for target class ─────────
def forward_func(input_ids, attention_mask):
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    return outputs.logits[:, 1]  # class 1 = POSITIVE

# ── Prepare input ─────────────────────────────────────────────
text   = "This thriller was absolutely gripping from start to finish"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
input_ids      = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# ── Baseline: all [PAD] tokens ───────────────────────────────
baseline_ids = torch.zeros_like(input_ids)
baseline_ids[0, 0]  = tokenizer.cls_token_id   # keep [CLS]
baseline_ids[0, -1] = tokenizer.sep_token_id   # keep [SEP]

# ── Layer IG on the embedding layer ──────────────────────────
lig = LayerIntegratedGradients(forward_func, model.distilbert.embeddings)

attributions, delta = lig.attribute(
    inputs=(input_ids, attention_mask),
    baselines=(baseline_ids, attention_mask),
    n_steps=200,                   # Riemann sum steps (200 is standard)
    return_convergence_delta=True  # should be near 0 if good
)

# ── Sum across embedding dimensions (L2 norm variant) ────────
attr_scores = attributions.sum(dim=-1).squeeze().detach().numpy()
attr_scores = attr_scores / (abs(attr_scores).max())  # normalize to [-1, 1]

print(f"Convergence delta: {delta.item():.6f}  (lower = more accurate)")
print("\nNormalized Token Attributions:")
for tok, score in zip(tokens, attr_scores):
    bar_len = int(abs(score) * 30)
    direction = "▶ POS" if score > 0 else "◀ NEG"
    print(f"  {tok:15s}: {score:+.4f}  {'█'*bar_len} {direction}")

OUTPUT

Convergence delta: 0.000234 (lower = more accurate) Normalized Token Attributions: [CLS] : +0.0034 [CLS] this : +0.0512 █ ▶ POS thriller : +0.4821 ██████████████ ▶ POS was : +0.0091 ▶ POS absolutely : +0.6234 ██████████████████ ▶ POS gripping : +1.0000 ██████████████████████████████ ▶ POS from : -0.0045 ◀ NEG start : +0.1023 ███ ▶ POS to : +0.0012 ▶ POS finish : +0.2341 ███████ ▶ POS [SEP] : +0.0001 ▶ POS

✅

Why Convergence Delta Matters

The convergence delta measures how well the Riemann sum approximated the true integral. A delta below 0.001 is excellent. Above 0.05, increase n_steps to 300–500. Above 0.1, your explanation is unreliable — do not report it.

Section 07

Counterfactual Explanations — "What Would Have Changed the Decision?"

📖 Story

The Loan Rejection Letter

GDPR Article 22 (in Europe) and similar regulations in many other countries require that when an automated system makes a consequential decision about a person, that person has the right to a meaningful explanation. "Your sentiment score was 0.23" is not meaningful. "If you had written 'good' instead of 'adequate' and removed the word 'however', your complaint would have been classified as positive" — that is a counterfactual explanation, and it is actionable. It tells you exactly what to change.

For text classifiers, counterfactuals identify the minimal word substitutions needed to flip the model's prediction. The best counterfactuals are close to the original (minimal change) and stay in the natural language manifold (fluent text).

Generating Text Counterfactuals with Polyjuice

# pip install polyjuice-nlp
from polyjuice import Polyjuice
from transformers import pipeline as hf_pipeline

# ── Polyjuice: counterfactual generator ──────────────────────
pj = Polyjuice(model_path="uw-hai/polyjuice", is_cuda=False)

# ── Sentiment classifier to test with ────────────────────────
classifier = hf_pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

original_text   = "The film was painfully slow and utterly boring"
original_label  = classifier(original_text)[0]['label']
print(f"Original: '{original_text}'")
print(f"Label   : {original_label}")

# ── Generate counterfactuals ──────────────────────────────────
counterfactuals = pj.perturb(
    original_text,
    perplex_thred=25,       # max perplexity (fluency filter)
    num_perturbations=8     # how many to generate
)

print("\nCounterfactuals that flip to POSITIVE:")
flipped = 0
for cf in counterfactuals:
    cf_label = classifier(cf)[0]['label']
    if cf_label != original_label:
        flipped += 1
        score = classifier(cf)[0]['score']
        print(f"  [{flipped}] '{cf}'")
        print(f"      → {cf_label} ({score:.2%} confidence)")

OUTPUT

Original: 'The film was painfully slow and utterly boring' Label : NEGATIVE Counterfactuals that flip to POSITIVE: [1] 'The film was beautifully slow and utterly mesmerizing' → POSITIVE (91.34% confidence) [2] 'The film was painfully slow but deeply rewarding' → POSITIVE (78.12% confidence) [3] 'The film was surprisingly slow and utterly captivating' → POSITIVE (85.67% confidence)

ℹ️

The Three Properties of a Good Text Counterfactual

Proximity: as few word changes as possible from the original. Fluency: the result should read like natural language (measured by perplexity under a language model). Diversity: provide multiple distinct counterfactuals — each highlights a different way the prediction could be flipped. Together, they give a richer picture of the model's decision boundary.

Section 08

Probing Classifiers — What Does Each Layer Know?

Probing is a global XAI technique for transformer models. The idea: if BERT encodes linguistic information in its layers, we should be able to predict linguistic labels (part-of-speech, named entity type, dependency head) directly from each layer's hidden states using a simple linear probe. A good probe accuracy means that layer has learned that linguistic property.

🔬 Probing Protocol — Step by Step

Step 1

Choose a linguistic property to probe: POS tags, named entities, semantic roles, negation scope, coreference, etc.

Step 2

Collect a labelled probing dataset for that property (e.g., Penn Treebank for POS tags).

Step 3

Run all sentences through the frozen BERT/RoBERTa model and extract hidden states from every layer. Keep the model frozen — no fine-tuning.

Step 4

Train a separate simple linear classifier (or MLP) on top of each layer's hidden states → predict the probing label.

Step 5

Plot probe accuracy by layer. High accuracy in layer L = that layer encodes that property. Pattern reveals the model's internal representations.

from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import torch, numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model     = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
model.eval()

# ── Toy probing dataset: positive/negative sentiment tokens ──
# In practice, use CoNLL-03 for NER, PTB for POS, etc.
probe_sentences = [
    "excellent", "awful", "brilliant", "terrible",
    "amazing",   "horrible", "fantastic", "dreadful",
]
probe_labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive adjective

# ── Extract hidden states from all 12 layers ──────────────── 
layer_hiddens = [[] for _ in range(13)]   # 0=embed, 1-12=transformer

with torch.no_grad():
    for sentence in probe_sentences:
        inputs  = tokenizer(sentence, return_tensors="pt")
        outputs = model(**inputs)
        # hidden_states: tuple of (1, seq_len, 768) for each layer
        for layer_idx, hs in enumerate(outputs.hidden_states):
            layer_hiddens[layer_idx].append(hs[0, 1, :].numpy())  # token at pos 1

# ── Probe each layer ──────────────────────────────────────── 
print("Layer  Probe Accuracy (CV-3)")
print("─" * 35)
for layer_idx in range(13):
    X = np.array(layer_hiddens[layer_idx])
    y = np.array(probe_labels)
    probe = LogisticRegression(max_iter=500)
    acc = cross_val_score(probe, X, y, cv=3, scoring="accuracy").mean()
    bar = "█" * int(acc * 30)
    label = "← embed" if layer_idx == 0 else ""
    print(f"  L{layer_idx:02d}   {acc:.2%}  {bar}  {label}")

OUTPUT

Layer Probe Accuracy (CV-3) ─────────────────────────────────── L00 62.50% ██████████████████ ← embed L01 75.00% ██████████████████████ L02 87.50% ██████████████████████████ L03 87.50% ██████████████████████████ L04 87.50% ██████████████████████████ L05 100.00% ██████████████████████████████ L06 100.00% ██████████████████████████████ L07 87.50% ██████████████████████████ L08 87.50% ██████████████████████████ L09 75.00% ██████████████████████ L10 75.00% ██████████████████████ L11 62.50% ██████████████████ L12 62.50% ██████████████████

This pattern — low in early layers, peaking in the middle, declining in late layers — is a well-documented finding in probing literature. Early layers learn syntax, middle layers learn semantics, and late layers are specialized for the fine-tuning task.

Section 09

Animated Diagram — How LIME Works Internally

The diagram below animates the core LIME loop: perturbing the sentence, querying the model, weighting the samples, and fitting the linear surrogate.

⚙️ LIME PROCESS — INTERACTIVE ANIMATION

STEP 1 — ORIGINAL SENTENCE & MODEL PREDICTION

We start with the sentence we want to explain. The black-box model predicts NEGATIVE with 87% confidence.

The

product

broke

after

one

day

—

absolutely

terrible

Model output →

NEGATIVE 87.3%

STEP 2 — CREATE PERTURBED SAMPLES (N=500)

Randomly mask out words. Each masked version is a new input to the model. Red = masked out, gold = kept.

The

product

broke

after

one

day

—

absolutely

terrible

The

product

broke

after

one

day

—

absolutely

terrible

The

product

broke

after

one

day

—

absolutely

terrible

... 497 more perturbed samples

STEP 3 — QUERY THE BLACK-BOX MODEL ON EACH PERTURBATION

Each perturbed sentence is classified. We record the probability of NEGATIVE class.

"The ~~product~~ broke after ~~one~~ day…" NEGATIVE 71.2% sim=0.82

"The product ~~broke~~ after one day — abs. ~~terrible~~" NEGATIVE 18.4% sim=0.71

"~~The~~ product ~~broke~~ after one ~~day~~…" NEGATIVE 61.0% sim=0.68

"The product broke after one day — abs. ~~terrible~~" NEGATIVE 45.1% sim=0.89

STEP 4 — FIT WEIGHTED LASSO LINEAR MODEL

A weighted LASSO regression maps word presence → NEGATIVE probability. Coefficients = the explanation.

terrible

+0.382

broke

+0.291

absolutely

+0.182

one day

+0.120

product

+0.034

after

-0.009

Click NEXT / PREV or the dots to step through LIME's four stages. Each stage shows what the algorithm is actually doing to your sentence.

Section 10

Explaining Specific NLP Tasks — Domain-By-Domain Guidance

💬

Sentiment Analysis

opinion-mining · review-classification

Best XAI: SHAP PartitionExplainer for BERT-based models. LIME for TF-IDF baselines. Show per-token polarity scores. Critical for customer service, brand monitoring.

📋

Named Entity Recognition

sequence-labelling · token-classification

Best XAI: Integrated Gradients per token. Show which context tokens contributed to labelling "Apple" as ORG vs PER. Counterfactuals: swap surrounding context to flip entity type.

🔍

Question Answering

extractive · span-prediction

Best XAI: Cross-attention heatmaps between question and context tokens. IG on span start/end logits. Helps debug hallucinations and out-of-context answer extraction.

🌐

Machine Translation

seq2seq · encoder-decoder

Best XAI: Encoder-decoder attention alignment maps which source tokens were attended to when generating each target token. SHAP on encoder states measures source word influence.

⚖️

Toxic Content Detection

safety · moderation

Best XAI: LIME + Counterfactuals. Regulatory compliance requires showing exactly which tokens triggered the decision. Counterfactuals are required for appeals processes.

📄

Text Summarization

abstractive · extractive

Best XAI: Saliency maps on encoder show which source sentences contributed most to the summary. BERTScore with attribution reveals faithfulness issues (hallucinated content).

Section 11

Comparing All XAI Methods — Side-by-Side

Property	LIME	SHAP	Attention	Integrated Gradients	Counterfactuals
Model-agnostic	Yes	Yes	No (transformer only)	No (gradient-based)	Yes
Attribution faithfulness	Approximate (local)	Axiomatically exact	Debated in literature	Provably complete	No attribution (contrastive)
Computational cost	Medium (N×model calls)	High (many permutations)	Very low (one pass)	Medium (m steps)	High (search problem)
Human interpretability	High (word bars)	High (waterfall plots)	High (heatmap)	Medium (token scores)	Very high (natural language)
Handles negation well	Partially	Yes (interaction values)	Variable by layer	Yes (gradient captures it)	Yes (flips on negation)
Suitable for compliance	Partial (approximate)	Strong (theoretically grounded)	Weak (contested faithfulness)	Strong (axiomatically grounded)	Very strong (actionable)
Best Python library	lime	shap	bertviz	captum	polyjuice

Section 12

XAI for Detecting Model Failure Modes

The most powerful use of NLP explainability is not compliance or audit — it is debugging models before deployment. When you explain many predictions systematically, patterns of model failure become visible.

🎭

Spurious Correlations

shortcut learning

A hate-speech model trained on social media data consistently assigns high weight to minority group names — not because mentioning them is hateful, but because the training data was biased. SHAP global plots expose this instantly.

🔀

Negation Blindness

semantic failure

Counterfactual analysis reveals that adding "not" before a positive adjective barely changes the model's confidence. IG shows the model ignores the negation token — a classic failure mode in models trained on short product reviews.

📍

Domain Shift

distribution shift

A model trained on movie reviews applied to hotel reviews. LIME explanations show it is assigning high weight to words like "plot" and "acting" on hotel texts — concepts that never appear. The distribution shift is visible in the explanations.

Systematic Explanation Audit — Finding Shortcuts

import shap, numpy as np
from collections import defaultdict
from transformers import pipeline as hf_pipeline

classifier = hf_pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    return_all_scores=True
)

test_corpus = [
    "The customer service was excellent",
    "Terrible customer service from start to finish",
    "The female agent was excellent, the male manager was terrible",
    "Not bad at all — actually quite good",
    "I cannot recommend this product enough",   # double negative → POSITIVE
    "I could not have asked for a worse experience"
]

def predict_proba(texts):
    results = classifier(texts)
    return np.array([[r['score'] for r in res] for res in results])

masker   = shap.maskers.Text(r"\W+")
explainer = shap.PartitionExplainer(predict_proba, masker,
                                     output_names=["NEGATIVE", "POSITIVE"])
sv = explainer(test_corpus)

# ── Aggregate: most influential words across the corpus ──────
word_total_impact = defaultdict(float)

for i in range(len(test_corpus)):
    tokens = sv[i].data
    values = sv[i].values[:, 1]   # POSITIVE class
    for tok, val in zip(tokens, values):
        word_total_impact[tok.lower()] += abs(val)

print("Top globally influential words across all predictions:")
for word, impact in sorted(word_total_impact.items(), key=lambda x: -x[1])[:12]:
    bar = "█" * int(impact * 20)
    print(f"  {word:20s}: {impact:.3f}  {bar}")

OUTPUT

Top globally influential words across all predictions: terrible : 2.341 ████████████████████████████████████████████████ excellent : 2.198 ████████████████████████████████████████████ cannot : 1.623 ████████████████████████████████ not : 1.441 ████████████████████████████ good : 0.981 ████████████████████ service : 0.423 ████████ worse : 0.398 ████████ female : 0.287 █████ ← ⚠️ gender word has unexpected influence male : 0.264 █████ ← ⚠️ should be near zero customer : 0.189 ████ manager : 0.144 ███ agent : 0.112 ██

⚠️

The Audit Found a Bias Signal

Notice "female" and "male" appearing with non-trivial SHAP values. In a well-behaved model, gender words should have near-zero impact on sentiment predictions. Their presence signals potential gender bias learned from training data. This is exactly the kind of finding that an XAI audit is designed to surface — invisible to accuracy metrics alone.

Section 13

Animated Token Heatmap — Live SHAP Visualization

The interactive widget below simulates a SHAP token heatmap for a sentiment classifier. Select a sentence to see how individual tokens push the prediction toward positive or negative.

🎨 INTERACTIVE SHAP TOKEN HEATMAP

SELECT A SENTENCE TO EXPLAIN

Pushes POSITIVE Pushes NEGATIVE Neutral

Token color intensity reflects SHAP magnitude. Green = pushes positive, Red = pushes negative. The bar chart shows the top 7 most influential tokens.

Section 14

Full End-to-End XAI Pipeline — Production Pattern

In production, XAI for NLP is not a one-off analysis — it is a continuous monitoring pipeline. The pattern below is battle-tested across financial services, healthcare NLP, and content moderation systems.

Model Registry & Baseline

Register the model with its training data hash, validation metrics, and a SHAP baseline computed on a representative holdout set. The baseline SHAP distribution defines "normal" explanation behavior.

Per-Prediction Explanation Cache

For every high-stakes prediction (confidence < 0.95 or decision value > threshold), compute and store SHAP values. Cache to Redis/DynamoDB with the prediction ID, input text hash, and top-5 influential tokens.

Explanation Drift Monitor

Weekly batch job computes global SHAP importance for the past week's predictions. Compare against the baseline distribution using PSI (Population Stability Index) or Jensen-Shannon divergence. Alert if any token's importance drifts significantly — this signals data shift before accuracy degrades.

Human Review Queue Integration

For predictions flagged for human review, the UI automatically surfaces the LIME/SHAP explanation alongside the raw text. Reviewers see which words drove the decision — this halves average review time and improves override quality.

Counterfactual Feedback Loop

When a human reviewer overrides a prediction, automatically generate counterfactuals showing the minimal change that would have produced the correct prediction. Feed these into the retraining dataset as hard negatives — targeted training data that directly fixes model weaknesses.

# ── Production XAI wrapper — minimalist pattern ─────────────
import shap, json, hashlib
from datetime import datetime

class ExplainablePredictor:
    def __init__(self, model_fn, tokenizer=None, threshold=0.90):
        self.model_fn  = model_fn
        self.threshold = threshold
        masker = shap.maskers.Text(r"\W+")
        self.explainer = shap.PartitionExplainer(model_fn, masker,
                                                  output_names=["NEG","POS"])

    def predict(self, text: str) -> dict:
        result   = self.model_fn([text])[0]
        label    = "POS" if result[1] > 0.5 else "NEG"
        conf     = result.max()
        needs_xai = conf < self.threshold  # explain uncertain predictions

        payload = {
            "text_hash"  : hashlib.md5(text.encode()).hexdigest(),
            "label"      : label,
            "confidence" : round(float(conf), 4),
            "timestamp"  : datetime.utcnow().isoformat(),
            "explanation": None
        }

        if needs_xai:
            sv = self.explainer([text])
            tokens = sv[0].data
            vals   = sv[0].values[:, 1]   # POS class
            top5   = sorted(zip(tokens, vals),
                            key=lambda x: -abs(x[1]))[:5]
            payload["explanation"] = [
                {"token": t, "shap": round(float(v), 4)} for t, v in top5
            ]

        return payload

EXAMPLE PAYLOAD (low-confidence prediction → XAI triggered){ "text_hash" : "a3f9c2d1...", "label" : "POS", "confidence" : 0.6834, "timestamp" : "2025-04-12T09:23:11", "explanation": [ {"token": "not", "shap": -0.2341}, {"token": "bad", "shap": 0.3102}, {"token": "great", "shap": 0.2890}, {"token": "either", "shap": -0.0912}, {"token": "quite", "shap": 0.1023} ] }

Section 15

Metrics for Evaluating XAI Quality

Metric	What It Measures	How to Compute	Target
Faithfulness (AOPC)	Do removing the top-k tokens actually change the output?	Area Over Perturbation Curve: mask top-k tokens one by one, measure output drop	Higher = more faithful
Comprehensiveness	Do the explained tokens contain all important information?	Predict with only the top-k tokens. Accuracy should be near full-model accuracy.	≥ 0.80 of full accuracy
Sufficiency	Are the top-k tokens alone sufficient for a correct prediction?	Compare P(y\|top-k tokens) to P(y\|full text)	KL divergence < 0.1
Stability (Lipschitz)	Do similar inputs produce similar explanations?	For semantically similar sentence pairs, measure L2 distance between SHAP vectors	Low variance preferred
Human Agreement	Do humans agree the top-k tokens are the "right" words?	Annotation study: Fleiss's Kappa between model explanation and human highlights	κ > 0.60 is good agreement
Convergence Delta (IG)	How accurate is the Riemann approximation?	\|Σ IGᵢ - (F(x) - F(x'))\| absolute difference	< 0.01 excellent, < 0.05 acceptable

Section 16

Golden Rules for XAI in NLP

🧠 XAI for NLP — Non-Negotiable Rules

Always use at least two explanation methods. No single method is universally faithful. If LIME and Integrated Gradients agree on the top tokens, you can be confident. If they disagree, investigate — the disagreement itself is informative.

Never present attention weights as the explanation in a compliance or audit context. Attention is a visualization tool, not a ground-truth attribution. Cite Jain & Wallace (2019) and always validate against IG or SHAP.

Always check the convergence delta when using Integrated Gradients. A delta above 0.05 means your attribution is unreliable. Increase n_steps to 300–500 or switch baselines.

Explain your bad predictions first. Most XAI workflows focus on correct predictions. The most valuable explanations come from wrong predictions — specifically high-confidence wrong ones. These reveal systematic model failures, not edge cases.

Run a global XAI audit before every model deployment. Compute SHAP global importance on a held-out test set and manually inspect the top 20 most influential tokens. If any are demographic attributes, punctuation, or formatting artifacts, the model has learned spurious correlations — retrain before deploying.

Match the explanation method to the audience. SHAP waterfall plots for data scientists. LIME word bars for product managers. Counterfactuals for legal teams and end-users. The same model may need three different explanation interfaces for three different stakeholders.

Monitor explanation drift in production. If the top influential tokens change significantly week-over-week (measured by PSI or JS divergence), your input distribution has shifted — even if accuracy has not yet dropped. XAI-based drift detection is typically faster than accuracy-based drift detection by 1–2 weeks.