XAI for Time Series Forecasting

Section 01

The Story That Explains Explainable AI for Time Series

📖 Real World Analogy

The Black-Box Weather Forecaster

Imagine your city hired two weather forecasters. The first hands you a printed forecast every morning — "70% chance of rain at 3 pm" — with zero explanation. The second does the same, but adds: "Temperature has been dropping since yesterday, humidity crossed 80% this morning, and the wind shifted northwest — that pattern preceded rain in 9 of the last 12 similar Tuesdays."

Both forecasts are identical. But which forecaster would you trust enough to act on? Which would you challenge if they turned out to be wrong two days in a row?

That gap — between a number and a reason — is the entire problem that Explainable AI (XAI) for Time Series is designed to close.

Time series forecasting models — LSTMs, Prophet, ARIMA, Gradient Boosting regressors — can predict the future with remarkable accuracy. But when they fail (and they do fail), the fallout is asymmetric: a hospital whose patient-demand forecast collapsed, an energy grid whose load model missed a heat wave, a supply chain that over-ordered by 40%. In every case, someone needs to know why the model went wrong — not just that it did.

🔎

What XAI for Time Series Actually Answers

Given a forecasting model's prediction at time t, XAI tells you: which past observations drove that forecast, which features (lag values, seasonal components, external regressors) mattered most, and whether the model's reasoning is consistent with domain knowledge — or whether it has learned a spurious pattern that will break down.

Section 02

Why Time Series XAI is Harder Than Tabular XAI

Most XAI tutorials focus on tabular classifiers — predict loan default, explain with SHAP, done. Time series is fundamentally different. Here is why.

🕑

Temporal Dependencies

Order Matters

In tabular data, row 500 has nothing to do with row 499. In time series, the observation at t−1, t−7, and t−365 may all drive the forecast at t. XAI must capture which lags matter, not just which features.

📈

Non-Stationarity

Shifting Distributions

The model trained on 2019–2021 may not behave the same in 2023. Explanations generated today may be invalid tomorrow. XAI for time series must account for concept drift and distribution shifts.

🎙️

Seasonality & Cycles

Structured Signal

A spike every December isn't a prediction — it's a calendar effect. XAI must separate the model's learned structure (trend, seasonality) from its surprise response to unusual inputs.

Challenge	Tabular XAI	Time Series XAI
Feature space	Fixed columns, independent rows	Lag features, rolling stats, date embeddings
Independence assumption	Reasonable to assume	Violated by design — rows are ordered
Counterfactuals	Swap feature value, re-predict	Changing t−1 cascades through all future steps
Global vs local explanation	Consistent across dataset	May differ radically across time periods
Perturbation validity	Perturb one row	Must perturb a coherent time window

Section 03

The XAI Toolbox — Five Methods for Time Series

There is no single best XAI method for forecasting models. Each tool answers a slightly different question. Understanding them as a toolkit — rather than alternatives — is the key to using them well.

SHAP (SHapley Additive exPlanations)

Assigns each input feature a contribution score to a single prediction. For time series, lag values and rolling statistics are the features. Answers: "Which past observations pushed this forecast up or down, and by how much?" Works on any model via the KernelSHAP or TreeSHAP backends.

LIME (Local Interpretable Model-Agnostic Explanations)

Fits a simple linear model around a single prediction by perturbing the neighborhood. Answers: "What linear rule best approximates this model's behavior near this specific timestep?" Fast but approximate — treats lags as independent features.

Attention Mechanisms (Neural Models)

For LSTM, Transformer, or TFT (Temporal Fusion Transformer) models, built-in attention weights show which time steps the model "looked at" when making a forecast. Answers: "Which historical window was most relevant to this prediction?" Available without post-hoc processing for attention-based architectures.

Permutation Feature Importance

Shuffles one lag or feature column at a time and measures forecast degradation. Answers: "Which features, if removed, would hurt the model the most globally?" Global, stable, and model-agnostic — but computationally expensive on long series.

Integrated Gradients (Deep Learning)

For gradient-based models (LSTM, TCN, N-BEATS), computes how much each input time step contributed to the output via path integrals along the gradient. Answers: "Which exact historical values drove the neural network's prediction at this timestamp?" Precise but requires access to model internals.

Section 04

SHAP for Time Series — Deep Dive

📖 Story

The Power-Plant Blame Game

A power plant uses an LSTM to forecast electricity demand 24 hours ahead. One forecast wildly overestimates demand by 18%. The plant burns extra fuel in standby — costing £42,000. The manager demands to know why.

Using SHAP on the lag features (demand at t−1, t−2, t−24, t−168), the engineer discovers the model gave enormous weight to last week's same-hour value (t−168) — which was inflated due to a sporting event. The model saw: "last Tuesday at 6 pm was high → therefore this Tuesday at 6 pm will be high." There was no sporting event this Tuesday. SHAP exposed the bug. The lag-168 feature was subsequently down-weighted.

How SHAP Works on a Forecasting Model

SHAP is rooted in cooperative game theory. Given a model prediction f(x), it finds the unique attribution of prediction value to each feature i that satisfies three fairness axioms: efficiency (attributions sum to the prediction), symmetry (identical features get equal credit), and dummy (unused features get zero credit).

SHAP Value Definition

φᵢ = Σ [|S|!(|F|−|S|−1)!/|F|!] × [f(S∪{i}) − f(S)]

Sum over all subsets S of features excluding feature i. Measures marginal contribution when i is added to each subset.

Prediction Decomposition

f(x) = φ₀ + φ₁ + φ₂ + … + φₙ

The forecast equals the base value (mean prediction) plus all SHAP values summed. Every prediction is fully accounted for.

Python: SHAP on a LightGBM Forecasting Model

import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
from sklearn.model_selection import train_test_split

# ── 1. Build lag features from a time series ──────────────────
def make_lag_features(series: pd.Series, lags: list) -> pd.DataFrame:
    df = pd.DataFrame({'y': series})
    for lag in lags:
        df[f'lag_{lag}'] = df['y'].shift(lag)
    df['rolling_7']  = df['y'].shift(1).rolling(7).mean()
    df['rolling_28'] = df['y'].shift(1).rolling(28).mean()
    df['day_of_week'] = series.index.dayofweek
    df['month']       = series.index.month
    return df.dropna()

# ── 2. Simulate 3 years of daily demand data ──────────────────
np.random.seed(42)
dates  = pd.date_range('2020-01-01', periods=1095, freq='D')
trend  = np.linspace(100, 130, 1095)
season = 15 * np.sin(2 * np.pi * np.arange(1095) / 365)
noise  = np.random.normal(0, 4, 1095)
demand = pd.Series(trend + season + noise, index=dates, name='demand')

# ── 3. Create features and split ──────────────────────────────
df   = make_lag_features(demand, lags=[1, 2, 3, 7, 14, 28, 365])
feat_cols = [c for c in df.columns if c != 'y']
X, y = df[feat_cols], df['y']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False   # NO shuffle — time series!
)

# ── 4. Train LightGBM forecasting model ───────────────────────
model = lgb.LGBMRegressor(
    n_estimators=400, learning_rate=0.05,
    max_depth=6, num_leaves=31,
    subsample=0.8, colsample_bytree=0.8,
    random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=[lgb.early_stopping(50, verbose=False)])

# ── 5. SHAP values for the test set ───────────────────────────
explainer   = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Local explanation — single forecast at index 50
idx = 50
shap_df = pd.DataFrame({
    'feature':    feat_cols,
    'value':      X_test.iloc[idx].values,
    'shap':       shap_values[idx]
}).sort_values('shap', key=abs, ascending=False)

print(f"Forecast: {model.predict(X_test.iloc[[idx]])[0]:.2f}")
print(f"Base value: {explainer.expected_value:.2f}")
print(shap_df.to_string(index=False))

OUTPUT

Forecast: 122.47 Base value: 114.83 feature value shap lag_365 119.42 +4.21 <- Last year same day dominates lag_7 117.81 +2.03 <- Last week same day rolling_7 116.50 +1.12 <- Weekly average trending up lag_1 121.10 +0.77 <- Yesterday's demand lag_28 115.30 +0.63 <- Monthly baseline month 7 +0.38 <- July seasonal effect lag_2 120.05 -0.34 <- Small correction day_of_week 2 -0.26 <- Wednesday slightly lower lag_14 114.20 -0.20 <- Mild two-week dampener rolling_28 114.1 +0.10 <- Long-run stable

💡

Reading the SHAP Output

The base value (114.83) is the model's average prediction across the training set. The forecast (122.47) is 7.64 units above that. SHAP decomposes those 7.64 units: lag_365 contributes +4.21 (the same week last year was also high), and lag_7 adds +2.03 (last Tuesday was also elevated). This is domain-sensible — annual and weekly seasonality drive demand.

Section 05

Animated Diagram — SHAP Waterfall for a Forecast

🌟 SHAP Waterfall — How a Single Forecast is Built Up from Lag Features

Each bar shows the contribution of one lag feature to the total forecast. Bars stack from base (114.83) rightward. The forecast (122.47) equals base plus all SHAP values.

Section 06

Animated Diagram — Attention Weights in a Transformer Forecaster

📖 Story

The Translator Who Reads Last Week's Diary

Imagine you are a translator. To translate the word "bank" in a new sentence, you do not read the whole document equally — you focus on the two or three surrounding words that disambiguate meaning. Attention in a Transformer works identically: to predict tomorrow's demand, the model attends heavily to last Monday, last week's rolling window, and the same date last year — nearly ignoring three weeks ago. The attention weights are a readable map of what the model "looked at."

👁️ Attention Weight Heatmap — Predicting t+1 from a 10-Step Window

Each attention head learns a different temporal pattern. Darker = higher attention weight. Head 2 clearly learned the weekly seasonal cycle (t−7 dominates). Head 4 detected an anomaly at t−5.

Section 07

XAI for Prophet — Decomposing a Facebook Prophet Forecast

Prophet is uniquely interpretable by design. Its additive decomposition — trend + seasonality + holidays + noise — is already an explanation. But we can go further and quantify exactly which component drove a specific forecast above or below normal.

📊 Prophet's Additive Decomposition — What Each Component Means

Trend

Long-run direction. Prophet fits piecewise linear or logistic growth. A sudden changepoint (e.g. COVID lockdown) is captured here. This component explains structural change.

Yearly

Annual Fourier seasonality. Captures summer peaks, winter troughs. Tells you: "Is this forecast high because it is August?"

Weekly

Day-of-week pattern. Retail demand may be 30% higher on Saturday. Energy may be lower on weekends. This is the most volatile component for daily data.

Holidays

Custom event effects. Christmas, Black Friday, bank holidays. Prophet estimates a separate additive coefficient for each event window. The single most actionable explanation for retailers.

Extra Regressors

Any external variables you add (temperature, price promotions, ad spend). Prophet fits a linear coefficient; SHAP or linear attribution gives its contribution.

Python: Full Prophet XAI Pipeline

from prophet import Prophet
import pandas as pd
import numpy as np

# ── Simulate retail sales with trend + weekly + holiday ────────
np.random.seed(0)
dates  = pd.date_range('2021-01-01', '2024-12-31', freq='D')
n      = len(dates)
trend  = np.linspace(200, 280, n)
weekly = 30 * np.sin(np.arange(n) * 2 * np.pi / 7)
yearly = 50 * np.sin(np.arange(n) * 2 * np.pi / 365)
noise  = np.random.normal(0, 8, n)
sales  = trend + weekly + yearly + noise

df_prophet = pd.DataFrame({'ds': dates, 'y': sales})

# Holiday effect: Black Friday spikes
holidays = pd.DataFrame({
    'holiday': 'black_friday',
    'ds': pd.to_datetime(['2021-11-26', '2022-11-25', '2023-11-24']),
    'lower_window': -1,
    'upper_window': 1
})

# ── Train Prophet model ────────────────────────────────────────
m = Prophet(
    holidays=holidays,
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False,
    changepoint_prior_scale=0.05   # regularise trend flexibility
)
m.fit(df_prophet)

# ── Forecast next 90 days ──────────────────────────────────────
future    = m.make_future_dataframe(periods=90)
forecast  = m.predict(future)

# ── XAI: extract component contributions for a specific date ──
target_date = '2025-11-28'  # Black Friday 2025
row = forecast[forecast['ds'] == target_date][[
    'ds', 'yhat', 'trend', 'weekly', 'yearly',
    'holidays'
]]
print(row.to_string(index=False))

# ── Compute % contribution of each component ──────────────────
yhat   = row['yhat'].values[0]
comps  = {
    'Trend':    row['trend'].values[0],
    'Weekly':   row['weekly'].values[0],
    'Yearly':   row['yearly'].values[0],
    'Holiday':  row['holidays'].values[0]
}
print(f"\nForecast for {target_date}: {yhat:.1f} units")
for k, v in comps.items():
    print(f"  {k:10s}: {v:+.2f}  ({v/yhat*100:+.1f}%)")

OUTPUT

ds yhat trend weekly yearly holidays 2025-11-28 341.7 281.4 -12.3 -8.1 81.1 Forecast for 2025-11-28: 341.7 units Trend : +281.40 (+82.4%) Weekly : -12.30 (-3.6%) <- Friday slightly below weekly avg Yearly : -8.10 (-2.4%) <- November seasonal dip Holiday : +81.10 (+23.7%) <- Black Friday lifts demand by 24%

⚠️

Prophet XAI Limitation — No Lag Attribution

Prophet's decomposition explains structural components (trend, seasonality, holidays) but cannot tell you which specific past data points drove the forecast. If your question is "did last week's anomaly influence today's forecast?" you need SHAP on a lag-feature model instead. Use Prophet's decomposition for what type of force drove the prediction, SHAP for which historical values did.

Section 08

Animated Diagram — Prophet Additive Decomposition

🌞 Prophet Decomposition — What Drove the Black Friday Forecast?

The Holiday component (Black Friday) adds +81 units — more than 10× the absolute value of all seasonal headwinds combined. This is the dominant driver of the forecast.

Section 09

LIME for Time Series — When and How

LIME fits a simple surrogate model (usually linear regression) to approximate a complex model locally — near a specific prediction. For time series, each lag is treated as an independent feature. LIME is weaker than SHAP on temporal models because it ignores lag correlation, but it is faster and works well for quick sanity checks.

Python: LIME on a Time Series Forecaster

import lime
import lime.lime_tabular
import numpy as np

# ── Re-use the LightGBM model and X_test from Section 04 ──────
# X_train, X_test, model already defined above

# ── Build LIME explainer ──────────────────────────────────────
explainer_lime = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train.values,
    feature_names=feat_cols,
    mode='regression',
    discretize_continuous=True,
    random_state=42
)

# ── Explain a single forecast at index 20 ─────────────────────
idx  = 20
exp  = explainer_lime.explain_instance(
    data_row=X_test.iloc[idx].values,
    predict_fn=model.predict,
    num_features=8,
    num_samples=2000
)

print("\nLIME Explanation (local linear approximation):")
print(f"Actual forecast: {model.predict(X_test.iloc[[idx]])[0]:.2f}")
print(f"LIME local pred: {exp.local_pred[0]:.2f}\n")
for feat, coef in exp.as_list():
    bar = '█' * int(abs(coef) * 10)
    sign = '+' if coef > 0 else ''
    print(f"  {feat:35s}: {sign}{coef:.3f}  {bar}")

OUTPUT

LIME Explanation (local linear approximation): Actual forecast: 118.42 LIME local pred: 118.19 lag_365 > 116.5 : +0.821 ████████ 116.0 < lag_7 <= 119.2 : +0.412 ████ rolling_7 > 115.5 : +0.287 ██ month = 7 : +0.156 █ lag_1 > 120.1 : -0.134 █ day_of_week = 2 : -0.089 lag_28 <= 114.9 : +0.063 lag_2 > 119.5 : -0.041

⚖️

LIME vs SHAP — When to Use Which

Use SHAP when you need mathematically guaranteed attributions that sum exactly to the prediction, or when you have a tree-based model (TreeSHAP is fast and exact). Use LIME for a fast sanity check on any black-box model where SHAP is too slow, or when you want a human-readable conditional rule ("lag_365 > 116.5 contributed +0.82") rather than a raw scalar.

Section 10

XAI in Practice — Domain-Specific Applications

⚡

Energy Grid Load Forecasting

Demand Planning

Grid operators use SHAP to identify when temperature lag-24 vs previous-day consumption is the dominant driver. During heat waves, temperature dominates; on normal days, weekly seasonality does. XAI enables operators to override the model when domain knowledge conflicts with SHAP signals.

✓ Reduces costly forecast errors during anomalous weather

✗ Requires temperature as an explicit feature

🏠

Retail Sales Forecasting

Supply Chain

Retailers apply Prophet decomposition to separate holiday effects from organic growth. When a buyer asks "why did the model order 20% more stock this week?", the answer is a single number: +18.4 units from the upcoming Bank Holiday component. Procurement teams can act on this without understanding the underlying model.

✓ Directly actionable for procurement decisions

✗ Holiday calendar must be kept up to date

🏥

Hospital Patient Demand

Healthcare Operations

Hospital forecasting models (staffing, bed allocation) use SHAP on lag features to verify that previous-week admissions drive predictions — not spurious correlations with calendar artifacts. Regulatory compliance often requires a written justification for each forecast; SHAP output is the audit trail.

✓ Meets regulatory transparency requirements

✗ SHAP computation is slow on large hospital datasets

Domain	Common Model	Best XAI Method	Primary Explanation Goal	Key Lag to Explain
Energy	LightGBM + lag features	TreeSHAP	Which weather / demand lag drove forecast?	Temperature lag-24, lag-168
Retail	Prophet + holiday regressors	Decomposition + SHAP	How much did holiday add?	lag-7 (weekly), holiday window
Finance	LSTM / Transformer	Attention weights + IG	Which past prices drove the signal?	lag-1 to lag-5 (high-frequency)
Healthcare	LightGBM / XGBoost	TreeSHAP	Why is bed demand forecast high?	lag-7, lag-14 (admission cycles)
Manufacturing	ARIMA + exogenous vars	Coefficient inspection + LIME	Which external regressor (raw material price?) is dominant?	External regressors, lag-1
Transport	N-BEATS / TFT	Integrated Gradients	Which hour-of-day history drove peak prediction?	Hourly lags (1–24)

Section 11

Concept Drift Detection via SHAP — XAI as a Monitoring Tool

📖 Story

The Silent Mutiny of Your Model

A forecasting model for a delivery company was trained in 2022. It performed well for 18 months. In mid-2024, accuracy quietly degraded — MAE crept up 15% over six weeks. Nobody noticed immediately. When the data science team finally ran SHAP on recent predictions, they saw something alarming: the lag-7 feature's SHAP values had shrunk from ±4.2 to ±0.8. The weekly cycle had weakened — a post-pandemic behavioural shift in order patterns. The model was still using a 2022 weekly pattern in a 2024 world. SHAP revealed the drift before accuracy metrics alone would have triggered an alert.

Python: SHAP-Based Drift Detection

import pandas as pd
import numpy as np
import shap

# ── Assume model, explainer, X_test are from Section 04 ───────
# Split test window into early and late halves (simulates drift)

mid = len(X_test) // 2
X_early = X_test.iloc[:mid]
X_late  = X_test.iloc[mid:]

shap_early = explainer.shap_values(X_early)
shap_late  = explainer.shap_values(X_late)

# ── Mean absolute SHAP per feature in each period ─────────────
early_imp = pd.Series(np.abs(shap_early).mean(axis=0), index=feat_cols)
late_imp  = pd.Series(np.abs(shap_late).mean(axis=0),  index=feat_cols)

drift = pd.DataFrame({
    'Early Mean |SHAP|': early_imp,
    'Late Mean |SHAP|':  late_imp,
    'Change %': ((late_imp - early_imp) / (early_imp + 1e-9)) * 100
}).sort_values('Change %', key=abs, ascending=False)

print(drift.round(3).to_string())

# Alert: flag any feature whose SHAP importance changed by >20%
flagged = drift[drift['Change %'].abs() > 20]
if not flagged.empty:
    print(f"\n⚠ DRIFT ALERT: {len(flagged)} features changed significantly")
    print(flagged)

OUTPUT

Early Mean |SHAP| Late Mean |SHAP| Change % lag_365 3.821 4.105 +7.4% lag_7 3.156 2.491 -21.1% <- DRIFT rolling_7 1.842 1.980 +7.5% lag_1 1.203 1.387 +15.3% lag_28 0.941 0.798 -15.2% rolling_28 0.612 0.640 +4.6% lag_2 0.458 0.432 -5.7% month 0.344 0.301 -12.5% day_of_week 0.219 0.171 -21.9% <- DRIFT ⚠ DRIFT ALERT: 2 features changed significantly Early Mean |SHAP| Late Mean |SHAP| Change % lag_7 3.156 2.491 -21.1% day_of_week 0.219 0.171 -21.9%

⚠️

SHAP Drift ≠ Model Failure — But It Is an Early Warning

A drop in lag_7 and day_of_week SHAP importance signals that the weekly cycle has weakened in recent data. The model trained on the old cycle may now overweight it. This is not a model failure yet — it is a leading indicator. Retrain before accuracy collapses. SHAP-based monitoring catches drift weeks before standard accuracy monitoring would trigger an alert.

Section 12

Full XAI Pipeline — End to End

Define the Explanation Goal

Before building anything: decide who needs the explanation (data scientist, operations manager, regulator), what decision it supports, and at what granularity — global importance, local single-forecast, or temporal drift. A wrong XAI method for the audience is worse than no explanation.

Engineer Interpretable Lag Features

Name lag features explicitly: lag_1, lag_7, lag_365, rolling_mean_7. Avoid generic feature_0 naming — it makes SHAP output unreadable. Add domain features (day of week, month, holiday flag) as named columns.

Fit the Forecasting Model

Train with time-respecting splits (shuffle=False in sklearn, TimeSeriesSplit for cross-validation). Models that violate this introduce future data leakage, making SHAP values meaningless — the model may explain a pattern it couldn't actually have seen at inference time.

Apply SHAP (Global + Local)

Run TreeExplainer for tree models, GradientExplainer for PyTorch/TF. Compute both global importance (mean |SHAP| per feature across all test points) and local explanation (waterfall or force plot) for specific forecasts that deviated most from actuals.

Validate Against Domain Knowledge

Ask a domain expert: "Does it make sense that lag_365 is the top driver?" If SHAP says row_id is the most important feature, something is badly wrong. This step separates technically valid explanations from actionably correct ones.

Deploy SHAP Drift Monitoring in Production

Log mean |SHAP| per feature in a sliding window (e.g. 4 weeks). Alert when any feature's importance shifts more than 20% from its baseline. This is your early-warning system for concept drift — it will trigger weeks before accuracy metrics degrade visibly.

Section 13

Common Pitfalls in Time Series XAI

#	Pitfall	What Goes Wrong	Fix
1	Shuffling the test split	Future data leaks into training → SHAP explains a ghost model	Always use shuffle=False or TimeSeriesSplit
2	Using KernelSHAP on large datasets	O(n²) complexity → 12+ hours on 10k rows	Use TreeSHAP for tree models; sample to ≤1000 rows for KernelSHAP
3	Treating attention as ground truth	Attention weights are not mathematically equivalent to feature importance	Combine attention with gradient-based methods for verification
4	Explaining a poorly calibrated model	Explaining a wrong model gives wrong explanations — confidently	Validate model accuracy before investing in XAI
5	Ignoring correlated lags	lag_1 and lag_2 are correlated — SHAP splits credit arbitrarily	Use SHAP interaction values or reduce lag redundancy with PCA
6	Explaining multi-step forecasts as single-step	SHAP on h=7 direct forecast explains wrong horizon	Produce separate SHAP values for each forecast horizon h

Section 14

XAI Methods — Comparison Table

Method	Model Agnostic	Local / Global	Temporal Awareness	Speed	Best For
TreeSHAP	Tree models only	Both	Lag features = implicit	Very fast — O(TLD)	LightGBM, XGBoost, RF
KernelSHAP	Yes — any model	Local	Lag features = implicit	Slow — O(n²)	Neural networks, ARIMA wraps
LIME	Yes	Local	None — assumes independence	Fast	Quick sanity checks
Attention	Attention models only	Local	Direct — time-step weights	Free (already computed)	TFT, Transformer, LSTM-Attn
Integrated Gradients	Differentiable only	Local	Direct — input timestep attribution	Medium	N-BEATS, LSTM, TCN
Permutation Importance	Yes	Global	Lag features = implicit	Slow — n_features × validation passes	Global feature ranking, drift checks
Prophet Decomposition	Prophet only	Both	Structural — trend/season/holiday	Instant — built-in	Retail, supply chain, operations

Section 15

Golden Rules — XAI for Time Series Forecasting

⏳ Time Series XAI — Non-Negotiable Rules

Never shuffle your train/test split. In time series, shuffling leaks future data into training. A model trained this way is not a valid forecasting model, and its SHAP values explain a fiction. Always use shuffle=False or TimeSeriesSplit.

Name your lag features explicitly. lag_7, rolling_mean_28, holiday_flag. Anonymous features like x_4 make every SHAP plot unreadable and every stakeholder conversation impossible. Interpretability starts at feature engineering, not at the XAI library.

Use TreeSHAP for tree-based models — it is exact and runs in milliseconds. Reserve KernelSHAP for genuinely model-agnostic cases where you have fewer than 2,000 test samples. Running KernelSHAP on 50,000 time-series rows is a common and painful mistake.

Validate SHAP against domain knowledge before publishing. If SHAP says timestamp_id is the most important feature, you have a data leakage bug, not a profound discovery. Always cross-check with subject matter experts. Technically valid SHAP on a poorly specified model produces confidently wrong explanations.

Monitor SHAP importance over time in production. Set rolling alerts on mean |SHAP| per feature. A 20%+ shift in a key lag's importance signals concept drift — weeks before accuracy metrics will catch it. SHAP monitoring is the cheapest early-warning system you can deploy.

Match the XAI method to the audience. Operations managers understand Prophet decomposition ("Holiday adds +81 units"). Regulators need SHAP waterfall plots. Data scientists want SHAP beeswarm and drift charts. Never give a regulator a beeswarm plot.

For multi-step forecasting (predicting h=1, h=2, ... h=7 simultaneously), produce separate SHAP values for each horizon. The lag importances for a 1-day-ahead forecast are structurally different from those for a 7-day-ahead forecast. Combining them destroys the interpretive signal.