The Story That Explains Explainable AI for Time Series
Both forecasts are identical. But which forecaster would you trust enough to act on? Which would you challenge if they turned out to be wrong two days in a row?
That gap — between a number and a reason — is the entire problem that Explainable AI (XAI) for Time Series is designed to close.
Time series forecasting models — LSTMs, Prophet, ARIMA, Gradient Boosting regressors — can predict the future with remarkable accuracy. But when they fail (and they do fail), the fallout is asymmetric: a hospital whose patient-demand forecast collapsed, an energy grid whose load model missed a heat wave, a supply chain that over-ordered by 40%. In every case, someone needs to know why the model went wrong — not just that it did.
Given a forecasting model's prediction at time t, XAI tells you: which past observations drove that forecast, which features (lag values, seasonal components, external regressors) mattered most, and whether the model's reasoning is consistent with domain knowledge — or whether it has learned a spurious pattern that will break down.
Why Time Series XAI is Harder Than Tabular XAI
Most XAI tutorials focus on tabular classifiers — predict loan default, explain with SHAP, done. Time series is fundamentally different. Here is why.
| Challenge | Tabular XAI | Time Series XAI |
|---|---|---|
| Feature space | Fixed columns, independent rows | Lag features, rolling stats, date embeddings |
| Independence assumption | Reasonable to assume | Violated by design — rows are ordered |
| Counterfactuals | Swap feature value, re-predict | Changing t−1 cascades through all future steps |
| Global vs local explanation | Consistent across dataset | May differ radically across time periods |
| Perturbation validity | Perturb one row | Must perturb a coherent time window |
The XAI Toolbox — Five Methods for Time Series
There is no single best XAI method for forecasting models. Each tool answers a slightly different question. Understanding them as a toolkit — rather than alternatives — is the key to using them well.
SHAP for Time Series — Deep Dive
Using SHAP on the lag features (demand at t−1, t−2, t−24, t−168), the engineer discovers the model gave enormous weight to last week's same-hour value (t−168) — which was inflated due to a sporting event. The model saw: "last Tuesday at 6 pm was high → therefore this Tuesday at 6 pm will be high." There was no sporting event this Tuesday. SHAP exposed the bug. The lag-168 feature was subsequently down-weighted.
How SHAP Works on a Forecasting Model
SHAP is rooted in cooperative game theory. Given a model prediction f(x), it finds the unique attribution of prediction value to each feature i that satisfies three fairness axioms: efficiency (attributions sum to the prediction), symmetry (identical features get equal credit), and dummy (unused features get zero credit).
Python: SHAP on a LightGBM Forecasting Model
import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
from sklearn.model_selection import train_test_split
# ── 1. Build lag features from a time series ──────────────────
def make_lag_features(series: pd.Series, lags: list) -> pd.DataFrame:
df = pd.DataFrame({'y': series})
for lag in lags:
df[f'lag_{lag}'] = df['y'].shift(lag)
df['rolling_7'] = df['y'].shift(1).rolling(7).mean()
df['rolling_28'] = df['y'].shift(1).rolling(28).mean()
df['day_of_week'] = series.index.dayofweek
df['month'] = series.index.month
return df.dropna()
# ── 2. Simulate 3 years of daily demand data ──────────────────
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=1095, freq='D')
trend = np.linspace(100, 130, 1095)
season = 15 * np.sin(2 * np.pi * np.arange(1095) / 365)
noise = np.random.normal(0, 4, 1095)
demand = pd.Series(trend + season + noise, index=dates, name='demand')
# ── 3. Create features and split ──────────────────────────────
df = make_lag_features(demand, lags=[1, 2, 3, 7, 14, 28, 365])
feat_cols = [c for c in df.columns if c != 'y']
X, y = df[feat_cols], df['y']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False # NO shuffle — time series!
)
# ── 4. Train LightGBM forecasting model ───────────────────────
model = lgb.LGBMRegressor(
n_estimators=400, learning_rate=0.05,
max_depth=6, num_leaves=31,
subsample=0.8, colsample_bytree=0.8,
random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=[lgb.early_stopping(50, verbose=False)])
# ── 5. SHAP values for the test set ───────────────────────────
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Local explanation — single forecast at index 50
idx = 50
shap_df = pd.DataFrame({
'feature': feat_cols,
'value': X_test.iloc[idx].values,
'shap': shap_values[idx]
}).sort_values('shap', key=abs, ascending=False)
print(f"Forecast: {model.predict(X_test.iloc[[idx]])[0]:.2f}")
print(f"Base value: {explainer.expected_value:.2f}")
print(shap_df.to_string(index=False))
The base value (114.83) is the model's average prediction across the training set. The forecast (122.47) is 7.64 units above that. SHAP decomposes those 7.64 units: lag_365 contributes +4.21 (the same week last year was also high), and lag_7 adds +2.03 (last Tuesday was also elevated). This is domain-sensible — annual and weekly seasonality drive demand.
Animated Diagram — SHAP Waterfall for a Forecast
Each bar shows the contribution of one lag feature to the total forecast. Bars stack from base (114.83) rightward. The forecast (122.47) equals base plus all SHAP values.
Animated Diagram — Attention Weights in a Transformer Forecaster
Each attention head learns a different temporal pattern. Darker = higher attention weight. Head 2 clearly learned the weekly seasonal cycle (t−7 dominates). Head 4 detected an anomaly at t−5.
XAI for Prophet — Decomposing a Facebook Prophet Forecast
Prophet is uniquely interpretable by design. Its additive decomposition — trend + seasonality + holidays + noise — is already an explanation. But we can go further and quantify exactly which component drove a specific forecast above or below normal.
Python: Full Prophet XAI Pipeline
from prophet import Prophet
import pandas as pd
import numpy as np
# ── Simulate retail sales with trend + weekly + holiday ────────
np.random.seed(0)
dates = pd.date_range('2021-01-01', '2024-12-31', freq='D')
n = len(dates)
trend = np.linspace(200, 280, n)
weekly = 30 * np.sin(np.arange(n) * 2 * np.pi / 7)
yearly = 50 * np.sin(np.arange(n) * 2 * np.pi / 365)
noise = np.random.normal(0, 8, n)
sales = trend + weekly + yearly + noise
df_prophet = pd.DataFrame({'ds': dates, 'y': sales})
# Holiday effect: Black Friday spikes
holidays = pd.DataFrame({
'holiday': 'black_friday',
'ds': pd.to_datetime(['2021-11-26', '2022-11-25', '2023-11-24']),
'lower_window': -1,
'upper_window': 1
})
# ── Train Prophet model ────────────────────────────────────────
m = Prophet(
holidays=holidays,
yearly_seasonality=True,
weekly_seasonality=True,
daily_seasonality=False,
changepoint_prior_scale=0.05 # regularise trend flexibility
)
m.fit(df_prophet)
# ── Forecast next 90 days ──────────────────────────────────────
future = m.make_future_dataframe(periods=90)
forecast = m.predict(future)
# ── XAI: extract component contributions for a specific date ──
target_date = '2025-11-28' # Black Friday 2025
row = forecast[forecast['ds'] == target_date][[
'ds', 'yhat', 'trend', 'weekly', 'yearly',
'holidays'
]]
print(row.to_string(index=False))
# ── Compute % contribution of each component ──────────────────
yhat = row['yhat'].values[0]
comps = {
'Trend': row['trend'].values[0],
'Weekly': row['weekly'].values[0],
'Yearly': row['yearly'].values[0],
'Holiday': row['holidays'].values[0]
}
print(f"\nForecast for {target_date}: {yhat:.1f} units")
for k, v in comps.items():
print(f" {k:10s}: {v:+.2f} ({v/yhat*100:+.1f}%)")
Prophet's decomposition explains structural components (trend, seasonality, holidays) but cannot tell you which specific past data points drove the forecast. If your question is "did last week's anomaly influence today's forecast?" you need SHAP on a lag-feature model instead. Use Prophet's decomposition for what type of force drove the prediction, SHAP for which historical values did.
Animated Diagram — Prophet Additive Decomposition
The Holiday component (Black Friday) adds +81 units — more than 10× the absolute value of all seasonal headwinds combined. This is the dominant driver of the forecast.
LIME for Time Series — When and How
LIME fits a simple surrogate model (usually linear regression) to approximate a complex model locally — near a specific prediction. For time series, each lag is treated as an independent feature. LIME is weaker than SHAP on temporal models because it ignores lag correlation, but it is faster and works well for quick sanity checks.
Python: LIME on a Time Series Forecaster
import lime
import lime.lime_tabular
import numpy as np
# ── Re-use the LightGBM model and X_test from Section 04 ──────
# X_train, X_test, model already defined above
# ── Build LIME explainer ──────────────────────────────────────
explainer_lime = lime.lime_tabular.LimeTabularExplainer(
training_data=X_train.values,
feature_names=feat_cols,
mode='regression',
discretize_continuous=True,
random_state=42
)
# ── Explain a single forecast at index 20 ─────────────────────
idx = 20
exp = explainer_lime.explain_instance(
data_row=X_test.iloc[idx].values,
predict_fn=model.predict,
num_features=8,
num_samples=2000
)
print("\nLIME Explanation (local linear approximation):")
print(f"Actual forecast: {model.predict(X_test.iloc[[idx]])[0]:.2f}")
print(f"LIME local pred: {exp.local_pred[0]:.2f}\n")
for feat, coef in exp.as_list():
bar = '█' * int(abs(coef) * 10)
sign = '+' if coef > 0 else ''
print(f" {feat:35s}: {sign}{coef:.3f} {bar}")
Use SHAP when you need mathematically guaranteed attributions that sum exactly to the prediction, or when you have a tree-based model (TreeSHAP is fast and exact). Use LIME for a fast sanity check on any black-box model where SHAP is too slow, or when you want a human-readable conditional rule ("lag_365 > 116.5 contributed +0.82") rather than a raw scalar.
XAI in Practice — Domain-Specific Applications
| Domain | Common Model | Best XAI Method | Primary Explanation Goal | Key Lag to Explain |
|---|---|---|---|---|
| Energy | LightGBM + lag features | TreeSHAP | Which weather / demand lag drove forecast? | Temperature lag-24, lag-168 |
| Retail | Prophet + holiday regressors | Decomposition + SHAP | How much did holiday add? | lag-7 (weekly), holiday window |
| Finance | LSTM / Transformer | Attention weights + IG | Which past prices drove the signal? | lag-1 to lag-5 (high-frequency) |
| Healthcare | LightGBM / XGBoost | TreeSHAP | Why is bed demand forecast high? | lag-7, lag-14 (admission cycles) |
| Manufacturing | ARIMA + exogenous vars | Coefficient inspection + LIME | Which external regressor (raw material price?) is dominant? | External regressors, lag-1 |
| Transport | N-BEATS / TFT | Integrated Gradients | Which hour-of-day history drove peak prediction? | Hourly lags (1–24) |
Concept Drift Detection via SHAP — XAI as a Monitoring Tool
Python: SHAP-Based Drift Detection
import pandas as pd
import numpy as np
import shap
# ── Assume model, explainer, X_test are from Section 04 ───────
# Split test window into early and late halves (simulates drift)
mid = len(X_test) // 2
X_early = X_test.iloc[:mid]
X_late = X_test.iloc[mid:]
shap_early = explainer.shap_values(X_early)
shap_late = explainer.shap_values(X_late)
# ── Mean absolute SHAP per feature in each period ─────────────
early_imp = pd.Series(np.abs(shap_early).mean(axis=0), index=feat_cols)
late_imp = pd.Series(np.abs(shap_late).mean(axis=0), index=feat_cols)
drift = pd.DataFrame({
'Early Mean |SHAP|': early_imp,
'Late Mean |SHAP|': late_imp,
'Change %': ((late_imp - early_imp) / (early_imp + 1e-9)) * 100
}).sort_values('Change %', key=abs, ascending=False)
print(drift.round(3).to_string())
# Alert: flag any feature whose SHAP importance changed by >20%
flagged = drift[drift['Change %'].abs() > 20]
if not flagged.empty:
print(f"\n⚠ DRIFT ALERT: {len(flagged)} features changed significantly")
print(flagged)
A drop in lag_7 and day_of_week SHAP importance signals that the
weekly cycle has weakened in recent data. The model trained on the old cycle
may now overweight it. This is not a model failure yet — it is a leading indicator.
Retrain before accuracy collapses. SHAP-based monitoring catches drift weeks before
standard accuracy monitoring would trigger an alert.
Full XAI Pipeline — End to End
lag_1, lag_7, lag_365,
rolling_mean_7. Avoid generic feature_0 naming — it makes SHAP output
unreadable. Add domain features (day of week, month, holiday flag) as named columns.
shuffle=False in sklearn,
TimeSeriesSplit for cross-validation). Models that violate this introduce
future data leakage, making SHAP values meaningless — the model may explain a
pattern it couldn't actually have seen at inference time.
TreeExplainer for tree models, GradientExplainer for PyTorch/TF.
Compute both global importance (mean |SHAP| per feature across all test points)
and local explanation (waterfall or force plot) for specific forecasts
that deviated most from actuals.
lag_365 is the top driver?"
If SHAP says row_id is the most important feature, something is badly wrong.
This step separates technically valid explanations from actionably correct ones.
Common Pitfalls in Time Series XAI
| # | Pitfall | What Goes Wrong | Fix |
|---|---|---|---|
| 1 | Shuffling the test split | Future data leaks into training → SHAP explains a ghost model | Always use shuffle=False or TimeSeriesSplit |
| 2 | Using KernelSHAP on large datasets | O(n²) complexity → 12+ hours on 10k rows | Use TreeSHAP for tree models; sample to ≤1000 rows for KernelSHAP |
| 3 | Treating attention as ground truth | Attention weights are not mathematically equivalent to feature importance | Combine attention with gradient-based methods for verification |
| 4 | Explaining a poorly calibrated model | Explaining a wrong model gives wrong explanations — confidently | Validate model accuracy before investing in XAI |
| 5 | Ignoring correlated lags | lag_1 and lag_2 are correlated — SHAP splits credit arbitrarily | Use SHAP interaction values or reduce lag redundancy with PCA |
| 6 | Explaining multi-step forecasts as single-step | SHAP on h=7 direct forecast explains wrong horizon | Produce separate SHAP values for each forecast horizon h |
XAI Methods — Comparison Table
| Method | Model Agnostic | Local / Global | Temporal Awareness | Speed | Best For |
|---|---|---|---|---|---|
| TreeSHAP | Tree models only | Both | Lag features = implicit | Very fast — O(TLD) | LightGBM, XGBoost, RF |
| KernelSHAP | Yes — any model | Local | Lag features = implicit | Slow — O(n²) | Neural networks, ARIMA wraps |
| LIME | Yes | Local | None — assumes independence | Fast | Quick sanity checks |
| Attention | Attention models only | Local | Direct — time-step weights | Free (already computed) | TFT, Transformer, LSTM-Attn |
| Integrated Gradients | Differentiable only | Local | Direct — input timestep attribution | Medium | N-BEATS, LSTM, TCN |
| Permutation Importance | Yes | Global | Lag features = implicit | Slow — n_features × validation passes | Global feature ranking, drift checks |
| Prophet Decomposition | Prophet only | Both | Structural — trend/season/holiday | Instant — built-in | Retail, supply chain, operations |
Golden Rules — XAI for Time Series Forecasting
shuffle=False or TimeSeriesSplit.
lag_7, rolling_mean_28,
holiday_flag. Anonymous features like x_4 make every SHAP plot
unreadable and every stakeholder conversation impossible. Interpretability starts
at feature engineering, not at the XAI library.
timestamp_id is the most important feature, you have a data leakage bug,
not a profound discovery. Always cross-check with subject matter experts. Technically valid
SHAP on a poorly specified model produces confidently wrong explanations.