The Story That Explains ARIMA
He notices something clever: today's price is almost always close to yesterday's price, adjusted for a small drift. And whenever the market over-reacted to news last week, it corrected itself this week. He uses both observations — the momentum (past values) and the correction (past errors) — to make a remarkably accurate prediction.
That detective is ARIMA. It reads only the past of the series itself. No external features. No domain knowledge. Just the signal hidden inside the history of the data.
ARIMA stands for AutoRegressive Integrated Moving Average. It is the most widely used classical model for univariate time series forecasting — from stock prices and economic indicators to weather readings and sensor data. It works by combining three distinct mechanisms, each solving a specific problem that raw time series data presents.
ARIMA is a univariate model — it uses only one variable: the series itself. It finds patterns within that single stream of numbers across time. Unlike regression, it never looks at external predictors. The entire signal must live inside the series's own past values and past errors.
Quick Primer — What Is a Time Series?
Before ARIMA makes sense, you need to see the four building blocks every time series is made of. Think of a time series as a song: it has a melody (trend), a beat (seasonality), long-form musical movements (cycles), and random crackling from the speakers (noise).
The raw series (gold) is the sum of a smooth trend (blue dashes), a repeating seasonal wave (green), and small unpredictable noise (purple). ARIMA works best after the trend is removed via differencing.
Stationarity — The One Rule ARIMA Cannot Break
ARIMA is the carpenter. Your time series is the workbench. Stationarity means the workbench is flat — the statistical properties (mean, variance, autocorrelation structure) do not change over time. Without it, ARIMA's calculations are built on shifting ground and every forecast will be unreliable.
Left: non-stationary series — the mean drifts upward. ARIMA will fail here. Right: stationary series — values oscillate around a stable mean. ARIMA works here. The fix for the left panel is differencing — subtract yesterday from today.
| Property | Behaviour |
|---|---|
| Mean | Changes over time (drifts up/down) |
| Variance | Often grows larger over time |
| Autocorrelation | Depends on when you measure |
| ACF plot | Decays very slowly, linearly |
| ADF p-value | > 0.05 (fail to reject unit root) |
| ARIMA ready? | No — must difference first |
| Property | Behaviour |
|---|---|
| Mean | Constant across the whole series |
| Variance | Constant (homoscedastic) |
| Autocorrelation | Depends only on lag, not time |
| ACF plot | Decays quickly to zero |
| ADF p-value | < 0.05 (reject unit root) |
| ARIMA ready? | Yes ✓ |
adfuller(series). If p > 0.05 → non-stationary → you must act.
ACF & PACF — Reading the Model's X-Ray
Before you pick p and q, you need to read two diagnostic plots. These are the X-ray and MRI of your series — they reveal exactly what memory structure is hiding inside.
Blue dashed lines = 95% confidence bands. Bars outside the bands are statistically significant. ACF cutting off at lag q → MA(q). PACF cutting off at lag p → AR(p). Both tailing off slowly → ARMA(p,q).
| ACF Pattern | PACF Pattern | Model to Try | Real-World Example |
|---|---|---|---|
| Cuts off after lag q | Tails off (decays gradually) | MA(q) | Queue wait times after a sudden influx |
| Tails off (decays gradually) | Cuts off after lag p | AR(p) | Daily temperature (today ≈ yesterday) |
| Tails off with damped oscillation | Tails off with damped oscillation | ARMA(p,q) | Stock returns — both momentum and reversal |
| Very slow, near-linear decay | First spike near 1.0 | Difference first! d > 0 | Stock price levels — random walk |
| Significant spikes at lags 12, 24… | Significant spikes at lags 12, 24… | SARIMA with s=12 | Monthly retail sales — yearly season |
ARIMA(p, d, q) — Dissecting the Three Parameters
Three gears, three jobs. The AR gear (p) regresses on past values. The Integration gear (d) removes the trend by differencing. The MA gear (q) uses past forecast errors to self-correct.
When unsure: start with ARIMA(1,1,1) as your baseline.
Then try ARIMA(2,1,2) and ARIMA(1,1,0).
Compare using AIC (lower is better). Never go above p=3 or q=3 without strong
ACF/PACF justification — more parameters usually overfit. Use
auto_arima() from pmdarima to search automatically.
The Complete ARIMA Workflow — Step by Step
auto_arima() for automatic selection.ARIMA in Python — Full Working Example
We will use the classic airline passengers dataset (Box & Jenkins, 1976) — monthly passenger counts from 1949 to 1960. It has a clear upward trend and strong yearly seasonality, making it a perfect test case.
# ─── 0. Install if needed ────────────────────────────────────────────────────
# pip install statsmodels pmdarima matplotlib pandas numpy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.stats.diagnostic import acorr_ljungbox
# ─── 1. Load data ─────────────────────────────────────────────────────────────
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv'
df = pd.read_csv(url, header=0, index_col=0, parse_dates=True)
series = df.squeeze()
print(f"Shape: {series.shape} | Period: {series.index[0]} → {series.index[-1]}")
# ─── 2. Stationarity test (raw series) ────────────────────────────────────────
adf_raw = adfuller(series, autolag='AIC')
print(f"[RAW] ADF stat={adf_raw[0]:.4f} p={adf_raw[1]:.4f}")
# p >> 0.05 → non-stationary → must transform
# ─── 3. Log + first difference ────────────────────────────────────────────────
log_series = np.log(series) # stabilise variance
diff_series = log_series.diff().dropna() # remove trend (d = 1)
adf_diff = adfuller(diff_series, autolag='AIC')
print(f"[DIFF] ADF stat={adf_diff[0]:.4f} p={adf_diff[1]:.6f}")
# ─── 4. ACF / PACF plots (on differenced series) ──────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(diff_series, lags=30, ax=ax1, title='ACF — log-differenced airline')
plot_pacf(diff_series, lags=30, ax=ax2, title='PACF — log-differenced airline')
plt.tight_layout()
plt.savefig('acf_pacf.png', dpi=120)
plt.show()
# ACF: significant spikes at lags 1, 12 — suggests MA(1) + seasonal component
# PACF: spike at lag 1 — suggests AR(1)
# ─── 5. Fit candidate ARIMA models, compare AIC ───────────────────────────────
candidates = [(1,1,0), (0,1,1), (1,1,1), (2,1,1), (1,1,2)]
results = []
for order in candidates:
try:
m = ARIMA(log_series, order=order).fit()
results.append({'order': order, 'AIC': round(m.aic, 2), 'BIC': round(m.bic, 2)})
except:
pass
res_df = pd.DataFrame(results).sort_values('AIC')
print(res_df.to_string(index=False))
# ─── 6. Fit best model ARIMA(1,1,1) ──────────────────────────────────────────
best_model = ARIMA(log_series, order=(1, 1, 1))
best_fit = best_model.fit()
print(best_fit.summary())
# ─── 7. Residual diagnostics ─────────────────────────────────────────────────
residuals = best_fit.resid
# Ljung-Box: p > 0.05 at all lags → white noise ✓
lb = acorr_ljungbox(residuals, lags=20, return_df=True)
sig_lags = lb[lb['lb_pvalue'] < 0.05]
print(f"Significant lags in Ljung-Box: {len(sig_lags)}")
# Normality of residuals
from scipy.stats import shapiro
stat, p_norm = shapiro(residuals[1:])
print(f"Shapiro-Wilk p = {p_norm:.4f} → {'Normal ✓' if p_norm > 0.05 else 'Non-normal ✗'}")
# ─── 8. Forecast 24 months ahead (on original scale) ────────────────────────
n_forecast = 24
forecast_obj = best_fit.get_forecast(steps=n_forecast)
fc_mean_log = forecast_obj.predicted_mean
fc_ci_log = forecast_obj.conf_int(alpha=0.05) # 95% CI
# Invert log transform
fc_mean = np.exp(fc_mean_log)
fc_lower = np.exp(fc_ci_log.iloc[:, 0])
fc_upper = np.exp(fc_ci_log.iloc[:, 1])
# Print first 6 forecasted values with intervals
out_df = pd.DataFrame({
'Forecast': fc_mean.round(0),
'Lower_95': fc_lower.round(0),
'Upper_95': fc_upper.round(0)
})
print(out_df.head(6))
Look at the widening confidence intervals — at 6 months ahead the range is already 378 to 536, nearly ±90 passengers. The airline data has a strong 12-month seasonal pattern that pure ARIMA(1,1,1) cannot model. The model is leaving systematic seasonal structure in the residuals. This is exactly the problem that SARIMA was invented to fix — and it is the focus of every section from here forward.
From ARIMA to SARIMA — Why Seasons Break Everything
This is inefficient and fragile. The correct solution is to give the model a dedicated seasonal layer that speaks in units of full seasons — "what happened exactly 12 months ago?" — rather than individual months. That is SARIMA: ARIMA with an added seasonal gear.
SARIMA (Seasonal ARIMA) extends ARIMA with a second set of three parameters that operate at the seasonal frequency rather than the observation frequency. The full model is written as ARIMA(p,d,q)(P,D,Q)[s].
SARIMA(p,d,q)(P,D,Q)[s] — Six Parameters Explained
The blue layer operates month-by-month (non-seasonal). The gold layer skips ahead by the full seasonal period s=12 — connecting January this year to January last year. Both layers run simultaneously inside one unified model.
| Parameter | Layer | What It Controls | How to Choose | Typical Values |
|---|---|---|---|---|
| p | Non-seasonal | AR order (past values) | PACF cut-off lag | 0, 1, 2 |
| d | Non-seasonal | Differencing order | ADF test; count diffs | 0, 1 (rarely 2) |
| q | Non-seasonal | MA order (past errors) | ACF cut-off lag | 0, 1, 2 |
| P | Seasonal | Seasonal AR lags | ACF/PACF spikes at s, 2s… | 0, 1 |
| D | Seasonal | Seasonal differencing | Check seasonal ACF decay | 0, 1 |
| Q | Seasonal | Seasonal MA errors | ACF spikes at seasonal lags | 0, 1 |
| s | Both | Season length | Domain knowledge / plot | 4, 7, 12, 24, 52 |
For monthly data with a yearly cycle, always start with SARIMA(1,1,1)(1,1,1)[12]. This tiny 6-parameter model handles a staggering range of real business data. One non-seasonal difference (d=1) removes trend. One seasonal difference (D=1) removes seasonality. One AR + MA at each level captures residual autocorrelation. Compete this baseline before trying anything more exotic.
What SARIMA's Forecast Actually Looks Like
Blue = historical data. Green dashes = in-sample fitted values (how well the model tracked history). Gold = 24-step-ahead forecast. The shaded region = 95% confidence interval — it widens as uncertainty compounds further into the future. A tight CI at horizon 1 widening to a wide CI at horizon 24 is completely expected and correct behaviour.
SARIMA in Python — Full Working Example
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools
# ─── 1. Seasonal ADF — check if seasonal differencing is needed ───────────────
# A large spike at lag 12 in ACF of the diff'd series → D = 1
seas_diff = log_series.diff(12).dropna() # seasonal difference
both_diff = seas_diff.diff(1).dropna() # then non-seasonal difference
adf_both = adfuller(both_diff)
print(f"After log + seasonal diff + diff: p = {adf_both[1]:.6f}")
# p << 0.05 → fully stationary (d=1, D=1)
# ─── 2. Grid search over (p,q,P,Q) — fix d=1, D=1, s=12 ─────────────────────
p_range = [0, 1, 2]
q_range = [0, 1, 2]
P_range = [0, 1]
Q_range = [0, 1]
grid_results = []
for p, q, P, Q in itertools.product(p_range, q_range, P_range, Q_range):
try:
m = SARIMAX(
log_series,
order=(p, 1, q),
seasonal_order=(P, 1, Q, 12),
enforce_stationarity=False,
enforce_invertibility=False
).fit(disp=False)
grid_results.append({
'(p,d,q)(P,D,Q)[s]': f"({p},1,{q})({P},1,{Q})[12]",
'AIC': round(m.aic, 2)
})
except: pass
grid_df = pd.DataFrame(grid_results).sort_values('AIC')
print(grid_df.head(6).to_string(index=False))
# ─── 3. Fit the best model: SARIMA(1,1,1)(1,1,1)[12] ────────────────────────
sarima = SARIMAX(
log_series,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False
)
sarima_fit = sarima.fit(disp=False)
print(sarima_fit.summary())
# ─── 4. Residual diagnostics ─────────────────────────────────────────────────
res_sarima = sarima_fit.resid
lb_sarima = acorr_ljungbox(res_sarima, lags=24, return_df=True)
sig_sarima = lb_sarima[lb_sarima['lb_pvalue'] < 0.05]
print(f"\nLjung-Box significant lags: {len(sig_sarima)} (want 0)")
stat2, p2 = shapiro(res_sarima[13:])
print(f"Shapiro-Wilk p = {p2:.4f} → {'Normal ✓' if p2 > 0.05 else 'Non-normal'}")
# ─── 5. Hold-out test: last 24 months as test set ────────────────────────────
train_log = log_series[:-24]
test_orig = series[-24:]
sarima_train = SARIMAX(
train_log,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False
).fit(disp=False)
fc_obj = sarima_train.get_forecast(steps=24)
fc_mean = np.exp(fc_obj.predicted_mean)
fc_lower = np.exp(fc_obj.conf_int().iloc[:, 0])
fc_upper = np.exp(fc_obj.conf_int().iloc[:, 1])
# ─── 6. Evaluate ─────────────────────────────────────────────────────────────
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(test_orig, fc_mean)
rmse = np.sqrt(mean_squared_error(test_orig, fc_mean))
mape = np.mean(np.abs((test_orig.values - fc_mean.values) / test_orig.values)) * 100
# Naïve seasonal benchmark: repeat same month from 12 months ago
naive_seas = series[-36:-12].values
mae_naive = mean_absolute_error(test_orig, naive_seas)
mase = mae / mae_naive
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
print(f"MASE: {mase:.4f} → {'Beats naïve ✓' if mase < 1 else 'Does NOT beat naïve ✗'}")
# ─── 7. Full auto_arima shortcut (pmdarima) ───────────────────────────────────
from pmdarima import auto_arima
auto_model = auto_arima(
log_series,
seasonal=True,
m=12,
d=None, # auto-detect
D=None, # auto-detect seasonal differencing
stepwise=True, # faster than full grid search
information_criterion='aic',
trace=True,
error_action='ignore',
suppress_warnings=True
)
print(f"\nBest model: {auto_model.order} Seasonal: {auto_model.seasonal_order}")
print(f"AIC: {auto_model.aic():.2f}")
ARIMA vs SARIMA — Head-to-Head Comparison
Left: ARIMA residual ACF shows a massive unexplained spike at lag 12 — the model left the entire seasonal pattern in the residuals. Right: SARIMA residual ACF — all bars inside confidence bands, confirming white-noise residuals. This is the single best visual argument for always using SARIMA when seasonal data is present.
| Property | ARIMA(p,d,q) | SARIMA(p,d,q)(P,D,Q)[s] |
|---|---|---|
| Models seasonality | No — only via long AR/MA lags | Yes — dedicated seasonal layer |
| Parameters | 3 (p,d,q) | 7 (p,d,q,P,D,Q,s) |
| Data required | Smaller datasets fine | Needs ≥ 3–4 full seasonal cycles |
| Residual ACF at lag s | Usually still significant spike | Inside confidence bands ✓ |
| MAPE on seasonal data | 5–15% typical | 1–4% typical |
| AIC on airline data | −242 | −302 (lower = better) |
| Training speed | Fast | Slower (more parameters) |
| Best for | Non-seasonal or unknown seasonality | Any series with a clear repeating cycle |
Decision Flowchart — Which Model to Use?
Follow the arrows. Stationarity → differencing → seasonality check → model choice. The only decision you need to make before writing code.
Evaluation Metrics — How to Know If Your Model Is Good
| Metric | Formula | Scale-free? | Punishes Large Errors? | Best Use |
|---|---|---|---|---|
| MAE | mean(|yₜ − ŷₜ|) | No — same units as data | No — equal weight | Intuitive error in original units |
| RMSE | √mean((yₜ − ŷₜ)²) | No | Yes — squares big errors | When large errors are especially costly |
| MAPE | mean(|yₜ−ŷₜ|/|yₜ|)×100 | Yes — percentage | No | Business reporting; comparing across series |
| MASE | MAE / MAE(naïve) | Yes | No | MASE < 1 = beats naïve; gold standard metric |
| AIC / BIC | −2·ln(L) + k·penalty | N/A | N/A | Model selection only — never use for forecast quality |
Never shuffle time series data before splitting train/test. You must split chronologically — train on earlier data, test on later data. If you randomly split rows, future information leaks into the training set, the model effectively memorises the future, and your metrics are completely fictitious. A model that "achieves 1% MAPE" via random split may actually be 15% on a proper temporal split.