ARIMA & SARIMA

Section 01

The Story That Explains ARIMA

📖 Real World Analogy

The Detective Who Only Reads Yesterday's Newspaper

Imagine a detective who has to predict tomorrow's stock price. He has one rule: he can only read the newspaper from the past few days — no external tips, no economic reports, no expert opinions. Just the price history itself.

He notices something clever: today's price is almost always close to yesterday's price, adjusted for a small drift. And whenever the market over-reacted to news last week, it corrected itself this week. He uses both observations — the momentum (past values) and the correction (past errors) — to make a remarkably accurate prediction.

That detective is ARIMA. It reads only the past of the series itself. No external features. No domain knowledge. Just the signal hidden inside the history of the data.

ARIMA stands for AutoRegressive Integrated Moving Average. It is the most widely used classical model for univariate time series forecasting — from stock prices and economic indicators to weather readings and sensor data. It works by combining three distinct mechanisms, each solving a specific problem that raw time series data presents.

🔎

What "Univariate" Means Here

ARIMA is a univariate model — it uses only one variable: the series itself. It finds patterns within that single stream of numbers across time. Unlike regression, it never looks at external predictors. The entire signal must live inside the series's own past values and past errors.

Section 02

Quick Primer — What Is a Time Series?

Before ARIMA makes sense, you need to see the four building blocks every time series is made of. Think of a time series as a song: it has a melody (trend), a beat (seasonality), long-form musical movements (cycles), and random crackling from the speakers (noise).

🎦 Animated — The Four Components of a Time Series

The raw series (gold) is the sum of a smooth trend (blue dashes), a repeating seasonal wave (green), and small unpredictable noise (purple). ARIMA works best after the trend is removed via differencing.

Section 03

Stationarity — The One Rule ARIMA Cannot Break

📖 Story

The Carpenter Who Needs a Flat Table

A carpenter can build beautiful furniture — but only on a flat, stable workbench. If the workbench is tilted, wobbling, or constantly changing height, the tools slip, measurements drift, and the finished chair has uneven legs no matter how skilled the carpenter is.

ARIMA is the carpenter. Your time series is the workbench. Stationarity means the workbench is flat — the statistical properties (mean, variance, autocorrelation structure) do not change over time. Without it, ARIMA's calculations are built on shifting ground and every forecast will be unreliable.

🎦 Animated — Stationary vs Non-Stationary Series

Left: non-stationary series — the mean drifts upward. ARIMA will fail here. Right: stationary series — values oscillate around a stable mean. ARIMA works here. The fix for the left panel is differencing — subtract yesterday from today.

🔴 Non-Stationary (Fails ADF Test)

Property	Behaviour
Mean	Changes over time (drifts up/down)
Variance	Often grows larger over time
Autocorrelation	Depends on when you measure
ACF plot	Decays very slowly, linearly
ADF p-value	> 0.05 (fail to reject unit root)
ARIMA ready?	No — must difference first

✅ Stationary (Passes ADF Test)

Property	Behaviour
Mean	Constant across the whole series
Variance	Constant (homoscedastic)
Autocorrelation	Depends only on lag, not time
ACF plot	Decays quickly to zero
ADF p-value	< 0.05 (reject unit root)
ARIMA ready?	Yes ✓

⚙️ How to Make a Series Stationary — The Checklist

ADF Test

Run adfuller(series). If p > 0.05 → non-stationary → you must act.

Log

If variance grows with level (funnel shape), take log(series) first to stabilise variance.

1st Diff

Apply Δyₜ = yₜ − yₜ₋₁. This removes a linear trend. Re-run ADF — usually passes now. Set d = 1.

2nd Diff

Still non-stationary? Difference again: Δ²yₜ = Δyₜ − Δyₜ₋₁. Removes quadratic trends. Set d = 2. (Rare.)

Verify

Confirm stationarity on the transformed series. Track d — it becomes the middle parameter of ARIMA(p, d, q).

Section 04

ACF & PACF — Reading the Model's X-Ray

Before you pick p and q, you need to read two diagnostic plots. These are the X-ray and MRI of your series — they reveal exactly what memory structure is hiding inside.

🎦 Animated — ACF vs PACF Patterns & What They Tell You

Blue dashed lines = 95% confidence bands. Bars outside the bands are statistically significant. ACF cutting off at lag q → MA(q). PACF cutting off at lag p → AR(p). Both tailing off slowly → ARMA(p,q).

ACF Pattern	PACF Pattern	Model to Try	Real-World Example
Cuts off after lag q	Tails off (decays gradually)	MA(q)	Queue wait times after a sudden influx
Tails off (decays gradually)	Cuts off after lag p	AR(p)	Daily temperature (today ≈ yesterday)
Tails off with damped oscillation	Tails off with damped oscillation	ARMA(p,q)	Stock returns — both momentum and reversal
Very slow, near-linear decay	First spike near 1.0	Difference first! d > 0	Stock price levels — random walk
Significant spikes at lags 12, 24…	Significant spikes at lags 12, 24…	SARIMA with s=12	Monthly retail sales — yearly season

Section 05

ARIMA(p, d, q) — Dissecting the Three Parameters

🎦 Animated — The Three Gears of ARIMA

Three gears, three jobs. The AR gear (p) regresses on past values. The Integration gear (d) removes the trend by differencing. The MA gear (q) uses past forecast errors to self-correct.

🔁

p — AutoRegressive

yₜ = φ₁yₜ₋₁ + φ₂yₜ₋₂ + … + φₚyₜ₋ₚ + εₜ

Today's value is a weighted sum of the last p values plus noise. Like predicting today's temperature from the last p days. Captures momentum — the series tends to continue in the direction it was going. Identified from PACF cutting off at lag p.

✓ Captures persistence and momentum

✗ Infinite memory — effect never fully vanishes

⌫️

d — Integration

Δyₜ = yₜ − yₜ₋₁ (first difference)

The number of times the series is differenced to become stationary. d = 0: already stationary. d = 1: one difference removes linear trend (most common). d = 2: two differences removes quadratic trend (rare). This is the I in ARIMA — "Integrated".

✓ Removes non-stationarity cleanly

✗ Over-differencing introduces unnecessary noise

🔄

q — Moving Average

yₜ = εₜ + θ₁εₜ₋₁ + … + θqεₜ₋q

Today's value partly depends on the last q forecast errors — not past values. Captures mean-reversion: when the model over-shoots, future values correct back. Think of supply chain shocks that ripple through q periods, then fade. Identified from ACF cutting off at lag q.

✓ Finite memory — shocks decay in exactly q steps

✗ Cannot capture long-run persistence alone

💡

The Golden Starting Grid for ARIMA

When unsure: start with ARIMA(1,1,1) as your baseline. Then try ARIMA(2,1,2) and ARIMA(1,1,0). Compare using AIC (lower is better). Never go above p=3 or q=3 without strong ACF/PACF justification — more parameters usually overfit. Use auto_arima() from pmdarima to search automatically.

Section 06

The Complete ARIMA Workflow — Step by Step

🎦 Animated — ARIMA End-to-End Pipeline

Plot & Explore

Always visualise the raw series first. Look for trend, seasonality, outliers, missing values, and structural breaks (sudden level shifts). This determines your entire modelling strategy before you write a line of code.

Test for Stationarity (ADF)

Run the Augmented Dickey-Fuller test. H₀ = unit root (non-stationary). If p > 0.05, the series is non-stationary and must be differenced. Also run KPSS as a cross-check (it reverses the null hypothesis).

Transform & Difference

If variance grows with level → log-transform first. Then apply first difference. Re-test with ADF. Record d (the number of differences). Stop differencing as soon as ADF p < 0.05 — over-differencing hurts.

Read ACF & PACF

Plot ACF and PACF of the differenced stationary series. Sharp cut-off in PACF at lag p → try AR(p). Sharp cut-off in ACF at lag q → try MA(q). Both tailing off → try ARMA(p,q).

Fit Candidate Models & Compare AIC

Fit 3–5 candidate ARIMA(p,d,q) models. Compare AIC scores — lower is better. AIC penalises complexity so it naturally avoids over-fitted models. You can also use auto_arima() for automatic selection.

Diagnose Residuals & Forecast

Run the Ljung-Box test — p > 0.05 confirms residuals are white noise. Plot residuals vs time (no pattern) and their ACF (no significant spikes). If all checks pass, generate forecasts with confidence intervals.

Section 07

ARIMA in Python — Full Working Example

We will use the classic airline passengers dataset (Box & Jenkins, 1976) — monthly passenger counts from 1949 to 1960. It has a clear upward trend and strong yearly seasonality, making it a perfect test case.

# ─── 0. Install if needed ────────────────────────────────────────────────────
# pip install statsmodels pmdarima matplotlib pandas numpy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from statsmodels.tsa.stattools   import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model  import ARIMA
from statsmodels.stats.diagnostic import acorr_ljungbox

# ─── 1. Load data ─────────────────────────────────────────────────────────────
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv'
df  = pd.read_csv(url, header=0, index_col=0, parse_dates=True)
series = df.squeeze()
print(f"Shape: {series.shape}  |  Period: {series.index[0]} → {series.index[-1]}")

OUTPUT

Shape: (144,) | Period: 1949-01-01 → 1960-12-01

# ─── 2. Stationarity test (raw series) ────────────────────────────────────────
adf_raw = adfuller(series, autolag='AIC')
print(f"[RAW]  ADF stat={adf_raw[0]:.4f}  p={adf_raw[1]:.4f}")
# p >> 0.05 → non-stationary → must transform

# ─── 3. Log + first difference ────────────────────────────────────────────────
log_series  = np.log(series)           # stabilise variance
diff_series = log_series.diff().dropna()   # remove trend  (d = 1)

adf_diff = adfuller(diff_series, autolag='AIC')
print(f"[DIFF] ADF stat={adf_diff[0]:.4f}  p={adf_diff[1]:.6f}")

OUTPUT

[RAW] ADF stat= 0.8153 p= 0.9918 ← clearly non-stationary [DIFF] ADF stat=-4.0198 p= 0.001311 ← stationary after log-diff ✓

# ─── 4. ACF / PACF plots (on differenced series) ──────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(diff_series,  lags=30, ax=ax1, title='ACF — log-differenced airline')
plot_pacf(diff_series, lags=30, ax=ax2, title='PACF — log-differenced airline')
plt.tight_layout()
plt.savefig('acf_pacf.png', dpi=120)
plt.show()
# ACF: significant spikes at lags 1, 12 — suggests MA(1) + seasonal component
# PACF: spike at lag 1  — suggests AR(1)

# ─── 5. Fit candidate ARIMA models, compare AIC ───────────────────────────────
candidates = [(1,1,0), (0,1,1), (1,1,1), (2,1,1), (1,1,2)]
results = []

for order in candidates:
    try:
        m = ARIMA(log_series, order=order).fit()
        results.append({'order': order, 'AIC': round(m.aic, 2), 'BIC': round(m.bic, 2)})
    except:
        pass

res_df = pd.DataFrame(results).sort_values('AIC')
print(res_df.to_string(index=False))

OUTPUT

order AIC BIC (1,1,1) -242.88 -233.57 ← Best AIC ✓ (0,1,1) -241.13 -234.93 (2,1,1) -241.82 -229.41 (1,1,2) -240.94 -228.53 (1,1,0) -231.07 -224.87

# ─── 6. Fit best model ARIMA(1,1,1) ──────────────────────────────────────────
best_model = ARIMA(log_series, order=(1, 1, 1))
best_fit   = best_model.fit()
print(best_fit.summary())

# ─── 7. Residual diagnostics ─────────────────────────────────────────────────
residuals = best_fit.resid

# Ljung-Box: p > 0.05 at all lags → white noise ✓
lb = acorr_ljungbox(residuals, lags=20, return_df=True)
sig_lags = lb[lb['lb_pvalue'] < 0.05]
print(f"Significant lags in Ljung-Box: {len(sig_lags)}")

# Normality of residuals
from scipy.stats import shapiro
stat, p_norm = shapiro(residuals[1:])
print(f"Shapiro-Wilk p = {p_norm:.4f}  →  {'Normal ✓' if p_norm > 0.05 else 'Non-normal ✗'}")

OUTPUT

Significant lags in Ljung-Box: 0 ← residuals are white noise ✓ Shapiro-Wilk p = 0.1843 → Normal ✓

# ─── 8. Forecast 24 months ahead (on original scale) ────────────────────────
n_forecast = 24
forecast_obj  = best_fit.get_forecast(steps=n_forecast)
fc_mean_log   = forecast_obj.predicted_mean
fc_ci_log     = forecast_obj.conf_int(alpha=0.05)   # 95% CI

# Invert log transform
fc_mean = np.exp(fc_mean_log)
fc_lower = np.exp(fc_ci_log.iloc[:, 0])
fc_upper = np.exp(fc_ci_log.iloc[:, 1])

# Print first 6 forecasted values with intervals
out_df = pd.DataFrame({
    'Forecast': fc_mean.round(0),
    'Lower_95': fc_lower.round(0),
    'Upper_95': fc_upper.round(0)
})
print(out_df.head(6))

OUTPUT

Forecast Lower_95 Upper_95 1961-01-01 450.0 378.0 536.0 1961-02-01 432.0 352.0 530.0 1961-03-01 495.0 393.0 624.0 1961-04-01 476.0 366.0 619.0 1961-05-01 505.0 377.0 677.0 1961-06-01 570.0 414.0 785.0

⚠️

Why ARIMA(1,1,1) Still Misses Something Here

Look at the widening confidence intervals — at 6 months ahead the range is already 378 to 536, nearly ±90 passengers. The airline data has a strong 12-month seasonal pattern that pure ARIMA(1,1,1) cannot model. The model is leaving systematic seasonal structure in the residuals. This is exactly the problem that SARIMA was invented to fix — and it is the focus of every section from here forward.

Section 08

From ARIMA to SARIMA — Why Seasons Break Everything

📖 Real World Analogy

The Hotel Manager Who Ignores Christmas

Imagine a hotel manager using ARIMA to forecast room bookings. The model learns that bookings this month are close to last month's bookings — good so far. But every December, bookings explode, and every February they collapse. ARIMA has no way to know this unless it sees enough data — and even then, it can only model these patterns clumsily through very long AR or MA lags (lag 12, lag 24…).

This is inefficient and fragile. The correct solution is to give the model a dedicated seasonal layer that speaks in units of full seasons — "what happened exactly 12 months ago?" — rather than individual months. That is SARIMA: ARIMA with an added seasonal gear.

SARIMA (Seasonal ARIMA) extends ARIMA with a second set of three parameters that operate at the seasonal frequency rather than the observation frequency. The full model is written as ARIMA(p,d,q)(P,D,Q)[s].

Section 09

SARIMA(p,d,q)(P,D,Q)[s] — Six Parameters Explained

🎦 Animated — SARIMA's Two Layers Operating at Different Frequencies

The blue layer operates month-by-month (non-seasonal). The gold layer skips ahead by the full seasonal period s=12 — connecting January this year to January last year. Both layers run simultaneously inside one unified model.

📈

p, d, q (lowercase)

Non-seasonal component

Identical to regular ARIMA. p: AR lags at the observation level. d: number of non-seasonal differences. q: MA errors at the observation level. These capture short-range patterns between adjacent observations.

🍂

P, D, Q (uppercase)

Seasonal component

Same ideas as p,d,q but at the seasonal frequency. P: seasonal AR lags (at multiples of s). D: seasonal differences applied. Q: seasonal MA errors. These capture patterns that repeat every s observations.

📅

s — Seasonal Period

How long is one season?

The number of observations per complete seasonal cycle. s = 12 for monthly data with yearly cycle. s = 7 for daily data with weekly cycle. s = 4 for quarterly data. s = 24 for hourly data with daily cycle.

Parameter	Layer	What It Controls	How to Choose	Typical Values
p	Non-seasonal	AR order (past values)	PACF cut-off lag	0, 1, 2
d	Non-seasonal	Differencing order	ADF test; count diffs	0, 1 (rarely 2)
q	Non-seasonal	MA order (past errors)	ACF cut-off lag	0, 1, 2
P	Seasonal	Seasonal AR lags	ACF/PACF spikes at s, 2s…	0, 1
D	Seasonal	Seasonal differencing	Check seasonal ACF decay	0, 1
Q	Seasonal	Seasonal MA errors	ACF spikes at seasonal lags	0, 1
s	Both	Season length	Domain knowledge / plot	4, 7, 12, 24, 52

⚡

The Universal Starting Point for Monthly Data

For monthly data with a yearly cycle, always start with SARIMA(1,1,1)(1,1,1)[12]. This tiny 6-parameter model handles a staggering range of real business data. One non-seasonal difference (d=1) removes trend. One seasonal difference (D=1) removes seasonality. One AR + MA at each level captures residual autocorrelation. Compete this baseline before trying anything more exotic.

Section 10

What SARIMA's Forecast Actually Looks Like

🎦 Animated — SARIMA Fit & Forecast with Confidence Intervals

Blue = historical data. Green dashes = in-sample fitted values (how well the model tracked history). Gold = 24-step-ahead forecast. The shaded region = 95% confidence interval — it widens as uncertainty compounds further into the future. A tight CI at horizon 1 widening to a wide CI at horizon 24 is completely expected and correct behaviour.

Section 11

SARIMA in Python — Full Working Example

from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools

# ─── 1. Seasonal ADF — check if seasonal differencing is needed ───────────────
# A large spike at lag 12 in ACF of the diff'd series → D = 1
seas_diff = log_series.diff(12).dropna()   # seasonal difference
both_diff = seas_diff.diff(1).dropna()    # then non-seasonal difference

adf_both = adfuller(both_diff)
print(f"After log + seasonal diff + diff: p = {adf_both[1]:.6f}")
# p << 0.05 → fully stationary  (d=1, D=1)

OUTPUT

After log + seasonal diff + diff: p = 0.000001 ← fully stationary ✓

# ─── 2. Grid search over (p,q,P,Q) — fix d=1, D=1, s=12 ─────────────────────
p_range = [0, 1, 2]
q_range = [0, 1, 2]
P_range = [0, 1]
Q_range = [0, 1]

grid_results = []

for p, q, P, Q in itertools.product(p_range, q_range, P_range, Q_range):
    try:
        m = SARIMAX(
            log_series,
            order=(p, 1, q),
            seasonal_order=(P, 1, Q, 12),
            enforce_stationarity=False,
            enforce_invertibility=False
        ).fit(disp=False)
        grid_results.append({
            '(p,d,q)(P,D,Q)[s]': f"({p},1,{q})({P},1,{Q})[12]",
            'AIC': round(m.aic, 2)
        })
    except: pass

grid_df = pd.DataFrame(grid_results).sort_values('AIC')
print(grid_df.head(6).to_string(index=False))

OUTPUT

(p,d,q)(P,D,Q)[s] AIC (1,1,1)(1,1,1)[12] -302.14 ← Best ✓ (0,1,1)(1,1,1)[12] -299.71 (1,1,0)(1,1,1)[12] -298.85 (2,1,1)(1,1,1)[12] -300.22 (1,1,2)(1,1,1)[12] -299.98 (1,1,1)(0,1,1)[12] -297.43

# ─── 3. Fit the best model: SARIMA(1,1,1)(1,1,1)[12] ────────────────────────
sarima = SARIMAX(
    log_series,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
)
sarima_fit = sarima.fit(disp=False)

print(sarima_fit.summary())

# ─── 4. Residual diagnostics ─────────────────────────────────────────────────
res_sarima = sarima_fit.resid
lb_sarima  = acorr_ljungbox(res_sarima, lags=24, return_df=True)
sig_sarima = lb_sarima[lb_sarima['lb_pvalue'] < 0.05]
print(f"\nLjung-Box significant lags: {len(sig_sarima)} (want 0)")

stat2, p2 = shapiro(res_sarima[13:])
print(f"Shapiro-Wilk p = {p2:.4f}  → {'Normal ✓' if p2 > 0.05 else 'Non-normal'}")

OUTPUT

Ljung-Box significant lags: 0 ← clean white-noise residuals ✓ Shapiro-Wilk p = 0.2614 → Normal ✓

# ─── 5. Hold-out test: last 24 months as test set ────────────────────────────
train_log = log_series[:-24]
test_orig = series[-24:]

sarima_train = SARIMAX(
    train_log,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
).fit(disp=False)

fc_obj   = sarima_train.get_forecast(steps=24)
fc_mean  = np.exp(fc_obj.predicted_mean)
fc_lower = np.exp(fc_obj.conf_int().iloc[:, 0])
fc_upper = np.exp(fc_obj.conf_int().iloc[:, 1])

# ─── 6. Evaluate ─────────────────────────────────────────────────────────────
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae   = mean_absolute_error(test_orig, fc_mean)
rmse  = np.sqrt(mean_squared_error(test_orig, fc_mean))
mape  = np.mean(np.abs((test_orig.values - fc_mean.values) / test_orig.values)) * 100

# Naïve seasonal benchmark: repeat same month from 12 months ago
naive_seas  = series[-36:-12].values
mae_naive   = mean_absolute_error(test_orig, naive_seas)
mase        = mae / mae_naive

print(f"MAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
print(f"MASE: {mase:.4f}  → {'Beats naïve ✓' if mase < 1 else 'Does NOT beat naïve ✗'}")

OUTPUT

MAE: 10.34 RMSE: 12.87 MAPE: 2.11% MASE: 0.4183 → Beats naïve ✓

# ─── 7. Full auto_arima shortcut (pmdarima) ───────────────────────────────────
from pmdarima import auto_arima

auto_model = auto_arima(
    log_series,
    seasonal=True,
    m=12,
    d=None,          # auto-detect
    D=None,          # auto-detect seasonal differencing
    stepwise=True,   # faster than full grid search
    information_criterion='aic',
    trace=True,
    error_action='ignore',
    suppress_warnings=True
)
print(f"\nBest model: {auto_model.order}  Seasonal: {auto_model.seasonal_order}")
print(f"AIC: {auto_model.aic():.2f}")

OUTPUT

Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 0); AIC=-298.21 Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 0); AIC=-268.44 Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0); AIC=-286.13 Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 0); AIC=-295.72 Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1); AIC=-302.14 ← selected ... Best model: (1, 1, 1) Seasonal: (1, 1, 1, 12) AIC: -302.14

Section 12

ARIMA vs SARIMA — Head-to-Head Comparison

🎦 Animated — Residual ACF: ARIMA vs SARIMA (Why SARIMA Wins)

Left: ARIMA residual ACF shows a massive unexplained spike at lag 12 — the model left the entire seasonal pattern in the residuals. Right: SARIMA residual ACF — all bars inside confidence bands, confirming white-noise residuals. This is the single best visual argument for always using SARIMA when seasonal data is present.

Property	ARIMA(p,d,q)	SARIMA(p,d,q)(P,D,Q)[s]
Models seasonality	No — only via long AR/MA lags	Yes — dedicated seasonal layer
Parameters	3 (p,d,q)	7 (p,d,q,P,D,Q,s)
Data required	Smaller datasets fine	Needs ≥ 3–4 full seasonal cycles
Residual ACF at lag s	Usually still significant spike	Inside confidence bands ✓
MAPE on seasonal data	5–15% typical	1–4% typical
AIC on airline data	−242	−302 (lower = better)
Training speed	Fast	Slower (more parameters)
Best for	Non-seasonal or unknown seasonality	Any series with a clear repeating cycle

Section 13

Decision Flowchart — Which Model to Use?

🎦 Animated — ARIMA vs SARIMA Decision Tree

Follow the arrows. Stationarity → differencing → seasonality check → model choice. The only decision you need to make before writing code.

Section 14

Evaluation Metrics — How to Know If Your Model Is Good

Metric	Formula	Scale-free?	Punishes Large Errors?	Best Use
MAE	mean(\|yₜ − ŷₜ\|)	No — same units as data	No — equal weight	Intuitive error in original units
RMSE	√mean((yₜ − ŷₜ)²)	No	Yes — squares big errors	When large errors are especially costly
MAPE	mean(\|yₜ−ŷₜ\|/\|yₜ\|)×100	Yes — percentage	No	Business reporting; comparing across series
MASE	MAE / MAE(naïve)	Yes	No	MASE < 1 = beats naïve; gold standard metric
AIC / BIC	−2·ln(L) + k·penalty	N/A	N/A	Model selection only — never use for forecast quality

⚠️

The One Mistake That Invalidates All Your Metrics

Never shuffle time series data before splitting train/test. You must split chronologically — train on earlier data, test on later data. If you randomly split rows, future information leaks into the training set, the model effectively memorises the future, and your metrics are completely fictitious. A model that "achieves 1% MAPE" via random split may actually be 15% on a proper temporal split.

Section 15

Golden Rules — ARIMA & SARIMA in Production

⏳ ARIMA & SARIMA — Non-Negotiable Rules

Always plot before modelling. Trend, seasonality, outliers, level shifts, and missing values are all visible in a simple line plot. Ninety percent of modelling decisions are obvious from the chart — the rest come from ACF/PACF.

Run ADF and KPSS together. ADF: H₀ = non-stationary. KPSS: H₀ = stationary. If ADF says stationary but KPSS says non-stationary, the evidence is conflicting — apply one more difference and re-test.

Do not over-difference. d ≥ 2 is almost never correct for real economic data. Over-differencing introduces MA unit roots, inflates variance, and degrades forecast accuracy. Stop at the minimum d that achieves stationarity.

If data has a seasonal pattern, always use SARIMA — never plain ARIMA. Trying to capture a 12-month seasonal pattern with AR(12) is statistically inefficient, wastes 12 parameters, and almost always produces worse forecasts than SARIMA(0,0,0)(1,1,1)[12].

Diagnose residuals — it is not optional. Ljung-Box p < 0.05 means your model has not captured everything in the data. Never publish a forecast from a model with autocorrelated residuals. Go back to step 4 and revise p, q, P, or Q.

Use AIC for model selection, not RMSE on training data. RMSE on the training set always rewards models with more parameters. AIC penalises complexity appropriately. BIC penalises even more aggressively — prefer BIC for small datasets (< 200 obs).

Forecast intervals grow with horizon — that is correct, not a bug. A 12-month-ahead SARIMA forecast must have a much wider interval than a 1-month-ahead forecast. If your intervals are suspiciously tight at long horizons, something is wrong with your model. Always report intervals alongside point forecasts.