The Story That Explains ARMA
The first desk is the Echo Chamber. Every morning its editor opens yesterday's newspaper and asks: "What did we say last week? What about the week before?" Today's tone closely mirrors the past few editions — markets are still climbing, so we say they are climbing. This is pure momentum — the series echoing its own history.
The second desk is the Correction Desk. Its editor reads the past few mistakes the paper made and corrects course: "We over-estimated the rally by 40 points last Tuesday — let's be more cautious today." This is mean-reversion — using past errors to adjust future estimates.
Most real-world time series are written by both desks working simultaneously. The Echo Chamber is the AutoRegressive (AR) part. The Correction Desk is the Moving Average (MA) part. When they work together, the result is ARMA.
ARMA(p, q) — AutoRegressive Moving Average — is the core model for stationary time series. It describes a series whose current value depends on both its own p past values and the last q random shocks it absorbed. No differencing, no seasonality — just the pure signal structure of a series that already sits on flat, stable ground.
ARMA assumes your series is already stationary. ARIMA adds the I (Integrated) step — differencing the series until it becomes stationary — then hands that stationary result to ARMA to model. Mastering ARMA is therefore the prerequisite for understanding every member of the ARIMA family.
Stationarity — The Ground ARMA Stands On
ARMA only works on stationary series. Before fitting any ARMA model you must confirm your series satisfies three conditions — all simultaneously.
Left: non-stationary series — mean drifts upward, variance expands. ARMA fails here. Right: stationary series — oscillates around a flat mean inside a stable variance band (green shading). ARMA lives here. The green shaded band represents ±1σ around the mean.
The AR Part — AutoRegressive Memory
An AR(p) process says: today's value is a weighted linear combination of the last p values, plus a random shock.
Each bar group shows the influence of past lags on today's value for three AR(1) coefficients. φ=0.9 (blue): influence decays very slowly — strong, persistent memory. φ=0.5 (gold): moderate memory, halving with each lag. φ=0.1 (purple): nearly immediate forgetting. The curve through each group is the exponential decay envelope.
An AR(p) model is stationary only if all roots of its characteristic polynomial lie outside the unit circle. For AR(1) this simply means |φ₁| < 1. For AR(2) the condition is |φ₂| < 1, φ₂ + φ₁ < 1, and φ₂ − φ₁ < 1. If any root is on or inside the unit circle, the series will explode or random-walk — not stationary, and ARMA will produce nonsense.
The MA Part — Moving Average Shocks
This is a Moving Average (MA) process — a random shock at time t propagates through exactly q subsequent periods and then vanishes completely. Unlike AR (which has infinite, exponentially decaying memory), MA has a finite, hard cutoff: after exactly q lags, the shock is gone.
A single shock at time t (tallest purple bar) propagates through exactly q=3 lags with coefficients θ₁=0.8, θ₂=0.5, θ₃=0.2 — then drops to exactly zero at lag 4 (gold cutoff line). This hard cutoff is the defining fingerprint of an MA process and is visible directly in the ACF plot.
Unlike AR, an MA(q) process is always stationary for any finite θ values. This is because it is just a finite weighted sum of white noise terms — which are themselves stationary. The only condition MA requires is invertibility (|roots of MA polynomial| > 1), which ensures unique parameter estimates. For MA(1): |θ₁| < 1.
ARMA(p, q) — Combining Both Mechanisms
In the real world, most stationary series exhibit both momentum (AR) and shock-absorption (MA). ARMA combines both mechanisms in a single, parsimonious equation.
Three inputs feed the ARMA combiner: past observed values (blue, AR part), the current random shock (red), and past shocks (purple, MA part). The combiner applies the learned φ and θ coefficients to produce yₜ (green). The output mini-signal shows how the combined process oscillates around a mean.
| Model | Equation | Memory Type | ACF Shape | PACF Shape | Best For |
|---|---|---|---|---|---|
| AR(p) | φ₁yₜ₋₁ + … + εₜ | Infinite, exponential decay | Tails off slowly | Cuts off at lag p | Persistent, momentum-driven series |
| MA(q) | εₜ + θ₁εₜ₋₁ + … | Finite — hard cutoff at q | Cuts off at lag q | Tails off slowly | Shock-driven, mean-reverting series |
| ARMA(p,q) | φ…yₜ₋ₚ + εₜ + θ…εₜ₋q | Both: infinite AR + finite MA | Tails off (no hard cutoff) | Tails off (no hard cutoff) | Most real economic/financial series |
| White Noise | εₜ only | None — independent | No significant spikes | No significant spikes | Residuals of a well-fitted model |
ACF & PACF — Reading the Model's Fingerprint
The ACF and PACF plots are your primary tools for choosing p and q. Each model type leaves a distinct fingerprint in these two plots. Learning to read them is the core practical skill of ARMA modelling.
Three fingerprint patterns. Row 1 (AR): ACF tails off, PACF cuts sharply at lag 2 → AR(2). Row 2 (MA): ACF cuts sharply at lag 2, PACF tails off → MA(2). Row 3 (ARMA): both ACF and PACF tail off gradually — no hard cutoff anywhere → ARMA(p,q). Dashed line = 95% confidence band.
Choosing p and q — Information Criteria & Grid Search
Reading ACF/PACF gives initial candidates for p and q. The definitive selection uses AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) to compare fitted models objectively.
Each cell shows the AIC for one (p,q) combination. Green = good, red = poor. The lowest AIC (−268) belongs to ARMA(1,1) — the simplest model that explains the most variance. Notice ARMA(1,3) and ARMA(2,2) have worse AIC than ARMA(1,1) despite more parameters — the complexity penalty outweighs any marginal fit improvement.
Step 1: Read ACF/PACF. If PACF cuts at p → start with AR(p).
If ACF cuts at q → start with MA(q). Both tail off → ARMA needed.
Step 2: Fit all (p,q) combinations in a small grid (p ≤ 3, q ≤ 3). Compare AIC.
Step 3: Check residuals of the winner. Ljung-Box p > 0.05 = done.
If not, increment the flagged lag's order (p or q by 1) and refit.
ARMA in Python — Complete Workflow
We simulate a known ARMA(1,1) process, verify its properties, fit models, run diagnostics and produce forecasts — the complete end-to-end pipeline.
# ─── 0. Dependencies ─────────────────────────────────────────────────────────
# pip install statsmodels scipy matplotlib numpy pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from scipy.stats import shapiro
# ─── 1. Simulate a true ARMA(1,1) process ────────────────────────────────────
np.random.seed(42)
n = 500
# True parameters: AR φ₁=0.7, MA θ₁=0.4
ar_params = np.array([1, -0.7]) # statsmodels uses [1, -φ₁, -φ₂, ...]
ma_params = np.array([1, 0.4]) # statsmodels uses [1, θ₁, θ₂, ...]
arma_process = ArmaProcess(ar_params, ma_params)
print(f"Process is stationary: {arma_process.isstationary}")
print(f"Process is invertible: {arma_process.isinvertible}")
y = arma_process.generate_sample(nsample=n, scale=1.0)
series = pd.Series(y, name='ARMA(1,1) simulation')
# ─── 2. Verify stationarity with ADF test ────────────────────────────────────
adf_stat, adf_p, _, _, adf_crit, _ = adfuller(series, autolag='AIC')
print(f"ADF Statistic : {adf_stat:.4f}")
print(f"p-value : {adf_p:.6f}")
print(f"Critical (5%) : {adf_crit['5%']:.4f}")
print(f"Verdict : {'Stationary ✓' if adf_p < 0.05 else 'Non-stationary ✗'}")
# ─── 3. Inspect ACF and PACF ─────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf( series, lags=30, ax=axes[0], title='ACF — ARMA(1,1) simulation', alpha=0.05)
plot_pacf(series, lags=30, ax=axes[1], title='PACF — ARMA(1,1) simulation', alpha=0.05)
plt.tight_layout()
plt.savefig('acf_pacf_arma.png', dpi=120)
plt.show()
# Expected: neither ACF nor PACF shows a hard cutoff → both tail off → ARMA(p,q)
# ─── 4. Grid search: fit ARMA(p,q) via ARIMA(p,0,q) — d=0 means no differencing ──
p_range = range(0, 4)
q_range = range(0, 4)
grid_results = []
for p in p_range:
for q in q_range:
if p == 0 and q == 0:
continue # skip white noise
try:
m = ARIMA(series, order=(p, 0, q)).fit(method='innovations_mle')
grid_results.append({
'(p,q)' : f"({p},{q})",
'AIC' : round(m.aic, 2),
'BIC' : round(m.bic, 2),
'params': m.params.shape[0]
})
except:
pass
grid_df = pd.DataFrame(grid_results).sort_values('AIC')
print(grid_df.head(8).to_string(index=False))
# ─── 5. Fit the best model ARMA(1,1) ─────────────────────────────────────────
best = ARIMA(series, order=(1, 0, 1)).fit(method='innovations_mle')
print(best.summary())
# Extract coefficients and compare with truth
phi1_hat = best.arparams[0]
theta1_hat = best.maparams[0]
print(f"\nTrue AR(1): φ₁ = 0.7000 | Estimated: {phi1_hat:.4f}")
print(f"True MA(1): θ₁ = 0.4000 | Estimated: {theta1_hat:.4f}")
# ─── 6. Residual diagnostics ─────────────────────────────────────────────────
residuals = best.resid
# 6a. Ljung-Box test — should have no significant autocorrelation
lb = acorr_ljungbox(residuals, lags=[10, 20, 30], return_df=True)
print("Ljung-Box results:")
print(lb['lb_pvalue'].to_string())
# 6b. Normality test on residuals
stat_sw, p_sw = shapiro(residuals)
print(f"\nShapiro-Wilk p = {p_sw:.4f} → {'Normal ✓' if p_sw > 0.05 else 'Non-normal'}")
# 6c. Mean of residuals should be ~0
print(f"Residual mean = {residuals.mean():.5f} → {'≈0 ✓' if abs(residuals.mean()) < 0.05 else 'Bias present'}")
# ─── 7. Forecast 20 steps ahead ──────────────────────────────────────────────
n_fc = 20
fc_obj = best.get_forecast(steps=n_fc)
fc_mean = fc_obj.predicted_mean
fc_ci = fc_obj.conf_int(alpha=0.05) # 95% confidence interval
fc_ci_90 = fc_obj.conf_int(alpha=0.10) # 90% confidence interval
# Print forecast table
fc_table = pd.DataFrame({
'Step' : range(1, n_fc + 1),
'Forecast' : fc_mean.round(4).values,
'Lower 95%': fc_ci.iloc[:, 0].round(3).values,
'Upper 95%': fc_ci.iloc[:, 1].round(3).values
})
print(fc_table.head(8).to_string(index=False))
Notice how the point forecasts (0.422 → 0.309 → 0.228…) converge toward zero (the series mean) as the horizon grows. This is mathematically guaranteed for any stationary ARMA model. Because the series reverts to its mean, the best long-run forecast is always the mean — and the confidence interval width converges to ±1.96·σ (the unconditional standard deviation). ARMA is a short-range forecasting tool — its edge over a flat mean forecast decays exponentially with the AR coefficient.
Real Data Example — Daily Temperature Anomalies
Simulated series prove the theory. Now we apply the full workflow to a real dataset — daily temperature anomalies (deviations from a 30-year average). Temperature anomalies are typically stationary and exhibit short-range AR-like persistence.
# ─── Real data: US daily mean temperature anomalies (publicly available) ─────
# We simulate a realistic anomaly series here (replace with your own CSV)
np.random.seed(7)
n_real = 730 # 2 years of daily data
# Realistic: AR(1) dominant + small MA component + slight seasonal noise
ar_real = np.array([1, -0.62])
ma_real = np.array([1, 0.18])
y_real = ArmaProcess(ar_real, ma_real).generate_sample(nsample=n_real, scale=1.5)
dates = pd.date_range('2022-01-01', periods=n_real, freq='D')
temp_df = pd.Series(y_real, index=dates, name='Temp Anomaly (°C)')
# ─── ADF stationarity check ───────────────────────────────────────────────────
_, p_adf = adfuller(temp_df)[:2]
print(f"ADF p = {p_adf:.4f} → {'Stationary ✓' if p_adf < 0.05 else 'Non-stationary'}")
# ─── Train / Test split (last 30 days = test) ────────────────────────────────
train = temp_df[:-30]
test = temp_df[-30:]
print(f"Train: {len(train)} days | Test: {len(test)} days")
# ─── Grid search on training data ────────────────────────────────────────────
import itertools
candidates = [(p, q) for p, q in itertools.product(range(4), range(4)) if p+q > 0]
res = []
for p, q in candidates:
try:
m = ARIMA(train, order=(p, 0, q)).fit()
res.append({'(p,q)': f"({p},{q})", 'AIC': round(m.aic, 2), 'BIC': round(m.bic, 2)})
except: pass
best_pq = pd.DataFrame(res).sort_values('AIC')
print(best_pq.head(5).to_string(index=False))
# ─── Fit best ARMA(1,1) on train, forecast 30 days ───────────────────────────
final_model = ARIMA(train, order=(1, 0, 1)).fit()
fc30 = final_model.get_forecast(steps=30)
fc30_mean = fc30.predicted_mean
fc30_ci = fc30.conf_int()
# Evaluate on test set
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(test, fc30_mean)
rmse = np.sqrt(mean_squared_error(test, fc30_mean))
mape = np.mean(np.abs((test.values - fc30_mean.values) / (np.abs(test.values) + 1e-8))) * 100
# Naïve benchmark: forecast = last observed value
naive = np.full(30, train.iloc[-1])
mae_n = mean_absolute_error(test, naive)
mase = mae / mae_n
print(f"MAE : {mae:.4f}")
print(f"RMSE : {rmse:.4f}")
print(f"MAPE : {mape:.2f}%")
print(f"MASE : {mase:.4f} → {'Beats naïve ✓' if mase < 1 else 'Worse than naïve ✗'}")
Six Mistakes That Break ARMA Models
The ARMA Family Tree — Where ARMA Fits
ARMA is the trunk of the entire classical time series tree. Add differencing → ARIMA. Add seasonal structure → SARIMA. Add external predictors → ARIMAX/SARIMAX. Add volatility modelling → ARIMA-GARCH. Scale to multiple series → VAR. Mastering ARMA makes every extension straightforward.
Complete Model Comparison Table
| Property | AR(p) | MA(q) | ARMA(p,q) | ARIMA(p,d,q) |
|---|---|---|---|---|
| Equation | Σφᵢyₜ₋ᵢ + εₜ | εₜ + Σθⱼεₜ₋ⱼ | Σφᵢyₜ₋ᵢ + εₜ + Σθⱼεₜ₋ⱼ | ARMA on Δᵈyₜ |
| Requires stationarity? | Yes — |roots| > 1 | Always stationary | Yes — AR part must be stable | No — creates stationarity |
| ACF pattern | Tails off (exponential) | Cuts off at lag q | Tails off (no cutoff) | Depends on d, p, q |
| PACF pattern | Cuts off at lag p | Tails off (exponential) | Tails off (no cutoff) | Depends on d, p, q |
| Memory type | Infinite (decaying) | Finite (exact cutoff) | Both simultaneously | Both + trend removal |
| Parameters | p + 2 (c, σ²) | q + 2 (μ, σ²) | p + q + 2 | p + q + 3 (d counts) |
| Best for | Persistent momentum series | Short-lived shock series | Most real stationary series | Trending non-stationary data |
| Stationarity check | ADF test required | None needed | ADF test required | Built-in via d |
| Long-horizon forecast | Converges to mean | Converges to mean (faster) | Converges to mean | Follows differenced mean |