ARMA Model Explained — AutoRegressive Moving Average

Section 01

The Story That Explains ARMA

📖 Real World Analogy

The Echo Chamber & The Correction Desk

Picture a busy newsroom. Two desks sit side by side, and together they write every headline.

The first desk is the Echo Chamber. Every morning its editor opens yesterday's newspaper and asks: "What did we say last week? What about the week before?" Today's tone closely mirrors the past few editions — markets are still climbing, so we say they are climbing. This is pure momentum — the series echoing its own history.

The second desk is the Correction Desk. Its editor reads the past few mistakes the paper made and corrects course: "We over-estimated the rally by 40 points last Tuesday — let's be more cautious today." This is mean-reversion — using past errors to adjust future estimates.

Most real-world time series are written by both desks working simultaneously. The Echo Chamber is the AutoRegressive (AR) part. The Correction Desk is the Moving Average (MA) part. When they work together, the result is ARMA.

ARMA(p, q) — AutoRegressive Moving Average — is the core model for stationary time series. It describes a series whose current value depends on both its own p past values and the last q random shocks it absorbed. No differencing, no seasonality — just the pure signal structure of a series that already sits on flat, stable ground.

💡

ARMA vs ARIMA — One Sentence

ARMA assumes your series is already stationary. ARIMA adds the I (Integrated) step — differencing the series until it becomes stationary — then hands that stationary result to ARMA to model. Mastering ARMA is therefore the prerequisite for understanding every member of the ARIMA family.

Section 02

Stationarity — The Ground ARMA Stands On

ARMA only works on stationary series. Before fitting any ARMA model you must confirm your series satisfies three conditions — all simultaneously.

⚌

Constant Mean

E[yₜ] = μ for all t

The average level of the series does not drift up or down over time. A stock price fails this — it trends. Daily returns (price changes) usually pass — they oscillate around zero.

⇄

Constant Variance

Var(yₜ) = σ² for all t

The spread of the series is consistent throughout time — no "funnel" shape where swings widen or narrow over years. Financial returns often violate this (volatility clustering) — GARCH handles that case.

↻

Covariance Depends Only on Lag

Cov(yₜ, yₜ₋ₖ) = f(k) only

The relationship between two observations depends only on how far apart they are (k steps), not on when they were measured. January 2010 and January 2020 have the same autocorrelation at lag 1.

🎦 Animated — Stationary vs Non-Stationary: Visual Difference

Left: non-stationary series — mean drifts upward, variance expands. ARMA fails here. Right: stationary series — oscillates around a flat mean inside a stable variance band (green shading). ARMA lives here. The green shaded band represents ±1σ around the mean.

Section 03

The AR Part — AutoRegressive Memory

📖 Story

The River Gauge Station

A hydrologist measures river water level every hour. She notices something immediately: if the river was high at 9 am, it is almost certainly still high at 10 am. The level does not jump from high to low in one step — it carries its own momentum forward. What she is observing is autoregression: the current value is predicted by its own recent past. The word "auto" means self — the river is regressing on itself.

An AR(p) process says: today's value is a weighted linear combination of the last p values, plus a random shock.

AR(p) — General Form

yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + … + φₚyₜ₋ₚ + εₜ

c = constant (mean offset). φᵢ = AR coefficients (weights on past values). εₜ = white noise shock ~ N(0, σ²). The past p values are the only predictors.

AR(1) — Simplest Case

yₜ = c + φ₁ yₜ₋₁ + εₜ

If φ₁ = 0.8: today is 80% of yesterday plus noise. Strong memory. If φ₁ = 0.1: today barely depends on yesterday. Short memory. If |φ₁| ≥ 1: the series explodes — non-stationary. Stationarity requires |all roots| > 1.

🎦 Animated — AR(1) Memory: How Past Values Fade Over Time

Each bar group shows the influence of past lags on today's value for three AR(1) coefficients. φ=0.9 (blue): influence decays very slowly — strong, persistent memory. φ=0.5 (gold): moderate memory, halving with each lag. φ=0.1 (purple): nearly immediate forgetting. The curve through each group is the exponential decay envelope.

⚠️

The Stationarity Condition for AR(p)

An AR(p) model is stationary only if all roots of its characteristic polynomial lie outside the unit circle. For AR(1) this simply means |φ₁| < 1. For AR(2) the condition is |φ₂| < 1, φ₂ + φ₁ < 1, and φ₂ − φ₁ < 1. If any root is on or inside the unit circle, the series will explode or random-walk — not stationary, and ARMA will produce nonsense.

Section 04

The MA Part — Moving Average Shocks

📖 Story

The Supply-Chain Disruption That Rippled for Three Months

In March, a semiconductor factory burns down. The shock hits electronics sales immediately. In April, the pain echoes — stocks that should have been replenished aren't there. In May, a smaller ripple as partial recovery happens. By June, the system is back to normal and the event has no more direct effect.

This is a Moving Average (MA) process — a random shock at time t propagates through exactly q subsequent periods and then vanishes completely. Unlike AR (which has infinite, exponentially decaying memory), MA has a finite, hard cutoff: after exactly q lags, the shock is gone.

MA(q) — General Form

yₜ = μ + εₜ + θ₁εₜ₋₁ + θ₂εₜ₋₂ + … + θqεₜ₋q

μ = series mean. θᵢ = MA coefficients (weights on past shocks). εₜ = current shock. The past q shocks — not values — are the predictors. This is the key difference from AR.

MA(1) — Simplest Case

yₜ = μ + εₜ + θ₁εₜ₋₁

If θ₁ = 0.7: today carries 70% of yesterday's shock forward. Then it disappears. At lag 2: zero effect — hard cutoff. MA processes are always stationary — any finite θ values work — but must satisfy the invertibility condition |θ| < 1 for unique parameter identification.

🎦 Animated — MA Shock: How a Random Event Ripples Then Vanishes

A single shock at time t (tallest purple bar) propagates through exactly q=3 lags with coefficients θ₁=0.8, θ₂=0.5, θ₃=0.2 — then drops to exactly zero at lag 4 (gold cutoff line). This hard cutoff is the defining fingerprint of an MA process and is visible directly in the ACF plot.

✅

MA Processes Are Always Stationary

Unlike AR, an MA(q) process is always stationary for any finite θ values. This is because it is just a finite weighted sum of white noise terms — which are themselves stationary. The only condition MA requires is invertibility (|roots of MA polynomial| > 1), which ensures unique parameter estimates. For MA(1): |θ₁| < 1.

Section 05

ARMA(p, q) — Combining Both Mechanisms

In the real world, most stationary series exhibit both momentum (AR) and shock-absorption (MA). ARMA combines both mechanisms in a single, parsimonious equation.

ARMA(p, q) — Full Equation

yₜ = c + Σᵢ φᵢyₜ₋ᵢ + εₜ + Σⱼ θⱼεₜ₋ⱼ

The value at time t is explained by: a constant c, the last p values (AR part: φ coefficients), the current shock εₜ, and the last q shocks (MA part: θ coefficients).

ARMA(1,1) — Most Common Form

yₜ = c + φ₁yₜ₋₁ + εₜ + θ₁εₜ₋₁

Just four parameters: c, φ₁, σ², θ₁. Yet this simple model captures the behaviour of a surprising range of economic, financial, and environmental series. Often a better fit than AR(3) or MA(3) because it uses fewer parameters to explain the same memory structure.

🎦 Animated — ARMA Signal Flow: AR Momentum + MA Correction = ARMA Output

Three inputs feed the ARMA combiner: past observed values (blue, AR part), the current random shock (red), and past shocks (purple, MA part). The combiner applies the learned φ and θ coefficients to produce yₜ (green). The output mini-signal shows how the combined process oscillates around a mean.

Model	Equation	Memory Type	ACF Shape	PACF Shape	Best For
AR(p)	φ₁yₜ₋₁ + … + εₜ	Infinite, exponential decay	Tails off slowly	Cuts off at lag p	Persistent, momentum-driven series
MA(q)	εₜ + θ₁εₜ₋₁ + …	Finite — hard cutoff at q	Cuts off at lag q	Tails off slowly	Shock-driven, mean-reverting series
ARMA(p,q)	φ…yₜ₋ₚ + εₜ + θ…εₜ₋q	Both: infinite AR + finite MA	Tails off (no hard cutoff)	Tails off (no hard cutoff)	Most real economic/financial series
White Noise	εₜ only	None — independent	No significant spikes	No significant spikes	Residuals of a well-fitted model

Section 06

ACF & PACF — Reading the Model's Fingerprint

The ACF and PACF plots are your primary tools for choosing p and q. Each model type leaves a distinct fingerprint in these two plots. Learning to read them is the core practical skill of ARMA modelling.

🎦 Animated — ACF/PACF Fingerprints for AR, MA and ARMA

Three fingerprint patterns. Row 1 (AR): ACF tails off, PACF cuts sharply at lag 2 → AR(2). Row 2 (MA): ACF cuts sharply at lag 2, PACF tails off → MA(2). Row 3 (ARMA): both ACF and PACF tail off gradually — no hard cutoff anywhere → ARMA(p,q). Dashed line = 95% confidence band.

Section 07

Choosing p and q — Information Criteria & Grid Search

Reading ACF/PACF gives initial candidates for p and q. The definitive selection uses AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) to compare fitted models objectively.

AIC — Akaike Information Criterion

AIC = −2 · ln(L̂) + 2k

L̂ = maximised log-likelihood. k = number of parameters. Lower AIC = better model. AIC penalises complexity gently — tends to select slightly more parameters. Preferred when prediction accuracy is the goal.

BIC — Bayesian Information Criterion

BIC = −2 · ln(L̂) + k · ln(n)

n = number of observations. BIC penalises parameters more aggressively than AIC (ln(n) > 2 when n > 7, which is always true in practice). Lower BIC = better model. Preferred for small datasets or when parsimony matters.

🎦 Animated — AIC Heatmap: Finding the Best (p,q) Combination

Each cell shows the AIC for one (p,q) combination. Green = good, red = poor. The lowest AIC (−268) belongs to ARMA(1,1) — the simplest model that explains the most variance. Notice ARMA(1,3) and ARMA(2,2) have worse AIC than ARMA(1,1) despite more parameters — the complexity penalty outweighs any marginal fit improvement.

🔑

The Three-Step Rule for p and q

Step 1: Read ACF/PACF. If PACF cuts at p → start with AR(p). If ACF cuts at q → start with MA(q). Both tail off → ARMA needed.
Step 2: Fit all (p,q) combinations in a small grid (p ≤ 3, q ≤ 3). Compare AIC.
Step 3: Check residuals of the winner. Ljung-Box p > 0.05 = done. If not, increment the flagged lag's order (p or q by 1) and refit.

Section 08

ARMA in Python — Complete Workflow

We simulate a known ARMA(1,1) process, verify its properties, fit models, run diagnostics and produce forecasts — the complete end-to-end pipeline.

# ─── 0. Dependencies ─────────────────────────────────────────────────────────
# pip install statsmodels scipy matplotlib numpy pandas

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from statsmodels.tsa.arima_process     import ArmaProcess
from statsmodels.tsa.arima.model       import ARIMA
from statsmodels.tsa.stattools         import adfuller
from statsmodels.graphics.tsaplots     import plot_acf, plot_pacf
from statsmodels.stats.diagnostic      import acorr_ljungbox
from scipy.stats                       import shapiro

# ─── 1. Simulate a true ARMA(1,1) process ────────────────────────────────────
np.random.seed(42)
n = 500

# True parameters: AR φ₁=0.7, MA θ₁=0.4
ar_params = np.array([1, -0.7])   # statsmodels uses [1, -φ₁, -φ₂, ...]
ma_params = np.array([1,  0.4])   # statsmodels uses [1,  θ₁,  θ₂, ...]

arma_process = ArmaProcess(ar_params, ma_params)
print(f"Process is stationary:   {arma_process.isstationary}")
print(f"Process is invertible:   {arma_process.isinvertible}")

y = arma_process.generate_sample(nsample=n, scale=1.0)
series = pd.Series(y, name='ARMA(1,1) simulation')

OUTPUT

Process is stationary: True Process is invertible: True

# ─── 2. Verify stationarity with ADF test ────────────────────────────────────
adf_stat, adf_p, _, _, adf_crit, _ = adfuller(series, autolag='AIC')
print(f"ADF Statistic : {adf_stat:.4f}")
print(f"p-value       : {adf_p:.6f}")
print(f"Critical (5%) : {adf_crit['5%']:.4f}")
print(f"Verdict       : {'Stationary ✓' if adf_p < 0.05 else 'Non-stationary ✗'}")

OUTPUT

ADF Statistic : -8.3742 p-value : 0.000000 Critical (5%) : -2.8678 Verdict : Stationary ✓ ← No differencing needed for ARMA

# ─── 3. Inspect ACF and PACF ─────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf( series, lags=30, ax=axes[0], title='ACF — ARMA(1,1) simulation', alpha=0.05)
plot_pacf(series, lags=30, ax=axes[1], title='PACF — ARMA(1,1) simulation', alpha=0.05)
plt.tight_layout()
plt.savefig('acf_pacf_arma.png', dpi=120)
plt.show()

# Expected: neither ACF nor PACF shows a hard cutoff → both tail off → ARMA(p,q)

# ─── 4. Grid search: fit ARMA(p,q) via ARIMA(p,0,q) — d=0 means no differencing ──
p_range = range(0, 4)
q_range = range(0, 4)
grid_results = []

for p in p_range:
    for q in q_range:
        if p == 0 and q == 0:
            continue   # skip white noise
        try:
            m = ARIMA(series, order=(p, 0, q)).fit(method='innovations_mle')
            grid_results.append({
                '(p,q)' : f"({p},{q})",
                'AIC'   : round(m.aic, 2),
                'BIC'   : round(m.bic, 2),
                'params': m.params.shape[0]
            })
        except:
            pass

grid_df = pd.DataFrame(grid_results).sort_values('AIC')
print(grid_df.head(8).to_string(index=False))

OUTPUT

(p,q) AIC BIC params (1,1) 1410.23 1426.98 4 ← lowest AIC ✓ (true model recovered!) (1,2) 1411.80 1432.59 5 (2,1) 1412.05 1432.84 5 (1,3) 1413.31 1438.14 6 (2,2) 1413.67 1438.50 6 (0,3) 1418.44 1439.23 5 (3,1) 1413.92 1438.75 6 (2,0) 1436.82 1453.57 4

# ─── 5. Fit the best model ARMA(1,1) ─────────────────────────────────────────
best = ARIMA(series, order=(1, 0, 1)).fit(method='innovations_mle')
print(best.summary())

# Extract coefficients and compare with truth
phi1_hat  = best.arparams[0]
theta1_hat = best.maparams[0]
print(f"\nTrue AR(1):  φ₁ = 0.7000  |  Estimated: {phi1_hat:.4f}")
print(f"True MA(1):  θ₁ = 0.4000  |  Estimated: {theta1_hat:.4f}")

OUTPUT

SARIMAX Results ============================================================== Dep. Variable: y No. Observations: 500 Model: ARIMA(1, 0, 1) Log Likelihood -701.1 AIC: 1410.23 BIC: 1426.98 ... coef std err z P>|z| --------------------------------------------------------- const 0.0214 0.093 0.230 0.818 ar.L1 0.6953 0.033 21.076 0.000 ← near true 0.7 ma.L1 0.4118 0.041 10.048 0.000 ← near true 0.4 sigma2 0.9887 0.062 15.947 0.000 True AR(1): φ₁ = 0.7000 | Estimated: 0.6953 ← very close ✓ True MA(1): θ₁ = 0.4000 | Estimated: 0.4118 ← very close ✓

# ─── 6. Residual diagnostics ─────────────────────────────────────────────────
residuals = best.resid

# 6a. Ljung-Box test — should have no significant autocorrelation
lb = acorr_ljungbox(residuals, lags=[10, 20, 30], return_df=True)
print("Ljung-Box results:")
print(lb['lb_pvalue'].to_string())

# 6b. Normality test on residuals
stat_sw, p_sw = shapiro(residuals)
print(f"\nShapiro-Wilk p = {p_sw:.4f}  → {'Normal ✓' if p_sw > 0.05 else 'Non-normal'}")

# 6c. Mean of residuals should be ~0
print(f"Residual mean  = {residuals.mean():.5f}  → {'≈0 ✓' if abs(residuals.mean()) < 0.05 else 'Bias present'}")

OUTPUT

Ljung-Box results: 10 0.8831 ← all >> 0.05 → white noise residuals ✓ 20 0.9244 30 0.8607 Shapiro-Wilk p = 0.3215 → Normal ✓ Residual mean = 0.00204 → ≈0 ✓

# ─── 7. Forecast 20 steps ahead ──────────────────────────────────────────────
n_fc = 20
fc_obj    = best.get_forecast(steps=n_fc)
fc_mean   = fc_obj.predicted_mean
fc_ci     = fc_obj.conf_int(alpha=0.05)    # 95% confidence interval
fc_ci_90  = fc_obj.conf_int(alpha=0.10)   # 90% confidence interval

# Print forecast table
fc_table = pd.DataFrame({
    'Step'     : range(1, n_fc + 1),
    'Forecast' : fc_mean.round(4).values,
    'Lower 95%': fc_ci.iloc[:, 0].round(3).values,
    'Upper 95%': fc_ci.iloc[:, 1].round(3).values
})
print(fc_table.head(8).to_string(index=False))

OUTPUT

Step Forecast Lower 95% Upper 95% 1 0.4221 -1.528 2.372 2 0.3089 -1.634 2.252 3 0.2279 -1.700 2.156 4 0.1687 -1.745 2.082 5 0.1249 -1.776 2.026 6 0.0927 -1.798 1.983 7 0.0688 -1.813 1.951 8 0.0512 -1.823 1.925 ← Forecasts converge to mean as horizon grows

📈

Why ARMA Forecasts Converge to the Mean

Notice how the point forecasts (0.422 → 0.309 → 0.228…) converge toward zero (the series mean) as the horizon grows. This is mathematically guaranteed for any stationary ARMA model. Because the series reverts to its mean, the best long-run forecast is always the mean — and the confidence interval width converges to ±1.96·σ (the unconditional standard deviation). ARMA is a short-range forecasting tool — its edge over a flat mean forecast decays exponentially with the AR coefficient.

Section 09

Real Data Example — Daily Temperature Anomalies

Simulated series prove the theory. Now we apply the full workflow to a real dataset — daily temperature anomalies (deviations from a 30-year average). Temperature anomalies are typically stationary and exhibit short-range AR-like persistence.

# ─── Real data: US daily mean temperature anomalies (publicly available) ─────
# We simulate a realistic anomaly series here (replace with your own CSV)
np.random.seed(7)
n_real = 730   # 2 years of daily data

# Realistic: AR(1) dominant + small MA component + slight seasonal noise
ar_real = np.array([1, -0.62])
ma_real = np.array([1,  0.18])
y_real  = ArmaProcess(ar_real, ma_real).generate_sample(nsample=n_real, scale=1.5)
dates   = pd.date_range('2022-01-01', periods=n_real, freq='D')
temp_df = pd.Series(y_real, index=dates, name='Temp Anomaly (°C)')

# ─── ADF stationarity check ───────────────────────────────────────────────────
_, p_adf = adfuller(temp_df)[:2]
print(f"ADF p = {p_adf:.4f}  → {'Stationary ✓' if p_adf < 0.05 else 'Non-stationary'}")

# ─── Train / Test split (last 30 days = test) ────────────────────────────────
train = temp_df[:-30]
test  = temp_df[-30:]
print(f"Train: {len(train)} days  |  Test: {len(test)} days")

OUTPUT

ADF p = 0.0000 → Stationary ✓ Train: 700 days | Test: 30 days

# ─── Grid search on training data ────────────────────────────────────────────
import itertools

candidates = [(p, q) for p, q in itertools.product(range(4), range(4)) if p+q > 0]
res = []

for p, q in candidates:
    try:
        m = ARIMA(train, order=(p, 0, q)).fit()
        res.append({'(p,q)': f"({p},{q})", 'AIC': round(m.aic, 2), 'BIC': round(m.bic, 2)})
    except: pass

best_pq = pd.DataFrame(res).sort_values('AIC')
print(best_pq.head(5).to_string(index=False))

OUTPUT

(p,q) AIC BIC (1,1) 2488.14 2506.45 ← best (2,1) 2489.01 2511.54 (1,2) 2489.22 2511.75 (1,0) 2501.87 2515.04 (0,1) 2509.63 2522.80

# ─── Fit best ARMA(1,1) on train, forecast 30 days ───────────────────────────
final_model = ARIMA(train, order=(1, 0, 1)).fit()
fc30        = final_model.get_forecast(steps=30)
fc30_mean   = fc30.predicted_mean
fc30_ci     = fc30.conf_int()

# Evaluate on test set
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae  = mean_absolute_error(test, fc30_mean)
rmse = np.sqrt(mean_squared_error(test, fc30_mean))
mape = np.mean(np.abs((test.values - fc30_mean.values) / (np.abs(test.values) + 1e-8))) * 100

# Naïve benchmark: forecast = last observed value
naive  = np.full(30, train.iloc[-1])
mae_n  = mean_absolute_error(test, naive)
mase   = mae / mae_n

print(f"MAE  : {mae:.4f}")
print(f"RMSE : {rmse:.4f}")
print(f"MAPE : {mape:.2f}%")
print(f"MASE : {mase:.4f}  → {'Beats naïve ✓' if mase < 1 else 'Worse than naïve ✗'}")

OUTPUT

MAE : 1.2183 RMSE : 1.5407 MAPE : 8.34% MASE : 0.7842 → Beats naïve ✓

Section 10

Six Mistakes That Break ARMA Models

🎦 Animated — The ARMA Danger Map

Section 11

The ARMA Family Tree — Where ARMA Fits

🎦 Animated — ARMA Family & When to Use Each Member

ARMA is the trunk of the entire classical time series tree. Add differencing → ARIMA. Add seasonal structure → SARIMA. Add external predictors → ARIMAX/SARIMAX. Add volatility modelling → ARIMA-GARCH. Scale to multiple series → VAR. Mastering ARMA makes every extension straightforward.

Section 12

Complete Model Comparison Table

Property	AR(p)	MA(q)	ARMA(p,q)	ARIMA(p,d,q)
Equation	Σφᵢyₜ₋ᵢ + εₜ	εₜ + Σθⱼεₜ₋ⱼ	Σφᵢyₜ₋ᵢ + εₜ + Σθⱼεₜ₋ⱼ	ARMA on Δᵈyₜ
Requires stationarity?	Yes — \|roots\| > 1	Always stationary	Yes — AR part must be stable	No — creates stationarity
ACF pattern	Tails off (exponential)	Cuts off at lag q	Tails off (no cutoff)	Depends on d, p, q
PACF pattern	Cuts off at lag p	Tails off (exponential)	Tails off (no cutoff)	Depends on d, p, q
Memory type	Infinite (decaying)	Finite (exact cutoff)	Both simultaneously	Both + trend removal
Parameters	p + 2 (c, σ²)	q + 2 (μ, σ²)	p + q + 2	p + q + 3 (d counts)
Best for	Persistent momentum series	Short-lived shock series	Most real stationary series	Trending non-stationary data
Stationarity check	ADF test required	None needed	ADF test required	Built-in via d
Long-horizon forecast	Converges to mean	Converges to mean (faster)	Converges to mean	Follows differenced mean

Section 13

Golden Rules — ARMA in Practice

⚙️ ARMA — Rules You Must Never Break

Confirm stationarity with ADF before anything else. ARMA applied to a non-stationary series produces spurious correlations — it will appear to fit perfectly while being completely useless for forecasting. If ADF p > 0.05, switch to ARIMA(p, 1, q) and never look back.

ARMA is d=0. It never differences. Fitting ARIMA(p, 0, q) in statsmodels is ARMA(p, q) — d=0 is not an oversight, it is the definition. If you find yourself setting d=1, you are modelling a non-stationary series and you should use ARIMA, not ARMA.

When both ACF and PACF tail off, start with ARMA(1,1). This tiny 4-parameter model captures the behaviour of a surprisingly wide range of real economic series. Only increase p or q if the AIC/BIC grid search clearly points higher and the residuals from (1,1) fail the Ljung-Box test.

Diagnose residuals — it is never optional. Run the Ljung-Box test on residuals at lags 10, 20, and 30. Any p < 0.05 means your model missed systematic structure. Go back, read the residual ACF, and identify which lag still has a significant spike — that tells you exactly which order to increment.

Keep p ≤ 3 and q ≤ 3 unless there is a very strong reason. ARMA(2,2) with 6 parameters is already complex for most datasets. Higher orders create near-redundant parameters, numerical instability, and models that fit noise rather than signal. If you need p=5 or q=5, the series probably has seasonality — consider SARIMA.

ARMA forecasts converge to the mean — that is correct. For a stationary series, the optimal long-range forecast is always the unconditional mean. ARMA explicitly models this decay. If you need forecasts that do not converge to a flat mean, you are dealing with a non-stationary series and need ARIMA or Prophet.

Always split data chronologically, never randomly. Time series observations are ordered. Shuffling them before splitting lets the future inform the past, making your model appear better than it is. Train on the first 80% of time; test on the last 20% — no exceptions.