**Open Review**. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 8.1 Introduction to ARIMA

ARIMA contains several elements:

- AR(p) - the AutoRegressive part, showing how the variable is impacted by its values on the previous observations. It contains \(p\) lags. For example, the quantity of the liquid in a vessel with an opened tap on some observation will depend on the quantity on the previous steps. This analogy explains the idea of AR part of the model;
- I(d) - the number of differences \(d\) taken in the model (I stands for "Integrated). Working with differences rather than with the original data means that we deal with changes and rates of changes, not with just values. Technically, differences are needed in order to make data stationary (i.e. with fixed expectation and variance, although there are different definitions of the term
*stationarity*, see below); - MA(q) - the Moving Average component, explaining how the variable is impacted by the previous white noise. It contains \(q\) lags. Once again, in technical systems, the idea that the random error can impact the value, has a relatively simple explanation. For example, when the liquid drips out of a vessel, we might not be able to predict the air fluctations, which would impact the flow and could be perceived as elements of random noise. This randomness might in turn impact the quantity of liquid in a vessel on a next observation, thus introducing the MA elements in the model.

I intentionally do not provide ARIMA examples from the demand forecasting area, as these are much more difficult to motivate and explain than the examples from the more technical areas.

Before we continue our discussion, we should define the term **stationarity**. There are two definitions in the literature, one refers to “strict stationarity,” while the other refers to the “weak stationarity”:

- Time series is said to be
**weak stationary**, when its conditional expectation and variance are constant and the variance is finite for all times; - Time series is
**strong stationary**, when its unconditional joint probability distribution does not change over time. This automatically implies that all its moments are constant (i.e. the process is also weak stationary).

The stationarity is essential in ARIMA context and also plays important role in regression analysis. If the series is not stationary, then it might be difficult to estimate its moments correctly using the conventional methods and in some cases it might be not possible to get the correct parameters at all (e.g. there is infinite combination of parameters that would produce the minimum of the selected loss function). In this case, the series is somehow transformed in order to make sure that the moments are finite and constant (e.g. take logarithms or do Box-Cox transform, take differences or detrend the series) and that the model becomes easier to identify. Note that in contrast with ARIMA, the ETS models are almost always all non-stationary and do not require for the series to be stationary. We will see the connection between the two approaches later in this chapter.

### 8.1.1 AR(p)

We start with a simple AR(1) model, which is written as: \[\begin{equation} {y}_{t} = \phi_1 y_{t-1} + \epsilon_t , \tag{8.1} \end{equation}\] where \(\phi_1\) is the parameter of the model. This formula tells us that the value on the previous observation is carried out to the new one in the proportion of \(\phi_1\). Typically, the parameter \(\phi_1\) is restricted with the region (-1, 1), in order to make the model stationary, but very often in real life \(\phi_1\) actually lies in (0, 1) region. If the parameter is equal to 1, then the model becomes equivalent to Random Walk.

The forecast trajectory (conditional expectation several steps ahead) of this model would typically correspond to the exponentially declining curve. Here is a simple example in R of a very basic forecast from AR(1) with \(\phi_1=0.9\):

```
<- vector("numeric", 20)
y 1] <- 100
y[<- 0.9
phi for(i in 2:length(y)){
<- phi * y[i-1]
y[i]
}plot(y, type="l", xlab="horizon", ylab="Forecast")
```

If for some reason we get \(\phi_1>1\), then the trajectory corresponds to exponential increase, becoming explosive, implying non-stationary behaviour. The model in this case becomes very difficult to work with, even if the parameter is close to one. So it is typically advised to restrict the parameter with stationarity region (we will discuss this in more detail later in this chapter).

In general, it is possible to imagine the situation, when the value at the moment of time \(t\) would depend on several previous values, so the model AR(p) can be written as: \[\begin{equation} {y}_{t} = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \epsilon_t , \tag{8.2} \end{equation}\] where \(\phi_i\) is the parameters for the \(i\)-th lag of the model. So, the model assumes that the data on the recent observations is influenced by the \(p\) previous observations. The more lags we introduce in the model, the more complicated the forecasting trajectory becomes, potentially introducing harmonic behaviour. Here is an example of AR(3) model \({y}_{t} = 0.9 y_{t-1} -0.7 y_{t-2} + 0.6 y_{t-3} + \epsilon_t\):

```
<- vector("numeric", 30)
y 1:3] <- c(100, 75, 30)
y[<- c(0.9,-0.7,0.6)
phi for(i in 4:30){
<- phi[1] * y[i-1] + phi[2] * y[i-2] + phi[3] * y[i-3]
y[i]
}plot(y, type="l", xlab="horizon", ylab="Forecast")
```

No matter what the forecast trajectory of AR model is, it will asymptotically converge to zero, as long as the model is stationary.

### 8.1.2 MA(q)

Before discussing the “Moving Averages” model, we should acknowledge that the name is quite misleading, and that the model has *nothing to do* with Centred Moving Averages used in time series decomposition or Simple Moving Averages (average of several concequitive observation). The idea of the simplest MA(1) model can be summarised in the following mathematical way:
\[\begin{equation}
{y}_{t} = \theta_1 \epsilon_{t-1} + \epsilon_t ,
\tag{8.3}
\end{equation}\]
where \(\theta_1\) is the parameter of the model, typically lying between (-1, 1), showing what part of the error is carried out to the next observation. Because of the conventional assumption that the error term has a zero mean (\(\mathrm{E}(\epsilon_{t})=0\)), the forecast trajectory of this model is just a straight line coinsiding with zero starting from the \(h=2\). For the one step ahead forecast we have:
\[\begin{equation}
\mathrm{E}({y}_{t+1}|t) = \theta_1 \mathrm{E}(\epsilon_{t}|t) + \mathrm{E}(\epsilon_{t+1}|t) = \theta_1 \epsilon_{t}.
\tag{8.4}
\end{equation}\]
But starting from \(h=2\) there are no observable error terms \(\epsilon_t\), so all the values past that are equal to zero:
\[\begin{equation}
\mathrm{E}({y}_{t+2}|t) = \theta_1 \mathrm{E}(\epsilon_{t+1}|t) + \mathrm{E}(\epsilon_{t+2}|t) = 0.
\tag{8.5}
\end{equation}\]
So, the forecast trajectory for MA(1) model converges to zero, when \(h>1\).

More generally, MA(q) model is written as: \[\begin{equation} {y}_{t} = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q} + \epsilon_t , \tag{8.6} \end{equation}\] where \(\theta_i\) is the parameters for the \(i\)-th lag of the error term, which are typically restricted with the so called invertibility region (discussed in the next section). In this case, the model assumes that the recent observation is influenced by several errors on previous observations (your mistakes in the past will haunt you in the future). The more lags we introduce, the more complicated the model becomes. As for the forecast trajectory, it will reach zero, when \(h>q\).

### 8.1.3 ARMA(p,q)

Connection the models (8.2) and (8.6), we get the more complicated model, ARMA(p,q): \[\begin{equation} {y}_{t} = \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q} + \epsilon_t , \tag{8.7} \end{equation}\] which has the properties of the two models discussed above. The forecast trajectory from this model will have a combination of trajectories for AR and MA for \(h \leq q\) and then will correspond to AR(p) for \(h>q\).

In order to simplify the work with ARMA models, the equation (8.7) is typically rewritten, by moving all terms with \(y_t\) to the left hand side: \[\begin{equation} {y}_{t} - \phi_1 y_{t-1} - \phi_2 y_{t-2} - \dots - \phi_p y_{t-p} = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q} + \epsilon_t . \tag{8.8} \end{equation}\] Furthermore, in order to make this even more compact, the backshift operator B is introduced, which just shows by how much the subscript of the variable is shifted back in time: \[\begin{equation} {y}_{t} B^i = {y}_{t-i}. \tag{8.9} \end{equation}\] Using (8.9), the ARMA model can be written as: \[\begin{equation} {y}_{t} (1 - \phi_1 B - \phi_2 B^2 - \dots - \phi_p B^p) = \epsilon_t (1 + \theta_1 B + \theta_2 B^2 + \dots + \theta_q B^q) . \tag{8.10} \end{equation}\] Finally, we can also introduce the AR and MA polynomial functions to make the model even more compact: \[\begin{equation} \begin{aligned} & \varphi^p(B) = 1 - \phi_1 B - \phi_2 B^2 - \dots - \phi_p B^p \\ & \vartheta^q(B) = 1 + \theta_1 B + \theta_2 B^2 + \dots + \theta_q B^q . \end{aligned} \tag{8.11} \end{equation}\] Inserting the functions (8.11) in (8.10) leads to the compact presentation of ARMA model: \[\begin{equation} {y}_{t} \varphi^p(B) = \epsilon_t \vartheta^q(B) . \tag{8.12} \end{equation}\] The model (8.12) can be considered as a compact form of (8.7). It is more difficult to understand and interpret, but easier to work with from mathematical point of view. In addition, this form permits introducing additional elements, which will be discussed later in this chapter.

Coming back to the ARMA model (8.7), we might notice, that it assumes convergence to zero, the speed of which is regulated via the parameters. In fact, this implies that the data has the mean of zero, and ARMA becomes useful, when the data is somehow pre-processed, so that it is stationary and varies around zero. This means that if you work with non-stationary and / or with non-zero mean data, the pure AR / MA or ARMA will be inappropriate - some prior transformations are in order.

### 8.1.4 ARMA with constant

One of the simpler ways to deal with the issue with zero forecasts is to introduce the constant (or intercept) in ARMA: \[\begin{equation} {y}_{t} = a_0 + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_p \epsilon_{t-p} + \epsilon_t \tag{8.13} \end{equation}\] or \[\begin{equation} {y}_{t} \varphi^p(B) = a_0 + \epsilon_t \vartheta^q(B) , \tag{8.12} \end{equation}\] where \(a_0\) is the constant parameter, which in this case also works as the unconditional mean of the series. The forecast trajectory in this case would converge to \(a_0\) instead of zero, but with some minor differences from the ARMA without constant. For example, in case of ARMA(1,1) with constant we will have: \[\begin{equation} {y}_{t} = a_0 + \phi_1 y_{t-1} + \theta_1 \epsilon_{t-1} + \epsilon_t . \tag{8.14} \end{equation}\] The conditional expectation of \(y_{t+h}\) for \(h=1\) and \(h=2\) can be written as (based on the discussions in previous sections): \[\begin{equation} \begin{aligned} & \mathrm{E}({y}_{t+1}|t) = a_0 + \phi_1 y_{t} + \theta_1 \epsilon_{t} \\ & \mathrm{E}({y}_{t+2}|t) = a_0 + \phi_1 \mathrm{E}(y_{t+1}|t) = a_0 + \phi_1 a_0 + \phi_1^2 y_{t} + \phi_1 \theta_1 \epsilon_t \end{aligned} , \tag{8.15} \end{equation}\] or in general for some horizon \(h\): \[\begin{equation} \mathrm{E}({y}_{t+h}|t) = \sum_{j=1}^h a_0\phi_1^{j-1} + \phi_1^h y_{t} + \phi_1^{h-1} \theta_1 \epsilon_{t} . \tag{8.16} \end{equation}\] So, the forecast trajectory from this model dampens out, similar to the ETS(A,Ad,N) model, and the rate of dampening is regulated by the value of \(\phi_1\). The following simple example demonstrates this point (I drop the MA(1) part because it does not change the shape of the curve):

```
<- vector("numeric", 20)
y 1] <- 100
y[<- 0.9
phi for(i in 2:length(y)){
<- 100 + phi * y[i-1]
y[i]
}plot(y, type="l", xlab="horizon", ylab="Forecast")
```

The more complicated ARMA(p,q) models with p>1 will have more complicated trajectories with potential harmonics, but the idea of dampening in AR(p) part of the model stays.

Finally, as alternative to adding \(a_0\), each actual value of \(y_t\) can be centred via \(y^\prime_t = y_t - \bar{y}\), making sure that the mean of \(y^\prime_t\) is zero and ARMA can be applied to the \(y^\prime_t\) data instead of \(y_t\). However, this approach introduces additional steps, but the result on stationary data is typically the same.

### 8.1.5 I(d)

Based on the previous discussion, we can conclude that ARMA cannot be applied to non-stationary data. So, if we deal with one, we need to make it stationary somehow. The convetional way of doing that is by taking differences of the data. The logic behind this is straight forward: if the data is not stationary, then the mean somehow changes over time. This can be, for example, due to a trend in the data. In this case we should be taking about the change of variable \(y_t\) rather than the variable itself. So we should work on the following data instead:
\[\begin{equation}
\Delta y_t = y_t - y_{t-1} = y_t (1 - B),
\tag{8.17}
\end{equation}\]
if the differences have constant mean. The simplest model with differences is I(1), which is also known as the **Random walk**:
\[\begin{equation}
\Delta y_t = \epsilon_t,
\tag{8.18}
\end{equation}\]
which can be reformulated in a simpler, more interpretable form by inserting (8.17) in (8.18) and regrouping elements:
\[\begin{equation}
y_t = y_{t-1} + \epsilon_t.
\tag{8.19}
\end{equation}\]
The model (8.19) can also be perceived as AR(1) with \(\phi_1=1\). This is a non-stationary model, meaning that the unconditional mean of \(y_t\) is not constant. The forecast from this model corresponds to the Na"{i}ve method with straight line equal to the last observed actual value (again, assuming that \(\mathrm{E}(\epsilon_{t})=0\) and that other basic assumptions hold):
\[\begin{equation}
\mathrm{E}(y_{t+h}|t) = \mathrm{E}(y_{t+h-1}|t) + \mathrm{E}(\epsilon_{t+h}|t) = y_{t} .
\tag{8.20}
\end{equation}\]

Another simple model that relies on differences of the data is called **Random Walk with drift** and is formulated by adding constant \(a_0\) to the right hand side of equation (8.18):
\[\begin{equation}
\Delta y_t = a_0 + \epsilon_t.
\tag{8.21}
\end{equation}\]
This model has some similarities with the global level model, which is formulated via the actual value rather than differences:
\[\begin{equation}
{y}_{t} = a_0 + \epsilon_t.
\tag{8.22}
\end{equation}\]
Using a similar regrouping as with the Random Walk, we can obtain a simpler form of (8.21):
\[\begin{equation}
y_t = a_0 + y_{t-1} + \epsilon_t.
\tag{8.23}
\end{equation}\]
which is, again, equivalent to AR(1) model with \(\phi_1=1\), but this time with a constant. The term “drift” appears because \(a_0\) acts as an additional element, showing what the tendecy in the data will be: if it is positive, the model will exhibit positive trend, if it is negative, the trend will be negative. This can be seen for the conditional mean, for example, for the case of \(h=2\):
\[\begin{equation}
\mathrm{E}(y_{t+2}|t) = \mathrm{E}(a_0) + \mathrm{E}(y_{t+1}|t) + \mathrm{E}(\epsilon_{t+2}|t) = a_0 + \mathrm{E}(a_0 + y_t + \epsilon_t|t) = 2 a_0 + y_t ,
\tag{8.24}
\end{equation}\]
or in general for the horizon \(h\):
\[\begin{equation}
\mathrm{E}(y_{t+h}|t) = h a_0 + y_t .
\tag{8.25}
\end{equation}\]

In a similar manner we can also introduce second differences of the data (differences of differences) if we suspect that the change of variable over time is not stationary, which would be written as: \[\begin{equation} \Delta^2 y_t = \Delta y_t - \Delta y_{t-1} = y_t - y_{t-1} - y_{t-1} + y_{t-2}, \tag{8.26} \end{equation}\] which can also be written in a form using backshift operator: \[\begin{equation} \Delta^2 y_t = y_t(1 - 2B + B^2) = y_t (1-B)^2. \tag{8.27} \end{equation}\] In fact, we can introduce higher level differences if we want (but typically we should not) based on the idea of (8.27): \[\begin{equation} \Delta^d = (1-B)^d. \tag{8.28} \end{equation}\] Based on that, the I(d) model is formualted as: \[\begin{equation} \Delta^d y_t = \epsilon_t. \tag{8.29} \end{equation}\]

### 8.1.6 ARIMA(p,d,q)

Finally, having made the data stationary via the differences, we can introduce ARMA elements (8.12) to it which would be done on the differenced data, instead of the original \(y_t\): \[\begin{equation} y_t \Delta^d(B) \varphi^p(B) = \epsilon_t \vartheta^q(B) , \tag{8.30} \end{equation}\] or in a more general form (8.10) with (8.27): \[\begin{equation} y_t (1-B)^d (1 - \phi_1 B - \dots - \phi_p B^p) = \epsilon_t (1 + \theta_1 B + \dots + \theta_q B^q), \tag{8.31} \end{equation}\] which is ARIMA(p,d,q) model. This model allows producing trends with some values of differences and also inherits the trajectories from both AR(p) and MA(q). This implies that the point forecasts from the model can exhibit quite complicated trajectories, depending on the values of parameters of the model.

The model (8.31) is difficult to interpret in a general form, but opening the brackets and moving all elements but \(y_t\) to the right hand side typically helps in understanding of each specific model.

### 8.1.7 Parameters bounds

ARMA models have two conditions that need to be satisfied in order for them to be useful and to work appropriately:

- Stationarity,
- Invertibility.

The condition (1) has already been discussed earlier in this chapter, and is imposed on AR parameters of the model, making sure that the forecast trajectories do not exhibit explosive behaviour (in terms of both mean and variance). (2) is equivalent to the stability condition in ETS and refers to the MA parameters: it guarantees that the old observations do not have increasing impact on the recent ones. The term “invertibility” comes from the idea that any MA process can be represented as an infinite AR process via the inversion of the parameters. For example, MA(1) model, which is written as:
\[\begin{equation}
y_t = \epsilon_t + \theta_1 \epsilon_{t-1} = \epsilon_t (1 + \theta_1 B) ,
\tag{8.32}
\end{equation}\]
can be rewritten as:
\[\begin{equation}
y_t (1 + \theta_1 B)^{-1} = \epsilon_t,
\tag{8.33}
\end{equation}\]
or in a slightly easier to digest form (based on (8.32) and the idea that \(\epsilon_{t} = y_{t} - \theta_1 \epsilon_{t-1}\), implying that \(\epsilon_{t-1} = y_{t-1} - \theta_1 \epsilon_{t-2}\)):
\[\begin{equation}
y_t = \theta_1 y_{t-1} - \theta_1^2 \epsilon_{t-2} + \epsilon_t = \theta_1 y_{t-1} - \theta_1^2 y_{t-2} + \theta_1^3 \epsilon_{t-2} + \epsilon_t = \sum_{j=1}^\infty -1^{j-1} \theta_1^j y_{t-j} + \epsilon_t.
\tag{8.34}
\end{equation}\]
The recursion in (8.34) shows that the recent actual value \(y_t\) depends on the previous infinite number of values of \(y_{t-j}\) for \(j=\{1,\dots,\infty\}\). The parameter \(\theta_1\) in this case is exponentiated and leads to the distribution of weights in this infinite series in an exponential way (reminds SES, doesn’t it?). The *invertibility* condition makes sure that those weights decline over time with the increase of \(j\), so that the older observations do not have an increasing impact on the most recent \(y_t\).

There are different ways how to check both conditions, the conventional of which is calculating the roots of the polynomial equations: \[\begin{equation} \begin{aligned} & \varphi^p(B) = 0 \text{ for AR} \\ & \vartheta^q(B) = 0 \text{ for MA} \end{aligned} , \tag{8.35} \end{equation}\] or expanding the functions in (8.35) and substituting \(B\) with a variable \(x\) (for convenience): \[\begin{equation} \begin{aligned} & 1 - \phi_1 x - \phi_2 x^2 - \dots - \phi_p x^p = 0 \text{ for AR} \\ & 1 + \theta_1 x + \theta_2 x^2 + \dots + \theta_q x^q = 0 \text{ for MA} \end{aligned} . \tag{8.36} \end{equation}\] Solving the first equation for \(x\) in (8.36), we get \(p\) roots (some of them might be complex numbers). In order for the model to be stationary all the roots need to be greater than one by absolute value. Similarly, if all the roots of the second equation in (8.36) are greater than one by absolute value, then the model is invertible (aka stable). A special case for both conditions is for the sums of parameters to lie between 0 and 1: \[\begin{equation} \begin{aligned} & 0 < \sum_{j=1}^p \phi_j < 1 \\ & 0 < \sum_{j=1}^q \theta_j < 1 \end{aligned} . \tag{8.37} \end{equation}\] In this case, the respective stationarity and invertibility conditions are satisfied, which could be used in the model estimation (calculating roots of polynomials is a more difficult task). But note that the condition (8.37) is rather restrictive and not genuinly applicable for all ARIMA models.

Finally, in a special case with AR(p) model, \(0 < \sum_{j=1}^p \phi_j < 1\) and \(\sum_{j=1}^p \phi_j = 1\), we end up with the moving weighted average, which is a non-stationary model. This becomes apparent from the connection between Simple Moving Average and AR processes (Svetunkov and Petropoulos, 2018).