\( \newcommand{\mathbbm}[1]{\boldsymbol{\mathbf{#1}}} \)

18.2 Conditional moments and scale

We have already discussed how to obtain conditional expectation and variance in Sections 5.3 and 6.3. However, the topic is worth discussing in more detail, especially for non-Normal distributions.

18.2.1 Conditional expectation

The general rule that applies to ADAM in terms of generating conditional expectations is that if you deal with the pure additive model, then you can produce forecasts analytically. This not only applies to ETS but also to ARIMA (Subsection 9.2.1) and Regression (Section 10.2). If the model has multiplicative components (such as multiplicative error, or trend, or seasonality) or is formulated in logarithms (for example, ARIMA in logarithms), then simulations should be preferred (Section 18.1) – the point forecasts from these models would not necessarily correspond to the conditional expectations.

18.2.2 Explanatory variables

If the model contains explanatory variables, then the \(h\) steps ahead conditional expectations should use them in the calculation. The main challenge in this situation is that future values might not be known in some cases. This has been discussed in Section 10.2. Practically speaking, if the user provides the holdout sample values of explanatory variables, the forecast.adam() method will use them in forecasting. If they are not provided, the function will produce forecasts for each of the explanatory variables via the adam() function and use the conditional \(h\) steps ahead expectations in forecasting.

18.2.3 Conditional variance and scale

Similar to conditional expectations, as we have discussed in Sections 5.3 and 6.3, the conditional \(h\) steps ahead variance is in general available only for the pure additive models. While the conditional expectation might be required on its own to use as a point forecast, the conditional variance is typically needed to produce prediction intervals. However, it becomes useful only in cases of distributions that support convolution (addition of random variables), which limits its usefulness to pure additive models and to additive models applied to the data in logarithms. For example, if we deal with Inverse Gaussian distribution, then the \(h\) steps ahead values will not follow Inverse Gaussian distribution, and we would need to revert to simulations in order to obtain the proper statistics for it. Another situation would be a multiplicative error model that relies on Normal distribution – the product of Normal distributions is not a Normal distribution, so the statistics would need to be obtained using simulations again.

If we deal with a pure additive model with either Normal, Laplace, S, or Generalised Normal distributions, then the formulae derived in Section 5.3 can be used to produce \(h\) steps ahead conditional variance. Having obtained those values, we can then produce conditional \(h\) steps ahead scales for the distributions (which would be needed, for example, to generate quantiles), using the relations between the variance and scale in those distributions (discussed in Section 5.5):

  1. Normal: scale is \(\sigma^2_h\);
  2. Laplace: \(s_h = \sigma_h \sqrt{\frac{1}{2}}\);
  3. Generalised Normal: \(s_h = \sigma_h \sqrt{\frac{\Gamma(1/\beta)}{\Gamma(3/\beta)}}\);
  4. S: \(s_h = \sqrt{\sigma_h}\sqrt[4]{\frac{1}{120}}\).

If the variance is needed for the other combinations of model/distributions, simulations would need to be done to produce multiple trajectories, similar to how it was done in Section 18.1. An alternative to this would be the calculation of in-sample multistep forecast errors (similar to how it was discussed in Sections 11.3 and 14.7.3) and then calculating the variance based on them for each horizon \(j = 1 \dots h\).

In the smooth package for R, there is a multicov() method that allows extracting the multiple steps ahead covariance matrix \(\hat{\boldsymbol{\Sigma}}\) (see Subsection 11.3.5). The method can estimate the covariance matrix using analytical formulae (where available), or via empirical calculations (based on multiple steps ahead in-sample error), or via simulation. Here is an example for one of the models in R:

adam(BJsales) |>
    multicov(h=7) |>
##       h1    h2     h3     h4     h5     h6     h7
## h1 1.844 2.305  2.700  3.029  3.305  3.536  3.729
## h2 2.305 4.725  5.679  6.486  7.161  7.725  8.197
## h3 2.700 5.679  8.677 10.113 11.324 12.337 13.184
## h4 3.029 6.486 10.113 13.653 15.543 17.133 18.462
## h5 3.305 7.161 11.324 15.543 19.577 21.880 23.816
## h6 3.536 7.725 12.337 17.133 21.880 26.357 29.031
## h7 3.729 8.197 13.184 18.462 23.816 29.031 33.898

18.2.4 Scale model

In the case of the scale model (Chapter 17), the situation becomes more complicated because we no longer assume that the variance of the error term is constant (residuals are homoscedastic) – we now assume that it is a model on its own. In this case, we need to take a step back to the recursion (5.10) and when taking the conditional variance, introduce the time-varying variance \(\sigma_{t+h}^2\).

Remark. Note the difference between \(\sigma_{t+h}^2\) and \(\sigma_{h}^2\) in our notations – the former is the variance of the error term for the specific step \(t+h\), while the latter is the conditional variance \(h\) steps ahead, which is derived based on the assumption of homoscedasticity.

Making that substitution leads to the following analytical formula for the \(h\) steps ahead conditional variance in the case of the scale model: \[\begin{equation} \text{V}(y_{t+h}|t) = \sum_{i=1}^d \left(\mathbf{w}_{m_i}^\prime \sum_{j=1}^{\lceil\frac{h}{m_i}\rceil-1} \mathbf{F}_{m_i}^{j-1} \mathbf{g}_{m_i} \mathbf{g}^\prime_{m_i} (\mathbf{F}_{m_i}^\prime)^{j-1} \mathbf{w}_{m_i} \sigma_{t+h-j}^2 \right) + \sigma_{t+h}^2 . \tag{18.1} \end{equation}\] This variance can then be used, for example, to produce quantiles from the assumed distribution.

As mentioned above, in the case of the not purely additive model or a model with other distributions than Normal, Laplace, S, or Generalised Normal, the conditional variance can be obtained using simulations. In the case of the scale model, the principles will be the same, just assuming that each error term \(\epsilon_{t+h}\) has its own scale, obtained from the estimated scale model. The rest of the logic will be exactly the same as discussed in Section 18.1.