**Open Review**. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 11.3 Multistep losses

Another class of losses that can be used in the estimation of ADAM models is the multistep losses. The idea behind them is to produce the point forecast for \(h\) steps ahead from each observation in sample and then calculate a measure based on that, which will be minimised by the optimiser in order to find the most suitable values of parameters. There is a lot of literature on this topic, Svetunkov et al. (2021) studied them in detail, showing that their usage implies shrinkage of smoothing parameters in case of ETS models. In this section we will discuss the most popular multistep losses, see what they imply and make a connection between these losses and predictive likelihoods from the ETS models.

### 11.3.1 \(\mathrm{MSE}_h\) - MSE for h steps ahead

One of the simplest estimators is the \(\mathrm{MSE}_h\) - mean squared \(h\)-steps ahead error: \[\begin{equation} \mathrm{MSE}_h = \frac{1}{T-h} \sum_{t=1}^{T-h} e_{t+h|t}^2 , \tag{11.8} \end{equation}\] where \(e_{t+h|t} = y_{t+h} - \hat{\mu}_{t+h|t}\) is the conditional \(h\) steps ahead forecast error on the observation \(t+h\) produced from the point at time \(t\). This estimator is sometimes used to fit a model several times, for each horizon from 1 to \(h\) steps ahead, resulting in \(h\) different values of parameters for each \(j=1, \ldots, h\). The estimation process in this case becomes at least \(h\) times more complicated than estimating one model, but is reported to result in increased accuracy. Svetunkov et al. (2021) show that MSE\(_h\) is proportional to the h steps ahead forecast error variance \(V(y_{t+h}|t)=\sigma^2_{h}\), which implies that the minimisation of (11.8) leads to the minimisation of the variance and in turn to the minimisation of both one step ahead MSE and a combination of smoothing parameters of a model. This becomes more obvious in the case of pure additive ETS, where the analytical formulae for variance are available. The parameters are shrunk towards zero in case of ETS, making the model deterministic. The effect is neglected on large samples, when the ratio \(\frac{T-h}{T-1}\) becomes close to one. In case of ARIMA, the shrinkage mechanism is similar, making models closer to the deterministic ones, but the direction of shrinkage is more complicating. The strength of shrinkage is proportional to the forecast horizon \(h\) and is weakened with the increase of the sample size.

Svetunkov et al. (2021) demonstrate that the minimum of MSE\(_h\) corresponds to the maximum of the predictive likelihood based on the normal distribution, assuming that \(\epsilon_t \sim N(0,\sigma^2)\). The log-likelihood in this case is: \[\begin{equation} \ell_{\mathrm{MSE}_h}(\theta, {\sigma^2_h} | \mathbf{y}) = -\frac{T-h}{2} \left( \log(2 \pi) + \log \sigma^2_h \right) -\frac{1}{2} \sum_{t=1}^{T-h} \left( \frac{\eta_{t+h|}^2}{\sigma^2_h} \right) , \tag{11.9} \end{equation}\] where \(\eta_{t+h|} \sim N(0, \sigma_h^2)\) is the h steps ahead forecast error, conditional on the information available at time \(t\), \(\theta\) is the vector of all estimated parameters of the model and \(\mathbf{y}\) is the vector of \(y_{t+h}\) for all \(t=1,..,T-h\). The MLE of the scale parameter in (11.9) coincides with the MSE\(_h\): \[\begin{equation} \hat{\sigma}_h = \frac{1}{T-h} \sum_{t=1}^{T-h} e_{t+h|t}^2 , \tag{11.10} \end{equation}\] where \(e_{t+h|t}\) is the in sample estimate of the \(\eta_{t+h}\). The formula (11.9) can be used for the calculation of information criteria and in turn for the model selection in cases, when MSE\(_h\) is used for the model estimation.

Svetunkov et al. (2021) demonstrate that (11.8) is more efficient than the conventional MSE\(_1\) one, when the true smoothing parameters are close to zero and is less efficient otherwise. On smaller samples, MSE\(_h\) produces biased estimates of parameters due to shrinkage. This can still be considered as advantage if you are interested in forecasting and do not want the smoothing parameters to vary a lot.

### 11.3.2 TMSE - Trace MSE

An alternative to MSE\(_h\) is to produce 1 to \(h\) steps ahead forecasts and calculate the respective forecast errors. Then, based on that, we can calculate the overall measure, which we will call “Trace MSE”: \[\begin{equation} \mathrm{TMSE} = \sum_{j=1}^h \frac{1}{T-h} \sum_{t=1}^{T-h} e_{t+j|t}^2 . \tag{11.11} \end{equation}\] The benefit of this estimator is in minimising the error for the whole 1 to \(h\) steps ahead in one model - there is no need to construct \(h\) models, minimising MSE\(_j\), \(j=1,...,h\). However, this comes with a cost: typically short term forecast errors have lower MSE than the longer term ones, so if we just sum their squares up, we are mixing different values, and the minimisation will be done mainly for the ones on the longer horizons.

TMSE does not have a related predictive likelihood, so it is difficult to study its properties, but the simulations show that it tends to produce less biased and more efficient estimates of parameters than MSE\(_h\). Kourentzes et al. (2019c) showed that TMSE performs well in comparison with the conventional MSE\(_1\) and MSE\(_h\) in terms of forecasting accuracy and does not take as much time as the estimation of \(h\) models using MSE\(_h\).

### 11.3.3 GTMSE - Geometric Trace MSE

An estimator that addresses the issues of TMSE, is the GTMSE, which is derived from a so called General Predictive Likelihood (GPL by Svetunkov et al., 2021). The word “Geometric” sums up how the value is calculated: \[\begin{equation} \mathrm{GTMSE} = \sum_{j=1}^h \log \left(\frac{1}{T-h} \sum_{t=1}^{T-h} e_{t+j|t}^2 \right). \tag{11.12} \end{equation}\] Logarithms in the formula (11.12) bring the MSEs on different horizons to the same level, so that the both short term and long term errors are minimised with a similar power. As a result, the shrinkage effect in this estimator is milder than in MSE\(_h\) and in TMSE and the estimates of parameters are less biased and more efficient on smaller samples. It still has the benefits of other multistep estimators, shrinking the parameters towards zero. Although it is possible to derive a predictive likelihood that would be maximised, when GTMSE is minimised, it relies on unrealistic assumptions of independence of multistep forecast errors (they are always correlated as long as smoothing parameters are not zero).

### 11.3.4 MSCE - Mean Squared Cumulative Error

This estimator aligns the loss function with a specific inventory decision: ordering based on the lead time \(h\): \[\begin{equation} \mathrm{MSCE} = \frac{1}{T-h} \sum_{t=1}^{T-h} \left( \sum_{j=1}^h e_{t+j|t} \right)^2 . \tag{11.13} \end{equation}\] Kourentzes et al. (2019b) demonstrated that it produces more accurate forecasts in cases of intermittent demand and leads to less revenue losses. Svetunkov et al. (2021) show that the shrinkage effect is much stronger in this estimator than in the other estimators discussed in this section. In addition, it is possible to derive a predictive log-likelihood related to this estimator: \[\begin{equation} \ell_{\mathrm{MSCE}}(\theta, {\varsigma^2_h} | \mathbf{Y}_c) = -\frac{T-h}{2} \left( \log(2 \pi) + \log {\varsigma^2_h} \right) -\frac{1}{2} \sum_{t=1}^{T-h} \left( \frac{\left(\sum_{j=1}^h \eta_{t+j|t}\right)^2}{2 {\varsigma^2_h}} \right) , \tag{11.14} \end{equation}\] where \(\mathbf{Y}_c\) is the cumulative sum of actual values, the vector of \(Y_{c,t}=\sum_{j=1}^h y_{t+j}\) for all \(t=1, \ldots, T-h\) and \({\varsigma^2_h}\) is the variance of the cumulative error term, the MLE of which is equal to (11.13). Having the likelihood (11.14) permits the model selection and combination using information criteria and also means that the parameters estimated using MSCE will be asymptotically consistent and efficient.

### 11.3.5 GPL - General Predictive Likelihood

Finally, Svetunkov et al. (2021) studied the General Predictive Likelihood for normally distributed variable from Clements and Hendry (1998), p.77, logarithm version of which can be written as: \[\begin{equation} \ell_{\mathrm{GPL}_h}(\theta, \mathbf{{\Sigma}} | \mathbf{Y}) = -\frac{T-h}{2} \left( h \log(2 \pi) + \log | \mathbf{{\Sigma}}| \right) -\frac{1}{2} \sum_{t=1}^T \left( \mathbf{E_t^\prime} \mathbf{{\Sigma}}^{-1} \mathbf{E_t} \right) , \tag{11.15} \end{equation}\] where \(\mathbf{{\Sigma}}\) is the conditional covariance matrix for the vector of variables \(\mathbf{Y}_t=\begin{pmatrix} y_{t+1|t} & y_{t+2|t} & \ldots & y_{t+h|t} \end{pmatrix}\), \(\mathbf{Y}\) is the matrix consisting of \(\mathbf{Y}_t\) for all \(t=1, \ldots, T-h\) and \(\mathbf{E_t}^{\prime} = \begin{pmatrix} \eta_{t+1|t} & \eta_{t+2|t} & \ldots & \eta_{t+h|t} \end{pmatrix}\) is the vector of 1 to \(h\) steps ahead forecast errors. Svetunkov et al. (2021) showed that the maximisation of the likelihood (11.15) is equivalent to minimising the generalised variance of the error term, \(|\mathbf{\hat{\Sigma}}|\), where: \[\begin{equation} \mathbf{\hat{\Sigma}} = \frac{1}{T-h} \sum_{t=1}^{T-h} \mathbf{E_t} \mathbf{E_t^\prime} = \begin{pmatrix} \hat{\sigma}_1^2 & \hat{\sigma}_{1,2} & \dots & \hat{\sigma}_{1,h} \\ \hat{\sigma}_{1,2} & \hat{\sigma}_2^2 & \dots & \hat{\sigma}_{2,h} \\ \vdots & \vdots & \ddots & \vdots \\ \hat{\sigma}_{1,h} & \hat{\sigma}_{2,h} & \dots & \hat{\sigma}_h^2 \end{pmatrix} , \tag{11.16} \end{equation}\] where \(\hat{\sigma}_{i,j}\) is the covariance between \(i\)-th and \(j\)-th steps ahead forecast errors. Svetunkov et al. (2021) show that this estimator encompassess all the other estimators discussed in this section: minimising MSE\(_h\) is equivalent to minimising the \(\hat{\sigma}^2_{h}\); minimising TMSE is equivalent to minimising the trace of the matrix \(\mathbf{\hat{\Sigma}}\); minimising GTMSE is the same as minimising the determinant of \(\mathbf{\hat{\Sigma}}\) but with the restriction that all off-diagonal elements are equal to zero; minimising MSCE produces the same results as minimising the sum of all elements in \(\mathbf{\hat{\Sigma}}\). However, maximum of GPL is equivalent to the maximum of the conventional one step ahead likelihood for a normal model in case, when all the basic assumptions hold. In other cases, they would be different, but it is still not clear, whether the difference would be favouring the conventional likelihood of the GPL. Nonetheless, GPL, being the likelihood, guarantees that the estimates of parameters will be efficient and consistent and permits model selection and combination via information criteria.

When it comes to models with multiplicative error term, the formula of GPL (11.15) will need to be amended by analogy with the log-likelihood of normal distribution in the same situation (Table 11.2): \[\begin{equation} \begin{aligned} \ell_{\mathrm{GPL}_h}(\theta, \mathbf{{\Sigma}} | \mathbf{Y}) = & -\frac{T-h}{2} \left( h \log(2 \pi) + \log | \mathbf{{\Sigma}}| \right) -\frac{1}{2} \sum_{t=1}^T \left( \mathbf{E_t^\prime} \mathbf{{\Sigma}}^{-1} \mathbf{E_t} \right) \\ & - \sum_{t=1}^{T-h} \sum_{j=1}^h \log |\mu_{y,t+j|t}|, \end{aligned} \tag{11.17} \end{equation}\] where the term \(\sum_{t=1}^{T-h} \sum_{j=1}^h \log |\mu_{y,t+j|t}|\) appears because we assume that the actual value \(h\) steps ahead follows multivariate normal distribution with the conditional expectation \(\mu_{y,t+j|t}\). Note that this is only a very crude approximation, as the conditional distribution for \(h\) steps ahead is not defined for models with multiplicative error models. So, when dealing with GPL, it is recommended to use pure additive models only.

### 11.3.6 Other multistep estimators

It is also possible to derive multistep estimators based on MAE, HAM and other error measures. adam() unofficially supports the following multistep losses by analogy with MSE\(_h\), TMSE and MSCE discussed in this section:

- MAE\(_h\);
- TMAE;
- MACE;
- HAM\(_h\);
- THAM;
- CHAM.

When calculating likelihoods based on these losses, `adam()`

will assume Laplace distribution for (1) - (3) and S distribution for (4) - (6).

### 11.3.7 An example in R

In order to see how different estimators perform, we will us Box-Jenkins Sales data, ETS(A,A,N) model and \(h=10\):

```
<- vector("list",6)
adamModel names(adamModel) <- c("MSE","MSEh","TMSE","GTMSE","MSCE","GPL")
for(i in 1:length(adamModel)){
<- adam(BJsales, "AAN", loss=names(adamModel)[i], h=10, holdout=TRUE)
adamModel[[i]] }
```

We can compare the smoothing parameters of these models to see how the shrinkage effect worked in different estimators:

`round(sapply(adamModel,"[[","persistence"),5)`

```
## MSE MSEh TMSE GTMSE MSCE GPL
## alpha 1.00000 1 1 1.00000 1 1
## beta 0.23915 0 0 0.14617 0 0
```

In the table above we can see that \(\beta\) is close to zero for the estimators that impose harder shrinkage on parameters: MSE\(_h\), TMSE and MSCE. MSE does not shrink the parameters, while GTMSE has the mild shrinkage effect. While the models estimated using these losses are in general not comparable in-sample (although MSE, MSE\(_h\), MSCE and GPL could be compared via information criteria if they are scaled appropriately), they are comparable on the holdout:

`round(sapply(adamModel,"[[","accuracy"),5)[c("ME","MAE","MSE"),]`

```
## MSE MSEh TMSE GTMSE MSCE GPL
## ME 3.22900 1.06479 1.05233 3.44962 1.04604 0.95515
## MAE 3.34075 1.41264 1.40153 3.53730 1.39593 1.31495
## MSE 14.41862 2.89067 2.85880 16.26344 2.84288 2.62394
```

In this specific case, ETS(A,A,N) estimated using GPL produced more accurate forecast than based on the other estimators. Repeating the experiment on many samples and selecting the approach that produces more accurate forecasts would allow to select the most appropriate approach for the combination of the model + data.