**Open Review**. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 4.4 ETS assumptions, estimation and selection

There are several assumptions that need to hold for the conventional ETS models in order for them to be used in practice appropriately. Some of them have already been discussed in one of the previous sections, and we will not discuss them here again. What is important in our context is that the conventional ETS assumes that the error term \(\epsilon_t\) follows normal distribution with zero mean and variance \(\sigma^2\). Normal distribution is defined for positive, negative and zero values. This is not a big deal for additive models, which assume that the actual value can be anything. And it is not an issue for the multiplicative models, when we deal with high level positive data (e.g. thousands of units): the variance of the error term will be small enough for the \(\epsilon_t\) not to become less than minus one. However, if the level of the data is low, then the variance of the error term can be large enough for the normally distributed error to cover negative values, less than minus one. This implies that the error term \(1+\epsilon_t\) can become negative, and the model will break. This is a potential flaw in the conventional ETS model with the multiplicative error term. So, what the conventional multiplicative error ETS model assumes in fact is that **the data we work with is strictly positive and has high level values**.

Based on the assumption of normality of error term, the ETS model can be estimated via the maximisation of likelihood, which is equivalent to the minimisation of the mean squared forecast error \(e_t\). Note that in order to apply the ETS models to the data, we also need to know the initial values of components, \(\hat{l}_0, \hat{b}_0, \hat{s}_{-m+2}, \hat{s}_{-m+3}, \dots, \hat{s}_{0}\). The conventional approach is to estimate these values together with the smoothing parameters during the maximisation of likelihood. As a result, the optimisation might involve a large number of parameters. In addition, the variance of the error term is considered as an additional parameter in the maximum likelihood estimation, so the number of parameters for different models is (here "*" stands for any type):

- ETS(*,N,N) - 3 parameters: \(\hat{l}_0\), \(\hat{\alpha}\) and \(\hat{\sigma}^2\);
- ETS(*,*,N) - 5 parameters: \(\hat{l}_0\), \(\hat{b}_0\), \(\hat{\alpha}\), \(\hat{\beta}\) and \(\hat{\sigma}^2\);
- ETS(*,*d,N) - 6 parameters: \(\hat{l}_0\), \(\hat{b}_0\), \(\hat{\alpha}\), \(\hat{\beta}\), \(\hat{\phi}\) and \(\hat{\sigma}^2\);
- ETS(*,N,*) - 4+m-1 parameters: \(\hat{l}_0\), \(\hat{s}_{-m+2}, \hat{s}_{-m+3}, \dots, \hat{s}_{0}\), \(\hat{\alpha}\), \(\hat{\gamma}\) and \(\hat{\sigma}^2\);
- ETS(*,*,*) - 6+m-1 parameters: \(\hat{l}_0\), \(\hat{b}_0\), \(\hat{s}_{-m+2}, \hat{s}_{-m+3}, \dots, \hat{s}_{0}\), \(\hat{\alpha}\), \(\hat{\beta}\), \(\hat{\gamma}\) and \(\hat{\sigma}^2\);
- ETS(*,*d,*) - 7+m-1 parameters: \(\hat{l}_0\), \(\hat{b}_0\), \(\hat{s}_{-m+2}, \hat{s}_{-m+3}, \dots, \hat{s}_{0}\), \(\hat{\alpha}\), \(\hat{\beta}\), \(\hat{\gamma}\), \(\hat{\phi}\) and \(\hat{\sigma}^2\).

Note that in case of seasonal models we typically make sure that the initial seasonality indices are normalised, so we only need to estimate \(m-1\) of them, the last one is calculated based on the linear combination of the others.

When it comes to the selection of the most appropriate model, the conventional approach involves the application of all models to the data and then selecting the most appropriate of them based on an information cretiria. In case of the conventional ETS model, this relies on the likelihood value of normal distribution, used in the estimation of the model.

Finally, the assumption of normality is used for the generation of prediction intervals from the model. There are typically two ways of doing that:

- Calculating the variance of multiple steps ahead forecast error and then using it for the intervals calculation;
- Generating thousands of possible paths for the components of the series and the actual values and then taking the necessary quantiles for the prediction intervals;

Typically, (1) is applied for pure additive models, where the closed forms for the variances are known and the assumption of normality holds for several steps ahead. In some special cases of mixed models, there are approximations for variances that work on small horizons. But in all the other cases (2) should be used, despite being typically slower than (1) and producing bounds that differ from run to run due to randomness.