This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

12.2 Non MLE-based loss functions

12.2.1 MSE and MAE

An alternative approach for estimating ADAM ETS is using the conventional loss functions, such as MSE, MAE etc. In this case, the model selection using information criteria would not work, but this might not matter, when you have already decided what model to use and want to improve it. In some special cases (as discussed earlier), the minimisation of these losses would give the same results as the maximisation of some likelihoods, namely:

  1. Minimum of MSE corresponds to the maximum of likelihood of Normal distribution;
  2. Minimum of MAE corresponds to the maximum of likelihood of Laplace distribution;
  3. Minimum of pinball function (Koenker and Bassett 1978, quantile regression) corresponds to the maximum of likelihood of Asymmetric Laplace distribution (Yu and Zhang 2005);

The main difference between using these losses and maximising respective likelihoods is in the number of estimated parameters: the latter implies that the scale is estimated together with the other parameters, while the former does not consider it.

Having said that, the assumed distribution does not necessarily depend on the used loss and vice versa. For example, we can assume that the actuals follow Inverse Gaussian distribution, but estimate the model using MAE. The estimates of parameters might not be as efficient as in the case of MLE, but it is still possible to do. The error term, which is minimised in respective losses depends on the error type:

  • Additive error: \(e_t = y_t - \hat{\mu}_{y,t}\);
  • Multiplicative error: \(e_t = \frac{y_t - \hat{\mu}_{y,t}}{\hat{\mu}_{y,t}}\).

This follows directly from the respective ETS models.

12.2.2 HAM

Along with the discussed MSE and MAE, there is also HAM - “Half Absolute Moment”: \[\begin{equation} \mathrm{HAM} = \frac{1}{T} \sum_{j=1}^T \sqrt{\left|e_t\right|}, \tag{12.6} \end{equation}\] the minimum of which corresponds to the maximum of the likelihood of S distribution. The idea of this estimator is to minimise the errors that happen very often, close to the estimate. It will typically ignore the outliers and focus on the values that happen most often. As a result, if used for the integer values, the minimum of HAM would correspond to the mode of that distribution. I donot have a proof of this property, but it becomes apparent, given that the square root in (12.6) would reduce the influence of all values lying above 1 and increase the values of everything that lies between (0, 1) (e.g. \(\sqrt{0.16}=0.4\), but \(\sqrt{16}=4\)).

Similar to HAM, one can calculate other fractional losses, which would be even less sensitive to outliers and more focused on the frequently appearing values, e.g. by using the \(\sqrt[^\alpha]{\left|e_t\right|}\) with \(\alpha>1\).

12.2.3 LASSO and RIDGE

While this is not a well studied approach yet, it is possible to use LASSO (Tibshirani 1996) and RIDGE for the estimation of ETS models (James et al. 2017 give a good overview of these losses with examples in R), which can be formulated as (at least this is how it is formualted in ADAM ETS): \[\begin{equation} \begin{aligned} \mathrm{LASSO} = &(1-\lambda) \sqrt{\frac{1}{T} \sum_{j=1}^T e_t^2} + \lambda \sum |\hat{\theta}| \\ \mathrm{RIDGE} = &(1-\lambda) \sqrt{\frac{1}{T} \sum_{j=1}^T e_t^2} + \lambda \sum \hat{\theta}^2, \end{aligned} \tag{12.7} \end{equation}\] where \(\theta\) is the vector of all parameters in the model and \(\lambda\) is the regularisation parameter. The idea of these losses is in shrinkage of parameters. If \(\lambda=0\), then the losses become equivalent to the MSE, when \(\lambda=1\), the optimiser would minimise the values of parameters, ignoring the MSE part. In the context of ETS, it makes sense to shrink everything but the initial level. In order for different components to shrink with a similar speed, they need to be normalised, so here are some tricks used in ADAM ETS:

  1. The smoothing parameters are left intact, because they typicall lie between 0 and 1, where 0 means that the respective states are not updated over time;
  2. Damping parameter is modified to shrink towards 1, enforcing no dampening of the trend via: \(\hat{\phi}^\prime=1-\hat{\phi}\);
  3. The initial level is normalised using the formula: \(\hat{l}_0' = \frac{\hat{l}_0 - \bar{y}_m}{\hat{\sigma}_{y,m}}\), where \(\bar{y}_m\) is the mean and \(\hat{\sigma}_{y,m}\) is the standard deviation of the first \(m\) actual observations, where \(m\) is the highest frequency of the data (if \(m=1\), then both values are taken based on the first two observations). Shrinking it to zero implies that it becomes equivalent to the mean of the first \(m\) observations;
  4. If the trend is additive, then the initial trend component is scaled using: \(\hat{b}_0' = \frac{\hat{b}_0}{\hat{\sigma}_{y,m}}\), making sure that it shrinks towards zero (no trend). In case of the multiplicative trend, this becomes: \(\hat{b}_0' = \log{\hat{b}_0}\) , making it shrink towards 1 in the original scale (again, no trend case). The multiplicative trend can be anything between 0 and \(\infty\) with 1 corresponding to no trend. Taking logarithm of it allows balancing out the cases, when \(\hat{b}_0<1\) with \(\hat{b}_0>1\);
  5. If the seasonal component is additive, then the normalisation similar to the one in trend is done for every seasonal index: \(\hat{s}_i' = \frac{\hat{s}_i}{\hat{\sigma}_{y,m}}\), making sure that they shrink towards zero. If the seasonal component is multiplicative, then the logarithm of components is taken: \(\hat{s}_i' = \log{\hat{s}_i}\), making sure that they shrink towards 1 in the original scale.
  6. If there are explanatory variables and the error term of the model is additive, then the respective parameters are divided by the standard deviations of the respective variables. In case of multiplicative error term, nothing is done, because the parameters in this case would typically be close to zero anyway (see a chapter on the ADAM ETSX).
  7. Finally, in order to make \(\lambda\) slightly more meaningful, in case of additive error model, we also divide the MSE part of the loss by \(\mathrm{V}\left({\Delta}y_t\right)\), where \({\Delta}y_t=y_t-y_{t-1}\). This sort of scaling helps in both cases, when there is a trend in the data and when the data does not exhibit one. We do not do anything for the multiplicative error models, because typically the error in this case is already quite small.

All of these tricks make sure that different components shrink towards zero simultaneously, not breaking the model. If we would not do these steps, then, for example, the initial trend could dominate in \(\hat{\theta}\) and would shrink faster than the smoothing parameters. As a result, one can potentially use the model with trend and seasonality and use regularisation in order to shrink the unnecessary parameters towards zero. The adam() function does not select the most appropriate \(\lambda\) and will set it equal to zero if it is not provided by the user.

Note that both LASSO and RIDGE are experimental options. If you have better ideas of how to implement them, send me a message, I will improve the mechanism in adam().

12.2.4 Custom losses

It is also theoretically possible to use other non-standard loss functions. adam() function allows doing that via the parameter loss. For example, we could estimate an ETS(A,A,N) model on the BJsales data using an absolute cubic loss (note that the parameters actual, fitted and B are compulsory for the function):

where actual is the vector of actual values \(y_t\), fitted is the estimate of the one step ahead conditional mean \(\hat{\mu}_{y,t}\) and \(B\) is the vector of all estimated parameters, \(\hat{\theta}\). This way, one can use more advanced estimators, such as, for example, M-estimators (Barrow et al. 2020).

12.2.5 Examples in R

adam() has two parameters, one regulating the assumed distribution, and another one, regulating how the model will be estimated, what loss will be used for these purposes. Here are examples with combinations of different losses and an assumed Inverse Gaussian distribution for ETS(M,A,M) on AirPassengers data. We start with MSE, MAE and HAM:

In these three cases the models, assuming the same distribution for the error term are estimated using MSE, MAE and HAM. Their smoothing parameters should differ, with MSE producing fitted values closer to the mean, MAE - closer to the median and HAM - closer to the mode (but not exactly the mode) of the distribution.

In addition, we introduce ADAM ETS(M,A,M) estimated using LASSO with arbitrarily selected \(\lambda=0.1\):

And, finally, we estimate the same model using a custom loss, which in this case is Huber loss with the threshold of 1.345:

Now we can compare the performance of the six models. First, we can compare the smoothing parameters:

##       likelihood   MSE   MAE   HAM LASSO Huber
## alpha      0.778 0.766 0.874 0.907 0.026 0.858
## beta       0.006 0.006 0.021 0.024 0.025 0.010
## gamma      0.000 0.000 0.000 0.001 0.183 0.000

What we sould observe in this case is that LASSO should have the lowest smoothing parameters, because they are shrunk directly in the model. Likelihood and MSE should probably give similar values, because they both rely on squared errors, but not the same because the likelihood of Inverse Gaussian also has additional elements in it.

Unfortunately, we do not have information criteria for the models 2 - 6 in this case, because the likelihood function is not maximised with these losses, so it’s not possible to compare them via the in-sample statistics. But we can compare their holdout sample performance:

##     likelihood     MSE     MAE     HAM   LASSO    Huber
## ME       8.184   7.253   4.571   3.603 -16.753   30.633
## MAE     19.514  18.805  16.489  16.075  19.547   38.131
## MSE    633.863 601.831 496.085 496.800 635.992 1853.751

And we can also produce forecasts from them and plot those forecasts for the visual inspection:

What we observe in this case, is that different losses led to different forecasts and prediction intervals (we still assume Inverse Gaussian distribution) and that in case of LASSO, the shrinkage is so strong that the seasonality is shrunk to zero. What can potentially be done to make this practical is the rolling origin evaluation for different losses and then comparison of forecast errors between them in order to select the most efficient one.

References

Barrow, Devon, Nikolaos Kourentzes, Rickard Sandberg, and Jacek Niklewski. 2020. “Automatic robust estimation for exponential smoothing: Perspectives from statistics and machine learning.” Expert Systems with Applications 160: 113637. https://doi.org/10.1016/j.eswa.2020.113637.

James, Gareth, Daniela Witen, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning with Applications in R. Vol. 64. 9-12. https://doi.org/10.1016/j.peva.2007.06.006.

Koenker, Roger, and Gilbert Bassett. 1978. “Regression Quantiles.” Econometrica 46 (1): 33. https://doi.org/10.2307/1913643.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.

Yu, Keming, and Jin Zhang. 2005. “A three-parameter asymmetric laplace distribution and its extension.” Communications in Statistics - Theory and Methods 34 (9-10): 1867–79. https://doi.org/10.1080/03610920500199018.