This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

12.2 Non MLE-based loss functions

12.2.1 MSE and MAE

An alternative approach for estimating ADAM ETS is using the conventional loss functions, such as MSE, MAE etc. In this case, the model selection using information criteria would not work, but this might not matter, when you have already decided what model to use and want to improve it. In some special cases (as discussed earlier), the minimisation of these losses would give the same results as the maximisation of some likelihoods, namely:

  1. Minimum of MSE corresponds to the maximum of likelihood of Normal distribution;
  2. Minimum of MAE corresponds to the maximum of likelihood of Laplace distribution;
  3. Minimum of pinball function (Koenker and Bassett 1978, quantile regression) corresponds to the maximum of likelihood of Asymmetric Laplace distribution (Yu and Zhang 2005);

The main difference between using these losses and maximising respective likelihoods is in the number of estimated parameters: the latter implies that the scale is estimated together with the other parameters, while the former does not consider it.

Having said that, the assumed distribution does not necessarily depend on the used loss and vice versa. For example, we can assume that the actuals follow Inverse Gaussian distribution, but estimate the model using MAE. The estimates of parameters might not be as efficient as in the case of MLE, but it is still possible to do. The error term, which is minimised in respective losses depends on the error type:

  • Additive error: \(e_t = y_t - \hat{\mu}_{y,t}\);
  • Multiplicative error: \(e_t = \frac{y_t - \hat{\mu}_{y,t}}{\hat{\mu}_{y,t}}\).

This follows directly from the respective ETS models.

12.2.2 HAM

Along with the discussed MSE and MAE, there is also HAM - "Half Absolute Moment": \[\begin{equation} \mathrm{HAM} = \frac{1}{T} \sum_{j=1}^T \sqrt{\left|e_t\right|}, \tag{12.6} \end{equation}\]

the minimum of which corresponds to the maximum of the likelihood of S distribution. The idea of this estimator is to minimise the errors that happen very often, close to the estimate. It will typically ignore the outliers and focus on the values that happen most often. As a result, if used for the integer values, the minimum of HAM would correspond to the mode of that distribution. I donot have a proof of this property, but it becomes apparent, given that the square root in (12.6) would reduce the influence of all values lying above 1 and increase the values of everything that lies between (0, 1) (e.g. \(\sqrt{0.16}=0.4\), but \(\sqrt{16}=4\)).

Similar to HAM, one can calculate other fractional losses, which would be even less sensitive to outliers and more focused on the frequently appearing values, e.g. by using the \(\sqrt[^\alpha]{\left|e_t\right|}\) with \(\alpha>1\).

12.2.3 LASSO and RIDGE

While this is not a well studied approach yet, it is possible to use LASSO (Tibshirani 1996) and RIDGE for the estimation of ETS models (James et al. 2017 give a good overview of these losses with examples in R), which can be formulated as (at least this is how it is formualted in ADAM ETS): \[\begin{equation} \begin{aligned} \mathrm{LASSO} = &(1-\lambda) \frac{1}{T} \sum_{j=1}^T e_t^2 + \lambda \sum |\hat{\theta}| \\ \mathrm{RIDGE} = &(1-\lambda) \frac{1}{T} \sum_{j=1}^T e_t^2 + \lambda \sum \hat{\theta}^2, \end{aligned} \tag{12.7} \end{equation}\]

where \(\theta\) is the vector of all parameters in the model and \(\lambda\) is the regularisation parameter. The idea of these losses is in shrinkage of parameters. If \(\lambda=0\), then the losses become equivalent to the MSE, when \(\lambda=1\), the optimiser would minimise the values of parameters, ignoring the MSE part. In the context of ETS, it makes sense to shrink everything but the initial level. In order for different components to shrink with a similar speed, they need to be normalised, so here are some tricks used in ADAM ETS:

  1. The smoothing parameters are left intact, because they typicall lie between 0 and 1, where 0 means that the respective states are not updated over time;
  2. The initial level is normalised using the formula: \(\hat{l}_0' = \frac{\hat{l}_0 - \bar{y}_m}{\hat{\sigma}_{y,m}}\), where \(\bar{y}_m\) is the mean and \(\hat{\sigma}_{y,m}\) is the standard deviation of the first \(m\) actual observations, where \(m\) is the highest frequency of the data (if \(m=1\), then both values are taken based on the first two observations). Shrinking it to zero implies that it becomes equivalent to the mean of the first \(m\) observations;
  3. If the trend is additive, then the initial trend component is scaled using: \(\hat{b}_0' = \frac{\hat{b}_0}{\hat{\sigma}_{y,m}}\), making sure that it shrinks towards zero (no trend). In case of the multiplicative trend, this becomes: \(\hat{b}_0' = \log{\hat{b}_0}\) , making it shrink towards 1 in the original scale (again, no trend case);
  4. If the seasonal component is additive, then the normalisation similar to the one in trend is done for every seasonal index: \(\hat{s}_i' = \frac{\hat{s}_i}{\hat{\sigma}_{y,m}}\), making sure that they shrink towards zero. If the seasonal component is multiplicative, then the logarithm of components is taken: \(\hat{s}_i' = \log{\hat{s}_i}\), making sure that they shrink towards 1 in the original scale.
  5. If there are explanatory variables and the error term of the model is additive, then the respective parameters are divided by the standard deviations of the respective variables. In case of multiplicative error term, nothing is done, because the parameters in this case would typically be close to zero anyway (see a chapter on the ADAM ETSX).

All of these tricks make sure that different components shrink towards zero, not breaking the model. As a result, one can potentially use the model with trend and seasonality and use regularisation in order to shrink the unnecessary parameters towards zero. The adam() function does not select the most appropriate \(\lambda\) and will set it equal to zero if it is not provided by the user.

Note that both LASSO and RIDGE are experimental options. If you have better ideas of how to implement them, send me a message, I will improve the mechanism in adam().

12.2.4 Custom losses

It is also theoretically possible to use other non-standard loss functions. adam() function allows doing that via the parameter loss. For example, we could estimate an ETS(A,A,N) model on the BJsales data using an absolute cubic loss (note that the parameters actual, fitted and B are compulsory for the function):

lossFunction <- function(actual, fitted, B){
adam(BJsales, "AAN", loss=lossFunction, h=10, holdout=TRUE)

where actual is the vector of actual values \(y_t\), fitted is the estimate of the one step ahead conditional mean \(\hat{\mu}_{y,t}\) and \(B\) is the vector of all estimated parameters, \(\hat{\theta}\). This way, one can use more advanced estimators, such as, for example, M-estimators (Barrow et al. 2020).

12.2.5 Examples in R

adam() has two parameters, one regulating the assumed distribution, and another one, regulating how the model will be estimated, what loss will be used for these purposes. Here are examples with combinations of different losses and an assumed Inverse Gaussian distribution for ETS(M,A,M) on AirPassengers data. We start with MSE, MAE and HAM:

adamModel <- vector("list",6)
names(adamModel) <- c("likelihood","MSE", "MAE", "HAM", "LASSO", "Huber")
adamModel[[1]] <- adam(AirPassengers, "MAM", distribution="dinvgauss",
                       h=12, holdout=TRUE, loss="likelihood")
adamModel[[2]] <- adam(AirPassengers, "MAM", distribution="dinvgauss",
                       h=12, holdout=TRUE, loss="MSE")
adamModel[[3]] <- adam(AirPassengers, "MAM", distribution="dinvgauss",
                       h=12, holdout=TRUE, loss="MAE")
adamModel[[4]] <- adam(AirPassengers, "MAM", distribution="dinvgauss",
                       h=12, holdout=TRUE, loss="HAM")

In these three cases the models, assuming the same distribution for the error term are estimated using MSE, MAE and HAM. Their smoothing parameters should differ, with MSE producing fitted values closer to the mean, MAE - closer to the median and HAM - closer to the mode (but not exactly the mode) of the distribution.

In addition, we introduce ADAM ETS(M,A,M) estimated using LASSO with arbitrarily selected \(\lambda=0.1\):

adamModel[[5]] <- adam(AirPassengers, "MAM", distribution="dinvgauss",
                       h=12, holdout=TRUE, loss="LASSO", lambda=0.1)

And, finally, we estimate the same model using a custom loss, which in this case is Huber loss with the threshold of 1.345:

# Huber loss with a threshold of 1.345
lossFunction <- function(actual, fitted, B){
  errors <- actual-fitted;
  return(sum(errors[errors<=1.345]^2) + sum(abs(errors)[errors>1.345]));
adamModel[[6]] <- adam(AirPassengers, "MAM", distribution="dinvgauss",
                       h=12, holdout=TRUE, loss=lossFunction)

Now we can compare the performance of the six models. First, we can compare the smoothing parameters:

##       likelihood   MSE   MAE   HAM LASSO Huber
## alpha      0.767 0.772 0.897 0.932 0.010 0.866
## beta       0.010 0.000 0.020 0.022 0.005 0.010
## gamma      0.000 0.000 0.000 0.000 0.000 0.010

What we sould observe in this case is that LASSO should have the lowest smoothing parameters, because they are shrunk directly in the model. Likelihood and MSE should probably give similar values, because they both rely on squared errors, but not the same because the likelihood of Inverse Gaussian also has additional elements in it.

Unfortunately, we do not have information criteria for the models 2 - 6 in this case, because the likelihood function is not maximised with these losses, so it's not possible to compare them via the in-sample statistics. But we can compare their holdout sample performance:

##     likelihood     MSE     MAE     HAM    LASSO    Huber
## ME       7.716  12.491   4.171   4.613   12.729   24.275
## MAE     19.249  23.296  16.553  16.733   57.721   31.818
## MSE    618.086 780.505 511.750 519.018 5494.784 1312.672

And we can also produce forecasts from them and plot those forecasts for the visual inspection:

adamModelForecasts <- lapply(adamModel,forecast, h=12, interval="prediction")
for(i in 1:6){
       main=paste0("ETS(MAM) estimated using ",names(adamModel)[i]))

What we observe in this case, is that different losses led to different forecasts and prediction intervals (we still assume Inverse Gaussian distribution) and that in case of LASSO, the shrinkage is so strong that the seasonality is shrunk to zero. What can potentially be done to make this practical is the rolling origin evaluation for different losses and then comparison of forecast errors between them in order to select the most efficient one.


Barrow, Devon, Nikolaos Kourentzes, Rickard Sandberg, and Jacek Niklewski. 2020. “Automatic robust estimation for exponential smoothing: Perspectives from statistics and machine learning.” Expert Systems with Applications 160: 113637. doi:10.1016/j.eswa.2020.113637.

James, Gareth, Daniela Witen, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning with Applications in R. Vol. 64. 9-12. doi:10.1016/j.peva.2007.06.006.

Koenker, Roger, and Gilbert Bassett. 1978. “Regression Quantiles.” Econometrica 46 (1): 33. doi:10.2307/1913643.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88. doi:10.1111/j.2517-6161.1996.tb02080.x.

Yu, Keming, and Jin Zhang. 2005. “A three-parameter asymmetric laplace distribution and its extension.” Communications in Statistics - Theory and Methods 34 (9-10): 1867–79. doi:10.1080/03610920500199018.