11.4 Initialisation of ADAM

$ \newcommand{\mathbbm}[1]{\boldsymbol{\mathbf{#1}}} $

To construct a model, we need to initialise it, defining the values of initial states of the model. In the case of the non-seasonal model, this means estimating values of $\mathbf{v}_{0}$. In the case of the seasonal one, it is $\mathbf{v}_{-m+1}, \dots, \mathbf{v}_0$, where $m$ is the seasonal lag. There are different ways of doing that, but here we only discuss the following three:

Optimisation of initials;
Backcasting;
Provided values.

The first option implies that the values of initial states are found in the same procedure as the other parameters of the model. This is what Hyndman et al. (2002) suggested doing. (2) means that the initials are refined iteratively when the model is fit to the data from observation $t=1$ to $t=T$ and then going backwards to get values for $t=0$. Finally, (3) is when a user knows initials and provides them to the model.

As a side note, we assume in ADAM that the model is initialised at the moment just before $t=1$. We do not believe that it was initialised at some point before the Big Bang (as ARIMA typically does), and we do not initialise it at the start of the sample. This way, we make all models in ADAM comparable, making them work on the same sample, no matter how many differences are taken or how many seasonal components we define.

11.4.1 Optimisation vs backcasting

In the case of optimisation, all the model parameters are estimated together. This includes (depending on the type of model, in the order used in optimiser):

Smoothing parameters of ETS;
Smoothing parameters for the regression part of the model (from Section 10.3);
Dampening parameter of ETS;
Parameters of ARIMA: first AR(p), then MA(q);
Initials of ETS;
Initials of ARIMA;
Initial values for parameters of explanatory variables;
Constant/drift for ARIMA;
Other additional parameters that are needed by assumed distributions.

The more complex the selected model is, the more parameters we will need to estimate, and all of this will happen in one and the same iterative process in the optimiser:

Choose parameters;
Fit the model;
Calculate loss function;
Compare the loss with the previous one;
Update the parameters based on (4);
Go to (2) and repeat until a specific criterion is met.

The user can specify the stopping criteria. There are several options accepted by the optimiser of adam():

Maximum number of iterations (maxeval), which by default is equal to $40\times k$ in case of ETS/ARIMA and $100 \times k$ for the model with explanatory variables, where $k$ is the number of all estimated parameters;
The relative precision of the optimiser (xtol_rel) with the default value of $10^{-6}$, which regulates the relative change of parameters;
The absolute precision of the optimiser (xtol_abs) with the default value of $10^{-8}$, which regulates the absolute change of parameters;
The stopping criterion in case of the relative change in the loss function (ftol_rel) with the default value of $10^{-8}$.

All these parameters are explained in more detail in the documentation of the nloptr() function from the nloptr package for R (Johnson, 2021), which handles the estimation of ADAM. adam() accepts several other stopping criteria, which can be found in the documentation of the function.

Remark. adam() can also print the results of the optimisation via the print_level parameter, which is defined in the same way as in the nloptr() function, but with an additional option of print_level=41, which will print the results of the optimisation, without producing step-by-step outputs.

The mechanism explained above can slow down substantially if a complex model is constructed, and it might take a lot of time and manual tuning of parameters to get to the optimum. In some cases, reducing the number of estimated parameters is worth considering, and one way of doing that is backcasting.

In case of backcasting we do not need to estimate initials of ETS, ARIMA, and regression. What the model does in this case is goes through the series from $t=1$ to $t=T$, fitting to the data, and then reverses and goes back from $t=T$ to $t=1$ based on the following state space model: \[\begin{equation} \begin{aligned} {y}_{t} = &w(\mathbf{v}_{t+\boldsymbol{l}}) + r(\mathbf{v}_{t+\boldsymbol{l}}) \epsilon_t \\ \mathbf{v}_{t} = &f(\mathbf{v}_{t+\boldsymbol{l}}) + g(\mathbf{v}_{t+\boldsymbol{l}}) \epsilon_t \end{aligned}. \tag{11.21} \end{equation}\] The new values of $\mathbf{v}_t$ for $t<1$ are then used to fit the model to the data again. The procedure can be repeated several times for the initial states to converge to more reasonable values. adam() does that only two times.

The backcasting procedure implies the extended fitting process for the model, removing the need to estimate all the initials. It works exceptionally well on large samples of data (thousands of observations) and with models with several seasonal components. The bigger your model is, the more time the optimisation will take, and the more likely backcasting would do better. On the other hand, you might also prefer backcasting to optimisation on small samples when you do not have more than two seasons of data – estimation of initial seasonal components might become challenging and can lead to overfitting.

When discussing specific models, ADAM ARIMA works better (faster and more accurate) with backcasting than with optimisation because it does not need to estimate as many parameters as in the latter case. On the other hand, ADAM ETS typically works quite well in case of optimisation, when there is enough data to train the model on. Last but not least, if you introduce explanatory variables, the mechanism implemented in adam() will optimise their parameters instead of backcasting. In the case of a dynamic ETSX/ARIMAX (Section 10.3) this will mean that the initial values of the part of the state vector for explanatory variables will be estimated as well. If you want them not to be optimised you can use initial="complete" to do full backcasting of all elements of the state vector.

It is also important to note that the information criteria of models with backcasting are typically lower than in the case of the optimised initials. This is because the difference in the number of estimated parameters is substantial in these two cases, and the models are initialised differently. So, it is advised not to mix the model selection between the two initialisation techniques, although there is no theoretical ground for forbidding it.

Nonetheless, no matter what initialisation method is used, we need to start the fitting process from $t=1$. This cannot be done unless we provide some starting (pre-initialised) values of parameters to the optimiser. The better we guess the starting values, the faster the optimiser will converge to the optimum. adam() uses several heuristics at this stage, explained in more detail in the following subsections.

11.4.2 Starting optimisation of parameters

Remark. In this subsection, we discuss how the values of smoothing parameters, damping parameters, and coefficients of ARIMA are preset before the initialisation. All the things discussed here are heuristics, developed based on my experience and many experiments with ADAM.

Depending on the model type, the vector of estimated parameters will have different lengths. We start with smoothing parameters of ETS:

For the unsafe mixed models ETS(A,A,M), ETS(A,M,A), ETS(M,A,A), and ETS(M,A,M): $\hat{\alpha}=0.01$, $\hat{\beta}=0$ and $\hat{\gamma}=0$. This is needed because the models listed above are susceptible to the changes in smoothing parameters and might fail for time series with actual values close to zero;
For one of the most complicated and sensitive models, ETS(M,M,A), $\hat{\alpha}=\hat{\beta}=\hat{\gamma}=0$. The combination of additive seasonality and the multiplicative trend is the most difficult one. The multiplicative error makes estimation even more challenging in cases of low-level data. So starting from the deterministic model, that will work for sure is a safe option;
ETS(M,A,N) is slightly easier to estimate than ETS(M,A,M) and ETS(M,A,A), so $\hat{\alpha}=0.2$, $\hat{\beta}=0.01$. The low value for the trend is needed to avoid the difficult situations with low level data, when the fitted values become negative;
ETS(M,M,N) and ETS(M,M,M) have $\hat{\alpha}=0.1$, $\hat{\beta}=0.05$, and $\hat{\gamma}=0.01$, making the trend and seasonal components a bit more conservative. The high values are typically not needed in this model as they might lead to explosive trends;
Other models with multiplicative components (ETS(M,N,N), ETS(M,N,A), ETS(M,N,M), ETS(A,N,M), ETS(A,M,N), and ETS(A,M,M)) are slightly easier to estimate and harder to break, so their parameters are set to $\hat{\alpha}=0.1$, $\hat{\beta}=0.05$, and $\hat{\gamma}=0.05$;
Finally, pure additive models are initialised with $\hat{\alpha}=0.1$, $\hat{\beta}=0.05$, and $\hat{\gamma}=0.11$. Their parameter space is the widest, and the models do not break on any data.

The smoothing parameter for the explanatory variables (Section 10.3) is set to $\hat{\delta}=0.01$ in case of additive error and $\hat{\delta}=0$ in case of the multiplicative one. The latter is done because the model might break if some ETS components are additive.

If a dampening parameter is needed in the model, then its starting value is $\hat{\phi}=0.95$.

In the case of ARIMA, the parameters are pre-initialised based on ACF and PACF (Sections 8.3.2 and 8.3.3). First, the in-sample actual values are differenced, according to the selected order $D_j$ for all $j=0,\dots,n$, after which the ACF and PACF are calculated. Then the initials for AR parameters are taken from the PACF, while the initials for MA parameters are taken from the ACF, making sure that the sum of parameters is not greater than one in both cases. If it is, then the parameters are renormalised to satisfy the condition. This mechanism aims to get a potentially correct direction towards the optimal parameters of the model and makes sure that the initial values meet the basic stationarity and invertibility conditions. In cases when it is not possible to calculate ACF and PACF for the specified lags and orders, AR parameters are set to -0.1, while the MA parameters are set to 0.1, making sure that the conditions mentioned above hold.

In case of Generalised Normal distribution, the shape parameter is set to 2 (if it is estimated), making the optimiser start from the conventional Normal distribution.

The pre-initialisations described above guarantee that the model is estimable for a wide variety of time series and that the optimiser will reach the optimum in a limited time. If it does not work for a specific case, a user can provide their vector of pre-initialised parameters via the parameter B in the ellipsis of the model. Furthermore, the typical bounds for the parameters can be tuned as well. For example, the bounds for smoothing parameters in ADAM ETS are (-5, 5), and they are needed only to simplify the optimisation procedure. The function will check the violation of either usual or admissible bounds inside the optimiser, but having some ideas of where to search for optimal parameters, helps. A user can provide their vector for the lower bound via lb and the upper one via ub.

11.4.3 Starting optimisation of ETS states

After defining the pre-initial parameters, we need to provide similar values for the initial state vector $\mathbf{v}_t$. The steps explained below are based on my experience and typically lead to a robust model. The pre-initialisation of the states of ADAM ETS differs depending on whether the model is seasonal or not. If it is seasonal, then the multiple seasonal decomposition is done using the msdecompose() function from the smooth package with the seasonality set to “multiplicative” if either the error or seasonal component of ETS is multiplicative. After that:

Initial level is set to be equal to the first initial value from the function (which is the back forecasted de-seasonalised series);
The value is corrected if regressors are included to remove their impact on the value (either by subtracting the fitted of the regression part or by dividing by them – depending on the error type);
If the trend is additive and seasonality is multiplicative, then the trend component is obtained by multiplying the initial level and trend from the decomposition (remember, the assumed model is multiplicative in this case) and then subtracting the previous level from the resulting value;
If the trend is multiplicative and seasonality is additive, then the initials are added and then divided by the previous level to get the initial multiplicative trend component;
If there is no seasonality and the trend is multiplicative, then the initial trend is set to 1. This is done to avoid the potentially explosive behaviour in the model;
If the trend is multiplicative and the level is negative, then the level is substituted by the first actual value. This might happen in some weird cases of time series with low values;
When it comes to seasonal components, if we have a pure additive or a pure multiplicative ETS model or ETS(A,Z,M), we use the seasonal indices obtained from the msdecompose() function (discussed in Subsection 3.2.3), making sure that they are normalised. The type of seasonality in msdecompose() corresponds to the seasonal component of ETS in this case, and nothing additional needs to be done;
The situation is more challenging with ETS(M,Z,A), for which the decomposition would return the multiplicative seasonal components. To convert them to the additive ones, we take their logarithm and multiply them by the minimum value of the actual time series. This way, we guarantee that the seasonal components are closer to the optimal ones.

In the case of the non-seasonal model, the algorithm is more straightforward:

The initial level is equal to either arithmetic or geometric mean (depending on the type of trend component) of the first $\max(m_1,\dots,m_n)$ observations, where $m_j$ is the model lag (e.g. in case of ARIMA(1,1,2), the components will have lags of 1 and 2). If the length of this mean is smaller than 20% of the sample, then the arithmetic mean of the first 20% of actual values is used;
If regressors are included, then the value is modified, similarly to how it is done in the seasonal case discussed above;
If the model has an additive trend, then its initial value is equal to the mean difference between first $\max(m_1,\dots,m_n)$ observations;
In the case of multiplicative trend, the initial value is equal to the geometric mean of ratios between first $\max(m_1,\dots,m_n)$ observations.

In cases of the small samples (less than two seasonal periods), the procedure is similar to the one above. However, the seasonal indices are obtained by taking the actual values and either subtracting the arithmetic mean from them or dividing them by the geometric one of the first $m_j$ observations, depending on the seasonality type, normalising them afterwards.

Finally, to ensure that the safe initials were provided, for the ETS(M,Z,Z) models, if the initial level contains a negative value, it is substituted by the Global Mean of the series.

The pre-initialisation described here is not simple, but it guarantees that any ETS model can be constructed and estimated to almost any data. Yes, there might still be some issues with mixed ETS models, but the mechanism used in ADAM is quite robust.

11.4.4 Starting optimisation of ARIMA states

ADAM ARIMA models have as many states as the number of polynomials $K$ (see Subsection 9.1.2). Each state $v_{i,t}$ needs to be initialised with $i$ values (e.g. 1 for the first state, 2 for the second, etc). This leads in general to more initial values for states than the SSARIMA from Svetunkov and Boylan (2020): $\frac{K(K+1)}{2}$ instead of $K$. However, we can reduce the number of initial seeds to estimate either by using a different initialisation procedure (e.g. backcasting) or by propagating $K$ initial values to all the states using the formula for the transition equation (9.7): \[\begin{equation*} v_{i,t} = \eta_i \sum_{j=1}^K B_{j} + (\eta_i + \psi_i) \epsilon_{t}. \end{equation*}\] The problem with the formula above is the absence of the error term for the observations before the beginning of the sample (where we initialise ADAM). So, to get the estimates of the initial vector, we use expectations of states instead: \[\begin{equation} \mathrm{E}(v_{i,t} | t) = \eta_i \sum_{j=1}^K B_{j} \text{ for } t=\{-K+1, -K+2, \dots, 0\}. \tag{9.7} \end{equation}\] Using this approach, the optimisier will find some optimal values of $B_j$ that will be propagated to all the states and then used in model fit. This way we will need to estimate only $K$ parameters for the initial values of the state vector. However, this method will only work for the states corresponding to the ARI elements of the model. In the case of MA, using the same principle of initialisation via the conditional expectation, we can set the initial MA states to zero and estimate only ARI states. This is a crude but relatively simple way to pre-initialise ADAM ARIMA.

To get some initial estimates of $B_j$, which will be used in the optimiser, ADAM applies ARIMA with backcasting and extracts the seed values from it.

Having said all that, I need to point out that it is advised to use backcasting in the ADAM ARIMA model – this is a more reliable and faster procedure for initialisation of ARIMA than the optimisation.

Finally, the experiments that I have done in the simulated setting show that the seasonal ADAM ARIMA implemented in the smooth package produces biased but consistent estimates of parameters, both in case of backcasting and optimisation. This does not apply to the non-seasonal ARIMA, which in general works well.

11.4.5 Starting optimisation of regressor states and constant

When it comes to the initials for the regressors, they are obtained from the parameters of the alm() model based on the rules below:

The model with the logarithm of the response variable is constructed, if the error term is multiplicative and one of the following distributions has been selected: Normal, Laplace, S, or Generalised Normal;
Otherwise, the model is constructed based on the provided formula and selected distribution;
In any case, the global trend is added to the formula to make sure that its effect on the values of parameters is reduced;
If the data contains categorical variables (aka “factors” in R), then they are expanded to a set of dummy variables, adding the baseline value to the mix (i.e. not dropping any categories). While the classical multiple regression would not be estimable in this situation, dynamic models like ETSX and ARIMAX can work with the complete set of levels of categorical variables (see discussion in Section 14.9). To get the missing level, the intercept is added to the parameters of dummy variables, after which the obtained vector is normalised. This way, we can get, for example, all seasonal components if we want to model seasonality via X part of the model, not merging one of the components with level (see discussion in Section 10.5).

Finally, the initialisation of constant (if needed in the model) depends on the selected model. In case of ARIMA with all $D_j=0$, the mean of the data is used. In all other cases, the arithmetic mean of difference or the geometric mean of ratios of all actual values is used depending on the error type. This is because the constant acts as a drift in the model in the case of non-zero differences. In case of ETS, the impact of the constant is removed from the level in ETS and the states of ARIMA by either subtraction or division, again depending on the error term type.

11.4.6 Example in R

All the details discussed above allow us to tell ADAM what values to start from if we want to help it in optimisation. For demonstration purposes, we consider the ETS(A,A,N)+ARIMA(2,0,0) applied to Box-Jenkins data:

adamETSBJ <- adam(BJsales, "AAN", orders=c(2,0,0))

The function will return the vector of parameters in the form it was used by the optimiser:

adamETSBJ$B

##     alpha      beta   phi1[1]   phi2[1] 
## 0.3583291 0.1817099 0.7309204 0.2687216

It can be amended to help the optimiser if we have an idea of what values we should have or just reused again to get to better set of values:

adamETSBJ <- adam(BJsales, "AAN", orders=c(2,0,0),
                  B=adamETSBJ$B)

If we are dissatisfied with the result, we can print the solution found by the optimiser to understand why it stopped (we do not provide the output here):

adamETSBJ <- adam(BJsales, "AAN", orders=c(2,0,0),
                  B=adamETSBJ$B, print_level=41)

Similar output is saved in adamETSBJ$res, so that you can access it to understand how the optimisation went.

Hopefully, after several iterations, we will get a better estimates of parameters of the model:

adamETSBJ

## Time elapsed: 0.03 seconds
## Model estimated using adam() function: ETS(AAN)+ARIMA(2,0,0)
## With backcasting initialisation
## Distribution assumed in the model: Normal
## Loss function type: likelihood; Loss function value: 258.3364
## Persistence vector g:
##  alpha   beta 
## 0.3516 0.1884 
## 
## ARMA parameters of the model:
##        Lag 1
## AR(1) 0.7297
## AR(2) 0.2703
## 
## Sample size: 150
## Number of estimated parameters: 5
## Number of degrees of freedom: 145
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 526.6729 527.0896 541.7261 542.7700

Given that we are trying the ETS+ARIMA model, we can use backcasting, which should help the optimiser by reducing the number of estimated parameters:

adam(BJsales, "AAN", orders=c(2,0,0),
     initial="backcasting")

## Time elapsed: 0.04 seconds
## Model estimated using adam() function: ETS(AAN)+ARIMA(2,0,0)
## With backcasting initialisation
## Distribution assumed in the model: Normal
## Loss function type: likelihood; Loss function value: 258.3425
## Persistence vector g:
##  alpha   beta 
## 0.3583 0.1817 
## 
## ARMA parameters of the model:
##        Lag 1
## AR(1) 0.7309
## AR(2) 0.2687
## 
## Sample size: 150
## Number of estimated parameters: 5
## Number of degrees of freedom: 145
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 526.6850 527.1016 541.7381 542.7820

Hopefully this gives an idea of how the estimation of parameters in adam() can be fine-tuned.

References

• Hyndman, R.J., Koehler, A.B., Snyder, R.D., Grose, S., 2002. A State Space Framework for Automatic Forecasting Using Exponential Smoothing Methods. International Journal of Forecasting. 18, 439–454. https://doi.org/10.1016/S0169-2070(01)00110-8

• Johnson, S.G., 2021. The NLopt Nonlinear Optimization Package. https://nlopt.readthedocs.io/ version: 2021-11-01

• Svetunkov, I., Boylan, J.E., 2020. State-space ARIMA for Supply-chain Forecasting. International Journal of Production Research. 58, 818–827. https://doi.org/10.1080/00207543.2019.1600764