This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

In order to construct a model, we need to initialise it, defining the values of $$\mathbf{v}_{-m+1}, \dots, \mathbf{v}_0$$ - initial states of the model. There are different ways of doing that, but here we only discuss the following three:

1. Optimisation of initials,
2. Backcasting,
3. Provided values.

The first option implies that the values of initial states are found in the same procedure as the other parameters of the model. (2) means that the initials are refined iteratively, when the model is fit to the data from observation $$t=1$$ to $$t=T$$ and backwards. Finally, (3) is when a user knows initials and provides them to the model.

As a side note, we assume in ADAM that the model is initialised at the moment just before $$t=1$$, we do not believe that it was initialised some time before the Big Bang (as ARIMA typically does) and we do not initialise it at the start of the sample. This way we make all models in ADAM comparable, making them work on exactly the same sample, no matter how many differences are taken or how many seasonal components they contain.

### 11.4.1 Optimisation vs backcasting

In case of optimisation, all the parameters of model are estimated together. This includes (depending on the type of model):

• Smoothing parameters of ETS,
• Smoothing parameters for the regression part of the model,
• Dampening parameter of ETS,
• Parameters of ARIMA: both AR(p) and MA(q),
• Initials of ETS,
• Initials of ARIMA,
• Initial values for explanatory variables,
• Constant / drift for ARIMA,
• Other additional parameters needed by assumed distribuitons.

The more complex the selected model is, the more parameters we will need to estimate, and all of this will happen in one and the same iterative process in the optimiser:

1. Choose parameters,
2. Fit the model,
3. Calculate loss function,
4. Compare the loss with the previous one,
5. Update the parameters based on (4),
6. Go to (2) and repeat until a specific criterion is met.

The stopping criteria can be different and specified by the user. There are several options considered by the optimiser of adam():

1. Maximum number of iterations (maxeval), which is equal to $$40\times k$$, where $$k$$ is the number of all estimated parameters;
2. The relative precision of the optimiser (xtol_rel) with default value of $$10^{-6}$$, which regulates the relative change of parmaters;
3. The absolute precision of the optimiser (xtol_abs) with default value of $$10^{-8}$$, which regulates the absolute change of parmaters;
4. The stopping criterion in case of the relative change in the loss function (ftol_rel) with default value of $$10^{-8}$$;

All these parameters are explained in more detail in the documentation of nloptr() function from nloptr package for R. adam() accepts several other stopping criteria, which can be found in the documentation of the function.

The mechanism explained above can become quite complicated if a big complex model is constructed and might take a lot of time and manual tuning of parameters in order to get to the optimum. In some cases, it is worth considering to reduce the number of estimated parameters, and one way of doing so is the backcasting.

In case of backcasting we do not need to estimate initials of ETS, ARIMA and regression. What model does in this case is goes through the series from $$t=1$$ to $$t=T$$, fitting to the data and then reverses and goes back from $$t=T$$ to $$t=1$$ based on the following state space model: \begin{equation} \begin{aligned} {y}_{t} = &w(\mathbf{v}_{t+\boldsymbol{l}}) + r(\mathbf{v}_{t+\boldsymbol{l}}) \epsilon_t \\ \mathbf{v}_{t} = &f(\mathbf{v}_{t+\boldsymbol{l}}) + g(\mathbf{v}_{t+\boldsymbol{l}}) \epsilon_t \end{aligned}. \tag{11.20} \end{equation} The new values of $$\mathbf{v}_t$$ for $$t<1$$ are then used in order to fit the model to the data again. The procedure can be repeated several times in order for the initial states to converge to more reasonable values.

The backcasting procedure implies the extended fitting process for the model, removing the need to estimate all the initials. It works especially well, on large samples of data (thousands of observations) and with models with several seasonal components. The bigger your model is, the more time the optimisation will take and the more likely backcasting would do better. On the other hand, you might also prefer backcasting to optimisation in cases of small samples, when you do not have more than two seasons of data - estimation of initial seasonal components might become challenging and can potentially lead to overfitting.

When talking about specific models, ADAM ARIMA works better (faster and more accurate) with backcasting than with optimisation, because it does not need to estimate as many parameters as in the latter case. ADAM ETS, on the other hand, typically works quite well in case of optimisation, when there is enough data to train the model on. Last but not least, if you introduce explanatory variables, then optimising the initial states might be a better option than backcasting, unless you use a dynamic ETSX / ARIMAX, because the initial parameters for the explanatory variables will not be updated in the latter case.

It is also important to note that the information criteria of models with backcasting are typically lower than in case of the optimised initials. This is because the difference in the number of estimated parameters is substantial in these two cases and the models are initialised differently. So, it is advised not to mix the model selection between the two initialisation techniques.

Nonetheless, no matter what initialisation method is used, we need to start the fitting process from $$t=1$$, and this cannot be done unless we provide some pre-initialised values for parameters to the optimiser. The better we guess the initial values, the faster the optimiser will converge to the optimum. adam() uses several heuristics in this stage, explained in more detail in the next subsections.

### 11.4.2 Pre-initialisation of ADAM parameters

In this subsection we discuss how the value of smoothing parameter, damping parameter and coefficients of ARIMA are preset before the initialisation. All the things discussed here are heuristic, developed based on my experience and many experiments with ADAM ETS. Depending on the type of model, the vector of estimated parameters will contain different values. We start with smoothing parameters of ETS:

1. For the unsafe mixed models ETS(A,A,M), ETS(A,M,A), ETS(M,A,A) and ETS(M,A,M): $$\hat{\alpha}=0.01$$, $$\hat{\beta}=0$$ and $$\hat{\gamma}=0$$. This is needed because the models listed above are very sensitive to the changes in smoothing parameters and might fail for time series with level close to zero;
2. For the one of the most complicated and sensitive models ETS(M,M,A) $$\hat{\alpha}=\hat{\beta}=\hat{\gamma}=0$$. The combination of additive seasonality and multiplicative trend is one of the most difficult ones. The multiplcative error makes estimation even more challenging in cases of the low level data. So starting from the deterministic model, that will work for sure is a safe option;
3. ETS(M,A,N) is slightly easier to estimate than ETS(M,A,M) and ETS(M,A,A), so $$\hat{\alpha}=0.2$$, $$\hat{\beta}=0.01$$. The low value for the trend is needed to avoid the difficult situations with low level data, when the fitted values become negative;
4. ETS(M,M,N) and ETS(M,M,M) have $$\hat{\alpha}=0.1$$, $$\hat{\beta}=0.05$$ and $$\hat{\gamma}=0.01$$, making the ternd and seasonal components a bit more conservative. The high values are not needed in this model as they might lead to explosive behaviour;
5. Other models with multiplicative components (ETS(M,N,N), ETS(M,N,A), ETS(M,N,M), ETS(A,N,M), ETS(A,M,N) and ETS(A,M,M)) are slightly easier to estimate and harder to break, so their parameters are set to $$\hat{\alpha}=0.1$$, $$\hat{\beta}=0.05$$ and $$\hat{\gamma}=0.05$$;
6. Finally, pure additive models are initialised with $$\hat{\alpha}=0.1$$, $$\hat{\beta}=0.05$$ and $$\hat{\gamma}=0.11$$. Their parameter space is the widest, and the models do not break on any data.

The smoothing parameter for the explanatory variables is set to $$\hat{\delta}=0.01$$ in case of additive error and $$\hat{\delta}=0$$ in case of the multiplicative one. The latter is done because the model might break if some of ETS components are additive.

If dampening parameter is needed in the model, then its pre-initialised value is $$\hat{\phi}=0.95$$.

In case of ARIMA, the parameters are pre-initialised based on ACF and PACF. First, the in sample actual values are differenced, according to the selected order $$d$$ and all $$D_j$$, after which the ACF and PACF are calculated. Then the initials for AR parameters are taken from the PACF, while the initials for MA parameters are taken from ACF, making sure that the sum of parameters is not greater than one in both cases. If it is, then the parameters are renormalised to satisfy the condition. The reason behind this mechanism is to get a potentially correct direction towards the optimal parameters of the model and make sure that the initial values satisfy the very basic stationarity and invertibility conditions. In cases, when it is not possible to calculate ACF and PACF for the specified lags and orders, AR parameters are set to -0.1, while the MA parameters are set to 0.1, making sure that the conditions mentioned above hold.

If the skewness parameter of Asymmetric Laplace distribution is estimated, then its initial value is set to 0.5, corresponding to the median of the data. In case of Generalised Normal distribution, the shape parameter is set to 2 (if it is estimated), making the optimiser start from the conventional Normal distribution.

The pre-initialisations described above guarantee that the model is estimable for a wide variety of time series and that the optimiser will reach the optimum in a limited time. If for a specific case, it does not work, a user can provide their own vector of pre-initialised parameters via the parameter B in ellipsis of the model. Furthermore, the typical bounds for the parameters can be tuned as well. For example, the bounds for smoothing parameters in ADAM ETS are (-5, 5), and they are needed only to simplify the optimisation procedure. The function will check the violation of either usual or admissible bounds inside the optimiser, but having some ideas of where to search for optimal parameters, helps. A user can provide their own vector for lower bound via lb and for the upper one via ub.

### 11.4.3 Pre-initialisation of ADAM states, ETS

The pre-initialisation of states of ADAM ETS differs depending on whether the model is seasonal or not. If it is, then the multiple seasonal decomposition is done using msdecompose() function from smooth with the seasonality set to “multiplicative” if either error or seasonal component of ETS is multiplicative. After that:

• Initial level is then equal to the first initial value from the function (which is the back forecasted de-seasonalised series);
• The value is corrected if regressors are included to remove their impact on the value (either by subtracting the fitted of the regression part or by dividing by them - depending on the type of error);
• If trend is additive and seasonality is multiplicative, then the trend component is obtained by multiplying the initial level and trend from the decomposition (remember, the assumed model is multiplicative in this case) and then subtracting the previous level;
• If trend is multiplicative and seasonality is additive, then the initials are added and then divided by the previous level to get the initial multiplicative trend component;
• If there is no seasonality and trend is multiplicative, then the initial trend is set to 1. This is done in order to avoid the potentially explosive behaviour of the model;
• If the trend is multiplicative and level is negative, then the level is substituted by the first actual value. This might happen in some weird cases of time series with low values;
• When it comes to seasonal components, if we have pure additive, or pure multiplicative ETS model or ETS(A,Z,M), then we use the seasonal indices, obtained from the msdecompose() function, making sure that they are normalised. The type of seasonality in msdecompose() corresponds to the seasonal component of ETS in this case, and nothing additional needs to be done;
• The situation is more challenging with ETS(M,Z,A), for which the decomposition would return the multiplicative seasonal components. In order to convert them to the additive, we take their logarithm and multiply them by the minimum value of the actual time series. This way we guarantee that the seasonal components are closer to the optimal ones.

In case of the non-seasonal model, the algorithm is easier:

• The initial level is equal to either arithmetic or geometric mean (depending on the type of trend component) of the first $$\max(m_1,\dots,m_n)$$ observations, where $$m_j$$ is the seasonal periodicity. If the length of this mean is smaller than 20% of the sample, then the arithmetic mean of the first 20% of actual values is used;
• If regressors are included, then the value is modified, similar to how it is done in the seasonal ETS;
• If the model has additive trend then its initial value is equal to the mean difference between first $$\max(m_1,\dots,m_n)$$ observations;
• In case of multiplicative trend, initial value is equal to the to the geometric mean of ratios between first $$\max(m_1,\dots,m_n)$$ observations;

In cases of the small samples (less than 2 seasonal periods), the procedure is similar to the one above, but the seasonal indices are obtained by taking the actual values and either subtracting an arithmetic mean or dividing them by the geometric one of the first $$m_j$$ observations, normalising them afterwards.

Finally, to make sure that the safe initials were provided, for the ETS(M,Z,Z) models, if the initial level contains negative value, then it is substituted by the global mean of the series.

The pre-initialisation described here is not simple, but it guarantees that any ETS model can be constructed and estimated almost to any data. Yes, there might still be some issues with mixed ETS models, but the mechanism used in ADAM is quite robust.

### 11.4.4 Pre-initialisation of ADAM states, ARIMA

Each state $$v_{i,t}$$ needs to be initialised with $$i$$ values (e.g. 1 for the first state, 2 for the second etc). This leads in general to more initial values for states than the SSARIMA from : $$\frac{K(K+1)}{2}$$ instead of $$K$$. However, this formulation has a more compact transition matrix, leading to computational improvements in terms of applying the model to the data with large $$K$$ (e.g. multiple seasonalities). Besides, we can reduce the number of initial seeds to estimate either by using a different initialisation procedure (e.g. backcasting) or estimating directly $$y_t$$ and $$\epsilon_t$$ for $$t=\{-K+1, -K+2, \dots, 0\}$$ to obtain the initials for each state via the formula (9.7). In order to reduce the number of estimated parameters to $$K$$, we can take the conditional expectations for the states, in which case we will have: $\begin{equation*} \mathrm{E}(v_{i,t} | t) = \eta_i y_{t} \text{ for } t=\{-K+1, -K+2, \dots, 0\}, \end{equation*}$ and then use these expectations for the initialisation of ARIMA. A the same time, we can express the actual value in terms of the state and error from (9.4) for the last state $$K$$: $\begin{equation} y_{t} = \frac{v_{K,t} - \theta_K \epsilon_{t}}{\eta_K}. \tag{11.21} \end{equation}$ We select the last state $$K$$ because it has the highest number of initials to estimate among all states. We can then insert the value (11.21) in each formula for each state for $$i=\{1, 2, \dots, K-1\}$$ and take their expectations: $\begin{equation} \mathrm{E}(v_{i,t}|t) = \frac{\eta_i}{\eta_K} \mathrm{E}(v_{K,t}|t) \text{ for } t=\{-i+1, -i+2, \dots, 0\}. \tag{11.22} \end{equation}$ So the process then comes to estimating the initials states of $$v_{K,t}$$ for $$t=\{-K+1, -K+2, \dots, 0\}$$ and then propagating them to the other states. However, this strategy will only work for the states corresponding to ARI elements of model. In case of MA(q), using the same principle of initialisation via the conditional expectation, we can set the initial MA states to zero and estimate only ARI states. This is a crude but relatively simple way to pre-initialise ADAM ARIMA.

Having said all that, we need to point out that it is advised to use backcasting in case of ADAM ARIMA model - this is a more reliable and a faster procedure for initialisation of ARIMA than the optimisation.

### 11.4.5 Pre-initialisation of ADAM states, Regressors and constant

When it comes to the initials for the regressors, they are obtained from the parameters of the alm() model based on the rules below:

• The model with logarithm of response variable is constructed, if the error term is multiplicative and one of the following distributions has been selected: Normal, Laplace, S, Generalised Normal or Asymmetric Laplace;
• Otherwise the model is constructed based on provided formula and selected distribution;
• In both cases, the global trend is added to the formula to make sure that its effect on the values of parameters is reduced;
• If the data contains categorical variables (aka “factors” in R), then the factors are expanded to dummy variable, adding the baseline value as well. While the classical multiple regression would not be estimable in this situation, the dynamic models like ETSX and ARIMAX can work with the full set of levels of categorical variable. In order to get the missing level, the intercept is added to the parameters of dummy variables, after which the obtained vector is normalised. This way we can get, for example, all seasonal components if we want to model seasonality via X part of the model, not merging one of the components with level.

Finally, the initialisation of constant (if it is needed in the model) is done depending on the selected model. In case of ARIMA with all $$D_j=0$$, the mean of the data is used. In all the other cases either the arithmetic mean of difference or geometric mean of ratios of all actual values is used. This is because the constant acts as a drift in the model in this situation. The impact of the constant is removed from the level in ETS and the states of ARIMA by either subtraction, or division.

### References

• Svetunkov, I., Boylan, J.E., 2020b. State-space ARIMA for supply-chain forecasting. International Journal of Production Research. 58, 818–827. https://doi.org/10.1080/00207543.2019.1600764