## 11.1 Maximum Likelihood Estimation

This section relies on the approach explained in detail in Chapter 13 of Svetunkov (2022a), so an interested reader is directed to that chapter.

The maximum likelihood estimation (MLE) of the ADAM relies on the distributional assumptions of each specific model. It might differ from one model to another (see Sections 5.5, 6.5 and 9.3). There are several options for the distribution supported by the adam() function in the smooth package, here we briefly discuss how the estimation is done for each one. We start with the additive error models, for which the assumptions, log-likelihoods and MLE of scales are provided in Table 11.1. The likelihoods are derived based on the probability density functions, discussed in Chapter 3 of Svetunkov (2022a), by taking logarithms of the product of probability density functions for all in sample observations. For example, this is what we get for the normal distribution: \begin{equation} \begin{aligned} \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \prod_{t=1}^T \left(\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t -\mu_t \right)^2}{2 \sigma^2} \right)\right) \\ \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \frac{1}{\left(\sqrt{2 \pi \sigma^2}\right)^T} \exp \left( \sum_{t=1}^T -\frac{\epsilon_t^2}{2 \sigma^2} \right) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \log \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & -\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} \end{aligned}, \tag{11.3} \end{equation} where $$\mathbf{y}$$ is the vector of all in-sample actual values. As for the scale, it is obtained by solving the equation after taking derivative of the log-likelihood (11.3) with respect to $$\sigma^2$$ and setting it equal to zero. We do not discuss the concentrated log-likelihoods (obtained after inserting the estimated scale in the respective log-likelihood function), because they are not useful in understanding of how the model is estimated, but knowing how to calculate scale helps, because it simplifies the model estimation process.

Table 11.1: Likelihood approach for additive error models
Assumption log-likelihood MLE of scale
$$\epsilon_t \sim \mathcal{N}(0, \sigma^2)$$ $$-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}$$ $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2$$
$$\epsilon_t \sim \mathcal{Laplace}(0, s)$$ $$-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s}$$ $$\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|$$
$$\epsilon_t \sim \mathcal{S}(0, s)$$ $$-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s}$$ $$\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}$$
$$\epsilon_t \sim \mathcal{GN}(0, s, \beta)$$ \begin{aligned} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s}\end{aligned} $$\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}$$
$$1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{IG}(1, \sigma^2)$$ \begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2} \sum_{t=1}^T \left( \left(\frac{y_t}{\mu_{y,t}}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2\sigma^2} \sum_{t=1}^{T} \frac{\epsilon_t^2}{\mu_{y,t}y_t}\end{aligned} $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{\hat{\mu}_{y,t} y_t}$$
$$1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma^2)$$ \begin{aligned} &-T \log \Gamma \left(\sigma^{-2}\right) -\frac{T}{\sigma^2} \log \sigma^2 + \\ &\frac{1}{\sigma^2} \sum_{t=1}^T \log \left(\frac{y_t/\mu_{y,t}}{ \exp(y_t/\mu_{y,t})}\right) -\sum_{t=1}^T \log y_t\end{aligned} $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2$$ *
$$1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)$$ \begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\sum_{t=1}^T \log y_t -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(\frac{y_t}{\mu_{y,t}}\right)+\frac{\sigma^2}{2}\right)^2 \end{aligned} $$\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(\frac{y_t}{\hat{\mu}_{y,t}}\right)}\right)$$

Remark. MLE of scale parameter for Gamma distribution (formulated in ADAM) does not have a closed-form. While there are no proofs for it, it seems that the maximum of the likelihood of Gamma distribution is achieved, when $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2$$, which corresponds to the estimate based on method of moments.

Other distributions can be used in ADAM ETS (for example, Logistic distribution or Student’s t), but we do not discuss them here. Note that for the additive error models $$y_t = \mu_{y,t}+\epsilon_t$$, thus some formulae in Table 11.1 are simplified. In all cases in Table 11.1, the assumptions imply that the actual value follows the same distribution but with a different location and/or scale. For example, for the Normal distribution, we have: \begin{equation} \begin{aligned} & \epsilon_t \sim \mathcal{N}(0, \sigma^2) \\ & \mathrm{or} \\ & y_t = \mu_{y,t}+\epsilon_t \sim \mathcal{N}(\mu_{y,t}, \sigma^2) \end{aligned}. \tag{11.4} \end{equation}

When it comes to the multiplicative error models, the likelihoods become slightly different. For example, when we assume that $$\epsilon_t \sim \mathcal{N}(0, \sigma^2)$$ in the multiplicative error model, this implies that: $\begin{equation} y_t = \mu_{y,t}(1+\epsilon_t) \sim \mathcal{N}(\mu_{y,t}, \mu_{y,t}^2 \sigma^2) . \tag{11.5} \end{equation}$ As a result, the log-likelihoods would have the $$\mu_{y,t}$$ part in the formulae: $\begin{equation} \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = -\frac{1}{2} \sum_{t=1}^T \log(2 \pi \mu_{y,t}^2 \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} . \tag{11.6} \end{equation}$

Remark. The part $$\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}$$ does not contain $$\mu_{y,t}$$ because when inserted in the formula of Normal distribution, the numerator is: $$y_t - \mu_{y,t}$$, which is based on (11.5) is $$\mu_{y,t} \epsilon_t$$. After inserting in the formula for the Normal distribution that part becomes: $$\frac{1}{2} \sum_{t=1}^T \frac{(\mu_{y,t} \epsilon_t)^2}{\mu_{y,t}^2\sigma^2}$$ which after cancellations leads to (11.6).

Similar logic is applicable to $$\mathcal{Laplace}$$, $$\mathcal{S}$$ and $$\mathcal{GN}$$ distributions. From the practical point of view, these assumptions imply that the scale (and variance) of the distribution of $$y_t$$ changes together with the level of the series. When it comes to the $$\mathcal{IG}$$, $$\Gamma$$ and log$$\mathcal{N}$$, the assumptions are imposed on the $$1+\epsilon_t$$ part of the model, the respective likelihoods do not involve the expectation $$\mu_{y,t}$$, but the formulation still implies that the variance of the data increases together with the increase of the level of data.

All the likelihoods for the multiplicative error models are summarised in Table 11.2.

Table 11.2: Likelihood approach for multiplicative error models
Assumption log-likelihood MLE of scale
$$\epsilon_t \sim \mathcal{N}(0, \sigma^2)$$ \begin{aligned} &-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} -\\ &\sum_{t=1}^T \log |\mu_{y,t}|\end{aligned} $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2$$
$$\epsilon_t \sim \mathcal{Laplace}(0, s)$$ \begin{aligned} &-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{aligned} $$\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|$$
$$\epsilon_t \sim \mathcal{S}(0, s)$$ \begin{aligned} &-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{aligned} $$\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}$$
$$\epsilon_t \sim \mathcal{GN}(0, s, \beta)$$ \begin{aligned} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{aligned} $$\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}$$
$$1+\epsilon_t \sim \mathcal{IG}(1, \sigma^2)$$ \begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2}\sum_{t=1}^T \left(1+\epsilon_{t}\right)^3 -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2\sigma^2} \sum_{t=1}^{T} \frac{\epsilon_t^2}{1+\epsilon_t}\end{aligned} $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{1+e_t}$$
$$1+\epsilon_t \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma)$$ \begin{aligned} &-T \log \Gamma \left(\sigma^{-2}\right) -\frac{T}{\sigma^2} \log \sigma^2 + \\ &\frac{1}{\sigma^2} \sum_{t=1}^T \log \left(\frac{1+\epsilon_t}{\exp(1+\epsilon_t)}\right) -\sum_{t=1}^T \log y_t\end{aligned} $$\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2$$ *
$$1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)$$ \begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) - \sum_{t=1}^T \log y_t -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(1+\epsilon_t\right)+\frac{\sigma^2}{2}\right)^2 \end{aligned} $$\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(1+e_t\right)}\right)$$

When it comes to practicalities, the optimiser in adam() function calculates the scale from Tables 11.1 and 11.2 on each iteration and then uses it in the log-likelihood based on the respective distribution function. For additive error models:

1. $$\mathcal{N}$$dnorm(x=actuals, mean=fitted, sd=scale, log=TRUE) from stats package;
2. $$\mathcal{Laplace}$$dlaplace(q=actuals, mu=fitted, scale=scale, log=TRUE) from greybox package;
3. $$\mathcal{S}$$ds(q=actuals, mu=fitted, scale=scale, log=TRUE) from greybox package;
4. $$\mathcal{GN}$$dgnorm(x=actuals, mu=fitted, alpha=scale, beta=beta, log=TRUE) implemented in greybox package based on gnorm package (the version on CRAN is outdated);
5. $$\mathcal{IG}$$dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE) from statmod package;
6. log$$\mathcal{N}$$dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE) from stats package;
7. $$\mathcal{Gamma}$$dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE) from stats package.

and for multiplicative error models:

1. $$\mathcal{N}$$dnorm(x=actuals, mean=fitted, sd=scale*fitted, log=TRUE);
2. $$\mathcal{Laplace}$$dlaplace(q=actuals, mu=fitted, scale=scale*fitted, log=TRUE);
3. $$\mathcal{S}$$ds(q=actuals, mu=fitted, scale=scale*sqrt(fitted), log=TRUE);
4. $$\mathcal{GN}$$dgnorm(x=actuals, mu=fitted, alpha=scale*fitted^beta, beta=beta, log=TRUE);
5. $$\mathcal{IG}$$dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE);
6. log$$\mathcal{N}$$dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE);
7. $$\mathcal{Gamma}$$dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE).

Remark. In cases of $$\mathcal{GN}$$, additional parameter (namely $$\beta$$) is needed. If the user does not provide it, it will be estimated together with the other parameters via the maximisation of respective likelihoods.

The MLE of parameters of ADAM ETS models makes them comparable with each other irrespective of the types of components and distributional assumptions. As a result, model selection based on information criteria can be made using the auto.adam() function from the smooth, which will select the most appropriate distribution for ADAM.

### 11.1.1 An example in R

adam() function in smooth package has the parameter distribution, which allows selecting between several options discussed in this chapter, based on the respective density functions (see the list above). Here is a brief example in R with ADAM ETS(M,A,M) applied to the AirPassengers data with several distributions:

adamETSMAMAir <- vector("list",5)
distribution="dnorm")
distribution="dlaplace")
distribution="dgnorm")
distribution="dinvgauss")
distribution="dgamma")

In this case, the function will select the most appropriate ETS model for each distribution. We can see what was selected in each case and compare the models using information criteria:

setNames(sapply(adamETSMAMAir, AICc),
c("dnorm","dlaplace","dgnorm","dinvgauss","dgamma"))
##     dnorm  dlaplace    dgnorm dinvgauss    dgamma
##  971.5172  977.4446  980.0950  973.5000  973.9646

We could compare the performance of models in detail, but for demonstration, it should suffice to say that among the four models considered above, the model with the Normal distribution should be preferred based on the AICc value. This process of fit and selection can be automated using auto.adam() model, which accepts the vector of distributions to test and by default would consider distribution=c("default", "dnorm", "dlaplace", "ds", "dgnorm", "dlnorm", "dinvgauss", "dgamma"):

adamETSMAMAir <- auto.adam(AirPassengers, "MAM", h=12, holdout=TRUE)

This command should return the ADAM ETS(M,A,M) model with the most appropriate distribution, selected based on the AICc.

### References

• Svetunkov, I., 2022a. Statistics for business analytics. https://openforecast.org/sba/ (version: 31.03.2022)