This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

12.1 Maximum Likelihood Estimation

The maximum likelihood estimation (MLE) of ADAM model relies on the distributional assumptions of each specific model and might differ from one to another. There are several options for the distribution supported by adam() function in smooth package, here we briefly discuss how the estimation is done for each one of them. We start with the additive error models, for which the assumptions, log-likelihoods and MLE of scales are provided in Table 12.1. The likelihoods are derived based on the probability density functions, discussed in the chapter 2, by taking logarithms of the product of pdfs for all in sample observations. For example, this is what we get for the normal distribution (we already had a simple example with this earlier in the book): \[\begin{equation} \begin{aligned} \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \prod_{t=1}^T \left(\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right)\right) \\ \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \frac{1}{\left(\sqrt{2 \pi \sigma^2}\right)^T} \exp \left( \sum_{t=1}^T -\frac{\epsilon_t^2}{2 \sigma^2} \right) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \log \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & -\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} \end{aligned}, \tag{12.3} \end{equation}\] where \(\mathbf{y}\) is the vector of all in-sample actual values. As for the scale, it is obtained by solving the equation after taking derivative of the log-likelihood (12.3) with respect to \(\sigma^2\) and setting it equal to zero. We don’t discuss the concentrated log-likelihoods (obtained after inserting the estimated scale in the respective log-likelihood function), because they are not useful in understanding how the model is estimated, but knowing how to calculate scale helps, because it simplifies the model estimation process.

Table 12.1: Likelihood approach for additive error models
Assumption log-likelihood MLE of scale
Normal \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) \(-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\)
Laplace \(\epsilon_t \sim \mathcal{Laplace}(0, s)\) \(-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\)
S \(\epsilon_t \sim \mathcal{S}(0, s)\) \(-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s}\) \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\)
Generalised Normal \(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) \(\begin{split} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s}\end{split}\) \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\)
Asymmetric Laplace \(\epsilon_t \sim \mathcal{ALaplace}(0, s, \alpha)\) \(\begin{split} &T\log\left(\alpha(1-\alpha)\right) -T \log(s) -\\ &\sum_{t=1}^T \frac{\epsilon_t (\alpha - I(\epsilon_t \leq 0))}{s} \end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T e_t(\alpha -I(e_t \leq 0))\)
Inverse Gaussian \(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{IG}(1, s)\) \(\begin{split} &-\frac{T}{2} \log \left(2 \pi s \left(\frac{y_t}{\mu_{y,t}}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2s} \sum_{t=1}^{T} \frac{\epsilon_t^2}{\mu_{y,t}y_t}\end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{\hat{\mu}_{y,t} y_t}\)
Gamma \(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{\Gamma}(s^{-1}, s)\) \(\begin{split} &-T \log \Gamma \left(s^{-1}\right) - \frac{T}{s} \log s + \\ &\frac{1}{s} \sum_{t=1}^T \log \left(\frac{y_t/\mu_{y,t}}{ \exp(y_t/\mu_{y,t})}\right) - \sum_{t=1}^T \log y_t\end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2\) *
Log Normal \(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) \(\begin{split} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(\frac{y_t}{\mu_{y,t}}\right)+\frac{\sigma^2}{2}\right)^2 -\\ &\sum_{t=1}^T \log y_t \end{split}\) \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(\frac{y_t}{\hat{\mu}_{y,t}}\right)}\right)\)

(*) Note that the MLE of scale parameter for Gamma distribution (formulated in ADAM) does not exist. While there are no proofs for it, it seems that the maximum of the likelihood of Gamma distribution is achieved, when \(\hat{s} = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2\), which corresponds to the estimate based on method of moments.

Other distributions can be used in ADAM ETS as well (for example, Logistic distribution or Student’s t), but we do not discuss them here. Note that for the additive error models \(y_t = \mu_{y,t}+\epsilon_t\), thus some formulae in Table 12.1 are simplified. In all of the cases in 12.1 the assumptions imply that the actual value follows the same distribution, but with a different location and / or scale. For example, for the Normal distribution we have: \[\begin{equation} \begin{aligned} & \epsilon_t \sim \mathcal{N}(0, \sigma^2) \\ & \mathrm{or} \\ & y_t = \mu_{y,t}+\epsilon_t \sim \mathcal{N}(\mu_{y,t}, \sigma^2) \end{aligned}. \tag{12.4} \end{equation}\]

When it comes to the multiplicative error models, the likelihoods become slightly different. For example, when we assume that \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) in the multiplicative error model, this implies that: \[\begin{equation} y_t = \mu_{y,t}(1+\epsilon_t) \sim \mathcal{N}(\mu_{y,t}, \mu_{y,t}^2 \sigma^2) . \tag{12.5} \end{equation}\] As a result the log-likelihoods would have the \(\mu_{y,t}\) part in the formulae. Similar logic is applicable to \(\mathcal{Laplace}\), \(\mathcal{S}\), \(\mathcal{GN}\) and \(\mathcal{ALaplace}\) distributions. From the practical point of view, these assumptions imply that the scale (and variance) of the distribution of \(y_t\) changes together with the level of the series. When it comes to the \(\mathcal{IG}\) and log\(\mathcal{N}\), the assumptions are imposed on the \(1+\epsilon_t\) part of the model, the respective likelihoods do not involve the expectation \(\mu_{y,t}\), but the formulation still implies that the variance of the data increases together with the increase of the level of data.

All the likelihoods for the multiplicative error models are summarised in Table 12.2.

Table 12.2: Likelihood approach for multiplicative error models
Assumption log-likelihood MLE of scale
Normal \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) \(\begin{split} &-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} -\\ &\sum_{t=1}^T \log |\mu_{y,t}|\end{split}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\)
Laplace \(\epsilon_t \sim \mathcal{Laplace}(0, s)\) \(\begin{split} &-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\)
S \(\epsilon_t \sim \mathcal{S}(0, s)\) \(\begin{split} &-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{split}\) \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\)
Generalised Normal \(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) \(\begin{split} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{split}\) \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\)
Asymmetric Laplace \(\epsilon_t \sim \mathcal{ALaplace}(0, s, \alpha)\) \(\begin{split} &T\log\left(\alpha(1-\alpha)\right) -T \log(s) -\\ &\sum_{t=1}^T \frac{\epsilon_t (\alpha - I(\epsilon_t \leq 0))}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T e_t(\alpha -I(e_t \leq 0))\)
Inverse Gaussian \(1+\epsilon_t \sim \mathcal{IG}(1, s)\) \(\begin{split} &-\frac{T}{2} \log \left(2 \pi s \left(1+\epsilon_{t}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2s} \sum_{t=1}^{T} \frac{\epsilon_t^2}{1+\epsilon_t}\end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{1+e_t}\)
Gamma \(1+\epsilon_t \sim \mathcal{\Gamma}(s^{-1}, s)\) \(\begin{split} &-T \log \Gamma \left(s^{-1}\right) - \frac{T}{s} \log s + \\ &\frac{1}{s} \sum_{t=1}^T \log \left(\frac{1+\epsilon_t}{\exp(1+\epsilon_t)}\right) - \sum_{t=1}^T \log y_t\end{split}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T e_t^2\) *
Log Normal \(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) \(\begin{split} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(1+\epsilon_t\right)+\frac{\sigma^2}{2}\right)^2 -\\ &\sum_{t=1}^T \log y_t \end{split}\) \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(1+e_t\right)}\right)\)

When it comes to practicalities, in adam() function, in the optimiser, on each iteration, after fitting the model the scales from Tables 12.1 and 12.2 are calculated and then used in the log-likelihoods based on the respective distribution functions, for additive error models:

  1. \(\mathcal{N}\) - dnorm(x=actuals, mean=fitted, sd=scale, log=TRUE) from stats package;
  2. \(\mathcal{Laplace}\) - dlaplace(q=actuals, mu=fitted, scale=scale, log=TRUE) from greybox package;
  3. \(\mathcal{S}\) - ds(q=actuals, mu=fitted, scale=scale, log=TRUE) from greybox package;
  4. \(\mathcal{GN}\) - dgnorm(x=actuals, mu=fitted, alpha=scale, beta=beta, log=TRUE) implemented in greybox package based on gnorm package (the version on CRAN is outdated);
  5. \(\mathcal{ALaplace}\) - dalaplace(q=actuals, mu=fitted, scale=scale, alpha=alpha, log=TRUE) from greybox package;
  6. \(\mathcal{IG}\) - dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE) from statmod package;
  7. log\(\mathcal{N}\) - dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE) from stats package;
  8. \(\mathcal{\Gamma}\) - dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE) from stats package.

and for multiplicative error models:

  1. \(\mathcal{N}\) - dnorm(x=actuals, mean=fitted, sd=scale*fitted, log=TRUE);
  2. \(\mathcal{Laplace}\) - dlaplace(q=actuals, mu=fitted, scale=scale*fitted, log=TRUE);
  3. \(\mathcal{S}\) - ds(q=actuals, mu=fitted, scale=scale*sqrt(fitted), log=TRUE);
  4. \(\mathcal{GN}\) - dgnorm(x=actuals, mu=fitted, alpha=scale*fitted^beta, beta=beta, log=TRUE);
  5. \(\mathcal{ALaplace}\) - dalaplace(q=actuals, mu=fitted, scale=scale*fitted, alpha=alpha, log=TRUE);
  6. \(\mathcal{IG}\) - dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE);
  7. log\(\mathcal{N}\) - dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE);
  8. \(\mathcal{\Gamma}\) - dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE).

Note that in cases of \(\mathcal{GN}\) and \(\mathcal{ALaplace}\), additional parameters (namely \(\beta\) and \(\alpha\)) are needed. If the user does not provide them, then they are estimated together with the other parameters via the maximisation of respective likelihoods.

The MLE of parameters of ADAM ETS models makes them comparable with each other irrespective of the types of components and distributional assumptions. As a result, model selection based on information criteria can be done using auto.adam() function from smooth, which will select the most appropriate ETS model with the most suitable distribution.

12.1.1 An example in R

adam() function in smooth package has the parameter distribution, which allows selecting between several options discussed in this chapter, based on the respective density functions (see the list above with dnorm() etc). Here is a brief example in R with ADAM ETS(M,A,M) applied to the AirPassengers data with several distributions:

In this case, the function will select the most appropriate ETS model for each distribution. We can see what was selected in each case and compare the models using information criteria:

##     dnorm  dlaplace    dgnorm dinvgauss    dgamma 
##  972.5669  975.2124  977.9902  973.7769  973.1933

We could compare the performance of models in detail, but, for the purposes of this demonstration, it should suffice to say that among the four models considered above, based on the AICc value, the model with the Normal distribution should be preferred. This process of fit and selection can be automated using auto.adam() model, which accepts the vector of distributions to test and by default would consider distribution=c("dnorm", "dlaplace", "ds", "dgnorm", "dlnorm", "dinvgauss", "dgamma"):

This command should return the ADAM ETS(M,A,M) model with the most appropriate distribution, selected based on the AICc.