\( \newcommand{\mathbbm}[1]{\boldsymbol{\mathbf{#1}}} \)

11.1 Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is one of the popular approaches for estimating parameters of a statistical model. It relies on assumptions about the distribution of the response variable and uses either a probability density or a cumulative distribution function, depending on the assumed model, and aims to maximise the likelihood that the observations can be described by the model with specific parameters. An interested reader can find a detailed explanation of the likelihood approach in Chapter 16 of Svetunkov (2022).

Remark. Contrary to common belief, the MLE itself does not assume that a true model follows a specific distribution. It just tries to fit the predefined distribution to the available data.

MLE of the ADAM relies on the distributional assumptions of each specific model and might differ from one model to another (see assumptions in Sections 5.5, 6.5, and 9.3). There are several options for the distribution parameter supported by the adam() function in the smooth package, here we briefly discuss how the estimation is done for each one of them. We start with the additive error model, for which the assumptions, log-likelihoods, and MLE of scales are provided in Table 11.1.

Table 11.1: Likelihood approach for additive error models. \(\mathcal{N}\) is the Normal, \(\mathcal{L}\) is the Laplace, \(\mathcal{S}\) is the S, \(\mathcal{GN}\) is the Generalised Normal, \(\mathcal{IG}\) is the Inverse Gaussian, \(\Gamma\) is the Gamma, and \(\mathrm{log}\mathcal{N}\) is the Log-Normal distribution.
Assumption log-likelihood MLE of scale
\(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) \(-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\)
\(\epsilon_t \sim \mathcal{L}(0, s)\) \(-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\)
\(\epsilon_t \sim \mathcal{S}(0, s)\) \(-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s}\) \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\)
\(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) \(\begin{aligned} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s}\end{aligned}\) \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\)
\(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{IG}(1, \sigma^2)\) \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2} \sum_{t=1}^T \left( \left(\frac{y_t}{\mu_{y,t}}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2\sigma^2} \sum_{t=1}^{T} \frac{\epsilon_t^2}{\mu_{y,t}y_t}\end{aligned}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{\hat{\mu}_{y,t} y_t}\)
\(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma^2)\) \(\begin{aligned} &-T \log \Gamma \left(\sigma^{-2}\right) -\frac{T}{\sigma^2} \log \sigma^2 + \\ &\frac{1}{\sigma^2} \sum_{t=1}^T \log \left(\frac{y_t/\mu_{y,t}}{ \exp(y_t/\mu_{y,t})}\right) -\sum_{t=1}^T \log y_t\end{aligned}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2\) *
\(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\sum_{t=1}^T \log y_t -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(\frac{y_t}{\mu_{y,t}}\right)+\frac{\sigma^2}{2}\right)^2 \end{aligned}\) \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(\frac{y_t}{\hat{\mu}_{y,t}}\right)}\right)\)

The likelihoods are derived based on the probability density functions, by taking the logarithms of their products for all in-sample observations. For example, this is what we get for the Normal distribution: \[\begin{equation} \begin{aligned} \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \prod_{t=1}^T \left(\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t -\mu_t \right)^2}{2 \sigma^2} \right)\right) \\ \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \frac{1}{\left(\sqrt{2 \pi \sigma^2}\right)^T} \exp \left( \sum_{t=1}^T -\frac{\epsilon_t^2}{2 \sigma^2} \right) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \log \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & -\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} \end{aligned}, \tag{11.3} \end{equation}\] where \(\mathbf{y}\) is the vector of all in-sample actual values, \(\mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y})\) is the likelihood value, while \(\ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y})\) is the log-likelihood. As for the scale, it is obtained by solving the equation after taking the derivative of the log-likelihood (11.3) with respect to \(\sigma^2\) and setting it equal to zero. We do not discuss the concentrated log-likelihoods (obtained after inserting the estimated scale in the respective log-likelihood function), because they are not useful in understanding of how the model is estimated, but knowing how to calculate scale helps, because it simplifies the model estimation process.

Remark. MLE of scale parameter for Gamma distribution (formulated in ADAM) does not have a closed-form. While there are no proofs for it, it seems that the maximum of the likelihood of Gamma distribution is achieved, when \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2\), which corresponds to the estimate based on method of moments.

While other distributions can be used in ADAM ETS (for example, Logistic distribution or Student’s t), we do not discuss them here. In all cases in Table 11.1, the assumptions imply that the actual value follows the same distributionn as the error term (due to additivity of the model) but with a different location and/or scale. For example, for the Normal distribution, we have: \[\begin{equation} \begin{aligned} & \epsilon_t \sim \mathcal{N}(0, \sigma^2) \\ & \mathrm{or} \\ & y_t = \mu_{y,t}+\epsilon_t \sim \mathcal{N}(\mu_{y,t}, \sigma^2) \end{aligned}. \tag{11.4} \end{equation}\]

When it comes to the multiplicative error models, the likelihoods become slightly different. For example, when we assume that \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) in the multiplicative error model, this implies that: \[\begin{equation} y_t = \mu_{y,t}(1+\epsilon_t) \sim \mathcal{N}(\mu_{y,t}, \mu_{y,t}^2 \sigma^2) . \tag{11.5} \end{equation}\] As a result, the log-likelihoods would have the \(\mu_{y,t}\) part in the formulae: \[\begin{equation} \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = -\frac{1}{2} \sum_{t=1}^T \log(2 \pi \mu_{y,t}^2 \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} . \tag{11.6} \end{equation}\]

Remark. The part \(\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}\) does not contain \(\mu_{y,t}\) because when inserted in the formula of Normal distribution, the numerator (based on (11.5)) is: \(y_t - \mu_{y,t} = \mu_{y,t} \epsilon_t\). After inserting it in the formula for the Normal distribution that part becomes: \(\frac{1}{2} \sum_{t=1}^T \frac{(\mu_{y,t} \epsilon_t)^2}{\mu_{y,t}^2\sigma^2}\) which after cancellations leads to (11.6).

Similar logic is applicable to Laplace (\(\mathcal{L}\)), Generalised Normal (\(\mathcal{GN}\)), and S (\(\mathcal{S}\)) distributions. From the practical point of view, these assumptions imply that the scale (and variance) of the distribution of \(y_t\) changes together with the level of the series. When it comes to the Inverse Gaussian (\(\mathcal{IG}\)), Gamma (\(\Gamma\)), and Log-Normal (log\(\mathcal{N}\)), the assumptions are imposed on the \(1+\epsilon_t\) part of the model and the respective likelihoods do not involve the expectation \(\mu_{y,t}\), but the formulation still implies that the variance of the data increases together with the increase of the level of data.

All the likelihoods for the multiplicative error models are summarised in Table 11.2.

Table 11.2: Likelihood approach for multiplicative error models. \(\mathcal{N}\) is the Normal, \(\mathcal{L}\) is the Laplace, \(\mathcal{S}\) is the S, \(\mathcal{GN}\) is the Generalised Normal, \(\mathcal{IG}\) is the Inverse Gaussian, \(\Gamma\) is the Gamma, and \(\mathrm{log}\mathcal{N}\) is the Log-Normal distribution.
Assumption log-likelihood MLE of scale
\(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) \(\begin{aligned} &-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} -\\ &\sum_{t=1}^T \log |\mu_{y,t}|\end{aligned}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\)
\(\epsilon_t \sim \mathcal{L}(0, s)\) \(\begin{aligned} &-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{aligned}\) \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\)
\(\epsilon_t \sim \mathcal{S}(0, s)\) \(\begin{aligned} &-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{aligned}\) \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\)
\(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) \(\begin{aligned} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{aligned}\) \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\)
\(1+\epsilon_t \sim \mathcal{IG}(1, \sigma^2)\) \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2}\sum_{t=1}^T \left(1+\epsilon_{t}\right)^3 -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2\sigma^2} \sum_{t=1}^{T} \frac{\epsilon_t^2}{1+\epsilon_t}\end{aligned}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{1+e_t}\)
\(1+\epsilon_t \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma)\) \(\begin{aligned} &-T \log \Gamma \left(\sigma^{-2}\right) -\frac{T}{\sigma^2} \log \sigma^2 + \\ &\frac{1}{\sigma^2} \sum_{t=1}^T \log \left(\frac{1+\epsilon_t}{\exp(1+\epsilon_t)}\right) -\sum_{t=1}^T \log y_t\end{aligned}\) \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\) *
\(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) - \sum_{t=1}^T \log y_t -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(1+\epsilon_t\right)+\frac{\sigma^2}{2}\right)^2 \end{aligned}\) \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(1+e_t\right)}\right)\)

When it comes to practicalities, the optimiser in the adam() function calculates the scale from Tables 11.1 and 11.2 on each iteration and then uses it in the log-likelihood based on the respective distribution function. For additive error models:

  1. Normal, \(\mathcal{N}\)dnorm(x=actuals, mean=fitted, sd=scale, log=TRUE) from the stats package;
  2. Laplace, \(\mathcal{L}\)dlaplace(q=actuals, mu=fitted, scale=scale, log=TRUE) from the greybox package;
  3. S, \(\mathcal{S}\)ds(q=actuals, mu=fitted, scale=scale, log=TRUE) from the greybox package;
  4. Generalised Normal, \(\mathcal{GN}\)dgnorm(x=actuals, mu=fitted, alpha=scale, beta=beta, log=TRUE) implemented in the greybox package based on the gnorm package (the version on CRAN is outdated);
  5. Inverse Gaussian, \(\mathcal{IG}\)dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE) from the statmod package;
  6. Log-Normal, log\(\mathcal{N}\)dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE) from the stats package;
  7. Gamma, \(\Gamma\)dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE) from the stats package.

And for multiplicative error models:

  1. Normal, \(\mathcal{N}\)dnorm(x=actuals, mean=fitted, sd=scale*fitted, log=TRUE);
  2. Laplace, \(\mathcal{L}\)dlaplace(q=actuals, mu=fitted, scale=scale*fitted, log=TRUE);
  3. S, \(\mathcal{S}\)ds(q=actuals, mu=fitted, scale=scale*sqrt(fitted), log=TRUE);
  4. Generalised Normal, \(\mathcal{GN}\)dgnorm(x=actuals, mu=fitted, alpha=scale*fitted^beta, beta=beta, log=TRUE);
  5. Inverse Gaussian, \(\mathcal{IG}\)dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE);
  6. Log-Normal, log\(\mathcal{N}\)dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE);
  7. Gamma, \(\Gamma\)dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE).

Remark. In cases of \(\mathcal{GN}\), an additional parameter (namely \(\beta\)) is needed. If the user does not provide it, it will be estimated together with the other parameters via the maximisation of respective likelihoods. Note however that the estimate of \(\beta\) can be inefficient if its true value is lower than 1.

The MLE of ADAM makes different models comparable with each other irrespective of the types of components and distributional assumptions. As a result, model selection based on information criteria can be done using the auto.adam() function from the smooth package, which will select the most appropriate distribution for ADAM.

11.1.1 An example in R

The adam() function in the smooth package has the parameter distribution, which allows selecting between several options discussed in this chapter, based on the respective density functions. Here is a brief example in R with ADAM ETS(M,A,M) applied to the AirPassengers data with several distributions:

adamETSMAMAir <- vector("list",5)
adamETSMAMAir[[1]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
                           distribution="dnorm")
adamETSMAMAir[[2]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
                           distribution="dlaplace")
adamETSMAMAir[[3]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
                           distribution="dgnorm")
adamETSMAMAir[[4]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
                           distribution="dinvgauss")
adamETSMAMAir[[5]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
                           distribution="dgamma")

In this case, the function will select the most appropriate ETS model for each distribution. We can see what was selected in each case and compare the models using information criteria:

sapply(adamETSMAMAir, AICc) |>
    setNames(c("dnorm","dlaplace","dgnorm","dinvgauss","dgamma"))
##     dnorm  dlaplace    dgnorm dinvgauss    dgamma 
##  991.6774  985.5971  987.4365  991.7300  991.5886

We could compare the performance of models in detail, but for demonstration purposes, it should suffice to say that among the four models considered above, based on the AICc value, the model with the Gamma distribution should be selected. This process of fit and selection can be automated using the auto.adam() function, which accepts the vector of distributions to test and by default would consider distribution=c("default", "dnorm", "dlaplace", "ds", "dgnorm", "dlnorm", "dinvgauss", "dgamma") (where default is Normal for additive error and Gamma for the multiplicative error model):

auto.adam(AirPassengers, "MAM", h=12, holdout=TRUE)

This command should return the ADAM ETS(M,A,M) model with the most appropriate distribution, selected based on the AICc.

References

• Svetunkov, I., 2022. Statistics for business analytics. https://openforecast.org/sba/ version: 31.10.2022