12.1 Maximum Likelihood Estimation
The maximum likelihood estimation (MLE) of ADAM model relies on the distributional assumptions of each specific model and might differ from one to another. There are several options for thedistribution
supported by adam()
function in smooth package, here we briefly discuss how the estimation is done for each one of them. We start with the additive error models, for which the assumptions, log-likelihoods and MLE of scales are provided in Table 12.1. The likelihoods are derived based on the probability density functions, discussed in the chapter 2, by taking logarithms of the product of pdfs for all in sample observations. For example, this is what we get for the normal distribution (we already had a simple example with this earlier in the book):
\[\begin{equation}
\begin{aligned}
\mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \prod_{t=1}^T \left(\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right)\right) \\
\mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \frac{1}{\left(\sqrt{2 \pi \sigma^2}\right)^T} \exp \left( \sum_{t=1}^T -\frac{\epsilon_t^2}{2 \sigma^2} \right) \\
\ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \log \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) \\
\ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & -\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}
\end{aligned},
\tag{12.3}
\end{equation}\]
where \(\mathbf{y}\) is the vector of all in-sample actual values. As for the scale, it is obtained by solving the equation after taking derivative of the log-likelihood (12.3) with respect to \(\sigma^2\) and setting it equal to zero. We don't discuss the concentrated log-likelihoods (obtained after inserting the estimated scale in the respective log-likelihood function), because they are not useful in understanding how the model is estimated, but knowing how to calculate scale helps, because it simplifies the model estimation process.
Assumption | log-likelihood | MLE of scale | |
---|---|---|---|
Normal | \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) | \(-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\) |
Laplace | \(\epsilon_t \sim \mathcal{Laplace}(0, s)\) | \(-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\) |
S | \(\epsilon_t \sim \mathcal{S}(0, s)\) | \(-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s}\) | \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\) |
Generalised Normal | \(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) | \(\begin{split} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s}\end{split}\) | \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\) |
Asymmetric Laplace | \(\epsilon_t \sim \mathcal{ALaplace}(0, s, \alpha)\) | \(\begin{split} &T\log\left(\alpha(1-\alpha)\right) -T \log(s) -\\ &\sum_{t=1}^T \frac{\epsilon_t (\alpha - I(\epsilon_t \leq 0))}{s} \end{split}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^T e_t(\alpha -I(e_t \leq 0))\) |
Inverse Gaussian | \(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{IG}(1, s)\) | \(\begin{split} &-\frac{T}{2} \log \left(2 \pi s \left(\frac{y_t}{\mu_{y,t}}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2s} \sum_{t=1}^{T} \frac{\epsilon_t^2}{\mu_{y,t}y_t}\end{split}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{\hat{\mu}_{y,t} y_t}\) |
Log Normal | \(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) | \(\begin{split} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(\frac{y_t}{\mu_{y,t}}\right)+\frac{\sigma^2}{2}\right)^2 -\\ &\sum_{t=1}^T \log y_t \end{split}\) | \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(\frac{y_t}{\hat{\mu}_{y,t}}\right)}\right)\) |
As a result the log-likelihoods would have the \(\mu_{y,t}\) part in the formulae. Similar logic is applicable to \(\mathcal{Laplace}\), \(\mathcal{S}\), \(\mathcal{GN}\) and \(\mathcal{ALaplace}\) distributions. From the practical point of view, these assumptions imply that the scale (and variance) of the distribution of \(y_t\) changes together with the level of the series. When it comes to the \(\mathcal{IG}\) and log\(\mathcal{N}\), the assumptions are imposed on the \(1+\epsilon_t\) part of the model, the respective likelihoods do not involve the expectation \(\mu_{y,t}\), but the formulation still implies that the variance of the data increases together with the increase of the level of data.
All the likelihoods for the multiplicative error models are summarised in Table 12.2.
Assumption | log-likelihood | MLE of scale | |
---|---|---|---|
Normal | \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) | \(\begin{split} &-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} -\\ &\sum_{t=1}^T \log |\mu_{y,t}|\end{split}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\) |
Laplace | \(\epsilon_t \sim \mathcal{Laplace}(0, s)\) | \(\begin{split} &-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{split}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\) |
S | \(\epsilon_t \sim \mathcal{S}(0, s)\) | \(\begin{split} &-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{split}\) | \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\) |
Generalised Normal | \(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) | \(\begin{split} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{split}\) | \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\) |
Asymmetric Laplace | \(\epsilon_t \sim \mathcal{ALaplace}(0, s, \alpha)\) | \(\begin{split} &T\log\left(\alpha(1-\alpha)\right) -T \log(s) -\\ &\sum_{t=1}^T \frac{\epsilon_t (\alpha - I(\epsilon_t \leq 0))}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{split}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^T e_t(\alpha -I(e_t \leq 0))\) |
Inverse Gaussian | \(1+\epsilon_t \sim \mathcal{IG}(1, s)\) | \(\begin{split} &-\frac{T}{2} \log \left(2 \pi s \left(1+\epsilon_{t}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2s} \sum_{t=1}^{T} \frac{\epsilon_t^2}{1+\epsilon_t}\end{split}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{1+e_t}\) |
Log Normal | \(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) | \(\begin{split} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(1+\epsilon_t\right)+\frac{\sigma^2}{2}\right)^2 -\\ &\sum_{t=1}^T \log y_t \end{split}\) | \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(1+e_t\right)}\right)\) |
When it comes to practicalities, in adam()
function, in the optimiser, on each iteration, after fitting the model the scales from Tables 12.1 and 12.2 are calculated and then used in the log-likelihoods based on the respective distribution functions, for additive error models:
- \(\mathcal{N}\) -
dnorm(x=actuals, mean=fitted, sd=scale, log=TRUE)
fromstats
package; - \(\mathcal{Laplace}\) -
dlaplace(q=actuals, mu=fitted, scale=scale, log=TRUE)
fromgreybox
package; - \(\mathcal{S}\) -
ds(q=actuals, mu=fitted, scale=scale, log=TRUE)
fromgreybox
package; - \(\mathcal{GN}\) -
dgnorm(x=actuals, mu=fitted, alpha=scale, beta=beta, log=TRUE)
implemented internally based ongnorm
package (the version on CRAN is outdated); - \(\mathcal{ALaplace}\) -
dalaplace(q=actuals, mu=fitted, scale=scale, alpha=alpha, log=TRUE)
fromgreybox
package; - \(\mathcal{IG}\) -
dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE)
fromstatmod
package; - log\(\mathcal{N}\) -
dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE)
fromstats
package;
and for multiplicative error models:
- \(\mathcal{N}\) -
dnorm(x=actuals, mean=fitted, sd=scale*fitted, log=TRUE)
; - \(\mathcal{Laplace}\) -
dlaplace(q=actuals, mu=fitted, scale=scale*fitted, log=TRUE)
; - \(\mathcal{S}\) -
ds(q=actuals, mu=fitted, scale=scale*sqrt(fitted), log=TRUE)
; - \(\mathcal{GN}\) -
dgnorm(x=actuals, mu=fitted, alpha=scale*fitted^beta, beta=beta, log=TRUE)
; - \(\mathcal{ALaplace}\) -
dalaplace(q=actuals, mu=fitted, scale=scale*fitted, alpha=alpha, log=TRUE)
; - \(\mathcal{IG}\) -
dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE)
; - log\(\mathcal{N}\) -
dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE)
.
Note that in cases of \(\mathcal{GN}\) and \(\mathcal{ALaplace}\), additional parameters (namely \(\beta\) and \(\alpha\)) are needed. If the user does not provide them, then they are estimated together with the other parameters via the maximisation of respective likelihoods.
The MLE of parameters of ADAM ETS models makes them comparable with each other irrespective of the types of components and distributional assumptions. As a result, model selection based on information criteria can be done using auto.adam()
function from smooth
, which will select the most appropriate ETS model with the most suitable distribution.
12.1.1 An example in R
adam()
function in smooth package has the parameter distribution
, which allows selecting between several options discussed in this chapter, based on the respective density functions (see the list above with dnorm()
etc). Here is a brief example in R with ADAM ETS(M,A,M) applied to the AirPassengers
data with several distributions:
adamModel <- vector("list",4)
adamModel[[1]] <- adam(AirPassengers, "MAM", distribution="dnorm", h=12, holdout=TRUE)
adamModel[[2]] <- adam(AirPassengers, "MAM", distribution="dlaplace", h=12, holdout=TRUE)
adamModel[[3]] <- adam(AirPassengers, "MAM", distribution="dgnorm", h=12, holdout=TRUE)
adamModel[[4]] <- adam(AirPassengers, "MAM", distribution="dinvgauss", h=12, holdout=TRUE)
In this case, the function will select the most appropriate ETS model for each distribution. We can see what was selected in each case and compare the models using information criteria:
setNames(sapply(adamModel, AICc), c("dnorm","dlaplace","dgnorm","dinvgauss"))
## dnorm dlaplace dgnorm dinvgauss
## 972.7352 975.0105 977.0554 974.6453
We could compare the performance of models in detail, but, for the purposes of this demonstration, it should suffice to say that among the four models considered above, based on the AICc value, the model with the Normal distribution should be preferred. This process of fit and selection can be automated using auto.adam()
model, which accepts the vector of distributions to test and by default would consider distribution=c("dnorm", "dlaplace", "ds", "dgnorm", "dlnorm", "dinvgauss")
:
adamModel <- auto.adam(AirPassengers, "MAM", h=12, holdout=TRUE)
This command should return the ADAM ETS(M,A,M) model with the most appropriate distribution, selected based on the AICc.