11.1 Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is one of the popular approaches for estimating parameters of a statistical model. It relies on assumptions about the distribution of the response variable and uses either a probability density or a cumulative distribution function, depending on the assumed model, and aims to maximise the likelihood that the observations can be described by the model with specific parameters. An interested reader can find a detailed explanation of the likelihood approach in Chapter 16 of Svetunkov (2022).
Remark. Contrary to common belief, the MLE itself does not assume that a true model follows a specific distribution. It just tries to fit the predefined distribution to the available data.
MLE of the ADAM relies on the distributional assumptions of each specific model and might differ from one model to another (see assumptions in Sections 5.5, 6.5, and 9.3). There are several options for the distribution
parameter supported by the adam()
function in the smooth
package, here we briefly discuss how the estimation is done for each one of them. We start with the additive error model, for which the assumptions, log-likelihoods, and MLE of scales are provided in Table 11.1.
Assumption | log-likelihood | MLE of scale | |
---|---|---|---|
\(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) | \(-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\) | |
\(\epsilon_t \sim \mathcal{L}(0, s)\) | \(-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\) | |
\(\epsilon_t \sim \mathcal{S}(0, s)\) | \(-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s}\) | \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\) | |
\(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) | \(\begin{aligned} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s}\end{aligned}\) | \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\) | |
\(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{IG}(1, \sigma^2)\) | \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2} \sum_{t=1}^T \left( \left(\frac{y_t}{\mu_{y,t}}\right)^3 \right) -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2\sigma^2} \sum_{t=1}^{T} \frac{\epsilon_t^2}{\mu_{y,t}y_t}\end{aligned}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{\hat{\mu}_{y,t} y_t}\) | |
\(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma^2)\) | \(\begin{aligned} &-T \log \Gamma \left(\sigma^{-2}\right) -\frac{T}{\sigma^2} \log \sigma^2 + \\ &\frac{1}{\sigma^2} \sum_{t=1}^T \log \left(\frac{y_t/\mu_{y,t}}{ \exp(y_t/\mu_{y,t})}\right) -\sum_{t=1}^T \log y_t\end{aligned}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2\) * | |
\(1+\frac{\epsilon_t}{\mu_{y,t}} \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) | \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) -\sum_{t=1}^T \log y_t -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(\frac{y_t}{\mu_{y,t}}\right)+\frac{\sigma^2}{2}\right)^2 \end{aligned}\) | \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(\frac{y_t}{\hat{\mu}_{y,t}}\right)}\right)\) |
The likelihoods are derived based on the probability density functions, by taking the logarithms of their products for all in-sample observations. For example, this is what we get for the Normal distribution: \[\begin{equation} \begin{aligned} \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \prod_{t=1}^T \left(\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t -\mu_t \right)^2}{2 \sigma^2} \right)\right) \\ \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \frac{1}{\left(\sqrt{2 \pi \sigma^2}\right)^T} \exp \left( \sum_{t=1}^T -\frac{\epsilon_t^2}{2 \sigma^2} \right) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & \log \mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) \\ \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = & -\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} \end{aligned}, \tag{11.3} \end{equation}\] where \(\mathbf{y}\) is the vector of all in-sample actual values, \(\mathcal{L}(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y})\) is the likelihood value, while \(\ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y})\) is the log-likelihood. As for the scale, it is obtained by solving the equation after taking the derivative of the log-likelihood (11.3) with respect to \(\sigma^2\) and setting it equal to zero. We do not discuss the concentrated log-likelihoods (obtained after inserting the estimated scale in the respective log-likelihood function), because they are not useful in understanding of how the model is estimated, but knowing how to calculate scale helps, because it simplifies the model estimation process.
Remark. MLE of scale parameter for Gamma distribution (formulated in ADAM) does not have a closed-form. While there are no proofs for it, it seems that the maximum of the likelihood of Gamma distribution is achieved, when \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{e_t}{\mu_{y,t}}\right)^2\), which corresponds to the estimate based on method of moments.
While other distributions can be used in ADAM ETS (for example, Logistic distribution or Student’s t), we do not discuss them here. In all cases in Table 11.1, the assumptions imply that the actual value follows the same distributionn as the error term (due to additivity of the model) but with a different location and/or scale. For example, for the Normal distribution, we have: \[\begin{equation} \begin{aligned} & \epsilon_t \sim \mathcal{N}(0, \sigma^2) \\ & \mathrm{or} \\ & y_t = \mu_{y,t}+\epsilon_t \sim \mathcal{N}(\mu_{y,t}, \sigma^2) \end{aligned}. \tag{11.4} \end{equation}\]
When it comes to the multiplicative error models, the likelihoods become slightly different. For example, when we assume that \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) in the multiplicative error model, this implies that: \[\begin{equation} y_t = \mu_{y,t}(1+\epsilon_t) \sim \mathcal{N}(\mu_{y,t}, \mu_{y,t}^2 \sigma^2) . \tag{11.5} \end{equation}\] As a result, the log-likelihoods would have the \(\mu_{y,t}\) part in the formulae: \[\begin{equation} \ell(\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = -\frac{1}{2} \sum_{t=1}^T \log(2 \pi \mu_{y,t}^2 \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} . \tag{11.6} \end{equation}\]
Remark. The part \(\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2}\) does not contain \(\mu_{y,t}\) because when inserted in the formula of Normal distribution, the numerator (based on (11.5)) is: \(y_t - \mu_{y,t} = \mu_{y,t} \epsilon_t\). After inserting it in the formula for the Normal distribution that part becomes: \(\frac{1}{2} \sum_{t=1}^T \frac{(\mu_{y,t} \epsilon_t)^2}{\mu_{y,t}^2\sigma^2}\) which after cancellations leads to (11.6).
Similar logic is applicable to Laplace (\(\mathcal{L}\)), Generalised Normal (\(\mathcal{GN}\)), and S (\(\mathcal{S}\)) distributions. From the practical point of view, these assumptions imply that the scale (and variance) of the distribution of \(y_t\) changes together with the level of the series. When it comes to the Inverse Gaussian (\(\mathcal{IG}\)), Gamma (\(\Gamma\)), and Log-Normal (log\(\mathcal{N}\)), the assumptions are imposed on the \(1+\epsilon_t\) part of the model and the respective likelihoods do not involve the expectation \(\mu_{y,t}\), but the formulation still implies that the variance of the data increases together with the increase of the level of data.
All the likelihoods for the multiplicative error models are summarised in Table 11.2.
Assumption | log-likelihood | MLE of scale | |
---|---|---|---|
\(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) | \(\begin{aligned} &-\frac{T}{2} \log(2 \pi \sigma^2) -\frac{1}{2} \sum_{t=1}^T \frac{\epsilon_t^2}{\sigma^2} -\\ &\sum_{t=1}^T \log |\mu_{y,t}|\end{aligned}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\) | |
\(\epsilon_t \sim \mathcal{L}(0, s)\) | \(\begin{aligned} &-T \log(2 s) -\sum_{t=1}^T \frac{|\epsilon_t|}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{aligned}\) | \(\hat{s} = \frac{1}{T} \sum_{t=1}^T |e_t|\) | |
\(\epsilon_t \sim \mathcal{S}(0, s)\) | \(\begin{aligned} &-T \log(4 s^2) -\sum_{t=1}^T \frac{\sqrt{|\epsilon_t|}}{s} -\\ &\sum_{t=1}^T \log \mu_{y,t}\end{aligned}\) | \(\hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{|e_t|}\) | |
\(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) | \(\begin{aligned} &T\log\beta -T \log(2 s \Gamma\left(\beta^{-1}\right)) -\\ &\sum_{t=1}^T \frac{\left|\epsilon_t\right|^\beta}{s} -\sum_{t=1}^T \log \mu_{y,t}\end{aligned}\) | \(\hat{s} = \sqrt[^{\beta}]{\frac{\beta}{T} \sum_{t=1}^T\left| e_t \right|^{\beta}}\) | |
\(1+\epsilon_t \sim \mathcal{IG}(1, \sigma^2)\) | \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2}\sum_{t=1}^T \left(1+\epsilon_{t}\right)^3 -\\ &\frac{3}{2}\sum_{t=1}^T \log y_t -\frac{1}{2\sigma^2} \sum_{t=1}^{T} \frac{\epsilon_t^2}{1+\epsilon_t}\end{aligned}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^{T} \frac{e_t^2}{1+e_t}\) | |
\(1+\epsilon_t \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma)\) | \(\begin{aligned} &-T \log \Gamma \left(\sigma^{-2}\right) -\frac{T}{\sigma^2} \log \sigma^2 + \\ &\frac{1}{\sigma^2} \sum_{t=1}^T \log \left(\frac{1+\epsilon_t}{\exp(1+\epsilon_t)}\right) -\sum_{t=1}^T \log y_t\end{aligned}\) | \(\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T e_t^2\) * | |
\(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) | \(\begin{aligned} &-\frac{T}{2} \log \left(2 \pi \sigma^2\right) - \sum_{t=1}^T \log y_t -\\ &\frac{1}{2\sigma^2} \sum_{t=1}^{T} \left(\log \left(1+\epsilon_t\right)+\frac{\sigma^2}{2}\right)^2 \end{aligned}\) | \(\hat{\sigma}^2 = 2\left(1-\sqrt{ 1-\frac{1}{T} \sum_{t=1}^{T} \log^2\left(1+e_t\right)}\right)\) |
When it comes to practicalities, the optimiser in the adam()
function calculates the scale from Tables 11.1 and 11.2 on each iteration and then uses it in the log-likelihood based on the respective distribution function. For additive error models:
- Normal, \(\mathcal{N}\) –
dnorm(x=actuals, mean=fitted, sd=scale, log=TRUE)
from thestats
package; - Laplace, \(\mathcal{L}\) –
dlaplace(q=actuals, mu=fitted, scale=scale, log=TRUE)
from thegreybox
package; - S, \(\mathcal{S}\) –
ds(q=actuals, mu=fitted, scale=scale, log=TRUE)
from thegreybox
package; - Generalised Normal, \(\mathcal{GN}\) –
dgnorm(x=actuals, mu=fitted, alpha=scale, beta=beta, log=TRUE)
implemented in thegreybox
package based on thegnorm
package (the version on CRAN is outdated); - Inverse Gaussian, \(\mathcal{IG}\) –
dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE)
from thestatmod
package; - Log-Normal, log\(\mathcal{N}\) –
dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE)
from thestats
package; - Gamma, \(\Gamma\) –
dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE)
from thestats
package.
And for multiplicative error models:
- Normal, \(\mathcal{N}\) –
dnorm(x=actuals, mean=fitted, sd=scale*fitted, log=TRUE)
; - Laplace, \(\mathcal{L}\) –
dlaplace(q=actuals, mu=fitted, scale=scale*fitted, log=TRUE)
; - S, \(\mathcal{S}\) –
ds(q=actuals, mu=fitted, scale=scale*sqrt(fitted), log=TRUE)
; - Generalised Normal, \(\mathcal{GN}\) –
dgnorm(x=actuals, mu=fitted, alpha=scale*fitted^beta, beta=beta, log=TRUE)
; - Inverse Gaussian, \(\mathcal{IG}\) –
dinvgauss(x=actuals, mean=fitted, dispersion=scale/fitted, log=TRUE)
; - Log-Normal, log\(\mathcal{N}\) –
dlnorm(x=actuals, meanlog=fitted-scale^2/2, sdlog=scale, log=TRUE)
; - Gamma, \(\Gamma\) –
dgamma(x=actuals, shape=1/scale, scale=scale*fitted, log=TRUE)
.
Remark. In cases of \(\mathcal{GN}\), an additional parameter (namely \(\beta\)) is needed. If the user does not provide it, it will be estimated together with the other parameters via the maximisation of respective likelihoods. Note however that the estimate of \(\beta\) can be inefficient if its true value is lower than 1.
The MLE of ADAM makes different models comparable with each other irrespective of the types of components and distributional assumptions. As a result, model selection based on information criteria can be done using the auto.adam()
function from the smooth
package, which will select the most appropriate distribution for ADAM.
11.1.1 An example in R
The adam()
function in the smooth
package has the parameter distribution
, which allows selecting between several options discussed in this chapter, based on the respective density functions. Here is a brief example in R with ADAM ETS(M,A,M) applied to the AirPassengers
data with several distributions:
adamETSMAMAir <- vector("list",5)
adamETSMAMAir[[1]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
distribution="dnorm")
adamETSMAMAir[[2]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
distribution="dlaplace")
adamETSMAMAir[[3]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
distribution="dgnorm")
adamETSMAMAir[[4]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
distribution="dinvgauss")
adamETSMAMAir[[5]] <- adam(AirPassengers, "MAM", h=12, holdout=TRUE,
distribution="dgamma")
In this case, the function will select the most appropriate ETS model for each distribution. We can see what was selected in each case and compare the models using information criteria:
## dnorm dlaplace dgnorm dinvgauss dgamma
## 974.5899 975.8483 975.7059 974.4286 973.1857
We could compare the performance of models in detail, but for demonstration purposes, it should suffice to say that among the four models considered above, based on the AICc value, the model with the Gamma distribution should be selected. This process of fit and selection can be automated using the auto.adam()
function, which accepts the vector of distributions to test and by default would consider distribution=c("default", "dnorm", "dlaplace", "ds", "dgnorm", "dlnorm", "dinvgauss", "dgamma")
(where default
is Normal for additive error and Gamma for the multiplicative error model):
This command should return the ADAM ETS(M,A,M) model with the most appropriate distribution, selected based on the AICc.