16.3 The full ADAM model
Uniting demand occurrence with the demand sizes parts of the model, we can now discuss the full iETS model, which in the most general form can be represented as: \[\begin{equation} \begin{aligned} & y_t = o_t z_t , \\ & {z}_{t} = w_z(\mathbf{v}_{z,t-\boldsymbol{l}}) + r_z(\mathbf{v}_{z,t-\boldsymbol{l}}) \epsilon_{z,t} \\ & \mathbf{v}_{z,t} = f_z(\mathbf{v}_{z,t-\boldsymbol{l}}) + g_z(\mathbf{v}_{z,t-\boldsymbol{l}}) \epsilon_{z,t} \\ & \\ & o_t \sim \text{Bernoulli} \left(p_t \right) , \\ & p_t = f_p(\mu_{a,t}, \mu_{b,t}) \\ & a_t = w_a(\mathbf{v}_{a,t-\boldsymbol{l}}) + r_a(\mathbf{v}_{a,t-\boldsymbol{l}}) \epsilon_{a,t} \\ & \mathbf{v}_{a,t} = f_a(\mathbf{v}_{a,t-\boldsymbol{l}}) + g_a(\mathbf{v}_{a,t-\boldsymbol{l}}) \epsilon_{a,t} \\ & b_t = w_b(\mathbf{v}_{b,t-\boldsymbol{l}}) + r_b(\mathbf{v}_{b,t-\boldsymbol{l}}) \epsilon_{b,t} \\ & \mathbf{v}_{b,t} = f_b(\mathbf{v}_{b,t-\boldsymbol{l}}) + g_b(\mathbf{v}_{b,t-\boldsymbol{l}}) \epsilon_{b,t} \end{aligned} , \tag{16.26} \end{equation}\] where the elements of the demand size and demand occurrence parts have been defined in Sections 8 and 16.1.4 respectively. The model (16.26) can also be considered as a more general one to the conventional ADAM ETS and ARIMA models. And if the probability of occurrence \(p_t\) is equal to one for all observations, then the model reverts to them. Another important thing to note about this model is that it relies on the following assumptions:
- The demand sizes variable \(z_t\) is continuous. While it might sound artificial, Svetunkov and Boylan (2019) showed that such model does not perform worse than count data models;
- Potential demand size may change over time even when \(o_t=0\). This means that the states evolve over time even when demand is not observed;
- Demand sizes and demand occurrence are independent. This simplifies many of the further derivations and makes model estimable. If the assumption is violated, then a different model with different properties would need to be constructed. My gut feeling tells me that even if this is violated, the model (16.26) would work well.
Depending on the specific model for each part and restrictions on \(\mu_{a,t}\) and \(\mu_{b,t}\), we might have different types of iETS models. In order to distinguish one model from another, we introduce the notation of iETS models of the form “iETS(demand sizes model)\(_\text{type of occurrence}\)(model A type)(model B type)”. For example, in the iETS(M,N,N)\(_G\)(A,N,N)(M,M,N) the first brackets say that ETS(M,N,N) was applied to the demand sizes, the underscore letter points out that this is the “general probability” model, which has ETS(A,N,N) for the model A and ETS(M,M,N) for the model B. If only one variable is needed (either \(a_t\) or \(b_t\)), then the redundant brackets are dropped, and the notation is simplified, for example, to: iETS(M,N,N)\(_O\)(M,N,N). If the same type of model is used for both demand sizes and demand occurrence, then the second brackets can be dropped as well, simplifying this further to: iETS(M,N,N)\(_O\) (odds ratio model with ETS(M,N,N) for all parts). All these models are implemented in adam()
function for smooth
package in R.
Similar notations and principles can be used for models based on ARIMA. Note that oARIMA is not yet implemented, but in theory a model like iARIMA(0,1,1)\(_O\)(0,1,1) could be constructed in ADAM framework.
Last but not least, in some cases we might have explanatory variables (such as promotions, prices, weather etc) that would impact both demand occurrence and demand sizes. In ADAM, we can include them in respective oes()
and adam()
functions. Just remember that when you include explanatory variables in the occurrence part, you are modelling the probability of occurrence, not the occurrence itself. So, for example, a promotional effect in this situation would mean that there is a higher chance of having sales. In some other situations we might not need dynamic models, such as ETS and ARIMA, and can focus on the static regression. While adam()
supports this as well, the alm()
function from greybox
might be more suitable in this situation. It supports similar parameters, but its occurrence
parameter accepts either the type of transform (plogis
for logit model and pnorm
for the probit one) or a previously estimated occurrence model (either from alm()
or from oes()
).
16.3.1 Maximum Likelihood Estimation
While there are different ways of estimating the parameters of ADAM model (16.26), it is worth focusing on likelihood estimation. The log-likelihood of the model should consist of several parts:
- The PDF of demand sizes part of the model, when demand occurs;
- The probability of occurrence;
- The probability of inoccurrence.
When demand occurs the likelihood is: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta} | y_t, o_t=1) = p_t f_z(z_t | \mathbf{v}_{z,t-\boldsymbol{l}}) , \tag{16.27} \end{equation}\] while in the opposite case it is: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta} | y_t, o_t=0) = (1-p_t) f_z(z_t | \mathbf{v}_{z,t-\boldsymbol{l}}), \tag{16.28} \end{equation}\] where \(\boldsymbol{\theta}\) includes all the estimated parameters of the model and parameters of assumed distribution (i.e. scale). Based on the equations (16.27) and (16.28), we can summarise the likelihood for the whole sample of \(T\) observations: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta} | \textbf{Y}) = \prod_{o_t=1} p_t \prod_{o_t=0} (1-p_t) \prod_{t=1}^T f_z(z_t | \mathbf{v}_{z,t-\boldsymbol{l}}) , \tag{16.29} \end{equation}\] or in logarithms: \[\begin{equation} \ell(\boldsymbol{\theta} | \textbf{Y}) = \sum_{t=1}^T f_z(z_t | \mathbf{v}_{z,t-\boldsymbol{l}}) + \sum_{o_t=1} \log(p_t) + \sum_{o_t=0} \log(1-p_t), \tag{16.30} \end{equation}\] where \(f_z(z_t | \mathbf{v}_{z,t-\boldsymbol{l}})\) can be substituted by a likelihood of the assumed distribution from the list of candidates in Section 14.1 (substituting \(T\) in formulae in Tables 14.1 and 14.2 by \(T_1\)). The main issue in calculating the likelihood (16.30) is that the demand sizes are not observable when \(o_t=0\). This means that we cannot calculate the likelihood using the conventional approach, we need to use something else. Svetunkov and Boylan (2019) proposed using Expectation Maximisation (EM) algorithm for this purpose, which is typically done in the following stages:
- Take Expectation of the likelihood;
- Maximise it with the obtained parameters;
- Go to (1) with the new set of parameters if the likelihood has not converged to maximum.
A classical example with EM is when there are several samples with different parameters and we need to split them, but we do not know where specific observations belongs to and what is the probability that each observation belongs to one of the groups. In our context, it is a slightly different idea: we know probabilities, but we do not observe some of demand sizes. If we take the expectation of (16.30) with respect to the unobserved demand sizes, we will get: \[\begin{equation} \begin{aligned} \ell(\boldsymbol{\theta} | \textbf{Y}) & = \sum_{o_t=1} \log f_z \left(z_{t} | \mathbf{v}_{z,t-\boldsymbol{l}} \right) + \sum_{o_t=0} \text{E} \left( \log f_z \left(z_{t} | \mathbf{v}_{z,t-\boldsymbol{l}} \right) \right) \\ & + \sum_{o_t=1} \log(p_t) + \sum_{o_t=0} \log(1- p_t) \end{aligned}. \tag{16.31} \end{equation}\] Luckily, the expectation in (16.31) is known in statistics as “Differential Entropy” (it is actually negative differential entropy in the formula above). It will differ from one case to another, depending on the assumed demand sizes distribution. Table 16.1 summarises differential entropies for the distributions used in ADAM.
Assumption | Differential Entropy | |
---|---|---|
Normal | \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) | \(\frac{1}{2}\left(\log(2\pi\sigma^2)+1\right)\) |
Laplace | \(\epsilon_t \sim \mathcal{Laplace}(0, s)\) | \(1+\log(2s)\) |
S | \(\epsilon_t \sim \mathcal{S}(0, s)\) | \(2+2\log(2s)\) |
Generalised Normal | \(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) | \(\beta^{-1}-\log\left(\frac{\beta}{2s\Gamma\left(\beta^{-1}\right)}\right)\) |
Asymmetric Laplace | \(\epsilon_t \sim \mathcal{ALaplace}(0, s, \alpha)\) | \(1+\log(2s)\) |
Inverse Gaussian | \(1+\epsilon_t \sim \mathcal{IG}(1, s)\) | \(\frac{1}{2}\left(\log \pi e s -\log(2) \right)\) |
Gamma | \(1+\epsilon_t \sim \mathcal{\Gamma}(s^{-1}, s)\) | \(s^{-1} + \log \Gamma\left(s^{-1} \right) + \left(1-s^{-1}\right)\psi\left(s^{-1}\right)\) |
Log Normal | \(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) | \(\frac{1}{2}\left(\log(2\pi\sigma^2)+1\right)-\frac{\sigma^2}{2}\) |
The majority of formulae for differential entropy in Table 16.1 are taken from Wikipedia (2021a) with the exclusion of the one for \(\mathcal{IG}\), which was derived by Mudholkar and Tian (2002). These values can be inserted instead of the \(\text{E} \left( \log f_z \left(z_{t} | \mathbf{v}_{z,t-\boldsymbol{l}} \right) \right)\) in the formula (16.31), leading to the expected likelihood for respective distributions. Luckily, the EM process in our specific situation does not need to be iterative - the obtained likelihood can then be maximised directly by changing the values of parameters \(\boldsymbol{\theta}\). It is also possible to derive analytical formulae for parameters of some of distributions based on (16.31) and the values from Table 16.1. For example, in case of \(\mathcal{IG}\) the estimate of scale parameter is: \[\begin{equation} \hat{s} = \frac{1}{T} \sum_{o_t=1}\frac{e_t^2}{1+e_t}. \tag{16.32} \end{equation}\] In fact, it can be shown that the likelihood estimates of scales for different distributions correspond to the conventional formulae from Section 14.1, but with the summation over \(o_t=1\) instead of all the observations. Note however that the division in (16.32) is done by the whole sample \(T\). This implies that the estimate of scale will be biased, similarly to the classical bias of the sample variance (2.1). However, in the full ADAM model, the estimate of scale is biased not only in sample, but asymptotically as well, implying that with the increase of the sample size it will be consistently lower than needed. This is because the summation is done over the non-zero values, while the division is done over the whole sample. This proportion of non-zeroes will impact the scale in (16.32), deflating its value. The only situation, when the bias will be reduced is when the probability of occurrence reaches 1 (demand on product becomes regular). Still, the value (16.32) will maximise the expected likelihood (16.31) and is useful for inference. However, if one needs to construct prediction intervals, this bias needs to be addressed, which can be done using the following correction: \[\begin{equation} \hat{s}^\prime = \frac{T}{T_1-k} \hat{s}, \tag{16.33} \end{equation}\] where \(k\) is the number of all estimated parameters.
16.3.2 Conditional expectation and variance
Now that we have discussed how the module is formulated and how it can be estimated, we can move to the discussion of conditional expectation and variance from the model. The former is needed in order to produce point forecasts, while the latter might be needed for different inventory decisions.
The conditional \(h\) steps ahead expectation of the model can be obtained easily based on the assumption of independence of demand occurrence and demand sizes we have discussed earlier in Section 16.3: \[\begin{equation} \mu_{y,t+h|t} = \mu_{o,t+h|t} \mu_{z,t+h|t}, \tag{16.34} \end{equation}\] where \(\mu_{o,t+h|t}\) is the conditional expectation of the occurrence variable (the conditional \(h\) steps ahead probability of occurrence) and \(\mu_{z,t+h|t}\) is the conditional expectation of the demand sizes variable \(z_t\). So, the forecast from the model (16.26) relies on the probability of occurrence of the variable and will reflect an average demand per period of time. As a result, in some cases it might be less than one, implying that the product is not sold every day. Consequentially, Kourentzes (2014) argues that a term “demand rate” should be used in this context instead of the conditional expectation. However, any forecasting model will produce “demand per period” forecasts, they just typically assume that the probability of occurrence is equal to one (\(p_t=1\)) for all observations. So, there is no conceptual difference between the forecasts produced by regular and intermittent demand models and I personally do not see point in using “demand rate” term.
As for the conditional variance, it is slightly trickier than the conditional expectation, because the variance of a product involves not only variances but expectations as well (assuming that two variables are independent): \[\begin{equation} \mathrm{V}(y_t) = \mathrm{V}(o_t) \mathrm{V}(z_t) + \mathrm{E}(o_t)^2 \mathrm{V}(z_t) + \mathrm{V}(o_t) \mathrm{E}(z_t)^2 . \tag{16.35} \end{equation}\] Given that we use Bernoulli distribution for the variable \(o_t\), its variance is equal to \(\mu_{o,t+h|t} (1-\mu_{o,t+h|t})\). In our context this implies that the conditional \(h\) steps ahead variance for the model 16.3 is: \[\begin{equation} \sigma^2_h = \mu_{o,t+h|t} (1-\mu_{o,t+h|t}) \sigma^2_{z,h} + \mu_{o,t+h|t}^2 \sigma^2_{z,h} + \mu_{o,t+h|t} (1-\mu_{o,t+h|t}) \mu_{z,t+h|t}^2 , \tag{16.36} \end{equation}\] or after some manipulations: \[\begin{equation} \sigma^2_h = \mu_{o,t+h|t} \left(\sigma^2_{z,h} + (1 - \mu_{o,t+h|t}) \mu_{z,t+h|t}^2 \right). \tag{16.37} \end{equation}\] All the elements of the formula (16.37) are available and have been discussed in the previous Chapters (Sections 8.3, 9.2 and 16.1).
References
• Kourentzes, N., 2014. On intermittent demand model optimisation and selection. International Journal of Production Economics. 156, 180–190. https://doi.org/10.1016/j.ijpe.2014.06.007
• Mudholkar, G.S., Tian, L., 2002. An entropy characterization of the inverse Gaussian distribution and related goodness-of-fit test. Journal of Statistical Planning and Inference. 102, 211–221. https://doi.org/10.1016/S0378-3758(01)00099-4
• Svetunkov, I., Boylan, J.E., 2019. Multiplicative state-space models for intermittent time series. https://doi.org/10.13140/RG.2.2.35897.06242
• Wikipedia, 2021a. Differential entropy. https://en.wikipedia.org/wiki/Differential_entropy (version: 2021-04-22)