13.3 The full ADAM

Uniting demand occurrence (from Section 13.1) with the demand sizes (Section 13.2) parts of the model, we can now discuss the full iETS model, which in the most general form can be represented as: \[\begin{equation} \begin{aligned} & y_t = o_t z_t , \\ & {z}_{t} = w_z(\mathbf{v}_{z,t-\mathbf{l}}) + r_z(\mathbf{v}_{z,t-\mathbf{l}}) \epsilon_{z,t} \\ & \mathbf{v}_{z,t} = f_z(\mathbf{v}_{z,t-\mathbf{l}}) + g_z(\mathbf{v}_{z,t-\mathbf{l}}) \epsilon_{z,t} \\ & \\ & o_t \sim \text{Bernoulli} \left(p_t \right) , \\ & p_t = f_p(\mu_{a,t}, \mu_{b,t}) \\ & a_t = w_a(\mathbf{v}_{a,t-\mathbf{l}}) + r_a(\mathbf{v}_{a,t-\mathbf{l}}) \epsilon_{a,t} \\ & \mathbf{v}_{a,t} = f_a(\mathbf{v}_{a,t-\mathbf{l}}) + g_a(\mathbf{v}_{a,t-\mathbf{l}}) \epsilon_{a,t} \\ & b_t = w_b(\mathbf{v}_{b,t-\mathbf{l}}) + r_b(\mathbf{v}_{b,t-\mathbf{l}}) \epsilon_{b,t} \\ & \mathbf{v}_{b,t} = f_b(\mathbf{v}_{b,t-\mathbf{l}}) + g_b(\mathbf{v}_{b,t-\mathbf{l}}) \epsilon_{b,t} \end{aligned} , \tag{13.26} \end{equation}\] where the elements of the demand size and demand occurrence parts have been defined in Sections 5 and 13.1.4 respectively. The model (13.26) can also be considered as a more general one to the conventional ETS and ARIMA models: if the probability of occurrence \(p_t\) is equal to one for all observations, then the model reverts to them. Another important thing to note about this model is that it relies on the following assumptions:

  1. The demand sizes variable \(z_t\) is continuous. This is a reasonable assumption for many contexts, including, for example, energy forecasting. But even when we deal with integer values, Svetunkov and Boylan (2019) showed that such model does not perform worse than count data models;
  2. Potential demand size may change over time even when \(o_t=0\). This means that the states evolve even when demand is not observed;
  3. Demand sizes and demand occurrence are independent. This simplifies many derivations and makes the model estimable. If the assumption is violated, a different model with different properties would need to be constructed. My gut feeling tells me that the model (13.26) will work well even if this is violated.

Depending on the specific model for each part and restrictions on \(\mu_{a,t}\) and \(\mu_{b,t}\), we might have different types of iETS models. To distinguish one model from another, we introduce the notation of iETS models of the form “iETS(demand sizes model)\(_\text{type of occurrence}\)(model A type)(model B type)”. For example, in the iETS(M,N,N)\(_G\)(A,N,N)(M,M,N), the first brackets say that ETS(M,N,N) was applied to the demand sizes, the underscored letter points out that this is the “general probability” model (Subsection 13.1.4), which has ETS(A,N,N) for the model A and ETS(M,M,N) for the model B. If only one variable is needed (either \(a_t\) or \(b_t\)), then the redundant brackets are dropped, and the notation is simplified, for example, to: iETS(M,N,N)\(_O\)(M,N,N). If the same type of model is used for both demand sizes and demand occurrence, then the second brackets can be dropped as well, simplifying this further to iETS(M,N,N)\(_O\) (odds ratio model with ETS(M,N,N) for all parts, Section 13.1.2). All these models are implemented in the adam() function for the smooth package in R.

Similar notations and principles can be used for models based on ARIMA. Note that oARIMA is not yet implemented, but in theory, a model like iARIMA(0,1,1)\(_O\)(0,1,1) could be constructed in the ADAM framework.

Last but not least, in some cases, we might have explanatory variables, such as promotions, prices, weather etc. They would impact both demand occurrence and demand sizes. In ADAM, we can include them in respective oes() and adam() functions. Remember that when you include explanatory variables in the occurrence part, you are modelling the probability of occurrence, not the occurrence itself. So, for example, a promotional effect in this situation would mean a higher chance of having sales. In some other situations, we might not need dynamic models, such as ETS and ARIMA, and can focus on static regression. While adam() supports this, the alm() function from the greybox might be more suitable in this situation. It supports similar parameters, but its occurrence parameter accepts either the type of transform (plogis for logit model and pnorm for the probit one) or a previously estimated occurrence model (either from alm() or from oes()).

13.3.1 Maximum Likelihood Estimation

While there are different ways of estimating the parameters of the ADAM (13.26), it is worth focusing on likelihood estimation (Section 11.1) for consistency with other ADAMs. The likelihood of the model will consist of several parts:

  1. The PDF of demand sizes part of the model when demand occurs;
  2. The probability of occurrence;
  3. The probability of inoccurrence.

When demand occurs the likelihood is: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta} | y_t, o_t=1) = p_t f_z(z_t | \mathbf{v}_{z,t-\mathbf{l}}) , \tag{13.27} \end{equation}\] while in the opposite case it is: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta} | y_t, o_t=0) = (1-p_t) f_z(z_t | \mathbf{v}_{z,t-\mathbf{l}}), \tag{13.28} \end{equation}\] where \(\boldsymbol{\theta}\) includes all the estimated parameters of the model and parameters of assumed distribution (i.e. scale). Note that because the model assumes that the demand evolves over time even when it is not observed (\(o_t=0\)), we have a probability density function of demand sizes, \(f_z(z_t | \mathbf{v}_{z,t-\mathbf{l}})\) in (13.28). Based on the equations (13.27) and (13.28), we can summarise the likelihood for the whole sample of \(T\) observations: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta} | \textbf{y}) = \prod_{o_t=1} p_t \prod_{o_t=0} (1-p_t) \prod_{t=1}^T f_z(z_t | \mathbf{v}_{z,t-\mathbf{l}}) , \tag{13.29} \end{equation}\] or in logarithms: \[\begin{equation} \ell(\boldsymbol{\theta} | \textbf{y}) = \sum_{o_t=1} \log(p_t) + \sum_{o_t=0} \log(1-p_t) + \sum_{t=1}^T f_z(z_t | \mathbf{v}_{z,t-\mathbf{l}}), \tag{13.30} \end{equation}\] where \(\textbf{y}\) is the vector of all actual values and \(f_z(z_t | \mathbf{v}_{z,t-\mathbf{l}})\) can be substituted by a likelihood of the assumed distribution from the list of candidates in Section 11.1. The main issue in calculating the likelihood (13.30) is that the demand sizes are not observable when \(o_t=0\). This means that we cannot calculate the likelihood using the conventional approach. We need to use something else. Svetunkov and Boylan (2019) proposed using the Expectation Maximisation (EM) algorithm for this purpose, which is typically done in the following stages:

  1. Take Expectation of the likelihood;
  2. Maximise it with the obtained parameters;
  3. Go to (1) with the new set of parameters if the likelihood has not converged to the maximum.

A classic example with EM is when several samples have different parameters, and we need to split them. Still, we do not know where specific observations belong and what the probability that each observation belongs to one of the groups is. In our context, it is a slightly different idea: we know probabilities, but we do not observe some of the demand sizes. If we take the expectation of (13.30) with respect to the unobserved demand sizes, we will get: \[\begin{equation} \begin{aligned} \mathrm{E}\left(\ell(\boldsymbol{\theta} | \textbf{y})\right) & = \sum_{o_t=1} \log f_z \left(z_{t} | \mathbf{v}_{z,t-\mathbf{l}} \right) + \sum_{o_t=0} \text{E} \left( \log f_z \left(z_{t} | \mathbf{v}_{z,t-\mathbf{l}} \right) \right) \\ & + \sum_{o_t=1} \log(p_t) + \sum_{o_t=0} \log(1- p_t) \end{aligned}. \tag{13.31} \end{equation}\] The expectation in (13.31) is known in statistics as “Differential Entropy” (it is the negative differential entropy in the formula above). It will differ from one case to another, depending on the assumed demand sizes distribution. Table 13.1 summarises differential entropies for the distributions used in ADAM.

Table 13.1: Differential entropies for different distributions. \(\Gamma(\cdot)\) is the Gamma function, while \(\psi(\cdot)\) is the digamma function.
Assumption Differential Entropy
\(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\) \(\frac{1}{2}\left(\log(2\pi\sigma^2)+1\right)\)
\(\epsilon_t \sim \mathcal{Laplace}(0, s)\) \(1+\log(2s)\)
\(\epsilon_t \sim \mathcal{S}(0, s)\) \(2+2\log(2s)\)
\(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\) \(\beta^{-1}-\log\left(\frac{\beta}{2s\Gamma\left(\beta^{-1}\right)}\right)\)
\(1+\epsilon_t \sim \mathcal{IG}(1, \sigma^2)\) \(\frac{1}{2}\left(\log \pi e \sigma^2 -\log(2) \right)\)
\(1+\epsilon_t \sim \mathcal{\Gamma}(\sigma^{-2}, \sigma^2)\) \(\sigma^{-2} + \log \Gamma\left(\sigma^{-2} \right) + \left(1-\sigma^{-2}\right)\psi\left(\sigma^{-2}\right)\)
\(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\) \(\frac{1}{2}\left(\log(2\pi\sigma^2)+1\right)-\frac{\sigma^2}{2}\)

The majority of formulae for differential entropy in Table 13.1 are taken from Wikipedia (2021e) with the exclusion of the one for \(\mathcal{IG}\), which was derived by Mudholkar and Tian (2002). These values can be inserted instead of the \(\text{E} \left( \log f_z \left(z_{t} | \mathbf{v}_{z,t-\mathbf{l}} \right) \right)\) in the formula (13.31), leading to the expected likelihood for respective distributions. For example, for Inverse Gaussian distribution (using the pdf from the Table 11.2), we get: \[\begin{equation} \begin{aligned} \mathrm{E}\left(\ell(\boldsymbol{\theta} | \textbf{y})\right) & = -\frac{T_1}{2} \log \left(2 \pi \sigma^2 \right) -\frac{1}{2}\sum_{o_t=1} \left(1+\epsilon_{t}\right)^3 -\frac{3}{2} \sum_{o_t=1} \log y_t -\frac{1}{2 \sigma^2} \sum_{o_t=1} \frac{\epsilon_t^2}{1+\epsilon_t} \\ & -\frac{T_0}{2}\left(\log \pi e \sigma^2 -\log(2) \right) \\ & + \sum_{o_t=1} \log(p_t) + \sum_{o_t=0} \log(1- p_t) \end{aligned} , \tag{13.32} \end{equation}\] where \(T_0\) is the number of zeroes in the data and \(T_1\) is the number of non-zeroe values. Luckily, the EM process in our specific situation does not need to be iterative – the obtained likelihood can then be maximised directly by changing the values of parameters \(\boldsymbol{\theta}\). It is also possible to derive analytical formulae for parameters of some of distributions based on (13.31) and the values from Table 13.1. For example, in case of \(\mathcal{IG}\) the estimate of scale parameter is: \[\begin{equation} \hat{\sigma}^2 = \frac{1}{T} \sum_{o_t=1}\frac{e_t^2}{1+e_t}. \tag{13.33} \end{equation}\] It can be shown that the likelihood estimates of scales for different distributions correspond to the conventional formulae from Section 11.1, but with the summation over \(o_t=1\) instead of all the observations. Note, however, that the division in (13.33) is done by the whole sample \(T\). This implies that the scale estimate will be biased, similarly to the classical bias of the sample variance (see Section 4.3 of Svetunkov (2022a)). Svetunkov and Boylan (2019) show that in the full ADAM, the estimate of scale is biased not only in-sample but also asymptotically, implying that with the increase of the sample size, it will be consistently lower than needed. This is because the summation is done over the non-zero values, while the division is done over the whole sample. This proportion of non-zeroes impacts the scale in (13.33), deflating its value. The only situation when the bias will be reduced is when the probability of occurrence reaches 1 (product demand becomes regular). Still, the value (13.33) will maximise the expected likelihood (13.31) and is useful for inference. However, if one needs to construct prediction intervals, this bias needs to be addressed, which can be done using the conventional correction: \[\begin{equation} \hat{\sigma}^2{^{\prime}} = \frac{T}{T_1-k} \hat{\sigma}^2, \tag{13.34} \end{equation}\] where \(k\) is the number of all estimated parameters.

13.3.2 Conditional expectation and variance

Now that we have discussed how the model is formulated and how it can be estimated, we can move to the discussion of conditional expectation and variance from it. The former is needed to produce point forecasts, while the latter might be required for different inventory decisions.

The conditional \(h\) steps ahead expectation of the model can be obtained easily based on the assumption of independence of demand occurrence and demand sizes discussed earlier in Section 13.3: \[\begin{equation} \mu_{y,t+h|t} = \mu_{o,t+h|t} \mu_{z,t+h|t}, \tag{13.35} \end{equation}\] where \(\mu_{o,t+h|t}\) is the conditional expectation of the occurrence variable (the conditional \(h\) steps ahead probability of occurrence) and \(\mu_{z,t+h|t}\) is the conditional expectation of the demand sizes variable \(z_t\). So, the forecast from the model (13.26) relies on the probability of occurrence of the variable and will reflect an average demand per period of time. As a result, it might be less than one in some cases, implying that the product is not sold every day. Consequentially, Kourentzes (2014) argues that the term “demand rate” should be used in this context instead of the conditional expectation. However, any forecasting model will produce “demand per period” forecasts. They just typically assume that the probability of occurrence is equal to one (\(p_t=1\)) for all observations. So, there is no conceptual difference between the forecasts produced by regular and intermittent demand models, and I do not see the point in using the “demand rate” term.

As for the conditional variance, it is slightly trickier than the conditional expectation, because the variance of a product involves not only variances, but expectations as well (assuming that two variables are independent): \[\begin{equation} \mathrm{V}(y_{t+h}) = \mathrm{V}(o_{t+h|t}) \mathrm{V}(z_{t+h|t}) + \mathrm{E}(o_{t+h|t})^2 \mathrm{V}(z_{t+h|t}) + \mathrm{V}(o_{t+h|t}) \mathrm{E}(z_{t+h|t})^2 . \tag{13.36} \end{equation}\] Given that we use Bernoulli distribution for the variable \(o_t\), its variance is equal to \(\mu_{o,t+h|t} (1-\mu_{o,t+h|t})\). In our context this implies that the conditional \(h\) steps ahead variance for the full model (from Section 13.3) is: \[\begin{equation} \sigma^2_h = \mu_{o,t+h|t} (1-\mu_{o,t+h|t}) \sigma^2_{z,h} + \mu_{o,t+h|t}^2 \sigma^2_{z,h} + \mu_{o,t+h|t} (1-\mu_{o,t+h|t}) \mu_{z,t+h|t}^2 , \tag{13.37} \end{equation}\] or after some manipulations: \[\begin{equation} \sigma^2_h = \mu_{o,t+h|t} \left(\sigma^2_{z,h} + (1 -\mu_{o,t+h|t}) \mu_{z,t+h|t}^2 \right). \tag{13.38} \end{equation}\] All the elements of the formula (13.38) are available and have been discussed in Sections 5.3, 6.3 and 13.1.

References

• Kourentzes, N., 2014. On intermittent demand model optimisation and selection. International Journal of Production Economics. 156, 180–190. https://doi.org/10.1016/j.ijpe.2014.06.007
• Mudholkar, G.S., Tian, L., 2002. An entropy characterization of the inverse Gaussian distribution and related goodness-of-fit test. Journal of Statistical Planning and Inference. 102, 211–221. https://doi.org/10.1016/S0378-3758(01)00099-4
• Svetunkov, I., 2022a. Statistics for business analytics. https://openforecast.org/sba/ (version: 31.03.2022)
• Svetunkov, I., Boylan, J.E., 2019. Multiplicative state-space models for intermittent time series. https://doi.org/10.13140/RG.2.2.35897.06242
• Wikipedia, 2021e. Differential entropy. https://en.wikipedia.org/wiki/Differential_entropy (version: 2021-04-22)