Chapter 3 A short introduction to main statistical ideas

Before moving forward and discussing distributions and models, it is also quite important to make sure that we understand what bias, efficiency and consistency of estimates of parameters mean. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context.

Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one (in the true model). Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory context this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be.

Efficiency means, if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratically behaving model, so we need to have efficient estimates of parameters.

Consistency means that our estimates of parameters will get closer to the stable values (true value in the population) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better.

There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA, discussed in this book). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data.

Another important aspect to cover is what the term asymptotic means in our context. Here and after in this book, when this word is used, we refer to an unrealistic hypothetical situation of having all the data in the multiverse, where the time index \(t \rightarrow \infty\). While this is impossible, the idea is useful, because asymptotic behaviour of estimators and models is helpful on large samples of data.

Finally, we will use different estimation techniques throughout this book, one of the main of which is Maximum Likelihood Estimate (MLE). We will not go into explanation of what specifically this is at this stage, but a rough understanding should suffice. In case of MLE, we assume that a variable follows a parametric distribution and that the parameters of the model that we use can be optimised in order to maximise the respective probability density function. The main advantages of MLE is that it gives consistent, asymptotically efficient and normal estimates of parameters.

Now that we have a basic understanding of these statistical terms, we can move to the next topic, distributions.

3.1 Theory of distributions

There are several probability distributions that will be helpful in the further chapters of this textbook. Here, I want to briefly discuss those of them that will be used. While this might not seem important at this point, we might refer to this section of the textbook in the next chapters.

3.1.1 Normal distribution

Every statistical textbook has normal distribution. It is that one famous bell-curved distribution that every statistician likes because it is easy to work with and it is an asymptotic distribution for many other well-behaved distributions in some conditions (so called “Central Limit Theorem”). Here is the probability density function (PDF) of this distribution: \[\begin{equation} f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) , \tag{3.1} \end{equation}\] where \(y_t\) is the value of the response variable, \(\mu_t\) is the mean on observation \(t\) and \(\sigma^2\) is the variance of the error term. The maximum likelihood estimate of \(\sigma^2\) is: \[\begin{equation} \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)^2 , \tag{3.2} \end{equation}\]

which coincides with Mean Squared Error (MSE), discussed in the section 1.

And here how this distribution looks:

What we typically assume in the basic time series models is that a variable is random and follows normal distribution, meaning that there is a central tendency (in our case - the mean \(mu\)), around which the concentration of values is the highest and there are other potential cases, but their probability of appearance reduces proportionally to the distance from the centre.

The normal distribution has skewness of zero and kurtosis of 3 (and excess kurtosis, being kurtosis minus three, of 0).

Additionally, if normal distribution is used for the maximum likelihood estimation of a model, it gives the same parameters as the minimisation of MSE would give.

3.1.2 Laplace distribution

A more exotic distribution is Laplace, which has some similarities with Normal, but has higher excess. It has the following PDF:

\[\begin{equation} f(y_t) = \frac{1}{2 s} \exp \left( -\frac{\left| y_t - \mu_t \right|}{s} \right) , \tag{3.3} \end{equation}\] where \(s\) is the scale parameter, which, when estimated using likelihood, is equal to the Mean Absolute Error (MAE): \[\begin{equation} \hat{s} = \frac{1}{T} \sum_{t=1}^T \left| y_t - \mu_t \right| . \tag{3.4} \end{equation}\]

It has the following shape:

Similar to the normal distribution, the skewness of Laplace is equal to zero. However, it has fatter tails - its kurtosis is equal to 6 instead of 3.

The dlaplace, qlaplace, plaplace and rlaplace functions from greybox package implement different sides of Laplace distribution in R.

3.1.3 S distribution

This is something relatively new, but not ground braking. I have derived S distribution few years ago, but have never written a paper on that. It has the following density function: \[\begin{equation} f(y_t) = \frac{1}{4 s^2} \exp \left( -\frac{\sqrt{|y_t - \mu_t|}}{s} \right) , \tag{3.5} \end{equation}\] where \(s\) is the scale parameter. If estimated via maximum likelihood, the scale parameter is equal to: \[\begin{equation} \hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{\left| y_t - \mu_t \right|} , \tag{3.6} \end{equation}\]

which corresponds to the minimisation of a half of “Mean Root Absolute Error” or “Half Absolute Moment” (HAM). This is a more exotic type of scale, but the main benefit of this distribution is sever heavy tails - it has kurtosis of 25.2. It might be useful in cases of randomly occurring incidents and extreme values (Black Swans?).

The ds, qs, ps and rs from greybox package implement the density, quantile, cumulative and random generation functions.

3.1.4 Generalised Normal distribution

Generalised Normal (\(\mathcal{GN}\)) distribution (as the name says) is a generalisation for normal distribution, which also includes Laplace and S as special cases (Nadarajah 2005). There are two versions of this distribution: one with a shape and another with a skewness parameter. We are mainly interested in the first one, which has the following PDF: \[\begin{equation} f(y_t) = \frac{\beta}{2 s \Gamma(\beta^{-1})} \exp \left( -\left(\frac{|y_t - \mu_t|}{s}\right)^{\beta} \right), \tag{3.7} \end{equation}\] where \(\beta\) is the shape parameter, and \(s\) is the scale of the distribution, which, when estimated via MLE, is equal to: \[\begin{equation} \hat{s} = \sqrt[^beta]{\frac{\beta}{T} \sum_{t=1}^T\left| y_t - \mu_t \right|^{\beta}}, \tag{3.8} \end{equation}\]

which has MSE, MAE and HAM as special cases, when \(\beta\) is equal to 2, 1 and 0.5 respectively. The parameter \(\beta\) influences the kurtosis directly, it can be calculated for each special case as \(\frac{\Gamma(5/\beta)\Gamma(1/\beta)}{\Gamma(3/\beta)^2}\). The higher \(\beta\) is, the lower the kurtosis is.

The advantage of \(\mathcal{GN}\) distribution is it’s flexibility. In theory, it is possible to model extremely rare events with this distribution, if the shape parameter \(\beta\) is fractional and close to zero. Alternatively, when \(\beta \rightarrow \infty\), the distribution converges point-wise to the uniform distribution on \((\mu_t - s, \mu_t + s)\).

Note that the estimation of \(\beta\) is a difficult task, especially, when it is less than 2 - the MLE of it looses properties of consistency and asymptotic normality.

Depending on the value of \(\beta\), the distribution can have different shapes:

Typically, estimating \(\beta\) consistently is a tricky thing to do, especially if it is less than one. Still, it is possible to do that by maximising the likelihood function (3.7).

The working functions for the Generalised Normal are implemented in the gnorm package for R.

3.1.5 Asymmetric Laplace distribution

Asymmetric Laplace distribution (\(\mathcal{AL}\)) can be considered as a two Laplace distributions with different parameters \(s\) for left and right sides from the location \(\mu_t\). There are several ways to summarise the probability density function, the neater one relies on the asymmetry parameter \(\alpha\) (Yu and Zhang 2005): \[\begin{equation} f(y_t) = \frac{\alpha (1- \alpha)}{s} \exp \left( -\frac{y_t - \mu_t}{s} (\alpha - I(y_t \leq \mu_t)) \right) , \tag{3.9} \end{equation}\] where \(s\) is the scale parameter, \(\alpha\) is the skewness parameter and \(I(y_t \leq \mu_t)\) is the indicator function, which is equal to one, when the condition is satisfied and to zero otherwise. The scale parameter \(s\) estimated using likelihood is equal to the quantile loss: \[\begin{equation} \hat{s} = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)(\alpha - I(y_t \leq \mu_t)) . \tag{3.10} \end{equation}\]

Thus maximising the likelihood (3.9) is equivalent to estimating the model via the minimisation of \(\alpha\) quantile, making this equivalent to quantile regression approach. So quantile regression models assume indirectly that the error term in the model is \(\epsilon_t \sim \mathcal{AL}(0, s, \alpha)\) (Geraci and Bottai 2007).

Depending on the value of \(\alpha\), the distribution can have different shapes:

Similarly to \(\mathcal{GN}\) distribution, the parameter \(\alpha\) can be estimated during the maximisation of the likelihood, although it makes more sense to set it to some specific values in order to obtain the desired quantile of distribution.

Functions dalaplace, qalaplace, palaplace and ralaplace from greybox package implement the Asymmetric Laplace distribution.

3.1.6 Log Normal, Log Laplace, Log S and Log GN distributions

In addition, it is possible to derive the log-versions of the Normal, , , and distributions. The main differences between the original and the log-versions of density functions for these distributions can be summarised as follows: \[\begin{equation} f_{log}(\log(y_t)) = \frac{1}{y_t} f(\log y_t). \tag{3.11} \end{equation}\]

They are defined for positive values only and will have different right tail, depending on the location, scale and shape parameters. \(\exp(\mu_t)\) in this case represents the geometric mean (and median) of distribution rather than the arithmetic one. The conditional expectation in these distributions is typically higher than \(\exp(\mu_t)\) and depends on the value of the scale parameter.

3.1.7 Inverse Gaussian distribution

An exotic distribution that will be useful for what comes in this textbook is the Inverse Gaussian (\(\mathcal{IG}\)), which is parameterised using mean value \(\mu_t\) and either the dispersion parameter \(s\) or the scale \(\lambda\) and is defined for positive values only. This distribution is useful because it is scalable and has some similarities with the Normal one. In our case, the important property is the following: \[\begin{equation} \text{if } \epsilon_t \sim \mathcal{IG}(1, s) \text{, then } y_t = \mu_t \times \epsilon_t \sim \mathcal{IG}\left(\mu_t, \frac{s}{\mu_t} \right), \tag{3.12} \end{equation}\]

implying that the dispersion of the model changes together with the expectation. The PDF of this distribution is:

\[\begin{equation} f(\epsilon_t) = \frac{1}{\sqrt{2 \pi s \epsilon_t^3}} \exp \left( -\frac{\left(\epsilon_t - 1 \right)^2}{2 s \epsilon_t} \right) , \tag{3.13} \end{equation}\] where the dispersion parameter can be estimated via maximising the likelihood and is calculated using: \[\begin{equation} \hat{s} = \frac{1}{T} \sum_{t=1}^T \frac{\left(\epsilon_t - 1 \right)^2}{\epsilon_t} . \tag{3.14} \end{equation}\]

This distribution becomes very useful for multiplicative models, where it is expected that the data can only be positive.

Here is how the PDF of \(\mathcal{IG}(1,1)\) looks for different

statmod package implements density, quantile, cumulative and random number generator functions for the \(\mathcal{IG}\).

References

Geraci, Marco, and Matteo Bottai. 2007. “Quantile regression for longitudinal data using the asymmetric Laplace distribution.” Biostatistics 8 (1): 140–54. doi:10.1093/biostatistics/kxj039.

Nadarajah, Saralees. 2005. “A generalized normal distribution.” Journal of Applied Statistics 32 (7): 685–94. doi:10.1080/02664760500079464.

Yu, Keming, and Jin Zhang. 2005. “A three-parameter asymmetric laplace distribution and its extension.” Communications in Statistics - Theory and Methods 34 (9-10): 1867–79. doi:10.1080/03610920500199018.