This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

2.7 Model selection mechanism

There are different ways how to select the most appropriate model for the data. One can use judgment, statistical tests, cross-validation or meta learning. The state of the art one in the field of exponential smoothing relies on the calculation of information criteria and on selection of the model with the lowest value. This approach is discussed in detail in Burnham and Anderson (2004). Here we briefly explain how this approach works and what are its advantages and disadvantages.

2.7.1 Information criteria idea

Before we move to the mathematics and well-known formulae, it makes sense to understand what we are trying to do, when we use information criteria. The idea is that we have a pool of model under consideration, and that there is a true model somewhere out there (not necessarily in our pool). This can be presented graphically in the following way:

An example of a model space

Figure 2.16: An example of a model space

This plot 2.16 represents a space of models. There is a true one in the middle, and there are three models under consideration: Model 1, Model 2, Model 3 and Model 4. They might differ from each other via the form (additive or multiplicative), included or omitted variables. They have some distances (they grey dashed lines) from the true model on this hypothetic model space: Model 1 is the closest to the true one than the others, while Model 2 is the farthest. Models 3 and 4 have a similar distance to the truth.

In the model selection exercise, what we typically want to do is to select the model, closest to the true one (Model 1 in our case). It is relatively easy to do, when you know the true model: just measure the distances and select the closest one. This can be written very roughly as: \[\begin{equation} \begin{split} d_1 = \ell^* - \ell_1 \\ d_2 = \ell^* - \ell_2 \\ d_3 = \ell^* - \ell_3 \\ d_4 = \ell^* - \ell_4 \end{split} , \tag{2.35} \end{equation}\]

where \(\ell_j\) is the position of the \(j^{th}\) model and \(\ell^*\) is the position of the true one. One of ways of getting the position of the model is by calculating the log-likelihood (logarithms of likelihood) values for each model, based on the assumed distributions. The likelihood of the true model will always be fixed, so if it is known, it just comes to calculating the values for the models 1 - 4 and inserting all the known values in the equation (2.35), and selecting the model that has the lowest distance \(d_j\).

However, in real life we never know the true model, so we need to find some other way of measuring this distance. The good thing about this approach is that the true model will always have the highest possible likelihood. This means that it is not important to know \(\ell^*\) - it will be the same for all the models. So, we can drop the \(\ell^*\) in the formulae (2.35) and compare the models via their likelihoods \(\ell_1, \ell_2, \ell_3 \text{ and } \ell_4\): \[\begin{equation} \begin{split} d_1 = - \ell_1 \\ d_2 = - \ell_2 \\ d_3 = - \ell_3 \\ d_4 = - \ell_4 \end{split} , \tag{2.36} \end{equation}\] This is a very simple method that allows us to get to the model, closest to the true one in the pool. However, we should not forget that we usually work with samples of data, not the population. So, we will inevitably have estimates of likelihoods, not the true ones. They will be biased and will need to be corrected. Akaike (1974) showed that the bias can be corrected if the number of parameters in each model is added to the distances (2.36), resultin in the bias corrected formula: \[\begin{equation} d_j = k_j - \ell_j \tag{2.37}, \end{equation}\] where \(k_j\) is the number of all estimated parameters in the model \(j\) (this typically also includes scale parameters, when dealing with Maximum Likelihood Estimates). When studying the properties of (2.37), Akaike (1974) suggested to multiply both parts of the right hand side by 2, so that there is a connection between the proposed criterion and the well known likelihood-ratio test (Wikipedia 2020d), and he proposed the following "An Information Criterion": \[\begin{equation} \mathrm{AIC}_j = 2 k_j - 2 \ell_j \tag{2.38}. \end{equation}\]

After that, there have been proposed different other criteria, motivated by similar ideas, among which it is worth mentioning:

  • AICc (Sugiura 1978), which is a sample corrected version of AIC (taking number of observations into account) for Normal and related distributions: \[\begin{equation} \mathrm{AICc}_j = 2 \frac{T}{T-k_j-1} k_j - 2 \ell_j \tag{2.39}, \end{equation}\]

    where \(T\) is the sample size.

  • BIC (Schwarz 1978) (aka "Schwarz criterion"), which criterion, developed on Bayesian statistics: \[\begin{equation} \mathrm{BIC}_j = \log(T) k_j - 2 \ell_j \tag{2.40}. \end{equation}\]
  • BICc (McQuarrie 1999) - the sample-corrected version of BIC, relying on the assumption of normality: \[\begin{equation} \mathrm{BICc}_j = \frac{\log (T)}{T-k_j-1} k_j - 2 \ell_j \tag{2.41}. \end{equation}\]

In general, it is recommended to use the sample-corrected versions of criteria (AICc, BICc) and use the others in cases of large samples (thousands of observations), where the effect of number of observations on criteria becomes negligible. The main issue is that the corrected versions of information criteria for the non-normal distributions need to be derived separately and will differ from (2.39) and (2.41). Still, Burnham and Anderson (2004) recommend using formulae (2.39) and (2.41) in cases of small samples, even if the distribution of variable does not follow the normal one and the correct formulae are not known. The motivation for this is that the corrected versions still take sample size into account, correcting the sample bias in criteria to some extent.

A thing to note is that the approach relies on asymptotic properties of estimators and assumes that the estimation method used in the process, guarantees that the likelihood functions of the models are maximised. In fact, it relies on asymptotic behaviour of parameters, so it is not very important whether the maximum of the likelihood in sample is reached or not or whether the final solution is near the maximum. If the sample size changes, the parameters guaranteeing the maximum will change as well, so we cannot get the point correctly in sample anyway. However, it is much more important to use an estimation method that will guarantee consistent maximisation of the likelihood. This implies that we might select wrong models in some cases in sample, but that is okay, because if we use the adequate approach for estimation and selection, with the increase of the sample size, we will select the correct model more often than the incorrect one. While the "increase of sample size" might seem as an unrealistic idea in some real life cases, keep in mind that this might mean not just the increase of \(T\), but also the increase of the number of series under consideration. So, for example, the approach should select the correct model on average, when you test it on a sample of 10,000 SKUs.

Summarising, the idea of model selection via information criteria is to:

  1. form a pool of competing models,
  2. construct them,
  3. calculate likelihood functions,
  4. based on them -- information criteria,
  5. and finally, select the model that has the lowest value.

This approach is relatively fast (in comparison with cross-validation, judgmental selection or meta learning) and has good theory behind it. It can also be shown that in case of Normal distribution, the selection for time series models based on AIC is asymptotically equivalent to the selection based on leave-one-out cross-validation with MSE. This becomes relatively straightforward, if we recall that typically time series models rely on one step ahead errors \((e_t = y_t - \mu_{t|t-1})\) and that the maximum of the likelihood of Normal distribution gives the same estimates as the minimum of MSE.

As for the disadvantages of the approach, as mentioned above, it relies on the in-sample value of the likelihood, based on one step ahead error, and does not guarantee that the selected model will perform well for the holdout for multiple steps ahead. Using the cross-validation or rolling origin for the full horizon could give better results if you suspect that information criteria do not work. Furthermore, any criterion is random on its own, and will change with the change of the sample size. This means that there is a model selection uncertainty and that the best model might change with the new observations. In order to address this issue, combination of models can be used, which allows mitigating this uncertainty.

2.7.2 Calculating number of parameters in models

When doing model selection and calculating different statistics, it is important to know how many parameters were estimated in the model. While this might seems like a trivial question for some, it has several caveats. Interestingly, this is not discussed in detail by many authors.

When it comes to the calculation of information criteria, the general idea is to calculate the number of all the independent estimated parameters \(k\). This typically includes all the initial components and all the coefficients of the model, together with the scale, shape and shift parameters of the assumed distribution (e.g. variance in the Normal distribution).

Example 2.1 In a simple regression model: \(y_t = \beta_0 + \beta_1 x_t + \epsilon_t\) - assuming Normal distribution for \(\epsilon_t\), using the MLE will result in the estimation of \(k=3\): the two parameters of the model (\(\beta_0\) and \(\beta_1\)) and the variance of the error term \(\sigma^2\).

If likelihood is not used, then the number of parameters might be different. For example, if we estimate the model via the minimisation of MSE (similar to OLS), then the number of all estimated parameters does not include the variance anymore - it is obtained as a by product of the estimation. This is because the likelihood needs to have all the parameters of distribution in order to maximise it, but with MSE, we just minimise the mean of squared errors, and the variance of the distribution is obtained automatically. While the values of parameters might be the same, the logic is slightly different.

Example 2.2 This means that for the same simple linear regression, the number of estimated parameters is equal to 2: estimates of \(\beta_0\) and \(\beta_1\).

In addition, all the restrictions on the parameters can reduce the number of estimated parameters, when they get to the boundary values.

Example 2.3 If we know that the parameter \(\beta_1\) lies between 0 and 1, and in the estimation process it gets to the value of 1 (due to how the optimiser works), it can be considered as a restriction \(\beta_1=1\). So, when estimated via the minimum of MSE with this restriction, this would imply that \(k=1\).

In general, if a parameter is provided in the model, then it does not count towards the number of all estimated parameters. So, setting \(b_1=1\) acts in the same fashion.

Finally, if a parameter is just a function of another one, then it does not count towards the \(k\) as well.

Example 2.4 If we know that in the same simple linear regression \(\beta_1 = \frac{\beta_0}{\sigma^2}\), then the number of all the estimated parameter via the maximum likelihood is 2: \(\beta_0\) and \(\sigma^2\).

We will come back to the number of parameters later in this textbook, when we discuss specific models.

A final note: typically, the scale, shape and shift parameters that maximise the likelihood are biased and do not coincide with the ones used in the OLS. For example, in case of Normal distribuiton, OLS estimate of variance has \(T-k\) in the denominator, while the likelihood one has just \(T\). This needs to be taken into account, when the variance is used for inference of forecasting.


Akaike, H. 1974. “A new look at the statistical model identification.” IEEE Transactions on Automatic Control 19 (6): 716–23. doi:10.1109/TAC.1974.1100705.

Burnham, Kenneth P, and David R Anderson. 2004. Model Selection and Multimodel Inference. Springer New York. doi:10.1007/b97636.

McQuarrie, Allan D. 1999. “A small-sample correction for the Schwarz SIC model selection criterion.” Statistics {&} Probability Letters 44 (1): 79–86. doi:10.1016/S0167-7152(98)00294-6.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2): 461–64. doi:10.1214/aos/1176344136.

Sugiura, Nariaki. 1978. “Further analysis of the data by akaike’ s information criterion and the finite corrections.” Communications in Statistics - Theory and Methods 7 (1): 13–26. doi:10.1080/03610927808827599.

Wikipedia. 2020d. “Likelihood-Ratio Test.” Wikipedia.