1.4 Models, methods and typical assumptions

While we do not aim to fully cover the topic of models, methods, and typical assumptions of statistical models, we need to make several important definitions to clarify what we will discuss in this textbook. For a more detailed discussion, see Chapters 1 and 12 of Svetunkov (2021b).

Cambridge dictionary (Dictionary, 2021) defines method as a particular way of doing something. So, the method does not necessarily explain how the structure appears or how the error term interacts with it; it only describes how a value is produced. In our context, the forecasting method would be a formula that generates point forecasts based on some parameters and available data. It would not explain how what underlies the data.

Statistical model on the other hand, is a ‘mathematical representation of a real phenomenon with a complete specification of distribution and parameters’ (Svetunkov and Boylan, 2019). It explains what happens inside the data, reveals the structure and shows how the error term interacts with the structure.

While discussing statistical models, we should also define true model. It is “the idealistic statistical model that is correctly specified (has all the necessary components in the correct form), and applied to the data in population” (Svetunkov, 2021b). Some statisticians also use the term Data Generating Process (DGP) when discussing the true model. Still, we need to distinguish between the two terms, as DGP implies that the data is somehow generated using a mathematical formula. In real life, the data is not generated from any function; it comes from a measurement of a complex process, influenced by many factors (e.g. behaviour of a group of customers based on their individual preferences and mental states). The DGP is useful when we want to conduct experiments on simulated data in a controlled environment, but it is not helpful when applying models to the data. Finally, the true model is an abstract notion because it is never known or reachable. But it is still a useful one, as it allows us to see what would happen when we know the model and, more importantly, what would happen when the model we use is wrong.

The related to this definition is the estimated or applied model, which is the statistical model that is applied to the available sample of data. This model will almost always be wrong because even if we know the specification of the true model for some mysterious reason, we would still need to estimate it on our data. In this case, the estimates of parameters would differ from those in the population, and thus the model will still be wrong.

Mathematically, in the simplest case the true model can be written as: \[\begin{equation} y_t = \mu_{y,t} + \epsilon_t, \tag{1.4} \end{equation}\] where \(y_t\) is the actual value, \(\mu_{y,t}\) is the structure and \(\epsilon_t\) is the true noise. If we manage to capture the structure correctly, the model applied to the sample of data would be written as: \[\begin{equation} y_t = \hat{\mu}_{y,t} + e_t, \tag{1.5} \end{equation}\] where \(\hat{\mu}_{y,t}\) is the estimate of the structure \(\mu_{y,t}\) and \(e_t\) is the estimate of the noise \(\epsilon_t\) (also known as “residuals”). If the structure is captured correctly, there would still be a difference between (1.4) and (1.5) because the latter is estimated on the data. However, if the sample size increases and we use an adequate estimation procedure, then due to Central Limit Theorem (see Chapter 4 of Svetunkov, 2021b), the distance between the two models will decrease and asymptotically (with the increase of sample size) \(e_t\) would converge to \(\epsilon_t\). This does not happen automatically, and some assumptions should hold for this to happen.

1.4.1 Assumptions of statistical models

Very roughly, the typical assumptions of statistical models can be split into the following categories (Svetunkov, 2021b):

  1. Model is correctly specified:
  1. We have not omitted important variables in the model (underfitting the data);
  2. We do not have redundant variables in the model (overfitting the data);
  3. The necessary transformations of the variables are applied;
  4. We do not have outliers in the model;
  1. Residuals are independent and identically distributed (i.i.d.):
  1. There is no autocorrelation in the residuals;
  2. The residuals are homoscedastic;
  3. The expectation of residuals is zero, no matter what;
  4. The variable follows the assumed distribution;
  5. More, generally speaking, the distribution of residuals does not change over time;
  1. The explanatory variables are not correlated with anything but the response variable:
  1. No multicollinearity;
  2. No endogeneity.

Many of these assumptions come to the idea that we have correctly captured the structure, meaning that we have not omitted any essential variables, we have not included the redundant ones, and we transformed all the variables correctly (e.g. took logarithms, where needed). If all these assumptions hold, then we would expect the applied model to converge to the true one with the increase of the sample size. If some of them do not hold, then the point forecasts from our model might be biased, or we might end up producing wider (or narrower) prediction intervals than needed.

These assumptions with their implications on an example of multiple regression are discussed in detail in Chapter 12 of Svetunkov (2021b). The diagnostics of dynamic models based on these assumptions is discussed in Chapter 14.

References

• Dictionary, 2021. Method. https://dictionary.cambridge.org/dictionary/english/method (version: 2021-09-02)
• Svetunkov, I., 2021b. Statistics for business analytics. https://openforecast.org/sba/ (version: 01.10.2021)
• Svetunkov, I., Boylan, J.E., 2019. Multiplicative state-space models for intermittent time series. https://doi.org/10.13140/RG.2.2.35897.06242