This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

1.4 Models, methods and typical assumptions

While we do not aim to fully cover the topic of models, methods and typical assumptions of statistical models, we need to make several important definitions, so that it becomes clear what we will discuss in this textbook. For a more detailed discussion, see Chapters 1 and 12 of Svetunkov (2021c).

Cambridge dictionary (Dictionary, 2021) defines method as a particular way of doing something. So, the method does not necessarily explain how the structure appears or how the error term interacts with it, it only explains how a value is produced. In our context, forecasting method would be a formula that generates point forecasts based on some parameters and available data. It would not explain how what underlies the data.

Statistical model on the other hand, is a ‘mathematical representation of a real phenomenon with a complete specification of distribution and parameters’ (Svetunkov and Boylan, 2019). It explains what happens inside the data, reveals the structure and shows how the error term interacts with the structure.

While we discuss statistical models, we should also define true model. It is “the idealistic statistical model that is correctly specified (has all the necessary components in correct form), and applied to the data in population” (Svetunkov, 2021c). Note that some statisticians also use the term Data Generating Process (DGP), when talking about the true model, but I think that we need to make a distinction between the two terms, as DGP implies that the data is somehow generated using a mathematical formula. In real life, the data is not generated from any function, it comes from a measurement of a complex process, influenced by many factors (e.g. behaviour of a group of customers based on their individual preferences and mental states). The DGP is useful when we want to conduct some experiments on a simulated data, in a control environment, but it is not helpful, when it comes to applying models to the data. Finally, the true model is an abstract notion, because it is never known or reachable. But it is still a useful one, as it allows not only to see what would happen in case when we know the model, but more importantly what would happen, when the model we use is wrong.

The related to this definition is the estimated or applied model, which is the statistical model that is applied to the available sample of data. This model will almost always be wrong, because even if we know for some mysterious reason the specification of the true model, we would still need to estimate it on our data. In this case the estimates of parameters would differ from the ones in the population and thus the model will still be wrong.

Mathematically, in the simplest case the true model can be written as: \[\begin{equation} y_t = \mu_{y,t} + \epsilon_t, \tag{1.4} \end{equation}\] where \(y_t\) is the actual value, \(\mu_{y,t}\) is the structure and \(\epsilon_t\) is the true noise. If we manage to capture the structure correctly, the model applied to the sample of data would be written as: \[\begin{equation} y_t = \hat{\mu}_{y,t} + e_t, \tag{1.5} \end{equation}\] where \(\hat{\mu}_{y,t}\) is the estimate of the structure \(\mu_{y,t}\) and \(e_t\) is the estimate of the noise \(\epsilon_t\) (also known as “residuals”). If the structure is captured correctly, then there would still be a difference between (1.4) and (1.5), because the latter is estimated on the data. However, if the sample size increases and we use an adequate estimation procedure, then due to Central Limit Theorem (see Chapter 4 of Svetunkov, 2021c) the distance between the two models will decrease and asymptotically (with the increase of sample size) \(e_t\) would converge to \(\epsilon_t\). This does not happen automatically, and there are some assumptions that should hold in order for this to happen.

1.4.1 Assumptions of statistical models

Very roughly, the typical assumptions of statistical models can be split into following categories (Svetunkov, 2021c):

  1. Model is correctly specified:
  1. We have not omitted important variables in the model (underfitting the data);
  2. We do not have redundant variables in the model (overfitting the data);
  3. The necessary transformations of the variables are applied;
  4. We do not have outliers in the model;
  1. Residuals are independent and identically distributed (i.i.d.):
  1. There is no autocorrelation in the residuals;
  2. The residuals are homoscedastic;
  3. The expectation of residuals is zero, no matter what;
  4. The variable follows the assumed distribution;
  5. More generally speaking, distribution of residuals does not change over time;
  1. The explanatory variables are not correlated with anything but the response variable:
  1. No multicollinearity;
  2. No endogeneity.

Many of these assumptions come to the idea that we have correctly captured the structure, meaning that we have not omitted any important variables, we have not included the redundant ones and that we transformed all the variables in the correct way (e.g. took logarithms, where needed). If all these assumptions hold, then we would expect the applied model to converge to the true one with the increase of the sample size. If some of them do not hold, then the point forecasts from our model might be biased or we might end up producing prediction intervals that are wider (or narrower) than needed.

These assumptions with their implications on example of multiple regression are discussed in detail in Chapter 12 of Svetunkov (2021c). The diagnostics of dynamic models based on these assumptions is discussed in Chapter 14.


• Dictionary, 2021. Method. (version: 2021-09-02)
• Svetunkov, I., 2021c. Statistics for business analytics. (version: [01.09.2021])
• Svetunkov, I., Boylan, J.E., 2019. Multiplicative state-space models for intermittent time series.