This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 2.2 Typical assumptions of statistical models

In order for a statistical model to work adequately and not to fail on data, several assumptions about it, when it is applied to the data, should hold. If they do not, then the model might lead to biased or inefficient estimates and forecasts. Here we briefly discuss the main of them. Here they are:

1. Model is correctly specified;
2. The expectation of residuals is zero, no matter what;
3. Residuals are independent and identicaly distributed (i.i.d.);
4. The explanatory variables are not correlated with anything but the response variable;
5. The residuals follow the specified distribution.

### 2.2.1 Model is correctly specified

This implies that:

1. We have not omitted important variables in the model (underfitting the data);
2. We do not have redundant variables in the model (overfitting the data);
3. The necessary transformation of the variables are applied;
4. We do not have outliers in the model.

(1): if there are some important variables that we did not include in the model, then the estimates of the parameters might be biased and in some cases quite seriously (e.g. positive sign instead of the negative one). This also means that the point forecasts from the model might be biased as well (systematic under or over forecasting).

(2): if there are redundant variables that are not needed in the model, then the estimates of parameters and point forecasts might be unbiased, but inefficient. This implies that the variance of parameters can be lower than needed and the prediction intervals can be narrower than needed.

(3): this means that, for example, instead of using a multiplicative model, we apply an additive one. The estimates of parameters and the point forecasts might be biased in this case as well: the model will produce linear trajectory of the forecast, when a non-linear one is needed.

(4): in a way, this is similar to (1), the presence of outliers might mean that we have missed some important information, meaning that the estimates of parameters and forecasts would be biased as well. There can be other reasons for outliers as well. For example, we might be using a wrong distributional assumptions. If so, this would imply that the prediction intervals from the model are narrower than needed.

### 2.2.2 The expectation of residuals is zero, no matter what

While in sample, this holds automatically in many cases (e.g. when using Least Squares method), this assumption might be violated in the holdout sample. In this case the point forecasts would be biased, because they typically do not take the non-zero mean of forecast error into account, and the prediction interval might be off as well, because of the wrong estimation of the scale of distribution (e.g. variance is higher than needed).

This assumption also implies that the expectation of residuals is zero even conditional on the explanatory variables in the model. If it is not, then this might mean that there is still some important information omitted in the applied model.

### 2.2.3 Residuals are i.i.d.

There are two assumptions in this group:

1. There is no autocorrelation in the residuals;
2. The residuals are homoscedastic.

(1): we expect that the model captures all the important aspects, so if the residuals are autocorrelated, then something is neglected by the applied model. Typically, this leads to inefficient estimates of parameters and in some cases they can also become biased. As a result, the point forecasts can be less accurate than expected and the prediction intervals might be wrong (wider or narrower than needed).

(2): if this is violated, then we say that there is a heteroscedasticity in the model. This means that with a change of variable, the variance of the residuals changes as well. If the model neglects this, then typically the estimates of parameters become inefficient and prediction intervals are wrong: they are wider than needed in some cases and narrower than needed in the other ones.

### 2.2.4 The explanatory variables are not correlated with anything but the response variable

There are two cases here as well:

1. No multicollinearity;
2. No endogeneity;

(1): the effect of multicollinearity implies that the variables included in the model are linearly dependent from each other. In this case, it becomes difficult to distinguish the effect of one variables from the other one. As a result, the estimates of parameters become inefficient and might become biased in some sever cases. In case of forecasting, the effect is not as straight forward, and in some cases might not damage the point forecasts, but can lead to prediction intervals of an incorrect width.

(2): endogeneity applies to the situation, when the dependent variable $$y_t$$ influences the explanatory variable $$x_t$$ in the model on the same observation. The relation in this case becomes bi-directional, meaning that the basic model is not appropriate in this situation anymore. The parameters and forecasts will typically be biased, and a different estimation method is needed or maybe a different model would need to be constructed in order to fix this.

### 2.2.5 The variable follows the specified distribution

Finally, in some cases we are interested in using methods that imply specific distributional assumptions about the model and its residuals. For example, it is assumed in the classical linear model that the error term follows Normal distribution. Estimating this model using MLE with the probability density function of Normal distribution or via minimisation of Mean Squared Error (MSE) would give efficient and consistent estimates of parameters. If the assumption of normality does not hold, then the estimates might be inefficient and in some cases inconsistent. When it comes to forecasting, the main issue in the wrong distributional assumption appears, when prediction intervals are needed: they might rely on a wrong distribution and be narrower or wider than needed. Finally, if we deal with the wrong distribution, then the model selection mechanism might be flawed and would lead to the selection of an inappropriate model.

In many cases, in our discussions in this textbook, we assume that all of these assumptions hold. In some of the cases, we will say explicitly, which are violated and what needs to be done in those situations.

Now that we have a basic understanding of these statistical terms, we can move to the next topic, distributions.