**Open Review**. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 15.3 The explanatory variables are not correlated with anything but the response variable

There are two assumptions in this group:

Technically speaking, both of them are not assumptions, but rather potential issues of a model. This is because they have nothing to do with properties of the “true model”. Indeed, it is unreasonable to assume that the explanatory variables do not have any relation between them or that they are not impacted by the response variable – they are what they are. However, the two issues cause difficulties in estimating parameters of models and lead to issues with estimates of parameters. So, they are worth discussing.

### 15.3.1 Multicollinearity

Multicollinearity appears, when either some of explanatory variables are correlated with each other (see Section 9.3), or their linear combination explains another explanatory variable included in the model. Depending on the strength of this relation and the estimation method used for model construction, the multicollinearity might cause issues of varying severity. For example, in the case, when two variables are perfectly correlated (correlation coefficient is equal to 1 or -1), the model will have perfect multicollinearity and it would not be possible to estimate its parameters. Another example is a case, when an explanatory variable can be perfectly explained by a set of other explanatory variables (resulting in \(R^2\) being close to one), which will cause exactly the same issue. The classical example of this situation is the dummy variables trap (see Section 13), when all values of categorical variable are included in regression together with the constant resulting in the linear relation \(\sum_{j=1}^k d_j = 1\). Given that the square root of \(R^2\) of linear regression is equal to multiple correlation coefficient, these two situations are equivalent and just come to “absolute value of correlation coefficient is equal to 1”. Finally, if correlation coefficient is high, but not equal to one, the effect of multicollinearity will lead to less efficient estimates of parameters. The loss of efficiency is in this case proportional to the absolute value of correlation coefficient. In case of forecasting, the effect is not as straight forward, and in some cases might not damage the point forecasts, but can lead to prediction intervals of an incorrect width. The main issue of multicollinearity comes to the difficulties in the model estimation in a sample. If we had all the data in the world, then the issue would not exist. All of this tells us how this problem can be diagnosed and that this diagnosis should be carried out before constructing regression model.

First, we can calculate correlation matrix for the available variables. If they are all numeric, then `cor()`

function from `stats`

should do the trick (we remove the response variable from consideration):

`cor(mtcars[,-1])`

```
## cyl disp hp drat wt qsec
## cyl 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958 -0.59124207
## disp 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799 -0.43369788
## hp 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479 -0.70822339
## drat -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406 0.09120476
## wt 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000 -0.17471588
## qsec -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159 1.00000000
## vs -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157 0.74453544
## am -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953 -0.22986086
## gear -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870 -0.21268223
## carb 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059 -0.65624923
## vs am gear carb
## cyl -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 1.0000000 0.16834512 0.2060233 -0.56960714
## am 0.1683451 1.00000000 0.7940588 0.05753435
## gear 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.5696071 0.05753435 0.2740728 1.00000000
```

This matrix tells us that there are some variables that are highly correlated and might reduce efficiency of estimates of parameters of regression model if included in the model together. This mainly applies to `cyl`

and `disp`

, which both characterise the size of engine. If we have a mix of numerical and categorical variables, then `assoc()`

(aka `association()`

) function from `greybox`

will be more appropriate (see Section 9).

`assoc(mtcars)`

In order to cover the second situation with linear combination of variables, we can use the `determ()`

(aka `determination()`

) function from `greybox`

:

`determ(mtcars[,-1])`

```
## cyl disp hp drat wt qsec vs am
## 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.7986256 0.7848763
## gear carb
## 0.8133441 0.8735577
```

This function will construct linear regression models for each variable from all the other variables and report the \(R^2\) from these models. If there are coefficients of determination close to one, then this might indicate that the variables would cause multicollinearity in the model. In our case, we see that `disp`

is linearly related to other variables, and we can expect it to cause the reduction of efficiency of estimate of parameters. If we remove it from the consideration (we do not want to include it in our model anyway), then the picture will change:

`determ(mtcars[,-c(1,3)])`

```
## cyl hp drat wt qsec vs am gear
## 0.9299952 0.8596168 0.6996363 0.8384243 0.8553748 0.7965848 0.7847198 0.8121855
## carb
## 0.7680136
```

Now `cyl`

has linear relation with some other variables, so it would not be wise to include it in the model with the other variables. We would need to decide, what to include based on our understanding of the problem.

Instead of calculating the coefficients of determination, econometricians prefer to calculate Variance Inflation Factor (VIF), which shows by how many times the estimates of parameters will loose efficiency. Its formula is based on the \(R^2\) calculated above: \[\begin{equation*} \mathrm{VIF}_i = \frac{1}{1-R_i^2} \end{equation*}\] for each model \(i\). Which in our case can be calculated as:

`1/(1-determ(mtcars[,-c(1,3)]))`

```
## cyl hp drat wt qsec vs am gear
## 14.284737 7.123361 3.329298 6.189050 6.914423 4.916053 4.645108 5.324402
## carb
## 4.310597
```

This is useful when you want to see the specific impact on the variance of parameters, but is difficult to work with, when it comes to model diagnostics, because the value of VIF lies between zero and infinity. So, I prefer using the determination coefficients instead, which is always bounded by \((0, 1)\) region and thus easier to interpret.

Finally, in some cases nothing can be done with multicollinearity, it just exists, and we need to include those correlated variables. This might not be a big problem, as long as we acknowledge the issues it will cause to the estimates of parameters.

### 15.3.2 Endogeneity

**Endogeneity** applies to the situation, when the dependent variable \(y_j\) influences the explanatory variable \(x_j\) in the model on the same observation. The relation in this case becomes bi-directional, meaning that the basic model is not appropriate in this situation any more. The parameters and forecasts will typically be *biased*, and a different estimation method would be needed (for example, instrumental variables) or maybe a different model would need to be constructed in order to fix this.

In econometrics, one of the definitions of the endogeneity is that the correlation between the error term and an explanatory variable is not zero, i.e. \(\mathrm{E}(\epsilon_j, x_{i,j}) \neq 0\) for at least some variable \(x_i\). In my personal opinion, this is a very confusing definition. First, if this applies to the “true” model then this is absurd, because by definition the error term in the true model is not related with anything (because the true model is correctly specified). Second, if this applies to the applied model, then this condition does not hold in sample if OLS is used for the estimation of parameters (this was discussed in Subsection 10.3). Third, even if we are talking about working with an incorrect model on the population data, the OLS will guarantee that \(\mathrm{E}(e_j, x_{i,j}) = 0\). So, the only case when this makes sense is for the relation between the explanatory variables and the forecast errors from the model generated on the holdout sample of data. This is why I think that this definition is not useful.

To make things even more complicated, endogeneity cannot be properly diagnosed and comes to the judgment of analyst: do we expect the relation between variables to be one directional or bi-directional? From the true model perspective, the latter might imply that we need to consider a system of equations of the style: \[\begin{equation} \begin{aligned} & y_j = \beta_0 + \beta_1 x_{1,j} + \dots + \beta_{k-1} x_{k-1,j} + \epsilon_j \\ & x_{1,j} = \gamma_0 + \gamma_1 y_{j} + \gamma_{2} x_{2,j} + \dots + \gamma_{k-1} x_{k-1,j} + \upsilon_j \end{aligned} . \tag{15.1} \end{equation}\] In the equation (15.1), the response variable \(y_j\) depends on the value of \(x_{1,j}\) (among other variables), but that variable depends on the value of \(y_j\) at the same time. In order to estimate such system of equations and break this loop, an analyst would need to find an “instrumental variable” – a variable that would be correlated with \(x_{1,j}\) but would not be correlated with \(y_j\) and then use a different estimation procedure (e.g. two-stage least squares). We do not aim to cover possible solutions of this issue, because they lie outside of the scope of this textbook, but an interested reader is referred to Chapter 12 of Hanck et al. (2022).

*Remark*. Note that if we work with time series then endogeneity would only appear when the bi-directional relation happens at the same time \(t\), not over time. In the latter case we would be dealing with recursive relation (\(y_t\) depends on \(x_{t}\), but \(x_t\) depends on \(y_{t-1}\)) rather than the contemporaneous and thus the estimation of such a model would not lead to the issues discussed in this subsection.

### References

• Hanck, C., Arnold, M., Gerber, A., Schmelzer, M., 2022. Introduction to Econometrics with R. https://www.econometrics-with-r.org/index.html (version: 2022-04-17)