This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

15.3 The explanatory variables are not correlated with anything but the response variable

There are two assumptions in this group:

  1. No multicollinearity;
  2. No endogeneity;

Technically speaking, both of them are not assumptions, but rather potential issues of a model. This is because they have nothing to do with properties of the “true model”. Indeed, it is unreasonable to assume that the explanatory variables do not have any relation between them or that they are not impacted by the response variable – they are what they are. However, the two issues cause difficulties in estimating parameters of models and lead to issues with estimates of parameters. So, they are worth discussing.

15.3.1 Multicollinearity

Multicollinearity appears, when either some of explanatory variables are correlated with each other (see Section 9.3), or their linear combination explains another explanatory variable included in the model. Depending on the strength of this relation and the estimation method used for model construction, the multicollinearity might cause issues of varying severity. For example, in the case, when two variables are perfectly correlated (correlation coefficient is equal to 1 or -1), the model will have perfect multicollinearity and it would not be possible to estimate its parameters. Another example is a case, when an explanatory variable can be perfectly explained by a set of other explanatory variables (resulting in \(R^2\) being close to one), which will cause exactly the same issue. The classical example of this situation is the dummy variables trap (see Section 13), when all values of categorical variable are included in regression together with the constant resulting in the linear relation \(\sum_{j=1}^k d_j = 1\). Given that the square root of \(R^2\) of linear regression is equal to multiple correlation coefficient, these two situations are equivalent and just come to “absolute value of correlation coefficient is equal to 1”. Finally, if correlation coefficient is high, but not equal to one, the effect of multicollinearity will lead to less efficient estimates of parameters. The loss of efficiency is in this case proportional to the absolute value of correlation coefficient. In case of forecasting, the effect is not as straight forward, and in some cases might not damage the point forecasts, but can lead to prediction intervals of an incorrect width. The main issue of multicollinearity comes to the difficulties in the model estimation in a sample. If we had all the data in the world, then the issue would not exist. All of this tells us how this problem can be diagnosed and that this diagnosis should be carried out before constructing regression model.

First, we can calculate correlation matrix for the available variables. If they are all numeric, then cor() function from stats should do the trick (we remove the response variable from consideration):

##             cyl       disp         hp        drat         wt        qsec
## cyl   1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958 -0.59124207
## disp  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788
## hp    0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339
## drat -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476
## wt    0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588
## qsec -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000
## vs   -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157  0.74453544
## am   -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953 -0.22986086
## gear -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870 -0.21268223
## carb  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059 -0.65624923
##              vs          am       gear        carb
## cyl  -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    1.0000000  0.16834512  0.2060233 -0.56960714
## am    0.1683451  1.00000000  0.7940588  0.05753435
## gear  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.5696071  0.05753435  0.2740728  1.00000000

This matrix tells us that there are some variables that are highly correlated and might reduce efficiency of estimates of parameters of regression model if included in the model together. This mainly applies to cyl and disp, which both characterise the size of engine. If we have a mix of numerical and categorical variables, then assoc() (aka association()) function from greybox will be more appropriate (see Section 9).


In order to cover the second situation with linear combination of variables, we can use the determ() (aka determination()) function from greybox:

##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.7986256 0.7848763 
##      gear      carb 
## 0.8133441 0.8735577

This function will construct linear regression models for each variable from all the other variables and report the \(R^2\) from these models. If there are coefficients of determination close to one, then this might indicate that the variables would cause multicollinearity in the model. In our case, we see that disp is linearly related to other variables, and we can expect it to cause the reduction of efficiency of estimate of parameters. If we remove it from the consideration (we do not want to include it in our model anyway), then the picture will change:

##       cyl        hp      drat        wt      qsec        vs        am      gear 
## 0.9299952 0.8596168 0.6996363 0.8384243 0.8553748 0.7965848 0.7847198 0.8121855 
##      carb 
## 0.7680136

Now cyl has linear relation with some other variables, so it would not be wise to include it in the model with the other variables. We would need to decide, what to include based on our understanding of the problem.

Instead of calculating the coefficients of determination, econometricians prefer to calculate Variance Inflation Factor (VIF), which shows by how many times the estimates of parameters will loose efficiency. Its formula is based on the \(R^2\) calculated above: \[\begin{equation*} \mathrm{VIF}_i = \frac{1}{1-R_i^2} \end{equation*}\] for each model \(i\). Which in our case can be calculated as:

##       cyl        hp      drat        wt      qsec        vs        am      gear 
## 14.284737  7.123361  3.329298  6.189050  6.914423  4.916053  4.645108  5.324402 
##      carb 
##  4.310597

This is useful when you want to see the specific impact on the variance of parameters, but is difficult to work with, when it comes to model diagnostics, because the value of VIF lies between zero and infinity. So, I prefer using the determination coefficients instead, which is always bounded by \((0, 1)\) region and thus easier to interpret.

Finally, in some cases nothing can be done with multicollinearity, it just exists, and we need to include those correlated variables. This might not be a big problem, as long as we acknowledge the issues it will cause to the estimates of parameters.

15.3.2 Endogeneity

Endogeneity applies to the situation, when the dependent variable \(y_j\) influences the explanatory variable \(x_j\) in the model on the same observation. The relation in this case becomes bi-directional, meaning that the basic model is not appropriate in this situation any more. The parameters and forecasts will typically be biased, and a different estimation method would be needed (for example, instrumental variables) or maybe a different model would need to be constructed in order to fix this.

In econometrics, one of the definitions of the endogeneity is that the correlation between the error term and an explanatory variable is not zero, i.e. \(\mathrm{E}(\epsilon_j, x_{i,j}) \neq 0\) for at least some variable \(x_i\). In my personal opinion, this is a very confusing definition. First, if this applies to the “true” model then this is absurd, because by definition the error term in the true model is not related with anything (because the true model is correctly specified). Second, if this applies to the applied model, then this condition does not hold in sample if OLS is used for the estimation of parameters (this was discussed in Subsection 10.3). Third, even if we are talking about working with an incorrect model on the population data, the OLS will guarantee that \(\mathrm{E}(e_j, x_{i,j}) = 0\). So, the only case when this makes sense is for the relation between the explanatory variables and the forecast errors from the model generated on the holdout sample of data. This is why I think that this definition is not useful.

To make things even more complicated, endogeneity cannot be properly diagnosed and comes to the judgment of analyst: do we expect the relation between variables to be one directional or bi-directional? From the true model perspective, the latter might imply that we need to consider a system of equations of the style: \[\begin{equation} \begin{aligned} & y_j = \beta_0 + \beta_1 x_{1,j} + \dots + \beta_{k-1} x_{k-1,j} + \epsilon_j \\ & x_{1,j} = \gamma_0 + \gamma_1 y_{j} + \gamma_{2} x_{2,j} + \dots + \gamma_{k-1} x_{k-1,j} + \upsilon_j \end{aligned} . \tag{15.1} \end{equation}\] In the equation (15.1), the response variable \(y_j\) depends on the value of \(x_{1,j}\) (among other variables), but that variable depends on the value of \(y_j\) at the same time. In order to estimate such system of equations and break this loop, an analyst would need to find an “instrumental variable” – a variable that would be correlated with \(x_{1,j}\) but would not be correlated with \(y_j\) and then use a different estimation procedure (e.g. two-stage least squares). We do not aim to cover possible solutions of this issue, because they lie outside of the scope of this textbook, but an interested reader is referred to Chapter 12 of Hanck et al. (2022).

Remark. Note that if we work with time series then endogeneity would only appear when the bi-directional relation happens at the same time \(t\), not over time. In the latter case we would be dealing with recursive relation (\(y_t\) depends on \(x_{t}\), but \(x_t\) depends on \(y_{t-1}\)) rather than the contemporaneous and thus the estimation of such a model would not lead to the issues discussed in this subsection.


• Hanck, C., Arnold, M., Gerber, A., Schmelzer, M., 2022. Introduction to Econometrics with R. (version: 2022-04-17)