This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

14.2 Model specification: Redundant variables

While there are some ways of testing for omitted variables, the redundant ones are very difficult to diagnose. Yes, we could look at the significance of variables or compare models with and without some variables based on information criteria, but even if our approaches say that a variable is not significant, this does not mean that it is not needed in the model. There can be many reasons, why a test would fail to reject H\(_0\) and AIC would prefer a model without the variable under consideration. So, it comes to using judgment, trying to figure out whether a variable is needed in the model or not.

In the example with Seatbelt data, DriversKilled would be a redundant variable. Let’s see what happens with the model in this case:

adamModelSeat04 <- adam(Seatbelts, "NNN", 
                        formula=drivers~PetrolPrice+kms+
                          front+rear+law+DriversKilled)
par(mfcol=c(1,2))
plot(adamModelSeat04,7:8)

The residuals from this model look adequate, with only issue being the first 45 observations lying below the zero line. The summary of this model is:

summary(adamModelSeat04)
## 
## Model estimated using alm() function: Regression
## Response variable: drivers
## Distribution used in the estimation: Normal
## Loss function type: likelihood; Loss function value: 1159.417
## Coefficients:
##               Estimate Std. Error Lower 2.5% Upper 97.5%  
## (Intercept)   320.2844   127.4014    68.9379    571.5145 *
## PetrolPrice   741.7600   769.1811  -775.7343   2258.5517  
## kms            -0.0039     0.0042    -0.0122      0.0044  
## front           0.9302     0.1375     0.6589      1.2014 *
## rear           -0.6859     0.2122    -1.1044     -0.2675 *
## law            67.9625    35.8203    -2.7064    138.5986  
## DriversKilled   6.6785     0.4377     5.8150      7.5416 *
## 
## Sample size: 192
## Number of estimated parameters: 7
## Number of degrees of freedom: 185
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 2332.834 2333.443 2355.637 2357.237

The uncertainty around the parameter DriversKilled is narrow, showing that the variable has a positive impact on the drivers. However the issue here is not statistical, but rather fundamental: we have included the variable that is a part of our response variable. It does not explain why drivers get injured and killed, it just reflects a specific part of this relation. So it explains part of the variance, which should have been explained by other variables (e.g. kms and law), making them statistically not significant. So, based on technical analysis we would be inclined to keep the variable, but based on our understanding of the problem we should not.

If we have redundant variables in the model, then the model might overfit the data, leading to narrower prediction intervals and biased forecasts. The parameters of such model are typically unbiased, but inefficient.