14.2 Model specification: Redundant variables
While there are some ways of testing for omitted variables, the redundant ones are very difficult to diagnose. Yes, we could look at the significance of variables or compare models with and without some variables based on information criteria, but even if our approaches say that a variable is not significant, this does not mean that it is not needed in the model. There can be many reasons, why a test would fail to reject H\(_0\) and AIC would prefer a model without the variable under consideration. So, it comes to using judgment, trying to figure out whether a variable is needed in the model or not.
In the example with Seatbelt data, DriversKilled
would be a redundant variable. Let’s see what happens with the model in this case:
<- adam(Seatbelts,"NNN",formula=drivers~PetrolPrice+kms+front+rear+law+DriversKilled)
adamModelSeat04 par(mfcol=c(1,2))
plot(adamModelSeat04,7:8)
The residuals from this model look adequate, with only issue being the first 45 observations lying below the zero line. The summary of this model is:
summary(adamModelSeat04)
##
## Model estimated using alm() function: Regression
## Response variable: drivers
## Distribution used in the estimation: Normal
## Loss function type: likelihood; Loss function value: 1159.417
## Coefficients:
## Estimate Std. Error Lower 2.5% Upper 97.5%
## (Intercept) 320.2844 127.4014 68.9379 571.5145 *
## PetrolPrice 741.7600 769.1811 -775.7343 2258.5517
## kms -0.0039 0.0042 -0.0122 0.0044
## front 0.9302 0.1375 0.6589 1.2014 *
## rear -0.6859 0.2122 -1.1044 -0.2675 *
## law 67.9625 35.8203 -2.7064 138.5986
## DriversKilled 6.6785 0.4377 5.8150 7.5416 *
##
## Sample size: 192
## Number of estimated parameters: 7
## Number of degrees of freedom: 185
## Information criteria:
## AIC AICc BIC BICc
## 2332.834 2333.443 2355.637 2357.237
The uncertainty around the parameter DriversKilled
is narrow, showing that the variable has a positive impact on the drivers
. However the issue here is not statistical, but rather fundamental: we have included the variable that is a part of our response variable. It does not explain why drivers get injured and killed, it just reflects a specific part of this relation. So it explains part of the variance, which should have been explained by other variables (e.g. kms
and law
), making them statistically not significant. So, based on technical analysis we would be inclined to keep the variable, but based on our understanding of the problem we should not.
If we have redundant variables in the model, then the model might overfit the data, leading to narrower prediction intervals and biased forecasts. The parameters of such model are typically unbiased, but inefficient.