14.2 Model specification: Redundant variables

While there are some ways of testing for omitted variables, the redundant ones are challenging to diagnose. Yes, we could look at the significance of variables (Section 5.3 of Svetunkov, 2022a) or compare models with and without some variables based on information criteria (Section 13.4 of Svetunkov, 2022a). Still, even if our approaches say that a variable is not significant, this does not mean that it is not needed in the model. There can be many reasons why a test would fail to reject H\(_0\), and AIC would prefer a model without the variable under consideration. So, it comes to using judgment, trying to figure out whether a variable is needed in the model or not.

If the model contains redundant variables then it will overfit the data, which could lead to narrower prediction intervals and biased point forecasts.

In the example with Seatbelt data, DriversKilled would be a redundant variable for the reasons explained in Section 14.1. Let us see what happens with the model if we include it:

adamSeat04 <- adam(Seatbelts, "NNN", 
                   formula=drivers~PetrolPrice+kms+
                     law+DriversKilled)
par(mfcol=c(1,2), mar=c(4,4,2,1))
plot(adamSeat04,7:8)
Diagnostic plots for Model 4.

Figure 14.6: Diagnostic plots for Model 4.

The residuals from this model look adequate, with the only issue being the first 45 observations lying below the zero line. The summary of this model is:

summary(adamSeat04)
## 
## Model estimated using alm() function: Regression
## Response variable: drivers
## Distribution used in the estimation: Normal
## Loss function type: likelihood; Loss function value: 1189.274
## Coefficients:
##                 Estimate Std. Error Lower 2.5% Upper 97.5%  
## (Intercept)     905.6559   115.0935   678.6073   1132.6294 *
## PetrolPrice   -1603.7772   827.8145 -3236.8326     28.7384  
## kms              -0.0112     0.0035    -0.0182     -0.0043 *
## law             -91.2672    31.9765  -154.3483    -28.2070 *
## DriversKilled     9.0423     0.3831     8.2866      9.7978 *
## 
## Error standard deviation: 120.1081
## Sample size: 192
## Number of estimated parameters: 5
## Number of degrees of freedom: 187
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 2388.549 2388.871 2404.836 2405.684

The uncertainty around the parameter DriversKilled is narrow, showing that the variable positively impacts the drivers. However, the issue here is not statistical but rather fundamental: we have included the variable that is a part of our response variable. It does not explain why drivers get injured and killed, and it just reflects a specific element of this relation. So it approximates part of the variance, which should have been explained by other variables (e.g. kms and law), making them statistically not significant. So, based on the technical analysis, we would be inclined to keep the variable, but based on our understanding of the problem, we should not.

If we have redundant variables in the model, then the model might overfit the data, leading to narrower prediction intervals and biased forecasts. The parameters of such a model are typically unbiased but inefficient (Section 4.3 of Svetunkov, 2022a).

References

• Svetunkov, I., 2022a. Statistics for business analytics. https://openforecast.org/sba/ (version: 31.03.2022)