This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

14.1 Model specification: Omitted variables

We start with one of the most important assumptions for models: model has not omitted important variables. In general this is difficult to diagnose, because typically it is not possible what is missing if we do not have it in front of us. The best thing one can do is a mental experiment, trying to comprise a list of all theoretically possible variables that would impact the variable of interest. If you manage to come up with such a list and realise that some of variables are missing, the next step would be to either collect the variables themselves or their proxies. One way or another, we would need to add the missing information in the model.

In some cases we might be able to diagnose this. For example, with our regression model from the previous section, we have a set of variables that are not included in the model. A simple thing to do would be to see if the residuals of our model are correlated with any of the omitted variables. We can either produce scatterplots or calculate measures of association to see if there is some relation in the residuals that is not explained by the existing structure. I will use assoc() and spread() functions from greybox for this:

# Create a new matrix, removing the variables that are already in the model
SeatbeltsWithResiduals <-
  cbind(as.data.frame(residuals(adamModelSeat01)),
        Seatbelts[,-c(2,5,6)])
colnames(SeatbeltsWithResiduals)[1] <- "residuals"
# Spread plot
greybox::spread(SeatbeltsWithResiduals)

spread() function automatically detects the type of variable and produces scatterplot / boxplot() / tableplot() between them, making the final plot more readable. The plot above tells us that residuals are correlated with DriversKilled, front, rear and law, so some of these variables can be added to the model to improve it. VanKilled might have a weak relation with drivers, but judging by description does not make sense in the model (this is a part of the drivers variable). In our case, it is safe to add these variables, because they make sense in explaining the number of injured drivers. However, I would not add DriversKilled as it seems not to drive the number of deaths and injuries, but is just correlated with it for obvious reasons (DriversKilled is included in drivers). We can also calculate measures of association between variables:

greybox::assoc(SeatbeltsWithResiduals)
## Associations: 
## values:
##               residuals DriversKilled  front   rear VanKilled    law
## residuals        1.0000        0.7826 0.6121 0.4811    0.2751 0.1892
## DriversKilled    0.7826        1.0000 0.7068 0.3534    0.4070 0.3285
## front            0.6121        0.7068 1.0000 0.6202    0.4724 0.5624
## rear             0.4811        0.3534 0.6202 1.0000    0.1218 0.0291
## VanKilled        0.2751        0.4070 0.4724 0.1218    1.0000 0.3949
## law              0.1892        0.3285 0.5624 0.0291    0.3949 1.0000
## 
## p-values:
##               residuals DriversKilled front   rear VanKilled    law
## residuals        0.0000             0     0 0.0000    0.0001 0.0086
## DriversKilled    0.0000             0     0 0.0000    0.0000 0.0000
## front            0.0000             0     0 0.0000    0.0000 0.0000
## rear             0.0000             0     0 0.0000    0.0925 0.6890
## VanKilled        0.0001             0     0 0.0925    0.0000 0.0000
## law              0.0086             0     0 0.6890    0.0000 0.0000
## 
## types:
##               residuals DriversKilled front     rear      VanKilled law   
## residuals     "none"    "pearson"     "pearson" "pearson" "pearson" "mcor"
## DriversKilled "pearson" "none"        "pearson" "pearson" "pearson" "mcor"
## front         "pearson" "pearson"     "none"    "pearson" "pearson" "mcor"
## rear          "pearson" "pearson"     "pearson" "none"    "pearson" "mcor"
## VanKilled     "pearson" "pearson"     "pearson" "pearson" "none"    "mcor"
## law           "mcor"    "mcor"        "mcor"    "mcor"    "mcor"    "none"

Technically speaking, the output of this function tells us that all variables are correlated with residuals and can be considered in the model. I would still prefer not to add DriversKilled in the model for the reasons explained earlier. We can construct a new model in the following way:

adamModelSeat02 <- adam(Seatbelts, "NNN",
                        formula=drivers~PetrolPrice+kms+
                          front+rear+law)
plot(adamModelSeat02,7)

How can we know that we have not omitted any important variables in our new model? Unfortunately, there is no good way of knowing that. In general, we should use judgment in order to decide whether anything else is needed or not. But given that we deal with time series, we can analyse residuals over time and see if there is any structure left:

plot(adamModelSeat02,8)

This plot shows that the model has not captured seasonality and that there is stil some structure left in the residuals. In order to address this, we will add ETS(A,N,A) element to the model:

adamModelSeat03 <- adam(Seatbelts, "ANA",
                        formula=drivers~PetrolPrice+kms+
                          front+rear+law)
par(mfcol=c(1,2))
plot(adamModelSeat03,7:8)

This is much better. There is no apparent missing structure in the data and no apparent omitted variables. We can now move to the next steps of diagnostics.