15.3 Explanatory variables selection

There are different approaches for automatic variable selection, but not all of them are efficient in the context of dynamic models. For example, backward stepwise might be either not feasible in the case of small samples or may take too much time to converge to an optimal solution (it has polynomial computational time). This is because the ADAMX model needs to be refitted and reestimated repeatedly using recursive relations based on the state space model (10.4). The classical stepwise forward might also be too slow because it has polynomial computational time as well. So, there need to be some simplifications, which will make variables selection in ADAMX doable in a reasonable time.

To make the mechanism efficient in a limited time, we rely on the Sagaert and Svetunkov (2022) approach of stepwise trace forward selection of variables. It is the approach that uses partial correlations between variables to identify which variables to include in each iteration. It has linear computational time instead of the polynomial. Still, doing that in the proper ADAMX would take more time than needed because of the fitting of the dynamic model. So one of the possible solutions is to make variables selection in ADAMX in the following steps:

  1. Estimate and fit the ADAM;
  2. Extract the residuals of the ADAM;
  3. Select the most suitable variables, explaining the residuals, based on the trace forward stepwise approach and the selected information criterion;
  4. Estimate the ADAMX model with the selected explanatory variables.

The residuals in step (2) might vary from model to model, depending on the type of the error term and the selected distribution:

  • Normal, Laplace, S, Generalised Normal or Asymmetric Laplace: \(e_t\);
  • Additive error and log-normal, Inverse Gaussian or Gamma: \(\left(1+\frac{e_t}{\hat{y}_t} \right)\);
  • Multiplicative error and log-normal, Inverse Gaussian or Gamma: \(1+e_t\).

So, the extracted residuals should be formulated based on the distributional assumptions of each model.

In R, step (3) is done using the stepwise() function from the greybox package, supporting all the distributions discussed in the previous chapters. The only thing that needs to be modified is the number of degrees of freedom: the function should consider all estimated parameters. This is done internally via the df parameter in stepwise().

While the suggested approach has obvious limitations (e.g. smoothing parameters can be higher than needed, explaining the variability otherwise explained by variables), it is efficient in terms of computational time.

To see how it works, we use SeatBelt data:

We have already had a look at this data earlier in Section 10.6, so we can move directly to the selection part:

## Warning: Observed Fisher Information is not positive semi-definite, which means
## that the likelihood was not maximised properly. Consider reestimating the model,
## tuning the optimiser or using bootstrap via bootstrap=TRUE.
## 
## Model estimated using adam() function: ETSX(MNM)
## Response variable: drivers
## Distribution used in the estimation: Gamma
## Loss function type: likelihood; Loss function value: 1117.189
## Coefficients:
##              Estimate Std. Error Lower 2.5% Upper 97.5%  
## alpha          0.2877     0.0856     0.1186      0.4565 *
## gamma          0.0000     0.0414     0.0000      0.0816  
## level       1655.9759    97.3924  1463.6713   1848.0473 *
## seasonal_1     1.0099     0.0155     0.9808      1.0459 *
## seasonal_2     0.9053     0.0153     0.8762      0.9413 *
## seasonal_3     0.9352     0.0156     0.9061      0.9712 *
## seasonal_4     0.8696     0.0147     0.8405      0.9056 *
## seasonal_5     0.9465     0.0162     0.9174      0.9825 *
## seasonal_6     0.9152     0.0155     0.8861      0.9513 *
## seasonal_7     0.9623     0.0160     0.9332      0.9983 *
## seasonal_8     0.9706     0.0159     0.9416      1.0067 *
## seasonal_9     1.0026     0.0169     0.9735      1.0386 *
## seasonal_10    1.0824     0.0178     1.0533      1.1184 *
## seasonal_11    1.2012     0.0183     1.1721      1.2372 *
## law            0.0200     0.1050    -0.1873      0.2271  
## 
## Error standard deviation: 0.0752
## Sample size: 180
## Number of estimated parameters: 16
## Number of degrees of freedom: 164
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 2266.378 2269.715 2317.465 2326.131

Remark. The function might complain about the observed Fisher Information. This only means that the estimated variances of parameters might be lower than they should be in reality. This is discussed in Section 16.1.

Based on the summary from the model, we can see that neither kms nor PetrolPrice improve the model in terms of AICc (they were not included in the model). We could check them manually to see if the selection worked out well in our case (construct sink regression as a benchmark):

## Warning: Observed Fisher Information is not positive semi-definite, which means
## that the likelihood was not maximised properly. Consider reestimating the model,
## tuning the optimiser or using bootstrap via bootstrap=TRUE.
## 
## Model estimated using adam() function: ETSX(MNM)
## Response variable: drivers
## Distribution used in the estimation: Gamma
## Loss function type: likelihood; Loss function value: 1234.278
## Coefficients:
##             Estimate Std. Error Lower 2.5% Upper 97.5%  
## alpha         0.9508     2.5767     0.0000      1.0000  
## gamma         0.0000     0.0131     0.0000      0.0259  
## level        23.2952     1.2746    20.7782     25.8087 *
## seasonal_1    1.1340     0.0621     1.0115      4.2731 *
## seasonal_2    0.9924     0.9516     0.8698      4.1314 *
## seasonal_3    0.9248     0.8549     0.8023      4.0639 *
## seasonal_4    0.8342     0.7888     0.7117      3.9733 *
## seasonal_5    0.9068     0.8962     0.7843      4.0459 *
## seasonal_6    0.8625     0.9153     0.7400      4.0016 *
## seasonal_7    0.8370     0.8126     0.7144      3.9761 *
## seasonal_8    0.8477     0.7795     0.7252      3.9868 *
## seasonal_9    0.9798     0.9959     0.8573      4.1189 *
## seasonal_10   1.1417     1.2918     1.0192      4.2808 *
## seasonal_11   1.3273     1.5918     1.2047      4.4663 *
## kms           0.0000     0.0000    -0.0001      0.0001  
## PetrolPrice  -2.7216     3.1292    -8.9009      3.4494  
## law           0.0181     5.2249   -10.2996     10.3216  
## 
## Error standard deviation: 0.1413
## Sample size: 180
## Number of estimated parameters: 18
## Number of degrees of freedom: 162
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 2504.556 2508.804 2562.029 2573.060

We can see that the sink regression model has a higher AICc value than the model with the selected variables, which means that the latter is closer to the “true model”. While adamETSXMNMSelectSeat might not be the best possible model in terms of information criteria, it is still a reasonable one and can be used for further inference. The choice of the best model should rest upon the minimum of the information criterion and the understanding of the problem. Note that the two aspects need to be considered simultaneously to avoid overfitting.

References

• Sagaert, Y., Svetunkov, I., 2022. Trace Forward Stepwise: Automatic Selection of Variables in No Time. https://doi.org/10.13140/RG.2.2.34995.35369