\( \newcommand{\mathbbm}[1]{\boldsymbol{\mathbf{#1}}} \)

15.3 Explanatory variables selection

There are different approaches for automatic variable selection, but not all of them are efficient in the context of dynamic models. For example, conventional stepwise approaches might be either not feasible in the case of small samples or may take too much time to converge to an optimal solution (it has polynomial computational time). This well known problem in regression context is magnified in the context of dynamic models, because each model fit takes much more time than in the case of regression. This is because the ADAMX needs to be refitted and re-estimated repeatedly using recursive relations based on the state space model (10.4). So, there need to be some simplifications, which will make variables selection in ADAMX doable in a reasonable time.

To make the mechanism efficient in a limited time, I propose using the Sagaert and Svetunkov (2022) approach of stepwise trace forward selection of variables. It uses partial correlations between variables to identify which of them to include in each iteration. While it has linear computational time instead of the polynomial, doing that in the proper ADAMX would still take a lot of time, because of the fitting of the dynamic model. So one of the possible solutions is to do variables selection in ADAMX based on models residuals, in the following steps:

  1. Estimate and fit the ADAM;
  2. Extract the residuals of the ADAM;
  3. Select the most suitable variables, explaining the residuals, based on the trace forward stepwise approach and the selected information criterion;
  4. Re-estimate the ADAMX with the selected explanatory variables.

The residuals in step (2) might vary from model to model, depending on the type of the error term and the selected distribution:

  • Normal, Laplace, S, Generalised Normal or Asymmetric Laplace: \(e_t\);
  • Additive error and Log-Normal, Inverse Gaussian or Gamma: \(\left(1+\frac{e_t}{\hat{y}_t} \right)\);
  • Multiplicative error and Log-Normal, Inverse Gaussian or Gamma: \(1+e_t\).

So, the extracted residuals should be aligned with the distributional assumptions of each model.

In R, step (3) is done using the stepwise() function from the greybox package, supporting all the distributions implemented in ADAM. The only thing that needs to be modified is the number of degrees of freedom: the function should consider all estimated parameters (including the number of parameters of the dynamic part). This is done internally via the df parameter in stepwise().

While the suggested approach has obvious limitations (e.g. smoothing parameters can be higher than needed, explaining the variability otherwise explained by variables), it is efficient in terms of computational time.

To see how it works, we use SeatBelt data:

SeatbeltsData <- Seatbelts[,c("drivers","kms","PetrolPrice","law")]

We have already had a look at this data earlier in Section 10.6, so we can move directly to the selection part:

adamETSXMNMSelectSeat <- adam(SeatbeltsData, "MNM",
                              h=12, holdout=TRUE,
                              regressors="select")
summary(adamETSXMNMSelectSeat)
## Warning: Observed Fisher Information is not positive semi-definite, which means
## that the likelihood was not maximised properly. Consider reestimating the
## model, tuning the optimiser or using bootstrap via bootstrap=TRUE.
## 
## Model estimated using adam() function: ETSX(MNM)
## Response variable: drivers
## Distribution used in the estimation: Gamma
## Loss function type: likelihood; Loss function value: 1133.186
## Coefficients:
##              Estimate Std. Error Lower 2.5% Upper 97.5%  
## alpha          0.1303     0.0932     0.0000      0.3140  
## gamma          0.1757     0.6797     0.0000      0.8697  
## level       1642.6560    51.1449  1541.6686   1743.5210 *
## seasonal_1     1.0421     0.1216     1.0047      1.4893 *
## seasonal_2     0.9484     0.2267     0.9110      1.3956 *
## seasonal_3     0.9277     0.0427     0.8903      1.3749 *
## seasonal_4     0.8471     0.0498     0.8097      1.2943 *
## seasonal_5     0.9910     0.1976     0.9536      1.4382 *
## seasonal_6     0.9073     0.0189     0.8699      1.3544 *
## seasonal_7     0.9802     0.0288     0.9428      1.4274 *
## seasonal_8     0.9561     0.0643     0.9187      1.4033 *
## seasonal_9     0.9760     0.1576     0.9386      1.4232 *
## seasonal_10    1.0345     0.2092     0.9971      1.4817 *
## seasonal_11    1.1998     0.0545     1.1624      1.6470 *
## law            0.0191     0.0435    -0.0667      0.1049  
## 
## Error standard deviation: 0.0815
## Sample size: 180
## Number of estimated parameters: 16
## Number of degrees of freedom: 164
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 2298.371 2301.709 2349.458 2358.124

Remark. The summary() method might complain about the observed Fisher Information. This only means that the estimated variances of parameters might be lower than they should be in reality. This is discussed in Section 16.2.

Based on the summary from the model, we can see that neither kms nor PetrolPrice improve the model in terms of AICc (they were not included in the model). We could check them manually to see if the selection worked out well in our case (construct sink regression as a benchmark):

adamETSXMNMSinkSeat <- adam(SeatbeltsData, "MNM",
                            h=12, holdout=TRUE)
summary(adamETSXMNMSinkSeat)
## Warning: Observed Fisher Information is not positive semi-definite, which means
## that the likelihood was not maximised properly. Consider reestimating the
## model, tuning the optimiser or using bootstrap via bootstrap=TRUE.
## 
## Model estimated using adam() function: ETSX(MNM)
## Response variable: drivers
## Distribution used in the estimation: Gamma
## Loss function type: likelihood; Loss function value: 1148.273
## Coefficients:
##              Estimate Std. Error Lower 2.5% Upper 97.5%  
## alpha          0.1850     0.0292     0.1274      0.2426 *
## gamma          0.5728     0.1756     0.2260      0.8150 *
## level       2199.8592   193.1049  1818.5320   2580.6654 *
## seasonal_1     1.0710     0.0938     0.9310      1.2741 *
## seasonal_2     0.8864     0.0730     0.7464      1.0894 *
## seasonal_3     0.9237     0.0817     0.7837      1.1267 *
## seasonal_4     0.8352     0.0709     0.6953      1.0383 *
## seasonal_5     0.9485     0.0827     0.8085      1.1515 *
## seasonal_6     0.9202     0.0833     0.7802      1.1233 *
## seasonal_7     0.9820     0.0901     0.8420      1.1850 *
## seasonal_8     0.9889     0.0891     0.8490      1.1920 *
## seasonal_9     0.9239     0.0807     0.7839      1.1269 *
## seasonal_10    1.0432     0.0962     0.9032      1.2462 *
## seasonal_11    1.3142     0.1030     1.1743      1.5173 *
## kms            0.0000     0.0000     0.0000      0.0000  
## PetrolPrice    0.2552     1.1220    -1.9604      2.4678  
## law            0.0223     0.0814    -0.1385      0.1829  
## 
## Error standard deviation: 0.09
## Sample size: 180
## Number of estimated parameters: 18
## Number of degrees of freedom: 162
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 2332.546 2336.794 2390.019 2401.050

We can see that the sink regression model has a higher AICc value than the model with the selected variables, which means that the latter is closer to the “true model”. While adamETSXMNMSelectSeat might not be the best possible model in terms of information criteria, it is still a reasonable one and can be used for further inference.

References

• Sagaert, Y., Svetunkov, I., 2022. Trace Forward Stepwise: Automatic Selection of Variables in No Time. Department of Management Science Working Paper Series. 1–25. https://doi.org/10.13140/RG.2.2.34995.35369