This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

14.3 Model specification: Transformations

The question of appropriate transformations for variables in the model is challenging, because it is difficult to decide, what sort of transformation is needed, if needed at all. In many cases, this comes to selecting between additive linear model and a multiplicative one. This implies that we compare the model: \[\begin{equation} y_t = a_0 + a_1 x_{1,t} + \dots + a_n x_{n,t} + \epsilon_t, \tag{14.1} \end{equation}\] and \[\begin{equation} y_t = \exp\left(a_0 + a_1 x_{1,t} + \dots + a_n x_{n,t} + \epsilon_t\right) . \tag{14.2} \end{equation}\] The latter model is equivalent to the so called “log-linear” model, but can also include logarithms of explanatory variables instead of the variables themselves.

There are different ways to diagnose the problem with wrong transformations, which sometimes help in detecting it. The first one is the actuals vs fitted plot:


The grey dashed line on the plot corresponds to the situation, when actuals and fitted coincide (100% fit). The red line on the plot above is LOESS line, produced by lowess() function in R, smoothing the scatterplot to reflect the potential tendencies in the data. In the ideal situation this red line should coinside with the grey line. In addition the variability around the line should not change with the increase of fitted values. In our case there is a slight U-shape in the red line and an insignificant increase in variability around the middle of the data. This could either be due to pure randomness and thus should be ignored, or could indicate a slight non-linearity in the data. After all, we have constructed pure additive model on the data that exhibits seasonality with multiplicative characteristics, which becomes especially apparent at the end of the series, where the drop in level is accompanied by the decrease of variability of the data:


In order to diagnose this properly, we might use other instruments. One of these is the analysis of standardised residuals. The formula for the standardised residuals will differ depending on the assumed distribution and for some of them comes to the value inside the “\(\exp\)” part of the probability density function:

  1. Normal, \(\epsilon_t \sim \mathcal{N}(0, \sigma^2)\): \(u_t = \frac{e_t - \bar{e}}{\hat{\sigma}}\);
  2. Laplace, \(\epsilon_t \sim \mathcal{Laplace}(0, s)\): \(u_t = \frac{e_t - \bar{e}}{\hat{s}}\);
  3. S, \(\epsilon_t \sim \mathcal{S}(0, s)\): \(u_t = \frac{e_t - \bar{e}}{\hat{s}^2}\);
  4. Generalised Normal, \(\epsilon_t \sim \mathcal{GN}(0, s, \beta)\): \(u_t = \frac{e_t - \bar{e}}{\hat{s}^{\frac{1}{\beta}}}\);
  5. Inverse Gaussian, \(1+\epsilon_t \sim \mathcal{IG}(1, s)\): \(u_t = \frac{1+e_t}{\bar{e}}\);
  6. Gamma, \(1+\epsilon_t \sim \mathcal{\Gamma}(s^{-1}, s)\): \(u_t = \frac{1+e_t}{\bar{e}}\);
  7. Log Normal, \(1+\epsilon_t \sim \mathrm{log}\mathcal{N}\left(-\frac{\sigma^2}{2}, \sigma^2\right)\): \(u_t = \frac{e_t - \bar{e} +\frac{\hat{\sigma}^2}{2}}{\hat{\sigma}}\). where \(\bar{e}\) is the mean of residuals, which is typically assumed to be zero and \(u_t\) is the value of standardised residuals. Note that the scales in the formulae above should be calculated via the formula with the bias correction, i.e. with the division by degrees of freedom, not the number of observations. Also, note that in case of \(\mathcal{IG}\), \(\Gamma\) and \(\mathrm{log}\mathcal{N}\) and additive error models, the formulae for the standardised residuals will be the same, only the assumptions will change (see Section 5.5).

Here is an example of a plot of fitted vs standardised residuals in R:

Diagnostics of pure additive ETSX model.

Figure 14.2: Diagnostics of pure additive ETSX model.

Given that the scale of the original variable is now removed in the standardised residuals, it might be easier to spot the non-linearity. In our case it is still not apparent, but there is a slight U-shape in LOESS line and a slight change in variance. Another plot that we have already used before is standardised residuals over time:


This plot shows that there is a slight decline in the residuals around year 1977. Still, there is no prominent non-linearity in the residuals, so it is not clear whether any transformations are needed or not.

However, based on my judgment and understanding of the problem, I would expect for the number of injuries and deaths to change proportionally to the change of the level of the data: if after some external interventions the overal level of injuries and deaths would increase, we would expect a percentage decline, not a unit decline with a change of already existing variables in the model. This is why I will try a multiplicative model next:

adamModelSeat05 <- adam(Seatbelts, "MNM",
Diagnostics of pure multiplicative ETSX model.

Figure 14.3: Diagnostics of pure multiplicative ETSX model.

The plot shows that the variability is now slightly more uniform across all fitted values, but the difference between Figures 14.2 and 14.3 is not very prominent. One of potential solutions in this situation is to compare the models in terms of information criteria:

setNames(c(AICc(adamModelSeat03), AICc(adamModelSeat05)),
         c("Additive model", "Multiplicative model"))
##       Additive model Multiplicative model 
##             2233.949             2237.081

Based on this, we would be inclined to select the multiplicative model. My personal judgment in this specific case agrees with the information criterion.