11.2 Quality of a fit

This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

Building upon the discussion of the quality of the fit in Section 10.4, we can introduce a measure, based on the OLS criterion, (10.5), which is called either “Root Mean Squared Error” (RMSE) or a “standard error” or a “standard deviation of error” of the regression: \[\begin{equation} \hat{\sigma}^2 = \sqrt{\frac{1}{n-k} \sum_{j=1}^n e_j^2 }. \tag{11.15} \end{equation}\] This is an unbiased estimate of the variance of the error term \(\sigma^2\). The denominator of (11.15) contains the number of degrees of freedom in the model, \(n-k\), not the number of observations \(n\), so technically speaking this is not a “mean” any more. This is done to correct the in-sample bias (Section 6.3.1) of the measure. Standard error does not tell us much about the in-sample performance but can be used to compare several models with the same response variable between each other: the lower it is, the better the model fits the data, given the number of estimated parameters. However, this measure is not aware of the randomness in the true model (Section 1.1.1) and thus will be equal to zero in a model that fits the data perfectly (thus ignoring the existence of error term). This is a potential issue, as we might end up with a poor model that would seem like the best one.

Here is how this can be calculated for our model, estimated using alm() function:

sigma(costsModel02)

## [1] 43.41579

The value of RMSE does not provide any important insights on its own, but it can be compared to the RMSE of another model to decide, which one of the two fits the data better.

Similarly to the simple linear regression, we can calculate the R\(^2\) (see Section 10.4). The problem is that the value of coefficient of determination would always increase with the increase of number of variables included in the model. This is because every variable will explain some proportion of the data due to randomness. So, if we add redundant variables, the fit will improve, but the quality of model will deteriorate. Here is an example:

# Record number of observations
n <- nobs(costsModel02)
set.seed(13)
# Generate white noise
SBA_Chapter_11_Costs$noise <- rnorm(n,0,10)
# Add it to the model
costsModel02WithNoise <- alm(overall~size+materials+projects+year+noise,
                             SBA_Chapter_11_Costs, loss="MSE")

The code above introduces a new variable, noise, which has nothing to do with the overall costs. We would expect that this variable would not bring value to the model. And here is the value of determination coefficient of the new model:

1 - sum(resid(costsModel02WithNoise)^2) /
    (var(actuals(costsModel02WithNoise))*(n-1))

## [1] 0.7523579

Compare it with the previous one:

1 - sum(resid(costsModel02)^2) /
    (var(actuals(costsModel02))*(n-1))

## [1] 0.7489864

The value in the new model will always be higher than in the previous one (or equal to it in some very special cases), no matter how we generate the random fluctuations. This means that some sort of penalisation of the number of estimated parameters is required to make the measure more reasonable. This is what adjusted coefficient of determination does: \[\begin{equation} R^2_{adj} = 1 - \frac{\hat{\sigma}^2}{\mathrm{V}(y)} = 1 - \frac{(n-1)\mathrm{SSE}}{(n-k)\mathrm{SST}} . \tag{11.16} \end{equation}\] So, instead of dividing sums of squares, in the adjusted R\(^2\) we divide the entities that are based on degrees of freedom. Given the presence of \(k\) in the formula (11.16), the coefficient will not necessarily increase with the addition of variables – when the variable does not contribute in the reduction of SSE of model substantially, R\(^2\) will not go up. Furthermore, if one model has higher \(\hat{\sigma}^2\) than the other one, then the R\(^2_{adj}\) of that model will be lower, which becomes apparent, given that we have \(-\hat{\sigma}^2\) in the formula (11.16).

Here how the adjusted R\(^2\) can be calculated for a model in R:

setNames(c(1 - sigma(costsModel02)^2 / var(actuals(costsModel02)),
           1 - sigma(costsModel02WithNoise)^2 / var(actuals(costsModel02WithNoise))),
         c("R^2-adj","R^2-adj, Noise"))

##        R^2-adj R^2-adj, Noise 
##      0.7310569      0.7298450

What we will typically see in the output above is that the model with the noise will have a lower value of adjusted R\(^2\) than the model without it. However, given that we deal with randomness, if you reproduce this example many times, you will see different situation, including those, where introducing noise still increases the value of the parameter just due to pure chance. So, you should not fully trust R\(^2_{adj}\) either. When constructing a model or deciding what to include in it, you should always use your judgement - make sure that the variables included in the model are meaningful. Otherwise you can easily overfit the data, which would lead to inefficient estimates of parameters (see Section 15 for details) and inaccurate forecasts.

11.2 Quality of a fit

11.2.1 Common mistakes related to quality of a fit