11.2 Quality of a fit
Building upon the discussion of the quality of the fit in Section 10.4, we can introduce a measure, based on the OLS criterion, (10.4), which is called either “Root Mean Squared Error” (RMSE) or a “standard error” or a “standard deviation of error” of the regression: \[\begin{equation} \hat{\sigma}^2 = \sqrt{\frac{1}{n-k} \sum_{j=1}^n e_j^2 }. \tag{11.15} \end{equation}\] The denominator of (11.15) contains the number of degrees of freedom in the model, \(n-k\), not the number of observations \(n\), so technically speaking this is not a “mean” any more. This is done to correct the in-sample bias (Section 6.3.1) of the measure. Standard error does not tell us much about the in-sample performance but can be used to compare several models with the same response variable between each other: the lower it is, the better the model fits the data, given the number of estimated parameters. However, this measure is not aware of the randomness in the true model (Section 1.1.1) and thus will be equal to zero in a model that fits the data perfectly (thus ignoring the existence of error term). This is a potential issue, as we might end up with a poor model that would seem like the best one.
Here is how this can be calculated for our model, estimated using alm()
function:
## [1] 30.56428
The value of RMSE does not provide any important insights on its own, but it can be compared to the RMSE of another model to decide, which one of the two fits the data better.
Similarly to the simple linear regression, we can calculate the R\(^2\) (see Section 10.4). The problem is that the value of coefficient of determination would always increase with the increase of number of variables included in the model. This is because every variable will explain some proportion of the data due to randomness. So, if we add redundant variables, the fit will improve, but the quality of model will deteriorate. Here is an example:
# Record number of observations
n <- nobs(costsModel02)
# Generate white noise
SBA_Chapter_11_Costs$noise <- rnorm(n,0,10)
# Add it to the model
costsModel02WithNoise <- alm(overall~size+materials+projects+year+noise,
SBA_Chapter_11_Costs, loss="MSE")
The code above introduces a new variable, noise
, which has nothing to do with the overall
costs. We would expect that this variable would not bring value to the model. And here is the value of determination coefficient of the new model:
## [1] 0.8029458
Compare it with the previous one:
## [1] 0.8016625
The value in the new model will always be higher than in the previous one (or equal to it in some very special cases), no matter how we generate the random fluctuations. This means that some sort of penalisation of the number of estimated parameters is required to make the measure more reasonable. This is what adjusted coefficient of determination does: \[\begin{equation} R^2_{adj} = 1 - \frac{\hat{\sigma}^2}{\mathrm{V}(y)} = 1 - \frac{(n-1)\mathrm{SSE}}{(n-k)\mathrm{SST}} . \tag{11.16} \end{equation}\] So, instead of dividing sums of squares, in the adjusted R\(^2\) we divide the entities that are based on degrees of freedom. Given the presence of \(k\) in the formula (11.16), the coefficient will not necessarily increase with the addition of variables – when the variable does not contribute in the reduction of SSE of model substantially, R\(^2\) will not go up. Furthermore, if one model has higher \(\hat{\sigma}^2\) than the other one, then the R\(^2_{adj}\) of that model will be lower, which becomes apparent, given that we have \(-\hat{\sigma}^2\) in the formula (11.16).
Here how the adjusted R\(^2\) can be calculated for a model in R:
setNames(c(1 - sigma(costsModel02)^2 / var(actuals(costsModel02)),
1 - sigma(costsModel02WithNoise)^2 / var(actuals(costsModel02WithNoise))),
c("R^2-adj","R^2-adj, Noise"))
## R^2-adj R^2-adj, Noise
## 0.7874955 0.7850317
What we will typically see in the output above is that the model with the noise will have a lower value of adjusted R\(^2\) than the model without it. However, given that we deal with randomness, if you reproduce this example many times, you will see different situation, including those, where introducing noise still increases the value of the parameter just due to pure chance. So, you should not fully trust R\(^2_{adj}\) either. When constructing a model or deciding what to include in it, you should always use your judgement - make sure that the variables included in the model are meaningful. Otherwise you can easily overfit the data, which would lead to inefficient estimates of parameters (see Section 15 for details) and inaccurate forecasts.