10.4 Quality of a fit
In order to get a general impression about the performance of the estimated model, we can calculate several in-sample measures, which could provide us insights about the fit of the model.
The fundamental measure that lies in the basis of many other ones is SSE, which is the value of the OLS criterion (10.4). It cannot be interpreted on its own and cannot be used for model comparison, but it shows the overall variability of the data around the regression line. In a more general case, it is written as: \[\begin{equation} \mathrm{SSE} = \sum_{j=1}^n (y_j - \hat{y}_j)^2 . \tag{10.20} \end{equation}\] This sum of squares is related to another two, the first being the Sum of Squares Total: \[\begin{equation} \mathrm{SST}=\sum_{j=1}^n (y_j - \bar{y})^2, \tag{10.21} \end{equation}\] where \(\bar{y}\) is the in-sample mean. If we divide the value (10.21) by \(n-1\), then we will get the in-sample variance (introduced in Section 5.1): \[\begin{equation*} \mathrm{V}(y)=\frac{\mathrm{SST}}{n-1}=\frac{1}{n-1} \sum_{j=1}^n (y_j - \bar{y})^2 . \end{equation*}\] The last sum of squares is Sum of Squared of Regression: \[\begin{equation} \mathrm{SSR} = \sum_{j=1}^n (\bar{y} - \hat{y}_j)^2 , \tag{10.22} \end{equation}\] which shows the variability of the regression line. It is possible to show that in the linear regression (this is important! This property might be violated in other models), the three sums are related to each other via the following equation: \[\begin{equation} \mathrm{SST} = \mathrm{SSE} + \mathrm{SSR} . \tag{10.23} \end{equation}\]
Proof. This involves manipulations, some of which are not straightforward: \[\begin{equation} \begin{aligned} \mathrm{SST} &= \mathrm{SSR} + \mathrm{SSE} = \sum_{j=1}^n (\hat{y}_j - \bar{y})^2 + \sum_{j=1}^n (y_j - \hat{y}_j)^2 \\ &= \sum_{j=1}^n \left( \hat{y}_j^2 - 2 \hat{y}_j \bar{y} + \bar{y}^2 \right) + \sum_{j=1}^n \left( y_j^2 - 2 y_j \hat{y}_j + \hat{y}_j^2 \right) \\ &= \sum_{j=1}^n \left( \hat{y}_j^2 - 2 \hat{y}_j \bar{y} + \bar{y}^2 + y_j^2 - 2 y_j \hat{y}_j + \hat{y}_j^2 \right) \\ &= \sum_{j=1}^n \left(\bar{y}^2 -2 \bar{y} y_j + y_j^2 + 2 \bar{y} y_j + \hat{y}_j^2 - 2 \hat{y}_j \bar{y} - 2 y_j \hat{y}_j + \hat{y}_j^2 \right) \\ &= \sum_{j=1}^n \left((\bar{y} - y_j)^2 + 2 \bar{y} y_j + 2 \hat{y}_j^2 - 2 \hat{y}_j \bar{y} - 2 y_j \hat{y}_j \right) \end{aligned} . \tag{10.24} \end{equation}\] We can then substitute \(y_j=\hat{y}_j+e_j\) in the right hand side of (10.24) to get: \[\begin{equation} \begin{aligned} \mathrm{SST} &= \sum_{j=1}^n \left((\bar{y} - y_j)^2 + 2 \bar{y} (\hat{y}_j+e_j) + 2 \hat{y}_j^2 - 2 \hat{y}_j \bar{y} - 2 (\hat{y}_j+e_j) \hat{y}_j \right) \\ &= \sum_{j=1}^n \left((\bar{y} - y_j)^2 + 2 \bar{y} \hat{y}_j + 2 \bar{y} e_j + 2 \hat{y}_j^2 - 2 \hat{y}_j \bar{y} - 2 \hat{y}_j\hat{y}_j -2 e_j \hat{y}_j \right) \\ &= \sum_{j=1}^n \left((\bar{y} - y_j)^2 + 2 \bar{y} e_j + 2 \hat{y}_j^2 - 2 \hat{y}_j^2 -2 e_j \hat{y}_j \right) \\ &= \sum_{j=1}^n \left((\bar{y} - y_j)^2 + 2 \bar{y} e_j - 2 e_j \hat{y}_j \right) \end{aligned} . \tag{10.25} \end{equation}\] Now if we split the sum into three elements, we will get: \[\begin{equation} \begin{aligned} \mathrm{SST} &= \sum_{j=1}^n (\bar{y} - y_j)^2 + 2 \sum_{j=1}^n \left(\bar{y} e_j\right) - 2 \sum_{j=1}^n \left(e_j \hat{y}_j \right) \\ &= \sum_{j=1}^n (\bar{y} - y_j)^2 + 2 \bar{y} \sum_{j=1}^n e_j - 2 \sum_{j=1}^n \left(e_j \hat{y}_j \right) \end{aligned} . \tag{10.26} \end{equation}\] The second sum in (10.26) is equal to zero, because OLS guarantees that the in-sample mean of error term is equal to zero (see proof in Subsection 10.3). The second one can be expanded to: \[\begin{equation} \begin{aligned} \sum_{j=1}^n \left(e_j \hat{y}_j \right) = \sum_{j=1}^n \left(e_j b_0 + b_1 e_j x_j \right) \end{aligned} . \tag{10.27} \end{equation}\] We see the sum of errors in the first sum of (10.27), so the first elements is equal to zero again. The second term is equal to zero as well due to OLS estimation (this was also proven in Subsection 10.3). This means that: \[\begin{equation} \mathrm{SST} = \sum_{j=1}^n (\bar{y} - y_j)^2 , \tag{10.28} \end{equation}\] which is the formula of SST (10.21).
The relation between SSE, SSR and SST is shown in Figure 10.7. If we take any observation in that Figure, we will see how the deviations from the regression line and from the mean are related.
Building upon that, there is a measure called “Coefficient of Determination”, which is calculated based on the sums of squares discussed above: \[\begin{equation} \mathrm{R}^2 = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} = \frac{\mathrm{SSR}}{\mathrm{SST}} . \tag{10.29} \end{equation}\] Given the meaning of the sums of squares, we can imagine the following situations to interpret the values of \(\mathrm{R}^2\):
- The model fits the data in the same way as the mean line (grey line in Figure 10.7). In this case SSE would be equal to SST and SSR would be equal to zero (because \(\hat{y}_j=\bar{y}\)) and as a result the R\(^2\) would be equal to zero.
- The model fits the data perfectly, without any errors (all points lie on the black line in Figure 10.7). In this situation SSE would be equal to zero and SSR would be equal to SST, because the regression would go through all points (i.e. \(\hat{y}_j=y_j\)). This would make R\(^2\) equal to one.
In the linear regression model due to (10.23), the coefficient of determination would always lie between zero and one, where zero means that the model does not explain the data at all and one means that it overfits the data. The value itself is usually interpreted as a percentage of variability in data explained by the model. This definition above provides us an important point about the coefficient of determination: it should not be equal to one, and it is alarming if it is very close to one - because in this situation we are implying that there is no randomness in the data, but this contradicts our definition of the statistical model (see Section 1.1.1). The adequate statistical model should always have some randomness in it. The situation of \(\mathrm{R}^2=1\) corresponds to: \[\begin{equation*} y_j = b_0 + b_1 x_j , \end{equation*}\] implying that all \(e_j=0\), which is unrealistic and is only possible if there is a functional relation between \(y\) and \(x\) (no need for statistical inference then). So, in practice we should not maximise R\(^2\) and should be careful with models that have very high values of it. At the same time, too low values of R\(^2\) are also alarming, as they tell us that the model becomes: \[\begin{equation*} y_j = b_0 + e_j, \end{equation*}\] meaning that it is not different from the global mean. So, coefficient of determination in general is not a very good measure for assessing performance of a model. It can be used for further inferences, and for a basic indication of whether the model overfits (R\(^2\) close to 1) or underfits (R\(^2\) close to 0) the data. But no serious conclusions should be made based on it.
Here how this measure can be calculated in R based on the model earlier:
## [1] 0.7528328
Note that in this formula we used the relation between SST and V\((y)\), multiplying the value by \(n-1\) to get rid of the denominator. The resulting value tells us that the model has explained 75.3% deviations in the data.
Finally, based on coefficient of determination, we can also calculate the coefficient of multiple correlation, which we have already discussed in Section 9.4: \[\begin{equation} R = \sqrt{R^2} = \sqrt{\frac{\mathrm{SSR}}{\mathrm{SST}}} . \tag{10.30} \end{equation}\] It shows the closeness of relation between the response variable \(y_j\) and the explanatory variables to the linear one. The coefficient has a positive sign, no matter what the relation between the variables is. In case of the simple linear regression, it is equal to the correlation coefficient (from Section 9.3) with the sign equal to the sign of the coefficient of the slop \(b_1\): \[\begin{equation} r_{x,y} = \mathrm{sign} (b_1) R . \tag{10.31} \end{equation}\]
Here is a demonstration of the formula above in R:
## wt
## -0.8676594
## [1] -0.8676594