This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

12.1 Confidence intervals

Given that the estimates of parameters have some uncertainty associated with them, as discussed in the introduction to this Chapter, it makes sense to capture that uncertainty so that decision makers can have a better understanding about the observed effects. The simplest way to do that is to construct confidence intervals in a similar way to what we discussed in Section 6.4. Visually, this process is shown in Figure 12.3, which continues the example we discussed before.

Parameter uncertainty in the estimated model

Figure 12.3: Parameter uncertainty in the estimated model

Figure 12.3 demonstrates the bootstrapped distributions of parameters (as before), together with the Normal probability density functions on top of them and vertical lines at the tails that represent the confidence bounds. The idea behind this is that due to the Central Limit Theorem (Section 6.2) we can assume that estimates of parameters follow the Normal distribution with some mean and variance, and based on that we can get the quantiles, thus inferring, for example, that the true Intercept will lie between 190.79 and 227.67 in the 95% of the cases, if we repeat the resampling experiment many times. So, the interpretation of the confidence interval is exactly the same as in the simpler case discussed in the Section 6.4. The formula for it is slightly different, because it is constructed for the parameters of the model rather than the mean of the sample, but the logic is exactly the same: \[\begin{equation} \beta_i \in (b_i + t_{\alpha/2}(n-k) s_{b_i}, b_i + t_{1-\alpha/2}(n-k) s_{b_i}), \tag{12.1} \end{equation}\] where \(s_{b_i}\) is the standard error of the parameter \(b_i\). An important note to make is that, as usual with confidence interval construction, we can use the Normal distribution only in the case when the variance of parameters is known. In reality, it is not, so, as discussed in Section 6.4, we need to use the Student’s t distribution. This is why we have \(t_{\alpha/2}(n-k)\) in the equation @ref(eq:#eq:MLRParameterInterval) above.

Example 12.1 To calculate the confidence interval, using the equation @ref(eq:#eq:MLRParameterInterval), we need to know several things:

  1. Significance level \(\alpha\) - we define it either based on our preferences or on the task at hand. This should be selected prior to construction of the interval. The typical one is 5%, mainly because the standard human has five fingers on a hand;
  2. The value of the estimated parameter \(b_i\). We get it after using the Least Squares, any statistical software can give us the estimate.
  3. Standard error (or deviation) of the parameter. We would need to use additional formula to get it.
  4. Number of degrees of freedom \(n-k\), which can be easily calculated based on the sample size \(n\) and the number of estimated parameters \(k\).
  5. Student’s t statistics. This can be obtained from statistical tables or the software, given the significance level \(\alpha\) and the number of degrees of freedom \(n-k\), calculated above.

Consider construction of the interval for the slope parameter materials from the regression we discussed earlier. In our case, we know the following:

  1. \(\alpha=0.05\) because we decided to produce the 95% confidence interval. We could choose the other value, which would result in the interval of a different width;
  2. The value of the parameter \(b_1\) is 1.1649173;
  3. We have not discussed yet how to obtain the variance of the parameter analytically, but there is a built-in formula in software that gives it, and/or we could get it via the bootstrap, simply by taking the variance of the values from the distribution in Figure 12.3. In R, we can get the variance of the parameter for slope (second parameter in the model) via the vcov() function, like this:
vcov(costsModelSLR)[2,2]
## [1] 0.006730734

Based on that, we can say that the standard error of the slope is approximately 0.082. 4. We estimated two parameters, so \(k=2\), and the sample size \(n\) was 61, which means that in our model has \(n-k=61-2=59\) degrees of freedom. 5. To get the Student’s t statistics correctly, we need to split the significance level \(\alpha\) into two equal part, meaning that we will have 2.5% of values below the lower bound and another 2.5% of values above the upper one. So, we will calculate it for \(\alpha/2=0.025\) and \(1-\alpha/2=0.975\) with \(n-k=59\). In R, we can use the qt() function in the folloqing way:

qt(c(0.025,0.975), 59)
## [1] -2.000995  2.000995

Taking all these values and inserting them in the formula (12.1), we should get the two numbers, representing the lower and the upper bounds of the 95% confidence interval respectively: (1.0008,1.329`)

In R, the confidence interval can be obtained via the confint() function, and it should be close to what we obtained above, but not exactly the same due to rounding:

confint(costsModelSLR)
##                    S.E.       2.5%     97.5%
## (Intercept) 16.22671606 176.909386 241.87199
## materials    0.08204105   1.000694   1.32914

The confidence interval for materials above shows, for example, that if we repeat the construction of interval many times on different samples of data, the true value of parameter will lie in 95% of cases between 1.0006943 and 1.3291403. This gives us an idea about the real effect in the population and how certain we are about it.

The question that we still have not resolved is how to calculate the variance of parameters. If you estimate the model via OLS or Maximisation of the Likelihood (see Chapter 16), there is an analytical solution for the variance. In the general case, the following formula is used to get the covariance matrix of parameters: \[\begin{equation} \mathrm{V}(\hat{\boldsymbol{\beta}}) = \frac{1}{n-k} \sum_{j=1}^n e_j^2 \times \left(\mathbf{X}' \mathbf{X}\right)^{-1}. \tag{12.2} \end{equation}\] where \(\mathbf{X}\) is the matrix of explanatory variables from equation (11.7) and \(e_j\) is the residual of the model on the observation \(j\). The result of this calculation would be a matrix, containing the variances of parameters on the diagonal and covariances between them on the off-diagonals. For now, we are only interested in the diagonal, so we can ignore the covariances. The function vcov() in R, uses exactly this formula in case of the OLS/Likelihood estimation, returning the following:

vcov(costsModelSLR)
##             (Intercept)    materials
## (Intercept)  263.306314 -1.288465587
## materials     -1.288466  0.006730734

We can also present all of this in the following summary (this is based on the alm() model, the other functions will produce different ones):

summary(costsModelSLR)
## Response variable: overall
## Distribution used in the estimation: Normal
## Loss function used in estimation: likelihood
## Coefficients:
##             Estimate Std. Error Lower 2.5% Upper 97.5%  
## (Intercept) 209.3907    16.2267   176.9094    241.8720 *
## materials     1.1649     0.0820     1.0007      1.3291 *
## 
## Error standard deviation: 31.8742
## Sample size: 61
## Number of estimated parameters: 3
## Number of degrees of freedom: 58
## Information criteria:
##      AIC     AICc      BIC     BICc 
## 598.3734 598.7944 604.7060 605.5714

This summary provides all the necessary information about the model and the estimates of parameters: their mean values in the column “Estimate”, their standard errors (square roots of the variances) in “Std. Error”, the bounds of confidence interval and finally a star if the interval does not contain zero. If we have that star, then this indicates that we are certain on the selected confidence level (95% in our example) about the sign of the parameter and that the effect indeed exists.