12.1 Covariance matrix of parameters

This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

One of the simplest ways of capturing the uncertainty about the parameters is by calculating the covariance matrix of parameters. It is the matrix that shows the individual and joint variabilities of parameters of the model. It has the following shape:

\[\begin{equation} \mathrm{V}({\boldsymbol{b}}) = \begin{pmatrix} \mathrm{V}(b_0) & \mathrm{cov}(b_0, b_1) & \dots & \mathrm{cov}(b_0, b_{k-1}) \\ \mathrm{cov}(b_0, b_1) & \mathrm{V}(b_1) & \dots & \mathrm{cov}(b_1, b_{k-1}) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{cov}(b_0, b_{k-1}) & \mathrm{cov}(b_1, b_{k-1}) & \dots & \mathrm{V}(b_{k-1}) \\ \end{pmatrix} , \tag{12.1} \end{equation}\] where \({\boldsymbol{b}}\) is the vector of all the estimated parameters. The covariance matrix above is similar to the one we discussed at the end of Section 5.1, equation (5.9), but it is now applied to the estimates of parameters of the model instead of some random variables. The diagonal elements of this matrix contain variances of parameters, showing how each estimate of parameter varies in sample, while the off-diagonal ones contain the covariances, capturing the joint variability of parameters. The square roots of the diagonal of that matrix would give the so called “Standard errors” of parameters (the standard deviations). These are typically used in further analysis.

12.1.1 Bootstrap-based matrix

There are different ways how we could calculate this matrix. We could use the bootstrap as we did in the introduction to this section, producing many estimates of the parameters and then calculating the matrix. R has all the necessary functions for that:

var(costsModelMLRBootstrap$coefficients) |> round(4)

##              (Intercept) materials      size  projects       year
## (Intercept) 8823780.1900  111.7570 -214.6135 -550.2357 -4392.6498
## materials       111.7570    0.0717   -0.1528    0.1137    -0.0557
## size           -214.6135   -0.1528    0.3437   -0.3026     0.1062
## projects       -550.2357    0.1137   -0.3026    6.2181     0.2678
## year          -4392.6498   -0.0557    0.1062    0.2678     2.1868

The values themselves do not tell us much about the model and parameters, so to better understand them, we can calculate the correlation matrix to see what is the linear relation between the estimates of parameters:

var(costsModelMLRBootstrap$coefficients) |>
    cov2cor() |> round(4)

##             (Intercept) materials    size projects    year
## (Intercept)      1.0000    0.1405 -0.1232  -0.0743 -1.0000
## materials        0.1405    1.0000 -0.9730   0.1703 -0.1406
## size            -0.1232   -0.9730  1.0000  -0.2070  0.1224
## projects        -0.0743    0.1703 -0.2070   1.0000  0.0726
## year            -1.0000   -0.1406  0.1224   0.0726  1.0000

We see, for example, that the correlation between the intercept and the parameter for the year is close to -1, which implies that the two have a strong negative linear relation. This means that when we draw lines through the clouds of points on samples of data (year vs overall costs), the higher intercept implies that the line should have a lower slope, i.e. the line exhibits rotations around the middle of the data for these two variables. Still, this does not provide much useful information, but the matrix itself can be used for further derivations (see next Section).

12.1.2 Analytical solution

The alternative way of calculating the covariance matrix is using a formula. This is not universally available, and works only in case of OLS and a couple of other estimation methods (such as Maximisation of the Likelihood in case of the Normal distribution, see Chapter 16). The following formula is derived for OLS, relies on several important assumptions (see derivations in Subsection 12.1.3), and would not produce adequate results in case of other estimators: \[\begin{equation} \mathrm{V}({\boldsymbol{b}}) = \frac{1}{n-k} \sum_{j=1}^n e_j^2 \times \left(\mathbf{X}' \mathbf{X}\right)^{-1}. \tag{12.2} \end{equation}\] where \(\mathbf{X}\) is the matrix of explanatory variables from equation (11.7) and \(e_j\) is the residual of the model on the observation \(j\). One way to understand what this formula does, is to keep in mind that \(\mathbf{X}' \mathbf{X}\) will produce a matrix, similar to the covariance matrix of explanatory variables, so roughly the sample variance of the error term is then distributed across the variances and covariances of the explanatory variables, producing the corresponding variances and covariances of the parameters. The function vcov() in R, uses exactly this formula in case of the OLS/Likelihood estimation, returning the following:

vcov(costsModelMLR) |> round(4)

##               (Intercept) materials     size  projects       year
## (Intercept) 18812682.8074   86.6304 -26.2332 1352.1495 -9377.7569
## materials         86.6304    0.0968  -0.2025    0.1951    -0.0435
## size             -26.2332   -0.2025   0.4884   -0.4137     0.0109
## projects        1352.1495    0.1951  -0.4137   14.4179    -0.6985
## year           -9377.7569   -0.0435   0.0109   -0.6985     4.6748

We can see that the values of this covariance matrix differ from the ones produced via bootstrap. This is expected because different methods will always give different results. The analytical solution via (12.2) is more convenient and easier to calculate, but it relies on more assumptions than the bootstrapped one (see derivations below), the most important of them being that the model is correctly specified (i.e. we deal with the true model applied to a sample of data). If this is violated, the estimates of the covariance matrix will be biased leading to a variety of issues, discussed in Chapter 15.

12.1.3 Mathematical derivation

In this subsection, we show the proof that the variance matrix of parameters can be calculated via the formula (12.2). There are several ways of showing this, we use the one that does not need the values of the explanatory variables to be known. The important thing we need to assume at this moment is that the “true” model is known and is correctly estimated. If it is not, the final formula will stay the same, but it will have different implications.

Linear algebra! Tread lightly!

Proof. We start by expanding the formula for the OLS (11.9) by inserting the “true” model in it (\({\mathbf{y} = \mathbf{X}} {\boldsymbol{\beta}} + {\boldsymbol{\epsilon}}\): \[\begin{equation} \begin{aligned} {\boldsymbol{b}} = & \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\mathbf{y}} = \\ & \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime ({\mathbf{X}} {\boldsymbol{\beta}} + {\boldsymbol{\epsilon}}) = \\ & \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\mathbf{X}} {\boldsymbol{\beta}} + \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} = \\ & {\boldsymbol{\beta}} + \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} . \end{aligned} \tag{12.3} \end{equation}\]

This equation tells us that the estimated parameters will have some variability around the true ones, and the extent of that variability will depend on the explanatory variables in \(\mathbf{X}\) and the specific values of the error term \(\boldsymbol{\epsilon}\).

In bypassing, we can show that the OLS estimates of parameters are unbiased (i.e. their expectation will be equal to the true parameters) as long as the model is correctly specified: \[\begin{equation} \begin{aligned} \mathrm{E}\left(\boldsymbol{b}\right) = & \mathrm{E}\left({\boldsymbol{\beta}} + \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}}\right) = \\ & {\boldsymbol{\beta}} + \mathrm{E}\left( \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}}\right) = \\ & {\boldsymbol{\beta}} + \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime \mathrm{E}\left( {\boldsymbol{\epsilon}}\right) = \\ & \boldsymbol{\beta} , \end{aligned} \tag{12.4} \end{equation}\] which holds as long as the expectation of the true error term is zero, and this is always true by the definition of the true model (see discussion in Subsection 11.0.1).

More importantly, we can calculate the variance of the estimates of parameters using (12.3):

\[\begin{equation} \mathrm{V}\left( {\boldsymbol{b}} \right) = \mathrm{V}\left( {\boldsymbol{\beta}} + \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right) . \tag{12.5} \end{equation}\]

In formula (12.5), the true parameter \(\boldsymbol{\beta}\) should be independent of the explanatory variables and the error term by definition of the true model, so the formula can be represented as a sum of variances (the right hand side only). Furthermore, in this basic linear model, we assume that the true parameter does not have any uncertainty, which means that the part related to \({\boldsymbol{\beta}}\) in (12.5) can be dropped leaving us with: \[\begin{equation} \mathrm{V}\left( {\boldsymbol{b}} \right) = \mathrm{V}\left(\left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right). \tag{12.6} \end{equation}\] After that, we can recall that \(\mathrm{E}\left(\left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right)=0\), because in the true model, there should be no structure in the error term and thus the explanatory variables and the error term should be unrelated (Subsection 11.0.1). This is useful because now we can represent the variance (12.6) as the expectation of products (this follows directly from equation (5.2) that we discussed in Subsection 5.1): \[\begin{equation*} \mathrm{V}\left( {\boldsymbol{b}} \right) = \mathrm{E}\left(\left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \times \left( \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right)^\prime \right). \tag{12.7} \end{equation*}\] Now to expand the expectation of the sum, we can use the following equality, coming directly from the definition of a covariance of two random variables (formula (5.8) from Section 5.1): \[\begin{equation} \begin{aligned} \mathrm{V}\left( {\boldsymbol{b}} \right) = & \mathrm{E}\left(\left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right) \times \mathrm{E}\left(\left( \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right)^\prime \right) + \\ & \mathrm{cov} \left(\left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}}, \left( \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right)^\prime \right). \end{aligned} \tag{12.8} \end{equation}\] We can notice that the expectations in (12.8) should be zero because, as discussed earlier, in the true model, the error term is not related to anything. This means that the formula can be simplified to: \[\begin{equation} \mathrm{V}\left( {\boldsymbol{b}} \right) = \mathrm{cov} \left(\left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}}, \left( \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right)^\prime \right). \tag{12.9} \end{equation}\] The second element in the brackets can be rewritten as: \[\begin{equation*} \left( \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\boldsymbol{\epsilon}} \right)^\prime = {\boldsymbol{\epsilon}} {\mathbf{X}} \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} . \end{equation*}\] Substituting this to (12.9) and moving explanatory variables from the covariance gives: \[\begin{equation} \mathrm{V}\left( {\boldsymbol{b}} \right) = \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} {\mathbf{X}}^\prime {\mathbf{X}} \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} \mathrm{cov} \left( {\boldsymbol{\epsilon}}, {\boldsymbol{\epsilon}} \right), \tag{12.10} \end{equation}\] which after some cancellations leads to: \[\begin{equation} \mathrm{V}\left( {\boldsymbol{b}} \right) = \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} \sigma^2 . \tag{12.11} \end{equation}\] Given that we do not know the true variance of the error term, we need to estimate it. In case of the OLS, it can be done using the formula (from equation (11.15)): \[\begin{equation*} \hat{\sigma}^2 = \frac{1}{n-k} \sum_{j=1}^n e_j^2 , \end{equation*}\] which after being inserted in (12.11), leads to: \[\begin{equation} \mathrm{V}\left( {\boldsymbol{b}} \right) = \frac{1}{n-k} \sum_{j=1}^n e_j^2 \left( {\mathbf{X}}^\prime {\mathbf{X}} \right)^{-1} . \tag{12.12} \end{equation}\]

Remark. Mathematically speaking, \(\mathrm{cov} \left( {\boldsymbol{\epsilon}}, {\boldsymbol{\epsilon}} \right)\) is a matrix, but because we are dealing with the true model and the true error term, the matrix should be diagonal with only variances in the main diagonal. This is because the error term cannot be correlated with itself (thus all the off-diagonal elements are zero). Furthermore, because we assume that the variance of the error term is the same for all observations, the following should hold: \[\begin{equation*} \mathrm{cov} \left( {\boldsymbol{\epsilon}}, {\boldsymbol{\epsilon}} \right) = \mathbf{I}_{k-1} \sigma^2, \end{equation*}\] where \(\mathbf{I}_{k-1}\) is the identity matrix of the size \(k-1\) and \(\sigma^2\) is the variance of the error term.

We should note that the majority of assumptions above needed for the derivation came from the definition of the true model. But the essential one was that the model is correctly estimated. If this is violated, the estimates of the elements in the covariance matrix will be biased in one way or the other (see Chapter 15).