This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 10.1 Ordinary Least Squares (OLS)

For obvious reasons, we do not have the values of parameters from the population. This means that we will never know what the true intercept and slope are. Luckily, we can estimate them based on the sample of data. There are different ways of doing that, and the most popular one is called “Ordinary Least Squares” method. This is the method that was used in the estimation of the model in Figure 10.1. So, how does it work?

When we estimate the simple linear regression model, the model (10.1) transforms into: $$$y_j = {b}_0 + {b}_1 x_j + e_j , \tag{10.2}$$$ where $$b_0$$ is the estimate of $$\beta_0$$, $$b_1$$ is the estimate of $$\beta_1$$ and $$e_j$$ is the estimate of $$\epsilon_j$$. This is because we do not know the true values of parameters and thus they are substituted by their estimates. This also applies to the error term for which in general $$e_j \neq \epsilon_j$$ because of the sample estimation. Now consider the same situation with weight vs mileage in Figure 10.2 but with some arbitrary line with unknown parameters. Each point on the plot will typically lie above or below the line, and we would be able to calculate the distances from those points to the line. They would correspond to $$e_j = y_j - \hat{y}_j$$, where $$\hat{y}_j$$ is the value of the regression line (aka “fitted” value) for each specific value of explanatory variable. For example, for the weight of car of 1.835 tones, the actual mileage is 33.9, while the fitted value is 27.478. The resulting error (or residual of model) is 6.422. We could collect all these errors of the model for all available cars based on their weights and this would result in a vector of positive and negative values like this:

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive
##          -2.2826106          -0.9197704          -2.0859521           1.2973499
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D
##          -0.2001440          -0.6932545          -3.9053627           4.1637381
##            Merc 230            Merc 280           Merc 280C          Merc 450SE
##           2.3499593           0.2998560          -1.1001440           0.8668731
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental
##          -0.0502472          -1.8830236           1.1733496           2.1032876
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla
##           5.9810744           6.8727113           1.7461954           6.4219792
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28
##          -2.6110037          -2.9725862          -3.7268663          -3.4623553
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa
##           2.4643670           0.3564263           0.1520430           1.2010593
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E
##          -4.5431513          -2.7809399          -3.2053627          -1.0274952

This corresponds to the formula: $$$e_j = y_j - {b}_0 - {b}_1 x_j. \tag{10.3}$$$ If we needed to estimate parameters $${b}_0$$ and $${b}_1$$ of the model, we would need to minimise those distances by changing the parameters of the model. The problem is that some errors are positive, while the others are negative. If we just sum them up, they will cancel each other out, and we would loose the information about the distance. The simplest way to get rid of sign and keep the distance is by taking squares of each error and calculating Sum of Squared Errors for the whole sample $$T$$: $$$\mathrm{SSE} = \sum_{j=1}^n e_j^2 . \tag{10.4}$$$ If we now minimise SSE by changing values of parameters $${b}_0$$ and $${b}_1$$, we will find those parameters that would guarantee that the line goes somehow through the cloud of points. Luckily, we do not need to use any fancy optimisers for this, as there is an analytical solution to this: \begin{aligned} {b}_1 = & \frac{\mathrm{cov}(x,y)}{\mathrm{V}(x)} \\ {b}_0 = & \bar{y} - {b}_1 \bar{x} \end{aligned} , \tag{10.5} where $$\bar{x}$$ is the mean of the explanatory variable $$x_j$$ and $$\bar{y}$$ is the mean of the response variables $$y_j$$.

Proof. In order to get (10.5), we should first insert (10.3) in (10.4) to get: $\begin{equation*} \mathrm{SSE} = \sum_{j=1}^n (y_j - {b}_0 - {b}_1 x_j)^2 . \end{equation*}$ This can be expanded to: \begin{equation*} \begin{aligned} \mathrm{SSE} = & \sum_{j=1}^n y_j^2 - 2 b_0 \sum_{j=1}^n y_j - 2 b_1 \sum_{j=1}^n y_j x_j + \\ & n b_0^2 + 2 b_0 b_1 \sum_{j=1}^n x_j + b_1^2 \sum_{j=1}^n x_j^2 \end{aligned} \end{equation*} Given that we need to find the values of parameters $$b_0$$ and $$b_1$$ minimising SSE, we can take a derivative of SSE with respect to $$b_0$$ and $$b_1$$, equating them to zero to get the following system of equations: \begin{equation*} \begin{aligned} & \frac{d \mathrm{SSE}}{d b_0} = -2 \sum_{j=1}^n y_j + 2 n b_0 + 2 b_1 \sum_{j=1}^n x_j = 0 \\ & \frac{d \mathrm{SSE}}{d b_1} = -2 \sum_{j=1}^n y_j x_j + 2 b_0 \sum_{j=1}^n x_j + 2 b_1 \sum_{j=1}^n x_j^2 = 0 \end{aligned} \end{equation*}

Solving this system of equations for $$b_0$$ and $$b_1$$ we get: \begin{aligned} & b_0 = \frac{1}{n}\sum_{j=1}^n y_j - b_1 \frac{1}{n}\sum_{j=1}^n x_j \\ & b_1 = \frac{n \sum_{j=1}^n y_j x_j - \sum_{j=1}^n y_j \sum_{j=1}^n x_j}{n \sum_{j=1}^n x_j^2 - \left(\sum_{j=1}^n x_j \right)^2} \end{aligned} \tag{10.6} In the system of equations (10.6), we have the following elements:

1. $$\bar{y}=\frac{1}{n}\sum_{j=1}^n y_j$$,
2. $$\bar{x}=\frac{1}{n}\sum_{j=1}^n x_j$$,
3. $$\mathrm{cov}(x,y) = \frac{1}{n}\sum_{j=1}^n y_j x_j - \frac{1}{n^2}\sum_{j=1}^n y_j \sum_{j=1}^n x_j$$,
4. $$\mathrm{V}(x) = \frac{1}{n}\sum_{j=1}^n x_j^2 - \left(\frac{1}{n} \sum_{j=1}^n x_j \right)^2$$,

which after inserting in (10.6) lead to (10.5).

Note that if for some reason $${b}_1=0$$ in (10.5) (for example, because the covariance between $$x$$ and $$y$$ is zero, implying that they are not correlated), then the intercept $${b}_0 = \bar{y}$$, meaning that the global average of the data would be the best predictor of the variable $$y_j$$. This method of estimation of parameters based on the minimisation of SSE, is called “Ordinary Least Squares”. It is simple and does not require any specific assumptions: we just minimise the overall distance by changing the values of parameters.

While we can do some inference based on simple linear regression, we know that the bivariate relations are not often met in practice: typically a variable is influenced by a set of variables, not just by one. This implies that the correct model would typically include many explanatory variables. This is why we will discuss inference in the following sections.

### 10.1.1 Gauss-Markov theorem

OLS is a very popular estimation method for linear regression for a variety of reasons. First, it is relatively simple (much simpler than other approaches) and conceptually easy to understand. Second, the estimates of OLS parameters can be found analytically (as in formula (10.5)). Furthermore, there is a mathematical proof that the estimates of parameters are efficient (Subsection 6.3.2), consistent (Subsection 6.3.3) and unbiased (Subsection 6.3.1). It is called “Gauss-Markov theorem”. It states that:

Theorem 10.1 If regression model is correctly specified then OLS will produce Best Linear Unbiased Estimates (BLUE) of its parameters.

The term “correctly specified” implies that all main statistical assumptions about the model are satisfied (such as no omitted important variables, no autocorrelation and heteroscedasticity in the residuals, see details in Chapter 15). The “BLUE” part means that OLS guarantees the most efficient and the least biased estimates of parameters amongst all possible estimators of a linear model. For example, if we used a criterion of minimisation of Mean Absolute Error (MAE), then the estimates of parameters would be less efficient than in the case of OLS (this is because OLS gives “mean” estimates, while the minimum of MAE corresponds to the median, see Subsection 6.3.2).

Practically speaking, the theorem implies that when you use OLS, the estimates of parameters will have good statistical properties (given that the model is correctly specified), in some cases better than the estimates obtained by other estimators.