Chapter 11 Multiple Linear Regression
Example 11.1 One of the problems that construction companies face is getting a good estimate of the budget needed to build something. Many companies tend to underestimate the costs and time the project will take. To address this, a mid-size company called “Eden city”, which specialises on construction of residential buildings, decided to take a more analytical approach to the problem. They have collected the data of their previous projects and needs help in building a model that would explain what forms the costs for different types of buildings. Their idea is to use this model during the business plan write-up phase to get an estimate of the future project, which they hope will be better than the ones they used before based on their pure judgment.
In this example, we are interested in the overall costs of construction (in thousands of pounds), which can be impacted by:
- The size of a building in squared meters,
- The cost of materials (in thousands of pounds),
- Type of building (detached, semi-detached, bungalow etc),
- How many projects the specific crew did before,
- Year when the project was started.
What else do you think can impact such costs in theory?
This data is available online:
Based on what we have discussed before, we can do analysis of measures of association and even build simple regression model (or several of them), but we acknowledge that in many real life situations, there are many factors that impact the variable of interest. In the example above, we have listed five explanatory variables that can be connected to the overall costs. This means that a basic bi-variate analysis (one variable vs the other) might not be sufficient. Furthermore, the relations between variables are typically complicated. So analysing, for example, only the relation between the cost of project and the size of building without considering the cost of materials might be misleading.
Here is how the bi-variate relations between the variables in our dataset look like (Figure 11.1):

Figure 11.1: Spread plot between variables in the building costs dataset.
While the plot in Figure 11.1 gives a good idea about the relations, for example, between the size of property and the overall costs (it seems to be linear positive) or between material and the overall costs (linear positive again), it does not tell us much about the complex relation between one variable (overall costs) and several others.
All of this gives a motivation to having a so called “Multiple Linear Regression”, the model that expresses the relation between one variable and several of others. Mathematically, this is a straight forward extension of the simple linear regression model from Chapter 10, where we just add variables to the right-hand side of the equation. For example, if we had two variables impacting one (e.g. \(size\) of project and cost of \(materials\) vs \(overall\) cost of the project), we could write: \[\begin{equation} overall_j = \beta_0 + \beta_1 size_{j} + \beta_2 material_{j} + \epsilon_j , \tag{11.1} \end{equation}\] where \(beta_0\) is the intercept, and \(beta_1\), \(beta_2\) are the coefficients for the respective variables. The predicted overall costs can be calculated based on this model by dropping the error term \(\epsilon_j\): \[\begin{equation*} \widehat{overall}_j = \beta_0 + \beta_1 size_{j} + \beta_2 material_{j}. \end{equation*}\] While in the example with the Simple Linear Regression the predicted (or fitted) values implied drawing a line through the cloud of dots on the plane of the two variables, now we are talking about drawing a plane through the point in the three-dimensional space. It can be visualised in the following way (Figure 11.2):

Figure 11.2: 3D scatterplot of Overall costs vs size of project and costs of materials.
The 3d image in Figure 11.2 is already hard to analyse, but at least it gives an idea of how the overall costs change with the change of materials costs and size of buildings. However, it would be impossible to produce a meaningful plot of overall costs from more than two variables. What the figure above gives us is the connection between the simple linear regression (which is just a straight line in the two-dimensional plane) and the multiple one (which is a plane in a multi-dimensional space).
In a more general way, the multiple linear regression can be written as: \[\begin{equation} y_j = \beta_0 + \beta_1 x_{1,j} + \beta_2 x_{2,j} + \dots + \beta_{k-1} x_{k-1,j} + \epsilon_j , \tag{11.2} \end{equation}\] where \(\beta_i\) is a \(i\)-th parameter for the respective \(i\)-th explanatory variable and there is \(k-1\) of them in the model, meaning that when we want to estimate this model, we will have \(k\) unknown parameters. The regression line of this model in population (aka expectation conditional on the values of explanatory variables) is: \[\begin{equation} \mu_{y,j} = \mathrm{E}(y_j | \mathbf{x}_j) = \beta_0 + \beta_1 x_{1,j} + \beta_2 x_{2,j} + \dots + \beta_{k-1} x_{k-1,j} . \tag{11.3} \end{equation}\] Furthermore, similar to how we discussed it in Chapter 10, when we want to estimate model @ref{eq:MLRFormula}, we should substitute all parameters \(\beta_j\) with their estimates \(b_j\): \[\begin{equation} \hat{y}_j = b_0 + b_1 x_{1,j} + b_2 x_{2,j} + \dots + b_{k-1} x_{k-1,j} . \tag{11.4} \end{equation}\] Similar to the Simple Linear Regression, each parameter in equation (11.4) represents the slope for the respective variable, showing how on average the value of the response variable (overall costs in our example) changes with the change of each variable.