11.4 Prediction with regression

This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

One of the things we can do with regression is get an understanding of what to expect from the response variable given a set of explanatory variables. Consider a model that we estimated in Section 11.1:

costsModel01 <- lm(overall~size+materials+projects+year, SBA_Chapter_11_Costs)

summary(costsModel01)

## 
## Call:
## lm(formula = overall ~ size + materials + projects + year, data = SBA_Chapter_11_Costs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.198  -30.443   -2.157   26.032  108.496 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  614.3227  4337.3590   0.142  0.88788   
## size           1.3471     0.6989   1.928  0.05898 . 
## materials      0.8706     0.3112   2.798  0.00704 **
## projects      -1.5921     3.7971  -0.419  0.67660   
## year          -0.1602     2.1621  -0.074  0.94121   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.42 on 56 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7311 
## F-statistic: 41.77 on 4 and 56 DF,  p-value: 3.415e-16

This model can be written mathematically as:

\[\begin{equation} overall_j = 614.3227 + 1.3471 size_j + 0.8706 materials_j - 1.5921 projects_j - 0.1602 year_j + \epsilon_j . \tag{11.17} \end{equation}\]

While there might be some issues with this model (e.g. not all important variables are included), we can use it to answer the following question:

What is the average overall cost of a project that has a detached property of the size 90 squared meters, overall cost of materials of 190 thousands pounds, developed in 2008 by a team that did 3 projects before it?

This question gives us all the necessary values of the variables, which we can now insert in the equation (11.17) to get the answer. What we need to do is just drop \(\epsilon_j\) (to switch to the regression line) and to substitute:

\(size_j\) with 90,
\(materials_j\) with 190,
\(projects_j\) with 3,
and \(year_j\) with 2008.

Note that because we do not have type in the model above, we cannot use this variable. We will discuss how to introduce it in the model in Chapter 13. Inserting all the above, we get: \[\begin{equation*} \widehat{overall}_j = 614.3227 + 1.3471 \times 90 + 0.8706 \times 190 - 1.5921 \times 3 - 0.1602 \times 2008 = 574.5178. \end{equation*}\] So, we can conclude that the expected (or average) overall costs for such project should be 574.5178. Note the boldface words “expected” and “average”. The number does not tell you what you will definitely have, but instead only indicates the general tendency in this setting. You can think of it as a value that you would get as an arithmetic mean of many similar projects with exactly the same set of variables (defined in the question).

In R, this can be done with the function predict() the following way:

costsNewdata <- data.frame(size=90, materials=190, projects=3, year=2008)
predict(costsModel01, newdata=costsNewdata)

##        1 
## 574.5546

We get a slightly different number due to rounding the values in our manual calculations. But the number is pretty close to the one we obtained before.