16.2 Mathematical explanation

This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

Now we can discuss the same idea from the mathematical point of view. We estimated the following simple model: \[\begin{equation} y_j = \mu_{y} + \epsilon_j, \tag{16.1} \end{equation}\] assuming normal distribution of the residuals (see Section 4.3). In order to make things closer to the regression context, we will introduce changing location, which is defined by the regression line (thus, it is conditional on the set of \(k-1\) explanatory variables): \[\begin{equation} y_j = \mu_{y,j} + \epsilon_j, \tag{16.2} \end{equation}\] where \(\mu_{y,j}\) is the population regression line, defined via: \[\begin{equation} \mu_{y,j} = \beta_0 + \beta_1 x_{1,j}+ \beta_2 x_{2,j} + \dots + \beta_{k-1} x_{k-1,j} . \tag{16.3} \end{equation}\] The typical assumption in regression context is that \(\epsilon_j \sim \mathcal{N}(0, \sigma^2)\) (normal distribution with zero mean and fixed variance), which means that \(y_j \sim \mathcal{N}(\mu_{y,j}, \sigma^2)\). We can use this assumption in order to calculate the point likelihood value for each observation based on the PDF of Normal distribution (Subsection 4.3): \[\begin{equation} \mathcal{L} (\mu_{y,j}, \sigma^2 | y_j) = f(y_j | \mu_{y,j}, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_j - \mu_{y,j} \right)^2}{2 \sigma^2} \right). \tag{16.4} \end{equation}\] Very roughly, what the value (16.4) shows is how likely it is that the specific observation comes from the assumed model with specified parameters (we know that in real world data does not come from any model, but this interpretation is easier to work with). Note that the likelihood is not the same as probability, because for any continuous random variables the probability for it to be equal to any specific number is equal to zero (as discussed in Section 4.1). The point likelihood (16.4) is not very helpful on its own, but we can get \(n\) values like that, based on our sample of data. We can then summarise them in one number, that would characterise the whole sample, given the assumed distribution, applied model and selected values of parameters: \[\begin{equation} \mathcal{L} (\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = \prod_{j=1}^n \mathcal{L} (\mu_{y,j}, \sigma^2 | \mathbf{y}) = \prod_{j=1}^n f(y_j | \mu_{y,j}, \sigma^2), \tag{16.5} \end{equation}\] where \(\boldsymbol{\theta}\) is the vector of all parameters in the model (in our example, it is \(k+1\) of them: all the coefficients of the model and the scale \(\sigma^2\)). We take the product of likelihoods in (16.5) because we need to get the joint likelihood for all observations and because we can typically assume that the point likelihoods are independent of each other (for example, the value on observation \(j\) will not be influenced by the value on \(j-1\)). The value (16.5) shows roughly how likely on average it is that the data comes from the assumed model with specified parameters.

Remark. Technically speaking, the “on average” element will be achieved if we divide (16.5) by the number of observations \(n\).

Having this value, we can change the values of parameters of the model, getting different value of (16.5) (as we did in the example in Section 16.1). Using an iterative procedure, we can get such estimates of parameters that would maximise the likelihood (16.5). These estimates of parameters are called ``Maximum Likelihood Estimates’’ (MLE). However, working with the products in formula (16.5) is challenging, so typically we linearise it using natural logarithm, obtaining log-likelihood. For the normal distribution, it can be written as: \[\begin{equation} \ell (\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = \log \mathcal{L} (\boldsymbol{\theta}, {\sigma}^2 | \mathbf{y}) = -\frac{n}{2} \log(2 \pi \sigma^2) -\sum_{j=1}^n \frac{\left(y_j - \mu_{y,j} \right)^2}{2 \sigma^2} . \tag{16.6} \end{equation}\] Based on that, we can find some of parameters of the model analytically. For example, we can derive the formula for the estimation of the scale based on the provided sample. Given that we are estimating the parameter, we should substitute \(\sigma^2\) with \(\hat{\sigma}^2\) in (16.6). We can then take derivative of (16.6) with respect to \(\hat{\sigma}^2\) and equate it to zero in order to find the value that maximises the log-likelihood function in our sample: \[\begin{equation} \frac{d \ell (\boldsymbol{\theta}, \hat{\sigma}^2 | \mathbf{y})}{d \hat{\sigma}^2} = -\frac{n}{2} \frac{1}{\hat{\sigma}^2} + \frac{1}{2 \hat{\sigma}^4}\sum_{j=1}^n \left(y_j - \mu_{y,j} \right)^2 =0 , \tag{16.7} \end{equation}\] which after multiplication of both sides by \(2 \hat{\sigma}^4\) leads to: \[\begin{equation} n \hat{\sigma}^2 = \sum_{j=1}^n \left(y_j - \mu_{y,j} \right)^2 , \tag{16.8} \end{equation}\] or \[\begin{equation} \hat{\sigma}^2 = \frac{1}{n}\sum_{j=1}^n \left(y_j - \mu_{y,j} \right)^2 . \tag{16.9} \end{equation}\] The value (16.9) is in fact a Mean Squared Error (MSE) of the model. If we calculate the value of \(\hat{\sigma}^2\) using the formula (16.9), we will maximise the likelihood with respect to the scale parameter. In fact, we can insert (16.9) in (16.6) in order to obtain the so called “concentrated” (or profile) log-likelihood for the normal distribution: \[\begin{equation} \ell^* (\boldsymbol{\theta} | \mathbf{y}) = -\frac{n}{2}\left( \log(2 \pi e) + \log \hat{\sigma}^2 \right) . \tag{16.10} \end{equation}\]

Remark. Sometimes, statisticians drop the \(2 \pi e\) part from the (16.10), because it does not affect any inferences, as long as one works only with Normal distribution. However, in general, it is not recommended to do (Burnham and Anderson, 2004), because this makes the comparison with other distributions impossible.

This function is useful because it simplifies some calculations and also demonstrates the condition, for which the likelihood is maximised: the first part on the right hand side of the formula does not depend on the parameters of the model, it is only the \(\log \hat{\sigma}^2\) that does. So, the maximum of the concentrated log-likelihood (16.10) is obtained, when \(\hat{\sigma}^2\) is minimised, implying the minimisation of MSE, which is the mechanism behind the “Ordinary Least Squares” (OLS from Section 10.1) estimation method. By doing this, we have just demonstrated that if we assume normality in the model, then the estimates of its parameters obtained via the maximisation of the likelihood coincide with the values obtained from OLS. So, why bother with MLE, when we have OLS?

First, the finding above holds for the Normal distribution only. If we assume a different distribution, we would get different estimates of parameters. In some cases, it might not be possible or reasonable to use OLS, but MLE would be a plausible option (for example, logistic, Poisson and any other non-standard model).

Second, the MLE of parameters have good statistical properties: they are consistent (Subsection 6.3.3) and efficient (Subsection 6.3.2). These properties hold almost universally for many likelihoods under very mild conditions. Note that the MLE of parameters are not necessarily unbiased (Subsection 6.3.1), but after estimating the model, one can de-bias some of them (for example, calculate the standard deviation of the error via division of the sum of squared errors by the number of degrees of freedom \(n-k\) instead of \(n\) as discussed in Section 11.2).

Third, likelihood can be used for the model assessment, even when the standard statistics, such as \(R^2\) or F-test are not available. We do not discuss these aspects in this textbook, but interested reader is directed to the topic of likelihood ratios.

Finally, likelihood permits the model selection (which will be discussed in Section ??) via information criteria. In general, this is not possible to do unless you assume a distribution and maximise the respective likelihood. In some statistical literature, you can notice that information criteria are calculated for the models estimated via OLS, but what the authors of such resources do not tell you is that there is still an assumption of normality behind this (see the link between OLS and MLE of Normal distribution above).

Note that the likelihood approach assumes that all parameters of the model are estimated, including location, scale, shape, shift of distribution etc. So typically it has more parameters to estimate than, for example, the OLS. This is discussed in some detail later in the Section 16.3.

References

• Burnham, K.P., Anderson, D.R., 2004. Model Selection and Multimodel Inference. Springer New York. https://doi.org/10.1007/b97636