3.2 Expected value and variance of a random variable
While a full probability distribution provides a complete picture of a random variable, we often need key summary measures to quickly understand its essential characteristics. These measures allow us to distil complex information into a few key parameters. The two most important summary measures for any probability distribution are:
- the expected value, which describes the distribution’s centre, and
- the variance, which describes its spread.
We begin with a practical example to build an intuitive understanding before moving to more formal definitions.
3.2.1 Daily Car Sales at DiCarlo Motors
Consider the sales data for DiCarlo Motors over the past 300 days of operation. The number of cars sold each day is a discrete random variable, as it can only take on a countable number of values (0, 1, 2, etc.). The observed frequencies are as follows:
Number of Cars Sold (\(x_i\)) | Number of Days |
---|---|
0 | 54 |
1 | 117 |
2 | 72 |
3 | 42 |
4 | 12 |
5 | 3 |
To analyse this as a random variable, we must convert this frequency data into a probability distribution. This can be done by dividing the number of days for each outcome by the total number of days (300). The probability of selling \(x\) cars on any given day \(t\), \(P(x_t)\), is calculated below. Note that the probabilities must sum to 1.
Number of Cars Sold (\(x_i\)) | Calculation | Probability \(P(x_i)\) |
---|---|---|
0 | 54 / 300 | 0.18 |
1 | 117 / 300 | 0.39 |
2 | 72 / 300 | 0.24 |
3 | 42 / 300 | 0.14 |
4 | 12 / 300 | 0.04 |
5 | 3 / 300 | 0.01 |
Total | 1.00 |
Now having these values, we can calculate the expected sales of cars (mean or average).
3.2.2 Calculating the Expected Value (The Mean)
The expected value (or “expectation”), denoted by \(\mu\) (pronounced as “mu”), represents the long-run average of the random variable. It is a weighted average of all possible values, where each value is weighted by its probability of occurrence. The reason why we use probabilities as weights is to take into account the frequency of occurrence of each specific outcome. If we took a simple average, we would ignore the shape of the distribution and would assume (in a way) that the distribution is uniform. In our example, to get the expected value, we need to multiply columns 1 and 3 and then sum up the resulting value:
\[\begin{equation*} \begin{aligned} \mu = & (0 \times 0.18) + (1 \times 0.39) + (2 \times 0.24) + (3 \times 0.14) + (4 \times 0.04) + (5 \times 0.01) = \\ & 0 + 0.39 + 0.48 + 0.42 + 0.16 + 0.05 = 1.5 . \end{aligned} \end{equation*}\]
So, the expected value is 1.5 cars per day. While it is impossible to sell exactly 1.5 cars on any single day, this number tells us something else quite important: that over a long period of time, DiCarlo Motors can anticipate selling on average 1.5 cars per day. In each single day, the specific sales would vary according to the probabilities in the previous Subsection, but if we take all the sales and divide them by the number of days, we would get 1.5 cars per day. This measure is fundamental for forecasting; for example, the company could project average monthly sales of 45 cars (1.5 cars/day days). This could be used for further decision making.
The mathematical formula for the expected value of a discrete random variable is: \[\begin{equation} \mu = P(x_1) x_1 + P(x_2) x_2 + \dots + P(x_n) x_n = \sum_{j=1}^n P(x_i) x_i , \tag{3.1} \end{equation}\] where \(x_i\) is the value of interest (e.g. number of cars) and \(P(x_i)\) is the probability of this specific outcome.
The expected value shows the so called central tendency of a distribution, pointing where the distribution is “pulled” to. If it is symmetric, the expected value would be in the centre. In the asymmetric case, it would be pulled towards the longer tail of the distribution.
3.2.3 Calculating the Variance and Standard Deviation
While the expected value gives us the central tendency of a distribution, it does not tell us anything about the variability of values in the data. The variance, denoted by \(\sigma^2\) (pronounced “sigma squared”), is calculated as the weighted average of the squared deviations from the mean, and shows exactly that: the overall variability of values around the expected value.
Using our calculated mean of \(\mu = 1.5\), we can find the variance for the car sales data: \[\begin{equation*} \begin{aligned} \sigma^2 = & (0 − 1.5)^2 \times 0.18 + (1 − 1.5)^2 \times 0.39 + (2 − 1.5)^2 \times 0.24 + (3 − 1.5)^2 \times 0.14 + (4 − 1.5)^2 \times 0.04 + (5 − 1.5)^2 \times 0.01 = \\ & 0.405 + 0.0975 + 0.06 + 0.315 + 0.25 + 0.1225 = 1.25 \end{aligned} \end{equation*}\]
So, the variance in our example is 1.25, but its units are “cars squared” (because we had squares in the formula above), which is not intuitive. To get a measure of spread in the original units (cars), we can calculate the standard deviation (\(\sigma\)), which is simply the positive square root of the variance (\(\sigma = \sqrt{\sigma^2}\): \[\begin{equation*} \sigma = \sqrt{1.25} \approx 1.118 \end{equation*}\]
This standard deviation of 1.118 cars provides a more direct measure of the typical deviation from the mean of 1.5 cars. It roughly shows that on average the deviation around the expectation is 1.118 cars.
Mathematically, the formula of variance is written as: \[\begin{equation} \sigma^2 = P(x_1) (x_1 - \mu)^2 + P(x_2) (x_2 - \mu)^2 + \dots + P(x_n) (x_n - \mu)^2 = \sum_{j=1}^n P(x_i) (x_n - \mu)^2 , \tag{3.2} \end{equation}\]
The purpose of variance and standard deviation is to quantify the dispersion or variability of the possible outcomes around the expected value. A distribution with outcomes that are typically far from the mean will have a high variance, indicating a wide spread. Conversely, a distribution with outcomes clustered tightly around the mean will have a low variance. The problem is that these “high” and “low” do not have any specific definitions and that they do not say anything about the performance of a model when we fit it to the data, but rather only depicts the overall variability.
Ultimately, the expected value and standard deviation work in tandem: the former pinpoints the centre of gravity for the distribution, while the latter quantifies the typical distance of an outcome from that centre. We will use these both for the further discussions in this and following chapters.