**Open Review**. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 2.2 Measuring uncertainty

While point forecasts are useful in order to understand what to expect on average, prediction intervals are important in many applications to be able to quantify what to expect in \(1-\alpha\) share of cases. Intervals quantify the uncertainty around the point forecasts and thus the riskiness of the decision. In a way, if you do not have prediction intervals, then you cannot adequately assess the uncertainty about future outcomes. If you cannot say that sales next week will be between 1,000 and 1,200 units with a confidence level of 95% then you cannot say anything useful about future sales because as per the previous discussion point forecasts represent only mean values and typically will not be equal to the actual observations from the holdout sample. Hopefully, all of this explains why the prediction intervals are needed in forecasting.

As with point forecasts multiple measures can be used to evaluate prediction intervals. Here are the most popular ones:

**Coverage**, showing the percentage of observations lying inside the interval: \[\begin{equation} \mathrm{coverage} = \frac{1}{h} \sum_{j=1}^h \left( \mathbb{1}(y_{t+j} < l_{t+j}) \times \mathbb{1}(y_{t+j} > u_{t+j}) \right), \tag{2.11} \end{equation}\] where \(l_{t+j}\) is the lower bound and \(u_{t+j}\) is the upper bound of the interval and \(\mathbb{1}(\cdot)\) is the indicator function, returning one, when the condition is true and zero otherwise. Ideally, the coverage should be equal to the confidence level of the interval, but in reality, this can only be observed asymptotically, as the sample size increases due to the inheritted randomness of any sample estimates of parameters;**Range**, showing the width of the prediction interval: \[\begin{equation} \mathrm{range} = \frac{1}{h} \sum_{j=1}^h (u_{t+j} -l_{t+j}); \tag{2.12} \end{equation}\]**Mean Interval Score**(Gneiting and Raftery, 2007), which shows a combination of the previous two: \[\begin{equation} \begin{aligned} \mathrm{MIS} = & \frac{1}{h} \sum_{j=1}^h \left( (u_{t+j} -l_{t+j}) + \frac{2}{\alpha} (l_{t+j} -y_{t+j}) \mathbb{1}(y_{t+j} < l_{t+j}) +\right. \\ & \left. \frac{2}{\alpha} (y_{t+j} -u_{t+j}) \mathbb{1}(y_{t+j} > u_{t+j}) \right) , \end{aligned} \tag{2.13} \end{equation}\] where \(\alpha\) is the significance level. If the actual values lie outside of the interval, they get penalised with a ratio of \(\frac{2}{\alpha}\), proportional to the distance from the interval bound. At the same time the width of the interval positively influences the value of the measure: the wider the interval, the higher the score. The ideal model with \(\mathrm{MIS}=0\) should have all the actual values in the holdout lying on the bounds of the interval and \(u_{t+j}=l_{t+j}\), implying that the bounds coincide with each other and that there is no uncertainty about the future (which is not possible in the real life).**Pinball Score**(Koenker and Bassett, 1978), which measures the accuracy of models in terms of specific quantiles (this is usually applied to different quantiles produced from the model, not just to the lower and upper bounds of 95% interval): \[\begin{equation} \mathrm{PS} = (1 -\alpha) \sum_{y_{t+j} < q_{t+j}, j=1,\dots,h } |y_{t+j} -q_{t+j}| + \alpha \sum_{y_{t+j} \geq q_{t+j} , j=1,\dots,h } |y_{t+j} -q_{t+j}|, \tag{2.14} \end{equation}\] where \(q_{t+j}\) is the value of the specific quantile of the distribution. What PS shows, is how well we capture the specific quantile in the data. The lower the value of pinball is, the closer the bound is to the specific quantile of the holdout distribution. If the PS is equal to zero, then we have done the perfect job in hitting that specific quantile. The main issue with PS is that it is very difficult to assess the quantiles correctly on small samples. For example, in order to get a better idea of how the 0.975 quantile performs, we would need to have at least 40 observations, so that 39 of them would be expected to lie below this bound \(\left(\frac{39}{40} = 0.975\right)\). In fact, the quantiles are not always uniquely defined (see, for example, Taylor, 2020), which makes the measurement difficult.

Similar to the pinball function, it is possible to propose the expectile-based score, but while it has nice statistical properties (Taylor, 2020), it is more difficult to interpret.

Range, MIS and PS are unit-dependent. In order to be able to aggregate them over several time series they need to be scaled (as we did with MAE and RMSE in previous section) either via division by the in-sample mean or in-sample mean absolute differences in order to obtain the scaled counterparts of the measures or via division by the values from the benchmark model in order to obtain the relative one.

If you are interested in the overall performance of the model, then MIS provides this information. However, it does not show what specifically happens inside and is difficult to interpret. Coverage and range are easier to interpret but only give information about the specific prediction interval and typically must be traded off against each other (i.e. one can either cover more or have a narrower interval). Academics prefer the pinball for the purposes of uncertainty assessment, as it shows more detailed information about the predictive distribution from each model, but, while it is easier to interpret than MIS, it is still not as straightforward as coverage and range. So, the selection of the measure, again, depends on your specific situation and on the understanding of statistics by decision makers.

### 2.2.1 Example in R

Continuing the example from the previous section, we could produce prediction intervals from the two models and compare them using MIS and pinball:

```
<- forecast(model1,h=10,interval="p",level=0.95)
model1Forecast <- forecast(model2,h=10,interval="p",level=0.95)
model2Forecast
# Mean Interval Score
setNames(c(MIS(model1$holdout, model1Forecast$lower,
$upper, 0.95),
model1ForecastMIS(model2$holdout, model2Forecast$lower,
$upper, 0.95)),
model2Forecastc("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 36.63630 36.79749
```

```
# Pinball for the upper bound
setNames(c(pinball(model1$holdout, model1Forecast$upper, 0.975),
pinball(model2$holdout, model2Forecast$upper, 0.975)),
c("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 6.850181 6.651080
```

```
# Pinball for the lower bound
setNames(c(pinball(model1$holdout, model1Forecast$lower, 0.025),
pinball(model2$holdout, model2Forecast$lower, 0.025)),
c("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 2.308893 2.548292
```

```
# Coverage
setNames(c(mean(model1$holdout > model1Forecast$lower & model1$holdout < model1Forecast$upper),
mean(model2$holdout > model2Forecast$lower & model2$holdout < model2Forecast$upper)),
c("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 1 1
```

These measures do not tell much in terms of performance of models, when only applied to one time series. In order to see a proper difference, we need to apply models to a set of time series, produce forecasts, calculate measures and then look at their aggregate performance, e.g. via mean / median or quantiles.