**Open Review**. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 2.2 Measuring uncertainty

As discussed in Section 1.3.2, point forecasts are not sufficient for adequate decision making - prediction intervals and quantiles are needed in order to capture the uncertainty of demand around the point forecast. As with point forecasts multiple measures can be used to evaluate them. There are several useful measures for evaluation of intervals. We start with the simplest of them, coverage.

**Coverage**shows the percentage of observations lying inside the interval: \[\begin{equation} \mathrm{coverage} = \frac{1}{h} \sum_{j=1}^h \left( \mathbb{1}(y_{t+j} < l_{t+j}) \times \mathbb{1}(y_{t+j} > u_{t+j}) \right), \tag{2.11} \end{equation}\] where \(l_{t+j}\) is the lower bound and \(u_{t+j}\) is the upper bound of the interval and \(\mathbb{1}(\cdot)\) is the indicator function, returning one, when the condition is true and zero otherwise. Ideally, the coverage should be equal to the confidence level of the interval, but in reality, this can only be observed asymptotically (with the increase of the sample size), as the sample size increases due to the inheritted randomness of any sample estimates of parameters;**Range**shows the width of the prediction interval: \[\begin{equation} \mathrm{range} = \frac{1}{h} \sum_{j=1}^h (u_{t+j} -l_{t+j}); \tag{2.12} \end{equation}\] If the range of interval from one model is lower than the range of the other one, then the uncertainty about the future values is lower for the first one. However, the narrower interval might not include as many actual values in the holdout sample, leading to lower coverage. So, there is a natural trade-off between the two measures.**Mean Interval Score**(Gneiting and Raftery, 2007) combines the properties of the previous two measures: \[\begin{equation} \begin{aligned} \mathrm{MIS} = & \frac{1}{h} \sum_{j=1}^h \left( (u_{t+j} -l_{t+j}) + \frac{2}{\alpha} (l_{t+j} -y_{t+j}) \mathbb{1}(y_{t+j} < l_{t+j}) +\right. \\ & \left. \frac{2}{\alpha} (y_{t+j} -u_{t+j}) \mathbb{1}(y_{t+j} > u_{t+j}) \right) , \end{aligned} \tag{2.13} \end{equation}\] where \(\alpha\) is the significance level. If the actual values lie outside of the interval, they get penalised with a ratio of \(\frac{2}{\alpha}\), proportional to the distance from the interval bound. At the same time the width of the interval positively influences the value of the measure: the wider the interval, the higher the score. The ideal model with \(\mathrm{MIS}=0\) should have all the actual values in the holdout lying on the bounds of the interval and \(u_{t+j}=l_{t+j}\), implying that the bounds coincide with each other and that there is no uncertainty about the future (which is not possible in real life).**Pinball Score**(Koenker and Bassett, 1978) measures the accuracy of models in terms of specific quantiles (this is usually applied to different quantiles produced from the model, not just to the lower and upper bounds of 95% interval): \[\begin{equation} \mathrm{PS} = (1 -\alpha) \sum_{y_{t+j} < q_{t+j}, j=1,\dots,h } |y_{t+j} -q_{t+j}| + \alpha \sum_{y_{t+j} \geq q_{t+j} , j=1,\dots,h } |y_{t+j} -q_{t+j}|, \tag{2.14} \end{equation}\] where \(q_{t+j}\) is the value of the specific quantile of the distribution. What PS shows, is how well we capture the specific quantile in the data. The lower the value of pinball is, the closer the bound is to the specific quantile of the holdout distribution. If the PS is equal to zero, then we have done the perfect job in hitting that specific quantile. The main issue with PS is that it is very difficult to assess the quantiles correctly on small samples. For example, in order to get a better idea of how the 0.975 quantile performs, we would need to have at least 40 observations, so that 39 of them would be expected to lie below this bound \(\left(\frac{39}{40} = 0.975\right)\). In fact, quantiles are not always uniquely defined (see, for example, Taylor, 2020), which makes the measurement difficult.

Similar to the pinball function, it is possible to propose the expectile-based score, but while it has nice statistical properties (Taylor, 2020), it is more difficult to interpret.

Range, MIS and PS are unit-dependent. In order to be able to aggregate them over several time series they need to be scaled (as we did with MAE and RMSE in Section 2.1) either via division by the in-sample mean or in-sample mean absolute differences in order to obtain the scaled counterparts of the measures or via division by the values from the benchmark model in order to obtain the relative one.

If you are interested in the overall performance of the model, then MIS provides this information. However, it does not show what specifically happens inside and is difficult to interpret. Coverage and range are easier to interpret but they only give information about the specific prediction interval and typically must be traded off against each other (i.e. one can either cover more or have a narrower interval). Academics prefer the pinball for the purposes of uncertainty assessment, as it shows more detailed information about the predictive distribution from each model, but, while it is easier to interpret than MIS, it is still not as straightforward as coverage and range. So, the selection of the measure, again, depends on your specific situation and on the understanding of statistics by decision makers.

### 2.2.1 Example in R

Continuing the example from Section 2.1, we could produce prediction intervals from the two models and compare them using MIS and pinball:

```
<- forecast(model1,h=10,interval="p",level=0.95)
model1Forecast <- forecast(model2,h=10,interval="p",level=0.95)
model2Forecast
# Mean Interval Score
setNames(c(MIS(model1$holdout, model1Forecast$lower,
$upper, 0.95),
model1ForecastMIS(model2$holdout, model2Forecast$lower,
$upper, 0.95)),
model2Forecastc("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 39.68038 46.91441
```

```
# Pinball for the upper bound
setNames(c(pinball(model1$holdout, model1Forecast$upper, 0.975),
pinball(model2$holdout, model2Forecast$upper, 0.975)),
c("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 5.402511 6.703046
```

```
# Pinball for the lower bound
setNames(c(pinball(model1$holdout, model1Forecast$lower, 0.025),
pinball(model2$holdout, model2Forecast$lower, 0.025)),
c("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 4.517584 5.025557
```

```
# Coverage
setNames(c(mean(model1$holdout > model1Forecast$lower &
$holdout < model1Forecast$upper),
model1mean(model2$holdout > model2Forecast$lower &
$holdout < model2Forecast$upper)),
model2c("Model 1", "Model 2"))
```

```
## Model 1 Model 2
## 0.9 0.9
```

The values above imply that the first model (ETS) performed better than the second one in terms of MIS and pinball loss (it interval was narrower). However, these measures do not tell much in terms of performance of models, when only applied to one time series. In order to see more solid results, we need to apply models to a set of time series, produce prediction intervals, calculate measures and then look at their aggregate performance, e.g. via mean / median or quantiles. The loop and the analysis would be similar to the one discussed in Section @ref(#errorMeasuresExampleBig), so we do not repeat it here.