2.2 Measuring uncertainty

\( \newcommand{\mathbbm}[1]{\boldsymbol{\mathbf{#1}}} \)

As discussed in Section 1.3.2, point forecasts are not sufficient for adequate decision making – prediction intervals and quantiles are needed to capture the uncertainty of demand. As with point forecasts, multiple measures can be used to evaluate them. There are several useful measures for the evaluation of intervals. We start with the simplest of them, coverage.

Coverage shows the percentage of observations lying inside the interval: \[\begin{equation} \mathrm{Coverage} = \frac{1}{h} \sum_{j=1}^h \left( \mathbbm{1}(y_{t+j} > l_{t+j}) \times \mathbbm{1}(y_{t+j} < u_{t+j}) \right), \tag{2.11} \end{equation}\] where \(l_{t+j}\) is the lower bound and \(u_{t+j}\) is the upper bound of the interval and \(\mathbbm{1}(\cdot)\) is the indicator function, returning one, when the condition is true and zero otherwise. Ideally, the coverage should be equal to the confidence level of the interval, but in reality, this can only be observed asymptotically (with the increase of the sample size) due to the inherited randomness of any sample estimates of parameters;
Range shows the width of the prediction interval: \[\begin{equation} \mathrm{Range} = \frac{1}{h} \sum_{j=1}^h (u_{t+j} -l_{t+j}), \tag{2.12} \end{equation}\] If the range of intervals from one model is lower than the range of the other one, then the uncertainty about the future values is lower for the first one. However, the narrower interval might not include as many actual values in the holdout sample, leading to lower coverage. So, there is a natural trade-off between the two measures;
Mean Interval Score (Gneiting and Raftery, 2007) combines the properties of the previous two measures: \[\begin{equation} \begin{aligned} \mathrm{MIS} = & \frac{1}{h} \sum_{j=1}^h \left( (u_{t+j} -l_{t+j}) + \frac{2}{\alpha} (l_{t+j} -y_{t+j}) \mathbbm{1}(y_{t+j} < l_{t+j}) +\right. \\ & \left. \frac{2}{\alpha} (y_{t+j} -u_{t+j}) \mathbbm{1}(y_{t+j} > u_{t+j}) \right) , \end{aligned} \tag{2.13} \end{equation}\] where \(\alpha\) is the significance level. If the actual values lie outside of the interval, they get penalised with a ratio of \(\frac{2}{\alpha}\), proportional to the distance from the interval bound. At the same time, the width of the interval positively influences the value of the measure: the wider the interval, the higher the score. The ideal model with \(\mathrm{MIS}=0\) should have all the actual values in the holdout lying on the bounds of the interval and \(u_{t+j}=l_{t+j}\), implying that the bounds coincide with each other and that there is no uncertainty about the future (which is not possible in real life);
Pinball Score (Koenker and Bassett, 1978) measures the accuracy of models in terms of specific quantiles (this is usually applied to different quantiles produced from the model, not just to the lower and upper bounds of 95% interval): \[\begin{equation} \mathrm{PS} = (1 -\alpha) \sum_{y_{t+j} < q_{t+j}, j=1,\dots,h } |y_{t+j} -q_{t+j}| + \alpha \sum_{y_{t+j} \geq q_{t+j} , j=1,\dots,h } |y_{t+j} -q_{t+j}|, \tag{2.14} \end{equation}\] where \(q_{t+j}\) is the value of the specific quantile of the distribution. PS shows how well we capture the specific quantile in the data. The lower the value of the pinball is, the closer the bound is to the specific quantile of the holdout distribution. If the PS is equal to zero, then we have done the perfect job in hitting that specific quantile. The main issue with PS is that it is very difficult to assess the quantiles correctly on small samples. For example, in order to get a better idea of how the 0.975 quantile performs, we would need to have at least 40 observations, so that 39 of them would be expected to lie below this bound \(\left(\frac{39}{40} = 0.975\right)\). In fact, quantiles are not always uniquely defined (see, for example, Taylor, 2020), which makes the measurement challenging.

Similar to the pinball function, it is possible to propose the expectile-based score, but while expectiles have good statistical properties (Taylor, 2020), they are more difficult to interpret.

Range, MIS, and PS are unit-dependent. To aggregate them over several time series, they need to be scaled either via division by either the in-sample mean or in-sample mean absolute differences to obtain the scaled counterparts of the measures or via division by the values from the benchmark model to get the relative one. The idea here would be similar to what we discussed for MAE and RMSE in Section 2.1.

If you are interested in the model’s overall performance, then MIS provides this information. However, it does not show what happens specifically (is there an issue in the distance from the bound or the width of the interval?) and it is difficult to interpret. Coverage and range are easier to interpret, but they only give information about the specific prediction interval. They typically must be traded off against each other (i.e. one can either cover more or have a narrower interval). Academics prefer pinball for uncertainty assessment, as it shows more detailed information about the predictive distribution from each model. However, while it is easier to interpret than MIS, it is still not as straightforward as coverage and range. So, the selection of the measure depends on the specific situation and the understanding of statistics by decision-makers.

2.2.1 Example in R

Continuing the example from Section 2.1, we could produce prediction intervals from the two models and compare them using MIS and pinball:

model1Forecast <- forecast(model1,h=10,interval="p",level=0.95)
model2Forecast <- forecast(model2,h=10,interval="p",level=0.95)

# Mean Interval Score
setNames(c(MIS(model1$holdout, model1Forecast$lower,
               model1Forecast$upper, 0.95),
           MIS(model2$holdout, model2Forecast$lower,
               model2Forecast$upper, 0.95)),
         c("Model 1", "Model 2"))

##  Model 1  Model 2 
## 39.68038 46.91441

# Pinball for the upper bound
setNames(c(pinball(model1$holdout, model1Forecast$upper, 0.975),
           pinball(model2$holdout, model2Forecast$upper, 0.975)),
         c("Model 1", "Model 2"))

##  Model 1  Model 2 
## 5.402511 6.703046

# Pinball for the lower bound
setNames(c(pinball(model1$holdout, model1Forecast$lower, 0.025),
           pinball(model2$holdout, model2Forecast$lower, 0.025)),
         c("Model 1", "Model 2"))

##  Model 1  Model 2 
## 4.517584 5.025557

# Coverage
setNames(c(mean(model1$holdout > model1Forecast$lower &
                model1$holdout < model1Forecast$upper),
           mean(model2$holdout > model2Forecast$lower &
                model2$holdout < model2Forecast$upper)),
         c("Model 1", "Model 2"))

## Model 1 Model 2 
##     0.9     0.9

The values above imply that the first model (ETS) performed better than the second one in terms of MIS and pinball loss (the interval was narrower). However, these measures do not tell much in terms of the performance of models when only applied to one time series. To see more solid results, we need to apply models to a set of time series, produce prediction intervals, calculate measures, and then look at their aggregate performance, e.g. via mean/median or quantiles. The loop and the analysis would be similar to the one discussed in Section 2.1.3, so we do not repeat it here.

References

• Gneiting, T., Raftery, A.E., 2007. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. 102, 359–378. https://doi.org/10.1198/016214506000001437

• Koenker, R., Bassett, G., 1978. Regression Quantiles. Econometrica. 46, 33. https://doi.org/10.2307/1913643

• Taylor, J.W., 2020. Evaluating Quantile-bounded and Expectile-bounded Interval Forecasts. International Journal of Forecasting. 37. https://doi.org/10.1016/j.ijforecast.2020.09.007