This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

## 2.1 Measuring accuracy of point forecasts

We start with a setting in which we are interested in point forecasts only. In this case we typically start by splitting the available data into train and test sets, apply the models under consideration to the former and produce forecasts on the latter, not showing that part to the models. This is called the “fixed origin” approach: we fix the point in time from which to produce forecasts, we produce them, calculate some sort of error measure and compare the models.

There are different error measures that can be used in this case. Which measure ought to be used depends on the specific need. Here we briefly discuss the most important measures and refer to for the gory details.

The majority of point forecast measures relies on the following two popular metrics:

Root Mean Squared Error (RMSE): $\begin{equation} \mathrm{RMSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h \left( y_{t+j} - \hat{y}_{t+j} \right)^2 }, \tag{2.1} \end{equation}$ and Mean Absolute Error (MAE): $\begin{equation} \mathrm{MAE} = \frac{1}{h} \sum_{j=1}^h \left| y_{t+j} - \hat{y}_{t+j} \right| , \tag{2.2} \end{equation}$ where $$y_{t+j}$$ is the actual value $$j$$ steps ahead from the holdout, $$\hat{y}_{t+j}$$ is the $$j$$ steps ahead point forecast and $$h$$ is the forecast horizon. As you see, these error measures aggregate the performance of competing forecasting methods across the forecasting horizon, averaging out the specific performances on each $$j$$. If this information needs to be retained, then the summation can be dropped to obtain a set of “SE” and “AE.”

It is well-known (see, for example, Kolassa, 2016) that RMSE is minimised by the mean value of a distribution, and MAE is minimised by the median. So, when selecting between the two, you should consider this property. This means, for example, that MAE-based error measures should not be used for the evaluation of models on intermittent demand, because zero forecast will minimise MAE, when the sample contains more than 50% of zeroes (see for example, Wallström and Segerstedt, 2010).

Another error measure that has been used in some cases is Root Mean Squared Logarithmic Error (RMSLE, see discussion in Tofallis, 2015): $\begin{equation} \mathrm{RMSLE} = \exp\left(\sqrt{\frac{1}{h} \sum_{j=1}^h \left( \log y_{t+j} - \log \hat{y}_{t+j} \right)^2} \right). \tag{2.3} \end{equation}$ It assumes that the actual values and the forecasts are positive and is minimised by geometric mean. In the formula (2.3), I have added the exponentiation, which is sometimes omitted. The reason for this is to bring the metric to the original scale, so that it has the same units as the actual values $$y_t$$.

The main difference in the three measures arises, when the data we deal with is not symmetric - in that case the arithmetic, geometric means and median will be different and thus the error measures might recommend different appraoches depending on what specifically is produced as a point forecast from the model (see discussion in Section 1.3.1).

### 2.1.1 An example in R

In order to see how the error measures work, we consider the following example based on a couple of forecasting functions from smooth package for R and measures from greybox:

y <- rnorm(100,100,10)
model1 <- es(y,h=10,holdout=TRUE)
model2 <- ces(y,h=10,holdout=TRUE)
# RMSE
setNames(sqrt(c(MSE(model1$holdout, model1$forecast),
MSE(model2$holdout, model2$forecast))),
c("ETS","CES"))
# MAE
setNames(c(MAE(model1$holdout, model1$forecast),
MAE(model2$holdout, model2$forecast)),
c("ETS","CES"))
# RMSLE
setNames(exp(sqrt(c(MSE(log(model1$holdout), log(model1$forecast)),
MSE(log(model2$holdout), log(model2$forecast))))),
c("ETS","CES"))
##      ETS      CES
## 9.492744 9.494683
##      ETS      CES
## 7.678865 7.678846
##      ETS      CES
## 1.095623 1.095626

Given that the distribution of the original data is symmetric, all three error measures should in general recommend the same model. But also given that the data we generated for the example is stationary, the two models will produce very similar forecasts. The values above demonstrate the latter point - the accuracy between the two models is roughly the same. Note that we have evaluated the same point forecasts from the models using different error measures, which would be wrong if the distribution of the data would be skewed. In our case, the model relies on normal distribution, so the point forecast from it would coincide with arithmetic mean, geometric mean and median.

### 2.1.2 Aggregating error measures

The main advantage of the error measures discussed in the previous subsection is that they are very simple and have a clear interpretations: they reflect the “average” distances between the point forecasts and the observed values. They are perfect for the work with only one time series. However, they are not suitable, when a set of time series is under consideration, and a forecasting method needs to be selected across them. This is because they are scale dependent and contain specific units: if you measures sales of apples in units, then MAE, RMSE and RMSLE (defined in equation (2.3)) will show the error in units as well. And, as we know, you should not add up apples with oranges - the result might not make sense.

In order to tackle this issue, different error scaling techniques have been proposed, resulting in a zoo of error measures:

1. MAPE - Mean Absolute Percentage Error: $\begin{equation} \mathrm{MAPE} = \frac{1}{h} \sum_{j=1}^h \frac{|y_{t+j} - \hat{y}_{t+j}|}{y_{t+j}}, \tag{2.4} \end{equation}$
2. MASE - Mean Absolute Scaled Error : $\begin{equation} \mathrm{MASE} = \frac{1}{h} \sum_{j=1}^h \frac{|y_{t+j} - \hat{y}_{t+j}|}{\bar{\Delta}_y}, \tag{2.5} \end{equation}$ where $$\bar{\Delta}_y = \frac{1}{t-1}\sum_{j=2}^t |\Delta y_{j}|$$ is the mean absolute value of the first differences $$\Delta y_{j}=y_j-y_{j-1}$$ of the in-sample data;
3. rMAE - Relative Mean Absolute Error : $\begin{equation} \mathrm{rMAE} = \frac{\mathrm{MAE}_a}{\mathrm{MAE}_b}, \tag{2.6} \end{equation}$ where $$\mathrm{MAE}_a$$ is the mean absolute error of the model under consideration and $$\mathrm{MAE}_b$$ is the MAE of the benchmark model;
4. sMAE - scaled Mean Absolute Error : $\begin{equation} \mathrm{sMAE} = \frac{\mathrm{MAE}}{\bar{y}}, \tag{2.7} \end{equation}$ where $$\bar{y}$$ is the mean of the in-sample data.
5. and others.

There is no “best” error measure. All have advantages and disadvantages, but some of them are more suitable in some circumstances than the others. For example:

1. MAPE is scale sensitive (if the actual values are measured in thousands of units, the resulting error will be much lower than in the case of hundreds of units) and cannot be estimated on data with zeroes. Furthermore, this error measure is biased, preferring when models underforecast the data (see for example, Makridakis, 1993) and is not minimised by median, but in general by an unknown quantity. Accidentally, in case of log normal distribution it is minimised by the mode (see discussion in Kolassa, 2016). Despite all the limitations, MAPE has a simple interpretation as it shows the percentage error (as the name suggests);
2. MASE avoids the disadvantages of MAPE, but does so at the cost of a simple interpretation due to the division by the first differences of the data (some interpret this as an in-sample one-step-ahead Naïve forecast, which does not simplify the interpretation);
3. rMAE avoids the disadvantages of MAPE, has a simple interpretation (it shows by how much one model is better than the other), but fails, when either $$\mathrm{MAE}_a$$ or $$\mathrm{MAE}_b$$ for a specific time series is equal to zero. In practice, this happens more often than desired, and can be considered as a serious limitation of the error measure. Furthermore, the increase of rMAE (for example, with the increase of sample size) might mean that either the method A is performing better than before, or that the method B is performing worse than before - it is not possible to tell the difference unless the denominator in the formula (2.6) is fixed;
4. sMAE avoids the disadvantages of MAPE has an interpretation close to it, but breaks down when the data has a trend.

When comparing different forecasting methods it might make sense to calculate several error measures for comparison. The choice of metric might depend on the specific needs of the forecaster. Here’s a few rules of thumb, however:

• If you want a robust measure that works consistently, but you do not care about the interpretation, then go with MASE.
• If you want an interpretation, then either go with rMAE, or sMAE (just keep in mind that if you decide to use rMAE or any other relative measure, you might get attacked by its creator, Andrey Davydenko, who might blame you for stealing his creation, even if you put a reference to his work).
• If the data does not exhibit trends (stationary), then you can use sMAE.
• You should typically avoid MAPE and other percentage error measures because they are highly influenced by the actual values you have in the holdout.

Furthermore, similar to the measures above, there have been proposed RMSE-based scaled and relative error metrics, which would measure the performance of methods in terms of means rather than medians. Here is a brief list of some of them:

1. RMSSE - Root Mean Squared Scaled Error : $\begin{equation} \mathrm{RMSSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h \frac{(y_{t+j} - \hat{y}_{t+j})^2}{\bar{\Delta}_y^2}} ; \tag{2.8} \end{equation}$
2. rRMSE - Relative Root Mean Squared Error : $\begin{equation} \mathrm{rRMSE} = \frac{\mathrm{RMSE}_a}{\mathrm{RMSE}_b} ; \tag{2.9} \end{equation}$
3. sRMSE - scaled Root Mean Squared Error : $\begin{equation} \mathrm{sRMSE} = \frac{\mathrm{RMSE}}{\bar{y}} . \tag{2.10} \end{equation}$

Similarly, RMSSLE, rRMSLE and sRMSLE can be proposed, using the same principles as in (2.8), (2.9) and (2.10) to assess performance of models in terms of geometric means across time series.

Finally, when aggregating the performance of forecasting methods across several time series, sometimes it makes sense to look at the distribution of errors - this way you will know which of the methods fails seriously and which does a consistently good job. If an aggregate measure is needed, then use mean and median of the chosen metric. The mean might be non-finite for some of error measures, especially when a method performs extremely poorly on a time series (an outlier), but it will give you an information about the average performance of the method and might flag the extreme cases. The median at the same time is robust to outliers and is always calculable, no matter what the distribution of the error term is. Furthermore, the comparison of mean and median might provide an additional information about the tail of distribution without reverting to histograms or calculation of quantiles. argues for the use of geometric mean for relative and scaled measures, but as discussed earlier, it might become equal to zero or to infinity if the data contains outliers (e.g. two cases, when one of methods produced perfect forecast, or the benchmark in rMAE produced a perfect forecast). At the same time, if the distribution of errors in logarithms is symmetric (which is the main argument of Davydenko and Fildes, 2013), then geometric mean will coincide with median, so there is no point in calculating the geometric mean at all.

### 2.1.3 Demonstration in R

In R, there is a variety of functions that calculate the error measures discussed above, including the accuracy() function from forecast package and measures() from greybox. Here is an example of how the measures can be calculated based on a couple of forecasting functions from smooth package for R and a set of generated time series:

# Apply a model to a test data to get names of error measures
y <- rnorm(100,100,10)
test <- es(y,h=10,holdout=TRUE)
# Define number of iterations
nsim <- 100
# Create an array for nsim time series, 2 models and a set of error measures
errorMeasures <- array(NA, c(nsim,2,length(test$accuracy)), dimnames=list(NULL,c("ETS","CES"), names(test$accuracy)))
# Start a loop for nsim iterations
for(i in 1:nsim){
# Generate a time series
y <- rnorm(100,100,10)
# Apply ETS
testModel1 <- es(y,"ANN",h=10,holdout=TRUE)
errorMeasures[i,1,] <- measures(testModel1$holdout, testModel1$forecast,
actuals(testModel1))
# Apply CES
testModel2 <- ces(y,h=10,holdout=TRUE)
errorMeasures[i,2,] <- measures(testModel2$holdout, testModel2$forecast,
actuals(testModel2))
}

The default benchmark methods for relative measures above is Naïve. In order to see how the distribution of error measures would look like, we can produce violinplots via vioplot() function from vioplot package. We will focus on rRMSE measure (see Figure 2.2).

vioplot::vioplot(errorMeasures[,,"rRMSE"]) Figure 2.1: Distribution of rRMSE on the original scale.

The distributions in Figure 2.2 look similar, and it is hard to tell, which one of them performs better. Besides, they do not look symmetric so we will take logarithms to see if this fixes the issue with the skewness (Figure 2.2).

vioplot::vioplot(log(errorMeasures[,,"rRMSE"])) Figure 2.2: Distribution of rRMSE on the log scale.

Figure 2.2 demonstrates that the distribution in logarithms is skewed, so the geometric mean in this case would not be suitable and might provide a misleading information. So, we calculate mean and median rRMSE to check the overall performance of the two models:

# Calculate mean rRMSE
apply(errorMeasures[,,"rRMSE"],2,mean)
##       ETS       CES
## 0.8163452 0.8135303
# Calculate median rRMSE
apply(errorMeasures[,,"rRMSE"],2,median)
##       ETS       CES
## 0.8796325 0.8725286

Based on the values above, we cannot make any solid conclusion about the performance of the two models: in terms of both mean and median rRMSE, CES is doing slightly better, but the difference between the two models is not substantial so we can probably choose the one that is easier to work with.

### References

• Davydenko, A., Fildes, R., 2013. Measuring Forecasting Accuracy: The Case Of Judgmental Adjustments To SKU-Level Demand Forecasts. International Journal of Forecasting. 29, 510–522. https://doi.org/10.1016/j.ijforecast.2012.09.002
• Hyndman, R.J., Koehler, A.B., 2006. Another look at measures of forecast accuracy. International Journal of Forecasting. 22, 679–688. https://doi.org/10.1016/j.ijforecast.2006.03.001
• Kolassa, S., 2016. Evaluating predictive count data distributions in retail sales forecasting. International Journal of Forecasting. 32, 788–803. https://doi.org/10.1016/j.ijforecast.2015.12.004
• Makridakis, S., 1993. Accuracy concerns measures: theoretical and practical concerns. International Journal of Forecasting. 9, 527–529. https://doi.org/10.1016/0169-2070(93)90079-3
• Makridakis, S., Spiliotis, E., Assimakopoulos, V., 2020. The M5 Accuracy Competition: Results, Findings and Conclusions. https://www.researchgate.net/publication/344487258 Working paper
• Petropoulos, F., Kourentzes, N., 2015. Forecast combinations for intermittent demand. Journal of the Operational Research Society. 66, 914–924. https://doi.org/10.1057/jors.2014.62
• Stock, J.H., Watson, M.W., 2004. Combination forecasts of output growth in a seven-country data set. Journal of Forecasting. 23, 405–430. https://doi.org/10.1002/for.928
• Svetunkov, I., 2019. Are you sure you’re precise? Measuring accuracy of point forecasts. https://forecasting.svetunkov.ru/en/2019/08/25/are-you-sure-youre-precise-measuring-accuracy-of-point-forecasts/ (version: 2019-08-25)
• Svetunkov, I., 2017. Naughty APEs and the quest for the holy grail. https://forecasting.svetunkov.ru/en/2017/07/29/naughty-apes-and-the-quest-for-the-holy-grail/ (version: 2017-07-29)
• Svetunkov, I., Kourentzes, N., 2015. Complex Exponential Smoothing. Department of Management Science, Lancaster University. https://ideas.repec.org/p/pra/mprapa/69394.html
• Tofallis, C., 2015. A better measure of relative prediction accuracy for model selection and model estimation. The Journal of the Operational Research Society. 66, 1352–1362. https://doi.org/10.2307/24505756
• Wallström, P., Segerstedt, A., 2010. Evaluation of forecasting error measurements and techniques for intermittent demand. International Journal of Production Economics. 128, 625–636. https://doi.org/10.1016/j.ijpe.2010.07.013