This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

3.1 Measuring accuracy of point forecasts

We start with a setting in which we are interested in point forecasts only. In this case we typically start by splitting the available data into train and test sets, apply the models under consideration to the former and produce forecasts on the latter, not showing that part to the models. This is called the “fixed origin” approach: we fix the point in time from which to produce forecasts, we produce them, calculate some sort of error measures and compare the models.

There are different error measures that can be used in this case. Which measure ought to be used depends on the specific need. Here we briefly discuss the most important meaures and refer to (Davydenko and Fildes 2013; I. Svetunkov 2019; Svetunkov 2017) for the gory details.

The most important error measures are Root Mean Squared Error (RMSE): \[\begin{equation} \mathrm{RMSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h \left( y_{t+j} - \hat{y}_{t+j} \right)^2 }, \tag{3.1} \end{equation}\] and Mean Absolute Error (MAE): \[\begin{equation} \mathrm{MAE} = \frac{1}{h} \sum_{j=1}^h \left| y_{t+j} - \hat{y}_{t+j} \right| , \tag{3.2} \end{equation}\] where \(y_{t+j}\) is the actual value \(j\) steps ahead from the holdout, \(\hat{y}_{t+j}\) is the \(j\) steps ahead point forecast (conditional expectation of the model) and \(h\) is the forecast horizon. As you see, these error measures aggregate the performance of competing forecasting methods across the forecasting horizon, averaging out the specific performances on each \(j\). If this information needs to be retained, then the summation can be dropped to obtain just “SE” and “AE”.

It is well-known (see, for example, Kolassa 2016) that RMSE is minimised by the mean value of a distribution, and MAE is minimised by the median. So, when selecting between the two, you should consider this property. This means, for example, that MAE-based error measures should not be used for the evaluation of models on intermittent demand.

The main advantage of these error measures is that they are very simple and have a clear interpretations: they are the “average” distances between the point forecasts and the observed values. They are perfect if you work with only one time series. However, they are not suitable, when you have several time series and want to see the performance of methods across them. This is mainly because they are scale dependent and contain specific units: if you measures sales of apples in pounds, then MAE and RMSE will show the error in pounds. And, as we know, you should not add up pounds of apples with pounds of oranges - the result might not make sense.

In order to tackle this issue, different error scaling techniques have been proposed, resulting in a zoo of error measures:

  1. MAPE - Mean Absolute Percentage Error: \[\begin{equation} \mathrm{MAPE} = \frac{1}{h} \sum_{j=1}^h \frac{y_{t+j} - \hat{y}_{t+j}}{y_{t+j}}, \tag{3.3} \end{equation}\]
  2. MASE - Mean Absolute Scaled Error (Hyndman and Koehler 2006): \[\begin{equation} \mathrm{MASE} = \frac{1}{h} \sum_{j=1}^h \frac{|y_{t+j} - \hat{y}_{t+j}|}{\bar{\Delta}_y}, \tag{3.4} \end{equation}\] where \(\bar{\Delta}_y = \frac{1}{t-1}\sum_{j=2}^t |\Delta y_{j}|\) is the mean absolute value of the first differences \(\Delta y_{j}=y_j-y_{j-1}\) of the in-sample data;
  3. rMAE - Relative Mean Absolute Error (Davydenko and Fildes 2013): \[\begin{equation} \mathrm{rMAE} = \frac{\mathrm{MAE}_a}{\mathrm{MAE}_b}, \tag{3.5} \end{equation}\] where \(\mathrm{MAE}_a\) is the mean absolute error of the model under consideration and \(\mathrm{MAE}_b\) is the MAE of the benchmark model;
  4. sMAE - scaled Mean Absolute Error (Petropoulos and Kourentzes 2015): \[\begin{equation} \mathrm{sMAE} = \frac{\mathrm{MAE}}{\bar{y}}, \tag{3.6} \end{equation}\] where \(\bar{y}\) is the mean of the in-sample data.
  5. and others.

There is no “best” error measure. All have advantages and disadvantages. For example:

  1. MAPE is scale sensitive (if the actual values are measured in thousands of units, the resulting error will be much lower than in the case of hundreds of units) and cannot be estimated on data with zeroes. However, it has a simple interpretation as it shows the percentage error (as the name suggests);
  2. MASE avoids the disadvantages of MAPE, but does so at the cost of a simple interpretation due to the division by the first differences of the data (some interpret this as an in-sample one-step-ahead naive forecast);
  3. rMAE avoids the disadvantages of MAPE, has a simple interpretation (it shows by how much one model is better than the other), but fails, when either \(\mathrm{MAE}_a\) or \(\mathrm{MAE}_b\) for a specific time series is equal to zero;
  4. sMAE avoids the disadvantages of MAPE, but has an interpretation close to it but breaks down when the data has a trend.

When comparing different forecasting methods it can make sense to calculate several of the error measures for comparison. The choice of metric might depend on the specific needs of the forecaster. Here’s a few rules of thumb, however: If you want a robust measure that works consistently, but you do not care about the interpretation, then go with MASE. If you want an interpretation, then either go with rMAE, or sMAE. You should typically avoid MAPE and other percentage error measures because they are highly influenced by the actual values you have in the holdout. Furthermore, similar to the measures above, one can propose RMSE-based scaled and relative error measures, which would measure the performance of methods in terms of means rather than medians.

Finally, when aggregating the performance of forecasting methods across several time series, sometimes it makes sense to look at the distribution of errors - this way you will know which of the methods fails seriously and which does a consistently good job.

References

Davydenko, Andrey, and Robert Fildes. 2013. “Measuring Forecasting Accuracy: The Case of Judgmental Adjustments to SKU-Level Demand Forecasts.” International Journal of Forecasting 29 (3): 510–22. https://doi.org/10.1016/j.ijforecast.2012.09.002.

Hyndman, Rob J, and Anne B Koehler. 2006. “Another look at measures of forecast accuracy.” International Journal of Forecasting 22 (4): 679–88. https://doi.org/10.1016/j.ijforecast.2006.03.001.

Kolassa, Stephan. 2016. “Evaluating predictive count data distributions in retail sales forecasting.” International Journal of Forecasting 32 (3): 788–803. https://doi.org/10.1016/j.ijforecast.2015.12.004.

Petropoulos, Fotios, and Nikolaos Kourentzes. 2015. “Forecast combinations for intermittent demand.” Journal of the Operational Research Society 66 (6): 914–24. https://doi.org/10.1057/jors.2014.62.

Svetunkov, Ivan. 2017. “Naughty Apes and the Quest for the Holy Grail.” Modern Forecasting. https://forecasting.svetunkov.ru/en/2017/07/29/naughty-apes-and-the-quest-for-the-holy-grail/.

Svetunkov, Ivan. 2019. “Are You Sure You’re Precise? Measuring Accuracy of Point Forecasts.” Modern Forecasting. https://forecasting.svetunkov.ru/en/2019/08/25/are-you-sure-youre-precise-measuring-accuracy-of-point-forecasts/.