This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

2.1 Measuring accuracy of point forecasts

We start with a setting in which we are interested in point forecasts only. In this case we typically start by splitting the available data into train and test sets, apply the models under consideration to the former and produce forecasts on the latter, not showing that part to the models. This is called the “fixed origin” approach: we fix the point in time from which to produce forecasts, we produce them, calculate some sort of error measures and compare the models.

There are different error measures that can be used in this case. Which measure ought to be used depends on the specific need. Here we briefly discuss the most important measures and refer to (Davydenko and Fildes, 2013; Svetunkov, 2019, 2017) for the gory details.

The most important error measures are Root Mean Squared Error (RMSE): \[\begin{equation} \mathrm{RMSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h \left( y_{t+j} - \hat{y}_{t+j} \right)^2 }, \tag{2.1} \end{equation}\] and Mean Absolute Error (MAE): \[\begin{equation} \mathrm{MAE} = \frac{1}{h} \sum_{j=1}^h \left| y_{t+j} - \hat{y}_{t+j} \right| , \tag{2.2} \end{equation}\] where \(y_{t+j}\) is the actual value \(j\) steps ahead from the holdout, \(\hat{y}_{t+j}\) is the \(j\) steps ahead point forecast (conditional expectation of the model) and \(h\) is the forecast horizon. As you see, these error measures aggregate the performance of competing forecasting methods across the forecasting horizon, averaging out the specific performances on each \(j\). If this information needs to be retained, then the summation can be dropped to obtain just “SE” and “AE.”

It is well-known (see, for example, Kolassa, 2016) that RMSE is minimised by the mean value of a distribution, and MAE is minimised by the median. So, when selecting between the two, you should consider this property. This means, for example, that MAE-based error measures should not be used for the evaluation of models on intermittent demand, because zero forecast will minimise MAE, when the sample contains more than 50% of zeroes.

Another error measure that has been used in some cases is Root Mean Squared Logarithmic Error (RMSLE): \[\begin{equation} \mathrm{RMSLE} = \frac{1}{h} \sum_{j=1}^h \left( \log y_{t+j} - \log \hat{y}_{t+j} \right)^2 . \tag{2.3} \end{equation}\] It assumes that the actual values and the forecasts are positive and is minimised by geometric mean.

The main advantage of these error measures is that they are very simple and have a clear interpretations: they are the “average” distances between the point forecasts and the observed values. They are perfect if you work with only one time series. However, they are not suitable, when you have several time series and want to see the performance of methods across them. This is mainly because they are scale dependent and contain specific units: if you measures sales of apples in pounds, then MAE and RMSE will show the error in pounds. And, as we know, you should not add up pounds of apples with pounds of oranges - the result might not make sense.

In order to tackle this issue, different error scaling techniques have been proposed, resulting in a zoo of error measures:

  1. MAPE - Mean Absolute Percentage Error: \[\begin{equation} \mathrm{MAPE} = \frac{1}{h} \sum_{j=1}^h \frac{y_{t+j} - \hat{y}_{t+j}}{y_{t+j}}, \tag{2.4} \end{equation}\]
  2. MASE - Mean Absolute Scaled Error (Hyndman and Koehler, 2006): \[\begin{equation} \mathrm{MASE} = \frac{1}{h} \sum_{j=1}^h \frac{|y_{t+j} - \hat{y}_{t+j}|}{\bar{\Delta}_y}, \tag{2.5} \end{equation}\] where \(\bar{\Delta}_y = \frac{1}{t-1}\sum_{j=2}^t |\Delta y_{j}|\) is the mean absolute value of the first differences \(\Delta y_{j}=y_j-y_{j-1}\) of the in-sample data;
  3. rMAE - Relative Mean Absolute Error (Davydenko and Fildes, 2013): \[\begin{equation} \mathrm{rMAE} = \frac{\mathrm{MAE}_a}{\mathrm{MAE}_b}, \tag{2.6} \end{equation}\] where \(\mathrm{MAE}_a\) is the mean absolute error of the model under consideration and \(\mathrm{MAE}_b\) is the MAE of the benchmark model;
  4. sMAE - scaled Mean Absolute Error (Petropoulos and Kourentzes, 2015): \[\begin{equation} \mathrm{sMAE} = \frac{\mathrm{MAE}}{\bar{y}}, \tag{2.7} \end{equation}\] where \(\bar{y}\) is the mean of the in-sample data.
  5. and others.

There is no “best” error measure. All have advantages and disadvantages. For example:

  1. MAPE is scale sensitive (if the actual values are measured in thousands of units, the resulting error will be much lower than in the case of hundreds of units) and cannot be estimated on data with zeroes. However, it has a simple interpretation as it shows the percentage error (as the name suggests);
  2. MASE avoids the disadvantages of MAPE, but does so at the cost of a simple interpretation due to the division by the first differences of the data (some interpret this as an in-sample one-step-ahead Naïve forecast);
  3. rMAE avoids the disadvantages of MAPE, has a simple interpretation (it shows by how much one model is better than the other), but fails, when either \(\mathrm{MAE}_a\) or \(\mathrm{MAE}_b\) for a specific time series is equal to zero;
  4. sMAE avoids the disadvantages of MAPE has an interpretation close to it, but breaks down when the data has a trend.

When comparing different forecasting methods it can make sense to calculate several of the error measures for comparison. The choice of metric might depend on the specific needs of the forecaster. Here’s a few rules of thumb, however: If you want a robust measure that works consistently, but you do not care about the interpretation, then go with MASE. If you want an interpretation, then either go with rMAE, or sMAE (just keep in mind that if you decide to use rMAE or any other relative measure, you might get attacked by its creator, Andrey Davydenko, who might blame you for stealing his creation, even if you put a reference to his work). You should typically avoid MAPE and other percentage error measures because they are highly influenced by the actual values you have in the holdout. Furthermore, similar to the measures above, there have been proposed RMSE-based scaled and relative error metrics, which would measure the performance of methods in terms of means rather than medians. Here is a brief list of some of them:

  1. RMSSE - Root Mean Squared Scaled Error: \[\begin{equation} \mathrm{RMSSE} = \sqrt{\frac{1}{h} \sum_{j=1}^h \frac{(y_{t+j} - \hat{y}_{t+j})^2}{\bar{\Delta}_y^2}} ; \tag{2.8} \end{equation}\]
  2. rRMSE - Relative Root Mean Squared Error: \[\begin{equation} \mathrm{rRMSE} = \frac{\mathrm{RMSE}_a}{\mathrm{RMSE}_b} ; \tag{2.9} \end{equation}\]
  3. sRMSE - scaled Root Mean Squared Error: \[\begin{equation} \mathrm{sRMSE} = \frac{\mathrm{RMSE}}{\bar{y}} . \tag{2.10} \end{equation}\]

Similarly, RMSSLE, rRMSLE and sRMSLE can be proposed, using the same principles as in (2.8), (2.9) and (2.10).

Finally, when aggregating the performance of forecasting methods across several time series, sometimes it makes sense to look at the distribution of errors - this way you will know which of the methods fails seriously and which does a consistently good job.

2.1.1 Demonstration in R

In R, there is a variety of functions that calculate the error measures discussed above, including the accuracy() function from forecast package and measures() from greybox. greybox also has individual measures, such as MAE(), MSE(), MASE() etc. Here is an example of how the measures can be calculated based on a couple of forecasting functions from smooth package for R:

y <- rnorm(100,100,10)
model1 <- es(y,h=10,holdout=TRUE)
model2 <- ces(y,h=10,holdout=TRUE)

rbind(measures(model1$holdout, model1$forecast,
      measures(model1$holdout, model2$forecast,
##             ME       MAE      MSE         MPE      MAPE        sCE       sMAE
## [1,] -7.412473  9.654007 110.0537 -0.08702359 0.1082522 -0.7440434 0.09690424
## [2,] -8.205575 10.129741 122.5774 -0.09568576 0.1138989 -0.8236528 0.10167953
##            sMSE      MASE     RMSSE      rMAE     rRMSE      rAME  asymmetry
## [1,] 0.01108856 0.8912543 0.7704354 0.4968460 0.5043486 0.3814848 -0.7691814
## [2,] 0.01235039 0.9351739 0.8130906 0.5213298 0.5322719 0.4223021 -0.7943697
##          sPIS
## [1,] 3.624725
## [2,] 4.050594

The default benchmark methods for relative measures above is Na"ive.


• Davydenko, A., Fildes, R., 2013. Measuring Forecasting Accuracy: The Case Of Judgmental Adjustments To SKU-Level Demand Forecasts. International Journal of Forecasting. 29, 510–522.
• Hyndman, R.J., Koehler, A.B., 2006. Another look at measures of forecast accuracy. International Journal of Forecasting. 22, 679–688.
• Kolassa, S., 2016. Evaluating predictive count data distributions in retail sales forecasting. International Journal of Forecasting. 32, 788–803.
• Petropoulos, F., Kourentzes, N., 2015. Forecast combinations for intermittent demand. Journal of the Operational Research Society. 66, 914–924.
• Svetunkov, I., 2019. Are you sure you’re precise? Measuring accuracy of point forecasts. (version: 2019-08-25)
• Svetunkov, I., 2017. Naughty APEs and the quest for the holy grail. (version: 2017-07-29)