This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

2.5 Statistical comparison of forecasts

After applying several competing models to the data and obtaining a distribution of error terms, we might find that some of them performed very similar. In this case, there might be a question, whether the difference is significant and which of the forecasting models we should select. Consider the following artificial example, where we have 4 competing models and measure their performance in terms of RMSSE:

smallCompetition <- matrix(NA, 100, 4,
smallCompetition[,1] <- rnorm(100,1,0.35)
smallCompetition[,2] <- rnorm(100,1.2,0.2)
smallCompetition[,3] <- runif(100,0.5,1.5)
smallCompetition[,4] <- rlnorm(100,0,0.3)

We can check the mean and median error measures in this example in order to see, how the methods perform overall:

overalResults <- matrix(c(colMeans(smallCompetition), 
                          apply(smallCompetition, 2, median)),
                        4, 2, dimnames=list(colnames(smallCompetition),
##            Mean  Median
## Method1 1.04148 0.97523
## Method2 1.23750 1.23572
## Method3 1.00213 1.02843
## Method4 0.97789 0.92591

In this artificial example, it looks like the most accurate method in terms of mean and median RMSSE is Method 4, and the least accurate one is Method 2. However, the difference in terms of accuracy between methods 1, 3 and 4 does not look substantial. So, should we conclude that the Method 4 is the best? Let’s first look at the distribution of errors using vioplot() function from vioplot package (Figure 2.10).

points(colMeans(smallCompetition), col="red", pch=16)
Boxplot of RMSE for the artificial example

Figure 2.10: Boxplot of RMSE for the artificial example

What the violin plots in Figure 2.10 show is that the distribution of errors for the Method 2 is shifted higher than the distributions of other methods, but it also looks like Method 2 is working more consistently, meaning that the variability of the errors is lower (the size of the box on the graph). It is difficult to tell whether Method 1 is better than Methods 3 and 4 or not - their boxes intersect and roughly look similar, with Method 4 having slightly shorter box and Method 3 having the box slightly lower positioned.

This is all the basics of descriptive statistics, which allows to conclude that in general Methods 1, 3 and 4 do better job than the Method 2. This is also reflected in the mean and median error measures, discussed above. So, what should we conclude?

We should not make hasty decisions and we should remember that we are dealing with a sample of data (100 time series), so inevitably the performance of methods will change if we try them on different data sets. If we had a population of all the time series in the world, then we could run our methods and make a more solid conclusion about their performances. But here we deal with a sample. So it might make sense to see, whether the difference in performance of methods is significant. How should we do that?

First, we can compare means of distributions of errors using a parametric statistical test. We can try F-test (Wikipedia, 2021a), which will tell us whether the mean performance of methods is similar or not. Unfortunately, this will not tell us, how the methods compare. But t-test (Wikipedia, 2021b) could be used to do that instead for pairwise comparison. One could also use a regression with dummy variables for methods, which will then give us parameters and their confidence intervals (based on t-statistics), telling us, how the means of methods compare. However F-test, t-test and t-statistics from regression rely on strong assumptions related to the distribution of the means of error measures (normality). If we had large sample (e.g. a thousand of series) and well behaved distribution, then we could try it, hoping that central limit theorem would work, and might get something relatively meaningful. However, on 100 observations this still could be an issue, especially given that the distribution of error measures is typically asymmetric (this means that the estimate of mean might be biased, which leads to a lot of issues).

Second, we could compare medians of distributions of errors. They are robust to outliers, so their estimates should not be too biased in case of skewed distributions on smaller samples. In order to have a general understanding of performance (is everything the same or is there at least one method that performs differently), we could try Friedman test (Wikipedia, 2021c), which could be considered as a non-parametric alternative of F-test. This should work in our case, but won’t tell us how specifically the methods compare. We could try Wilcoxon signed-ranks test (Wikipedia, 2021d), which could be considered as a non-parametric counterpart of t-test, but it is only applicable for the comparison of two variables, while we want to compare four.

Luckily, there is Nemenyi test (Demšar, 2006), which is equivalent to MCB test (Koning et al., 2005; Kourentzes, 2012). What the test does, is it ranks performance of methods for each time series and then takes mean of those ranks and produces confidence bounds for those means. The means of ranks correspond to medians, so this means that by using this test, we compare medians of errors of different methods. If the confidence bounds for different methods intersect, then we can conclude that the medians are not different from statistical point of view. Otherwise, we can see which of the methods has higher rank, and which has the lower one. There are different ways how to present the results of the test and there are several R functions that implement it, including nemenyi() from tsutils package. However, we will use a function rmcb() from greybox which has more flexible plotting capabilities, supporting all the default parameters for plot() method.

smallCompetitionTest <- rmcb(smallCompetition, plottype="none")
plot(smallCompetitionTest, "mcb", main="")
## Regression for Multiple Comparison with the Best
## The significance level is 5%
## The number of observations is 100, the number of methods is 4
## Significance test p-value: 0
MCB test results for small competition.

Figure 2.11: MCB test results for small competition.

Figure 2.11 shows that Methods 1, 3 and 4 are not statistically different - their intervals intersect, so we cannot really tell the difference between them, even though the mean rank of Method 4 is lower than for the other methods. Method 2, on the other hand, is significantly worse than the other methods: it has the highest mean rank of all and its interval does not intersect intervals of other methods.

Note that while this is a good way of presenting the results, all the MCB test does is comparison of mean ranks. It does not tell much about the distribution of errors and neglects the distances between values (i.e. 0.1 is lower than 0.11, so the first method has lower rank, which is exactly the same result as with comparing 0.1 and 100). This happens because by doing the test, we move from numerical scale to the ordinal one (see Section 1.2 of Svetunkov, 2021c). Finally, as any other statistical test, it will get its power, when the sample increases - we know that the null hypothesis “variables are equal to each other” in reality is always wrong (see Section 5.3 of Svetunkov, 2021c), so the increase of sample size will lead at some point to the right conclusion: methods are statistically different. Here is a demonstration of this assertion:

largeCompetition <- 
  matrix(NA, 100000, 4,
         dimnames=list(NULL, paste0("Method",c(1:4))))
# Generate data
largeCompetition[,1] <- rnorm(100000,1,0.35)
largeCompetition[,2] <- rnorm(100000,1.2,0.2)
largeCompetition[,3] <- runif(100000,0.5,1.5)
largeCompetition[,4] <- rlnorm(100000,0,0.3)
# Run the test
largeCompetitionTest <- rmcb(largeCompetition, plottype="none")
plot(largeCompetitionTest, "mcb", main="")
MCB test results for large competition.

Figure 2.12: MCB test results for large competition.

In the plot in Figure 2.12, Method 4 has become significantly worse than Methods 1 and 3 in terms of mean ranks (note that it was winning in the small competition). The difference between Methods 1 and 3 is still not significant, but it would become if we continue increasing the sample size. This example tells us that we need to be careful, when selecting the best method, as this might change under different circumstances. At least we knew from the start that Method 2 is not good.


• Demšar, J., 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research. 7, 1–30.
• Koning, A.J., Franses, P.H., Hibon, M., Stekler, H.O., 2005. The M3 competition: Statistical tests of the results. International Journal of Forecasting. 21, 397–409.
• Kourentzes, N., 2012. Statistical Significance of Forecasting Methods – an empirical evaluation of the robustness and interpretability of the MCB, ANOM and Friedman-Nemenyi Test. (version: 2021-08-12)
• Svetunkov, I., 2021c. Statistics for business analytics. (version: 01.10.2021)
• Wikipedia, 2021a. F-test. (version: 2021-07-08)
• Wikipedia, 2021b. Student’s t-test. (version: 2021-07-08)
• Wikipedia, 2021c. Friedman test. (version: 2021-07-08)
• Wikipedia, 2021d. Wilcoxon signed-rank test. (version: 2021-07-08)