Archives regular demand - Open Forecasting

M-competitions, from M4 to M5: reservations and expectations

Ivan Svetunkov — Sun, 01 Mar 2020 20:39:52 +0000

UPDATE: I have also written a short post on “The role of M competitions in forecasting“, which gives historical perspective and a brief overview of the main findings of the previous competitions.

Some of you might have noticed that the guidelines for the M5 competition have finally been released. Those of you who have previously visited this blog, might know my scepticism about the M4 competition. So, I’ve decided to write this small post, outlining my reservations about the M4 and my thoughts and expectations about the M5.

Reservations about the M4 competition

Although, I have participated in two submissions in M4, in the end I found the results unsatisfactory. My main reservations were with how the results were interpreted by the organisers, who announced right after the end of the M4 that “Machine Learning doesn’t work”, which was repeated few times on the Twitter of Spyros Makridakis (see, for example, this, and this) and in their summary paper in IJF. This was pedalled few times in social media, and I did not take it well. I participated in the M4 competition in order to support Spyros, but I did not expect that I would contribute to the attack on Machine Learning. Besides, the claims that ML methods didn’t work were not valid. First, the winning method of Slawek Smyl uses Machine Learning technique. Second, there were not many ML methods submitted to the competition, because the ML community was not involved in it. Third, the setting of the competition itself was not suitable for ML in the first place. What sort of non-linearity do we expect to see in yearly, quarterly or monthly data anyway? We already known that using ML makes sense, when you have a lot of observations, so that you can train the methods properly and not overfit the data. Yes, there was also a handful of weekly, daily and hourly data in the M4 competition, but they constituted only 5% of the whole dataset. So, the fact that MLP and RNN failed to perform well, when applied to each separate time series, is not surprising. Furthermore, the organisers did not have ML specialists in their team, so I don’t think that it was a fair setting in the first place to conclude that “ML methods don’t work”. To me, it was similar to the case, when professional cyclists would organise a race between super cars and bicycles on narrow streets of an old crowded city. The conclusion would be “the super cars don’t work”.

Furthermore, it appeared that there were several time series of a weird nature. For example, there were yearly data with 835 observations (e.g. series Y13190), there were some series without any variability for long periods of time (e.g. 29 years of no change in monthly data, like in the series M16833) and there were some series that end in the future (e.g. Y3820 ends in the year 2378). Some other examples of weird data were also discussed in Darin & Stellwagen (2020). Furthermore, Ingel et al. (2020) showed that some time series in the dataset were highly correlated, meaning that there was a data leakage in the M4 competition. All of this does not have much to do with the time series we usually deal with in real life, at least not in the area I work in. We should acknowledge that preparing such a big competition as M4 is not an easy task (it’s not possible to look through all 100,000 time series), and I’m sure that the organisers did as best as they could in order to make sure that the dataset contains meaningful series. However, while all of this has probably happened because of the complexity of the competition, I now have my reservations about using M4 data for any future experiments.

I also had my reservations about the error measures used in the competition, although I have accepted them, when I participated in the M4. I still think that it was a mistake to use sMAPE, because now we see a backlash, when practitioners from different companies refer to M4 and ask us about the sMAPE. I have to explain them the limitations of this error measure, and why they should not use it. And I am afraid that we will have more and more forecasting related papers, using sMAPE for forecasts evaluation, just because of the M4 competition, although we have known since Goodwin & Lawton 1999 that it has serious issues (preferring when methods overshoot the data). In my opinion this is a serious step back for science.

To be fair, there were also good results of the M4 competition, and while in general being critical of it, I don’t claim that it was a complete failure. It showed how the ML can be efficiently used on big datasets: see the papers of Slawek Smyl (2020) and Montero-Manso et al. (2020). It reconfirmed the previous findings that the combinations of forecasting methods on average perform better than the individual ones in terms of accuracy. It also showed that an increase in accuracy can be achieved with more complex methods, and I personally like the Figure 2 of the Makridakis et al. (2020) paper, which presents this finding in a nice concise way. And yes, it helped in promoting forecasting techniques developed by different researches in the area.

Summarising all of that, in my opinion, M4 was a great opportunity to move science forward and promote forecasting. Unfortunately, I have a feeling that it was not well thought through by the organisers, and, as a result, did not do it as well as it should have done.

Thoughts and expectations about the M5 competition

Now we arrive to M5 competition. This will be based on the data from Wallmart containing 42,840 series of daily data from 29th January 2011 to 19th June 2016, forming a natural hierarchy / grouped time series of SKUs – departments – shops – states and several categories. It is announced that this will be a mixture of intermittent and non-intermittent data, and explanatory variables (such as prices and promotions) will be provided. So far, it sounds quite exciting, because this is as close to reality as it can get.

When it comes to forecasts evaluation, the organisers use a modification of MASE, which they call Root Mean Squared Scaled Error, and although it is not as interpretable as sMSE or relative RMSE, it should do a good job in measuring mean performance of methods, especially given the complexity of the task and the fact that they will have to deal with intermittent data. I totally understand the issue with the relative RMSE becoming equal to zero in some cases on intermittent data, I have encountered this several times myself. So, I think that RMSSE is a good choice of the error measure for the competition like this.

They will also measure the accuracy of prediction intervals using a scaled version of pinball loss, which makes sense if one is interested in the accuracy of different methods in predicting specific quantiles of distribution. The quantiles also relate to the inventory planning decisions, although not always directly (inevitably, aggregating over the lead time is avoided in the M5 competition, because it probably would complicate a lot of things).

The organisers plan to have two stages in the competition, starting on 2nd March, releasing the interim leader board on 31st May and setting the final deadline to 30th June. So, if you participate in it, you will have roughly three months for analysing data, developing your methods and producing forecasts and than one more month (from 31st May to 30th June) for the necessary corrections and re-submission of the forecasts in the light of the new data.

Finally, they say that one of the things they want to do is “to compare the accuracy/uncertainty of Machine Learning and Deep Learning methods vis-à-vis those of standard statistical ones”. I am not an expert in the ML area, but given the length and quantity of data, I think that this will be a good dataset for training all the fancy artificial neural networks and random forests that we have over there. And while the previous competition did not see many submission from the ML community, I am sure that this one will have a lot, mainly because the M5 will be held on Kaggle Platform, which typically attracts researchers from ML, Data Science and Computer Science areas. And I hope that we won’t see a new version of “Machine Learning doesn’t work” on Twitter after M5.

Summing up, I am quite optimistic about the M5 Competition. It seems to improve upon the M4 and it looks like it has addressed all the main concerns that some researchers previously had (see, for example, Fildes, 2020), including those issues that I had with M4.

M5 competition: to do or not to do?

And now an important question: to participate or not to participate.

Last time, when I participated in M4, I was involved in two submissions. The first one was mainly done by Fotios Petropoulos, my role there was quite insignificant, only discussing a couple of ideas. All the coding, calculations and the submission were done by him. The second one was planned to be a submission from the CMAF, but I was the only person from the Centre involved. In the end I got help from Mahdi Abolghasemi, and, if it wasn’t for him, we wouldn’t have submitted at all. In fact, we never tested our method on the holdout and ran out of time, because I could not finish the calculations before the deadline. I even asked Evangelos Spiliotis for extension, suggesting to exclude our submission from the official leader board. So, I know how time consuming this work can be. This is not a simple blind application of a method you came up with in the shower to a set of \(k\) time series. You need to invest time in exploring the dataset, analysing the relations, selecting variables, finding the most suitable models, testing them and then reanalysing, refinding, reselecting, retesting, again and again. And because of that I will not participate in the M5 competition. Given the complexity of the problem, time restrictions and my current workload, I don’t see how I would be able to handle it on time (I cannot even finish marking students courseworks…). It’s a shame because this looks like a very good competition, which has a very nice potential.

But this is me and my personal problems. However, if you are thinking of participating in the competition, I would encourage you to do so. It looks nice and very well thought through. I hope that it lives up to our expectations!

All the information about the M5 Competition can be found on the MOFC website.

Message M-competitions, from M4 to M5: reservations and expectations first appeared on Open Forecasting.

Naughty APEs and the quest for the holy grail

Ivan Svetunkov — Sat, 29 Jul 2017 19:05:57 +0000

Today I want to tell you a story of naughty APEs and the quest for the holy grail in forecasting. The topic has already been known for a while in academia, but is widely ignored by practitioners. APE stands for Absolute Percentage Error and is one of the simplest error measures, which is supposed to show the accuracy of forecasting models. It has its sister PE (Percentage Error), which is as sinister as APE, but is supposed to give an information about the bias of forecasts (so we can see, if we systematically overforecast or underforecast). As for the quest for the holy grail, it is to find the most informative error measure, that does not have any flaws.

Let’s start with one of the most popular error measures, based on APE, that is very often used by practitioners and is called “MAPE” – Mean Absolute Percentage Error. It is calculated the following way:
\begin{equation} \label{eq:MAPE}
\text{MAPE} = \frac{1}{h} \sum_{j=1}^h \frac{|y_{t+j} -\hat{y}_{t+j}|}{y_{t+j}} ,
\end{equation}
where \(y_{t+j}\) is the actual value, \(\hat{y}_{t+j}\) is the forecasted value j-steps ahead and \(h\) is the forecasting horizon. This thing shows the accuracy of forecasts relative to actual values and it has one big flaw – its final value really depends on the actual value in the holdout sample. If, for instance, the actual value is very close to zero, then the MAPE will most probably be very high. Vice versa if the actual value was very high (e.g. 1m), then the resulting error will be very low.

Some people would say that it’s okay, and that there’s no problem at all. They would argue that MAPE was developed to show that, and in a way they will be right. But let us imagine the following situation with a company called “Precise consulting”. The company works in retail and sells different goods, and their sales manager Tom wants to have the most accurate forecasts possible. In order to stimulate forecasters to produce accurate forecasts, he has a tricky system. If a forecaster has MAPE less than 10%, then he gets a money bonus. Otherwise the forecaster is punished and Tom spanks him personally. Tom thinks that this is a good system, which stimulates forecasters to work better. As for the forecasters, there is a forecaster Siegfried, who always gets bonuses and Roy, who always gets spanked. The first mainly deals with time series of the following type:

x1 <- 100000 - 10*c(1:100) + rnorm(100,0,5)

While the second has:

x2 <- 1000 - 10*c(1:100) + rnorm(100,0,5)

As you see, Siegfried has data with a level of sales close to 100k, while Roy has data with a level around 1k. They both don’t have appropriate education in forecasting, so although their data has trends, they both use simple exponential smoothing the following way:

es(x1,"ANN",silent=F,holdout=T,h=10)
es(x2,"ANN",silent=F,holdout=T,h=10)

The graphs in both cases look very similar:

As we see, they both do lousy job as forecasters, not able to be precise. However Siegfried is a lucky guy, that’s why he has MAPE=0.1% and nice bonuses each month. Roy on the other hand is cursed with a bad time series and has MAPE=575.3%, so no bonuses for him. As you can see from the graphs above, the problem is not in Roy’s inability to produce accurate forecasts, but in the error measure that Tom makes both forecasters use. In fact, even if Roy uses the correct model (which would be ETS(A,A,N) or Holt’s model in this case), he would still get punished, because MAPE would still be above the threshold, being equal to 35.3%:

es(x2,"AAN",silent=F,holdout=T,h=10)

Roy is lucky enough not to have zeroes in the holdout sample. Otherwise he would have infinite MAPE and probably would get fired.

One would think that using Median APE could solve the problem, but it really doesn’t, because we still have the same levels’ influence on errors. And I’m not even mentioning the aggregation of MAPEs over several time series. If we do that, we won’t have anything meaningful and useful as well, because the errors on series with lower levels would overlap the errors on series with higher levels.

So, this is why APEs are naughty. Whenever you have error measure, where any value is divided by the actual value from the same sample, you will most likely encounter problems. MPE, mentioned above, has exactly the same problem. It is calculated similarly to MAPE, but without absolute values. In our example Siegfried has MPE=-0.1%, while Roy has MPE=-575.3% if he uses the wrong model and MPE=26.8% if he uses the correct one. So, this error measure does not provide the correct information about the forecasts as well.

Okay. So, now we know that using MAPE, APE, MPE and whatever else ending with PE is a bad idea, because it does not do what it is supposed to do and as a result does not allow making correct managerial decisions.

Tom, being a smart guy, and having read this post, decided to use a different error measure. He saw people mentioning in forecasting literature Symmetric MAPE (SMAPE):
\begin{equation} \label{eq:SMAPE}
\text{SMAPE} = \frac{2}{h} \sum_{j=1}^h \frac{|y_{t+j} -\hat{y}_{t+j}|}{|y_{t+j}| + |\hat{y}_{t+j}|} .
\end{equation}

The idea of this error measure is to make it less biased by dividing the absolute value by the average of forecast and actual values (thus the appearance of 2 in the formula). Tom thinks that if this error measure is called “Symmetric”, then it should solve all of his problems. Now he asks Siegfried and Roy to report SMAPE instead of MAPE. And in order to test their forecasting skills, he gives both of them a very simple time series, with a fixed level (deterministic level):

x <- rnorm(100,1000,5)

Either out of fun or out of lack of knowledge Siegfried produced weird forecast, overshooting the data:

es(x,"ANN",silent=F,holdout=T,h=10,persistence=0,initial=1500)

while Roy did similar bad job, systematically underforecasting:

es(x,"ANN",silent=F,holdout=T,h=10,persistence=0,initial=500)

Obviously, both of them did similarly bad job, not getting to the point (and probably should be fired). But SMAPE would be able to tell that, right? You won’t believe it but it tells us that Siegfried still did a better job than Roy. He has SMAPE=40.1%, while Roy has SMAPE=66.6%. Judging by SMAPE alone, Tom would be inclined to fire Roy, although in reality he is as bad as Siegfried. What has happened? Are we missing something important in the performance of forecasters? Is Siegfried really doing a better job than Roy?

No! The problem is once again in the error measure - we are dealing with a naughty APE. This time we inflate the value of the error when the forecast is high, because of the inclusion of forecasts in the denominator of \eqref{eq:SMAPE}. This means that SMAPE likes, when we overforecast (this has been discussed in the literature first time by Goodwin and Lawton, 1999). Very bad APE! Very naughty APE! No one should EVER use it!

Okay. Now we know that there are bad APEs. What about good ones? Unfortunately, there are no good APEs, but there are other decent error measures.

Tom needs an error measure that would be applicable to a wide variety of data, would not have such dire problems and would be easy to interpret. As you probably already understand, there is no such thing, but at least there are better alternatives to APEs:

sMAE – scaled Mean Absolute Error (proposed by Fotios Petropoulos and Nikos Kourentzes in 2015), which is also sometimes referred to as "weighted MAPE" or "wMAPE":

\begin{equation} \label{eq:sMAE}
\text{sMAE} = \frac{\text{MAE}}{\bar{y}},
\end{equation}
where \(\bar{y}\) is average value of in-sample actuals and \(\text{MAE}=\frac{1}{h} \sum_{j=1}^h |y_{t+j} -\hat{y}_{t+j}|\) is Mean Absolute Error of the forecast for the holdout. This is a better analogue of MAPE, which is easy to interpret (it can also be measured in percentage). Unfortunately, it still has similar problems with different levels as MAPE has, but at least it does not react to potential zeroes in the holdout sample. Use it if you desperately need something meaningful, but at the same time better than MAPE.

MASE – Mean Absolute Scaled Error (proposed by Rob J. Hyndman and Anne B. Koehler in 2006)

\begin{equation} \label{eq:MASE}
\text{MASE} = \frac{\text{MAE}}{\frac{1}{t-1} \sum_{j=2}^t |y_{j} -y_{j-1}|}.
\end{equation}
The main thing of MASE is the division by the first differences of in-sample data. Sometimes the denominator is interpreted as an in-sample one-step-ahead Naive error. This thing allows bringing different time series to the same level and get rid of potential trend in the data, so you would not have that naughty effect of APEs. This is a more robust error measure than sMAE, it has fewer problems, but at the same time it is much harder to interpret than the other error measures. For example, MASE=0.554 does not really tell us anything specific. Yes, it seems that the forecast error is lower than the mean absolute difference of the data, but so what? This is a better error measure than any APE, but good luck explaining Tom what it means!

rMAE – Relative MAE (discussed in Andrey Davydenko and Robert Fildes in 2013)

\begin{equation} \label{eq:rMAE}
\text{rMAE} = \frac{\text{MAE}_1}{\text{MAE}_2},
\end{equation}
here \(\text{MAE}_1\) is MAE of the model of interest, while \(\text{MAE}_2\) is MAE of some benchmark model. The simplest benchmark is Naive method. So by calculating rMAE we compare the performance of our model with Naive. If our model performs worse than the benchmark, then rMAE will be greater than one. If it is more accurate, then rMAE is less than one. In fact, rMAE aligns very well with so called "Forecast Value":
\begin{equation} \label{eq:FVA}
\text{FV} = (1-\text{rMAE}) \cdot 100\text{%} .
\end{equation}
So, for example, if rMAE=0.95, then we can conclude that the tested model is 5% better than the benchmark. This would be a perfect error measure, our holy grail, if not for a couple of small "Buts" – if for some reason \(\text{MAE}_2\) is equal to zero, rMAE cannot be estimated. Furthermore, it is recommended to aggregate rMAE over different time series using geometric mean rather than arithmetic one (because this is a relative error). So if on some time series the chosen model performs very well (e.g. MAE is very close to zero or even equal to zero), then rMAE will be close to zero as well, which will bring the aggregated value close to zero as well no matter what, even if in some cases the model did not perform well.

One more thing to note is that none of the discussed error measures can be applied to intermittent data (the data with randomly occurring zeroes). But this is a completely different story with completely different set of forecasters and managers.

As a conclusion, I would advise Tom not to use percentage error, but to use several other error measures, because each of them has some problems. Trying to find one best error measure is similar to the search for the holy grail. Don’t waste your time! And, please, don’t set bonuses or punishments for forecasters based on error measures – it’s a silly idea, which demotivates people to work well, but encourages them to cheat.

Message Naughty APEs and the quest for the holy grail first appeared on Open Forecasting.