M-competitions, from M4 to M5: reservations and expectations

UPDATE: I have also written a short post on “The role of M competitions in forecasting“, which gives historical perspective and a brief overview of the main findings of the previous competitions.

Some of you might have noticed that the guidelines for the M5 competition have finally been released. Those of you who have previously visited this blog, might know my scepticism about the M4 competition. So, I’ve decided to write this small post, outlining my reservations about the M4 and my thoughts and expectations about the M5.

Reservations about the M4 competition

Although, I have participated in two submissions in M4, in the end I found the results unsatisfactory. My main reservations were with how the results were interpreted by the organisers, who announced right after the end of the M4 that “Machine Learning doesn’t work”, which was repeated few times on the Twitter of Spyros Makridakis (see, for example, this, and this) and in their summary paper in IJF. This was pedalled few times in social media, and I did not take it well. I participated in the M4 competition in order to support Spyros, but I did not expect that I would contribute to the attack on Machine Learning. Besides, the claims that ML methods didn’t work were not valid. First, the winning method of Slawek Smyl uses Machine Learning technique. Second, there were not many ML methods submitted to the competition, because the ML community was not involved in it. Third, the setting of the competition itself was not suitable for ML in the first place. What sort of non-linearity do we expect to see in yearly, quarterly or monthly data anyway? We already known that using ML makes sense, when you have a lot of observations, so that you can train the methods properly and not overfit the data. Yes, there was also a handful of weekly, daily and hourly data in the M4 competition, but they constituted only 5% of the whole dataset. So, the fact that MLP and RNN failed to perform well, when applied to each separate time series, is not surprising. Furthermore, the organisers did not have ML specialists in their team, so I don’t think that it was a fair setting in the first place to conclude that “ML methods don’t work”. To me, it was similar to the case, when professional cyclists would organise a race between super cars and bicycles on narrow streets of an old crowded city. The conclusion would be “the super cars don’t work”.

Furthermore, it appeared that there were several time series of a weird nature. For example, there were yearly data with 835 observations (e.g. series Y13190), there were some series without any variability for long periods of time (e.g. 29 years of no change in monthly data, like in the series M16833) and there were some series that end in the future (e.g. Y3820 ends in the year 2378). Some other examples of weird data were also discussed in Darin & Stellwagen (2020). Furthermore, Ingel et al. (2020) showed that some time series in the dataset were highly correlated, meaning that there was a data leakage in the M4 competition. All of this does not have much to do with the time series we usually deal with in real life, at least not in the area I work in. We should acknowledge that preparing such a big competition as M4 is not an easy task (it’s not possible to look through all 100,000 time series), and I’m sure that the organisers did as best as they could in order to make sure that the dataset contains meaningful series. However, while all of this has probably happened because of the complexity of the competition, I now have my reservations about using M4 data for any future experiments.

I also had my reservations about the error measures used in the competition, although I have accepted them, when I participated in the M4. I still think that it was a mistake to use sMAPE, because now we see a backlash, when practitioners from different companies refer to M4 and ask us about the sMAPE. I have to explain them the limitations of this error measure, and why they should not use it. And I am afraid that we will have more and more forecasting related papers, using sMAPE for forecasts evaluation, just because of the M4 competition, although we have known since Goodwin & Lawton 1999 that it has serious issues (preferring when methods overshoot the data). In my opinion this is a serious step back for science.

To be fair, there were also good results of the M4 competition, and while in general being critical of it, I don’t claim that it was a complete failure. It showed how the ML can be efficiently used on big datasets: see the papers of Slawek Smyl (2020) and Montero-Manso et al. (2020). It reconfirmed the previous findings that the combinations of forecasting methods on average perform better than the individual ones in terms of accuracy. It also showed that an increase in accuracy can be achieved with more complex methods, and I personally like the Figure 2 of the Makridakis et al. (2020) paper, which presents this finding in a nice concise way. And yes, it helped in promoting forecasting techniques developed by different researches in the area.

Summarising all of that, in my opinion, M4 was a great opportunity to move science forward and promote forecasting. Unfortunately, I have a feeling that it was not well thought through by the organisers, and, as a result, did not do it as well as it should have done.

Thoughts and expectations about the M5 competition

Now we arrive to M5 competition. This will be based on the data from Wallmart containing 42,840 series of daily data from 29th January 2011 to 19th June 2016, forming a natural hierarchy / grouped time series of SKUs – departments – shops – states and several categories. It is announced that this will be a mixture of intermittent and non-intermittent data, and explanatory variables (such as prices and promotions) will be provided. So far, it sounds quite exciting, because this is as close to reality as it can get.

When it comes to forecasts evaluation, the organisers use a modification of MASE, which they call Root Mean Squared Scaled Error, and although it is not as interpretable as sMSE or relative RMSE, it should do a good job in measuring mean performance of methods, especially given the complexity of the task and the fact that they will have to deal with intermittent data. I totally understand the issue with the relative RMSE becoming equal to zero in some cases on intermittent data, I have encountered this several times myself. So, I think that RMSSE is a good choice of the error measure for the competition like this.

They will also measure the accuracy of prediction intervals using a scaled version of pinball loss, which makes sense if one is interested in the accuracy of different methods in predicting specific quantiles of distribution. The quantiles also relate to the inventory planning decisions, although not always directly (inevitably, aggregating over the lead time is avoided in the M5 competition, because it probably would complicate a lot of things).

The organisers plan to have two stages in the competition, starting on 2nd March, releasing the interim leader board on 31st May and setting the final deadline to 30th June. So, if you participate in it, you will have roughly three months for analysing data, developing your methods and producing forecasts and than one more month (from 31st May to 30th June) for the necessary corrections and re-submission of the forecasts in the light of the new data.

Finally, they say that one of the things they want to do is “to compare the accuracy/uncertainty of Machine Learning and Deep Learning methods vis-à-vis those of standard statistical ones”. I am not an expert in the ML area, but given the length and quantity of data, I think that this will be a good dataset for training all the fancy artificial neural networks and random forests that we have over there. And while the previous competition did not see many submission from the ML community, I am sure that this one will have a lot, mainly because the M5 will be held on Kaggle Platform, which typically attracts researchers from ML, Data Science and Computer Science areas. And I hope that we won’t see a new version of “Machine Learning doesn’t work” on Twitter after M5.

Summing up, I am quite optimistic about the M5 Competition. It seems to improve upon the M4 and it looks like it has addressed all the main concerns that some researchers previously had (see, for example, Fildes, 2020), including those issues that I had with M4.

M5 competition: to do or not to do?

And now an important question: to participate or not to participate.

Last time, when I participated in M4, I was involved in two submissions. The first one was mainly done by Fotios Petropoulos, my role there was quite insignificant, only discussing a couple of ideas. All the coding, calculations and the submission were done by him. The second one was planned to be a submission from the CMAF, but I was the only person from the Centre involved. In the end I got help from Mahdi Abolghasemi, and, if it wasn’t for him, we wouldn’t have submitted at all. In fact, we never tested our method on the holdout and ran out of time, because I could not finish the calculations before the deadline. I even asked Evangelos Spiliotis for extension, suggesting to exclude our submission from the official leader board. So, I know how time consuming this work can be. This is not a simple blind application of a method you came up with in the shower to a set of \(k\) time series. You need to invest time in exploring the dataset, analysing the relations, selecting variables, finding the most suitable models, testing them and then reanalysing, refinding, reselecting, retesting, again and again. And because of that I will not participate in the M5 competition. Given the complexity of the problem, time restrictions and my current workload, I don’t see how I would be able to handle it on time (I cannot even finish marking students courseworks…). It’s a shame because this looks like a very good competition, which has a very nice potential.

But this is me and my personal problems. However, if you are thinking of participating in the competition, I would encourage you to do so. It looks nice and very well thought through. I hope that it lives up to our expectations!

All the information about the M5 Competition can be found on the MOFC website.

Comments (3):

  1. Hi Ivan,

    Glad you like the set-up of M5. We also think it is reasonable and that it represents reality, at least as much as a competition allows in order for it to be relatively simple and comprehensible. I think M5 has a lot of potential and I am looking forward to reviewing its results.

    Regarding M4, I think we’ve already “agreed we disagree” in a previous post of yours about both the accuracy measures used in the competition and the way its results were interpreted.

    I won’t comment on the M4 measures here as I’ve already done that in your previous post.
    I would just like to make it clear that the finding of M4 was not that “ML does not work”. It was that pure ML methods, trained in a series by series fashion, are less accurate than traditional, statistical approaches. We have highlighted however the importance of exploiting ML elements for applying cross-learning, i.e., learning from multiple series to predict the individual ones. In fact, this is what Slawek (1st place) and Pablo (2nd place) did and this is exactly why they managed to win the competition. If it wasn’t for ML algorithms, cross-learning wouldn’t be possible to apply.

    You claim that the setting of the competition itself was not suitable for ML in the first place because you cannot expect non-linearity in yearly, quarterly or monthly series, which are also relatively short to allow effective training. This is not true. ML methods do not have to be necessarily fitted to each series individually. Those who did that, failed to get a high score, but the ones that trained their models across all 100,000 series of the competition got the highest scores. It is not about having long, non-linear data. It is about finding relationships between the series. Also, it is not about using ML methods. It is about the way you use them. Is this a super car or a super bike? I don’t know, but it works nice.

    In my point of view, M4 promoted the utilization of ML in time series forecasting, showing how ML methods should be used to extract information from multiple series. Its large data-set allowed for such experimentation and lot of research is done in this area, expanding from ML to deep learning (see the excellent work done by Boris and his colleagues here https://arxiv.org/abs/1905.10437). Personally, I love ML and I don’t have anything against its use. I am also pretty sure that the winner of M5 will utilize a ML-based method.

    Finally, I would like you to know that I appreciate your current workload, making it difficult for you to participate. On the other hand, given that CMAF is the biggest forecasting center in the UK, it would be reasonable for some of its members to participate (maybe some PhD students under the supervision of yours and other senior researchers?). M5 would be an excellent opportunity for forecasters to do some forecasting and experiment with the tools they’ve been developing for such purposes.

    • Hi Vangelis,

      Thank you for you comment.

      Indeed, we do agree to disagree. We don’t need to have the same opinion on the topic, and I’m just expressing mine based on the observations I made. I know that you have a reasonable view on the problem, although I’m not sure that Spyros shares your opinion on the topic.

      Anyway, good luck with M5!

      • From the conclusions article in the M4 competition IJF special issue:
        “The forecasting spring began with the M4 Competition, where a complex hybrid approach combining statistical and ML elements came first, providing a 9.4% improvement in its sMAPE relative to that of the Comb benchmark.”
        Makridakis and Petropoulos (2020)

Leave a Reply