Comments on: Are you sure you’re precise? Measuring accuracy of point forecasts

By: Ivan Svetunkov

Ivan Svetunkov — Tue, 24 Nov 2020 16:06:29 +0000

In reply to Nicolas Vandeput. Yes, very good point, Nicolas! Agreed :)

By: Nicolas Vandeput

Nicolas Vandeput — Tue, 24 Nov 2020 16:00:51 +0000

“MAE-based error measures should not be used on intermittent demand.”
Or any distribution where the mean and the median are different ;)
(i.e., nearly all demand distributions in supply chains)

By: Ivan Svetunkov

Ivan Svetunkov — Sun, 16 Feb 2020 12:18:06 +0000

In reply to Turbo Forecasting.

Dear Andrey Davydenko,

Thank you for your comments. A couple of points from my side:

1. The name “AvgRelMAE” is objectively complicated in perception and difficult to pronounce, and I personally think that a simpler name can be used instead. For example, using “rMAE” instead of “RelMAE” looks shorter and easier to read. I don’t use “Avg” part because I usually present results in a table that contains both mean and median values of rMAE. In fact, I’m not the first person using this abbreviation – I’ve seen it in some presentation of the ISF2019.
2. As for MAE becoming equal to zero, it is an issue of any relative measure. Trimming is one of the solutions to the problem, indeed. But it is just a solution to the problem appearing in the measure naturally. In fact, this is what I suggest in this post: “Having said that, if you notice that either rMAE or rRMSE becomes equal to zero or infinite for some time series, it makes sense to investigate, why that happened, and probably remove those series from the analysis.”
3. As for my ethics, I promote your error measure and refer to your work every time I mention it.

Kind regards,
Ivan Svetunkov

By: Turbo Forecasting

Turbo Forecasting — Sun, 16 Feb 2020 06:30:08 +0000

Renaming AvgRelMAE into ArMAE and AvgRelMSE intro ArRMSE is not a great idea (unless you intentionally want to disguise the original names) because of these reasons:

1) It is not very ethical to rename a method proposed by other people just because “you find it tedious”, especially after many people used the original names.

2) The original names, AvgRelMAE and AvgRelMSE, were chosen in order to avoid confusion with other metrics based on the arithmetic means (some researchers had already used the ARMSE abbreviations for the arithmetic means).

3) This makes the measures recognizable and using the same name across different studies helps summarize research made by different authors. For example, you can see the AvgRelMAE in the review on retail forecasting and it is immediately clear as to what aggregation scheme was used.

It is also not very ethical that are making every effort to avoid mentioning that the the scheme based on the geometric averaging of relative performances was proposed and justified in (Davydenko and Fildes, 2013). In fact, the AvgRelMSE and AvgRelMAE are special cases of a general metric proposed in this Ph.D. thesis: (Davydenko, 2012, p. 62):
https://www.researchgate.net/publication/338885739_Integration_of_judgmental_and_statistical_approaches_for_demand_forecasting_Models_and_methods

It is also not very ethical that are making every effort to eschew mentioning that the the scheme based on the geometric averaging of relative performances was proposed and justified in (Davydenko and Fildes, 2013). In fact, the AvgRelMSE and AvgRelMAE are special cases of a general metric proposed in this Ph.D. thesis: (Davydenko, 2012, p. 62):
https://www.researchgate.net/publication/338885739_Integration_of_judgmental_and_statistical_approaches_for_demand_forecasting_Models_and_methods

The IJF paper is just a chapter of the Ph.D. thesis. The boxplots of log(RelMAE) were also first demonstrated in (Davydenko and Fildes, 2013), which is worth noting when following this approach.

And, yes, this metric has the best statistical properties among other alternatives, but you must give proper references when writing about it (either Davydenko and Fildes, 2013, or Davydenko, 2012).

As for zero MAEs, as I wrote above, there’s no problem with these occurrences, the method of working with zero MAEs was given in (Davydenko and Fildes, 2014, p. 24)
https://www.researchgate.net/publication/282136084_Measuring_Forecasting_Accuracy_Problems_and_Recommendations_by_the_Example_of_SKU-Level_Judgmental_Adjustments

So it is not very clear as to why you still think that obtaining zero MAEs is a problem. One important thing is, of course, that your loss function used for optimization should correspond to the loss function used for evaluation. This criteria for a good error measure was formulated in this chapter (Davydenko and Fildes, 2016, p. 3):
https://www.researchgate.net/publication/284947381_Forecast_Error_Measures_Critical_Review_and_Practical_Recommendations

The same chapter says that if you have a density forecast, then you need to adjust your forecast depending on the loss function. But most commonly you’ll obtain your forecast optimized for the linear symmetric loss (explained in the same chapter), thus the AvgRelMAE is generally a reasonable option.

By: Turbo Forecasting

Turbo Forecasting — Sun, 16 Feb 2020 05:46:45 +0000

In reply to Fotios Petropoulos.

The problem of zero MAEs was easily solved by the method proposed in (Davydenko and Fildes, 2013, p. 24), full text available here, see note (a):
https://www.researchgate.net/publication/282136084_Measuring_Forecasting_Accuracy_Problems_and_Recommendations_by_the_Example_of_SKU-Level_Judgmental_Adjustments

If you use trimmed AvgRelMAE, then there is no problem in obtaining a certain percentage of zero MAEs.

By: Ivan Svetunkov

Ivan Svetunkov — Tue, 03 Sep 2019 14:31:26 +0000

In reply to Vangelis Spiliotis.

Thanks for responding to my questions.

To me personally, it’s not about hate, it’s about using the findings from the recent literature. To be honest, I used SMAPE a lot, when I worked in Russia, because it sounded as a reasonable measure at that point (I even have it as a main measure in a textbook on forecasting that I have coauthored). But then the new findings came into light, and I decided that I need to take them into account. So, nothing personal, just pragmatism.

As for the very last point, my apologies. I did not want to sound sarcastic, it was more like a joke, and, as it appears, not the best one. My bad.

As a very last note, to clarify something out, I personally think that you did a very good job in organising the competition and writing the papers on that, especially given all the existing limitations. Still, this does not mean that there is no room for improvement… :)

By: Vangelis Spiliotis

Vangelis Spiliotis — Tue, 03 Sep 2019 10:43:14 +0000

In reply to Ivan Svetunkov.

1. I don’t disagree that MASE has better properties than sMAPE. In the M4 paper we discuss the reasons of using both measures, noting that MASE is superior to sMAPE, while also stressing the limitations of the former in terms of interpretation. Personally, I’m not a sMAPE-fan either, but it’s useful for interpreting your results, can’t deny that, especially when MAPE or relative measures are not an option.

2. That will always be the case when using different error measures, either if the measures being compared are good or bad. Sure, we cannot neglect such differences, but do they matter in practice after all? I believe that the most important thing here is that both measures agree on which methods did best, providing the same results in terms of statistical significance. So, yes, one method may score position 2 or 3 depending on the measure used, we always knew that, but the conclusions about whether the examined method works systematically better than others remain the same.

3. Same with 2

4. As I said, it isn’t good. It’s just easy to interpret. That’s why it is used either way. You may use a different measure depending on your research question if you like. No problem.

5. As Nikolopoulos mentions in his commentary, first cut is the deepest. Time will tell.

6. Fair enough.

7. That’s a little bit controversial. You hate sMAPE, but have no problem with OWA, that involves sMAPE. Either way, like you said, participants accept the conditions when participating in a competition so, if OWA is the measure, just go for it.

8. Honestly, I don’t understand why you believe that the aim of M4 was to prove that ML doesn’t work. Its aim was clearly the one of the previous three competitions, i.e., identifying new accurate methods and promoting forecasting research and practice. Its motivation was also clear from the beginning till the end. The performance of ML was just evaluated as an alternative over traditional statistical methods given the hype and the advances reported since the last M competition. Actually, if you read the M4 paper, you’ll see that one of the main finding of the M4 competition was that ML works, at least when applied in a smart way (e.g., by using elements of statistical models as well). Why witch-hunt ML anyway? I don’t understand what the motivations could be. Deceived? Really? Anyway, we appreciate the work done at CMAF so, either if you are personally willing to participate or not, we’ll be happy to receive a submission from your Centre.

9. You made your point. Being sarcastic adds no value. But you are free to do so if this puts a smile on your face.

By: Ivan Svetunkov

Ivan Svetunkov — Tue, 03 Sep 2019 09:28:50 +0000

In reply to Suraj Vissa.

Hi Suraj,

I think we need to distinguish two things: what the model produces and what the forecast corresponds to, when we evaluate it. So, when you work with averages, you can easily aggregate forecasts, no matter what error measure you use (there might be some issues with non-linear models from the distributional point of view, but this is a different topic). But when it comes to evaluating the model using MAE for different levels, what you do is assess how your forecast performs in terms of median of the true distribution for the specific levels. So, there should not be an issue here, as long as you are aware of what you do.

However, if you want to align your model estimation with the evaluation (i.e. produce median forecasts and evaluate models using MAE), then this is a different question, because then you will be dealing with medians produced by models, and the sum of medians is not the same as the median of sums. Everything becomes much more complicated in this case… You might need to revert to simulations in order to produce correct median forecasts.

By: Suraj Vissa

Suraj Vissa — Tue, 03 Sep 2019 01:33:29 +0000

Hi Ivan,

Thanks for the article. Some very nice insights on forecasting error measures. I have a question on the use of error measures which cause forecast to be more closely aligned to the median of the distribution (E.g., MAE).

In supply chain, we often aggregate/disaggregate forecasts using hierarchical forecasting models. Mathematically, while we can add/subtract/average point forecasts that align to average (E.g., if we use MSE), we cannot extend the same to median-based point forecasts.

What is the impact of performing these operations on median-based point forecasts in the real-world? Is it still okay?

Thanks,
Suraj

By: Ivan Svetunkov

Ivan Svetunkov — Mon, 02 Sep 2019 15:31:54 +0000

In reply to Vangelis Spiliotis.

Hi Vangelis,

1. I don’t say that I really like MASE, but we already know what this measure is minimised with. It is not the question of taste, it is the question of statistics. So, it is good, because we know what it does. And yes, it is better than SMAPE, few papers in IJF show that.

2. Table 4 of your paper (https://www.sciencedirect.com/science/article/pii/S0169207019301128) demonstrates that the ranking is different between MASE and SMAPE. For example, Pawlikowski et al. is ranked as 2 instead of 5, when we use MASE instead of SMAPE. The results change based on different measures, mainly because they are focused on different central tendencies.

3. OWA and MASE rank people also differently. For example, Montero-Manso is 3rd, not 2nd, if MASE is used instead of OWA. Fiorucci is 4th, not 5th etc. Once again, this is expected due to different measures, but the results a different, so we cannot just neglect them and say that the error measure is not important.

4. But my main critique is that we do not know what minimises SMAPE. So, we don’t know in what terms the methods perform better. For example, if we had MSE-based measure, we could have said that some methods produce more accurate mean values, but in the case of M4, we cannot say anything specific. Furthermore, we know that SMAPE prefers over-forecasting to under-forecasting. So the best method according to SMAPE will probably slightly overshoot the data. Why is this good?

5. I’ve never said that either of competitions is a failure! And I explicitly mentioned in the comment above that I think that M3 was one of the best things happening in forecasting. So, you got a wrong impression here. M4 was interesting, but I don’t find it as ground breaking as M3.

6. One of my points in the post is that there is no perfect error measure. There are better ones and there are worse ones. MASE and measures similar to MASE can be used in research, but when it comes to practice, I would go with relative ones, because they are easier to interpret.

7. I have never said that OWA was not appropriate for the competion. I understand why you use it, and, given the selection of error measures, it sort of makes sense. And I don’t complain about the error measures you used – it’s your business. People, participating in competitions, accept the conditions. I did as well. I only point out at the limitations of the error measures you use. I don’t understand, why you insist on using SMAPE, given all the critique in the literature for the last 20 years…

8. I never said that I won’t participate because of the error measures. Once again, there is a misunderstanding. The reason is different. Whatever the findings and conclusions of M4 are, it all comes to the statement repeated by Spyros everywhere over and over again: “machine learning doesn’t work” – which is neither fair nor correct. I have participated in M4 for a different reason. You have said in the beginning that the aim was to see, how the new methods perform, and I wanted to support you in this by submitting some innovative things. But it appears that this was not the true aim. It seems that the main aim of M4 was to show that “machine learning doesn’t work”, but this has never been vocalised in advance. So, now I feel that by participating in M4 I was deceived. My feelings would be quite different if you have stated clearly in advance the true hypotheses that you want to test. Now, I don’t want to participate in M5, because it might lead to something similar. Probably Spyros will end up ranting again that “machine learning doesn’t work” or something like that. Why should I spend my time on that (especially given my workload for the upcoming year), when I can do something more useful?

9. Are you going to use SMAPE for intermittent demand? :)