Archives Forecast evaluation - Open Forecasting

Scaling of error measures

Ivan Svetunkov — Mon, 23 Feb 2026 13:36:12 +0000

Apparently, we need to talk about scaling of error measures because this is not as obvious as it seems.

In forecasting literature, since early days of the area, there has been a general consensus that the forecast errors from the individual time series should not be analysed and aggregated as is. This is because you can have very different time series capturing dynamics of very different processes.

Indeed, if you forecast sales of apples in kilograms, your actual value would be apples in kilograms, and your point forecast would also be in the same units. Subtracting one from another tells us how many kilograms of apples we missed with the forecast we produced. But if we then take the average between forecast errors for apples and beer, we would be aggregating things in different units, which contradicts some basic aggregating principles.

Furthermore, if the company sells thousands of kilograms of apples and jet engines, aggregating forecast errors on those (e.g. 3000 vs 3) might introduce all types of issues, because the models performance on apples might mask the performance of the model on jet engines. Still, the jet engines are much more expensive than apples and getting them accurately might be more important for the company than forecasting apples.

So, forecasting literature has agreed that the forecast errors need to be somehow scaled to make the errors unitless and not to distort performance of models on time series with different volumes. There are several ways of doing that, including the poor ones and reasonable ones. The state of the art at the moment is to divide error measures by some in-sample statistics to avoid potential holdout-sample distortion. Using mean absolute differences (MAD) for this (thus ending up with MASE or RMSSE) is considered as a standard. A couple of years ago, I have written a post about advantages and disadvantages of several scaling methods.

But there is one method that I haven’t looked at and which is not very well discussed in the forecasting literature. It relies on the monetary value of forecasts. We could multiply each individual forecast error “e” by the price of the product “p” (thus moving to the missed income per product) and then divide everything by the overall income (price times quantity) from different products. This can be written as:

\begin{equation}
\text{monetary Mean Error} = \frac{\sum_{j=1}^n (p_j \times e_j)} {\sum_{j=1}^n (p_j \times q_j)}
\end{equation}

(the above formula can be modified to have squares or absolute values of the error). This way we switch from the original units to the monetary values and each error would tell you the percentage of the missed income in the overall one. This is a useful measure because it connects models performance with some managerial decisions and it takes the value of product into account (thus we do not mask the expensive jet engines with cheap apples).

However, it might have a potential issue similar to what the MAE/Mean or wMAPE has: if the sales of the product are not stationary, the denominator would change, thus driving the proportion either up or down, irrespective of how good the forecast is. I am not sure whether this needs to be addressed, because there is an argument that if the income from a product has increased and the error hasn’t changed, then this means that the proportion of the missed income decreased, which makes sense. But if we need to address this, we can switch to the MAD multiplied by price in the denominator to address this issue. In fact, this was sort of done in M5 competition that used a weighted RMSSE, relying on the income from each product over the last 4 weeks of data.

But here is one more interesting thing about this error measure. If we assume that prices for all products are exactly the same, they will disappear from the numerator and the denominator, leaving us with just sum of errors divided by the overall sales of all products. This still maintains the original idea of the proportion of the missed income, but now has a very strong assumption, which is probably not correct in the real life (apples and engines for the same price?). Furthermore, this would mask the performance of the model for the expensive products again. I personally don’t like this measure and find the assumption unrealistic and potentially misleading. Having said that, I can see some cases where this could still be acceptable and useful (e.g. similar products with similar dynamics and similar prices).

Summarising:

If you are conducting a forecasting experiment without a specific context, I’d recommend using RMSSE or some other similar measure with scaling.
If you have prices of products, income-based scaling might be more informative.
Setting all prices to the same value does not sound appealing to me, but I understand that there is a context where this might work.

Message Scaling of error measures first appeared on Open Forecasting.

Don’t use MAE-based error measures for intermittent demand!

Ivan Svetunkov — Tue, 21 Jan 2025 12:02:06 +0000

I’m currently doing a literature review for one of my papers on intermittent demand forecasting with machine learning, and I’ve noticed a recurring fundamental mistake in several recently published papers, even in respectable peer-reviewed journals.

The mistake? Using error measures based on the Mean Absolute Error (MAE). This is a crime against the humanity when working with intermittent demand. I’ve explained this issue multiple times before (here, here, and here), but it appears that this idea needs to be repeated over and over again. Let me explain.

MAE is minimised by the median. In the case of intermittent demand, the median can often be zero. If you use MAE (or scaled measures like MASE or sMAE) to evaluate forecasts and compare, for example, Croston, TSB, ETS, and an Artificial Neural Network (ANN), you may find the ANN outperforming the others. However, this could simply mean that the ANN produces forecasts closer to zero than the alternatives. This is not what you want for intermittent demand! The goal is to capture the structure correctly and produce conditional mean forecasts (typically). Instead, by relying on MAE, you might conclude: “We won’t sell anything in the next two weeks”, implying that there’s no need to stock products. This is apparently wrong and unhelpful.

Attached to this post is a figure showing three forecasts for an intermittent demand series:

The blue line represents the mean of the data;
The green line is a forecast from an Artificial Neural Network;
The red line is the zero forecast.

In the figure’s legend, you’ll see error measures indicating that the zero forecast performs best in terms of MAE, followed by the ANN, and lastly, the mean forecast. Based on MAE, the conclusion would be: “We won’t sell anything, so don’t bother stocking the product”. But this outcome occurs solely because 12 out of 20 values in the holdout are zeros, making the median zero as well.

On the other hand, RMSE provides a more reasonable evaluation, showing that the mean of the data is more informative and preferable to the other methods.

The brief summary of this post is: *Don’t use MAE-based error measures for intermittent demand!* (Insert as many exclamation marks as you’d like!)

P.S. Actually, as a general rule, avoid using MAE for evaluating methods that produce mean forecasts. For more details, check out this post.

P.P.S What frustrates me a lot is that the reviewers of those papers did nothing to fix this issue, which means that they are clueless about that as well.

Message Don’t use MAE-based error measures for intermittent demand! first appeared on Open Forecasting.

Why Naive is not a good benchmark for intermittent demand

Ivan Svetunkov — Mon, 02 Dec 2024 14:05:53 +0000

While Naive is considered a standard benchmark in forecasting, there is a case where it might not be a good one: intermittent demand. And here is why I think so.

Naive is a forecasting method that uses the last available observation as a forecast for the next ones. It does not have any parameters to estimate, it does not require training, it can be applied to the sample of any data (even if you only have one observation). When you deal with a regular demand, it makes perfect sense to use Naive as a benchmark, because it costs nothing in terms of computational time, you get a forecast of demand, and if you cannot beat it, you should rethink your forecasting process.

However, in case of intermittent demand, the demand itself does not happen on every observation. As a result, when the Naive copies the last available value, it can either reproduce either a proper non-zero demand, or just the absence of demand. The latter implies that nobody bought our product today, and nobody will do in the next week or whatever the forecasting horizon we use. In the following image, Naive will be the most accurate forecasting method, because in the training set, the final observation was zero, and in the test set we did not have any sales:

Naive forecast on intermittent demand

But is this useful? To answer this question, we need to understand what specifically we are forecasting when we deal with demand with zeroes.

As discussed in a previous post, zeroes can occur for different reasons: some of them happen because nobody came to buy the product (naturally occurring zeroes), while the others appear because there was some sort of disruption (e.g. a stockout) or a product was discontinued (artificially occurring zeroes). The two situations are fundamentally different, but if we work with the sales data exclusively (no stock information), it can be hard to tell the difference between them. Naive might work perfectly in both cases, forecasting no sales for the next few observations, and it can be 100% right in some cases. But the problem is that this is not useful. If we indeed cannot beat Naive on the data with zeroes, it does not mean that we should use it, because there is a chance that we have stockouts in the holdout period. If that’s the case, we might be doing something fundamentally wrong. After all, “we will not sell anything” is in general a simple statement, but not ordering products based on that could be a mistake, because “no sales” is not the same as “no demand”. In fact, if Naive indeed performs very well on your series with zeroes, this might indicate that your evaluation is wrong and you need to clean the data, removing the discontinued and out of stock items from the evaluation.

There are three lessons here:

we should forecast demand, not sales;
we should measure accuracy on the data with naturally occurring zeroes – do data cleaning before setting up your evaluation;
it’s better to use a benchmark that tries capturing demand, not the one that reproduces sales.

Arguably, a more helpful benchmark forecast would be the one in the following image:

Forecast for intermittent demand from the SMA

The forecast above was generated using the Simple Moving Average, and it tells us that there is a demand for the product over the next 13 days. Yes, it is less accurate than Naive, but it gives an estimate of the expected demand, not the expected sales.

Message Why Naive is not a good benchmark for intermittent demand first appeared on Open Forecasting.

What about the training/test sets?

Ivan Svetunkov — Wed, 02 Oct 2024 14:32:25 +0000

Another question my students sometimes ask is how to define the sizes for the training and test sets in a forecasting experiment. If you’ve done data mining or machine learning, you’re likely familiar with this concept. But when it comes to forecasting, there are a few nuances. Let’s discuss.

First and foremost, in forecasting, the test set (or “holdout sample”) should always be at the end of your data, while the training set (or “in-sample”) comes before it, because forecasting is about the future, not the past. This may seem obvious, but for people unfamiliar with time series, it can be surprising.

Also, the training and test sets should be continuous, without gaps, to avoid breaking the structure of the data. Machine learning techniques like k-fold cross-validation should account for this (no randomly picking observations from the middle of the time series). The relevant paper is by Devon Barrow and Sven Crone, who explored several cross-validation techniques for forecasting with neural networks.

As for the sizes of the sets, there’s no strict rule or definitive theory. Some people advocate for a 70/30% split, but this is arbitrary. In practice, you should consider the needs of the business and design your experiment accordingly. For example, if your forecast horizon is 14 days ahead (remember this post?), your test set should have at least 14 observations for daily data. However, if you use exactly 14 observations, you’ll only be able to do a “fixed origin” evaluation – forecasting once and stopping there. This can be unreliable because a model might perform well by chance, and you wouldn’t see how it behaves in different situations (e.g., performing well in summer but not in other seasons).

A better approach is to make the test set longer than the forecast horizon and evaluate the model’s performance over time, for example, throughout a full year, using a rolling origin evaluation (see more here). This gives you more data for analysis (a distribution of error measures) and shows whether the model performs consistently across different periods.

Unfortunately, there is a potential problem with that: in practice many companies only store up to three years of data, often thinking anything older than that is irrelevant. This makes life more difficult for forecasters. With limited data, it may be impossible to fit or compare some models (e.g. seasonal ARIMA/ETS do not always work on data of less than 3 cycles). In such cases, your evaluation options become limited.

A possible solution is to train global models across multiple smaller time series, while keeping larger test sets for each series. For example, in cases where there are shared seasonal patterns, dynamic models can be adjusted to use cross-sectional seasonal indices. In case of ETS, John Boylan, Huijing Chen and I developed Vector ETS.

Lastly, to all practitioners out there: please, store as much data as possible! If an analyst or data scientist doesn’t need older data, they can always discard it. But in most cases, we are hungry for data, so the more, the merrier!

Message What about the training/test sets? first appeared on Open Forecasting.

Straight line is just fine

Ivan Svetunkov — Tue, 03 Sep 2024 13:44:13 +0000

Look at the image above. Which forecast seems more appropriate: the red straight line (1) or the purple wavy line (2)? Many demand planners might choose option 2, thinking it better captures the ups and downs. But, in many cases, the straight line is just fine. Here’s why.

In a previous post on Structure vs. Noise, we talked about how a time series is made up of different components (such as level, trend, and seasonality), and how the main goal of a point forecast is to capture the structure of the data, not the noise. Noise is unpredictable, and it should be treated by capturing uncertainty around the point forecasts (e.g., prediction intervals).

So the answer to the question in the very beginning of this post comes to understanding what sort of structure we have in the data. In the series attached to the image, the only structure we have is the level (average sales). There’s no obvious trend, no seasonality, no apparent outliers, and we do not have promotional information or any explanatory variables. The best you can do in that situation is capture the level correctly and produce a straight line (parallel to the x-axis) for the next 10 observations.

In this case, we used our judgment to decide what’s appropriate. That works well when you’re dealing with just a few time series. Petropoulos et al. (2018) showed that humans are quite good at selecting models in such a task as above. But what do you do when you have thousands or even millions of time series?

The standard approach today is to apply several models or methods and choose the one that performs best on a holdout sample using an error measure, like RMSE (Root Mean Square Error, for example, see this). In our example, the red line produced a forecast with an RMSE of 10.33, while the purple line had an RMSE of 10.62, suggesting that the red line is more accurate. However, relying only on one evaluation can be misleading because just by chance, we can get a better forecast with a model that overfits the data.

To address this, we can use a technique called “rolling origin evaluation” (Tashman, 2000). The idea is to fit the model to the training data, evaluate its performance on a test set over a specific horizon (e.g., the next 10 days), then add one observation from the test set to the training set and repeat the process. This way, we gather a distribution of RMSEs, leading to a more reliable conclusion about a model’s performance. Nikos Kourentzes has created a neat visualization of this process:

For more details with examples in R, you can check out Section 2.4 of my book.

After doing a rolling origin evaluation, you might find that the straight line is indeed the best option for your data. That’s perfectly fine – sometimes, simplicity is all you need. But then the real question becomes: what will you do with the point forecasts you’ve produced?

Message Straight line is just fine first appeared on Open Forecasting.

Point Forecast Evaluation: State of the Art

Ivan Svetunkov — Tue, 16 Jul 2024 11:24:06 +0000

I have summarised several posts on point forecasts evaluation in an article for the Foresight journal. Mike Gilliland, being the Editor-in-Chief of the journal, contributed to the paper a lot, making it read much smoother, but preferred not to be included as the co-author. This article was recently published in the issue 74 for Q3:2024. I attach the author copy to this post just because I can. Here is the direct link.

Here are the Key Points from the article:

Evaluation is important for tracking forecast process performance and understanding whether changes (to forecasts, models, or the overall process) are needed.
Understand what kind of forecast our models produce, and measure it properly. Most likely, our approach produces the mean (rather than the median) as a point forecast, so root mean squared error (RMSE) should be used to evaluate it.
To aggregate the error measure across several products, you need to scale it. A reliable way of scaling is to divide the selected error measure by the mean absolute differences of the training data. This way we get rid of the scale and units of the original measure and make sure that its value does not change substantially if we have trend in the data.
Avoid MAPE!
To make decisions based on your error measure, consider using the FVA framework, directly comparing performance of your forecasting approach with the performance of some simple
benchmark method.

Disclaimer: This article originally appeared in Foresight, Issue 74 (forecasters.org/foresight) and is made available with permission of Foresight and the International Institute of Forecasters.

Message Point Forecast Evaluation: State of the Art first appeared on Open Forecasting.

Don’t forget about bias!

Ivan Svetunkov — Tue, 07 May 2024 11:43:36 +0000

So far, we’ve discussed forecasts evaluation, focusing on the precision of point forecasts. However, there are many other dimensions in the evaluation that can provide useful information about your model’s performance. One of them is bias, which we’ll explore today.

Introduction

But before that, why should we bother with bias? Research suggests that bias is more related to the operational performance than accuracy. For example, Sanders & Graman (2016) discovered that an increase in bias leads to an exponential rise in operational costs, whereas similar deterioration in accuracy results in a linear cost increase. Kourentzes et al., (2020) estimated forecasting models using different loss functions, including the one based on the inventory costs, and found that better inventory performance was associated with lower bias. These findings indicate that high bias impacts supply chain and inventory costs more substantially than low accuracy does. Thus, minimizing bias in your model is crucial.

So, how can we measure it?

When producing a point forecast and comparing it to actual values, you get a collection of forecast errors. Averaging the squares of these errors gives the Mean Squared Error (MSE). If you average their absolute values, you obtain the Mean Absolute Error (MAE). Both measures assess the variability of actual values around your forecast, but in different ways. But we can also measure whether the model systematically overshoots or undershoots the actual values. To do that, we can average the errors without removing their signs, resulting in the Mean Error (ME), which measures forecast bias.

A simple interpretation of the ME, is that it shows whether the forecast consistently overshoots (negative value) or undershoots (the positive one) the actual values. An ME of zero suggests that, on average, the point forecast goes somehow closely to the middle of the data. Here is a visual example of the three situations:

Example of three forecasts with different bias

In the plot above, we have three cases:

Forecast is negatively biased, overshooting the data with ME = -166.42 (the blue line);
Forecast is almost unbiased, going in the middle of the data (but not being able to capture the structure correctly) with ME=0.45 (purple line);
Forecast is positively biased, undershooting the data with ME=176.97 (the red line);

If we were to make a decision only based on the ME, we would need to say that the second forecast (purple line) is unbiased and should be preferred. The problem is that the ME does not measure how well the forecast captures the structure or how close it is to the actual data. This is why bias should not be used on its own, but rather in combination with some accuracy measure (e.g. RMSE). Arguably, a much better forecast is the one shown below:

An example of an unbiased and accurate forecast

In this figure, we see that not only the forecast line goes through the data in the holdout, but it also captures accurately an upward trend, achieving an ME of 0.28 and the lowest RMSE among the discussed forecasts.

So, as we see, it is indeed important to track both bias and accuracy. What’s next?

How to aggregate Mean Error

Well, we might need to aggregate the mean error to get an overall impression about the performance of our model across several time series. How do we do that?

If you are thinking of calculating the “Mean Percentage Error” (similarly to MAPE, but without absolute value), then don’t! It will have the same problems caused by the division of the error by the actuals (as discussed here, for example). Instead, it’s better to use a reliable scaling method. At the very least, you could divide the mean error by the in-sample mean to get something called a “scaled Mean Error” (sME). An even better option is to divide the ME by the mean absolute differences of the data to get rid of a potential trend (similar to how it was done in MASE by Koehler & Hyndman 2006).

After that, you will end up with distributions of scaled MEs for various forecasting approaches across different products, like the one shown in the image below (we discussed a similar idea in this post):

Boxplots of scaled Mean Errors for several forecasting models

The boxplots above are zoomed in because there were some cases with extremely high bias. We can see that different forecasting methods produce varying distributions of sMEs: some are narrower, others wider. Comparing the means (red dots) and medians (black dots) of these distributions, it appears that the CES model is the least biased on average since its mean and median are closest to zero. However, averaging like this can be misleading as positive and negative biases can cancel each other out, resulting in an “average temperature in the hospital” situation. Besides, just because the average of sMEs for one method is closer to zero, it doesn’t necessarily mean it is consistently less biased. This only shows that it is least biased on average. A more useful approach might be to look at the distribution of the absolute values of sMEs (after all, it is not as important whether the model is positively or negatively biased on average, as whether it is biased at all):

Boxplots of absolute scaled Mean Errors for several forecasting models

In the image above, it becomes clear that CES is the least biased approach in terms of median, while the ETS is the least biased one in terms of mean. This suggests that CES has some instances of higher bias compared to ETS. Additionally, the boxplot for CES is narrower, indicating that in most cases, it produces less biased forecasts.

Bias and accuracy

Finally, as discussed above, it makes sense to look at both bias and accuracy together to better understand how models perform. But how exactly can we do this?

DISCLAIMER: the ideas I am about to share are based on my own understanding of the issue. Most of the research on forecasting focuses on evaluating accuracy, and there are only a few studies on bias. I haven’t come across any studies that address both bias and accuracy together in a holistic way. So, take this with a grain of salt, and I’d appreciate any references to relevant studies that I might have missed.

To jointly analyze both bias and accuracy, we might try to summarize them using mean to gain a clearer picture of how models perform on our data. However, simply looking at their average values can be misleading (see this post). A better approach could be to look at the quartiles and the means of these measures. Since bias often relates more directly to operational costs, it might make sense to examine it first. Here is an example using the same dataset as in this post (the lower the value, the better the model performs in terms of absolute bias):

Absolute Bias

           min    1Q  median    3Q     max  mean
ADAM ETS 0.000 0.277   0.748 1.862  39.694 1.463
ETS      0.001 0.281   0.760 1.901  39.694 1.488
ARIMA    0.000 0.297   0.778 1.875  39.694 1.509
CES      0.000 0.278   0.753 1.801  39.294 1.473

From the table above, we see that no single model consistently outperforms the others across all measures of absolute bias. However, the ADAM ETS model does better than others in terms of the median and mean absolute bias, and it’s the second-best in terms of the third quartile and the maximum value. The second least biased model is CES, which suggests these two models could be prime candidates for the next step in the model selection.

Looking at accuracy, we see the following picture:

Accuracy

           min    1Q  median    3Q     max  mean
ADAM ETS 0.024 0.670   1.180 2.340  51.616 1.947
ETS      0.024 0.677   1.181 2.376  51.616 1.970
ARIMA    0.025 0.681   1.179 2.358  51.616 1.986
CES      0.045 0.675   1.171 2.330  51.201 1.960

The accuracy tells us a slightly different story: CES appears to be the most accurate in terms of the median, third quartile, and maximum values. ETS still performs better in terms of the mean, minimum, and first quartile. Given these mixed results, we can’t conclusively choose the best model between the two. Therefore, our selection might also consider other factors, such as computational time, ease of understanding, or simplicity of the model (ETS, for instance, is arguably simpler than the Complex Exponential Smoothing).

Message Don’t forget about bias! first appeared on Open Forecasting.

Best practice for forecasts evaluation for business

Ivan Svetunkov — Wed, 24 Apr 2024 10:06:39 +0000

One question I received from my LinkedIn followers was how to evaluate forecast accuracy in practice. MAPE is wrong, but it is easy to use. In practice, we want something simple, informative and straightforward, but not all error measures are easy to calculate and interpret. What should we do? Here is my subjective view.

Step 1. Choose error measure.

If you are interested in measuring the performance of your approaches at the individual level (e.g., SKU), you can use RMSE without needing to scale it. Why RMSE? As discussed in a previous post, it is minimized by the mean, which is what most forecasting methods produce. If your method produces medians, then you should use MAE instead of RMSE.

If you want to measure the performance of different approaches across several time series, you can still calculate individual RMSEs. However, before aggregating, you need to scale them to avoid adding apples to oranges and beer bottles. Plus, the volume of sales might differ substantially from one product to another. The simplest scaling method is to divide RMSE by the in-sample mean. This has issues if data exhibits trends: if sales are increasing, your in-sample mean will also increase, deflating the RMSE. A better approach is to divide RMSE by the root mean squared differences of the data, which are typically more stable. This measure, called RMSSE (Root Mean Squared Scaled Error), was used in the M5 competition and was motivated by Athanasopoulos & Kourentzes (2022). The measure itself is hard to interpret, but we will address this in the next steps.

Step 2. Benchmarks

Calculate the error measures for some benchmark approaches, such as Naive, ETS, and ARIMA. These will be used as baselines for the next step.

Step 3. “Forecast Value Added” (FVA)

Calculate something called FVA (by Mike Gilliland). This approach calculates the ratio between the error measure for the method of interest and the benchmark. You end up with a value showing how many percent your method is better than the benchmark. For example, if the FVA for your ML approach compared to ETS was 0.85, you can say that it improves accuracy by 15% (1-0.85=0.15).

And that’s it!

So, instead of focusing on values of some APE, we would be moving to the discussion of how you can improve your process directly. Doing FVA, gives you a meaningful information about the performance of your approaches and can help you make specific changes to the forecasting process if needed.

There are also many other questions related to forecasting performance evaluation, such as what decisions you plan to make, on what level you should produce forecasts, how you plan to use forecasts etc. I might return to some of them in future posts.

Message Best practice for forecasts evaluation for business first appeared on Open Forecasting.

Avoid using MAPE!

Ivan Svetunkov — Wed, 17 Apr 2024 08:31:35 +0000

Frankly speaking, I didn’t see the point in discussing MAPE when I wrote recent posts on error measures. However, I’ve received several comments and messages from data scientists and demand planners asking for clarification. So, here it is.

TL;DR: Avoid using MAPE!

MAPE, or Mean Absolute Percentage Error, is a still-very-popular-in-practice error measure, which is calculated by taking the absolute difference between the actual and forecast, dividing it by the actual value:

\begin{equation*}
\mathrm{MAPE} = \mathrm{mean} \left(\frac{actual – forecast}{actual} \right).
\end{equation*}

The rationale is clear: we need to get rid of scale, and we want something that measures accuracy well, is easy to calculate and interpret. Unfortunately, MAPE is none of these things, and here is why.

It is scale sensitive: if you have sales in thousands of units then the actual value in the denominator will bring the overall measure down and you will have very low number even if the model is not doing well. Similarly, if you deal with very low volumes, they will inflate the measure, making it easily hundreds of percents, even if the model does a very good job.
It is well known that MAPE prefers when you underforecast (Fildes, 1992). It is not symmetric and might be misleading. BTW, “symmetric” MAPE is not better and not symmetric either (Goodwin & Lawton, 1999).
It is not easy to calculate on intermittent demand. Technically speaking, you get an infinite value, so it is not possible to calculate it in that case.
Okay, it is easy to interpret, fair enough. But the value itself does not tell you anything about performance of your model (see point 1 above).
And it is not clear what it is minimised with (remember this post?).

In fact, anything that has “APE” in it, will have similar issues.

Right. So, how can we fix that?

The main problem of MAPE arises because of the division of the forecast error by the actuals from the holdout sample. If we change the denominator, we solve problems (1) and (2).

Hyndman & Koehler (2006) proposed a solution, taking the Mean Absolute Error (MAE) of forecast and dividing it by the mean absolute differences of the data. The latter step is done purely for scaling reasons, and we end up with something called “MASE” that does not have the issues (1), (2) and (5), but is not easy to interpret.

The problem with MASE is that it is minimised by median and as a result not appropriate for intermittent demand. But there is a good alternative based on the Root Mean Squared Error (RMSE), called RMSSE (Makridakis et al., 2022) that uses the same logic as MASE: take RMSE and divide it by the in-sample Root Mean Squared differences. It is still hard to interpret, but at least it ticks the other four boxes.

If you really need the “interpretation” bit in your error measure, consider dividing MAE/RMSE by the in-sample mean of the data (Petropoulos & Kourentzes, 2015). This might not fix the issue (1) completely, but at least it would solve the other four problems.

If you want to learn more about error measures, check out Section 2.1 of my monograph or read an old post of mine “Naughty APEs and the quest for the holy grail“.

And here is a depiction of Mean APEs, inspired by my old post (thanks to Stephan Kolassa for the idea of the image):

Mean APEs

Message Avoid using MAPE! first appeared on Open Forecasting.

Stop reporting several error measures just for the sake of them!

Ivan Svetunkov — Wed, 03 Apr 2024 10:45:11 +0000

We continue our discussion of error measures (if you don’t mind). One other thing that you encounter in forecasting experiments is tables containing several error measures (MASE, RMSSE, MAPE, etc.). Have you seen something like this? Well, this does not make sense, and here is why.

The idea of reporting several error measures comes from forecasting competitions, one of the findings of which was that the ranking of methods might differ depending on what error measure you use. But this is so 20th-century thing to do! We are now in the 21st century and have a much better understanding of what to measure and how.

I should start with the maxima very well summarised by Stephan Kolassa (2016 and 2020): MAE-based measures are minimised by the median of a distribution, while the RMSE-based ones are minimised by the mean. To illustrate this point, see the following image of a distribution of a variable x:

Distribution of a variable x with several error measures

What you see in that image is a histogram with three vertical lines: red for mean, blue for median and green for the geometric mean. To the right of the histogram, you can see a small table with the three error measures (read more about them here): RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and RMSLE (Root Mean Squared Logarithmic Error, based on difference in logarithms). You can see that RMSE is minimised by the mean, MAE is minimised by the median and the RMSLE is minimised by the geometric mean. And this happens by design, not by coincidence.

So what?

Well, this just tells us that it does not make sense to use all three error measures, because they measure different things. If your models produce conditional means (this is what the majority of them do by default), use RMSE-based measure – there is no point in showing how those models also perform in terms of MAPE/MASE, or whatever else. If your model produces conditional mean, fails in terms of RMSE, but does a good job in terms of MAPE, then it is probably not doing a good job overall. It is like saying that my bicycle isn’t great for riding, but it excels at hammering nails. So, think what you want to measure and measure it! And if you want to measure a temperature, don’t use a ruler!

Having said that, I should confess that I used to report several measures in my papers until a few years back. This is because I did not understand this idea very well. But I have realised my mistake, and now I am trying to avoid this and stick to those measures that make sense for the task at hand.

Another story is whether you are interested in the accuracy of point forecasts, their bias or the performance of a model in terms of quantiles. In that case you might need a completely different set of error measures. But I might come back to this in a future post.

Message Stop reporting several error measures just for the sake of them! first appeared on Open Forecasting.