Apparently, we need to talk about scaling of error measures because this is not as obvious as it seems.
In forecasting literature, since early days of the area, there has been a general consensus that the forecast errors from the individual time series should not be analysed and aggregated as is. This is because you can have very different time series capturing dynamics of very different processes.
Indeed, if you forecast sales of apples in kilograms, your actual value would be apples in kilograms, and your point forecast would also be in the same units. Subtracting one from another tells us how many kilograms of apples we missed with the forecast we produced. But if we then take the average between forecast errors for apples and beer, we would be aggregating things in different units, which contradicts some basic aggregating principles.
Furthermore, if the company sells thousands of kilograms of apples and jet engines, aggregating forecast errors on those (e.g. 3000 vs 3) might introduce all types of issues, because the models performance on apples might mask the performance of the model on jet engines. Still, the jet engines are much more expensive than apples and getting them accurately might be more important for the company than forecasting apples.
So, forecasting literature has agreed that the forecast errors need to be somehow scaled to make the errors unitless and not to distort performance of models on time series with different volumes. There are several ways of doing that, including the poor ones and reasonable ones. The state of the art at the moment is to divide error measures by some in-sample statistics to avoid potential holdout-sample distortion. Using mean absolute differences (MAD) for this (thus ending up with MASE or RMSSE) is considered as a standard. A couple of years ago, I have written a post about advantages and disadvantages of several scaling methods.
But there is one method that I haven’t looked at and which is not very well discussed in the forecasting literature. It relies on the monetary value of forecasts. We could multiply each individual forecast error “e” by the price of the product “p” (thus moving to the missed income per product) and then divide everything by the overall income (price times quantity) from different products. This can be written as:
\begin{equation}
\text{monetary Mean Error} = \frac{\sum_{j=1}^n (p_j \times e_j)} {\sum_{j=1}^n (p_j \times q_j)}
\end{equation}
(the above formula can be modified to have squares or absolute values of the error). This way we switch from the original units to the monetary values and each error would tell you the percentage of the missed income in the overall one. This is a useful measure because it connects models performance with some managerial decisions and it takes the value of product into account (thus we do not mask the expensive jet engines with cheap apples).
However, it might have a potential issue similar to what the MAE/Mean or wMAPE has: if the sales of the product are not stationary, the denominator would change, thus driving the proportion either up or down, irrespective of how good the forecast is. I am not sure whether this needs to be addressed, because there is an argument that if the income from a product has increased and the error hasn’t changed, then this means that the proportion of the missed income decreased, which makes sense. But if we need to address this, we can switch to the MAD multiplied by price in the denominator to address this issue. In fact, this was sort of done in M5 competition that used a weighted RMSSE, relying on the income from each product over the last 4 weeks of data.
But here is one more interesting thing about this error measure. If we assume that prices for all products are exactly the same, they will disappear from the numerator and the denominator, leaving us with just sum of errors divided by the overall sales of all products. This still maintains the original idea of the proportion of the missed income, but now has a very strong assumption, which is probably not correct in the real life (apples and engines for the same price?). Furthermore, this would mask the performance of the model for the expensive products again. I personally don’t like this measure and find the assumption unrealistic and potentially misleading. Having said that, I can see some cases where this could still be acceptable and useful (e.g. similar products with similar dynamics and similar prices).
Summarising:
- If you are conducting a forecasting experiment without a specific context, I’d recommend using RMSSE or some other similar measure with scaling.
- If you have prices of products, income-based scaling might be more informative.
- Setting all prices to the same value does not sound appealing to me, but I understand that there is a context where this might work.