What do you use for model selection? Do you select the best model based on its cross-validated performance, or do you use in-sample measures like AIC? If so, there is a way to improve your selection process further.
JORS recently published the paper of Nikos Kourentzes and I based on a simple but powerful idea: instead of using summary statistics (like the mean RMSE of cross-validated errors), you should consider the entire distribution and choose a specific quantile. This aligns with my previous post on error measures, but here is the core intuition:
The distribution of error measures is almost always asymmetric. If you only look at the average, you end up with a “mean temperature in the hospital” statistic, which doesn’t reflect how models actually behave. Some models perform great on most series but fail miserably on a few.
What can we do in this case? We can look at quantiles of distribution.
For example, if we use 84th quantile, we compare the models based on their “bad” performance, situations where they fail and produce less accurate forecasts. If you choose the best performing model there, you will end up with something that does not fail as much. So your preferences for the model become risk-averse in this situation.
If you focus on the lower quantile (e.g. 16th), you are looking at models that do well on the well-behaved series and ignore how they do on the difficult ones. So, your model selection preferences can be described as risk-tolerant, because you are accept that the best performing model might fail on a difficult time series.
Furthermore, the median (50th quantile, the middle of sample), corresponds to the risk-neutral situation, because it ignores the tails of the distribution.
What about the mean? This is a risk-agnostic strategy, because it says nothing about the performance on the difficult or easy time series – it takes everything and nothing in it at the same time, hiding the true risk profile.
So what?
In the paper, we show that using a risk-averse strategy tends to improve overall forecasting accuracy in day-to-day situations. Conversely, a risk-tolerant strategy can be beneficial when disruptions are anticipated, as standard models are likely to fail anyway.
So, next time you select a model, think about the measure you are using. If it’s just the mean RMSE, keep in mind that you might be ignoring the inherent risks of that selection.
P.S. While the discussion above applies to the distribution of error measures, our paper specifically focused on point AIC (in-sample performance). But it is a distance measure as well, so the logic explained above holds.
P.P.S. Nikos wrote a post about this paper here.
P.P.P.S. And here is the link to the paper.