What about the training/test sets?

Train on a test site... maybe

Another question my students sometimes ask is how to define the sizes for the training and test sets in a forecasting experiment. If you’ve done data mining or machine learning, you’re likely familiar with this concept. But when it comes to forecasting, there are a few nuances. Let’s discuss.

First and foremost, in forecasting, the test set (or “holdout sample”) should always be at the end of your data, while the training set (or “in-sample”) comes before it, because forecasting is about the future, not the past. This may seem obvious, but for people unfamiliar with time series, it can be surprising.

Also, the training and test sets should be continuous, without gaps, to avoid breaking the structure of the data. Machine learning techniques like k-fold cross-validation should account for this (no randomly picking observations from the middle of the time series). The relevant paper is by Devon Barrow and Sven Crone, who explored several cross-validation techniques for forecasting with neural networks.

As for the sizes of the sets, there’s no strict rule or definitive theory. Some people advocate for a 70/30% split, but this is arbitrary. In practice, you should consider the needs of the business and design your experiment accordingly. For example, if your forecast horizon is 14 days ahead (remember this post?), your test set should have at least 14 observations for daily data. However, if you use exactly 14 observations, you’ll only be able to do a “fixed origin” evaluation – forecasting once and stopping there. This can be unreliable because a model might perform well by chance, and you wouldn’t see how it behaves in different situations (e.g., performing well in summer but not in other seasons).

A better approach is to make the test set longer than the forecast horizon and evaluate the model’s performance over time, for example, throughout a full year, using a rolling origin evaluation (see more here). This gives you more data for analysis (a distribution of error measures) and shows whether the model performs consistently across different periods.

Unfortunately, there is a potential problem with that: in practice many companies only store up to three years of data, often thinking anything older than that is irrelevant. This makes life more difficult for forecasters. With limited data, it may be impossible to fit or compare some models (e.g. seasonal ARIMA/ETS do not always work on data of less than 3 cycles). In such cases, your evaluation options become limited.

A possible solution is to train global models across multiple smaller time series, while keeping larger test sets for each series. For example, in cases where there are shared seasonal patterns, dynamic models can be adjusted to use cross-sectional seasonal indices. In case of ETS, John Boylan, Huijing Chen and I developed Vector ETS.

Lastly, to all practitioners out there: please, store as much data as possible! If an analyst or data scientist doesn’t need older data, they can always discard it. But in most cases, we are hungry for data, so the more, the merrier!

Leave a Reply