This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

2.4 Rolling origin

Remark. The text in this section is based on the vignette for the greybox package, written by the author of this textbook.

When there is a need to select the most appropriate forecasting model or method for the data, the forecaster usually splits the sample into two parts: in-sample (aka “training set”) and holdout sample (aka out-sample or “test set”). The model is estimated on the in-sample and its forecasting performance is evaluated using some error measure on the holdout sample.

Using this procedure only once is known as “fixed origin” evaluation. However, this might give a misleading impression of the accuracy of forecasting methods. If, for example, the time series contains outliers or level shifts a poor model might perform better in fixed origin evaluation than a more appropriate one. Besides, a good performance might happen by chance. So it makes sense to have a more robust evaluation technique. An alternative procedure known as “rolling origin” evaluation is one of such techniques.

In rolling origin evaluation the forecasting origin is repeatedly moved forward and forecasts are produced from each origin (Tashman, 2000). This technique allows obtaining several forecast errors for time series, which gives a better understanding of how the models perform. This can be considered as a time series analogue to cross-validation techniques (Wikipedia, 2020). Here is a simple graphical representation, courtesy of Nikos Kourentzes.

Rolling origin illustrated, by Nikos Kourentzes

Figure 2.4: Rolling origin illustrated, by Nikos Kourentzes

There are different options of how this can be done.

2.4.1 Principles of Rolling origin

Figure 2.5 (Svetunkov and Petropoulos, 2018) illustrates the basic idea behind rolling origin. White cells correspond to the in-sample data while the light grey cells correspond to the three-steps-ahead forecasts. The time series in the figure has 25 observations and forecasts are produced for 8 origins starting from observation 15. The model is estimated on the first in-sample set and forecasts are produced for the holdout. Next, another observation is added to the end of the in-sample set, the test set is advanced and the procedure is repeated. The process stops when there is no more data left. This is a rolling origin with a constant holdout sample size. As a result of this procedure 8 one to three steps ahead forecasts are produced. Based on them we can calculate the preferred error measures and choose the best performing model (see Section 2.1.2).

Rolling origin with constant holdout size

Figure 2.5: Rolling origin with constant holdout size

Another option for producing forecasts via rolling origin would be to continue with rolling origin even, when the test sample is smaller than the forecast horizon, as shown in Figure 2.6. In this case the procedure continues until origin 22 when the last full set of three-steps-ahead forecasts can be produced but then continues with a decreasing forecasting horizon. So the two-steps-ahead forecast is produced from origin 23 and only a one-step-ahead forecast is produced from origin 24. As a result we obtain 10 one-step-ahead forecasts, 9 two-steps-ahead forecasts and 8 three-steps-ahead forecasts. This is a rolling origin with a non-constant holdout sample size, which can be useful with small samples when we don’t have enough observations.

Rolling origin with non-constant holdout size

Figure 2.6: Rolling origin with non-constant holdout size

Finally, in both of the cases above we had the increasing in-sample size. However for some research purposes we might need a constant in-sample. Figure 2.7 demonstrates such a setup. In this case, in each iteration we add an observation to the end of the in-sample series and remove one from the beginning (dark grey cells).

Rolling origin with constant in-sample size

Figure 2.7: Rolling origin with constant in-sample size

2.4.2 Rolling origin in R

The function ro() from greybox package (written by Yves Sagaert and Ivan Svetunkov in 2016 on the way to the International Symposium on Forecasting) implements the rolling origin evaluation for any function you like with a predefined call and returns the desired value. It heavily relies on the two variables: call and value - so it is quite important to understand how to formulate them in order to get the desired results. ro() is a very flexible function but as a result it is not very simple. In this subsection we will see how it work on a couple of examples.

We start with a simple example, generating a series from normal distribution:

y <- rnorm(100,100,10)

We use an ARIMA(0,1,1) model implemented in the stats package (this model is discussed in Section 8):

ourCall <- "predict(arima(x=data,order=c(0,1,1)),n.ahead=h)"

The call that we specify includes two important elements: data and h. data specifies where the in-sample values are located in the function that we want to use, and it needs to be called “data” in the call. h will tell our function, where the forecasting horizon is specified in the selected function. Note that in this example we use arima(x=data,order=c(0,1,1)), which produces a desired ARIMA(0,1,1) model and then we use predict(..., n.ahead=h), which produces an h steps ahead forecast from that model.

Having the call, we need also to specify what the function should return. This can be the conditional mean (point forecasts), prediction intervals, the parameters of a model, or, in fact, anything that the model returns (e.g. name of the fitted model and its likelihood). However, there are some differences in what ro() returns depending on what the function returns. If it is a vector, then ro() will produce a matrix (with values for each origin in columns). If it is a matrix then an array is returned. Finally, if it is a list, then a list of lists is returned.

In order not to overcomplicate things, we start from collecting the conditional mean from the predict() function:

ourValue <- c("pred")

NOTE: If you do not specify the value to return, the function will try to return everything, but it might fail, especially if a lot of values are returned. So, in order to be on the safe side, always provide the value, when possible.

Now that we have specified ourCall and ourValue, we can produce forecasts from the model using rolling origin. Let’s say that we want three-steps-ahead forecasts and 8 origins with the default values of all the other parameters:

returnedValues1 <- ro(y, h=3, origins=8,
                      call=ourCall, value=ourValue)

The same can be achieved using the following loop:

obs <- 100
roh <- 8
h <- 3
data <- y
returnedValues1 <- setNames(vector("list",3),
returnedValues1$actuals <- y
returnedValues1$holdout <- returnedValues1$pred <- matrix(NA,h,roh,
for(i in 1:roh){
  testModel <- arima(x=data[1:(obs-roh+i-h)],order=c(0,1,1))
  returnedValues1$holdout[,i] <- data[-c(1:(obs-roh+i-h))]
  returnedValues1$pred[,i] <- predict(testModel, n.ahead=h)$pred

The function returns a list with all the values that we asked for plus the actual values from the holdout sample. We can calculate some basic error measure based on those values, for example, scaled Absolute Error (Petropoulos and Kourentzes, 2015):

apply(abs(returnedValues1$holdout - returnedValues1$pred),
      1, mean, na.rm=TRUE) /
##         h1         h2         h3 
## 0.05445728 0.05444613 0.06933872

In this example we use apply() function in order to distinguish between the different forecasting horizons and to have an idea of how the model performs for each of them. These numbers do not tell us much on their own, but if we compare the performance of this model with another one, then we could infer if one model is more appropriate for the data than the other one. For example, applying ARIMA(1,1,2) to the same data, we will get:

ourCall <- "predict(arima(x=data,order=c(1,1,2)),n.ahead=h)"
returnedValues2 <- ro(y, h=3, origins=8,
                      call=ourCall, value=ourValue)
apply(abs(returnedValues2$holdout - returnedValues2$pred),
      1, mean, na.rm=TRUE) /
##         h1         h2         h3 
## 0.05444983 0.05494058 0.07064521

Comparing these errors with the ones from the previous model, we can conclude, which of the approaches is more adequate for the data.

We can also plot the forecasts from the rolling origin, which shows how the models behave:

par(mfcol=c(2,1), mar=c(4,4,1,1))
Rolling origin performance of two forecasting methods

Figure 2.8: Rolling origin performance of two forecasting methods

In Figure 2.8, the forecasts from different origins are close to each other. This is because the data is stationary and both models produce flat lines as forecasts.

The rolling origin function from the greybox package also allows working with explanatory variables and returning prediction intervals if needed. Some further examples are discussed in the vignette of the package: vignette("ro","greybox").

Practically speaking, if we have a set of forecasts from different models we can analyse the distribution of error measures and come to conclusions about performance of models. Here is an example with analysis of performance for \(h=1\) based on absolute errors:

aeValuesh1 <- cbind(abs(returnedValues1$holdout -
                    abs(returnedValues1$holdout -
colnames(aeValuesh1) <- c("ARIMA(0,1,1)","ARIMA(1,1,2)")
Boxplots of error measures of two methods.

Figure 2.9: Boxplots of error measures of two methods.

The boxplots in Figure 2.9 can be interpreted as any other boxplots applied to random variables (see for example, discussion in Section 2.2 of Svetunkov, 2021c).


• Petropoulos, F., Kourentzes, N., 2015. Forecast combinations for intermittent demand. Journal of the Operational Research Society. 66, 914–924.
• Svetunkov, I., 2021c. Statistics for business analytics. (version: 01.10.2021)
• Svetunkov, I., Petropoulos, F., 2018. Old dog, new tricks: a modelling view of simple moving averages. International Journal of Production Research. 56, 6034–6047.
• Tashman, L.J., 2000. Out-of-sample tests of forecasting accuracy: An analysis and review. International Journal of Forecasting. 16, 437–450.
• Wikipedia, 2020. Cross-validation (statistics). (version: 2020-11-04)