Archives Applied forecasting - Open Forecast

stick function for the EDA in time series

Ivan Svetunkov — Fri, 26 Jun 2026 11:24:43 +0000

You have probably seen my post about the STI classification of Hans Levenbach (this one). Well, I’ve decided to implement it, and it has landed in the greybox package for R/Python.

What’s greybox? It is a package for statistical modelling focusing on forecasting and time series analysis. I created it back in 2018 to split the static models (such as linear regression) from the dynamic ones that landed in the smooth package. Greybox has evolved since then, and now has linear regression (alm), regression selection (stepwise) and combinations (calm), a variety of tools for feature generation, diagnostics, forecast evaluation (e.g. rolling origin) etc. You can read more about it here. Originally, the package was available for R only, but Claude and I ported its main functions to Python back in February.

The Exploratory Data Analysis techniques for time series fit the package quite well, although I don’t have many of those yet. So, I’ve implemented the main idea of the STI of Hans Levenbach in a function called “stick” (Seasonal, Trend, Irregular Contribution Kit) in the greybox package for R/Python. The idea is straightforward: apply stick to a time series, it will use ANOVA, and give you the strength of each component. Here, for example, is how to apply the function to the AirPassengers data (everyone’s favourite toy time series) in R:

library(greybox)
stick(AirPassengers)

and in Python:

from fcompdata import AirPassengers
from greybox import stick

result = stick(AirPassengers.y, lags=12)
print(result)

which gives exactly the same result:

Strength of the components:
seasonal12      trend  irregular
    0.1061     0.8613     0.0326

So, trend dominates the time series, explaining 86.13% of its variability, meaning that if you capture it correctly, you solve a big chunk of the problem. This split also gives you a rough idea about the structure-versus-noise breakdown in the time series, although it assumes that the seasonal component does not evolve over time.

The function supports several seasonal components, and I might extend it to include external information (e.g. promotions) in the future if there is demand for it.

Message stick function for the EDA in time series first appeared on Open Forecast.

Risky business: how to select your model based on risk preferences

Ivan Svetunkov — Mon, 19 Jan 2026 11:28:04 +0000

What do you use for model selection? Do you select the best model based on its cross-validated performance, or do you use in-sample measures like AIC? If so, there is a way to improve your selection process further.

JORS recently published the paper of Nikos Kourentzes and I based on a simple but powerful idea: instead of using summary statistics (like the mean RMSE of cross-validated errors), you should consider the entire distribution and choose a specific quantile. This aligns with my previous post on error measures, but here is the core intuition:

The distribution of error measures is almost always asymmetric. If you only look at the average, you end up with a “mean temperature in the hospital” statistic, which doesn’t reflect how models actually behave. Some models perform great on most series but fail miserably on a few.

What can we do in this case? We can look at quantiles of distribution.

For example, if we use 84th quantile, we compare the models based on their “bad” performance, situations where they fail and produce less accurate forecasts. If you choose the best performing model there, you will end up with something that does not fail as much. So your preferences for the model become risk-averse in this situation.

If you focus on the lower quantile (e.g. 16th), you are looking at models that do well on the well-behaved series and ignore how they do on the difficult ones. So, your model selection preferences can be described as risk-tolerant, because you are accept that the best performing model might fail on a difficult time series.

Furthermore, the median (50th quantile, the middle of sample), corresponds to the risk-neutral situation, because it ignores the tails of the distribution.

What about the mean? This is a risk-agnostic strategy, because it says nothing about the performance on the difficult or easy time series – it takes everything and nothing in it at the same time, hiding the true risk profile.

So what?

In the paper, we show that using a risk-averse strategy tends to improve overall forecasting accuracy in day-to-day situations. Conversely, a risk-tolerant strategy can be beneficial when disruptions are anticipated, as standard models are likely to fail anyway.

So, next time you select a model, think about the measure you are using. If it’s just the mean RMSE, keep in mind that you might be ignoring the inherent risks of that selection.

P.S. While the discussion above applies to the distribution of error measures, our paper specifically focused on point AIC (in-sample performance). But it is a distance measure as well, so the logic explained above holds.

P.P.S. Nikos wrote a post about this paper here.

P.P.P.S. And here is the link to the paper.

Message Risky business: how to select your model based on risk preferences first appeared on Open Forecast.

Review of a paper on comparison of modern machine learning techniques in retail

Ivan Svetunkov — Sun, 22 Jun 2025 21:59:19 +0000

A couple of days ago, I noticed a link to the following paper in a post by Jack Rodenberg: https://arxiv.org/abs/2506.05941v1. The topic seemed interesting and relevant to my work, so I read it, only to find that the paper contains several serious flaws that compromise its findings. Let me explain.

Introduction

But first, why am I writing this post?

There’s growing interest in forecasting among data scientists, data engineers, ML experts etc. Many of them assume that they can apply their existing knowledge directly to this new area without reading domain-specific literature. As a result, we get a lot of “hit-or-miss” work: sometimes having promising ideas, but executed in ways that violate basic forecasting principles. The main problem with that is that if your experiment is not done correctly, your results might be compromised, i.e. your claims might be simply wrong.

If you’re a researcher writing forecasting-related papers, then hopefully reading this post (and the posts and papers I refer to), will help you improve your papers. This might lead to a smoother peer-review process. Also, while I can’t speak for other reviewers, if I come across a paper with similar issues, I typically give it a hard time.

I should also say that I am not a reviewer of this paper (I would not publish a review), but I merely decided to demonstrate what issues I can see when I read papers like that. The authors are just unlucky that I picked their paper…

Let’s start.

The authors apply several ML methods to retail data, compare their forecasting accuracy, and conclude that XGBoost and LightGBM outperform N-BEATS, NHITS, and Temporal Fusion Transformer. While the finding isn’t groundbreaking, additional evidence on a new dataset is always welcome.

Major issues

So, what’s wrong? Here is a list of the major comments:

Forecast horizon vs. data frequency:

Daily data with a 365-day forecast horizon makes no practical sense (page 2, paragraph 3). I haven’t seen any company making daily-level decisions a year in advance. Stock decisions are typically made on much shorter horizons, and if you need a year ahead forecast, you definitely do not need it on the daily level. After all, there is no point in knowing that on 22nd December 2025 you will have the expected demand of 35.457 units – it is too far into the future to make any difference. Some references:

Athanasopoulos and Kourentzes (2023) paper discusses data frequency and some decisions related to them;
and there is a post on my website on a related topic

Misuse of SBC classification:

Claiming that 70% of products are “intermittent” (page 2, last paragraph) based on SBC is incorrect. Furthermore, SBC classification does not make sense in this setting, and is not used in the paper anyway, so the authors should just drop it.

Read more about it here.
And there is a post of Stephan Kolassa on exactly this point

Product elimination and introduction is unclear (page 3):

The authors say “Around 30% of products were eliminated during training and 10% are newly introduced in validation”. It’s not clear why this was done and how specifically. This needs to be explained in more detail.

“Missing values” undefined:

It is not clear what the authors mean by “missing values” (page 3, “Handling Missing Values”). How do they appear and why? Are they the same as stockouts, or were there some other issues in the data? This needs to be explained in more detail.

Figure 1 is vague:

Figure 1 is supposed to explain how the missing values were treated. But the whole imputation process is questionable, because it is not clear how well it worked in comparison with alternatives and how reasonable it is to have an imputed series that look more erratic than the original one. The discussion of that needs to be expanded with some insights from the business problem.

No stockout handling discussion:

The authors do not discuss whether the data has stockouts or not. This becomes especially important in retail, because if the stockouts are not treated correctly, you would end up forecasting sales instead of demand

For example, see this post.

Feature engineering is opaque:

“Lag and rolling-window statistics for sales and promotional indicators were created” (page 3, “Feature Engineering”) – it is not clear, what specific lags, what length of rolling windows, and what statistics (anything besides mean?) were created. These need to be explained for transparency and so that a reader could better understand what specifically was done. Without this explanation, it is not clear whether the features are sensible at all.

Training/validation setup not explained:

It is not clear how specifically the split into training and validation sets was done (page 3, last paragraph), and whether the authors used rolling origin (aka time series cross-validation). If they did random splits, that could cause some issues, because the first law of time series is not to break its structure!

Variables transformation is unclear:

It is not clear whether any transformations of the response variable were done. For example, if the data is not stationary, taking differences might be necessary to capture the trend and to do extrapolation correctly. Normalisation of variables is also important for neural networks, unless this is built-in in the functions the authors used. This is not discussed in the paper.

Forecast strategy not explained:

It is not clear whether the direct or recursive strategy was used for forecasting. If lags were not used in the model, that would not matter, but they are, so this becomes a potential issue. Also, if the authors used the lag of the actual value on observation 235 steps ahead to produce forecast for 236 steps ahead, then this is another fundamental issue, because that implies that the forecast horizon is just 1 step ahead, and not 365, as the authors claim. This needs to be explained in more detail.

I’ve written a post about the strategies.

No statistical benchmarks:

At the very least, the authors should use simple moving average and probably exponential smoothing. Even if they do not perform well, this gives an additional information about the performance of the other approaches. Without them, the claims about good performance of the used ML approaches are not supported by evidence. The authors claim that they used mean as a benchmark, but its performance is not discussed in the paper.

Issues with forecast evaluation:

The whole Table 3 with error measures is an example of what not to do. Here are some of major issues:

There is no point in reporting several error measures – each one of them is minimised by their own statistics. The error measure should align with what approaches produce.
MSE, RMSE, MAE and ME should be dropped, because they are not scaled, so the authors are adding up error measures for bricks and nails. The result is meaningless.
MASE is not needed – it is minimised by median, which could be a serious issue on intermittent demand see this post. wMAPE has similar issues because it is also based on MAE.
If the point forecasts are produced in terms of medians (like in case of NBEATS), then RMSSE should be dropped, and MASE should be used instead.
But also, comparing means with medians is not a good idea. If you assume a symmetric distribution, the two should coincide, but in general this might not hold.
R2 is not a good measure of forecast accuracy. It makes some sense in regression context for linear models, but in this one, it is pointless, and only shows that the authors don’t fully understand what they are doing. Plus, it’s not clear how specifically it was calculated.
I don’t fully understand “demand error”, “demand bias” and other measures, and the authors do not explain them in necessary detail. This needs to be added to the paper.
The split into “Individual Groups” and “Whole Category” is not well explained either: it is not clear what this means, why, and how this was done.
And in general, I don’t understand what the authors want to do with Cases A – D in Table 3. It is not clear why they are needed, and what they want to show with them. This is not explained in the paper.

I have a series of posts on forecast evaluation here.

Invalid analysis of bias measures:

Analysis of bias measures is meaningless because they were not scaled.

Disturbing bias of NBEATS in Figure 2:

The bias shown in Figure 2 is disturbing and should be dealt with prior to evaluation. It could have appeared due to the loss function used for training or because the data was not pre-processed correctly. Leaving it as is and blaming NBEATS for this does not sound reasonable to me.

No inventory implications:

The authors mention inventory management, but stop on forecasting, not showing how the specific forecasts translate to inventory decisions. If this paper was to be submitted to any operations-related journal, the inventory implications would need to be added in the discussion.

Underexplained performance gaps:

The paper also does not explain well why neural networks performed worse than gradient boosting methods. They mention that this could be due to the effect of missing values, but this is a speculation rather than an explanation, which I personally do not believe (I might be wrong). While the overall results make sense for me personally, if you want to publish a good paper, you need to provide a more detailed answer to the question “why?”.

Minor issues

I also have three minor comments:

“many product series are censored” (page 2, last paragraph) is not what it sounds like. The authors imply that the histories are short, while the usual interpretation is that the sales are lower than the demand, so the values are censored. I would rewrite this.
Figure 2 has the legend saying “Poisson” three times, not providing any useful information. This is probably just a mistake, which can easily be fixed.
There are no references to Table 2 and Figure 3 in the paper. It is not clear why they are needed. Every table and figure should be referred to and explained.

Conclusions

Overall, the paper has a sensible idea, but I feel that the authors need to learn more about forecasting principles and that they have not read forecasting literature carefully to understand how specifically the experiments should be designed, what to do, and not to do (stop using SBC!). Because they made several serious mistakes, I feel that the results of the paper are compromised and might not be correct.

P.S. If I were a reviewer of this paper, I would recommend either “reject and resubmit” or a “major revision” (if the former option was not available).

P.P.S. If the authors of the paper are reading this, I hope you find these comments useful. If you have not submitted the paper yet, I’d suggest to take some of them (if not all) into account. Hopefully, this will smooth the submission process for you.

Message Review of a paper on comparison of modern machine learning techniques in retail first appeared on Open Forecast.

Six questions for a forecaster-consultant

Ivan Svetunkov — Wed, 28 May 2025 11:48:31 +0000

NHS has a helpful page with a set of questions you can ask your GP to ensure you receive the right treatment for your illness. Surprisingly, these questions can be applied in other fields as well. Here’s an example in applied forecasting, working with companies.

I’m not going to go through all of them, I picked several most useful. These are questions you can ask a consultant who is helping you (as a client). But they can also help you (as a consultant) deliver a better product for your client.

What are the alternative approaches you could try? (Are there other ways to treat my condition?)

If a consultant presents only one approach and says it’s the best, they may have found something that works for them but not necessarily for you. Ask whether they’ve tried other approaches on your data. A fancy neural network might do a great job, but a gradient boosting method, or even a simple statistical one, could perform similarly. After all, there is no such thing as the best approach for everything. If a consultant says there are no alternatives, find a new one.

What are the drawbacks of the proposed approach? (Are there any side effects or risks? If so, what are they?)

Every approach relies on assumptions. If they are violated, the approach may fail. Understanding when and why this might happen is crucial. For example, the method might assume an inappropriate distribution (e.g. Normal for intermittent demand), or require more data than the company can afford to store. It could be too time-consuming to use in practice, or assume that the end users know what they are doing. There are always assumptions and trade-offs. If a consultant says, “There are no drawbacks”, they’re either lying or don’t understand what they’re doing.

What if we don’t change the process? (What will happen if I don’t have any treatment?)

One of the main responsibilities of a professional consultant is to benchmark new approaches against the current one. This allows them to show exactly what would change and what improvements would it would bring. If they don’t do this, consider it a red flag.

How will the new approach improve our processes? (How effective is this treatment?)

At the very least, a consultant should demonstrate an improvement in accuracy. Even better, they should show how this translates into better decisions and costs reduction (e.g. via inventory simulations). Reducing the workload for demand planners can also be a significant benefit. In any case, ask – they should know.

Are there parts of the process that should be avoided? (Is there anything I should stop or avoid doing?)

A good consultant should know what can cause issues. e.g. if external information (promotions?) is used in the model, judgmentally adjusting forecasts may do more harm than good. And a good consultant should be able to explain what not to do.

What else should we do? (Is there anything I can do to help myself?)

It’s not enough to just deploy a new approach. End users must understand how to use it. It’s great if demand planners are familiar with forecasting, statistics, and feature engineering, but if not, they might abandon an accurate approach in favour of a simpler, less effective alternative. A consultant should be able to explain what the company can do to make the new approach practical and ensure it delivers value (e.g. train the team).

Any other questions you would ask?

Message Six questions for a forecaster-consultant first appeared on Open Forecast.

5th IMA and OR Society Conference

Ivan Svetunkov — Fri, 02 May 2025 14:37:50 +0000

It was a pleasure to attend the 5th IMA and OR Society Conference at Aston University, Birmingham, and to present my research with Anna Sroginis on model-based demand classification. A great crowd of people from universities across the UK, along with several esteemed international colleagues. The event was very well organised – thanks to Aris Syntetos, Anna-Lena Sachs, Adam Letchford, Dilek Onkal, and Paresh Date.

My presentation was based on this paper. And here are the slides:
2025-05-01-IMA-OR

Message 5th IMA and OR Society Conference first appeared on Open Forecast.

Challenges related to seasonal data: shifting seasonality

Ivan Svetunkov — Mon, 07 Apr 2025 12:54:49 +0000

There are many different issues with capturing seasonality in time series. In this short post, I’d like to discuss one of the most annoying ones.

I’m talking about the seasonal pattern that shifts over time. What I mean is that, for example, instead of having the standard number of observations in the cycle (e.g., 24 hours in a day), in some cases you can have more or fewer of them. How is that possible?

One of these issues is the Daylight Saving Time (DST) change. The original idea of DST was to reduce energy consumption because daylight in summer is longer than in winter (there’s a nice and long article on Wikipedia about it). Because of this, many countries introduced a time shift: in spring, the clock is moved forward by one hour, while in autumn it goes back. This idea had a reasonable motivation at the beginning of the 20th century, but I personally think that as we’ve progressed as a society, it has lost its value. While this is already extremely annoying on its own, a bit unhealthy (several studies report an increased risk of heart attacks), and a torture for parents with small kids (the little ones don’t understand that it’s not 7am yet), it also introduces a modelling challenge: two days in the year do not have 24 hours. In spring, we have 23 hours, while in autumn we have 25. Standard classical forecasting approaches (such as ETS/ARIMA, regression, STL or classical decomposition) break in this case, because by default they assume that a specific pattern repeats itself every 24 hours. The issue arises because business cycles are tuned to working hours, not to the movement of the sun – people come to work at 9am, no matter how many hours are in the day.

Another challenge is leap years. While DST is totally man-made, leap years occur because the Earth orbits the sun approximately every 365.25 days. To avoid drifting too far from reality, our calendars include one extra day every four years (29th February). This addresses the issue but also means that one year has 366 days instead of 365. Once again, conventional models relying on fixed periodicity fail.

There are several ways to handle this, all with their own advantages and disadvantages:

Fix the data. In the case of DST, this means removing one of the duplicated hours during the autumn time change and adding one during the spring shift. For leap years, it means dropping the 29th of February. This is easy to do, but breaks the structure and might cause issues when we have DST/leap year in the holdout sample.
Introduce more complex components, such as Fourier-based ones, to capture the shift in the data. This works well for leap years but doesn’t address the DST issue. Harmonic regressions and TBATS do this, for example.
Shift seasonal indices when the issue happens – for example, having two indices for 1am when the switch to winter time occurs.

In R, I’ve developed the temporaldummy() function in the greybox package to introduce correct dummy variables for data with shifting seasonality, and I’ve incorporated method (3) into the adam() function from the smooth package. You can read more about these here: https://openforecast.org/adam/MultipleFrequenciesDSTandLeap.html

Are there any other strategies? Which one do you prefer?

BTW, Kandrika Pritularga and I are running a course on Demand Forecasting Principles with Examples in R. We’ll discuss some of these aspects there. Read more about it here.

Message Challenges related to seasonal data: shifting seasonality first appeared on Open Forecast.

Methods for the smooth functions in R

Ivan Svetunkov — Thu, 10 Oct 2024 13:46:22 +0000

I have been asked recently by a colleague of mine how to extract the variance from a model estimated using adam() function from the smooth package in R. The problem was that that person started reading the source code of the forecast.adam() and got lost between the lines (this happens to me as well sometimes). Well, there is an easier solution, and in this post I want to summarise several methods that I have implemented in the smooth package for forecasting functions. In this post I will focus on the adam() function, although all of them work for es() and msarima() as well, and some of them work for other functions (at least as for now, for smooth v4.1.0). Also, some of them are mentioned in the Cheat sheet for adam() function of my monograph (available online).

The main methods

The adam class supports several methods that are used in other packages in R (for example, for the lm class). Here are they:

forecast() and predict() – produce forecasts from the model. The former is preferred, the latter has a bit of limited functionality. See documentation to see what forecasts can be generated. This was also discussed in Chapter 18 of my monograph.
fitted() – extracts the fitted values from the estimated object;
residuals() – extracts the residuals of the model. These are values of \(e_t\), which differ depending on the error type of the model (see discussion here);
rstandard() – returns standardised residuals, i.e. residuals divided by their standard deviation;
rstudent() – studentised residuals, i.e. residuals that are divided by their standard deviation, dropping the impact of each specific observation on it. This helps in case of influential outliers.

An additional method was introduced in the greybox package, called actuals(), which allows extracting the actual values of the response variable. Another useful method is accuracy(), which returns a set of error measures using the measures() function of the greybox package for the provided model and the holdout values.

All the methods above can be used for model diagnostics and for forecasting (the main purpose of the package). Furthermore, the adam class supports several functions for working with coefficients of models, similar to how it is done in case of lm:

coef() or coefficient() – extracts all the estimated coefficients in the model;
vcov() – extracts the covariance matrix of parameters. This can be done either using Fisher Information or via a bootstrap (bootstrap=TRUE). In the latter case, the coefbootstrap() method is used to create bootstrapped time series, reapply the model and extract estimates of parameters;
confint() – returns the confidence intervals for the estimated parameter. Relies on vcov() and the assumption of normality (CLT);
summary() – returns the output of the model, containing the table with estimated parameters, their standard errors and confidence intervals.

Here is an example of an output from an ADAM ETS estimated using adam():

adamETSBJ <- adam(BJsales, h=10, holdout=TRUE)
summary(adamETSBJ, level=0.99)

The first line above estimates and selects the most appropriate ETS for the data, while the second one will create a summary with 99% confidence intervals, which should look like this:

Model estimated using adam() function: ETS(AAdN)
Response variable: BJsales
Distribution used in the estimation: Normal
Loss function type: likelihood; Loss function value: 241.1634
Coefficients:
      Estimate Std. Error Lower 0.5% Upper 99.5%  
alpha   0.8251     0.1975     0.3089      1.0000 *
beta    0.4780     0.3979     0.0000      0.8251  
phi     0.7823     0.2388     0.1584      1.0000 *
level 199.9314     3.6753   190.3279    209.5236 *
trend   0.2178     2.8416    -7.2073      7.6340  

Error standard deviation: 1.3848
Sample size: 140
Number of estimated parameters: 6
Number of degrees of freedom: 134
Information criteria:
     AIC     AICc      BIC     BICc 
494.3268 494.9584 511.9767 513.5372

How to read this output, is discussed in Section 16.3.

Multistep forecast errors

There are two methods that can be used as additional analytical tools for the estimated model. Their generics are implemented in the smooth package itself:

rmultistep() - extracts the multiple steps ahead in-sample forecast errors for the specified horizon. This means that the model produces the forecast of length h for every observation starting from the very first one, till the last one and then calculates forecast errors based on it. This is used in case of semiparametric and nonparametric prediction intervals, but can also be used for diagnostics (see, for example, Subsection 14.7.3);
multicov() - returns the covariance matrix of the h steps ahead forecast error. The diagonal of this matrix corresponds to the h steps ahead variance conditional on the in-sample information.

For the same model that we used in the previous section, we can extract and plot the multistep errors:

rmultistep(adamETSBJ, h=10) |> boxplot()
abline(h=0, col="red2", lwd=2)

which will result in:

Distributions of the multistep forecast errors

The image above shows that the model tend to under shoot the actual values in-sample (because the boxplots tend to lie slightly above the zero line). This might cause a bias in the final forecasts.

The covariance matrix of the multistep forecast error looks like this in our case:

multicov(adamETSBJ, h=10) |> round(3)

       h1    h2     h3     h4     h5     h6     h7     h8     h9    h10
h1  1.918 2.299  2.860  3.299  3.643  3.911  4.121  4.286  4.414  4.515
h2  2.299 4.675  5.729  6.817  7.667  8.333  8.853  9.260  9.579  9.828
h3  2.860 5.729  8.942 10.651 12.250 13.501 14.480 15.246 15.845 16.314
h4  3.299 6.817 10.651 14.618 16.918 18.979 20.592 21.854 22.841 23.613
h5  3.643 7.667 12.250 16.918 21.538 24.348 26.808 28.733 30.239 31.417
h6  3.911 8.333 13.501 18.979 24.348 29.515 32.753 35.549 37.737 39.448
h7  4.121 8.853 14.480 20.592 26.808 32.753 38.372 41.964 45.036 47.440
h8  4.286 9.260 15.246 21.854 28.733 35.549 41.964 47.950 51.830 55.127
h9  4.414 9.579 15.845 22.841 30.239 37.737 45.036 51.830 58.112 62.223
h10 4.515 9.828 16.314 23.613 31.417 39.448 47.440 55.127 62.223 68.742

This is not useful on its own, but can be used for some further derivations.

Note that the returned values by both rmultistep() and multicov() depend on the model's error type (see Section 11.2 for clarification).

Model diagnostics

The conventional plot() method applied to a model estimated using adam() can produce a variety of images for the visual model diagnostics. This is controlled by the which parameter (overall, 16 options). The documentation of the plot.smooth() contains the exhaustive list of options and Chapter 14 of the monograph shows how they can be used for model diagnostics. Here I only list several main ones:

plot(ourModel, which=1) - actuals vs fitted values. Can be used for general diagnostics of the model. Ideally, all points should lie around the diagonal line;
plot(ourModel, which=2) - standardised residuals vs fitted values. Useful for detecting potential outliers. Also accepts the level parameter, which regulates the width of the confidence bounds.
plot(ourModel, which=4) - absolute residuals vs fitted, which can be used for detecting heteroscedasticity of the residuals;
plot(ourModel, which=6) - QQ plot for the analysis of the distribution of the residuals. The specific figure changes for different distribution assumed in the model (see Section 11.1 for the supported ones);
plot(ourModel, which=7) - actuals, fitted values and point forecasts over time. Useful for understanding how the model fits the data and what point forecast it produces;
plot(ourModel, which=c(10,11)) - ACF and PACF of the residuals of the model to detect potentially missing AR/MA elements;
plot(ourModel, which=12) - plot of the components of the model. In case of ETS, will show the time series decomposition based on it.

And here are four default plots for the model that we estimated earlier:

par(mfcol=c(2,2))
plot(adamETSBJ)

Diagnostic plots for the estimated model

Based on the plot above, we can conclude that the model fits the data fine, does not have apparent heteroscedasticity, but has several potential outliers, which can be explored to improve it. The outliers detection is done via the outlierdummy() method, the generic of which is implemented in the greybox package.

Other useful methods

There are many methods that are used by functions to extract some information about the model. I sometimes use them to simplify my coding routine. Here they are:

lags() - returns lags of the model. Especially useful if you fit a multiple seasonal model;
orders() - the vector of orders of the model. Mainly useful in case of ARIMA, which can have multiple seasonalities and p,d,q,P,D,Q orders;
modelType() - the type of the model. In case with the one fitted above will return "AAdN". Can be useful to easily refit the similar model on the new data;
modelName() - the name of the model. In case of the one we fitted above will return "ETS(AAdN)";
nobs(), nparam(), nvariate() - number of in-sample observations, number of all estimated parameters and number of time series used in the model respectively. The latter one is developed mainly for the multivariate models, such as VAR and VETS (e.g. legion package in R);
logLik() - extracts log-Likelihood of the model;
AIC(), AICc(), BIC(), BICc() - extract respective information criteria;
sigma() - returns the standard error of the residuals.

More specialised methods

One of the methods that can be useful for scenarios and artificial data generation is simulate(). It will take the structure and parameters of the estimated model and use them to generate time series, similar to the original one. This is discussed in Section 16.1 of the ADAM monograph.

Furthermore, smooth implements the scale model, discussed in Chapter 17, which allows modelling time varying scale of distribution. This is done via the sm() method (generic introduced in the greybox package), the output of which can then be merged with the original model via the implant() method.

For the same model that we used earlier, the scale model can be estimated this way:

adamETSBJSM <- sm(adamETSBJ)

This is how it looks:

plot(adamETSBJSM, 7)

Scale model for the ADAM ETS

In the plot above, the y-axis contains the squared residuals. The fact that the holdout sample contains a large increase in the error is expected, because that part corresponds to the forecast errors rather than residuals. It is added to the plot for completeness.

To use the scale model in forecasting, we should implant it in the location one, which can be done using the following command:

adamETSBJFull <- implant(location=adamETSBJ, scale=adamETSBJSM)

The resulting model will have fewer degrees of freedom (because the scale model estimated two parameters), but its prediction interval will now take the scale model into account, and will differ from the original. We will now take into account the time varying variance based on the more recent information instead of the averaged one across the whole time series. In our case, the forecasted variance is lower than the one we would obtain in case of the adamETSBJ model. This leads to the narrower prediction interval (you can produce them for both models and compare):

forecast(adamETSBJFull, h=10, interval="prediction") |> plot()

Forecast from the full ADAM, containing both location and scale parts

Conclusions

The methods discussed above give a bit of flexibility of how to model things and what tools to use. I hope this makes your life easier and that you won't need to spend time reading the source code, but instead can focus on forecasting and analytics with ADAM.

Message Methods for the smooth functions in R first appeared on Open Forecast.

The role of M competitions in forecasting

Ivan Svetunkov — Thu, 14 Mar 2024 14:53:03 +0000

If you are interested in forecasting, you might have heard of M-competitions. They played a pivotal role in developing forecasting principles, yet also sparked controversy. In this short post, I’ll briefly explain their historical significance and discuss their main findings.

Before M-competitions, only few papers properly evaluated forecasting approaches. Statisticians assumed that if a model had solid theoretical backing, it should perform well. One of the first papers to conduct a proper evaluation was Newbold & Granger (1974), who compared exponential smoothing (ES), ARIMA, and stepwise AR on 106 economic time series. Their conclusions were:

1. ES performed well on short time series;
2. Stepwise AR did well on the series with more than 30 observations;
3. Box-Jenkins methodology was recommended for series longer than 50 observations.

Statistical community received the results favourably, as they aligned with their expectations.

In 1979, Makridakis & Hibon conducted a similar analysis on 111 time series, including various ES methods and ARIMA. However, they found that “simpler methods perform well in comparison to the more complex and statistically sophisticated ARMA models”. This is because ARIMA performed slightly worse than ES, which contradicted the findings of Newbold & Granger. Furthermore, their paper faced heavy criticism, with some claiming that Makridakis did not correctly utilize Box-Jenkins methodology.

So, in 1982, Makridakis et al. organized a competition on 1001 time series, inviting external participants to submit their forecasts. It was won by… the ARARMA model by Emmanuel Parzen. This model used information criteria for ARMA order selection instead of Box-Jenkins methodology. The main conclusion drawn from this competition was that “Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones.” Note that this does not mean that simple methods are always better, because that was not even the case in the first competition: it was won by a quite complicated statistical model based on ARMA. This only means that the complexity does not necessarily translate into accuracy.

The M2 competition focused on judgmental forecasting, and is not discussed here.

We then arrive to M3 competition with 3003 time series and, once again, open submission for anyone. The results widely confirmed the previous findings, with Theta by Vasilious Assimakopoulos and Kostas Nikolopoulos outperforming all the other methods. Note that ARIMA with order selection based on Box-Jenkins methodology performed fine, but could not beat its competitors.

Finally, we arrive to M4 competition, which had 100,000 time series and was open to even wider audience. While I have my reservations about the competition itself, there were several curious findings, including the fact that ARIMA implemented by Hyndman & Khandakar (2008) performed on average better than ETS (Theta outperformed both of them), and that the more complex methods won the competition.

It was also the first paper to show that the accuracy tends to increase on average with the increase of the computational time spent for training. This means that if you want to have more accurate forecasts, you need to spend more resources. The only catch is that this happens with the decreasing return effect. So, the improvements become smaller and smaller the more time you spend on training.

The competition was followed by M5 and M6, and now they plan to have another one. I don’t want to discuss all of them – they are beyond the scope of this short post (see details on the website of the competitions). But I personally find the first competitions very impactful and useful.

And here are my personal takeaways from these competitions:

1. Simple forecasting methods perform well and should be included as benchmarks in experiments;
2. Complex methods can outperform simple ones, especially if used intelligently, but you might need to spend more resources to gain in accuracy;
3. ARIMA is effective, but Box-Jenkins methodology may not be practical. Using information criteria for order selection is a better approach (as evidenced from ARARMA example and Hydnman & Khandakar implementation).

Finally, I like the following quote from Rob J. Hyndman about the competitions that gives some additional perspective: “The “M” competitions organized by Spyros Makridakis have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models”.

Table with the results of the M3 competition

Message The role of M competitions in forecasting first appeared on Open Forecast.

Story of “Probabilistic forecasting of hourly emergency department arrivals”

Ivan Svetunkov — Wed, 10 May 2023 20:47:27 +0000

The paper

Back in 2020, when we were all siting in the COVID lockdown, I had a call with Bahman Rostami-Tabar to discuss one of our projects. He told me that he had an hourly data of an Emergency Department from a hospital in Wales, and suggested writing a paper for a healthcare audience to show them how forecasting can be done properly in this setting. I noted that we did not have experience in working with high frequency data, and it would be good to have someone with relevant expertise. I knew a guy who worked in energy forecasting, Jethro Browell (we are mates in the IIF UK Chapter), so we had a chat between the three of us and formed a team to figure out better ways for ED arrival demand forecasting.

We agreed that each one of us will try their own models. Bahman wanted to try TBATS, Prophet and models from the fasster package in R (spoiler: the latter ones produced very poor forecasts on our data, so we removed them from the paper). Jethro had a pool of GAMLSS models with different distributions, including Poisson and truncated Normal. He also tried a Gradient Boosting Machine (GBM). I decided to test ETS, Poisson Regression and ADAM. We agreed that we will measure performance of models not only in terms of point forecasts (using RMSE), but also in terms of quantiles (pinball and quantile bias) and computational time. It took us a year to do all the experiments and another one to find a journal that would not desk-reject our paper because the editor thought that it was not relevant (even though they have published similar papers in the past). It was rejected from Annals of Emergency Medicine, Emergency Medicine Journal, American Journal of Emergency Medicine and Journal of Medical Systems. In the end, we submitted to Health Systems, and after a short revision the paper got accepted. So, there is a happy end in this story.

In the paper itself, we found that overall, in terms of quantile bias (calibration of models), GAMLSS with truncated Normal distribution and ADAM performed better than the other approaches, with the former also doing well in terms of pinball loss and the latter doing well in terms of point forecasts (RMSE). Note that the count data models did worse than the continuous ones, although one would expect Poisson distribution to be appropriate for the ED arrivals.

I don’t want to explain the paper and its findings in detail in this post, but given my relation to ADAM, I have decided to briefly explain what I included in the model and how it was used. After all, this is the first paper that uses almost all the main features of ADAM and shows how powerful it can be if used correctly.

Using ADAM in Emergency Department arrivals forecasting

Disclaimer: The explanation provided here relies on the content of my monograph “Forecasting and Analytics with ADAM“. In the paper, I ended up creating a quite complicated model that allowed capturing complex demand dynamics. In order to fully understand what I am discussing in this post, you might need to refer to the monograph.

Emergency Department Arrivals. The plots were generated using seasplot() function from the tsutils package.

The figure above shows the data that we were dealing with together with several seasonal plots (generated using seasplot() function from the tsutils package). As we see, the data exhibits hour of day, day of week and week of year seasonalities, although some of them are not very well pronounced. The data does not seem to have a strong trend, although there is a slow increase of the level. Based on this, I decided to use ETS(M,N,M) as the basis for modelling. However, if we want to capture all three seasonal patterns then we need to fit a triple seasonal model, which requires too much computational time, because of the estimation of all the seasonal indices. So, I have decided to use a double-seasonal ETS(M,N,M) instead with hour of day and hour of week seasonalities and to include dummy variables for week of year seasonality. The alternative to week of year dummies would be hour of year seasonal component, which would then require estimating 8760 seasonal indices, potentially overfitting the data. I argue that the week of year dummy provides the sufficient flexibility and there is no need in capturing the detailed intra-yearly profile on a more granular level.

To make things more exciting, given that we deal with hourly data of a UK hospital, we had to deal with issues of daylight saving and leap year. I know that many of us hate the idea of daylight saving, because we have to change our lifestyles 2 times each year just because of an old 18th century tradition. But in addition to being bad for your health, this nasty thing messes things up for my models, because once a year we have 23 hours and in another time we have 25 hours in a day. Luckily, it is taken care of by adam() that shifts seasonal indices, when the time change happens. All you need to do for this mechanism to work is to provide an object with timestamps to the function (for example, zoo). As for the leap year, it becomes less important when we model week of year seasonality instead of the day of year or hour of year one.

Emergency Department Daily Arrivals

Furthermore, as it can be seen from the figure above, it is apparent that calendar events play a crucial role in ED arrivals. For example, the Emergency Department demand over Christmas is typically lower than average (the drops in Figure above), but right after the Christmas it tends to go up (with all the people who injured themselves during the festivities showing up in the hospital). So these events need to be taken into account in a form of additional dummy variables by a model together with their lags (the 24 hour lags of the original variables).

But that’s not all. If we want to fit a multiplicative seasonal model (which makes more sense than the additive one due to changing seasonal amplitude for different times of year), we need to do something with zeroes, which happen naturally in ED arrivals over night (see the first figure in this post with seasonal plots). They do not necessarily happen at the same time of day, but the probability of having no demand tends to increase at night. This meant that I needed to introduce the occurrence part of the model to take care of zeroes. I used a very basic occurrence model called “direct probability“, because it is more sensitive to changes in demand occurrence, making the model more responsive. I did not use a seasonal demand occurrence model (and I don’t remember why), which is one of the limitations of ADAM used in this study.

Finally, given that we are dealing with low volume data, a positive distribution needed to be used instead of the Normal one. I used Gamma distribution because it is better behaved than the Log Normal or the Inverse Gaussian, which tend to have much heavier tails. In the exploration of the data, I found that Gamma does better than the other two, probably because the ED arrivals have relatively slim tails.

So, the final ADAM included the following features:

ETS(M,N,M) as the basis;
Double seasonality;
Week of year dummy variables;
Dummy variables for calendar events with their lags;
“Direct probability” occurrence model;
Gamma distribution for the residuals of the model.

This model is summarised in equation (3) of the paper.

The model was initialised using backcasting, because otherwise we would need to estimate too many initial values for the state vector. The estimation itself was done using likelihood. In R, this corresponded to roughly the following lines of code:

library(smooth)
oesModel <- oes(y, "MNN", occurrence="direct", h=48)
adamModelFirst <- adam(ourData, "MNM", lags=c(24,24*7), formula=y~x+xLag24+weekOfYear,
                       h=48, initial="backcasting",
                       occurrence=oesModel, distribution="dgamma")

Where x was the categorical variable (factor in R) with all the main calendar events. However, even with backcasting, the estimation of such a big model took an hour and 25 minutes. Given that Bahman, Jethro and I have agreed to do rolling origin evaluation, I've decided to help the function in the estimation inside the loop, providing the initials to the optimiser based on the very first estimated model. As a result, each estimation of ADAM in the rolling origin took 1.5 minutes. The code in the loop was modified to:

adamParameters <- coef(adamModelFirst)
oesModel <- oes(y, "MNN", occurrence="direct", h=48)
adamModel <- adam(ourData, "MNM", lags=c(24,24*7), formula=y~x+xLag24+weekOfYear,
                  h=48, initial="backcasting",
                  occurrence=oesModel, distribution="dgamma",
                  B=adamParameters)

Finally, we generated mean and quantile forecasts for 48 hours ahead. I used semiparametric quantiles, because I expected violation of some of assumptions in the model (e.g. autocorrelated residuals). The respective R code is:

testForecast <- forecast(adamModel, newdata=newdata, h=48,
                         interval="semiparametric", level=c(1:19/20), side="upper")

Furthermore, given that the data is integer-valued (how many people visit the hospital each hour) and ADAM produces fractional quantiles (because of the Gamma distribution), I decided to see how it would perform if the quantiles were rounded up. This strategy is simple and might be sensible when a continuous model is used for forecasting on a count data (see discussion in the paper). However, after running the experiment, the ADAM with rounded up quantiles performed very similar to the conventional one, so we have decided not to include it in the paper.

In the end, as stated earlier in this post, we concluded that in our experiment, there were two well performing approaches: GAMLSS with Truncated Normal distribution (called "NOtr-2" in the paper) and ADAM in the form explained above. The popular TBATS, Prophet and Gradient Boosting Machine performed poorly compared to these two approaches. For the first two, this is because of the lack of explanatory variables and inappropriate distributional assumptions (normality). As for the GBM, this is probably due to the lack of dynamic element in it (e.g. changing level and seasonal components).

Concluding this post, as you can see, I managed to fit a decent model based on ADAM, which captured the main characteristics of the data. However, it took a bit of time to understand what features should be included, together with some experiments on the data. This case study shows that if you want to get a better model for your problem, you might need to dive in the problem and spend some time analysing what you have on hands, experimenting with different parameters of a model. ADAM provides the flexibility necessary for such experiments.

Message Story of “Probabilistic forecasting of hourly emergency department arrivals” first appeared on Open Forecast.

Probabilistic forecasting of hourly emergency department arrivals

Ivan Svetunkov — Tue, 09 May 2023 06:45:13 +0000

Authors: Bahman Rostami-Tabar, Jethro Browell, Ivan Svetunkov

Journal: Health Systems

Abstract: An accurate forecast of Emergency Department (ED) arrivals by an hour of the day is critical to meet patients’ demand. It enables planners to match ED staff to the number of arrivals, redeploy staff, and reconfigure units. In this study, we develop a model based on Generalised Additive Models and an advanced dynamic model based on exponential smoothing to generate an hourly probabilistic forecast of ED arrivals for a prediction window of 48 hours. We compare the forecast accuracy of these models against appropriate benchmarks, including TBATS, Poisson Regression, Prophet, and simple empirical distribution. We use Root Mean Squared Error to examine the point forecast accuracy and assess the forecast distribution accuracy using Quantile Bias, PinBall Score and Pinball Skill Score. Our results indicate that the proposed models outperform their benchmarks. Our developed models can also be generalised to other services, such as hospitals, ambulances or clinical desk services.

DOI: 10.1080/20476965.2023.2200526

The paper and R code.

Story of the paper.

Message Probabilistic forecasting of hourly emergency department arrivals first appeared on Open Forecast.