Archives papers - Open Forecasting

Risky business: how to select your model based on risk preferences

Ivan Svetunkov — Mon, 19 Jan 2026 11:28:04 +0000

What do you use for model selection? Do you select the best model based on its cross-validated performance, or do you use in-sample measures like AIC? If so, there is a way to improve your selection process further.

JORS recently published the paper of Nikos Kourentzes and I based on a simple but powerful idea: instead of using summary statistics (like the mean RMSE of cross-validated errors), you should consider the entire distribution and choose a specific quantile. This aligns with my previous post on error measures, but here is the core intuition:

The distribution of error measures is almost always asymmetric. If you only look at the average, you end up with a “mean temperature in the hospital” statistic, which doesn’t reflect how models actually behave. Some models perform great on most series but fail miserably on a few.

What can we do in this case? We can look at quantiles of distribution.

For example, if we use 84th quantile, we compare the models based on their “bad” performance, situations where they fail and produce less accurate forecasts. If you choose the best performing model there, you will end up with something that does not fail as much. So your preferences for the model become risk-averse in this situation.

If you focus on the lower quantile (e.g. 16th), you are looking at models that do well on the well-behaved series and ignore how they do on the difficult ones. So, your model selection preferences can be described as risk-tolerant, because you are accept that the best performing model might fail on a difficult time series.

Furthermore, the median (50th quantile, the middle of sample), corresponds to the risk-neutral situation, because it ignores the tails of the distribution.

What about the mean? This is a risk-agnostic strategy, because it says nothing about the performance on the difficult or easy time series – it takes everything and nothing in it at the same time, hiding the true risk profile.

So what?

In the paper, we show that using a risk-averse strategy tends to improve overall forecasting accuracy in day-to-day situations. Conversely, a risk-tolerant strategy can be beneficial when disruptions are anticipated, as standard models are likely to fail anyway.

So, next time you select a model, think about the measure you are using. If it’s just the mean RMSE, keep in mind that you might be ignoring the inherent risks of that selection.

P.S. While the discussion above applies to the distribution of error measures, our paper specifically focused on point AIC (in-sample performance). But it is a distance measure as well, so the logic explained above holds.

P.P.S. Nikos wrote a post about this paper here.

P.P.P.S. And here is the link to the paper.

Message Risky business: how to select your model based on risk preferences first appeared on Open Forecasting.

AID paper rejected from the IJPR

Ivan Svetunkov — Fri, 14 Nov 2025 11:12:16 +0000

So, our paper with Anna Sroginis got rejected from a special issue of the International Journal of Production Research after a second round of revision. And here is what I think about this!

First things first, why am I writing this post? I want to share failures with the community, because I am tired of all the success stories. It is okay not to win, and this happens much more often than it seems.

Now, about the paper. In the first round of revisions, four reviewers looked at it and provided their comments. We expanded the paper accordingly, making it now 46 pages long (ouch!). We introduced inventory simulations and showed how using some basic principles improves forecasting accuracy and can lead to a reduction in inventory costs.

In the second round, the AE added one more reviewer. After careful consideration, two of the reviewers recommended major revisions, while the other two suggested a strong rejection, claiming that the paper does not make new and significant contributions to the production research literature.

Obviously, I disagree with this evaluation. Based on the reviewers’ comments, I have a feeling they didn’t read the paper in full (their main concerns relate to Section 3, and some of these could have been resolved if they had reached Section 5). But this probably also means that the paper in its current state is too big and needs to be rewritten to become more focused. Maybe this is what confused the reviewers.

So, what’s next?

We will amend it to address the reviewers’ comments, shorten it a bit to make it more focused, and then submit to another OR-related journal.

And while we are doing that, I have updated the arXiv version of the paper to show what we did after the first round, and here is a brief summary of the main findings:

Using a stockout dummy variable and capturing the level of data correctly (removing the effect of stockouts) improves the accuracy of forecasting approaches;
The stockouts detection should be done for both the training and the test sets. If the series with stockouts are not removed from the test set, the forecasts would be evaluated incorrectly;
Splitting the demand into demand sizes and demand occurrence, producing forecasts for each of the parts and then combining the result substantially improves the accuracy;
Using the feature for regular/intermittent demand improves the forecasting accuracy, but does not seem to impact the inventory performance. Note that this separation is straightforward in AID: if after removing the stockouts, there are some zeroes left, the demand is identified as intermittent;
The further split into smooth/lumpy leads to slight improvements in terms of accuracy, without a substantial impact on the inventory;
The split into count/fractional demand does not bring value in terms of forecasting accuracy or inventory performance.

Message AID paper rejected from the IJPR first appeared on Open Forecasting.

Review of a paper on comparison of modern machine learning techniques in retail

Ivan Svetunkov — Sun, 22 Jun 2025 21:59:19 +0000

A couple of days ago, I noticed a link to the following paper in a post by Jack Rodenberg: https://arxiv.org/abs/2506.05941v1. The topic seemed interesting and relevant to my work, so I read it, only to find that the paper contains several serious flaws that compromise its findings. Let me explain.

Introduction

But first, why am I writing this post?

There’s growing interest in forecasting among data scientists, data engineers, ML experts etc. Many of them assume that they can apply their existing knowledge directly to this new area without reading domain-specific literature. As a result, we get a lot of “hit-or-miss” work: sometimes having promising ideas, but executed in ways that violate basic forecasting principles. The main problem with that is that if your experiment is not done correctly, your results might be compromised, i.e. your claims might be simply wrong.

If you’re a researcher writing forecasting-related papers, then hopefully reading this post (and the posts and papers I refer to), will help you improve your papers. This might lead to a smoother peer-review process. Also, while I can’t speak for other reviewers, if I come across a paper with similar issues, I typically give it a hard time.

I should also say that I am not a reviewer of this paper (I would not publish a review), but I merely decided to demonstrate what issues I can see when I read papers like that. The authors are just unlucky that I picked their paper…

Let’s start.

The authors apply several ML methods to retail data, compare their forecasting accuracy, and conclude that XGBoost and LightGBM outperform N-BEATS, NHITS, and Temporal Fusion Transformer. While the finding isn’t groundbreaking, additional evidence on a new dataset is always welcome.

Major issues

So, what’s wrong? Here is a list of the major comments:

Forecast horizon vs. data frequency:

Daily data with a 365-day forecast horizon makes no practical sense (page 2, paragraph 3). I haven’t seen any company making daily-level decisions a year in advance. Stock decisions are typically made on much shorter horizons, and if you need a year ahead forecast, you definitely do not need it on the daily level. After all, there is no point in knowing that on 22nd December 2025 you will have the expected demand of 35.457 units – it is too far into the future to make any difference. Some references:

Athanasopoulos and Kourentzes (2023) paper discusses data frequency and some decisions related to them;
and there is a post on my website on a related topic

Misuse of SBC classification:

Claiming that 70% of products are “intermittent” (page 2, last paragraph) based on SBC is incorrect. Furthermore, SBC classification does not make sense in this setting, and is not used in the paper anyway, so the authors should just drop it.

Read more about it here.
And there is a post of Stephan Kolassa on exactly this point

Product elimination and introduction is unclear (page 3):

The authors say “Around 30% of products were eliminated during training and 10% are newly introduced in validation”. It’s not clear why this was done and how specifically. This needs to be explained in more detail.

“Missing values” undefined:

It is not clear what the authors mean by “missing values” (page 3, “Handling Missing Values”). How do they appear and why? Are they the same as stockouts, or were there some other issues in the data? This needs to be explained in more detail.

Figure 1 is vague:

Figure 1 is supposed to explain how the missing values were treated. But the whole imputation process is questionable, because it is not clear how well it worked in comparison with alternatives and how reasonable it is to have an imputed series that look more erratic than the original one. The discussion of that needs to be expanded with some insights from the business problem.

No stockout handling discussion:

The authors do not discuss whether the data has stockouts or not. This becomes especially important in retail, because if the stockouts are not treated correctly, you would end up forecasting sales instead of demand

For example, see this post.

Feature engineering is opaque:

“Lag and rolling-window statistics for sales and promotional indicators were created” (page 3, “Feature Engineering”) – it is not clear, what specific lags, what length of rolling windows, and what statistics (anything besides mean?) were created. These need to be explained for transparency and so that a reader could better understand what specifically was done. Without this explanation, it is not clear whether the features are sensible at all.

Training/validation setup not explained:

It is not clear how specifically the split into training and validation sets was done (page 3, last paragraph), and whether the authors used rolling origin (aka time series cross-validation). If they did random splits, that could cause some issues, because the first law of time series is not to break its structure!

Variables transformation is unclear:

It is not clear whether any transformations of the response variable were done. For example, if the data is not stationary, taking differences might be necessary to capture the trend and to do extrapolation correctly. Normalisation of variables is also important for neural networks, unless this is built-in in the functions the authors used. This is not discussed in the paper.

Forecast strategy not explained:

It is not clear whether the direct or recursive strategy was used for forecasting. If lags were not used in the model, that would not matter, but they are, so this becomes a potential issue. Also, if the authors used the lag of the actual value on observation 235 steps ahead to produce forecast for 236 steps ahead, then this is another fundamental issue, because that implies that the forecast horizon is just 1 step ahead, and not 365, as the authors claim. This needs to be explained in more detail.

I’ve written a post about the strategies.

No statistical benchmarks:

At the very least, the authors should use simple moving average and probably exponential smoothing. Even if they do not perform well, this gives an additional information about the performance of the other approaches. Without them, the claims about good performance of the used ML approaches are not supported by evidence. The authors claim that they used mean as a benchmark, but its performance is not discussed in the paper.

Issues with forecast evaluation:

The whole Table 3 with error measures is an example of what not to do. Here are some of major issues:

There is no point in reporting several error measures – each one of them is minimised by their own statistics. The error measure should align with what approaches produce.
MSE, RMSE, MAE and ME should be dropped, because they are not scaled, so the authors are adding up error measures for bricks and nails. The result is meaningless.
MASE is not needed – it is minimised by median, which could be a serious issue on intermittent demand see this post. wMAPE has similar issues because it is also based on MAE.
If the point forecasts are produced in terms of medians (like in case of NBEATS), then RMSSE should be dropped, and MASE should be used instead.
But also, comparing means with medians is not a good idea. If you assume a symmetric distribution, the two should coincide, but in general this might not hold.
R2 is not a good measure of forecast accuracy. It makes some sense in regression context for linear models, but in this one, it is pointless, and only shows that the authors don’t fully understand what they are doing. Plus, it’s not clear how specifically it was calculated.
I don’t fully understand “demand error”, “demand bias” and other measures, and the authors do not explain them in necessary detail. This needs to be added to the paper.
The split into “Individual Groups” and “Whole Category” is not well explained either: it is not clear what this means, why, and how this was done.
And in general, I don’t understand what the authors want to do with Cases A – D in Table 3. It is not clear why they are needed, and what they want to show with them. This is not explained in the paper.

I have a series of posts on forecast evaluation here.

Invalid analysis of bias measures:

Analysis of bias measures is meaningless because they were not scaled.

Disturbing bias of NBEATS in Figure 2:

The bias shown in Figure 2 is disturbing and should be dealt with prior to evaluation. It could have appeared due to the loss function used for training or because the data was not pre-processed correctly. Leaving it as is and blaming NBEATS for this does not sound reasonable to me.

No inventory implications:

The authors mention inventory management, but stop on forecasting, not showing how the specific forecasts translate to inventory decisions. If this paper was to be submitted to any operations-related journal, the inventory implications would need to be added in the discussion.

Underexplained performance gaps:

The paper also does not explain well why neural networks performed worse than gradient boosting methods. They mention that this could be due to the effect of missing values, but this is a speculation rather than an explanation, which I personally do not believe (I might be wrong). While the overall results make sense for me personally, if you want to publish a good paper, you need to provide a more detailed answer to the question “why?”.

Minor issues

I also have three minor comments:

“many product series are censored” (page 2, last paragraph) is not what it sounds like. The authors imply that the histories are short, while the usual interpretation is that the sales are lower than the demand, so the values are censored. I would rewrite this.
Figure 2 has the legend saying “Poisson” three times, not providing any useful information. This is probably just a mistake, which can easily be fixed.
There are no references to Table 2 and Figure 3 in the paper. It is not clear why they are needed. Every table and figure should be referred to and explained.

Conclusions

Overall, the paper has a sensible idea, but I feel that the authors need to learn more about forecasting principles and that they have not read forecasting literature carefully to understand how specifically the experiments should be designed, what to do, and not to do (stop using SBC!). Because they made several serious mistakes, I feel that the results of the paper are compromised and might not be correct.

P.S. If I were a reviewer of this paper, I would recommend either “reject and resubmit” or a “major revision” (if the former option was not available).

P.P.S. If the authors of the paper are reading this, I hope you find these comments useful. If you have not submitted the paper yet, I’d suggest to take some of them (if not all) into account. Hopefully, this will smooth the submission process for you.

Message Review of a paper on comparison of modern machine learning techniques in retail first appeared on Open Forecasting.

Online Detection of Forecast Model Inadequacies Using Forecast Errors

Ivan Svetunkov — Wed, 11 Jun 2025 12:04:14 +0000

There’s a large and fascinating area in time series analysis called “changepoint detection”. I hadn’t worked in this area before, but thanks to Rebecca Killick and Thomas Grundy, I contributed to the paper “Online Detection of Forecast Model Inadequacies Using Forecast Errors“, which has just been published in the Journal of Time Series Analysis.

DISCLAIMER: the image in the post is taken from the paper, Figure 6, showing the proportion of GRP A&E admissions, the forecast errors and two detectors.

Here’s a brief summary of what it’s about:

One of the common issues in forecasting is that there might be some serious changes in the data due to external factors (e.g. changes in consumer preferences). These changes are not always captured by the model, which can lead to reduced accuracy, increased variance, and ultimately to losses. The changepoint detection literature addresses this by trying to automatically identify such structural changes and alert analysts when intervention might be needed. This becomes especially useful when managing large numbers of time series, where visual inspection isn’t feasible.

However, most existing approaches either work directly on raw data or rely on a model, which makes their usefulness limited.

Tom Grundy and Rebecca Killick came up with a better idea: analysing forecast errors instead. They kindly invited me to join as a co-author (since I know a thing or two about forecasting). The result is an online changepoint detection mechanism that is more universal and can be applied to classical statistical forecasting models and potentially to machine learning approaches.

The paper is quite technical and includes theoretical derivations, showing that the proposed method substantially reduces detection delay compared to some conventional approaches. We also evaluated its performance with ARIMA and ETS models on simulated data and provided several examples with real time series, demonstrating how it works.

The final version of the paper is available here, while the pre-print is here.

Message Online Detection of Forecast Model Inadequacies Using Forecast Errors first appeared on Open Forecasting.

Svetunkov & Sroginis (2025) – Model Based Demand Classification

Ivan Svetunkov — Fri, 11 Apr 2025 10:39:30 +0000

For the last year, Anna Sroginis and I have been working on a paper, trying to modernise demand classification schemes and make them useful in the brave new era of machine learning. We have finally wrapped it up and submitted it to a peer-reviewed journal. But the temptation to share was too strong, so we have also uploaded it to arXiv, and it is now available here.

What is this paper about?

Intermittent demand is a common challenge in sectors like supply chain and retail. But the key issue is that zeroes in sales can happen for two fundamentally different reasons (see one of my previous posts):

Nobody wanted to buy the product (naturally occurring zeroes),
Nobody could buy the product (artificially occurring due to stockouts, etc).

However, forecasting methods are typically unaware of this distinction and treat both types equally. This can lead to inaccurate forecasts and poor decisions. On top of that, existing classification schemes for intermittent demand (such as SBC) use arbitrary thresholds and rely on choosing between forecasting methods like Croston and SBA. There’s a clear need for smarter, more flexible tools that can distinguish between types of demand and make classifications practical.

In this paper, we introduce a two-stage, model-based framework called “Automatic Identification of Demand” (AID), designed to bring more clarity and accuracy to demand classification. The first stage uses a data-driven approach to detect artificially occurring zeroes. Once those are accounted for, the second stage classifies the demand into one of six categories based on key characteristics: whether the demand is regular or intermittent, whether it consists of count or fractional values, and whether intermittent demand is smooth or lumpy in nature. AID detects stockouts by analysing demand intervals using the Geometric distribution, then flags the demand as one of those six types based on several simple statistical models.

We applied AID to a retailer dataset covering over 31,000 products with weekly sales across three stores. Based on that, we generated several features and tested multiple approaches (local level, pooled regression, and LightGBM) to see whether their accuracy improved. We found that:

Correcting for stockouts significantly improved the accuracy of all approaches;
Using a mixture approach (separating demand into sizes and occurrences) yielded large gains in accuracy, regardless of the forecasting method used;
Further splitting the data by demand categories (e.g., regular vs. intermittent, smooth vs. lumpy) provided additional, though more modest, benefits.

We argue that these three principles are universally valuable for forecasting, no matter what approach you use. If you face intermittent demand, at a minimum, consider detecting stockouts and then using the mixture approach.

Hope you find this paper useful. Let me know what you think in the comments.

Message Svetunkov & Sroginis (2025) – Model Based Demand Classification first appeared on Open Forecasting.

Point Forecast Evaluation: State of the Art

Ivan Svetunkov — Tue, 16 Jul 2024 11:24:06 +0000

I have summarised several posts on point forecasts evaluation in an article for the Foresight journal. Mike Gilliland, being the Editor-in-Chief of the journal, contributed to the paper a lot, making it read much smoother, but preferred not to be included as the co-author. This article was recently published in the issue 74 for Q3:2024. I attach the author copy to this post just because I can. Here is the direct link.

Here are the Key Points from the article:

Evaluation is important for tracking forecast process performance and understanding whether changes (to forecasts, models, or the overall process) are needed.
Understand what kind of forecast our models produce, and measure it properly. Most likely, our approach produces the mean (rather than the median) as a point forecast, so root mean squared error (RMSE) should be used to evaluate it.
To aggregate the error measure across several products, you need to scale it. A reliable way of scaling is to divide the selected error measure by the mean absolute differences of the training data. This way we get rid of the scale and units of the original measure and make sure that its value does not change substantially if we have trend in the data.
Avoid MAPE!
To make decisions based on your error measure, consider using the FVA framework, directly comparing performance of your forecasting approach with the performance of some simple
benchmark method.

Disclaimer: This article originally appeared in Foresight, Issue 74 (forecasters.org/foresight) and is made available with permission of Foresight and the International Institute of Forecasters.

Message Point Forecast Evaluation: State of the Art first appeared on Open Forecasting.

iETS: State space model for intermittent demand forecasting

Ivan Svetunkov — Fri, 08 Sep 2023 09:30:40 +0000

Authors: Ivan Svetunkov, John E. Boylan

Journal: International Journal of Production Economics

Abstract: Inventory decisions relating to items that are demanded intermittently are particularly challenging. Decisions relating to termination of sales of product often rely on point estimates of the mean demand, whereas replenishment decisions depend on quantiles from interval estimates. It is in this context that modelling intermittent demand becomes an important task. In previous research, this has been addressed by generalised linear models or integer-valued ARMA models, while the development of models in state space framework has had mixed success. In this paper, we propose a general state space model that takes intermittence of data into account, extending the taxonomy of single source of error state space models. We show that this model has a connection with conventional non-intermittent state space models used in inventory planning. Certain forms of it may be estimated by Croston’s and Teunter-Syntetos-Babai (TSB) forecasting methods. We discuss properties of the proposed models and show how a selection can be made between them in the proposed framework. We then conduct a simulation experiment, empirically evaluating the inventory implications.

DOI: 10.1016/j.ijpe.2023.109013.

Working paper.

About the paper

DISCLAIMER: The models in this paper are also discussed in detail in the ADAM monograph (Chapter 13) with some examples going beyond what is discussed in the paper (e.g. models with trends).

What is “intermittent demand”? It is the demand that happens at irregular frequency (i.e. at random). Note that according to this definition, intermittent demand does not need to be count – it is a wider term than that. For example, electricity demand can be intermittent, but it is definitely not count. The definition above means that we do not necessarily know when specifically we will sell our product. From the modelling point of view, it means that we need to take into account two elements of uncertainty instead of just one:

How much people will buy;
When they will buy.

(1) is familiar for many demand planners and data scientists: we do not know specifically how much our customers will buy in the future, but we can get an estimate of the expected demand (mean value via a point forecast) and an idea of the uncertainty around it (e.g. produce prediction intervals or estimate the demand distribution). (2) is less obvious: there may be some periods when nobody buys our product, and then periods when we sell some, followed by no sales again. In that case we can encode the no sales in those “dry” periods with zeroes, the periods with demand as ones, and end up with a time series like this (this idea was briefly discussed in this and this posts):

An example of the occurrence part of an intermittent demand

The plot above visualises the demand occurrence, with zeroes corresponding to the situation of “no demand” and ones corresponding to some demand. In general, it is is challenging to predict, when the “ones” will happen specifically, but in the case above, it seems that over time the frequency of demand increases, implying that maybe it becomes regular. In mathematical terms, we could phrase this as the probability of occurrence increases over time: at the end of series, we won’t necessarily sell product, but the chance of selling is much higher than in the beginning. The original time series looks like this:

An example of an intermittent demand

It shows that indeed there is an increase of the frequency of sales together with the amount sold, and that it seems that the product is becoming more popular, moving from the intermittent to the regular demand domain.

In general, forecasting intermittent demand is a challenging task, but there are many existing approaches that can be used in this case. However, they are all detached from the conventional ones that are used for regular demand (such as ETS or ARIMA). What people usually do in practice is first categorise the data into regular and intermittent and then apply specific approaches to it (e.g. ETS/ARIMA for the regular demand, and Croston‘s method or TSB for the intermittent one).

John Boylan and I developed a statistical model that unites the two worlds – you no longer need to decide whether the data is intermittent or not, you can just use one model in an automated fashion – it will take care of intermittence (if there is one). It relies fundamentally on the classical Croston’s equation:
\begin{equation} \label{eq:general}
y_t = o_t z_t ,
\end{equation}
where \(y_t\) is the observed value at time \(t\), \(o_t\) is the binary occurrence variable and \(z_t\) is the demand sizes variable. Trying to derive the statistical model underlying Croston’s method, Snyder (2002) and Shenstone & Hyndman (2005) used models based on \eqref{eq:general} but instead of plugging in a multiplicative ETS in \(z_t\) they got stuck with the idea of logarithmic transformation of demand sizes and/or using count distributions for the demand sizes. John and I looked into this equation again and decided that we can model both demand sizes and demand occurrence using a pair of pure multiplicative ETS models. In this post, I will focus on ETS(M,N,N) as the simplest model, but more complicated ones (with trend and/or explanatory variables) can be used as well without the loss in logic. So, for the demand sizes we will have:
\begin{equation}
\begin{aligned}
& z_t = l_{t-1} (1 + \epsilon_t) \\
& l_t = l_{t-1} (1 + \alpha \epsilon_t)
\end{aligned}
\label{eq:demandSizes}
\end{equation}
where \(l_t\) is the level of series, \(\alpha\) is the smoothing parameter and \(1 + \epsilon_t \) is the error term that follows some positive distribution (the options we considered in the paper are the Log-Normal, Gamma and Inverse Gaussian). The demand sizes part is relatively straightforward: you just apply the conventional pure multiplicative ETS model with a positive distribution (which makes \(z_t\) always positive) and that’s it. However, the occurrence part is more complicated.

Given that the occurrence variable is random, we should model the probability of occurrence. We proposed to assume that \(o_t \sim \mathrm{Bernoulli}(p_t) \) (logical assumption, done in many other papers), meaning that the probability of occurrence changes over time. In turn, the changing probability can be modelled using one of the several approaches that we proposed. For example, it can be modelled via the so called “inverse odds ratio” model with ETS(M,N,N), formulated as:
\begin{equation}
\begin{aligned}
& p_t = \frac{1}{1 + \mu_{b,t}} \\
& \mu_{b,t} = l_{b,t-1} \\
& l_{b,t} = l_{b,t-1} (1 + \alpha_b \epsilon_{b,t})
\end{aligned}
\label{eq:demandOccurrenceOdds}
\end{equation}
where \(\mu_{b,t}\) is the one step ahead expectation of the underlying model, \(l_{b,t}\) is the latent level, \(\alpha_b\) is the smoothing parameter of the model, and \(1+\epsilon_{b,t}\) is the positively distributed error term (with expectation equal to one and an unknown distribution, which we actually do not care about). The main feature of the inverse odds ratio occurrence model is that it should be effective in cases when demand is building up (moving from the intermittent to the regular pattern, without zeroes). In our paper we show how such model can be estimated and also show that Croston’s method can be used for the estimation of this model when the demand occurrence does not change (substantially) between the non-zero demands. So, this model can be considered as the model underlying Croston’s method.

Uniting the equations \eqref{eq:general}, \eqref{eq:demandSizes} and \eqref{eq:demandOccurrenceOdds}, we get the iETS(M,N,N)\(_\mathrm{I}\)(M,N,N) model, where the letters in the first brackets correspond to the demand sizes part, the subscript “I” tells us that we have the “inverse odds ratio” model for the occurrence, and the second brackets show what ETS model was used in the demand occurrence model. The paper explains in detail how this model can be built and estimated.

In the very same paper we discuss other potential models for demand occurrence (more suitable for demand obsolescence or fixed probability of occurrence) and, in fact, in my opinion this part is the main contribution of the paper – we have looked into something no one did before: how to model demand occurrence using ETS. Having so many options, we might need to decide which to use in an automated fashion. Luckily, given that these models are formulated in one and the same framework, we can use information criteria to select the most suitable one for the data. Furthermore, when all probabilities of occurrence are equal to one, the model \eqref{eq:general} together with \eqref{eq:demandSizes} transforms into the conventional ETS(M,N,N) model. This also means that the regular ETS model can be compared with the iETS directly using information criteria to decide whether the occurrence part is needed or not. So, we end up with a relatively simple framework that can be used for any type of demand without a need to do a categorisation.

As a small side note, we also showed in the paper that the estimates of smoothing parameters for the demand sizes in iETS will always be positively biased (being higher than needed). In fact, this bias appears in any intermittent demand model that assumes that the potential demand sizes change between the non-zero observations (reasonable assumption for any modelling approach). In a way, this finding also applies to both Croston’s and TSB methods and agrees with similar finding by Kourentzes (2014).

Example in R

All the models from the paper are implemented in the adam() function from the smooth package in R (with the oes() function taking care of the occurrence, see details here and here). For the demonstration purposes (and for fun), we will consider an artificial example of the demand obsolescence, modelled via the “Direct probability” iETS model (it underlies the TSB method):

set.seed(7)
c(rpois(10,3),rpois(10,2),rpois(10,1),rpois(10,0.5),rpois(10,0.1)) |>
    ts(frequency=12) -> y

My randomly generated time series looks like this:

Demand becoming obsolete

In practice, in the example above, we can be interested in deciding, whether to discontinue the product (to save money on stocking it) or not. To model and forecast the demand above, we can use the following code in R:

library(smooth)
iETSModel <- adam(y, "YYN", occurrence="direct", h=5, holdout=TRUE)

The "YYN" above tells function to select the best pure multiplicative ETS model based on the information criterion (AICc by default, see discussion in Section 15.1 of the ADAM monograph), the "occurrence" variable specifies, which of the demand occurrence models to build. By default, the function will use the same model for the demand probability as the selected for the demand sizes. So, for example, if we end up with ETS(M,M,N) for demand sizes, the function will use ETS(M,M,N) for the probability of occurrence. If you want to change this, you would need to use the oes() function and specify the model there (see examples in Section 13.4 of the ADAM monograph). Finally, I've asked function to produce 5 steps ahead forecasts and to keep the last 5 observations in the holdout sample. I ended up having the following model:

summary(iETSModel)

Model estimated using adam() function: iETS(MMN)
Response variable: y
Occurrence model type: Direct
Distribution used in the estimation: 
Mixture of Bernoulli and Gamma
Loss function type: likelihood; Loss function value: 71.0549
Coefficients:
      Estimate Std. Error Lower 2.5% Upper 97.5%  
alpha   0.1049     0.0925     0.0000      0.2903  
beta    0.1049     0.0139     0.0767      0.1049 *
level   4.3722     1.1801     1.9789      6.7381 *
trend   0.9517     0.0582     0.8336      1.0685 *

Error standard deviation: 1.0548
Sample size: 45
Number of estimated parameters: 9
Number of degrees of freedom: 36
Information criteria:
     AIC     AICc      BIC     BICc 
202.6527 204.1911 218.9126 206.6142

As we see from the output above, the function has selected the iETS(M,M,N) model for the data. The line "Mixture of Bernoulli and Gamma" tells us that the Bernoulli distribution was used for the demand occurrence (this is the only option), while the Gamma distribution was used for the demand sizes (this is the default option, but you can change this via the distribution parameter). We can then produce forecasts from this model:

forecast(iETSModel, h=5, interval="prediction", side="upper") |>
    plot()

In the code above, I have asked the function to generate prediction intervals (by default, for the pure multiplicative models, the function uses simulations) and to produce only the upper bound of the interval. The latter is motivated by the idea that in the case of the intermittent demand, the lower bound is typically not useful for decision making: we know that the demand cannot be below zero, and our stocking decisions are typically made based on the specific quantiles (e.g. for the 95% confidence level). Here is the plot that I get after running the code above:

Point and interval forecasts for the demand becoming obsolete

While the last observation in the holdout was not included in the prediction interval, the dynamics captured by the model is correct. The question that we should ask ourselves in this example is: what decision can be made based on the model? If you want to decide whether to stock the product or not, you can look at the forecast of the probability of occurrence to see how it changes over time and decide, whether to discontinue the product:

forecast(iETSModel$occurrence, h=5) |> plot()

Forecast of the probability of occurrence for the demand becoming obsolete

In our case, the probability reaches roughly 0.2 over the next 5 months (i.e. we might sale once every 5 months). If we think that this is too low then we should discontinue the product. Otherwise, if we decide to continue selling the product, then it makes more sense to generate the desired quantile of the cumulative demand over the lead time. In case of the adam() function it can be done by adding cumulative=TRUE in the forecast() function:

forecast(iETSModel, h=5, interval="prediction", side="upper", cumulative=TRUE)

after which we get:

      Point forecast Upper bound (95%)
Oct 4      0.3055742          1.208207

From the decision point of view, if we deal with count demand, the value 1.208207 complicates things. Luckily, as we showed in our paper, we can round the value up to get something meaningful, preserving the properties of the model. This means, that based on the estimated model, we need to have two items in stock to satisfy the demand over the next 5 months with the confidence level of 95%.

Conclusions

This is just a demonstration of what can be done with the proposed iETS model, but there are many more things one can do. For example, this approach allows capturing multiplicative seasonality in data that has zeroes (as long as seasonal indices can be estimated somehow). John and I started thinking in this direction, and we even did some work together with Patricia Ramos (our colleague from the university of INESC TEC), but given the hard time that was given to our paper by the reviewers in IJF, we had to postpone this research. I also used the ideas explained in this post in the paper on ED forecasting (written together with Bahman and Jethro). In that paper, I have used a seasonal model with the "direct" occurrence part, which tool care of zeroes (not bothering with modelling them properly) and allowed me to apply a multiple seasonal multiplicative ETS model with explanatory variables. Anyway, the proposed approach is flexible enough to be used in variety of contexts, and I think it will have many applications in real life.

P.S.: Story of the paper

I've written a separate long post, explaining the revision process of the paper and how it got to the acceptance stage at the IJPE, but then I realised that it is too long and boring. Besides, John would not have approved of the post and would say that I am sharing the unnecessary details, creating potential exasperation for fellow forecasters who reviewed the paper. So, I have decided not to publish that post, and instead just to add a short subsection. Here it is.

We started working on the paper in March 2016 and submitted it to the International Journal of Forecasting (IJF) in January 2017. It went through four rounds of revision with the second reviewer throughout the way being very critical, unsupportive and driving the paper into a wrong direction, burying it in the discussion of petty statistical details. We rewrote the paper several times and I rewrote the R code of the function few times. In the end the Associate Editor (AE) of the IJF (who completely forgot about our paper for several months) decided not to send the paper to the reviewers again, completely ignored our responses to the reviewers, did not provide any major feedback and have written an insulting response that ended with the phrase "I could go on, but I’m out of patience with the authors and their paper". The paper was rejected from IJF in 2019, which set me back in my academic career. This together with constant rejections of my Complex Exponential Smoothing paper and actions of a colleague of mine who decided to cut all ties with me in Summer 2019, hit my self-esteem and caused a serious damage to my professional life. I thought of quitting academia and to either starting working in business or doing something different with my life, not related to forecasting at all. I stayed mainly because of all the support that John Boylan, Robert Fildes, Nikos Kourentzes and my wife Anna Sroginis provided me. I recovered from that hit only in 2022, when my Complex Exponential Smoothing paper got accepted and things finally started turning well. After that John and I have rewritten the paper again, split it into two: "iETS" and "Multiplicative ETS" (under revision in IMA Journal of Management Mathematics) and submitted the former to the International Journal of Production Economics, where after one round of revision it got accepted. Unfortunately, we never got to celebrate the success with John because he passed away.

The moral of this story is that publishing in academia can be very tough and unfair. Sometimes, you get a very negative feedback from the people you least expect to get it from. People that you respect and think very highly of might not understand what you are proposing and be very unsupportive. We actually knew who the reviewers and the AE of our IJF paper were - they are esteemed academics in the field of forecasting. And while I still think highly of their research and contributions to the field, the way the second reviewer and the AE handled the review has damaged my personal respect to them - I never expected them to be so narrow-minded...

Message iETS: State space model for intermittent demand forecasting first appeared on Open Forecasting.

Multi-step Estimators and Shrinkage Effect in Time Series Models

Ivan Svetunkov — Wed, 09 Aug 2023 10:14:59 +0000

Authors: Ivan Svetunkov, Nikos Kourentzes, Rebecca Killick

Journal: Computational Statistics

Abstract: Many modern statistical models are used for both insight and prediction when applied to data. When models are used for prediction one should optimise parameters through a prediction error loss function. Estimation methods based on multiple steps ahead forecast errors have been shown to lead to more robust and less biased estimates of parameters. However, a plausible explanation of why this is the case is lacking. In this paper, we provide this explanation, showing that the main benefit of these estimators is in a shrinkage effect, happening in univariate models naturally. However, this can introduce a series of limitations, due to overly aggressive shrinkage. We discuss the predictive likelihoods related to the multistep estimators and demonstrate what their usage implies to time series models. To overcome the limitations of the existing multiple steps estimators, we propose the Geometric Trace Mean Squared Error, demonstrating its advantages. We conduct a simulation experiment showing how the estimators behave with different sample sizes and forecast horizons. Finally, we carry out an empirical evaluation on real data, demonstrating the performance and advantages of the estimators. Given that the underlying process to be modelled is often unknown, we conclude that the shrinkage achieved by the GTMSE is a competitive alternative to conventional ones.

DOI: 10.1007/s00180-023-01377-x.

Working paper.

About the paper

DISCLAIMER 1: To better understand what I am talking about in this section, I would recommend you to have a look at the ADAM monograph, and specifically at the Chapter 11. In fact, Section 11.3 is based on this paper.

DISCLAIMER 2: All the discussions in the paper only apply to pure additive models. If you are interested in multiplicative or mixed ETS models, you’ll have to wait another seven years for another paper on this topic to get written and published.

Introduction

There are lots of ways how dynamic models can be estimated. Some analysts prefer likelihood, some would stick with Least Squares (i.e. minimising MSE), while others would use advanced estimators like Huber’s loss or M-estimators. And sometimes, statisticians or machine learning experts would use multiple steps ahead estimators. For example, they would use a so-called “direct forecast” by fitting a model to the data, producing h-steps ahead in-sample point forecasts from the very first to the very last observation, then calculating the respective h-steps ahead forecast errors and (based on them) Mean Squared Error. Mathematically, this can be written as:

\begin{equation} \label{eq:hstepsMSE}
\mathrm{MSE}_h = \frac{1}{T-h} \sum_{t=1}^{T-h} e_{t+h|t}^2 ,
\end{equation}
where \(e_{t+h|t}\) is the h-steps ahead error for the point forecast produced from the observation \(t\), and \(T\) is the sample size.

In my final year of PhD, I have decided to analyse how different multistep loss functions work, to understand what happens with dynamic models, when these losses are minimised, and how this can help in efficient model estimation. Doing the literature review, I noticed that the claims about the multistep estimators are sometimes contradictory: some authors say that they are more efficient (i.e. estimates of parameters have lower variances) than the conventional estimators, some say that they are less efficient; some claim that they improve accuracy, while the others do not find any substantial improvements. Finally, I could not find a proper explanation of what happens with the dynamic models when the estimators are used. So, I’ve started my own investigation, together with Nikos Kourentzes and Rebecca Killick (who was my internal examiner and joined our team after my graduation).

Our investigation started with the single source of error model, then led us to predictive likelihoods and, after that – to the development of a couple of non-conventional estimators. As a result, the paper grew and became less focused than initially intended. In the end, it became 42 pages long and discussed several aspects of models estimation (making it a bit of a hodgepodge):

How multistep estimators regularise parameters of dynamic models;
That multistep forecast errors are always correlated when the models’ parameters are not zero;
What predictive likelihoods align with the multistep estimators (this is useful for a discussion of their statistical properties);
How General Predictive Likelihood encompasses all popular multistep estimators;
And that there is another estimator (namely GTMSE – Geometric Trace Mean Squared Error), which has good properties and has not been discussed in the literature before.

Because of the size of the paper and the spread of the topics throughout it, many reviewers ignored (1) – (4), focusing on (5) and thus rejecting the paper on the grounds that we propose a new estimator, but instead spend too much time discussing irrelevant topics. These types of comments were given to us by the editor of the Journal of the Royal Statistical Society: B and reviewers of Computational Statistics and Data Analysis. While we tried addressing this issue several times, given the size of the paper, we failed to fix it fully. The paper was rejected from both of these journals and ended up in Computational Statistics, where the editor gave us a chance to respond to the comments. We explained what the paper was really about and changed its focus to satisfy the reviewers, after which the paper was accepted.

So, what are the main findings of this paper?

How multistep estimators regularise parameters of dynamic models

Given that any dynamic model (such as ETS or ARIMA) can be represented in the Single Source of Error state space form, we showed that the application of multistep estimators leads to the inclusion of parameters of models in the loss function, leading to the regularisation. In ETS, this means that the smoothing parameters are shrunk to zero, with the shrinkage becoming stronger with the increase of the forecasting horizon relative to the sample size. This makes the models less stochastic and more conservative. Mathematically this becomes apparent if we express the conditional multistep variance in terms of smoothing parameters and one-step-ahead error variance. For example, for ETS(A,N,N) we have:

\begin{equation} \label{eq:hstepsMSEVariance}
\mathrm{MSE}_h \propto \hat{\sigma}_1^2 \left(1 +(h-1) \hat{\alpha} \right),
\end{equation}
where \( \hat{\alpha} \) is the smoothing parameter and \(\hat{\sigma}_1^2 \) is the one-step-ahead error variance. From the formula \eqref{eq:hstepsMSEVariance}, it becomes apparent that when we minimise MSE\(_h\), the estimated variance and the smoothing parameters will be minimised as well. This is how the shrinkage effect appears: we force \( \hat{\alpha} \) to become as close to zero as possible, and the strength of shrinkage is regulated by the forecasting horizon \( h \).

In the paper itself, we discuss this effect for several multistep estimators (the specific effect would be different between them) and several ETS and ARIMA models. While for ETS, it is easy to show how shrinkage works, for ARIMA, the situation is more complicated because the direction of shrinkage would change with the ARIMA orders. Still, what can be said clearly for any dynamic model is that the multistep estimators make them less stochastic and more conservative.

Multistep forecast errors are always correlated

This is a small finding, done in bypassing. It means that, for example, the forecast error two steps ahead is always correlated with the three steps ahead one. This does not depend on the autocorrelation of residuals or any violation of assumptions of the model but rather only on whether the parameters of the model are zero or not. This effect arises from the model rather than from the data. The only situation when the forecast errors will not be correlated is when the model is deterministic (e.g. linear trend). This has important practical implications because some forecasting techniques make explicit and unrealistic assumptions that these correlations are zero, which would impact the final forecasts.

Predictive likelihoods aligning with the multistep estimators

We showed that if a model assumes the Normal distribution, in the case of MSEh and MSCE (Mean Squared Cumulative Error), the distribution of the future values follows Normal as well. This means that there are predictive likelihood functions for these models, the maximum of which is achieved with the same set of parameters as the minimum of the multistep estimators. This has two implications:

These multistep estimators should be consistent and efficient, especially when the smoothing parameters are close to zero;
The predictive likelihoods can be used in the model selection via information criteria.

The first point also explains the contradiction in the literature: if the smoothing parameter in the population is close to zero, then the multistep estimators will give more efficient estimates than the conventional estimators; in the other case, it might be less efficient. We have not used the second point above, but it would be useful when the best model needs to be selected for the data, and an analyst wants to use information criteria. This is one of the potential ways for future research.

How General Predictive Likelihood (GPL) encompasses all popular multistep estimators

GPL arises when the joint distribution of 1 to h steps ahead forecast errors is considered. It will be Multivariate Normal if the model assumes normality. In the paper, we showed that the maximum of GPL coincides with the minimum of the so-called “Generalised Variance” – the determinant of the covariance matrix of forecast errors. This minimisation reduces variances for all the forecast errors (from 1 to h) and increases the covariances between them, making the multistep forecast errors look more similar. In the perfect case, when the model is correctly specified (no omitted or redundant variables, homoscedastic residuals etc), the maximum of GPL will coincide with the maximum of the conventional likelihood of the Normal distribution (see Section 11.1 of the ADAM monograph).

Accidentally, it can be shown that the existing estimators are just special cases of the GPL, but with some restrictions on the covariance matrix. I do not intend to show it here, the reader is encouraged to either read the paper or see the brief discussion in Subsection 11.3.5 of the ADAM monograph.

GTMSE – Geometric Trace Mean Squared Error

Finally, looking at the special cases of GPL, we have noticed that there is one which has not been discussed in the literature. We called it Geometric Trace Mean Squared Error (GTMSE) because of the logarithms in the formula:
\begin{equation} \label{eq:GTMSE}
\mathrm{GTMSE} = \sum_{j=1}^h \log \frac{1}{T-j} \sum_{t=1}^{T-j} e_{t+j|t}^2 .
\end{equation}
GTMSE imposes shrinkage on parameters similar to other estimators but does it more mildly because of the logarithms in the formula. In fact, what the logarithms do is make variances of all forecast errors similar to each other. As a result, when used, GTMSE does not focus on the larger variances as other methods do but minimises all of them simultaneously similarly.

Examples in R

The estimators discussed in the paper are all implemented in the functions of the smooth package in R, including adam(), es(), ssarima(), msarima() and ces(). In the example below, we will see how the shrinkage works for the ETS on the example of Box-Jenkins sales data (this is the example taken from ADAM, Subsection 11.3.7):

library(smooth)

adamETSAANBJ <- vector("list",6)
names(adamETSAANBJ) <- c("MSE","MSEh","TMSE","GTMSE","MSCE","GPL")
for(i in 1:length(adamETSAANBJ)){
    adamETSAANBJ[[i]] <- adam(BJsales, "AAN", h=10, holdout=TRUE,
                              loss=names(adamETSAANBJ)[i])
}

The ETS(A,A,N) model, applied to this data, has different estimates of smoothing parameters:

sapply(adamETSAANBJ,"[[","persistence") |>
	round(5)

          MSE MSEh TMSE   GTMSE MSCE GPL
alpha 1.00000    1    1 1.00000    1   1
beta  0.23915    0    0 0.14617    0   0

We can see how shrinkage shows itself in the case of the smoothing parameter \(\beta\), which is shrunk to zero by MSEh, TMSE, MSCE and GPL but left intact by MSE and shrunk a little bit in the case of GTMSE. These different estimates of parameters lead to different forecasting trajectories and prediction intervals, as can be shown visually:

par(mfcol=c(3,2), mar=c(2,2,4,1))
# Produce forecasts
lapply(adamETSAANBJ, forecast, h=10, interval="prediction") |>
# Plot forecasts
    lapply(function(x, ...) plot(x, ylim=c(200,280), main=x$model$loss))

This should result in the following plots:

ADAM ETS on Box-Jenkins data with several estimators

Analysing the figure, it looks like the shrinkage of the smoothing parameter \(\beta\) is useful for this time series: the forecasts from ETS(A,A,N) estimated using MSEh, TMSE, MSCE and GPL look closer to the actual values than the ones from MSE and GTMSE. To assess their performance more precisely, we can extract error measures from the models:

sapply(adamETSAANBJ,"[[","accuracy") |>
	round(5)[c("ME","MSE"),]

         MSE    MSEh    TMSE    GTMSE    MSCE     GPL
ME   3.22900 1.06479 1.05233  3.44962 1.04604 0.95515
MSE 14.41862 2.89067 2.85880 16.26344 2.84288 2.62394

Alternatively, we can calculate error measures based on the produced forecasts and the measures() function from the greybox package:

lapply(adamETSAANBJ, forecast, h=10) |>
    sapply(function(x, ...) measures(holdout=x$model$holdout,
                                     forecast=x$mean,
                                     actual=actuals(x$model)))

A thing to note about the multistep estimators is that they are slower than the conventional ones because they require producing 1 to \( h \) steps ahead forecasts from every observation in-sample. In the case of the smooth functions, the time elapsed can be extracted from the models in the following way:

sapply(adamETSAANBJ, "[[", "timeElapsed")

In summary, the multistep estimators are potentially useful in forecasting and can produce models with more accurate forecasts. This happens because they impose shrinkage on the estimates of parameters, making models less stochastic and more inert. But their performance depends on each specific situation and the available data, so I would not recommend using them universally.

Message Multi-step Estimators and Shrinkage Effect in Time Series Models first appeared on Open Forecasting.

Story of “Probabilistic forecasting of hourly emergency department arrivals”

Ivan Svetunkov — Wed, 10 May 2023 20:47:27 +0000

The paper

Back in 2020, when we were all siting in the COVID lockdown, I had a call with Bahman Rostami-Tabar to discuss one of our projects. He told me that he had an hourly data of an Emergency Department from a hospital in Wales, and suggested writing a paper for a healthcare audience to show them how forecasting can be done properly in this setting. I noted that we did not have experience in working with high frequency data, and it would be good to have someone with relevant expertise. I knew a guy who worked in energy forecasting, Jethro Browell (we are mates in the IIF UK Chapter), so we had a chat between the three of us and formed a team to figure out better ways for ED arrival demand forecasting.

We agreed that each one of us will try their own models. Bahman wanted to try TBATS, Prophet and models from the fasster package in R (spoiler: the latter ones produced very poor forecasts on our data, so we removed them from the paper). Jethro had a pool of GAMLSS models with different distributions, including Poisson and truncated Normal. He also tried a Gradient Boosting Machine (GBM). I decided to test ETS, Poisson Regression and ADAM. We agreed that we will measure performance of models not only in terms of point forecasts (using RMSE), but also in terms of quantiles (pinball and quantile bias) and computational time. It took us a year to do all the experiments and another one to find a journal that would not desk-reject our paper because the editor thought that it was not relevant (even though they have published similar papers in the past). It was rejected from Annals of Emergency Medicine, Emergency Medicine Journal, American Journal of Emergency Medicine and Journal of Medical Systems. In the end, we submitted to Health Systems, and after a short revision the paper got accepted. So, there is a happy end in this story.

In the paper itself, we found that overall, in terms of quantile bias (calibration of models), GAMLSS with truncated Normal distribution and ADAM performed better than the other approaches, with the former also doing well in terms of pinball loss and the latter doing well in terms of point forecasts (RMSE). Note that the count data models did worse than the continuous ones, although one would expect Poisson distribution to be appropriate for the ED arrivals.

I don’t want to explain the paper and its findings in detail in this post, but given my relation to ADAM, I have decided to briefly explain what I included in the model and how it was used. After all, this is the first paper that uses almost all the main features of ADAM and shows how powerful it can be if used correctly.

Using ADAM in Emergency Department arrivals forecasting

Disclaimer: The explanation provided here relies on the content of my monograph “Forecasting and Analytics with ADAM“. In the paper, I ended up creating a quite complicated model that allowed capturing complex demand dynamics. In order to fully understand what I am discussing in this post, you might need to refer to the monograph.

Emergency Department Arrivals. The plots were generated using seasplot() function from the tsutils package.

The figure above shows the data that we were dealing with together with several seasonal plots (generated using seasplot() function from the tsutils package). As we see, the data exhibits hour of day, day of week and week of year seasonalities, although some of them are not very well pronounced. The data does not seem to have a strong trend, although there is a slow increase of the level. Based on this, I decided to use ETS(M,N,M) as the basis for modelling. However, if we want to capture all three seasonal patterns then we need to fit a triple seasonal model, which requires too much computational time, because of the estimation of all the seasonal indices. So, I have decided to use a double-seasonal ETS(M,N,M) instead with hour of day and hour of week seasonalities and to include dummy variables for week of year seasonality. The alternative to week of year dummies would be hour of year seasonal component, which would then require estimating 8760 seasonal indices, potentially overfitting the data. I argue that the week of year dummy provides the sufficient flexibility and there is no need in capturing the detailed intra-yearly profile on a more granular level.

To make things more exciting, given that we deal with hourly data of a UK hospital, we had to deal with issues of daylight saving and leap year. I know that many of us hate the idea of daylight saving, because we have to change our lifestyles 2 times each year just because of an old 18th century tradition. But in addition to being bad for your health, this nasty thing messes things up for my models, because once a year we have 23 hours and in another time we have 25 hours in a day. Luckily, it is taken care of by adam() that shifts seasonal indices, when the time change happens. All you need to do for this mechanism to work is to provide an object with timestamps to the function (for example, zoo). As for the leap year, it becomes less important when we model week of year seasonality instead of the day of year or hour of year one.

Emergency Department Daily Arrivals

Furthermore, as it can be seen from the figure above, it is apparent that calendar events play a crucial role in ED arrivals. For example, the Emergency Department demand over Christmas is typically lower than average (the drops in Figure above), but right after the Christmas it tends to go up (with all the people who injured themselves during the festivities showing up in the hospital). So these events need to be taken into account in a form of additional dummy variables by a model together with their lags (the 24 hour lags of the original variables).

But that’s not all. If we want to fit a multiplicative seasonal model (which makes more sense than the additive one due to changing seasonal amplitude for different times of year), we need to do something with zeroes, which happen naturally in ED arrivals over night (see the first figure in this post with seasonal plots). They do not necessarily happen at the same time of day, but the probability of having no demand tends to increase at night. This meant that I needed to introduce the occurrence part of the model to take care of zeroes. I used a very basic occurrence model called “direct probability“, because it is more sensitive to changes in demand occurrence, making the model more responsive. I did not use a seasonal demand occurrence model (and I don’t remember why), which is one of the limitations of ADAM used in this study.

Finally, given that we are dealing with low volume data, a positive distribution needed to be used instead of the Normal one. I used Gamma distribution because it is better behaved than the Log Normal or the Inverse Gaussian, which tend to have much heavier tails. In the exploration of the data, I found that Gamma does better than the other two, probably because the ED arrivals have relatively slim tails.

So, the final ADAM included the following features:

ETS(M,N,M) as the basis;
Double seasonality;
Week of year dummy variables;
Dummy variables for calendar events with their lags;
“Direct probability” occurrence model;
Gamma distribution for the residuals of the model.

This model is summarised in equation (3) of the paper.

The model was initialised using backcasting, because otherwise we would need to estimate too many initial values for the state vector. The estimation itself was done using likelihood. In R, this corresponded to roughly the following lines of code:

library(smooth)
oesModel <- oes(y, "MNN", occurrence="direct", h=48)
adamModelFirst <- adam(ourData, "MNM", lags=c(24,24*7), formula=y~x+xLag24+weekOfYear,
                       h=48, initial="backcasting",
                       occurrence=oesModel, distribution="dgamma")

Where x was the categorical variable (factor in R) with all the main calendar events. However, even with backcasting, the estimation of such a big model took an hour and 25 minutes. Given that Bahman, Jethro and I have agreed to do rolling origin evaluation, I've decided to help the function in the estimation inside the loop, providing the initials to the optimiser based on the very first estimated model. As a result, each estimation of ADAM in the rolling origin took 1.5 minutes. The code in the loop was modified to:

adamParameters <- coef(adamModelFirst)
oesModel <- oes(y, "MNN", occurrence="direct", h=48)
adamModel <- adam(ourData, "MNM", lags=c(24,24*7), formula=y~x+xLag24+weekOfYear,
                  h=48, initial="backcasting",
                  occurrence=oesModel, distribution="dgamma",
                  B=adamParameters)

Finally, we generated mean and quantile forecasts for 48 hours ahead. I used semiparametric quantiles, because I expected violation of some of assumptions in the model (e.g. autocorrelated residuals). The respective R code is:

testForecast <- forecast(adamModel, newdata=newdata, h=48,
                         interval="semiparametric", level=c(1:19/20), side="upper")

Furthermore, given that the data is integer-valued (how many people visit the hospital each hour) and ADAM produces fractional quantiles (because of the Gamma distribution), I decided to see how it would perform if the quantiles were rounded up. This strategy is simple and might be sensible when a continuous model is used for forecasting on a count data (see discussion in the paper). However, after running the experiment, the ADAM with rounded up quantiles performed very similar to the conventional one, so we have decided not to include it in the paper.

In the end, as stated earlier in this post, we concluded that in our experiment, there were two well performing approaches: GAMLSS with Truncated Normal distribution (called "NOtr-2" in the paper) and ADAM in the form explained above. The popular TBATS, Prophet and Gradient Boosting Machine performed poorly compared to these two approaches. For the first two, this is because of the lack of explanatory variables and inappropriate distributional assumptions (normality). As for the GBM, this is probably due to the lack of dynamic element in it (e.g. changing level and seasonal components).

Concluding this post, as you can see, I managed to fit a decent model based on ADAM, which captured the main characteristics of the data. However, it took a bit of time to understand what features should be included, together with some experiments on the data. This case study shows that if you want to get a better model for your problem, you might need to dive in the problem and spend some time analysing what you have on hands, experimenting with different parameters of a model. ADAM provides the flexibility necessary for such experiments.

Message Story of “Probabilistic forecasting of hourly emergency department arrivals” first appeared on Open Forecasting.

Probabilistic forecasting of hourly emergency department arrivals

Ivan Svetunkov — Tue, 09 May 2023 06:45:13 +0000

Authors: Bahman Rostami-Tabar, Jethro Browell, Ivan Svetunkov

Journal: Health Systems

Abstract: An accurate forecast of Emergency Department (ED) arrivals by an hour of the day is critical to meet patients’ demand. It enables planners to match ED staff to the number of arrivals, redeploy staff, and reconfigure units. In this study, we develop a model based on Generalised Additive Models and an advanced dynamic model based on exponential smoothing to generate an hourly probabilistic forecast of ED arrivals for a prediction window of 48 hours. We compare the forecast accuracy of these models against appropriate benchmarks, including TBATS, Poisson Regression, Prophet, and simple empirical distribution. We use Root Mean Squared Error to examine the point forecast accuracy and assess the forecast distribution accuracy using Quantile Bias, PinBall Score and Pinball Skill Score. Our results indicate that the proposed models outperform their benchmarks. Our developed models can also be generalised to other services, such as hospitals, ambulances or clinical desk services.

DOI: 10.1080/20476965.2023.2200526

The paper and R code.

Story of the paper.

Message Probabilistic forecasting of hourly emergency department arrivals first appeared on Open Forecasting.