Archives Social media - Open Forecasting

smooth in python: Non-normal distributions in ETS/ARIMA

Ivan Svetunkov — Wed, 27 May 2026 14:17:46 +0000

So, you know quite well that the normal distribution is one of the most popular distributions in statistics. The reasons are manifold, including convenience for the academic community and the fact that it is taught in every single statistics course in the world. But what if we don’t want to be normal?

There are situations where non-normal distributions fit considerably better. The main candidate for substitution is the conditional distribution of the response variable. For example, sales of engines cannot follow the normal distribution by definition: they are intermittent and integer-based — you cannot sell 1.78 engines. More generally, while demand can be fractional, it cannot be negative. It is therefore only logical to use distributions that support positive values only in these situation. Examples include Log-Normal, Gamma, and Inverse Gaussian, among many others.

In my last paper with John Boylan (this one), we discussed how ETS can be extended to use these three distributions instead of the normal one. I implemented this functionality (together with the others, such as Laplace and Generalised Normal) in ADAM. It supports any of these distributions with any ETS/ARIMA/regression model, for both additive and multiplicative error terms. There is some maths involved, which you can find here and here.

Why bother? The main is in the predictive distribution. If the data is not normal, we may end up with poorly calibrated forecasts and misleading prediction intervals. Using a more appropriate distribution can resolve this.

But how do we choose the right distribution for our data?

A possible solution (similar to selecting ETS components) is to fit models with different distributions and pick the one with the lowest information criterion. This is implemented in the ADAM function from the smooth package. We can do this manually, or use AutoADAM (called auto.adam in R) to select the most suitable distribution based on AICc automatically:

from fcompdata import AirPassengers
from smooth import AutoADAM

model = AutoADAM(lags=[12], h=12, holdout=True, orders=None, verbose=True)
model.fit(AirPassengers.y)
model.summary()

The orders=None line stops the function from trying different ARIMA orders – something we will come back to in a future post. For this example, the output is:

Model estimated using ADAM() function: ETS(MAM)
Response variable: y
Distribution used in the estimation: Normal
Loss function type: likelihood; Loss function value: 523.2756
Coefficients:
       Estimate  Std. Error  Lower 2.5%  Upper 97.5%   
alpha    0.7575      0.0895      0.5807       0.9343  *
beta     0.0000      0.0080      0.0000       0.0158   
gamma    0.0000      0.0503      0.0000       0.0994   
Error standard deviation: 0.0358
Sample size: 144
Number of estimated parameters: 4
Number of degrees of freedom: 140
Information criteria:
      AIC     AICc       BIC      BICc
1054.5512 1054.839 1066.4305 1067.1455

Boring… the function found that the Normal distribution has the lowest AICc among those tested – the Air Passengers data is too well-behaved.

Oh, and don’t forget to produce the forecasts:

model.predict(h=18, interval="prediction")

Smooth forecasting!

Install smooth: pip install smooth

Message smooth in python: Non-normal distributions in ETS/ARIMA first appeared on Open Forecasting.

Hans Levenbach’s classification scheme for trend/seasonal components

Ivan Svetunkov — Mon, 18 May 2026 08:01:28 +0000

Here is a curious idea: if we can somehow estimate the importance of trend/seasonal components for your data, you can use this in model building and forecasting. But how can we do this first step? Hans Levenbach has an answer with his simple EDA technique. Let me explain.

The core idea is simple and neat. For this example, I’ll use monthly data, like the time series in this image:

Series N2568 from the M3 dataset

You can see that the data has strong seasonality, and we can qualitatively say that capturing that seasonal component correctly will probably solve the main problem in capturing the structure. But how can we quantify this?

All you need to do is put the data in a “wide” format, with months in rows and years in columns. Then, as Hans proposed, run a two-way ANOVA with “month” and “year” to capture variability due to year (trend) and due to month (seasonality). Roughly, we take row/column means to get mean seasonal profiles and mean annual changes (trend), as in the following two images:

Seasonal profile of the data

Trend profile

The former has no trend, the latter has no seasonality, so they can be analysed separately. Then we calculate the sums of squares of these means from the global mean to estimate variation due to months (seasonality) and years (trend). We can also calculate the sum of squares of the irregular component (what is left), giving three elements that add up to the total sum of squares.

Next step is trivial and straightforward: calculate the shares of each component in the total sum of squares. For our example, using aov() in R and then computing the total:

Seasonal:  292,307,558
Trend:     176,308,365
Irregular:  33,618,630

Total:     502,234,552

So, the seasonal contribution is 292,307,558 / 502,234,552 ≈ 58.2%, the trend contribution is 35.1%, and the irregular component is 6.69%.

Why bother? This simple EDA technique tells you roughly what to focus in forecasting. In this example, capturing seasonality correctly is roughly 60% of the story, with trend being second in importance. Hans goes further in his derivations, see his LinkedIn post. He also analysed M3 results at some point, explaining why some methods performed better (trend dominated the data).

It is worth pointing out that this approach assumes that the seasonal component does not evolve over time, which is reasonable but not always correct. And the model behind this is essentially a regression with dummy variables for year and month. Nonetheless, it is a great starting point for EDA.

P.S. Hans Levenbach passed away on 7 April 2026. I wasn’t sure whether to write about it and what to write about him, but I had several nice discussions with him, and I have admired his approach to forecasting: first explore the data, then build a model. His passing is a loss for the forecasting community.

P.P.S. You can read a bit about him on the IIF website.

CMAF had a webinar with Hans a couple of years ago. We had technical issues, but he managed to explain his idea well.

Message Hans Levenbach’s classification scheme for trend/seasonal components first appeared on Open Forecasting.

smooth in python: multiple seasonal ETS

Ivan Svetunkov — Mon, 11 May 2026 08:08:38 +0000

Another interesting case in demand forecasting is the high frequency data. For example, if you work with demand on daily level, you might notice that demand increases every Monday but also exhibits proper seasonal fluctuations (e.g. decline every Winter). What do you do in this case?

One of the solutions (old but gold) is the multiple seasonal ETS model, which was originally developed by James Taylor (2003) for the pure additive exponential smoothing. The idea was quite simple: to model multiple seasonal cycles, one can add multiple seasonal components, i.e. to capture the day-of-week (frequency 7) and the day-of-year (frequency 365) effects. While it worked fine for some examples, the main issue with it has been its computational speed (or rather slowness): the original ETS needs to estimate all smoothing parameters + all the initial values for seasonal indices and other components. Both ADAM and ES in the smooth package support multiple seasonalities and avoid the whole issue by using a different model initialisation called “backcasting”.

Here is a classical example from James’ paper on the half-hourly electricity demand (see the image in the post). It is clear that there is a half-hour-of-day and the day-of-week effects. In ES, this means that we need to provide the vector for the lags variable:

from smooth import ES
from fcompdata import taylor

# Fit ES with automatic ETS model selection
model = ES(lags=[48, 336], h=336, holdout=True)
model.fit(taylor.y)
model.predict(h=336)
print(model)

This is the output I get from the function:

Time elapsed: 2.03 seconds
Model estimated using ES() function: ETS(MNM)
With backcasting initialisation
Distribution assumed in the model: Normal
Loss function type: likelihood; Loss function value: 25391.1773
Persistence vector g:
 alpha gamma1 gamma2
0.2899 0.1283 0.5270
Sample size: 3696
Number of estimated parameters: 4
Number of degrees of freedom: 3692
Information criteria:
      AIC      AICc       BIC      BICc
50790.3546 50790.3654 50815.2146 50815.2591

Forecast errors:
ME: 829.1195; MAE: 942.1447; RMSE: 1065.1127
sCE: 941.5012%; Asymmetry: 9.2%; sMAE: 3.1841%; sMSE: 0.1296%
MASE: 1.4491; RMSSE: 1.1286; rMAE: 0.1408; rRMSE: 0.1300

The computational time on this data was only 2.03 second. In this time, the function tried several possible ETS models and selected the best one based on the AICc value. The resulting best model is ETS(M,N,M), which makes perfect sense for this data.

Is there a way to improve this model? Yes! Taylor mentions that adding AR(1) to the cocktail tends to improve the accuracy in case of multiple seasonal series. We can try that if we switch to ADAM:

from smooth import ADAM

# Fit ADAM ETS(MNM)+AR(1) model
model = ADAM(model="MNM", ar_orders=1, lags=[48, 336], h=336, holdout=True)
model.fit(taylor.y)
print(model)
model.plot(7)

Here is the output:

Time elapsed: 1.04 seconds
Model estimated using ADAM() function: ETS(MNM)+ARIMA(1,0,0)
With backcasting initialisation
Distribution assumed in the model: Gamma
Loss function type: likelihood; Loss function value: 24157.2473
Persistence vector g:
 alpha gamma1 gamma2
0.1097 0.2225 0.3481
ARMA parameters of the model:
             Lag 1
AR(1)       0.6852
Sample size: 3696
Number of estimated parameters: 5
Number of degrees of freedom: 3691
Information criteria:
      AIC      AICc       BIC      BICc
48324.4947 48324.5109 48355.5697 48355.6365

Forecast errors:
ME: 276.4061; MAE: 462.5092; RMSE: 588.5957
sCE: 313.8711%; Asymmetry: 2.1%; sMAE: 1.5631%; sMSE: 0.0396%
MASE: 0.7114; RMSSE: 0.6237; rMAE: 0.0691; rRMSE: 0.0719

The resulting model has lower AICc, but also produces more accurate point forecasts (compare RMSSE values) for the holdout set. The following image shows the data and the point forecasts for it:

Double seasonal ETS(M,N,M) applied to the half-hourly electricity demand data

What else can we do here? Actually, quite a lot: multistep losses, seasonal ARIMA, explanatory variables – things can get only more complicated from here. Have a look at this.

Do I hear someone shouting “TBATS”? TBATS is the exponential smoothing with additional bells and whistles (ETS + adapted Fourier terms + ARMA errors). I don’t have it as a separate function in the smooth just yet, but you can reproduce it, for example, like this.

So, what are you waiting for? Dive in and see how it works for yourself!

Install smooth: pip install smooth

Message smooth in python: multiple seasonal ETS first appeared on Open Forecasting.

smooth in python: ETS with explanatory variables

Ivan Svetunkov — Tue, 05 May 2026 08:03:37 +0000

We continue our series of posts on the functions from the smooth package for Python/R. Today we will see how to enhance your exponential smoothing with explanatory variables. What? Yes, you heard me! Let’s dive in!

We all know that in real life sales don’t just evolve over time on their own. Any univariate model, such as ARIMA or ETS is just a way to approximate a complex reality. In practice, there are many factors affecting the demand for your product. What would happen if the price on your product increases? What if you run a promotion (e.g. “Buy One, Get One Free”)? Your competitor’s strategy impacts the demand for your product as well… There’s lots of different factors, and some of them can be quite useful in demand forecasting. But can we join the dynamic univariate models with regression?

Yes, we can! Although ETS is thought as a pure univariate model, it is easy to extend to include explanatory variables. There are several great papers showing how it works (e.g. Kourentzes & Petropoulos, 2016), and in fact the es() function from the smooth package for R was used as a benchmark in the M5 competition.

So, consider a situation where you have weekly sales of a product with some recorded promotions (encoded as dummy variables). We will use a time series from the fcompdata package for Python. The first image shows how the series looks, the vertical lines show when promotions happen. The series itself seems to be seasonal, roughly repeating peaks and troughs every 52 observations (every year). Also, we see that there are two types of promotions, and when they happen sales tend to increase. So, including them should improve the model fit, and if the company decides to run promotions again, the model will forecast demand better. I will start by fitting the ETS(M,N,M) to the data:

from smooth import ES
from fcompdata import PromoData

y = PromoData.y

model = ES(model="MNM", lags=52, holdout=True, h=13)
model.fit(y)
model.predict(h=13)
model.plot(7)

NOTE: PromoData has a specific structure with several attributes. PromoData.x contains the in-sample data, PromoData.xx has the holdout – this is consistent with the Mcomp package for R. The new features in python are:

PromoData.y – concatenated training and test sets,
PromoData.xregx – matrix of explanatory variables for the training set,
PromoData.xregxx – matrix of explanatory variables for the test set,
PromoData.xreg – the full (concatenated) matrix of explanatory variables.

The following image shows the model fit and the point forecasts from the ETS(M,N,M):

ETS(M,N,M) fit and forecast for the promotional data example

As expected, because the model does not take promotions into account, it fits the data as best as it can and produces forecasts that are oblivious of the potential external effects on sales. We can improve it by including the promotional dummies:

X_train = PromoData.xreg
X_test =  PromoData.xregxx

model = ES(model="MNM", lags=52, holdout=True, h=13)
model.fit(y, X_train)
model.predict(h=13, X=X_test)
model.plot(7)

ETS(M,N,M) with explanatory variables

The image above shows the fit and the point forecasts from the ETSX(M,N,M) model that now takes the promotions into account. This is quite an improvement in comparison with the previous one. Furthermore, if we can control when to have promotions and what types of promotions to run, we can change the values in the `X_test` matrix and see what demand to expected in that situation. So, this gives an analyst a tool for a more advanced sensitivity analysis.

Read more about the ETSX here.
Install smooth: pip install smooth
ETSX wiki on github.

Message smooth in python: ETS with explanatory variables first appeared on Open Forecasting.

smooth in python: ETS with model selection

Ivan Svetunkov — Wed, 22 Apr 2026 00:06:43 +0000

As some of you have heard, the smooth package is now on PyPI. So, I’ve decided to write a series of posts showcasing how some of its functions work. We start with the basics, ETS.

ETS stands for the “Error-Trend-Seasonal” model or ExponenTial Smoothing. It is a statistical model that relies on time series decomposition and updates the unobserved states (level/trend/seasonal) based on the mistakes it makes. In a way, you can call it an adaptive model that changes its forecast based on the most recent available information. It is relatively simple to explain and work with, and it has performed well in a variety of competitions (M3, M4, M5, for example).

The smooth package implements an advanced form of ETS in the ADAM and a more basic one in the ES classes. In fact, ES is just a wrapper of ADAM, it is the conventional model, with just some tuning. Both support all 30 ETS models, have automated model selection and forecast combination, allow producing point forecasts and a variety of prediction intervals types. In fact, if you want a straightforward robust implementation of ETS, give ES a try.

Here’s how to use it in Python:

from smooth import ES
from fcompdata import M3

# Pick a series from the M3 competition for demonstration
series = M3[2568]
y = series.x
freq = series.period

# Fit ES with the automatic model selection
model = ES(lags=freq, h=18, holdout=True)
model.fit(y)
print(model)

Running this produces output similar to this:

Time elapsed: 0.4 seconds
Model estimated using ES() function: ETS(MAM)
With backcasting initialisation
Distribution assumed in the model: Normal
Loss function type: likelihood; Loss function value: 724.8524
Persistence vector g:
 alpha   beta  gamma
0.0065 0.0000 0.0000
Sample size: 98
Number of estimated parameters: 4
Number of degrees of freedom: 94
Information criteria:
      AIC      AICc       BIC      BICc
1457.7047 1458.1348 1468.0446 1469.0306

Forecast errors:
ME: -580.9985; MAE: 604.0204; RMSE: 710.5457
sCE: -149.9347%; Asymmetry: -2.5%; sMAE: 8.6598%; sMSE: 1.0378%
MASE: 0.2653; RMSSE: 0.2452; rMAE: 0.2555; rRMSE: 0.2163

A few things worth noting from the output:

ES automatically selected ETS(MAM) based on the AICc value – a multiplicative error, additive trend, multiplicative seasonality model – as the best fit
It used backcasting for the model initialisation (default), which speeds up the process and requires fewer parameters to estimate
It kept the last 18 observation for the holdout, produced autoforecasts for it and calculated several forecast errors. This is handy if you want to directly compare different smooth models on a time series.

But why are we here? We want to forecast! So, here it is:

model.predict(h=18, interval="prediction")
model.plot(7)

This should produce an image similar to the one attached to the post. As simple as that.

Now it’s your turn! :)

🔗 Install smooth: pip install smooth
📖 smooth wiki

Message smooth in python: ETS with model selection first appeared on Open Forecasting.

The real Dunning-Kruger effect

Ivan Svetunkov — Mon, 23 Mar 2026 09:03:35 +0000

Many of you have seen this image on the Internet — I’ve seen it myself a few times on LinkedIn lately. People say it depicts the “Dunning-Kruger” effect… But did you know this is actually an internet meme with little to do with the original paper?

Here is one of the recent examples, a screenshot of the post of Fotios Petropoulos about the effect.

A LinkedIn post by Fotios Petropoulos

In the original paper, Kruger and Dunning (1999) ran experiments with undergraduates on humour, logical reasoning, and grammar. Participants completed a test and estimated their percentile rank. The authors then sorted participants into four quartiles by actual performance and computed averages for actual and self-assessed performance for each quartile. The plots in their paper – the real Dunning–Kruger effect – are just four data points per line, not a smooth curve over a learning journey (second image).

What did they find? People in the bottom quartile substantially overestimated their performance, often believing they were average or above. Top performers slightly underestimated their standing. The key finding is an asymmetry in miscalibration: low performers overestimate, high performers slightly underestimate.

This has almost nothing to do with the popular “experience vs. confidence” image. The original X‑axis is performance quartile at a single point in time; the meme’s X‑axis is a vague notion of “experience” through time. The original Y‑axis is the assessed test percentile; the meme’s is a free‑floating “confidence” construct. In the actual data, perceived performance increases with actual performance – there is no early spike, no “valley of despair,” no “slope of enlightenment.” That swooping curve is an internet-era graphic never reported by Kruger and Dunning, and it misleadingly frames the effect as a personal development trajectory the paper never studied.

There is also a serious critique of the original paper from statistical point of view. For example, Gignac and Zajenkowski (2020) showed that sorting people into quartiles and plotting average self-assessment against average performance can, by itself, generate the characteristic pattern – purely as a statistical artefact. In their own empirical data, miscalibration was roughly constant across ability levels, consistent with measurement noise rather than a special cognitive deficit in low performers. You can actually reproduce the pattern using two random uncorrelated variables. Here is a simple example in R:

set.seed(41)

x <- rnorm(10000, 100, 10)
y <- rnorm(10000, 100, 10)
plot(x,y)
xQ <- quantile(x)
yQ <- quantile(y)

yMeans <- xMeans <- vector("numeric",4)

for(i in 1:4){
    xMeans[i] <- mean(x[xxQ[i]])
    yMeans[i] <- mean(y[xxQ[i]])
}

plot(1:4, xMeans, type="b", ylim=range(xMeans,yMeans),
     xlab="Real performance", ylab="Assessed performance",
     lwd=2)
lines(yMeans, lwd=2, lty=2)
points(yMeans, lwd=2)
legend("topleft",
       legend=c("Actual performance", "Assessed performance"),
       lwd=2, lty=c(1,2), pch=1)

Which produces the image like this:

Dunning-Kruger plot reproduction

If you introduce a correlation between the two variables, the images starts looking even more similar to the ones from the original paper.

So there might be a real effect – many follow-up studies have measured it with more rigorous tools – but Dunning and Kruger’s method was not the right one to establish it. And that image with experience vs confidence is just a meme and a serious misconception that should not be used.

P.S. If you wonder who the “leading expert” that Fotios Petropoulos refers to in his post is – it’s me. Not sure why he doesn’t tag me properly.

Message The real Dunning-Kruger effect first appeared on Open Forecasting.

There’s no such thing as “deterministic forecast”

Ivan Svetunkov — Mon, 02 Mar 2026 22:45:31 +0000

Sometimes I see people referring to a “deterministic” forecast, and I have some personal issues with this. Because if you apply a model to data then there is nothing deterministic about your forecasts!

In many contexts, “deterministic” has a precise meaning: no randomness, no uncertainty. A deterministic solution to an optimisation problem (e.g. linear programming) implies that there are no random inputs or outputs once the model and its parameters are fixed. Forecasting is different. As Chatfield and many others have pointed out, forecasting has multiple sources of uncertainty, and there is essentially zero chance that the future will unfold exactly as any single number suggests.

Yes, some people use “deterministic” as a synonym for “point forecast”. But that label is still misleading, because a point forecast is not uncertainty-free – it is just one summary of a predictive distribution (often the conditional mean, sometimes the median or another functional).

Here’s a quick reality check you can do yourself. Take a dataset, apply your model, and write down the point forecast for the next few observations. Now add one new observation, re-estimate, and forecast again (the image in this post depicts exactly that, but with 50 forecasts produced on different subsamples of data). The point forecast will change unless you are dealing with an exotic situation with non-random data (e.g. every day, you sell exactly 100 units). So, which of the two was the “deterministic” forecast? If forecasts were truly deterministic in the strict sense, you would not get multiple plausible values from small, reasonable changes in the sample.

This happens because any forecasting method (statistical or ML) depends on data and on modelling choices: parameter estimation, feature selection, splitting rules, tuning, even decisions like “use α=0.1”. Those choices can be fixed across samples of data, but fixing them does not remove uncertainty – it only hides it. The randomness is still there in the data and in the fact that we only observe a sample of it.

So when you see someone mentioning “deterministic forecast”, it’s worth translating it mentally to: “a point forecast, probably a conditional mean”. If you care about decisions and risk, you should know that there is an uncertainty associated with this so called “deterministic forecast”, and that it should not be ignored. But this is a topic for another discussion in another post.

Message There’s no such thing as “deterministic forecast” first appeared on Open Forecasting.

Scaling of error measures

Ivan Svetunkov — Mon, 23 Feb 2026 13:36:12 +0000

Apparently, we need to talk about scaling of error measures because this is not as obvious as it seems.

In forecasting literature, since early days of the area, there has been a general consensus that the forecast errors from the individual time series should not be analysed and aggregated as is. This is because you can have very different time series capturing dynamics of very different processes.

Indeed, if you forecast sales of apples in kilograms, your actual value would be apples in kilograms, and your point forecast would also be in the same units. Subtracting one from another tells us how many kilograms of apples we missed with the forecast we produced. But if we then take the average between forecast errors for apples and beer, we would be aggregating things in different units, which contradicts some basic aggregating principles.

Furthermore, if the company sells thousands of kilograms of apples and jet engines, aggregating forecast errors on those (e.g. 3000 vs 3) might introduce all types of issues, because the models performance on apples might mask the performance of the model on jet engines. Still, the jet engines are much more expensive than apples and getting them accurately might be more important for the company than forecasting apples.

So, forecasting literature has agreed that the forecast errors need to be somehow scaled to make the errors unitless and not to distort performance of models on time series with different volumes. There are several ways of doing that, including the poor ones and reasonable ones. The state of the art at the moment is to divide error measures by some in-sample statistics to avoid potential holdout-sample distortion. Using mean absolute differences (MAD) for this (thus ending up with MASE or RMSSE) is considered as a standard. A couple of years ago, I have written a post about advantages and disadvantages of several scaling methods.

But there is one method that I haven’t looked at and which is not very well discussed in the forecasting literature. It relies on the monetary value of forecasts. We could multiply each individual forecast error “e” by the price of the product “p” (thus moving to the missed income per product) and then divide everything by the overall income (price times quantity) from different products. This can be written as:

\begin{equation}
\text{monetary Mean Error} = \frac{\sum_{j=1}^n (p_j \times e_j)} {\sum_{j=1}^n (p_j \times q_j)}
\end{equation}

(the above formula can be modified to have squares or absolute values of the error). This way we switch from the original units to the monetary values and each error would tell you the percentage of the missed income in the overall one. This is a useful measure because it connects models performance with some managerial decisions and it takes the value of product into account (thus we do not mask the expensive jet engines with cheap apples).

However, it might have a potential issue similar to what the MAE/Mean or wMAPE has: if the sales of the product are not stationary, the denominator would change, thus driving the proportion either up or down, irrespective of how good the forecast is. I am not sure whether this needs to be addressed, because there is an argument that if the income from a product has increased and the error hasn’t changed, then this means that the proportion of the missed income decreased, which makes sense. But if we need to address this, we can switch to the MAD multiplied by price in the denominator to address this issue. In fact, this was sort of done in M5 competition that used a weighted RMSSE, relying on the income from each product over the last 4 weeks of data.

But here is one more interesting thing about this error measure. If we assume that prices for all products are exactly the same, they will disappear from the numerator and the denominator, leaving us with just sum of errors divided by the overall sales of all products. This still maintains the original idea of the proportion of the missed income, but now has a very strong assumption, which is probably not correct in the real life (apples and engines for the same price?). Furthermore, this would mask the performance of the model for the expensive products again. I personally don’t like this measure and find the assumption unrealistic and potentially misleading. Having said that, I can see some cases where this could still be acceptable and useful (e.g. similar products with similar dynamics and similar prices).

Summarising:

If you are conducting a forecasting experiment without a specific context, I’d recommend using RMSSE or some other similar measure with scaling.
If you have prices of products, income-based scaling might be more informative.
Setting all prices to the same value does not sound appealing to me, but I understand that there is a context where this might work.

Message Scaling of error measures first appeared on Open Forecasting.

Forecasting Competitions Datasets in Python

Ivan Svetunkov — Mon, 26 Jan 2026 09:29:25 +0000

Here is one small, unexpected piece of news: I now have my first package on PyPI! It’s called fcompdata, and let me tell you a little bit about it.

When I test my functions in R, I usually use the M1, M3, and tourism competition datasets because they are diverse enough, containing seasonal, non-seasonal, trended, and non-trended time series of different frequencies (yearly, quarterly, monthly). The total number of these series is 5,315, which is large enough but not too heavy for my PC. So, when I run something on those datasets, it becomes like a stress test for the forecasting approach, and I can see where it fails and how it can be improved. I consider this type of test a toy experiment — something to do before applying anything to real-world data.

In R, there are the Mcomp and Tcomp packages that contain these datasets, and I like how they are organised. You can do something like this:

series <- Mcomp::M3[[2568]]
ourModel <- adam(series$x)
ourForecast <- forecast(model, h=series$h)
ourError <- series$xx - ourForecast$mean

Each series from the dataset contains all the necessary attributes to run the experiment without trouble. This is easy and straightforward. Plus, I don’t need to download or organise any data — I just use the installed package.

When I started vibe coding in Python, I realised that I missed this functionality. So, with the help of Claude AI, I created a Python script to download the data from the Monash repository and organise it the way I liked. But then I realised two things, which motivated me to package it:

I needed to drag this script with me to every project I worked on. It would be much easier to just run "pip install fcompdata" and forget about everything else.
Some series in the Monash repository differ from those in the R package.

Wait, what?! Really?

Yes. The difference is tiny — it’s a matter of rounding. For example, series N350 from the M1 competition data (T169 from the quarterly data subset) has three digits in the R package and only two if downloaded from the Monash repository (Zenodo website).

Who cares?! It's just one digit difference, right?

Well, if you want to reproduce results across different languages, this tiny difference might become your nightmare. So, I care (and probably nobody else in the world), and I decided to create a proper Python package. You can now do this in Python and relax:

pip install fcompdata

from fcompdata import M1, M3, Tourism
series = M3[2568]

The "series" object is now an instance of the MCompSeries class that has the same attributes as in R: series.x, series.h, series.xx, etc.

As simple as that!

One more thing: I’ve added support for the M4 competition data, which — when imported — will be downloaded and formatted properly. The dataset is large (100k time series), and I personally don’t like it. I even wrote a post about it back in 2020. But if I want the package to be useful to a wider audience, I shouldn’t impose my personal preferences — you should decide for yourselves whether to use it or not.

P.S. Submitting to PyPI gave me a good understanding of the submission process for Python and why it can be such a mess. My package was published just a few seconds after submission — nobody looked at it, nobody ran any tests. CRAN does a variety of checks to ensure you don’t submit garbage. PyPI doesn’t care. So, I’ve gained more respect for CRAN after submitting this package to PyPI.

Message Forecasting Competitions Datasets in Python first appeared on Open Forecasting.

Risky business: how to select your model based on risk preferences

Ivan Svetunkov — Mon, 19 Jan 2026 11:28:04 +0000

What do you use for model selection? Do you select the best model based on its cross-validated performance, or do you use in-sample measures like AIC? If so, there is a way to improve your selection process further.

JORS recently published the paper of Nikos Kourentzes and I based on a simple but powerful idea: instead of using summary statistics (like the mean RMSE of cross-validated errors), you should consider the entire distribution and choose a specific quantile. This aligns with my previous post on error measures, but here is the core intuition:

The distribution of error measures is almost always asymmetric. If you only look at the average, you end up with a “mean temperature in the hospital” statistic, which doesn’t reflect how models actually behave. Some models perform great on most series but fail miserably on a few.

What can we do in this case? We can look at quantiles of distribution.

For example, if we use 84th quantile, we compare the models based on their “bad” performance, situations where they fail and produce less accurate forecasts. If you choose the best performing model there, you will end up with something that does not fail as much. So your preferences for the model become risk-averse in this situation.

If you focus on the lower quantile (e.g. 16th), you are looking at models that do well on the well-behaved series and ignore how they do on the difficult ones. So, your model selection preferences can be described as risk-tolerant, because you are accept that the best performing model might fail on a difficult time series.

Furthermore, the median (50th quantile, the middle of sample), corresponds to the risk-neutral situation, because it ignores the tails of the distribution.

What about the mean? This is a risk-agnostic strategy, because it says nothing about the performance on the difficult or easy time series – it takes everything and nothing in it at the same time, hiding the true risk profile.

So what?

In the paper, we show that using a risk-averse strategy tends to improve overall forecasting accuracy in day-to-day situations. Conversely, a risk-tolerant strategy can be beneficial when disruptions are anticipated, as standard models are likely to fail anyway.

So, next time you select a model, think about the measure you are using. If it’s just the mean RMSE, keep in mind that you might be ignoring the inherent risks of that selection.

P.S. While the discussion above applies to the distribution of error measures, our paper specifically focused on point AIC (in-sample performance). But it is a distance measure as well, so the logic explained above holds.

P.P.S. Nikos wrote a post about this paper here.

P.P.P.S. And here is the link to the paper.

Message Risky business: how to select your model based on risk preferences first appeared on Open Forecasting.