Archives theory - Open Forecasting

The real Dunning-Kruger effect

Ivan Svetunkov — Mon, 23 Mar 2026 09:03:35 +0000

Many of you have seen this image on the Internet — I’ve seen it myself a few times on LinkedIn lately. People say it depicts the “Dunning-Kruger” effect… But did you know this is actually an internet meme with little to do with the original paper?

Here is one of the recent examples, a screenshot of the post of Fotios Petropoulos about the effect.

A LinkedIn post by Fotios Petropoulos

In the original paper, Kruger and Dunning (1999) ran experiments with undergraduates on humour, logical reasoning, and grammar. Participants completed a test and estimated their percentile rank. The authors then sorted participants into four quartiles by actual performance and computed averages for actual and self-assessed performance for each quartile. The plots in their paper – the real Dunning–Kruger effect – are just four data points per line, not a smooth curve over a learning journey (second image).

What did they find? People in the bottom quartile substantially overestimated their performance, often believing they were average or above. Top performers slightly underestimated their standing. The key finding is an asymmetry in miscalibration: low performers overestimate, high performers slightly underestimate.

This has almost nothing to do with the popular “experience vs. confidence” image. The original X‑axis is performance quartile at a single point in time; the meme’s X‑axis is a vague notion of “experience” through time. The original Y‑axis is the assessed test percentile; the meme’s is a free‑floating “confidence” construct. In the actual data, perceived performance increases with actual performance – there is no early spike, no “valley of despair,” no “slope of enlightenment.” That swooping curve is an internet-era graphic never reported by Kruger and Dunning, and it misleadingly frames the effect as a personal development trajectory the paper never studied.

There is also a serious critique of the original paper from statistical point of view. For example, Gignac and Zajenkowski (2020) showed that sorting people into quartiles and plotting average self-assessment against average performance can, by itself, generate the characteristic pattern – purely as a statistical artefact. In their own empirical data, miscalibration was roughly constant across ability levels, consistent with measurement noise rather than a special cognitive deficit in low performers. You can actually reproduce the pattern using two random uncorrelated variables. Here is a simple example in R:

set.seed(41)

x <- rnorm(10000, 100, 10)
y <- rnorm(10000, 100, 10)
plot(x,y)
xQ <- quantile(x)
yQ <- quantile(y)

yMeans <- xMeans <- vector("numeric",4)

for(i in 1:4){
    xMeans[i] <- mean(x[xxQ[i]])
    yMeans[i] <- mean(y[xxQ[i]])
}

plot(1:4, xMeans, type="b", ylim=range(xMeans,yMeans),
     xlab="Real performance", ylab="Assessed performance",
     lwd=2)
lines(yMeans, lwd=2, lty=2)
points(yMeans, lwd=2)
legend("topleft",
       legend=c("Actual performance", "Assessed performance"),
       lwd=2, lty=c(1,2), pch=1)

Which produces the image like this:

Dunning-Kruger plot reproduction

If you introduce a correlation between the two variables, the images starts looking even more similar to the ones from the original paper.

So there might be a real effect – many follow-up studies have measured it with more rigorous tools – but Dunning and Kruger’s method was not the right one to establish it. And that image with experience vs confidence is just a meme and a serious misconception that should not be used.

P.S. If you wonder who the “leading expert” that Fotios Petropoulos refers to in his post is – it’s me. Not sure why he doesn’t tag me properly.

Message The real Dunning-Kruger effect first appeared on Open Forecasting.

There’s no such thing as “deterministic forecast”

Ivan Svetunkov — Mon, 02 Mar 2026 22:45:31 +0000

Sometimes I see people referring to a “deterministic” forecast, and I have some personal issues with this. Because if you apply a model to data then there is nothing deterministic about your forecasts!

In many contexts, “deterministic” has a precise meaning: no randomness, no uncertainty. A deterministic solution to an optimisation problem (e.g. linear programming) implies that there are no random inputs or outputs once the model and its parameters are fixed. Forecasting is different. As Chatfield and many others have pointed out, forecasting has multiple sources of uncertainty, and there is essentially zero chance that the future will unfold exactly as any single number suggests.

Yes, some people use “deterministic” as a synonym for “point forecast”. But that label is still misleading, because a point forecast is not uncertainty-free – it is just one summary of a predictive distribution (often the conditional mean, sometimes the median or another functional).

Here’s a quick reality check you can do yourself. Take a dataset, apply your model, and write down the point forecast for the next few observations. Now add one new observation, re-estimate, and forecast again (the image in this post depicts exactly that, but with 50 forecasts produced on different subsamples of data). The point forecast will change unless you are dealing with an exotic situation with non-random data (e.g. every day, you sell exactly 100 units). So, which of the two was the “deterministic” forecast? If forecasts were truly deterministic in the strict sense, you would not get multiple plausible values from small, reasonable changes in the sample.

This happens because any forecasting method (statistical or ML) depends on data and on modelling choices: parameter estimation, feature selection, splitting rules, tuning, even decisions like “use α=0.1”. Those choices can be fixed across samples of data, but fixing them does not remove uncertainty – it only hides it. The randomness is still there in the data and in the fact that we only observe a sample of it.

So when you see someone mentioning “deterministic forecast”, it’s worth translating it mentally to: “a point forecast, probably a conditional mean”. If you care about decisions and risk, you should know that there is an uncertainty associated with this so called “deterministic forecast”, and that it should not be ignored. But this is a topic for another discussion in another post.

Message There’s no such thing as “deterministic forecast” first appeared on Open Forecasting.

Scaling of error measures

Ivan Svetunkov — Mon, 23 Feb 2026 13:36:12 +0000

Apparently, we need to talk about scaling of error measures because this is not as obvious as it seems.

In forecasting literature, since early days of the area, there has been a general consensus that the forecast errors from the individual time series should not be analysed and aggregated as is. This is because you can have very different time series capturing dynamics of very different processes.

Indeed, if you forecast sales of apples in kilograms, your actual value would be apples in kilograms, and your point forecast would also be in the same units. Subtracting one from another tells us how many kilograms of apples we missed with the forecast we produced. But if we then take the average between forecast errors for apples and beer, we would be aggregating things in different units, which contradicts some basic aggregating principles.

Furthermore, if the company sells thousands of kilograms of apples and jet engines, aggregating forecast errors on those (e.g. 3000 vs 3) might introduce all types of issues, because the models performance on apples might mask the performance of the model on jet engines. Still, the jet engines are much more expensive than apples and getting them accurately might be more important for the company than forecasting apples.

So, forecasting literature has agreed that the forecast errors need to be somehow scaled to make the errors unitless and not to distort performance of models on time series with different volumes. There are several ways of doing that, including the poor ones and reasonable ones. The state of the art at the moment is to divide error measures by some in-sample statistics to avoid potential holdout-sample distortion. Using mean absolute differences (MAD) for this (thus ending up with MASE or RMSSE) is considered as a standard. A couple of years ago, I have written a post about advantages and disadvantages of several scaling methods.

But there is one method that I haven’t looked at and which is not very well discussed in the forecasting literature. It relies on the monetary value of forecasts. We could multiply each individual forecast error “e” by the price of the product “p” (thus moving to the missed income per product) and then divide everything by the overall income (price times quantity) from different products. This can be written as:

\begin{equation}
\text{monetary Mean Error} = \frac{\sum_{j=1}^n (p_j \times e_j)} {\sum_{j=1}^n (p_j \times q_j)}
\end{equation}

(the above formula can be modified to have squares or absolute values of the error). This way we switch from the original units to the monetary values and each error would tell you the percentage of the missed income in the overall one. This is a useful measure because it connects models performance with some managerial decisions and it takes the value of product into account (thus we do not mask the expensive jet engines with cheap apples).

However, it might have a potential issue similar to what the MAE/Mean or wMAPE has: if the sales of the product are not stationary, the denominator would change, thus driving the proportion either up or down, irrespective of how good the forecast is. I am not sure whether this needs to be addressed, because there is an argument that if the income from a product has increased and the error hasn’t changed, then this means that the proportion of the missed income decreased, which makes sense. But if we need to address this, we can switch to the MAD multiplied by price in the denominator to address this issue. In fact, this was sort of done in M5 competition that used a weighted RMSSE, relying on the income from each product over the last 4 weeks of data.

But here is one more interesting thing about this error measure. If we assume that prices for all products are exactly the same, they will disappear from the numerator and the denominator, leaving us with just sum of errors divided by the overall sales of all products. This still maintains the original idea of the proportion of the missed income, but now has a very strong assumption, which is probably not correct in the real life (apples and engines for the same price?). Furthermore, this would mask the performance of the model for the expensive products again. I personally don’t like this measure and find the assumption unrealistic and potentially misleading. Having said that, I can see some cases where this could still be acceptable and useful (e.g. similar products with similar dynamics and similar prices).

Summarising:

If you are conducting a forecasting experiment without a specific context, I’d recommend using RMSSE or some other similar measure with scaling.
If you have prices of products, income-based scaling might be more informative.
Setting all prices to the same value does not sound appealing to me, but I understand that there is a context where this might work.

Message Scaling of error measures first appeared on Open Forecasting.

Risky business: how to select your model based on risk preferences

Ivan Svetunkov — Mon, 19 Jan 2026 11:28:04 +0000

What do you use for model selection? Do you select the best model based on its cross-validated performance, or do you use in-sample measures like AIC? If so, there is a way to improve your selection process further.

JORS recently published the paper of Nikos Kourentzes and I based on a simple but powerful idea: instead of using summary statistics (like the mean RMSE of cross-validated errors), you should consider the entire distribution and choose a specific quantile. This aligns with my previous post on error measures, but here is the core intuition:

The distribution of error measures is almost always asymmetric. If you only look at the average, you end up with a “mean temperature in the hospital” statistic, which doesn’t reflect how models actually behave. Some models perform great on most series but fail miserably on a few.

What can we do in this case? We can look at quantiles of distribution.

For example, if we use 84th quantile, we compare the models based on their “bad” performance, situations where they fail and produce less accurate forecasts. If you choose the best performing model there, you will end up with something that does not fail as much. So your preferences for the model become risk-averse in this situation.

If you focus on the lower quantile (e.g. 16th), you are looking at models that do well on the well-behaved series and ignore how they do on the difficult ones. So, your model selection preferences can be described as risk-tolerant, because you are accept that the best performing model might fail on a difficult time series.

Furthermore, the median (50th quantile, the middle of sample), corresponds to the risk-neutral situation, because it ignores the tails of the distribution.

What about the mean? This is a risk-agnostic strategy, because it says nothing about the performance on the difficult or easy time series – it takes everything and nothing in it at the same time, hiding the true risk profile.

So what?

In the paper, we show that using a risk-averse strategy tends to improve overall forecasting accuracy in day-to-day situations. Conversely, a risk-tolerant strategy can be beneficial when disruptions are anticipated, as standard models are likely to fail anyway.

So, next time you select a model, think about the measure you are using. If it’s just the mean RMSE, keep in mind that you might be ignoring the inherent risks of that selection.

P.S. While the discussion above applies to the distribution of error measures, our paper specifically focused on point AIC (in-sample performance). But it is a distance measure as well, so the logic explained above holds.

P.P.S. Nikos wrote a post about this paper here.

P.P.P.S. And here is the link to the paper.

Message Risky business: how to select your model based on risk preferences first appeared on Open Forecasting.

SBC is not for you!

Ivan Svetunkov — Wed, 04 Jun 2025 11:41:00 +0000

I’ve been acting as a reviewer lately, providing comments on papers about intermittent demand, and I’ve felt a bit frustrated by what some authors write. Let me explain.

Several papers I reviewed claim that demand can be either intermittent or lumpy. They then mention the Syntetos-Boylan-Croston (SBC) classification and use the thresholds from Syntetos et al. (2005: ) to do some things with ML methods. Sounds reasonable?

No! And here’s why.

Actually, I’ve already explained this in a previous post, but let me summarise the main points again.

First, intermittent demand is the demand that happens at irregular frequency. That’s the definition John Boylan and I came up with in our paper (this one). But even before that, the literature generally agreed: if you observe naturally occurring zeroes (e.g., no one wants to buy a product), then the demand is intermittent – even if there’s only one zero in the data.

Now, Syntetos et al. (2005) specifically studied intermittent demand and proposed a classification to help choose between Croston’s method and SBA. Their classification includes four types (see image in the post):

Erratic but not very intermittent
Smooth
Lumpy
Intermittent but not very erratic

The thresholds they used (ADI=1.32 and CV²=0.49) were only intended to guide the choice between Croston and SBA. And “lumpy”, as you can see, is just a special case of intermittent demand!

Yes, you can classify intermittent demand into “lumpy” and “smooth”, but this separation is not well-defined. Use a different classification (e.g., this paper) and you’ll get different results. In fact, practically speaking, your ML approach likely doesn’t need this classification at all.

So, here are a two things you should NOT DO:

Saying that demand can be “intermittent” or “lumpy” – the latter is a subset of the former.
Use ADI=1.32 and/or CV²=0.49 to categorise demand, unless you’re selecting between Croston and SBA. And let’s be honest, you’re probably not doing that. So forget about it!

And honestly, stop overusing SBC! Lately, I’ve seen more harm than good from it. If you really want to use it, make sure you’ve read carefully and understood the original paper.

But if you don’t know what you are doing, SBC is not for you!

Message SBC is not for you! first appeared on Open Forecasting.

On randomness and uncertainty

Ivan Svetunkov — Mon, 28 Apr 2025 11:05:29 +0000

Everything is random! Your data, your model, its parameter estimates, the forecasts it produces, and even the minimum of the loss function you used. There is no such thing as a “deterministic” forecast – everything is stochastic!

Whenever you work with data, you are working with a sample from a population. In some cases, this is more apparent than in others. In my statistics lectures, I typically give the following example. Consider that we are interested in the average height of students at the university. I could ask every student at the lecture to tell me their height, take the average, and get a number. Is this number random? Yes, indeed. Why? Because if a student who was late for the lecture comes in, I would need to recalculate the average, and the number would change. The average that I get depends on who specifically I have in the sample and how many observations I have. It will vary more in smaller samples and become more stable in larger ones. But this example gives you an idea about the inherent uncertainty of any estimates we deal with.

In time series, the situation is somewhat similar: you are dealing with a sample of values that you have observed up until a specific moment. If, for example, you want to forecast daily admissions in the emergency department of a hospital and apply a model, its forecast will change when a new day comes and a new cohort of patients arrives. This is because your sample changes, and you receive new information about the demand.

So, the parameter estimates of a model you use will change when you get a new observation (e.g., a new record of product sales). Yes, if you estimate the model properly (e.g., using Least Squares), the parameter estimates won’t change substantially, but they will change nonetheless. And this would affect point forecasts and any other statistics produced by your model. Your standard errors, p-values, conditional means, prediction intervals, error measures, model ranking – everything will change with a new observation. In fact, if you do model selection, the structure of the model might change as well. For example, in the case of ETS, you might switch from a model without a trend to one with a trend. So, every time you estimate anything on a sample of data, you should keep in mind that it is random and will change if your sample changes or gets updated.

Why is that important? Because we need to understand this inherent uncertainty, and ideally, we should somehow take it into account. In forecasting, this means you should not draw conclusions based on one application of a model to a dataset. At the very least, you should perform a rolling origin evaluation. As Leonidas Tsaprounis says, “if you don’t roll the origin, you roll the dice”.

So, embrace the uncertainty and learn how to deal with it.

By the way, Kandrika Pritularga and I are holding a course on Demand Forecasting starting on 6th May. There is still time to sign up for it here.

Message On randomness and uncertainty first appeared on Open Forecasting.

Challenges related to seasonal data: shifting seasonality

Ivan Svetunkov — Mon, 07 Apr 2025 12:54:49 +0000

There are many different issues with capturing seasonality in time series. In this short post, I’d like to discuss one of the most annoying ones.

I’m talking about the seasonal pattern that shifts over time. What I mean is that, for example, instead of having the standard number of observations in the cycle (e.g., 24 hours in a day), in some cases you can have more or fewer of them. How is that possible?

One of these issues is the Daylight Saving Time (DST) change. The original idea of DST was to reduce energy consumption because daylight in summer is longer than in winter (there’s a nice and long article on Wikipedia about it). Because of this, many countries introduced a time shift: in spring, the clock is moved forward by one hour, while in autumn it goes back. This idea had a reasonable motivation at the beginning of the 20th century, but I personally think that as we’ve progressed as a society, it has lost its value. While this is already extremely annoying on its own, a bit unhealthy (several studies report an increased risk of heart attacks), and a torture for parents with small kids (the little ones don’t understand that it’s not 7am yet), it also introduces a modelling challenge: two days in the year do not have 24 hours. In spring, we have 23 hours, while in autumn we have 25. Standard classical forecasting approaches (such as ETS/ARIMA, regression, STL or classical decomposition) break in this case, because by default they assume that a specific pattern repeats itself every 24 hours. The issue arises because business cycles are tuned to working hours, not to the movement of the sun – people come to work at 9am, no matter how many hours are in the day.

Another challenge is leap years. While DST is totally man-made, leap years occur because the Earth orbits the sun approximately every 365.25 days. To avoid drifting too far from reality, our calendars include one extra day every four years (29th February). This addresses the issue but also means that one year has 366 days instead of 365. Once again, conventional models relying on fixed periodicity fail.

There are several ways to handle this, all with their own advantages and disadvantages:

Fix the data. In the case of DST, this means removing one of the duplicated hours during the autumn time change and adding one during the spring shift. For leap years, it means dropping the 29th of February. This is easy to do, but breaks the structure and might cause issues when we have DST/leap year in the holdout sample.
Introduce more complex components, such as Fourier-based ones, to capture the shift in the data. This works well for leap years but doesn’t address the DST issue. Harmonic regressions and TBATS do this, for example.
Shift seasonal indices when the issue happens – for example, having two indices for 1am when the switch to winter time occurs.

In R, I’ve developed the temporaldummy() function in the greybox package to introduce correct dummy variables for data with shifting seasonality, and I’ve incorporated method (3) into the adam() function from the smooth package. You can read more about these here: https://openforecast.org/adam/MultipleFrequenciesDSTandLeap.html

Are there any other strategies? Which one do you prefer?

BTW, Kandrika Pritularga and I are running a course on Demand Forecasting Principles with Examples in R. We’ll discuss some of these aspects there. Read more about it here.

Message Challenges related to seasonal data: shifting seasonality first appeared on Open Forecasting.

Naming conventions for seasonality types

Ivan Svetunkov — Wed, 26 Mar 2025 11:44:00 +0000

In forecasting, the term seasonality doesn’t always mean what you think it does. It encompasses more than just patterns repeating from one season to the next. In fact, seasonality covers a wide range of periodic behaviors, and can have some issues associated with the naming conventions. Should we discuss?

First things first: when we say “seasonality” in forecasting, we mean any pattern that repeats periodically. If you mention monthly seasonality, most people will understand that you’re referring to a pattern repeating every 12 observations. Similarly, quarterly seasonality is widely recognized. However, beyond these two simple cases, ambiguity creeps in.

For example, if you describe your data as having “weekly” seasonality, do you mean that you’re working with weekly data and observe similar patterns every 52 weeks? Or are you dealing with daily data, where the pattern repeats every 7 days? The same issue applies to the term “daily” seasonality, which can refer to a pattern within daily data or a repeating pattern across multiple days.

Furthermore, the more granular your data, the more potential seasonal profiles you can have. For daily data, you may observe seasonality at 7-day (weekly) and 365-day (yearly) intervals. For hourly data, you could have three seasonal patterns: 24 hours, 168 hours (24 × 7), and 8,760 hours (24 × 365). An example of such data is shown in the image attached to this post.

Some people use the prefix “intra” to indicate patterns within a given frequency, but I still find this confusing. For example, intraweekly only indicates that a pattern exists within the week but doesn’t specify the frequency: it could refer to either 7 days or 168 hours.

That’s why I personally prefer the “A of B” naming scheme for seasonality. For example, “week of year” seasonality clearly denotes a pattern repeating every 52 observations. “Day of week” clearly refers to a 7-observation pattern. This format is more precise and less ambiguous than “weekly” or “intraweekly” seasonality. “Hour of year”, “half-hour of week”, “minute of day” etc are all straightforward and easy to understand.

And what naming conventions do you use?

P.S. Kandrika Pritularga and I are running the course “Demand Forecasting Principles with Examples in R” again, where we’ll discuss some of these and related aspects in detail. You can read more about the course and sign up for it here and here respectively.

Message Naming conventions for seasonality types first appeared on Open Forecasting.

There is no such thing as “the best approach for everything”

Ivan Svetunkov — Thu, 06 Mar 2025 13:54:09 +0000

If someone tells you that method X solves all problems and is the best one ever, they are either lying intentionally or do not fully understand what they are talking about. There is no such thing as “the best approach for everything”. Let me explain.

Consider two products sold by retailers: ice cream and bread. You would expect the demand for ice cream to exhibit seasonal patterns because people tend to buy it more when it is warm outside. As a result, demand in summer is typically higher on average than in winter (this doesn’t apply to my friend Nikos Kourentzes, who eats ice cream no matter what). This suggests that if we want to forecast demand for ice cream, we should use an approach that correctly captures seasonality in one way or another.

Demand for bread, on the other hand, typically follows a different pattern, as people tend to buy it regularly, and it usually does not have seasonality. Imposing a seasonal structure on such data could harm forecast accuracy.

Even in this simplistic example, it’s clear that the optimal approach may vary depending on each situation. Yes, we could develop a more flexible model capable of distinguishing between these cases, but there are multiple ways to achieve this (cross-validation, information criteria, statistical tests, etc.), and each specific solution would have strengths and weaknesses.

Now, would you expect a single new approach that can distinguish between the cases to outperform all others and be the best for every possible scenario? My answer is no, because one could always devise an alternative model or method that selects features differently and performs better under different conditions. For example, approach A might forecast demand for white bread better than approach B, but the opposite might be true for sourdough bread.

Even if approach A outperforms all others on average across a dataset, there will always be cases where it performs worse than some competitors, because forecasting accuracy is based on the distribution of error measures, not just a single number (see my old post here). This is, for example, confirmed by the M5 competition, where the winning method, LightGBM, produced the most accurate forecasts on average but was outperformed by exponential smoothing in 41.5% of cases.

The same principle applies beyond point forecasts, across other statistics, fields, and disciplines. For example, if you need to produce prediction intervals or quantiles, you have a variety of tools to choose from, and depending on the specific situation, some will work better than others. There is no single approach that outperforms all alternatives in every context.

So, when someone claims to have a silver bullet that solves all problems, keep in mind: they are either trying to sell you something or do not understand what they are talking about.

Message There is no such thing as “the best approach for everything” first appeared on Open Forecasting.

Model vs Method – why should we care?

Ivan Svetunkov — Tue, 04 Feb 2025 12:14:44 +0000

Image above depicts a fashion model making a presentation about a forecasting method. I like the forecast for the final period in that image…

Over the last few years, I’ve seen phrases like “LightGBM model” or “Neural Network model” on LinkedIn many times, and the statistician in me shivers every time. So, I figured it’s time to discuss the difference between a model and a method.

Some of you might remember that I wrote a post on this topic a few years ago. But it seems it is worth revisiting.

John Boylan and I came up with the following definitions in our paper:

A forecasting model is a mathematical representation of a real phenomenon with a complete specification of distribution and parameters;
A forecasting method is a mathematical procedure that generates point and/or interval forecasts, with or without a forecasting model.

If these sound too technical, here’s a simpler explanation:

A forecasting method is a way of generating forecasts;
A forecasting model is a way to describe the assumed structure of a real phenomenon.

The key difference? A method focuses on producing something specific (e.g., point forecasts) with minimal assumptions, while a model relies on assumptions but can do much more:

Rigorous estimation. Models can be constructed in ways that ensure their estimates of parameters are efficient and consistent.
Model selection using information criteria. A powerful approach that saves computational time and typically produces reasonable forecasts.
Predictive distribution. Models can generate moments (mean, variance, skewness) and quantiles, capturing uncertainty around future values.
Confidence intervals for parameters. While not crucial for forecasting, this is useful in other areas to quantify uncertainty.
Extendibility. Additional variables and components can be easily incorporated in a model.

All of this comes at a price of making assumptions about the reality. If the assumptions don’t hold, the model won’t perform well. It might still be useful, but the risk of error increases. For example, you can apply a Random Walk model to purely random data, but you shouldn’t expect it to work well.

Examples

A forecasting method: Naïve, defined by the simple equation:
\( F_t = A _{t-1} \)
This method is easy to explain, hard to break, and provides point forecasts, but nothing more.
A forecasting model: Random Walk, which underlies the Naïve method:
\( A_t = A_{t-1} + \epsilon_t \)
where \( \epsilon_t \) follows some distribution with zero mean and fixed variance. The Random Walk model has all the properties described above.

In some cases, you can derive models underlying the methods. In my opinion, this typically enhances the latter, making them more powerful due to the reasons explained above. What is interesting about this general connection is that if we can identify a model underlying a method, we can do much more with it.

For example, when estimating a quantile regression, we typically minimize a pinball loss function, which gives us a method for generating quantiles. However, if we estimate the same linear regression model using likelihood, assuming that the error term follows the Asymmetric Laplace distribution, we arrive at exactly the same parameter estimates as in quantile regression. But now, we also gain additional benefits, such as model selection, predictive distribution, and confidence intervals for parameters – features outlined in the previous post. In a way, these benefits come “for free”, although at the cost of making explicit assumptions about the model. That said, I’d argue that assumptions exist in quantile regression anyway – they’re just not stated explicitly.

And here we finally come to the ML approaches. According to the definitions we discussed earlier, Decision Trees, k-Nearest Neighbors, Artificial Neural Networks (ANNs) and other ML approaches are not forecasting models. They do not attempt to capture the underlying structure of the data. Instead, they focus on identifying nonlinear patterns via engineered features to produce point forecasts. In other words, they are methods, not models.

This doesn’t make them inferior. Their strength lies in their flexibility, precisely because they don’t impose strong assumptions. However, treating them as forecasting models can lead to potential issues.

For example, plugging LightGBM’s point forecasts into a probability distribution doesn’t magically turn it into a model. It simply makes it a method that now generates quantiles, but without a solid theoretical foundation for why a specific distribution is chosen or used in a particular way.

Another example is model selection using information criteria, which is meaningless for ML approaches. Why? Because information criteria rely on the assumption that the model is estimated in a specific way (e.g., via maximum likelihood estimation), ensuring parameter consistency and model identifiability. However, some ML methods, such as ANNs, are fundamentally unidentifiable, as different architectures can produce the same output. So, the information criteria become meaningless in this setting.

So next time you see the term model, take a moment to consider whether it’s used correctly and whether it actually means what the author thinks.

Message Model vs Method – why should we care? first appeared on Open Forecasting.