Archives statistics - Open Forecasting

The real Dunning-Kruger effect

Ivan Svetunkov — Mon, 23 Mar 2026 09:03:35 +0000

Many of you have seen this image on the Internet — I’ve seen it myself a few times on LinkedIn lately. People say it depicts the “Dunning-Kruger” effect… But did you know this is actually an internet meme with little to do with the original paper?

Here is one of the recent examples, a screenshot of the post of Fotios Petropoulos about the effect.

A LinkedIn post by Fotios Petropoulos

In the original paper, Kruger and Dunning (1999) ran experiments with undergraduates on humour, logical reasoning, and grammar. Participants completed a test and estimated their percentile rank. The authors then sorted participants into four quartiles by actual performance and computed averages for actual and self-assessed performance for each quartile. The plots in their paper – the real Dunning–Kruger effect – are just four data points per line, not a smooth curve over a learning journey (second image).

What did they find? People in the bottom quartile substantially overestimated their performance, often believing they were average or above. Top performers slightly underestimated their standing. The key finding is an asymmetry in miscalibration: low performers overestimate, high performers slightly underestimate.

This has almost nothing to do with the popular “experience vs. confidence” image. The original X‑axis is performance quartile at a single point in time; the meme’s X‑axis is a vague notion of “experience” through time. The original Y‑axis is the assessed test percentile; the meme’s is a free‑floating “confidence” construct. In the actual data, perceived performance increases with actual performance – there is no early spike, no “valley of despair,” no “slope of enlightenment.” That swooping curve is an internet-era graphic never reported by Kruger and Dunning, and it misleadingly frames the effect as a personal development trajectory the paper never studied.

There is also a serious critique of the original paper from statistical point of view. For example, Gignac and Zajenkowski (2020) showed that sorting people into quartiles and plotting average self-assessment against average performance can, by itself, generate the characteristic pattern – purely as a statistical artefact. In their own empirical data, miscalibration was roughly constant across ability levels, consistent with measurement noise rather than a special cognitive deficit in low performers. You can actually reproduce the pattern using two random uncorrelated variables. Here is a simple example in R:

set.seed(41)

x <- rnorm(10000, 100, 10)
y <- rnorm(10000, 100, 10)
plot(x,y)
xQ <- quantile(x)
yQ <- quantile(y)

yMeans <- xMeans <- vector("numeric",4)

for(i in 1:4){
    xMeans[i] <- mean(x[xxQ[i]])
    yMeans[i] <- mean(y[xxQ[i]])
}

plot(1:4, xMeans, type="b", ylim=range(xMeans,yMeans),
     xlab="Real performance", ylab="Assessed performance",
     lwd=2)
lines(yMeans, lwd=2, lty=2)
points(yMeans, lwd=2)
legend("topleft",
       legend=c("Actual performance", "Assessed performance"),
       lwd=2, lty=c(1,2), pch=1)

Which produces the image like this:

Dunning-Kruger plot reproduction

If you introduce a correlation between the two variables, the images starts looking even more similar to the ones from the original paper.

So there might be a real effect – many follow-up studies have measured it with more rigorous tools – but Dunning and Kruger’s method was not the right one to establish it. And that image with experience vs confidence is just a meme and a serious misconception that should not be used.

P.S. If you wonder who the “leading expert” that Fotios Petropoulos refers to in his post is – it’s me. Not sure why he doesn’t tag me properly.

Message The real Dunning-Kruger effect first appeared on Open Forecasting.

There’s no such thing as “deterministic forecast”

Ivan Svetunkov — Mon, 02 Mar 2026 22:45:31 +0000

Sometimes I see people referring to a “deterministic” forecast, and I have some personal issues with this. Because if you apply a model to data then there is nothing deterministic about your forecasts!

In many contexts, “deterministic” has a precise meaning: no randomness, no uncertainty. A deterministic solution to an optimisation problem (e.g. linear programming) implies that there are no random inputs or outputs once the model and its parameters are fixed. Forecasting is different. As Chatfield and many others have pointed out, forecasting has multiple sources of uncertainty, and there is essentially zero chance that the future will unfold exactly as any single number suggests.

Yes, some people use “deterministic” as a synonym for “point forecast”. But that label is still misleading, because a point forecast is not uncertainty-free – it is just one summary of a predictive distribution (often the conditional mean, sometimes the median or another functional).

Here’s a quick reality check you can do yourself. Take a dataset, apply your model, and write down the point forecast for the next few observations. Now add one new observation, re-estimate, and forecast again (the image in this post depicts exactly that, but with 50 forecasts produced on different subsamples of data). The point forecast will change unless you are dealing with an exotic situation with non-random data (e.g. every day, you sell exactly 100 units). So, which of the two was the “deterministic” forecast? If forecasts were truly deterministic in the strict sense, you would not get multiple plausible values from small, reasonable changes in the sample.

This happens because any forecasting method (statistical or ML) depends on data and on modelling choices: parameter estimation, feature selection, splitting rules, tuning, even decisions like “use α=0.1”. Those choices can be fixed across samples of data, but fixing them does not remove uncertainty – it only hides it. The randomness is still there in the data and in the fact that we only observe a sample of it.

So when you see someone mentioning “deterministic forecast”, it’s worth translating it mentally to: “a point forecast, probably a conditional mean”. If you care about decisions and risk, you should know that there is an uncertainty associated with this so called “deterministic forecast”, and that it should not be ignored. But this is a topic for another discussion in another post.

Message There’s no such thing as “deterministic forecast” first appeared on Open Forecasting.

Teaching Statistics and Descriptive Analytics in the world of AI

Ivan Svetunkov — Wed, 07 Jan 2026 17:32:28 +0000

Teaching statistics as a flipped classroom with the help of AI? You heard that right! That’s exactly what I tried this year – and here are the results.

Attached to this post is the student evaluation score for the module. Yes, the number of responses is quite low (only 50% of the cohort), but it should still give a sense of how students perceived Statistics and Descriptive Analytics. Of course, this reflects only their impression – coursework submissions are yet to come – but it’s still an encouraging sign that some things worked well.

I’ve taught this module since 2018, first with Dave Worthington and later with Alisa Yusupova. Normally, I focused on the second half, covering regression through lectures and workshops. But this year, I took on the full module and realised I didn’t want to teach probability theory and statistics in the traditional way – long monologues in lectures followed by awkward silence in workshops. That format, I believe, no longer works. After all, students can always ask their favourite LLM to explain concepts they don’t understand. And some don’t even do that – they just ask to solve problems without understanding them. So, what can be done in this brave new world?

I don’t yet have a definitive answer – only the results of an experiment.

Lectures. This year, I used Google Notebook ML to prepare lecture materials. I provided my existing notes, slides, and relevant texts, then asked it to produce podcasts on specific topics. This took more time than expected, as I had to review the generated content, adjust prompts, and refine focus areas, listening to the podcasts over and over again. Once ready, I uploaded the materials on Moodle and asked students to listen beforehand. In class, we skipped formal lectures and instead had whiteboard & marker discussions. I asked questions, showed derivations, and encouraged debate. With a class of 26 students, it was possible to create much more interaction than in previous years.

Workshops. We still had problem-solving sessions, but I allowed (actually encouraged) students to use LLMs to solve tasks and explain why the solutions were correct. The aim was to emphasise reasoning and assumptions over simply obtaining the right number. This worked with mixed success, and I still need to think how it can be improved further.

Did it work overall?

I’m not entirely sure. Not all students engaged with the materials in advance, but those who did seemed to benefit and appreciated the approach. What I do know is that the “two-hour monologue while everyone tries not to fall asleep” format does not work any more. For universities, and for the (very!) expensive UK education, to remain relevant, we must innovate and rethink how we teach.

What would you change if you were teaching a technical subject at university in the era of AI? I’d love to hear your ideas.

Message Teaching Statistics and Descriptive Analytics in the world of AI first appeared on Open Forecasting.

On randomness and uncertainty

Ivan Svetunkov — Mon, 28 Apr 2025 11:05:29 +0000

Everything is random! Your data, your model, its parameter estimates, the forecasts it produces, and even the minimum of the loss function you used. There is no such thing as a “deterministic” forecast – everything is stochastic!

Whenever you work with data, you are working with a sample from a population. In some cases, this is more apparent than in others. In my statistics lectures, I typically give the following example. Consider that we are interested in the average height of students at the university. I could ask every student at the lecture to tell me their height, take the average, and get a number. Is this number random? Yes, indeed. Why? Because if a student who was late for the lecture comes in, I would need to recalculate the average, and the number would change. The average that I get depends on who specifically I have in the sample and how many observations I have. It will vary more in smaller samples and become more stable in larger ones. But this example gives you an idea about the inherent uncertainty of any estimates we deal with.

In time series, the situation is somewhat similar: you are dealing with a sample of values that you have observed up until a specific moment. If, for example, you want to forecast daily admissions in the emergency department of a hospital and apply a model, its forecast will change when a new day comes and a new cohort of patients arrives. This is because your sample changes, and you receive new information about the demand.

So, the parameter estimates of a model you use will change when you get a new observation (e.g., a new record of product sales). Yes, if you estimate the model properly (e.g., using Least Squares), the parameter estimates won’t change substantially, but they will change nonetheless. And this would affect point forecasts and any other statistics produced by your model. Your standard errors, p-values, conditional means, prediction intervals, error measures, model ranking – everything will change with a new observation. In fact, if you do model selection, the structure of the model might change as well. For example, in the case of ETS, you might switch from a model without a trend to one with a trend. So, every time you estimate anything on a sample of data, you should keep in mind that it is random and will change if your sample changes or gets updated.

Why is that important? Because we need to understand this inherent uncertainty, and ideally, we should somehow take it into account. In forecasting, this means you should not draw conclusions based on one application of a model to a dataset. At the very least, you should perform a rolling origin evaluation. As Leonidas Tsaprounis says, “if you don’t roll the origin, you roll the dice”.

So, embrace the uncertainty and learn how to deal with it.

By the way, Kandrika Pritularga and I are holding a course on Demand Forecasting starting on 6th May. There is still time to sign up for it here.

Message On randomness and uncertainty first appeared on Open Forecasting.

Model vs Method – why should we care?

Ivan Svetunkov — Tue, 04 Feb 2025 12:14:44 +0000

Image above depicts a fashion model making a presentation about a forecasting method. I like the forecast for the final period in that image…

Over the last few years, I’ve seen phrases like “LightGBM model” or “Neural Network model” on LinkedIn many times, and the statistician in me shivers every time. So, I figured it’s time to discuss the difference between a model and a method.

Some of you might remember that I wrote a post on this topic a few years ago. But it seems it is worth revisiting.

John Boylan and I came up with the following definitions in our paper:

A forecasting model is a mathematical representation of a real phenomenon with a complete specification of distribution and parameters;
A forecasting method is a mathematical procedure that generates point and/or interval forecasts, with or without a forecasting model.

If these sound too technical, here’s a simpler explanation:

A forecasting method is a way of generating forecasts;
A forecasting model is a way to describe the assumed structure of a real phenomenon.

The key difference? A method focuses on producing something specific (e.g., point forecasts) with minimal assumptions, while a model relies on assumptions but can do much more:

Rigorous estimation. Models can be constructed in ways that ensure their estimates of parameters are efficient and consistent.
Model selection using information criteria. A powerful approach that saves computational time and typically produces reasonable forecasts.
Predictive distribution. Models can generate moments (mean, variance, skewness) and quantiles, capturing uncertainty around future values.
Confidence intervals for parameters. While not crucial for forecasting, this is useful in other areas to quantify uncertainty.
Extendibility. Additional variables and components can be easily incorporated in a model.

All of this comes at a price of making assumptions about the reality. If the assumptions don’t hold, the model won’t perform well. It might still be useful, but the risk of error increases. For example, you can apply a Random Walk model to purely random data, but you shouldn’t expect it to work well.

Examples

A forecasting method: Naïve, defined by the simple equation:
\( F_t = A _{t-1} \)
This method is easy to explain, hard to break, and provides point forecasts, but nothing more.
A forecasting model: Random Walk, which underlies the Naïve method:
\( A_t = A_{t-1} + \epsilon_t \)
where \( \epsilon_t \) follows some distribution with zero mean and fixed variance. The Random Walk model has all the properties described above.

In some cases, you can derive models underlying the methods. In my opinion, this typically enhances the latter, making them more powerful due to the reasons explained above. What is interesting about this general connection is that if we can identify a model underlying a method, we can do much more with it.

For example, when estimating a quantile regression, we typically minimize a pinball loss function, which gives us a method for generating quantiles. However, if we estimate the same linear regression model using likelihood, assuming that the error term follows the Asymmetric Laplace distribution, we arrive at exactly the same parameter estimates as in quantile regression. But now, we also gain additional benefits, such as model selection, predictive distribution, and confidence intervals for parameters – features outlined in the previous post. In a way, these benefits come “for free”, although at the cost of making explicit assumptions about the model. That said, I’d argue that assumptions exist in quantile regression anyway – they’re just not stated explicitly.

And here we finally come to the ML approaches. According to the definitions we discussed earlier, Decision Trees, k-Nearest Neighbors, Artificial Neural Networks (ANNs) and other ML approaches are not forecasting models. They do not attempt to capture the underlying structure of the data. Instead, they focus on identifying nonlinear patterns via engineered features to produce point forecasts. In other words, they are methods, not models.

This doesn’t make them inferior. Their strength lies in their flexibility, precisely because they don’t impose strong assumptions. However, treating them as forecasting models can lead to potential issues.

For example, plugging LightGBM’s point forecasts into a probability distribution doesn’t magically turn it into a model. It simply makes it a method that now generates quantiles, but without a solid theoretical foundation for why a specific distribution is chosen or used in a particular way.

Another example is model selection using information criteria, which is meaningless for ML approaches. Why? Because information criteria rely on the assumption that the model is estimated in a specific way (e.g., via maximum likelihood estimation), ensuring parameter consistency and model identifiability. However, some ML methods, such as ANNs, are fundamentally unidentifiable, as different architectures can produce the same output. So, the information criteria become meaningless in this setting.

So next time you see the term model, take a moment to consider whether it’s used correctly and whether it actually means what the author thinks.

Message Model vs Method – why should we care? first appeared on Open Forecasting.

There is no such thing as an “assumption-free approach”

Ivan Svetunkov — Tue, 07 Jan 2025 10:50:55 +0000

One thing that bothers me when I read posts on social media or papers in peer-reviewed journals is the claim that a proposed approach is “assumption-free.” In forecasting, this is never true. Such an approach is like a spherical unicorn in a vacuum (see image above). Here’s why.

Every model is a simplification of reality, meaning that it captures only a part of it. Simplifying implies that certain aspects of reality are irrelevant and can be ignored. For example, in forecasting, regardless of the approach used, we typically assume that the model captures the structure correctly, i.e. neither omitting important elements nor overfitting the data. Different approaches address this differently: statistical models do that explicitly, while a good ML approach seeks a balance between underfitting and overfitting, often in a non-linear way. When the structure is captured correctly, the forecast reflects the essential part of reality while ignoring small random fluctuations (see a post on structure vs. noise).

Depending on the assumptions we make, we can classify approaches as parametric, semiparametric, or nonparametric.

Parametric approaches assume that the model is correctly specified, its parameters are accurately estimated, and the chosen distribution is appropriate (often the normal one, though others can be used). In this case, we fully rely on the model. A classical example is the construction of conventional prediction intervals: the conditional expectation and variance are calculated and plugged into the normal distribution to derive the necessary quantiles for a specified confidence level. Specifically, in this case we assume that the model is correct, errors are uncorrelated and homoscedastic, and that they follow a normal distribution.

Semiparametric approaches relax some of these assumptions. For example, we might calculate statistics in a more robust manner or drop the assumption of a specific distribution. For example, instead of relying on textbook formulae, we could use in-sample multistep forecast errors to calculate conditional variances. This eliminates the need to assume uncorrelated and homoscedastic errors and allows for some flexibility in the model structure. However, in this example, we still rely on normality.

Nonparametric approaches avoid most of the above assumptions but come with their own hidden ones. For instance, the method proposed by Taylor & Bunn (1999) for constructing prediction intervals fits quantile regressions to in-sample multistep forecast errors. This method does not assume a correct model, well-behaved residuals, or normality. However, it does assume the appropriateness of the chosen quantile regression function (Spoiler: they used polynomial regression, but my experiments suggest that a power function is a more robust alternative).

You might think that nonparametric approaches, with fewer assumptions, should always be preferred. But that’s not necessarily the case. It is “horses for courses”: you should select the approach that best fits your specific situation. For example, when working with small samples, introducing some assumptions might be necessary to get meaningful estimates. A nonparametric approach, while powerful, might require more data than you have available.

Finally, there is no such thing as a “best” method for every situation. As is often the case in forecasting, you need to try different approaches and choose the one that works best. Even then, remember that forecasting always rests on a fundamental assumption: the future will resemble the past. And no fancy method can guarantee that this assumption will hold.

Message There is no such thing as an “assumption-free approach” first appeared on Open Forecasting.

Structure vs. Noise: A Fundamental Concept in Forecasting

Ivan Svetunkov — Tue, 13 Aug 2024 13:06:01 +0000

One of the core ideas in statistics, which extends to many other fields including forecasting, is the concept of structure versus noise. You’ve probably heard of it, but it’s often overlooked by those without a strong quantitative background. So, let’s discuss.

The core of the idea is that any data consists of two fundamental parts:

Structure, which can take various forms, and might include trend, seasonality, calendar effects, and the influence of external factors on demand (e.g., price changes, promotions etc).
Noise, which is inherently unpredictable.

Structure can be captured using models or methods, and this is what produces the fitted values or point forecasts. Noise, on the other hand, is unpredictable – like not knowing exactly who will visit a store and when they’ll make a purchase.

For example, consider a local Lancaster pub that has a nice selection of beers. Their sales likely follow a pattern, such as higher sales on weekends or during special events like football matches. These patterns are the structure we can capture and forecast. However, the pub can’t anticipate when my friend Yves will visit me, and when we’ll go out for drinks. This element of uncertainty forms the noise – while it’s explainable from my perspective, it’s a mystery to the pub owner.

But as I said, the idea of structure vs. noise isn’t just relevant in demand forecasting; it applies in many other areas too. Take classification, for instance. When identifying mushrooms, you might not be able to tell for sure whether you’re looking at a Rosy Brittlegill or The Sickener without a microscope. While certain characteristics (like stem shape or cap colour) make up the structure, there’s always some randomness that can make one mushroom look like another. So, in classification, you can only say that it’s more likely that we have one type of mushroom rather than the other, and you need to consider the uncertainty around this choice (the modern approach to this is to use conformal prediction).

Furthermore, we as humans are very good at finding patterns in the noise. If you look at clouds and see a mushroom, it’s not a real mushroom, just a random arrangement of vapour. So when you work with the data, remember this feature and don’t fall into the trap of finding patterns that don’t actually exist. Be critical and avoid overfitting the noise.

As you can see, the concept of structure versus noise is fundamental and shows up in many contexts. In forecasting, our job all is to capture the structure somehow, filter out the noise so that we can then produce point forecasts (future structure) and prediction intervals (representing the size of uncertainty) to be able to make adequate decisions.

Message Structure vs. Noise: A Fundamental Concept in Forecasting first appeared on Open Forecasting.

Complex-Valued Econometrics with Examples in R

Ivan Svetunkov — Sun, 04 Aug 2024 14:33:40 +0000

Back in 2022, my father asked me to help him in amending and editing a monograph he wrote on the topic of “Complex-Valued Econometrics”. The original book focused on dynamic models, but after looking through the material and a thorough discussion, we decided to write something more fundamental. The monograph is based on the research he has done over the years, working in Saint Petersburg. I developed an R package called “complex” to support the book and then expanded the text with some derivations and examples of application. The result was then submitted to Springer and is now finally published in their “Contributions to Economics” series. Unfortunately, due to the agreement with the publisher, we cannot make the book freely available, but some of related materials can be found on a github repo, here.

We will receive royalties from selling this book, and we have decided to direct them to a charity to help Ukrainians (this one).

And here is how the cover of the book looks like:

Complex-Valued Econometrics with Examples in R

Svetunkov S., Svetunkov I. (2024). Complex-Valued Econometrics with Examples in R: Modelling, Regression and Applications. Springer Cham. 154 pages. DOI: 10.1007/978-3-031-62608-1

Message Complex-Valued Econometrics with Examples in R first appeared on Open Forecasting.

Multistep loss functions: Geometric Trace MSE

Ivan Svetunkov — Tue, 04 Jun 2024 09:05:56 +0000

While there is a lot to say about multistep losses, I’ve decided to write the final post on one of them and leave the topic alone for a while. Here it goes.

Last time, we discussed MSEh and TMSE, and I mentioned that both of them impose shrinkage and have some advantages and disadvantages. One of the main advantages of TMSE was in reducing computational time in comparison with MSEh: you just fit one model with it instead of doing it h times. However, the downside of TMSE is that it averages things out, and we end up with model parameters that minimize the h-steps-ahead forecast error to a much larger extent than those that are close to the one-step-ahead. For example, if the one-step-ahead MSE was 500, while the six-steps-ahead MSE was 3000, the impact of the latter in TMSE would be six times higher than that of the former, and the estimator would prioritize the minimization of the longer horizon one.

A more balanced version of this was introduced in our paper and was called “Geometric Trace MSE” (GTMSE). The main idea of GTMSE is to take the geometric mean or, equivalently, the sum of logarithms of MSEh instead of taking the arithmetic mean. Because of that, the impact of MSEh on the loss becomes comparable with the effect of MSE1, and the model performs well throughout the whole horizon from 1 to h. For the same example of MSEs as above, the logarithm of 500 is approximately 2.7, while the logarithm of 3000 is 3.5. The difference between the two is much smaller, reducing the impact of the long-term forecast uncertainty. As a result, GTMSE has the following features:

It imposes shrinkage on models parameters.
The strength of shrinkage is proportional to the forecast horizon.
But it is much milder than in case of MSEh or TMSE.
It leads to more balanced forecasts, performing well on average across the whole horizon.

In that paper, we did extensive simulations to see how different estimators behave, and we found that:

If an analyst is interested in parameters of models, they should stick with the conventional loss functions (based on one-step-ahead forecast error) because the multistep ones tend to produce biased estimates of parameters.
On the other hand, multistep losses kick off the redundant parameters faster than the conventional one, so there might be a benefit in the case of overparameterized models.
At the same time, if forecasting is of the main interest, then multistep losses might bring benefits, especially on larger samples.

ETS(A,A,A) estimated using different loss functions applied to the data with multiplicative seasonality

The image above shows an example from our paper, where we applied the additive model to the data, which exhibits apparent multiplicative seasonality. Despite that, we can see that multistep losses did a much better job than the conventional MSE, compensating for the misspecification.

Message Multistep loss functions: Geometric Trace MSE first appeared on Open Forecasting.

Multistep loss functions: Trace MSE

Ivan Svetunkov — Sat, 01 Jun 2024 11:29:12 +0000

As we discussed last time, there are two possible strategies in forecasting: recursive and direct. The latter aligns with the estimation of a model using a so-called multistep loss function, such as Mean Squared Error for h-steps-ahead forecast (MSEh). But this is not the only loss function that can be efficiently used for model estimation. Let’s discuss another popular option.

But before that, let’s take a step back to recap what we are talking about. All the multistep losses imply that we fit the model to the data in the conventional way and then produce recursively 1 to h-steps-ahead point forecasts from each in-sample observation, from the very first to the very last one. We can then calculate the forecast errors and collect them in a matrix with observations in rows and horizon in columns, as shown in the image below, generated using the rmultistep() function from the smooth package in R:

An example of a matrix of multistep forecast errors

After that, we can calculate any of the multistep loss functions. MSEh, for example, would simply be the mean of squared errors in the last column of that matrix.

One of the most straightforward modifications of MSEh is a loss function that can be called “Trace MSE”, which is the sum of MSEs of each of the columns in that matrix. It has some advantages and disadvantages in comparison with MSEh. Here are some:

Because we sum up MSEs for different horizons, those closer to h will tend to be higher than those close to 1, simply because typically, with an increase of the horizon, uncertainty increases as well.
The previous point means that the model estimated via TMSE will care less about short-term forecasts and will focus more on longer ones.
But at least it will not be as myopic as a model estimated with a specific MSEh.
You do not need to estimate h models; you can estimate just one, and it will be optimized for the entire horizon from 1 to h.
This means that you save on computations, making the estimation and forecasting roughly h times faster than in the case of MSEh.
Kourentzes et al. (2019) showed that TMSE slightly outperformed MSE1 and MSEh. In fact, in one of the early versions of that paper, Kourentzes & Trapero showed how well TMSE performs in the example of solar irradiation forecasting with ETS.
TMSE imposes shrinkage on parameters of dynamic models, which makes them less reactive and avoids overfitting.
But the shrinkage is not as strong as in the case of MSEh.

This is discussed in the paper I wrote together with Nikolaos Kourentzes and Rebecca Killick

Some examples of application of TMSE are provided in Section 11.3 of ADAM.

Also, Peter Laurinec did an independent exploration of multistep losses and wrote this nice post.

Message Multistep loss functions: Trace MSE first appeared on Open Forecasting.