Archives bla-bla-bla - Open Forecasting

Forecasting for the sake of forecasting

Ivan Svetunkov — Mon, 23 Mar 2020 22:24:19 +0000

You probably have already noticed that we are in a pandemic of COVID-19 these days (breaking news: the UK has just announced a lockdown due to the virus). The number of news, memes and noise on the topic coming from around the world is astonishing! What is also astonishing is the number of posts on data analysis and forecasting about the pandemic. Many data scientists, statisticians, machine learning experts, and even people who don’t know much about data analysis, feel that they have to make a difference. All of a sudden they have become experts in forecasting epidemics. They tell you how many cases of COVID-19 we will have next week, they predict how many people will die, they forecast how many COVID-19 cases will be reported in the US by the 31 March 2021 and so on, and so forth. These experts use Simulations, Exponential Smoothing, ARIMA, Bayesian methods, Neural Networks, judgment and whatever else they know in order to predict these things (I’m not giving links here – use Google if you want to find them). My head aches because of all this noise around us, and I don’t think that what these people do is helpful at all. And here is why.

Why?

First, without the fundamental background in epidemiology, such forecasts are often just exercises in fitting models. Some of these experts think that they can filter out the noise, getting a structure, or construct a fancy model that describes something and that this is it. But all of that should be done, given the theory, given the domain knowledge. One cannot just fit an ARIMA model to the data, produce forecasts and claim that they have done something reasonable or useful. Without the understanding of the problem, this becomes an exercise of using R / Python. This does not make anyone an expert in the area, nor does it mean that they have done something sensible. To add to this point, Rob Hyndman has recently summarised reasons why time series models are not really useful in this context, but in my opinion, the problem is wider and can be applied to many other analytical tools as well if they are used without proper expertise.

Second, we don’t really know the real situation at the moment. The data we use is probably incorrect and incomplete (again, see Rob’s post for discussion on this). For example, the countries seem to stop testing people these days in order not to spread the virus further, but even when they test, there is no proper way of saying how many people really have the virus. It looks like for the majority of the population it passes without big issues, it is the smaller proportion that is seen on the surface. And if we construct models using such data, then the conclusions we make would inevitably be incorrect and incomplete as well. This is unless we use proper models and have the necessary domain knowledge (go back to point one of this post)…

Third, all these analyses and forecasts do not help in decision making, they are done just out of curiosity, without any specific purposes. For instance, an expert predicted that we would have a total between 53 and 530 million cases of COVID-19 worldwide reported by 31 March 2021. So what? What do we do with that? Does this help decision makers? No. Does this help people in understanding what they should do? Again, no. This is just a forecast for the sake of forecasting. The COVID-19 topic is a current hype in analytics, and one can get attention and potentially scientific publications if they do anything in this direction. But the contribution of such analytics / forecasting exercises to science and society at large is limited (if at all useful).

What can we do?

Instead of producing “actionless” forecasts, we should focus on what we expect to happen with the society and economy. The lockdowns and self-isolation are hurting the economy, but are inevitable: this is a trade-off between public health and public prosperity. Some questions to consider:

Can we predict how the virus will spread in different scenarios (early lockdown / late lockdown / no lockdown)?
Can we predict what will happen with business during the lockdown period?
How many bankruptcies will we see?
What types of companies will go down first?
How will this impact the prices on products?
How many people will lose their jobs because of the hit on the economy?

All these are important questions that can help decide what we should do now, and how we should support business and society. Most probably, we cannot use time series methods and simple data analysis in order to answer these questions for the reasons outlined above. So, we either need to use domain specific models or judgmental methods, but this needs to be done with the support from the experts in the area.

Another interesting example is the observed panic buying, which has already damaged supply chains. People have suddenly started buying on average 200% more than they usually buy. We can phrase a few important operational research-related questions:

How will the supply chain react?
What are the effects of panic buying in long term?
When will this stop and when will demand normalise again?
What will happen with demand after the end of panic?

These are also important questions that help making decisions here and now for different groups of people. These are not simple questions to answer, they assume some effort, but at least the answers to them are useful.

Summary

I will not give you any predictions about COVID-19, because I am not an expert in the area. But, I can tell you that there is too much hype, noise and panic on the topic. We (forecasters, statisticians, data scientists, etc.) need to help society, not create even more noise and panic. So if you want to analyse something or create forecasts of something related to COVID-19, make sure that your results will help other people. If they won’t, then don’t!

Message Forecasting for the sake of forecasting first appeared on Open Forecasting.

M-competitions, from M4 to M5: reservations and expectations

Ivan Svetunkov — Sun, 01 Mar 2020 20:39:52 +0000

UPDATE: I have also written a short post on “The role of M competitions in forecasting“, which gives historical perspective and a brief overview of the main findings of the previous competitions.

Some of you might have noticed that the guidelines for the M5 competition have finally been released. Those of you who have previously visited this blog, might know my scepticism about the M4 competition. So, I’ve decided to write this small post, outlining my reservations about the M4 and my thoughts and expectations about the M5.

Reservations about the M4 competition

Although, I have participated in two submissions in M4, in the end I found the results unsatisfactory. My main reservations were with how the results were interpreted by the organisers, who announced right after the end of the M4 that “Machine Learning doesn’t work”, which was repeated few times on the Twitter of Spyros Makridakis (see, for example, this, and this) and in their summary paper in IJF. This was pedalled few times in social media, and I did not take it well. I participated in the M4 competition in order to support Spyros, but I did not expect that I would contribute to the attack on Machine Learning. Besides, the claims that ML methods didn’t work were not valid. First, the winning method of Slawek Smyl uses Machine Learning technique. Second, there were not many ML methods submitted to the competition, because the ML community was not involved in it. Third, the setting of the competition itself was not suitable for ML in the first place. What sort of non-linearity do we expect to see in yearly, quarterly or monthly data anyway? We already known that using ML makes sense, when you have a lot of observations, so that you can train the methods properly and not overfit the data. Yes, there was also a handful of weekly, daily and hourly data in the M4 competition, but they constituted only 5% of the whole dataset. So, the fact that MLP and RNN failed to perform well, when applied to each separate time series, is not surprising. Furthermore, the organisers did not have ML specialists in their team, so I don’t think that it was a fair setting in the first place to conclude that “ML methods don’t work”. To me, it was similar to the case, when professional cyclists would organise a race between super cars and bicycles on narrow streets of an old crowded city. The conclusion would be “the super cars don’t work”.

Furthermore, it appeared that there were several time series of a weird nature. For example, there were yearly data with 835 observations (e.g. series Y13190), there were some series without any variability for long periods of time (e.g. 29 years of no change in monthly data, like in the series M16833) and there were some series that end in the future (e.g. Y3820 ends in the year 2378). Some other examples of weird data were also discussed in Darin & Stellwagen (2020). Furthermore, Ingel et al. (2020) showed that some time series in the dataset were highly correlated, meaning that there was a data leakage in the M4 competition. All of this does not have much to do with the time series we usually deal with in real life, at least not in the area I work in. We should acknowledge that preparing such a big competition as M4 is not an easy task (it’s not possible to look through all 100,000 time series), and I’m sure that the organisers did as best as they could in order to make sure that the dataset contains meaningful series. However, while all of this has probably happened because of the complexity of the competition, I now have my reservations about using M4 data for any future experiments.

I also had my reservations about the error measures used in the competition, although I have accepted them, when I participated in the M4. I still think that it was a mistake to use sMAPE, because now we see a backlash, when practitioners from different companies refer to M4 and ask us about the sMAPE. I have to explain them the limitations of this error measure, and why they should not use it. And I am afraid that we will have more and more forecasting related papers, using sMAPE for forecasts evaluation, just because of the M4 competition, although we have known since Goodwin & Lawton 1999 that it has serious issues (preferring when methods overshoot the data). In my opinion this is a serious step back for science.

To be fair, there were also good results of the M4 competition, and while in general being critical of it, I don’t claim that it was a complete failure. It showed how the ML can be efficiently used on big datasets: see the papers of Slawek Smyl (2020) and Montero-Manso et al. (2020). It reconfirmed the previous findings that the combinations of forecasting methods on average perform better than the individual ones in terms of accuracy. It also showed that an increase in accuracy can be achieved with more complex methods, and I personally like the Figure 2 of the Makridakis et al. (2020) paper, which presents this finding in a nice concise way. And yes, it helped in promoting forecasting techniques developed by different researches in the area.

Summarising all of that, in my opinion, M4 was a great opportunity to move science forward and promote forecasting. Unfortunately, I have a feeling that it was not well thought through by the organisers, and, as a result, did not do it as well as it should have done.

Thoughts and expectations about the M5 competition

Now we arrive to M5 competition. This will be based on the data from Wallmart containing 42,840 series of daily data from 29th January 2011 to 19th June 2016, forming a natural hierarchy / grouped time series of SKUs – departments – shops – states and several categories. It is announced that this will be a mixture of intermittent and non-intermittent data, and explanatory variables (such as prices and promotions) will be provided. So far, it sounds quite exciting, because this is as close to reality as it can get.

When it comes to forecasts evaluation, the organisers use a modification of MASE, which they call Root Mean Squared Scaled Error, and although it is not as interpretable as sMSE or relative RMSE, it should do a good job in measuring mean performance of methods, especially given the complexity of the task and the fact that they will have to deal with intermittent data. I totally understand the issue with the relative RMSE becoming equal to zero in some cases on intermittent data, I have encountered this several times myself. So, I think that RMSSE is a good choice of the error measure for the competition like this.

They will also measure the accuracy of prediction intervals using a scaled version of pinball loss, which makes sense if one is interested in the accuracy of different methods in predicting specific quantiles of distribution. The quantiles also relate to the inventory planning decisions, although not always directly (inevitably, aggregating over the lead time is avoided in the M5 competition, because it probably would complicate a lot of things).

The organisers plan to have two stages in the competition, starting on 2nd March, releasing the interim leader board on 31st May and setting the final deadline to 30th June. So, if you participate in it, you will have roughly three months for analysing data, developing your methods and producing forecasts and than one more month (from 31st May to 30th June) for the necessary corrections and re-submission of the forecasts in the light of the new data.

Finally, they say that one of the things they want to do is “to compare the accuracy/uncertainty of Machine Learning and Deep Learning methods vis-à-vis those of standard statistical ones”. I am not an expert in the ML area, but given the length and quantity of data, I think that this will be a good dataset for training all the fancy artificial neural networks and random forests that we have over there. And while the previous competition did not see many submission from the ML community, I am sure that this one will have a lot, mainly because the M5 will be held on Kaggle Platform, which typically attracts researchers from ML, Data Science and Computer Science areas. And I hope that we won’t see a new version of “Machine Learning doesn’t work” on Twitter after M5.

Summing up, I am quite optimistic about the M5 Competition. It seems to improve upon the M4 and it looks like it has addressed all the main concerns that some researchers previously had (see, for example, Fildes, 2020), including those issues that I had with M4.

M5 competition: to do or not to do?

And now an important question: to participate or not to participate.

Last time, when I participated in M4, I was involved in two submissions. The first one was mainly done by Fotios Petropoulos, my role there was quite insignificant, only discussing a couple of ideas. All the coding, calculations and the submission were done by him. The second one was planned to be a submission from the CMAF, but I was the only person from the Centre involved. In the end I got help from Mahdi Abolghasemi, and, if it wasn’t for him, we wouldn’t have submitted at all. In fact, we never tested our method on the holdout and ran out of time, because I could not finish the calculations before the deadline. I even asked Evangelos Spiliotis for extension, suggesting to exclude our submission from the official leader board. So, I know how time consuming this work can be. This is not a simple blind application of a method you came up with in the shower to a set of \(k\) time series. You need to invest time in exploring the dataset, analysing the relations, selecting variables, finding the most suitable models, testing them and then reanalysing, refinding, reselecting, retesting, again and again. And because of that I will not participate in the M5 competition. Given the complexity of the problem, time restrictions and my current workload, I don’t see how I would be able to handle it on time (I cannot even finish marking students courseworks…). It’s a shame because this looks like a very good competition, which has a very nice potential.

But this is me and my personal problems. However, if you are thinking of participating in the competition, I would encourage you to do so. It looks nice and very well thought through. I hope that it lives up to our expectations!

All the information about the M5 Competition can be found on the MOFC website.

Message M-competitions, from M4 to M5: reservations and expectations first appeared on Open Forecasting.

True model

Ivan Svetunkov — Sat, 25 Jun 2016 12:12:48 +0000

In the modern statistical literature there is a notion of “true model”, by which people usually mean some abstract mathematical model, presumably lying in the core of observed process. Roughly saying, it is implied that data we have has been generated by some big guy with a white beard sitting in mathematical clouds using some function. So the aim of a researcher is to get to that function as close as possible. You can also sometimes meet a term “Data Generating Process” (DGP) which is usually used as a synonym of true model.

But here it gets a bit confusing – no one has ever seen a true model, which makes it a mythical character as unicorn, Superman or Jesus Christ. So you can believe in it or not, but you cannot prove its existence. There are even big books telling us about the true model, how to reach it and how it can save us all, which in fact does not prove its existence, but usually just implies it. The bad thing is that these books do not explain how the hell some mathematical function can generate real sales we have and if it has anything to do with reality. I personally dislike this definition of a true model, because I don’t find it really helpful and in my opinion there are some aspects of modelling that need clarification. For example, what is really a true model? How is it connected with DGP and reality? Does it exist at all? Is it reachable? How does it look? So what?

True model can look weird…

Imagine

In order to answer these questions we need to get to the core of the problem and imagine how data that we work with is really generated. Let us travel to the land of imagination and, in order to make this trip pleasant, let’s take an example of beer consumption.

First imagine that for some reason you entered a shop and found yourself looking at several different bottles of beer, deciding what and how much to buy (if you need to buy at all). What is happening in your brain at this moment? You have a desire to drink, which in theory could be measured. You look at prices, trying to figure out which brand to choose, how many bottles to take and (taking your income into account) if they worth it at all. You also notice that there is a promotion on one of the beers: buy one get one crate for free. The influence of all these and other elements on a final decision is hidden and the process of selection is happening very fast. But if we had two super powers: slow down the time and read minds – then we would be able to measure these factors and quantify relations between the amount of purchased bottles and all those factors. Some relations could be approximated using straight lines, some – using more complicated mathematical functions. So there is a data generating process and it happens in every human brain, but it happens individually, not on an aggregate level of all the consumers (as usually implied by big statistical books).

Note that we can only observe purchases of integer number of bottles. But I would argue that the decision of how many bottles to buy is descretised based on true continuous functions in hearts of consumers. In order not to overcomplicate things, let’s discuss here the continuous case.

Now these dependencies that we want to measure may vary in time and they will definitely vary from person to person. Furthermore DGPs of some individuals could be approximated using logarithms, while the others would have exponential, polynomial functions or even sinusoidal. So when we start aggregating these small DGPs for group of consumers over some period of time, we end up with a very complicated mathematical model.

Let’s say, for example, that we have two random consumers with completely random names who want to buy beer “Agriochorto” on Monday of week 11, year 2016. Let’s call them Nikos and Fotios. They have following DGPs in them:

\begin{equation} \label{eq:NandF}
\begin{matrix}
y_{N,t} = -0.2 \log x_{1,t} + 3 x_{2,t} \\
y_{F,t} = -0.3 \sqrt{ x_{1,t}} + 4 x_{2,t}
\end{matrix}
\end{equation}

where \( x_1 \) is price on Agriochorto beer and \( x_2 \) is price on some competitors beer, let’s call it “Oura”.
When we aggregate these two DGPs we end up with something like this:

\begin{equation} \label{eq:aggregate_demand}
y_{t} = -0.2 \log x_{1,t} -0.3 \sqrt{ x_{1,t}} + 7 x_{2,t}
\end{equation}

Here because both Nikos and Fotios had a similar perception of Oura, the aggregate demand on our beer depends linearly on the price of competitors’ beer. But they had different perception of Agriochorto beer so the aggregate demand has a strange non-linear relation between the price on our beer and quantity bought.

Obviously when we add more consumers throughout the day, the model becomes more and more complicated. Keep also in mind that some DGPs may be purely additive, while the others – multiplicative… So after the aggregation we end up with an insane mix. The good news is that these non-linear relations can be approximated using some simple functions (for example, linear), so there is no need to analyse the insane mix of DGPs directly, approximation will suffice.

Note that because of the differences in individual functions of Nikos, Fotios and all the other thousands of random people, when we approximate the aggregate relations using some functions, we most probably will end up having a constant term in our final model. This term means nothing. It just shows that there are individual differences between customers.

Now let’s keep in mind that we looked at the consumers’ behaviour on Monday. Similar thing will happen on Tuesday but probably with different set of random people and different individual DGPs, because the time has passed, weather has changed and some factors have become more important in the selection process than the others. The resulting aggregate function of demand will differ from the yesterdays one, although they will have some similarities in some core relations (for example, between price and number of bottles). At the same time some factors will disappear, the others will take their places, some relations will weaken, the others – become stronger. But because it is impossible to track all the smaller factors, they may be considered as random and distributed, for example, normally. So the core of our sales will have some more or less stable relations between sales and a set of factors, but there will also be “the unknown”, appearing and disappearing, something that is sometimes called “noise”. Note that there would be no noise if we had all the information and knew all the individual DGPs in every moment in time. But obviously this is cumbersome and not realistic.

So this final model with all the necessary variables included in correct forms, with a constant term and random noise is in my understanding the true model. Keep in mind though that some factors may look important, but in fact will not influence individual selection for the majority of consumers in one and the same manner. For example, bitterness of beverage may be important only for a small group of beer enthusiasts over a specific period of time (when Saturn is dominated by Venus). So these factors will influence the final sales and may correlate with them if we gather that data, but they are in fact random and should be included in error term, rather than in the model. The other very important point here is that the true model may change in time, because DGPs in heads of people evolve: first year you like mild beer, the next one you feel bored with it and switch to a bitter one.

Definition, so what?

So, let’s summarise my definition of the true model. It is a parsimonious model that contains all necessary variables (not less and not more) in appropriate forms, being at the same time the best model among all the possible ones in terms of explaining and predicting a process of interest. Including unnecessary variables in a model leads to overfitting, while skipping important ones leads to underfitting. There should be a balance, and the true model has it.

There is another important point to a true model in my understanding. If we aggregated our original DGPs not to daily, but to weekly or monthly level, we would end up with different models (because we have different number of consumers with varying in time DGPs). So the true model is never one and the same, it is different for different aggregation levels (both in time and space).

The other point is in extrapolative models. It is crazy, for example, to claim that there is some ARIMA that generates data: in real life sales cannot generate themselves and they do not depend on errors of the model! But there may exist an optimal ARIMA that satisfies the definition of a true model. So one and the same process may have several true models of a different nature. It all just comes to a question of different points of view on the same object.

So, can we reach a true model? In theory – yes, in practice – no. That’s because we are always restricted with a number of available variables and finite sample sizes. The fact that we can observe only aggregate parts of a complicated, changing in time process, complicates things even more. Nevertheless the notion of a true model is useful because it sets a target that we may try to reach. And by trying we may improve models that we have.

The last unanswered question in the set that we have defined in the very beginning is “so what?”

When we define the true model this way and show the connection between DGPs and the true model, we can make sense out of the abstract mathematical idea. If we do not make this point, then we start implying ridiculous things (for example that data is generated using some mathematical function). Furthermore without this definition there is no plausible explanation of overfitting (if the parameter looks important, just include it, right?). And finally it is hard to explain using the conventional definition, why in practice we may end up having different optimal models for different aggregation levels or why models with a different nature may make sense at the same time.

Obviously, this post is based on my subjective opinion, and you may disagree with my definitions. If you do, please, leave comments, so we can have a discussion. In discussion, as you probably know, the truth is found.

Message True model first appeared on Open Forecasting.