In the modern statistical literature there is a notion of “true model”, by which people usually mean some abstract mathematical model, presumably lying in the core of observed process. Roughly saying, it is implied that data we have has been generated by some big guy with a white beard sitting in mathematical clouds using some function. So the aim of a researcher is to get to that function as close as possible. You can also sometimes meet a term “Data Generating Process” (DGP) which is usually used as a synonym of true model.
But here it gets a bit confusing – no one has ever seen a true model, which makes it a mythical character as unicorn, Superman or Jesus Christ. So you can believe in it or not, but you cannot prove its existence. There are even big books telling us about the true model, how to reach it and how it can save us all, which in fact does not prove its existence, but usually just implies it. The bad thing is that these books do not explain how the hell some mathematical function can generate real sales we have and if it has anything to do with reality. I personally dislike this definition of a true model, because I don’t find it really helpful and in my opinion there are some aspects of modelling that need clarification. For example, what is really a true model? How is it connected with DGP and reality? Does it exist at all? Is it reachable? How does it look? So what?
Imagine
In order to answer these questions we need to get to the core of the problem and imagine how data that we work with is really generated. Let us travel to the land of imagination and, in order to make this trip pleasant, let’s take an example of beer consumption.
First imagine that for some reason you entered a shop and found yourself looking at several different bottles of beer, deciding what and how much to buy (if you need to buy at all). What is happening in your brain at this moment? You have a desire to drink, which in theory could be measured. You look at prices, trying to figure out which brand to choose, how many bottles to take and (taking your income into account) if they worth it at all. You also notice that there is a promotion on one of the beers: buy one get one crate for free. The influence of all these and other elements on a final decision is hidden and the process of selection is happening very fast. But if we had two super powers: slow down the time and read minds – then we would be able to measure these factors and quantify relations between the amount of purchased bottles and all those factors. Some relations could be approximated using straight lines, some – using more complicated mathematical functions. So there is a data generating process and it happens in every human brain, but it happens individually, not on an aggregate level of all the consumers (as usually implied by big statistical books).
Note that we can only observe purchases of integer number of bottles. But I would argue that the decision of how many bottles to buy is descretised based on true continuous functions in hearts of consumers. In order not to overcomplicate things, let’s discuss here the continuous case.
Now these dependencies that we want to measure may vary in time and they will definitely vary from person to person. Furthermore DGPs of some individuals could be approximated using logarithms, while the others would have exponential, polynomial functions or even sinusoidal. So when we start aggregating these small DGPs for group of consumers over some period of time, we end up with a very complicated mathematical model.
Let’s say, for example, that we have two random consumers with completely random names who want to buy beer “Agriochorto” on Monday of week 11, year 2016. Let’s call them Nikos and Fotios. They have following DGPs in them:
\begin{equation} \label{eq:NandF}
\begin{matrix}
y_{N,t} = -0.2 \log x_{1,t} + 3 x_{2,t} \\
y_{F,t} = -0.3 \sqrt{ x_{1,t}} + 4 x_{2,t}
\end{matrix}
\end{equation}
where \( x_1 \) is price on Agriochorto beer and \( x_2 \) is price on some competitors beer, let’s call it “Oura”.
When we aggregate these two DGPs we end up with something like this:
\begin{equation} \label{eq:aggregate_demand}
y_{t} = -0.2 \log x_{1,t} -0.3 \sqrt{ x_{1,t}} + 7 x_{2,t}
\end{equation}
Here because both Nikos and Fotios had a similar perception of Oura, the aggregate demand on our beer depends linearly on the price of competitors’ beer. But they had different perception of Agriochorto beer so the aggregate demand has a strange non-linear relation between the price on our beer and quantity bought.
Obviously when we add more consumers throughout the day, the model becomes more and more complicated. Keep also in mind that some DGPs may be purely additive, while the others – multiplicative… So after the aggregation we end up with an insane mix. The good news is that these non-linear relations can be approximated using some simple functions (for example, linear), so there is no need to analyse the insane mix of DGPs directly, approximation will suffice.
Note that because of the differences in individual functions of Nikos, Fotios and all the other thousands of random people, when we approximate the aggregate relations using some functions, we most probably will end up having a constant term in our final model. This term means nothing. It just shows that there are individual differences between customers.
Now let’s keep in mind that we looked at the consumers’ behaviour on Monday. Similar thing will happen on Tuesday but probably with different set of random people and different individual DGPs, because the time has passed, weather has changed and some factors have become more important in the selection process than the others. The resulting aggregate function of demand will differ from the yesterdays one, although they will have some similarities in some core relations (for example, between price and number of bottles). At the same time some factors will disappear, the others will take their places, some relations will weaken, the others – become stronger. But because it is impossible to track all the smaller factors, they may be considered as random and distributed, for example, normally. So the core of our sales will have some more or less stable relations between sales and a set of factors, but there will also be “the unknown”, appearing and disappearing, something that is sometimes called “noise”. Note that there would be no noise if we had all the information and knew all the individual DGPs in every moment in time. But obviously this is cumbersome and not realistic.
So this final model with all the necessary variables included in correct forms, with a constant term and random noise is in my understanding the true model. Keep in mind though that some factors may look important, but in fact will not influence individual selection for the majority of consumers in one and the same manner. For example, bitterness of beverage may be important only for a small group of beer enthusiasts over a specific period of time (when Saturn is dominated by Venus). So these factors will influence the final sales and may correlate with them if we gather that data, but they are in fact random and should be included in error term, rather than in the model. The other very important point here is that the true model may change in time, because DGPs in heads of people evolve: first year you like mild beer, the next one you feel bored with it and switch to a bitter one.
Definition, so what?
So, let’s summarise my definition of the true model. It is a parsimonious model that contains all necessary variables (not less and not more) in appropriate forms, being at the same time the best model among all the possible ones in terms of explaining and predicting a process of interest. Including unnecessary variables in a model leads to overfitting, while skipping important ones leads to underfitting. There should be a balance, and the true model has it.
There is another important point to a true model in my understanding. If we aggregated our original DGPs not to daily, but to weekly or monthly level, we would end up with different models (because we have different number of consumers with varying in time DGPs). So the true model is never one and the same, it is different for different aggregation levels (both in time and space).
The other point is in extrapolative models. It is crazy, for example, to claim that there is some ARIMA that generates data: in real life sales cannot generate themselves and they do not depend on errors of the model! But there may exist an optimal ARIMA that satisfies the definition of a true model. So one and the same process may have several true models of a different nature. It all just comes to a question of different points of view on the same object.
So, can we reach a true model? In theory – yes, in practice – no. That’s because we are always restricted with a number of available variables and finite sample sizes. The fact that we can observe only aggregate parts of a complicated, changing in time process, complicates things even more. Nevertheless the notion of a true model is useful because it sets a target that we may try to reach. And by trying we may improve models that we have.
The last unanswered question in the set that we have defined in the very beginning is “so what?”
When we define the true model this way and show the connection between DGPs and the true model, we can make sense out of the abstract mathematical idea. If we do not make this point, then we start implying ridiculous things (for example that data is generated using some mathematical function). Furthermore without this definition there is no plausible explanation of overfitting (if the parameter looks important, just include it, right?). And finally it is hard to explain using the conventional definition, why in practice we may end up having different optimal models for different aggregation levels or why models with a different nature may make sense at the same time.
Obviously, this post is based on my subjective opinion, and you may disagree with my definitions. If you do, please, leave comments, so we can have a discussion. In discussion, as you probably know, the truth is found.