greybox package for R

Hexagon for greybox

I am delighted to announce a new package on CRAN. It is called “greybox”. I know, what my American friends will say, as soon as they see the name – they will claim that there is a typo, and that it should be “a” instead of “e”. But in fact no mistake was made – I used British spelling for the name, and I totally understand that at some point I might regret this…

So, what is “greybox”? Wikipedia tells us that grey box is a model that “combines a partial theoretical structure with data to complete the model”. This means that almost any statistical model can be considered as a grey box, thus making the package potentially quite flexible and versatile.

But why do we need a new package on CRAN?

First, there were several functions in smooth package that did not belong there, and there are several functions in TStools package that can be united with a topic of model building. They focus on the multivariate regression analysis rather than on state-space models, time series smoothing or anything else. It would make more sense to find them their own home package. An example of such a function is ro()Rolling Origin – function that Yves and I wrote in 2016 on our way to the International Symposium on Forecasting. Arguably this function can be used not only for assessing the accuracy of forecasting models, but also for the variables / model selection.

Second, in one of my side projects, I needed to work more on the multivariate regressions, and I had several ideas I wanted to test. One of those is creating a combined multivariate regression from several models using information criteria weights. The existing implementations did not satisfy me, so I ended up writing a function lmCombine() that does that. In addition, our research together with Yves Sagaert indicates that there is a nice solution for a fat regression problem (when the number of parameters is higher than the number of observations) using information criteria. Uploading those function in smooth did not sound right, but having a greybox helps a lot. There are other ideas that I have in mind, and they don’t fit in the other packages.

Finally, I could not find satisfactory (from my point of view) packages on CRAN that would focus on multivariate model building and forecasting – the usual focus is on analysis instead (including time series analysis). The other thing is the obsession of many packages with p-values and hypotheses testing, which was yet another motivator for me to develop a package that would be completely hypotheses-free (at 95% level). As a result, if you work with the functions from greybox, you might notice that they produce confidence intervals instead of p-values (because I find them more informative and useful). Finally, I needed good instruments for the promotional modelling for several projects, and it was easier to implement them myself than to compile them from different functions from different packages.

Keeping that in mind, it makes sense to briefly discuss what is already available there. I’ve already discussed how xregExpander() and stepwise() functions work in one of the previous posts, and these functions are now available in greybox instead of smooth. However, I have not covered either lmCombine() or ro() functions yet. While lmCombine() is still under construction and works only for normal cases (fat regression can be solved, but not 100% efficiently), ro() has worked efficiently for several years already. So I created a detailed vignette, explaining what is rolling origin, how the function works and how to use it. So, if you are interested in finding out more, check it out on CRAN.

As a wrap up, greybox package is focused on model building and forecasting and from now on will be periodically updated.

As a final note, I plan to do the following in greybox in future releases:

  1. Move nemenyi() function from TStools to greybox;
  2. Develop functions for promotional modelling;
  3. Write a function for multiple correlation coefficients (will be used for multicollinearity analysis);
  4. Implement variables selection based on rolling origin evaluation;
  5. Stepwise regression and combinations of models, based on Laplace and the other distributions;
  6. AICc for Laplace and the other distributions;
  7. Solve fat regression problem via combination of regression models (sounds crazy, right?);
  8. xregTransformer – Non-linear transformation of the provided xreg variables;
  9. Other cool stuff.

If you have any thoughts on what to implement, leave a comment – I will consider your idea.

Leave a Reply