Archives greybox - Open Forecast

smooth & greybox under LGPLv2.1

Ivan Svetunkov — Tue, 19 Sep 2023 09:32:56 +0000

Good news, everyone! I’ve recently released major versions of my packages smooth and greybox, v4.0.0 and v2.0.0 respectively, on CRAN. Has something big happened? Yes and no. Let me explain.

Stickers of the greybox and smooth packages for R

Starting from these versions, the packages will be licensed under LGPLv2.1 instead of the very restrictive GPLv2. This does not change anything to the everyday users of the packages, but is a potential game changer to software developers and those who might want to modify the source code of the packages for commercial purposes. This is because any change of the code under GPLv2 implies that these changes need to be released and made available to everyone, while the LGPLv2.1 allows modifications without releasing the source code. At the same time, both licenses imply that the attribution to the author is necessary, so if someone modifies the code and uses it for their purposes, they still need to say that the original package was developed by this and that author (Ivan Svetunkov in this case). The reason I decided to change the license is that one of software vendors that I sometimes work with pointed out that they cannot touch anything under GPL because of the restrictions above. Moving to the LGPL will now allow them using my packages in their own developments. This applies to such functions as adam(), es(), msarima(), ces(), alm() and others. I don’t mind, as long as they say who developed the original thing.

What happens now? The versions of the smooth and greybox packages under GPLv2 are available on github here and here respectively, so if you are a radical open source adept, you can download those releases, install them and use them instead of the new versions. But from now on, I plan to support the packages under the LGPLv2.1 license.

Finally, a small teaser: colleagues of mine have agreed to help me in translating the R code into Python (actually, I am quite useless in this endeavor, they do everything), so at some point in future, we might see the smooth and greybox packages in Python. And they will also be licensed under LGPLv2.1.

Message smooth & greybox under LGPLv2.1 first appeared on Open Forecast.

Introducing scale model in greybox

Ivan Svetunkov — Sun, 23 Jan 2022 18:04:33 +0000

At the end of June 2021, I released the greybox package version 1.0.0. This was a major release, introducing new functionality, but I did not have time to write a separate post about it because of the teaching and lack of free time. Finally, Christmas has arrived, and I could spend several hours preparing the post about it. In this post, I want to tell you about the new major feature in the greybox package.

Scale Model

The Scale Model is the regression-like model focusing on capturing the relation between the scale of distribution (for example, variance in Normal distribution) and a set of explanatory variables. It is implemented in sm() method in the greybox package. The motivation for this comes from GAMLSS, the Generalised Additive Model for Location, Scale and Shape. While I have decided not to bother with the “GAM” part of this (there are gam and gamlss packages in R that do that), I liked the idea of being able to predict the scale (for example, variance) of a distribution. This becomes especially useful when one suspects heteroscedasticity in the model but does not think that variable transformations are appropriate.

To understand what the function does, it is necessary first to discuss the underlying model. We will start the discussion with an example of a linear regression model with two explanatory variables, assuming Normally distributed residuals $\xi_t$ with zero mean and a fixed variance $\sigma^2$, $\xi_t \sim \mathcal{N}(0,\sigma^2)$, which can be formulated as:
\begin{equation} \label{eq:model1}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \xi_t ,
\end{equation}
where $y_t$ is the response variable, $x_{1,t}$ and $x_{2,t}$ are the explanatory variables on observation $t$, $\beta_0$, $\beta_1$ and $\beta_2$ are the parameters of the model and $\xi_t \sim \mathcal{N}\left(0, \sigma^2 \right)$. Recalling the basic properties of Normal distribution, we can rewrite the same model as a model with standard normal residuals $\epsilon_t \sim \mathcal{N}\left(0, 1 \right)$ by inserting $\xi_t = \sigma \epsilon_t$ in \eqref{eq:model1}:
\begin{equation} \label{eq:model2}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \sigma \epsilon_t .
\end{equation}
Now if we suspect that the variance of the model might not be constant, we can substitute the standard deviation $\sigma$ with some function, transforming the model into:
\begin{equation} \label{eq:model3}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + f\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right) \epsilon_t ,
\end{equation}
where $x_{2,t}$ and $x_{3,t}$ are the explanatory variables (as you see, not necessarily the same as in the first part of the model) and $\gamma_0$, $\gamma_1$ and $\gamma_2$ are the parameters of the scale part of the model. The idea here is that there is a regression model for the conditional mean of the distribution $\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t}$, and that there is another one that will regulate the standard deviation via $f\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)$. The main thing to keep in mind about the latter is that the function $f(\cdot)$ needs to be strictly positive because the standard deviation cannot be zero or negative. The simplest way to guarantee this is to use exponent instead of $f(\cdot)$. Furthermore, in our example with Normal distribution, the scale corresponds to the variance, so we should be introducing the model for variance: $\sigma^2_t = \exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)$. This leads to the following model:
\begin{equation} \label{eq:model4}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \sqrt{\exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)} \epsilon_t ,
\end{equation}
The model above would not only have the conditional mean depending on the values of explanatory variables (the conventional regression) but also the conditional variance, which would change depending on the values of variables. Note that this model assumes the linearity in the conditional mean: increase of $x_{1,t}$ by one leads to the increase of $y_t$ by $\beta_1$ on average. At the same time, it assumes non-linearity in the variance: increase of $x_{2,t}$ by one leads to the increase of variance by $\exp(\gamma_2-1)\times 100$%. If we want a non-linear change in the conditional mean, we can use a model in logarithms. Alternatively, we could assume a different distribution for the response variable $y_t$. To understand how the latter would work, we need to represent the same model \eqref{eq:model4} in a more general form. For the Normal distribution, the same model \eqref{eq:model4} can be rewritten as:
\begin{equation} \label{eq:model5}
y_t \sim \mathcal{N}\left(\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t}, \exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)\right).
\end{equation}
This representation allows introducing scale model for many other distributions, such as Laplace, Generalised Normal, Gamma, Inverse Gaussian etc. All that we need to do in those cases is to substitute the distribution $\mathcal{N}(\cdot)$ with a distribution of interest. The sm() function supports the same list of distributions as alm() (see the vignette for the function on CRAN or in R using the command vignette()). Each specific formula for scale would differ from one distribution to another, but the principles will be the same.

Demonstration in R

For demonstration purposes, we will use an example with artificial data, generated according to the model \eqref{eq:model4}:

xreg <- matrix(rnorm(300,10,3),100,3)
xreg <- cbind(1000-0.75*xreg[,1]+1.75*xreg[,2]+
              sqrt(exp(0.3+0.5*xreg[,2]-0.4*xreg[,3]))*rnorm(100,0,1),xreg)
colnames(xreg) <- c("y",paste0("x",c(1:3)))

The scatterplot of the generated data will look like this:

spread(xreg)

Scatterplot matrix for the generated data

We can then fit a model, specifying the location and scale parts of it in alm(). In this case, the alm() will call for sm() and will estimate both parts via likelihood maximisation. To make things closer to forecasting task, we will withhold the last 10 observations for the test set:

ourModel <- alm(y~x1+x2+x3, scale=~x2+x3, xreg, subset=c(1:90), distribution="dnorm")

The returned model contains both parts. The scale part of the model can be accessed via ourModel$scale. It is an object of class "scale", supporting several methods, such as
actuals(), residuals(), fitted(), summary() and plot() (and several other). Here how the summary of the model looks in my case:

summary(ourModel)

Response variable: y
Distribution used in the estimation: Normal
Loss function used in estimation: likelihood
Coefficients:
             Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept) 1000.2850     2.9698   994.3782   1006.1917 *
x1            -0.8350     0.1435    -1.1204     -0.5497 *
x2             1.8656     0.1714     1.5246      2.2065 *
x3            -0.0228     0.1776    -0.3761      0.3305  

Coefficients for scale:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)   0.0436     0.7012    -1.3510      1.4382  
x2            0.4705     0.0413     0.3883      0.5527 *
x3           -0.3355     0.0487    -0.4324     -0.2385 *

Error standard deviation: 4.52
Sample size: 90
Number of estimated parameters: 7
Number of degrees of freedom: 83
Information criteria:
     AIC     AICc      BIC     BICc 
391.0191 392.3849 408.5177 411.5908

The summary above shows parameters for both parts of the model. They are not far from the ones used in the generation of the model, which indicates that the implemented model works as intended. The only issue here is that the standard errors in the location part of the model (first four coefficients) do not take the heteroscedasticity into account and thus are biased. The HAC standard errors are not yet implemented in alm()

As we see, the returned model contains both parts. The scale part of the model can be accessed via ourModel$scale. It is an object of class "scale", supporting several methods, such as
actuals(), residuals(), fitted(), summary() and plot() (and several other). Just to see the effect of scale model, here are the diagnostics plots for the original model (which returns the $\xi_t$ residuals) and for the scale model ($\epsilon_t$ residuals):

par(mfcol=c(1,2))
plot(ourModel, 5)
plot(ourModel, 5)

Diagnostics plots for sm

The Figure above shows squared residuals vs fitted values for the location (the plot on the left) and the scale (the plot on the right) models. The former is agnostic of the scale model and demonstrates that there is heteroscedasticity of residuals (the variance increases with the increase of the fitted values). The latter shows that the scale model managed to resolve the issue. While the LOWESS line demonstrates some non-linearity, the distribution of residuals conditional on fitted values looks random.

Finally, we can produce forecasts from such model, similarly to how it is done for any other model, estimated with alm():

ourForecast <- predict(ourModel,xreg[-c(1:90),],interval="pred")
plot(ourForecast)

Forecast from the model

In this case, the function will first predict the scale part of the model, then it will use the predicted variance and the covariance matrix of parameters to calculate the prediction intervals, shown in Figure above. Given the independence of location and scale parts of the model, the conditional expectation (point forecast) will not change if we drop the scale model. It is all about variance.

Finally, if you do not want to use alm() function, you can use lm() instead and then apply the sm():

lmModel <- lm(y~x1+x2+x3, as.data.frame(xreg), subset=c(1:90))
smModel <- sm(lmModel, formula=~x2+x3, xreg)

In this case, the sm() will assume that the error term follows Normal distribution, and we will end up with two models that are not connected with each other (e.g., the predict() method applied to lmModel will not use predictions from the smModel). Nonetheless, we could still use all the R methods discussed above for the analysis of the smModel.

As a final word, the scale model is a new feature. While it already works, there might be bugs in it. If you find any, please let me know by submitting an issue on Github.

P.S.

There is a danger that greybox package will be soon removed from CRAN together with other 88 packages (including my smooth and legion) because the nloptr package that it relies on has not passed some of new checks recently introduced by CRAN. This is beyond my control, and I do not have time or power to influence this, but if this happens, you might need to switch to the installation from GitHub via remotes package, using the command:

remotes::install_github("config-i1/greybox")

My apologies for the inconvenience. I might be able to remove the dependence on nloptr at some point, but it will not happen before March 2022.

Message Introducing scale model in greybox first appeared on Open Forecast.

An Integrated Method for Estimation and Optimisation

Ivan Svetunkov — Fri, 03 Sep 2021 15:47:15 +0000

My PhD student, Congzheng Liu (co-supervised with Adam Letchford) has written a paper, entitled “Newsvendor Problems: An Integrated Method for Estimation and Optimisation“. This paper has recently been published in EJOR. In this paper we build upon the existing Ban & Rudin (2019) approach for newsvendor problem, showing that in case of the linear model, it becomes equivalent to quantile regression. We then extend it for the non-linear newsvendor problems, testing it on simulated and real life data. In order to understand what specifically we propose, we need to discuss the typical process in case of newsvendor problem.

Newsvendor is a class of problems, where the product can only be sold one day, after which it goes to waste. So this is appropriate, for example, for perishable products in retail. Typically, in this situation we would have historical demand of sales of our product $y_t$ and we would try forecasting it using regression / ETS / ARIMA or any other model. After doing that and obtaining the estimates of parameters, we would produce a quantile of assumed distribution, which then tells us how much to order ($q_t$). If we order more than needed, we will have holding costs. In the opposite case, we will have shortage costs. Based on these costs and the price of product, we can find the optimal order, that will give the maximum profit.

As you can already spot, the forecasting stage is detached from the optimisation one in this situation. The idea of the proposed integrated approach (IMEO) is simple: instead of optimising the model via MSE or any other conventional loss and then solving the optimisation problem, we could estimate the model via maximisation of the specific profit function, thus obtaining the required orders directly. This is not a new idea on its own, but using profit function rather than the cost (as Ban & Rudin, 2019 did) allows applying IMEO to wider set of problems.

For example, if we know the price of the product $p$, the costs for production $v$, holding $c_h$ and shortage costs $c_s$, we can then calculate profit as (for a linear newsvendor problem):
\begin{equation}
\pi(q_t,y_t)=
\begin{cases}
p y_t -v q_t -c_h (q_t -y_t),& \text{for } q_t \geq y_t\\
p q_t -v q_t -c_s (y_t -q_t),& \text{for } q_t< y_t, \end{cases} \end{equation} where $q_t$ is the order quantity and $y_t$ is the actual sales. This profit function can be used for the estimation of a model of your choosing. Congzheng has written a separate R code for the experiments for the paper. Inspired by his example, I have implemented custom losses in alm() and adam() functions from respective greybox and smooth packages for R. At the moment, only the regression model works properly with custom losses – ETS / ARIMA need additional modifications, which we will hopefully resolve in the next paper. So, here is an example with linear newsvendor problem and alm():

# Generate artificial data
x1 <- rnorm(100,100,10)
x2 <- rbinom(100,2,0.05)
y <- 10 + 1.5*x1 + 5*x2 + rnorm(100,0,10)
ourData <- cbind(y=y,x1=x1,x2=x2)

# Define price and costs
price <- 50
costBasic <- 5
costShort <- 15
costHold <- 1

# Define profit function for the linear case
lossProfit <- function(actual, fitted, B, xreg){
    # Minus sign is needed here, because we need to minimise the loss
    profit <- -ifelse(actual >= fitted,
                     (price - costBasic) * fitted - costShort * (actual - fitted),
                     price * actual - costBasic * fitted - costHold * (fitted - actual));
    return(sum(profit));
}

# Estimate the model
model1 <- alm(y~x1+x2, ourData, loss=lossProfit)

# Print summary of the model
summary(model1, bootstrap=TRUE)

Response variable: y
Distribution used in the estimation: Normal
Loss function used in estimation: custom
Bootstrap was used for the estimation of uncertainty of parameters
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)  36.5177    14.2840     2.7783     51.4844 *
x1            1.3622     0.1622     1.1909      1.7528 *
x2            3.3423     2.7810    -6.5997      5.9101  

Error standard deviation: 17.2266
Sample size: 100
Number of estimated parameters: 3
Number of degrees of freedom: 97

The resulting model is easy to work with: it provides meaningful parameters, showing how on average the order should change if a variable changes by one. For example, we see that with the increase of the variable x1, the orders should change on average by 1.36.

Note that in this specific case, as shown in our paper, the model would be equivalent to the quantile regression, estimated for the quantile $\left( \frac{c_u}{c_o+c_u} \right)$, where $c_u= p-v+c_s$ is the "underage" cost and $c_o = v+c_h$ is the "overage" cost. In our example it corresponds to approximately 0.9091 quantile. We can compare the output of this model with the one from the quantile regression in alm (which is estimated as an Asymmetric Laplace model):

model2 <- alm(y~x1+x2, ourData, distribution="dalaplace", alpha=0.9091)
summary(model2, bootstrap=TRUE)

Response variable: y
Distribution used in the estimation: Asymmetric Laplace with alpha=0.9091
Loss function used in estimation: likelihood
Bootstrap was used for the estimation of uncertainty of parameters
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)  36.6688    11.6686     3.8674     51.1987 *
x1            1.3611     0.1338     1.1920      1.7454 *
x2            3.1259     2.5424    -6.2518      5.4703  

Error standard deviation: 17.3379
Sample size: 100
Number of estimated parameters: 4
Number of degrees of freedom: 96
Information criteria:
     AIC     AICc      BIC     BICc 
826.4622 826.8833 836.8829 837.8524

The differences between the estimates of parameters of the two models are due to the optimisation procedure, which would converge to slightly different points in these two cases. Still, the values of parameters are close to each other and would converge asymptotically, which supports our finding.

And here how the orders over time look in case of our custom loss:

plot(model1, 7)

Dynamics of orders from alm model

The purple line in the Figure above corresponds to the orders and would cover roughly 90.91% of cases, so that we would run out of product in approximately 10% of cases, which would still be more profitable than any other option.

Finally, the approach works also well in case of non-linear newsvendor problem (see the paper for details), where quantile regression is not suitable and the conventional approach fails. The only thing that would change is the loss function, where the prices and costs would depend non-linearly on the order quantity and sales.

You can read the published paper on EJOR website or the working paper on ResearchGate.

Message An Integrated Method for Estimation and Optimisation first appeared on Open Forecast.

Analytics with greybox

Ivan Svetunkov — Mon, 07 Jan 2019 16:40:17 +0000

One of the reasons why I have started the greybox package is to use it for marketing research and marketing analytics. The common problem that I face, when working with these courses is analysing the data measured in different scales. While R handles numeric scales natively, the work with categorical is not satisfactory. Yes, I know that there are packages that implement some of the functions, but I wanted to have them in one place without the need to install a lot of packages and satisfy the dependencies. After all, what’s the point in installing a package for Cramer’s V, when it can be calculated with two lines of code? So, here’s a brief explanation of the functions for marketing analytics in greybox.

I will use `mtcars` dataset for the examples, but we will transform some of the variables into factors:

mtcarsData <- as.data.frame(mtcars)
mtcarsData$vs <- factor(mtcarsData$vs, levels=c(0,1), labels=c("v","s"))
mtcarsData$am <- factor(mtcarsData$am, levels=c(0,1), labels=c("a","m"))

All the functions discussed in this post are available in greybox starting from v0.4.0. However, I’ve found several bugs since the submission to CRAN, and the most recent version with bugfixes is now available on github.

Analysing the relation between the two variables in categorical scales

Cramer’s V

Cramer’s V measures the relation between two variables in categorical scale. It is implemented in the cramer() function. It returns the value in a range of 0 to 1 (1 – when the two categorical variables are linearly associated with each other, 0 – otherwise), Chi-Squared statistics from the chisq.test(), the respective p-value and the number of degrees of freedom. The tested hypothesis in this case is formulated as:
\begin{matrix}
H_0: V = 0 \text{ (the variables don’t have association);} \\
H_1: V \neq 0 \text{ (there is an association between the variables).}
\end{matrix}

Here’s what we get when trying to find the association between the engine and transmission in the `mtcars` data:

cramer(mtcarsData$vs, mtcarsData$am)

Cramer's V: 0.1042
Chi^2 statistics = 0.3475, df: 1, p-value: 0.5555

Judging by this output, the association between these two variables is very low (close to zero) and is not statistically significant.

Cramer’s V can also be used for the data in numerical scales. In general, this might be not the most suitable solution, but this might be useful when you have a small number of values in the data. For example, the variable `gear` in `mtcars` is numerical, but it has only three options (3, 4 and 5). Here’s what Cramer’s V tells us in the case of `gear` and `am`:

cramer(mtcarsData$am, mtcarsData$gear)

Cramer's V: 0.809
Chi^2 statistics = 20.9447, df: 2, p-value: 0

As we see, the value is high in this case (0.809), and the null hypothesis is rejected on 5% level. So we can conclude that there is a relation between the two variables. This does not mean that one variable causes the other one, but they both might be driven by something else (do more expensive cars have less gears but the automatic transmission?).

Plotting categorical variables

While R allows plotting two categorical variables against each other, the plot is hard to read and is not very helpful (in my opinion):

plot(table(mtcarsData$am,mtcarsData$gear))

Default plot of a table

So I have created a function that produces a heat map for two categorical variables. It is called tableplot():

tableplot(mtcarsData$am,mtcarsData$gear)

Tableplot for the two categorical variables

It is based on table() function and uses the frequencies inside the table for the colours:

table(mtcarsData$am,mtcarsData$gear) / length(mtcarsData$am)

        3       4       5
a 0.46875 0.12500 0.00000
m 0.00000 0.25000 0.15625

The darker sectors mean that there is a higher concentration of values, while the white ones correspond to zeroes. So, in our example, we see that the majority of cars have automatic transmissions with three gears. Furthermore, the plot shows that there is some sort of relation between the two variables: the cars with automatic transmissions have the lower number of gears, while the ones with the manual have the higher number of gears (something we’ve already noticed in the previous subsection).

Association between the categorical and numerical variables

While Cramer’s V can also be used for the measurement of association between the variables in different scales, there are better instruments. For example, some analysts recommend using intraclass correlation coefficient when measuring the relation between the numerical and categorical variables. But there is a simpler option, which involves calculating the coefficient of multiple correlation between the variables. This is implemented in mcor() function of greybox. The `y` variable should be numerical, while `x` can be of any type. What the function then does is expands all the factors and runs a regression via .lm.fit() function, returning the square root of the coefficient of determination. If the variables are linearly related, then the returned value will be close to one. Otherwise it will be closet to zero. The function also returns the F statistics from the regression, the associated p-value and the number of degrees of freedom (the hypothesis is formulated similarly to cramer() function).

Here’s how it works:

mcor(mtcarsData$am,mtcarsData$mpg)

Multiple correlations value: 0.5998
F-statistics = 16.8603, df: 1, df resid: 30, p-value: 3e-04

In this example, the simple linear regression of mpg from the set of dummies is constructed, and we can conclude that there is a linear relation between the variables, and that this relation is statistically significant.

Association between several variables

Measures of association

When you deal with datasets (i.e. data frames or matrices), then you can use cor() function in order to calculate the correlation coefficients between the variables in the data. But when you have a mixture of numerical and categorical variables, the situation becomes more difficult, as the correlation does not make sense for the latter. This motivated me to create a function that uses either cor(), or cramer(), or mcor() functions depending on the types of data (see discussions of cramer() and mcor() above). The function is called association() or assoc() and returns three matrices: the values of the measures of association, their p-values and the types of the functions used between the variables. Here’s an example:

assocValues <- assoc(mtcarsData)
print(assocValues,digits=2)

 Associations: 
 values:
        mpg  cyl  disp    hp  drat    wt  qsec   vs   am gear carb
 mpg   1.00 0.86 -0.85 -0.78  0.68 -0.87  0.42 0.66 0.60 0.66 0.67
 cyl   0.86 1.00  0.92  0.84  0.70  0.78  0.59 0.82 0.52 0.53 0.62
 disp -0.85 0.92  1.00  0.79 -0.71  0.89 -0.43 0.71 0.59 0.77 0.56
 hp   -0.78 0.84  0.79  1.00 -0.45  0.66 -0.71 0.72 0.24 0.66 0.79
 drat  0.68 0.70 -0.71 -0.45  1.00 -0.71  0.09 0.44 0.71 0.83 0.33
 wt   -0.87 0.78  0.89  0.66 -0.71  1.00 -0.17 0.55 0.69 0.66 0.61
 qsec  0.42 0.59 -0.43 -0.71  0.09 -0.17  1.00 0.74 0.23 0.63 0.67
 vs    0.66 0.82  0.71  0.72  0.44  0.55  0.74 1.00 0.10 0.62 0.69
 am    0.60 0.52  0.59  0.24  0.71  0.69  0.23 0.10 1.00 0.81 0.44
 gear  0.66 0.53  0.77  0.66  0.83  0.66  0.63 0.62 0.81 1.00 0.51
 carb  0.67 0.62  0.56  0.79  0.33  0.61  0.67 0.69 0.44 0.51 1.00
 
 p-values:
       mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
 mpg  1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.01
 cyl  0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01
 disp 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.07
 hp   0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.18 0.00 0.00
 drat 0.00 0.00 0.00 0.01 1.00 0.00 0.62 0.01 0.00 0.00 0.66
 wt   0.00 0.00 0.00 0.00 0.00 1.00 0.34 0.00 0.00 0.00 0.02
 qsec 0.02 0.00 0.01 0.00 0.62 0.34 1.00 0.00 0.21 0.00 0.01
 vs   0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.56 0.00 0.01
 am   0.00 0.01 0.00 0.18 0.00 0.00 0.21 0.56 1.00 0.00 0.28
 gear 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.09
 carb 0.01 0.01 0.07 0.00 0.66 0.02 0.01 0.01 0.28 0.09 1.00
 
 types:
      mpg    cyl      disp   hp     drat   wt     qsec   vs       am      
 mpg  "none" "mcor"   "cor"  "cor"  "cor"  "cor"  "cor"  "mcor"   "mcor"  
 cyl  "mcor" "none"   "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
 disp "cor"  "mcor"   "none" "cor"  "cor"  "cor"  "cor"  "mcor"   "mcor"  
 hp   "cor"  "mcor"   "cor"  "none" "cor"  "cor"  "cor"  "mcor"   "mcor"  
 drat "cor"  "mcor"   "cor"  "cor"  "none" "cor"  "cor"  "mcor"   "mcor"  
 wt   "cor"  "mcor"   "cor"  "cor"  "cor"  "none" "cor"  "mcor"   "mcor"  
 qsec "cor"  "mcor"   "cor"  "cor"  "cor"  "cor"  "none" "mcor"   "mcor"  
 vs   "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "none"   "cramer"
 am   "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "none"  
 gear "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
 carb "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
      gear     carb    
 mpg  "mcor"   "mcor"  
 cyl  "cramer" "cramer"
 disp "mcor"   "mcor"  
 hp   "mcor"   "mcor"  
 drat "mcor"   "mcor"  
 wt   "mcor"   "mcor"  
 qsec "mcor"   "mcor"  
 vs   "cramer" "cramer"
 am   "cramer" "cramer"
 gear "none"   "cramer"
 carb "cramer" "none"

One thing to note is that the function considers numerical variables as categorical, when they only have up to 10 unique values. This is useful, for example, in case of number of `gears` in the dataset.

Plots of association between several variables

Similarly to the problem with cor(), scatterplot matrix (produced using plot()) is not meaningful in case of a mixture of variables:

plot(mtcarsData)

Default scatter plot matrix

It makes sense to use scatterplot in case of numeric variables, tableplot() in case of categorical and boxplot() in case of a mixture. So, there is the function spread() in greybox that creates something more meaningful. It uses the same algorithm as assoc() function, but produces plots instead of calculating measures of association. So, `gear` will be considered as categorical and the function will produce either boxplot() or tableplot(), when plotting it against other variables.

Here’s an example:

spread(mtcarsData)

Spread matrix

This plot demonstrates, for example, that the number of carburetors influences fuel consumption (something that we could not have spotted in the case of plot()). Notice also, that the number of gears influences the fuel consumption in a non-linear relation as well. So constructing the model with dummy variables for the number of gears might be a reasonable thing to do.

The function also has the parameter `log`, which will transform all the numerical variables using logarithms, which is handy, when you suspect the non-linear relation between the variables. Finally, there is a parameter `histogram`, which will plot either histograms, or barplots on the diagonal.

spread(mtcarsData, histograms=TRUE, log=TRUE)

Spread matrix in logs

The plot demonstrates that the `disp` has a strong non-linear relation with `mpg`, and, similarly, `drat` and `hp` also influence `mpg` in a non-linear fashion.

Regression diagnostics

One of the problems of linear regression that can be diagnosed prior to the model construction is multicollinearity. The conventional way of doing this diagnostics is via calculating the variance inflation factor (VIF) after constructing the model. However, VIF is not easy to interpret, because it lies in $(1,\infty)$. Coefficients of determination from the linear regression models of explanatory variables are easier to interpret and work with. If such a coefficient is equal to one, then there are some perfectly correlated explanatory variables in the dataset. If it is equal to zero, then they are not linearly related.

There is a function determination() or determ() in greybox that returns the set of coefficients of determination for the explanatory variables. The good thing is that this can be done before constructing any model. In our example, the first column, `mpg` is the response variable, so we can diagnose the multicollinearity the following way:

determination(mtcarsData[,-1])

       cyl      disp        hp      drat        wt      qsec        vs 
 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.8017720 
        am      gear      carb 
 0.7924392 0.8133441 0.8735577

As we can see from the output above, `disp` is the most linearly related with the variables, so including it in the model might cause the multicollinearity, which will decrease the efficiency of the estimates of parameters.

Message Analytics with greybox first appeared on Open Forecast.

greybox 0.3.0 – what’s new

Ivan Svetunkov — Tue, 07 Aug 2018 16:06:26 +0000

Three months have passed since the initial release of greybox on CRAN. I would not say that the package develops like crazy, but there have been some changes since May. Let’s have a look. We start by loading both greybox and smooth:

library(greybox)
library(smooth)

Rolling Origin

First of all, ro() function now has its own class and works with plot() function, so that you can have a visual representation of the results. Here’s an example:

x <- rnorm(100,100,10)
ourCall <- "es(data, h=h, intervals=TRUE)"
ourValue <- c("forecast", "lower", "upper")
ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE)
plot(ourRO)

Example of the plot of rolling origin function

Each point on the produced graph corresponds to an origin and straight lines correspond to the forecasts. Given that we asked for point forecasts and for lower and upper bounds of prediction interval, we have three respective lines. By plotting the results of rolling origin experiment, we can see if the model is stable or not. Just compare the previous graph with the one produced from the call to Holt's model:

ourCall <- "es(data, model='AAN', h=h, intervals=TRUE)"
ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE)
plot(ourRO)

Example of the plot of rolling origin function with ETS(A,A,N)

Holt's model is not suitable for this time series, so it's forecasts are less stable than the forecasts of the automatically selected model in the previous case (which is ETS(A,N,N)).

Once again, there is a vignette with examples for the ro() function, have a look if you want to know more.

ALM - Advanced Linear Model

Yes, there is "Generalised Linear Model" in R, which implements Poisson, Gamma, Binomial and other regressions. Yes, there are smaller packages, implementing models with more exotic distributions. But I needed several regression models with: Laplace distribution, Folded normal distribution, Chi-squared distribution and one new mysterious distribution, which is currently called "S distribution". I needed them in one place and in one format: properly estimated using likelihoods, returning confidence intervals, information criteria and being able to produce forecasts. I also wanted them to work similar to lm(), so that the learning curve would not be too steep. So, here it is, the function alm(). It works quite similar to lm():

xreg <- cbind(rfnorm(100,1,10),rnorm(100,50,5))
xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
inSample <- xreg[1:80,]
outSample <- xreg[-c(1:80),]

ourModel <- alm(y~x1+x2, inSample, distribution="laplace")
summary(ourModel)

Here's the output of the summary:

Distribution used in the estimation: Laplace
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%
(Intercept) 95.85207    0.36746   95.12022    96.58392
x1           0.59618    0.02479    0.54681     0.64554
x2          -0.67865    0.00622   -0.69103    -0.66626
ICs:
     AIC     AICc      BIC     BICc 
474.2453 474.7786 483.7734 484.9419

And here's the respective plot of the forecast:

plot(forecast(ourModel,outSample))

Forecast from lm with Laplace distribution

The thing that is currently missing in the function is prediction intervals, but this will be added in the upcoming releases.

Having the likelihood approach, allows comparing different models with different distributions using information criteria. Here's, for example, what model we get if we assume S-distribution (which has fatter tails than Laplace):

summary(alm(y~x1+x2, inSample, distribution="s"))

Distribution used in the estimation: S
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%
(Intercept) 95.61244    0.23386   95.14666    96.07821
x1           0.56144    0.00721    0.54708     0.57581
x2          -0.66867    0.00302   -0.67470    -0.66265
ICs:
     AIC     AICc      BIC     BICc 
482.9358 483.4692 492.4639 493.6325

As you see, the information criteria for S distribution are higher than for Laplace, so we can conclude that the previous model was better than the second in terms of ICs.

Note that at this moment the AICc and BICc are not correct for non-normal models (at least the derivation of them needs to be double checked, which I haven't done yet), so don't rely on them too much.

I intent to add several other distributions that either are not available in R or are implemented unsatisfactory (from my point of view) - the function is written in a quite flexible way, so this should not be difficult to do. If you have any preferences, please add them on github, here.

I also want to implement the mixture distributions, so that things discussed in the paper on intermittent state-space model can also be implemented using pure regression.

Finally, now that I have alm, we can select between the regression models with different distributions (with stepwise() function) or even combine them using AIC weights (hello, lmCombine()!). Yes, I know that it sounds crazy (think of the pool of models in this case), but this should be fun!

Regression for Multiple Comparison with the Best

Please, note that this part of the post has been updated on 02.03.2020 in order to reflect the changes in the v0.5.9 version of the package.
One of the typical tasks in forecasting is to evaluate the performance of different methods on the holdout. In order to do that, it is common to use some statistical tests, the most popular of which is Nemenyi / MCB (Multiple Comparison with the Best method). The test implemented in greybox package uses similar principles and relies on ranks of methods, but instead of taking averages and then applying studentised distances, it constructs a regression on the ranked data. This way we compare the median performance of different method (the same way as it is done in the classical MCB) and we produce parametric confidence intervals for parameters. The test is based on the simple linear model with dummy variables for each provided method (1 if the error corresponds to the method and 0 otherwise). Here's an example of how this thing works:

ourData <- cbind(rnorm(100,0,10), rnorm(100,-2,5), rnorm(100,2,6), rlaplace(100,1,5))
colnames(ourData) <- c("Method A","Method B","Method C","Method D")

ourTest <- rmcb(ourData, level=0.95)

By default the function produces graph in the MCB (Multiple Comparison with the Best) style:

RMCB example, MCB style plot

If we compare the results of the test with the mean rank values, we will see that they are the same:

apply(t(apply(ourData,1,rank)),2,mean)

Method A Method B Method C Method D 
    2.40     2.06     2.75     2.79

ourTest$mean

Method B Method A Method C Method D 
    2.06     2.40     2.75     2.79

This also reflects how the data was generated. Notice that Method D was generated from Laplace distribution with mean 1, but the test managed to give the correct answer in this situation, because Laplace distribution is symmetric and the sample size is large enough. But the main point of the test is that we can get the confidence intervals for each parameter, so we can see if the differences between the methods are significant: if the intervals intersect, then they are not.

The regression model used in the calculation is saved in the variable model and you can request a basic summary from it:

summary(ourTest$model)

            Estimate Std. Error  Lower 2.5% Upper 97.5%
(Intercept)     2.40  0.1083601  2.18761804  2.61238196
Method B       -0.34  0.1532444 -0.64035346 -0.03964654
Method C        0.35  0.1532444  0.04964654  0.65035346
Method D        0.39  0.1532444  0.08964654  0.69035346

But, please, keep in mind that this is not a proper "lm" object, so you cannot do much with it.

The function also reports p-value from the F-test of regression, testing the standard hypothesis that all the parameters are equal to zero.

We can also produce plots with vertical lines, that connect the models that are in the same group (no statistical difference, intersection of respective intervals). Here's the example for the same data:

plot(ourTest, outplot="lines")

RMCB example, lines plot

If you want to tune the plot, you can always do this using the standart plot parameters:

plot(ourTest, xlab="Models", ylab="Errors")

Also, given that we work with a flexible plot method, you can tune the parameters of the canvas using "par()" function, as it is usually done in R.

What else?

Several methods have been moved from smooth to greybox. These include:

pointLik() - returns point Likelihoods, discussed in our research with Nikos;
pAIC, pBIC, pAICc, pBICc - point values of respective information criteria, from the same research;
nParam() - returns number of the estimated parameters in the model (+ variance);
errorType() - returns the type of error used in the model (Additive / Multiplicative);

Furthermore, as you might have already noticed, I've implemented several distribution functions:

Folded normal distribution;
Laplace distribution;
S distribution.

Finally, there is also a function, called lmDynamic(), which uses pAIC in order to produce dynamic linear regression models. But this should be discussed separately in a separate post.

That's it for now. See you in greybox 0.4.0!

Message greybox 0.3.0 – what’s new first appeared on Open Forecast.

greybox package for R

Ivan Svetunkov — Fri, 04 May 2018 12:22:35 +0000

I am delighted to announce a new package on CRAN. It is called “greybox”. I know, what my American friends will say, as soon as they see the name – they will claim that there is a typo, and that it should be “a” instead of “e”. But in fact no mistake was made – I used British spelling for the name, and I totally understand that at some point I might regret this…

So, what is “greybox”? Wikipedia tells us that grey box is a model that “combines a partial theoretical structure with data to complete the model”. This means that almost any statistical model can be considered as a grey box, thus making the package potentially quite flexible and versatile.

But why do we need a new package on CRAN?

First, there were several functions in smooth package that did not belong there, and there are several functions in TStools package that can be united with a topic of model building. They focus on the multivariate regression analysis rather than on state-space models, time series smoothing or anything else. It would make more sense to find them their own ~~home~~ package. An example of such a function is ro() – Rolling Origin – function that Yves and I wrote in 2016 on our way to the International Symposium on Forecasting. Arguably this function can be used not only for assessing the accuracy of forecasting models, but also for the variables / model selection.

Second, in one of my side projects, I needed to work more on the multivariate regressions, and I had several ideas I wanted to test. One of those is creating a combined multivariate regression from several models using information criteria weights. The existing implementations did not satisfy me, so I ended up writing a function lmCombine() that does that. In addition, our research together with Yves Sagaert indicates that there is a nice solution for a fat regression problem (when the number of parameters is higher than the number of observations) using information criteria. Uploading those function in smooth did not sound right, but having a greybox helps a lot. There are other ideas that I have in mind, and they don’t fit in the other packages.

Finally, I could not find satisfactory (from my point of view) packages on CRAN that would focus on multivariate model building and forecasting – the usual focus is on analysis instead (including time series analysis). The other thing is the obsession of many packages with p-values and hypotheses testing, which was yet another motivator for me to develop a package that would be completely hypotheses-free (at 95% level). As a result, if you work with the functions from greybox, you might notice that they produce confidence intervals instead of p-values (because I find them more informative and useful). Finally, I needed good instruments for the promotional modelling for several projects, and it was easier to implement them myself than to compile them from different functions from different packages.

Keeping that in mind, it makes sense to briefly discuss what is already available there. I’ve already discussed how xregExpander() and stepwise() functions work in one of the previous posts, and these functions are now available in greybox instead of smooth. However, I have not covered either lmCombine() or ro() functions yet. While lmCombine() is still under construction and works only for normal cases (fat regression can be solved, but not 100% efficiently), ro() has worked efficiently for several years already. So I created a detailed vignette, explaining what is rolling origin, how the function works and how to use it. So, if you are interested in finding out more, check it out on CRAN.

As a wrap up, greybox package is focused on model building and forecasting and from now on will be periodically updated.

As a final note, I plan to do the following in greybox in future releases:

Move nemenyi() function from TStools to greybox;
Develop functions for promotional modelling;
Write a function for multiple correlation coefficients (will be used for multicollinearity analysis);
Implement variables selection based on rolling origin evaluation;
Stepwise regression and combinations of models, based on Laplace and the other distributions;
AICc for Laplace and the other distributions;
Solve fat regression problem via combination of regression models (sounds crazy, right?);
xregTransformer – Non-linear transformation of the provided xreg variables;
Other cool stuff.

If you have any thoughts on what to implement, leave a comment – I will consider your idea.

Message greybox package for R first appeared on Open Forecast.