Archives regression - Open Forecast

The first draft of “Forecasting and Analytics with ADAM”

Ivan Svetunkov — Mon, 11 Apr 2022 15:30:26 +0000

Forecasting and Analytics with ADAM

After working on this for more than a year, I have finally prepared the first draft of my online monograph “Forecasting and Analytics with ADAM“. This is a monograph on the model that unites ETS, ARIMA and regression and introduces advanced features in univariate modelling, including:

ETS in a new State Space form;
ARIMA in a new State Space form;
Regression;
TVP regression;
Combinations of (1), (2) and either (3), or (4);
Automatic selection/combination for ETS;
Automatic orders selection for ARIMA;
Variables selection for regression part;
Normal and non-normal distributions;
Automatic selection of most suitable distribution;
Multiple seasonality;
Occurrence part of the model to handle zeroes in data (intermittent demand);
Modelling scale of distribution (GARCH and beyond);
Handling uncertainty of estimates of parameters.

The model and all its features are already implemented in adam() function from smooth package for R (you need v3.1.6 from CRAN for all the features listed above). The function supports many options that allow one experimenting with univariate forecasting, allowing to build complex models, combining elements from the list above. The monograph explaining how models underlying ADAM and how to work with them is available online, and I plan to produce several physical copies of it after refining the text. Furthermore, I have already asked two well-known academics to act as reviewers of the monograph to collect the feedback and improve the monograph, and if you want to act as a reviewer as well, please let me know.

Examples in R

Just to give you a flavour of ADAM, I decided to provide a couple of examples on time series AirPassengers (included in datasets package in R). The first one is the ADAM ETS.

Building and selecting the most appropriate ADAM ETS comes to running the following line of code:

adamETSAir <- adam(AirPassengers, h=12, holdout=TRUE)

In this case, ADAM will select the most appropriate ETS model for the data, creating a holdout of the last 12 observations. We can see the details of the model by printing the output:

adamETSAir

Time elapsed: 0.75 seconds
Model estimated using adam() function: ETS(MAM)
Distribution assumed in the model: Gamma
Loss function type: likelihood; Loss function value: 467.2981
Persistence vector g:
 alpha   beta  gamma 
0.7691 0.0053 0.0000 

Sample size: 132
Number of estimated parameters: 17
Number of degrees of freedom: 115
Information criteria:
      AIC      AICc       BIC      BICc 
 968.5961  973.9646 1017.6038 1030.7102 

Forecast errors:
ME: 9.537; MAE: 20.784; RMSE: 26.106
sCE: 43.598%; Asymmetry: 64.8%; sMAE: 7.918%; sMSE: 0.989%
MASE: 0.863; RMSSE: 0.833; rMAE: 0.273; rRMSE: 0.254

The output above provides plenty of detail on what was estimated and how. Some of these elements have been discussed in one of my previous posts on es() function. The new thing is the information about the assumed distribution for the response variable. By default, ADAM works with Gamma distribution in case of multiplicative error model. This is done to make model more robust in cases of low volume data, where the Normal distribution might produce negative numbers (see my presentation on this issues). In case of high volume data, the Gamma distribution will perform similar to the Normal one. The pure multiplicative ADAM ETS is discussed in Chapter 6 of ADAM monograph. If Gamma is not suitable, then the other distribution can be selected via the distribution parameter. There is also an automated distribution selection approach in the function auto.adam():

adamETSAutoAir <- auto.adam(AirPassengers, h=12, holdout=TRUE)
adamETSAutoAir

Time elapsed: 3.86 seconds
Model estimated using auto.adam() function: ETS(MAM)
Distribution assumed in the model: Normal
Loss function type: likelihood; Loss function value: 466.0744
Persistence vector g:
 alpha   beta  gamma 
0.8054 0.0000 0.0000 

Sample size: 132
Number of estimated parameters: 17
Number of degrees of freedom: 115
Information criteria:
      AIC      AICc       BIC      BICc 
 966.1487  971.5172 1015.1564 1028.2628 

Forecast errors:
ME: 9.922; MAE: 21.128; RMSE: 26.246
sCE: 45.36%; Asymmetry: 65.4%; sMAE: 8.049%; sMSE: 1%
MASE: 0.877; RMSSE: 0.838; rMAE: 0.278; rRMSE: 0.255

As we see from the output above, the Normal distribution is more appropriate for the data in terms of AICc than the other ones tried out by the function (by default the list includes Normal, Laplace, S, Generalised Normal, Gamma, Inverse Gaussian and Log Normal distributions, but this can be amended by providing a vector of names via distribution parameter). The selection of ADAM ETS and distributions is discussed in Chapter 15 of the monograph.

Having obtained the model, we can diagnose it using plot.adam() function:

par(mfcol=c(3,3))
plot(adamETSAutoAir,which=c(1,4,2,6,7,8,10,11,13))

The which parameter specifies what type of plots to produce, you can find the list of plots in the documentation for plot.adam(). The code above will result in:

Diagnostics plots for ADAM ETS on AirPassengers data

The diagnostic plots are discussed in the Chapter 14 of ADAM monograph. The plot above does not show any serious issues with the model.

Just for the comparison, we could also try fitting the most appropriate ADAM ARIMA to the data (this model is discussed in Chapter 9). The code in this case is slightly more complicated, because we need to switch off ETS part of the model and define the maximum orders of ARIMA to try:

adamARIMAAir <- adam(AirPassengers, model="NNN", h=12, holdout=TRUE,
                     orders=list(ar=c(3,2),i=c(2,1),ma=c(3,2),select=TRUE))

This results in the following automatically selected ARIMA model:

Time elapsed: 3.54 seconds
Model estimated using auto.adam() function: SARIMA(0,1,1)[1](0,1,1)[12]
Distribution assumed in the model: Normal
Loss function type: likelihood; Loss function value: 491.7117
ARMA parameters of the model:
MA:
 theta1[1] theta1[12] 
   -0.1952    -0.0720 

Sample size: 132
Number of estimated parameters: 16
Number of degrees of freedom: 116
Information criteria:
     AIC     AICc      BIC     BICc 
1015.423 1020.154 1061.548 1073.097 

Forecast errors:
ME: -13.795; MAE: 16.65; RMSE: 21.644
sCE: -63.064%; Asymmetry: -79.4%; sMAE: 6.343%; sMSE: 0.68%
MASE: 0.691; RMSSE: 0.691; rMAE: 0.219; rRMSE: 0.21

Given that ADAM ETS and ADAM ARIMA are formulated in the same framework, they are directly comparable using information critirea. Comparing AICc of the models adamETSAutoAir and adamARIMAAir, we can conclude that the former is more appropriate to the data than the latter. However, the default ARIMA works with the Normal distribution, which might not be appropriate for the data, so we can revert to the auto.adam() to select the better one:

adamAutoARIMAAir <- auto.adam(AirPassengers, model="NNN", h=12, holdout=TRUE,
                              orders=list(ar=c(3,2),i=c(2,1),ma=c(3,2),select=TRUE))

This will take more computational time, but will result in a different model with a lower AICc (which is still higher than the one in ADAM ETS):

Time elapsed: 25.46 seconds
Model estimated using auto.adam() function: SARIMA(0,1,1)[1](0,1,1)[12]
Distribution assumed in the model: Log-Normal
Loss function type: likelihood; Loss function value: 472.923
ARMA parameters of the model:
MA:
 theta1[1] theta1[12] 
   -0.2785    -0.5530 

Sample size: 132
Number of estimated parameters: 16
Number of degrees of freedom: 116
Information criteria:
      AIC      AICc       BIC      BICc 
 977.8460  982.5764 1023.9708 1035.5197 

Forecast errors:
ME: -12.968; MAE: 13.971; RMSE: 19.143
sCE: -59.285%; Asymmetry: -91.7%; sMAE: 5.322%; sMSE: 0.532%
MASE: 0.58; RMSSE: 0.611; rMAE: 0.184; rRMSE: 0.186

Note that although the AICc is higher for ARIMA than for ETS, the former has lower error measures than the latter. So, the higher AICc does not necessarily mean that the model is not good. But if we rely on the information criteria, then we should stick with ADAM ETS and we can then produce the forecasts for the next 12 observations (see Chapter 18):

adamETSAutoAirForecast <- forecast(adamETSAutoAir, h=12, interval="prediction",
                                   level=c(0.9,0.95,0.99))
par(mfcol=c(1,1))
plot(adamETSAutoAirForecast)

Forecast from ADAM ETS

Finally, if we want to do a more in-depth analysis of parameters of ADAM, we can also produce the summary, which will create the confidence intervals for the parameters of the model:

summary(adamETSAutoAir)

Model estimated using auto.adam() function: ETS(MAM)
Response variable: data
Distribution used in the estimation: Normal
Loss function type: likelihood; Loss function value: 466.0744
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
alpha         0.8054     0.0864     0.6343      0.9761 *
beta          0.0000     0.0203     0.0000      0.0401  
gamma         0.0000     0.0382     0.0000      0.0755  
level        96.2372     6.8596    82.6496    109.7919 *
trend         2.0901     0.3955     1.3068      2.8716 *
seasonal_1    0.9145     0.0077     0.9003      0.9372 *
seasonal_2    0.8999     0.0081     0.8857      0.9227 *
seasonal_3    1.0308     0.0094     1.0165      1.0535 *
seasonal_4    0.9885     0.0077     0.9743      1.0112 *
seasonal_5    0.9856     0.0072     0.9713      1.0083 *
seasonal_6    1.1165     0.0093     1.1023      1.1392 *
seasonal_7    1.2340     0.0115     1.2198      1.2568 *
seasonal_8    1.2254     0.0105     1.2112      1.2481 *
seasonal_9    1.0668     0.0094     1.0526      1.0896 *
seasonal_10   0.9256     0.0087     0.9113      0.9483 *
seasonal_11   0.8040     0.0075     0.7898      0.8268 *

Error standard deviation: 0.0367
Sample size: 132
Number of estimated parameters: 17
Number of degrees of freedom: 115
Information criteria:
      AIC      AICc       BIC      BICc 
 966.1487  971.5172 1015.1564 1028.2628

Note that the summary() function might complain about the Observed Fisher Information. This is because the covariance matrix of parameters is calculated numerically and sometimes the likelihood is not maximised properly. I have not been able to fully resolve this issue yet, but hopefully will do at some point. The summary above shows, for example, that the smoothing parameters $\beta$ and $\gamma$ are not significantly different from zero (on 5% level), while $\alpha$ is expected to vary between 0.6343 and 0.9761 in 95% of the cases. You can read more about the uncertainty of parameters in ADAM in Chapter 16 of the monograph.

As for the other features of ADAM, here is a brief guide:

If you work with multiple seasonal data, then you might need to specify the seasonality via the lags parameter, for example as lags=c(24,7*24) in case of hourly data. This is discussed in Chapter 12;
If you have intermittent data, then you should read Chapter 13, which explains how to work with the occurrence parameter of the function;
Explanatory variables are discussed in Chapter 10 and are handled in the adam() function via the formula parameter;
In the cases of heteroscedasticity (time varying or induced by some explanatory variables), there a scale model (which is discussed in Chapter 17 and implemented as sm() method for the adam class).

You can also experiment with advanced estimators (Chapter 11, including custom loss functions) via the loss parameter and forecast combinations (Section 15.4).

Long story short, if you are interested in univariate forecasting, then do give ADAM a try - it might have the flexibility you needed for your experiments. If you are worried about its accuracy, check out this post, where I compared ADAM with other models.

And, as a friend of mine says, "Happy forecasting!"

Message The first draft of “Forecasting and Analytics with ADAM” first appeared on Open Forecast.

Introducing scale model in greybox

Ivan Svetunkov — Sun, 23 Jan 2022 18:04:33 +0000

At the end of June 2021, I released the greybox package version 1.0.0. This was a major release, introducing new functionality, but I did not have time to write a separate post about it because of the teaching and lack of free time. Finally, Christmas has arrived, and I could spend several hours preparing the post about it. In this post, I want to tell you about the new major feature in the greybox package.

Scale Model

The Scale Model is the regression-like model focusing on capturing the relation between the scale of distribution (for example, variance in Normal distribution) and a set of explanatory variables. It is implemented in sm() method in the greybox package. The motivation for this comes from GAMLSS, the Generalised Additive Model for Location, Scale and Shape. While I have decided not to bother with the “GAM” part of this (there are gam and gamlss packages in R that do that), I liked the idea of being able to predict the scale (for example, variance) of a distribution. This becomes especially useful when one suspects heteroscedasticity in the model but does not think that variable transformations are appropriate.

To understand what the function does, it is necessary first to discuss the underlying model. We will start the discussion with an example of a linear regression model with two explanatory variables, assuming Normally distributed residuals $\xi_t$ with zero mean and a fixed variance $\sigma^2$, $\xi_t \sim \mathcal{N}(0,\sigma^2)$, which can be formulated as:
\begin{equation} \label{eq:model1}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \xi_t ,
\end{equation}
where $y_t$ is the response variable, $x_{1,t}$ and $x_{2,t}$ are the explanatory variables on observation $t$, $\beta_0$, $\beta_1$ and $\beta_2$ are the parameters of the model and $\xi_t \sim \mathcal{N}\left(0, \sigma^2 \right)$. Recalling the basic properties of Normal distribution, we can rewrite the same model as a model with standard normal residuals $\epsilon_t \sim \mathcal{N}\left(0, 1 \right)$ by inserting $\xi_t = \sigma \epsilon_t$ in \eqref{eq:model1}:
\begin{equation} \label{eq:model2}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \sigma \epsilon_t .
\end{equation}
Now if we suspect that the variance of the model might not be constant, we can substitute the standard deviation $\sigma$ with some function, transforming the model into:
\begin{equation} \label{eq:model3}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + f\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right) \epsilon_t ,
\end{equation}
where $x_{2,t}$ and $x_{3,t}$ are the explanatory variables (as you see, not necessarily the same as in the first part of the model) and $\gamma_0$, $\gamma_1$ and $\gamma_2$ are the parameters of the scale part of the model. The idea here is that there is a regression model for the conditional mean of the distribution $\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t}$, and that there is another one that will regulate the standard deviation via $f\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)$. The main thing to keep in mind about the latter is that the function $f(\cdot)$ needs to be strictly positive because the standard deviation cannot be zero or negative. The simplest way to guarantee this is to use exponent instead of $f(\cdot)$. Furthermore, in our example with Normal distribution, the scale corresponds to the variance, so we should be introducing the model for variance: $\sigma^2_t = \exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)$. This leads to the following model:
\begin{equation} \label{eq:model4}
y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \sqrt{\exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)} \epsilon_t ,
\end{equation}
The model above would not only have the conditional mean depending on the values of explanatory variables (the conventional regression) but also the conditional variance, which would change depending on the values of variables. Note that this model assumes the linearity in the conditional mean: increase of $x_{1,t}$ by one leads to the increase of $y_t$ by $\beta_1$ on average. At the same time, it assumes non-linearity in the variance: increase of $x_{2,t}$ by one leads to the increase of variance by $\exp(\gamma_2-1)\times 100$%. If we want a non-linear change in the conditional mean, we can use a model in logarithms. Alternatively, we could assume a different distribution for the response variable $y_t$. To understand how the latter would work, we need to represent the same model \eqref{eq:model4} in a more general form. For the Normal distribution, the same model \eqref{eq:model4} can be rewritten as:
\begin{equation} \label{eq:model5}
y_t \sim \mathcal{N}\left(\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t}, \exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)\right).
\end{equation}
This representation allows introducing scale model for many other distributions, such as Laplace, Generalised Normal, Gamma, Inverse Gaussian etc. All that we need to do in those cases is to substitute the distribution $\mathcal{N}(\cdot)$ with a distribution of interest. The sm() function supports the same list of distributions as alm() (see the vignette for the function on CRAN or in R using the command vignette()). Each specific formula for scale would differ from one distribution to another, but the principles will be the same.

Demonstration in R

For demonstration purposes, we will use an example with artificial data, generated according to the model \eqref{eq:model4}:

xreg <- matrix(rnorm(300,10,3),100,3)
xreg <- cbind(1000-0.75*xreg[,1]+1.75*xreg[,2]+
              sqrt(exp(0.3+0.5*xreg[,2]-0.4*xreg[,3]))*rnorm(100,0,1),xreg)
colnames(xreg) <- c("y",paste0("x",c(1:3)))

The scatterplot of the generated data will look like this:

spread(xreg)

Scatterplot matrix for the generated data

We can then fit a model, specifying the location and scale parts of it in alm(). In this case, the alm() will call for sm() and will estimate both parts via likelihood maximisation. To make things closer to forecasting task, we will withhold the last 10 observations for the test set:

ourModel <- alm(y~x1+x2+x3, scale=~x2+x3, xreg, subset=c(1:90), distribution="dnorm")

The returned model contains both parts. The scale part of the model can be accessed via ourModel$scale. It is an object of class "scale", supporting several methods, such as
actuals(), residuals(), fitted(), summary() and plot() (and several other). Here how the summary of the model looks in my case:

summary(ourModel)

Response variable: y
Distribution used in the estimation: Normal
Loss function used in estimation: likelihood
Coefficients:
             Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept) 1000.2850     2.9698   994.3782   1006.1917 *
x1            -0.8350     0.1435    -1.1204     -0.5497 *
x2             1.8656     0.1714     1.5246      2.2065 *
x3            -0.0228     0.1776    -0.3761      0.3305  

Coefficients for scale:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)   0.0436     0.7012    -1.3510      1.4382  
x2            0.4705     0.0413     0.3883      0.5527 *
x3           -0.3355     0.0487    -0.4324     -0.2385 *

Error standard deviation: 4.52
Sample size: 90
Number of estimated parameters: 7
Number of degrees of freedom: 83
Information criteria:
     AIC     AICc      BIC     BICc 
391.0191 392.3849 408.5177 411.5908

The summary above shows parameters for both parts of the model. They are not far from the ones used in the generation of the model, which indicates that the implemented model works as intended. The only issue here is that the standard errors in the location part of the model (first four coefficients) do not take the heteroscedasticity into account and thus are biased. The HAC standard errors are not yet implemented in alm()

As we see, the returned model contains both parts. The scale part of the model can be accessed via ourModel$scale. It is an object of class "scale", supporting several methods, such as
actuals(), residuals(), fitted(), summary() and plot() (and several other). Just to see the effect of scale model, here are the diagnostics plots for the original model (which returns the $\xi_t$ residuals) and for the scale model ($\epsilon_t$ residuals):

par(mfcol=c(1,2))
plot(ourModel, 5)
plot(ourModel, 5)

Diagnostics plots for sm

The Figure above shows squared residuals vs fitted values for the location (the plot on the left) and the scale (the plot on the right) models. The former is agnostic of the scale model and demonstrates that there is heteroscedasticity of residuals (the variance increases with the increase of the fitted values). The latter shows that the scale model managed to resolve the issue. While the LOWESS line demonstrates some non-linearity, the distribution of residuals conditional on fitted values looks random.

Finally, we can produce forecasts from such model, similarly to how it is done for any other model, estimated with alm():

ourForecast <- predict(ourModel,xreg[-c(1:90),],interval="pred")
plot(ourForecast)

Forecast from the model

In this case, the function will first predict the scale part of the model, then it will use the predicted variance and the covariance matrix of parameters to calculate the prediction intervals, shown in Figure above. Given the independence of location and scale parts of the model, the conditional expectation (point forecast) will not change if we drop the scale model. It is all about variance.

Finally, if you do not want to use alm() function, you can use lm() instead and then apply the sm():

lmModel <- lm(y~x1+x2+x3, as.data.frame(xreg), subset=c(1:90))
smModel <- sm(lmModel, formula=~x2+x3, xreg)

In this case, the sm() will assume that the error term follows Normal distribution, and we will end up with two models that are not connected with each other (e.g., the predict() method applied to lmModel will not use predictions from the smModel). Nonetheless, we could still use all the R methods discussed above for the analysis of the smModel.

As a final word, the scale model is a new feature. While it already works, there might be bugs in it. If you find any, please let me know by submitting an issue on Github.

P.S.

There is a danger that greybox package will be soon removed from CRAN together with other 88 packages (including my smooth and legion) because the nloptr package that it relies on has not passed some of new checks recently introduced by CRAN. This is beyond my control, and I do not have time or power to influence this, but if this happens, you might need to switch to the installation from GitHub via remotes package, using the command:

remotes::install_github("config-i1/greybox")

My apologies for the inconvenience. I might be able to remove the dependence on nloptr at some point, but it will not happen before March 2022.

Message Introducing scale model in greybox first appeared on Open Forecast.

An Integrated Method for Estimation and Optimisation

Ivan Svetunkov — Fri, 03 Sep 2021 15:47:15 +0000

My PhD student, Congzheng Liu (co-supervised with Adam Letchford) has written a paper, entitled “Newsvendor Problems: An Integrated Method for Estimation and Optimisation“. This paper has recently been published in EJOR. In this paper we build upon the existing Ban & Rudin (2019) approach for newsvendor problem, showing that in case of the linear model, it becomes equivalent to quantile regression. We then extend it for the non-linear newsvendor problems, testing it on simulated and real life data. In order to understand what specifically we propose, we need to discuss the typical process in case of newsvendor problem.

Newsvendor is a class of problems, where the product can only be sold one day, after which it goes to waste. So this is appropriate, for example, for perishable products in retail. Typically, in this situation we would have historical demand of sales of our product $y_t$ and we would try forecasting it using regression / ETS / ARIMA or any other model. After doing that and obtaining the estimates of parameters, we would produce a quantile of assumed distribution, which then tells us how much to order ($q_t$). If we order more than needed, we will have holding costs. In the opposite case, we will have shortage costs. Based on these costs and the price of product, we can find the optimal order, that will give the maximum profit.

As you can already spot, the forecasting stage is detached from the optimisation one in this situation. The idea of the proposed integrated approach (IMEO) is simple: instead of optimising the model via MSE or any other conventional loss and then solving the optimisation problem, we could estimate the model via maximisation of the specific profit function, thus obtaining the required orders directly. This is not a new idea on its own, but using profit function rather than the cost (as Ban & Rudin, 2019 did) allows applying IMEO to wider set of problems.

For example, if we know the price of the product $p$, the costs for production $v$, holding $c_h$ and shortage costs $c_s$, we can then calculate profit as (for a linear newsvendor problem):
\begin{equation}
\pi(q_t,y_t)=
\begin{cases}
p y_t -v q_t -c_h (q_t -y_t),& \text{for } q_t \geq y_t\\
p q_t -v q_t -c_s (y_t -q_t),& \text{for } q_t< y_t, \end{cases} \end{equation} where $q_t$ is the order quantity and $y_t$ is the actual sales. This profit function can be used for the estimation of a model of your choosing. Congzheng has written a separate R code for the experiments for the paper. Inspired by his example, I have implemented custom losses in alm() and adam() functions from respective greybox and smooth packages for R. At the moment, only the regression model works properly with custom losses – ETS / ARIMA need additional modifications, which we will hopefully resolve in the next paper. So, here is an example with linear newsvendor problem and alm():

# Generate artificial data
x1 <- rnorm(100,100,10)
x2 <- rbinom(100,2,0.05)
y <- 10 + 1.5*x1 + 5*x2 + rnorm(100,0,10)
ourData <- cbind(y=y,x1=x1,x2=x2)

# Define price and costs
price <- 50
costBasic <- 5
costShort <- 15
costHold <- 1

# Define profit function for the linear case
lossProfit <- function(actual, fitted, B, xreg){
    # Minus sign is needed here, because we need to minimise the loss
    profit <- -ifelse(actual >= fitted,
                     (price - costBasic) * fitted - costShort * (actual - fitted),
                     price * actual - costBasic * fitted - costHold * (fitted - actual));
    return(sum(profit));
}

# Estimate the model
model1 <- alm(y~x1+x2, ourData, loss=lossProfit)

# Print summary of the model
summary(model1, bootstrap=TRUE)

Response variable: y
Distribution used in the estimation: Normal
Loss function used in estimation: custom
Bootstrap was used for the estimation of uncertainty of parameters
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)  36.5177    14.2840     2.7783     51.4844 *
x1            1.3622     0.1622     1.1909      1.7528 *
x2            3.3423     2.7810    -6.5997      5.9101  

Error standard deviation: 17.2266
Sample size: 100
Number of estimated parameters: 3
Number of degrees of freedom: 97

The resulting model is easy to work with: it provides meaningful parameters, showing how on average the order should change if a variable changes by one. For example, we see that with the increase of the variable x1, the orders should change on average by 1.36.

Note that in this specific case, as shown in our paper, the model would be equivalent to the quantile regression, estimated for the quantile $\left( \frac{c_u}{c_o+c_u} \right)$, where $c_u= p-v+c_s$ is the "underage" cost and $c_o = v+c_h$ is the "overage" cost. In our example it corresponds to approximately 0.9091 quantile. We can compare the output of this model with the one from the quantile regression in alm (which is estimated as an Asymmetric Laplace model):

model2 <- alm(y~x1+x2, ourData, distribution="dalaplace", alpha=0.9091)
summary(model2, bootstrap=TRUE)

Response variable: y
Distribution used in the estimation: Asymmetric Laplace with alpha=0.9091
Loss function used in estimation: likelihood
Bootstrap was used for the estimation of uncertainty of parameters
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)  36.6688    11.6686     3.8674     51.1987 *
x1            1.3611     0.1338     1.1920      1.7454 *
x2            3.1259     2.5424    -6.2518      5.4703  

Error standard deviation: 17.3379
Sample size: 100
Number of estimated parameters: 4
Number of degrees of freedom: 96
Information criteria:
     AIC     AICc      BIC     BICc 
826.4622 826.8833 836.8829 837.8524

The differences between the estimates of parameters of the two models are due to the optimisation procedure, which would converge to slightly different points in these two cases. Still, the values of parameters are close to each other and would converge asymptotically, which supports our finding.

And here how the orders over time look in case of our custom loss:

plot(model1, 7)

Dynamics of orders from alm model

The purple line in the Figure above corresponds to the orders and would cover roughly 90.91% of cases, so that we would run out of product in approximately 10% of cases, which would still be more profitable than any other option.

Finally, the approach works also well in case of non-linear newsvendor problem (see the paper for details), where quantile regression is not suitable and the conventional approach fails. The only thing that would change is the loss function, where the prices and costs would depend non-linearly on the order quantity and sales.

You can read the published paper on EJOR website or the working paper on ResearchGate.

Message An Integrated Method for Estimation and Optimisation first appeared on Open Forecast.

The creation of ADAM – next step in statistical forecasting

Ivan Svetunkov — Wed, 13 Jan 2021 11:24:18 +0000

Good news everyone! The future of statistical forecasting is finally here :). Have you ever struggled with ETS and needed explanatory variables? Have you ever needed to unite ARIMA and ETS? Have you ever needed to deal with all those zeroes in the data? What about the data with multiple seasonalities? All of this and more can now be solved by adam() function from smooth v3.0.1 package for R (on its way to CRAN now). ADAM stands for “Augmented Dynamic Adaptive Model” (I will talk about it in the next CMAF Friday Forecasting Talk on 15th January). Now, what is ADAM? Well, something like this:

The Creation of ADAM by Arne Niklas Jansson with my adaptation

ADAM is the next step in time series analysis and forecasting. Remember exponential smoothing and functions like es() and ets()? Remember ARIMA and functions like arima(), ssarima(), msarima() etc? Remember your favourite linear regression function, e.g. lm(), glm() or alm()? Well, now these three models are implemented in a unified framework. Now you can have exponential smoothing with ARIMA elements and explanatory variables in one box: adam(). You can do ETS components and ARIMA orders selection, together with explanatory variables selection in one go. You can estimate ETS / ARIMA / regression using either likelihood of a selected distribution or using conventional losses like MSE, or even using your own custom loss. You can tune parameters of optimiser and experiment with initialisation and estimation of the model. The function can deal with multiple seasonalities and with intermittent data in one place. In fact, there are so many features that it is just easier to list the major of them:

ETS;
ARIMA;
Regression;
TVP regression;
Combination of (1), (2) and either (3), or (4);
Automatic selection / combination of states for ETS;
Automatic orders selection for ARIMA;
Variables selection for regression part;
Normal and non-normal distributions;
Automatic selection of most suitable distributions;
Advanced and custom loss functions;
Multiple seasonality;
Occurrence part of the model to handle zeroes in data (intermittent demand);
Model diagnostics using plot() and other methods;
Confidence intervals for parameters of models;
Automatic outliers detection;
Handling missing data;
Fine tuning of persistence vector (smoothing parameters);
Fine tuning of initial values of the state vector (e.g. level / trend / seasonality / ARIMA components / regression parameters);
Two initialisation options (optimal / backcasting);
Provided ARMA parameters;
Fine tuning of optimiser (select algorithm and convergence criteria);
…

All of this is based on the Single Source of Error state space model, which makes ETS, ARIMA and regression directly comparable via information criteria and opens a variety of modelling and forecasting possibilities. In addition, the code is much more efficient than the code of already existing smooth functions, so hopefully this will be a convenient function to use. I do not promise that everything will work 100% efficiently from scratch, because this is a new function, which implies that inevitably there are bugs and there is a room for improvement. But I intent to continue working on it, improving it further, based on the provided feedback (you can submit an issue on github if you have ideas).

Keep in mind that starting from smooth v3.0.0 I will not be introducing new features in es(), ssarima() and other conventional functions for univariate variables in smooth – I will only fix bugs in them and possibly optimise some parts of the code, but there will be no innovations in them, given that the main focus from now on will be on adam(). To that extent, I have removed some experimental and not fully developed parameters from those functions (e.g. occurrence, oesmodel, updateX, persistenceX and transitionX).

Now, I realise that ADAM is something completely new and contains just too much information to cover in one post. As a result, I have started the work on an online textbook. This is work in progress, missing some chapters, but it already covers many important elements of ADAM. If you find any mistakes in the text or formulae, please, use the “Open Review” functionality in the textbook to give me feedback or send me a message. This will be highly appreciated, because, working on this alone, I am sure that I have made plenty of mistakes and typos.

Example in R

Finally, it would be boring just to announce things and leave it like that. So, I’ve decided to come up with an R experiments on M, M3 and tourism competitions data, similar to how I’ve done it in 2017, just to show how the function compares with the other conventional ones, measuring their accuracy and computational time:

Huge chunk of code in R

# Load the packages. If the packages are not available, install them from CRAN
library(Mcomp)
library(Tcomp)
library(smooth)
library(forecast)

# Load the packages for parallel calculation
# This package is available for Linux and MacOS only
# Comment out this line if you work on Windows
library(doMC)

# Set up the cluster on all cores / threads.
## Note that the code that follows might take around 500Mb per thread,
## so the issue is not in the number of threads, but rather in the RAM availability
## If you do not have enough RAM,
## you might need to reduce the number of threads manually.
## But this should not be greater than the number of threads your processor can do.
registerDoMC(detectCores())

##### Alternatively, if you work on Windows (why?), uncomment and run the following lines
# library(doParallel)
# cl <- detectCores()
# registerDoParallel(cl)
#####

# Create a small but neat function that will return a vector of error measures
errorMeasuresFunction <- function(object, holdout, insample){
    return(c(measures(holdout, object$mean, insample),
             mean(holdout < object$upper & holdout > object$lower),
             mean(object$upper-object$lower)/mean(insample),
             pinball(holdout, object$upper, 0.975)/mean(insample),
             pinball(holdout, object$lower, 0.025)/mean(insample),
             sMIS(holdout, object$lower, object$upper, mean(insample),0.95),
             object$timeElapsed))
}

# Create the list of datasets
datasets <- c(M1,M3,tourism)
datasetLength <- length(datasets)
# Give names to competing forecasting methods
methodsNames <- c("ADAM-ETS(ZZZ)","ADAM-ETS(ZXZ)","ADAM-ARIMA",
                  "ETS(ZXZ)","ETSHyndman","AutoSSARIMA","AutoARIMA");
methodsNumber <- length(methodsNames);
# Run adam on one of time series from the competitions to get names of error measures
test <- adam(datasets[[125]]);
# The array with error measures for each method on each series.
## Here we calculate a lot of error measures, but we will use only few of them
testResults <- array(NA,c(methodsNumber,datasetLength,length(test$accuracy)+6),
                             dimnames=list(methodsNames, NULL,
                                           c(names(test$accuracy),
                                             "Coverage","Range",
                                             "pinballUpper","pinballLower","sMIS",
                                             "Time")));

#### ADAM(ZZZ) ####
j <- 1;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="smooth") %dopar% {
    startTime <- Sys.time()
    test <- adam(datasets[[i]],"ZZZ");
    testForecast <- forecast(test, h=datasets[[i]]$h, interval="pred");
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

#### ADAM(ZXZ) ####
j <- 2;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="smooth") %dopar% {
    startTime <- Sys.time()
    test <- adam(datasets[[i]],"ZXZ");
    testForecast <- forecast(test, h=datasets[[i]]$h, interval="pred");
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

#### ADAMARIMA ####
j <- 3;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="smooth") %dopar% {
    startTime <- Sys.time()
    test <- adam(datasets[[i]], "NNN",
                 order=list(ar=c(3,2),i=c(2,1),ma=c(3,2),select=TRUE));
    testForecast <- forecast(test, h=datasets[[i]]$h, interval="pred");
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

#### ES(ZXZ) ####
j <- 4;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="smooth") %dopar% {
    startTime <- Sys.time()
    test <- es(datasets[[i]],"ZXZ");
    testForecast <- forecast(test, h=datasets[[i]]$h, interval="parametric");
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

#### ETS from forecast package ####
j <- 5;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="forecast") %dopar% {
    startTime <- Sys.time()
    test <- ets(datasets[[i]]$x);
    testForecast <- forecast(test, h=datasets[[i]]$h, level=95);
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

#### AUTO SSARIMA ####
j <- 6;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="smooth") %dopar% {
    startTime <- Sys.time()
    test <- auto.ssarima(datasets[[i]]);
    testForecast <- forecast(test, h=datasets[[i]]$h, interval=TRUE);
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

#### AUTOARIMA ####
j <- 7;
result <- foreach(i=1:datasetLength, .combine="cbind", .packages="forecast") %dopar% {
    startTime <- Sys.time()
    test <- auto.arima(datasets[[i]]$x);
    testForecast <- forecast(test, h=datasets[[i]]$h, level=95);
    testForecast$timeElapsed <- Sys.time() - startTime;
    return(errorMeasuresFunction(testForecast, datasets[[i]]$xx, datasets[[i]]$x));
}
testResults[j,,] <- t(result);

# If you work on Windows, don't forget to shutdown the cluster via the following command:
# stopCluster(cl)

After running this code, we will get the big array (7x5315x21), which would contain many different error measures for point forecasts and prediction intervals. We will not use all of them, but instead will extract MASE and RMSSE for point forecasts and Coverage, Range and sMIS for prediction intervals, together with computational time. Although it might be more informative to look at distributions of those variables, we will calculate mean and median values overall, just to get a feeling about the performance:

A much smaller chunk of code in R

round(apply(testResults[,,c("MASE","RMSSE","Coverage","Range","sMIS","Time")],
            c(1,3),mean),3)
round(apply(testResults[,,c("MASE","RMSSE","Range","MIS","Time")],
            c(1,3),median),3)

This will result in the following two tables (boldface shows the best performing functions):

Means:
               MASE RMSSE Coverage Range  sMIS  Time
ADAM-ETS(ZZZ) 2.415 2.098    0.888 1.398 2.437 0.654
ADAM-ETS(ZXZ) 2.250 1.961    0.895 1.225 2.092 0.497
ADAM-ARIMA    2.551 2.203    0.862 0.968 3.098 5.990
ETS(ZXZ)      2.279 1.977    0.862 1.372 2.490 1.128
ETSHyndman    2.263 1.970    0.882 1.200 2.258 0.404
AutoSSARIMA   2.482 2.134    0.801 0.780 3.335 1.700
AutoARIMA     2.303 1.989    0.834 0.805 3.013 1.385

Medians:
               MASE RMSSE Range  sMIS  Time
ADAM-ETS(ZZZ) 1.362 1.215 0.671 0.917 0.396
ADAM-ETS(ZXZ) 1.327 1.184 0.675 0.909 0.310
ADAM-ARIMA    1.476 1.300 0.769 1.006 3.525
ETS(ZXZ)      1.335 1.198 0.616 0.931 0.551
ETSHyndman    1.323 1.181 0.653 0.925 0.164
AutoSSARIMA   1.419 1.271 0.577 0.988 0.909
AutoARIMA     1.310 1.182 0.609 0.881 0.322

Some things to note from this:

ADAM ETS(ZXZ) is the most accurate model in terms of mean MASE and RMSSE, it has the coverage closest to 95% (although none of the models achieved the nominal value because of the fundamental underestimation of uncertainty) and has the lowest sMIS, implying that it did better than the other functions in terms of prediction intervals;
The ETS(ZZZ) did worse than ETS(ZXZ) because the latter considers the multiplicative trend, which sometimes becomes unstable, producing exploding trajectories;
ADAM ARIMA is not performing well yet, because of the implemented order selection algorithm and it was the slowest function of all. I plan to improve it in future releases of the function;
While ADAM ETS(ZXZ) did not beat ETS from forecast package in terms of computational time, it was faster than the other functions;
When it comes to medians, auto.arima(), ets() and auto.ssarima() seem to be doing better than ADAM, but not by a large margin.

In order to see if the performance of functions is statistically different, we run the RMCB test for MASE, RMSSE and MIS. Note that RMCB compares the median performance of functions. Here is the R code:

A smaller chunk of code in R for the MCB test

# Load the package with the function
library(greybox)
# Run it for each separate measure, automatically producing plots
rmcbResultMASE <- rmcb(t(testResults[,,"MASE"]))
rmcbResultRMSSE <- rmcb(t(testResults[,,"RMSSE"]))
rmcbResultsMIS <- rmcb(t(testResults[,,"sMIS"]))

And here are the figures that we get by running that code

RMCB test for MASE

RMCB test for RMSSE

As we can see from the two figures above, ADAM-ETS(Z,X,Z) performs better than the other functions, although statistically not different than ETS implemented in es() and ets() functions. ADAM-ARIMA is the worst performing function for the moment, as we have already noticed in the previous analysis. The ranking is similar for both MASE and RMSSE.

And here is the sMIS plot:

RMCB test for sMIS

When it comes to sMIS, the leader in terms of medians is auto.arima(), doing quite similar to ets(), but this is mainly because they have lower ranges, incidentally resulting in lower than needed coverage (as seen from the summary performance above). ADAM-ETS does similar to ets() and es() in this aspect (the intervals of the three intersect).

Obviously, we could provide more detailed analysis of performance of functions on different types of data and see, how they compare in each category, but the aim of this post is just to demonstrate how the new function works, I do not have intent to investigate this in detail.

Finally, I will present ADAM with several case studies in CMAF Friday Forecasting Talk on 15th January. If you are interested to hear more and have some questions, please register on MeetUp or via LinkedIn and join us online.

Message The creation of ADAM – next step in statistical forecasting first appeared on Open Forecast.

Analytics with greybox

Ivan Svetunkov — Mon, 07 Jan 2019 16:40:17 +0000

One of the reasons why I have started the greybox package is to use it for marketing research and marketing analytics. The common problem that I face, when working with these courses is analysing the data measured in different scales. While R handles numeric scales natively, the work with categorical is not satisfactory. Yes, I know that there are packages that implement some of the functions, but I wanted to have them in one place without the need to install a lot of packages and satisfy the dependencies. After all, what’s the point in installing a package for Cramer’s V, when it can be calculated with two lines of code? So, here’s a brief explanation of the functions for marketing analytics in greybox.

I will use `mtcars` dataset for the examples, but we will transform some of the variables into factors:

mtcarsData <- as.data.frame(mtcars)
mtcarsData$vs <- factor(mtcarsData$vs, levels=c(0,1), labels=c("v","s"))
mtcarsData$am <- factor(mtcarsData$am, levels=c(0,1), labels=c("a","m"))

All the functions discussed in this post are available in greybox starting from v0.4.0. However, I’ve found several bugs since the submission to CRAN, and the most recent version with bugfixes is now available on github.

Analysing the relation between the two variables in categorical scales

Cramer’s V

Cramer’s V measures the relation between two variables in categorical scale. It is implemented in the cramer() function. It returns the value in a range of 0 to 1 (1 – when the two categorical variables are linearly associated with each other, 0 – otherwise), Chi-Squared statistics from the chisq.test(), the respective p-value and the number of degrees of freedom. The tested hypothesis in this case is formulated as:
\begin{matrix}
H_0: V = 0 \text{ (the variables don’t have association);} \\
H_1: V \neq 0 \text{ (there is an association between the variables).}
\end{matrix}

Here’s what we get when trying to find the association between the engine and transmission in the `mtcars` data:

cramer(mtcarsData$vs, mtcarsData$am)

Cramer's V: 0.1042
Chi^2 statistics = 0.3475, df: 1, p-value: 0.5555

Judging by this output, the association between these two variables is very low (close to zero) and is not statistically significant.

Cramer’s V can also be used for the data in numerical scales. In general, this might be not the most suitable solution, but this might be useful when you have a small number of values in the data. For example, the variable `gear` in `mtcars` is numerical, but it has only three options (3, 4 and 5). Here’s what Cramer’s V tells us in the case of `gear` and `am`:

cramer(mtcarsData$am, mtcarsData$gear)

Cramer's V: 0.809
Chi^2 statistics = 20.9447, df: 2, p-value: 0

As we see, the value is high in this case (0.809), and the null hypothesis is rejected on 5% level. So we can conclude that there is a relation between the two variables. This does not mean that one variable causes the other one, but they both might be driven by something else (do more expensive cars have less gears but the automatic transmission?).

Plotting categorical variables

While R allows plotting two categorical variables against each other, the plot is hard to read and is not very helpful (in my opinion):

plot(table(mtcarsData$am,mtcarsData$gear))

Default plot of a table

So I have created a function that produces a heat map for two categorical variables. It is called tableplot():

tableplot(mtcarsData$am,mtcarsData$gear)

Tableplot for the two categorical variables

It is based on table() function and uses the frequencies inside the table for the colours:

table(mtcarsData$am,mtcarsData$gear) / length(mtcarsData$am)

        3       4       5
a 0.46875 0.12500 0.00000
m 0.00000 0.25000 0.15625

The darker sectors mean that there is a higher concentration of values, while the white ones correspond to zeroes. So, in our example, we see that the majority of cars have automatic transmissions with three gears. Furthermore, the plot shows that there is some sort of relation between the two variables: the cars with automatic transmissions have the lower number of gears, while the ones with the manual have the higher number of gears (something we’ve already noticed in the previous subsection).

Association between the categorical and numerical variables

While Cramer’s V can also be used for the measurement of association between the variables in different scales, there are better instruments. For example, some analysts recommend using intraclass correlation coefficient when measuring the relation between the numerical and categorical variables. But there is a simpler option, which involves calculating the coefficient of multiple correlation between the variables. This is implemented in mcor() function of greybox. The `y` variable should be numerical, while `x` can be of any type. What the function then does is expands all the factors and runs a regression via .lm.fit() function, returning the square root of the coefficient of determination. If the variables are linearly related, then the returned value will be close to one. Otherwise it will be closet to zero. The function also returns the F statistics from the regression, the associated p-value and the number of degrees of freedom (the hypothesis is formulated similarly to cramer() function).

Here’s how it works:

mcor(mtcarsData$am,mtcarsData$mpg)

Multiple correlations value: 0.5998
F-statistics = 16.8603, df: 1, df resid: 30, p-value: 3e-04

In this example, the simple linear regression of mpg from the set of dummies is constructed, and we can conclude that there is a linear relation between the variables, and that this relation is statistically significant.

Association between several variables

Measures of association

When you deal with datasets (i.e. data frames or matrices), then you can use cor() function in order to calculate the correlation coefficients between the variables in the data. But when you have a mixture of numerical and categorical variables, the situation becomes more difficult, as the correlation does not make sense for the latter. This motivated me to create a function that uses either cor(), or cramer(), or mcor() functions depending on the types of data (see discussions of cramer() and mcor() above). The function is called association() or assoc() and returns three matrices: the values of the measures of association, their p-values and the types of the functions used between the variables. Here’s an example:

assocValues <- assoc(mtcarsData)
print(assocValues,digits=2)

 Associations: 
 values:
        mpg  cyl  disp    hp  drat    wt  qsec   vs   am gear carb
 mpg   1.00 0.86 -0.85 -0.78  0.68 -0.87  0.42 0.66 0.60 0.66 0.67
 cyl   0.86 1.00  0.92  0.84  0.70  0.78  0.59 0.82 0.52 0.53 0.62
 disp -0.85 0.92  1.00  0.79 -0.71  0.89 -0.43 0.71 0.59 0.77 0.56
 hp   -0.78 0.84  0.79  1.00 -0.45  0.66 -0.71 0.72 0.24 0.66 0.79
 drat  0.68 0.70 -0.71 -0.45  1.00 -0.71  0.09 0.44 0.71 0.83 0.33
 wt   -0.87 0.78  0.89  0.66 -0.71  1.00 -0.17 0.55 0.69 0.66 0.61
 qsec  0.42 0.59 -0.43 -0.71  0.09 -0.17  1.00 0.74 0.23 0.63 0.67
 vs    0.66 0.82  0.71  0.72  0.44  0.55  0.74 1.00 0.10 0.62 0.69
 am    0.60 0.52  0.59  0.24  0.71  0.69  0.23 0.10 1.00 0.81 0.44
 gear  0.66 0.53  0.77  0.66  0.83  0.66  0.63 0.62 0.81 1.00 0.51
 carb  0.67 0.62  0.56  0.79  0.33  0.61  0.67 0.69 0.44 0.51 1.00
 
 p-values:
       mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
 mpg  1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.01
 cyl  0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01
 disp 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.07
 hp   0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.18 0.00 0.00
 drat 0.00 0.00 0.00 0.01 1.00 0.00 0.62 0.01 0.00 0.00 0.66
 wt   0.00 0.00 0.00 0.00 0.00 1.00 0.34 0.00 0.00 0.00 0.02
 qsec 0.02 0.00 0.01 0.00 0.62 0.34 1.00 0.00 0.21 0.00 0.01
 vs   0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.56 0.00 0.01
 am   0.00 0.01 0.00 0.18 0.00 0.00 0.21 0.56 1.00 0.00 0.28
 gear 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.09
 carb 0.01 0.01 0.07 0.00 0.66 0.02 0.01 0.01 0.28 0.09 1.00
 
 types:
      mpg    cyl      disp   hp     drat   wt     qsec   vs       am      
 mpg  "none" "mcor"   "cor"  "cor"  "cor"  "cor"  "cor"  "mcor"   "mcor"  
 cyl  "mcor" "none"   "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
 disp "cor"  "mcor"   "none" "cor"  "cor"  "cor"  "cor"  "mcor"   "mcor"  
 hp   "cor"  "mcor"   "cor"  "none" "cor"  "cor"  "cor"  "mcor"   "mcor"  
 drat "cor"  "mcor"   "cor"  "cor"  "none" "cor"  "cor"  "mcor"   "mcor"  
 wt   "cor"  "mcor"   "cor"  "cor"  "cor"  "none" "cor"  "mcor"   "mcor"  
 qsec "cor"  "mcor"   "cor"  "cor"  "cor"  "cor"  "none" "mcor"   "mcor"  
 vs   "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "none"   "cramer"
 am   "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "none"  
 gear "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
 carb "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
      gear     carb    
 mpg  "mcor"   "mcor"  
 cyl  "cramer" "cramer"
 disp "mcor"   "mcor"  
 hp   "mcor"   "mcor"  
 drat "mcor"   "mcor"  
 wt   "mcor"   "mcor"  
 qsec "mcor"   "mcor"  
 vs   "cramer" "cramer"
 am   "cramer" "cramer"
 gear "none"   "cramer"
 carb "cramer" "none"

One thing to note is that the function considers numerical variables as categorical, when they only have up to 10 unique values. This is useful, for example, in case of number of `gears` in the dataset.

Plots of association between several variables

Similarly to the problem with cor(), scatterplot matrix (produced using plot()) is not meaningful in case of a mixture of variables:

plot(mtcarsData)

Default scatter plot matrix

It makes sense to use scatterplot in case of numeric variables, tableplot() in case of categorical and boxplot() in case of a mixture. So, there is the function spread() in greybox that creates something more meaningful. It uses the same algorithm as assoc() function, but produces plots instead of calculating measures of association. So, `gear` will be considered as categorical and the function will produce either boxplot() or tableplot(), when plotting it against other variables.

Here’s an example:

spread(mtcarsData)

Spread matrix

This plot demonstrates, for example, that the number of carburetors influences fuel consumption (something that we could not have spotted in the case of plot()). Notice also, that the number of gears influences the fuel consumption in a non-linear relation as well. So constructing the model with dummy variables for the number of gears might be a reasonable thing to do.

The function also has the parameter `log`, which will transform all the numerical variables using logarithms, which is handy, when you suspect the non-linear relation between the variables. Finally, there is a parameter `histogram`, which will plot either histograms, or barplots on the diagonal.

spread(mtcarsData, histograms=TRUE, log=TRUE)

Spread matrix in logs

The plot demonstrates that the `disp` has a strong non-linear relation with `mpg`, and, similarly, `drat` and `hp` also influence `mpg` in a non-linear fashion.

Regression diagnostics

One of the problems of linear regression that can be diagnosed prior to the model construction is multicollinearity. The conventional way of doing this diagnostics is via calculating the variance inflation factor (VIF) after constructing the model. However, VIF is not easy to interpret, because it lies in $(1,\infty)$. Coefficients of determination from the linear regression models of explanatory variables are easier to interpret and work with. If such a coefficient is equal to one, then there are some perfectly correlated explanatory variables in the dataset. If it is equal to zero, then they are not linearly related.

There is a function determination() or determ() in greybox that returns the set of coefficients of determination for the explanatory variables. The good thing is that this can be done before constructing any model. In our example, the first column, `mpg` is the response variable, so we can diagnose the multicollinearity the following way:

determination(mtcarsData[,-1])

       cyl      disp        hp      drat        wt      qsec        vs 
 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.8017720 
        am      gear      carb 
 0.7924392 0.8133441 0.8735577

As we can see from the output above, `disp` is the most linearly related with the variables, so including it in the model might cause the multicollinearity, which will decrease the efficiency of the estimates of parameters.

Message Analytics with greybox first appeared on Open Forecast.

greybox 0.3.0 – what’s new

Ivan Svetunkov — Tue, 07 Aug 2018 16:06:26 +0000

Three months have passed since the initial release of greybox on CRAN. I would not say that the package develops like crazy, but there have been some changes since May. Let’s have a look. We start by loading both greybox and smooth:

library(greybox)
library(smooth)

Rolling Origin

First of all, ro() function now has its own class and works with plot() function, so that you can have a visual representation of the results. Here’s an example:

x <- rnorm(100,100,10)
ourCall <- "es(data, h=h, intervals=TRUE)"
ourValue <- c("forecast", "lower", "upper")
ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE)
plot(ourRO)

Example of the plot of rolling origin function

Each point on the produced graph corresponds to an origin and straight lines correspond to the forecasts. Given that we asked for point forecasts and for lower and upper bounds of prediction interval, we have three respective lines. By plotting the results of rolling origin experiment, we can see if the model is stable or not. Just compare the previous graph with the one produced from the call to Holt's model:

ourCall <- "es(data, model='AAN', h=h, intervals=TRUE)"
ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE)
plot(ourRO)

Example of the plot of rolling origin function with ETS(A,A,N)

Holt's model is not suitable for this time series, so it's forecasts are less stable than the forecasts of the automatically selected model in the previous case (which is ETS(A,N,N)).

Once again, there is a vignette with examples for the ro() function, have a look if you want to know more.

ALM - Advanced Linear Model

Yes, there is "Generalised Linear Model" in R, which implements Poisson, Gamma, Binomial and other regressions. Yes, there are smaller packages, implementing models with more exotic distributions. But I needed several regression models with: Laplace distribution, Folded normal distribution, Chi-squared distribution and one new mysterious distribution, which is currently called "S distribution". I needed them in one place and in one format: properly estimated using likelihoods, returning confidence intervals, information criteria and being able to produce forecasts. I also wanted them to work similar to lm(), so that the learning curve would not be too steep. So, here it is, the function alm(). It works quite similar to lm():

xreg <- cbind(rfnorm(100,1,10),rnorm(100,50,5))
xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
inSample <- xreg[1:80,]
outSample <- xreg[-c(1:80),]

ourModel <- alm(y~x1+x2, inSample, distribution="laplace")
summary(ourModel)

Here's the output of the summary:

Distribution used in the estimation: Laplace
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%
(Intercept) 95.85207    0.36746   95.12022    96.58392
x1           0.59618    0.02479    0.54681     0.64554
x2          -0.67865    0.00622   -0.69103    -0.66626
ICs:
     AIC     AICc      BIC     BICc 
474.2453 474.7786 483.7734 484.9419

And here's the respective plot of the forecast:

plot(forecast(ourModel,outSample))

Forecast from lm with Laplace distribution

The thing that is currently missing in the function is prediction intervals, but this will be added in the upcoming releases.

Having the likelihood approach, allows comparing different models with different distributions using information criteria. Here's, for example, what model we get if we assume S-distribution (which has fatter tails than Laplace):

summary(alm(y~x1+x2, inSample, distribution="s"))

Distribution used in the estimation: S
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%
(Intercept) 95.61244    0.23386   95.14666    96.07821
x1           0.56144    0.00721    0.54708     0.57581
x2          -0.66867    0.00302   -0.67470    -0.66265
ICs:
     AIC     AICc      BIC     BICc 
482.9358 483.4692 492.4639 493.6325

As you see, the information criteria for S distribution are higher than for Laplace, so we can conclude that the previous model was better than the second in terms of ICs.

Note that at this moment the AICc and BICc are not correct for non-normal models (at least the derivation of them needs to be double checked, which I haven't done yet), so don't rely on them too much.

I intent to add several other distributions that either are not available in R or are implemented unsatisfactory (from my point of view) - the function is written in a quite flexible way, so this should not be difficult to do. If you have any preferences, please add them on github, here.

I also want to implement the mixture distributions, so that things discussed in the paper on intermittent state-space model can also be implemented using pure regression.

Finally, now that I have alm, we can select between the regression models with different distributions (with stepwise() function) or even combine them using AIC weights (hello, lmCombine()!). Yes, I know that it sounds crazy (think of the pool of models in this case), but this should be fun!

Regression for Multiple Comparison with the Best

Please, note that this part of the post has been updated on 02.03.2020 in order to reflect the changes in the v0.5.9 version of the package.
One of the typical tasks in forecasting is to evaluate the performance of different methods on the holdout. In order to do that, it is common to use some statistical tests, the most popular of which is Nemenyi / MCB (Multiple Comparison with the Best method). The test implemented in greybox package uses similar principles and relies on ranks of methods, but instead of taking averages and then applying studentised distances, it constructs a regression on the ranked data. This way we compare the median performance of different method (the same way as it is done in the classical MCB) and we produce parametric confidence intervals for parameters. The test is based on the simple linear model with dummy variables for each provided method (1 if the error corresponds to the method and 0 otherwise). Here's an example of how this thing works:

ourData <- cbind(rnorm(100,0,10), rnorm(100,-2,5), rnorm(100,2,6), rlaplace(100,1,5))
colnames(ourData) <- c("Method A","Method B","Method C","Method D")

ourTest <- rmcb(ourData, level=0.95)

By default the function produces graph in the MCB (Multiple Comparison with the Best) style:

RMCB example, MCB style plot

If we compare the results of the test with the mean rank values, we will see that they are the same:

apply(t(apply(ourData,1,rank)),2,mean)

Method A Method B Method C Method D 
    2.40     2.06     2.75     2.79

ourTest$mean

Method B Method A Method C Method D 
    2.06     2.40     2.75     2.79

This also reflects how the data was generated. Notice that Method D was generated from Laplace distribution with mean 1, but the test managed to give the correct answer in this situation, because Laplace distribution is symmetric and the sample size is large enough. But the main point of the test is that we can get the confidence intervals for each parameter, so we can see if the differences between the methods are significant: if the intervals intersect, then they are not.

The regression model used in the calculation is saved in the variable model and you can request a basic summary from it:

summary(ourTest$model)

            Estimate Std. Error  Lower 2.5% Upper 97.5%
(Intercept)     2.40  0.1083601  2.18761804  2.61238196
Method B       -0.34  0.1532444 -0.64035346 -0.03964654
Method C        0.35  0.1532444  0.04964654  0.65035346
Method D        0.39  0.1532444  0.08964654  0.69035346

But, please, keep in mind that this is not a proper "lm" object, so you cannot do much with it.

The function also reports p-value from the F-test of regression, testing the standard hypothesis that all the parameters are equal to zero.

We can also produce plots with vertical lines, that connect the models that are in the same group (no statistical difference, intersection of respective intervals). Here's the example for the same data:

plot(ourTest, outplot="lines")

RMCB example, lines plot

If you want to tune the plot, you can always do this using the standart plot parameters:

plot(ourTest, xlab="Models", ylab="Errors")

Also, given that we work with a flexible plot method, you can tune the parameters of the canvas using "par()" function, as it is usually done in R.

What else?

Several methods have been moved from smooth to greybox. These include:

pointLik() - returns point Likelihoods, discussed in our research with Nikos;
pAIC, pBIC, pAICc, pBICc - point values of respective information criteria, from the same research;
nParam() - returns number of the estimated parameters in the model (+ variance);
errorType() - returns the type of error used in the model (Additive / Multiplicative);

Furthermore, as you might have already noticed, I've implemented several distribution functions:

Folded normal distribution;
Laplace distribution;
S distribution.

Finally, there is also a function, called lmDynamic(), which uses pAIC in order to produce dynamic linear regression models. But this should be discussed separately in a separate post.

That's it for now. See you in greybox 0.4.0!

Message greybox 0.3.0 – what’s new first appeared on Open Forecast.