Archives Common parameters - Open Forecasting

“smooth” package for R. Common ground. Part IV. Exogenous variables. Advanced stuff

Ivan Svetunkov — Sat, 10 Feb 2018 15:51:33 +0000

Previously we’ve covered the basics of exogenous variables in smooth functions. Today we will go slightly crazy and discuss automatic variables selection. But before we do that, we need to look at a Santa’s little helper function implemented in smooth. It is called xregExpander(). It is useful in cases when you think that your exogenous variable may influence the variable of interest either via some lag or lead. Let’s say that we think that BJsales.lead discussed in the previous post influences sales in a non-standard way: for example, we believe that today’s value of sales is driven by the several values of that variable: the value today, five and 10 days ago. This means that we need to include BJsales.lead with lags. And xregExpander() from the greybox package allows automating this for us:

library("greybox")
newXreg <- xregExpander(BJsales.lead, lags=c(-5,-10))

The newXreg is a matrix, which contains the original data, the data with lag 5 and the data with lag 10. However, if we just move the original data several observations ahead or backwards, we will have missing values, so xregExpander() fills in those values with the forecasts using es() and iss() functions (depending on the type of variable we are dealing with). This also means that in cases of binary variables you may have weird averaged values as forecasts (e.g. 0.7812), so beware and look at the produced matrix. Maybe in your case it makes sense to just substitute these weird numbers with zeroes...

You may also need leads instead of lags. This is regulated with the same "lags" parameter but with positive values:

newXreg <- xregExpander(BJsales.lead, lags=c(7,-5,-10))

Once again, the values are shifted, and now the first 7 values are backcasted.

After expanding the exogenous variables, we can use them in smooth functions for forecasting purposes. All the rules discussed in the previous post apply:

es(BJsales, "XXN", xreg=newXreg, h=10, holdout=TRUE)

But what should we do if we have several variables and we are not sure what lags and leads to select? This may become a complicated task which can have several possible solutions. smooth functions provide one. I should warn you that this is not necessarily the best solution, but a solution. There is a function called stepwise() in greybox that does the selection based on an information criterion and partial correlations. In order to run this function the response variable needs to be in the first column. The idea of the function is simple, it works iteratively the following way:

The basic model of the first variable and the constant is constructed (this corresponds to simple mean). An information criterion is calculated;
The correlations of the residuals of the model with all the original exogenous variables are calculated;
The regression model of the response variable and all the variables in the previous model plus the new most correlated variable from (2) is constructed using lm() function;
An information criterion is calculated and is compared with the one from the previous model. If it is greater or equal to the previous one, then we stop and use the previous model. Otherwise we go to step 2.

This way we do not do a blind search, but we follow some sort of "trace" of a good model: if some significant part that can be explained by one of the exogenous variables is left in the residuals, then that variable is included in the model. Following correlations makes sure that we include only meaningful (from technical point of view) things in the model. Using information criteria allows overcoming the problem with the uncertainty of statistical tests. In general the function guarantees that you will have the model with the lowest information criterion. However this does not guarantee that you will end up with a meaningful model or with a model that produces the most accurate forecasts. And this is why evolution has granted human beings the almighty brain – it helps in selecting the most appropriate model in those cases, when statistics fails.

Let’s see how the function works with the same example and 1 to 10 leads and 1 to 10 lags (so we have 21 variables now). First we expand the data and form the matrix with all the variables:

newXreg <- as.data.frame(xregExpander(BJsales.lead,lags=c(-10:10)))
newXreg <- cbind(as.matrix(BJsales),newXreg)
colnames(newXreg)[1] <- "y"

This way we have a nice data frame with nice names, not something weird with strange long names. It is important to note that the response variable should be in the first column of the resulting matrix. After that we use our magical stepwise function:

ourModel <- stepwise(newXreg)

And here’s what it returns (the object of class “lm”):

Call:
lm(formula = y ~ xLag4 + xLag9 + xLag3 + xLag10 + xLag5 + xLag6 + 
    xLead9 + xLag7 + xLag8, data = newXreg)

Coefficients:
(Intercept)        xLag4        xLag9        xLag3       xLag10        xLag5        xLag6  
    17.6448       3.3712       1.3724       4.6781       1.5412       2.3213       1.7075  
     xLead9        xLag7        xLag8  
     0.3767       1.4025       1.3370

The values in the function are listed in the order of most correlated with the response variable to the least correlated ones. The function works very fast because it does not need to go through all the variables and their combinations in the dataset.

Okay. So, that’s the second nice function in smooth that you can use for exogenous variables. But what does it have to do with the main forecasting functions?

The thing is es(), ssarima(), ces(), ges() - all the forecasting functions in smooth have a parameter xregDo, which defines what to do with the provided exogenous variables, and by default it is set to "use". However, there is also an option of "select", which uses the aforementioned stepwise() function. The function is applied to the residuals of the specified model, and when the appropriate exogenous variables are found, the model with these variables is re-estimated. This may cause some inaccuracies in the selection mechanisms (because, for example, the optimisation of ETS and X parts does not happen simultaneously), but at least it is done in the finite time frame.

Let’s see how this works. First we use all the expanded variables for up to 10 lags and leads without the variables selection mechanism (I will remove the response variable from the first row of the data we previously used):

newXreg <- newXreg[,-1]
ourModelUse <- es(BJsales, "XXN", xreg=newXreg, h=10, holdout=TRUE, silent=FALSE, xregDo="use", intervals="sp")

Time elapsed: 1.13 seconds
Model estimated: ETSX(ANN)
Persistence vector g:
alpha 
0.922 
Initial values were optimised.
24 parameters were estimated in the process
Residuals standard deviation: 0.287
Xreg coefficients were estimated in a normal style
Cost function type: MSE; Cost function value: 0.068

Information criteria:
      AIC      AICc       BIC 
 69.23731  79.67209 139.83673 
95% semiparametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: 0%; Bias: 55.7%; MAPE: 0.1%; SMAPE: 0.1%
MASE: 0.166; sMAE: 0.1%; RelMAE: 0.055; sMSE: 0%

BJsales series and ETSX with all the variables

As we see, the forecast became more accurate in this case than in the case of using pure BJsales.lead, which may mean that there can be a lagged relationship between the sales and the indicator. However, we have included too many exogenous variable, which may lead to the overfitting. So there is a potential for the accuracy increase if we remove the redundant variables. And that’s where the selection procedure kicks in:

ourModelSelect <- es(BJsales, "XXN", xreg=newXreg, h=10, holdout=TRUE, silent=FALSE, xregDo="select", intervals="sp")

Time elapsed: 0.98 seconds
Model estimated: ETSX(ANN)
Persistence vector g:
alpha 
    1 
Initial values were optimised.
11 parameters were estimated in the process
Residuals standard deviation: 0.283
Xreg coefficients were estimated in a normal style
Cost function type: MSE; Cost function value: 0.074

Information criteria:
     AIC     AICc      BIC 
54.55463 56.61713 86.91270 
95% semiparametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: 0%; Bias: 61.4%; MAPE: 0.1%; SMAPE: 0.1%
MASE: 0.159; sMAE: 0.1%; RelMAE: 0.052; sMSE: 0%

BJsales series and ETSX with the selected variables

Although it is hard to see from graph, there is an increase in forecasting accuracy: MASE was reduced from 0.166 to 0.159. AICc has also reduced from 79.67209 to 56.61713. This is because only 8 exogenous variables are used in the second model (instead of 21):

ncol(ourModelUse$xreg)
ncol(ourModelSelect$xreg)

The xreg selection works even when you have combination of forecasts. The exogenous variables are selected for each model separately in this case, and then the forecasts are combined based on IC weights. For example, this way we can combine different non-seasonal additive ETS models:

ourModelCombine <- es(BJsales, c("ANN","AAN","AAdN","CCN"), xreg=newXreg, h=10, holdout=TRUE, silent=FALSE, xregDo="s", intervals="sp")

Time elapsed: 1.46 seconds
Model estimated: ETSX(CCN)
Initial values were optimised.
Residuals standard deviation: 0.272
Xreg coefficients were estimated in a normal style
Cost function type: MSE

Information criteria:
(combined values)
     AIC     AICc      BIC 
54.55463 56.61713 86.91270 
95% semiparametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: 0%; Bias: 61.4%; MAPE: 0.1%; SMAPE: 0.1%
MASE: 0.159; sMAE: 0.1%; RelMAE: 0.052; sMSE: 0%

Taking that ETSX(A,N,N) is significantly better in terms of AICc than the other models, we end up with a maximum weight for that model and infinitesimal weights for the others. That’s why the forecasts of the ourModelSelect and ourModelCombine are roughly the same. Starting from v2.3.2, es() function returns the matrix of information criteria for different models estimated in the process, so we can see what was used and what were the values for the models. This information is available through the element ICs:

ourModelCombine$ICs

               AIC      AICc      BIC
ANN       54.55463  56.61713  86.9127
AAN      120.85273 122.91523 153.2108
AAdN     107.76905 110.22575 143.0688
Combined  54.55463  56.61713  86.9127

So, as we see, ETS(A,N,N) had indeed the lowest information criterion of all the models in the combination (substantially lower than the others), which led to its domination in the combination.

Keep in mind that this combination of forecasts does not mean the combination of regression models with different exogenous variables – this functionality is currently unavailable, and I am not sure how to implement it (and whether it is needed at all). The combination of forecasts is done for the whole ETS models, not just for the X parts of it.

Finally, it is worth noting that the variables selection approach we use here puts more importance on the dynamic model (in our examples it is ETS) rather than on the exogenous variables. We use exogenous variables as an additional instrument which allows increasing the forecasting accuracy. The conventional econometrics approach would be the other way around: construct regression, then add dynamic components to that in case you cannot explain the response variable with the available data. This approach has different aims and leads to different results.

Message “smooth” package for R. Common ground. Part IV. Exogenous variables. Advanced stuff first appeared on Open Forecasting.

“smooth” package for R. Common ground. Part III. Exogenous variables. Basic stuff

Ivan Svetunkov — Mon, 15 Jan 2018 14:42:01 +0000

One of the features of the functions in smooth package is the ability to use exogenous (aka “external”) variables. This potentially leads to the increase in the forecasting accuracy (given that you have a good estimate of the future exogenous variable). For example, in retail this can be a binary variable for promotions and we may know when the next promotion will happen. Or we may have an idea about the temperature for the next day and include it as an exogenous variable in the model.

While arima() function from stats package allows inserting exogenous variables, ets() function from forecast package does not. That was one of the original motivations of developing an alternative function for ETS. It is worth noting that all the forecasting functions in smooth package (except for sma()) allow using exogenous variables, so this feature is not restricted with es() only.

There are two types of models with exogenous variables implemented in smooth functions: additive error model and multiplicative error model. They are slightly different. The former one is formulated as:
\begin{equation} \label{eq:additive}
y_t = w’ v_{t-l} + a_1 x_{1,t} + a_2 x_{2,t} + … + a_k x_{k,t} + \epsilon_t ,
\end{equation}
where \(a_1, a_2, …, a_k\) are parameters for the respective exogenous variables \(x_{1,t}, x_{2,t}, …, x_{t,k}\). All the other variables have been discussed earlier in the previous posts.

The second model is formulated differently, because it is driven by the multiplication of the ETS components by error term:
\begin{equation} \label{eq:multiplicative}
\log y_t = w’ \log(v_{t-1}) + a_1 x_{1,t} + a_2 x_{2,t} + … + a_k x_{k,t} + \log(1 + \epsilon_t) ,
\end{equation}
so this model can be reformlated as:
\begin{equation} \label{eq:multiplicativeAlternative}
y_t =\exp \left({w’ \log(v_{t-1})} \right) \exp(a_1 x_{1,t}) \exp(a_2 x_{2,t}) \dots \exp(a_k x_{k,t}) (1 + \epsilon_t).
\end{equation}

This corresponds to log-linear model. This formulation is adopted because the exponents of exogenous variables allow using dummy variables, which would not be possible in log-log model. This also means that if you want to have a log-log model, you need to take logarithms of the exogenous variables before using them in the functions.

The important thing to note is that the mixed ETS models may cause problems, because some components are added, while the others are multiplied and you may end up with weird cocktail leading to meaningless or unstable forecasts. So in cases of ETSX I would advice sticking with either pure additive or pure multiplicative models, ignoring their combinations (see the model selection post on how to select between the pure models).

In order to construct the model with a set of preselected exogenous variables, all you need to do is to specify the vector, matrix or data.frame with these variables in columns the following way:

ourModel <- es(BJsales, "XXN", xreg=BJsales.lead, h=10, holdout=TRUE, silent=FALSE)

Estimation progress: 100%... Done! 
Time elapsed: 0.27 seconds
Model estimated: ETSX(AAdN)
Persistence vector g:
alpha  beta 
0.939 0.301 
Damping parameter: 0.877
Initial values were optimised.
7 parameters were estimated in the process
Residuals standard deviation: 1.381
Xreg coefficients were estimated in a normal style
Cost function type: MSE; Cost function value: 1.811

Information criteria:
     AIC     AICc      BIC 
494.4490 495.2975 515.0405 
Forecast errors:
MPE: 1.2%; Bias: 91.3%; MAPE: 1.3%; SMAPE: 1.3%
MASE: 2.794; sMAE: 1.5%; RelMAE: 0.917; sMSE: 0%

BJsales series and ETSX with a leading indicator

In this example we use sales data from Box & Jenkins (1976) book with a leading indicator. I’ve asked for 10-steps ahead forecast and for the holdout. This means that we don’t need to do anything with the leading indicator, the last 10 actual observations of BJsales.lead will be taken for the forecasting purposes. The function tells us that ETSX model was estimated and points out that the parameters for exogenous variables were "estimated in a normal style", meaning that they are assumed to be constant for all the observations. The option with dynamic parameters will be discussed later.

As we see the selected model is ETS(A,Ad,N), but the produced forecast is not really accurate and seems to be biased.

If you ever get lost and stop understanding what model you have created, you can use formula() function, which in case of smooth functions gives purely descriptive information – the output of that function cannot be used directly in any model. Let’s see what is the formula of our model:

formula(ourModel)

"y[t] = l[t-1] + b[t-1] + a1 * x[t] + e[t]"

This thing tells us that we have level l[t-1], trend b[t-1], one exogenous variable that is called "x[t]" in our case (because we provided a vector) and the error term. If we had a matrix with exogenous variables or a model with dynamic parameters for exogenous variables, this would be reflected in the formula. Just remember, that this is purely descriptive thing. You cannot use this information directly in any other model (as you may usually do with lm() function).

Just for fun, let’s specify a weird mixed model and see its formula:

ourModel <- es(BJsales, "MAN", xreg=BJsales.lead, h=10, holdout=TRUE)
formula(ourModel)

"y[t] = (l[t-1] + b[t-1]) * exp(a1 * x[t]) * e[t]"

As it can be seen, we firstly add the trend component to the level and then multiply it by the exponent of our exogenous variable. If for some reason trend is negative and level is low, we will end up with a very weird thing, because the exogenous variable will be multiplied by a negative number. That’s why I say that the mixed models are not safe.

Now, if we do not have the values for the holdout, then the smooth functions will automatically produce forecasts using es() for each of the variables and then use those values in the final forecast of the variable of interest. Beware, that if you use dummy variables, then the forecast will correspond to some sort of conditional mean value (this is then produced by iss() function). This means that you will end up having something like 0.784 as a forecast. So be careful when blindly using the function for the cases of holdout=FALSE. Here’s how it works:

es(BJsales, "XXN", xreg=BJsales.lead, h=10, holdout=FALSE, silent=FALSE)

We should get the following warning:

Warning message:
xreg did not contain values for the holdout, so we had to predict missing values.

If your exogenous variable is longer than the variable of interest, then smooth functions will cut off the redundant end of data. For example:

ourModel <- es(BJsales[1:140], "XXN", xreg=BJsales.lead, h=10, holdout=TRUE)

This produces a warning:

Warning message:
xreg contained too many observations, so we had to cut off some of them.

This is because xreg contained too many observations, and the function used only the first 140 of them, removing the last ten.

As you see, the function works on its quite well, but if you are kin in using forecast() function together with smooth functions (which is not necessary at all), you can do it the following way:

forecast(ourModel, h=10, xreg=BJsales.lead)

Due to the implementation of exogenous variables in smooth, you need to provide the whole xreg (as much as you have) to the forecast() function. If you provide the values for the holdout only, the function will think that your xreg series is too short and produce forecasts for it. If you provide the values for the in-sample only, the function will once again produce forecasts for each variable in xreg using es() (as discussed above).

Personally, I would advise using es(), ssarima() and other smooth functions directly, ignoring forecast(). This way you can prepare your xreg, and then use it directly without additional lines of code.

Similarly to how it was discussed in the previous post, you can always ask for the prediction intervals and you will have them. But keep in mind that parametric intervals are pretty complicated in case of dynamic models with exogenous variables, because the covariances between the parameters and ETS components are hard to derive. Although the function will produce you anything you ask for it, the parametric intervals may be inaccurate. So I would advise using semiparametric or nonparametric intervals in case of xreg.

Finally, you can always pre-specify the values of xreg parameters if you don’t want them to be estimated. This is controlled using initialX parameter:

ourModel <- es(BJsales, "XXN", xreg=BJsales.lead, h=10, holdout=T, initialX=c(-1))

The function is also smart enough to detect if a provided variable does not have any variability or if some of the variables are highly correlated (correlation or multiple correlations are higher than 0.999). In both of these cases, it will drop some variables and tell us about that:

es(BJsales, "XXN", xreg=cbind(BJsales.lead,BJsales.lead), h=10, holdout=TRUE)

Warning message:
Some exogenous variables were perfectly correlated. We've dropped them out.

es(BJsales, "XXN", xreg=cbind(BJsales.lead,rep(100,150)), h=10, holdout=TRUE)

Warning message:
Some exogenous variables do not have any variability. Dropping them out.

If you accidentally provide the response variable in xreg, the function will also drop it:

es(BJsales, "XXN", xreg=cbind(BJsales,BJsales.lead), h=10, holdout=TRUE)

Warning message:
One of exogenous variables and the forecasted data are exactly the same. We have dropped it.

That’s it for the basic exogenous variables functionality in smooth functions. Next time we will continue with more advanced, more fascinating stuff.

Message “smooth” package for R. Common ground. Part III. Exogenous variables. Basic stuff first appeared on Open Forecasting.

“smooth” package for R. Common ground. Part II. Estimators

Ivan Svetunkov — Mon, 20 Nov 2017 17:21:15 +0000

UPDATE: Starting from the v2.5.1 the cfType parameter has been renamed into loss. This post has been updated since then in order to include the more recent name.

A bit about estimates of parameters

Hi everyone! Today I want to tell you about parameters estimation of smooth functions. But before going into details, there are several things that I want to note. In this post we will discuss bias, efficiency and consistency of estimates of parameters, so I will use phrases like “efficient estimator”, implying that we are talking about some optimisation mechanism that gives efficient estimates of parameters. It is probably not obvious for people without statistical background to understand what the hell is going on and why we should care, so I decided to give a brief explanation. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context. So, let’s get started.

Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one. Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be.

Efficiency means that if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratic stocking policy, so we need to have efficient estimates of parameters.

Consistency means that our estimates of parameters will get closer to the stable values (what statisticians would refer to as “true” values) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better. There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data.

Having clarified these points, we can proceed to the topic of today’s post.

One-step-ahead estimators of smooth functions

We already know that the default estimator used for smooth functions is Mean Squared Error (for one step ahead forecast). If the residuals are distributed normally / log-normally, then the minimum of MSE gives estimates that also maximise the respective likelihood function. As a result the estimates of parameters become nice: consistent and efficient. It is also known in statistics that minimum of MSE gives mean estimates of the parameters, which means that MSE also produces unbiased estimates of parameters (if the model is specified correctly and bla-bla-bla). This works very well, when we deal with symmetric distributions of random variables. But it may perform poorly otherwise.

In this post we will use the series N1823 for our examples:

library(Mcomp)
x <- ts(c(M3$N1823$x,M3$N1823$xx),frequency=frequency(M3$N1823$x))

Plot the data in order to see what we have:

plot(x)

N1823 series

The data seems to have slight multiplicative seasonality, which changes over the time, but it is hard to say for sure. Anyway, in order to simplify things, we will apply an ETS(A,A,N) model to this data, so we can see how the different estimators behave. We will withhold 18 observations as it is usually done for monthly data in M3.

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T)

N1823 and ETS(A,A,N) with MSE

Here’s the output:

Time elapsed: 0.08 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
0.147 0.000 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 629.249
Cost function type: MSE; Cost function value: 377623.069

Information criteria:
     AIC     AICc      BIC 
1703.389 1703.977 1716.800 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -14%; Bias: -74.1%; MAPE: 16.8%; SMAPE: 15.1%
MASE: 0.855; sMAE: 13.4%; RelMAE: 1.047; sMSE: 2.4%

It is hard to make any reasonable conclusions from the graph and the output, but it seems that we slightly overforecast the data. At least the prediction interval covers all the values in the holdout. Relative MAE is equal to 1.047, which means that the model produced forecasts less accurate than Naive. Let’s have a look at the residuals:

qqnorm(resid(ourModel))
qqline(resid(ourModel))

QQ-plot of the residuals from ETS(A,A,N) with MSE

The residuals of this model do not look normal, a lot of empirical quantiles a far from the theoretical ones. If we conduct Shapiro-Wilk test, then we will have to reject the hypothesis of normality for the residuals on 5%:

shapiro.test(resid(ourModel))
> p-value = 0.001223

This may indicate that other estimators may do a better job. And there is a magical parameter loss in the smooth functions which allows to estimate models differently. It controls what to use and how to use it. You can select the following estimators instead of MSE:

MAE – Mean Absolute Error:

\begin{equation} \label{eq:MAE}
\text{MAE} = \frac{1}{T} \sum_{t=1}^T |e_{t+1}|
\end{equation}

The minimum of MAE gives median estimates of the parameters. MAE is considered to be a more robust estimator than MSE. If you have asymmetric distribution, give MAE a try. It gives consistent, but not efficient estimates of parameters. Asymptotically, if the distribution of the residuals is normal, the estimators of MAE converge to the estimators of MSE (which follows from the connection between mean and median of normal distribution). Also, the minimum of MAE corresponds to the maximum of the likelihood of Laplace distribution (and the functions of the package use this property for the inference).

Let’s see what happens with the same model, on the same data when we use MAE:

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T,loss="MAE")

N1823 and ETS(A,A,N) with MAE

Time elapsed: 0.09 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
0.101 0.000 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 636.546
Cost function type: MAE; Cost function value: 462.675

Information criteria:
     AIC     AICc      BIC 
1705.879 1706.468 1719.290 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -5.1%; Bias: -32.1%; MAPE: 12.9%; SMAPE: 12.4%
MASE: 0.688; sMAE: 10.7%; RelMAE: 0.842; sMSE: 1.5%

There are several things to note from the graph and the output. First, the smoothing parameter alpha is smaller than in the previous case. Second, Relative MAE is smaller than one, which means that model in this case outperformed Naive. Comparing this value with the one from the previous model, we can conclude that MAE worked well as an estimator of parameters for this data. Finally, the graph shows that point forecasts go through the middle of the holdout sample, which is reflected with lower values of error measures. The residuals are still not normally distributed, but this is expected, because they won't become normal just because we used a different estimator:

QQ-plot of the residuals from ETS(A,A,N) with MAE

HAM – Half Absolute Moment:

\begin{equation} \label{eq:HAM}
\text{HAM} = \frac{1}{T} \sum_{t=1}^T \sqrt{|e_{t+1}|}
\end{equation}
This is even more robust estimator than MAE. On count data its minimum corresponds to the mode of data. In case of continuous data the minimum of this estimator corresponds to something not yet well studied, between mode and median. The paper about this thing is currently in a draft stage, but you can already give it a try, if you want. This is also consistent, but not efficient estimator.

The same example, the same model, but HAM as an estimator:

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T,loss="HAM")

N1823 and ETS(A,A,N) with HAM

Time elapsed: 0.06 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
0.001 0.001 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 666.439
Cost function type: HAM; Cost function value: 19.67

Information criteria:
     AIC     AICc      BIC 
1715.792 1716.381 1729.203 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -1.7%; Bias: -14.1%; MAPE: 11.4%; SMAPE: 11.4%
MASE: 0.63; sMAE: 9.8%; RelMAE: 0.772; sMSE: 1.3%

This estimator produced even more accurate forecasts in this example, forcing smoothing parameters to become close to zero. Note that the residuals standard deviation in case of HAM is larger than in case of MAE which in its turn is larger than in case of MSE. This means that one-step-ahead parametric and semiparametric prediction intervals will be wider in case of HAM than in case of MAE, than in case of MSE. However, taking that the smoothing parameters in the last model are close to zero, the multiple steps ahead prediction intervals of HAM may be narrower than the ones of MSE.

Finally, it is worth noting that the optimisation of models using different estimators takes different time. MSE is the slowest, while HAM is the fastest estimator. I don't have any detailed explanation of this, but this obviously happens because of the form of the cost functions surfaces. So if you are in a hurry and need to estimate something somehow, you can give HAM a try. Just remember that information criteria may become inapplicable in this case.

Multiple-steps-ahead estimators of smooth functions

While these three estimators above are calculated based on the one-step-ahead forecast error, the next three are based on multiple steps ahead estimators. They can be useful if you want to have a more stable and “conservative” model (a paper on this topic is currently in the final stage). Prior to v2.2.1 these estimators had different names, be aware!

MSE\(_h\) - Mean Squared Error for h-steps ahead forecast:

\begin{equation} \label{eq:MSEh}
\text{MSE}_h = \frac{1}{T} \sum_{t=1}^T e_{t+h}^2
\end{equation}
The idea of this estimator is very simple: if you are interested in 5 steps ahead forecasts, then optimise over this horizon, not one-step-ahead. However, by using this estimator, we shrink the smoothing parameters towards zero, forcing the model to become closer to the deterministic and robust to outliers. This applies both to ETS and ARIMA, but the models behave slightly differently. The effect of shrinkage increases with the increase of \(h\). The forecasts accuracy may increase for that specific horizon, but it almost surely will decrease for all the other horizons. Keep in mind that this is in general not efficient and biased estimator with a much slower convergence to the true value than the one-step-ahead estimators. This estimator is eventually consistent, but it may need a very large sample to become one. This means that this estimator may result in values of parameters very close to zero even if they are not really needed for the data. I personally would advise using this thing on large samples (for instance, on high frequency data). By the way, Nikos Kourentzes, Rebecca Killick and I are working on a paper on that topic, so stay tuned.

Here’s what happens when we use this estimator:

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T,loss="MSEh")

N1823 and ETS(A,A,N) with MSEh

Time elapsed: 0.24 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    0     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 657.781
Cost function type: MSEh; Cost function value: 550179.34

Information criteria:
     AIC     AICc      BIC 
30393.86 30404.45 30635.25 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -10.4%; Bias: -62%; MAPE: 14.9%; SMAPE: 13.8%
MASE: 0.772; sMAE: 12.1%; RelMAE: 0.945; sMSE: 1.8%

As you can see, the smoothing parameters are now equal to zero, which gives us the straight line going through all the data. If we had 1008 observations instead of 108, the parameters would not be shrunk to zero, because the model would need to adapt to changes in order to minimise the respective cost function.

TMSE - Trace Mean Squared Error:

The need for having a specific 5 steps ahead forecast is not common, so it makes sense to work with something that deals with one to h steps ahead:
\begin{equation} \label{TMSE}
\text{TMSE} = \sum_{j=1}^h \frac{1}{T} \sum_{t=1}^T e_{t+j}^2
\end{equation}
This estimator is more reasonable than MSE\(_h\) because it takes into account all the errors from one to h-steps-ahead forecasts. This is a desired behaviour in inventory management, because we are not so much interested in how much we will sell tomorrow or next Monday, but rather how much we will sell starting from tomorrow till the next Monday. However, the variance of forecast errors h-steps-ahead is usually larger than the variance of one-step-ahead errors (because of the increasing uncertainty), which leads to the effect of “masking”: the latter is hidden behind the former. As a result if we use TMSE as the estimator, the final values are seriously influenced by the long term errors than the short term ones (see Taieb and Atiya, 2015 paper). This estimator is not recommended if short-term forecasts are more important than long term ones. Plus, this is still less efficient and more biased estimator than one-step-ahead estimators, with slow convergence to the true values, similar to MSE\(_h\), but slightly better.

This is what happens in our example:

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T,loss="TMSE")

N1823 and ETS(A,N,N) with TMSE

Time elapsed: 0.2 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
0.075 0.000 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 633.48
Cost function type: TMSE; Cost function value: 7477097.717

Information criteria:
     AIC     AICc      BIC 
30394.36 30404.94 30635.75 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -7.5%; Bias: -48.9%; MAPE: 13.4%; SMAPE: 12.6%
MASE: 0.704; sMAE: 11%; RelMAE: 0.862; sMSE: 1.5%

Comparing the model estimated using TMSE with the same one estimated using MSE and MSE\(_h\), it is worth noting that the smoothing parameters in this model are greater than in case of MSE\(_h\), but less than in case of MSE. This demonstrates that there is a shrinkage effect in TMSE, forcing the parameters towards zero, but the inclusion of one-step-ahead errors makes model slightly more flexible than in case of MSE\(_h\). Still, it is advised to use this estimator on large samples, where the estimates of parameters become more efficient and less biased.

GTMSE - Geometric Trace Mean Squared Error:

This is similar to TMSE, but derived from the so called Trace Forecast Likelihood (which I may discuss at some point in one of the future posts). The idea here is to take logarithms of each MSE\(_j\) and then sum them up:
\begin{equation} \label{eq:GTMSE}
\text{GTMSE} = \sum_{j=1}^h \log \left( \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 \right)
\end{equation}
Logarithms make variances of errors on several steps ahead closer to each other. For example, if the variance of one-step-ahead error is equal to 100 and the variance of 10 steps ahead error is equal to 1000, then their logarithms will be 4.6 and 6.9 respectively. As a result when GTMSE is used as an estimator, the model will take into account both short and long term errors. So this is a more balanced estimator of parameters than MSE\(_h\) and TMSE. This estimator is more efficient than both TMSE and MSE\(_j\) because of the log-scale and converges to true values faster than the previous two, but still can be biased on small samples.

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T,loss="GTMSE")

N1823 and ETS(A,A,N) with GTMSE

Time elapsed: 0.18 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    0     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 649.253
Cost function type: GTMSE; Cost function value: 232.419

Information criteria:
     AIC     AICc      BIC 
30402.77 30413.36 30644.16 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -8.2%; Bias: -53.8%; MAPE: 13.8%; SMAPE: 12.9%
MASE: 0.72; sMAE: 11.3%; RelMAE: 0.882; sMSE: 1.6%

In our example GTMSE shrinks both smoothing parameters towards zero and makes the model deterministic, which corresponds to MSE\(_h\). However the initial values are slightly different, which leads to slightly different forecasts. Once again, it is advised to use this estimator on large samples.

Keep in mind that all those multiple steps ahead estimators take more time for the calculation, because the model needs to produce h-steps-ahead forecasts from each observation in sample.

Analytical multiple steps ahead estimators.

There is also a non-documented feature in smooth functions (currently available only for pure additive models) – analytical versions of multiple steps ahead estimators. In order to use it, we need to add “a” in front of the desired estimator: aMSE\(_h\), aTMSE, aGTMSE. In this case only one-step-ahead forecast error is produced and after that the structure of the applied state-space model is used in order to reconstruct multiple steps ahead estimators. This feature is useful if you want to use the multiple steps ahead estimator on small samples, where the multi-steps errors cannot be calculated appropriately. It is also useful in cases of large samples, when the time of estimation is important.

These estimators have similar properties to their empirical counterparts, but work faster and are based on asymptotic properties. Here is an example of analytical MSE\(_h\) estimator:

ourModel <- es(x,"AAN",silent=F,interval="p",h=18,holdout=T,loss="aMSEh")

N1823 and ETS(A,A,N) with aMSEh

Time elapsed: 0.11 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    0     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 627.818
Cost function type: aMSEh; Cost function value: 375907.976

Information criteria:
     AIC     AICc      BIC 
30652.15 30662.74 30893.55 
95% parametric prediction intervals were constructed
100% of values are in the prediction interval
Forecast errors:
MPE: -1.9%; Bias: -14.6%; MAPE: 11.7%; SMAPE: 11.6%
MASE: 0.643; sMAE: 10%; RelMAE: 0.787; sMSE: 1.3%

The resulting smoothing parameters are shrunk towards zero, similar to MSE\(_h\), but the initial values are slightly different, which leads to different forecasts. Note that the time elapsed in this case is 0.11 seconds instead of 0.24 as in MSE\(_h\). The difference in time may increase with the increase of sample size and forecasting horizon.

Similar to MSE, there are empirical multi-step MAE and HAM in smooth functions (e.g. MAE\(_h\) and THAM). However, they are currently implemented mainly “because I can” and for fun, so I cannot give you any recommendations about them.

Starting from v2.4.0, a new estimator was introduced, "Mean Squared Cumulative Error" - MSCE, which may be useful in cases, when the cumulative demand is of the interest rather than point or trace ones.

Conclusions

Now that we discussed all the possible estimators that you can use with smooth, you are most probably confused and completely lost. The question that may naturally appear after you have read this post is “What should I do?” Frankly speaking, I cannot give you appropriate answer and a set of universal recommendations, because this is still under-researched problem. However, I have some advice.

First, Nikos Kourentzes and Juan Ramon Trapero found that in case of high frequency data (they used solar irradiation data) using MSE\(_h\) and TMSE leads to the increase in forecasting accuracy in comparison with the conventional MSE. However in order to achieve good accuracy in case of MSE\(_h\), you need to estimate \(h\) separate models, while with TMSE you need to estimate only one. So, TMSE is faster than MSE\(_h\), but at the same time leads to at least as accurate forecasts as in case of MSE\(_h\) for all the steps from 1 to h.

Second, if you have asymmetrically distributed residuals in the model after using MSE, give MAE or HAM a try – they may improve your model and its accuracy.

Third, analytical counterparts of multi-step estimators can be useful in one of the two situations: 1. When you deal with very large samples (e.g. high frequency data), want to use advanced estimation methods, but want them to work fast. 2. When you work with small sample, but want to use the properties of these estimators anyway.

Finally, don’t use MSE\(_h\), TMSE and GTMSE if you are interested in the values of parameters of models – they will almost surely be inefficient and biased. This applies to both ETS and ARIMA models, which will become close to their deterministic counterparts in this case. Use conventional MSE instead.

Message “smooth” package for R. Common ground. Part II. Estimators first appeared on Open Forecasting.

“smooth” package for R. Common ground. Part I. Prediction intervals

Ivan Svetunkov — Sun, 11 Jun 2017 13:23:40 +0000

UPDATE: Starting from v2.5.1 the parameter intervals has been renamed into interval for the consistency purposes with the other R functions.

We have spent previous six posts discussing basics of es() function (underlying models and their implementation). Now it is time to move forward. Starting from this post we will discuss common parameters, shared by all the forecasting functions implemented in smooth. This means that the topics that we discuss are not only applicable to es(), but also to ssarima(), ces(), gum() and sma(). However, taking that we have only discussed ETS so far, we will use es() in our examples for now.

And I would like to start this series of general posts from the topic of prediction intervals.

Prediction intervals for smooth functions

One of the features of smooth functions is their ability to produce different types of prediction intervals. Parametric prediction intervals (triggered by interval="p", interval="parametric" or interval=TRUE) are derived analytically only for pure additive and pure multiplicative models and are based on the state-space model discussed in previous posts. In the current smooth version (v2.0.0) only es() function has multiplicative components, all the other functions are based on additive models. This makes es() “special”. While constructing intervals for pure models (either additive or multiplicative) is relatively easy to do, the mixed models cause pain in the arse (one of the reasons why I don’t like them). So in case of mixed ETS models, we have to use several tricks.

If the model has multiplicative error, non-multiplicative other components (trend, seasonality) and low variance of the error (smaller than 0.1), then the intervals can be approximated by similar models with additive error term. For example, the intervals for ETS(M,A,N) can be approximated with intervals of ETS(A,A,N), when the variance is low, because the distribution of errors in both models will be similar. In all the other cases we use simulations for prediction intervals construction (via sim.es() function). In this case the data is generated with preset parameters (including variance) and contains \(h\) observations. This process is repeated 10,000 times, resulting in 10,000 possible trajectories. After that the necessary quantiles of these trajectories for each step ahead are taken using quantile() function from stats package and returned as prediction intervals. This cannot be considered as a pure parametric approach, but it is the closest we have.

smooth functions also introduce semiparametric and nonparametric prediction intervals. Both of them are based on multiple steps ahead (also sometimes called as “trace”) forecast errors. These are obtained via producing forecasts for horizon 1 to \(h\) from each observation of time series. As a result a matrix with \(h\) columns and \(T-h\) rows is produced. In case of semi-parametric intervals (called using interval="sp" or interval="semiparametric"), variances of forecast errors for each horizon are calculated and then used in order to extract quantiles of either normal or log-normal distribution (depending on error type). This way we cover possible violation of assumptions of homoscedasticity and no autocorrelation in residuals, but we still assume that each separate observation has some parametric distribution.

In case of non-parametric prediction intervals (defined in R via interval="np" or interval="nonparametric"), we loosen assumptions further, dropping part about distribution of residuals. In this case quantile regressions are used as proposed by Taylor and Bunn, 1999. However we use a different form of regression model than the authors do:
\begin{equation} \label{eq:ssTaylorPIs}
\hat{e}_{j} = a_0 j ^ {a_{1}},
\end{equation}
where \(j = 1, .., h\) is forecast horizon. This function has an important advantage over the proposed by the authors second order polynomial: it does not have extremum (turning point) for \(j>0\), which means that the intervals won’t behave strangely after several observations ahead. Using polynomials for intervals sometimes leads to weird bounds (for example, expanding and then shrinking). On the other hand, power function allows producing wide variety of forecast trajectories, which correspond to differently increasing or decreasing bounds of prediction intervals (depending on values of \(a_0\) and \(a_1\)), without producing any ridiculous trajectories.

The main problem with nonparametric intervals produced by smooth is caused by quantile regressions, which do not behave well on small samples. In order to produce correct 0.95 quantile, we need to have at least 20 observations, and if we want 0.99 quantile, then the sample must contain at least 100. In the cases, when there is not enough observations, the produced intervals can be inaccurate and may not correspond to the nominal level values.

As a small note, if a user produces only one-step-ahead forecast, then semiparametric interval will correspond to parametric one (because only the variance of the one-step-ahead error is used), and the nonparametric interval is constructed using quantile() function from stats package.

Finally, the width of prediction intervals is regulated by parameter level, which can be written either as a fraction number (level=0.95) or as an integer number, less than 100 (level=95). I personally prefer former, but the latter is needed for the consistency with forecast package functions. By default all the smooth functions produce 95% prediction intervals.

There are some other features of prediction interval construction for specific intermittent models and cumulative forecasts, but they will be covered in upcoming posts.

Examples in R

We will use a time series N1241 as an example and we will estimate model ETS(A,Ad,N). Here’s how we do that:

ourModel1 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, interval="p")
ourModel2 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, interval="sp")
ourModel3 <- es(M3$N1241$x, "AAdN", h=8, holdout=TRUE, interval="np")

The resulting graphs demonstrate some differences in prediction intervals widths and shapes:

Series N1241 from M3, es() forecast, parametric prediction intervals

Series N1241 from M3, es() forecast, semiparametric prediction intervals

Series N1241 from M3, es() forecast, nonparametric prediction intervals

All of them cover actual values in the holdout, because the intervals are very wide. It is not obvious, which of them is the most appropriate for this task. So we can calculate the spread of intervals and see, which of them is on average wider:

mean(ourModel1$upper-ourModel1$lower)
mean(ourModel2$upper-ourModel2$lower)
mean(ourModel3$upper-ourModel3$lower)

Which results in:

950.4171
955.0831
850.614

In this specific example, the non-parametric interval appeared to be the narrowest, which is good, taking that it adequately covered values in the holdout sample. However, this doesn't mean that it is in general superior to the other methods. Selection of the appropriate intervals should be done based on the general understanding of the violated assumptions. If we didn't know the actual values in the holdout sample, then we could make a decision based on the analysis of the in-sample residuals in order to get a clue about the violation of any assumptions. This can be done, for example, this way:

forecast::tsdisplay(ourModel1$residuals)

hist(ourModel1$residuals)

qqnorm(ourModel3$residuals)
qqline(ourModel3$residuals)

Linear plot and correlation functions of the residuals of the ETS(A,Ad,N) model

Histogram of the residuals of the ETS(A,Ad,N) model

Q-Q plot of the residuals of the ETS(A,Ad,N) model

The first plot shows how residuals change over time and how the autocorrelation and partial autocorrelation functions look for this time series. There is no obvious autocorrelation and no obvious heteroscedasticity in the residuals. This means that we can assume that these conditions are not violated in the model, so there is no need to use semiparametric prediction intervals. However, the second and the third graphs demonstrate that the residuals are not normally distributed (as assumed by the model ETS(A,Ad,N)). This means that parametric prediction intervals may be wrong for this time series. All of this motivates the usage of nonparametric prediction intervals for the series N1241.

That's it for today.

Message “smooth” package for R. Common ground. Part I. Prediction intervals first appeared on Open Forecasting.