Archives About es() function - Open Forecast

smooth v3.2.0: what’s new?

Ivan Svetunkov — Mon, 30 Jan 2023 13:06:47 +0000

smooth package has reached version 3.2.0 and is now on CRAN. While the version change from 3.1.7 to 3.2.0 looks small, this has introduced several substantial changes and represents a first step in moving to the new C++ code in the core of the functions. In this short post, I will outline the main new features of smooth 3.2.0.

New engines for ETS, MSARIMA and SMA

The first and one of the most important changes is the new engine for the ETS (Error-Trend-Seasonal exponential smoothing model), MSARIMA (Multiple Seasonal ARIMA) and SMA (Simple Moving Average), implemented respectively in es(), msarima() and sma() functions. The new engine was developed for adam() and the three models above can be considered as special cases of it. You can read more about ETS in ADAM monograph, starting from Chapter 4; MSARIMA is discussed in Chapter 9, while SMA is briefly discussed in Subsection 3.3.3.

The es() function now implements the ETS close to the conventional one, assuming that the error term follows normal distribution. It still supports explanatory variables (discussed in Chapter 10 of ADAM monograph) and advanced estimators (Chapter 11), and it has the same syntax as the previous version of the function had, but now acts as a wrapper for adam(). This means that it is now faster, more accurate and requires less memory than it used to. msarima() being a wrapper of adam() as well, is now also faster and more accurate than it used to be. But in addition to that both functions now support the methods that were developed for adam(), including vcov(), confint(), summary(), rmultistep(), reapply(), plot() and others. So, now you can do more thorough analysis and improve the models using all these advanced instruments (see, for example, Chapter 14 of ADAM).

The main reason why I moved the functions to the new engine was to clean up the code and remove the old chunks that were developed when I only started learning C++. A side effect, as you see, is that the functions have now been improved in a variety of ways.

And to be on the safe side, the old versions of the functions are still available in smooth under the names es_old(), msarima_old() and sma_old(). They will be removed from the package if it ever reaches the v.4.0.0.

New methods for ADAM

There are two new methods for adam() that can be used in a variety of cases. The first one is simulate(), which will generate data based on the estimated ADAM, whatever the original model is (e.g. mixture of ETS, ARIMA and regression on the data with multiple frequencies). Here is how it can be used:

adam(BJsales, "AAdN") |>
     simulate() |>
     plot()

which will produce a plot similar to the following:

Simulated data based on adam() applied to Box-Jenkins sales data

This can be used for research, when a more controlled environment is needed. If you want to fine tune the parameters of ADAM before simulating the data, you can save the output in an object and amend its parameters. For example:

testModel <- adam(BJsales, "AAdN")
testModel$persistence <- c(0.5, 0.2)
simulate(testModel)

The second new method is the xtable() from the respective xtable package. It produces LaTeX version of the table from the summary of ADAM. Here is an example of a summary from ADAM ETS:

adam(BJsales, "AAdN") |>
     summary()

Model estimated using adam() function: ETS(AAdN)
Response variable: BJsales
Distribution used in the estimation: Normal
Loss function type: likelihood; Loss function value: 256.1516
Coefficients:
      Estimate Std. Error Lower 2.5% Upper 97.5%  
alpha   0.9514     0.1292     0.6960      1.0000 *
beta    0.3328     0.2040     0.0000      0.7358  
phi     0.8560     0.1671     0.5258      1.0000 *
level 203.2835     5.9968   191.4304    215.1289 *
trend  -2.6793     4.7705   -12.1084      6.7437  

Error standard deviation: 1.3623
Sample size: 150
Number of estimated parameters: 6
Number of degrees of freedom: 144
Information criteria:
     AIC     AICc      BIC     BICc 
524.3032 524.8907 542.3670 543.8387

As you can see in the output above, the function generates the confidence intervals for the parameters of the model, including the smoothing parameters, dampening parameter and the initial states. This summary can then be used to generate the LaTeX code for the main part of the table:

adam(BJsales, "AAdN") |>
     xtable()

which will looks something like this:

Summary of adam()

Other improvements

First, one of the major changes in smooth functions is the new backcasting mechanism for adam(), es() and msarima() (this is discussed in Section 11.4 of ADAM monograph). The main difference with the old one is that now it does not backcast the parameters for the explanatory variables and estimates them separately via optimisation. This feature appeared to be important for some of users who wanted to try MSARIMAX/ETSX (a model with explanatory variables) but wanted to use backcasting as the initialisation. These users then wanted to get a summary, analysing the uncertainty around the estimates of parameters for exogenous variables, but could not because the previous implementation would not estimate them explicitly. This is now available. Here is an example:

cbind(BJsales, BJsales.lead) |>
    adam(model="AAdN", initial="backcasting") |>
    summary()

Model estimated using adam() function: ETSX(AAdN)
Response variable: BJsales
Distribution used in the estimation: Normal
Loss function type: likelihood; Loss function value: 255.1935
Coefficients:
             Estimate Std. Error Lower 2.5% Upper 97.5%  
alpha          0.9724     0.1108     0.7534      1.0000 *
beta           0.2904     0.1368     0.0199      0.5607 *
phi            0.8798     0.0925     0.6970      1.0000 *
BJsales.lead   0.1662     0.2336    -0.2955      0.6276  

Error standard deviation: 1.3489
Sample size: 150
Number of estimated parameters: 5
Number of degrees of freedom: 145
Information criteria:
     AIC     AICc      BIC     BICc 
520.3870 520.8037 535.4402 536.4841

As you can see in the output above, the initial level and trend of the model are not reported, because they were estimated via backcasting. However, we get the value of the parameter BJsales.lead and the uncertainty around it. The old backcasting approach is now called "complete", implying that all values of the state vector are produce via backcasting.

Second, forecast.adam() now has a parameter scenarios, which when TRUE will return the simulated paths from the model. This only works when interval="simulated" and can be used for the analysis of possible forecast trajectories.

Third, the plot() method now can also produce ACF/PACF for the squared residuals for all smooth functions. This becomes useful if you suspect that your data has ARCH elements and want to see if they need to be modelled separately. This can also be done using adam() and sm() and is discussed in Chapter 17 of the monograph.

Finally, the sma() function now has the fast parameter, which when true will use a modified Ternary search for the best order based on information criteria. It might not give the global minimum, but it works much faster than the exhaustive search.

Conclusions

These are the main new features in the package. I feel that the main job in smooth is already done, and all I can do now is just tune the functions and improve the existing code. I want to move all the functions to the new engine and ditch the old one, but this requires much more time than I have. So, I don't expect to finish this any time soon, but I hope I'll get there someday. On the other hand, I'm not sure that spending much time on developing an R package is a wise idea, given that nowadays people tend to use Python. I would develop Python analogue of the smooth package, but currently I don't have the necessary expertise and time to do that. Besides, there already exist great libraries, such as tsforecast from nixtla and sktime. I am not sure that another library, implementing ETS and ARIMA is needed in Python. What do you think?

Message smooth v3.2.0: what’s new? first appeared on Open Forecast.

“smooth” package for R. Intermittent state-space model. Part I. Introducing the model

Ivan Svetunkov — Tue, 18 Sep 2018 20:52:14 +0000

UPDATE: Starting from smooth v 3.0.0, the occurrence part of the model has been removed from es() and other functions. The only one that implements this now is adam(). This post has been updated on 01 January 2021.

UPDATE: Starting from smooth v 2.5.0, the model and the respective functions have changed. Now instead of calling the parameter intermittent and working with iss(), one needs to use occurrence and oes() respectively. This post has been updated on 25 April 2019.

One of the features of functions of smooth package is the ability to work with intermittent data and the data with periodically occurring zeroes.

Intermittent time series is a series that has non-zero values occurring at irregular frequency (Svetuknov and Boylan, 2017). Imagine retailer who sells green lipsticks. The demand on such a product will not be easy to predict, because green colour is not a popular colour in this case, and thus the data of sales will contain a lot of zeroes with seldom non-zero values. Such demand is called “intermittent”. In fact, many products exhibit intermittent patterns in sales, especially when we increase the frequency of measurement (how many tomatoes and how often does a store sell per day? What about per hour? Per minute?).

The other case is when watermelons are sold in high quantities over summer and are either not sold at all or sold very seldom in winter. In this case the demand might be intermittent or even absent in winter and have a nature of continuous demand during summer.

smooth functions can work with both of these types of data, building upon mixture distributions.

In this post we will discuss the basic intermittent demand statistical models implemented in the package.

The model

First, it is worth pointing out that the approach that is used in the statistical methods and models discussed in this post assumes that the final demand on the product can be split into two parts (Croston, 1972):

Demand occurrence part, which is represented by a binary variable, which is equal to one when there is a non-zero demand in period $t$ and zero otherwise;
Demand sizes part, which reflects the amount of sold product when demand occurrence part is equal to one.

This can be represented mathematically by the following equation:
\begin{equation} \label{eq:iSS}
y_t = o_t z_t ,
\end{equation}
where $o_t$ is the binary demand occurrence variable, $z_t$ is the demand sizes variable and $y_t$ is the final demand. This equation was originally proposed by Croston, (1972), although he has never considered the complete statistical model and only proposed a forecasting method.

There are several intermittent demand methods that are usually discussed in forecasting literature: Croston (Croston, 1972), SBA (Syntetos & Boylan, 2000) and TSB (Teunter et al., 2011). These are good methods that work well in intermittent demand context (see, for example, Kourentzes, 2014). The only limitation is that they are the “methods” and not “models”. Having models becomes important when you want to include additional components and produce proper prediction intervals, need the ability to select the appropriate components and do proper statistical inference. So, John Boylan and I developed the model underlying these methods (Svetunkov & Boylan, 2017), based on \eqref{eq:iSS}. It is built upon ETS framework, so we call it “iETS”. Given that all the intermittent demand forecasting methods rely on simple exponential smoothing (SES), we suggested to use ETS(M,N,N) model for both demand sizes and demand occurrence parts, because it underlies SES (Hyndman et al., 2008). One of the key assumptions in our model is that demand sizes and demand occurrences are independent of each other. Although, this is an obvious simplification, it is inherited from Croston and TSB, and seems to work well in many contexts.

The iETS(M,N,N) model, discussed in our paper is formulated the following way:
\begin{equation}
\begin{matrix} \label{eq:iETS}
y_t = o_t z_t \\
z_t = l_{z,t-1} \left(1 + \epsilon_t \right) \\
l_{z,t} = l_{z,t-1}( 1 + \alpha_z \epsilon_t) \\
o_t \sim \text{Bernoulli}(p_t)
\end{matrix} ,
\end{equation}
where $z_t$ is represented by the ETS(M,N,N) model, $l_{z,t}$ is the level of demand sizes, $\alpha_z$ is the smoothing parameter and $\epsilon_t$ is the error term. The important assumption in our implementation of the model is that $\left(1 + \epsilon_t \right) \sim \text{log}\mathcal{N}(0, \sigma_\epsilon^2) $ – something that we discussed in one of the previous posts. This means that the demand will always be positive. However if you deal with some other type of data, where negative values are natural, then you might want to stick with pure additive model.

Having this statistical model, makes it extendable, so that one can add trend, seasonal component, or exogenous variables. We don’t discuss these elements in our paper, but it is briefly mentioned in the conclusions. And we don’t discuss these features just yet, we will cover them in the next post.

Now the main question that stays unanswered is how to model the probability $p_t$. And there are several approaches to that:

iETS$_F$ – assume that the demand occurs at random with a fixed probability (so $p_t = p$).
iETS$_O$ – so called “Odds Ratio” model, which uses the logistic curve in order to update the probability. In this case the model is focused on the probability of occurrence of the demand.
iETS$_I$ – the “Inverse Odds Ratio” model, which uses similar principles to iETS$_O$, but focusing on the probability of non-occurrence. This model underlies Croston (1972) method, but it uses different principles. Instead of updating the probability only when demand occurs, it does that on every observation.
iETS$_D$ – the “Direct probability” model, which uses the principles, suggested by Teunter et al., (2011). In this case the probability is updated directly based on the values of occurrence variable, using SES method.
iETS$_G$ – the “General” model, encompassing all above. This model has two sub-models for the demand occurrence, capturing both the probability of occurrence and non-occurrence.

In case (1) the model is very basic, and we can just estimate the value of probability and produce forecasts. In cases of (2) – (5), we suggest using another ETS(M,N,N) model as underlying each of these processes. So when it comes to producing forecasts, in both cases we assume that future level of probability will be the same as the last obtained one (level forecast from the local-level model). After that the final forecast is generated using:
\begin{equation} \label{eq:iSSForecast}
\hat{y}_t = \hat{p}_t \hat{z}_t ,
\end{equation}
where $\hat{p}_t$ is the forecast of the probability, $\hat{z}_t$ is the forecast of demand sizes and $\hat{y}_t$ is the final forecast of the intermittent demand.

In order to distinguish the whole model \eqref{eq:iETS}, the demand sizes and the demand occurrence parts of the model, different names are used. For example, iETS$_G$(M,N,N) would refer to the full model \eqref{eq:iETS}, ($y_t$), oETS$_G$(M,N,N) would refer to the occurrence part of the model ($o_t$) and ETS(M,N,N) refers to the demand sizes part ($z_t$). In all these three cases the “(M,N,N)” part indicates that the exponential smoothing with multiplicative error, no trend and no seasonality is used. The more advanced notations for the iETS models are available, but they will be discuss in the next post. For now we will stick with the level models and use the shorter names.

Summarising advantages of our framework:

Our model is extendable: you can use any ETS model and even introduce exogenous variables. This is already available in smooth package. In fact, you can use any model you want for demand sizes and a wide variety of models for demand occurrence variable;
The model allows selecting between the aforementioned types of intermittent models (“fixed” / “odds ratio” / “inverse odds ratio” / “direct” and “general”) using information criteria. This mechanism works fine on large samples, but, unfortunately, does not always work well in cases of small samples;
The model allows producing prediction intervals for several steps ahead and cumulative (over a lead time) upper bound of the intervals. The latter arises naturally from the model and can be used for safety stock calculation;
The estimation of models is done using likelihood function and not some ad-hoc estimators. This means that the estimates of parameters become efficient and consistent;
Although the proposed model is continuous, we show in our paper that it is more accurate than several other integer-valued models. Still, if you want to have integer numbers as your final forecasts, you can round up or round down either the point or prediction intervals, ending up with meaningful values. This can be done due to a connection between the quantiles of rounded values and the rounding of quantiles of continuous variable (discussed in Appendix of our paper).

If you need more details, have a look at our working paper (have I already advertise it enough in this post?).

Implementation. Demand occurrence

The aforementioned model with different occurrence types is available in smooth package. There is a special function for demand occurrence part, called oes() (Occurrence Exponential Smoothing) and there is a parameter in every smooth forecasting function called occurrence, which can be one of: “none”, “fixed”, “odds-ratio”, “inverse-odds-ratio”, “direct”, “general” or “auto”, corresponding to the sub types of oETS discussed above. The “auto” option selects the best occurrence model of the five. For now we will neglect this option and come back to it later.

So, let’s consider an example with artificial data. We create the following time series:

x <- c(rpois(25,5),rpois(25,1),rpois(25,0.5),rpois(25,0.1))

This way we have an artificial data, where both demand sizes and demand occurrence probability decrease over time in a step-wise manner, each 25 observations. The generated data resembles something called "demand obsolescence" or "dying out demand". Let’s start by fitting those five models for the demand occurrence part:

oesFixed <- oes(x, occurrence="f", h=25)

Occurrence state space model estimated: Fixed probability
Underlying ETS model: oETS[F](MNN)
Smoothing parameters:
level 
    0 
Vector of initials:
level 
 0.55 
Information criteria: 
     AIC     AICc      BIC     BICc 
139.6278 139.6686 142.2329 142.3269

oesOdds <- oes(x, occurrence="o", h=25)

Occurrence state space model estimated: Odds ratio
Underlying ETS model: oETS[O](MNN)
Smoothing parameters:
level 
0.828 
Vector of initials:
 level 
14.442 
Information criteria: 
     AIC     AICc      BIC     BICc 
116.3124 116.4361 121.5227 121.8076

oesInverse <- oes(x, occurrence="i", h=25)

Occurrence state space model estimated: Inverse odds ratio
Underlying ETS model: oETS[I](MNN)
Smoothing parameters:
level 
0.116 
Vector of initials:
level 
0.039 
Information criteria: 
     AIC     AICc      BIC     BICc 
 98.5508  98.6745 103.7611 104.0460

oesDirect <- oes(x, occurrence="d", h=25)

Occurrence state space model estimated: Direct probability
Underlying ETS model: oETS[D](MNN)
Smoothing parameters:
level 
0.115 
Vector of initials:
level 
0.884 
Information criteria: 
     AIC     AICc      BIC     BICc 
106.5982 106.7219 111.8086 112.0934

oesGeneral <- oes(x, occurrence="g", h=25)

Occurrence state space model estimated: General
Underlying ETS model: oETS[G](MNN)(MNN)
Information criteria: 
     AIC     AICc      BIC     BICc 
102.5508 102.9718 112.9715 113.9410

By looking at the outputs, we can already say that oETS$_I$ model performs better than the others in terms of information criteria - it has the lowest AIC, as well as the other criteria. This is because this model is more suitable for cases, when demand is dying out, as it focuses on demand non-occurrence. Note that the optimal smoothing parameter for the oETS$_O$ is quite high. This is because the model is more focused on demand occurrences. If the demand was building out, not dying out in our example, the situation would be different: the oETS$_O$ would have a lower parameter and oETS$_I$ would have the higher one. Note that the initial level of oETS$_I$ is equal to 0.116, which corresponds to $\frac{1}{1+0.116} \approx 0.89$, which is the probability of occurrence at the beginning of time series.

Note also, that the oETS$_G$ does not print out details, because it has two models (called modelA and modelB in R), each of which has their own parameters. Here are the outputs of both these models:

oesGeneral$modelA
oesGeneral$modelB

Occurrence state space model estimated: General
Underlying ETS model: oETS(MNN)_A
Smoothing parameters:
level 
    0 
Vector of initials:
level 
   16 
Information criteria: 
     AIC     AICc      BIC     BICc 
 98.5508  98.6745 103.7611 104.0460

Occurrence state space model estimated: General
Underlying ETS model: oETS(MNN)_B
Smoothing parameters:
level 
0.116 
Vector of initials:
level 
0.628 
Information criteria: 
     AIC     AICc      BIC     BICc 
 98.5508  98.6745 103.7611 104.0460

Note that oETS$_G$ and both models A and B have the same likelihood, because they are all parts of one and the same thing. However, the information criteria differ because they have different number of parameters to estimate: models A and B have 2 parameters each, which means that the whole model has 4 parameters. Also, note that the model A has the smoothing parameter equal to zero, which means that there is no updated of states in that part of model. We will come back to this observation in a moment.

We can also plot the actual occurrence variable, fitted and forecasted probabilities using plot function:

plot(oesFixed)

plot(oesOdds)

plot(oesInverse)

plot(oesDirect)

plot(oesGeneral)

Note that the different models capture the probability differently. While oETS$_F$ averages out the probability, all the other models react to the changes in the data, but differently:
odds ratio model is more reactive and seems not to do a good job, trying to keep up after the changes in the data; inverse odds ratio is much smoother; the direct probability model is more reactive, but not as reactive as the odds ratio one; finally the general model replicates the behaviour of the inverse odds ratio model. The latter happens because, as we have seen from the output above, model A in oETS$_G$ is not updating the states, which means that the general and more complicated model reverts to the special case of oETS$_I$. Still, it is worth mentioning that all the models predict that the probability will be quite low, which corresponds to the data that we have generated.

Although we could have tried out the more complicated ETS models for the demand occurrence, we will leave this to the next post.

Implementation. The whole demand

In order to deal with the intermittent data and produce the forecasts for the whole time series, we can use either es(), or ssarima(), or ces(), or gum() - all of them have the parameter occurrence, which is equal to "none" by default. We will use an example of ETS models. In order to simplify things we will use iETS$_I$ model (because we have noticed that it is more appropriate for the data than the other models):

es(x, "MNN", occurrence="i", silent=FALSE, h=25)

The forecast of this model is a straight line, close to zero due to the decrease in both demand sizes and demand occurrence parts. However, knowing that the demand decreases, we can use trend model in this case. And the flexibility of the approach allows us doing that, so we fit ETS(M,M,N) to demand sizes:

es(x, "MMN", occurrence="i", silent=FALSE, h=25)

The forecast in this case is even closer to zero and reaches it asymptotically, which means that we foresee that the demand on our product will on average die out.

We can also produce prediction intervals and use model selection for demand sizes. If you know that the data cannot be negative (e.g. selling tomatoes in kilograms), then I would recommend using the pure multiplicative model:

es(x, "YYN", occurrence="i", silent=FALSE, h=25, intervals=TRUE)

Forming the pool of models based on... MNN, MMN, Estimation progress: 100%... Done! 
Time elapsed: 1.02 seconds
Model estimated: iETS(MMN)
Occurrence model type: Inverse odds ratio
Persistence vector g:
alpha  beta 
0.268 0.000 
Initial values were optimised.
7 parameters were estimated in the process
Residuals standard deviation: 0.386
Cost function type: MSE; Cost function value: 0.149

Information criteria:
     AIC     AICc      BIC     BICc 
333.4377 334.0760 348.5648 339.9301 
95% parametric prediction intervals were constructed

As we can see, the multiplicative trend model appears to be more suitable for our data. Note that the prediction intervals for the model are narrowing down, which is due to the decrease of level of demand. Compare this graph with the one when the pure additive model is selected:

es(x, "XXN", occurrence="i", silent=FALSE, h=25, intervals=TRUE)

Forming the pool of models based on... ANN, AAN, Estimation progress:    ... Done! 
Time elapsed: 0.23 seconds
Model estimated: iETS(ANN)
Occurrence model type: Inverse odds ratio
Persistence vector g:
alpha 
0.251 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 1.125
Cost function type: MSE; Cost function value: 1.265

Information criteria:
     AIC     AICc      BIC     BICc 
459.8706 460.1206 472.8964 464.2617 
95% parametric prediction intervals were constructed

In the latter case the prediction intervals cover the negative part of the plane which does not make sense in our context. Note also that the information criteria are lower for the multiplicative model, which is due to the changing variance in sample for each 25 observations.

The important thing to note is that although multiplicative trend model sounds solid from the theoretical point of view (it cannot produce negative values), it might be dangerous in cases of small samples and positive trends. In this situation the model can produce exploding trajectory, because the forecast corresponds to the exponent. I don’t have any universal solution for this problem at the moment, but I would recommend using ETS(M,Md,N) (damped multiplicative trend) model instead of ETS(M,M,N). The reason why I don’t recommend ETS(M,A,N), is because in cases of negative trend, with the typically low level of intermittent demand, the updated level might become negative, thus making model inapplicable to the data.

As you see, there is now five new options for the models, which might complicate things in practice. Now we need to understand, how to select the most appropriate model for the demand occurrence between these five and how to make them more flexible, so that the trend and seasonal components are taken into account in demand occurrence, potentially together with some explanatory variables. If we can have a model that does all that, it would make it universal for a wide variety of data... One could only dream... right?

Message “smooth” package for R. Intermittent state-space model. Part I. Introducing the model first appeared on Open Forecast.

“smooth” package for R. es() function. Part VI. Parameters optimisation

Ivan Svetunkov — Sat, 29 Apr 2017 18:56:21 +0000

UPDATE: Starting from the v2.5.6 the C parameter has been renamed into B. This is now consistent across all the functions.

Now that we looked into the basics of es() function, we can discuss how the optimisation mechanism works, how the parameters are restricted and what are the initials values for the parameters in the optimisation of the function. This will be fairly technical post for the researchers who are interested in the inner (darker) parts of es().

NOTE. In this post we will discuss initial values of parameters. Please, don’t confuse this with initial values of components. The former is a wider term, than the latter, because in general it includes the latter. Here I will explain how initialisation is done before the parameters are optimised.

Let’s get started.

Before the optimisation, we need to have some initial values of all the parameters we want to estimate. The number of parameters and initialisation principles depend on the selected model. Let’s go through all of them in details.

The smoothing parameters $\alpha$, $\beta$ and $\gamma$ (for level, trend and seasonal components) for the model with additive error term are set to be equal to 0.3, 0.2 and 0.1, while for the multiplicative one they are equal to 0.1, 0.05 and 0.01 respectively. The motivation here is that we want to have parameters closer to zero in order to smooth the series (although we don’t always get these values), and in case with multiplicative models the parameters need to be very low, otherwise the model may become too sensitive to the noise.

The next important values is the initial of the damping parameter $\phi$, which is set to be equal to 0.95. We don’t want to start from one, because the damped trend model in this case looses property of damping, but we want to be close to one in order not too enforce the excessive damping.

As for the vector of states, its initial values are set depending on the type of model. First, the following simple model is fit to the first 12 observations of data (if we don’t have 12 observations, than to the whole sample we have):
\begin{equation} \label{eq:simpleregressionAdditive}
y_t = a_0 + a_1 t + e_t .
\end{equation}
In case with multiplicative trend we use a different model:
\begin{equation} \label{eq:simpleregressionMulti}
\log(y_t) = a_0 + a_1 t + e_t .
\end{equation}
In both cases $a_0$ is the intercept, which is used as the initial value of level component and $a_1$ is the slope of the trend, which is used as the initial of trend component. In case of multiplicative model, exponents of $a_0$ and $a_1$ are used. For the case of no trend, a simple average (of the same sample) is used as initial of level component.

In case of seasonal model, the classical seasonal decomposition (“additive” or “multiplicative” – depending on the type specified by user) is done using decompose() function, and the seasonal indices are used as initials for the seasonal component.

All the values are then packed in the vector called B in the following order:

Vector of smoothing parameters $\mathbf{g}$ (persistence);
Damping parameter $\phi$ (phi);
Initial values of non-seasonal part of vector of states $\mathbf{v}_t$ (initial);
Initial values of seasonal part of vector of states $\mathbf{v}_t$ (initialSeason);

After that parameters of exogenous variables are added to the vector. We will cover the topic of exogenous variable separately in one of the upcoming posts. The sequence is:

Vector of parameters of exogenous variables $\mathbf{a}_t$ (initialX);
Transition matrix for exogenous variables (transitionX);
Persistence vector for exogenous variables (persistenceX).

Obviously, if we use predefined values of some of those elements, then they are not optimised and skipped during the formation of the vector B. For example, if user specifies parameter initial, then the step (3) is skipped.

There are some restrictions on the estimated parameters values. They are defined in vectors lb and ub (names are taken from the respective parameters of nloptr() function), which have the same length as B and correspond to the same elements as in B (persistence vector, then damping parameter etc). They may vary depending on the value of the parameter bounds. These restrictions are needed in order to find faster the optimal value of the vector B. The majority of them are fairly technical, making sure that the resulting model has meaningful components (for example, multiplicative component should be greater than zero). The only parameter that is worth mentioning separately is the damping parameter $\phi$. It is allowed to take values between zero and one (including boundary values). In this case the forecasting trajectories do not exhibit explosive behaviour.

Now the vectors lb and ub define general regions for all the parameters, but the bounds of smoothing parameters need finer regulations, because they are connected with each other. That is why they are regulated in the cost function itself. If user defines "usual" bounds, then they are restricted to make sure that:
\begin{equation} \label{eq:boundsUsual}
\alpha \in [0, 1]; \beta \in [0, \alpha]; \gamma \in [0, 1-\alpha] \end{equation}
This way the exponential smoothing has property of averaging model, meaning that the weights are distributed over time in an exponential fashion, they are all positive and add up to one, plus the weights of the newer observations are higher than the weights of the older ones.

One of the features that has been introduced in smooth v2.5.3 is that if a parameter takes the boundary values (either zero or one), then it is substituted by that value and the number of parameters is decreased by one. All of that happens only in case of bounds="usual" and either model selection or estimation. Also, if the dampening parameter $\phi=1$, then the model with damped trend is substituted by non-damped version (e.g. ETS(A,A,N) instead of ETS(A,Ad,N)).

If user defines bounds="admissible" then the eigenvalues of discount matrix are calculated on each iteration. The function makes sure that the selected smoothing parameters guarantee that the eigenvalues lie in the unit circle. This way the model has property of being stable, which means that the weights decrease over time and add up to one. However, on each separate observation they may become negative or greater than one, meaning that the model is no longer an “averaging” model.

In the extreme case of bounds="none" the bounds of smoothing parameters are not checked.

In case of violation of bounds for smoothing parameters, the cost function returns a very high number, so the optimiser “hits the wall” and goes to the next value.

In order to optimise the model we use function nloptr() from nloptr package. This function implements non-linear optimisation algorithms written in C. smooth functions use two algorithms: BOBYQA and Nelder-Mead. This is done in two stages: parameters are estimated using the former, after that the returned parameters are used as the initial values for the latter. In cases of mixed models we also check if the parameters returned from the first stage differ from the initial values. If they don’t, then it means that the optimisation failed and BOBYQA is repeated but with the different initial values of the vector of parameters B (smoothing parameters that failed during the optimisation are set equal to zero). If you find that the optimisation did not go well, you can pass two parameters to the functions via ellipsis: maximum number of iterations maxeval and the relative tolerance xtol_rel. The default values and general explanation are given in the documentation of smooth functions.

Overall this optimisation mechanism guarantees that the parameters are close to the optimal values, lie in the meaningful region and satisfy the predefined bounds.

As an example, we will apply es() to the time series N41 from M3.

First, here’s how ETS(A,A,N) with usual bounds looks like on that time series:

es(M3$N0041$x,"AAN",bounds="u",h=6)

Time elapsed: 0.1 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    0     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 397.628
Cost function type: MSE; Cost function value: 101640.73

Information criteria:
     AIC     AICc      BIC 
211.1391 218.6391 214.3344

As we see, in this case the optimal smoothing parameters are equal to zero. This means that we do not take any information into account and just produce the straight line (deterministic trend). See for yourselves:

Series №41 and ETS(A,A,N) with traditional (aka “usual”) bounds

And here’s what happens with the admissible bounds:

es(M3$N0041$x,"AAN",bounds="a",h=6)

Time elapsed: 0.11 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
1.990 0.018 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 327.758
Cost function type: MSE; Cost function value: 69059.107

Information criteria:
     AIC     AICc      BIC 
205.7283 213.2283 208.9236

The smoothing parameter of the level, $\alpha$ is greater than one. It is almost two. This means that the exponential smoothing is no longer averaging model, but I can assure you that the model is still stable. Such a high value of smoothing parameter means that the level in time series changes drastically. This is not common and usually not a desired, but possible behaviour of the exponential smoothing. Here how it looks:

Series №41 and ETS(A,A,N) with admissible bounds

Here I would like to note that model can be stable even with negative smoothing parameters. So don’t be scared. If the model is not stable, the function will warn you.

Last but not least, user can regulate values of B, lb and ub vectors for the first optimisation stage. Model selection does not work with the provided vectors of initial parameters, because the length of B, lb and ub vectors is fixed, while in the case of model selection it will vary from model to model. User also needs to make sure that the length of each of the vectors is correct and corresponds to the selected model. The values are passed via the ellipses, the following way:

B <- c(0.2, 0.1, M3$N0041$x[1], diff(M3$N0041$x)[1])
es(M3$N0041$x,"AAN",B=B,h=6,bounds="u")

Time elapsed: 0.1 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    1     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 429.923
Cost function type: MSE; Cost function value: 118821.938

Information criteria:
     AIC     AICc      BIC 
213.3256 220.8256 216.5209

In this case we reached boundary values for both level and trend smoothing parameters. The resulting model has constantly changing level (random walk level) and deterministic trend. This is a weird, but possible combination. The fit and forecast looks similar to the model with the admissible bounds, but not as reactive:

Series №41 and ETS(A,A,N) with traditional bounds and non-standard initials

Using this functionality, you may end up with ridiculous and meaningless models, so be aware and be careful. For example, the following does not have any sense from forecasting perspective:

B <- c(2.5, 1.1, M3$N0041$x[1], diff(M3$N0041$x)[1])
lb <- c(1,1, 0, -Inf)
ub <- c(3,3, Inf, Inf)
es(M3$N0041$x,"AAN",B=B, lb=lb, ub=ub, bounds="none",h=6)

Time elapsed: 0.12 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
2.483 1.093 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 193.328
Cost function type: MSE; Cost function value: 24027.222

Information criteria:
     AIC     AICc      BIC 
190.9475 198.4475 194.1428 
Warning message:
Model ETS(AAN) is unstable! Use a different value of 'bounds' parameter to address this issue!

Although the fit is very good and the model approximates data better than all the others (MSE value is 24027 versus 70000 – 120000 of other models), the model is unstable (the function warns us about that), meaning that the weights are distributed in an unreasonable way: the older observations become more important than the newer ones. The forecast of such a model is meaningless and most probably is biased and not accurate. Here how it looks:

Series №41 and ETS(A,A,N) with crazy bounds

So be careful with manual tuning of the optimiser.

Have fun but be reasonable!

Message “smooth” package for R. es() function. Part VI. Parameters optimisation first appeared on Open Forecast.

“smooth” package for R. es() function. Part V. Essential parameters

Ivan Svetunkov — Sun, 05 Mar 2017 00:00:58 +0000

While the previous posts on es() function contained two parts: theory of ETS and then the implementation – this post will cover only the latter. We won’t discuss anything new, we will mainly look into several parameters that the exponential smoothing function has and what they allow us to do.

We start with initialisation of es().

History of exponential smoothing counts dozens of methods of initialisation. Some of them are fine, some of them are very wrong. Some of those methods allow preserving data, the others unnecessarily consume parts of time series. I have implemented ETS in a way that allows initialising it before the sample starts. So the state vector $v_t$ discussed in parts 2 and 3, is defined before the very first observation $y_1$. This is consistent with Hyndman et al. (2008) approach. Still this initial value can be defined using different methods:

Optimisation. This means that the initial value is found along with the smoothing parameters. This can be triggered by initial="optimal" and is the default method in es().

While the optimisation works perfectly fine on monthly data, there may be some problems with weekly and daily seasonal data. The reason for that is a high number of parameters that need to be estimated. For example, ETS(M,N,M) on weekly seasonal data will have 52 + 1 + 2 + 1 = 56 parameters to estimate (52 seasonal indices, 1 level component, 2 smoothing parameters and 1 variance of residuals). This is not an easy task which sometimes cannot be efficiently solved. That is why we may need other initialisation methods.

Let’s see what happens when we encounter this problem in an example. I will use time series taylor from forecast package. This is half-hourly electricity demand data. It has frequency of 336 (7 days * 48 half-hours) and is really hard to work with when the standard initialisation is used. Let’s see what happens when es() is applied with model selection and a holdout of one week of data:

es(taylor,"ZZZ",h=336,holdout=TRUE)

Forming the pool of models based on... ANN, ANA, ANM, AAA, Estimation progress: 100%... Done! 
Time elapsed: 18.47 seconds
Model estimated: ETS(ANA)
Persistence vector g:
alpha gamma 
0.850 0.001 
Initial values were optimised.
340 parameters were estimated in the process
Residuals standard deviation: 250.546
Cost function type: MSE; Cost function value: 56999

Information criteria:
     AIC     AICc      BIC 
51642.90 51712.02 53756.01 
Forecast errors:
MPE: 1%; Bias: 50%; MAPE: 1.8%; SMAPE: 1.8%
MASE: 0.798; sMAE: 1.8%; RelMAE: 0.078; sMSE: 0.1%

We had to estimate 340 parameters and the model selection took 18 seconds (checking only 5 models in the pool). We ended up with the following graph:

Electricity demand series with ETS(A,N,A) initialised using optimisation and its forecast

The first thing that can be noticed is the initial value of level, which results in a wrong one step ahead forecast for one of the first observations. This is because of the high number of parameters – the optimiser could not find the appropriate values. This could be not important, taking number of observations in the data, but still may influence the final forecast and, what is more important, the model selection. Some of models could have been initialised slightly better than the others, which could lead to a smaller value of information criterion for those models that should not have been selected.

This example motivates the other initialisation mechanisms.

Backcasting. In order to define the initial value model is fitted to data several times going forward and backwards. For example, for ETS(A,N,N) the formula used for the forward is:

\begin{equation} \label{eq:ETSANN_Forward}
\begin{matrix}
y_t = l_{t-1} + \epsilon_t \\
l_t = l_{t-1} + \alpha \epsilon_t
\end{matrix}
\end{equation}
while for the backwards it should be changed to:
\begin{equation} \label{eq:ETSANN_Backward}
\begin{matrix}
y_t = l_{t+1} + \epsilon_t \\
l_t = l_{t+1} + \alpha \epsilon_t
\end{matrix}
\end{equation}

As you see, the only thing that changes is the lower index of level component. The formula \eqref{eq:ETSANN_Forward} is used for fitting of the model to the data starting from the first observation till the end of series. When we reach the end of series, we use formula \eqref{eq:ETSANN_Backward} and move from the last observation to the very first one. Then we produce forecast back in time before the initial $y_1$ and obtain the initial values for the components. The model is then fit to the data using these initials. This procedure can be repeated several times in order to get more accurate estimates of the initial values. In es() it is done 3 times and can be triggered by initial="backcasting". As mentioned above this method of initialisation is recommended for data with high seasonality frequencies (weekly, daily, hourly etc).

As an example, we will use the same time series from forecast package:

es(taylor,"ZZZ",h=336,holdout=TRUE,initial="b")

This time all the process takes around 7 seconds:

Forming the pool of models based on... ANN, ANA, ANM, AAA, Estimation progress: 100%... Done! 
Time elapsed: 6.81 seconds
Model estimated: ETS(MNA)
Persistence vector g:
alpha gamma 
    1     0 
Initial values were produced using backcasting.
3 parameters were estimated in the process
Residuals standard deviation: 0.007
Cost function type: MSE; Cost function value: 38238

Information criteria:
     AIC     AICc      BIC 
49493.46 49493.47 49512.11 
Forecast errors:
MPE: 0.8%; Bias: 40.6%; MAPE: 1.7%; SMAPE: 1.8%
MASE: 0.784; sMAE: 1.7%; RelMAE: 0.076; sMSE: 0.1%

The function has checked the same pool of models and selected ETS(M,N,A) as the optimal model (estimating only 3 parameters). We ended up with a peculiar model, where smoothing parameter for the level is equal to one (meaning that we have random walk in the level) and the other parameter equal to zero (meaning that we have deterministic seasonality).

The graph of the model now looks reasonable:

Electricity demand series with ETS(M,N,A) initialised using backcasting and its forecast

As it can be seen from the Figure, backcasting is not a bad technique and it can be useful in cases when we have a data with high frequencies. Furthermore, there is a proof that backcasting asymptotically gives the same estimates as least squares, meaning that both optimal and backcasted estimates of initial values should eventually converge to the same values.

Arbitrary values. If for some reason we know initial values (either from a previous data or from similar data), then we can provide them to es(). In this case we may provide two parameters: initial and initialSeason. The function will then use the provided values and fit the model. We can provide both of them or just one of them, meaning that the other will be estimated during the optimisation. Obviously if we deal with non-seasonal model, we don’t need initialSeason at all. A thing to note is that we cannot use backcasting when we define parameters manually. The other important thing to note is that the model selection and combinations do not work with predefined initial values, so this method of initialisation is not available when we need to select the best model or combine several forecasts using es().

Continuing our example, we will use classical decomposition and construct ETS(M,N,M) model

ourFigure <- decompose(taylor,type="m")$figure
es(taylor,"MNM",h=336,holdout=TRUE,initial=mean(taylor),initialSeason=ourFigure)

Time elapsed: 3.97 seconds
Model estimated: ETS(MNM)
Persistence vector g:
alpha gamma 
    1     0 
Initial values were provided by user.
340 parameters were estimated in the process
Residuals standard deviation: 0.007
Cost function type: MSE; Cost function value: 37783

Information criteria:
     AIC     AICc      BIC 
50123.24 50192.36 52236.34 
Forecast errors:
MPE: 0.7%; Bias: 35%; MAPE: 1.6%; SMAPE: 1.7%
MASE: 0.738; sMAE: 1.6%; RelMAE: 0.072; sMSE: 0.1%

We can even compare results of this call with other initialisation methods:

es(taylor,"MNM",h=336,holdout=TRUE,initial="o")

Time elapsed: 13.13 seconds
Model estimated: ETS(MNM)
Persistence vector g:
alpha gamma 
0.919 0.000 
Initial values were optimised.
340 parameters were estimated in the process
Residuals standard deviation: 0.009
Cost function type: MSE; Cost function value: 56381

Information criteria:
     AIC     AICc      BIC 
51602.59 51671.70 53715.69 
Forecast errors:
MPE: 0.9%; Bias: 40.6%; MAPE: 1.8%; SMAPE: 1.8%
MASE: 0.791; sMAE: 1.7%; RelMAE: 0.077; sMSE: 0.1%

es(taylor,"MNM",h=336,holdout=TRUE,initial="b")

Time elapsed: 6.52 seconds
Model estimated: ETS(MNM)
Persistence vector g:
alpha gamma 
    1     0 
Initial values were produced using backcasting.
3 parameters were estimated in the process
Residuals standard deviation: 0.007
Cost function type: MSE; Cost function value: 37272

Information criteria:
     AIC     AICc      BIC 
49398.88 49398.89 49417.52 
Forecast errors:
MPE: 0.8%; Bias: 33.8%; MAPE: 1.7%; SMAPE: 1.8%
MASE: 0.786; sMAE: 1.7%; RelMAE: 0.076; sMSE: 0.1%

Estimation in this case takes approximately four seconds (on my PC), while for initial="o" it takes around 13 seconds and for initial="b" - near seven seconds. The resulting models in these cases look very similar, with the first model producing slightly more accurate forecasts.

This method of initialisation may be used when the other two methods for some reason do not work as expected (for example, take too much computational time) and/ or when we know the values from some reliable sources (for example, from previously fitted model to the same or similar data). It can also be used for fun and experiments with ETS. In all the other cases I would not recommend using it.

There are other exciting parameters in es() that can be controlled. They allow to switch between optimised and predefined values. For example, parameter persistence accepts vector of smoothing parameters, the length of which should correspond to the number of components in the model, while parameter phi defines damping parameter. So, for example, ETS(A,Ad,N) fitted to a time series N1234 from M3 can be constructed using:

es(M3$N1234$x,"AAdN",h=8,persistence=c(0.2,0.1),phi=0.95)

Compare the resulting graph:

Series N1234 and ETS(A,Ad,N) fit to it with predefined parameters

with the one from optimised model:

es(M3$N1234$x,"AAdN",h=8)

Series N1234 and ETS(A,Ad,N) fit to it with optimised parameters

I cannot say which of them is better in means of accuracy, but if you really need to define those parameters manually (for example, when applying one and the same model to a large set of time series), then you can easily do it, and now you know how.

One other cool thing about es() is that it saves all the discussed above values and returns them as a list. So you can save a model and then reuse the parameters. For example, let’s select the best model and save it:

ourModel <- es(M3$N1234$x,"ZZZ",h=8,holdout=TRUE)

Which gives us:

Series N1234 and ETS(M,A,N) fit to it with optimised parameters

Now we will use a small but neat function called modelType(), which extracts type of model, and use exactly the same model with exactly the same parameters but with a larger sample:

es(M3$N1234$x,modelType(ourModel),h=8,holdout=FALSE,initial=ourModel$initial,persistence=ourModel$persistence,phi=ourModel$phi)

This now results in the same model but with the updated states, taking into account the last ушпре observations:

Series N1234 and the same ETS(M,A,N) fit to a larger sample

The other way to do exactly the same is just to pass ourModel to es() function this way:

es(M3$N1234$x,model=ourModel,h=8,holdout=FALSE)

This way we can, for example, do rolling origin with fixed parameters.

The function modelType() also works with models estimated using ets() function from forecast package, so you can easily use ets() and then construct the model of the same type using es():

etsModel <- ets(M3$N1234$x)
es(M3$N1234$x,model=modelType(etsModel),h=8,holdout=TRUE)

Not sure why you would need it, but here it is. Enjoy!

That’s it for today. I hope that this post was helpful and now you know what you can do when you don’t have anything to do. See you next time!

Message “smooth” package for R. es() function. Part V. Essential parameters first appeared on Open Forecast.

“smooth” package for R. es() function. Part IV. Model selection and combination of forecasts

Ivan Svetunkov — Tue, 24 Jan 2017 20:54:38 +0000

Mixed models

In the previous posts we have discussed pure additive and pure multiplicative exponential smoothing models. The next logical step would be to discuss mixed models, where some components have additive and the others have multiplicative nature. But we won’t spend much time on them because I personally think that they do not make much sense. Why do I think so? Well, they simply contradict basic modelling logic. For example, the original Holt-Winters method, which has underlying ETS(A,A,M) model, assumes that data may be both positive and negative (from side of additive error and trend), but at the same time does not work when data is non-positive (because multiplicative seasonality cannot be calculated in this case). This causes severe problems in forecasting of data with low values. A simple example is dying out seasonal product, which implies negative trend, low values and some periodic pattern. In this case level and trend components may become negative, which screws the seasonality. Having said that, mixed models work fine when data has high level. But this disadvantage of mixed models should be taken into account when one of them is selected. And this is a reason why I don’t want to spend a separate post on them. This paragraph should suffice.

Theory of model selection and combinations

Now that we have discussed all the possible types of exponential smoothing models, it is time to select the most appropriate one for your data. Fotios Petropoulos and Nikos Kourentzes had a research on this topic and demonstrated that human beings are very good in selecting appropriate components for each time series. However when a forecaster faces thousands of products that he needs to work with, it is not possible to select components individually. That’s why automatic model selection is needed.

There are many model selection methods and a lot of literature on this topic. I have implemented only one of those methods in es() function. It is based on information criteria calculation and proved to work well (Hyndman et al. 2002, Billah et al. 2006). Any information criterion uses likelihood function, that is why I showed them in the previous posts. These likelihoods depend mainly on error term, so mixed models will have one of two likelihood functions.

es() allows selecting between AIC (Akaike Information Criterion), AICc (Akaike Information Criterion corrected) and BIC (Bayesian Information Criterion, also known as Schwarz IC). The very basic information criterion is AIC. It is calculated for a chosen model using formula:
\begin{equation} \label{eq:AIC}
\text{AIC} = -2 \ell \left(\theta, \hat{\sigma}^2 | Y \right) + 2k,
\end{equation}
where $k$ is number of parameters of the model. Not going too much into details, the model with the smallest AIC is considered to be the closest to the true model. Obviously IC cannot be calculated without model fitting, which implies that a human being needs to form a pool of models, then fit each of them to the data, calculate an information criterion for each of them and after that select the one model that has the lowest IC value. There are 30 ETS models, so this procedure may take some time. Or even too much time, if we deal with large samples. So what can be done in order to increase the speed?

I have decided to solve this problem using logic and decrease pool of models in several steps. Here’s what happens in es():

Simple ETS(A,N,N) is fitted to the data.
ETS(A,N,A) is fitted to the data.

If IC of (2) is lower than IC of (1), then there is some type of seasonality in the data. This means that we can exclude non-seasonal models. This decreases pool of models from 30 to 20. After that we go to step (3).

Otherwise there is no seasonality in the data, which means that the pool of models decreases from 30 to 10. We go to step (4).

These two steps need some explanation. We do not discuss trend at this point because even if data is trended, then ETS(A,N,N) will be a fine approximator for it (but a very poor forecaster, which is irrelevant at this point). The smoothing parameter in this case will obviously be very high (can be very close to 1), but this is not really important when we want to see if there is seasonality in the data. ETS(A,N,A) will perform better than ETS(A,N,N) on series with trend and seasonality. Similar argument holds even when data has multiplicative seasonality. ETS(A,N,A) in this case will have lower IC than ETS(A,N,N), because it will always fit seasonal data better.

Fit ETS(A,N,M). Compare it with (2).

If IC of (3) is lower than (2), then seasonality has multiplicative type. This reduces pool of models from 20 to 10. Go to step (4).

Otherwise, multiplicative seasonality does not contribute in the model, we can stick with additive, decreasing pool of models to 10. Go to step (4).

The logic here is similar to the previous step. If there is trend with multiplicative seasonality in the data, then level part of the model will capture the increase of level, while the model with multiplicative seasonality will approximate data better than the one with additive.

Fit a model with additive trend and preselected type of seasonality. Depending on steps (2) and (3) this can either be ETS(A,A,N), ETS(A,A,A) or ETS(A,A,M).

If IC of this step model is lower than IC of model selected on previous step, then there is a trend in data. This decreases pool of models from 10 to 8.

Otherwise there is no trend, meaning that we deal either with model with additive or multiplicative error. The pool of models decreases from 10 to 1. For example, if we compared ETS(A,A,M) and (A,N,M), and found that the latter is better, then ther is only one model left to fit and compare with others – ETS(M,N,M).

Additive trend on the last step is a good approximation for all trend types. If we find that it contributes towards better fit, then we can investigate, what type of trend is needed. Also, because we have already identified seasonality type, it won’t make any distortions on this step.

Using this algorithm of model selection allows to fit from 5 to 12 models instead of 30. The experiments that I have conducted showed that this way we usually end up with a model with the lowest IC. Keep in mind that this does not necessarily mean that the chosen model will produce the most accurate forecasts. Here we only care for the fit and closeness to the “true model”. Which brings us to another idea. We know from forecasting theory that combinations of forecasts are on average more accurate than individual models. So why not use them for ETS?

One of the simplest and pretty efficient methods would be to produce forecasts from each model in the pool, then combine them and return the final combined forecast. The combination itself can be done with either equal weights or unequal ones. The former does not make much sense, when we deal with data with specific components (for example, why would ETS(A,N,N) have the same weight as ETS(A,N,A) on seasonal data?). So we need to use some weights. There is a lot of methods of weights distribution, but we will use the one discussed in Burnham and Anderson (2002), which is based on information criteria, so we do not create anything artificial and just use what ETS models implemented in es() already have. Weights for each produced forecast are calculated using the following formula:
\begin{equation} \label{eq:statLikelihoodAICweights}
w_j = \frac{ \exp \left(-\frac{1}{2} \left(\text{IC}_j -\min(\text{IC}) \right) \right)}{\sum_{i=1}^m \exp \left(-\frac{1}{2} \left(\text{IC}_i -\min(\text{IC}) \right) \right)},
\end{equation}
where $m$ is number of models in the pool, IC$_j$ is information criterion value for $j^{th}$ model and $\min(\text{IC})$ is the value of the lowest IC in the pool. Models with lower IC will have higher weights than the ones with high ICs. These weights are then used for combination of forecasts, prediction intervals and fitted values. Any information criterion among discussed above can be used instead of IC.

By the way, Stephan Kolassa (2011) showed that using this method of forecasts combination increases accuracy of ETS models. This method protects forecaster from a random error, where a wrong model would be chosen for some unknown reasons. The pool of models is determined by a forecaster and may either include all the 30 ETS models or a sub-sample of them.

Practice of model selection and combinations

Now that we have discussed basics of model selection and combinations, we can try it out in es() function. There are several ways of making this selection. Let’s start from the very basic one.

If we are not sure what to do and whether we need to restrict our models, we may use default selection mechanism, which is triggered by setting all the components of the model parameter equal to “Z”. This is taken from Rob Hyndman ets() function which uses the same notation. If we ask for model=”ZZZ”, then es() function will use the described in the previous section model selection algorithm.

For example, for a time series N2568 from M3 we will have:

es(M3$N2568$x, "ZZZ", h=18)

Which results in:

Forming the pool of models based on... ANN, ANA, ANM, AAM, Estimation progress: 100%... Done! 
Time elapsed: 2.6 seconds
Model estimated: ETS(MMdM)
Persistence vector g:
alpha  beta gamma 
0.020 0.020 0.001 
Damping parameter: 0.965
Initial values were optimised.
19 parameters were estimated in the process
Residuals standard deviation: 0.065
Cost function type: MSE; Cost function value: 169626

Information criteria:
     AIC     AICc      BIC 
1763.990 1771.907 1816.309

ETS(MMdM) on N2568 series from M3

During the model selection the function will print out the progress and tell us what’s happening: first it tries “ANN”, then “ANA”, then “ANM” (which means that the data is seasonal and it tries to define type of seasonality) and finally “AAM” (checking the trend). If we do not need any output we can ask our function to shut up by specifying parameter silent=”all”. As we see, the optimal model for this data is ETS(M,Md,M).

Note that the mentioned model selection algorithm is only used when we specify “Z” for at least one component and either “Z” or “A”, or “M”, or “N” for the others. Setting some components equal to either “A”, “M” or “N” will speed up the selection process, but the function will use the discussed above selection algorithm.

By default the selection is done using AICc, but AIC and BIC are also available and can be selected with ic parameter:

es(M3$N2568$x, "ZZM", h=18, ic="BIC")

Now let’s assume that for some reason we are only interested in pure additive models. What should we do in this case? How can we select the best model from such a pool of 6 models? es() function allows us to do that via “X” component:

es(M3$N2568$x, "XXX", h=18)

This gives us:

Estimation progress: 100%... Done! 
Time elapsed: 0.72 seconds
Model estimated: ETS(ANA)
Persistence vector g:
alpha gamma 
0.174 0.695 
Initial values were optimised.
16 parameters were estimated in the process
Residuals standard deviation: 609.711
Cost function type: MSE; Cost function value: 320472

Information criteria:
     AIC     AICc      BIC 
1831.789 1837.284 1875.847

ETS(ANA) on N2568 series from M3

In this case the model selection algorithm described in the previous section is not used. We just check 6 models and select the best of them. The pool of models in this case includes: “ANN”, “AAN”, “AAdN”, “ANA”, “AAA” and “AAdA”. For this data the best model in this pool is ETS(A,N,A).

In a similar manner we can ask the function to select between multiplicative models only. This is regulated with “Y” value for components:

es(M3$N2568$x, "YYY", h=18)

Similar to “XXX” we go through 6 models and select the best one. In this case we will end up with exactly the same model as in “ZZZ” case, because the optimal model for this time series is ETS(M,Md,M). “YYY” can be especially useful, when we deal with low level values.

Now that we know these features, we can combine different parameters and form pools that we want. For example, we can select the best model between “MNN”, “MAN”, “MAdN”, “MNM”, “MAM”, MAdM this way:

es(M3$N2568$x, "YXY", h=18)

Which results in:

Estimation progress: 100%... Done! 
Time elapsed: 0.92 seconds
Model estimated: ETS(MAdM)
Persistence vector g:
alpha  beta gamma 
0.021 0.021 0.001 
Damping parameter: 0.976
Initial values were optimised.
19 parameters were estimated in the process
Residuals standard deviation: 0.065
Cost function type: MSE; Cost function value: 169730

Information criteria:
     AIC     AICc      BIC 
1764.062 1771.979 1816.380

Note that the information criteria for ETS(M,Ad,M) are very close to the optimal ETS(M,Md,M). This raises the question which of the models to prefer if the difference between them is negligible…

We can also pre-specify some components if we really want something special. For example, anything with multiplicative seasonality:

es(M3$N2568$x, "ZZM", h=18)

Finally, there is another way to select the best model from some pre-specified pool. Let’s say that we have a favourite set of ETS models that we want to check, which includes: “ANN”, “MNN”, “AAdN”, “AAdM” and “MMdM”. This cannot be specified via “X”, “Y” and “Z” parameters, but it can be provided to es() as a vector of model names:

es(M3$N2568$x, c("ANN", "MNN", "AAdN", "AAdM", "MMdM"), h=18)

In this case the function will go through all the provided models and select the one with the lowest IC (which is ETS(M,Md,M) in this case).

Now let’s have a look at combinations. This is specified with components equal to “C”. Once again we can set a pool of models via “Z”, “X”, “Y”, “N”, “A” and “M” and combine their forecasts. In order to produce combinations we need to specify at least one component as “C”. In a very simple case we have (let’s also ask for prediction intervals in order to see how this thing works):

es(M3$N2568$x, "CCC", h=18, intervals=TRUE)

With the following output:

Estimation progress: 100%... Done!
Time elapsed: 10.04 seconds
Model estimated: ETS(CCC)
Initial values were optimised.
Residuals standard deviation: 438.242
Cost function type: MSE

Information criteria:
Combined AICc 
     1772.198 
95% parametric prediction intervals were constructed

and graph:

Combined ETS on N2568 series from M3

Because we have set “C” for all the components, we have combined point forecasts and prediction intervals for all the 30 models, which took almost 10 seconds. But we could have asked something stricter, if we knew that some of the components need to be of a specific type. For example, additive seasonality does not make much sense for our time series, so we can ask for:

es(M3$N2568$x, "CCM", h=18)

which will use only 10 models and as a result happen approximately 3 times faster with approximately the same result for our time series, which exhibits obvious multiplicative seasonality. This means that all the models with multiplicative seasonality will have higher weight than other models.

We can also use “X” and “Y”, which in this case will ask our function to use additive and multiplicative models in combinations respectively. For example, this thing:

es(M3$N2568$x, "CXY", h=18)

will combine forecasts from the following 12 models: “ANN”, “AAN”, “AAdN”, “ANM”, “AAM”, “AAdM”, “MNN”, “MAN”, “MAdN”, “MNM”, “MAM”, “MAdM”.

Finally forecasts can be combined from an arbitrary pool of models. In order to do that a user needs to add “CCC” model in the desired pool of models. Here how it works:

es(M3$N2568$x, c("CCC","ANN", "MNN", "AAdN", "AAdM", "MMdM"), h=18)

This will combine forecasts of “ANN”, “MNN”, “AAdN”, “AAdM” and “MMdM” ETS models.

As we see, the model selection and combinations mechanism implemented in es() is very flexible and allows to do a lot of cool things with exponential smoothing.

That’s it for today. I hope that this post was as exciting for you as previous ones about es() function :). We will continue with a more detailed explanation of es() function parameters.

P.S. about the pools of models

Although there are 30 types of ETS models out there, there have been some arguments that not all of them make sense. For example, it is weird to have a model with multiplicative seasonality and additive error – the more natural would be the model with the aligned errors and seasonality. In addition, a lot of mixed models are difficult to work with – they break easily if the fitted values get close to zero. So, taking all of this into account Rob Hyndman has restricted his ets() function from the forecast package with the following pool of 19 models (if the parameter allow.multiplicative.trend=TRUE): (A,N,N), (A,A,N), (A,Ad,N), (A,N,A), (A,A,A), (A,Ad,A), (M,N,N), (M,A,N), (M,Ad,N), (M,M,N), (M,Md,N), (M,N,M), (M,A,M), (M,Ad,M), (M,M,M), (M,Md,M), (M,N,A), (M,A,A), (M,Ad,A). In my opinion, the last three models should be removed from the pool for the consistency purposes, and all the mixed models can only be used when the level of series big enough, ensuring that we will not get to the negative plane. Still, if anyone wants to select between these of models, it can easily be done in es() as a pool of models:

es(M3$N2568$x, c("ANN", "AAN", "AAdN", "ANA", "AAA", "AAdA", "MNN", "MAN", "MAdN", "MMN", "MMdN", "MNM", "MAM", "MAdM", "MMM", "MMdM", "MNA", "MAA", "MAdA"), h=18)

Message “smooth” package for R. es() function. Part IV. Model selection and combination of forecasts first appeared on Open Forecast.

“smooth” package for R. es() function. Part III. Multiplicative models

Ivan Svetunkov — Fri, 18 Nov 2016 13:17:19 +0000

Theoretical stuff

Last time we talked about pure additive models, today I want to discuss multiplicative ones.

There is a general scepticism about pure multiplicative exponential smoothing models in the forecasters society, because it is not clear why level, trend, seasonality and error term should be multiplied. Well, when it comes to seasonality, then there is no doubt – multiplicative one is more often met than additive and thus is more often used in practice. However, multiplicative trend and multiplicative error are not as straight forward, because it is not easy to understand why we need them in the first place. In addition, models with these multiplicative components are harder to implement, harder to work with and their forecast accuracy is not necessarily better than accuracy of other models.

So why bother at all? There is at least one reason. Pure multiplicative models are constructed with the assumption that the data we work with is positive only. And this is a plausible assumption when we work with demand on products, because selling -50 boots in March 2017 or something like that does not make much sense. However, when we work with high scale data (for example, hundreds or even thousands of units), then this advantage becomes negligible, so pure additive or mixed models can be used instead. This, in fact, is the reason why pure multiplicative models are neglected.

So, what’s the catch with these models? Let’s have a look.

The general pure multiplicative model can be written in the following compact form, if we use natural logarithms:
\begin{equation} \label{eq:ssGeneralMultiplicative}
\begin{matrix}
y_t = \exp \left(w’ \log(v_{t-l}) + \log(1+\epsilon_t) \right) \\
\log(v_t) = F \log(v_{t-l}) + \log (1 + g \epsilon_t)
\end{matrix} ,
\end{equation}
where all the notations have already been introduced in the previous post. The only thing to note is that both $\exp$ and $\log$ are applied element wise to vectors. This means that $\log(v_t)$ will result in vector consisting of components in logarithms. The important thing to note here is that all the components of this models must be positive, otherwise it won’t work.

An example of multiplicative model written in this form is ETS(M,M,N) – model with multiplicative error and multiplicative trend – can be written as:
\begin{equation} \label{eq:ssETS(M,M,N)}
\begin{matrix}
y_t = \exp \left(\log(l_{t-l}) + \log(b_{t-l}) + \log(1 + \epsilon_t) \right) \\
\log(l_t) = \log(l_{t-l}) + \log(b_{t-l}) + \log (1 + \alpha \epsilon_t) \\
\log(b_t) = \log(b_{t-l}) + \log (1 + \beta \epsilon_t)
\end{matrix} .
\end{equation}
Taking exponent of second and third equations and simplifying the first one in \eqref{eq:ssETS(M,M,N)} leads to the conventional model, that underlies Pegel’s method:
\begin{equation} \label{eq:ssETS(M,M,N)_Pegels}
\begin{matrix}
y_t = l_{t-l} b_{t-l} (1 + \epsilon_t) \\
l_t = l_{t-l} b_{t-l} (1 + \alpha \epsilon_t) \\
b_t = b_{t-l} (1 + \beta \epsilon_t)
\end{matrix} .
\end{equation}
Multiplication of level, trend and error term restricts actuals with positive values, but this will only be true if we impose correct assumptions on error term. While Hyndman et al. 2008 assume that it is distributed normally (they have actually discussed other assumption in Chapter 15), we assume in es() that it is distributed log-normally:
\begin{equation} \label{eq:ssErrorlogN}
(1 + \epsilon_t) \sim \text{log}\mathcal{N}(0,\sigma^2),
\end{equation}
where $\sigma^2$ is variance of logarithm of $1 + \epsilon_t$. Why log-normal distribution? Let me explain. When the variance $\sigma^2$ is small, the differences between es() and ets() models with multiplicative error are almost non-existent, because in this case both normal and log-normal distributions look very similar. However with larger variance the differences become more substantial (and by larger I mean something greater than 0.1), because log-normal distribution becomes positive skewed with a longer tail. In addition, with a higher variance and normality assumption chances of having $1 + \epsilon_t \leq 0$ increase, which contradicts the original model. So the assumption of normality is not always unrealistic and may lead to problems in model construction. That is why we need the assumption \eqref{eq:ssErrorlogN}.

Now \eqref{eq:ssErrorlogN} influences several aspects of how the model work: point forecasts, prediction intervals and estimation of the model \eqref{eq:ssGeneralMultiplicative} will be slightly different.

First of all, with low actual values point forecasts correspond to median rather than mean. This is because of relations between mean in normal and log-normal distributions and the fact that variance of error term in these cases may become sufficiently large for them to differ substantially. Arguably, median is what we need in cases of skewed distributions and that’s what es() produces. Knowing that is especially important in cases with intermittent demand models, which will be discussed in this blog at some distant point in future.

Secondly, due to asymmetry of log-normal distribution prediction intervals become asymmetric. This will especially be evident, when scale of data is low and variance is high. In the other situations the intervals will be very close to intervals produced by additive error models with normally distributed errors.

Finally, the estimation of models with multiplicative errors should be based on the following concentrated log-likelihood function (for log-normal distribution):
\begin{equation} \label{eq:ssConcentratedLogLikelihoodLnorm}
\ell(\theta | Y) = -\frac{T}{2} \left( \log \left( 2 \pi e \right) +\log \left( \hat{\sigma}^2 \right) \right) -\sum_{t=1}^T \log y_t ,
\end{equation}
where $\hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \log^2(1 + \epsilon_{t})$. The likelihood function \eqref{eq:ssConcentratedLogLikelihoodLnorm} looks very similar to the one used in additive models, but leads to a slightly different cost function (that we need to minimise):
\begin{equation} \label{eq:ssCostFunction}
\text{CF} = \log \left( \hat{\sigma}^2 \right) + \frac{2}{T} \sum_{t=1}^T \log y_t .
\end{equation}
Giving interpretation to the cost function \eqref{eq:ssCostFunction}, variance of error term here is weighed with actual values. So roughly saying higher actuals will be predicted with higher precision.

R stuff

A good example of data, where pure multiplicative model can be used is time series N2457 from M3. Although the scale of data is high (the average is around 5000), the variability of data is high as well, which indicates that multiplicative model could be efficiently used here. So, let’s construct the ETS(M,N,N) model with parametric prediction intervals:

es(M3$N2457$x, "MNN", h=18, holdout=TRUE, intervals="p")

This results in the following output:

Time elapsed: 0.1 seconds
Model estimated: ETS(MNN)
Persistence vector g:
alpha 
0.145 
Initial values were optimised.
3 parameters were estimated in the process
Residuals standard deviation: 0.413
Cost function type: MSE; Cost function value: 1288657

Information criteria:
     AIC     AICc      BIC 
1645.978 1646.236 1653.702 
95% parametric prediction intervals were constructed
72% of values are in the prediction interval
Forecast errors:
MPE: 26.3%; Bias: 87%; MAPE: 39.8%; SMAPE: 49.4%
MASE: 2.944; sMAE: 120.1%; RelMAE: 1.258; sMSE: 242.7%

We have already discussed what each of the lines of this output means in the previous post, so we will not stop on that here. The thing to note, however, is the line with residuals standard deviation, which is equal to 0.413 (meaning that variance is approximately equal to 0.17). With this standard deviation the distribution will be noticeably skewed. This can be evidently seen on the produced graph:

Series N2457 from M3 and es("MNN") forecast with prediction intervals

For the same time series ETS(A,N,N) model will produce the following forecast:

Series N2457 from M3 and es(“ANN”) forecast with prediction intervals

Comparing the two graphs, we can notice that prediction intervals on the first one are more adequate for this data, because they take into account skewness in errors distribution. As a result they cover more observations in the holdout than the intervals of the second model.

A counter example is time series N2348 – it does not have as high variability as N2457, so ETS(M,N,N) and ETS(A,N,N) produce very similar point forecasts and prediction intervals. This is because variance in multiplicative model applied to this data is equal to 0.000625 and as a result the log-normal distribution becomes very close to the normal. See for yourselves:

Series N2348 from M3 and es(“MNN”) forecast with prediction intervals

Series N2348 from M3 and es(“ANN”) forecast with prediction intervals

So in cases of time series similar to N2348 both multiplicative and additive error models can be used equally efficiently. But in general pure multiplicative models should not be neglected as they have useful properties. In one of the next posts we will see that these properties become especially useful for intermittent demand.

That’s all, folks!

Message “smooth” package for R. es() function. Part III. Multiplicative models first appeared on Open Forecast.

“smooth” package for R. es() function. Part II. Pure additive models

Ivan Svetunkov — Wed, 02 Nov 2016 09:27:21 +0000

A bit of statistics

As mentioned in the previous post, all the details of models underlying functions of “smooth” package can be found in extensive documentation. Here I want to discuss several basic, important aspects of statistical model underlying es() and how it is implementated in R. Today we will have a look at basic pure additive models. These models do not have multiplicative components in them in any form and are very easy to implement and understand.

es() uses Single Source of Error State-Space model, described in Hyndman et al. (2008). The advantage of this is that it allows writing any type of exponential smoothing in a compact form and simplifies some statistical derivations. However the general model underlying es() differs slightly from the conventional one by Hyndman et al. (2008) – seasonal component is modelled using lags rather than dummy variables. There are some other differences, but we will not go into much details right now.

In general state-space model underlying es() is written as (for pure additive models):
\begin{equation} \label{eq:ssGeneralAdditive}
\begin{matrix}
y_t = w’ v_{t-l} + \epsilon_t \\
v_t = F v_{t-l} + g \epsilon_t
\end{matrix} ,
\end{equation}
where $y_{t}$ is value of series on observation $t$, $v_{t}$ is a state vector (containing components of time series, such as level, trend, seasonality), $w$ and $F$ are predefined measurement vector and transition matrix and $g$ is persistence vector (vector, containing smoothing parameters). Finally $\epsilon_t$ is error term, which for additive model is assumed to be normally distributed.

Using these notations, all the additive exponential smoothing models can be united in one compact form \eqref{eq:ssGeneralAdditive}. For example, Damped trend model, ETS(A,Ad,N), has the following matrices:
\begin{equation} \label{eq:ssAAdNMatrices}
w = \begin{pmatrix}
1 \\ \phi
\end{pmatrix},
F = \begin{pmatrix}
1 & \phi \\
0 & \phi
\end{pmatrix},
g = \begin{pmatrix}
\alpha \\
\beta
\end{pmatrix},
v_t = \begin{pmatrix}
l_t \\
b_t
\end{pmatrix},
v_{t-l} = \begin{pmatrix}
l_{t-1} \\
b_{t-1}
\end{pmatrix},
\end{equation}
and can be represented in the system of the following equations:
\begin{equation} \label{eq:ssAAdN}
\begin{matrix}
y_t = l_{t-1} + \phi b_{t-1} + \epsilon_t \\
l_t = l_{t-1} + \phi b_{t-1} + \alpha \epsilon_t \\
b_t = \phi b_{t-1} + \beta \epsilon_t
\end{matrix} ,
\end{equation}
where $l_t$ is level components, $b_t$ is trend component, $\phi$ is dampening parameter, $\alpha$ and $\beta$ are smoothing parameters.

Non-seasonal models in es() have exactly the same structure as in ets(). However the differences appear when we deal with seasonal components. The main element that is different in \eqref{eq:ssGeneralAdditive} in comparison with the conventional ETS is index $l$, which indicates that some components of state vector have different lags. For example, seasonal component has a lag of $m$ (for example, $m=$12 for monthly data) instead of 1, so some model ETS(A,A,A) has the following lagged state vector:
\begin{equation} \label{eq:ssETS(A,A,A)StateVector}
v_{t-l} =
\begin{pmatrix}
l_{t-1} \\
b_{t-1} \\
s_{t-m}
\end{pmatrix} ,
\end{equation}
where $l_{t-1}$ is lagged level component, $b_{t-1}$ is lagged trend component, $s_{t-m}$ is lagged seasonal component and $m$ is lag of seasonality. Inserting \eqref{eq:ssETS(A,A,A)StateVector} in \eqref{eq:ssGeneralAdditive} and substituting all the other elements with the appropriate values, leads to the following well-known model, which underlies additive Holt-Winters method:
\begin{equation} \label{eq:ssETS(A,A,A)}
\begin{matrix}
y_t = l_{t-1} + b_{t-1} + s_{t-m} + \epsilon_t \\
l_t = l_{t-1} + b_{t-1} + \alpha \epsilon_t \\
b_t = b_{t-1} + \beta \epsilon_t \\
s_t = s_{t-m} + \gamma \epsilon_t
\end{matrix} .
\end{equation}

In case of Hyndman et al. (2008) equation \eqref{eq:ssETS(A,A,A)} would contain $m-$1 more seasonal components, which would not be updated (claiming that their values are just moved to another observation). So the essence of the model would be the same, but it would be larger in size. By introducing the lagged structure of state vector, we decrease dimensions of $v_t, w, F$ and $g$. This simplifies some derivations and also means that normalisation of seasonal components needs to be done differently, than proposed by Hyndman et al. (2008). Point forecasts in this case should be similar to ets(), however prediction intervals and seasonal components could be potentially slightly different than in ets(). Although using lags instead of dummies can be considered as a substantial difference from modelling perspective, it does not change substantially the final forecasts, and all the statistical properties of the model are still there. For example, concentrated log-likelihood for models with additive errors is calculated in exactly the same manner as in the original ETS:
\begin{equation} \label{eq:ssConcentratedLogLikelihoodNorm}
\ell(\theta | Y) = -\frac{T}{2} \left( \log \left( 2 \pi e \right) +\log \left( \hat{\sigma}^2 \right) \right),
\end{equation}
where $\theta$ is a set of parameters used in model and $T$ is number of observations. This likelihood can be used for the purpose of estimation of models and selection of the most appropriate one via information criteria. We will discuss these elements in details in the upcoming posts.

A bit of examples in R

Let’s see some examples of usage of es(). We will use library “Mcomp”, so don’t forget to install it (if you haven’t done so before) and load it using library(Mcomp).

We start by estimating some model on a time series N1234. This time series has an obvious trend, and a safe option in forecasting of this sort of time series is using Damped trend method, which corresponds to ETS(A,Ad,N) model:

es(M3$N1234$x, "AAdN", h=8, intervals=TRUE)

This command produces two things: an output and the following graph:

Series N1234 from M3 and es() forecast

If you don’t need a graph, ask function not to do it via silent=”graph”. If you don’t need an output, then write “smooth” object into some variable:

ourModel <- es(M3$N1234$x, "AAdN", h=8, intervals=TRUE, silent="graph")

In cases of model selection and combinations (which will be discussed later), you may want for the function to work really silently. You then need to specify silent="all" or silent=TRUE.

So, what do we have in that output? Let's see:

Time elapsed: 0.15 seconds
Model estimated: ETS(AAdN)
Persistence vector g:
alpha  beta
0.623 0.26
Damping parameter: 0.964
Initial values were optimised.
6 parameters were estimated in the process
Residuals standard deviation: 75.206
Cost function type: MSE; Cost function value: 4902

Information criteria:
     AIC     AICc      BIC 
522.0857 524.2962 532.9256 

95% parametric prediction intervals were constructed

The first two lines are self explanatory - all the process took 0.15 seconds and we have constructed damped-trend model.

"Persistence vector g" refers to the vector of smoothing parameters $g$ in \eqref{eq:ssGeneralAdditive}. It consists of two smoothing parameters: $\alpha$, for level of series, and $\beta$, for the trend component. As we see, alpha is pretty high, which indicates that the level component evolves fast in time. Beta is higher than usual, which corresponds to changes of trend in time.

"Damping parameter" shows what is the value of parameter that damps trend. It is close to one, which means that the trend is damped slightly.

The next line tells us how the initial values of state vector $v_0$ were estimated. This time they were optimised. However we could ask our function to do it differently. We may discuss this some other day, some other time.

After that we see the number of parameters estimated in the process. We get 6 because we have: 2 smoothing parameters, 2 initial values of state vector $v_0$, 1 damping parameter and 1 estimated variance of residuals. The latter is needed in order to take the correct number of degree of freedom into account. And actually it is not fare to calculate the variance but not to take it into account.

Then we see the value of standard deviation of residuals. This is an unbiased estimate, corrected by the number of degrees of freedom. This means that it is calculated using:
\begin{equation} \label{eq:sd_Value}
s = \sqrt{\frac{1}{T-k} \sum_{t=1}^T e_t },
\end{equation}
where $k$ is number of estimated parameters (in our example $k=$6). This value is reported just for the general information. It does not tell us much about the estimated model, although we could potentially compare models with additive error using this value... But not today.

The line about cost functions follows. We see that Mean Squared Error was used in the estimation and the final value is equal to 4902. Not really helpful for anything, just a general information.

"Information criteria" line and a table with values below it tell what they say - these are Akaike Information Criterion, it's correct version and Bayesian Information Criterion. These can be compared across different models applied to one and the same sample of data.

Finally, the function tells us that it has produced 95% parametric prediction intervals.

All of this in a way points out at what to expect in ourModel, when we decide to extract some values. Note that ourModel is a list of variables, so the values can be extracted as usual. For example, model name is saved in ourModel$model, while forecasts are stored in ourModel$forecast. The detailed description of returned values is given in the help for function in R.

Now let's make things more interesting and see how the same model performs in the holdout. We will use parameter holdout for this purpose:

y <- ts(c(M3$N1234$x,M3$N1234$xx),start=start(M3$N1234$x),frequency=frequency(M3$N1234$x))
es(y, "AAdN", h=8, holdout=TRUE, intervals=TRUE)

The resulting graph looks similar to the previous one with a small difference - we now have actual values in the holdout:

Series N1234 from M3, es() forecast and holdout

As it can be seen from the graph we didn't manage to produce forecasts close to the actual values, but at least prediction intervals cover them. The output gives us more information about that:

Time elapsed: 0.18 seconds
Model estimated: ETS(AAdN)
Persistence vector g:
alpha  beta 
0.623 0.260 
Damping parameter: 0.964
Initial values were optimised.
6 parameters were estimated in the process
Residuals standard deviation: 75.206
Cost function type: MSE; Cost function value: 4902

Information criteria:
     AIC     AICc      BIC 
522.0857 524.2962 532.9256 

95% parametric prediction intervals were constructed
88% of values are in the prediction interval
Forecast errors:
MPE: -3.2%; Bias: -100%; MAPE: 3.2%; SMAPE: 3.2%
MASE: 4.183; sMAE: 3.7%; RelMAE: 3.436; sMSE: 0.2%

The main difference between this output and the previous one is in last several lines. Now the function also informs us, in how many cases the prediction intervals covered actual values (88%) and what were the prediction errors for the holdout. The errors are:

MPE - Mean Percentage Error;
Bias - Coefficient based on Mean Root Error, measuring symmetry / bias in residuals. If it is 0, then there's no bias, otherwise there is either positive or negative bias. This coefficient lies in region from -100% to 100%;
MAPE - Mean Absolute Percentage Error;
SMAPE - Symmetric Mean Absolute Percentage Error;
MASE - Mean Absolute Scaled Error;
sMAE - Scaled Mean Absolute Error (MAE divided by mean absolute actual value);
RelMAE - Relative Mean Absolute Error (comparison is done with Naive);
sMSE - Scaled Mean Squared Error. This is scaled by dividing MSE by square of mean absolute actual value.

These error measures allow us assessing accuracy of the produced model. They do not mean much on their own, but can be compared across several models. For example, ETS(A,A,N) applied to the same data has:

MPE: -3.7%; Bias: -100%; MAPE: 3.7%; SMAPE: 3.6%
MASE: 4.82; sMAE: 4.3%; RelMAE: 3.958; sMSE: 0.2%

Comparing these errors with the errors of ETS(A,Ad,N), we can say that damped trend model performs slightly better than ETS(A,A,N).

The only meaningful metric in the list, that can be analysed on its own, is RelMAE, which shows us that our forecast is 3.436 times worse than Naive. Not really soothing...

All these errors are stored in ourModel$accuracy as a vector. Additionally, because we used holdout=TRUE in the estimation, now we have this part of data in a special variable - ourModel$holdout. This is handy, because allows calculating any other error measures you can think of.

We have mentioned seasonal models in this post, so let's see how es() works with seasonal data on an example of ETS(A,N,A) model and time series N1956:

ourModel <- es(M3$N1956$x, "ANA", h=18)

The resulting graph should look like the following one:

Series N1956 from M3 and es() forecast

There is not much to say about this graph, except that the model is fitted well and more or less adequate forecasts have been produced (at least nothing ridiculous). Obviously if we compare this model with the one estimated using ets(), we will see that there are some small differences. For example, smoothing parameters values and initial values will differ, which will lead to slightly different forecasts. Here's an example of forecasts for 6 months ahead produced by both functions:

h	es()	ets()
1	3106.583	3105.761
2	3592.868	3580.976
3	4395.580	4389.775
4	5044.109	5052.459
5	4305.332	4364.522
6	3650.615	3733.528

As we see, for some values the difference is almost non-existent (0.822 for $h=$1), but for the others it becomes larger (-82.913 for $h=$6). Still the overall difference between these two forecasts is around 0.43% (averaged out for the horizon of 1 to 6, scaled to the mean of the data), so it is not clear, which of the two forecasts is more adequate. This difference is not only caused by the structure of state vector, but also by the optimisers used in functions: the one used in es() is more precise, which however not necessarily transfers to increase in forecasting accuracy.

That's it for today. Next time we will look into models with multiplicative errors. They have some important differences in comparison with ets(), that we should discuss.

Message “smooth” package for R. es() function. Part II. Pure additive models first appeared on Open Forecast.

“smooth” package for R. es() function. Part I

Ivan Svetunkov — Fri, 14 Oct 2016 15:29:45 +0000

Good news, everyone! “smooth” package is now available on CRAN. And it is time to look into what this package can do and why it is needed at all. The package itself contains some documentation that you can use as a starting point. For example, there are vignettes, which show included functions and what they allow to do, but do not go into details of what is happening inside those functions. If you don’t have a personal life and are ready to spend some time reading semi-scientific materials with formulae, then I have a thing especially for you – smooth documentation, which describes in details why models used in the package make sense, how they are optimised and what it leads to. Here we will try to look into some functions and their application to real life forecasting problems.

This is the first post in a series of posts on “smooth” functions. We start with es() – Exponential Smoothing.

What is es() and why do we need it?

“ES” stands for “Exponential Smoothing”. This function is an implementation of ETS model, alternative to Rob Hyndman’s ets() function from “forecast” package. The natural question would be: “Why bother if there already exists exponential smoothing function”? There are several reasons for that:

ets() does not allow constructing some mixed models. For example, classical Holt-Winters model, which corresponds to ETS(A,A,M) is unavailable. This is not a big loss, but these models may be of interest to researchers. So es() function has all 30 of them;
ets() does not have exogenous variables. es() allows providing either vector or matrix of exogenous variables. Note, however, that currently only additive ETS models are guaranteed to work well with exogenous variables. It will most probably work fine for other models, but I cannot guarantee stable performance;
There are quite few papers in forecasting field that show that using combinations of forecasts of different models leads to increase in forecasting accuracy. Stephan Kolassa applied one neat method of combinations described in Burnham and Anderson book to exponential smoothing, showing that it leads to increase of forecast accuracy. So I took that method and implemented it in es() function;
ets() function restricts number of seasonal coefficients by 24. The motivation here is that with higher number of coefficients, optimisers will do lousy job in finding optimal values of parameters. es() doesn’t have this restriction, claiming that users should be responsible for their own actions. Every action has consequences, you know? So, be smart and responsible!
ets() function does not allow defining initial values for state vector. In general, who could care less, right? But they may become important in some cases, when optimisation of initial states is not an option. So I have implemented two methods of initialisation (optimisation and backcasting) + allowed users providing initial values if they really need to. This also allows addressing problem with large number of parameters because of number of seasonal coefficients. For example, if you deal with weekly data, then you can switch to backcasting or provide your own values for seasonal indices;
MSE as cost function looked too boring, so I have introduced MAE, cost functions based on multiple steps ahead forecasting errors and exotic “Half Absolute Moment”. Why? Well, just in order to make life more exiting! (although there is a rationale for that, which is discussed in the documentation)
There are several ways of constructing prediction intervals. es() allows selecting between parametric, semi-parametric and non-parametric ones. This additional flexibility allows dealing with cases, when for some reason basic assumptions of ETS do not hold;
The derived underlying statistical model in es() differs slightly from ets(). This becomes especially evident, when models with multiplicative errors are used. There we assume that error term is distributed log-normally on the contrast to assumption of normality in ets();
All the functions of “smooth” package allow dealing with intermittent data. This is based on a very recent research done in collaboration with John Boylan. So es() function allows producing forecasts using Croston’s model (not method, this is not a typo!) and some other intermittent data models. We can even select between intermittent and normal models using this approach. This is still an on-going research, so more models will follow with more advanced mechanisms;
Last but not least, we have a fancy holdout parameter, which allows dividing provided time series into two parts, fitting model on the first one and assessing forecasts accuracy on the second.

Having said that, I am not claiming that es() function always produces more accurate forecasts than ets(). From what I have experienced so far the function is sometimes more accurate and sometimes less accurate than ets(). However the main point of the function is not in precision, but in flexibility. You can do much more with it, than with ets(), the only restriction for you is your imagination! If you don’t have imagination and / or just need an efficient exponential smoothing function, don’t waste your time with es() and use Rob’s ets().

Message “smooth” package for R. es() function. Part I first appeared on Open Forecast.