<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Archives greybox - Open Forecasting</title>
	<atom:link href="https://openforecast.org/tag/greybox/feed/" rel="self" type="application/rss+xml" />
	<link>https://openforecast.org/tag/greybox/</link>
	<description>How to look into the future</description>
	<lastBuildDate>Mon, 28 Jul 2025 15:51:00 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2015/08/cropped-usd-05-32x32.png&amp;nocache=1</url>
	<title>Archives greybox - Open Forecasting</title>
	<link>https://openforecast.org/tag/greybox/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>smooth &#038; greybox under LGPLv2.1</title>
		<link>https://openforecast.org/2023/09/19/smooth-greybox-under-lgplv2-1/</link>
					<comments>https://openforecast.org/2023/09/19/smooth-greybox-under-lgplv2-1/#comments</comments>
		
		<dc:creator><![CDATA[Ivan Svetunkov]]></dc:creator>
		<pubDate>Tue, 19 Sep 2023 09:32:56 +0000</pubDate>
				<category><![CDATA[Package greybox for R]]></category>
		<category><![CDATA[Package smooth for R]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[greybox]]></category>
		<category><![CDATA[smooth]]></category>
		<guid isPermaLink="false">https://openforecast.org/?p=3286</guid>

					<description><![CDATA[<p>Good news, everyone! I&#8217;ve recently released major versions of my packages smooth and greybox, v4.0.0 and v2.0.0 respectively, on CRAN. Has something big happened? Yes and no. Let me explain. Starting from these versions, the packages will be licensed under LGPLv2.1 instead of the very restrictive GPLv2. This does not change anything to the everyday [&#8230;]</p>
<p>Message <a href="https://openforecast.org/2023/09/19/smooth-greybox-under-lgplv2-1/">smooth &#038; greybox under LGPLv2.1</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Good news, everyone! I&#8217;ve recently released major versions of my packages <a href="https://cran.r-project.org/package=smooth">smooth</a> and <a href="https://cran.r-project.org/web/packages/greybox/index.html">greybox</a>, v4.0.0 and v2.0.0 respectively, on CRAN. Has something big happened? Yes and no. Let me explain.</p>
<div id="attachment_3308" style="width: 510px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2023/09/greybox-smooth.png&amp;nocache=1"><img fetchpriority="high" decoding="async" aria-describedby="caption-attachment-3308" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2023/09/greybox-smooth.png&amp;nocache=1" alt="Stickers of the greybox and smooth packages for R" width="500" height="289" class="size-full wp-image-3308" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2023/09/greybox-smooth.png&amp;nocache=1 500w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2023/09/greybox-smooth-300x173.png&amp;nocache=1 300w" sizes="(max-width: 500px) 100vw, 500px" /></a><p id="caption-attachment-3308" class="wp-caption-text">Stickers of the greybox and smooth packages for R</p></div>
<p>Starting from these versions, the packages will be licensed under <a href="https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html">LGPLv2.1</a> instead of the very restrictive <a href="https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html">GPLv2</a>. This does not change anything to the everyday users of the packages, but is a potential game changer to software developers and those who might want to modify the source code of the packages for commercial purposes. This is because any change of the code under GPLv2 implies that these changes need to be released and made available to everyone, while the LGPLv2.1 allows modifications without releasing the source code. At the same time, both licenses imply that the attribution to the author is necessary, so if someone modifies the code and uses it for their purposes, they still need to say that the original package was developed by this and that author (Ivan Svetunkov in this case). The reason I decided to change the license is that one of software vendors that I sometimes work with pointed out that they cannot touch anything under GPL because of the restrictions above. Moving to the LGPL will now allow them using my packages in their own developments. This applies to such functions as <a href="https://openforecast.org/adam/">adam()</a>, <a href="/en/category/r-en/smooth/es-function/">es()</a>, <a href="https://cran.r-project.org/web/packages/smooth/vignettes/ssarima.html">msarima()</a>, <a href="/en/2022/08/02/complex-exponential-smoothing/">ces()</a>, <a href="https://cran.r-project.org/web/packages/greybox/vignettes/alm.html">alm()</a> and others. I don&#8217;t mind, as long as they say who developed the original thing.</p>
<p>What happens now? The versions of the <code>smooth</code> and <code>greybox</code> packages under GPLv2 are available on github <a href="https://github.com/config-i1/smooth/releases/tag/v3.2.2">here</a> and <a href="https://github.com/config-i1/greybox/releases/tag/v1.0.9">here</a> respectively, so if you are a radical open source adept, you can download those releases, install them and use them instead of the new versions. But from now on, I plan to support the packages under the LGPLv2.1 license.</p>
<p>Finally, a small teaser: colleagues of mine have agreed to help me in translating the R code into Python (actually, I am quite useless in this endeavor, they do everything), so at some point in future, we might see the <code>smooth</code> and <code>greybox</code> packages in Python. And they will also be licensed under LGPLv2.1.</p>
<p>Message <a href="https://openforecast.org/2023/09/19/smooth-greybox-under-lgplv2-1/">smooth &#038; greybox under LGPLv2.1</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://openforecast.org/2023/09/19/smooth-greybox-under-lgplv2-1/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Introducing scale model in greybox</title>
		<link>https://openforecast.org/2022/01/23/introducing-scale-model-in-greybox/</link>
					<comments>https://openforecast.org/2022/01/23/introducing-scale-model-in-greybox/#respond</comments>
		
		<dc:creator><![CDATA[Ivan Svetunkov]]></dc:creator>
		<pubDate>Sun, 23 Jan 2022 18:04:33 +0000</pubDate>
				<category><![CDATA[Package greybox for R]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regression]]></category>
		<category><![CDATA[Univariate models]]></category>
		<category><![CDATA[greybox]]></category>
		<category><![CDATA[regression]]></category>
		<guid isPermaLink="false">https://openforecast.org/?p=2670</guid>

					<description><![CDATA[<p>At the end of June 2021, I released the greybox package version 1.0.0. This was a major release, introducing new functionality, but I did not have time to write a separate post about it because of the teaching and lack of free time. Finally, Christmas has arrived, and I could spend several hours preparing the [&#8230;]</p>
<p>Message <a href="https://openforecast.org/2022/01/23/introducing-scale-model-in-greybox/">Introducing scale model in greybox</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>At the end of June 2021, I released the <code>greybox</code> package version 1.0.0. This was a major release, introducing new functionality, but I did not have time to write a separate post about it because of the teaching and lack of free time. Finally, Christmas has arrived, and I could spend several hours preparing the post about it. In this post, I want to tell you about the new major feature in the <code>greybox</code> package.</p>
<h3>Scale Model</h3>
<p>The Scale Model is the regression-like model focusing on capturing the relation between the scale of distribution (for example, variance in Normal distribution) and a set of explanatory variables. It is implemented in <code>sm()</code> method in the <code>greybox</code> package. The motivation for this comes from <a href="https://www.gamlss.com/">GAMLSS</a>, the Generalised Additive Model for Location, Scale and Shape. While I have decided not to bother with the &#8220;GAM&#8221; part of this (there are <code>gam</code> and <code>gamlss</code> packages in R that do that), I liked the idea of being able to predict the scale (for example, variance) of a distribution. This becomes especially useful when one suspects heteroscedasticity in the model but does not think that variable transformations are appropriate.</p>
<p>To understand what the function does, it is necessary first to discuss the underlying model. We will start the discussion with an example of a linear regression model with two explanatory variables, assuming Normally distributed residuals \(\xi_t\) with zero mean and a fixed variance \(\sigma^2\), \(\xi_t \sim \mathcal{N}(0,\sigma^2)\), which can be formulated as:<br />
\begin{equation} \label{eq:model1}<br />
    y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \xi_t ,<br />
\end{equation}<br />
where \(y_t\) is the response variable, \(x_{1,t}\) and \(x_{2,t}\) are the explanatory variables on observation \(t\), \(\beta_0\), \(\beta_1\) and \(\beta_2\) are the parameters of the model and \(\xi_t \sim \mathcal{N}\left(0, \sigma^2 \right)\). Recalling the basic properties of Normal distribution, we can rewrite the same model as a model with standard normal residuals \(\epsilon_t \sim \mathcal{N}\left(0, 1 \right)\) by inserting \(\xi_t = \sigma \epsilon_t\) in \eqref{eq:model1}:<br />
\begin{equation} \label{eq:model2}<br />
    y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \sigma \epsilon_t .<br />
\end{equation}<br />
Now if we suspect that the variance of the model might not be constant, we can substitute the standard deviation \(\sigma\) with some function, transforming the model into:<br />
\begin{equation} \label{eq:model3}<br />
    y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + f\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right) \epsilon_t ,<br />
\end{equation}<br />
where \(x_{2,t}\) and \(x_{3,t}\) are the explanatory variables (as you see, not necessarily the same as in the first part of the model) and \(\gamma_0\), \(\gamma_1\) and \(\gamma_2\) are the parameters of the scale part of the model. The idea here is that there is a regression model for the conditional mean of the distribution \(\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t}\), and that there is another one that will regulate the standard deviation via \(f\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)\). The main thing to keep in mind about the latter is that the function \(f(\cdot)\) needs to be strictly positive because the standard deviation cannot be zero or negative. The simplest way to guarantee this is to use exponent instead of \(f(\cdot)\). Furthermore, in our example with Normal distribution, the scale corresponds to the variance, so we should be introducing the model for variance: \(\sigma^2_t = \exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)\). This leads to the following model:<br />
\begin{equation} \label{eq:model4}<br />
    y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \sqrt{\exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)} \epsilon_t ,<br />
\end{equation}<br />
The model above would not only have the conditional mean depending on the values of explanatory variables (the conventional regression) but also the conditional variance, which would change depending on the values of variables. Note that this model assumes the linearity in the conditional mean: increase of \(x_{1,t}\) by one leads to the increase of \(y_t\) by \(\beta_1\) on average. At the same time, it assumes non-linearity in the variance: increase of \(x_{2,t}\) by one leads to the increase of variance by \(\exp(\gamma_2-1)\times 100\)%. If we want a non-linear change in the conditional mean, we can use a model in logarithms. Alternatively, we could assume a different distribution for the response variable \(y_t\). To understand how the latter would work, we need to represent the same model \eqref{eq:model4} in a more general form. For the Normal distribution, the same model \eqref{eq:model4} can be rewritten as:<br />
\begin{equation} \label{eq:model5}<br />
    y_t \sim \mathcal{N}\left(\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t}, \exp\left(\gamma_0 + \gamma_2 x_{2,t} + \gamma_3 x_{3,t}\right)\right).<br />
\end{equation}<br />
This representation allows introducing scale model for many other distributions, such as Laplace, Generalised Normal, Gamma, Inverse Gaussian etc. All that we need to do in those cases is to substitute the distribution \(\mathcal{N}(\cdot)\) with a distribution of interest. The <code>sm()</code> function supports the same list of distributions as <code>alm()</code> (see <a href="https://cran.r-project.org/web/packages/greybox/vignettes/alm.html" rel="noopener" target="_blank">the vignette </a>for the function on CRAN or in R using the command <code>vignette()</code>). Each specific formula for scale would differ from one distribution to another, but the principles will be the same.</p>
<h3>Demonstration in R</h3>
<p>For demonstration purposes, we will use an example with artificial data, generated according to the model \eqref{eq:model4}:</p>
<pre class="decode">xreg <- matrix(rnorm(300,10,3),100,3)
xreg <- cbind(1000-0.75*xreg[,1]+1.75*xreg[,2]+
              sqrt(exp(0.3+0.5*xreg[,2]-0.4*xreg[,3]))*rnorm(100,0,1),xreg)
colnames(xreg) <- c("y",paste0("x",c(1:3)))</pre>
<p>The scatterplot of the generated data will look like this:</p>
<pre class="decode">spread(xreg)</pre>
<div id="attachment_2789" style="width: 310px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smExampleSpread.png&amp;nocache=1"><img decoding="async" aria-describedby="caption-attachment-2789" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smExampleSpread-300x263.png&amp;nocache=1" alt="Scatterplot matrix for the generated data" width="300" height="263" class="size-medium wp-image-2789" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smExampleSpread-300x263.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smExampleSpread-768x672.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smExampleSpread.png&amp;nocache=1 800w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-2789" class="wp-caption-text">Scatterplot matrix for the generated data</p></div>
<p>We can then fit a model, specifying the location and scale parts of it in <code>alm()</code>. In this case, the <code>alm()</code> will call for <code>sm()</code> and will estimate both parts via likelihood maximisation. To make things closer to forecasting task, we will withhold the last 10 observations for the test set:</p>
<pre class="decode">ourModel <- alm(y~x1+x2+x3, scale=~x2+x3, xreg, subset=c(1:90), distribution="dnorm")</pre>
<p>The returned model contains both parts. The scale part of the model can be accessed via <code>ourModel$scale</code>. It is an object of class "scale", supporting several methods, such as<br />
<code>actuals()</code>, <code>residuals()</code>, <code>fitted()</code>, <code>summary()</code> and <code>plot()</code> (and several other). Here how the summary of the model looks in my case:</p>
<pre class="decode">summary(ourModel)</pre>
<pre>Response variable: y
Distribution used in the estimation: Normal
Loss function used in estimation: likelihood
Coefficients:
             Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept) 1000.2850     2.9698   994.3782   1006.1917 *
x1            -0.8350     0.1435    -1.1204     -0.5497 *
x2             1.8656     0.1714     1.5246      2.2065 *
x3            -0.0228     0.1776    -0.3761      0.3305  

Coefficients for scale:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)   0.0436     0.7012    -1.3510      1.4382  
x2            0.4705     0.0413     0.3883      0.5527 *
x3           -0.3355     0.0487    -0.4324     -0.2385 *

Error standard deviation: 4.52
Sample size: 90
Number of estimated parameters: 7
Number of degrees of freedom: 83
Information criteria:
     AIC     AICc      BIC     BICc 
391.0191 392.3849 408.5177 411.5908</pre>
<p>The summary above shows parameters for both parts of the model. They are not far from the ones used in the generation of the model, which indicates that the implemented model works as intended. The only issue here is that the standard errors in the location part of the model (first four coefficients) <strong>do not take the heteroscedasticity into account and thus are biased</strong>. The <a href="https://www.econometrics-with-r.org/15.4-hac-standard-errors.html" rel="noopener" target="_blank">HAC standard errors</a> are not yet implemented in <code>alm()</code></p>
<p>As we see, the returned model contains both parts. The scale part of the model can be accessed via <code>ourModel$scale</code>. It is an object of class "scale", supporting several methods, such as<br />
<code>actuals()</code>, <code>residuals()</code>, <code>fitted()</code>, <code>summary()</code> and <code>plot()</code> (and several other). Just to see the effect of scale model, here are the diagnostics plots for the original model (which returns the \(\xi_t\) residuals) and for the scale model (\(\epsilon_t\) residuals):</p>
<pre class="decode">par(mfcol=c(1,2))
plot(ourModel, 5)
plot(ourModel, 5)</pre>
<div id="attachment_2794" style="width: 310px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smDiagnostics.png&amp;nocache=1"><img decoding="async" aria-describedby="caption-attachment-2794" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smDiagnostics-300x175.png&amp;nocache=1" alt="Diagnostics plots for sm" width="300" height="175" class="size-medium wp-image-2794" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smDiagnostics-300x175.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smDiagnostics-1024x597.png&amp;nocache=1 1024w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smDiagnostics-768x448.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smDiagnostics.png&amp;nocache=1 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-2794" class="wp-caption-text">Diagnostics plots for sm</p></div>
<p>The Figure above shows squared residuals vs fitted values for the location (the plot on the left) and the scale (the plot on the right) models. The former is agnostic of the scale model and demonstrates that there is heteroscedasticity of residuals (the variance increases with the increase of the fitted values). The latter shows that the scale model managed to resolve the issue. While the LOWESS line demonstrates some non-linearity, the distribution of residuals conditional on fitted values looks random.</p>
<p>Finally, we can produce forecasts from such model, similarly to how it is done for any other model, estimated with <code>alm()</code>:</p>
<pre class="decode">ourForecast <- predict(ourModel,xreg[-c(1:90),],interval="pred")
plot(ourForecast)</pre>
<div id="attachment_2800" style="width: 310px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smForecast.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-2800" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smForecast-300x175.png&amp;nocache=1" alt="Forecast from the model" width="300" height="175" class="size-medium wp-image-2800" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smForecast-300x175.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smForecast-1024x597.png&amp;nocache=1 1024w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smForecast-768x448.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2022/01/smForecast.png&amp;nocache=1 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-2800" class="wp-caption-text">Forecast from the model</p></div>
<p>In this case, the function will first predict the scale part of the model, then it will use the predicted variance and the covariance matrix of parameters to calculate the prediction intervals, shown in Figure above. Given the independence of location and scale parts of the model, the conditional expectation (point forecast) will not change if we drop the scale model. It is all about variance.</p>
<p>Finally, if you do not want to use <code>alm()</code> function, you can use <code>lm()</code> instead and then apply the <code>sm()</code>:</p>
<pre class="decode">lmModel <- lm(y~x1+x2+x3, as.data.frame(xreg), subset=c(1:90))
smModel <- sm(lmModel, formula=~x2+x3, xreg)</pre>
<p>In this case, the <code>sm()</code> will assume that the error term follows Normal distribution, and we will end up with two models that are not connected with each other (e.g., the <code>predict()</code> method applied to <code>lmModel</code> will not use predictions from the <code>smModel</code>). Nonetheless, we could still use all the R methods discussed above for the analysis of the <code>smModel</code>.</p>
<p>As a final word, the scale model is a new feature. While it already works, there might be bugs in it. If you find any, please let me know by submitting <a href="https://github.com/config-i1/greybox/issues" rel="noopener" target="_blank">an issue on Github</a>.</p>
<h3>P.S.</h3>
<p>There is a danger that <code>greybox</code> <strong>package will be soon removed from CRAN</strong> together with other 88 packages (including my <code>smooth</code> and <code>legion</code>) because the <code>nloptr</code> package that it relies on has not passed some of new checks recently introduced by CRAN. This is beyond my control, and I do not have time or power to influence this, but if this happens, you might need to switch to <a href="https://github.com/config-i1/greybox/" rel="noopener" target="_blank">the installation from GitHub</a> via <code>remotes</code> package, using the command:</p>
<pre class="decode">remotes::install_github("config-i1/greybox")</pre>
<p>My apologies for the inconvenience. I might be able to remove the dependence on <code>nloptr</code> at some point, but it will not happen before March 2022.</p>
<p>Message <a href="https://openforecast.org/2022/01/23/introducing-scale-model-in-greybox/">Introducing scale model in greybox</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://openforecast.org/2022/01/23/introducing-scale-model-in-greybox/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>An Integrated Method for Estimation and Optimisation</title>
		<link>https://openforecast.org/2021/09/03/an-integrated-method-for-estimation-and-optimisation/</link>
					<comments>https://openforecast.org/2021/09/03/an-integrated-method-for-estimation-and-optimisation/#respond</comments>
		
		<dc:creator><![CDATA[Ivan Svetunkov]]></dc:creator>
		<pubDate>Fri, 03 Sep 2021 15:47:15 +0000</pubDate>
				<category><![CDATA[Package greybox for R]]></category>
		<category><![CDATA[Papers]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Regression]]></category>
		<category><![CDATA[Univariate models]]></category>
		<category><![CDATA[extrapolation methods]]></category>
		<category><![CDATA[greybox]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://openforecast.org/?p=2703</guid>

					<description><![CDATA[<p>My PhD student, Congzheng Liu (co-supervised with Adam Letchford) has written a paper, entitled &#8220;Newsvendor Problems: An Integrated Method for Estimation and Optimisation&#8220;. This paper has recently been published in EJOR. In this paper we build upon the existing Ban &#038; Rudin (2019) approach for newsvendor problem, showing that in case of the linear model, [&#8230;]</p>
<p>Message <a href="https://openforecast.org/2021/09/03/an-integrated-method-for-estimation-and-optimisation/">An Integrated Method for Estimation and Optimisation</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>My PhD student, <a href="https://www.linkedin.com/in/congzheng-liu/" rel="noopener" target="_blank">Congzheng Liu</a> (co-supervised with <a href="https://www.lancaster.ac.uk/staff/letchfoa/default.htm" rel="noopener" target="_blank">Adam Letchford</a>) has written a paper, entitled &#8220;<a href="https://doi.org/10.1016/j.ejor.2021.08.013" rel="noopener" target="_blank">Newsvendor Problems: An Integrated Method for Estimation and Optimisation</a>&#8220;. This paper has recently been published in <a href="https://www.sciencedirect.com/journal/european-journal-of-operational-research" rel="noopener" target="_blank">EJOR</a>. In this paper we build upon the existing <a href="https://doi.org/10.1287/opre.2018.1757" rel="noopener" target="_blank">Ban &#038; Rudin (2019)</a> approach for newsvendor problem, showing that in case of the linear model, it becomes equivalent to quantile regression. We then extend it for the non-linear newsvendor problems, testing it on simulated and real life data. In order to understand what specifically we propose, we need to discuss the typical process in case of newsvendor problem.</p>
<p>Newsvendor is a class of problems, where the product can only be sold one day, after which it goes to waste. So this is appropriate, for example, for perishable products in retail. Typically, in this situation we would have historical demand of sales of our product \(y_t\) and we would try forecasting it using regression / ETS / ARIMA or any other model. After doing that and obtaining the estimates of parameters, we would produce a quantile of assumed distribution, which then tells us how much to order (\(q_t\)). If we order more than needed, we will have holding costs. In the opposite case, we will have shortage costs. Based on these costs and the price of product, we can find the optimal order, that will give the maximum profit.</p>
<p>As you can already spot, the forecasting stage is detached from the optimisation one in this situation. The idea of the proposed integrated approach (IMEO) is simple: instead of optimising the model via MSE or any other conventional loss and then solving the optimisation problem, we could <strong>estimate the model via maximisation of the specific profit function</strong>, thus obtaining the required orders directly. This is not a new idea on its own, but using profit function rather than the cost (as <a href="https://doi.org/10.1287/opre.2018.1757" rel="noopener" target="_blank">Ban &#038; Rudin, 2019</a> did) allows applying IMEO to wider set of problems.</p>
<p>For example, if we know the price of the product \(p\), the costs for production \(v\), holding \(c_h\) and shortage costs \(c_s\), we can then calculate profit as (for a linear newsvendor problem):<br />
\begin{equation}<br />
    \pi(q_t,y_t)=<br />
    \begin{cases}<br />
        p y_t -v q_t -c_h (q_t -y_t),&#038; \text{for } q_t \geq y_t\\<br />
        p q_t -v q_t -c_s (y_t -q_t),&#038; \text{for } q_t< y_t,
    \end{cases}
\end{equation}
where \(q_t\) ​is the order quantity and \(y_t\) is the actual sales. This profit function can be used for the estimation of a model of your choosing. Congzheng has written a separate R code for the experiments for the paper. Inspired by his example, I have implemented custom losses in <code>alm()</code> and <code>adam()</code> functions from respective <code>greybox</code> and <code>smooth</code> packages for R. At the moment, only the regression model works properly with custom losses &#8211; ETS / ARIMA need additional modifications, which we will hopefully resolve in the next paper. So, here is an example with linear newsvendor problem and <code>alm()</code>:</p>
<pre class="decode"># Generate artificial data
x1 <- rnorm(100,100,10)
x2 <- rbinom(100,2,0.05)
y <- 10 + 1.5*x1 + 5*x2 + rnorm(100,0,10)
ourData <- cbind(y=y,x1=x1,x2=x2)

# Define price and costs
price <- 50
costBasic <- 5
costShort <- 15
costHold <- 1

# Define profit function for the linear case
lossProfit <- function(actual, fitted, B, xreg){
    # Minus sign is needed here, because we need to minimise the loss
    profit <- -ifelse(actual >= fitted,
                     (price - costBasic) * fitted - costShort * (actual - fitted),
                     price * actual - costBasic * fitted - costHold * (fitted - actual));
    return(sum(profit));
}

# Estimate the model
model1 <- alm(y~x1+x2, ourData, loss=lossProfit)

# Print summary of the model
summary(model1, bootstrap=TRUE) </pre>
<pre>Response variable: y
Distribution used in the estimation: Normal
Loss function used in estimation: custom
Bootstrap was used for the estimation of uncertainty of parameters
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)  36.5177    14.2840     2.7783     51.4844 *
x1            1.3622     0.1622     1.1909      1.7528 *
x2            3.3423     2.7810    -6.5997      5.9101  

Error standard deviation: 17.2266
Sample size: 100
Number of estimated parameters: 3
Number of degrees of freedom: 97</pre>
<p>The resulting model is easy to work with: it provides meaningful parameters, showing how on average the order should change if a variable changes by one. For example, we see that with the increase of the variable x1, the orders should change on average by 1.36.</p>
<p>Note that in this specific case, as shown <a href="https://doi.org/10.1016/j.ejor.2021.08.013" rel="noopener" target="_blank">in our paper</a>, the model would be equivalent to the quantile regression, estimated for the quantile \(\left( \frac{c_u}{c_o+c_u} \right)\), where \(c_u= p-v+c_s\) is the "underage" cost and \(c_o = v+c_h\) is the "overage" cost. In our example it corresponds to approximately 0.9091 quantile. We can compare the output of this model with the one from the quantile regression in <code>alm</code> (which is estimated as an Asymmetric Laplace model):</p>
<pre class="decode">model2 <- alm(y~x1+x2, ourData, distribution="dalaplace", alpha=0.9091)
summary(model2, bootstrap=TRUE)</pre>
<pre>Response variable: y
Distribution used in the estimation: Asymmetric Laplace with alpha=0.9091
Loss function used in estimation: likelihood
Bootstrap was used for the estimation of uncertainty of parameters
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%  
(Intercept)  36.6688    11.6686     3.8674     51.1987 *
x1            1.3611     0.1338     1.1920      1.7454 *
x2            3.1259     2.5424    -6.2518      5.4703  

Error standard deviation: 17.3379
Sample size: 100
Number of estimated parameters: 4
Number of degrees of freedom: 96
Information criteria:
     AIC     AICc      BIC     BICc 
826.4622 826.8833 836.8829 837.8524</pre>
<p>The differences between the estimates of parameters of the two models are due to the optimisation procedure, which would converge to slightly different points in these two cases. Still, the values of parameters are close to each other and would converge asymptotically, which supports our finding.</p>
<p>And here how the orders over time look in case of our custom loss:</p>
<pre class="decode">plot(model1, 7)</pre>
<div id="attachment_2716" style="width: 310px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2021/09/ordersDynamics.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-2716" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2021/09/ordersDynamics-300x175.png&amp;nocache=1" alt="Dynamics of orders from alm model" width="300" height="175" class="size-medium wp-image-2716" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2021/09/ordersDynamics-300x175.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2021/09/ordersDynamics-1024x597.png&amp;nocache=1 1024w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2021/09/ordersDynamics-768x448.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2021/09/ordersDynamics.png&amp;nocache=1 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-2716" class="wp-caption-text">Dynamics of orders from alm model</p></div>
<p>The purple line in the Figure above corresponds to the orders and would cover roughly 90.91% of cases, so that we would run out of product in approximately 10% of cases, which would still be more profitable than any other option.</p>
<p>Finally, the approach works also well in case of non-linear newsvendor problem (see <a href="https://doi.org/10.1016/j.ejor.2021.08.013" rel="noopener" target="_blank">the paper</a> for details), where quantile regression is not suitable and the conventional approach fails. The only thing that would change is the loss function, where the prices and costs would depend non-linearly on the order quantity and sales.</p>
<p>You can read <a href="https://doi.org/10.1016/j.ejor.2021.08.013" rel="noopener" target="_blank">the published paper on EJOR website</a> or the working paper on <a href="http://dx.doi.org/10.13140/RG.2.2.27057.81763" rel="noopener" target="_blank">ResearchGate</a>.</p>
<p>Message <a href="https://openforecast.org/2021/09/03/an-integrated-method-for-estimation-and-optimisation/">An Integrated Method for Estimation and Optimisation</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://openforecast.org/2021/09/03/an-integrated-method-for-estimation-and-optimisation/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Analytics with greybox</title>
		<link>https://openforecast.org/2019/01/07/marketing-analytics-with-greybox/</link>
					<comments>https://openforecast.org/2019/01/07/marketing-analytics-with-greybox/#comments</comments>
		
		<dc:creator><![CDATA[Ivan Svetunkov]]></dc:creator>
		<pubDate>Mon, 07 Jan 2019 16:40:17 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Package greybox for R]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[greybox]]></category>
		<category><![CDATA[regression]]></category>
		<guid isPermaLink="false">https://openforecast.org/?p=1893</guid>

					<description><![CDATA[<p>One of the reasons why I have started the greybox package is to use it for marketing research and marketing analytics. The common problem that I face, when working with these courses is analysing the data measured in different scales. While R handles numeric scales natively, the work with categorical is not satisfactory. Yes, I [&#8230;]</p>
<p>Message <a href="https://openforecast.org/2019/01/07/marketing-analytics-with-greybox/">Analytics with greybox</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>One of the reasons why I have started the <span class="lang:r decode:true crayon-inline">greybox</span> package is to use it for marketing research and marketing analytics. The common problem that I face, when working with these courses is analysing the data measured in different scales. While R handles numeric scales natively, the work with categorical is not satisfactory. Yes, I know that there are packages that implement some of the functions, but I wanted to have them in one place without the need to install a lot of packages and satisfy the dependencies. After all, what&#8217;s the point in installing a package for Cramer&#8217;s V, when it can be calculated with two lines of code? So, here&#8217;s a brief explanation of the functions for marketing analytics in <span class="lang:r decode:true crayon-inline">greybox</span>.</p>
<p>I will use `mtcars` dataset for the examples, but we will transform some of the variables into factors:</p>
<pre class="decode">mtcarsData &lt;- as.data.frame(mtcars)
mtcarsData$vs &lt;- factor(mtcarsData$vs, levels=c(0,1), labels=c("v","s"))
mtcarsData$am &lt;- factor(mtcarsData$am, levels=c(0,1), labels=c("a","m"))</pre>
<p><em>All the functions discussed in this post are available in <span class="lang:r decode:true crayon-inline">greybox</span> starting from v0.4.0. However, I&#8217;ve found several bugs since the submission to CRAN, and the most recent version with bugfixes is now <a href="https://github.com/config-i1/greybox" rel="noopener noreferrer" target="_blank">available on github</a>.</em></p>
<h2>Analysing the relation between the two variables in categorical scales</h2>
<h3>Cramer&#8217;s V</h3>
<p>Cramer&#8217;s V measures the relation between two variables in categorical scale. It is implemented in the <span class="lang:r decode:true crayon-inline">cramer()</span> function. It returns the value in a range of 0 to 1 (1 &#8211; when the two categorical variables are linearly associated with each other, 0 &#8211; otherwise), Chi-Squared statistics from the <span class="lang:r decode:true crayon-inline">chisq.test()</span>, the respective p-value and the number of degrees of freedom. The tested hypothesis in this case is formulated as:<br />
\begin{matrix}<br />
H_0: V = 0 \text{ (the variables don&#8217;t have association);} \\<br />
H_1: V \neq 0 \text{ (there is an association between the variables).}<br />
\end{matrix}</p>
<p>Here&#8217;s what we get when trying to find the association between the engine and transmission in the `mtcars` data:</p>
<pre class="decode">cramer(mtcarsData$vs, mtcarsData$am)</pre>
<pre>Cramer's V: 0.1042
Chi^2 statistics = 0.3475, df: 1, p-value: 0.5555</pre>
<p>Judging by this output, the association between these two variables is very low (close to zero) and is not statistically significant.</p>
<p>Cramer&#8217;s V can also be used for the data in numerical scales. In general, this might be not the most suitable solution, but this might be useful when you have a small number of values in the data. For example, the variable `gear` in `mtcars` is numerical, but it has only three options (3, 4 and 5). Here&#8217;s what Cramer&#8217;s V tells us in the case of `gear` and `am`:</p>
<pre class="decode">cramer(mtcarsData$am, mtcarsData$gear)</pre>
<pre>Cramer's V: 0.809
Chi^2 statistics = 20.9447, df: 2, p-value: 0</pre>
<p>As we see, the value is high in this case (0.809), and the null hypothesis is rejected on 5% level. So we can conclude that there is a relation between the two variables. This does not mean that one variable causes the other one, but they both might be driven by something else (do more expensive cars have less gears but the automatic transmission?).</p>
<h3>Plotting categorical variables</h3>
<p>While R allows plotting two categorical variables against each other, the plot is hard to read and is not very helpful (in my opinion):</p>
<pre class="decode">plot(table(mtcarsData$am,mtcarsData$gear))</pre>
<div id="attachment_1912" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsPlot.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1912" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsPlot-300x300.png&amp;nocache=1" alt="" width="300" height="300" class="size-medium wp-image-1912" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsPlot-300x300.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsPlot-150x150.png&amp;nocache=1 150w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsPlot.png&amp;nocache=1 700w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1912" class="wp-caption-text">Default plot of a table</p></div>
<p>So I have created a function that produces a heat map for two categorical variables. It is called <span class="lang:r decode:true crayon-inline">tableplot()</span>:</p>
<pre class="decode">tableplot(mtcarsData$am,mtcarsData$gear)</pre>
<div id="attachment_1915" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsTableplot.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1915" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsTableplot-300x300.png&amp;nocache=1" alt="" width="300" height="300" class="size-medium wp-image-1915" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsTableplot-300x300.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsTableplot-150x150.png&amp;nocache=1 150w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsTableplot.png&amp;nocache=1 700w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1915" class="wp-caption-text">Tableplot for the two categorical variables</p></div>
<p>It is based on <span class="lang:r decode:true crayon-inline">table()</span> function and uses the frequencies inside the table for the colours:</p>
<pre class="decode">table(mtcarsData$am,mtcarsData$gear) / length(mtcarsData$am)</pre>
<pre>        3       4       5
a 0.46875 0.12500 0.00000
m 0.00000 0.25000 0.15625</pre>
<p>The darker sectors mean that there is a higher concentration of values, while the white ones correspond to zeroes. So, in our example, we see that the majority of cars have automatic transmissions with three gears. Furthermore, the plot shows that there is some sort of relation between the two variables: the cars with automatic transmissions have the lower number of gears, while the ones with the manual have the higher number of gears (something we&#8217;ve already noticed in the previous subsection).</p>
<h2>Association between the categorical and numerical variables</h2>
<p>While Cramer&#8217;s V can also be used for the measurement of association between the variables in different scales, there are better instruments. For example, some analysts recommend using intraclass correlation coefficient when measuring the relation between the numerical and categorical variables. But there is a simpler option, which involves calculating the coefficient of multiple correlation between the variables. This is implemented in <span class="lang:r decode:true crayon-inline">mcor()</span> function of <span class="lang:r decode:true crayon-inline">greybox</span>. The `y` variable should be numerical, while `x` can be of any type. What the function then does is expands all the factors and runs a regression via <span class="lang:r decode:true crayon-inline">.lm.fit()</span> function, returning the square root of the coefficient of determination. If the variables are linearly related, then the returned value will be close to one. Otherwise it will be closet to zero. The function also returns the F statistics from the regression, the associated p-value and the number of degrees of freedom (the hypothesis is formulated similarly to <span class="lang:r decode:true crayon-inline">cramer()</span> function).</p>
<p>Here&#8217;s how it works:</p>
<pre class="decode">mcor(mtcarsData$am,mtcarsData$mpg)</pre>
<pre>Multiple correlations value: 0.5998
F-statistics = 16.8603, df: 1, df resid: 30, p-value: 3e-04</pre>
<p>In this example, the simple linear regression of mpg from the set of dummies is constructed, and we can conclude that there is a linear relation between the variables, and that this relation is statistically significant.</p>
<h2>Association between several variables</h2>
<h3>Measures of association</h3>
<p>When you deal with datasets (i.e. data frames or matrices), then you can use <span class="lang:r decode:true crayon-inline">cor()</span> function in order to calculate the correlation coefficients between the variables in the data. But when you have a mixture of numerical and categorical variables, the situation becomes more difficult, as the correlation does not make sense for the latter. This motivated me to create a function that uses either <span class="lang:r decode:true crayon-inline">cor()</span>, or <span class="lang:r decode:true crayon-inline">cramer()</span>, or <span class="lang:r decode:true crayon-inline">mcor()</span> functions depending on the types of data (see discussions of <span class="lang:r decode:true crayon-inline">cramer()</span> and <span class="lang:r decode:true crayon-inline">mcor()</span> above). The function is called <span class="lang:r decode:true crayon-inline">association()</span> or <span class="lang:r decode:true crayon-inline">assoc()</span> and returns three matrices: the values of the measures of association, their p-values and the types of the functions used between the variables. Here&#8217;s an example:</p>
<pre class="decode">assocValues &lt;- assoc(mtcarsData)
print(assocValues,digits=2)</pre>
<pre> Associations: 
 values:
        mpg  cyl  disp    hp  drat    wt  qsec   vs   am gear carb
 mpg   1.00 0.86 -0.85 -0.78  0.68 -0.87  0.42 0.66 0.60 0.66 0.67
 cyl   0.86 1.00  0.92  0.84  0.70  0.78  0.59 0.82 0.52 0.53 0.62
 disp -0.85 0.92  1.00  0.79 -0.71  0.89 -0.43 0.71 0.59 0.77 0.56
 hp   -0.78 0.84  0.79  1.00 -0.45  0.66 -0.71 0.72 0.24 0.66 0.79
 drat  0.68 0.70 -0.71 -0.45  1.00 -0.71  0.09 0.44 0.71 0.83 0.33
 wt   -0.87 0.78  0.89  0.66 -0.71  1.00 -0.17 0.55 0.69 0.66 0.61
 qsec  0.42 0.59 -0.43 -0.71  0.09 -0.17  1.00 0.74 0.23 0.63 0.67
 vs    0.66 0.82  0.71  0.72  0.44  0.55  0.74 1.00 0.10 0.62 0.69
 am    0.60 0.52  0.59  0.24  0.71  0.69  0.23 0.10 1.00 0.81 0.44
 gear  0.66 0.53  0.77  0.66  0.83  0.66  0.63 0.62 0.81 1.00 0.51
 carb  0.67 0.62  0.56  0.79  0.33  0.61  0.67 0.69 0.44 0.51 1.00
 
 p-values:
       mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
 mpg  1.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.01
 cyl  0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01
 disp 0.00 0.00 1.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.07
 hp   0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.18 0.00 0.00
 drat 0.00 0.00 0.00 0.01 1.00 0.00 0.62 0.01 0.00 0.00 0.66
 wt   0.00 0.00 0.00 0.00 0.00 1.00 0.34 0.00 0.00 0.00 0.02
 qsec 0.02 0.00 0.01 0.00 0.62 0.34 1.00 0.00 0.21 0.00 0.01
 vs   0.00 0.00 0.00 0.00 0.01 0.00 0.00 1.00 0.56 0.00 0.01
 am   0.00 0.01 0.00 0.18 0.00 0.00 0.21 0.56 1.00 0.00 0.28
 gear 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.09
 carb 0.01 0.01 0.07 0.00 0.66 0.02 0.01 0.01 0.28 0.09 1.00
 
 types:
      mpg    cyl      disp   hp     drat   wt     qsec   vs       am      
 mpg  "none" "mcor"   "cor"  "cor"  "cor"  "cor"  "cor"  "mcor"   "mcor"  
 cyl  "mcor" "none"   "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
 disp "cor"  "mcor"   "none" "cor"  "cor"  "cor"  "cor"  "mcor"   "mcor"  
 hp   "cor"  "mcor"   "cor"  "none" "cor"  "cor"  "cor"  "mcor"   "mcor"  
 drat "cor"  "mcor"   "cor"  "cor"  "none" "cor"  "cor"  "mcor"   "mcor"  
 wt   "cor"  "mcor"   "cor"  "cor"  "cor"  "none" "cor"  "mcor"   "mcor"  
 qsec "cor"  "mcor"   "cor"  "cor"  "cor"  "cor"  "none" "mcor"   "mcor"  
 vs   "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "none"   "cramer"
 am   "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "none"  
 gear "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
 carb "mcor" "cramer" "mcor" "mcor" "mcor" "mcor" "mcor" "cramer" "cramer"
      gear     carb    
 mpg  "mcor"   "mcor"  
 cyl  "cramer" "cramer"
 disp "mcor"   "mcor"  
 hp   "mcor"   "mcor"  
 drat "mcor"   "mcor"  
 wt   "mcor"   "mcor"  
 qsec "mcor"   "mcor"  
 vs   "cramer" "cramer"
 am   "cramer" "cramer"
 gear "none"   "cramer"
 carb "cramer" "none"</pre>
<p>One thing to note is that the function considers numerical variables as categorical, when they only have up to 10 unique values. This is useful, for example, in case of number of `gears` in the dataset.</p>
<h2>Plots of association between several variables</h2>
<p>Similarly to the problem with <span class="lang:r decode:true crayon-inline">cor()</span>, scatterplot matrix (produced using <span class="lang:r decode:true crayon-inline">plot()</span>) is not meaningful in case of a mixture of variables:</p>
<pre class="decode">plot(mtcarsData)</pre>
<div id="attachment_1913" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsScatter.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1913" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsScatter-300x300.png&amp;nocache=1" alt="" width="300" height="300" class="size-medium wp-image-1913" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsScatter-300x300.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsScatter-150x150.png&amp;nocache=1 150w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsScatter.png&amp;nocache=1 700w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1913" class="wp-caption-text">Default scatter plot matrix</p></div>
<p>It makes sense to use scatterplot in case of numeric variables, <span class="lang:r decode:true crayon-inline">tableplot()</span> in case of categorical and <span class="lang:r decode:true crayon-inline">boxplot()</span> in case of a mixture. So, there is the function <span class="lang:r decode:true crayon-inline">spread()</span> in <span class="lang:r decode:true crayon-inline">greybox</span> that creates something more meaningful. It uses the same algorithm as <span class="lang:r decode:true crayon-inline">assoc()</span> function, but produces plots instead of calculating measures of association. So, `gear` will be considered as categorical and the function will produce either <span class="lang:r decode:true crayon-inline">boxplot()</span> or <span class="lang:r decode:true crayon-inline">tableplot()</span>, when plotting it against other variables.</p>
<p>Here&#8217;s an example:</p>
<pre class="decode">spread(mtcarsData)</pre>
<div id="attachment_1914" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpread.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1914" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpread-300x300.png&amp;nocache=1" alt="" width="300" height="300" class="size-medium wp-image-1914" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpread-300x300.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpread-150x150.png&amp;nocache=1 150w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpread.png&amp;nocache=1 700w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1914" class="wp-caption-text">Spread matrix</p></div>
<p>This plot demonstrates, for example, that the number of carburetors influences fuel consumption (something that we could not have spotted in the case of <span class="lang:r decode:true crayon-inline">plot()</span>). Notice also, that the number of gears influences the fuel consumption in a non-linear relation as well. So constructing the model with dummy variables for the number of gears might be a reasonable thing to do.</p>
<p>The function also has the parameter `log`, which will transform all the numerical variables using logarithms, which is handy, when you suspect the non-linear relation between the variables. Finally, there is a parameter `histogram`, which will plot either histograms, or barplots on the diagonal.</p>
<pre class="decode">spread(mtcarsData, histograms=TRUE, log=TRUE)</pre>
<div id="attachment_1921" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpreadLogs.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1921" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpreadLogs-300x300.png&amp;nocache=1" alt="" width="300" height="300" class="size-medium wp-image-1921" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpreadLogs-300x300.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpreadLogs-150x150.png&amp;nocache=1 150w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2019/01/mtcarsSpreadLogs.png&amp;nocache=1 700w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1921" class="wp-caption-text">Spread matrix in logs</p></div>
<p>The plot demonstrates that the `disp` has a strong non-linear relation with `mpg`, and, similarly, `drat` and `hp` also influence `mpg` in a non-linear fashion.</p>
<h2>Regression diagnostics</h2>
<p>One of the problems of linear regression that can be diagnosed prior to the model construction is multicollinearity. The conventional way of doing this diagnostics is via calculating the variance inflation factor (VIF) after constructing the model. However, VIF is not easy to interpret, because it lies in \((1,\infty)\). Coefficients of determination from the linear regression models of explanatory variables are easier to interpret and work with. If such a coefficient is equal to one, then there are some perfectly correlated explanatory variables in the dataset. If it is equal to zero, then they are not linearly related.</p>
<p>There is a function <span class="lang:r decode:true crayon-inline">determination()</span> or <span class="lang:r decode:true crayon-inline">determ()</span> in <span class="lang:r decode:true crayon-inline">greybox</span> that returns the set of coefficients of determination for the explanatory variables. The good thing is that this can be done before constructing any model. In our example, the first column, `mpg` is the response variable, so we can diagnose the multicollinearity the following way:</p>
<pre class="decode">determination(mtcarsData[,-1])</pre>
<pre>       cyl      disp        hp      drat        wt      qsec        vs 
 0.9349544 0.9537470 0.8982917 0.7036703 0.9340582 0.8671619 0.8017720 
        am      gear      carb 
 0.7924392 0.8133441 0.8735577</pre>
<p>As we can see from the output above, `disp` is the most linearly related with the variables, so including it in the model might cause the multicollinearity, which will decrease the efficiency of the estimates of parameters.</p>
<p>Message <a href="https://openforecast.org/2019/01/07/marketing-analytics-with-greybox/">Analytics with greybox</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://openforecast.org/2019/01/07/marketing-analytics-with-greybox/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>greybox 0.3.0 &#8211; what&#8217;s new</title>
		<link>https://openforecast.org/2018/08/07/greybox-0-3-0-whats-new/</link>
					<comments>https://openforecast.org/2018/08/07/greybox-0-3-0-whats-new/#respond</comments>
		
		<dc:creator><![CDATA[Ivan Svetunkov]]></dc:creator>
		<pubDate>Tue, 07 Aug 2018 16:06:26 +0000</pubDate>
				<category><![CDATA[Package greybox for R]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Univariate models]]></category>
		<category><![CDATA[greybox]]></category>
		<category><![CDATA[regression]]></category>
		<guid isPermaLink="false">https://openforecast.org/?p=1778</guid>

					<description><![CDATA[<p>Three months have passed since the initial release of greybox on CRAN. I would not say that the package develops like crazy, but there have been some changes since May. Let&#8217;s have a look. We start by loading both greybox and smooth: library(greybox) library(smooth) Rolling Origin First of all, ro() function now has its own [&#8230;]</p>
<p>Message <a href="https://openforecast.org/2018/08/07/greybox-0-3-0-whats-new/">greybox 0.3.0 &#8211; what&#8217;s new</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Three months have passed since the initial release of <span class="lang:r decode:true crayon-inline">greybox</span> on CRAN. I would not say that the package develops like crazy, but there have been some changes since May. Let&#8217;s have a look. We start by loading both <span class="lang:r decode:true crayon-inline">greybox</span> and <span class="lang:r decode:true crayon-inline">smooth</span>:</p>
<pre class="decode">library(greybox)
library(smooth)</pre>
<h3>Rolling Origin</h3>
<p>First of all, <span class="lang:r decode:true crayon-inline">ro()</span> function now has its own class and works with <span class="lang:r decode:true crayon-inline">plot()</span> function, so that you can have a visual representation of the results. Here&#8217;s an example:</p>
<pre class="decode">x <- rnorm(100,100,10)
ourCall <- "es(data, h=h, intervals=TRUE)"
ourValue <- c("forecast", "lower", "upper")
ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE)
plot(ourRO)</pre>
<div id="attachment_1781" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExample.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1781" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExample-300x175.png&amp;nocache=1" alt="" width="300" height="175" class="size-medium wp-image-1781" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExample-300x175.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExample-768x448.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExample-1024x597.png&amp;nocache=1 1024w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExample.png&amp;nocache=1 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1781" class="wp-caption-text">Example of the plot of rolling origin function</p></div>
<p>Each point on the produced graph corresponds to an origin and straight lines correspond to the forecasts. Given that we asked for point forecasts and for lower and upper bounds of prediction interval, we have three respective lines. By plotting the results of rolling origin experiment, we can see if the model is stable or not. Just compare the previous graph with the one produced from the call to Holt's model:</p>
<pre class="decode">ourCall <- "es(data, model='AAN', h=h, intervals=TRUE)"
ourRO <- ro(x,h=20,origins=5,ourCall,ourValue,co=TRUE)
plot(ourRO)</pre>
<div id="attachment_1782" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExampleAAN.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1782" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExampleAAN-300x175.png&amp;nocache=1" alt="" width="300" height="175" class="size-medium wp-image-1782" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExampleAAN-300x175.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExampleAAN-768x448.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExampleAAN-1024x597.png&amp;nocache=1 1024w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/roPlotExampleAAN.png&amp;nocache=1 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1782" class="wp-caption-text">Example of the plot of rolling origin function with ETS(A,A,N)</p></div>
<p>Holt's model is not suitable for this time series, so it's forecasts are less stable than the forecasts of the automatically selected model in the previous case (which is ETS(A,N,N)).</p>
<p>Once again, there is a vignette with examples for the <span class="lang:r decode:true crayon-inline">ro()</span> function, <a href="https://cran.r-project.org/web/packages/greybox/vignettes/ro.html" rel="noopener noreferrer" target="_blank">have a look</a> if you want to know more.</p>
<h3>ALM - Advanced Linear Model</h3>
<p>Yes, there is "Generalised Linear Model" in R, which implements Poisson, Gamma, Binomial and other regressions. Yes, there are smaller packages, implementing models with more exotic distributions. But I needed several regression models with: Laplace distribution, Folded normal distribution, Chi-squared distribution and one new mysterious distribution, which is currently called "S distribution". I needed them in one place and in one format: properly estimated using likelihoods, returning confidence intervals, information criteria and being able to produce forecasts. I also wanted them to work similar to <span class="lang:r decode:true crayon-inline">lm()</span>, so that the learning curve would not be too steep. So, here it is, the function <span class="lang:r decode:true crayon-inline">alm()</span>. It works quite similar to <span class="lang:r decode:true crayon-inline">lm()</span>:</p>
<pre class="decode">xreg <- cbind(rfnorm(100,1,10),rnorm(100,50,5))
xreg <- cbind(100+0.5*xreg[,1]-0.75*xreg[,2]+rlaplace(100,0,3),xreg,rnorm(100,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
inSample <- xreg[1:80,]
outSample <- xreg[-c(1:80),]

ourModel <- alm(y~x1+x2, inSample, distribution="laplace")
summary(ourModel)</pre>
<p>Here's the output of the summary: </p>
<pre>Distribution used in the estimation: Laplace
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%
(Intercept) 95.85207    0.36746   95.12022    96.58392
x1           0.59618    0.02479    0.54681     0.64554
x2          -0.67865    0.00622   -0.69103    -0.66626
ICs:
     AIC     AICc      BIC     BICc 
474.2453 474.7786 483.7734 484.9419</pre>
<p>And here's the respective plot of the forecast:</p>
<pre class="decode">plot(forecast(ourModel,outSample))</pre>
<div id="attachment_1787" style="width: 310px" class="wp-caption alignnone"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/almLaplaceExample.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-1787" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/almLaplaceExample-300x175.png&amp;nocache=1" alt="" width="300" height="175" class="size-medium wp-image-1787" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/almLaplaceExample-300x175.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/almLaplaceExample-768x448.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/almLaplaceExample-1024x597.png&amp;nocache=1 1024w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/almLaplaceExample.png&amp;nocache=1 1200w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-1787" class="wp-caption-text">Forecast from lm with Laplace distribution</p></div>
The thing that is currently missing in the function is prediction intervals, but this will be added in the upcoming releases.</p>
<p>Having the likelihood approach, allows comparing different models with different distributions using information criteria. Here's, for example, what model we get if we assume S-distribution (which has fatter tails than Laplace):</p>
<pre class="decode">summary(alm(y~x1+x2, inSample, distribution="s"))</pre>
<pre>Distribution used in the estimation: S
Coefficients:
            Estimate Std. Error Lower 2.5% Upper 97.5%
(Intercept) 95.61244    0.23386   95.14666    96.07821
x1           0.56144    0.00721    0.54708     0.57581
x2          -0.66867    0.00302   -0.67470    -0.66265
ICs:
     AIC     AICc      BIC     BICc 
482.9358 483.4692 492.4639 493.6325</pre>
<p>As you see, the information criteria for S distribution are higher than for Laplace, so we can conclude that the previous model was better than the second in terms of ICs.</p>
<p><strong>Note</strong> that at this moment the AICc and BICc are not correct for non-normal models (at least the derivation of them needs to be double checked, which I haven't done yet), so don't rely on them too much.</p>
<p>I intent to add several other distributions that either are not available in R or are implemented unsatisfactory (from my point of view) - the function is written in a quite flexible way, so this should not be difficult to do. If you have any preferences, please add them on github, <a href="https://github.com/config-i1/greybox/issues/13" rel="noopener noreferrer" target="_blank">here</a>.</p>
<p>I also want to implement the mixture distributions, so that things discussed in <a href="/en/2017/11/07/multiplicative-state-space-models-for-intermittent-time-series/">the paper on intermittent state-space model</a> can also be implemented using pure regression.</p>
<p>Finally, now that I have alm, we can select between the regression models with different distributions (with <span class="lang:r decode:true crayon-inline">stepwise()</span> function) or even combine them using AIC weights (hello, <span class="lang:r decode:true crayon-inline">lmCombine()</span>!). Yes, I know that it sounds crazy (think of the pool of models in this case), but this should be fun!</p>
<p><a name="RMCB"></a></p>
<h3>Regression for Multiple Comparison with the Best</h3>
<p><strong>Please, note that this part of the post has been updated on 02.03.2020 in order to reflect the changes in the v0.5.9 version of the package.</strong><br />
One of the typical tasks in forecasting is to evaluate the performance of different methods on the holdout. In order to do that, it is common to use some statistical tests, the most popular of which is <a href="http://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf">Nemenyi</a> / <a href="https://doi.org/10.1016/j.ijforecast.2004.10.003">MCB</a> (Multiple Comparison with the Best method). The test implemented in greybox package uses similar principles and relies on ranks of methods, but instead of taking averages and then applying studentised distances, it constructs a regression on the ranked data. This way we compare the median performance of different method (the same way as it is done in the classical MCB) and we produce parametric confidence intervals for parameters. The test is based on the simple linear model with dummy variables for each provided method (1 if the error corresponds to the method and 0 otherwise). Here's an example of how this thing works:</p>
<pre class="decode">ourData <- cbind(rnorm(100,0,10), rnorm(100,-2,5), rnorm(100,2,6), rlaplace(100,1,5))
colnames(ourData) <- c("Method A","Method B","Method C","Method D")

ourTest <- rmcb(ourData, level=0.95)</pre>
<p>By default the function produces graph in the MCB (Multiple Comparison with the Best) style:</p>
<div id="attachment_2370" style="width: 310px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormNew.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-2370" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormNew-300x180.png&amp;nocache=1" alt="" width="300" height="180" class="size-medium wp-image-2370" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormNew-300x180.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormNew-768x461.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormNew.png&amp;nocache=1 1000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-2370" class="wp-caption-text">RMCB example, MCB style plot</p></div>
<p>If we compare the results of the test with the mean rank values, we will see that they are the same:</p>
<pre class="decode">apply(t(apply(ourData,1,rank)),2,mean)</pre>
<pre>Method A Method B Method C Method D 
    2.40     2.06     2.75     2.79</pre>
<pre class="decode">ourTest$mean</pre>
<pre>Method B Method A Method C Method D 
    2.06     2.40     2.75     2.79</pre>
<p>This also reflects how the data was generated. Notice that Method D was generated from Laplace distribution with mean 1, but the test managed to give the correct answer in this situation, because Laplace distribution is symmetric and the sample size is large enough. But the main point of the test is that we can get the confidence intervals for each parameter, so we can see if the differences between the methods are significant: if the intervals intersect, then they are not.</p>
<p>The regression model used in the calculation is saved in the variable model and you can request a basic summary from it:</p>
<pre class="decode">summary(ourTest$model)</pre>
<pre>            Estimate Std. Error  Lower 2.5% Upper 97.5%
(Intercept)     2.40  0.1083601  2.18761804  2.61238196
Method B       -0.34  0.1532444 -0.64035346 -0.03964654
Method C        0.35  0.1532444  0.04964654  0.65035346
Method D        0.39  0.1532444  0.08964654  0.69035346</pre>
<p>But, please, keep in mind that this is not a proper "lm" object, so you cannot do much with it.</p>
<p>The function also reports p-value from the F-test of regression, testing the standard hypothesis that all the parameters are equal to zero.</p>
<p>We can also produce plots with vertical lines, that connect the models that are in the same group (no statistical difference, intersection of respective intervals). Here's the example for the same data:</p>
<pre class="decode">plot(ourTest, outplot="lines")</pre>
<div id="attachment_2371" style="width: 310px" class="wp-caption aligncenter"><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormLinesNew.png&amp;nocache=1"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-2371" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormLinesNew-300x180.png&amp;nocache=1" alt="" width="300" height="180" class="size-medium wp-image-2371" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormLinesNew-300x180.png&amp;nocache=1 300w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormLinesNew-768x461.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/08/rmcExampleNormLinesNew.png&amp;nocache=1 1000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a><p id="caption-attachment-2371" class="wp-caption-text">RMCB example, lines plot</p></div>
<p>If you want to tune the plot, you can always do this using the standart plot parameters:</p>
<pre class="decode">plot(ourTest, xlab="Models", ylab="Errors")</pre>
<p>Also, given that we work with a flexible plot method, you can tune the parameters of the canvas using "par()" function, as it is usually done in R.</p>
<h3>What else?</h3>
<p>Several methods have been moved from <span class="lang:r decode:true crayon-inline">smooth</span> to <span class="lang:r decode:true crayon-inline">greybox</span>. These include:</p>
<ul>
<li>pointLik() - returns point Likelihoods, discussed in <a href="http://kourentzes.com/forecasting/2018/06/20/isf2018-presentation-beyond-summary-performance-metrics-for-forecast-selection-and-combination/" rel="noopener noreferrer" target="_blank">our research with Nikos</a>;</li>
<li>pAIC, pBIC, pAICc, pBICc - point values of respective information criteria, from <a href="http://kourentzes.com/forecasting/2018/06/20/isf2018-presentation-beyond-summary-performance-metrics-for-forecast-selection-and-combination/" rel="noopener noreferrer" target="_blank">the same research</a>;</li>
<li>nParam() - returns number of the estimated parameters in the model (+ variance);</li>
<li>errorType() - returns the type of error used in the model (Additive / Multiplicative);</li>
</ul>
<p>Furthermore, as you might have already noticed, I've implemented several distribution functions:</p>
<ul>
<li>Folded normal distribution;</li>
<li>Laplace distribution;</li>
<li>S distribution.</li>
</ul>
<p>Finally, there is also a function, called <span class="lang:r decode:true crayon-inline">lmDynamic()</span>, which uses pAIC in order to produce dynamic linear regression models. But this should be discussed separately in a separate post.</p>
<p>That's it for now. See you in greybox 0.4.0!</p>
<p>Message <a href="https://openforecast.org/2018/08/07/greybox-0-3-0-whats-new/">greybox 0.3.0 &#8211; what&#8217;s new</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://openforecast.org/2018/08/07/greybox-0-3-0-whats-new/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>greybox package for R</title>
		<link>https://openforecast.org/2018/05/04/greybox-package-for-r/</link>
					<comments>https://openforecast.org/2018/05/04/greybox-package-for-r/#respond</comments>
		
		<dc:creator><![CDATA[Ivan Svetunkov]]></dc:creator>
		<pubDate>Fri, 04 May 2018 12:22:35 +0000</pubDate>
				<category><![CDATA[Applied forecasting]]></category>
		<category><![CDATA[Package greybox for R]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[greybox]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://openforecast.org/?p=1718</guid>

					<description><![CDATA[<p>I am delighted to announce a new package on CRAN. It is called &#8220;greybox&#8221;. I know, what my American friends will say, as soon as they see the name &#8211; they will claim that there is a typo, and that it should be &#8220;a&#8221; instead of &#8220;e&#8221;. But in fact no mistake was made &#8211; [&#8230;]</p>
<p>Message <a href="https://openforecast.org/2018/05/04/greybox-package-for-r/">greybox package for R</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><a href="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/05/greybox2.png&amp;nocache=1"><img loading="lazy" decoding="async" src="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/05/greybox2-260x300.png&amp;nocache=1" alt="Hexagon for greybox" width="260" height="300" class="size-medium wp-image-1719" srcset="https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/05/greybox2-260x300.png&amp;nocache=1 260w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/05/greybox2-768x888.png&amp;nocache=1 768w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/05/greybox2-886x1024.png&amp;nocache=1 886w, https://openforecast.org/wp-content/webpc-passthru.php?src=https://openforecast.org/wp-content/uploads/2018/05/greybox2.png&amp;nocache=1 1206w" sizes="auto, (max-width: 260px) 100vw, 260px" /></a></p>
<p>I am delighted to announce a new package on CRAN. It is called &#8220;greybox&#8221;. I know, what my American friends will say, as soon as they see the name &#8211; they will claim that there is a typo, and that it should be &#8220;a&#8221; instead of &#8220;e&#8221;. But in fact no mistake was made &#8211; I used British spelling for the name, and I totally understand that at some point I might regret this&#8230;</p>
<p>So, what is &#8220;greybox&#8221;? Wikipedia <a href="https://en.wikipedia.org/wiki/Grey_box_model" rel="noopener noreferrer" target="_blank">tells us that grey box</a> is a model that &#8220;combines a partial theoretical structure with data to complete the model&#8221;. This means that almost any statistical model can be considered as a grey box, thus making the package potentially quite flexible and versatile.</p>
<p>But why do we need a new package on CRAN?</p>
<p>First, there were several functions in <a href="/en/tag/smooth/">smooth</a> package that did not belong there, and there are several functions in <a href="https://github.com/trnnick/TStools" rel="noopener noreferrer" target="_blank">TStools</a> package that can be united with a topic of model building. They focus on the multivariate regression analysis rather than on state-space models, time series smoothing or anything else. It would make more sense to find them their own <del>home</del> package. An example of such a function is <span class="lang:r decode:true crayon-inline">ro()</span> &#8211; <a href="https://cran.r-project.org/web/packages/greybox/vignettes/ro.html" rel="noopener noreferrer" target="_blank">Rolling Origin</a> &#8211; function that Yves and I wrote in 2016 on our way to the International Symposium on Forecasting. Arguably this function can be used not only for assessing the accuracy of forecasting models, but also for the variables / model selection.</p>
<p>Second, in one of my side projects, I needed to work more on the multivariate regressions, and I had several ideas I wanted to test. One of those is creating a combined multivariate regression from several models using information criteria weights. The existing implementations did not satisfy me, so I ended up writing a function <span class="lang:r decode:true crayon-inline">lmCombine()</span> that does that. In addition, our research together with Yves Sagaert indicates that there is a nice solution for a fat regression problem (when the number of parameters is higher than the number of observations) using information criteria. Uploading those function in <span class="lang:r decode:true crayon-inline">smooth</span> did not sound right, but having a <span class="lang:r decode:true crayon-inline">greybox</span> helps a lot. There are other ideas that I have in mind, and they don&#8217;t fit in the other packages.</p>
<p>Finally, I could not find satisfactory (from my point of view) packages on CRAN that would focus on multivariate model building and forecasting &#8211; the usual focus is on analysis instead (including time series analysis). The other thing is the obsession of many packages with p-values and hypotheses testing, which was yet another motivator for me to develop a package that would be completely hypotheses-free (at 95% level). As a result, if you work with the functions from <span class="lang:r decode:true crayon-inline">greybox</span>, you might notice that they produce confidence intervals instead of p-values (because I find them more informative and useful). Finally, I needed good instruments for the promotional modelling for several projects, and it was easier to implement them myself than to compile them from different functions from different packages.</p>
<p>Keeping that in mind, it makes sense to briefly discuss what is already available there. I&#8217;ve already discussed how <span class="lang:r decode:true crayon-inline">xregExpander()</span> and <span class="lang:r decode:true crayon-inline">stepwise()</span> functions work <a href="/en/2018/02/10/smooth-package-for-r-common-ground-part-iv-exogenous-variables-advanced-stuff/">in one of the previous posts</a>, and these functions are now available in <span class="lang:r decode:true crayon-inline">greybox</span> instead of <span class="lang:r decode:true crayon-inline">smooth</span>. However, I have not covered either <span class="lang:r decode:true crayon-inline">lmCombine()</span> or <span class="lang:r decode:true crayon-inline">ro()</span> functions yet. While <span class="lang:r decode:true crayon-inline">lmCombine()</span> is still under construction and works only for normal cases (fat regression can be solved, but not 100% efficiently), <span class="lang:r decode:true crayon-inline">ro()</span> has worked efficiently for several years already. So I created a detailed <a href="https://cran.r-project.org/web/packages/greybox/vignettes/ro.html" rel="noopener noreferrer" target="_blank">vignette</a>, explaining what is rolling origin, how the function works and how to use it. So, if you are interested in finding out more, <a href="https://cran.r-project.org/web/packages/greybox/vignettes/ro.html" rel="noopener noreferrer" target="_blank">check it out on CRAN</a>.</p>
<p>As a wrap up, <span class="lang:r decode:true crayon-inline">greybox</span> package is focused on model building and forecasting and from now on will be periodically updated.</p>
<p>As a final note, I plan to do the following in <span class="lang:r decode:true crayon-inline">greybox</span> in future releases:</p>
<ol>
<li>Move <span class="lang:r decode:true crayon-inline">nemenyi()</span> function from <a href="https://github.com/trnnick/TStools" rel="noopener noreferrer" target="_blank">TStools</a> to <a href="https://github.com/config-i1/greybox" rel="noopener noreferrer" target="_blank">greybox</a>;</li>
<li>Develop functions for promotional modelling;</li>
<li>Write a function for multiple correlation coefficients (will be used for multicollinearity analysis);</li>
<li>Implement variables selection based on rolling origin evaluation;</li>
<li>Stepwise regression and combinations of models, based on Laplace and the other distributions;</li>
<li>AICc for Laplace and the other distributions;</li>
<li>Solve fat regression problem via combination of regression models (sounds crazy, right?);</li>
<li><span class="lang:r decode:true crayon-inline">xregTransformer</span> &#8211; Non-linear transformation of the provided xreg variables;</li>
<li>Other cool stuff.</li>
</ol>
<p>If you have any thoughts on what to implement, leave a comment &#8211; I will consider your idea.</p>
<p>Message <a href="https://openforecast.org/2018/05/04/greybox-package-for-r/">greybox package for R</a> first appeared on <a href="https://openforecast.org">Open Forecasting</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://openforecast.org/2018/05/04/greybox-package-for-r/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
