This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

9.1 Nominal scale

As discussed in Section 1.2, not all scales support the more advanced operations (such as taking mean in ordinal scale). This means that if we want to analyse relations between variables, we need to use appropriate instrument. The coefficients that show relations between variables are called “measures of association”. We start their discussions with the simplest scale - nominal.

There are several measures of association for the variables in nominal scale. They are all based on calculating the number of specific values of variables, but use different formulae. The first one is called contingency coefficient \(\phi\) and can only be calculated between variables that have only two values. As the name says, this measure is based on the contingency table. Here is an example:

table(mtcarsData$vs,mtcarsData$am)
##           
##            automatic manual
##   V-shaped        12      6
##   Straight         7      7

The \(\phi\) coefficient is calculated as: \[\begin{equation} \phi = \frac{n_{1,1} n_{2,2} - n_{1,2} n_{2,1}}{\sqrt{n_{1,\cdot}\times n_{2,\cdot}\times n_{\cdot,1}\times n_{\cdot,2}}} , \tag{9.1} \end{equation}\] where \(n_{i,j}\) is the element of the table on row \(i\) and column \(j\), \(n_{i,\cdot}=\sum_{j}n_{i,j}\) - is the sum in row \(i\) and \(n_{\cdot,j}=\sum_{i} n_{i,j}\) - is the sum in column \(j\). This coefficient lies between -1 and 1 and has a simple interpretation: if will be close to 1, when the elements on diagonal are greater than the off-diagonal ones, implying that there is a relation between variables. The value of -1 can only be obtained, when off-diagonal elements are non-zero, while the diagonal ones are zero. Finally, if the values in the contingency table are distributed evenly, the coefficient will be equal to zero. In our case the value of \(\phi\) is:

(12*7 - 6*7)/sqrt(19*13*14*18)
## [1] 0.1683451

This is a very low value, so even if the two variables are related, the relation is not well pronounced. In order to see, whether this value is statistically significantly different from zero, we could test a statistical hypothesis (hypothesis testing was discussed in Section 7):

\(H_0\): there is no relation between variables

\(H_1\): there is some relation between variables

This can be done using \(\chi^2\) test (we discussed it in a different context in Section 8.2), the statistics for which is calculated via: \[\begin{equation} \chi^2 = \sum_{i,j} \frac{n \times n_{i,j} - n_{i,\cdot} \times n_{\cdot,j}}{n \times n_{i,\cdot} \times n_{\cdot,j}} , \tag{9.2} \end{equation}\] where \(n\) is the sum of elements in the contingency table. The value calculated based on (9.2) will follow \(\chi^2\) distribution with \((r-1)(c-1)\) degrees of freedom, where \(r\) is the number of rows and \(c\) is the number of columns in contingency table. This is a proper statistical test, so it should be treated as one. We select my favourite significance level, 1% and can now conduct the test:

chisq.test(table(mtcarsData$vs,mtcarsData$am))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(mtcarsData$vs, mtcarsData$am)
## X-squared = 0.34754, df = 1, p-value = 0.5555

Given that p-value is greater than 1%, we fail to reject the null hypothesis and can conclude that the relation does not seem to be different from zero - we do not find a relation between the variables in our data.

The main limitation of the coefficient \(\phi\) is that it only works for the \(2\times 2\) tables. In reality we can have variables in nominal scale that take several values and it might be useful to know relations between them. For example, we can have a variable colour, which takes values red, green and blue and we would want to know if it is related to the transmission type. We do not have this variable in the data, so just for this example, we will create one (using multinomial distribution):

colour <- c(1:3) %*% rmultinom(nrow(mtcars), 1,
                               c(0.4,0.5,0.6))
colour <- factor(colour, levels=c(1:3),
                 labels=c("red","green","blue"))
barplot(table(colour), xlab="Colour")

In order to measure relation between the new variable and the am, we can use Cramer’s V coefficient, which relies on the formula of \(\chi^2\) test (9.2): \[\begin{equation} V = \sqrt{\frac{\chi^2}{n\times \min(r-1, c-1)}} . \tag{9.3} \end{equation}\]

Cramer’s V always lies between 0 and 1, becoming close to one only if there is some relation between the two categorical variables. greybox package implements this coefficient in cramer() function:

cramer(mtcarsData$am,colour)
## Cramer's V: 0.1003
## Chi^2 statistics = 0.3222, df: 2, p-value: 0.8512

The output above shows that the value of the coefficient is approximately 0.1, which is low, implying that the relation between the two variables is very weak. In addition, the p-value tells us that we fail to reject the null hypothesis on 1% level in the \(\chi^2\) test (9.2), and the relation does not look statistically significant. So we can conclude that according to our data, the two variables are not related (no wonder, we have generated one of them).

The main limitation of Cramer’s V is that it is difficult to interpret beyond “there is a relation”. Imagine a situation, where the colour would be related to the variable “class” of a car, that can take 5 values. What could we say more than to state the fact that the two are related? After all, in that case you end up with a contingency table of \(3\times 5\), and it might not be possible to say how specifically one variable changes with the change of another one. Still, Cramer’s V at least provides some information about the relation of two categorical variables.