2.1 Preliminary data analysis

This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

One of the basic thing that is worth doing before starting any modelling is the preliminary data analysis. This can be done either using numerical or graphical analysis. The former is useful when you want to have a summary information about the data without trying to find detailed information about it. The latter is useful when you can spend more time, investigating relations and issues in the data. In many cases, they compliment each other.

2.1.1 Numerical analysis

In this section we will use the classical mtcars dataset from datasets package for R. It contains 32 observations with 11 variables. While all the variables are numerical, some of them are in fact categorical variables encoded as binary ones. We can check the description of the dataset in R:

?mtcars

Judging by the explanation in the R documentation, the following variables are categorical:

vs - Engine (0 = V-shaped, 1 = straight),
am - Transmission (0 = automatic, 1 = manual).

In addition, the following variables are integer numeric ones:

cyl - Number of cylinders,
hp - Gross horsepower,
gear - Number of forward gears,
carb - Number of carburetors.

All the other variables are continuous numeric.

Takign this into account, we will create a data frame, encoding the categorical variables as factors for further analysis:

mtcarsData <- data.frame(mtcars)
mtcarsData$vs <- factor(mtcarsData$vs,levels=c(0,1),labels=c("V-shaped","Straight"))
mtcarsData$am <- factor(mtcarsData$am,levels=c(0,1),labels=c("automatic","manual"))

Given that we only have two options in those variables, it is not compulsory to do this encoding, but it will help us in the further analysis.

We can start with the basic summary statistics. We remember from the scales of information (Section 1.2) that the nominal variables can be analysed only via frequencies, so this is what we can produce for them:

table(mtcarsData$vs)

## 
## V-shaped Straight 
##       18       14

table(mtcarsData$am)

## 
## automatic    manual 
##        19        13

These tables are called contingency tables, they show the frequency of appearance of values of variables. Based on this, we can conclude that the cars with V-shaped engine are met more often in the dataset than the cars with the Straight one. In addition, the automatic transmission prevails in the data. The related statistics which is useful for analysis of categorical variables is called mode. It shows which of the values happens most often in the data. Judging by the frequencies above, we can conclude that the mode for the first variable is the value “V-shaped.”

All of this is purely descriptive information, which does not provide us much. We could probably get more information if we analysed the contingency table based on these two variables:

table(mtcarsData$vs,mtcarsData$am)

##           
##            automatic manual
##   V-shaped        12      6
##   Straight         7      7

For now, we can only conclude that the cars with V-shaped engine and automatic transmission are met more often than the other cars in the dataset.

Next, we can look at the numerical variables. As we recall from Section 1.2, this scale supports all operations, so we can use quantiles, mean, standard deviation etc. Here how we can analyse, for example, the variable mpg:

setNames(mean(mtcarsData$mpg),"mean")

##     mean 
## 20.09062

quantile(mtcarsData$mpg)

##     0%    25%    50%    75%   100% 
## 10.400 15.425 19.200 22.800 33.900

setNames(median(mtcarsData$mpg),"median")

## median 
##   19.2

The output above produces:

Mean - the average value of mpg in the dataset, \(\bar{y}=\frac{1}{n}\sum_{j=1}^n y_j\).
Quantiles - the values that show, below which values the respective proportions of the dataset lie. For example, 25% of observations have mpg less than 15.425. The 25%, 50% and 75% quantiles are also called 1st, 2nd and 3rd quartiles respectively.
Median, which splits the sample in two halves. It corresponds to the 50% quantile.

If median is greater than mean, then this typically means that the distribution of the variable is skewed (it has some rare observations that have large values). This is the case in our case, we can investigate it further using skewness and kurtosis from timeDate package:

timeDate::skewness(mtcarsData$mpg)

## [1] 0.610655
## attr(,"method")
## [1] "moment"

timeDate::kurtosis(mtcarsData$mpg)

## [1] -0.372766
## attr(,"method")
## [1] "excess"

Skewness shows the asymmetry of distribution. If it is greater than zero, then the distribution has the long right tail. If it is equal to zero, then it is symmetric.

Kurtosis shows the excess of distribution (fatness of tails) in comparison with the normal distribution. If it is equal to zero, then it is the same as for normal distribution.

Based on all of this, we can conclude that the distribution of mpg is skewed and has the longer right tail. This is expected for such variable, because the cars that have higher mileage are not common in this dataset.

All the conventional statistics discussed above can be produced using the following summary for all variables in the dataset:

summary(mtcarsData)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec              vs             am    
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   V-shaped:18   automatic:19  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   Straight:14   manual   :13  
##  Median :3.695   Median :3.325   Median :17.71                               
##  Mean   :3.597   Mean   :3.217   Mean   :17.85                               
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90                               
##  Max.   :4.930   Max.   :5.424   Max.   :22.90                               
##       gear            carb      
##  Min.   :3.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :2.000  
##  Mean   :3.688   Mean   :2.812  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :8.000

2.1.2 Graphical analysis

Continuing our example with mtcars dataset, we now investigate what plots can be used for different types of data. As discussed earlier, we have two categorical variables: vs and am - and they need to be treated differently than the numerical ones. We can start by producing their barplots:

barplotVS <- barplot(table(mtcarsData$vs), xlab="Type of engine")
text(barplotVS,table(mtcarsData$vs)/2,table(mtcarsData$vs),cex=1.25)

Figure 2.1: Barplot for the engine type.

This is just a graphical presentation of the contingency table we have already discussed earlier. Note that histograms do not make sense in case of categorical variables, because they assume that variables are numerical and continuous. Barplots are useful when you deal with either categorical variables or integer numerical ones. Here is what we can produce in case of the integer variable cyl:

barplotCYL <- barplot(table(mtcarsData$cyl), xlab="Number of cylinders")
text(barplotCYL,table(mtcarsData$cyl)/2,table(mtcarsData$cyl),cex=1.25)

Figure 2.2: Barplot for the number of cylinders.

Figure 2.2 shows that there are three types of cars in the data: with 4, 6 and 8 cylinders. The most frequently met is the car with 8 cylinders. Judging by the plot, half of cars have not more than 6 cylinders (median is equal to 6). All of this can be deducted from the barplot. And here how the histogram would look like for cylinders:

hist(mtcarsData$cyl)

Figure 2.3: Histogram for the number of cylinders. Do not do this!

Figure 2.3 is difficult to read, because on histogram, the bars show frequency at which continuous variable appears in pre-specified bins. In our case we would conclude that the most frequently cars in the dataset are those that have 7.5 - 8 cylinders, which is wrong and misleading. In addition, this basic plot does not have a readable label for x-axis and a meaningful title (in fact, we do not need one, given that we have caption). So, always label your axis and make sure that the text on plots is easy to understand for those people who do not work with the data.

Coming back to categorical variables, we can construct two-dimensional plots to investigate potential relations between variables. We will first try the same barplot as above, but with vs and am variables:

barplot(table(mtcarsData$vs,mtcarsData$am),
        xlab="Type of transmission", legend.text=levels(mtcarsData$vs))

Figure 2.4: Barplot for the type of engine and transmission.

Figure 2.4 provides some information about the distribution of type of engine and transmission. For example, we can say that the most often met car in the dataset is the one with automatic transmission and V-shaped engine. However, it is not possible to say much about the relation between the two variables based on this plot. So, there is an alternative presentation, which uses the heat map (tableplot() from greybox):

tableplot(mtcarsData$vs,mtcarsData$am,
          xlab="Type of engine", ylab="Type of transmission")

Figure 2.5: Heat map for the type of engine and transmission.

The idea of this plot is that the darkness of areas shows the frequency of occurrence of each specific value. The numbers inside the box show the proportions for each answer. So, we can conclude (again), that automatic transmission with V-shaped engine is met in 37.5% of cases. On the other hand, the least frequent type of car is the one with V-shaped engine and manual transmission. There might be some tendency in the dataset: the engine and transmission might be related (v-shaped with automatic vs Straigh with manual) - but it is not very well pronounced. The same table plot can be used for the analysis of relations between integer variables (and categorical). Here, for example, the plot between the number of cylinders and the type of engine:

tableplot(mtcarsData$cyl,mtcarsData$vs,
          xlab="Number of cylinders", ylab="Type of engine")

Figure 2.6: Heat map for the number of cylinders and the type of engine.

Figure 2.6 allows making more solid conclusions about the relation between the two variables: we see that with the increase of the number of cylinders, the cars tend to switch from Straight to the V-shaped engines. This has an explanation: the engines with more cylinders need to have a different geometry to fit them all, and the V shape is more suitable for them. The table plot shows clearly this relation between the two variables.

Next, we can analyse the numerical variables. We can start with the basic histogram:

hist(mtcarsData$wt, xlab="Weight", main="", probability=TRUE)
lines(density(mtcarsData$wt), col="red")

Figure 2.7: Distribution of the weights of cars.

The histogram 2.7 shows that there is a slight skewness in the data: the cars with weight from 3 to 4 thousands pounds are met more often than the cars with more than 5. The left tail of this distribution is slightly longer than the right one. Note that I have produced the probabilities on the y-axis of the plot in order to add the density curve, which smooths out the frequencies and shows how the distribution looks like.

An alternative presentation of the histogram is the boxplot, which graphically presents quantiles of distribution:

boxplot(mtcarsData$wt, ylab="Weight")
points(mean(mtcarsData$wt),col="red", pch=16)

Figure 2.8: Boxplot of the variable weight.

This plot has the box in the middle, the whiskers on the sides, points at the top and the red point at the centre. The box shows 1st, 2nd and 3rd quartiles of distribution, thus the black line in the middle is the median. The distance between the 1st and the 3rd quartiles is called “Interquartile range” (IQR) and is used for the calculation of the interval (1st / 3rd quartile \(\pm 1.5 \times\)IQR), which corresponds roughly to the 99.3% interval (read more about this in Section 2.4) from Normal distribution and is graphically drawn as the furthest observation in the interval. So, the lower whisker on our plot corresponds to the minimum value in the data, which is still in the interval, while the upper whisker corresponds to the bound of the interval. All the observations that lie beyond the interval are marked as potential outliers. Note that this does not mean that the values are indeed outliers, they just lie outside the 99.3% interval of Normal distribution. Finally, the red dot was added by me to show where the mean is. It is lower than median, this implies that there is a slight skewness in the distribution of weight.

There is also a way for producing the plots that would combine elements of histogram, density curve and boxplot. There is a plot called “violin plot.” We will use vioplot() function from vioplot package in order to produce them:

vioplot(mtcarsData$wt, ylab="Weight")
points(mean(mtcarsData$wt),col="red", pch=16)

Figure 2.9: Violin plot together with boxplot of the variable weight.

Figure 2.9 unites the boxplot and the density curve from the plots above, providing not only information about the quantiles, but also about the shape of the distribution.

Finally, if we want to compare the distribution of a variable with a known theoretical distribution, we can produce the QQ-plot. Here how it looks for Normal distribution:

qqnorm(mtcarsData$wt)
qqline(mtcarsData$wt)

Figure 2.10: QQ plot of Normal distribution for variable weight.

The idea of the plot on Figure 2.10 is to compare theoretical quantiles with the empirical ones. If the variable would follow the specific distribution, then all the points would lie on the solid line. In our case, they do not: there are points in the right tail that are very far from the line - so we would conclude that the distribution of weight does not look Normal.

So far, we have discussed the univariate analysis of numerical variables, but we can also produce plots showing potential relations between them. We start with the classical scatterplot:

plot(mtcarsData$wt, mtcarsData$mpg, xlab="Weight", ylab="Mileage")
lines(lowess(mtcarsData$wt, mtcarsData$mpg), col="red")

Figure 2.11: Scatterplot diagram between weight and mileage.

The plot on Figure 2.11 shows the observations that have specific weight and mileage. Based on this, we can see if there is a relation between variables or not and what sort of relation this is. In order to simplify analysis, I have added the lowess line to the plot. It smooths the relation between variables, drawing the smooth line through the points and helps in understanding the existing relations in the data. Judging by Figure 2.11, there is a negative, slightly non-linear relation between the variables: the mileage decreases with reduced speed, when weight of a car increases. This relation makes sense, because heavier cars will consume more fuel and thus drive less on a gallon of petrol.

We could construct similar plots for all the other numerical variables, but not all plots would be helpful. For example, a plot of mileage versus number of forward gears would be very difficult to read (see Figure 2.12).

plot(mtcarsData$gear, mtcarsData$mpg, xlab="Number of gears", ylab="Mileage")

Figure 2.12: Scatterplot diagram between weight and mileage.

This is because one of the variables is integer and takes only a handful of values. In this case, a boxplot or a violin plot would be more useful:

boxplot(mpg~gear, mtcarsData, xlab="Number of gears", ylab="Mileage")
points(tapply(mtcarsData$mpg, mtcarsData$gear, mean), col="red", pch=16)

Figure 2.13: Boxplot of mileage vs number of gears.

The plot on Figure 2.13 is more informative than the one on Figure 2.12: it shows how the distribution of mileage changes with the increase of the numeric variable number of gears. We can also see that the mean value first increases and then goes down slightly. I do not have any good explanation of this phenomenon, but it might be related with how efficient the cars become with the increase fo the number of gears, or this could happen due to some latent, unobserved factors. So, the data tells us that there is a non-linear relation between number of gears and mileage.

Similarly, we can produce violin plots for the same data using the following code:

vioplot(mpg~gear, mtcarsData, xlab="Number of gears", ylab="Mileage")
points(tapply(mtcarsData$mpg, mtcarsData$gear, mean), col="red", pch=16)

Finally, using exactly the same idea with boxplots / violin plots, we can analyse relations between categorical and numerical variables. Figure 2.14 shows the relation between transmission type and mileage. We can conclude that the cars with manual transmission tend to have a higher mileage than the ones with the automatic one in our dataset.

vioplot(mpg~am, mtcarsData, xlab="Transmission type", ylab="Mileage")
points(tapply(mtcarsData$mpg, mtcarsData$am, mean), col="red", pch=16)

Figure 2.14: Violin plot of mileage vs transmission type.

Finally, producing plots one by one might be a tedious and challenging task, so it is good to have some instruments for producing several of them together. The plot() method will produce scatterplot matrix for numerical variables, but does not deal well with integer and categorical variables:

plot(mtcars)

Figure 2.15: Scatterplot matrix for the mtcars dataset.

Figure 2.15 is informative for the variables mpg, cyl, disp, hp, drat, qsec and carb, but is difficult to read for the others. In order to address this issue, we can use the spread() function from greybox, which will detect types of variables and produce the necessary plots automatically:

spread(mtcarsData, lowess=TRUE)

Figure 2.16: Spread plot for the mtcars dataset.

The plot on Figure 2.16 is the collection of the plots discussed above, so I will not stop on explaining what it shows.

As a final word for this section, when analysing data, it is critically important not to just describe what we see, but also explain why a result or a relationship is meaningful, otherwise this becomes an exercise of stating the obvious which does not have any value. So, for example, concluding based on Figure 2.16 that the mileage has a negative relation with displacement is not enough. If you want to analyse the data properly, you need to explain that this relation is meaningful, because with the increase of the size of engine, the fuel consumption will increase as well, and as a result the mileage will go down. Furthermore, the relation is non-linear because the change in decrease will slow down with cars with bigger engines. Inevitably, the car with a gigantic engine will be able to travel a short distance on a gallon of fuel - the mileage will not become negative, so the non-linearity is not an artefact of the data, but an existing phenomenon.