This book is in Open Review. I want your feedback to make the book better for you and other readers. To add your annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the button in the upper right hand corner of the page

2.1 Numerical analysis

In this section we will use the classical mtcars dataset from datasets package for R. It contains 32 observations with 11 variables. While all the variables are numerical, some of them are in fact categorical variables encoded as binary ones. We can check the description of the dataset in R:

?mtcars

Judging by the explanation in the R documentation, the following variables are categorical:

  1. vs - Engine (0 = V-shaped, 1 = straight),
  2. am - Transmission (0 = automatic, 1 = manual).

In addition, the following variables are integer numeric ones:

  1. cyl - Number of cylinders,
  2. hp - Gross horsepower,
  3. gear - Number of forward gears,
  4. carb - Number of carburetors.

All the other variables are continuous numeric.

Takign this into account, we will create a data frame, encoding the categorical variables as factors for further analysis:

mtcarsData <- data.frame(mtcars)
mtcarsData$vs <- factor(mtcarsData$vs,levels=c(0,1),labels=c("V-shaped","Straight"))
mtcarsData$am <- factor(mtcarsData$am,levels=c(0,1),labels=c("automatic","manual"))

Given that we only have two options in those variables, it is not compulsory to do this encoding, but it will help us in the further analysis.

We can start with the basic summary statistics. We remember from the scales of information (Section 1.2) that the nominal variables can be analysed only via frequencies, so this is what we can produce for them:

table(mtcarsData$vs)
## 
## V-shaped Straight 
##       18       14
table(mtcarsData$am)
## 
## automatic    manual 
##        19        13

These tables are called contingency tables, they show the frequency of appearance of values of variables. Based on this, we can conclude that the cars with V-shaped engine are met more often in the dataset than the cars with the Straight one. In addition, the automatic transmission prevails in the data. The related statistics which is useful for analysis of categorical variables is called mode. It shows which of the values happens most often in the data. Judging by the frequencies above, we can conclude that the mode for the first variable is the value “V-shaped.”

All of this is purely descriptive information, which does not provide us much. We could probably get more information if we analysed the contingency table based on these two variables:

table(mtcarsData$vs,mtcarsData$am)
##           
##            automatic manual
##   V-shaped        12      6
##   Straight         7      7

For now, we can only conclude that the cars with V-shaped engine and automatic transmission are met more often than the other cars in the dataset.

Next, we can look at the numerical variables. As we recall from Section 1.2, this scale supports all operations, so we can use quantiles, mean, standard deviation etc. Here how we can analyse, for example, the variable mpg:

setNames(mean(mtcarsData$mpg),"mean")
##     mean 
## 20.09062
quantile(mtcarsData$mpg)
##     0%    25%    50%    75%   100% 
## 10.400 15.425 19.200 22.800 33.900
setNames(median(mtcarsData$mpg),"median")
## median 
##   19.2

The output above produces:

  1. Mean - the average value of mpg in the dataset, \(\bar{y}=\frac{1}{n}\sum_{j=1}^n y_j\).
  2. Quantiles - the values that show, below which values the respective proportions of the dataset lie. For example, 25% of observations have mpg less than 15.425. The 25%, 50% and 75% quantiles are also called 1st, 2nd and 3rd quartiles respectively.
  3. Median, which splits the sample in two halves. It corresponds to the 50% quantile.

If median is greater than mean, then this typically means that the distribution of the variable is skewed (it has some rare observations that have large values). This is the case in our case, we can investigate it further using skewness and kurtosis from timeDate package:

timeDate::skewness(mtcarsData$mpg)
## [1] 0.610655
## attr(,"method")
## [1] "moment"
timeDate::kurtosis(mtcarsData$mpg)
## [1] -0.372766
## attr(,"method")
## [1] "excess"

Skewness shows the asymmetry of distribution. If it is greater than zero, then the distribution has the long right tail. If it is equal to zero, then it is symmetric.

Kurtosis shows the excess of distribution (fatness of tails) in comparison with the normal distribution. If it is equal to zero, then it is the same as for normal distribution.

Based on all of this, we can conclude that the distribution of mpg is skewed and has the longer right tail. This is expected for such variable, because the cars that have higher mileage are not common in this dataset.

All the conventional statistics discussed above can be produced using the following summary for all variables in the dataset:

summary(mtcarsData)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec              vs             am    
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   V-shaped:18   automatic:19  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   Straight:14   manual   :13  
##  Median :3.695   Median :3.325   Median :17.71                               
##  Mean   :3.597   Mean   :3.217   Mean   :17.85                               
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90                               
##  Max.   :4.930   Max.   :5.424   Max.   :22.90                               
##       gear            carb      
##  Min.   :3.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :2.000  
##  Mean   :3.688   Mean   :2.812  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :8.000