2.1 Numerical analysis
In this section we will use the classical mtcars
dataset from datasets
package for R. It contains 32 observations with 11 variables. While all the variables are numerical, some of them are in fact categorical variables encoded as binary ones. We can check the description of the dataset in R:
?mtcars
Judging by the explanation in the R documentation, the following variables are categorical:
- vs - Engine (0 = V-shaped, 1 = straight),
- am - Transmission (0 = automatic, 1 = manual).
In addition, the following variables are integer numeric ones:
- cyl - Number of cylinders,
- hp - Gross horsepower,
- gear - Number of forward gears,
- carb - Number of carburetors.
All the other variables are continuous numeric.
Takign this into account, we will create a data frame, encoding the categorical variables as factors for further analysis:
<- data.frame(mtcars)
mtcarsData $vs <- factor(mtcarsData$vs,levels=c(0,1),labels=c("V-shaped","Straight"))
mtcarsData$am <- factor(mtcarsData$am,levels=c(0,1),labels=c("automatic","manual")) mtcarsData
Given that we only have two options in those variables, it is not compulsory to do this encoding, but it will help us in the further analysis.
We can start with the basic summary statistics. We remember from the scales of information (Section 1.2) that the nominal variables can be analysed only via frequencies, so this is what we can produce for them:
table(mtcarsData$vs)
##
## V-shaped Straight
## 18 14
table(mtcarsData$am)
##
## automatic manual
## 19 13
These tables are called contingency tables, they show the frequency of appearance of values of variables. Based on this, we can conclude that the cars with V-shaped engine are met more often in the dataset than the cars with the Straight one. In addition, the automatic transmission prevails in the data. The related statistics which is useful for analysis of categorical variables is called mode. It shows which of the values happens most often in the data. Judging by the frequencies above, we can conclude that the mode for the first variable is the value “V-shaped.”
All of this is purely descriptive information, which does not provide us much. We could probably get more information if we analysed the contingency table based on these two variables:
table(mtcarsData$vs,mtcarsData$am)
##
## automatic manual
## V-shaped 12 6
## Straight 7 7
For now, we can only conclude that the cars with V-shaped engine and automatic transmission are met more often than the other cars in the dataset.
Next, we can look at the numerical variables. As we recall from Section 1.2, this scale supports all operations, so we can use quantiles, mean, standard deviation etc. Here how we can analyse, for example, the variable mpg:
setNames(mean(mtcarsData$mpg),"mean")
## mean
## 20.09062
quantile(mtcarsData$mpg)
## 0% 25% 50% 75% 100%
## 10.400 15.425 19.200 22.800 33.900
setNames(median(mtcarsData$mpg),"median")
## median
## 19.2
The output above produces:
- Mean - the average value of mpg in the dataset, \(\bar{y}=\frac{1}{n}\sum_{j=1}^n y_j\).
- Quantiles - the values that show, below which values the respective proportions of the dataset lie. For example, 25% of observations have mpg less than 15.425. The 25%, 50% and 75% quantiles are also called 1st, 2nd and 3rd quartiles respectively.
- Median, which splits the sample in two halves. It corresponds to the 50% quantile.
If median is greater than mean, then this typically means that the distribution of the variable is skewed (it has some rare observations that have large values). This is the case in our case, we can investigate it further using skewness and kurtosis from timeDate
package:
::skewness(mtcarsData$mpg) timeDate
## [1] 0.610655
## attr(,"method")
## [1] "moment"
::kurtosis(mtcarsData$mpg) timeDate
## [1] -0.372766
## attr(,"method")
## [1] "excess"
Skewness shows the asymmetry of distribution. If it is greater than zero, then the distribution has the long right tail. If it is equal to zero, then it is symmetric.
Kurtosis shows the excess of distribution (fatness of tails) in comparison with the normal distribution. If it is equal to zero, then it is the same as for normal distribution.
Based on all of this, we can conclude that the distribution of mpg
is skewed and has the longer right tail. This is expected for such variable, because the cars that have higher mileage are not common in this dataset.
All the conventional statistics discussed above can be produced using the following summary for all variables in the dataset:
summary(mtcarsData)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs am
## Min. :2.760 Min. :1.513 Min. :14.50 V-shaped:18 automatic:19
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 Straight:14 manual :13
## Median :3.695 Median :3.325 Median :17.71
## Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :4.930 Max. :5.424 Max. :22.90
## gear carb
## Min. :3.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000