6 Descriptive Statistics

There are various functions for descriptive statistics in R. The below subsections show a selected sample.

6.1 Centrality and Spread

Like in Microsoft Excel, we can cast centrality and spread functions on a variable.

k <- c(1, 5, 7, 9)
mean(k)
## [1] 5.5
# Use the $ operator for columns in a dataset
mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948

If you want to use multiple functions on a single variable, the with() function can be useful, as it lets you define the local environment to be the desired dataset so that you do not have to use the $ operator repeatedly.

with(mtcars, c(mean = mean(mpg), median = median(mpg), sd = sd(mpg)))
##      mean    median        sd 
## 20.090625 19.200000  6.026948

6.2 Minimum and Maximum

To compute the minimum and maximum of a variable, we can use the min() and max() functions respectively.

x <- 1:10 # 1 through 10.

min(x)
## [1] 1
max(x)
## [1] 10

6.3 Data Dimensions

To know the dimensions of an object in R, we can use nrow()/NROW for the number of rows; ncol()/NCOL() for the number of columns; and dim() for number of both rows and columns simultaneously.

NROW(mtcars)
## [1] 32
NCOL(mtcars)
## [1] 11
dim(mtcars)
## [1] 32 11

6.4 Data Summary

We can cast summary() on an object to capture summary information on an object. This function is useful following str(), as you can get a sense of what your dataset is like.

# Preview the dataset
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Summarize the dataset.
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Note that because Species is a factor variable, we obtain counts by category for that column instead of quantiles and means like the others.

6.5 Frequency Tables

To get counts by groups, we can use the table() function, while using prop.table() on a table() computation produces proportions. The input of table() can be one to two columns and the output is a table class.

6.5.1 Single-variable Case

For the single-variable case, we can simply input our desired column into the table() function.

my_table <- table(iris$Species)

my_table
## 
##     setosa versicolor  virginica 
##         50         50         50

Additionally, we can apply prop.table() on our my_table object to obtain proportions.

prop.table(my_table)
## 
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333

6.5.2 Multi-variable Case

For the case of multiple variables, we simply input the desired columns from a dataset.

my_table2 <- with(mtcars, table(am, gear))

my_table2
##    gear
## am   3  4  5
##   0 15  4  0
##   1  0  8  5

When you input 3 or more variables, R will present the results in a list-like fashion (note that the class is still table).

my_table3 <- with(mtcars, table(am, gear, cyl))

my_table3
## , , cyl = 4
## 
##    gear
## am   3  4  5
##   0  1  2  0
##   1  0  6  2
## 
## , , cyl = 6
## 
##    gear
## am   3  4  5
##   0  2  2  0
##   1  0  2  1
## 
## , , cyl = 8
## 
##    gear
## am   3  4  5
##   0 12  0  0
##   1  0  0  2

6.5.3 Converting to a Data Frame

If we apply the as.data.frame() function to an object of a table class, the output would be structured in a way such that we have a column (or columns) containing the group(s) and a column for the frequency. The structure is useful, as it is in a format that is acceptable for CSV output, for example.

freq <- table(iris$Species)
prop <- prop.table(freq)

as.data.frame(freq)
##         Var1 Freq
## 1     setosa   50
## 2 versicolor   50
## 3  virginica   50
as.data.frame(prop)
##         Var1      Freq
## 1     setosa 0.3333333
## 2 versicolor 0.3333333
## 3  virginica 0.3333333
my_table_df <- merge(as.data.frame(freq), as.data.frame(prop), by = 'Var1')

names(my_table_df) <- c('Species', 'Frequency', 'Percent')

my_table_df
##      Species Frequency   Percent
## 1     setosa        50 0.3333333
## 2 versicolor        50 0.3333333
## 3  virginica        50 0.3333333
write.csv(my_table_df, 'my_example_table.csv')

6.6 Summary

Table 6.1: Summary of Descriptive Statistics Functions
Function Description Example
mean(x) Computes the mean. mean(mtcars$mpg)
sd(x) Computes the standard deviation. sd(mtcars$mpg)
median(x) Computes the median. median(mtcars$mpg)
min(x) Computes the minimum. min(mtcars$mpg)
max(x) Computes the maximum. max(mtcars$mpg)
nrow(x)/NROW(x) Computes the number of rows. nrow(mtcars); NROW(mtcars)
ncol(x)/NCOL(x) Computes the number of columns. ncol(mtcars); NCOL(mtcars)
dim(x) Computes the number of rows and columns. dim(mtcars)
length(x) Computes the number of elements in a data object. length(mtcars$mpg)
summary(x) Summarizes a dataset. summary(mtcars)
table(x) Generates a frequency table for one or more variables. table(mtcars$gear); with(mtcars, table(gear, am))
prop.table(table) Generates a proportions table. prop.table(table(mtcars$gear))