6 Descriptive Statistics
There are various functions for descriptive statistics in R. The below subsections show a selected sample.
6.1 Centrality and Spread
Like in Microsoft Excel, we can cast centrality and spread functions on a variable.
<- c(1, 5, 7, 9)
k mean(k)
## [1] 5.5
# Use the $ operator for columns in a dataset
mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948
If you want to use multiple functions on a single variable, the with()
function can be useful, as it lets you define the local environment to be the desired dataset so that you do not have to use the $
operator repeatedly.
with(mtcars, c(mean = mean(mpg), median = median(mpg), sd = sd(mpg)))
## mean median sd
## 20.090625 19.200000 6.026948
6.2 Minimum and Maximum
To compute the minimum and maximum of a variable, we can use the min()
and max()
functions respectively.
<- 1:10 # 1 through 10.
x
min(x)
## [1] 1
max(x)
## [1] 10
6.3 Data Dimensions
To know the dimensions of an object in R, we can use nrow()/NROW
for the number of rows; ncol()/NCOL()
for the number of columns; and dim()
for number of both rows and columns simultaneously.
NROW(mtcars)
## [1] 32
NCOL(mtcars)
## [1] 11
dim(mtcars)
## [1] 32 11
6.4 Data Summary
We can cast summary()
on an object to capture summary information on an object. This function is useful following str()
, as you can get a sense of what your dataset is like.
# Preview the dataset
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Summarize the dataset.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Note that because Species
is a factor variable, we obtain counts by category for that column instead of quantiles and means like the others.
6.5 Frequency Tables
To get counts by groups, we can use the table()
function, while using prop.table()
on a table()
computation produces proportions. The input of table()
can be one to two columns and the output is a table
class.
6.5.1 Single-variable Case
For the single-variable case, we can simply input our desired column into the table()
function.
<- table(iris$Species)
my_table
my_table
##
## setosa versicolor virginica
## 50 50 50
Additionally, we can apply prop.table()
on our my_table
object to obtain proportions.
prop.table(my_table)
##
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
6.5.2 Multi-variable Case
For the case of multiple variables, we simply input the desired columns from a dataset.
<- with(mtcars, table(am, gear))
my_table2
my_table2
## gear
## am 3 4 5
## 0 15 4 0
## 1 0 8 5
When you input 3 or more variables, R will present the results in a list-like fashion (note that the class is still table
).
<- with(mtcars, table(am, gear, cyl))
my_table3
my_table3
## , , cyl = 4
##
## gear
## am 3 4 5
## 0 1 2 0
## 1 0 6 2
##
## , , cyl = 6
##
## gear
## am 3 4 5
## 0 2 2 0
## 1 0 2 1
##
## , , cyl = 8
##
## gear
## am 3 4 5
## 0 12 0 0
## 1 0 0 2
6.5.3 Converting to a Data Frame
If we apply the as.data.frame()
function to an object of a table
class, the output would be structured in a way such that we have a column (or columns) containing the group(s) and a column for the frequency. The structure is useful, as it is in a format that is acceptable for CSV output, for example.
<- table(iris$Species)
freq <- prop.table(freq)
prop
as.data.frame(freq)
## Var1 Freq
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
as.data.frame(prop)
## Var1 Freq
## 1 setosa 0.3333333
## 2 versicolor 0.3333333
## 3 virginica 0.3333333
<- merge(as.data.frame(freq), as.data.frame(prop), by = 'Var1')
my_table_df
names(my_table_df) <- c('Species', 'Frequency', 'Percent')
my_table_df
## Species Frequency Percent
## 1 setosa 50 0.3333333
## 2 versicolor 50 0.3333333
## 3 virginica 50 0.3333333
write.csv(my_table_df, 'my_example_table.csv')
6.6 Summary
Function | Description | Example |
---|---|---|
mean(x) | Computes the mean. | mean(mtcars$mpg) |
sd(x) | Computes the standard deviation. | sd(mtcars$mpg) |
median(x) | Computes the median. | median(mtcars$mpg) |
min(x) | Computes the minimum. | min(mtcars$mpg) |
max(x) | Computes the maximum. | max(mtcars$mpg) |
nrow(x)/NROW(x) | Computes the number of rows. | nrow(mtcars); NROW(mtcars) |
ncol(x)/NCOL(x) | Computes the number of columns. | ncol(mtcars); NCOL(mtcars) |
dim(x) | Computes the number of rows and columns. | dim(mtcars) |
length(x) | Computes the number of elements in a data object. | length(mtcars$mpg) |
summary(x) | Summarizes a dataset. | summary(mtcars) |
table(x) | Generates a frequency table for one or more variables. | table(mtcars$gear); with(mtcars, table(gear, am)) |
prop.table(table) | Generates a proportions table. | prop.table(table(mtcars$gear)) |