3 Data Management

In this chapter, we will learn how to replace values, switch values, import data, combine data, subset data, and split data.

3.1 Replacing Values

We can replace values with the replace() function.

x <- 1:10 # 1 through 10.

# If x equals 2, 5, or 7, replace with 0.
## replace(vector, condition, replacement value).
replace(x, x %in% c(2, 5, 7), 0)
##  [1]  1  0  3  4  0  6  0  8  9 10

3.2 Switching Values

We can switch–or recode–values with the switch() function. By default, switch() is a “scalar” function in that it only produces a single value. To produce a vector of values, we combine it with sapply()–see the Functionals chapter for more details on sapply().

3.2.1 Scalar Case

In the case of a single value, all we need to pass into switch() are (1) the data object and (2) an expression stating what the old value should become (i.e, provide the old value and the replacement value).

# SYNTAX OF switch():
## switch(x, old_value = new_value)

x <- "a"

x
## [1] "a"
xs <- switch(x, a = 1)

xs
## [1] 1

3.2.2 Vector Case

For the case of applying switch() to vectors, we make use of sapply().

To take a case in point, let’s first generate some random data of racial groups.

set.seed(1) # Remember our random sampling

# Generate vector of unique values.
my_vector <- c('Asian', 'African American', 'White', 'Other')

my_vector
## [1] "Asian"            "African American" "White"            "Other"
# Conduct repeat sampling of my_vector
## See the Probability Functions chapter for more details on sample().
my_vector2 <- sample(my_vector, 20, replace = TRUE)

# Print the new vector.
my_vector2
##  [1] "Asian"            "Other"            "White"            "Asian"           
##  [5] "African American" "Asian"            "White"            "White"           
##  [9] "African American" "African American" "White"            "White"           
## [13] "Asian"            "Asian"            "Asian"            "African American"
## [17] "African American" "African American" "African American" "White"

Let’s say that we want to recode these values: 0 for White, 1 for African American, 2 for Asian, and 3 for Other. To do so, we first define a function and pass it through sapply()–see the Function Writing and Functionals chapters respectively for more information.

# First, define a function that recodes the races into integers.
my_switch <- function(v) {
  
  switch(v, White = 0, `African American` = 1, Asian = 2, Other = 3)
  # We use back quotes for "African American" because of the space.
  
}

# Now we can pass my_switch to sapply() to execute the recoding.
sapply(my_vector2, my_switch)
##            Asian            Other            White            Asian 
##                2                3                0                2 
## African American            Asian            White            White 
##                1                2                0                0 
## African American African American            White            White 
##                1                1                0                0 
##            Asian            Asian            Asian African American 
##                2                2                2                1 
## African American African American African American            White 
##                1                1                1                0

3.3 Importing Data

We can import datasets with read.table()–this method is the most general.

# Set path to dataset
# For this example, our data is in the data folder
#  and our data are separated by commas.
my_data <- read.table('data/mtcars.csv', sep = ',', stringsAsFactors = FALSE)

# Setting stringsAsFactors = FALSE maintains strings as strings.
## See the Basics chapter for more detail on classes and types.

In the case of files with comma-separated values, we can use read.csv() to import them more easily.

# Set path to dataset
my_data <- read.csv('data/mtcars.csv', stringsAsFactors = FALSE)

3.4 Combining Data

There are three main ways to combine data: (1) cbind(), (2) rbind(), and (3) merge().

3.4.1 cbind()/rbind()

The function cbind() combines vectors or datasets column-wise, while rbind() does so row-wise.

# Creating x and y
x <- 1:5
y <- 6:10

# Seeing x and y separately
x
## [1] 1 2 3 4 5
y
## [1]  6  7  8  9 10
# Combining them
cbind(x, y)
##      x  y
## [1,] 1  6
## [2,] 2  7
## [3,] 3  8
## [4,] 4  9
## [5,] 5 10
rbind(x, y)
##   [,1] [,2] [,3] [,4] [,5]
## x    1    2    3    4    5
## y    6    7    8    9   10

If we have a list of values we want to combine, we can use do.call() and cbind()/rbind() together. The former iteratively calls a function on a list, which can be useful for combining multiple datasets together. do.call() is a special case of a function called a functional, which is a function that takes other functions as inputs–this concept is discussed more in the Functionals chapter.

my_list <- list(x = 1:5, y = 6:10, z = 11:15)

do.call(cbind, my_list)
##      x  y  z
## [1,] 1  6 11
## [2,] 2  7 12
## [3,] 3  8 13
## [4,] 4  9 14
## [5,] 5 10 15
do.call(rbind, my_list)
##   [,1] [,2] [,3] [,4] [,5]
## x    1    2    3    4    5
## y    6    7    8    9   10
## z   11   12   13   14   15

3.4.2 merge()

Merging data with merge() (AKA “joining data”) is powerful, as we can combine disparate datasets that have a common linking variable between them.

set.seed(1) # remember our random numbers from rnorm().

data1 <- data.frame(survey_id = 1:5, 
                    wage      = rnorm(5, mean = 15, sd = 5))

data1
##   survey_id     wage
## 1         1 11.86773
## 2         2 15.91822
## 3         3 10.82186
## 4         4 22.97640
## 5         5 16.64754
data2 <- data.frame(survey_id  = 5:1,
                    experience = rnorm(5, mean = 5, sd = 3))

data2
##   survey_id experience
## 1         5   2.538595
## 2         4   6.462287
## 3         3   7.214974
## 4         2   6.727344
## 5         1   4.083835
# merge(first data, second data, by = 'a common variable').
data_merge <- merge(data1, data2, by = 'survey_id')

data_merge # An "inner-join" of datasets
##   survey_id     wage experience
## 1         1 11.86773   4.083835
## 2         2 15.91822   6.727344
## 3         3 10.82186   7.214974
## 4         4 22.97640   6.462287
## 5         5 16.64754   2.538595

What we accomplished here is an inner join: a join in which two datasets overlap. See the documentation file for merge() for more information on different types of joins (i.e., type ?merge into the R console).

3.5 Subsetting Data

To subset data, we can pass data and relational/logic operators1 into the subset() function, or we can use the bracket syntax and use the operators there.

The relational operators are the following:

  • <, >, <= (less than or equal to), >= (greater than or equal to)

  • == (equal to), != (not equal to)

The main logic operators are the following:

  • & (and)

  • | (or)

3.5.1 Vector Case

Suppose we have the following vector:

x <- -10:10 # integers from -10 to 10.
x
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6   7   8
## [20]   9  10

Then we can subset like the following:

x[x < 0] # same as subset(x, x < 0)
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1
x[x > 2 & x < 5]
## [1] 3 4
# We can use functions inside the brackets.
## For example, %in% is a matching function:
##   let's use it to subset for only 1 through 5.
x[x %in% 1:5] 
## [1] 1 2 3 4 5

3.5.2 Data Frame Case

Suppose the dataset mtcars. Then we can subset like the following:

subset(mtcars, mpg > 30) # Same as mtcars[mtcars$mpg > 30, ]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
subset(mtcars, mpg > 30 & wt > 1.7)
##                 mpg cyl disp hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1

3.6 Splitting Data

To split data, we pass a data frame and a variable into the split() function.

split(mtcars, mtcars$gear) # Splits into 3 subsets.
## $`3`
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 
## $`4`
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 
## $`5`
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8

Splitting can be useful when you want to apply a function that’s contingent on subsets of data. For example, we can split the data and perform a regression model on each of them.

# 1. Split dataset by a splitting variable.
my_split <- split(mtcars, mtcars$gear)

# 2. Estimate a regression model based on each subset.
my_models <- lapply(my_split, function(data) lm(mpg ~ wt, data))

# 3. Print the coefficients in a matrix form.
sapply(my_models, coef)
##                     3         4         5
## (Intercept) 28.395036 42.492769 42.562784
## wt          -3.156854 -6.863478 -8.046336

For more information about lapply() and sapply(), see the Functionals chapter; for more information about lm(), see the Linear Modeling chapter.

3.7 Summary

Table 3.1: Summary of Data Management Functions
Function Description Example
replace(x, condition, replacement) Replace a value in a vector based on a condition. x <- 1:10; replace(x, x %in% c(2, 5, 7), 0)
switch(x, expression) Switch (recode) values. x <- ‘a’; switch(x, a = 1)
read.table(‘path/to/file.csv’, sep = ‘,’) Import a dataset. my_data <- read.table(‘data/mtcars.csv’, sep = ‘,’)
cbind(x,y)/rbind(x,y) Combine data column- or row-wise. x <- 1:5; y <- 6:10; cbind(x, y); rbind(x,y)
do.call(function, list) Iteratively call a function on a list my_list <- list(x = 1:5, y = 6:10, z = 11:15); do.call(cbind, my_list)
merge(x, y, by = ‘linking_var’) Join data by a linking variable.

data1 <- data.frame(survey_id = 1:5, wage = rnorm(5, mean = 15, sd = 5))

data2 <- data.frame(survey_id = 5:1, experience = rnorm(5, mean = 5, sd = 3))

data_merge <- merge(data1, data2, by = ‘survey_id’)
subset(data, condition); x[condition] Subset data via relational and logic operators. subset(mtcars, mpg > 30 & wt > 1.7)
split(data, grouping_variable) Split data by a grouping variable split(mtcars, mtcars$gear)