Working with Other Data Structures


Objectives

The objective of this module is to introduce additional data structures in R (arrays, matrices, lists, and data frames) and to learn how to extract, filter, and subset data from them.

Preliminaries

The Backstory

So far, we have seen how to create vectors, which is the most fundamental data structure in R. Today, we will explore and learn how to manipulate other data structures, including matrices, arrays, lists, and data frames.

Matrices and Arrays

A matrix is like a collection of vectors stored together, and like vectors, matrices can only store data of one class (e.g., numerical or character). Matrices are created using the matrix() function,

m <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2)
m
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Matrices are typically filled column-wise, with the argument, byrow, set to FALSE by default, but this can be changed by specifying this argument as TRUE.

m <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = FALSE)
m
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
m <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3, byrow = TRUE)
m
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

You can also create matrices in R by binding vectors of the same length together either row-wise (with the function rbind()) or column-wise (with the function cbind()).

v1 <- c(1, 2, 3, 4)
v2 <- c(6, 7, 8, 9)
m1 <- rbind(v1, v2)
m1
##    [,1] [,2] [,3] [,4]
## v1    1    2    3    4
## v2    6    7    8    9
m2 <- cbind(v1, v2)
m2
##      v1 v2
## [1,]  1  6
## [2,]  2  7
## [3,]  3  8
## [4,]  4  9

Metadata about a matrix can be extracted using the class(), dim(), names(), rownames(), colnames() and other commands. The dim() command returns a vector containing the number of rows at index position 1 and the number of columns at index position 2.

class(m1)
## [1] "matrix" "array"
dim(m1)
## [1] 2 4
class(m2)
## [1] "matrix" "array"
dim(m2)
## [1] 4 2
colnames(m2)
## [1] "v1" "v2"
rownames(m2)
## NULL

NOTE: in this example, rownames are not defined, since cbind() was used to create the matrix

The structure (str()) command can be applied to any data structure to provide details about that object. This is an incredibly useful function that you will find yourself using over and over again.

str(m2)
##  num [1:4, 1:2] 1 2 3 4 6 7 8 9
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "v1" "v2"

An array is a more general data structure, of which a vector (with 1 implicit dimension) and a matrix (with 2 defined dimensions) are but examples. Arrays can include additional dimensions, but (like vectors and matrices) they can only include elements that are all of the same atomic data class (e.g., numeric, character). The example below shows the construction of a 3 dimensional array with 5 rows, 6 columns, and 3 “levels”). Visualizing higher and higher dimension arrays, obviously, becomes challenging!

a <- array(data = 1:90, dim = c(5, 6, 3))
a
## , , 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    6   11   16   21   26
## [2,]    2    7   12   17   22   27
## [3,]    3    8   13   18   23   28
## [4,]    4    9   14   19   24   29
## [5,]    5   10   15   20   25   30
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   31   36   41   46   51   56
## [2,]   32   37   42   47   52   57
## [3,]   33   38   43   48   53   58
## [4,]   34   39   44   49   54   59
## [5,]   35   40   45   50   55   60
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   61   66   71   76   81   86
## [2,]   62   67   72   77   82   87
## [3,]   63   68   73   78   83   88
## [4,]   64   69   74   79   84   89
## [5,]   65   70   75   80   85   90

Subsetting

You can select elements from vectors, matrices, and arrays by subsetting them using their index position(s) in bracket notation. For vectors, you would specify an index value in one dimension. For matrices, you would give the index values in two dimensions. For arrays generally, you would give index values for each dimension in the array.

For example, suppose you have the following vector:

v <- 1:100  #This creates a vector containing all integers from 1 to 100
v
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

You can select the first 15 elements as follows:

v[1:15]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

You can also give a vector of index values to use to subset:

v[c(2, 4, 6, 8, 10)]
## [1]  2  4  6  8 10

You can also use a function or a calculation to subset a vector. What does the following return?

v <- 101:200
v[seq(from = 1, to = 100, by = 2)]
##  [1] 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129 131 133 135 137
## [20] 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 171 173 175
## [39] 177 179 181 183 185 187 189 191 193 195 197 199



CHALLENGE 1


  • First, create a vector of 1, word character strings comprising the first line of the novel Ulysses, by James Joyce: “Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.”

  • Then, extract every third element of this vector.

    • HINT 1: Take a look at the {stringr} package and the function str_split() to see if you can easily divide a single string into a vector of substrings.

    • HINT 2: If you’re ambitious, the following function, gsub("[[:punct:]]","",*string*), applied to *string*, will remove punctuation and special characters from that string.


As an example for a matrix, suppose you have the following:

m <- matrix(data = 1:80, nrow = 8, ncol = 10, byrow = FALSE)
m
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    9   17   25   33   41   49   57   65    73
## [2,]    2   10   18   26   34   42   50   58   66    74
## [3,]    3   11   19   27   35   43   51   59   67    75
## [4,]    4   12   20   28   36   44   52   60   68    76
## [5,]    5   13   21   29   37   45   53   61   69    77
## [6,]    6   14   22   30   38   46   54   62   70    78
## [7,]    7   15   23   31   39   47   55   63   71    79
## [8,]    8   16   24   32   40   48   56   64   72    80

You can extract the element in row 4, column 5 and assign it to a new variable, x, as follows:

x <- m[4, 5]
x
## [1] 36

You can also extract an entire row or an entire column (or set of rows or set of columns) from a matrix by specifying the desired row or column number(s) and leaving the other value blank.



CHALLENGE 2


  • Given the matrix, m, above, extract the 2nd, 3rd, and 6th columns and assign them to the variable x

  • Given the matrix, m, above, extract the 6th to 8th row and assign them to the variable x

  • Given the matrix, m, above, extract the elements from row 2, column 2 to row 6, column 9 and assign them to the variable x



CHALLENGE 3


  • Construct a 4-dimensional, 400 element array (5 x 5 x 4 x 4) named a consisting of the numbers 400 to 1 (i.e., a descending series)

  • Given this matrix, what would the following return?

    • a[1, 1, 1, 2]

    • a[2, 3, 2, ]

    • a[1:5, 1:5, 3, 3]

Overwriting

You can replace elements in a vector or matrix, or even entire rows or columns, by identifying the elements to be replaced and then assigning them new values.

Starting with the matrix, m, defined above, what will be the effects of operations below? Pay careful attention to row and column index values, vector recycling, and automated conversion/recasting among data classes.

m[7, 1] <- 564
m[, 8] <- 2
m[2:5, 4:8] <- 1
m[2:5, 4:8] <- c(20, 19, 18, 17)
m[2:5, 4:8] <- matrix(data = c(20:1), nrow = 4, ncol = 5, byrow = TRUE)
m[, 8] <- c("a", "b")

Lists and Data Frames

Lists and data frames, unlike vectors, matrices, and arrays, can be used to group together a mix of R structures and objects. A single list could contain a matrix, vector of character strings, vector of factors, an array, even another list.

Lists are created using the list() function where the elements to add to the list are given as arguments to the function, separated by commas. Type in the following example:

s <- c("this", "is", "a", "vector", "of", "strings")  # here we have a vector
m <- matrix(data = 1:40, nrow = 5, ncol = 8)  # this is a matrix
b <- FALSE  #and this is a logical statement
l <- list(s, m, b)  # this puts them all together as a list
l
## [[1]]
## [1] "this"    "is"      "a"       "vector"  "of"      "strings"
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    6   11   16   21   26   31   36
## [2,]    2    7   12   17   22   27   32   37
## [3,]    3    8   13   18   23   28   33   38
## [4,]    4    9   14   19   24   29   34   39
## [5,]    5   10   15   20   25   30   35   40
## 
## [[3]]
## [1] FALSE

You can extract elements from a list similarly to how you would from other data structures, except that you use double brackets to reference a single element in the list.

l[[2]]  #let's take another look at our matrix
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    6   11   16   21   26   31   36
## [2,]    2    7   12   17   22   27   32   37
## [3,]    3    8   13   18   23   28   33   38
## [4,]    4    9   14   19   24   29   34   39
## [5,]    5   10   15   20   25   30   35   40

An extension of this notation can be used to access elements contained within an element in the list. For example:

l[[2]][2, 6]  #this will show us the entry at row 2, column 6 of the matrix in the second position of the list
## [1] 27

To reference/extract multiple elements from a list, you would use single bracket notation, which would itself return a list. This is called “list slicing”.

l[1:2]  # here we're slicing our list so it just has the first and second elements
## [[1]]
## [1] "this"    "is"      "a"       "vector"  "of"      "strings"
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    6   11   16   21   26   31   36
## [2,]    2    7   12   17   22   27   32   37
## [3,]    3    8   13   18   23   28   33   38
## [4,]    4    9   14   19   24   29   34   39
## [5,]    5   10   15   20   25   30   35   40
l[c(1, 3)]  # and here we're only taking the first and third elements
## [[1]]
## [1] "this"    "is"      "a"       "vector"  "of"      "strings"
## 
## [[2]]
## [1] FALSE

Using str() provides details about the three elements in our list:

str(l)
## List of 3
##  $ : chr [1:6] "this" "is" "a" "vector" ...
##  $ : int [1:5, 1:8] 1 2 3 4 5 6 7 8 9 10 ...
##  $ : logi FALSE

You can name the elements in a list using the names() function, which adds a name attribute to each list item.

names(l) <- c("string", "matrix", "logical")
names(l)
## [1] "string"  "matrix"  "logical"

Now, you can use the name of an item in the list to refer to it:

l$string  # this is calling the element named 'string' in list 'l'
## [1] "this"    "is"      "a"       "vector"  "of"      "strings"
l$matrix[3, 5]  # this is calling the entry in row 3, column 5 of the element named 'matrix' in list 'l'
## [1] 23



CHALLENGE 4


Create a list representing this (simplified) primate taxonomy:

  • Primates
    • Haplorhini
      • Anthropoidea
        • Platyrrhini
          • Cebidae
          • Atelidae
          • Pitheciidae
        • Catarrhini
          • Cercopithecidae
          • Hylobatidae
          • Hominidae
      • Tarsioidea
        • Tarsiidae
    • Strepsirhini
      • Lorisoidea
        • Lorisidae
        • Galagidae
      • Lemuroidea
        • Cheirogaleidae
        • Lepilemuridae
        • Indriidae
        • Lemuridae
        • Daubentoniidae

HINT: you can use lists as elements in a list, which would nest the names like in a phylogeny.

I’ll get you started with the Strepsirhini:

Lemuroidea <- c("Cheirogaleidae", "Lepilemuridae", "Indriidae", "Lemuridae",
    "Daubentoniidae")
Lorisoidea <- c("Lorisidae", "Galagidae")
Strepsirhini <- list(Lemuroidea, Lorisoidea)
names(Strepsirhini) <- list("Lemuroidea", "Lorisoidea")
Strepsirhini
## $Lemuroidea
## [1] "Cheirogaleidae" "Lepilemuridae"  "Indriidae"      "Lemuridae"     
## [5] "Daubentoniidae"
## 
## $Lorisoidea
## [1] "Lorisidae" "Galagidae"

A data frame is perhaps the most useful (and most familiar) data structure that we can operate with in R as it most closely aligns with how we tend to represent tabular data, with rows as cases or observations and columns as variables describing those observations (e.g., a measurement of a particular type). Variables tend to be measured using the same units and thus fall into the same data class and can be thought of as analogous to vectors.

The data.frame() command can be used to create dataframes from scratch, like this dataframe representing data on some (fictional) BU undergrads:

df <- data.frame(firstName = c("Nitin", "Silvy", "Ingrid", "Claire"), major = c("Biology (CM)",
    "Anthropology (Bio)", "Anthropology (Medical)", "Biology (EEB)"), sex = c("M",
    "F", "F", "F"), yearInProgram = c(1, 2, 2, 3))
df
##   firstName                  major sex yearInProgram
## 1     Nitin           Biology (CM)   M             1
## 2     Silvy     Anthropology (Bio)   F             2
## 3    Ingrid Anthropology (Medical)   F             2
## 4    Claire          Biology (EEB)   F             3

Rather than creating one from scratch, we more commonly read tabular data into R from a pre-existing data file (i.e., a spreadsheet), which typically results in the table being represented as a dataframe. The following code will read from the file “random-people.csv”, currently stored on a user’s Desktop:

df <- read.csv("~/Desktop/random-people.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
# only print select columns
df[, c(1, 3, 4, 11, 12)]
##    gender name.first  name.last login.password            dob
## 1    male        ted     wright          rolex   11/8/73 1:33
## 2    male    quentin    schmitt         norton   5/24/51 3:16
## 3  female      laura   johansen        stevens  5/22/77 21:03
## 4    male     ismael    herrero         303030    8/1/58 9:13
## 5  female     susana     blanco          aloha  12/18/55 3:21
## 6    male      mason     wilson         topdog   6/23/60 9:19
## 7    male       lutz    strauio       close-up   7/20/80 3:51
## 8  female     kaylee     gordon       atlantis  3/24/48 12:22
## 9    male     baraek limoncuocu         tobias   5/8/92 22:01
## 10   male     basile     perrin          ellie   2/28/65 0:37
## 11   male      ruben      lopez           beth   5/27/76 3:30
## 12   male   valtteri   waisanen       nocturne 12/24/80 10:40
## 13 female    vanessa     brewer       gladiato  1/15/68 17:39
## 14 female   kimberly      brown       nebraska    1/9/86 8:54
## 15 female     loreen   baettner         rovers    9/3/49 4:56
## 16 female      becky    wallace          bambi   3/30/59 5:03
## 17   male     hector   gonzalez        calgary  12/10/53 6:48
## 18 female       ella       neva       f00tball  7/18/91 14:30
## 19   male      simon    barnaby         buddie   6/18/89 5:06
## 20   male        max      moser    penetrating  8/12/61 12:13
str(df)
## 'data.frame':    20 obs. of  17 variables:
##  $ gender           : chr  "male" "male" "female" "male" ...
##  $ name.title       : chr  "mr" "mr" "ms" "mr" ...
##  $ name.first       : chr  "ted" "quentin" "laura" "ismael" ...
##  $ name.last        : chr  "wright" "schmitt" "johansen" "herrero" ...
##  $ location.street  : chr  "2020 royal ln" "2433 rue dubois" "2142 elmelunden" "3897 calle del barquillo" ...
##  $ location.city    : chr  "coffs harbour" "vitry-sur-seine" "silkeboeg" "gandia" ...
##  $ location.state   : chr  "tasmania" "indre-et-loire" "hovedstaden" "ceuta" ...
##  $ location.postcode: chr  "4126" "99856" "16264" "61349" ...
##  $ email            : chr  "ted.wright@example.com" "quentin.schmitt@example.com" "laura.johansen@example.com" "ismael.herrero@example.com" ...
##  $ login.username   : chr  "organicleopard402" "bluegoose191" "orangebird528" "heavyswan518" ...
##  $ login.password   : chr  "rolex" "norton" "stevens" "303030" ...
##  $ dob              : chr  "11/8/73 1:33" "5/24/51 3:16" "5/22/77 21:03" "8/1/58 9:13" ...
##  $ date.registered  : chr  "5/5/07 20:26" "4/11/11 7:05" "5/16/14 15:53" "2/17/06 16:53" ...
##  $ phone            : chr  "01-0349-5128" "05-72-65-32-21" "81616775" "974-117-403" ...
##  $ cell             : chr  "0449-989-455" "06-83-24-92-41" "697-993-20" "665-791-673" ...
##  $ picture.large    : chr  "https://randomuser.me/api/portraits/men/48.jpg" "https://randomuser.me/api/portraits/men/53.jpg" "https://randomuser.me/api/portraits/women/70.jpg" "https://randomuser.me/api/portraits/men/79.jpg" ...
##  $ nat              : chr  "AU" "FR" "DK" "ES" ...

As for other data structures, you can select and subset data frames using bracket notation. You can also select named columns from a data frame the $ operator.

df[, 3]
##  [1] "ted"      "quentin"  "laura"    "ismael"   "susana"   "mason"   
##  [7] "lutz"     "kaylee"   "baraek"   "basile"   "ruben"    "valtteri"
## [13] "vanessa"  "kimberly" "loreen"   "becky"    "hector"   "ella"    
## [19] "simon"    "max"
str(df[, 3])
##  chr [1:20] "ted" "quentin" "laura" "ismael" "susana" "mason" "lutz" ...
# returns a vector of data in column 3
df$name.last
##  [1] "wright"     "schmitt"    "johansen"   "herrero"    "blanco"    
##  [6] "wilson"     "strauio"    "gordon"     "limoncuocu" "perrin"    
## [11] "lopez"      "waisanen"   "brewer"     "brown"      "baettner"  
## [16] "wallace"    "gonzalez"   "neva"       "barnaby"    "moser"
str(df$name.last)
##  chr [1:20] "wright" "schmitt" "johansen" "herrero" "blanco" "wilson" ...
# returns a vector of data for column *name.last*
df[3]
##    name.first
## 1         ted
## 2     quentin
## 3       laura
## 4      ismael
## 5      susana
## 6       mason
## 7        lutz
## 8      kaylee
## 9      baraek
## 10     basile
## 11      ruben
## 12   valtteri
## 13    vanessa
## 14   kimberly
## 15     loreen
## 16      becky
## 17     hector
## 18       ella
## 19      simon
## 20        max
str(df[3])
## 'data.frame':    20 obs. of  1 variable:
##  $ name.first: chr  "ted" "quentin" "laura" "ismael" ...
# returns a data frame of data from column 3
df["name.last"]
##     name.last
## 1      wright
## 2     schmitt
## 3    johansen
## 4     herrero
## 5      blanco
## 6      wilson
## 7     strauio
## 8      gordon
## 9  limoncuocu
## 10     perrin
## 11      lopez
## 12   waisanen
## 13     brewer
## 14      brown
## 15   baettner
## 16    wallace
## 17   gonzalez
## 18       neva
## 19    barnaby
## 20      moser
str(df["name.last"])
## 'data.frame':    20 obs. of  1 variable:
##  $ name.last: chr  "wright" "schmitt" "johansen" "herrero" ...
# returns a data frame of data for column *name.last*

You can add rows (additional cases) or columns (additional variables) to a data frame using rbind() and cbind().

df <- cbind(df, id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
    17, 18, 19, 20))
df <- cbind(df, school = c("BU", "BU", "Harvard", "Harvard", "BU", "Boston College",
    "MIT", "BU", "BU", "UMASS Boston", "Harvard", "BU", "Boston College", "BU",
    "Harvard", "MIT", "Harvard", "BU", "UMASS Boston", "Harvard"))
df
##    gender name.title name.first  name.last            location.street
## 1    male         mr        ted     wright              2020 royal ln
## 2    male         mr    quentin    schmitt            2433 rue dubois
## 3  female         ms      laura   johansen            2142 elmelunden
## 4    male         mr     ismael    herrero   3897 calle del barquillo
## 5  female         ms     susana     blanco    2208 avenida de america
## 6    male         mr      mason     wilson           4576 wilson road
## 7    male         mr       lutz    strauio             5707 eichenweg
## 8  female       miss     kaylee     gordon            5475 camden ave
## 9    male         mr     baraek limoncuocu             2664 baedat cd
## 10   male         mr     basile     perrin   3683 avenue jean-jaurias
## 11   male   monsieur      ruben      lopez        4204 rue principale
## 12   male         mr   valtteri   waisanen             9850 hemeentie
## 13 female        mrs    vanessa     brewer          3436 henry street
## 14 female       miss   kimberly      brown            8654 manor road
## 15 female       miss     loreen   baettner              2234 erlenweg
## 16 female       miss      becky    wallace          5965 the crescent
## 17   male   monsieur     hector   gonzalez           5104 rue pasteur
## 18 female         ms       ella       neva             4620 visiokatu
## 19   male         mr      simon    barnaby             2206 simcoe st
## 20   male         mr        max      moser 3045 koenigsberger strasse
##      location.city         location.state location.postcode
## 1    coffs harbour               tasmania              4126
## 2  vitry-sur-seine         indre-et-loire             99856
## 3        silkeboeg            hovedstaden             16264
## 4           gandia                  ceuta             61349
## 5         mastoles            extremadura             29445
## 6          dunedin               taranaki             91479
## 7            emden            brandenburg             13341
## 8            flint                 oregon             84509
## 9            siirt                  tokat             86146
## 10      versailles               dordogne             25177
## 11      montricher                 aargau              7993
## 12          halsua          south karelia             58124
## 13       celbridge                 fingal             30030
## 14          bangor                borders          HI92 8RY
## 15 chemnitzer land          niedersachsen             27167
## 16             ely         west yorkshire           GR8 0UP
## 17         bercher             st. gallen              1745
## 18          kerava         finland proper             26385
## 19          odessa            nova scotia             90889
## 20          weimar mecklenburg-vorpommern             18553
##                              email    login.username login.password
## 1           ted.wright@example.com organicleopard402          rolex
## 2      quentin.schmitt@example.com      bluegoose191         norton
## 3       laura.johansen@example.com     orangebird528        stevens
## 4       ismael.herrero@example.com      heavyswan518         303030
## 5        susana.blanco@example.com    silverkoala701          aloha
## 6         mason.wilson@example.com    organicduck470         topdog
## 7         lutz.strauio@example.com    purplemouse467       close-up
## 8        kaylee.gordon@example.com beautifulgoose794       atlantis
## 9  baraek.limoncuoculu@example.com whitebutterfly599         tobias
## 10       basile.perrin@example.com organicmeercat418          ellie
## 11         ruben.lopez@example.com      crazybear348           beth
## 12   valtteri.waisanen@example.com        redswan919       nocturne
## 13      vanessa.brewer@example.com  purpleleopard113       gladiato
## 14      kimberly.brown@example.com  crazyelephant996       nebraska
## 15     loreen.baettner@example.com     blacktiger499         rovers
## 16       becky.wallace@example.com   crazypeacock937          bambi
## 17     hector.gonzalez@example.com     organiccat637        calgary
## 18           ella.neva@example.com  orangegorilla786       f00tball
## 19       simon.barnaby@example.com     redmeercat724         buddie
## 20           max.moser@example.com     bigladybug459    penetrating
##               dob date.registered          phone           cell
## 1    11/8/73 1:33    5/5/07 20:26   01-0349-5128   0449-989-455
## 2    5/24/51 3:16    4/11/11 7:05 05-72-65-32-21 06-83-24-92-41
## 3   5/22/77 21:03   5/16/14 15:53       81616775     697-993-20
## 4     8/1/58 9:13   2/17/06 16:53    974-117-403    665-791-673
## 5   12/18/55 3:21   10/3/02 17:55    917-199-202    612-612-929
## 6    6/23/60 9:19    12/1/08 8:31 (137)-326-5772 (700)-060-1523
## 7    7/20/80 3:51   4/10/11 20:44   0802-1871274   0170-4221269
## 8   3/24/48 12:22     5/5/13 8:14 (817)-962-1275 (831)-325-1142
## 9    5/8/92 22:01    9/12/04 0:56 (023)-879-4331 (837)-014-1113
## 10   2/28/65 0:37    4/12/07 4:50 04-16-53-97-17 06-63-08-15-52
## 11   5/27/76 3:30   4/17/06 17:40 (931)-692-1073 (747)-833-3781
## 12 12/24/80 10:40   9/22/03 20:47     02-227-661  042-153-83-79
## 13  1/15/68 17:39    2/5/05 21:02   041-167-2755   081-154-0245
## 14    1/9/86 8:54    12/3/11 0:41   017684 80873   0799-553-944
## 15    9/3/49 4:56   3/24/16 17:27   0693-0473309   0176-2769900
## 16   3/30/59 5:03    7/15/08 2:42   015395 04615   0740-849-325
## 17  12/10/53 6:48   3/19/04 20:51 (569)-609-0669 (032)-958-9790
## 18  7/18/91 14:30    3/17/14 7:13     02-351-279  043-436-42-30
## 19   6/18/89 5:06  12/24/12 12:08   618-983-5566   088-025-2948
## 20  8/12/61 12:13    7/19/03 1:08   0503-3338884   0175-7158431
##                                       picture.large nat id         school
## 1    https://randomuser.me/api/portraits/men/48.jpg  AU  1             BU
## 2    https://randomuser.me/api/portraits/men/53.jpg  FR  2             BU
## 3  https://randomuser.me/api/portraits/women/70.jpg  DK  3        Harvard
## 4    https://randomuser.me/api/portraits/men/79.jpg  ES  4        Harvard
## 5  https://randomuser.me/api/portraits/women/18.jpg  ES  5             BU
## 6    https://randomuser.me/api/portraits/men/60.jpg  NZ  6 Boston College
## 7    https://randomuser.me/api/portraits/men/31.jpg  DE  7            MIT
## 8  https://randomuser.me/api/portraits/women/65.jpg  US  8             BU
## 9    https://randomuser.me/api/portraits/men/94.jpg  TR  9             BU
## 10   https://randomuser.me/api/portraits/men/82.jpg  FR 10   UMASS Boston
## 11   https://randomuser.me/api/portraits/men/84.jpg  CH 11        Harvard
## 12   https://randomuser.me/api/portraits/men/80.jpg  FI 12             BU
## 13 https://randomuser.me/api/portraits/women/15.jpg  IE 13 Boston College
## 14 https://randomuser.me/api/portraits/women/49.jpg  GB 14             BU
## 15 https://randomuser.me/api/portraits/women/50.jpg  DE 15        Harvard
## 16 https://randomuser.me/api/portraits/women/67.jpg  GB 16            MIT
## 17    https://randomuser.me/api/portraits/men/9.jpg  CH 17        Harvard
## 18 https://randomuser.me/api/portraits/women/68.jpg  FI 18             BU
## 19   https://randomuser.me/api/portraits/men/61.jpg  CA 19   UMASS Boston
## 20   https://randomuser.me/api/portraits/men/49.jpg  DE 20        Harvard

Alternatively, you can extend a data frame by adding a new variable directly using the $ operator, like this:

df$school <- c("BU", "BU", "Harvard", "Harvard", "BU", "Boston College", "MIT",
    "BU", "BU", "UMASS Boston", "Harvard", "BU", "Boston College", "BU", "Harvard",
    "MIT", "Harvard", "BU", "UMASS Boston", "Harvard")

NOTE: cbind() results in school being added as a factor while using the $ operator results in school being added as a character vector. You can see this by using the str() command.

A factor is another atomic data class for R for dealing efficiently with nominal variables, usually character strings. Internally, R assigns integer values to each unique string (e.g., 1 for “female”, 2 for “male”, etc.).

Filtering

Logical vectors may also be used to subset data frames. Here, we subset the data frame for only those rows where the variable school is “BU”.

new_df <- df[df$school == "BU", ]
new_df
##    gender name.title name.first  name.last         location.street
## 1    male         mr        ted     wright           2020 royal ln
## 2    male         mr    quentin    schmitt         2433 rue dubois
## 5  female         ms     susana     blanco 2208 avenida de america
## 8  female       miss     kaylee     gordon         5475 camden ave
## 9    male         mr     baraek limoncuocu          2664 baedat cd
## 12   male         mr   valtteri   waisanen          9850 hemeentie
## 14 female       miss   kimberly      brown         8654 manor road
## 18 female         ms       ella       neva          4620 visiokatu
##      location.city location.state location.postcode
## 1    coffs harbour       tasmania              4126
## 2  vitry-sur-seine indre-et-loire             99856
## 5         mastoles    extremadura             29445
## 8            flint         oregon             84509
## 9            siirt          tokat             86146
## 12          halsua  south karelia             58124
## 14          bangor        borders          HI92 8RY
## 18          kerava finland proper             26385
##                              email    login.username login.password
## 1           ted.wright@example.com organicleopard402          rolex
## 2      quentin.schmitt@example.com      bluegoose191         norton
## 5        susana.blanco@example.com    silverkoala701          aloha
## 8        kaylee.gordon@example.com beautifulgoose794       atlantis
## 9  baraek.limoncuoculu@example.com whitebutterfly599         tobias
## 12   valtteri.waisanen@example.com        redswan919       nocturne
## 14      kimberly.brown@example.com  crazyelephant996       nebraska
## 18           ella.neva@example.com  orangegorilla786       f00tball
##               dob date.registered          phone           cell
## 1    11/8/73 1:33    5/5/07 20:26   01-0349-5128   0449-989-455
## 2    5/24/51 3:16    4/11/11 7:05 05-72-65-32-21 06-83-24-92-41
## 5   12/18/55 3:21   10/3/02 17:55    917-199-202    612-612-929
## 8   3/24/48 12:22     5/5/13 8:14 (817)-962-1275 (831)-325-1142
## 9    5/8/92 22:01    9/12/04 0:56 (023)-879-4331 (837)-014-1113
## 12 12/24/80 10:40   9/22/03 20:47     02-227-661  042-153-83-79
## 14    1/9/86 8:54    12/3/11 0:41   017684 80873   0799-553-944
## 18  7/18/91 14:30    3/17/14 7:13     02-351-279  043-436-42-30
##                                       picture.large nat id school
## 1    https://randomuser.me/api/portraits/men/48.jpg  AU  1     BU
## 2    https://randomuser.me/api/portraits/men/53.jpg  FR  2     BU
## 5  https://randomuser.me/api/portraits/women/18.jpg  ES  5     BU
## 8  https://randomuser.me/api/portraits/women/65.jpg  US  8     BU
## 9    https://randomuser.me/api/portraits/men/94.jpg  TR  9     BU
## 12   https://randomuser.me/api/portraits/men/80.jpg  FI 12     BU
## 14 https://randomuser.me/api/portraits/women/49.jpg  GB 14     BU
## 18 https://randomuser.me/api/portraits/women/68.jpg  FI 18     BU

In this case, R evaluates the expression `df$school == “BU” and returns a logical vector equal in length to the number of rows in df. It then subsets df based on that vector, returning only rows that evaluate to TRUE.

We can also choose to only return particular columns when we subset.

new_df <- df[df$school == "BU", c("name.last", "name.first", "school")]
new_df
##     name.last name.first school
## 1      wright        ted     BU
## 2     schmitt    quentin     BU
## 5      blanco     susana     BU
## 8      gordon     kaylee     BU
## 9  limoncuocu     baraek     BU
## 12   waisanen   valtteri     BU
## 14      brown   kimberly     BU
## 18       neva       ella     BU

Factors, Conversion and Coercion, and Special Data Values

We were introduced to the factor data class above. Again, factors are numeric codes that R can use internally that correspond to character value “levels”.

When we load in data from an external source (as we will do in our next module), R tends to import character string data as factors, assigning to each unique string an integer numeric code, and assigning the string as a “label” for that code. Using factors can make some code run much more quickly (e.g., in ANOVA, ANCOVA, and other forms of regression using categorical variables).

You can convert factor to character data (and vice versa) using the as.character() or as.factor() commands. You can also convert/coerce any vector to a different class using similar constructs (e.g., as.numeric()), although not all such conversions are really meaningful. Converting factor data to numeric results in the the converted data having the value of R’s internal numeric code for the factor level, while converting character data to numeric results in the data being coerced into the special data value of NA (see below) for missing data.

Finally, R has three special data values that it uses in a variety of situations.

  • NA (for not available) is used for missing data. Many statistical functions offer the possibility to include as an argument na.rm=TRUE (“remove NAs”) so that NAs are excluded from a calculation.
  • Inf (and -Inf) is used when the result of a numerical calculation is too extreme for R to express
  • NaN (for not a number) is used when R cannot express the results of a calculation , e.g., when you try to take the square root of a negative number



CHALLENGE 5


  • Store the following numbers as a 5 x 3 matrix, called “m3”: 3, 0, 1 ,23, 1, 2, 33, 1, 1, 42, 0, 1, 41, 0, 2. Be sure to fill the matrix ROW-WISE.

  • Then, do the following:

    • Coerce the matrix to a data frame.

    • As a data frame, coerce the second column to be logical-valued

    • As a data frame, coerce the third column to be factor-valued

When you are done, use the str() command to show the data type for each variable in your dataframe.