Exploratory Data Analysis


Objectives

The objective of this module to begin exploring data using the summary functions and graphing abilities of R.

Preliminaries

The Backstory

R has some very easy to use functions for taking a quick tour of your data. We have seen some of these already (e.g., head(), tail(), and str()), and you should always use these right after loading in a dataset to work with. Also useful are dim() to return the number of rows and columns in a data frame, names(), colnames(), and sometimes rownames().

As an aside, you can use the attach() function to make variables within data frames accessible in R with fewer keystrokes. The attach() function binds the variables from the data frame named as an argument to the local namespace so that as long as the data frame is attached, variables can be called by their names without explicitly referring to the data frame. That is, if you attach() a data frame, then you do not need to use the $ operator or bracket notation to refer to a particular variable. It is important to remember to detach() data frames when finished. It is also possible to attach multiple data frames (and the same data frame multiple times), and, if these share variable names, then the more recently attached one will mask the other. Thus, it is best to attach only one data frame at a time (or none at all).

The with() function accomplishes much the same thing as attach() but is self-contained and cleaner, especially for use in functions. If you use with(), all code to be run should be included as an argument of the function.

Some Useful Procedures for Exploratory Data Analysis

For univariate data


CHALLENGE 1


Summary: The summary() function provides a quick overview of each column in a data frame. For numeric variables, this includes the minimum, 25th percentile, median, mean, 75th percentile, and maximum of the data, as well as a count of NA (missing values). For factors, it includes a count of each factor.

Load the Country-Data-2016 dataset into a data frame variable, d, and summarize the variables in that data frame. You can load the file any way you want, e.g., load from a local file, or you can access the data straight from GitHub, as in the code below.

library(curl)
## Using libcurl 7.54.0 with LibreSSL/2.6.5
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/Country-Data-2016.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
##          country population    area                      govt_form birthrate
## 1    Afghanistan   32564342  652230               islamic republic      38.6
## 2        Albania    3029278   28748                       republic      12.9
## 3        Algeria   39542166 2381741                       republic      23.7
## 4 American Samoa      54343     199 territory of the United States      22.9
## 5        Andorra      85580     468        constitutional monarchy       8.1
## 6         Angola   19625353 1246700                       republic      38.8
##   deathrate life_expect mammals birds reptiles amphibians fishes mollucs
## 1      13.9        50.9      11    17        1          1      5       0
## 2       6.6        78.1       3    10        4          2     44      49
## 3       4.3        76.6      14    14        8          3     40      10
## 4       4.8        75.1       1     8        6          0     12       5
## 5       7.0        82.7       2     2        1          0      0       3
## 6      11.5        55.6      17    26        5          0     51       5
##   other_inverts plants fungi_protists
## 1             2      5              0
## 2            13      0              0
## 3            26     18              0
## 4            59      1              0
## 5             4      0              0
## 6             4     34              0
summary(d)
##    country            population             area           govt_form        
##  Length:248         Min.   :3.000e+01   Min.   :       0   Length:248        
##  Class :character   1st Qu.:2.991e+05   1st Qu.:    1769   Class :character  
##  Mode  :character   Median :4.912e+06   Median :   69700   Mode  :character  
##                     Mean   :2.999e+07   Mean   :  610952                     
##                     3rd Qu.:1.803e+07   3rd Qu.:  398754                     
##                     Max.   :1.367e+09   Max.   :17098242                     
##                     NA's   :6           NA's   :1                            
##    birthrate       deathrate      life_expect       mammals      
##  Min.   : 0.00   Min.   : 0.00   Min.   :49.80   Min.   :  0.00  
##  1st Qu.:11.40   1st Qu.: 5.65   1st Qu.:67.40   1st Qu.:  3.00  
##  Median :16.40   Median : 7.40   Median :74.70   Median :  8.00  
##  Mean   :18.95   Mean   : 7.61   Mean   :72.19   Mean   : 13.85  
##  3rd Qu.:24.35   3rd Qu.: 9.40   3rd Qu.:78.40   3rd Qu.: 15.00  
##  Max.   :45.50   Max.   :14.90   Max.   :89.50   Max.   :188.00  
##  NA's   :17      NA's   :17      NA's   :19      NA's   :3       
##      birds           reptiles         amphibians          fishes      
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  6.00   1st Qu.:  2.000   1st Qu.:  0.000   1st Qu.: 11.00  
##  Median : 12.00   Median :  5.000   Median :  0.000   Median : 25.00  
##  Mean   : 17.82   Mean   :  8.331   Mean   :  9.849   Mean   : 32.84  
##  3rd Qu.: 19.00   3rd Qu.:  8.000   3rd Qu.:  4.000   3rd Qu.: 43.00  
##  Max.   :165.00   Max.   :139.000   Max.   :215.000   Max.   :249.00  
##  NA's   :3        NA's   :3         NA's   :3         NA's   :3       
##     mollucs       other_inverts        plants        fungi_protists   
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:  0.00   1st Qu.:  3.00   1st Qu.:   2.00   1st Qu.: 0.0000  
##  Median :  1.00   Median : 11.00   Median :  10.00   Median : 0.0000  
##  Mean   :  9.62   Mean   : 32.57   Mean   :  60.78   Mean   : 0.6082  
##  3rd Qu.:  6.00   3rd Qu.: 33.00   3rd Qu.:  44.00   3rd Qu.: 0.0000  
##  Max.   :301.00   Max.   :340.00   Max.   :1856.00   Max.   :12.0000  
##  NA's   :3        NA's   :3        NA's   :3         NA's   :3
names(d)
##  [1] "country"        "population"     "area"           "govt_form"     
##  [5] "birthrate"      "deathrate"      "life_expect"    "mammals"       
##  [9] "birds"          "reptiles"       "amphibians"     "fishes"        
## [13] "mollucs"        "other_inverts"  "plants"         "fungi_protists"
  • What are the median area and population size of all countries in the dataset? HINT: There are a couple of ways to do this… try summary() and median() (for the latter, you’ll need to use the na.rm = TRUE argument)
  • Create a new pop_density variable in your data frame which is population / area. What are the 10 most dense countries? The 10 least dense? HINT: check out the order() function
d$density <- d$population/d$area
d <- d[order(-d$density), ]
d[1:10, ]
##                     country population   area
## 130                   Macau     592731   28.0
## 145                  Monaco      30535    2.0
## 97  Holy See (Vatican City)        842    0.1
## 199               Singapore    5674472  697.0
## 99                Hong Kong    7141106 1108.0
## 84                Gibraltar      29258    7.0
## 17                  Bahrain    1346613  760.0
## 135                Maldives     393253  298.0
## 137                   Malta     413965  316.0
## 24                  Bermuda      70196   54.0
##                                  govt_form birthrate deathrate life_expect
## 130 special administrative region of China       8.9       4.2        84.5
## 145                constitutional monarchy       6.7       9.2        89.5
## 97                                monarchy        NA        NA          NA
## 199                               republic       8.3       3.4        84.7
## 99  special administrative region of China       9.2       7.1        82.9
## 84              British overseas territory      14.1       8.4        79.3
## 17                 constitutional monarchy      13.7       2.7        78.7
## 135                               republic      15.8       3.9        75.4
## 137                               republic      10.2       9.1        80.2
## 24              British overseas territory      11.3       8.2        81.2
##     mammals birds reptiles amphibians fishes mollucs other_inverts plants
## 130       0     4        1          0      5       0             1      0
## 145       3     0        0          0     15       0             3      0
## 97        1     0        0          0      0       0             0      0
## 199      13    17        6          0     27       0           173     58
## 99        3    20        5          5     13       1             7      9
## 84        4     5        0          0     18       3             2      0
## 17        3     6        4          0     10       0            13      0
## 135       2     0        3          0     24       0            46      0
## 137       2     5        1          0     22       3             2      4
## 24        4     1        4          0     26       0            28      8
##     fungi_protists   density
## 130              0 21168.964
## 145              0 15267.500
## 97               0  8420.000
## 199              0  8141.280
## 99               0  6445.042
## 84               0  4179.714
## 17               0  1771.859
## 135              0  1319.641
## 137              0  1310.016
## 24               0  1299.926
d <- d[order(d$density), ]
d[1:10, ]
##                                      country population    area
## 206 South Georgia and South Sandwich Islands         30    3903
## 86                                 Greenland      57733 2166086
## 70                          Falkland Islands       3361   12173
## 175                         Pitcairn Islands         48      47
## 146                                 Mongolia    2992908 1564116
## 245                           Western Sahara     570866  266000
## 76                             French Guiana     181000   83534
## 152                                  Namibia    2212307  824292
## 13                                 Australia   22751014 7741220
## 101                                  Iceland     331918  103000
##                        govt_form birthrate deathrate life_expect mammals birds
## 206   British overseas territory        NA        NA          NA       3     6
## 86  autonomous region of Denmark      14.5       8.5        72.1       9     3
## 70    British overseas territory        NA        NA          NA       4     9
## 175   British overseas territory        NA        NA          NA       1    10
## 146                     republic      20.3       6.4        69.3      11    24
## 245 autonomous region of Morocco      30.2       8.3        62.6      10     3
## 76  overseas territory of France       0.0       0.0        76.1       8     8
## 152                     republic      19.8      13.9        51.6      14    28
## 13        parliamentary monarchy      12.2       7.1        82.2      63    50
## 101                     republic      13.9       6.3        83.0       6     4
##     reptiles amphibians fishes mollucs other_inverts plants fungi_protists
## 206        0          0      0       0             0      0              0
## 86         0          0      9       0             0      1              0
## 70         0          0      5       0             0      5              0
## 175        0          0     10       5            11      7              0
## 146        0          0      2       0             3      0              0
## 245        1          0     31       0             3      0              0
## 76         7          3     29       0             0     18              0
## 152        5          1     33       0             4     28              0
## 13        43         47    118     174           340     92              1
## 101        0          0     16       0             0      0              0
##         density
## 206 0.007686395
## 86  0.026653143
## 70  0.276102851
## 175 1.021276596
## 146 1.913482120
## 245 2.146112782
## 76  2.166782388
## 152 2.683887506
## 13  2.938944249
## 101 3.222504854
  • Extract data from the 20 largest countries into a new variable. What are the median area and population size of these countries?
  • Extract data from all countries beginning with the letters “A” through “F”. What are the mean area and population size of these countries?
new <- d[grep("^[A-F]", d$country), ]
summary(new)
##    country            population             area           govt_form        
##  Length:78          Min.   :5.960e+02   Min.   :      14   Length:78         
##  Class :character   1st Qu.:2.991e+05   1st Qu.:    4066   Class :character  
##  Mode  :character   Median :4.785e+06   Median :   51148   Mode  :character  
##                     Mean   :3.507e+07   Mean   :  918248                     
##                     3rd Qu.:1.469e+07   3rd Qu.:  466498                     
##                     Max.   :1.367e+09   Max.   :14000000                     
##                     NA's   :4                                                
##    birthrate       deathrate       life_expect       mammals    
##  Min.   : 0.00   Min.   : 0.000   Min.   :49.80   Min.   : 0.0  
##  1st Qu.:11.65   1st Qu.: 5.850   1st Qu.:68.75   1st Qu.: 3.0  
##  Median :15.90   Median : 7.700   Median :75.50   Median : 7.0  
##  Mean   :18.77   Mean   : 7.861   Mean   :72.25   Mean   :13.4  
##  3rd Qu.:23.30   3rd Qu.: 9.500   3rd Qu.:78.40   3rd Qu.:14.0  
##  Max.   :42.00   Max.   :14.400   Max.   :82.70   Max.   :81.0  
##  NA's   :7       NA's   :7        NA's   :7                     
##      birds           reptiles        amphibians         fishes      
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  6.00   1st Qu.: 2.000   1st Qu.:  0.00   1st Qu.: 10.00  
##  Median : 11.00   Median : 5.000   Median :  0.00   Median : 24.50  
##  Mean   : 18.62   Mean   : 7.397   Mean   : 11.86   Mean   : 29.54  
##  3rd Qu.: 18.00   3rd Qu.: 8.000   3rd Qu.:  3.00   3rd Qu.: 41.50  
##  Max.   :165.00   Max.   :43.000   Max.   :215.00   Max.   :133.00  
##                                                                     
##     mollucs       other_inverts        plants        fungi_protists  
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   :0.0000  
##  1st Qu.:  0.00   1st Qu.:  4.00   1st Qu.:   2.25   1st Qu.:0.0000  
##  Median :  1.00   Median : 11.00   Median :  10.00   Median :0.0000  
##  Mean   : 10.17   Mean   : 23.63   Mean   :  70.64   Mean   :0.6026  
##  3rd Qu.:  5.00   3rd Qu.: 25.25   3rd Qu.:  41.75   3rd Qu.:0.0000  
##  Max.   :174.00   Max.   :340.00   Max.   :1856.00   Max.   :8.0000  
##                                                                      
##     density         
##  Min.   :   0.2761  
##  1st Qu.:  24.5932  
##  Median :  75.0297  
##  Mean   : 162.4785  
##  3rd Qu.: 140.7140  
##  Max.   :1771.8592  
##  NA's   :4

Or, alternatively…

mean(new$population, na.rm = TRUE)
## [1] 35065172
mean(new$area, na.rm = TRUE)
## [1] 918247.7

CHALLENGE 2


Boxplots: The boxplot() function provides a box-and-whiskers visual representation of the five-number summary plus outliers that go beyond the bulk of the data. The function balks if you pass it nonnumeric data, so you may need to reference columns specifically using either bracket notation or the $ operator.

Barplots : The barplot() function is useful for crude data, with bar height proportional to the value of the variable. The function dotchart() provides a similar graphical summary.

Make boxplots of the raw population and area data, then do the same after log() transforming these variables.

NOTE: The par() command will let you set up a grid of panel in which to plot. Here, I set up a two row by three column grid.

par(mfrow = c(2, 3))
boxplot(d$population)
boxplot(log(d$population))
boxplot(d$area)
boxplot(log(d$area))
barplot(d$population)
barplot(d$area)


CHALLENGE 3


Histograms : The hist() function returns a histogram showing the complete empirical distribution of the data in binned categories, which is useful for checking skewwness of the data, symmetry, multi-modality, etc. Setting the argument freq=FALSE will scale the Y axis to represent the proportion of observations falling into each bin rather than the count.

Make histograms of the log() transformed population and area data from the Country-Data-2016 file. Explore what happens if you set freq=FALSE versus the default of freq=TRUE. Try looking at other variables as well.

par(mfrow = c(1, 2))  # gives us two panels
attach(d)
hist(log(population), freq = FALSE, col = "red", main = "Plot 1", xlab = "log(population size)",
    ylab = "density", ylim = c(0, 0.2))
hist(log(area), freq = FALSE, col = "red", main = "Plot 2", xlab = "log(area)",
    ylab = "density", ylim = c(0, 0.2))

NOTE: You can add a line to your histograms (e.g., to show the mean value for a variable) using the abline() command, with arguments. For exmaple, to show a single vertical line representing the mean log(population size), you would add the argument v=mean(log(population)))


Density plot : The density() function computes a non-parametric estimate of the distribution of a variable, which can be combined with plot() to also yield a graphical view of the distribution of the data. If your data have missing values, then you need to add the argument na.rm=TRUE to the density() function. To superimpose a density() curve on a histogram, you can use the lines(density()) function.

par(mfrow = c(1, 1))  # set up one panel and redraw the log(population) histogram
hist(log(population), freq = FALSE, col = "white", main = "My Plot with Mean and Density",
    xlab = "log(population size)", ylab = "density", ylim = c(0, 0.2))
abline(v = mean(log(population), na.rm = TRUE), col = "blue")
lines(density(log(population), na.rm = TRUE), col = "green")

detach(d)

CHALLENGE 4


Tables : the table() function can be used to summarize counts and proportions for categorical variables in your dataset.

Using the table() function, find what is the most common form of government in the Country-Data-2016 dataset. How many countries have that form? HINT: We can combine table() with sort() and the argument decreasing=TRUE to get the desired answered straight away:

sort(table(d$govt_form), decreasing = TRUE)
## 
##                                republic                 constitutional monarchy 
##                                     127                                      33 
##              British overseas territory            overseas territory of France 
##                                      12                                       7 
##           presidential federal republic                                monarchy 
##                                       7                                       6 
##                  parliamentary monarchy          territory of the United States 
##                                       5                                       5 
##                 parliamentary democracy                  territory of Australia 
##                                       4                                       4 
## autonomous territory of the Netherlands                        federal republic 
##                                       3                                       3 
##                        islamic republic                      socialist republic 
##                                       3                                       3 
##                       absolute monarchy            autonomous region of Denmark 
##                                       2                                       2 
##     autonomous territory of New Zealand            overseas community of France 
##                                       2                                       2 
##          parliamentary federal republic  special administrative region of China 
##                                       2                                       2 
##                     territory of Norway                       autonomous region 
##                                       2                                       1 
##             autonomous region of France            autonomous region of Morocco 
##                                       1                                       1 
##                              federation         foreign-administrated territory 
##                                       1                                       1 
##      islamic-socialist peoples republic                  parliamentary republic 
##                                       1                                       1 
##                        peoples republic                     territory of France 
##                                       1                                       1 
##                territory of New Zealand            territory of the Netherlands 
##                                       1                                       1

For multivariate data


CHALLENGE 5


Multiple boxplots or histograms can be laid out side-by-side or overlaid. For boxplots, the ~ operator can be read as “by”.

Read in the dataset KamilarAndCooperData, which contains a host of information from about 213 living primate species.

Spend some time exploring the data and then make boxplots of log(female body mass) ~ family. Try doing this with {base} graphics and then look at how we might do in in {ggplot2}, which provides a standard “grammar of graphics” (see the {ggplot2} documentation)

f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/KamilarAndCooperData.csv")
d <- read.csv(f, header = TRUE, stringsAsFactors = FALSE)
attach(d)
head(d)
##               Scientific_Name          Family          Genus      Species
## 1 Allenopithecus_nigroviridis Cercopithecidae Allenopithecus nigroviridis
## 2         Allocebus_trichotis Cercopithecidae      Allocebus    trichotis
## 3           Alouatta_belzebul        Atelidae       Alouatta     belzebul
## 4             Alouatta_caraya        Atelidae       Alouatta       caraya
## 5            Alouatta_guariba        Atelidae       Alouatta      guariba
## 6           Alouatta_palliata        Atelidae       Alouatta     palliata
##   Brain_Size_Species_Mean Brain_Size_Female_Mean   Brain_size_Ref
## 1                   58.02                  53.70 Isler et al 2008
## 2                      NA                     NA                 
## 3                   52.84                  51.19 Isler et al 2008
## 4                   52.63                  47.80 Isler et al 2008
## 5                   51.70                  49.08 Isler et al 2008
## 6                   49.88                  48.04 Isler et al 2008
##   Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism
## 1                6130                  3180           1.928
## 2                  92                    84           1.095
## 3                7270                  5520           1.317
## 4                6525                  4240           1.539
## 5                5800                  4550           1.275
## 6                7150                  5350           1.336
##                 Mass_Ref MeanGroupSize AdultMales AdultFemale AdultSexRatio
## 1       Isler et al 2008            NA         NA          NA            NA
## 2 Smith and Jungers 1997          1.00       1.00         1.0            NA
## 3       Isler et al 2008          7.00       1.00         1.0          1.00
## 4       Isler et al 2008          8.00       2.30         3.3          1.43
## 5       Isler et al 2008          6.53       1.37         2.2          1.61
## 6       Isler et al 2008         12.00       2.90         6.3          2.17
##                                                     Social_Organization_Ref
## 1                                                                          
## 2                                                             Kappeler 1997
## 3                                                       Campbell et al 2007
## 4 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 5                                                       Campbell et al 2007
## 6 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
##   InterbirthInterval_d Gestation WeaningAge_d MaxLongevity_m LitterSz
## 1                   NA        NA       106.15          276.0     1.01
## 2                   NA        NA           NA             NA     1.00
## 3                   NA        NA           NA             NA       NA
## 4               337.62       187       323.16          243.6     1.01
## 5                   NA        NA           NA             NA       NA
## 6               684.37       186       495.60          300.0     1.02
##    Life_History_Ref GR_MidRangeLat_dd Precip_Mean_mm Temp_Mean_degC AET_Mean_mm
## 1 Jones et al. 2009             -0.17         1574.0           25.2      1517.8
## 2                              -16.59         1902.3           20.3      1388.2
## 3                               -6.80         1643.5           24.9      1286.6
## 4 Jones et al. 2009            -20.34         1166.4           22.9      1193.1
## 5                              -21.13         1332.3           19.6      1225.7
## 6 Jones et al. 2009              6.95         1852.6           23.7      1300.0
##   PET_Mean_mm       Climate_Ref HomeRange_km2      HomeRangeRef DayLength_km
## 1      1589.4 Jones et al. 2009            NA                             NA
## 2      1653.7 Jones et al. 2009            NA                             NA
## 3      1549.8 Jones et al. 2009            NA                             NA
## 4      1404.9 Jones et al. 2009            NA                           0.40
## 5      1332.2 Jones et al. 2009          0.03 Jones et al. 2009           NA
## 6      1633.9 Jones et al. 2009          0.19 Jones et al. 2009         0.32
##       DayLengthRef Territoriality Fruit Leaves Fauna             DietRef1
## 1                              NA    NA                                  
## 2                              NA    NA                                  
## 3                              NA  57.3   19.1   0.0 Campbell et al. 2007
## 4 Nunn et al. 2003             NA  23.8   67.7   0.0 Campbell et al. 2007
## 5                              NA   5.2   73.0   0.0 Campbell et al. 2007
## 6 Nunn et al. 2003         0.6506  33.1   56.4   0.0 Campbell et al. 2007
##   Canine_Dimorphism Canine_Dimorphism_Ref  Feed  Move  Rest Social
## 1             2.210   Plavcan & Ruff 2008    NA    NA    NA     NA
## 2                NA                          NA    NA    NA     NA
## 3             1.811   Plavcan & Ruff 2008 13.75 18.75 57.30  10.00
## 4             1.542   Plavcan & Ruff 2008 15.90 17.60 61.60   4.90
## 5             1.783   Plavcan & Ruff 2008 18.33 14.33 64.37   3.00
## 6             1.703   Plavcan & Ruff 2008 17.94 12.32 66.14   3.64
##    Activity_Budget_Ref
## 1                     
## 2                     
## 3 Campbell et al. 2007
## 4 Campbell et al. 2007
## 5 Campbell et al. 2007
## 6 Campbell et al. 2007
summary(d)
##  Scientific_Name       Family             Genus             Species         
##  Length:213         Length:213         Length:213         Length:213        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Brain_Size_Species_Mean Brain_Size_Female_Mean Brain_size_Ref    
##  Min.   :  1.63          Min.   :  1.66         Length:213        
##  1st Qu.: 13.95          1st Qu.: 13.23         Class :character  
##  Median : 61.45          Median : 57.20         Mode  :character  
##  Mean   : 68.11          Mean   : 65.24                           
##  3rd Qu.: 88.48          3rd Qu.: 86.80                           
##  Max.   :491.27          Max.   :480.15                           
##  NA's   :42              NA's   :48                               
##  Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism   Mass_Ref        
##  Min.   :    31      Min.   :   30.0       Min.   :0.841   Length:213        
##  1st Qu.:   865      1st Qu.:  835.5       1st Qu.:1.013   Class :character  
##  Median :  4290      Median : 3039.0       Median :1.109   Mode  :character  
##  Mean   :  8112      Mean   : 5396.5       Mean   :1.246                     
##  3rd Qu.:  7815      3rd Qu.: 6390.0       3rd Qu.:1.409                     
##  Max.   :170400      Max.   :97500.0       Max.   :2.688                     
##  NA's   :18          NA's   :18            NA's   :18                        
##  MeanGroupSize     AdultMales      AdultFemale     AdultSexRatio   
##  Min.   : 1.00   Min.   : 0.900   Min.   : 1.000   Min.   : 0.500  
##  1st Qu.: 4.00   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 7.80   Median : 1.500   Median : 2.900   Median : 1.450  
##  Mean   :15.06   Mean   : 2.516   Mean   : 5.049   Mean   : 2.305  
##  3rd Qu.:18.50   3rd Qu.: 3.100   3rd Qu.: 7.100   3rd Qu.: 2.770  
##  Max.   :90.00   Max.   :16.000   Max.   :25.200   Max.   :15.600  
##  NA's   :60      NA's   :68       NA's   :68       NA's   :84      
##  Social_Organization_Ref InterbirthInterval_d   Gestation       WeaningAge_d   
##  Length:213              Min.   : 144.5       Min.   : 59.99   Min.   :  40.0  
##  Class :character        1st Qu.: 365.0       1st Qu.:142.00   1st Qu.: 122.4  
##  Mode  :character        Median : 476.9       Median :165.04   Median : 237.7  
##                          Mean   : 572.1       Mean   :163.47   Mean   : 310.0  
##                          3rd Qu.: 741.4       3rd Qu.:180.75   3rd Qu.: 383.4  
##                          Max.   :2007.5       Max.   :256.00   Max.   :1260.8  
##                          NA's   :103          NA's   :75       NA's   :95      
##  MaxLongevity_m     LitterSz     Life_History_Ref   GR_MidRangeLat_dd
##  Min.   :103.0   Min.   :0.990   Length:213         Min.   :-24.500  
##  1st Qu.:233.4   1st Qu.:1.010   Class :character   1st Qu.:-13.625  
##  Median :303.6   Median :1.010   Mode  :character   Median : -0.760  
##  Mean   :327.3   Mean   :1.181                      Mean   : -1.796  
##  3rd Qu.:394.2   3rd Qu.:1.050                      3rd Qu.:  6.785  
##  Max.   :720.0   Max.   :2.520                      Max.   : 35.880  
##  NA's   :66      NA's   :47                         NA's   :38       
##  Precip_Mean_mm Temp_Mean_degC   AET_Mean_mm      PET_Mean_mm    
##  Min.   : 419   Min.   : 2.60   Min.   : 453.1   Min.   : 842.5  
##  1st Qu.:1190   1st Qu.:21.95   1st Qu.:1091.0   1st Qu.:1512.3  
##  Median :1542   Median :24.30   Median :1291.1   Median :1566.9  
##  Mean   :1543   Mean   :23.13   Mean   :1253.1   Mean   :1553.2  
##  3rd Qu.:1857   3rd Qu.:25.20   3rd Qu.:1444.2   3rd Qu.:1622.2  
##  Max.   :2794   Max.   :27.40   Max.   :1828.3   Max.   :1927.3  
##  NA's   :38     NA's   :38      NA's   :38       NA's   :38      
##  Climate_Ref        HomeRange_km2     HomeRangeRef        DayLength_km   
##  Length:213         Min.   : 0.0020   Length:213         Min.   : 0.250  
##  Class :character   1st Qu.: 0.0600   Class :character   1st Qu.: 0.708  
##  Mode  :character   Median : 0.2750   Mode  :character   Median : 1.212  
##                     Mean   : 1.9379                      Mean   : 1.551  
##                     3rd Qu.: 0.8995                      3rd Qu.: 1.800  
##                     Max.   :28.2400                      Max.   :11.000  
##                     NA's   :65                           NA's   :104     
##  DayLengthRef       Territoriality        Fruit          Leaves         
##  Length:213         Min.   : 0.2250   Min.   : 1.00   Length:213        
##  Class :character   1st Qu.: 0.8555   1st Qu.:27.00   Class :character  
##  Mode  :character   Median : 1.5923   Median :48.00   Mode  :character  
##                     Mean   : 2.2289   Mean   :47.74                     
##                     3rd Qu.: 2.6867   3rd Qu.:68.00                     
##                     Max.   :15.5976   Max.   :97.00                     
##                     NA's   :109       NA's   :98                        
##     Fauna             DietRef1         Canine_Dimorphism Canine_Dimorphism_Ref
##  Length:213         Length:213         Min.   :0.880     Length:213           
##  Class :character   Class :character   1st Qu.:1.109     Class :character     
##  Mode  :character   Mode  :character   Median :1.560     Mode  :character     
##                                        Mean   :1.617                          
##                                        3rd Qu.:1.883                          
##                                        Max.   :5.263                          
##                                        NA's   :92                             
##       Feed            Move            Rest           Social      
##  Min.   : 9.00   Min.   : 3.00   Min.   : 4.00   Min.   : 0.900  
##  1st Qu.:21.75   1st Qu.:14.93   1st Qu.:18.38   1st Qu.: 3.500  
##  Median :33.30   Median :21.00   Median :30.48   Median : 5.400  
##  Mean   :33.08   Mean   :21.67   Mean   :34.26   Mean   : 7.369  
##  3rd Qu.:43.08   3rd Qu.:26.90   3rd Qu.:52.40   3rd Qu.:10.000  
##  Max.   :63.90   Max.   :70.60   Max.   :78.50   Max.   :23.500  
##  NA's   :141     NA's   :143     NA's   :143     NA's   :136     
##  Activity_Budget_Ref
##  Length:213         
##  Class :character   
##  Mode  :character   
##                     
##                     
##                     
## 

Plotting using {base} graphics…

boxplot(log(Body_mass_female_mean) ~ Family, d)

detach(d)

Alternatively, plotting using {ggplot2}… notice how each novel command is being added to the already-saved initial command? This is for ease of reading and understanding how we’re building the figure. All of this could also go on one line (and is read by the computer as such in the final version of the object p), but to make it easier to understand what each component it doing, we often build graphs in {ggplot2} using the below method:

library(ggplot2)
p <- ggplot(data = d, aes(x = Family, y = log(Body_mass_female_mean)))  #define the variables
p <- p + geom_boxplot()  #graph them in a boxplot
p <- p + theme(axis.text.x = element_text(angle = 90))  #put x-axis names at 90deg
p <- p + ylab("log(Female Body Mass)")  #rename y-axis title
p  #show me the graph


CHALLENGE 6


Scatterplots : Scatterplots are a natural tool for visualizing two continuous variables and can be made easily with the plot(x=XXX, y=YYY) function in {base} graphics (where XXX* and YYY** denote the names of the two variables you wish to plot). Transformations of the variables, e.g., log or square-root (sqrt()), may be necessary for effective visualization.

Again using data from the KamilarAndCooperData dataset, plot the relationship between female body size and female brain size. Then, play with log transforming the data and plot again.

attach(d)
par(mfrow = c(1, 2))
plot(x = Body_mass_female_mean, y = Brain_Size_Female_Mean)
plot(x = log(Body_mass_female_mean), y = log(Brain_Size_Female_Mean))

detach(d)

The grammar for {ggplot2} is a bit more complicated… see if you can follow it in the example below.

p <- ggplot(data = d, aes(x = log(Body_mass_female_mean), y = log(Brain_Size_Female_Mean),
    color = factor(Family)))  # first, we build a plot object and color points by Family
p <- p + xlab("log(Female Body Mass)") + ylab("log(Female Brain Size)")  # then we modify the axis labels
p <- p + geom_point()  # then we make a scatterplot
p <- p + theme(legend.position = "bottom", legend.title = element_blank())  # then we modify the legend
p  # and, finally, we plot the object

Using {ggplot2}, we can also easily set up a grid for “faceting”" by a grouping variable…

p <- p + facet_wrap(~Family, ncol = 4)
p <- p + theme(legend.position = "none")
p

And we can easily add regression lines to our plot. Here, we add a linear model to each facet.

p <- p + geom_smooth(method = "lm", fullrange = TRUE)
p
## `geom_smooth()` using formula 'y ~ x'


CHALLENGE 7


Build your own bivariate scatterplot using the KamilarAndCooperData dataset.

p <- ggplot(data = d, aes(x = log(Body_mass_female_mean), y = log(MaxLongevity_m)))
p <- p + geom_point()
p <- p + geom_smooth(method = "lm")
p
## `geom_smooth()` using formula 'y ~ x'

Aggregate Statistics and the {dplyr} Package

To calculate summary statistics for groups of observations in a data frame, there are many different approaches. One is to use the aggregate() function from the {stats} package (a standard package), which provides a quick way to look at summary statistics for sets of observations, though it requires a bit of clunky code. Here, we apply a particular function (FUN = "mean") to mean female body mass, grouped by Family.

aggregate(d$Body_mass_female_mean ~ d$Family, FUN = "mean", na.rm = TRUE)
##           d$Family d$Body_mass_female_mean
## 1         Atelidae               6616.2000
## 2          Cebidae                876.3323
## 3  Cercopithecidae               6327.8247
## 4    Cheirogalidae                186.0286
## 5    Daubentonidae               2490.0000
## 6        Galagidae                371.6143
## 7        Hominidae              53443.7167
## 8      Hylobatidae               6682.1200
## 9        Indriidae               3886.5333
## 10       Lemuridae               1991.1200
## 11   Lepilemuridae                813.5000
## 12       Lorisidae                489.8625
## 13     Pitheciidae               1768.5000
## 14       Tarsiidae                120.0000

Or, alternatively…

aggregate(x = d["Body_mass_female_mean"], by = d["Family"], FUN = "mean", na.rm = TRUE)
##             Family Body_mass_female_mean
## 1         Atelidae             6616.2000
## 2          Cebidae              876.3323
## 3  Cercopithecidae             6327.8247
## 4    Cheirogalidae              186.0286
## 5    Daubentonidae             2490.0000
## 6        Galagidae              371.6143
## 7        Hominidae            53443.7167
## 8      Hylobatidae             6682.1200
## 9        Indriidae             3886.5333
## 10       Lemuridae             1991.1200
## 11   Lepilemuridae              813.5000
## 12       Lorisidae              489.8625
## 13     Pitheciidae             1768.5000
## 14       Tarsiidae              120.0000

Another, EASIER, way to summarize data is to use the package {dplyr}, which provides “a flexible grammar of data manipulation” that includes a set of verbs that can be used to perform useful operations on data frames. Before using {dplyr} for this, let’s look in general at what it can do…

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
s <- filter(d, Family == "Hominidae" & Mass_Dimorphism > 2)
head(s)  # filtering a data frame for certain rows...
##   Scientific_Name    Family   Genus  Species Brain_Size_Species_Mean
## 1 Gorilla_gorilla Hominidae Gorilla  gorilla                  490.41
## 2    Pongo_abelii Hominidae   Pongo   abelii                  389.50
## 3  Pongo_pygmaeus Hominidae   Pongo pygmaeus                  377.38
##   Brain_Size_Female_Mean   Brain_size_Ref Body_mass_male_mean
## 1                 455.89 Isler et al 2008            170400.0
## 2                 341.21 Isler et al 2008             84482.5
## 3                 337.72 Isler et al 2008             80136.9
##   Body_mass_female_mean Mass_Dimorphism         Mass_Ref MeanGroupSize
## 1               71500.0           2.383 Isler et al 2008           7.1
## 2               41148.0           2.053 Isler et al 2008            NA
## 3               36947.6           2.169 Isler et al 2008           2.0
##   AdultMales AdultFemale AdultSexRatio
## 1        1.8         4.4          2.44
## 2         NA          NA            NA
## 3        1.0         1.0          1.00
##                                                     Social_Organization_Ref
## 1 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 2                                                                          
## 3 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
##   InterbirthInterval_d Gestation WeaningAge_d MaxLongevity_m LitterSz
## 1               1430.8       256       920.35            648     1.05
## 2                   NA        NA           NA             NA       NA
## 3               2007.5       249      1088.80            720     1.07
##    Life_History_Ref GR_MidRangeLat_dd Precip_Mean_mm Temp_Mean_degC AET_Mean_mm
## 1 Jones et al. 2009              0.88         1528.8           24.2      1291.1
## 2                                  NA             NA             NA          NA
## 3 Jones et al. 2009              1.33         2587.0           23.8      1676.6
##   PET_Mean_mm       Climate_Ref HomeRange_km2      HomeRangeRef DayLength_km
## 1      1502.7 Jones et al. 2009          4.03 Jones et al. 2009         0.86
## 2          NA                              NA                             NA
## 3      1691.7 Jones et al. 2009          3.88 Jones et al. 2009         0.50
##       DayLengthRef Territoriality Fruit Leaves Fauna        DietRef1
## 1 Nunn et al. 2003         0.3797    11     93     0 Nunn et al 2003
## 2                              NA    NA                             
## 3 Nunn et al. 2003         0.2250    64     18     3 Nunn et al 2003
##   Canine_Dimorphism Canine_Dimorphism_Ref Feed  Move Rest Social
## 1             1.739   Plavcan & Ruff 2008 52.6  9.55 35.5    5.8
## 2             1.710   Plavcan & Ruff 2008   NA    NA   NA     NA
## 3             1.693   Plavcan & Ruff 2008 42.4 15.40 41.0    1.2
##                  Activity_Budget_Ref
## 1 wendy dataset; Sussman et al. 2005
## 2                                   
## 3                       Kamilar 2006
s <- arrange(d, Family, Genus, Body_mass_male_mean)  # rearranging a data frame...
head(s)
##      Scientific_Name   Family    Genus   Species Brain_Size_Species_Mean
## 1   Alouatta_guariba Atelidae Alouatta   guariba                   51.70
## 2    Alouatta_caraya Atelidae Alouatta    caraya                   52.63
## 3 Alouatta_seniculus Atelidae Alouatta seniculus                   55.22
## 4  Alouatta_palliata Atelidae Alouatta  palliata                   49.88
## 5  Alouatta_belzebul Atelidae Alouatta  belzebul                   52.84
## 6     Alouatta_pigra Atelidae Alouatta     pigra                   51.13
##   Brain_Size_Female_Mean   Brain_size_Ref Body_mass_male_mean
## 1                  49.08 Isler et al 2008                5800
## 2                  47.80 Isler et al 2008                6525
## 3                  54.26 Isler et al 2008                6690
## 4                  48.04 Isler et al 2008                7150
## 5                  51.19 Isler et al 2008                7270
## 6                  48.75 Isler et al 2008               11400
##   Body_mass_female_mean Mass_Dimorphism         Mass_Ref MeanGroupSize
## 1                  4550           1.275 Isler et al 2008          6.53
## 2                  4240           1.539 Isler et al 2008          8.00
## 3                  5210           1.284 Isler et al 2008          7.10
## 4                  5350           1.336 Isler et al 2008         12.00
## 5                  5520           1.317 Isler et al 2008          7.00
## 6                  6430           1.773 Isler et al 2008          6.60
##   AdultMales AdultFemale AdultSexRatio
## 1      1.370       2.200          1.61
## 2      2.300       3.300          1.43
## 3      1.700       2.200          1.29
## 4      2.900       6.300          2.17
## 5      1.000       1.000          1.00
## 6      1.925       2.175          1.13
##                                                     Social_Organization_Ref
## 1                                                       Campbell et al 2007
## 2 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 3 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 4 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 5                                                       Campbell et al 2007
## 6                                                       Campbell et al 2007
##   InterbirthInterval_d Gestation WeaningAge_d MaxLongevity_m LitterSz
## 1                   NA        NA           NA             NA       NA
## 2               337.62       187       323.16          243.6     1.01
## 3               507.35       190       370.04          300.0     1.42
## 4               684.37       186       495.60          300.0     1.02
## 5                   NA        NA           NA             NA       NA
## 6                   NA       187           NA          240.0     1.01
##    Life_History_Ref GR_MidRangeLat_dd Precip_Mean_mm Temp_Mean_degC AET_Mean_mm
## 1                              -21.13         1332.3           19.6      1225.7
## 2 Jones et al. 2009            -20.34         1166.4           22.9      1193.1
## 3 Jones et al. 2009              0.68         1823.4           25.1      1449.8
## 4 Jones et al. 2009              6.95         1852.6           23.7      1300.0
## 5                               -6.80         1643.5           24.9      1286.6
## 6 Jones et al. 2009             18.80         1341.3           25.1      1373.8
##   PET_Mean_mm       Climate_Ref HomeRange_km2      HomeRangeRef DayLength_km
## 1      1332.2 Jones et al. 2009          0.03 Jones et al. 2009           NA
## 2      1404.9 Jones et al. 2009            NA                           0.40
## 3      1574.9 Jones et al. 2009          0.10 Jones et al. 2009         0.55
## 4      1633.9 Jones et al. 2009          0.19 Jones et al. 2009         0.32
## 5      1549.8 Jones et al. 2009            NA                             NA
## 6      1580.8 Jones et al. 2009          0.30 Jones et al. 2009           NA
##       DayLengthRef Territoriality Fruit Leaves Fauna             DietRef1
## 1                              NA   5.2   73.0   0.0 Campbell et al. 2007
## 2 Nunn et al. 2003             NA  23.8   67.7   0.0 Campbell et al. 2007
## 3 Nunn et al. 2003         1.5414  40.0   48.1   0.0 Campbell et al. 2007
## 4 Nunn et al. 2003         0.6506  33.1   56.4   0.0 Campbell et al. 2007
## 5                              NA  57.3   19.1   0.0 Campbell et al. 2007
## 6                              NA  40.8   45.1   0.0 Campbell et al. 2007
##   Canine_Dimorphism Canine_Dimorphism_Ref  Feed  Move  Rest Social
## 1             1.783   Plavcan & Ruff 2008 18.33 14.33 64.37   3.00
## 2             1.542   Plavcan & Ruff 2008 15.90 17.60 61.60   4.90
## 3             1.464   Plavcan & Ruff 2008 12.70  6.20 78.50   2.50
## 4             1.703   Plavcan & Ruff 2008 17.94 12.32 66.14   3.64
## 5             1.811   Plavcan & Ruff 2008 13.75 18.75 57.30  10.00
## 6             1.109   Plavcan & Ruff 2008 24.40  9.80 61.90   3.80
##    Activity_Budget_Ref
## 1 Campbell et al. 2007
## 2 Campbell et al. 2007
## 3 Campbell et al. 2007
## 4 Campbell et al. 2007
## 5 Campbell et al. 2007
## 6 Campbell et al. 2007
s <- select(d, Family, Genus, Body_mass_male_mean)  # selecting specific columns...
head(s)
##            Family          Genus Body_mass_male_mean
## 1 Cercopithecidae Allenopithecus                6130
## 2 Cercopithecidae      Allocebus                  92
## 3        Atelidae       Alouatta                7270
## 4        Atelidae       Alouatta                6525
## 5        Atelidae       Alouatta                5800
## 6        Atelidae       Alouatta                7150
s <- rename(d, Female_Mass = Body_mass_female_mean)
head(s$Female_Mass)  # renaming columns...
## [1] 3180   84 5520 4240 4550 5350
s <- mutate(d, Binomial = paste(Genus, Species, sep = " "))
head(s$Binomial)  # and adding new columns...
## [1] "Allenopithecus nigroviridis" "Allocebus trichotis"        
## [3] "Alouatta belzebul"           "Alouatta caraya"            
## [5] "Alouatta guariba"            "Alouatta palliata"

The {dplyr} package also makes it easy to summarize data using more convenient functions than aggregate(). For example:

s <- summarise(d, avgF = mean(Body_mass_female_mean, na.rm = TRUE), avgM = mean(Body_mass_male_mean,
    na.rm = TRUE))
s
##       avgF     avgM
## 1 5396.474 8111.801

The group_by() function allows us to do apply summary functions to sets of observations defined by a categorical variable, as we did above with aggregate().

byFamily <- group_by(d, Family)
byFamily
## # A tibble: 213 × 44
## # Groups:   Family [14]
##    Scientific_Name   Family   Genus   Species Brain_Size_Speci… Brain_Size_Fema…
##    <chr>             <chr>    <chr>   <chr>               <dbl>            <dbl>
##  1 Allenopithecus_n… Cercopi… Alleno… nigrov…              58.0             53.7
##  2 Allocebus_tricho… Cercopi… Alloce… tricho…              NA               NA  
##  3 Alouatta_belzebul Atelidae Alouat… belzeb…              52.8             51.2
##  4 Alouatta_caraya   Atelidae Alouat… caraya               52.6             47.8
##  5 Alouatta_guariba  Atelidae Alouat… guariba              51.7             49.1
##  6 Alouatta_palliata Atelidae Alouat… pallia…              49.9             48.0
##  7 Alouatta_pigra    Atelidae Alouat… pigra                51.1             48.8
##  8 Alouatta_senicul… Atelidae Alouat… senicu…              55.2             54.3
##  9 Aotus_azarai      Cebidae  Aotus   azarai               20.7             20.7
## 10 Aotus_brumbacki   Cebidae  Aotus   brumba…              NA               NA  
## # … with 203 more rows, and 38 more variables: Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>,
## #   Life_History_Ref <chr>, GR_MidRangeLat_dd <dbl>, Precip_Mean_mm <dbl>, …
s <- summarise(byFamily, avgF = mean(Body_mass_female_mean, na.rm = TRUE), avgM = mean(Body_mass_male_mean,
    na.rm = TRUE))
s
## # A tibble: 14 × 3
##    Family            avgF   avgM
##    <chr>            <dbl>  <dbl>
##  1 Atelidae         6616.  7895.
##  2 Cebidae           876.  1012.
##  3 Cercopithecidae  6328.  9543.
##  4 Cheirogalidae     186.   193.
##  5 Daubentonidae    2490   2620 
##  6 Galagidae         372.   395.
##  7 Hominidae       53444. 98681.
##  8 Hylobatidae      6682.  6926.
##  9 Indriidae        3887.  3638.
## 10 Lemuridae        1991.  2077.
## 11 Lepilemuridae     814.   792 
## 12 Lorisidae         490.   512.
## 13 Pitheciidae      1768.  1955.
## 14 Tarsiidae         120    131

Piping

One other cool thing about the {dplyr} package is that it provides a convenient way to “pipe” together operations on a data frame using the %>% operator. This means that each line of code after the operator is implemented on the product of the line of code before the operator.In this way, you can use piping to build, step by step, a more complicated output.

As an example, the line of code, below, accomplishes the same as the multiple line of code in the previous chunk (although it is only one line of code, I’ve separated it by pipes for ease of reading and understanding (see hashes for a descriptor of what each pipe section accomplishes)… it could also be written as one continuous line):

s <-                                                             #to create dataframe "s"
  d %>%                                                          #take dataframe "d"
  group_by(Family) %>%                                           #Group it by Family
  summarise(avgF = mean(Body_mass_female_mean, na.rm=TRUE),      #And calculate mean male BM
            avgM = mean(Body_mass_male_mean, na.rm=TRUE))        #And mean female BM
s
## # A tibble: 14 × 3
##    Family            avgF   avgM
##    <chr>            <dbl>  <dbl>
##  1 Atelidae         6616.  7895.
##  2 Cebidae           876.  1012.
##  3 Cercopithecidae  6328.  9543.
##  4 Cheirogalidae     186.   193.
##  5 Daubentonidae    2490   2620 
##  6 Galagidae         372.   395.
##  7 Hominidae       53444. 98681.
##  8 Hylobatidae      6682.  6926.
##  9 Indriidae        3887.  3638.
## 10 Lemuridae        1991.  2077.
## 11 Lepilemuridae     814.   792 
## 12 Lorisidae         490.   512.
## 13 Pitheciidae      1768.  1955.
## 14 Tarsiidae         120    131

Piping allows us to keep a clean and readable workflow without having to create numerous intermediate dataframes, as well as offering us a shorthand that accomplishes one complicated process with one simple-to-breakdown command.

Although this may at first seem cumbersome (many students despise piping at first!), it will quickly become one of the best ways to make your code more readable and simpler to implement.


CHALLENGE 8


In one line of code, do the following:

  • Add a variable, Binomial to our data frame d, which is a concatenation of the Genus and Species
  • Trim the data frame to only include the variables Binomial, Family, Body_mass_female_mean, Body_mass_male_mean and Mass_Dimorphism
  • Group these by Binomial (in other words, by species but with full binomial nomenclature)…
  • And calculate the average value for female body mass, male body mass, and mass dimorphism.
s <- d %>%
    mutate(Binomial = paste(Genus, Species, sep = " ")) %>%
    select(Binomial, Body_mass_female_mean, Body_mass_male_mean, Mass_Dimorphism) %>%
    group_by(Binomial) %>%
    summarise(avgF = mean(Body_mass_female_mean, na.rm = TRUE), avgM = mean(Body_mass_male_mean,
        na.rm = TRUE), avgBMD = mean(Mass_Dimorphism, na.rm = TRUE))
s
## # A tibble: 213 × 4
##    Binomial                     avgF  avgM  avgBMD
##    <chr>                       <dbl> <dbl>   <dbl>
##  1 Allenopithecus nigroviridis  3180  6130   1.93 
##  2 Allocebus trichotis            84    92   1.10 
##  3 Alouatta belzebul            5520  7270   1.32 
##  4 Alouatta caraya              4240  6525   1.54 
##  5 Alouatta guariba             4550  5800   1.27 
##  6 Alouatta palliata            5350  7150   1.34 
##  7 Alouatta pigra               6430 11400   1.77 
##  8 Alouatta seniculus           5210  6690   1.28 
##  9 Aotus azarai                 1230  1180   0.959
## 10 Aotus brumbacki               NaN   NaN NaN    
## # … with 203 more rows
  • Acccording to Kamilar & Cooper’s (2013) dataset, what is the average male and female size, and body mass dimorphism of my two main study species (vervet monkeys, Chlorocebus pygerythrus; and woolly monkeys, Lagothrix lagotricha)? Which has a larger average female body mass? Which is more sexually dimorphic?

  • Compare the body size of my two main study taxa at the Family level (i.e., Cercopithecidae vs. Atelidae) by plotting (using {ggplot2}) the body mass of males and females and sexual dimorphism. If you can, make the Cercopithecid boxes green, and the Atelid boxes purple.