GIS and Spatial Analysis in R


Preliminaries

Objectives

In this module, we use R to extend our use of the system to GIS analyses. This module will include working with GIS data to create maps, and to extract data from numerous layers of different data types for spatial analysis. This is a first step, and by no means comprehensive. Remember that R is constantly changing, and the ways and packages by which GIS analyses are being done are changing, too!

Geographic Information Systems (GIS)

Geographic Information Systems, or GIS, is an analytical framework that can be used to organize, communicate, and better understand landscape-scale datasets. GIS allows users to overlay multiple layers of data that are geographically referenced to the same mapping system, so that those layers can be associated and mutually inform each other. This allows for both the creation of comprehensive maps, and for the integration of diverse datatypes.

As is fitting for GIS, perhaps the best representation of this is an image:

If we added another layer of GPS points with, say, locations where monkeys have been spotted, we can then use spatial analysis techniques to construct statistical models that may help us to better understand how all of the factors in these various layers might influence whether an observer would see a monkey at a particular location. Such data and techniques are typically used in creating landscape-level inferences regarding how organisms are found in the environment. These may include species distribution models, assessments of factors associated with daily ranging paths or presence/absence of species in different habitat types, and estimating home range use patterns. The nature of layer integration has also been helpful for predicting archaeological artefact discovery, analyzing dental enamel imperfections, and other, perhaps less readily intuitive uses.

Historically, GIS analyses have been done primarily using the proprietary ESRI system of programs, including the excellent ArcGIS, which are a comprehensive suite of GIS analytical tools. There are also open-source GIS frameworks that can be used, including QGIS and GRASS, among many others.

The increasing functionality of R for GIS, however, is providing another powerful option for open-source GIS analyses. In our case, this may be especially appealing, as we’ve already built some analytical capabilities. Conducting GIS analyses in R itself allows us to extract data from multiple layers and immediately analyze them using statistical models of our choice. As {ggplot2} and the rest of the {tidyverse} continues to integrate with GIS-style methods, this has become much easier, as well (and prettier!).

In this module, we’ll explore some aspects of GIS-based analysis as a base for asking questions about primate ranging and space use.

Importing Data for Use in GIS

Projections and Coordinate Systems

For the most part, importing data for GIS analyses is similar to importing data for any other R analysis, depending on the type of data. To integrate your data with spatial datasets, however, those data must use a common coordinate system, and be projected into the same geospatial extent as the other data layers with which you’re working. Coordinate systems and projections are critically important, as they dictate how commonly marked points align, and the ways in which the 3-dimensional geographic space of the globe are mapped into the 2-dimensional space of a map. To learn more about these systems, ArcGIS Help has a great resource. You can also see a list of map projections here. For the most part, we’ll be working using the Mercator (UTM) projection, shown here:

Always keep in mind that the projection is not a true representation of the relative size of landmasses on the globe, as these areas exist on a 3-dimensional surface and each projection system distorts those true areas into a 2-dimensional space. The WGS84/UTM coordinates on the Mercator projection is no different. Here’s an image that shows the Mercator projection size vs. real size of each country (note that distortion in this projection increases with distance from the equator):

We’ll also be using the WGS84 or Universal Transverse Mercator (UTM) coordinate systems to map points onto the Mercator projection. The UTM coordinate system maps points within 6-degree longitudinal tiles, or zones, in the Mercator projection. Each zone has a designated letter/number, and the coordinate system essentially counts a ‘northing’ and ‘easting’ from the southwest corner of the tile. Knowing your tile is critically important: because of the amplified distortion of the Mercator projection away from the equator, each zone has a slightly different system for accounting for distortion.

You can see the zones here:

Once you know your UTM zone, it’s fairly straightforward to map UTM coordinates into the more commonly used GPS-based coordinate system, WGS84 (although you can set GPS receivers to record data directly in UTM). For example, spider monkey points collected at the Tiputini Biodiversity Station in Amazonian Ecuador would be mapped to UTM zone 17M. The World Geodetic System 1984 (WGS84) is defined and maintained by the United States National Geospatial-Intelligence Agency, and so is typically the default for U.S. GPS systems.

Remember: if different datasets have different coordinate systems or projections, they won’t integrate properly! Imagine trying to map UTM points onto a map using the Azimuthal Equidistant projection! Also remember that these global reference systems are continuously updated, and demarcated into epochs of use. New Zealand, for example, due to tectonic plate movement, shifts 5 cm every year from the designated WGS84 coordinates in that country!

Now, to start, let’s import a GPS dataset of interest and start some spatial analyses!

EXAMPLE:

Suppose we have been studying a group of white-bellied spider monkeys (Ateles belzebuth) for the past few years at the Tiputini Biodiversity Station. While collecting behavioral data, we’ve also been collecting GPS points and location data from georeferenced markers every 20-minutes throughout the day, and every time we run into them throughout the day. From this data, we’re interested in figuring out the size of their home range.


CHALLENGE 1



Let’s explore this question using some actual data. First, load in the dataset “atelesranging.csv” and do some exploratory mapping and data analysis. This is only a little more complicated than how we’ve been doing it before given the spatial aspect of the data:

library(curl)
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN597_Fall19/atelesranging.csv")
d <- read.csv(f, header = TRUE, sep = ",")
head(d)
##              Taxon   UTMX    UTMY
## 1 Ateles belzebuth 370080 9930910
## 2 Ateles belzebuth 370085 9930916
## 3 Ateles belzebuth 370105 9929864
## 4 Ateles belzebuth 370125 9929892
## 5 Ateles belzebuth 370129 9930521
## 6 Ateles belzebuth 370133 9930906
summary(d)
##               Taxon           UTMX             UTMY        
##  Ateles belzebuth:2114   Min.   :370080   Min.   :9928975  
##                          1st Qu.:371163   1st Qu.:9929927  
##                          Median :371376   Median :9930051  
##                          Mean   :371346   Mean   :9930113  
##                          3rd Qu.:371578   3rd Qu.:9930321  
##                          Max.   :374153   Max.   :9930979

Let’s take a quick look at how these points map out:

library(tidyverse)
ggplot(data = d, aes(x = UTMX, y = UTMY)) + geom_point() + theme_bw()

Ok, so we can see there’s some points… but without the projection and some other layers, they’re not much help. Let’s get some more information added to our ‘map’.

First, let’s make sure we’ve got the points in the proper projection. We can do that most simply using the {sf} package. Note that the crs command (which stands for ‘coordinate reference system’) in the st_as_sf call is where we designate the projection. In this case, 32718 is the EPSG code. To learn more about how to define the projection, see this helpful tutorial. In the meantime, let’s project!

library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
d <- st_as_sf(d, coords = c("UTMX", "UTMY"), crs = 32718)

Now, we’ll need to plot using the geom_sf() call, which will ensure that our ggplot is a map with the required coordinate system and projection (notice how our axes now contain the tranlsated latitude and longitude associated with our coordinate/projection):

q <- ggplot() + geom_sf(data = d, color = "light pink", alpha = 0.5) + theme_bw()
q

Now, to contextualize this data a bit more, we’ve also got some shapefiles associated with the study site. A shapefile is a simple, nontopological format for storing the location and attributes of a geographic feature, which can be represented by points, lines, or polygons (areas). Also associated with these values might be a dBASE table, which can store additional attributes associated with the shapefile features. In file format, shapefiles are actually several (5 or 6) files together. Most publicly available spatial layers (i.e., those available from governmental repositories) are available as shapefiles.

To load a shapefile into R, we must load all associated files at once. Luckily, the package {rgdal} has a set command for doing so. Let’s load in a shapefile of the Tiputini River, which forms the southern boundary of the spider monkeys’ territory.

NOTE: To execute this code, you’ll need to download these files from the AN597_Fall19 folder on my GitHub page to a folder in your working directory called GISdata.

library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-6, (SVN revision 841)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.4.2, released 2019/06/28
##  Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/gdal
##  GDAL binary built with GEOS: FALSE 
##  Loaded PROJ.4 runtime: Rel. 5.2.0, September 15th, 2018, [PJ_VERSION: 520]
##  Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/proj
##  Linking to sp version: 1.3-1
tbs_rio <- readOGR(dsn = path.expand("GISdata"), layer = "rio_tiputini_drawing")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/christopherschmitt/fuzzyatelin.github.io/fuzzyatelin.github.io/bioanth-stats/module-GIS/GISdata", layer: "rio_tiputini_drawing"
## with 1 features
## It has 1 fields
tbs_rio <- st_as_sf(tbs_rio, crs = 32718)

Notice that, in addition to importing, we’ve also converted our shapefile into an {sf} object with the same projection as our data. Let’s see how they look:

q <- ggplot() + geom_sf(data = tbs_rio, color = "dodgerblue") + geom_sf(data = d, 
    color = "light pink", alpha = 0.5) + theme_bw()
q

We can do the same for the system of trails from Tiputini:

tbs_trails <- readOGR(dsn = path.expand("GISdata"), layer = "trail_polylines_2010")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/christopherschmitt/fuzzyatelin.github.io/fuzzyatelin.github.io/bioanth-stats/module-GIS/GISdata", layer: "trail_polylines_2010"
## with 29 features
## It has 1 fields
tbs_trails <- st_as_sf(tbs_trails, crs = 32718)

q <- ggplot() + geom_sf(data = tbs_trails) + geom_sf(data = tbs_rio, color = "dodgerblue") + 
    geom_sf(data = d, color = "light pink", alpha = 0.5) + theme_bw()
q

Great! Now we’ve got a basic map with the trails and river, with our spider monkey ranging points overlaid above them. This is helpful for seeing where the spider monkeys occur on the trail system and relative to the river, but we can do much more…

For example, if we’d like to add satellite data to our map, we can use the {ggmap} package. Note that, as of recently, this requires registering with Google since we are actually accessing their API (application programming interface) within the package. Intructions on how to do so are here, which you must complete before the next step (although they ask for billing information, they will not bill you).

library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
# ggmap::register_google(key = 'SET YOUR KEY HERE') Here, we download a
# satellite image centered on the WGS 84 GPS coordinates of the Tiputini
# Biodiversity Station (as eyeballed from our map, above). When downloading
# other maps, you may need to experiment with the zoom to get the right
# scale. Note that Google will not allow non-integer zooms, and also will
# not allow you to dictate the dimensions of the map (all maps must be
# square):
map <- get_googlemap(center = c(lon = -76.15, lat = -0.635), zoom = 14, scale = 2, 
    maptype = "satellite", color = "color")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=-0.635,-76.15&zoom=14&size=640x640&scale=2&maptype=satellite&key=xxx
# Google is weird in that their 'bounding box' (coordinates that set the
# boundary of the image we download) is in a fixed coordinate system, so we
# need to define a function to fix the 'bounding box' to be in our actual
# data projection of EPSG:32718:
ggmap_bbox <- function(map) {
    if (!inherits(map, "ggmap")) 
        stop("map must be a ggmap object")
    # Extract the bounding box (in lat/lon) from the ggmap to a numeric vector,
    # and set the names to what sf::st_bbox expects:
    map_bbox <- setNames(unlist(attr(map, "bb")), c("ymin", "xmin", "ymax", 
        "xmax"))
    
    # Coonvert the bbox to an sf polygon, transform it to 32718, and convert
    # back to a bbox (convoluted, but it works)
    bbox_32718 <- st_bbox(st_transform(st_as_sfc(st_bbox(map_bbox, crs = 4326)), 
        32718))
    
    # Overwrite the bbox of the ggmap object with the transformed coordinates
    attr(map, "bb")$ll.lat <- bbox_32718["ymin"]
    attr(map, "bb")$ll.lon <- bbox_32718["xmin"]
    attr(map, "bb")$ur.lat <- bbox_32718["ymax"]
    attr(map, "bb")$ur.lon <- bbox_32718["xmax"]
    map
}

# Use the function to change the coordinate system of the bounding box:
map <- ggmap_bbox(map)

p <- ggmap(map) + coord_sf(xlim = c(-76.18, -76.13), ylim = c(-0.63, -0.66), 
    crs = st_crs(32718)) + geom_sf(data = tbs_trails, inherit.aes = FALSE) + 
    geom_sf(data = tbs_rio, color = "dodgerblue", inherit.aes = FALSE) + geom_sf(data = d, 
    color = "light pink", alpha = 0.5, inherit.aes = FALSE)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
p

If you want to take a closer look, you can use the {zoom} package:

library(zoom)
inout.zoom()

Although this is pretty, it’s not necessarily helping us with our spatial analysis. To really get into understand the home range of these monkeys, we need to run some spatial statistics.

Basic Spatial Statistics in R

One interesting piece of data we can extract from our cloud of spider monkeys locations is to get an idea of their home range. There are a few ways to estimate home range in terrestrial mammals:

Mimimum Convex Polygon

The minimum convex polygon is, most simply put, drawing a line between each of the most peripheral points, such that a polygon is created that contains all points in the dataset. More epscifically, it is the smallest polygon containing all data points in which no internal angle exceeds 180 degrees.

This is the method recommended by the International Union for the Conservation of Nature (IUCN) for estimating home range, as it’s easy to compute from coordinate data, and is appropriate for presence-only data (i.e., we are not taking into account absences in observation), although it is not perfect.


CHALLENGE 2



Let’s use the package {adehabitatHR} to estimate the home range of this spider monkey group using mimumum convex polygons.

Now, unfortunately, {agehabitatHR} does not yet allow for the use of {sf} objects. To use the spider monkey points in this package, we need to convert them to a spatial points object. We can do this by reimporting the dataframe and converting it to an {sp} object rather than an {sf} object:

# Remember, the variable 'f' is still holding the location of our spider
# monkey data CSV file:
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN597_Fall19/atelesranging.csv")
e <- read.csv(f, header = TRUE, sep = ",")

# Create a coordinate column, with both your X and Y values:

xy <- e[, c("UTMX", "UTMY")]

# Create your Spatial Points dataframe:

library(sp)
ateles.sp <- SpatialPointsDataFrame(coords = xy, data = e, proj4string = CRS("+proj=utm +zone=18 +south +datum=WGS84 +units=m +no_defs"))

Notice that the coordinate reference system (CRS) call is very different from the ESPG number (which {sp} does not accept). Here’s we’re defining the projection (UTM), the zone (18), the datum (WGS84), and the units (meters). To learn how to write one of these strings, see this helpful resource.

Now, unfortunately, {sp} objects are not easily plotted by {ggplot2}, so if we want to see our points on the map to make sure they mapped appropriately, we need to convert them either to a dataframe that geom_point() can render, or to an {sf} object. The latter is far easier, so let’s take a look:

ateles.sp.sf <- st_as_sf(ateles.sp)

ggplot() + geom_sf(data = tbs_trails) + geom_sf(data = tbs_rio, color = "dodgerblue") + 
    geom_sf(data = ateles.sp.sf, color = "light pink", alpha = 0.5) + theme_bw()

Looks good! Now we’ve got our spider monkey data in the right format for {adehabitatHR}. Let’s make that minimum convex polygon estimate of the home range:

library(adehabitatHR)
ateles.mcp <- mcp(ateles.sp, percent = 100)
ateles.mcp
## Object of class "SpatialPolygonsDataFrame" (package sp):
## 
## Number of SpatialPolygons:  1
## 
## Variables measured:
##   id     area
## a  a 564.6744

As you can see, the output is giving us an estimate of the home range area (in this case, 565 meters squared), and there’s also a polygon made. Let’s take a look at this polygon, laid over our map:

# First, we'll convert our polygon into an {sf} object:
ateles.mcp <- st_as_sf(ateles.mcp)

# Now, let's map it:
mcp1 <- q + geom_sf(data = ateles.mcp, color = "pink", fill = "pink", alpha = 0.3)
mcp1

Now, imagine that you think the points east of the trail system are outliers, and may actually represent a different group of monkeys… how do you get rid of them? You could go back to the original data file and delete them… but then you’re actually getting rid of recorded data (which is not a good idea). You could try to modify the {sp} dataframe here in R, but regular {dplyr} and other dataframe ‘verbs’ don’t work on a {sp} dataframe. Luckily, there’s a lovely solution called {spdplyr} that implements {dplyr} verbs in {sp} datasets.

So, to get rid of those points, we would excise points using the filter command, in this case, all points that are east of the UTMX coordinate 373050:

library(spdplyr)
ateles.sp <- ateles.sp %>% filter(UTMX < 373050)

Now let’s see how that influences our MCP estimate of the home range:

ateles.mcp <- mcp(ateles.sp, percent = 100)
ateles.mcp <- st_as_sf(ateles.mcp)
mcp1 <- q + geom_sf(data = ateles.mcp, color = "pink", fill = "pink", alpha = 0.3)
mcp1

Kernel Density Estimated Home Range

Another way of estimating the home range is to take into account how often certain parts of the home range are used based on the density of observation points in a particular area of that range. Such density estimates assume that the observation points represent a random sample of occupation points that are not biased by observer sampling. We might be wary of this method if, for example, all our observations were on the trail system. Here, however, we do appear to have a relatively unbiased sample.

In this case, we can try to get a kernel density estimate of the home range. Kernel density estimates essentially plot a smoothed curve over the density of points in a given geographic area (for an excellent, intuitive visual descriptor of the process, see this resource), which allows us to create a home range based on frequency of use rather than just a presence/absence MCP. Remember: although a kernel density estimator also uses presence data, absence is also considered to be data, as the density of points is a meaningful parameter.


CHALLENGE 3



Let’s use the {adehabitatHR} package to estimate the KDE home range!

First, we’ll need to implement the kernel smoothing algorithm on our {sp} spider monkey data points. The output of this algorithm will be a density map across the graphical range divided into a grid. This grid will indicate how many spider monkey points are present for each location on a visual scale, smoothed across the entire grid. As often occurs with spatial analysis, it may be easier to show than tell:

library(adehabitatHR)
ateles_k100 <- kernelUD(ateles.sp)
image(ateles_k100)

In this case, the yellow grid squares are the areas of high density, while the red grid squares are the areas of low density.

Now we can get and plot the 85% home range estimate from the kernel density estimator as a “core” home range, and the 99% to see where the “edge” of their home range falls. What this means is that 85% of our datapoints are falling within the 85% kernel range, while 99% fall into the 99% kernel range. Let’s take a look:

kern99 <- getverticeshr(ateles_k100, percent = 99)
kern85 <- getverticeshr(ateles_k100, percent = 85)

kern99 <- st_as_sf(kern99)
kern85 <- st_as_sf(kern85)
kern <- q + geom_sf(data = kern99, color = "pink", fill = "pink", alpha = 0.3) + 
    geom_sf(data = kern85, color = "salmon1", fill = "salmon1", alpha = 0.3)
kern

Natural hypotheses to test with such data might include whether kernel density correlates with resource density (i.e., the density of Ocotea trees), if kernel density has an inverse relationship with predator presence, or if the 85% KDE home range shifts temporally with seasonal considerations like fruiting season of key resources or the presence of flooded forest areas that protect them from terrestrial predators.