Group Project–Species Distribution Model


Introduction

Backgroud

Species distribution modeling (SDM) is also known by other names including climate envelope-modeling, habitat modeling, and (environmental or ecological) niche-modeling.

New technologies and emerging disciplines, such as biodiversity bioinformatics, have promoted the development of methodological approaches that allow generating useful information from incomplete data; such is the case of the modeling of the ecological niches of species and their geographical distributions.

What is the geographical distribution of a species? In the simplest sense, it is that species’ spatial arrangement on the planet, or the geographical description of where it lives.

Traditionally, the distribution of species has been represented in a cartographic way from the geo-referenced points on maps, without considering explicitly the local factors that determine that distribution.

SDMs are correlative methods, static and non mechanistic, that work by identifying the relationships that exist between variables associated with the local environment and observations of presence for the species.

To make the best use of SDMs it is very important know how they work, but equally important is to know the limitations of the data sources, algorithms, and the methodological approach in general.

Despite their limitations, SDMs are a useful tool to generate geographical hypotheses about present, past and future distribution of species.

In this module we provide a short introduction to species distribution modelling in R. This module will cover a brief overview of the concept of species distribution modelling and the basic approach to modelling in R.

Figure 1. Diagram of the Species Distribution Modelling procedure. (Rodríguez-Rey et al., 2019)

There is a wide variety of modeling algorithms (between 20 and 30), as well as a series of platforms where they have been implemented. The aim of SDM is to estimate the similarity of the conditions at any site to the conditions at the locations of known occurrence (and perhaps of non-occurrence) of a phenomenon. A common application of this method is to predict species ranges with climate data as predictors.

SDM steps

  1. locations of occurrence of a species (or other phenomenon) are compiled

  2. values of environmental predictor variables (such as climate) at these locations are extracted from spatial databases

  3. the environmental values are used to fit a model to estimate similarity to the sites of occurrence, or another measure such as abundance of the species

  4. The model is used to predict the variable of interest across the region of interest (and perhaps for a future or past climate).

SDM uses

Figure 2: Uses of SDMs (Araújo et al., 2019).

Example

Figure 3. An overlay of suitable areas for the five tree-cavity-nesters’ distribution models with protected areas. (Moradi et al., 2019)

Package

Notes: Before you begin you will need to install and load the following packages and their dependencies:

  • {curl}
  • {dismo}
  • {dplyr}
  • {maptools}
  • {raster}
  • {rJava}
  • {rgdal}
  • {rgbif}
  • {rgeos}

{dismo} (=“distribution modeling”), and {raster} (=“geographic data analysis and modelling”), and each one of them has a set of functions that enable the modelling of species distributions in geographic and spatial contexts.

Step 1: Data preparation

You need to collect a sufficient number of occurrence records (and absence or abundance, if possible) of the species of interest. You also need to have accurate and relevant environmental data (predictor variables) at a sufficiently high spatial resolution.

Step 2: Import species occurrence data

Importing occurrence data into R is easy. But collecting, georeferencing, and cross-checking coordinate data is tedious…

2.1 Upload files

You can upload our data files like we did in previous modules, using the {curl} package.

For example, we can load a datast of occurrences of blue-eyed grass, Sisyrinchium angustifolium, like so:

library (curl)
#install package curl before this step to load a file from server.
library(knitr)
# to make nice data table. 

f <- curl("https://raw.githubusercontent.com/feiyawang1207/AN597_groupproject/master/S_angustifolium.csv")
#load the file in varibale f
d <- read.csv(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# read file in d as a dataframe.
kable(head(d))
gbifID datasetKey occurrenceID kingdom phylum class order family genus species infraspecificEpithet taxonRank scientificName verbatimScientificName verbatimScientificNameAuthorship countryCode locality stateProvince occurrenceStatus individualCount publishingOrgKey decimalLatitude decimalLongitude coordinateUncertaintyInMeters coordinatePrecision elevation elevationAccuracy depth depthAccuracy eventDate day month year taxonKey speciesKey basisOfRecord institutionCode collectionCode catalogNumber recordNumber identifiedBy dateIdentified license rightsHolder recordedBy typeStatus establishmentMeans lastInterpreted mediaType issue
2451752339 50c9509d-22c7-4a22-a47d-8c48425ef4a7 https://www.inaturalist.org/observations/35412543 Plantae Tracheophyta Magnoliopsida Solanales Solanaceae Solanum Solanum angustifolium NA SPECIES Solanum rostratum Dun. Solanum rostratum MX Michoacán NA 28eb1a3f-1c15-4a95-931a-4af90ecb574d 19.07638 -102.88527 22 NA NA NA NA NA 2019-11-07T09:40:48Z 7 11 2019 2930922 2929800 HUMAN_OBSERVATION iNaturalist Observations 35412543 2019-11-07T16:03:48Z CC_BY_NC_4_0 Jesús Martínez Jesús Martínez 2019-11-15T13:12:19.217Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED
2451721667 50c9509d-22c7-4a22-a47d-8c48425ef4a7 https://www.inaturalist.org/observations/35294114 Plantae Tracheophyta Magnoliopsida Solanales Solanaceae Solanum Solanum angustifolium NA SPECIES Solanum rostratum Dun. Solanum rostratum MX Jalisco NA 28eb1a3f-1c15-4a95-931a-4af90ecb574d 21.36370 -101.94901 4 NA NA NA NA NA 2019-11-03T14:32:00Z 3 11 2019 2930922 2929800 HUMAN_OBSERVATION iNaturalist Observations 35294114 2019-11-04T18:09:06Z CC_BY_NC_4_0 Christian Martinez Christian Martinez 2019-11-15T13:11:55.564Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED
2451686769 50c9509d-22c7-4a22-a47d-8c48425ef4a7 https://www.inaturalist.org/observations/35175230 Plantae Tracheophyta Magnoliopsida Solanales Solanaceae Solanum Solanum angustifolium NA SPECIES Solanum rostratum Dun. Solanum rostratum US Texas NA 28eb1a3f-1c15-4a95-931a-4af90ecb574d 32.69588 -102.25799 8 NA NA NA NA NA 2019-11-01T15:15:39Z 1 11 2019 2930922 2929800 HUMAN_OBSERVATION iNaturalist Observations 35175230 2019-11-01T20:17:03Z CC_BY_NC_4_0 Nathan Taylor Nathan Taylor 2019-11-15T13:11:23Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84
2451656431 50c9509d-22c7-4a22-a47d-8c48425ef4a7 https://www.inaturalist.org/observations/35013197 Plantae Tracheophyta Magnoliopsida Solanales Solanaceae Solanum Solanum angustifolium NA SPECIES Solanum rostratum Dun. Solanum rostratum US Texas NA 28eb1a3f-1c15-4a95-931a-4af90ecb574d 30.22526 -99.01538 6 NA NA NA NA NA 2019-10-28T11:53:22Z 28 10 2019 2930922 2929800 HUMAN_OBSERVATION iNaturalist Observations 35013197 2019-10-28T16:53:40Z CC_BY_NC_4_0 fbgjwh fbgjwh 2019-11-15T13:11:04.858Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED
2451601084 50c9509d-22c7-4a22-a47d-8c48425ef4a7 https://www.inaturalist.org/observations/34881947 Plantae Tracheophyta Magnoliopsida Solanales Solanaceae Solanum Solanum angustifolium NA SPECIES Solanum rostratum Dun. Solanum rostratum US Arkansas NA 28eb1a3f-1c15-4a95-931a-4af90ecb574d 36.44975 -94.59106 10 NA NA NA NA NA 2019-08-21T13:59:00Z 21 8 2019 2930922 2929800 HUMAN_OBSERVATION iNaturalist Observations 34881947 2019-10-25T19:01:17Z CC_BY_NC_4_0 jim_keesling jim_keesling 2019-11-15T13:10:54.246Z STILLIMAGE GEODETIC_DATUM_ASSUMED_WGS84
2451556298 1060a696-f6a2-4341-92ef-47a63babbea1 7526a2b2-b355-492e-9d57-7de5519acd6e Plantae Tracheophyta Magnoliopsida Solanales Solanaceae Solanum Solanum angustifolium NA SPECIES Solanum rostratum Dun. Solanum rostratum Dunal US by Hwy 30 c. 10 mi W of Sidney Nebraska NA 75bf5fec-40fa-43ae-919f-c43bd7ce25c9 NA NA NA NA NA NA NA NA 1945-08-21T00:00:00Z 21 8 1945 2930922 2929800 PRESERVED_SPECIMEN SJSU 12374 5425 CC_BY_4_0 C.W. Sharsmith 2019-11-14T16:31:00.343Z

2.2 Importing occurrence data

If you do not have any species distribution data you can get started by downloading data from the Global Biodiversity Inventory Facility (GBIF) There you can find 1.354.603.236 occurrence records!

To import data from a database like this, we can use a function called gbif () to download files from the GBIF database.

For now, we are going to download the occurrence data from a potato species. Solanum acaule is widespread and common in upland habitats from northern Peru (Dept. Cajamarca), south through Bolivia to northern Argentina (Prov. San Juan), and with one record in northern Chile (Antofagasta Region)

Figure 4: Picture of Solanum acaule

library(dismo)
## Loading required package: raster
## Loading required package: sp
library(knitr)

acaule <- gbif("solanum", "acaule*", geo=FALSE)
## Loading required namespace: jsonlite
## 7229 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6600-6900-7200-7229 records downloaded
#load the saved S. acaule data from data base

kable(head(acaule))
acceptedScientificName acceptedTaxonKey accessRights adm1 adm2 associatedReferences basisOfRecord bibliographicCitation catalogNumber class classKey cloc collectionCode collectionID continent coordinateUncertaintyInMeters country crawlId datasetID datasetKey datasetName dateIdentified day depth depthAccuracy disposition dynamicProperties elevation elevationAccuracy endDayOfYear establishmentMeans eventDate eventID eventRemarks eventTime family familyKey fieldNotes fieldNumber fullCountry gbifID genericName genus genusKey geodeticDatum georeferenceProtocol georeferenceRemarks georeferenceSources habitat higherClassification higherGeography higherGeographyID http://unknown.org/nick http://unknown.org/occurrenceDetails identificationID identificationRemarks identifiedBy identifier individualCount infraspecificEpithet installationKey institutionCode institutionID ISO2 key kingdom kingdomKey language lastCrawled lastInterpreted lastParsed lat license locality lon materialSampleID modified month municipality nomenclaturalCode nomenclaturalStatus occurrenceID occurrenceRemarks order orderKey organismID organismQuantity organismQuantityType organismRemarks originalNameUsage otherCatalogNumbers ownerInstitutionCode parentEventID phylum phylumKey preparations previousIdentifications protocol publishingCountry publishingOrgKey recordedBy recordNumber references rights rightsHolder samplingProtocol scientificName scientificNameID species speciesKey specificEpithet startDayOfYear taxonID taxonKey taxonomicStatus taxonRank taxonRemarks type typeStatus typifiedName verbatimCoordinateSystem verbatimElevation verbatimEventDate verbatimLocality vernacularName year downloadDate
Solanum acaule Bitter 2931061 NA Cusco NA NA HUMAN_OBSERVATION NA 22332967 Magnoliopsida 220 Cusco, Peru Observations NA NA 300 Peru 190 NA 50c9509d-22c7-4a22-a47d-8c48425ef4a7 iNaturalist research-grade observations 2019-04-30T14:41:30 26 NA NA NA NA NA NA NA NA 2019-03-26T14:30:00 NA NA 18:30:00Z Solanaceae 7717 NA NA Peru 2238787932 Solanum Solanum 2928997 WGS84 NA NA NA NA NA NA NA cstobie https://www.inaturalist.org/observations/22332967 52791525 NA NA 22332967 NA NA 997448a8-f762-11e1-a439-00145eb45e9a iNaturalist NA PE 2238787932 Plantae 6 NA 2019-11-29T14:10:16.861+0000 2019-11-29T15:09:07.641+0000 2019-11-29T15:09:07.641+0000 -15.047820 http://creativecommons.org/licenses/by-nc/4.0/legalcode NA -71.13188 NA 2019-05-20T17:10:31.000+0000 3 NA NA NA https://www.inaturalist.org/observations/22332967 NA Solanales 1176 NA NA NA NA NA NA NA NA Tracheophyta 7707728 NA NA DWC_ARCHIVE US 28eb1a3f-1c15-4a95-931a-4af90ecb574d cstobie NA https://www.inaturalist.org/observations/22332967 © cstobie some rights reserved cstobie NA Solanum acaule Bitter NA Solanum acaule 2931061 acaule NA 886633 2931061 ACCEPTED SPECIES NA NA NA NA NA NA 2019/03/26 2:30 PM EDT Espinar, Perú NA 2019 2019-12-04
Solanum acaule Bitter 2931061 NA Catamarca Ambato NA PRESERVED_SPECIMEN NA CORD00056696 Magnoliopsida 220 Camino desde El Rodeo rumbo a Primer Campo. Puesto Vergara, Primer Campo, alrededores de la Casa de Piedra, Catamarca, Ambato, Argentina, SOUTH_AMERICA NA NA SOUTH_AMERICA NA Argentina 15 NA ab61ad8f-953f-4c7c-80b2-e1fb50fd8187 NA 2016-01-01T00:00:00 24 NA NA NA NA NA NA NA NATIVE 2016-02-24T00:00:00 NA NA NA Solanaceae 7717 NA NA Argentina 2252415409 Solanum Solanum 2928997 WGS84 NA Puntual GPS Ensign - Datum WGS84 NA NA Sudamérica | Argentina | Catamarca NA NA NA NA NA Chiarini, F. CORD-44463984-f4cc-49ec-8911-2de91f216c2b-AR NA NA 4fd221cd-129c-404f-ab62-fdd2377d12ac CORD NA AR 2252415409 Plantae 6 NA 2019-11-05T11:30:29.242+0000 2019-11-13T15:15:57.223+0000 2019-11-13T15:15:57.223+0000 -28.179167 http://creativecommons.org/licenses/by/4.0/legalcode Camino desde El Rodeo rumbo a Primer Campo. Puesto Vergara, Primer Campo, alrededores de la Casa de Piedra -65.99250 NA NA 2 NA NA NA CORD-44463984-f4cc-49ec-8911-2de91f216c2b-AR NA Solanales 1176 NA NA NA NA NA NA NA NA Tracheophyta 7707728 NA Solanaceae | Solanum acaule Bitter | Chiarini, F. | 2016 DWC_ARCHIVE AR 7a6bdf66-ef5c-4a81-b731-2e328f4881eb Barboza, G. E. | Cantero, J. J. & Chiarini, F. 4669 NA NA NA NA Solanum acaule Bitter NA Solanum acaule 2931061 acaule NA NA 2931061 ACCEPTED SPECIES NA NA NA NA NA 3513 24/02/2016 NA NA 2016 2019-12-04
Solanum acaule Bitter 2931061 NA Lima NA NA PRESERVED_SPECIMEN NA BM001134709 Magnoliopsida 220 Huaroz district, 37 km from canta on road to La Viuda, Lima, Peru, SOUTH_AMERICA BOT NA SOUTH_AMERICA NA Peru 142 NA 7e380070-f762-11e1-a439-00145eb45e9a NA NA 28 NA NA NA {“determinationfiledas”: “Yes”, “collectionkind”: “Sheet”, “gbifissue”: [“GEODETIC_DATUM_ASSUMED_WGS84”], “created”: 1415984060000, “associatedmediacount”: 1, “project”: “Picturae”, “determinationnames”: “Solanum acaule Bitter”, “plantdescription”: “herb to .1m high; infrequent, flowers blue/purple, without interstitial leaflet; collection time: 11:42am; mixed shrubland with dominance of Senecio collinus and Lupinus brachyprennon”, “subdepartment”: “Flowering Plants”, “gbifid”: 1056708863} 4052 NA NA NA 2014-02-28T00:00:00 NA NA NA Solanaceae 7717 NA 2893 Peru 1056708863 Solanum Solanum 2928997 WGS84 NA NA NA NA NA South America; Peru; Lima; Canta NA NA NA NA NA Sandra (Sandy) Diane Knapp 4338577 NA NA 60277b56-f762-11e1-a439-00145eb45e9a NHMUK NA PE 1056708863 Plantae 6 NA 2019-11-29T15:42:43.569+0000 2019-11-29T16:08:04.040+0000 2019-11-29T16:08:04.040+0000 -11.387433 http://creativecommons.org/publicdomain/zero/1.0/legalcode Huaroz district, 37 km from canta on road to La Viuda -76.47753 NA NA 2 NA NA NA 2bc0e320-b403-43b0-bede-334814a339b1 NA Solanales 1176 NA NA NA NA NA NHMUK:ecatalogue:4338577 NA NA Tracheophyta 7707728 NA NA DWC_ARCHIVE GB 19456090-b49a-11d8-abeb-b8a03c50a862 Paúl Gonzáles, Mindy M. Syfert, Dr Erica McAlister, D Whitmore, Sandra (Sandy) Diane Knapp 2893 NA NA NA NA Solanum acaule Bitter NA Solanum acaule 2931061 acaule NA NA 2931061 ACCEPTED SPECIES NA NA NA NA NA NA NA NA NA 2014 2019-12-04
Solanum acaule Bitter 2931061 NA Lima Canta NA PRESERVED_SPECIMEN http://www.tropicos.org/Specimen/100991187 100991187 Magnoliopsida 220 Huaroz district, 37 km from Canta on road to La Viuda., Lima, Canta, Peru, SOUTH_AMERICA MO http://biocol.org/urn:lsid:biocol.org:col:15859 SOUTH_AMERICA NA Peru 95 NA 7bd65a7a-f762-11e1-a439-00145eb45e9a Tropicos NA 28 NA NA NA NA 4052 NA NA NA 2014-02-28T00:00:00 NA NA NA Solanaceae 7717 NA NA Peru 2268891108 Solanum Solanum 2928997 WGS84 NA NA NA NA NA NA NA NA NA NA NA NA urn:catalog:MO:Tropicos:100991187 1 NA c4e134d3-c68f-406d-a116-6a37895602e0 MO NA PE 2268891108 Plantae 6 NA 2019-11-29T07:20:04.818+0000 2019-11-29T07:48:48.952+0000 2019-11-29T07:48:48.952+0000 -11.387500 http://creativecommons.org/licenses/by/4.0/legalcode Huaroz district, 37 km from Canta on road to La Viuda. -76.47750 NA 2017-08-08T14:37:00.000+0000 2 NA ICNafp No opinion urn:catalog:MO:Tropicos:100991187 NA Solanales 1176 NA NA NA NA NA NA MOBOT NA Tracheophyta 7707728 NA NA DWC_ARCHIVE US 90fd6680-349f-11d8-aa2d-b8a03c50a862 Paúl Gonzáles|M. Syfert|E. McAlister|D. Whitmore|Sandy Knapp Gonzáles Arce 2893 NA NA Missouri Botanical Garden NA Solanum acaule Bitter NA Solanum acaule 2931061 acaule NA 29600091 2931061 ACCEPTED SPECIES NA PhysicalObject NA NA NA NA NA NA NA 2014 2019-12-04
Solanum acaule Bitter 2931061 NA Lima Oyon NA PRESERVED_SPECIMEN http://www.tropicos.org/Specimen/100991022 100991022 Magnoliopsida 220 Oyón district, 1.7 km up to Quichas on road up to Cordillera Raura., Lima, Oyon, Peru, SOUTH_AMERICA MO http://biocol.org/urn:lsid:biocol.org:col:15859 SOUTH_AMERICA NA Peru 95 NA 7bd65a7a-f762-11e1-a439-00145eb45e9a Tropicos NA 6 NA NA NA NA 4134 NA NA NA 2014-03-06T00:00:00 NA NA NA Solanaceae 7717 NA NA Peru 2268891573 Solanum Solanum 2928997 WGS84 NA NA NA NA NA NA NA NA NA NA NA NA urn:catalog:MO:Tropicos:100991022 1 NA c4e134d3-c68f-406d-a116-6a37895602e0 MO NA PE 2268891573 Plantae 6 NA 2019-11-29T07:20:04.818+0000 2019-11-29T07:48:54.267+0000 2019-11-29T07:48:54.267+0000 -10.558056 http://creativecommons.org/licenses/by/4.0/legalcode Oyón district, 1.7 km up to Quichas on road up to Cordillera Raura. -76.77194 NA 2017-08-07T14:49:00.000+0000 3 NA ICNafp No opinion urn:catalog:MO:Tropicos:100991022 NA Solanales 1176 NA NA NA NA NA NA MOBOT NA Tracheophyta 7707728 NA NA DWC_ARCHIVE US 90fd6680-349f-11d8-aa2d-b8a03c50a862 Paúl Gonzáles|M. Syfert|E. McAlister|D. Whitmore Gonzáles Arce 2961 NA NA Missouri Botanical Garden NA Solanum acaule Bitter NA Solanum acaule 2931061 acaule NA 29600091 2931061 ACCEPTED SPECIES NA PhysicalObject NA NA NA NA NA NA NA 2014 2019-12-04
Solanum acaule Bitter 2931061 NA Ancash NA NA PRESERVED_SPECIMEN NA BM001120832 Magnoliopsida 220 between km55-57 on rd from Yanac to Sihuas, Ancash, Peru, SOUTH_AMERICA BOT NA SOUTH_AMERICA NA Peru 142 NA 7e380070-f762-11e1-a439-00145eb45e9a NA NA 19 NA NA NA {“determinationfiledas”: “Yes”, “collectionkind”: “Sheet”, “gbifissue”: [“GEODETIC_DATUM_ASSUMED_WGS84”], “created”: 1415986484000, “associatedmediacount”: 1, “project”: “Picturae”, “determinationnames”: “Solanum acaule Bitter”, “plantdescription”: “Rosette forming herb; corolla purple; fruits green, turning purple-black; growing along moist muddy areas in puna vegetation close to Polylepis forest”, “subdepartment”: “Flowering Plants”, “gbifid”: 1055949175} 3955 NA NA NA 2013-05-19T00:00:00 NA NA NA Solanaceae 7717 NA 4706 Peru 1055949175 Solanum Solanum 2928997 WGS84 NA NA NA NA NA South America; Peru; Ancash; Corongo NA NA NA NA NA Tiina Särkinen 4339588 NA NA 60277b56-f762-11e1-a439-00145eb45e9a NHMUK NA PE 1055949175 Plantae 6 NA 2019-11-29T15:42:43.569+0000 2019-11-29T16:08:20.122+0000 2019-11-29T16:08:20.122+0000 -8.571389 http://creativecommons.org/publicdomain/zero/1.0/legalcode between km55-57 on rd from Yanac to Sihuas -77.71803 NA NA 5 NA NA NA c3a7d94e-d9d4-469e-b724-a85252512ed1 NA Solanales 1176 NA NA NA NA NA NHMUK:ecatalogue:4339588 NA NA Tracheophyta 7707728 NA NA DWC_ARCHIVE GB 19456090-b49a-11d8-abeb-b8a03c50a862 Tiina Särkinen, H. Maria Baden, Dr Erica McAlister, Diana M. Percy 4706 NA NA NA NA Solanum acaule Bitter NA Solanum acaule 2931061 acaule NA NA 2931061 ACCEPTED SPECIES NA NA NA NA NA NA NA NA NA 2013 2019-12-04

If you want to check how many rows and colums the data frame you have, we can use dim () to show the dimensions of the dataset.

colrow<-dim(acaule)
# use dim() to get dimension of our dataset

colrow[1]
## [1] 7229
#show row numbers

y<-colrow[2]
#show column number

In order to draw a map, we need to find the location by longitude and latitude from our dataset.

Here, we’re subsetting the data to include only those that have longitude and latitude. Many occurence records may not have geographic coordinates in your dataset, so you will need to remove those NA values from your dataframe. We’ll use the is.na() command to assess whether the data have both longitude and latitude data, and then subset them into a new data frame.

acgeo <- subset(acaule, !is.na(lon) & !is.na(lat))
#use subset() to seperate those data

#datatable(head(acgeo))
#showing the example data see that have longtitude and latitude data

dim(acgeo)
## [1] 4643  127
#showing the rows and columns for data frame
acgeo[1:4, c(1:5,7:10)] #removing locality
##   acceptedScientificName acceptedTaxonKey accessRights      adm1   adm2
## 1  Solanum acaule Bitter          2931061         <NA>     Cusco   <NA>
## 2  Solanum acaule Bitter          2931061         <NA> Catamarca Ambato
## 3  Solanum acaule Bitter          2931061         <NA>      Lima   <NA>
## 4  Solanum acaule Bitter          2931061         <NA>      Lima  Canta
##        basisOfRecord                      bibliographicCitation
## 1  HUMAN_OBSERVATION                                       <NA>
## 2 PRESERVED_SPECIMEN                                       <NA>
## 3 PRESERVED_SPECIMEN                                       <NA>
## 4 PRESERVED_SPECIMEN http://www.tropicos.org/Specimen/100991187
##   catalogNumber         class
## 1      22332967 Magnoliopsida
## 2  CORD00056696 Magnoliopsida
## 3   BM001134709 Magnoliopsida
## 4     100991187 Magnoliopsida

2.3 Making a map of the ocurrence localities

In order to make a map of our subset of Solanum acaule data that have locations, we’ll use the package {maptools}.

First, we need to draw an empty map: a variable called wrld_simpl! will give us an empty world-wide map. The wrld_simpl dataset contains rough country outlines.

library(maptools)
## Checking rgeos availability: TRUE
data(wrld_simpl)
x<-plot(wrld_simpl, xlim=c(-80,70), ylim=c(-80,30), axes=TRUE)

# empty map

Now we need to add our data points. It is important to make such maps to make sure that the points are, at least roughly, in the right location:

library(maptools)

data(wrld_simpl)
plot(wrld_simpl, xlim=c(-80,70), ylim=c(-80,30), axes=TRUE, col="light yellow")
# restore the box around the map
box()
# add the points
points(acgeo$lon, acgeo$lat, col='orange', pch=20, cex=0.75)
# plot points again to add a border, for better visibility
points(acgeo$lon, acgeo$lat, col='red', cex=0.75)

 r <- raster(system.file("external/test.grd", package="raster"))

2.4 Data cleaning

Data ‘cleaning’ is particularly important for data sourced from species distribution data warehouses such as GBIF. As we mentioned, we already know that Solanum acaule is a species that occurs in the higher parts of the Andes mountains of southern Peru, Bolivia and northern Argentina. Do we see this in our map?

We imported more than seven thousand records from GBIF, but not all of these are usable because of various issues. This species is distributed mainly in the Andes mountains of southern Peru, Bolivia and northern Argentina, but as you can see in the previous map there are records in other countries and even in Antartica! In order to clean this data we will need to:

  1. avoid false identifications

  2. use only specimens collected in a period relevant to the climate data we will be utilizing

  3. have reliable locational certainty (the cleaning will depend of the species we are modelling and the data availability.

The point in Brazil (record acaule[98,]) should actually be in southern Bolivia, so this is probably a typo in the longitude value.

There are also three records that have plausible latitudes, but longitudes that are clearly wrong, as they are in the Atlantic Ocean, south of West Africa! It looks like they have a longitude that is zero. In many data-bases you will find values that are ‘zero’ where ‘no data’ was intended. The gbif function (when using the default arguments) sets coordinates that are (0, 0) to NA, but not if only one of the coordinates is listed as zero. Let’s see if we find these by searching for records with longitudes of zero.

lonzero = subset(acgeo, lon==0)
lonzero[, 1:13]
##  [1] acceptedScientificName acceptedTaxonKey       accessRights          
##  [4] adm1                   adm2                   associatedReferences  
##  [7] basisOfRecord          bibliographicCitation  catalogNumber         
## [10] class                  classKey               cloc                  
## [13] collectionCode        
## <0 rows> (or 0-length row.names)

The records might also come from very different sources, sometimes from museum collections or zoos. Be sure to choose only the kind of records you really want:

unique(acaule$basisOfRecord)#if you want to know origin of your records.
## [1] "HUMAN_OBSERVATION"  "PRESERVED_SPECIMEN" "LIVING_SPECIMEN"   
## [4] "UNKNOWN"
#You can either decide to work with specimens stored only in  scientific collections or include all records in your analysis.
acaule1 <- subset(acaule,
    basisOfRecord=="specimen" |
    basisOfRecord=="unknown" |
    basisOfRecord=="living"
    )
nrow(acaule1)
## [1] 0

2.5 Duplicate records

Many databases share data, so you also risk including duplicate records. The next function will help you to remove duplicates.

#which records are duplicates (only for the first 10 columns)?
dups <- duplicated(lonzero[, 1:10])
#remove duplicates
lonzero  <-  lonzero[dups, ]
lonzero[,1:13]
##  [1] acceptedScientificName acceptedTaxonKey       accessRights          
##  [4] adm1                   adm2                   associatedReferences  
##  [7] basisOfRecord          bibliographicCitation  catalogNumber         
## [10] class                  classKey               cloc                  
## [13] collectionCode        
## <0 rows> (or 0-length row.names)

Another approach might be to detect duplicates for the same species and some coordinates in the data, even if the records were from collections by different people or in different years (in our case, using species is redundant as we have data for only one species).

# differentiating by (sub) species
# dups2 <- duplicated(acgeo[, c('species', 'lon', 'lat')])
# ignoring (sub) species and other naming variation
dups2 <- duplicated(acgeo[, c('lon', 'lat')])
# number of duplicates
sum(dups2)
## [1] 2345
## [1] 483
# keep the records that are _not_ duplicated
acg <- acgeo[!dups2, ]

Let’s repatriate the records near Pakistan to Argentina, and remove the records in Brazil, Antarctica, and with a longitude of ‘0’.

i <- acg$lon > 0 & acg$lat > 0
acg$lon[i] <- -1 * acg$lon[i]
acg$lat[i] <- -1 * acg$lat[i]
acg <- acg[acg$lon < -50 & acg$lat > -50, ]

2.6 Cross-checking

It is important to cross-check coordinates by visual and other means. One approach is to compare the country (and lower level administrative subdivisions) of the site as specified by the records, with the country implied by the coordinates (Hijmans et al., 1999).

We should ask these two questions:

  1. Which points (identified by their record numbers) do not match any country (that is, they are in an ocean)?

  2. Which points have coordinates that are in a different country than listed in the ‘country’ field of the GBIF record

In the example below we use the coordinates() function from the sp package to create a spatial points data frame, and then the over() function, also from sp, to do a point-in-polygon query with the named countries polygons.

library(sp)
coordinates(acg) <- ~lon+lat
crs(acg) <- crs(wrld_simpl)
class(acg)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
class(wrld_simpl)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
ovr <- over(acg, wrld_simpl)
head(ovr)
##   FIPS ISO2 ISO3  UN      NAME   AREA  POP2005 REGION SUBREGION     LON
## 1   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 2   AR   AR  ARG  32 Argentina 273669 38747148     19         5 -65.167
## 3   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 4   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 5   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 6   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
##       LAT
## 1  -9.326
## 2 -35.377
## 3  -9.326
## 4  -9.326
## 5  -9.326
## 6  -9.326
cntr <- ovr$NAME
i <- which(is.na(cntr))
i
## integer(0)
## integer(0)
j <- which(cntr != acg$country)
# for the mismatches, bind the country names of the polygons and points
cbind(cntr, acg$country)[j,]
##       cntr             
##  [1,] "27"  "Argentina"
##  [2,] "27"  "Argentina"
##  [3,] "27"  "Argentina"
##  [4,] "172" "Bolivia"  
##  [5,] "172" "Bolivia"  
##  [6,] "172" "Bolivia"  
##  [7,] "27"  "Argentina"
##  [8,] "27"  "Argentina"
##  [9,] "172" "Bolivia"  
## [10,] "172" "Bolivia"  
## [11,] "172" "Bolivia"  
## [12,] "172" "Bolivia"  
## [13,] "172" "Bolivia"  
## [14,] "172" "Bolivia"  
## [15,] "172" "Bolivia"  
## [16,] "172" "Bolivia"  
## [17,] "172" "Bolivia"  
## [18,] "172" "Bolivia"  
## [19,] "172" "Bolivia"  
## [20,] "172" "Bolivia"  
## [21,] "172" "Bolivia"  
## [22,] "172" "Bolivia"  
## [23,] "172" "Bolivia"  
## [24,] "172" "Bolivia"  
## [25,] "172" "Bolivia"  
## [26,] "172" "Bolivia"  
## [27,] "172" "Bolivia"  
## [28,] "172" "Bolivia"  
## [29,] "172" "Bolivia"  
## [30,] "172" "Bolivia"  
## [31,] "172" "Bolivia"  
## [32,] "172" "Bolivia"  
## [33,] "172" "Bolivia"  
## [34,] "172" "Bolivia"  
## [35,] "172" "Bolivia"  
## [36,] "172" "Bolivia"  
## [37,] "172" "Bolivia"  
## [38,] "172" "Bolivia"  
## [39,] "172" "Bolivia"  
## [40,] "27"  "Peru"     
## [41,] "172" "Bolivia"  
## [42,] "172" "Bolivia"  
## [43,] "172" "Bolivia"  
## [44,] "172" "Bolivia"  
## [45,] "27"  "Peru"     
## [46,] "172" "Bolivia"  
## [47,] "172" "Bolivia"  
## [48,] "172" "Bolivia"  
## [49,] "172" "Bolivia"  
## [50,] "172" "Bolivia"  
## [51,] "172" "Bolivia"  
## [52,] "172" "Bolivia"  
## [53,] "172" "Bolivia"  
## [54,] "27"  "Peru"     
## [55,] "172" "Bolivia"  
## [56,] "27"  "Peru"     
## [57,] "27"  "Peru"     
## [58,] "27"  "Peru"     
## [59,] "31"  "Peru"     
## [60,] "65"  "Peru"     
## [61,] "65"  "Peru"     
## [62,] "143" "Peru"     
## [63,] "27"  "Argentina"
## [64,] "172" "Bolivia"  
## [65,] "172" "Bolivia"  
## [66,] "27"  "Argentina"
## [67,] "172" "Bolivia"
plot(acg)
plot(wrld_simpl, add=T, border='blue', lwd=2)
points(acg[j, ], col='red', pch=20, cex=2)

Step 3: Deal with absence and background points

Some of the early species distribution model algorithms, such as Bioclim and Domain only use ’presence’ data in the modeling process. Other methods also use ’absence’ data or ’background’ data.

3.1 Absence data

If you want to have a non-NA-value data frame for your method, here is what you could do…

First, there is a common method to delete all NA values from a data frame: we can use the na.omit() function to omit the rows that contains the NA values.

For example, given you only need longtitude and latitude values to draw your map, you can take those two columns into a data frame.

long<-acaule$lon
#extract column of longtitude
lati<-acaule$lat
#extract column of latitude

d<-cbind(long,lati)
#combine into a data frame
colnames(d)<-c("longitude","latitude")
#name data frame

dim(d)[1]
## [1] 7229
#rows number of uncleaning data frame

d[27,]
## longitude  latitude 
##        NA        NA
#there are some NA rows in this data frame

Then, you can use na.omit() to remove all NA value from this data frame.

clean_d <-na.omit(d)
#save teh cleaing data frame

dim(clean_d)[1]
## [1] 4643
#rows number of cleaning data frame

clean_d[27,]
## longitude  latitude 
## -67.71306 -28.97167
#NA have been removed from the previous row.

Another simple way to do this is to use the filter() function in the {dplyr} package.

For example:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:raster':
## 
##     intersect, select, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
new_d<-filter(acaule, lon !="NA" & lat !="NA" )
#new variable to store the data frame.
summary(is.na(new_d$lon))
##    Mode   FALSE 
## logical    4643
#use is.na() to check if there is remaining NA in longitude column.

3.2 Background data

Background data do not attempt to guess at absence locations, but rather characterize environments in the study region. In this sense, background is the same, irrespective of where the species has been found.

Background data establishes the environmental domain of the study, whilst presence data should establish under which conditions a species is more likely to be present than on average. A closely related but different concept, that of “pseudo-absences”, is also used for generating the non-presence class for logistic models. In this case, researchers sometimes try to guess where absences might occur – they may sample the whole region except at presence locations, or they might sample at places unlikely to be suitable for the species. We prefer the background concept because it requires fewer assumptions and has some coherent statistical methods for dealing with the “overlap” between presence and background points.

In the example below, we first get the list of filenames with the predictor raster data (discussed in detail in the next section). We use a raster as a ‘mask’ in the randomPoints() function such that the background points are from the same geographic area, and only for places where there are values (land, in our case).

library(dismo)
# get the file names
files <- list.files(path=paste(system.file(package="dismo"), '/ex',
                       sep=''),  pattern='grd',  full.names=TRUE )
# we use the first file to create a RasterLayer
mask <- raster(files[1])
# select 500 random points

set.seed(1963)
# set seed to assure that the examples will always have the same random sample.

bg <- randomPoints(mask, 500 )

plot(!is.na(mask), legend=FALSE)

points(bg, cex=0.5)

# now we repeat the sampling, but limit the area of sampling using a spatial extent
e <- extent(-80, -53, -39, -22)
bg2 <- randomPoints(mask, 50, ext=e)
#limit point for longitude -39~-22, latitude -80~-53

plot(!is.na(mask), legend=FALSE)
plot(e, add=TRUE, col='red')
points(bg2, cex=0.5)

There are several approaches one could use to sample ‘pseudo-absence’ points from a more restricted area than ‘background’.

We first read the cleaned and subsetted S. acaule data that we produced in the previous chapter from the csv file that comes with {dismo}:

file <- paste(system.file(package="dismo"), '/ex/acaule.csv', sep='')
ac <- read.csv(file)
#load the cleaned data file

coordinates(ac) <- ~lon+lat
projection(ac) <- CRS('+proj=longlat +datum=WGS84')
#create a Spatial Points Data Frame

We first create a ‘circles’ model (see the chapter about geographic models), using an arbitrary radius of 50 km. Circles() is a function from {rgeos} package.

library(rgeos)
## rgeos version: 0.5-2, (SVN revision 621)
##  GEOS runtime version: 3.7.2-CAPI-1.11.2 
##  Linking to sp version: 1.3-1 
##  Polygon checking: TRUE
x <- circles(ac, d=50000, lonlat=TRUE)
# circles with a radius of 50 km
pol <- polygons(x)

samp1 <- spsample(pol, 250, type='random', iter=25)
# sample randomly from all circles

cells <- cellFromXY(mask, samp1)
length(cells)
## [1] 250
cells <- unique(cells)
# get unique cells
length(cells)
## [1] 161
xy <- xyFromCell(mask, cells)
plot(pol, axes=TRUE)
points(xy, cex=0.75, pch=20, col='blue')

#make a plot to see results

A similar results could be produced via raster function extract()

# extract cell numbers for the circles
v <- extract(mask, x@polygons, cellnumbers=T)
# use rbind to combine the elements in list v
v <- do.call(rbind, v)
# get unique cell numbers from which you could sample
v <- unique(v[,1])
head(v)
## [1] 15531 15717 17581 17582 17765 17767
# to display the results

m <- mask
m[] <- NA
m[v] <- 1
plot(m, ext=extent(x@polygons)+1)
plot(x@polygons, add=T)

# make a plot for results,

Step 4: Using the R package {raster} to analyze data

4.1 Raster data

A raster consists of a matrix of cells (or pixels) organized into rows and columns (or a grid) where each contains a value representing information. Raster format represents real-world phenomena, so data might be discrete or continuous. Discrete data: also are called thematic, categorical or discontinuous data. A discrete object has known and definable boundaries.

In species distribution modeling, predictor variables are typically organized as raster (grid) type files. Each predictor should be a ‘raster’ representing a variable of interest. Variables can include climatic, soil, terrain, vegetation, land use, and other variables.

These data are typically stored in files in some kind of GIS format. Almost all relevant formats can be used (including ESRI grid, geoTiff, netCDF, IDRISI).

For any particular study the layers should all have the same spatial extent, resolution, origin, and projection. The set of predictor variables (rasters) can be used to make a RasterStack, which is a collection of RasterLayer objects

A RasterLayer object represents single-layer (variable) raster data. A RasterLayer object always stores a number of fundamental parameters that describe it. These include the number of columns and rows, the spatial extent, and the Coordinate Reference System. In addition, a RasterLayer can store information about the file in which the raster cell values are stored (if there is such a file). A RasterLayer can also hold the raster cell values in memory.

Using the {dismo} package, we are going to create a RasterStack from the list of files that are intalled with the package.

Fist, we need to get the folder with our files…..

path <- file.path(system.file(package="dismo"), 'ex')

Now get the names of all the files with extension .grd in this folder. The $ sign in the code below indicates that the files must end with the characters ‘grd’. By using full.names=TRUE, the full path names are returned.

library(dismo)
files <- list.files(path, pattern='grd$', full.names=TRUE )
files
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio1.grd" 
## [2] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio12.grd"
## [3] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio16.grd"
## [4] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio17.grd"
## [5] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio5.grd" 
## [6] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio6.grd" 
## [7] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio7.grd" 
## [8] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio8.grd" 
## [9] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/biome.grd"

Now create a RasterStack of predictor variables.

predictors <- stack(files)
predictors
## class      : RasterStack 
## dimensions : 192, 186, 35712, 9  (nrow, ncol, ncell, nlayers)
## resolution : 0.5, 0.5  (x, y)
## extent     : -125, -32, -56, 40  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## names      : bio1, bio12, bio16, bio17, bio5, bio6, bio7, bio8, biome 
## min values :  -23,     0,     0,     0,   61, -212,   60,  -66,     1 
## max values :  289,  7682,  2458,  1496,  422,  242,  461,  323,    14
names(predictors)
## [1] "bio1"  "bio12" "bio16" "bio17" "bio5"  "bio6"  "bio7"  "bio8"  "biome"
plot(predictors)

We can also make a plot of a single layer in a RasterStack, and plot some additional data on top of it.

To so, we’ll first get the world boundaries again, along with data on Bradypus:

Figure 5. Bradypus sp.

library(maptools)
data(wrld_simpl)
file <- paste(system.file(package="dismo"), "/ex/bradypus.csv", sep="")
bradypus <- read.table(file,  header=TRUE,  sep=',')
# we do not need the first column
bradypus  <- bradypus[,-1]

Plotting bioclimatic variables from the WorldClim database:

# first layer of the RasterStack
plot(predictors, 1)
# note the "add=TRUE" argument with plot
plot(wrld_simpl, add=TRUE)
# with the points function, "add" is implicit
points(bradypus, col='blue')

4.2 Extracting values from rasters

We now have a set of predictor climatic variables (rasters) and occurrence points. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the {dismo} package). This is a very straightforward thing to do using the extract() function from the {raster} package.

presvals <- extract(predictors, bradypus)
# setting random seed to always create the same
# random set of points for this example
set.seed(0)
#Set the seed of R's random number generator, which is useful for creating simulations or random objects that can be reproduced

backgr <- randomPoints(predictors, 500)
#setting for backgroud

absvals <- extract(predictors, backgr)
#extract values

pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))

sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))

sdmdata[,'biome'] = as.factor(sdmdata[,'biome'])

head(sdmdata)
##   pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 1  1  263  1639   724    62  338  191  147  261     1
## 2  1  263  1639   724    62  338  191  147  261     1
## 3  1  253  3624  1547   373  329  150  179  271     1
## 4  1  243  1693   775   186  318  150  168  264     1
## 5  1  243  1693   775   186  318  150  168  264     1
## 6  1  252  2501  1081   280  326  154  172  270     1
tail(sdmdata)
##     pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 611  0  265  1622   749    66  344  184  160  266     1
## 612  0  241  1648   475   354  294  183  112  229     1
## 613  0  154   731   250    80  319   18  301  198     8
## 614  0  169  1195   316   273  293   69  224  122     7
## 615  0  247  2965  1173   327  314  168  146  253     1
## 616  0  117   495   231    36  336  -96  432  216     8
summary(sdmdata)
##        pb              bio1           bio12            bio16       
##  Min.   :0.0000   Min.   : 16.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:176.8   1st Qu.: 821.5   1st Qu.: 341.8  
##  Median :0.0000   Median :242.0   Median :1497.0   Median : 645.5  
##  Mean   :0.1883   Mean   :213.9   Mean   :1629.5   Mean   : 656.4  
##  3rd Qu.:0.0000   3rd Qu.:261.0   3rd Qu.:2266.8   3rd Qu.: 917.0  
##  Max.   :1.0000   Max.   :286.0   Max.   :7682.0   Max.   :2458.0  
##                                                                    
##      bio17             bio5            bio6             bio7      
##  Min.   :   0.0   Min.   : 88.0   Min.   :-167.0   Min.   : 68.0  
##  1st Qu.:  42.0   1st Qu.:303.0   1st Qu.:  45.0   1st Qu.:118.8  
##  Median : 110.0   Median :320.0   Median : 151.0   Median :160.0  
##  Mean   : 169.2   Mean   :309.3   Mean   : 119.3   Mean   :190.0  
##  3rd Qu.: 240.5   3rd Qu.:333.0   3rd Qu.: 202.2   3rd Qu.:247.0  
##  Max.   :1496.0   Max.   :410.0   Max.   : 232.0   Max.   :440.0  
##                                                                   
##       bio8           biome    
##  Min.   :-20.0   1      :299  
##  1st Qu.:215.0   7      : 65  
##  Median :250.0   8      : 61  
##  Mean   :225.1   13     : 51  
##  3rd Qu.:262.0   2      : 49  
##  Max.   :323.0   4      : 33  
##                  (Other): 58

To visually investigate colinearity in the environmental data (at the presence and background points) you can use a pairs plot.

pairs(sdmdata[,2:5], cex=0.1)

Step 5: Picking models

A large number of algorithms has been used in species distribution modeling. They can be classified as ‘profile’, ‘regression’, and ‘machine learning’ methods. Profile methods only consider ‘presence’ data, not absence nor background data. Regression and machine learning methods use both presence and absence or background data. The distinction between regression and machine learning methods is not sharp, but it is perhaps still useful as way to classify models. Another distinction that one can make is between presence-only and presence-absence models. Profile methods are always presence-only, other methods can be either, depending if they are used with survey-absence or with pseudo-absence/background data. An entirely different class of models consists of models that only, or primarily, use the geographic location of known occurrences, and do not rely on the values of predictor variables at these locations. We refer to these models as ‘geographic models’.

Figure 2. A hierarchical structure of the 12 SDM algorithms commonly used. (Pecchi et al., 2019)

To make a decision on which algorithm to use, you must first know the type of data we have.

  1. Presence only (e.g. BIOCLIM)

  2. Presence/absence (e.g. ANN)

  3. Presence/pseudo-absence (e.g. GARP)

  4. Presence/environment(background) (e.g. Maxent).

Platforms with multiple algorithms:

Biomod: http://cran.r-project.org/web/packages/biomod2/index.html

Dismo: http://cran.r-project.org/web/packages/dismo/index.html

OpenModeller: http://openmodeller.sourceforge.net/

ModEco: http://gis.ucmerced.edu/ModEco/

Div-GIS: http://www.diva-gis.org

Profile model:

Bioclim, Domain, and Mahal are profile methods.These are implemented in the {dismo} package, and the procedures to use these models are the same for all three.

Regression model:

We already learned classic regression modeling earlier in this class. It is very similar here. Models are fit using maximum likelihood and by allowing the model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value

Machine learning models:

Machine learning models are non-parametric flexible regression models. Methods include Artifical Neural Networks (ANN), Random Forests, Boosted Regression Trees, and Support Vector Machines. Through the dismo package you can also use the Maxent program, that implements the most widely used method (maxent) in species distribution modeling.

Example: Generalized Linear Models

A generalized linear model (GLM) is a generalization of ordinary least squares regression, which we’ve also looked at in an earlier class. Models are fit using maximum likelihood and by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Depending on how a GLM is specified it can be equivalent to (multiple) linear regression, logistic regression or Poisson regression

In R, GLMs may be implemented using the glm() function, and the link function and error distribution are specified with the family argument. Examples include:

*family = binomial(link = “logit”)

*family = poisson(link = “log”)

Here we fit two basic glm models. The models need to be fit with presence and absence (background) data. With the exception of ‘maxent’, we cannot fit the model with a RasterStack and points. Instead, we need to extract the environmental data values ourselves, and fit the models with these values.

#first example
gm1 <- glm(pa ~ bio1 + bio5 + bio6 + bio7 + bio8 + bio12 + bio16 + bio17,
            family = binomial(link = "logit"), data=envtrain)
summary(gm1)
## 
## Call:
## glm(formula = pa ~ bio1 + bio5 + bio6 + bio7 + bio8 + bio12 + 
##     bio16 + bio17, family = binomial(link = "logit"), data = envtrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.73157  -0.48749  -0.23064  -0.05455   2.90207  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  2.4189152  1.6563006   1.460  0.14417   
## bio1         0.1083418  0.0609904   1.776  0.07567 . 
## bio5         0.1010893  0.2526065   0.400  0.68902   
## bio6        -0.1941921  0.2545272  -0.763  0.44549   
## bio7        -0.1875595  0.2530838  -0.741  0.45864   
## bio8        -0.0180964  0.0275002  -0.658  0.51051   
## bio12        0.0018953  0.0007034   2.695  0.00705 **
## bio16       -0.0019137  0.0014796  -1.293  0.19586   
## bio17       -0.0042036  0.0015431  -2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.69  on 892  degrees of freedom
## Residual deviance: 440.50  on 884  degrees of freedom
## AIC: 458.5
## 
## Number of Fisher Scoring iterations: 8
coef(gm1)
##  (Intercept)         bio1         bio5         bio6         bio7 
##  2.418915241  0.108341785  0.101089299 -0.194192095 -0.187559462 
##         bio8        bio12        bio16        bio17 
## -0.018096425  0.001895285 -0.001913743 -0.004203616
#second example
gm2 <- glm(pa ~ bio1+bio5 + bio6 + bio7 + bio8 + bio12 + bio16 + bio17,
            family = gaussian(link = "identity"), data=envtrain)
evaluate(testpres, testbackg, gm1)
## class          : ModelEvaluation 
## n presences    : 23 
## n absences     : 200 
## AUC            : 0.7938043 
## cor            : 0.2640135 
## max TPR+TNR at : -2.844012
ge2 <- evaluate(testpres, testbackg, gm2)
ge2
## class          : ModelEvaluation 
## n presences    : 23 
## n absences     : 200 
## AUC            : 0.7801087 
## cor            : 0.2952745 
## max TPR+TNR at : 0.05090658
#show as a plot graph
pg <- predict(predictors, gm2, ext=ext)
par(mfrow=c(1,2))
plot(pg, main='GLM/gaussian, raw values')
plot(wrld_simpl, add=TRUE, border='dark grey')
tr <- threshold(ge2, 'spec_sens')
plot(pg > tr, main='presence/absence')
plot(wrld_simpl, add=TRUE, border='dark grey')
points(pres_train, pch='+')
points(backg_train, pch='-', cex=0.25)

Step 6: Model fitting, prediction, evaluation

6.1 Model fitting

Model fitting is the heart of any SDM application. Many different algorithms are available (Elith et al., 2006), and often several algorithms are combined into ensemble models or several candidate models with different candidate predictor sets are averaged (Hastie, Tibshirani, and Friedman, 2009). The decisions on these matters should have been made during the conceptualisation phase. Important aspects to consider during the model fitting step are: how to deal with multicollinearity in the environmental data? How many variables should be included in the model (without overfitting) and how should we select these? Which model settings should be used? When multiple model algorithms or candidate models are fitted, how to select the final model or average the models? Do we need to test or correct for non-independence in the data (spatial or temporal autocorrelation, nested data)? If the goal is to derive binary predictions, which threshold should be used?

Model fitting is technically quite similar across the modeling methods that exist in R. Most methods take a formula identifying the dependent and independent variables, accompanied with a data.frame that holds these variables

m1 <- glm(pb ~ bio1 + bio5 + bio12, data=sdmdata)
class(m1)
## [1] "glm" "lm"
summary(m1)
## 
## Call:
## glm(formula = pb ~ bio1 + bio5 + bio12, data = sdmdata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95534  -0.23380  -0.09987   0.07831   0.88796  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.550e-01  1.093e-01   1.418 0.156593    
## bio1         1.621e-03  3.960e-04   4.093 4.84e-05 ***
## bio5        -1.614e-03  4.725e-04  -3.417 0.000676 ***
## bio12        1.140e-04  1.676e-05   6.806 2.39e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1220413)
## 
##     Null deviance: 94.156  on 615  degrees of freedom
## Residual deviance: 74.689  on 612  degrees of freedom
## AIC: 458.43
## 
## Number of Fisher Scoring iterations: 2
m2 = glm(pb ~ ., data=sdmdata)
m2
## 
## Call:  glm(formula = pb ~ ., data = sdmdata)
## 
## Coefficients:
## (Intercept)         bio1        bio12        bio16        bio17  
##   0.2644272   -0.0020029    0.0004336   -0.0006207   -0.0007843  
##        bio5         bio6         bio7         bio8       biome2  
##   0.0035996   -0.0025153   -0.0041593    0.0008508   -0.0899869  
##      biome3       biome4       biome5       biome7       biome8  
##  -0.1262125   -0.0699228   -0.0136612   -0.1993102    0.0057666  
##      biome9      biome10      biome11      biome12      biome13  
##   0.0465949    0.0055868   -0.2705639   -0.0450194    0.0839402  
##     biome14  
##  -0.1689054  
## 
## Degrees of Freedom: 615 Total (i.e. Null);  595 Residual
## Null Deviance:       94.16 
## Residual Deviance: 69.17     AIC: 445.1
bc <- bioclim(presvals[,c('bio1', 'bio5', 'bio12')])
class(bc)
## [1] "Bioclim"
## attr(,"package")
## [1] "dismo"
bc
## class    : Bioclim 
## 
## variables: bio1 bio5 bio12 
## 
## 
## presence points: 116 
##    bio1 bio5 bio12
## 1   263  338  1639
## 2   263  338  1639
## 3   253  329  3624
## 4   243  318  1693
## 5   243  318  1693
## 6   252  326  2501
## 7   240  317  1214
## 8   275  335  2259
## 9   271  327  2212
## 10  274  329  2233
##   (... ...  ...)
pairs(bc)

6.2 Model prediction

Different modeling methods return different type of ‘model’ objects (typically they have the same name as the modeling method used). All of these ‘model’ objects, irrespective of their exact class, can be used to with the predict function to make predictions for any combination of values of the independent variables.

bio1 = c(40, 150, 200)
bio5 = c(60, 115, 290)
bio12 = c(600, 1600, 1700)
pd = data.frame(cbind(bio1, bio5, bio12))
pd
##   bio1 bio5 bio12
## 1   40   60   600
## 2  150  115  1600
## 3  200  290  1700
predict(m1, pd)
##         1         2         3 
## 0.1914136 0.3949569 0.2048905
predict(bc, pd)

Making such predictions for a few environments can be very useful to explore and understand model predictions. For example, it can be used in the response function that creates response plots for each variable, with the other variables at their median value.

response(bc)

predictors <- stack(list.files(file.path(system.file(package="dismo"), 'ex'), pattern='grd$', full.names=TRUE ))
names(predictors)
## [1] "bio1"  "bio12" "bio16" "bio17" "bio5"  "bio6"  "bio7"  "bio8"  "biome"
p <- predict(predictors, m1)
plot(p)

6.3 Model evaluation

Most model types have different measures that can help to assess how well the model fits the data. Most statistics or machine learning texts will provide some details on these methods. For instance, for a GLM one can look at how much deviance is explained, whether there are patterns in the residuals, whether there are points with high leverage and so on. However, since many models are to be used for prediction, much evaluation is focused on how well the model predicts to points not used in model training.

What do we consider to evaluate a model?

Does the model seem sensible, ecologically?

Do the fitted functions (the shapes of the modeled relationships) make sense?

Do the predictions seem reasonable? (map them, and think about them)?

Are there any spatial patterns in model residuals?

Different measures can be used to evaluate the quality of a prediction, perhaps depending on the goal of the study. Many measures for evaluating models based on presence-absence or presence-only data are ‘threshold dependent’. That means that a threshold must be set first (e.g., 0.5, though 0.5 is rarely a sensible choice – e.g. see Lui et al., 2005). Predicted values above that threshold indicate a prediction of ‘presence’, and values below the threshold indicate ‘absence’. Some measures emphasize the weight of false absences; others give more weight to false presences.

Commonly used statistics that are threshold independent are the correlation coefficient and the Area Under the Receiver Operator Curve (AUROC, generally further abbreviated to AUC). AUC is a measure of rank-correlation. In unbiased data, a high AUC indicates that sites with high predicted suitability values tend to be areas of known presence and locations with lower model prediction values tend to be areas where the species is not known to be present (absent or a random point). An AUC score of 0.5 means that the model is as good as a random guess.

For a discussion on the use of AUC in the context of presence-only rather than presence/absence data.

Here we illustrate the computation of the correlation coefficient and AUC with two random variables. p (presence) has higher values, and represents the predicted value for 50 known cases (locations) where the species is present, and a (absence) has lower values, and represents the predicted value for 50 known cases (locations) where the species is absent.

p <- rnorm(50, mean=0.7, sd=0.3)
a <- rnorm(50, mean=0.4, sd=0.4)
par(mfrow=c(1, 2))
plot(sort(p), col='red', pch=21)
points(sort(a), col='blue', pch=24)
legend(1, 0.95 * max(a,p), c('presence', 'absence'),
          pch=c(21,24), col=c('red', 'blue'))
comb <- c(p,a)
group <- c(rep('presence', length(p)), rep('absence', length(a)))
boxplot(comb~group, col=c('blue', 'red'))

We created two variables with random normally distributed values, but with different mean and standard deviation. The two variables clearly have different distributions, and the values for ‘presence’ tend to be higher than for ‘absence’. Here is how you can compute the correlation coefficient and the AUC:

group = c(rep(1, length(p)), rep(0, length(a)))
cor.test(comb, group)$estimate
##       cor 
## 0.4224299
mv <- wilcox.test(p,a)
auc <- as.numeric(mv$statistic) / (length(p) * length(a))
auc
## [1] 0.7368

Below we show how you can compute these, and other statistics more conveniently, with the evaluate() function in the {dismo} package. See ?evaluate for information on additional evaluation measures that are available. ROC/AUC can also be computed with the {ROCR} package.

e <- evaluate(p=p, a=a)
class(e)
## [1] "ModelEvaluation"
## attr(,"package")
## [1] "dismo"
par(mfrow=c(1, 2))
density(e)
boxplot(e, col=c('blue', 'red'))

Back to some real data, presence-only in this case. We’ll divide the data in two random sets, one for training a Bioclim model, and one for evaluating the model.

samp <- sample(nrow(sdmdata), round(0.75 * nrow(sdmdata)))
traindata <- sdmdata[samp,]
traindata <- traindata[traindata[,1] == 1, 2:9]
testdata <- sdmdata[-samp,]
bc <- bioclim(traindata)
e <- evaluate(testdata[testdata==1,], testdata[testdata==0,], bc)
e
## class          : ModelEvaluation 
## n presences    : 25 
## n absences     : 129 
## AUC            : 0.756124 
## cor            : 0.2518402 
## max TPR+TNR at : 0.01088901
plot(e, 'ROC')

Let’s first create presence and background data.

pres <- sdmdata[sdmdata[,1] == 1, 2:9]
back <- sdmdata[sdmdata[,1] == 0, 2:9]

The background data will only be used for model testing and does not need to be partitioned. We now partition the data into 5 groups:

k <- 5
group <- kfold(pres, k)
group[1:10]
##  [1] 3 2 2 4 3 2 3 1 1 2
unique(group)
## [1] 3 2 4 1 5

Now we can fit and test our model five times. In each run, the records corresponding to one of the five groups is only used to evaluate the model, while the other four groups are only used to fit the model. The results are stored in a list called ‘e’.

e <- list()
for (i in 1:k) {
    train <- pres[group != i,]
    test <- pres[group == i,]
    bc <- bioclim(train)
    e[[i]] <- evaluate(p=test, a=back, bc)
}

We can extract several things from the objects in e, but let’s restrict ourselves to the AUC values and the “maximum of the sum of the sensitivity (true positive rate) and specificity (true negative rate)” threshold “spec_sens” (this is sometimes uses as a threshold for setting cells to presence or absence).

auc <- sapply(e, function(x){x@auc})
auc
## [1] 0.7800000 0.7756522 0.8173750 0.7817826 0.7738696
mean(auc)
## [1] 0.7857359
sapply( e, function(x){ threshold(x)['spec_sens'] } )
## $spec_sens
## [1] 0.04291075
## 
## $spec_sens
## [1] 0.08592151
## 
## $spec_sens
## [1] 0.01076957
## 
## $spec_sens
## [1] 0.04291075
## 
## $spec_sens
## [1] 0.03215806

The use of AUC in evaluating SDMs has been criticized (Lobo et al., 2008; Jiménez-Valverde, 2011). A particularly sticky problem is that the values of AUC vary with the spatial extent used to select background points. Generally, the larger that extent, the higher the AUC value. Therefore, AUC values are generally biased and cannot be directly compared. Hijmans (2012) suggests that one could remove “spatial sorting bias” (the difference between the distance from testing-presence to training-presence and the distance from testing-absence to training-presence points) through “point-wise distance sampling”.

file <- file.path(system.file(package="dismo"), "ex/bradypus.csv")
bradypus <- read.table(file,  header=TRUE,  sep=',')
bradypus <- bradypus[,-1]
presvals <- extract(predictors, bradypus)
set.seed(0)
backgr <- randomPoints(predictors, 500)
nr <- nrow(bradypus)
s <- sample(nr, 0.25 * nr)
pres_train <- bradypus[-s, ]
pres_test <- bradypus[s, ]
nr <- nrow(backgr)
set.seed(9)
s <- sample(nr, 0.25 * nr)
back_train <- backgr[-s, ]
back_test <- backgr[s, ]
sb <- ssb(pres_test, back_test, pres_train)
sb[,1] / sb[,2]
##         p 
## 0.1000638

sb[,1] / sb[,2] is an indicator of spatial sorting bias (SSB). If there is no SSB this value should be 1, in these data it is close to zero, indicating that SSB is very strong. Let’s create a subsample in which SSB is removed.

i <- pwdSample(pres_test, back_test, pres_train, n=1, tr=0.1)
pres_test_pwd <- pres_test[!is.na(i[,1]), ]
back_test_pwd <- back_test[na.omit(as.vector(i)), ]
sb2 <- ssb(pres_test_pwd, back_test_pwd, pres_train)
sb2[1]/ sb2[2]
## [1] 1.004106

Spatial sorting bias is much reduced now; notice how the AUC dropped!

bc <- bioclim(predictors, pres_train)
evaluate(bc, p=pres_test, a=back_test, x=predictors)
## class          : ModelEvaluation 
## n presences    : 29 
## n absences     : 125 
## AUC            : 0.757931 
## cor            : 0.2777298 
## max TPR+TNR at : 0.03438276
evaluate(bc, p=pres_test_pwd, a=back_test_pwd, x=predictors)
## class          : ModelEvaluation 
## n presences    : 19 
## n absences     : 19 
## AUC            : 0.5914127 
## cor            : 0.1487003 
## max TPR+TNR at : 0.09185402

References

  1. Hijmans RJ, Elith J (2012) Species distribution modeling with R. http://cran.r-project.org/web/packages/dismo/vignettes/dm.pdf.

  2. Elith, J., C.H. Graham, R.P. Anderson, M. Dudik, S. Ferrier, A. Guisan, R.J. Hijmans, F. Huettmann, J. Leathwick, A. Lehmann, J. Li, L.G. Lohmann, B. Loiselle, G. Manion, C. Moritz, M. Nakamura, Y. Nakazawa, J. McC. Overton, A.T. Peterson, S. Phillips, K. Richardson, R. Scachetti-Pereira, R. Schapire, J. Soberon, S. Williams, M. Wisz and N. Zimmerman, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-151.