Species distribution modeling (SDM) is also known by other names including climate envelope-modeling, habitat modeling, and (environmental or ecological) niche-modeling.
New technologies and emerging disciplines, such as biodiversity bioinformatics, have promoted the development of methodological approaches that allow generating useful information from incomplete data; such is the case of the modeling of the ecological niches of species and their geographical distributions.
What is the geographical distribution of a species? In the simplest sense, it is that species’ spatial arrangement on the planet, or the geographical description of where it lives.
Traditionally, the distribution of species has been represented in a cartographic way from the geo-referenced points on maps, without considering explicitly the local factors that determine that distribution.
SDMs are correlative methods, static and non mechanistic, that work by identifying the relationships that exist between variables associated with the local environment and observations of presence for the species.
To make the best use of SDMs it is very important know how they work, but equally important is to know the limitations of the data sources, algorithms, and the methodological approach in general.
Despite their limitations, SDMs are a useful tool to generate geographical hypotheses about present, past and future distribution of species.
In this module we provide a short introduction to species distribution modelling in R. This module will cover a brief overview of the concept of species distribution modelling and the basic approach to modelling in R.
Figure 1. Diagram of the Species Distribution Modelling procedure. (Rodríguez-Rey et al., 2019)
There is a wide variety of modeling algorithms (between 20 and 30), as well as a series of platforms where they have been implemented. The aim of SDM is to estimate the similarity of the conditions at any site to the conditions at the locations of known occurrence (and perhaps of non-occurrence) of a phenomenon. A common application of this method is to predict species ranges with climate data as predictors.
locations of occurrence of a species (or other phenomenon) are compiled
values of environmental predictor variables (such as climate) at these locations are extracted from spatial databases
the environmental values are used to fit a model to estimate similarity to the sites of occurrence, or another measure such as abundance of the species
The model is used to predict the variable of interest across the region of interest (and perhaps for a future or past climate).
Figure 2: Uses of SDMs (Araújo et al., 2019).
Figure 3. An overlay of suitable areas for the five tree-cavity-nesters’ distribution models with protected areas. (Moradi et al., 2019)
Notes: Before you begin you will need to install and load the following packages and their dependencies:
{dismo} (=“distribution modeling”), and {raster} (=“geographic data analysis and modelling”), and each one of them has a set of functions that enable the modelling of species distributions in geographic and spatial contexts.
You need to collect a sufficient number of occurrence records (and absence or abundance, if possible) of the species of interest. You also need to have accurate and relevant environmental data (predictor variables) at a sufficiently high spatial resolution.
Importing occurrence data into R is easy. But collecting, georeferencing, and cross-checking coordinate data is tedious…
You can upload our data files like we did in previous modules, using the {curl} package.
For example, we can load a datast of occurrences of blue-eyed grass, Sisyrinchium angustifolium, like so:
library (curl)
#install package curl before this step to load a file from server.
library(knitr)
# to make nice data table.
f <- curl("https://raw.githubusercontent.com/feiyawang1207/AN597_groupproject/master/S_angustifolium.csv")
#load the file in varibale f
d <- read.csv(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# read file in d as a dataframe.
kable(head(d))
gbifID | datasetKey | occurrenceID | kingdom | phylum | class | order | family | genus | species | infraspecificEpithet | taxonRank | scientificName | verbatimScientificName | verbatimScientificNameAuthorship | countryCode | locality | stateProvince | occurrenceStatus | individualCount | publishingOrgKey | decimalLatitude | decimalLongitude | coordinateUncertaintyInMeters | coordinatePrecision | elevation | elevationAccuracy | depth | depthAccuracy | eventDate | day | month | year | taxonKey | speciesKey | basisOfRecord | institutionCode | collectionCode | catalogNumber | recordNumber | identifiedBy | dateIdentified | license | rightsHolder | recordedBy | typeStatus | establishmentMeans | lastInterpreted | mediaType | issue |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2451752339 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | https://www.inaturalist.org/observations/35412543 | Plantae | Tracheophyta | Magnoliopsida | Solanales | Solanaceae | Solanum | Solanum angustifolium | NA | SPECIES | Solanum rostratum Dun. | Solanum rostratum | MX | Michoacán | NA | 28eb1a3f-1c15-4a95-931a-4af90ecb574d | 19.07638 | -102.88527 | 22 | NA | NA | NA | NA | NA | 2019-11-07T09:40:48Z | 7 | 11 | 2019 | 2930922 | 2929800 | HUMAN_OBSERVATION | iNaturalist | Observations | 35412543 | 2019-11-07T16:03:48Z | CC_BY_NC_4_0 | Jesús Martínez | Jesús Martínez | 2019-11-15T13:12:19.217Z | STILLIMAGE | GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED | |||||||
2451721667 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | https://www.inaturalist.org/observations/35294114 | Plantae | Tracheophyta | Magnoliopsida | Solanales | Solanaceae | Solanum | Solanum angustifolium | NA | SPECIES | Solanum rostratum Dun. | Solanum rostratum | MX | Jalisco | NA | 28eb1a3f-1c15-4a95-931a-4af90ecb574d | 21.36370 | -101.94901 | 4 | NA | NA | NA | NA | NA | 2019-11-03T14:32:00Z | 3 | 11 | 2019 | 2930922 | 2929800 | HUMAN_OBSERVATION | iNaturalist | Observations | 35294114 | 2019-11-04T18:09:06Z | CC_BY_NC_4_0 | Christian Martinez | Christian Martinez | 2019-11-15T13:11:55.564Z | STILLIMAGE | GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED | |||||||
2451686769 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | https://www.inaturalist.org/observations/35175230 | Plantae | Tracheophyta | Magnoliopsida | Solanales | Solanaceae | Solanum | Solanum angustifolium | NA | SPECIES | Solanum rostratum Dun. | Solanum rostratum | US | Texas | NA | 28eb1a3f-1c15-4a95-931a-4af90ecb574d | 32.69588 | -102.25799 | 8 | NA | NA | NA | NA | NA | 2019-11-01T15:15:39Z | 1 | 11 | 2019 | 2930922 | 2929800 | HUMAN_OBSERVATION | iNaturalist | Observations | 35175230 | 2019-11-01T20:17:03Z | CC_BY_NC_4_0 | Nathan Taylor | Nathan Taylor | 2019-11-15T13:11:23Z | STILLIMAGE | GEODETIC_DATUM_ASSUMED_WGS84 | |||||||
2451656431 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | https://www.inaturalist.org/observations/35013197 | Plantae | Tracheophyta | Magnoliopsida | Solanales | Solanaceae | Solanum | Solanum angustifolium | NA | SPECIES | Solanum rostratum Dun. | Solanum rostratum | US | Texas | NA | 28eb1a3f-1c15-4a95-931a-4af90ecb574d | 30.22526 | -99.01538 | 6 | NA | NA | NA | NA | NA | 2019-10-28T11:53:22Z | 28 | 10 | 2019 | 2930922 | 2929800 | HUMAN_OBSERVATION | iNaturalist | Observations | 35013197 | 2019-10-28T16:53:40Z | CC_BY_NC_4_0 | fbgjwh | fbgjwh | 2019-11-15T13:11:04.858Z | STILLIMAGE | GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED | |||||||
2451601084 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | https://www.inaturalist.org/observations/34881947 | Plantae | Tracheophyta | Magnoliopsida | Solanales | Solanaceae | Solanum | Solanum angustifolium | NA | SPECIES | Solanum rostratum Dun. | Solanum rostratum | US | Arkansas | NA | 28eb1a3f-1c15-4a95-931a-4af90ecb574d | 36.44975 | -94.59106 | 10 | NA | NA | NA | NA | NA | 2019-08-21T13:59:00Z | 21 | 8 | 2019 | 2930922 | 2929800 | HUMAN_OBSERVATION | iNaturalist | Observations | 34881947 | 2019-10-25T19:01:17Z | CC_BY_NC_4_0 | jim_keesling | jim_keesling | 2019-11-15T13:10:54.246Z | STILLIMAGE | GEODETIC_DATUM_ASSUMED_WGS84 | |||||||
2451556298 | 1060a696-f6a2-4341-92ef-47a63babbea1 | 7526a2b2-b355-492e-9d57-7de5519acd6e | Plantae | Tracheophyta | Magnoliopsida | Solanales | Solanaceae | Solanum | Solanum angustifolium | NA | SPECIES | Solanum rostratum Dun. | Solanum rostratum | Dunal | US | by Hwy 30 c. 10 mi W of Sidney | Nebraska | NA | 75bf5fec-40fa-43ae-919f-c43bd7ce25c9 | NA | NA | NA | NA | NA | NA | NA | NA | 1945-08-21T00:00:00Z | 21 | 8 | 1945 | 2930922 | 2929800 | PRESERVED_SPECIMEN | SJSU | 12374 | 5425 | CC_BY_4_0 | C.W. Sharsmith | 2019-11-14T16:31:00.343Z |
If you do not have any species distribution data you can get started by downloading data from the Global Biodiversity Inventory Facility (GBIF) There you can find 1.354.603.236 occurrence records!
To import data from a database like this, we can use a function called gbif ()
to download files from the GBIF database.
For now, we are going to download the occurrence data from a potato species. Solanum acaule is widespread and common in upland habitats from northern Peru (Dept. Cajamarca), south through Bolivia to northern Argentina (Prov. San Juan), and with one record in northern Chile (Antofagasta Region)
Figure 4: Picture of Solanum acaule
library(dismo)
## Loading required package: raster
## Loading required package: sp
library(knitr)
acaule <- gbif("solanum", "acaule*", geo=FALSE)
## Loading required namespace: jsonlite
## 7229 records found
## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6600-6900-7200-7229 records downloaded
#load the saved S. acaule data from data base
kable(head(acaule))
acceptedScientificName | acceptedTaxonKey | accessRights | adm1 | adm2 | associatedReferences | basisOfRecord | bibliographicCitation | catalogNumber | class | classKey | cloc | collectionCode | collectionID | continent | coordinateUncertaintyInMeters | country | crawlId | datasetID | datasetKey | datasetName | dateIdentified | day | depth | depthAccuracy | disposition | dynamicProperties | elevation | elevationAccuracy | endDayOfYear | establishmentMeans | eventDate | eventID | eventRemarks | eventTime | family | familyKey | fieldNotes | fieldNumber | fullCountry | gbifID | genericName | genus | genusKey | geodeticDatum | georeferenceProtocol | georeferenceRemarks | georeferenceSources | habitat | higherClassification | higherGeography | higherGeographyID | http://unknown.org/nick | http://unknown.org/occurrenceDetails | identificationID | identificationRemarks | identifiedBy | identifier | individualCount | infraspecificEpithet | installationKey | institutionCode | institutionID | ISO2 | key | kingdom | kingdomKey | language | lastCrawled | lastInterpreted | lastParsed | lat | license | locality | lon | materialSampleID | modified | month | municipality | nomenclaturalCode | nomenclaturalStatus | occurrenceID | occurrenceRemarks | order | orderKey | organismID | organismQuantity | organismQuantityType | organismRemarks | originalNameUsage | otherCatalogNumbers | ownerInstitutionCode | parentEventID | phylum | phylumKey | preparations | previousIdentifications | protocol | publishingCountry | publishingOrgKey | recordedBy | recordNumber | references | rights | rightsHolder | samplingProtocol | scientificName | scientificNameID | species | speciesKey | specificEpithet | startDayOfYear | taxonID | taxonKey | taxonomicStatus | taxonRank | taxonRemarks | type | typeStatus | typifiedName | verbatimCoordinateSystem | verbatimElevation | verbatimEventDate | verbatimLocality | vernacularName | year | downloadDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Solanum acaule Bitter | 2931061 | NA | Cusco | NA | NA | HUMAN_OBSERVATION | NA | 22332967 | Magnoliopsida | 220 | Cusco, Peru | Observations | NA | NA | 300 | Peru | 190 | NA | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | iNaturalist research-grade observations | 2019-04-30T14:41:30 | 26 | NA | NA | NA | NA | NA | NA | NA | NA | 2019-03-26T14:30:00 | NA | NA | 18:30:00Z | Solanaceae | 7717 | NA | NA | Peru | 2238787932 | Solanum | Solanum | 2928997 | WGS84 | NA | NA | NA | NA | NA | NA | NA | cstobie | https://www.inaturalist.org/observations/22332967 | 52791525 | NA | NA | 22332967 | NA | NA | 997448a8-f762-11e1-a439-00145eb45e9a | iNaturalist | NA | PE | 2238787932 | Plantae | 6 | NA | 2019-11-29T14:10:16.861+0000 | 2019-11-29T15:09:07.641+0000 | 2019-11-29T15:09:07.641+0000 | -15.047820 | http://creativecommons.org/licenses/by-nc/4.0/legalcode | NA | -71.13188 | NA | 2019-05-20T17:10:31.000+0000 | 3 | NA | NA | NA | https://www.inaturalist.org/observations/22332967 | NA | Solanales | 1176 | NA | NA | NA | NA | NA | NA | NA | NA | Tracheophyta | 7707728 | NA | NA | DWC_ARCHIVE | US | 28eb1a3f-1c15-4a95-931a-4af90ecb574d | cstobie | NA | https://www.inaturalist.org/observations/22332967 | © cstobie some rights reserved | cstobie | NA | Solanum acaule Bitter | NA | Solanum acaule | 2931061 | acaule | NA | 886633 | 2931061 | ACCEPTED | SPECIES | NA | NA | NA | NA | NA | NA | 2019/03/26 2:30 PM EDT | Espinar, Perú | NA | 2019 | 2019-12-04 |
Solanum acaule Bitter | 2931061 | NA | Catamarca | Ambato | NA | PRESERVED_SPECIMEN | NA | CORD00056696 | Magnoliopsida | 220 | Camino desde El Rodeo rumbo a Primer Campo. Puesto Vergara, Primer Campo, alrededores de la Casa de Piedra, Catamarca, Ambato, Argentina, SOUTH_AMERICA | NA | NA | SOUTH_AMERICA | NA | Argentina | 15 | NA | ab61ad8f-953f-4c7c-80b2-e1fb50fd8187 | NA | 2016-01-01T00:00:00 | 24 | NA | NA | NA | NA | NA | NA | NA | NATIVE | 2016-02-24T00:00:00 | NA | NA | NA | Solanaceae | 7717 | NA | NA | Argentina | 2252415409 | Solanum | Solanum | 2928997 | WGS84 | NA | Puntual | GPS Ensign - Datum WGS84 | NA | NA | Sudamérica | Argentina | Catamarca | NA | NA | NA | NA | NA | Chiarini, F. | CORD-44463984-f4cc-49ec-8911-2de91f216c2b-AR | NA | NA | 4fd221cd-129c-404f-ab62-fdd2377d12ac | CORD | NA | AR | 2252415409 | Plantae | 6 | NA | 2019-11-05T11:30:29.242+0000 | 2019-11-13T15:15:57.223+0000 | 2019-11-13T15:15:57.223+0000 | -28.179167 | http://creativecommons.org/licenses/by/4.0/legalcode | Camino desde El Rodeo rumbo a Primer Campo. Puesto Vergara, Primer Campo, alrededores de la Casa de Piedra | -65.99250 | NA | NA | 2 | NA | NA | NA | CORD-44463984-f4cc-49ec-8911-2de91f216c2b-AR | NA | Solanales | 1176 | NA | NA | NA | NA | NA | NA | NA | NA | Tracheophyta | 7707728 | NA | Solanaceae | Solanum acaule Bitter | Chiarini, F. | 2016 | DWC_ARCHIVE | AR | 7a6bdf66-ef5c-4a81-b731-2e328f4881eb | Barboza, G. E. | Cantero, J. J. & Chiarini, F. | 4669 | NA | NA | NA | NA | Solanum acaule Bitter | NA | Solanum acaule | 2931061 | acaule | NA | NA | 2931061 | ACCEPTED | SPECIES | NA | NA | NA | NA | NA | 3513 | 24/02/2016 | NA | NA | 2016 | 2019-12-04 |
Solanum acaule Bitter | 2931061 | NA | Lima | NA | NA | PRESERVED_SPECIMEN | NA | BM001134709 | Magnoliopsida | 220 | Huaroz district, 37 km from canta on road to La Viuda, Lima, Peru, SOUTH_AMERICA | BOT | NA | SOUTH_AMERICA | NA | Peru | 142 | NA | 7e380070-f762-11e1-a439-00145eb45e9a | NA | NA | 28 | NA | NA | NA | {“determinationfiledas”: “Yes”, “collectionkind”: “Sheet”, “gbifissue”: [“GEODETIC_DATUM_ASSUMED_WGS84”], “created”: 1415984060000, “associatedmediacount”: 1, “project”: “Picturae”, “determinationnames”: “Solanum acaule Bitter”, “plantdescription”: “herb to .1m high; infrequent, flowers blue/purple, without interstitial leaflet; collection time: 11:42am; mixed shrubland with dominance of Senecio collinus and Lupinus brachyprennon”, “subdepartment”: “Flowering Plants”, “gbifid”: 1056708863} | 4052 | NA | NA | NA | 2014-02-28T00:00:00 | NA | NA | NA | Solanaceae | 7717 | NA | 2893 | Peru | 1056708863 | Solanum | Solanum | 2928997 | WGS84 | NA | NA | NA | NA | NA | South America; Peru; Lima; Canta | NA | NA | NA | NA | NA | Sandra (Sandy) Diane Knapp | 4338577 | NA | NA | 60277b56-f762-11e1-a439-00145eb45e9a | NHMUK | NA | PE | 1056708863 | Plantae | 6 | NA | 2019-11-29T15:42:43.569+0000 | 2019-11-29T16:08:04.040+0000 | 2019-11-29T16:08:04.040+0000 | -11.387433 | http://creativecommons.org/publicdomain/zero/1.0/legalcode | Huaroz district, 37 km from canta on road to La Viuda | -76.47753 | NA | NA | 2 | NA | NA | NA | 2bc0e320-b403-43b0-bede-334814a339b1 | NA | Solanales | 1176 | NA | NA | NA | NA | NA | NHMUK:ecatalogue:4338577 | NA | NA | Tracheophyta | 7707728 | NA | NA | DWC_ARCHIVE | GB | 19456090-b49a-11d8-abeb-b8a03c50a862 | Paúl Gonzáles, Mindy M. Syfert, Dr Erica McAlister, D Whitmore, Sandra (Sandy) Diane Knapp | 2893 | NA | NA | NA | NA | Solanum acaule Bitter | NA | Solanum acaule | 2931061 | acaule | NA | NA | 2931061 | ACCEPTED | SPECIES | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2014 | 2019-12-04 |
Solanum acaule Bitter | 2931061 | NA | Lima | Canta | NA | PRESERVED_SPECIMEN | http://www.tropicos.org/Specimen/100991187 | 100991187 | Magnoliopsida | 220 | Huaroz district, 37 km from Canta on road to La Viuda., Lima, Canta, Peru, SOUTH_AMERICA | MO | http://biocol.org/urn:lsid:biocol.org:col:15859 | SOUTH_AMERICA | NA | Peru | 95 | NA | 7bd65a7a-f762-11e1-a439-00145eb45e9a | Tropicos | NA | 28 | NA | NA | NA | NA | 4052 | NA | NA | NA | 2014-02-28T00:00:00 | NA | NA | NA | Solanaceae | 7717 | NA | NA | Peru | 2268891108 | Solanum | Solanum | 2928997 | WGS84 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | urn:catalog:MO:Tropicos:100991187 | 1 | NA | c4e134d3-c68f-406d-a116-6a37895602e0 | MO | NA | PE | 2268891108 | Plantae | 6 | NA | 2019-11-29T07:20:04.818+0000 | 2019-11-29T07:48:48.952+0000 | 2019-11-29T07:48:48.952+0000 | -11.387500 | http://creativecommons.org/licenses/by/4.0/legalcode | Huaroz district, 37 km from Canta on road to La Viuda. | -76.47750 | NA | 2017-08-08T14:37:00.000+0000 | 2 | NA | ICNafp | No opinion | urn:catalog:MO:Tropicos:100991187 | NA | Solanales | 1176 | NA | NA | NA | NA | NA | NA | MOBOT | NA | Tracheophyta | 7707728 | NA | NA | DWC_ARCHIVE | US | 90fd6680-349f-11d8-aa2d-b8a03c50a862 | Paúl Gonzáles|M. Syfert|E. McAlister|D. Whitmore|Sandy Knapp | Gonzáles Arce 2893 | NA | NA | Missouri Botanical Garden | NA | Solanum acaule Bitter | NA | Solanum acaule | 2931061 | acaule | NA | 29600091 | 2931061 | ACCEPTED | SPECIES | NA | PhysicalObject | NA | NA | NA | NA | NA | NA | NA | 2014 | 2019-12-04 |
Solanum acaule Bitter | 2931061 | NA | Lima | Oyon | NA | PRESERVED_SPECIMEN | http://www.tropicos.org/Specimen/100991022 | 100991022 | Magnoliopsida | 220 | Oyón district, 1.7 km up to Quichas on road up to Cordillera Raura., Lima, Oyon, Peru, SOUTH_AMERICA | MO | http://biocol.org/urn:lsid:biocol.org:col:15859 | SOUTH_AMERICA | NA | Peru | 95 | NA | 7bd65a7a-f762-11e1-a439-00145eb45e9a | Tropicos | NA | 6 | NA | NA | NA | NA | 4134 | NA | NA | NA | 2014-03-06T00:00:00 | NA | NA | NA | Solanaceae | 7717 | NA | NA | Peru | 2268891573 | Solanum | Solanum | 2928997 | WGS84 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | urn:catalog:MO:Tropicos:100991022 | 1 | NA | c4e134d3-c68f-406d-a116-6a37895602e0 | MO | NA | PE | 2268891573 | Plantae | 6 | NA | 2019-11-29T07:20:04.818+0000 | 2019-11-29T07:48:54.267+0000 | 2019-11-29T07:48:54.267+0000 | -10.558056 | http://creativecommons.org/licenses/by/4.0/legalcode | Oyón district, 1.7 km up to Quichas on road up to Cordillera Raura. | -76.77194 | NA | 2017-08-07T14:49:00.000+0000 | 3 | NA | ICNafp | No opinion | urn:catalog:MO:Tropicos:100991022 | NA | Solanales | 1176 | NA | NA | NA | NA | NA | NA | MOBOT | NA | Tracheophyta | 7707728 | NA | NA | DWC_ARCHIVE | US | 90fd6680-349f-11d8-aa2d-b8a03c50a862 | Paúl Gonzáles|M. Syfert|E. McAlister|D. Whitmore | Gonzáles Arce 2961 | NA | NA | Missouri Botanical Garden | NA | Solanum acaule Bitter | NA | Solanum acaule | 2931061 | acaule | NA | 29600091 | 2931061 | ACCEPTED | SPECIES | NA | PhysicalObject | NA | NA | NA | NA | NA | NA | NA | 2014 | 2019-12-04 |
Solanum acaule Bitter | 2931061 | NA | Ancash | NA | NA | PRESERVED_SPECIMEN | NA | BM001120832 | Magnoliopsida | 220 | between km55-57 on rd from Yanac to Sihuas, Ancash, Peru, SOUTH_AMERICA | BOT | NA | SOUTH_AMERICA | NA | Peru | 142 | NA | 7e380070-f762-11e1-a439-00145eb45e9a | NA | NA | 19 | NA | NA | NA | {“determinationfiledas”: “Yes”, “collectionkind”: “Sheet”, “gbifissue”: [“GEODETIC_DATUM_ASSUMED_WGS84”], “created”: 1415986484000, “associatedmediacount”: 1, “project”: “Picturae”, “determinationnames”: “Solanum acaule Bitter”, “plantdescription”: “Rosette forming herb; corolla purple; fruits green, turning purple-black; growing along moist muddy areas in puna vegetation close to Polylepis forest”, “subdepartment”: “Flowering Plants”, “gbifid”: 1055949175} | 3955 | NA | NA | NA | 2013-05-19T00:00:00 | NA | NA | NA | Solanaceae | 7717 | NA | 4706 | Peru | 1055949175 | Solanum | Solanum | 2928997 | WGS84 | NA | NA | NA | NA | NA | South America; Peru; Ancash; Corongo | NA | NA | NA | NA | NA | Tiina Särkinen | 4339588 | NA | NA | 60277b56-f762-11e1-a439-00145eb45e9a | NHMUK | NA | PE | 1055949175 | Plantae | 6 | NA | 2019-11-29T15:42:43.569+0000 | 2019-11-29T16:08:20.122+0000 | 2019-11-29T16:08:20.122+0000 | -8.571389 | http://creativecommons.org/publicdomain/zero/1.0/legalcode | between km55-57 on rd from Yanac to Sihuas | -77.71803 | NA | NA | 5 | NA | NA | NA | c3a7d94e-d9d4-469e-b724-a85252512ed1 | NA | Solanales | 1176 | NA | NA | NA | NA | NA | NHMUK:ecatalogue:4339588 | NA | NA | Tracheophyta | 7707728 | NA | NA | DWC_ARCHIVE | GB | 19456090-b49a-11d8-abeb-b8a03c50a862 | Tiina Särkinen, H. Maria Baden, Dr Erica McAlister, Diana M. Percy | 4706 | NA | NA | NA | NA | Solanum acaule Bitter | NA | Solanum acaule | 2931061 | acaule | NA | NA | 2931061 | ACCEPTED | SPECIES | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2013 | 2019-12-04 |
If you want to check how many rows and colums the data frame you have, we can use dim ()
to show the dimensions of the dataset.
colrow<-dim(acaule)
# use dim() to get dimension of our dataset
colrow[1]
## [1] 7229
#show row numbers
y<-colrow[2]
#show column number
In order to draw a map, we need to find the location by longitude and latitude from our dataset.
Here, we’re subsetting the data to include only those that have longitude and latitude. Many occurence records may not have geographic coordinates in your dataset, so you will need to remove those NA
values from your dataframe. We’ll use the is.na()
command to assess whether the data have both longitude and latitude data, and then subset them into a new data frame.
acgeo <- subset(acaule, !is.na(lon) & !is.na(lat))
#use subset() to seperate those data
#datatable(head(acgeo))
#showing the example data see that have longtitude and latitude data
dim(acgeo)
## [1] 4643 127
#showing the rows and columns for data frame
acgeo[1:4, c(1:5,7:10)] #removing locality
## acceptedScientificName acceptedTaxonKey accessRights adm1 adm2
## 1 Solanum acaule Bitter 2931061 <NA> Cusco <NA>
## 2 Solanum acaule Bitter 2931061 <NA> Catamarca Ambato
## 3 Solanum acaule Bitter 2931061 <NA> Lima <NA>
## 4 Solanum acaule Bitter 2931061 <NA> Lima Canta
## basisOfRecord bibliographicCitation
## 1 HUMAN_OBSERVATION <NA>
## 2 PRESERVED_SPECIMEN <NA>
## 3 PRESERVED_SPECIMEN <NA>
## 4 PRESERVED_SPECIMEN http://www.tropicos.org/Specimen/100991187
## catalogNumber class
## 1 22332967 Magnoliopsida
## 2 CORD00056696 Magnoliopsida
## 3 BM001134709 Magnoliopsida
## 4 100991187 Magnoliopsida
In order to make a map of our subset of Solanum acaule data that have locations, we’ll use the package {maptools}.
First, we need to draw an empty map: a variable called wrld_simpl! will give us an empty world-wide map. The wrld_simpl
dataset contains rough country outlines.
library(maptools)
## Checking rgeos availability: TRUE
data(wrld_simpl)
x<-plot(wrld_simpl, xlim=c(-80,70), ylim=c(-80,30), axes=TRUE)
# empty map
Now we need to add our data points. It is important to make such maps to make sure that the points are, at least roughly, in the right location:
library(maptools)
data(wrld_simpl)
plot(wrld_simpl, xlim=c(-80,70), ylim=c(-80,30), axes=TRUE, col="light yellow")
# restore the box around the map
box()
# add the points
points(acgeo$lon, acgeo$lat, col='orange', pch=20, cex=0.75)
# plot points again to add a border, for better visibility
points(acgeo$lon, acgeo$lat, col='red', cex=0.75)
r <- raster(system.file("external/test.grd", package="raster"))
Data ‘cleaning’ is particularly important for data sourced from species distribution data warehouses such as GBIF. As we mentioned, we already know that Solanum acaule is a species that occurs in the higher parts of the Andes mountains of southern Peru, Bolivia and northern Argentina. Do we see this in our map?
We imported more than seven thousand records from GBIF, but not all of these are usable because of various issues. This species is distributed mainly in the Andes mountains of southern Peru, Bolivia and northern Argentina, but as you can see in the previous map there are records in other countries and even in Antartica! In order to clean this data we will need to:
avoid false identifications
use only specimens collected in a period relevant to the climate data we will be utilizing
have reliable locational certainty (the cleaning will depend of the species we are modelling and the data availability.
The point in Brazil (record acaule[98,]) should actually be in southern Bolivia, so this is probably a typo in the longitude value.
There are also three records that have plausible latitudes, but longitudes that are clearly wrong, as they are in the Atlantic Ocean, south of West Africa! It looks like they have a longitude that is zero. In many data-bases you will find values that are ‘zero’ where ‘no data’ was intended. The gbif function (when using the default arguments) sets coordinates that are (0, 0) to NA
, but not if only one of the coordinates is listed as zero. Let’s see if we find these by searching for records with longitudes of zero.
lonzero = subset(acgeo, lon==0)
lonzero[, 1:13]
## [1] acceptedScientificName acceptedTaxonKey accessRights
## [4] adm1 adm2 associatedReferences
## [7] basisOfRecord bibliographicCitation catalogNumber
## [10] class classKey cloc
## [13] collectionCode
## <0 rows> (or 0-length row.names)
The records might also come from very different sources, sometimes from museum collections or zoos. Be sure to choose only the kind of records you really want:
unique(acaule$basisOfRecord)#if you want to know origin of your records.
## [1] "HUMAN_OBSERVATION" "PRESERVED_SPECIMEN" "LIVING_SPECIMEN"
## [4] "UNKNOWN"
#You can either decide to work with specimens stored only in scientific collections or include all records in your analysis.
acaule1 <- subset(acaule,
basisOfRecord=="specimen" |
basisOfRecord=="unknown" |
basisOfRecord=="living"
)
nrow(acaule1)
## [1] 0
Many databases share data, so you also risk including duplicate records. The next function will help you to remove duplicates.
#which records are duplicates (only for the first 10 columns)?
dups <- duplicated(lonzero[, 1:10])
#remove duplicates
lonzero <- lonzero[dups, ]
lonzero[,1:13]
## [1] acceptedScientificName acceptedTaxonKey accessRights
## [4] adm1 adm2 associatedReferences
## [7] basisOfRecord bibliographicCitation catalogNumber
## [10] class classKey cloc
## [13] collectionCode
## <0 rows> (or 0-length row.names)
Another approach might be to detect duplicates for the same species and some coordinates in the data, even if the records were from collections by different people or in different years (in our case, using species is redundant as we have data for only one species).
# differentiating by (sub) species
# dups2 <- duplicated(acgeo[, c('species', 'lon', 'lat')])
# ignoring (sub) species and other naming variation
dups2 <- duplicated(acgeo[, c('lon', 'lat')])
# number of duplicates
sum(dups2)
## [1] 2345
## [1] 483
# keep the records that are _not_ duplicated
acg <- acgeo[!dups2, ]
Let’s repatriate the records near Pakistan to Argentina, and remove the records in Brazil, Antarctica, and with a longitude of ‘0’.
i <- acg$lon > 0 & acg$lat > 0
acg$lon[i] <- -1 * acg$lon[i]
acg$lat[i] <- -1 * acg$lat[i]
acg <- acg[acg$lon < -50 & acg$lat > -50, ]
It is important to cross-check coordinates by visual and other means. One approach is to compare the country (and lower level administrative subdivisions) of the site as specified by the records, with the country implied by the coordinates (Hijmans et al., 1999).
We should ask these two questions:
Which points (identified by their record numbers) do not match any country (that is, they are in an ocean)?
Which points have coordinates that are in a different country than listed in the ‘country’ field of the GBIF record
In the example below we use the coordinates()
function from the sp
package to create a spatial points data frame, and then the over()
function, also from sp
, to do a point-in-polygon query with the named countries polygons.
library(sp)
coordinates(acg) <- ~lon+lat
crs(acg) <- crs(wrld_simpl)
class(acg)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
class(wrld_simpl)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
ovr <- over(acg, wrld_simpl)
head(ovr)
## FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION LON
## 1 PE PE PER 604 Peru 128000 27274266 19 5 -75.552
## 2 AR AR ARG 32 Argentina 273669 38747148 19 5 -65.167
## 3 PE PE PER 604 Peru 128000 27274266 19 5 -75.552
## 4 PE PE PER 604 Peru 128000 27274266 19 5 -75.552
## 5 PE PE PER 604 Peru 128000 27274266 19 5 -75.552
## 6 PE PE PER 604 Peru 128000 27274266 19 5 -75.552
## LAT
## 1 -9.326
## 2 -35.377
## 3 -9.326
## 4 -9.326
## 5 -9.326
## 6 -9.326
cntr <- ovr$NAME
i <- which(is.na(cntr))
i
## integer(0)
## integer(0)
j <- which(cntr != acg$country)
# for the mismatches, bind the country names of the polygons and points
cbind(cntr, acg$country)[j,]
## cntr
## [1,] "27" "Argentina"
## [2,] "27" "Argentina"
## [3,] "27" "Argentina"
## [4,] "172" "Bolivia"
## [5,] "172" "Bolivia"
## [6,] "172" "Bolivia"
## [7,] "27" "Argentina"
## [8,] "27" "Argentina"
## [9,] "172" "Bolivia"
## [10,] "172" "Bolivia"
## [11,] "172" "Bolivia"
## [12,] "172" "Bolivia"
## [13,] "172" "Bolivia"
## [14,] "172" "Bolivia"
## [15,] "172" "Bolivia"
## [16,] "172" "Bolivia"
## [17,] "172" "Bolivia"
## [18,] "172" "Bolivia"
## [19,] "172" "Bolivia"
## [20,] "172" "Bolivia"
## [21,] "172" "Bolivia"
## [22,] "172" "Bolivia"
## [23,] "172" "Bolivia"
## [24,] "172" "Bolivia"
## [25,] "172" "Bolivia"
## [26,] "172" "Bolivia"
## [27,] "172" "Bolivia"
## [28,] "172" "Bolivia"
## [29,] "172" "Bolivia"
## [30,] "172" "Bolivia"
## [31,] "172" "Bolivia"
## [32,] "172" "Bolivia"
## [33,] "172" "Bolivia"
## [34,] "172" "Bolivia"
## [35,] "172" "Bolivia"
## [36,] "172" "Bolivia"
## [37,] "172" "Bolivia"
## [38,] "172" "Bolivia"
## [39,] "172" "Bolivia"
## [40,] "27" "Peru"
## [41,] "172" "Bolivia"
## [42,] "172" "Bolivia"
## [43,] "172" "Bolivia"
## [44,] "172" "Bolivia"
## [45,] "27" "Peru"
## [46,] "172" "Bolivia"
## [47,] "172" "Bolivia"
## [48,] "172" "Bolivia"
## [49,] "172" "Bolivia"
## [50,] "172" "Bolivia"
## [51,] "172" "Bolivia"
## [52,] "172" "Bolivia"
## [53,] "172" "Bolivia"
## [54,] "27" "Peru"
## [55,] "172" "Bolivia"
## [56,] "27" "Peru"
## [57,] "27" "Peru"
## [58,] "27" "Peru"
## [59,] "31" "Peru"
## [60,] "65" "Peru"
## [61,] "65" "Peru"
## [62,] "143" "Peru"
## [63,] "27" "Argentina"
## [64,] "172" "Bolivia"
## [65,] "172" "Bolivia"
## [66,] "27" "Argentina"
## [67,] "172" "Bolivia"
plot(acg)
plot(wrld_simpl, add=T, border='blue', lwd=2)
points(acg[j, ], col='red', pch=20, cex=2)
Some of the early species distribution model algorithms, such as Bioclim and Domain only use ’presence’ data in the modeling process. Other methods also use ’absence’ data or ’background’ data.
If you want to have a non-NA-value data frame for your method, here is what you could do…
First, there is a common method to delete all NA
values from a data frame: we can use the na.omit()
function to omit the rows that contains the NA
values.
For example, given you only need longtitude and latitude values to draw your map, you can take those two columns into a data frame.
long<-acaule$lon
#extract column of longtitude
lati<-acaule$lat
#extract column of latitude
d<-cbind(long,lati)
#combine into a data frame
colnames(d)<-c("longitude","latitude")
#name data frame
dim(d)[1]
## [1] 7229
#rows number of uncleaning data frame
d[27,]
## longitude latitude
## NA NA
#there are some NA rows in this data frame
Then, you can use na.omit()
to remove all NA
value from this data frame.
clean_d <-na.omit(d)
#save teh cleaing data frame
dim(clean_d)[1]
## [1] 4643
#rows number of cleaning data frame
clean_d[27,]
## longitude latitude
## -67.71306 -28.97167
#NA have been removed from the previous row.
Another simple way to do this is to use the filter()
function in the {dplyr} package.
For example:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:raster':
##
## intersect, select, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
new_d<-filter(acaule, lon !="NA" & lat !="NA" )
#new variable to store the data frame.
summary(is.na(new_d$lon))
## Mode FALSE
## logical 4643
#use is.na() to check if there is remaining NA in longitude column.
Background data do not attempt to guess at absence locations, but rather characterize environments in the study region. In this sense, background is the same, irrespective of where the species has been found.
Background data establishes the environmental domain of the study, whilst presence data should establish under which conditions a species is more likely to be present than on average. A closely related but different concept, that of “pseudo-absences”, is also used for generating the non-presence class for logistic models. In this case, researchers sometimes try to guess where absences might occur – they may sample the whole region except at presence locations, or they might sample at places unlikely to be suitable for the species. We prefer the background concept because it requires fewer assumptions and has some coherent statistical methods for dealing with the “overlap” between presence and background points.
In the example below, we first get the list of filenames with the predictor raster data (discussed in detail in the next section). We use a raster as a ‘mask’ in the randomPoints()
function such that the background points are from the same geographic area, and only for places where there are values (land, in our case).
library(dismo)
# get the file names
files <- list.files(path=paste(system.file(package="dismo"), '/ex',
sep=''), pattern='grd', full.names=TRUE )
# we use the first file to create a RasterLayer
mask <- raster(files[1])
# select 500 random points
set.seed(1963)
# set seed to assure that the examples will always have the same random sample.
bg <- randomPoints(mask, 500 )
plot(!is.na(mask), legend=FALSE)
points(bg, cex=0.5)
# now we repeat the sampling, but limit the area of sampling using a spatial extent
e <- extent(-80, -53, -39, -22)
bg2 <- randomPoints(mask, 50, ext=e)
#limit point for longitude -39~-22, latitude -80~-53
plot(!is.na(mask), legend=FALSE)
plot(e, add=TRUE, col='red')
points(bg2, cex=0.5)
There are several approaches one could use to sample ‘pseudo-absence’ points from a more restricted area than ‘background’.
We first read the cleaned and subsetted S. acaule data that we produced in the previous chapter from the csv
file that comes with {dismo}:
file <- paste(system.file(package="dismo"), '/ex/acaule.csv', sep='')
ac <- read.csv(file)
#load the cleaned data file
coordinates(ac) <- ~lon+lat
projection(ac) <- CRS('+proj=longlat +datum=WGS84')
#create a Spatial Points Data Frame
We first create a ‘circles’ model (see the chapter about geographic models), using an arbitrary radius of 50 km. Circles()
is a function from {rgeos} package.
library(rgeos)
## rgeos version: 0.5-2, (SVN revision 621)
## GEOS runtime version: 3.7.2-CAPI-1.11.2
## Linking to sp version: 1.3-1
## Polygon checking: TRUE
x <- circles(ac, d=50000, lonlat=TRUE)
# circles with a radius of 50 km
pol <- polygons(x)
samp1 <- spsample(pol, 250, type='random', iter=25)
# sample randomly from all circles
cells <- cellFromXY(mask, samp1)
length(cells)
## [1] 250
cells <- unique(cells)
# get unique cells
length(cells)
## [1] 161
xy <- xyFromCell(mask, cells)
plot(pol, axes=TRUE)
points(xy, cex=0.75, pch=20, col='blue')
#make a plot to see results
A similar results could be produced via raster function extract()
# extract cell numbers for the circles
v <- extract(mask, x@polygons, cellnumbers=T)
# use rbind to combine the elements in list v
v <- do.call(rbind, v)
# get unique cell numbers from which you could sample
v <- unique(v[,1])
head(v)
## [1] 15531 15717 17581 17582 17765 17767
# to display the results
m <- mask
m[] <- NA
m[v] <- 1
plot(m, ext=extent(x@polygons)+1)
plot(x@polygons, add=T)
# make a plot for results,
A raster consists of a matrix of cells (or pixels) organized into rows and columns (or a grid) where each contains a value representing information. Raster format represents real-world phenomena, so data might be discrete or continuous. Discrete data: also are called thematic, categorical or discontinuous data. A discrete object has known and definable boundaries.
In species distribution modeling, predictor variables are typically organized as raster (grid) type files. Each predictor should be a ‘raster’ representing a variable of interest. Variables can include climatic, soil, terrain, vegetation, land use, and other variables.
These data are typically stored in files in some kind of GIS format. Almost all relevant formats can be used (including ESRI grid, geoTiff, netCDF, IDRISI).
For any particular study the layers should all have the same spatial extent, resolution, origin, and projection. The set of predictor variables (rasters) can be used to make a RasterStack, which is a collection of RasterLayer objects
A RasterLayer object represents single-layer (variable) raster data. A RasterLayer object always stores a number of fundamental parameters that describe it. These include the number of columns and rows, the spatial extent, and the Coordinate Reference System. In addition, a RasterLayer can store information about the file in which the raster cell values are stored (if there is such a file). A RasterLayer can also hold the raster cell values in memory.
Using the {dismo} package, we are going to create a RasterStack from the list of files that are intalled with the package.
Fist, we need to get the folder with our files…..
path <- file.path(system.file(package="dismo"), 'ex')
Now get the names of all the files with extension .grd
in this folder. The $
sign in the code below indicates that the files must end with the characters ‘grd’. By using full.names=TRUE
, the full path names are returned.
library(dismo)
files <- list.files(path, pattern='grd$', full.names=TRUE )
files
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio1.grd"
## [2] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio12.grd"
## [3] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio16.grd"
## [4] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio17.grd"
## [5] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio5.grd"
## [6] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio6.grd"
## [7] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio7.grd"
## [8] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio8.grd"
## [9] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/biome.grd"
Now create a RasterStack of predictor variables.
predictors <- stack(files)
predictors
## class : RasterStack
## dimensions : 192, 186, 35712, 9 (nrow, ncol, ncell, nlayers)
## resolution : 0.5, 0.5 (x, y)
## extent : -125, -32, -56, 40 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## names : bio1, bio12, bio16, bio17, bio5, bio6, bio7, bio8, biome
## min values : -23, 0, 0, 0, 61, -212, 60, -66, 1
## max values : 289, 7682, 2458, 1496, 422, 242, 461, 323, 14
names(predictors)
## [1] "bio1" "bio12" "bio16" "bio17" "bio5" "bio6" "bio7" "bio8" "biome"
plot(predictors)
We can also make a plot of a single layer in a RasterStack, and plot some additional data on top of it.
To so, we’ll first get the world boundaries again, along with data on Bradypus:
Figure 5. Bradypus sp.
library(maptools)
data(wrld_simpl)
file <- paste(system.file(package="dismo"), "/ex/bradypus.csv", sep="")
bradypus <- read.table(file, header=TRUE, sep=',')
# we do not need the first column
bradypus <- bradypus[,-1]
Plotting bioclimatic variables from the WorldClim database:
# first layer of the RasterStack
plot(predictors, 1)
# note the "add=TRUE" argument with plot
plot(wrld_simpl, add=TRUE)
# with the points function, "add" is implicit
points(bradypus, col='blue')
We now have a set of predictor climatic variables (rasters) and occurrence points. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the {dismo} package). This is a very straightforward thing to do using the extract()
function from the {raster} package.
presvals <- extract(predictors, bradypus)
# setting random seed to always create the same
# random set of points for this example
set.seed(0)
#Set the seed of R's random number generator, which is useful for creating simulations or random objects that can be reproduced
backgr <- randomPoints(predictors, 500)
#setting for backgroud
absvals <- extract(predictors, backgr)
#extract values
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
sdmdata[,'biome'] = as.factor(sdmdata[,'biome'])
head(sdmdata)
## pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 1 1 263 1639 724 62 338 191 147 261 1
## 2 1 263 1639 724 62 338 191 147 261 1
## 3 1 253 3624 1547 373 329 150 179 271 1
## 4 1 243 1693 775 186 318 150 168 264 1
## 5 1 243 1693 775 186 318 150 168 264 1
## 6 1 252 2501 1081 280 326 154 172 270 1
tail(sdmdata)
## pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 611 0 265 1622 749 66 344 184 160 266 1
## 612 0 241 1648 475 354 294 183 112 229 1
## 613 0 154 731 250 80 319 18 301 198 8
## 614 0 169 1195 316 273 293 69 224 122 7
## 615 0 247 2965 1173 327 314 168 146 253 1
## 616 0 117 495 231 36 336 -96 432 216 8
summary(sdmdata)
## pb bio1 bio12 bio16
## Min. :0.0000 Min. : 16.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:176.8 1st Qu.: 821.5 1st Qu.: 341.8
## Median :0.0000 Median :242.0 Median :1497.0 Median : 645.5
## Mean :0.1883 Mean :213.9 Mean :1629.5 Mean : 656.4
## 3rd Qu.:0.0000 3rd Qu.:261.0 3rd Qu.:2266.8 3rd Qu.: 917.0
## Max. :1.0000 Max. :286.0 Max. :7682.0 Max. :2458.0
##
## bio17 bio5 bio6 bio7
## Min. : 0.0 Min. : 88.0 Min. :-167.0 Min. : 68.0
## 1st Qu.: 42.0 1st Qu.:303.0 1st Qu.: 45.0 1st Qu.:118.8
## Median : 110.0 Median :320.0 Median : 151.0 Median :160.0
## Mean : 169.2 Mean :309.3 Mean : 119.3 Mean :190.0
## 3rd Qu.: 240.5 3rd Qu.:333.0 3rd Qu.: 202.2 3rd Qu.:247.0
## Max. :1496.0 Max. :410.0 Max. : 232.0 Max. :440.0
##
## bio8 biome
## Min. :-20.0 1 :299
## 1st Qu.:215.0 7 : 65
## Median :250.0 8 : 61
## Mean :225.1 13 : 51
## 3rd Qu.:262.0 2 : 49
## Max. :323.0 4 : 33
## (Other): 58
To visually investigate colinearity in the environmental data (at the presence and background points) you can use a pairs plot.
pairs(sdmdata[,2:5], cex=0.1)
A large number of algorithms has been used in species distribution modeling. They can be classified as ‘profile’, ‘regression’, and ‘machine learning’ methods. Profile methods only consider ‘presence’ data, not absence nor background data. Regression and machine learning methods use both presence and absence or background data. The distinction between regression and machine learning methods is not sharp, but it is perhaps still useful as way to classify models. Another distinction that one can make is between presence-only and presence-absence models. Profile methods are always presence-only, other methods can be either, depending if they are used with survey-absence or with pseudo-absence/background data. An entirely different class of models consists of models that only, or primarily, use the geographic location of known occurrences, and do not rely on the values of predictor variables at these locations. We refer to these models as ‘geographic models’.
Figure 2. A hierarchical structure of the 12 SDM algorithms commonly used. (Pecchi et al., 2019)
To make a decision on which algorithm to use, you must first know the type of data we have.
Presence only (e.g. BIOCLIM)
Presence/absence (e.g. ANN)
Presence/pseudo-absence (e.g. GARP)
Presence/environment(background) (e.g. Maxent).
Platforms with multiple algorithms:
Biomod: http://cran.r-project.org/web/packages/biomod2/index.html
Dismo: http://cran.r-project.org/web/packages/dismo/index.html
OpenModeller: http://openmodeller.sourceforge.net/
ModEco: http://gis.ucmerced.edu/ModEco/
Div-GIS: http://www.diva-gis.org
Profile model:
Bioclim, Domain, and Mahal are profile methods.These are implemented in the {dismo} package, and the procedures to use these models are the same for all three.
Regression model:
We already learned classic regression modeling earlier in this class. It is very similar here. Models are fit using maximum likelihood and by allowing the model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value
Machine learning models:
Machine learning models are non-parametric flexible regression models. Methods include Artifical Neural Networks (ANN), Random Forests, Boosted Regression Trees, and Support Vector Machines. Through the dismo package you can also use the Maxent program, that implements the most widely used method (maxent) in species distribution modeling.
Example: Generalized Linear Models
A generalized linear model (GLM) is a generalization of ordinary least squares regression, which we’ve also looked at in an earlier class. Models are fit using maximum likelihood and by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Depending on how a GLM is specified it can be equivalent to (multiple) linear regression, logistic regression or Poisson regression
In R, GLMs may be implemented using the glm()
function, and the link function and error distribution are specified with the family
argument. Examples include:
*family = binomial(link = “logit”)
*family = poisson(link = “log”)
Here we fit two basic glm models. The models need to be fit with presence and absence (background) data. With the exception of ‘maxent’, we cannot fit the model with a RasterStack and points. Instead, we need to extract the environmental data values ourselves, and fit the models with these values.
#first example
gm1 <- glm(pa ~ bio1 + bio5 + bio6 + bio7 + bio8 + bio12 + bio16 + bio17,
family = binomial(link = "logit"), data=envtrain)
summary(gm1)
##
## Call:
## glm(formula = pa ~ bio1 + bio5 + bio6 + bio7 + bio8 + bio12 +
## bio16 + bio17, family = binomial(link = "logit"), data = envtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.73157 -0.48749 -0.23064 -0.05455 2.90207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4189152 1.6563006 1.460 0.14417
## bio1 0.1083418 0.0609904 1.776 0.07567 .
## bio5 0.1010893 0.2526065 0.400 0.68902
## bio6 -0.1941921 0.2545272 -0.763 0.44549
## bio7 -0.1875595 0.2530838 -0.741 0.45864
## bio8 -0.0180964 0.0275002 -0.658 0.51051
## bio12 0.0018953 0.0007034 2.695 0.00705 **
## bio16 -0.0019137 0.0014796 -1.293 0.19586
## bio17 -0.0042036 0.0015431 -2.724 0.00645 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.69 on 892 degrees of freedom
## Residual deviance: 440.50 on 884 degrees of freedom
## AIC: 458.5
##
## Number of Fisher Scoring iterations: 8
coef(gm1)
## (Intercept) bio1 bio5 bio6 bio7
## 2.418915241 0.108341785 0.101089299 -0.194192095 -0.187559462
## bio8 bio12 bio16 bio17
## -0.018096425 0.001895285 -0.001913743 -0.004203616
#second example
gm2 <- glm(pa ~ bio1+bio5 + bio6 + bio7 + bio8 + bio12 + bio16 + bio17,
family = gaussian(link = "identity"), data=envtrain)
evaluate(testpres, testbackg, gm1)
## class : ModelEvaluation
## n presences : 23
## n absences : 200
## AUC : 0.7938043
## cor : 0.2640135
## max TPR+TNR at : -2.844012
ge2 <- evaluate(testpres, testbackg, gm2)
ge2
## class : ModelEvaluation
## n presences : 23
## n absences : 200
## AUC : 0.7801087
## cor : 0.2952745
## max TPR+TNR at : 0.05090658
#show as a plot graph
pg <- predict(predictors, gm2, ext=ext)
par(mfrow=c(1,2))
plot(pg, main='GLM/gaussian, raw values')
plot(wrld_simpl, add=TRUE, border='dark grey')
tr <- threshold(ge2, 'spec_sens')
plot(pg > tr, main='presence/absence')
plot(wrld_simpl, add=TRUE, border='dark grey')
points(pres_train, pch='+')
points(backg_train, pch='-', cex=0.25)
Model fitting is the heart of any SDM application. Many different algorithms are available (Elith et al., 2006), and often several algorithms are combined into ensemble models or several candidate models with different candidate predictor sets are averaged (Hastie, Tibshirani, and Friedman, 2009). The decisions on these matters should have been made during the conceptualisation phase. Important aspects to consider during the model fitting step are: how to deal with multicollinearity in the environmental data? How many variables should be included in the model (without overfitting) and how should we select these? Which model settings should be used? When multiple model algorithms or candidate models are fitted, how to select the final model or average the models? Do we need to test or correct for non-independence in the data (spatial or temporal autocorrelation, nested data)? If the goal is to derive binary predictions, which threshold should be used?
Model fitting is technically quite similar across the modeling methods that exist in R. Most methods take a formula identifying the dependent and independent variables, accompanied with a data.frame that holds these variables
m1 <- glm(pb ~ bio1 + bio5 + bio12, data=sdmdata)
class(m1)
## [1] "glm" "lm"
summary(m1)
##
## Call:
## glm(formula = pb ~ bio1 + bio5 + bio12, data = sdmdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.95534 -0.23380 -0.09987 0.07831 0.88796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.550e-01 1.093e-01 1.418 0.156593
## bio1 1.621e-03 3.960e-04 4.093 4.84e-05 ***
## bio5 -1.614e-03 4.725e-04 -3.417 0.000676 ***
## bio12 1.140e-04 1.676e-05 6.806 2.39e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1220413)
##
## Null deviance: 94.156 on 615 degrees of freedom
## Residual deviance: 74.689 on 612 degrees of freedom
## AIC: 458.43
##
## Number of Fisher Scoring iterations: 2
m2 = glm(pb ~ ., data=sdmdata)
m2
##
## Call: glm(formula = pb ~ ., data = sdmdata)
##
## Coefficients:
## (Intercept) bio1 bio12 bio16 bio17
## 0.2644272 -0.0020029 0.0004336 -0.0006207 -0.0007843
## bio5 bio6 bio7 bio8 biome2
## 0.0035996 -0.0025153 -0.0041593 0.0008508 -0.0899869
## biome3 biome4 biome5 biome7 biome8
## -0.1262125 -0.0699228 -0.0136612 -0.1993102 0.0057666
## biome9 biome10 biome11 biome12 biome13
## 0.0465949 0.0055868 -0.2705639 -0.0450194 0.0839402
## biome14
## -0.1689054
##
## Degrees of Freedom: 615 Total (i.e. Null); 595 Residual
## Null Deviance: 94.16
## Residual Deviance: 69.17 AIC: 445.1
bc <- bioclim(presvals[,c('bio1', 'bio5', 'bio12')])
class(bc)
## [1] "Bioclim"
## attr(,"package")
## [1] "dismo"
bc
## class : Bioclim
##
## variables: bio1 bio5 bio12
##
##
## presence points: 116
## bio1 bio5 bio12
## 1 263 338 1639
## 2 263 338 1639
## 3 253 329 3624
## 4 243 318 1693
## 5 243 318 1693
## 6 252 326 2501
## 7 240 317 1214
## 8 275 335 2259
## 9 271 327 2212
## 10 274 329 2233
## (... ... ...)
pairs(bc)
Different modeling methods return different type of ‘model’ objects (typically they have the same name as the modeling method used). All of these ‘model’ objects, irrespective of their exact class, can be used to with the predict function to make predictions for any combination of values of the independent variables.
bio1 = c(40, 150, 200)
bio5 = c(60, 115, 290)
bio12 = c(600, 1600, 1700)
pd = data.frame(cbind(bio1, bio5, bio12))
pd
## bio1 bio5 bio12
## 1 40 60 600
## 2 150 115 1600
## 3 200 290 1700
predict(m1, pd)
## 1 2 3
## 0.1914136 0.3949569 0.2048905
predict(bc, pd)
Making such predictions for a few environments can be very useful to explore and understand model predictions. For example, it can be used in the response function that creates response plots for each variable, with the other variables at their median value.
response(bc)
predictors <- stack(list.files(file.path(system.file(package="dismo"), 'ex'), pattern='grd$', full.names=TRUE ))
names(predictors)
## [1] "bio1" "bio12" "bio16" "bio17" "bio5" "bio6" "bio7" "bio8" "biome"
p <- predict(predictors, m1)
plot(p)
Most model types have different measures that can help to assess how well the model fits the data. Most statistics or machine learning texts will provide some details on these methods. For instance, for a GLM one can look at how much deviance is explained, whether there are patterns in the residuals, whether there are points with high leverage and so on. However, since many models are to be used for prediction, much evaluation is focused on how well the model predicts to points not used in model training.
What do we consider to evaluate a model?
Does the model seem sensible, ecologically?
Do the fitted functions (the shapes of the modeled relationships) make sense?
Do the predictions seem reasonable? (map them, and think about them)?
Are there any spatial patterns in model residuals?
Different measures can be used to evaluate the quality of a prediction, perhaps depending on the goal of the study. Many measures for evaluating models based on presence-absence or presence-only data are ‘threshold dependent’. That means that a threshold must be set first (e.g., 0.5, though 0.5 is rarely a sensible choice – e.g. see Lui et al., 2005). Predicted values above that threshold indicate a prediction of ‘presence’, and values below the threshold indicate ‘absence’. Some measures emphasize the weight of false absences; others give more weight to false presences.
Commonly used statistics that are threshold independent are the correlation coefficient and the Area Under the Receiver Operator Curve (AUROC, generally further abbreviated to AUC). AUC is a measure of rank-correlation. In unbiased data, a high AUC indicates that sites with high predicted suitability values tend to be areas of known presence and locations with lower model prediction values tend to be areas where the species is not known to be present (absent or a random point). An AUC score of 0.5 means that the model is as good as a random guess.
For a discussion on the use of AUC in the context of presence-only rather than presence/absence data.
Here we illustrate the computation of the correlation coefficient and AUC with two random variables. p (presence) has higher values, and represents the predicted value for 50 known cases (locations) where the species is present, and a (absence) has lower values, and represents the predicted value for 50 known cases (locations) where the species is absent.
p <- rnorm(50, mean=0.7, sd=0.3)
a <- rnorm(50, mean=0.4, sd=0.4)
par(mfrow=c(1, 2))
plot(sort(p), col='red', pch=21)
points(sort(a), col='blue', pch=24)
legend(1, 0.95 * max(a,p), c('presence', 'absence'),
pch=c(21,24), col=c('red', 'blue'))
comb <- c(p,a)
group <- c(rep('presence', length(p)), rep('absence', length(a)))
boxplot(comb~group, col=c('blue', 'red'))
We created two variables with random normally distributed values, but with different mean and standard deviation. The two variables clearly have different distributions, and the values for ‘presence’ tend to be higher than for ‘absence’. Here is how you can compute the correlation coefficient and the AUC:
group = c(rep(1, length(p)), rep(0, length(a)))
cor.test(comb, group)$estimate
## cor
## 0.4224299
mv <- wilcox.test(p,a)
auc <- as.numeric(mv$statistic) / (length(p) * length(a))
auc
## [1] 0.7368
Below we show how you can compute these, and other statistics more conveniently, with the evaluate()
function in the {dismo} package. See ?evaluate
for information on additional evaluation measures that are available. ROC/AUC can also be computed with the {ROCR} package.
e <- evaluate(p=p, a=a)
class(e)
## [1] "ModelEvaluation"
## attr(,"package")
## [1] "dismo"
par(mfrow=c(1, 2))
density(e)
boxplot(e, col=c('blue', 'red'))
Back to some real data, presence-only in this case. We’ll divide the data in two random sets, one for training a Bioclim model, and one for evaluating the model.
samp <- sample(nrow(sdmdata), round(0.75 * nrow(sdmdata)))
traindata <- sdmdata[samp,]
traindata <- traindata[traindata[,1] == 1, 2:9]
testdata <- sdmdata[-samp,]
bc <- bioclim(traindata)
e <- evaluate(testdata[testdata==1,], testdata[testdata==0,], bc)
e
## class : ModelEvaluation
## n presences : 25
## n absences : 129
## AUC : 0.756124
## cor : 0.2518402
## max TPR+TNR at : 0.01088901
plot(e, 'ROC')
Let’s first create presence and background data.
pres <- sdmdata[sdmdata[,1] == 1, 2:9]
back <- sdmdata[sdmdata[,1] == 0, 2:9]
The background data will only be used for model testing and does not need to be partitioned. We now partition the data into 5 groups:
k <- 5
group <- kfold(pres, k)
group[1:10]
## [1] 3 2 2 4 3 2 3 1 1 2
unique(group)
## [1] 3 2 4 1 5
Now we can fit and test our model five times. In each run, the records corresponding to one of the five groups is only used to evaluate the model, while the other four groups are only used to fit the model. The results are stored in a list called ‘e’.
e <- list()
for (i in 1:k) {
train <- pres[group != i,]
test <- pres[group == i,]
bc <- bioclim(train)
e[[i]] <- evaluate(p=test, a=back, bc)
}
We can extract several things from the objects in e
, but let’s restrict ourselves to the AUC values and the “maximum of the sum of the sensitivity (true positive rate) and specificity (true negative rate)” threshold “spec_sens” (this is sometimes uses as a threshold for setting cells to presence or absence).
auc <- sapply(e, function(x){x@auc})
auc
## [1] 0.7800000 0.7756522 0.8173750 0.7817826 0.7738696
mean(auc)
## [1] 0.7857359
sapply( e, function(x){ threshold(x)['spec_sens'] } )
## $spec_sens
## [1] 0.04291075
##
## $spec_sens
## [1] 0.08592151
##
## $spec_sens
## [1] 0.01076957
##
## $spec_sens
## [1] 0.04291075
##
## $spec_sens
## [1] 0.03215806
The use of AUC in evaluating SDMs has been criticized (Lobo et al., 2008; Jiménez-Valverde, 2011). A particularly sticky problem is that the values of AUC vary with the spatial extent used to select background points. Generally, the larger that extent, the higher the AUC value. Therefore, AUC values are generally biased and cannot be directly compared. Hijmans (2012) suggests that one could remove “spatial sorting bias” (the difference between the distance from testing-presence to training-presence and the distance from testing-absence to training-presence points) through “point-wise distance sampling”.
file <- file.path(system.file(package="dismo"), "ex/bradypus.csv")
bradypus <- read.table(file, header=TRUE, sep=',')
bradypus <- bradypus[,-1]
presvals <- extract(predictors, bradypus)
set.seed(0)
backgr <- randomPoints(predictors, 500)
nr <- nrow(bradypus)
s <- sample(nr, 0.25 * nr)
pres_train <- bradypus[-s, ]
pres_test <- bradypus[s, ]
nr <- nrow(backgr)
set.seed(9)
s <- sample(nr, 0.25 * nr)
back_train <- backgr[-s, ]
back_test <- backgr[s, ]
sb <- ssb(pres_test, back_test, pres_train)
sb[,1] / sb[,2]
## p
## 0.1000638
sb[,1] / sb[,2]
is an indicator of spatial sorting bias (SSB). If there is no SSB this value should be 1, in these data it is close to zero, indicating that SSB is very strong. Let’s create a subsample in which SSB is removed.
i <- pwdSample(pres_test, back_test, pres_train, n=1, tr=0.1)
pres_test_pwd <- pres_test[!is.na(i[,1]), ]
back_test_pwd <- back_test[na.omit(as.vector(i)), ]
sb2 <- ssb(pres_test_pwd, back_test_pwd, pres_train)
sb2[1]/ sb2[2]
## [1] 1.004106
Spatial sorting bias is much reduced now; notice how the AUC dropped!
bc <- bioclim(predictors, pres_train)
evaluate(bc, p=pres_test, a=back_test, x=predictors)
## class : ModelEvaluation
## n presences : 29
## n absences : 125
## AUC : 0.757931
## cor : 0.2777298
## max TPR+TNR at : 0.03438276
evaluate(bc, p=pres_test_pwd, a=back_test_pwd, x=predictors)
## class : ModelEvaluation
## n presences : 19
## n absences : 19
## AUC : 0.5914127
## cor : 0.1487003
## max TPR+TNR at : 0.09185402
Hijmans RJ, Elith J (2012) Species distribution modeling with R. http://cran.r-project.org/web/packages/dismo/vignettes/dm.pdf.
Elith, J., C.H. Graham, R.P. Anderson, M. Dudik, S. Ferrier, A. Guisan, R.J. Hijmans, F. Huettmann, J. Leathwick, A. Lehmann, J. Li, L.G. Lohmann, B. Loiselle, G. Manion, C. Moritz, M. Nakamura, Y. Nakazawa, J. McC. Overton, A.T. Peterson, S. Phillips, K. Richardson, R. Scachetti-Pereira, R. Schapire, J. Soberon, S. Williams, M. Wisz and N. Zimmerman, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-151.