Group Project–Species Distribution Model

Introduction

Backgroud

Species distribution modeling (SDM) is also known by other names including climate envelope-modeling, habitat modeling, and (environmental or ecological) niche-modeling.

New technologies and emerging disciplines, such as biodiversity bioinformatics, have promoted the development of methodological approaches that allow generating useful information from incomplete data; such is the case of the modeling of the ecological niches of species and their geographical distributions.

What is the geographical distribution of a species? In the simplest sense, it is that species’ spatial arrangement on the planet, or the geographical description of where it lives.

Traditionally, the distribution of species has been represented in a cartographic way from the geo-referenced points on maps, without considering explicitly the local factors that determine that distribution.

SDMs are correlative methods, static and non mechanistic, that work by identifying the relationships that exist between variables associated with the local environment and observations of presence for the species.

To make the best use of SDMs it is very important know how they work, but equally important is to know the limitations of the data sources, algorithms, and the methodological approach in general.

Despite their limitations, SDMs are a useful tool to generate geographical hypotheses about present, past and future distribution of species.

In this module we provide a short introduction to species distribution modelling in R. This module will cover a brief overview of the concept of species distribution modelling and the basic approach to modelling in R.

Figure 1. Diagram of the Species Distribution Modelling procedure. (Rodríguez-Rey et al., 2019)

There is a wide variety of modeling algorithms (between 20 and 30), as well as a series of platforms where they have been implemented. The aim of SDM is to estimate the similarity of the conditions at any site to the conditions at the locations of known occurrence (and perhaps of non-occurrence) of a phenomenon. A common application of this method is to predict species ranges with climate data as predictors.

SDM steps

locations of occurrence of a species (or other phenomenon) are compiled
values of environmental predictor variables (such as climate) at these locations are extracted from spatial databases
the environmental values are used to fit a model to estimate similarity to the sites of occurrence, or another measure such as abundance of the species
The model is used to predict the variable of interest across the region of interest (and perhaps for a future or past climate).

SDM uses

Figure 2: Uses of SDMs (Araújo et al., 2019).

Example

Figure 3. An overlay of suitable areas for the five tree-cavity-nesters’ distribution models with protected areas. (Moradi et al., 2019)

Package

Notes: Before you begin you will need to install and load the following packages and their dependencies:

{curl}
{dismo}
{dplyr}
{maptools}
{raster}
{rJava}
{rgdal}
{rgbif}
{rgeos}

{dismo} (=“distribution modeling”), and {raster} (=“geographic data analysis and modelling”), and each one of them has a set of functions that enable the modelling of species distributions in geographic and spatial contexts.

Step 1: Data preparation

You need to collect a sufficient number of occurrence records (and absence or abundance, if possible) of the species of interest. You also need to have accurate and relevant environmental data (predictor variables) at a sufficiently high spatial resolution.

Step 2: Import species occurrence data

Importing occurrence data into R is easy. But collecting, georeferencing, and cross-checking coordinate data is tedious…

2.1 Upload files

You can upload our data files like we did in previous modules, using the {curl} package.

For example, we can load a datast of occurrences of blue-eyed grass, Sisyrinchium angustifolium, like so:

library (curl)
#install package curl before this step to load a file from server.
library(knitr)
# to make nice data table. 

f <- curl("https://raw.githubusercontent.com/feiyawang1207/AN597_groupproject/master/S_angustifolium.csv")
#load the file in varibale f
d <- read.csv(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# read file in d as a dataframe.
kable(head(d))

gbifID	datasetKey	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	taxonRank	scientificName	verbatimScientificName	verbatimScientificNameAuthorship	countryCode	locality	stateProvince	individualCount	publishingOrgKey	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	coordinatePrecision	elevation	elevationAccuracy	depth	depthAccuracy	eventDate	day	month	year	taxonKey	speciesKey	basisOfRecord	institutionCode	collectionCode	catalogNumber	recordNumber	dateIdentified	license	rightsHolder	recordedBy	lastInterpreted	mediaType	issue
2451752339	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/35412543	Plantae	Tracheophyta	Magnoliopsida	Solanales	Solanaceae	Solanum	Solanum angustifolium	NA	SPECIES	Solanum rostratum Dun.	Solanum rostratum		MX		Michoacán	NA	28eb1a3f-1c15-4a95-931a-4af90ecb574d	19.07638	-102.88527	22	NA	NA	NA	NA	NA	2019-11-07T09:40:48Z	7	11	2019	2930922	2929800	HUMAN_OBSERVATION	iNaturalist	Observations	35412543		2019-11-07T16:03:48Z	CC_BY_NC_4_0	Jesús Martínez	Jesús Martínez	2019-11-15T13:12:19.217Z	STILLIMAGE	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED
2451721667	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/35294114	Plantae	Tracheophyta	Magnoliopsida	Solanales	Solanaceae	Solanum	Solanum angustifolium	NA	SPECIES	Solanum rostratum Dun.	Solanum rostratum		MX		Jalisco	NA	28eb1a3f-1c15-4a95-931a-4af90ecb574d	21.36370	-101.94901	4	NA	NA	NA	NA	NA	2019-11-03T14:32:00Z	3	11	2019	2930922	2929800	HUMAN_OBSERVATION	iNaturalist	Observations	35294114		2019-11-04T18:09:06Z	CC_BY_NC_4_0	Christian Martinez	Christian Martinez	2019-11-15T13:11:55.564Z	STILLIMAGE	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED
2451686769	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/35175230	Plantae	Tracheophyta	Magnoliopsida	Solanales	Solanaceae	Solanum	Solanum angustifolium	NA	SPECIES	Solanum rostratum Dun.	Solanum rostratum		US		Texas	NA	28eb1a3f-1c15-4a95-931a-4af90ecb574d	32.69588	-102.25799	8	NA	NA	NA	NA	NA	2019-11-01T15:15:39Z	1	11	2019	2930922	2929800	HUMAN_OBSERVATION	iNaturalist	Observations	35175230		2019-11-01T20:17:03Z	CC_BY_NC_4_0	Nathan Taylor	Nathan Taylor	2019-11-15T13:11:23Z	STILLIMAGE	GEODETIC_DATUM_ASSUMED_WGS84
2451656431	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/35013197	Plantae	Tracheophyta	Magnoliopsida	Solanales	Solanaceae	Solanum	Solanum angustifolium	NA	SPECIES	Solanum rostratum Dun.	Solanum rostratum		US		Texas	NA	28eb1a3f-1c15-4a95-931a-4af90ecb574d	30.22526	-99.01538	6	NA	NA	NA	NA	NA	2019-10-28T11:53:22Z	28	10	2019	2930922	2929800	HUMAN_OBSERVATION	iNaturalist	Observations	35013197		2019-10-28T16:53:40Z	CC_BY_NC_4_0	fbgjwh	fbgjwh	2019-11-15T13:11:04.858Z	STILLIMAGE	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_ROUNDED
2451601084	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/34881947	Plantae	Tracheophyta	Magnoliopsida	Solanales	Solanaceae	Solanum	Solanum angustifolium	NA	SPECIES	Solanum rostratum Dun.	Solanum rostratum		US		Arkansas	NA	28eb1a3f-1c15-4a95-931a-4af90ecb574d	36.44975	-94.59106	10	NA	NA	NA	NA	NA	2019-08-21T13:59:00Z	21	8	2019	2930922	2929800	HUMAN_OBSERVATION	iNaturalist	Observations	34881947		2019-10-25T19:01:17Z	CC_BY_NC_4_0	jim_keesling	jim_keesling	2019-11-15T13:10:54.246Z	STILLIMAGE	GEODETIC_DATUM_ASSUMED_WGS84
2451556298	1060a696-f6a2-4341-92ef-47a63babbea1	7526a2b2-b355-492e-9d57-7de5519acd6e	Plantae	Tracheophyta	Magnoliopsida	Solanales	Solanaceae	Solanum	Solanum angustifolium	NA	SPECIES	Solanum rostratum Dun.	Solanum rostratum	Dunal	US	by Hwy 30 c. 10 mi W of Sidney	Nebraska	NA	75bf5fec-40fa-43ae-919f-c43bd7ce25c9	NA	NA	NA	NA	NA	NA	NA	NA	1945-08-21T00:00:00Z	21	8	1945	2930922	2929800	PRESERVED_SPECIMEN	SJSU		12374	5425		CC_BY_4_0		C.W. Sharsmith	2019-11-14T16:31:00.343Z

2.2 Importing occurrence data

If you do not have any species distribution data you can get started by downloading data from the Global Biodiversity Inventory Facility (GBIF) There you can find 1.354.603.236 occurrence records!

To import data from a database like this, we can use a function called gbif () to download files from the GBIF database.

For now, we are going to download the occurrence data from a potato species. Solanum acaule is widespread and common in upland habitats from northern Peru (Dept. Cajamarca), south through Bolivia to northern Argentina (Prov. San Juan), and with one record in northern Chile (Antofagasta Region)

Figure 4: Picture of Solanum acaule

library(dismo)

## Loading required package: raster

## Loading required package: sp

library(knitr)

acaule <- gbif("solanum", "acaule*", geo=FALSE)

## Loading required namespace: jsonlite

## 7229 records found

## 0-300-600-900-1200-1500-1800-2100-2400-2700-3000-3300-3600-3900-4200-4500-4800-5100-5400-5700-6000-6300-6600-6900-7200-7229 records downloaded

#load the saved S. acaule data from data base

kable(head(acaule))

acceptedScientificName	acceptedTaxonKey	accessRights	adm1	adm2	associatedReferences	basisOfRecord	bibliographicCitation	catalogNumber	class	classKey	cloc	collectionCode	collectionID	continent	coordinateUncertaintyInMeters	country	crawlId	datasetID	datasetKey	datasetName	dateIdentified	day	depth	depthAccuracy	disposition	dynamicProperties	elevation	elevationAccuracy	endDayOfYear	establishmentMeans	eventDate	eventID	eventRemarks	eventTime	family	familyKey	fieldNotes	fieldNumber	fullCountry	gbifID	genericName	genus	genusKey	geodeticDatum	georeferenceProtocol	georeferenceRemarks	georeferenceSources	habitat	higherClassification	higherGeography	higherGeographyID	http://unknown.org/nick	http://unknown.org/occurrenceDetails	identificationID	identificationRemarks	identifiedBy	identifier	individualCount	infraspecificEpithet	installationKey	institutionCode	institutionID	ISO2	key	kingdom	kingdomKey	language	lastCrawled	lastInterpreted	lastParsed	lat	license	locality	lon	materialSampleID	modified	month	municipality	nomenclaturalCode	nomenclaturalStatus	occurrenceID	occurrenceRemarks	order	orderKey	organismID	organismQuantity	organismQuantityType	organismRemarks	originalNameUsage	otherCatalogNumbers	ownerInstitutionCode	parentEventID	phylum	phylumKey	preparations	previousIdentifications	protocol	publishingCountry	publishingOrgKey	recordedBy	recordNumber	references	rights	rightsHolder	samplingProtocol	scientificName	scientificNameID	species	speciesKey	specificEpithet	startDayOfYear	taxonID	taxonKey	taxonomicStatus	taxonRank	taxonRemarks	type	typeStatus	typifiedName	verbatimCoordinateSystem	verbatimElevation	verbatimEventDate	verbatimLocality	vernacularName	year	downloadDate
Solanum acaule Bitter	2931061	NA	Cusco	NA	NA	HUMAN_OBSERVATION	NA	22332967	Magnoliopsida	220	Cusco, Peru	Observations	NA	NA	300	Peru	190	NA	50c9509d-22c7-4a22-a47d-8c48425ef4a7	iNaturalist research-grade observations	2019-04-30T14:41:30	26	NA	NA	NA	NA	NA	NA	NA	NA	2019-03-26T14:30:00	NA	NA	18:30:00Z	Solanaceae	7717	NA	NA	Peru	2238787932	Solanum	Solanum	2928997	WGS84	NA	NA	NA	NA	NA	NA	NA	cstobie	https://www.inaturalist.org/observations/22332967	52791525	NA	NA	22332967	NA	NA	997448a8-f762-11e1-a439-00145eb45e9a	iNaturalist	NA	PE	2238787932	Plantae	6	NA	2019-11-29T14:10:16.861+0000	2019-11-29T15:09:07.641+0000	2019-11-29T15:09:07.641+0000	-15.047820	http://creativecommons.org/licenses/by-nc/4.0/legalcode	NA	-71.13188	NA	2019-05-20T17:10:31.000+0000	3	NA	NA	NA	https://www.inaturalist.org/observations/22332967	NA	Solanales	1176	NA	NA	NA	NA	NA	NA	NA	NA	Tracheophyta	7707728	NA	NA	DWC_ARCHIVE	US	28eb1a3f-1c15-4a95-931a-4af90ecb574d	cstobie	NA	https://www.inaturalist.org/observations/22332967	© cstobie some rights reserved	cstobie	NA	Solanum acaule Bitter	NA	Solanum acaule	2931061	acaule	NA	886633	2931061	ACCEPTED	SPECIES	NA	NA	NA	NA	NA	NA	2019/03/26 2:30 PM EDT	Espinar, Perú	NA	2019	2019-12-04
Solanum acaule Bitter	2931061	NA	Catamarca	Ambato	NA	PRESERVED_SPECIMEN	NA	CORD00056696	Magnoliopsida	220	Camino desde El Rodeo rumbo a Primer Campo. Puesto Vergara, Primer Campo, alrededores de la Casa de Piedra, Catamarca, Ambato, Argentina, SOUTH_AMERICA	NA	NA	SOUTH_AMERICA	NA	Argentina	15	NA	ab61ad8f-953f-4c7c-80b2-e1fb50fd8187	NA	2016-01-01T00:00:00	24	NA	NA	NA	NA	NA	NA	NA	NATIVE	2016-02-24T00:00:00	NA	NA	NA	Solanaceae	7717	NA	NA	Argentina	2252415409	Solanum	Solanum	2928997	WGS84	NA	Puntual	GPS Ensign - Datum WGS84	NA	NA	Sudamérica \| Argentina \| Catamarca	NA	NA	NA	NA	NA	Chiarini, F.	CORD-44463984-f4cc-49ec-8911-2de91f216c2b-AR	NA	NA	4fd221cd-129c-404f-ab62-fdd2377d12ac	CORD	NA	AR	2252415409	Plantae	6	NA	2019-11-05T11:30:29.242+0000	2019-11-13T15:15:57.223+0000	2019-11-13T15:15:57.223+0000	-28.179167	http://creativecommons.org/licenses/by/4.0/legalcode	Camino desde El Rodeo rumbo a Primer Campo. Puesto Vergara, Primer Campo, alrededores de la Casa de Piedra	-65.99250	NA	NA	2	NA	NA	NA	CORD-44463984-f4cc-49ec-8911-2de91f216c2b-AR	NA	Solanales	1176	NA	NA	NA	NA	NA	NA	NA	NA	Tracheophyta	7707728	NA	Solanaceae \| Solanum acaule Bitter \| Chiarini, F. \| 2016	DWC_ARCHIVE	AR	7a6bdf66-ef5c-4a81-b731-2e328f4881eb	Barboza, G. E. \| Cantero, J. J. & Chiarini, F.	4669	NA	NA	NA	NA	Solanum acaule Bitter	NA	Solanum acaule	2931061	acaule	NA	NA	2931061	ACCEPTED	SPECIES	NA	NA	NA	NA	NA	3513	24/02/2016	NA	NA	2016	2019-12-04
Solanum acaule Bitter	2931061	NA	Lima	NA	NA	PRESERVED_SPECIMEN	NA	BM001134709	Magnoliopsida	220	Huaroz district, 37 km from canta on road to La Viuda, Lima, Peru, SOUTH_AMERICA	BOT	NA	SOUTH_AMERICA	NA	Peru	142	NA	7e380070-f762-11e1-a439-00145eb45e9a	NA	NA	28	NA	NA	NA	{“determinationfiledas”: “Yes”, “collectionkind”: “Sheet”, “gbifissue”: [“GEODETIC_DATUM_ASSUMED_WGS84”], “created”: 1415984060000, “associatedmediacount”: 1, “project”: “Picturae”, “determinationnames”: “Solanum acaule Bitter”, “plantdescription”: “herb to .1m high; infrequent, flowers blue/purple, without interstitial leaflet; collection time: 11:42am; mixed shrubland with dominance of Senecio collinus and Lupinus brachyprennon”, “subdepartment”: “Flowering Plants”, “gbifid”: 1056708863}	4052	NA	NA	NA	2014-02-28T00:00:00	NA	NA	NA	Solanaceae	7717	NA	2893	Peru	1056708863	Solanum	Solanum	2928997	WGS84	NA	NA	NA	NA	NA	South America; Peru; Lima; Canta	NA	NA	NA	NA	NA	Sandra (Sandy) Diane Knapp	4338577	NA	NA	60277b56-f762-11e1-a439-00145eb45e9a	NHMUK	NA	PE	1056708863	Plantae	6	NA	2019-11-29T15:42:43.569+0000	2019-11-29T16:08:04.040+0000	2019-11-29T16:08:04.040+0000	-11.387433	http://creativecommons.org/publicdomain/zero/1.0/legalcode	Huaroz district, 37 km from canta on road to La Viuda	-76.47753	NA	NA	2	NA	NA	NA	2bc0e320-b403-43b0-bede-334814a339b1	NA	Solanales	1176	NA	NA	NA	NA	NA	NHMUK:ecatalogue:4338577	NA	NA	Tracheophyta	7707728	NA	NA	DWC_ARCHIVE	GB	19456090-b49a-11d8-abeb-b8a03c50a862	Paúl Gonzáles, Mindy M. Syfert, Dr Erica McAlister, D Whitmore, Sandra (Sandy) Diane Knapp	2893	NA	NA	NA	NA	Solanum acaule Bitter	NA	Solanum acaule	2931061	acaule	NA	NA	2931061	ACCEPTED	SPECIES	NA	NA	NA	NA	NA	NA	NA	NA	NA	2014	2019-12-04
Solanum acaule Bitter	2931061	NA	Lima	Canta	NA	PRESERVED_SPECIMEN	http://www.tropicos.org/Specimen/100991187	100991187	Magnoliopsida	220	Huaroz district, 37 km from Canta on road to La Viuda., Lima, Canta, Peru, SOUTH_AMERICA	MO	http://biocol.org/urn:lsid:biocol.org:col:15859	SOUTH_AMERICA	NA	Peru	95	NA	7bd65a7a-f762-11e1-a439-00145eb45e9a	Tropicos	NA	28	NA	NA	NA	NA	4052	NA	NA	NA	2014-02-28T00:00:00	NA	NA	NA	Solanaceae	7717	NA	NA	Peru	2268891108	Solanum	Solanum	2928997	WGS84	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	urn:catalog:MO:Tropicos:100991187	1	NA	c4e134d3-c68f-406d-a116-6a37895602e0	MO	NA	PE	2268891108	Plantae	6	NA	2019-11-29T07:20:04.818+0000	2019-11-29T07:48:48.952+0000	2019-11-29T07:48:48.952+0000	-11.387500	http://creativecommons.org/licenses/by/4.0/legalcode	Huaroz district, 37 km from Canta on road to La Viuda.	-76.47750	NA	2017-08-08T14:37:00.000+0000	2	NA	ICNafp	No opinion	urn:catalog:MO:Tropicos:100991187	NA	Solanales	1176	NA	NA	NA	NA	NA	NA	MOBOT	NA	Tracheophyta	7707728	NA	NA	DWC_ARCHIVE	US	90fd6680-349f-11d8-aa2d-b8a03c50a862	Paúl Gonzáles\|M. Syfert\|E. McAlister\|D. Whitmore\|Sandy Knapp	Gonzáles Arce 2893	NA	NA	Missouri Botanical Garden	NA	Solanum acaule Bitter	NA	Solanum acaule	2931061	acaule	NA	29600091	2931061	ACCEPTED	SPECIES	NA	PhysicalObject	NA	NA	NA	NA	NA	NA	NA	2014	2019-12-04
Solanum acaule Bitter	2931061	NA	Lima	Oyon	NA	PRESERVED_SPECIMEN	http://www.tropicos.org/Specimen/100991022	100991022	Magnoliopsida	220	Oyón district, 1.7 km up to Quichas on road up to Cordillera Raura., Lima, Oyon, Peru, SOUTH_AMERICA	MO	http://biocol.org/urn:lsid:biocol.org:col:15859	SOUTH_AMERICA	NA	Peru	95	NA	7bd65a7a-f762-11e1-a439-00145eb45e9a	Tropicos	NA	6	NA	NA	NA	NA	4134	NA	NA	NA	2014-03-06T00:00:00	NA	NA	NA	Solanaceae	7717	NA	NA	Peru	2268891573	Solanum	Solanum	2928997	WGS84	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	urn:catalog:MO:Tropicos:100991022	1	NA	c4e134d3-c68f-406d-a116-6a37895602e0	MO	NA	PE	2268891573	Plantae	6	NA	2019-11-29T07:20:04.818+0000	2019-11-29T07:48:54.267+0000	2019-11-29T07:48:54.267+0000	-10.558056	http://creativecommons.org/licenses/by/4.0/legalcode	Oyón district, 1.7 km up to Quichas on road up to Cordillera Raura.	-76.77194	NA	2017-08-07T14:49:00.000+0000	3	NA	ICNafp	No opinion	urn:catalog:MO:Tropicos:100991022	NA	Solanales	1176	NA	NA	NA	NA	NA	NA	MOBOT	NA	Tracheophyta	7707728	NA	NA	DWC_ARCHIVE	US	90fd6680-349f-11d8-aa2d-b8a03c50a862	Paúl Gonzáles\|M. Syfert\|E. McAlister\|D. Whitmore	Gonzáles Arce 2961	NA	NA	Missouri Botanical Garden	NA	Solanum acaule Bitter	NA	Solanum acaule	2931061	acaule	NA	29600091	2931061	ACCEPTED	SPECIES	NA	PhysicalObject	NA	NA	NA	NA	NA	NA	NA	2014	2019-12-04
Solanum acaule Bitter	2931061	NA	Ancash	NA	NA	PRESERVED_SPECIMEN	NA	BM001120832	Magnoliopsida	220	between km55-57 on rd from Yanac to Sihuas, Ancash, Peru, SOUTH_AMERICA	BOT	NA	SOUTH_AMERICA	NA	Peru	142	NA	7e380070-f762-11e1-a439-00145eb45e9a	NA	NA	19	NA	NA	NA	{“determinationfiledas”: “Yes”, “collectionkind”: “Sheet”, “gbifissue”: [“GEODETIC_DATUM_ASSUMED_WGS84”], “created”: 1415986484000, “associatedmediacount”: 1, “project”: “Picturae”, “determinationnames”: “Solanum acaule Bitter”, “plantdescription”: “Rosette forming herb; corolla purple; fruits green, turning purple-black; growing along moist muddy areas in puna vegetation close to Polylepis forest”, “subdepartment”: “Flowering Plants”, “gbifid”: 1055949175}	3955	NA	NA	NA	2013-05-19T00:00:00	NA	NA	NA	Solanaceae	7717	NA	4706	Peru	1055949175	Solanum	Solanum	2928997	WGS84	NA	NA	NA	NA	NA	South America; Peru; Ancash; Corongo	NA	NA	NA	NA	NA	Tiina Särkinen	4339588	NA	NA	60277b56-f762-11e1-a439-00145eb45e9a	NHMUK	NA	PE	1055949175	Plantae	6	NA	2019-11-29T15:42:43.569+0000	2019-11-29T16:08:20.122+0000	2019-11-29T16:08:20.122+0000	-8.571389	http://creativecommons.org/publicdomain/zero/1.0/legalcode	between km55-57 on rd from Yanac to Sihuas	-77.71803	NA	NA	5	NA	NA	NA	c3a7d94e-d9d4-469e-b724-a85252512ed1	NA	Solanales	1176	NA	NA	NA	NA	NA	NHMUK:ecatalogue:4339588	NA	NA	Tracheophyta	7707728	NA	NA	DWC_ARCHIVE	GB	19456090-b49a-11d8-abeb-b8a03c50a862	Tiina Särkinen, H. Maria Baden, Dr Erica McAlister, Diana M. Percy	4706	NA	NA	NA	NA	Solanum acaule Bitter	NA	Solanum acaule	2931061	acaule	NA	NA	2931061	ACCEPTED	SPECIES	NA	NA	NA	NA	NA	NA	NA	NA	NA	2013	2019-12-04

If you want to check how many rows and colums the data frame you have, we can use dim () to show the dimensions of the dataset.

colrow<-dim(acaule)
# use dim() to get dimension of our dataset

colrow[1]

## [1] 7229

#show row numbers

y<-colrow[2]
#show column number

In order to draw a map, we need to find the location by longitude and latitude from our dataset.

Here, we’re subsetting the data to include only those that have longitude and latitude. Many occurence records may not have geographic coordinates in your dataset, so you will need to remove those NA values from your dataframe. We’ll use the is.na() command to assess whether the data have both longitude and latitude data, and then subset them into a new data frame.

acgeo <- subset(acaule, !is.na(lon) & !is.na(lat))
#use subset() to seperate those data

#datatable(head(acgeo))
#showing the example data see that have longtitude and latitude data

dim(acgeo)

## [1] 4643  127

#showing the rows and columns for data frame

acgeo[1:4, c(1:5,7:10)] #removing locality

##   acceptedScientificName acceptedTaxonKey accessRights      adm1   adm2
## 1  Solanum acaule Bitter          2931061         <NA>     Cusco   <NA>
## 2  Solanum acaule Bitter          2931061         <NA> Catamarca Ambato
## 3  Solanum acaule Bitter          2931061         <NA>      Lima   <NA>
## 4  Solanum acaule Bitter          2931061         <NA>      Lima  Canta
##        basisOfRecord                      bibliographicCitation
## 1  HUMAN_OBSERVATION                                       <NA>
## 2 PRESERVED_SPECIMEN                                       <NA>
## 3 PRESERVED_SPECIMEN                                       <NA>
## 4 PRESERVED_SPECIMEN http://www.tropicos.org/Specimen/100991187
##   catalogNumber         class
## 1      22332967 Magnoliopsida
## 2  CORD00056696 Magnoliopsida
## 3   BM001134709 Magnoliopsida
## 4     100991187 Magnoliopsida

2.3 Making a map of the ocurrence localities

In order to make a map of our subset of Solanum acaule data that have locations, we’ll use the package {maptools}.

First, we need to draw an empty map: a variable called wrld_simpl! will give us an empty world-wide map. The wrld_simpl dataset contains rough country outlines.

library(maptools)

## Checking rgeos availability: TRUE

data(wrld_simpl)
x<-plot(wrld_simpl, xlim=c(-80,70), ylim=c(-80,30), axes=TRUE)

# empty map

Now we need to add our data points. It is important to make such maps to make sure that the points are, at least roughly, in the right location:

library(maptools)

data(wrld_simpl)
plot(wrld_simpl, xlim=c(-80,70), ylim=c(-80,30), axes=TRUE, col="light yellow")
# restore the box around the map
box()
# add the points
points(acgeo$lon, acgeo$lat, col='orange', pch=20, cex=0.75)
# plot points again to add a border, for better visibility
points(acgeo$lon, acgeo$lat, col='red', cex=0.75)

 r <- raster(system.file("external/test.grd", package="raster"))

2.4 Data cleaning

Data ‘cleaning’ is particularly important for data sourced from species distribution data warehouses such as GBIF. As we mentioned, we already know that Solanum acaule is a species that occurs in the higher parts of the Andes mountains of southern Peru, Bolivia and northern Argentina. Do we see this in our map?

We imported more than seven thousand records from GBIF, but not all of these are usable because of various issues. This species is distributed mainly in the Andes mountains of southern Peru, Bolivia and northern Argentina, but as you can see in the previous map there are records in other countries and even in Antartica! In order to clean this data we will need to:

avoid false identifications
use only specimens collected in a period relevant to the climate data we will be utilizing
have reliable locational certainty (the cleaning will depend of the species we are modelling and the data availability.

The point in Brazil (record acaule[98,]) should actually be in southern Bolivia, so this is probably a typo in the longitude value.

There are also three records that have plausible latitudes, but longitudes that are clearly wrong, as they are in the Atlantic Ocean, south of West Africa! It looks like they have a longitude that is zero. In many data-bases you will find values that are ‘zero’ where ‘no data’ was intended. The gbif function (when using the default arguments) sets coordinates that are (0, 0) to NA, but not if only one of the coordinates is listed as zero. Let’s see if we find these by searching for records with longitudes of zero.

lonzero = subset(acgeo, lon==0)
lonzero[, 1:13]

##  [1] acceptedScientificName acceptedTaxonKey       accessRights          
##  [4] adm1                   adm2                   associatedReferences  
##  [7] basisOfRecord          bibliographicCitation  catalogNumber         
## [10] class                  classKey               cloc                  
## [13] collectionCode        
## <0 rows> (or 0-length row.names)

The records might also come from very different sources, sometimes from museum collections or zoos. Be sure to choose only the kind of records you really want:

unique(acaule$basisOfRecord)#if you want to know origin of your records.

## [1] "HUMAN_OBSERVATION"  "PRESERVED_SPECIMEN" "LIVING_SPECIMEN"   
## [4] "UNKNOWN"

#You can either decide to work with specimens stored only in  scientific collections or include all records in your analysis.
acaule1 <- subset(acaule,
    basisOfRecord=="specimen" |
    basisOfRecord=="unknown" |
    basisOfRecord=="living"
    )
nrow(acaule1)

## [1] 0

2.5 Duplicate records

Many databases share data, so you also risk including duplicate records. The next function will help you to remove duplicates.

#which records are duplicates (only for the first 10 columns)?
dups <- duplicated(lonzero[, 1:10])
#remove duplicates
lonzero  <-  lonzero[dups, ]
lonzero[,1:13]

##  [1] acceptedScientificName acceptedTaxonKey       accessRights          
##  [4] adm1                   adm2                   associatedReferences  
##  [7] basisOfRecord          bibliographicCitation  catalogNumber         
## [10] class                  classKey               cloc                  
## [13] collectionCode        
## <0 rows> (or 0-length row.names)

Another approach might be to detect duplicates for the same species and some coordinates in the data, even if the records were from collections by different people or in different years (in our case, using species is redundant as we have data for only one species).

# differentiating by (sub) species
# dups2 <- duplicated(acgeo[, c('species', 'lon', 'lat')])
# ignoring (sub) species and other naming variation
dups2 <- duplicated(acgeo[, c('lon', 'lat')])
# number of duplicates
sum(dups2)

## [1] 2345

## [1] 483
# keep the records that are _not_ duplicated
acg <- acgeo[!dups2, ]

Let’s repatriate the records near Pakistan to Argentina, and remove the records in Brazil, Antarctica, and with a longitude of ‘0’.

i <- acg$lon > 0 & acg$lat > 0
acg$lon[i] <- -1 * acg$lon[i]
acg$lat[i] <- -1 * acg$lat[i]
acg <- acg[acg$lon < -50 & acg$lat > -50, ]

2.6 Cross-checking

It is important to cross-check coordinates by visual and other means. One approach is to compare the country (and lower level administrative subdivisions) of the site as specified by the records, with the country implied by the coordinates (Hijmans et al., 1999).

We should ask these two questions:

Which points (identified by their record numbers) do not match any country (that is, they are in an ocean)?
Which points have coordinates that are in a different country than listed in the ‘country’ field of the GBIF record

In the example below we use the coordinates() function from the sp package to create a spatial points data frame, and then the over() function, also from sp, to do a point-in-polygon query with the named countries polygons.

library(sp)
coordinates(acg) <- ~lon+lat
crs(acg) <- crs(wrld_simpl)
class(acg)

## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

class(wrld_simpl)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

ovr <- over(acg, wrld_simpl)

head(ovr)

##   FIPS ISO2 ISO3  UN      NAME   AREA  POP2005 REGION SUBREGION     LON
## 1   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 2   AR   AR  ARG  32 Argentina 273669 38747148     19         5 -65.167
## 3   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 4   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 5   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
## 6   PE   PE  PER 604      Peru 128000 27274266     19         5 -75.552
##       LAT
## 1  -9.326
## 2 -35.377
## 3  -9.326
## 4  -9.326
## 5  -9.326
## 6  -9.326

cntr <- ovr$NAME

i <- which(is.na(cntr))
i

## integer(0)

## integer(0)
j <- which(cntr != acg$country)
# for the mismatches, bind the country names of the polygons and points
cbind(cntr, acg$country)[j,]

##       cntr             
##  [1,] "27"  "Argentina"
##  [2,] "27"  "Argentina"
##  [3,] "27"  "Argentina"
##  [4,] "172" "Bolivia"  
##  [5,] "172" "Bolivia"  
##  [6,] "172" "Bolivia"  
##  [7,] "27"  "Argentina"
##  [8,] "27"  "Argentina"
##  [9,] "172" "Bolivia"  
## [10,] "172" "Bolivia"  
## [11,] "172" "Bolivia"  
## [12,] "172" "Bolivia"  
## [13,] "172" "Bolivia"  
## [14,] "172" "Bolivia"  
## [15,] "172" "Bolivia"  
## [16,] "172" "Bolivia"  
## [17,] "172" "Bolivia"  
## [18,] "172" "Bolivia"  
## [19,] "172" "Bolivia"  
## [20,] "172" "Bolivia"  
## [21,] "172" "Bolivia"  
## [22,] "172" "Bolivia"  
## [23,] "172" "Bolivia"  
## [24,] "172" "Bolivia"  
## [25,] "172" "Bolivia"  
## [26,] "172" "Bolivia"  
## [27,] "172" "Bolivia"  
## [28,] "172" "Bolivia"  
## [29,] "172" "Bolivia"  
## [30,] "172" "Bolivia"  
## [31,] "172" "Bolivia"  
## [32,] "172" "Bolivia"  
## [33,] "172" "Bolivia"  
## [34,] "172" "Bolivia"  
## [35,] "172" "Bolivia"  
## [36,] "172" "Bolivia"  
## [37,] "172" "Bolivia"  
## [38,] "172" "Bolivia"  
## [39,] "172" "Bolivia"  
## [40,] "27"  "Peru"     
## [41,] "172" "Bolivia"  
## [42,] "172" "Bolivia"  
## [43,] "172" "Bolivia"  
## [44,] "172" "Bolivia"  
## [45,] "27"  "Peru"     
## [46,] "172" "Bolivia"  
## [47,] "172" "Bolivia"  
## [48,] "172" "Bolivia"  
## [49,] "172" "Bolivia"  
## [50,] "172" "Bolivia"  
## [51,] "172" "Bolivia"  
## [52,] "172" "Bolivia"  
## [53,] "172" "Bolivia"  
## [54,] "27"  "Peru"     
## [55,] "172" "Bolivia"  
## [56,] "27"  "Peru"     
## [57,] "27"  "Peru"     
## [58,] "27"  "Peru"     
## [59,] "31"  "Peru"     
## [60,] "65"  "Peru"     
## [61,] "65"  "Peru"     
## [62,] "143" "Peru"     
## [63,] "27"  "Argentina"
## [64,] "172" "Bolivia"  
## [65,] "172" "Bolivia"  
## [66,] "27"  "Argentina"
## [67,] "172" "Bolivia"

plot(acg)
plot(wrld_simpl, add=T, border='blue', lwd=2)
points(acg[j, ], col='red', pch=20, cex=2)

Step 3: Deal with absence and background points

Some of the early species distribution model algorithms, such as Bioclim and Domain only use ’presence’ data in the modeling process. Other methods also use ’absence’ data or ’background’ data.

3.1 Absence data

If you want to have a non-NA-value data frame for your method, here is what you could do…

First, there is a common method to delete all NA values from a data frame: we can use the na.omit() function to omit the rows that contains the NA values.

For example, given you only need longtitude and latitude values to draw your map, you can take those two columns into a data frame.

long<-acaule$lon
#extract column of longtitude
lati<-acaule$lat
#extract column of latitude

d<-cbind(long,lati)
#combine into a data frame
colnames(d)<-c("longitude","latitude")
#name data frame

dim(d)[1]

## [1] 7229

#rows number of uncleaning data frame

d[27,]

## longitude  latitude 
##        NA        NA

#there are some NA rows in this data frame

Then, you can use na.omit() to remove all NA value from this data frame.

clean_d <-na.omit(d)
#save teh cleaing data frame

dim(clean_d)[1]

## [1] 4643

#rows number of cleaning data frame

clean_d[27,]

## longitude  latitude 
## -67.71306 -28.97167

#NA have been removed from the previous row.

Another simple way to do this is to use the filter() function in the {dplyr} package.

For example:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

new_d<-filter(acaule, lon !="NA" & lat !="NA" )
#new variable to store the data frame.
summary(is.na(new_d$lon))

##    Mode   FALSE 
## logical    4643

#use is.na() to check if there is remaining NA in longitude column.

3.2 Background data

Background data do not attempt to guess at absence locations, but rather characterize environments in the study region. In this sense, background is the same, irrespective of where the species has been found.

Background data establishes the environmental domain of the study, whilst presence data should establish under which conditions a species is more likely to be present than on average. A closely related but different concept, that of “pseudo-absences”, is also used for generating the non-presence class for logistic models. In this case, researchers sometimes try to guess where absences might occur – they may sample the whole region except at presence locations, or they might sample at places unlikely to be suitable for the species. We prefer the background concept because it requires fewer assumptions and has some coherent statistical methods for dealing with the “overlap” between presence and background points.

In the example below, we first get the list of filenames with the predictor raster data (discussed in detail in the next section). We use a raster as a ‘mask’ in the randomPoints() function such that the background points are from the same geographic area, and only for places where there are values (land, in our case).

library(dismo)
# get the file names
files <- list.files(path=paste(system.file(package="dismo"), '/ex',
                       sep=''),  pattern='grd',  full.names=TRUE )
# we use the first file to create a RasterLayer
mask <- raster(files[1])
# select 500 random points

set.seed(1963)
# set seed to assure that the examples will always have the same random sample.

bg <- randomPoints(mask, 500 )

plot(!is.na(mask), legend=FALSE)

points(bg, cex=0.5)

# now we repeat the sampling, but limit the area of sampling using a spatial extent
e <- extent(-80, -53, -39, -22)
bg2 <- randomPoints(mask, 50, ext=e)
#limit point for longitude -39~-22, latitude -80~-53

plot(!is.na(mask), legend=FALSE)
plot(e, add=TRUE, col='red')
points(bg2, cex=0.5)

There are several approaches one could use to sample ‘pseudo-absence’ points from a more restricted area than ‘background’.

We first read the cleaned and subsetted S. acaule data that we produced in the previous chapter from the csv file that comes with {dismo}:

file <- paste(system.file(package="dismo"), '/ex/acaule.csv', sep='')
ac <- read.csv(file)
#load the cleaned data file

coordinates(ac) <- ~lon+lat
projection(ac) <- CRS('+proj=longlat +datum=WGS84')
#create a Spatial Points Data Frame

We first create a ‘circles’ model (see the chapter about geographic models), using an arbitrary radius of 50 km. Circles() is a function from {rgeos} package.

library(rgeos)

## rgeos version: 0.5-2, (SVN revision 621)
##  GEOS runtime version: 3.7.2-CAPI-1.11.2 
##  Linking to sp version: 1.3-1 
##  Polygon checking: TRUE

x <- circles(ac, d=50000, lonlat=TRUE)
# circles with a radius of 50 km
pol <- polygons(x)

samp1 <- spsample(pol, 250, type='random', iter=25)
# sample randomly from all circles

cells <- cellFromXY(mask, samp1)
length(cells)

## [1] 250

cells <- unique(cells)
# get unique cells
length(cells)

## [1] 161

xy <- xyFromCell(mask, cells)
plot(pol, axes=TRUE)
points(xy, cex=0.75, pch=20, col='blue')

#make a plot to see results

A similar results could be produced via raster function extract()

# extract cell numbers for the circles
v <- extract(mask, x@polygons, cellnumbers=T)
# use rbind to combine the elements in list v
v <- do.call(rbind, v)
# get unique cell numbers from which you could sample
v <- unique(v[,1])
head(v)

## [1] 15531 15717 17581 17582 17765 17767

# to display the results

m <- mask
m[] <- NA
m[v] <- 1
plot(m, ext=extent(x@polygons)+1)
plot(x@polygons, add=T)

# make a plot for results,

Step 4: Using the R package {raster} to analyze data

4.1 Raster data

A raster consists of a matrix of cells (or pixels) organized into rows and columns (or a grid) where each contains a value representing information. Raster format represents real-world phenomena, so data might be discrete or continuous. Discrete data: also are called thematic, categorical or discontinuous data. A discrete object has known and definable boundaries.

In species distribution modeling, predictor variables are typically organized as raster (grid) type files. Each predictor should be a ‘raster’ representing a variable of interest. Variables can include climatic, soil, terrain, vegetation, land use, and other variables.

These data are typically stored in files in some kind of GIS format. Almost all relevant formats can be used (including ESRI grid, geoTiff, netCDF, IDRISI).

For any particular study the layers should all have the same spatial extent, resolution, origin, and projection. The set of predictor variables (rasters) can be used to make a RasterStack, which is a collection of RasterLayer objects

A RasterLayer object represents single-layer (variable) raster data. A RasterLayer object always stores a number of fundamental parameters that describe it. These include the number of columns and rows, the spatial extent, and the Coordinate Reference System. In addition, a RasterLayer can store information about the file in which the raster cell values are stored (if there is such a file). A RasterLayer can also hold the raster cell values in memory.

Using the {dismo} package, we are going to create a RasterStack from the list of files that are intalled with the package.

Fist, we need to get the folder with our files…..

path <- file.path(system.file(package="dismo"), 'ex')

Now get the names of all the files with extension .grd in this folder. The $ sign in the code below indicates that the files must end with the characters ‘grd’. By using full.names=TRUE, the full path names are returned.

library(dismo)
files <- list.files(path, pattern='grd$', full.names=TRUE )
files

## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio1.grd" 
## [2] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio12.grd"
## [3] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio16.grd"
## [4] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio17.grd"
## [5] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio5.grd" 
## [6] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio6.grd" 
## [7] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio7.grd" 
## [8] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/bio8.grd" 
## [9] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/dismo/ex/biome.grd"

Now create a RasterStack of predictor variables.

predictors <- stack(files)
predictors

## class      : RasterStack 
## dimensions : 192, 186, 35712, 9  (nrow, ncol, ncell, nlayers)
## resolution : 0.5, 0.5  (x, y)
## extent     : -125, -32, -56, 40  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## names      : bio1, bio12, bio16, bio17, bio5, bio6, bio7, bio8, biome 
## min values :  -23,     0,     0,     0,   61, -212,   60,  -66,     1 
## max values :  289,  7682,  2458,  1496,  422,  242,  461,  323,    14

names(predictors)

## [1] "bio1"  "bio12" "bio16" "bio17" "bio5"  "bio6"  "bio7"  "bio8"  "biome"

plot(predictors)

We can also make a plot of a single layer in a RasterStack, and plot some additional data on top of it.

To so, we’ll first get the world boundaries again, along with data on Bradypus:

Figure 5. Bradypus sp.

library(maptools)
data(wrld_simpl)
file <- paste(system.file(package="dismo"), "/ex/bradypus.csv", sep="")
bradypus <- read.table(file,  header=TRUE,  sep=',')
# we do not need the first column
bradypus  <- bradypus[,-1]

Plotting bioclimatic variables from the WorldClim database:

# first layer of the RasterStack
plot(predictors, 1)
# note the "add=TRUE" argument with plot
plot(wrld_simpl, add=TRUE)
# with the points function, "add" is implicit
points(bradypus, col='blue')

4.2 Extracting values from rasters

We now have a set of predictor climatic variables (rasters) and occurrence points. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the {dismo} package). This is a very straightforward thing to do using the extract() function from the {raster} package.

presvals <- extract(predictors, bradypus)
# setting random seed to always create the same
# random set of points for this example
set.seed(0)
#Set the seed of R's random number generator, which is useful for creating simulations or random objects that can be reproduced

backgr <- randomPoints(predictors, 500)
#setting for backgroud

absvals <- extract(predictors, backgr)
#extract values

pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))

sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))

sdmdata[,'biome'] = as.factor(sdmdata[,'biome'])

head(sdmdata)

##   pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 1  1  263  1639   724    62  338  191  147  261     1
## 2  1  263  1639   724    62  338  191  147  261     1
## 3  1  253  3624  1547   373  329  150  179  271     1
## 4  1  243  1693   775   186  318  150  168  264     1
## 5  1  243  1693   775   186  318  150  168  264     1
## 6  1  252  2501  1081   280  326  154  172  270     1

tail(sdmdata)

##     pb bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 biome
## 611  0  265  1622   749    66  344  184  160  266     1
## 612  0  241  1648   475   354  294  183  112  229     1
## 613  0  154   731   250    80  319   18  301  198     8
## 614  0  169  1195   316   273  293   69  224  122     7
## 615  0  247  2965  1173   327  314  168  146  253     1
## 616  0  117   495   231    36  336  -96  432  216     8

summary(sdmdata)

##        pb              bio1           bio12            bio16       
##  Min.   :0.0000   Min.   : 16.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:176.8   1st Qu.: 821.5   1st Qu.: 341.8  
##  Median :0.0000   Median :242.0   Median :1497.0   Median : 645.5  
##  Mean   :0.1883   Mean   :213.9   Mean   :1629.5   Mean   : 656.4  
##  3rd Qu.:0.0000   3rd Qu.:261.0   3rd Qu.:2266.8   3rd Qu.: 917.0  
##  Max.   :1.0000   Max.   :286.0   Max.   :7682.0   Max.   :2458.0  
##                                                                    
##      bio17             bio5            bio6             bio7      
##  Min.   :   0.0   Min.   : 88.0   Min.   :-167.0   Min.   : 68.0  
##  1st Qu.:  42.0   1st Qu.:303.0   1st Qu.:  45.0   1st Qu.:118.8  
##  Median : 110.0   Median :320.0   Median : 151.0   Median :160.0  
##  Mean   : 169.2   Mean   :309.3   Mean   : 119.3   Mean   :190.0  
##  3rd Qu.: 240.5   3rd Qu.:333.0   3rd Qu.: 202.2   3rd Qu.:247.0  
##  Max.   :1496.0   Max.   :410.0   Max.   : 232.0   Max.   :440.0  
##                                                                   
##       bio8           biome    
##  Min.   :-20.0   1      :299  
##  1st Qu.:215.0   7      : 65  
##  Median :250.0   8      : 61  
##  Mean   :225.1   13     : 51  
##  3rd Qu.:262.0   2      : 49  
##  Max.   :323.0   4      : 33  
##                  (Other): 58

To visually investigate colinearity in the environmental data (at the presence and background points) you can use a pairs plot.

pairs(sdmdata[,2:5], cex=0.1)

Step 5: Picking models

A large number of algorithms has been used in species distribution modeling. They can be classified as ‘profile’, ‘regression’, and ‘machine learning’ methods. Profile methods only consider ‘presence’ data, not absence nor background data. Regression and machine learning methods use both presence and absence or background data. The distinction between regression and machine learning methods is not sharp, but it is perhaps still useful as way to classify models. Another distinction that one can make is between presence-only and presence-absence models. Profile methods are always presence-only, other methods can be either, depending if they are used with survey-absence or with pseudo-absence/background data. An entirely different class of models consists of models that only, or primarily, use the geographic location of known occurrences, and do not rely on the values of predictor variables at these locations. We refer to these models as ‘geographic models’.

Figure 2. A hierarchical structure of the 12 SDM algorithms commonly used. (Pecchi et al., 2019)

To make a decision on which algorithm to use, you must first know the type of data we have.

Presence only (e.g. BIOCLIM)
Presence/absence (e.g. ANN)
Presence/pseudo-absence (e.g. GARP)
Presence/environment(background) (e.g. Maxent).

Platforms with multiple algorithms:

Biomod: http://cran.r-project.org/web/packages/biomod2/index.html

Dismo: http://cran.r-project.org/web/packages/dismo/index.html

OpenModeller: http://openmodeller.sourceforge.net/

ModEco: http://gis.ucmerced.edu/ModEco/

Div-GIS: http://www.diva-gis.org

Profile model:

Bioclim, Domain, and Mahal are profile methods.These are implemented in the {dismo} package, and the procedures to use these models are the same for all three.

Regression model:

We already learned classic regression modeling earlier in this class. It is very similar here. Models are fit using maximum likelihood and by allowing the model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value

Machine learning models:

Machine learning models are non-parametric flexible regression models. Methods include Artifical Neural Networks (ANN), Random Forests, Boosted Regression Trees, and Support Vector Machines. Through the dismo package you can also use the Maxent program, that implements the most widely used method (maxent) in species distribution modeling.

Example: Generalized Linear Models

A generalized linear model (GLM) is a generalization of ordinary least squares regression, which we’ve also looked at in an earlier class. Models are fit using maximum likelihood and by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Depending on how a GLM is specified it can be equivalent to (multiple) linear regression, logistic regression or Poisson regression

In R, GLMs may be implemented using the glm() function, and the link function and error distribution are specified with the family argument. Examples include:

*family = binomial(link = “logit”)

*family = poisson(link = “log”)

Here we fit two basic glm models. The models need to be fit with presence and absence (background) data. With the exception of ‘maxent’, we cannot fit the model with a RasterStack and points. Instead, we need to extract the environmental data values ourselves, and fit the models with these values.

#first example
gm1 <- glm(pa ~ bio1 + bio5 + bio6 + bio7 + bio8 + bio12 + bio16 + bio17,
            family = binomial(link = "logit"), data=envtrain)
summary(gm1)

## 
## Call:
## glm(formula = pa ~ bio1 + bio5 + bio6 + bio7 + bio8 + bio12 + 
##     bio16 + bio17, family = binomial(link = "logit"), data = envtrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.73157  -0.48749  -0.23064  -0.05455   2.90207  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  2.4189152  1.6563006   1.460  0.14417   
## bio1         0.1083418  0.0609904   1.776  0.07567 . 
## bio5         0.1010893  0.2526065   0.400  0.68902   
## bio6        -0.1941921  0.2545272  -0.763  0.44549   
## bio7        -0.1875595  0.2530838  -0.741  0.45864   
## bio8        -0.0180964  0.0275002  -0.658  0.51051   
## bio12        0.0018953  0.0007034   2.695  0.00705 **
## bio16       -0.0019137  0.0014796  -1.293  0.19586   
## bio17       -0.0042036  0.0015431  -2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.69  on 892  degrees of freedom
## Residual deviance: 440.50  on 884  degrees of freedom
## AIC: 458.5
## 
## Number of Fisher Scoring iterations: 8

coef(gm1)

##  (Intercept)         bio1         bio5         bio6         bio7 
##  2.418915241  0.108341785  0.101089299 -0.194192095 -0.187559462 
##         bio8        bio12        bio16        bio17 
## -0.018096425  0.001895285 -0.001913743 -0.004203616

#second example
gm2 <- glm(pa ~ bio1+bio5 + bio6 + bio7 + bio8 + bio12 + bio16 + bio17,
            family = gaussian(link = "identity"), data=envtrain)
evaluate(testpres, testbackg, gm1)

## class          : ModelEvaluation 
## n presences    : 23 
## n absences     : 200 
## AUC            : 0.7938043 
## cor            : 0.2640135 
## max TPR+TNR at : -2.844012

ge2 <- evaluate(testpres, testbackg, gm2)
ge2

## class          : ModelEvaluation 
## n presences    : 23 
## n absences     : 200 
## AUC            : 0.7801087 
## cor            : 0.2952745 
## max TPR+TNR at : 0.05090658

#show as a plot graph
pg <- predict(predictors, gm2, ext=ext)
par(mfrow=c(1,2))
plot(pg, main='GLM/gaussian, raw values')
plot(wrld_simpl, add=TRUE, border='dark grey')
tr <- threshold(ge2, 'spec_sens')
plot(pg > tr, main='presence/absence')
plot(wrld_simpl, add=TRUE, border='dark grey')
points(pres_train, pch='+')
points(backg_train, pch='-', cex=0.25)

Step 6: Model fitting, prediction, evaluation

6.1 Model fitting

Model fitting is the heart of any SDM application. Many different algorithms are available (Elith et al., 2006), and often several algorithms are combined into ensemble models or several candidate models with different candidate predictor sets are averaged (Hastie, Tibshirani, and Friedman, 2009). The decisions on these matters should have been made during the conceptualisation phase. Important aspects to consider during the model fitting step are: how to deal with multicollinearity in the environmental data? How many variables should be included in the model (without overfitting) and how should we select these? Which model settings should be used? When multiple model algorithms or candidate models are fitted, how to select the final model or average the models? Do we need to test or correct for non-independence in the data (spatial or temporal autocorrelation, nested data)? If the goal is to derive binary predictions, which threshold should be used?

Model fitting is technically quite similar across the modeling methods that exist in R. Most methods take a formula identifying the dependent and independent variables, accompanied with a data.frame that holds these variables

m1 <- glm(pb ~ bio1 + bio5 + bio12, data=sdmdata)
class(m1)

## [1] "glm" "lm"

summary(m1)

## 
## Call:
## glm(formula = pb ~ bio1 + bio5 + bio12, data = sdmdata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.95534  -0.23380  -0.09987   0.07831   0.88796  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.550e-01  1.093e-01   1.418 0.156593    
## bio1         1.621e-03  3.960e-04   4.093 4.84e-05 ***
## bio5        -1.614e-03  4.725e-04  -3.417 0.000676 ***
## bio12        1.140e-04  1.676e-05   6.806 2.39e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1220413)
## 
##     Null deviance: 94.156  on 615  degrees of freedom
## Residual deviance: 74.689  on 612  degrees of freedom
## AIC: 458.43
## 
## Number of Fisher Scoring iterations: 2

m2 = glm(pb ~ ., data=sdmdata)
m2

## 
## Call:  glm(formula = pb ~ ., data = sdmdata)
## 
## Coefficients:
## (Intercept)         bio1        bio12        bio16        bio17  
##   0.2644272   -0.0020029    0.0004336   -0.0006207   -0.0007843  
##        bio5         bio6         bio7         bio8       biome2  
##   0.0035996   -0.0025153   -0.0041593    0.0008508   -0.0899869  
##      biome3       biome4       biome5       biome7       biome8  
##  -0.1262125   -0.0699228   -0.0136612   -0.1993102    0.0057666  
##      biome9      biome10      biome11      biome12      biome13  
##   0.0465949    0.0055868   -0.2705639   -0.0450194    0.0839402  
##     biome14  
##  -0.1689054  
## 
## Degrees of Freedom: 615 Total (i.e. Null);  595 Residual
## Null Deviance:       94.16 
## Residual Deviance: 69.17     AIC: 445.1

bc <- bioclim(presvals[,c('bio1', 'bio5', 'bio12')])
class(bc)

## [1] "Bioclim"
## attr(,"package")
## [1] "dismo"

bc

## class    : Bioclim 
## 
## variables: bio1 bio5 bio12 
## 
## 
## presence points: 116 
##    bio1 bio5 bio12
## 1   263  338  1639
## 2   263  338  1639
## 3   253  329  3624
## 4   243  318  1693
## 5   243  318  1693
## 6   252  326  2501
## 7   240  317  1214
## 8   275  335  2259
## 9   271  327  2212
## 10  274  329  2233
##   (... ...  ...)

pairs(bc)

6.2 Model prediction

Different modeling methods return different type of ‘model’ objects (typically they have the same name as the modeling method used). All of these ‘model’ objects, irrespective of their exact class, can be used to with the predict function to make predictions for any combination of values of the independent variables.

bio1 = c(40, 150, 200)
bio5 = c(60, 115, 290)
bio12 = c(600, 1600, 1700)
pd = data.frame(cbind(bio1, bio5, bio12))
pd

##   bio1 bio5 bio12
## 1   40   60   600
## 2  150  115  1600
## 3  200  290  1700

predict(m1, pd)

##         1         2         3 
## 0.1914136 0.3949569 0.2048905

predict(bc, pd)

Making such predictions for a few environments can be very useful to explore and understand model predictions. For example, it can be used in the response function that creates response plots for each variable, with the other variables at their median value.

response(bc)

predictors <- stack(list.files(file.path(system.file(package="dismo"), 'ex'), pattern='grd$', full.names=TRUE ))
names(predictors)

## [1] "bio1"  "bio12" "bio16" "bio17" "bio5"  "bio6"  "bio7"  "bio8"  "biome"

p <- predict(predictors, m1)
plot(p)

6.3 Model evaluation

Most model types have different measures that can help to assess how well the model fits the data. Most statistics or machine learning texts will provide some details on these methods. For instance, for a GLM one can look at how much deviance is explained, whether there are patterns in the residuals, whether there are points with high leverage and so on. However, since many models are to be used for prediction, much evaluation is focused on how well the model predicts to points not used in model training.

What do we consider to evaluate a model?

Does the model seem sensible, ecologically?

Do the fitted functions (the shapes of the modeled relationships) make sense?

Do the predictions seem reasonable? (map them, and think about them)?

Are there any spatial patterns in model residuals?

Different measures can be used to evaluate the quality of a prediction, perhaps depending on the goal of the study. Many measures for evaluating models based on presence-absence or presence-only data are ‘threshold dependent’. That means that a threshold must be set first (e.g., 0.5, though 0.5 is rarely a sensible choice – e.g. see Lui et al., 2005). Predicted values above that threshold indicate a prediction of ‘presence’, and values below the threshold indicate ‘absence’. Some measures emphasize the weight of false absences; others give more weight to false presences.

Commonly used statistics that are threshold independent are the correlation coefficient and the Area Under the Receiver Operator Curve (AUROC, generally further abbreviated to AUC). AUC is a measure of rank-correlation. In unbiased data, a high AUC indicates that sites with high predicted suitability values tend to be areas of known presence and locations with lower model prediction values tend to be areas where the species is not known to be present (absent or a random point). An AUC score of 0.5 means that the model is as good as a random guess.

For a discussion on the use of AUC in the context of presence-only rather than presence/absence data.

Here we illustrate the computation of the correlation coefficient and AUC with two random variables. p (presence) has higher values, and represents the predicted value for 50 known cases (locations) where the species is present, and a (absence) has lower values, and represents the predicted value for 50 known cases (locations) where the species is absent.

p <- rnorm(50, mean=0.7, sd=0.3)
a <- rnorm(50, mean=0.4, sd=0.4)
par(mfrow=c(1, 2))
plot(sort(p), col='red', pch=21)
points(sort(a), col='blue', pch=24)
legend(1, 0.95 * max(a,p), c('presence', 'absence'),
          pch=c(21,24), col=c('red', 'blue'))
comb <- c(p,a)
group <- c(rep('presence', length(p)), rep('absence', length(a)))
boxplot(comb~group, col=c('blue', 'red'))

We created two variables with random normally distributed values, but with different mean and standard deviation. The two variables clearly have different distributions, and the values for ‘presence’ tend to be higher than for ‘absence’. Here is how you can compute the correlation coefficient and the AUC:

group = c(rep(1, length(p)), rep(0, length(a)))
cor.test(comb, group)$estimate

##       cor 
## 0.4224299

mv <- wilcox.test(p,a)
auc <- as.numeric(mv$statistic) / (length(p) * length(a))
auc

## [1] 0.7368

Below we show how you can compute these, and other statistics more conveniently, with the evaluate() function in the {dismo} package. See ?evaluate for information on additional evaluation measures that are available. ROC/AUC can also be computed with the {ROCR} package.

e <- evaluate(p=p, a=a)
class(e)

## [1] "ModelEvaluation"
## attr(,"package")
## [1] "dismo"

par(mfrow=c(1, 2))
density(e)
boxplot(e, col=c('blue', 'red'))

Back to some real data, presence-only in this case. We’ll divide the data in two random sets, one for training a Bioclim model, and one for evaluating the model.

samp <- sample(nrow(sdmdata), round(0.75 * nrow(sdmdata)))
traindata <- sdmdata[samp,]
traindata <- traindata[traindata[,1] == 1, 2:9]
testdata <- sdmdata[-samp,]
bc <- bioclim(traindata)
e <- evaluate(testdata[testdata==1,], testdata[testdata==0,], bc)
e

## class          : ModelEvaluation 
## n presences    : 25 
## n absences     : 129 
## AUC            : 0.756124 
## cor            : 0.2518402 
## max TPR+TNR at : 0.01088901

plot(e, 'ROC')

Let’s first create presence and background data.

pres <- sdmdata[sdmdata[,1] == 1, 2:9]
back <- sdmdata[sdmdata[,1] == 0, 2:9]

The background data will only be used for model testing and does not need to be partitioned. We now partition the data into 5 groups:

k <- 5
group <- kfold(pres, k)
group[1:10]

##  [1] 3 2 2 4 3 2 3 1 1 2

unique(group)

## [1] 3 2 4 1 5

Now we can fit and test our model five times. In each run, the records corresponding to one of the five groups is only used to evaluate the model, while the other four groups are only used to fit the model. The results are stored in a list called ‘e’.

e <- list()
for (i in 1:k) {
    train <- pres[group != i,]
    test <- pres[group == i,]
    bc <- bioclim(train)
    e[[i]] <- evaluate(p=test, a=back, bc)
}

We can extract several things from the objects in e, but let’s restrict ourselves to the AUC values and the “maximum of the sum of the sensitivity (true positive rate) and specificity (true negative rate)” threshold “spec_sens” (this is sometimes uses as a threshold for setting cells to presence or absence).

auc <- sapply(e, function(x){x@auc})
auc

## [1] 0.7800000 0.7756522 0.8173750 0.7817826 0.7738696

mean(auc)

## [1] 0.7857359

sapply( e, function(x){ threshold(x)['spec_sens'] } )

## $spec_sens
## [1] 0.04291075
## 
## $spec_sens
## [1] 0.08592151
## 
## $spec_sens
## [1] 0.01076957
## 
## $spec_sens
## [1] 0.04291075
## 
## $spec_sens
## [1] 0.03215806

The use of AUC in evaluating SDMs has been criticized (Lobo et al., 2008; Jiménez-Valverde, 2011). A particularly sticky problem is that the values of AUC vary with the spatial extent used to select background points. Generally, the larger that extent, the higher the AUC value. Therefore, AUC values are generally biased and cannot be directly compared. Hijmans (2012) suggests that one could remove “spatial sorting bias” (the difference between the distance from testing-presence to training-presence and the distance from testing-absence to training-presence points) through “point-wise distance sampling”.

file <- file.path(system.file(package="dismo"), "ex/bradypus.csv")
bradypus <- read.table(file,  header=TRUE,  sep=',')
bradypus <- bradypus[,-1]
presvals <- extract(predictors, bradypus)
set.seed(0)
backgr <- randomPoints(predictors, 500)
nr <- nrow(bradypus)
s <- sample(nr, 0.25 * nr)
pres_train <- bradypus[-s, ]
pres_test <- bradypus[s, ]
nr <- nrow(backgr)
set.seed(9)
s <- sample(nr, 0.25 * nr)
back_train <- backgr[-s, ]
back_test <- backgr[s, ]

sb <- ssb(pres_test, back_test, pres_train)
sb[,1] / sb[,2]

##         p 
## 0.1000638

sb[,1] / sb[,2] is an indicator of spatial sorting bias (SSB). If there is no SSB this value should be 1, in these data it is close to zero, indicating that SSB is very strong. Let’s create a subsample in which SSB is removed.

i <- pwdSample(pres_test, back_test, pres_train, n=1, tr=0.1)
pres_test_pwd <- pres_test[!is.na(i[,1]), ]
back_test_pwd <- back_test[na.omit(as.vector(i)), ]
sb2 <- ssb(pres_test_pwd, back_test_pwd, pres_train)
sb2[1]/ sb2[2]

## [1] 1.004106

Spatial sorting bias is much reduced now; notice how the AUC dropped!

bc <- bioclim(predictors, pres_train)
evaluate(bc, p=pres_test, a=back_test, x=predictors)

## class          : ModelEvaluation 
## n presences    : 29 
## n absences     : 125 
## AUC            : 0.757931 
## cor            : 0.2777298 
## max TPR+TNR at : 0.03438276

evaluate(bc, p=pres_test_pwd, a=back_test_pwd, x=predictors)

## class          : ModelEvaluation 
## n presences    : 19 
## n absences     : 19 
## AUC            : 0.5914127 
## cor            : 0.1487003 
## max TPR+TNR at : 0.09185402

References

Hijmans RJ, Elith J (2012) Species distribution modeling with R. http://cran.r-project.org/web/packages/dismo/vignettes/dm.pdf.
Elith, J., C.H. Graham, R.P. Anderson, M. Dudik, S. Ferrier, A. Guisan, R.J. Hijmans, F. Huettmann, J. Leathwick, A. Lehmann, J. Li, L.G. Lohmann, B. Loiselle, G. Manion, C. Moritz, M. Nakamura, Y. Nakazawa, J. McC. Overton, A.T. Peterson, S. Phillips, K. Richardson, R. Scachetti-Pereira, R. Schapire, J. Soberon, S. Williams, M. Wisz and N. Zimmerman, 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-151.

AN597 F19 Module - Species Distribution Modeling in R

Frank Azorsa Salazar, MaJo Salazar Nicholls, Feiya Wang