In this section you will learn how to prepare the data to train
models using SDMtune
. We will use the
virtualSp
dataset included in the package and environmental
predictors from the WorldClim
dataset.
First let’s load some packages that we will use to visualize the data:
library(ggplot2) # To plot locations
library(maps) # To access useful maps
library(rasterVis) # To plot raster objects
We use the climate data of WorldClim version 1.4 (Hijmans et al. 2005) and the terrestrial
ecoregions from WWF (Olson et al.
2001) included in the dismo
package:
<- list.files(path = file.path(system.file(package = "dismo"), "ex"),
files pattern = "grd",
full.names = TRUE)
We convert these files in a raster
object that will be
used later in the analysis:
<- terra::rast(files) predictors
There are nine environmental variables, eight continuous and one categorical:
names(predictors)
We can plot bio1 using the gplot
function from the rasterVis
package:
gplot(predictors$bio1) +
geom_tile(aes(fill = value)) +
coord_equal() +
scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"),
na.value = "transparent",
name = "°C x 10") +
labs(title = "Annual Mean Temperature",
x = "longitude",
y = "latitude") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank())
Let’s load the SDMtune package:
library(SDMtune)
For demonstrating how to use SDMtune
we use the random
generated virtualSp
dataset included in the package.
help(virtualSp)
<- virtualSp$presence
p_coords <- virtualSp$background bg_coords
Plot the study area together with the presence locations:
ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = p_coords, aes(x = x, y = y), color = "red",
alpha = 0.4, size = 1) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))
To plot the background locations run the following code:
ggplot(map_data("world"), aes(long, lat)) +
geom_polygon(aes(group = group), fill = "grey95", color = "gray40", size = 0.2) +
geom_jitter(data = as.data.frame(bg_coords), aes(x = x, y = y),
color = "blue", alpha = 0.4, size = 0.5) +
labs(x = "longitude", y = "latitude") +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed() +
scale_x_continuous(limits = c(-125, -32)) +
scale_y_continuous(limits = c(-56, 40))
Before training a model we have to prepare the data in the correct
format. The prepareSWD
function creates an SWD
object that stores the species’ name, the coordinates of the species at
presence and absence/background locations and the value of the
environmental variables at the locations. The argument
categorical
indicates which environmental variables are
categorical. In our example biome is categorical (we
can pass a vector if we have more than one categorical environmental
variable). The function extracts the value of the environmental
variables for each location and excludes those locations that have
NA
value for at least one environmental variable.
<- prepareSWD(species = "Virtual species",
data p = p_coords,
a = bg_coords,
env = predictors,
categorical = "biome")
Let’s have a look at the SWD
object that we have just
created:
data
When we print an SWD
object we get a bunch of
information:
The object contains four slots: @species
,
@coords
@data
and @pa
.
@pa
contains a vector with 1 for presence and 0 for
absence/background locations. To visualize the data we run:
head(data@data)
We can visualize the coordinates with:
head(data@coords)
or the name of the species with:
@species data
We can save the SWD
object in a .csv
file using the function swd2csv
(the function saves the
file in the working directory). There are two possibilities:
swd2csv(data,
file_name = "data.csv")
swd2csv(data,
file_name = c("presence.csv", "background.csv"))
SDMtune
supports four methods for model training:
nnet
package (Venables and Ripley
2002);gbm
package (Greenwell et al.
2019);dismo
package (Hijmans et al. 2017);maxnet
package (Phillips 2017);randomForest
package (Liaw and
Wiener 2002).The code necessary to train a model is the same for all the
implementations. We will show how to train a Maxent
model, you can adapt the code for the other methods or check the
presence absence models
vignette.
We use the function train
to train a
Maxent model. We need to provide two arguments:
method
: “Maxent” in our case;data
: the SWD
object with the presence and
background locations.<- train(method = "Maxent",
model data = data)
The function trains the model using default settings that are:
We will see later how to change the default settings, for the moment
let’s have a look at the model
object.
The output of the function train
is an object of class
SDMmodel
. Let’s print the model
object we’ve
just created:
model
When we print an SDMmodel
object we get the following
information:
An SDMmodel
object has two slots:
slotNames(model)
SWD
object with the presence
absence/background locations used to train the model;Maxent
object, in our case,
with all the model configurations.The slot model
contains the configurations of the model
plus other information used to make predictions.
slotNames(model@model)
The most important are: fc, reg and iter that contain the values of the model configuration.
The function train()
accepts optional arguments that can
be used to change the default model settings. In our previous example we
could have trained the same model using:
<- train(method = "Maxent",
model data = data,
fc = "lqph",
reg = 1,
iter = 500)
In the following example we train a model using linear and hinge as feature class combination, 0.5 as regularization multiplier and 700 iterations:
<- train(method = "Maxent",
model data = data,
fc = "lh",
reg = 0.5,
iter = 700)
By default Maxent models are trained using the arguments
“removeduplicates=false” and
“addsamplestobackground=false”. The user should have the full
control of the data used to train the model, so is expected that
duplicated locations are already removed and that the presence locations
are already included in the background locations, when desired. You can
use the function thinData
to remove duplicated locations
and the function addSamplesToBg
to add the presence
locations to the background locations.
New locations are predicted with the function predict
.
The function takes three main arguments:
SDMmodel
object;data.frame
, an SWD
object or a raster
object);Next we get the prediction for our training locations using the cloglog output type:
<- predict(model,
pred data = data,
type = "cloglog")
The output in this case is a vector containing all the predicted values for the training locations:
head(pred)
We can get the prediction only for the presence location with:
<- data@data[data@pa == 1, ]
p <- predict(model,
pred data = p,
type = "cloglog")
tail(pred)
For models trained with the Maxent method, the function performs the prediction in R without calling the MaxEnt Java software. This results in a faster computation for large datasets and might result in a slightly different output compared to the Java software.
We can use the same function to create a distribution map starting
from the predictors
raster object:
<- predict(model,
map data = predictors,
type = "cloglog")
In this case the output is a raster object:
map
The map can be saved in a file directly when running the prediction,
we just have to pass additional arguments to the predict
function. In the next example we save the map in a file called
“my_file” in the GeoTIFF format:
<- predict(model,
map data = predictors,
type = "cloglog",
file = "my_map",
format = "GTiff")
The function predict
has other arguments useful when
predicting large datasets:
"text"
to
visualize a progress bar;In the next example we restrict the prediction to Chile and plot the prediction:
# First create the extent that surrounds Chile
= terra::ext(c(-77, -60, -56, -15))
e # Now use the extent to make the prediction
<- predict(model,
map_e data = predictors,
type = "cloglog",
extent = e)
To plot the distribution map we can use the function
plotPred
:
plotPred(map)
The function plotPred
plots a map with a color ramp
similar to the one used by the MaxEnt Java software. We can pass
additional arguments to customize the map. In the next example we
provide a custom color ramp and we add a title to the legend:
plotPred(map,
lt = "Habitat\nsuitability",
colorramp = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"))
To plot a presence/absence map we need a threshold value that splits
the prediction in presence and absence values. The function
thresholds
returns some commonly used threshold values
starting from an SDMmodel
object. In the next example we
print the threshold values using the type "cloglog"
:
<- thresholds(model,
ths type = "cloglog")
ths
For example we want to create a presence/absence map using the
threshold that maximize the training sensitivity plus specificity. We
use the function plotPA
passing the threshold value as
argument:
plotPA(map,
th = ths[3, 2])
We can also save the map in a file with the following code:
plotPA(map,
th = ths[3, 2],
filename = "my_pa_map",
format = "GTiff")
Both functions plotPred
and plotPA
have the
argument hr
to plot the the map with high resolution, useful
when the map is used in a scientific publication.
SDMtune
implements three evaluation metrics:
We will compute the value of the metrics on the training dataset.
The AUC can be calculated using the function auc
:
auc(model)
We can also plot the ROC curve using the function
plotROC
:
plotROC(model)
The TSS is computed with the function tss
:
tss(model)
For the AICc we use the function aicc
. In this case we
need to pass to the env
argument the
predictors
raster object:
aicc(model,
env = predictors)
It’s always a good practice to split the species locations into two
parts and use one part to train the model and the remaining part to
evaluate it. We can use the trainValTest
function for this
purpose. Let’s say we want to use 80% of the species locations to train
our model and 20% as testing dataset to evaluate it:
library(zeallot) # For unpacking assignment
c(train, test) %<-% trainValTest(data,
test = 0.2,
only_presence = TRUE,
seed = 25)
Now we train the model using the train
dataset:
<- train("Maxent",
model data = train)
The only_presence
argument is used to split only the
presence and not the background locations. We can now evaluate the model
using the testing dataset that has not been used to train the model:
auc(model)
auc(model, test = test)
We can plot the ROC curve for both, training and testing datasets, with:
plotROC(model, test = test)
This approach is valid when we have a large dataset. In our case, with only few observations, the evaluation depends strongly on how we split our presence locations. Let’s run a small experiment in which we perform different train/test splits and we compute the AUC:
# Create an empty data.frame
<- data.frame(matrix(NA, nrow = 10, ncol = 3))
output colnames(output) <- c("seed", "trainAUC", "testAUC")
# Create 10 different random seeds
set.seed(25)
<- sample.int(1000, 10)
seeds
# Loop through the seeds
for (i in 1:length(seeds)) {
# Make the train/test split
c(train, test) %<-% trainValTest(data,
test = 0.2,
seed = seeds[i],
only_presence = TRUE)
# train the model
<- train("Maxent", data = train)
m
# Populate the output data.frame
1] <- seeds[i]
output[i, 2] <- auc(m)
output[i, 3] <- auc(m, test = test)
output[i,
}
# Print the output
output# compute the range of the testing AUC
range(output[, 3])
The testing AUC varies for the different train/test partitions. When we have to deal with a small dataset a better approach is the cross validation.
To perform a cross validation in SDMtune we have to
pass the fold
argument to the train
function.
First we have to create the folds. There are several way to create them,
here we explain how to make a random partition of 4 folds using the
function randomFolds
:
<- randomFolds(data,
folds k = 4,
only_presence = TRUE,
seed = 25)
The output of the function is a list containing two matrices, the
first for the training and the second for the testing locations. Each
column of one matrix represents a fold with TRUE
for the
locations included in and FALSE
excluded from the
partition.
Let’s perform a 4 fold cross validation using the Maxent method (note that we use the full dataset):
<- train("Maxent",
cv_model data = data,
folds = folds)
cv_model
The output in this case is an SDMmodelCV
object. It
contains the four trained models in the models
slot and the
fold partitions in the folds
slot. We can compute the AUC
of a SDMmodelCV
object using:
auc(cv_model)
auc(cv_model, test = TRUE)
this returns the AUC value averaged across the four different models.
The train()
function accepts folds created with two
other packages:
ENMeval
(Muscarella et al.
2014)blockCV
(Valavi et al.
2019)The function will convert internally the created folds into the
correct format for SDMtune
. These packages have specific
function to create folds partitions that are spatially or
environmentally independent.
Block partition using the ENMeval
package:
library(ENMeval)
<- get.block(occ = data@coords[data@pa == 1, ],
block_folds bg.coords = data@coords[data@pa == 0, ])
<- train(method = "Maxent",
model data = data,
fc = "l",
reg = 0.8,
folds = block_folds)
Checkerboard1 partition using the ENMeval
package:
<- get.checkerboard1(occ = data@coords[data@pa == 1, ],
cb_folds env = predictors,
bg.coords = data@coords[data@pa == 0, ],
aggregation.factor = 4)
<- train(method = "Maxent",
model data = data,
fc = "l",
reg = 0.8,
folds = cb_folds)
Environmental block using the package blockCV
:
library(blockCV)
# Create sf object
<- sf::st_as_sf(cbind(data@coords, pa = data@pa),
sf_df coords = c("X", "Y"),
crs = terra::crs(predictors,
proj = TRUE))
# Spatial blocks
<- cv_spatial(x = sf_df,
spatial_folds column = "pa",
rows_cols = c(8, 10),
k = 5,
hexagon = FALSE,
selection = "systematic")
<- train(method = "Maxent",
model data = data,
fc = "l",
reg = 0.8,
folds = spatial_folds)