Modeling Abiotic Niches of Crops and Wild Ancestors Using Deep Learning: A Generalized Approach

Introduction Understanding what interactions and environmental factors shape the geographic distribution of species is one of the fundamental questions in ecology and evolution. Insofar as the focus is on agriculturally important species, insight into this is also of applied importance. Species Distribution Modeling (SDM) comprises a spectrum of approaches for establishing correlative models of species (co-)occurrences and geospatial patterns of abiotic environmental variables. Methods Here, we contribute to this field by presenting a generalized approach for SDM that utilizes deep learning, which offers some improvements over current methods, and by presenting a case study on the habitat suitability of staple crops and their wild ancestors. The approach we present is implemented in a reusable software toolkit, which we apply to an extensive data set of geo-referenced occurrence records for 52 species and 59 GIS layers. We compare the habitat suitability projections for selected, major crop species with the actual extent of their current cultivation. Results Our results show that the approach yields especially plausible projections for species with large numbers of occurrences (>500). For the analysis of such data sets, the toolkit provides a convenient interface for using deep neural networks in SDM, a relatively novel application of deep learning. The toolkit, the data, and the results are available as open source / open access packages. Conclusions Species Distribution Modeling with deep learning is a promising avenue for method development. The niche projections that can be produced are plausible, and the general approach provides great flexibility for incorporating additional data such as species interactions.


INTRODUCTION
Understanding what processes and environmental factors shape the distribution of species still remains poorly understood. What constitutes a suitable habitat is only partially known or even completely unknown for many uncommon species. Furthermore, a number of factors including human activities like land transformation and climate change have put Earth's biodiversity under stress [1], causing geographical distributions to shift [2] [3] or change through degradation of habitat suitability.
Aside from the current biodiversity loss, these changes in land use and climate are also highly relevant to agriculture, as climate is an important determinant for agricultural productivity [4]. As such, changes in climate or human activities that alter the environment have the potential to cause a decrease in agricultural yield and lead to economic losses [5]. To gain insight into the current habitat suitability of a variety of crop species, this study will model the global habitat suitability for a number of crops and their ancestors. These results paired with predictions on future habitat suitability could mitigate economic losses and aid global food security in the future.
However, a number of challenges can arise when predicting the distribution of species. A paper by Hickisch et al. [6] elaborates several reasons that affect the availability of (spatial) biodiversity data. First, there are geographic and taxonomic sampling biases that can make it harder to obtain data on certain groups of species (e.g. species that are hard to find or identify). Some species are under-sampled as locations are physically harder to reach. Other reasons that can contribute to inconsistent spatial data include policy bias, where more samples are recorded in areas where conservation efforts are focused. Considering these biases, using limited and/or species with which the focal species interacts, or those with similar ecological niches, to be used to predict habitat suitability for a species of interest. This entails that the model is able to make use of biotic variables (i.e. variables that represent living organisms), in addition to the more traditional abiotic variables (variables that represent intimate features or processes in the ecosystem). A direct comparison between a deep learning and maximum entropy approach reveals that a deep learning SDM is more accurate at predicting potential distribution if the species in question has a substantial number of occurrences to be trained on. Nevertheless, deep learning SDMs have shown to perform worse with low numbers of occurrences [13].
Another drawback of using deep learning-based models for the projection of potential species distribution is lack of user-friendliness. Whereas MaxEnt offers open source software that uses a GUI to guide people through the process of model creation, any deep learning model will require the user to build such a model from scratch, in code. This means that any user without experience in programming, modeling and deep learning is unable to fully use its potential. Here, this challenge is addressed by presenting a framework for the construction and evaluation of SDMs with deep learning (hereafter: 'sdmdl') implemented in the python programming language. Some of the tangible results presented here thus constitute a software package called 'sdmdl' that automates the creation of species distribution models. This package was developed to meet the following requirements: • Maximizing ease of use by only requiring from the user a few basic commands.
• Maximizing the package's flexibility to optionally change parameters of the model.

•
Testable functionality of the package by way of unit tests.
• Demonstrable utility by applying the resulting package to a case study.
• A clear outlook on future improvements and feature implementations.
To demonstrate the utility of this package we apply it to model and project habitat suitability worldwide for a number of staple crops and their wild relatives.

Software development
All software development and testing for the package 'sdmdl' was performed using python version 3.6 [15]. The main Integrated Development Environment (IDE) used to write the source code of the package was PyCharm Free Community [16], although Spyder [17] was used during the initial phases of development. The source code, required input files, documentation and all testing code that were created during development are available through a GitHub repository (https://github.com/naturalis/sdmdl). Details on how to install and use the package are provided at: https://sdmdl.readthedocs.io.
One of the goals of this study was that the code of the created package was testable so as to guarantee reproducibility of its functionality. This has been performed using python's unit testing framework. To ensure that code would be repeatable at all times, continuous integrating provided by Travis CI (see https://travis-ci.com/naturalis/sdmdl) was used. This is a service that tests the code of a GitHub repository every time code is modified or new code is added. Using a set of predefined instructions, Travis CI creates a remote virtual Linux environment to perform all of the following procedures automatically: 1. Install a specific version of a requested programming language, in the case of this study this was Python 3.6.

3.
Install all python package dependencies for the created package.

4.
Install and run pytest, which automatically tests package code by running all scripts which file name start with 'test_'.

5.
Once the tests have been successfully completed returns the testing coverage.

Case study preliminaries
The

Literature review
Due to the nature of the case study, literature review was required before the occurrence data could be obtained for each species. This initial literature review was performed for a list of species based on major crop types in the standard cross cultural sample [18]. This resulted in a list of species that are considered the closest genetic wild ancestors (see Table 1). In most cases there is one ancestor per crop, although sometimes multiple ancestors can be identified in the case of complex hybridization events. Accompanied by a bibliography with the salient literature on the domestication of each of the crops.
Once this review was completed, a small secondary review was performed to identify the native range of each wild ancestor. These distributions were used to spatially query the occurrence data of the wild relatives, so only occurrences from its native range would be included.

Occurrence data collection
Once these reviews were completed the occurrence data was obtained for each of the crops and their closest wild relative(s). Occurrence archives downloaded from the Global Biodiversity Information Facility (or GBIF) database [19] contain an occurrence table and a number of metadata files in DarwinCore archive format [20]. For the purpose of this study the only required data was the occurrence table for each species, from which the following four columns were extracted for each occurrence: (1) gbifID, i.e. a unique identifier for each occurrence in the GBIF database (2) decimalLatitude, the (decimal) latitude value of an occurrence (3) decimalLongitude, the (decimal) longitude value of an occurrence (4) acceptedScientificName, the scientific species name of the occurrence as considered 'accepted' by the GBIF taxonomic backbone.

Occurrence coordinate cleaning
To use the occurrence coordinates for training a model they first needed to be cleaned.
This involved removing any invalid data that could potentially cause problems during the data preparations or model training, and removing any occurrences that have spatial issues. This was done using the 'CoordinateCleaner' [21] package in R [22]. To preprocess the occurrence datasets used in the case study a pipeline involving eleven steps was followed, sequentially removing occurrences if: 1.
The precision of the latitude or longitude coordinate is less than 2 decimal values.

2.
The latitude and longitude fall outside the valid range, which here refers to the size of the raster layers used as input to the model. (maximum latitude = 90, minimum latitude = -60, maximum longitude = 180, minimum longitude = -180)

3.
The exact location is not unique.

4.
The occurrence is in, or nearby, country capitals.

5.
The occurrence is in, or nearby, country or province centroids.

6.
The occurrence has identical latitude and longitude values.

7.
The occurrence is in, or near, GBIF headquarters.

8.
The occurrence is in, or near, biodiversity institutions.

9.
The occurrence has non-terrestrial coordinates.

10.
The latitude and/or longitude is exactly 0.
11. An occurrence is geographically isolated and the distance to any other occurrence is at least 1000 kilometers.
Once these filtering steps were executed for each species, the occurrence data was ready to be consumed by the package.

'sdmdl' analysis
Here, the practical procedures that were followed to install and configure the package are omitted. A guide on how to setup and work with the created package is available at https://sdmdl.readthedocs.io. Briefly, to use the 'sdmdl' package, the following prerequisites needed to be met: 1.
At least one occurrence table for a species of choice was provided. Here, 117,020 occurrences for 52 species were provided.

2.
At least one environmental raster layer for a variable of choice was provided. Here, 59 environmental raster layers were used.

3.
A copy or clone of the GitHub repository is present on the user's device.
Subsequently, two data-generating steps were taken. Firstly, presence (i.e. co-occurrence) maps were created to be used by the model as additional variables. Secondly, pseudo-absence locations were generated by random sampling. These were used to train the model to detect which environmental variables make a location suitable or unsuitable for a given species.

Creating presence maps
Presence maps are raster maps where pixel value 0 represents the absence of a species and 1 represents its presence, or occurrence. These maps were created for every species in the case study, to serve as input variables to the model to help predict the presence of another, focal,  The pseudo-absences were sampled according to the amount of occurrence data for the focal species. If the species had fewer than 2,000 occurrences, 2,000 pseudo-absences were sampled randomly anywhere on (terrestrial) earth (see Figure 4). If the species had more than 2,000 occurrences, the number of pseudo-absences that was sampled was equal to the number of occurrences for that species (see Figure 5). For the case study, the value 2,000 was used as the minimum number of pseudo-absences, however in other cases this value can be defined by the user using a configuration file.

Preparing training data vectors
Once the two previous steps were completed, the training data vectors to the model were created. Preparing these was a process that consumed two types of data. First, the environmental raster layers collected for the case study, which were combined with the generated raster layers of the presence maps. Second, the occurrence data that were collected and cleaned, which were combined with a set of randomly sampled pseudo-absences.
The input data to the model was prepared by querying the locations of all occurrences and pseudo-absences, and obtaining the raster values for that specific location (an example of this procedure is shown in Figure 6). If a raster layer contained categorical data, the values were extracted directly; if the raster layer contained numerical data the values were scaled first using the following formula:

DNN analysis
The preceding data preparation steps are broadly applicable to any method for species distribution modeling that uses some form of machine learning (including maximum entropy).
The following sections describe how these data were used in an analysis that uses deep neural networks (DNNs). These sections thus include defining a DNN architecture, training it, evaluating the trained model(s) and assessing feature importance, and finally using the optimal model to predict species' potential distributions.

Creating a DNN architecture
When creating and training a deep learning model there are multiple choices that can affect the performance of the final model. The first and arguably one of the most important parts Page | 18 of this process is creating a model architecture. Deep learning models can come in a range of different shapes and sizes. In principle, all deep learning networks have the same basic structure, which is illustrated in Figure 7. A deep learning model has a number of inputs (visualized in green), at least two hidden layers (visualized in blue) and outputs (visualized in red). absence.
An important part of a created model are the hidden layers (represented in blue in Figure   7). A hidden layer can contain different numbers of 'nodes' or neurons, and a deep learning model can have two or more hidden layers. In general, shallower models -or models with a small number of hidden layers -are better at generalizing, while deeper models with a larger number of hidden layers are able to model more complex situations (e.g. when using a large number of input variables).
The architecture used in the case study analysis, the default for the package, was relatively shallow: it contained four hidden layers with respectively 250, 200, 150 and 100 nodes (for a visual representation of the default architecture see Figure 8). Each hidden layer was subject to a procedure called 'dropout', which turned off a percentage of the layer's neurons.
This prevented overfitting behavior since the model was unable to strengthen the same neural pathways on each training iteration. The dropout percentages were 30%, 50%, 30% and 50% for layers one through four, respectively. have fixed default values that normally are not exposed to the user. For more technical details on these model parameters see Rademaker et al. [13].

DNN training
The training procedure applied here was relatively straightforward, following these steps:

1.
A training data handler that handles the input to the model was created.

Model evaluation
To obtain a viable model, multiple models are created and tested. During this process five different models are created, each initialized with a different set of weights. After their creation, these models are automatically validated to determine which has achieved the best performance.  Using these fundamental metrics, the true positive rate (TPR) and false positive rate (FPR) were calculated using the following formulae: Plotting the TPR versus FPR as a function of the threshold value creates a graph called the ROC curve (for more details see Rademaker et al. [13]). Once this curve is defined the AUC score is calculated by measuring the area under the ROC curve, which is 0.5 if a model is equally likely to predict one class over the other (50% accuracy) and 1 if all predictions are correct (100% accuracy). Out of the five models that were created for each species the one that achieved the highest AUC score was serialized to file.

Feature importance
After the single, optimal model was selected based on the model evaluation, the model performed additional computation to determine the impact of each feature (i.e. environmental variable, co-occurrence) used by the model. To achieve this, an approach called shapely values [25] was used. This approach, which originates from game theory, is used to calculate the difference between a model incorporating all variables and a model incorporating all but one variable, thus testing the difference in outcome based on the missing variable. For a more detailed description of the shapely values see Rademaker et al. [13].

Distribution prediction
Once the models were trained, the global distribution for a species was predicted. This used a prediction dataset that contained the values of the environmental variables for every terrestrial location on the globe. The model then predicted the presence or absence of a species for each of these locations. Once all the predictions were performed, they were aggregated into a two-dimensional matrix that was then converted into a prediction map and raster layer.

Software
During this study a python package called 'sdmdl' was created for Python 3.6. This package is open source and available under the MIT license, from the 'sdmdl' GitHub repository.
This license allows the user to: (1) Use the package commercially.
(2) Modify the source code of the package.
(3) Distribute modifications of the package.
Furthermore, it limits any liability and warranty for use of this package. The user is free to use the package as long as they follow these guidelines and correctly cites it. The package was coded using an object oriented programming (OOP) approach, following the PEP-8 style guide for python code [26].
The package contains twelve classes, which can be grouped into three broad groups: Helper, Handler and Main classes. Helper classes perform very specific functions. Handler classes perform more broad processes like managing spatial or occurrence data and model settings. The main class limits the users' interaction to one class, which only contains a handful of functions needed for performing the entire modeling process. For an overview of each class that is part of the final package and their respective functions see Table 2.

Testing
During this study, tests have been performed to ensure the reproducibility of the package.
The end result of these testing procedures can be found in Table 3.

Species with 100 or fewer occurrences
To illustrate the performance of the package on species with little occurrence data, the predicted potential distribution is discussed for four species with fewer than 100 available occurrences. These species are the ancestors of rye (i.e. Secale vavilovii), taro (Colocasia

formosana), purple yam (Dioscorea hamiltonii) and banana/plantain (Musa balbisiana). Secale
vavilovii only had 10 occurrences available. This low number of occurrences has an evident effect on the model, as it is practically unable to learn which locations are suitable or unsuitable to the species (see Figure 11). Instead, the projected suitability is almost global. Moreover, the model has not only predicted worldwide suitability for the species, but has also predicted that this entire distribution is only semi suitable. the inability of the model to learn habitat suitability for a species using a low number of occurrences (see Figure 12). As this species has slightly more occurrences (23), the model is marginally better at deciding which locations are more or less suitable for the species. When compared to the previous case, the distribution map of Colocasia formosana shows suitable areas (values close to 1) and unsuitable areas (values of 0.6 or less). Even though the model was better able to distinguish between suitable and unsuitable areas, it predicts a high suitability (suitability = 1) for locations (e.g. Central Africa, Central America and South America) that seem implausible given that it is native to the island of Taiwan [27]. These predictions are most likely side effects of the model being trained with only 23 occurrences, and should not be considered more suitable then areas that were predicted to be mostly suitable (suitability ≈ 0.9).   The last species with 100 occurrences or fewer is Musa balbisiana. The suitability projection in Figure 14 shows that the model was mostly able to learn what environmental variables are suitable for the species and which variables made an environment unsuitable. This manifests in the projection mostly covering Southeast Asia, which is where this ancestor originates from [29]. When continents outside of the native range are not considered (e.g. Africa, South and Central America) the model was able to learn what variables make a location suitable for this species. This is especially relevant considering that this species only had 100 occurrences available for training the model.  Figure 15 shows that the model has over-predicted habitat suitability for Sorghum bicolor subsp. verticilliflorum. Not only has it over-predicted the spatial range, but also the suitability within that range all around the world. Taking a closer look, Figure 15 reveals that the model has predicted suitability for a range of different climate zones, ranging from the arctic (e.g. North America, Greenland, Scandinavia and Siberia) to the tropics (e.g. the Amazonian basin, India, Southeast Asia and Indonesia). However, the projection does show that the model has been able to learn the suitability for the species on the African continent, which is where this species originates from [30]. Eleusine africana shows results similar to the previously discussed species (see Figure   16). However, this species shows a more spatially confined projection. The model still projects habitat suitability on every continent, including a wide range of different climate zones, from arctic areas like Iceland and Scandinavia to (sub)tropical areas like the Philippines and Indonesia. When considering the native range of the species, which largely overlaps with sub-Saharan Africa, the predicted suitability more closely matches descriptions of the distribution found in literature, being present in much of the dryer areas in sub-Saharan Africa [31]. This has led the model to project habitat suitability in line with expectations for the species, while minimizing over-predicting on other continents. As Figure 17 shows, the main location where suitability for this species is projected is in Central America (which closely matches its native distribution) [32], but there are other locations, like Ethiopia, where the model has predicted suitability for this species on a very local scale.

Feature importance
The importance each variable has in the model can be tested individually by using the shapely values [25]. However, detailed interpretation of the ecological implications of each individual species and variable is omitted; feature importance is discussed by looking at the importance of presence instead. An example of a feature importance graph is given in Figure 23.    9 (>20) Mango Mangifera sylvatica 9 (>20) Dates Phoenix sylvestris 6 (>20)

DISCUSSION
This paper outlines a generalized approach, presenting its results with reference to functionality of a software package that was a major outcome of this research. The functionality is assessed in terms of the predicted habitat suitability for the species in the case study. However, due to the nature of the case study, in which species were included with relatively low numbers of occurrences, it was not possible to obtain external data sources that could be used to validate the results and test how accurate they are when compared to other data sources. This means that this report is unable to provide metrics on the exact performance of the created models.
Nevertheless, the results show some of the general patterns that can be expected in the study of the biodiversity of domestication. Namely, that the process of domestication would suggest that crop species have been selected to survive in a wider variety of habitats. Thus domesticated species should have a wider distribution and are able to thrive in a wider range of niches then their ancestor. This pattern can be seen in a number of predicted crop species including: maize, rice, barley, common wheat, potato, sweet potato and peanut. The most plausible evidence for this patterns is found in species with more than about 500 distinct occurrence points available, as was also found by Rademaker et al. [13]. Another pattern that was observed is related to spatial autocorrelation, more precisely the use of presence maps from close genetic relatives for prediction. In species with a small number of occurrences (e.g. Manihot esculenta subsp. flabellifolia, the ancestor of cassava) this has been observed to drastically affect the predicted output of the model, essentially creating a model that almost exclusively predicts a high suitability in locations where its genetic relative is present. present here yet undocumented. Nonetheless, it is worth to mention that this does not invalidate the model and its performance, but has the potential to provide a slight distortion in performance metrics.
Some functionalities should be considered for implementation in the 'sdmdl' package.
The most urgent feature is the implementation of functionality regarding the input of environmental raster layers. In the current state the package is unable to use provided raster layers unless they have a specific spatial extent and a resolution (of 5 arc minutes). These requirements can prevent people from using this package, especially in cases in which the user has no experience working with raster layers or other spatial data formats. Other less urgent features are discussed and documented within the code on the GitHub repository.

CONCLUSIONS
This paper presents a proof of concept of user friendly SDM with deep learning and gives insight into how the model performs on the 'crop and ancestor' case study. This case study is used to illustrate under what circumstances the model is able to perform as expected while also discussing the model's potential weak points, which could lead to producing biased, or incorrect results.
The 'sdmdl' approach is easy to use by researchers with little experience with programming and deep learning; by using four simple commands the entire procedure can be performed. Furthermore, the model offers a variety of parameters that give the user extensive freedom to change the behavior of the model. These parameters are accessible via an editable configuration file, and allow the user to change settings regarding the species and the environmental variables included in the analysis, the architecture of the model, and several more (see documentation on the GitHub repository for more details on all the configuration parameters at https://sdmdl.readthedocs.io).
This study demonstrates that the methodology used is applicable to other groups of species as the default parameters used for performing the case study are the parameters used by Rademaker et al. [13] to predict the potential distribution of ungulates (hooved mammals). In conclusion, this study has succeeded in using the deep learning methodology devised by Rademaker et al. [13] and adapting it into a package that offers an intuitive interface and considerable freedom for the user to change how the model performs and behaves. However, in its current state, 'sdmdl' still needs a number of features that are crucial for it to be broadly applicable. These features will be discussed in the following section.

FUTURE WORK
The approach used to sample random absences has been previously reviewed in the discussion section. However, a relatively simple approach could be used to improve this methodology. As previously mentioned the sampled locations can have a profound impact on the predicted potential distribution. Sampling locations close to occurrence points can condition the model to predict suitable locations as unsuitable. One way to prevent this is using a geodesic buffer around occurrences where no pseudo-absences can be sampled. The size of this buffer should be relatively small, to prevent the model from over fitting, but also should not be so small that it might include locations that are likely suitable to the species.
Another way to improve the model would be to find a more suitable performance metric to the nature of the problem, i.e. a metric that does not directly rely on the ground truth labels of the pseudo-absences. However, this could prove a difficult implementation as most performance