Modelling the niches of wild and domesticated Ungulate species using deep learning

Knowledge of global biodiversity remains limited by geographic and taxonomic sampling biases. The scarcity of species data restricts our understanding of the underlying environmental factors shaping distributions, and the ability to draw comparisons among species. Species distribution models (SDMs) were developed in the early 2000s to address this issue. Although SDMs based on single layered Neural Networks have been experimented with in the past, these performed poorly. However, the past two decades have seen a strong increase in the use of Deep Learning (DL) approaches, such as Deep Neural Networks (DNNs). Despite the large improvement in predictive capacity DNNs provide over shallow networks, to our knowledge these have not yet been applied to SDM. The aim of this research was to provide a proof of concept of a DL-SDM1. We used a pre-existing dataset of the world’s ungulates and abiotic environmental predictors that had recently been used in MaxEnt SDM, to allow for a direct comparison of performance between both methods. Our DL-SDM consisted of a binary classification DNN containing 4 hidden layers and drop-out regularization between each layer. Performance of the DL-SDM was similar to MaxEnt for species with relatively large sample sizes and worse for species with relatively low sample sizes. Increasing the number of occurrences further improved DL-SDM performance for species that already had relatively high sample sizes. We then tried to further improve performance by altering the sampling procedure of negative instances and increasing the number of environmental predictors, including species interactions. This led to a large increase in model performance across the range of sample sizes in the species datasets. We conclude that DL-SDMs provide a suitable alternative to traditional SDMs such as MaxEnt and have the advantage of being both able to directly include species interactions, as well as being able to handle correlated input features. Further improvements to the model would include increasing its scalability by turning it into a multi-classification model, as well as developing a more user friendly DL-SDM Python package.


Background
Biodiversity is in strong decline across the globe (1,2 ). The main drivers are the loss and degradation of natural habitats through human activities (3 ). Loss of biodiversity negatively affects ecosystem functioning (4 ), and its conservation is therefore of high priority. However, knowledge of global biodiversity is still limited (5 ). This is partly due to the observation that the vast majority of known species occur in restricted ranges and low abundances (6, 7 ). Furthermore, data from areas with some of the highest biodiversity, such as the tropics, is relatively sparse (8,9 ). Species distribution models (SDMs), which were initially developed in the early 2000s (10, 11 ), provide a partial solution to the scarcity of species data. SDMs relate patterns in the occurrence data to a selection of environmental predictors and use this information to predict the probability of presence outside of sampled areas (12 ). Predictions based on limited or geographically skewed input data, among other things, have implications for the quality and interpretation of model output (13,14 ), and SDMs are subject to continuous improvement (12 ).
The MaxEnt software package (15, 16 ) is currently one of the most popular SDMs with > 1000 applications published since its introduction (17 ). The approach was originally developed to estimate the density of presences across the landscape (15 ). In the absence of knowledge on absolute population sizes, it provides a relative occurrence rate (ROR) per grid cell as output (18 ). However, for many species the available records cannot be seen as a random sample from the landscape, and the output will therefore not meet the assumptions for density estimation (17 ). Alternatively, using MaxEnt to predict the probability of presence in a cell requires a logistic transformation of the ROR (16 ). However, this transformation has also been criticized (17,19 ). It includes a parameter τ, representing the background probability of presence for 'average 'presence localities. The value of τ has a large influence on the predicted output probabilities, but is arbitrarily set, rather than being fitted from the data (20 ). Considering the challenges in model interpretation when estimating density or probability of presence, MaxEnt is often used in a qualitative way by interpreting the output as an index of habitat suitability (21, 22 ).
In this research we propose an alternative approach for constructing SDMs, based on Deep Learning (DL). The past two decades have seen a strong increase in the use of Deep Learning (DL) (23 ), which has been attributed mainly to increased chip processing abilities, lower hardware costs and advances in machine learning algorithms (24, 25 ). DL is a subfield of machine learning that focuses on learning high-level abstractions in data (25 ). This is achieved by using a hierarchical architecture consisting of multiple interconnected layers, which in in turn contain multiple artificial neurons. A common type of DL is the application of Deep Neural Networks (DNNs). DNNs contain >2 layers and three basic computations are performed in each of them (Fig.1). (1) The neurons in a given layer receive input values from each of the neurons in the preceding layer. For the neurons in the first layer this means that they receive the raw values for each of the input variables. Each of these input values are multiplied by a specific weight, obtained through optimization, (2) the weighted input values are subsequently summed, and (3) the weighted sum is transformed using a non-linear activation function, which is selected from a set of candidate function by comparing the network's performance using each of them. The transformed output value is passed on as input to the neurons in the next layer (26 ). Thus, what is learned by the network is the optimal set of weights for all the connections between neurons in adjoining layers, maximizing network performance. Each individual neuron is able to focus on a specific pattern in the data. For example, a neuron in the first layer might put most weight on all variables related to seasonality, and another neuron in the first layer assigns most weight to variables related to terrain and vegetation. A neuron in the second layer might then put most weight on the outputs of these particular two neurons in the first layer and thereby model the abstract concept of "seasonal lowland forest". For classification purposes, the number of neurons in the final layer equals the number of classes to predict.The output of the neurons in the final layer are passed through a softmax function (27 ), which transforms them into probabilities that sum to 1 (eqn. 1). Although shallow networks containing a single hidden layer have been available in SDMs (28, 29 ), these typically ranked at the bottom in terms of performance (30, 31 ). Harris (32 ), used a two-layer network for SDM and already noticed a large increase in performance compared to single layered models. Current methods will allow us to create much deeper models still and further improve performance.
Where K is the number of classes, s(x) is a vector containing the scores of each class for instance x and σ(s(x)) k is the estimated probability that instance x belongs to class k given the scores of each class for that instance. From Géron (26 ).
Several arguments can be made for developing DL-SDMs as an alternative to MaxEnt-SDMs. Firstly, there is the clarity in the interpretation of network's output. The output will be two probabilities for each location, a probability for that location of belonging to class 1: species occurs, and a probability of belonging to class 0: species does not occur. A second argument, more interesting from an ecological point of view, is the possibility of taking into account the presence of other species as environmental predictors in DL-SDMs. This allows for the direct inclusion of biotic interactions that is not possible in MaxEnt. Researchers now often use a two-tiered approach, first running MaxEnt, and then separately modelling the output including co-occurrence of other species (34, 35 ). Including biotic interactions considerably improves model performance (35 ). Finally, a further incentive for developing DL-SDMs is their scalability. In MaxEnt-SDMs each species to be modelled requires the selection of a separate set of appropriate and uncorrelated input variables (36, 37 ). Given appropriate model structure, DL-SDMs can take the same complete set of (correlated) input features for each species. Next to this, there is also the potential of multi-classification in DL-SDM in which the model outputs the probability of presence for all species in a single instance. This would increase scalability as the network weights only need to be trained once, rather than separately for each species. Furthermore, these pretrained weights could be transferred to a new species dataset and retrained, likely reaching an optimal solution faster than starting from the default random initialization.

Aims of the study
The aims of this exploratory research are to (1) provide a proof of concept of DL-SDM, (2) compare performance of DL-SDM to MaxEnt-SDM and (3) to provide recommendations on the large scale practical implementation of DL-SDMs.

Research questions
Based on the aims of the research, the following research questions were defined: 2 Methods

Software
All source code for this research was written using Jupyter Notebook (38 ), based on a Python 3.6 kernel (39 ). The code, together with the input and output data is publicly available via github and can be found at: https://github.com/naturalis/trait-geo-diverse-dl.

Data preparation
The research project was structured in three separate stages. Firstly, a pilot model was made utilizing the same input species, occurrences and environmental predictors as recently used by Hendrix & Vos to model the niches of the world's ungulates with MaxEnt (40 ). This choice was made to allow for a qualitative visual comparison of the results of the DL-SDM with MaxEnt-SDM. In the second stage, the number of occurrences in the pilot model was extended. This stage was used to gain deeper insight in the number of occurrences required for credible modelling performance for DL-SDM and potential improvements through changes in model architecture. Finally, in the third stage additional environmental predictors were included to assess their impact on model performance and potential improvements in model performance by changing model architecture.

Pilot study
We used the occurrence data of 154 ungulate species and raster datasets for 41 abiotic environmental predictors relating to climate, topography and soil characteristics from the online repository of Hendrix & Vos (40 ). The occurrence data originated from the Global Biodiversity Information Facility (GBIF) website (41 ) and ranged between 10 -882 observations per species (mean: 191 ± 234 sd). The climatic raster data were sourced from the widely used BIOCLIM and ENVIREM datasets (42, 43 ). The soil characteristics rasters were sourced from the Land-Atmosphere Interaction Research Group, and topography rasters from the Harmonized World Soil database (44 ). All environmental rasters were transformed to a 5 minute spatial resolution. A full list of the variable descriptions of each raster can be found in Appendix A.
Starting with a csv file with filtered occurrences for a given species, the goal is to generate a dataframe including labeled positive and negative occurrence examples and the environmental variable values at these locations. This dataframe will form the input for the DNN. As no hard data on species absences exists, typically pseudo-absences are used instead (45 ). The steps taken to generate this dataframe are visualised in Figure 3 and detailed below. The code is provided in the Stacking environmental rasters and Species and global prediction dataframes notebooks in the pilot study folder in the repository. To generate pseudo-absences, circular buffers with 1000km radius were constructed around each occurrence point. These buffers were merged into a single 'multipolygon' shapefile. The environmental variable rasters were first stacked into a single multi-band raster and then clipped based on the extent of the multipolygon shapefile. A random selection of pseudo-absence locations was generated within the raster clip based on two constraints: (1) points were not allowed to be located within the sea and (2) points were not allowed to be located within raster cells with occurrences. For species with < 1000 occurrences, 1000 random locations were generated. For species with >1000 occurrences, the number of random locations was set equal to the number of occurrences. The resulting selection of pseudo-absence points and their longitude and latitude values were added to the csv file with filtered occurrences. Next, the environmental variable values for all locations were added to this dataframe. Each band in the stacked raster clip represented one of the 41 environmental variables. For all occurrence and pseudo-absence points, the cell number in which they were located was determined. By going iteratively through the raster bands, the cell values for all variables were extracted and added to the dataframe. The environmental variable values were scaled by subtracting the band mean and dividing by the standard deviation. This formed the dataset on which to train and test the DNN. To produce global predictions of species distributions after model training and testing, a separate dataset was made containing the scaled environmental variable values of all terrestrial cells in the stacked world raster map.

Extended observations
In contrast to the pilot model, occurrence data was directly sourced from an SQL relational database containing all GBIF occurrences for the world's ungulate species. These raw occurrences were first sorted on taxonomy and then filtered based on multiple criteria (Fig. 4), the code for which can be found in the Filter GBIF records from SQL Database notebook in the data extended folder in the repository. As a first filtering step, only occurrence records with at least two decimal values for longitude and latitude and records representing a unique longitude-latitude combination were included. Next, it was determined whether each occurrence was located within the species IUCN range by uti-lizing the publicly available IUCN species distribution range shapefiles (46 ). Finally, only records collected after 1900 and species with > 10 records after filtering were included. The total number of ungulate species included was 124, and the occurrences per species ranged between 10 -58329 (mean: 1401 ± 5798 sd). The subsequent process of generating pseudo-absences and extracting environmental values was the same as in the pilot project.

Extended observations and environmental variables
The same set of occurrence data was used as in the extended observation models. However, rather than generating pseudo-absence locations from within the buffers generated around each occurrence location, as in Hendrix & Vos (40 ), these were now sampled randomly from the entire world. This was done to increase the range of environmental variable values the model was exposed to during training on pseudo-absences and improve predictive capabilities at the global scale. For species with > 2000 occurrences, 2000 random locations were generated, and for species with more occurrences the number of random locations was set equal to the number of occurrences. Next to this, multiple biotic and abiotic variables were added to the environmental predictor dataset. These variables consisted of the occurrences of the other ungulate species in the dataset, as well as maps from the Atlas of World Conservation that represented: the world's ecoregions, levels of human appropriation, human accessibility, habitat fragmentation, mammal species richness and plant species richness (47, 48 ). The code for processing, rasterizing and stacking these various additional environmental layers is listed in the Environmental Raster Layers notebook in the data GIS extended folder in the repository. The final stacked environmental raster contained 186 bands. A description for each variable is provided in Appendix B.

Model architecture and training
We first applied combinations between various learning rates, regularization functions, activation functions and optimization algorithms to a trial dataset of Capreolus capreolus to guide model construction ( Table 1). The model structure was kept fixed with two hidden layers containing 50 and 25 neurons, a batch size of 100, and 500 epochs for training. In terms of model performance. we looked at the average loss, accuracy and AUC value for each of these hyperparameters ( Table 2). The outcomes indicated the best performances were obtained using L2 or no regularization, a ReLu activation function, RMSProp or Adam optimization and a relatively high learning rate (0.001 or 0.0001). After this more systematic evaluation we attempted to further improve model performance by (1) adjusting the number of layers, (2) using drop-out as an alternative regularization method and (3) adjusting batch size and number of epochs. The final architecture of the pilot model consisted of four hidden layers with drop-out in between each layer (Fig. 5). We used Python's Keras module to build the DNN (49 ) and trained the model using a batch size of 75 for 125 epochs, with a learning rate of 0.001 and using Adam optimization. As many datasets were imbalanced, with considerably more pseudo-absences than presence-locations, datasets were randomly shuffled, then split into training (85%) and test sets (15%) using a stratified approach. Furthermore, a balanced batch generator was used during training (50 ).   The architecture for the model with extended observations was kept the same as the pilot model, as adding layers or altering drop-out rates did not seem to improve performance in the trial dataset. For the model with extended observations and variables, the number of layers and drop-out rates was kept the same, but the number of neurons in each hidden layer was increased to 250, 200, 150 and 100 neurons respectively, and the number of epochs was increased to 250 (Fig. 6). The code for training the models can be found in the Train DNN notebooks.

Model evaluation
Evaluation of the models was the same for the pilot, extended observations, and extended observations and variable models. The DNN was run five times for each species. During each run, the test loss, accuracy and AUC value were stored and 95% lower and upper confidence bounds around the AUC value were estimated using a bootstrapping procedure with 1000 repetitions. The average test loss, accuracy, AUC and associated 95% confidence intervals over the five runs were written to a text file. The model weights of the run with the highest AUC value were saved as a .h5 file, to later reconstruct it for making predictions of the species global distribution. We used the DeepExplainer function from the SHAP package developed by Lundberg (51 ) to calculate feature importance by approximating Shapley values (52,53 ). The approach computes the contribution of a target feature to a model prediction by rerunning the prediction using all possible non-target feature combinations and again for these combinations now including the target feature. It then takes the average difference in predicted outcomes. As DNNs' fixed network structure means they cannot actually exclude a feature, excluded features take on a reference value instead (54 ).

Pilot study
The DNN pilot model showed increasing performance and decreasing variation in performance with higher availability of occurrence samples (Fig 6). Compared to the MaxEnt model used by Hendrix & Vos (40 ), the pilot DNN model performed considerably worse when assessed over all species, with a large standard deviation, indicating high among species variation (Table 3). However, the difference in performance between the pilot DNN model and the MaxEnt model was relatively low if only species > 100 samples were taken into account. The predicted global distributions for a species with high, intermediate and low number of occurrence samples using both modelling approaches can be found in Figure 7. Associated variable importance for each of the individual models can be found in Appendix C.

Extended observations
Including additional observations had mixed effects on model performance. Only for species with 500 observations, was there a clear improvement in terms of reduced test loss and increased AUC values (Table 4, Fig.8). This is also reflected in the changes in the predicted global distributions ( Fig.10. a,c,e ). There was a large restriction in the predicted distribution of Alces alces, with 9966 occurrences, but not for the Ceratotherium simum and Vicugna vicugna, despite increases from 263 to 418 and from 12 to 61 occurrences respectively.

Extended observations and environmental variables
The inclusion of additional environmental variables and sampling pseudo-absences globally reduced the variation in AUC values and associated confidence intervals for species with sample sizes between ∼ 100 and 500 samples compared to the extended model (Fig.9). For species with > 500 observations, model loss, accuracy and AUC scores were all improved and there was relatively low variation in these metrics between species (Table 5). Although there was still considerable variation in performance measured across all species, the performance and predicted distribution of several species with < 500 occurrences did improve considerably, as can be seen in Figure 10. b,d,f .    Of the added environmental variables in the model, co-occurrence with another species was the most important feature for both Alces alces and Ceratotherium simum (Fig. 11). In the model for Vicugna vicugna, on the other hand, the most important features came from the same subset of abiotic variables as in the extended model, suggesting that the global pseudo-absence sampling strategy is responsible for the large improvement in model performance for this species. Three out of the five highest ranked features for Ceratotherium simum did not show a clear relationship between intermediate to high feature values and the impact on the model's predicted probability of occurrence. Dependency plots indicated interaction effects occurring with other features (Fig. 12).   The predicted probability of occurrence of Ceratotherium simum increases when going from low to intermediate temperature values in the warmest quarter and plateaus for high values. However, at intermediate levels (0 -1), having conditions that are neither very arid nor moist (-1.0 -0) increase the predicted probability of occurrence, whereas high moisture does not (Fig.12 a ). Reversing the scaling of the data shows that this corresponds to a combination of temperatures between around 21.1 -39.3 • C in the warmest quarter and moisture index levels of 53 -55. Increased mammal species richness increases the predicted probability of occurrence. At high levels of mammal species richness (1.0 -2.3; 94 -227 spec.), having a low to intermediate seasonality in potential evapotranspiration (-1.5 -0; 5.9 -73.0 mm) further increases the predicted probability of occurrence (Fig.12 b ). Finally, both relatively low and high potential evapotranspiration in the driest quarter lower the predicted probability of occurrence. In between (0.0 -1.5; 0.0 -240.6mm), having intermediate to high values for the minimum temperature of the warmest month (0.0 -1.0; 13.9 -22.2 • C) increased the predicted probability of occurrence (Fig.12 c ).

Discussion
In this research we aimed to apply a Deep Learning approach to Species Distribution Modelling (DL-SDM). We also compared its performance to the well-established MaxEnt SDM on a limited dataset of the world's ungulates.

Relationship DNN and MaxEnt
Although the mechanics are still the subject of active research and debate (55,56 ), the internal processes within a DNN share a similarity with the MaxEnt approach in that information flowing through the network converges to a maximum entropy solution (57,58 ). In MaxEnt, this solution can be described as the distribution that minimizes the distance from the uninformed prior distribution of the 'background' feature set, but maintains the maximum amount of information contained in the distribution of the target feature set, i.e. has the same feature characteristics (mean, variance) as the feature set associated with the occurrence samples (20 ). Research by Schwarz & Tishby shows that going successively through each layer in a DNN, there is a trade-off between compression, or efficient representation of the information contained in the input features, and maintaining the predictive capabilities of the network (57 ) (Appendix E.1). The generalization capacity derived from this process does not occur in single-layered networks, and might partly explain their poor performance in SDM (30, 31 ). However, these findings are still debated and the process was not observed in research by Saxe (55 ) in networks utilizing ReLu activation functions (Appendix E.2). This is the most commonly used type of activation function and was also used in the networks in this research.

Model comparison
Our model comparison based on a limited dataset of the world's ungulates showed the DNN model performing worse for species with low and intermediate sample size and similar for species with a large sample size. DNNs typically require a large amount of training data to achieve high performance, which is related to to the large amount of parameters that need to be optimized (59 ). In this respect, the MaxEnt model is much less complex and it might explain why it performs better for species with few occurrence samples. However, sample size was not the only important determinant. The results of the extended ov model showed the selection of pseudo-absences was responsible for the large improvement in the global predicted distribution of Vicugna vicugna. By sampling negative labels only from within the IUCN range of the species in the pilot and extended model, these overfitted on the peculiarities of environmental conditions and generalized poorly when exposed to different conditions in other regions of the world. This also shows that if 'pseudo-absences' are not selected appropriately, an evaluation metric like the AUC value can be misleading. Both the MaxEnt model of the Vicugna vicugna and the DNN extended model of Ceratotherium simum achieved a high AUC, but their global predicted distribution showed poor generalization.

DL-SDM improvement
One way to improve performance of the network model on small species datasets, is to apply transfer learning (60 ). Rather than learning the network weights starting from some random initialization, it is often beneficial for small datasets to use an existing model whose weights were pretrained for a similar classification task. This model can then be retrained on the small dataset, starting from the pretrained weights (61 ). This is a strategy that is often utilized in image recognition studies (62 -64 ), where well-known existing networks trained on thousands or millions of images such as Alexnet are retrained on the limited dataset available in the study. For DL-SDM, this could mean first training the model on an ecologically similar species with a large sample size and then apply transfer learning to retrain it on the target species with a small sample size. Alternatively, a single, deeper multi-classification model could be created that outputs the probability of presence for all species in a single instance. This model would then still need to be trained using resampling strategies to increase performance for species with few occurrences (65 ).

Modelling shifting distributions
The DL-SDM in the current research was used to predict the distribution of the world's ungulate species based on occurrence samples that were collected between 1900 and the present. Recently, there has been an increasing interest in modelling how species distributions might shift in the future following climate change (66 -68 ). This could be modelled in DL-SDM by exposing a pretrained version of the current model to an adjusted set of environmental data, but the model would not be able to include species co-occurrences, as the distribution of the other species would likely change as well.
Whether this is problematic would depend on the organism being modelled, for many of the ungulates in this research co-occurrences were shown to be important features, whereas one might expect plant distribution to be modelled accurately using only abiotic features.
The multi-classification model suggested earlier, which takes the occurrences of all species as inputs and also outputs the predicted occurrences of each species, could provide an approximation in two steps. In the first step the pretrained model is exposed to a new feature set including the adjusted abiotic environmental conditions, but the same set of species occurrences. The result can be framed as "the predicted distribution of species X if only climatic conditions change and the distribution of other species remains the same". In the second step this pretrained model is then exposed again to the feature set with changed climatic conditions, but the species occurrences are replaced with the newly predicted distributions of all species. However, as the new distributions have arisen from a static process, which assumed the distribution of all other species remained the same, this would still provide a very rough approximation.
An alternative solution would be to create a dynamic version of the multi-classification model in the form of a Recurrent Neural Network (RNN) (27 ). RNNs are a type of neural network suitable for modelling sequential data. They have been very successful in language processing (69, 70 ), but are also used to approximate dynamic processes in climate modelling in computationally efficient ways (71 ). In an ecological setting, Lee & Donghyun (72 ) created RNN models to predict algal blooms in South Korean river systems. For DL-SDM, a dataset could be created that starts from current abiotic conditions, where at each time-step conditions are slightly changed in line with a certain climate scenario until reaching the predicted conditions in, for example, 2050. The co-occurrence features should be updated during each time step. At the first time step the current distributions of all species are used. The model then outputs the newly predicted distribution for all species under this small change in environmental conditions. In the next step, the co-occurrence feature values should be replaced by the newly predicted distribution values and so on until the end of the sequence. A potential downside to this approach would be that the RNN would initially have to be trained on a historic time-series dataset as well. This requires an explicit temporal link between the occurrence samples and environ-mental feature values that might be difficult to establish.

Other applications
Another potential application of DL-SDMs is to combine them with image recognition techniques for automated species identification. Tools based on Convolutional Neural Networks (CNNs) are being developed to aid in species identification both in the field (73 ) and in museum collections (74 ). DL-SDM could provide an additional measure of certainty to a proposed identification by the CNN, by returning the probability that the species actually occurs at the locality the specimen was collected.
If the process can be linked to a taxonomic relational database, another closely related species with a higher probability of occurrence at the specimen's locality might then be proposed to the user.

Conclusions
In this report we provided a proof of concept of DL-SDM using both a limited and an extended dataset of occurrences of the world's ungulates. The required input consists of a selection of rasterized abiotic and biotic environmental predictor variables of the same spatial resolution. Notably, co-occurrences with other species proved an important environmental predictor for many of the ungulate species. Increasing the sample size, including species co-occurrences and improving pseudo-absence sampling resulted in large improvements in model performance and gave realistic distributions for species across a range of occurrence sample sizes. Implementing DL-SDMs on a larger scale will likely require the model to be transformed to a single multi-classification model.

Recommendations
To further explore the potential of DL-SDM, we recommend to (1) apply and adjust the current model to other groups of organisms, for example plants (2) to construct a single large multi-classification model for all species and compare it's performance against the single species models in this research.
(3) construct a temporal version of the model using a RNN framework to allow for modelling distribution shifts, for example under climate-change.

Acknowledgements
We would like to thank Elke Hendrix for discussing and clarifying the particularities of the limited ungulate dataset and MaxEnt model. Next to this, we would like to thank Dr. Anouschka Hoff from Wageningen University for the supervision provided to MR during the internship in which this work was produced. 7 Glossary 1. Accuracy. A metric for evaluating classification models, representing the fraction of predictions the model predicted correctly. In the binary classification example in this research, accuracy can be expressed as: T P + T N T P + T N + F P + F N • TP = True Positive, a positive example correctly predicted positive.
• TN = True Negative, a negative example correctly predicted negative.
• FP = False Positive, a negative example falsely predicted as positive.
• FN = False Negative, a positive example falsely predicted as negative. 3. Batch size. The number of training examples the model works through before updating its internal parameters. In the case of DNNs, this refers to the updating of the network weights.
• Balanced batch generator. In an imbalanced dataset there are more samples of certain classes than others. This can have a negative effect on the predictive capabilities of a model, as during training it can be exposed to many batches that do not contain samples from the minority class. The model will still learn to achieve a high accuracy, but does this by simply classifying each sample as belonging to the majority class. A balanced batch generator resamples the dataset, usually by undersampling the majority class, to reduce the class imbalance in the batches passed to the model during training (50 ).
4. Epochs. The number of complete passes the model makes through the entire training dataset.
5. Learning rate. First read loss function. Optimization algorithms find the optimal set of parameter values required to minimize the loss function. As reviewing the change in model loss for each potential parameter value is inefficient, a step size is defined: the learning rate, which the optimizer uses to determine the next set of candidate weights to evaluate. As seen in the left figure below of a simplified loss landscape from Baughman & Liu (77 ), using a learning rate that is too low will result in very slow convergence and risks getting stuck in local minima, while a learning rate that is too high will overshoot the global minimum. The right figure from Li et al. (78 ) better illustrates the complex loss landscape of neural networks.
6. Loss function. In training the deep neural network, we are trying to minimize errors in classification. The loss function is used to evaluate the error value for a candidate set of network weights and bias terms identified by the optimization algorithm. An often used loss function for classification models is the cross-entropy function.
• Cross entropy. Measures the performance of a classification model that outputs a probability value between 0 and 1. If there are only two classes as in the current research, the cross-entropy function equals the log-loss function (26 ), expressed for a single training instance as: Where l(θ) is the estimated loss for the candidate set of weights and bias terms θ. The function is averaged over all training instances to estimate the cost function for the whole training set.
As explained by Géron (26 ), -log(x) increases as x aproaches 0 and decreases as x approaches 1. Therefore the cost will be high if the model estimates a probability close to 0 for a positive instance and low if it estimates a probability close to 1.
7. Non-linear activation function. Functions used to introduce non-linearity into the network. As each neuron in each layer computes a weighted sum of its inputs, the output of a network would remain a linear function, irrespective of how many layers are added, if no non-linear activation function is applied.
• ReLU activation. Short for Rectified Linear Unit, ReLU is the most commonly used activation function. It takes the weighted sum z of the inputs of each neuron. If z is equal to 0, the output of the neuron will be 0, if z is larger than 0, the output of the neuron is simply the weighted sum z, as seen in the figure from Sharma (79 ) below.
ReLU (z) = max(0, z) • Sigmoid activation. Long the default activation function for neural-networks, it has now started to fall out of favour to the ReLU function. This is because for deeper networks, there is a vanishing gradient problem illustrated in the figure below the equation, from Arunava (80 ). As the input values for the sigmoid function become larger or smaller, the derivative of the function becomes close to zero. Starting from the last layer in the network, optimization algorithms computes the gradient of the loss function for each parameter in the network and uses these to update the parameters based on the learning rate. This process is called backpropagation. If a sigmoid activation function is used, the gradients get increasingly small as the algorithm goes to the lower layers in the network, meaning the lower connection weights remain unchanged and the model cannot converge to a good solution (26 ). Φ(z) = 1 1 + e −z 8. Optimization algorithms. Used to find the optimal set of parameter values (weights) in the network that minimize the loss function.
• Gradient descent optimization. The most common optimization algorithm used. Given a loss function l evaluated for a set of weights and bias terms θ, gradient descent adjusts θ using the following rule: Where ∇l (θ) is the gradient of the loss function. This gives the direction in parameter space to increase the loss. Instead, gradient descent moves in the opposite direction (-∇l (θ)) based on the step size or learning rate η (81 ).
• Gradient descent with momentum. The Gradient descent algorithm was improved by including the concept of momentum, that can speed up movement along directions of strong improvement and better avoid local minima. This was achieved by introducing two additional parameters v and µ (82, 83 ). Where v is velocity, the exponential moving average of current and past gradients up to time step t, and µ is the momentum coefficient, between 0 and 1, that restricts the velocity. The updated rule becomes: • Nesterov optimization. Variant of Gradient Descent with momentum that can speed up training and improve convergence. It measures the gradient of the cost function slightly ahead in the direction of the momentum (26 ). Notation wise, this difference is expressed in the update of the velocity vector v (83 ): • Adagrad optimization. An optimizer providing an adaptive learning rate. Whereas the previous optimization algorithms used a single learning rate η for the set of parameters θ, Adagrad uses different learning rates for every parameter at every time step (84 ). The update rule for a single parameter can be expressed as : ii , is a matrix containing the sum of the squares of the gradients of parameter θ i up to the current time step and is a smoothing term, typically 10 −10 , to prevent division by zero (85 ). For example, if three steps have been taken so far for parameter θ 1 , then the notation becomes: θ 3+1,1 = θ 3,1 − η ∇l(θ 1,1 ) 2 + ∇l(θ 2,1 ) 2 + ∇l(θ 3,1 ) 2 ∇l(θ 3,1 ) As with increasing time steps the sum of the gradients in the denominator also increases, the learning can become infinitesimally small over time and stop before the global optimum is reached (85 ).
• RMSProp optimization. Designed to handle the problem of Adagrad's increasingly small learning rates. It only accumulates the gradients from the most recent iterations. In the expression, the diagonal matrix G t is replaced by an exponentially decaying average over the past squared gradients (85 ). The parameter β is the decay rate, often set to 0.9 (26 ).
• Adam optimization. An optimizer combining the concepts of momentum and RMSProp. It stores an exponentially decaying average of both past gradients m t and of past squared gradients v t (85,86 ).
t β 1 represents the momentum decay and is usually set to 0.9, whereas the scaling decay β 2 is usually set to 0.99 (26 ). However, with these values v t and m t are biased towards zero during the first few time steps (86 ). Therefore bias corrected values are used instead in the Adam update rule: Regularization. Introduces a penalty term in the model's loss function that penalizes model complexity to prevent overfitting.
• L1 regularization. Regularization term encouraging feature sparsity by setting the weights of the least important features to zero if parameter α is sufficiently large (87 ). The regularization part of the loss function can be expressed as: | θ i | • L2 regularization. Regularization term that encourages weight values close to zero and the mean of the weights towards zero with a gaussian distribution. L2 regularization penalizes the squared values of the weights. • Drop-out regularization. The approach randomly sets the activation of a collection of neurons to zero during training, dropping all their connections in the network during a single pass through the network and weight updating (88 ). This prevents neurons in the network from over-specializing on a specific feature in the training dataset, which results in poor model generalization. Typically drop-out rates between 0.1 and 0.5 are used. An example representation can be seen in the figure below from MIT (33 ).