Toward a Monte Carlo approach to selecting climate variables in MaxEnt

MaxEnt is an important aid in understanding the influence of climate change on species distributions. There is growing interest in using IPCC-class global climate model outputs as environmental predictors in this work. These models provide realistic, global representations of the climate system, projections for hundreds of variables (including Essential Climate Variables), and combine observations from an array of satellite, airborne, and in-situ sensors. Unfortunately, direct use of this important class of data in MaxEnt modeling has been limited by the large size of climate model output collections and the fact that MaxEnt can only operate on a relatively small set of predictors stored in a computer’s main memory. In this study, we demonstrate the feasibility of a Monte Carlo method that overcomes this limitation by finding a useful subset of predictors in a larger, externally-stored collection of environmental variables in a reasonable amount of time. Our proposed solution takes an ensemble approach wherein many MaxEnt runs, each drawing on a small random subset of variables, converges on a global estimate of the top contributing subset of variables in the larger collection. In preliminary tests, the Monte Carlo approach selected a consistent set of top six variables within 540 runs, with the four most contributory variables of the top six accounting for approximately 93% of overall permutation importance in a final model. These results suggest that a Monte Carlo approach could offer a viable means of screening environmental predictors prior to final model construction that is amenable to parallelization and scalable to very large data sets. This points to the possibility of near-real-time multiprocessor implementations that could enable broader and more exploratory use of global climate model outputs in environmental niche modeling and aid in the discovery of viable predictors.


Introduction
MaxEnt is an important aid in understanding the influence of climate change on species distributions and abundance. Based on a machine learning approach to maximum entropy modeling, the software allows researchers to construct ecological niche models (ENMs) that estimate the habitat suitability of a species using occurrence data and a set of environmental variables [1][2][3]. The need for reliable climate projections in this work is leading to greater use of global climate model (GCM) outputs as predictors [4]. While creating important new opportunities for research, this trend is also creating a "Big Data" challenge for the MaxEnt community [5]. The largest and most sophisticated GCMssometimes referred to as "IPCC-class" models because of the critical role they play in the work of the Intergovernmental Panel on Climate Change (IPCC) -produce petabyte-scale data sets comprising hundreds of variables, a volume that vastly exceeds what is generally used in bioclimatic modeling today [6][7][8]. Moreover, the direct outputs of these systems are being transformed into derived climate data products on an unprecedented scale [9,10]. As a result, model tuning and variable selection, which are crucial aspects of any species distribution modeling effort, are becoming more complicated issues [11].
Part of the problem lies in the fact that MaxEnt, like many machine learning systems, acts on its inputs as a piece: predictors and observations must be memory-resident for the program to work [12]. This results in run-times and space requirements that scale linearly with the size of a model's inputs. In most cases, these scaling properties pose few difficulties.
But when the number of predictors under consideration becomes large, compute times can become impractically long, models can become overly complex, and efforts to understand any particular variable's contribution to model formation, either as an aspect of model analysis or as a way of selecting subsets of variables for further model refinement, can become challenging [11,[13][14][15][16]. Clearly, an effective way of dealing with large environmental data sets that preserves the many advantages of MaxEnt while overcoming its current limitations would benefit the MaxEnt community.
In this study, we investigated the potential of a Monte Carlo method to help accomplish such an outcome. Monte Carlo optimizations are a common way of finding approximate answers to problems that are solvable in principle but lack a practical means of solution [17]. Our objective was to find a useful subset of predictors in a larger collection of environmental variables in a reasonable amount of time. Our proposed solution takes an ensemble approach wherein many MaxEnt runs, each drawing on a small random subset of variables, converges on a global estimate of the top contributing subset of variables in the larger collection.
Preliminary results suggest that the method reliably selects a subset of the original predictors that is capable of producing a well-tuned, parsimonious model of high quality.
Since each model run is independent and uses a set number of variables, the method is totally parallelizable, independent of the scaling properties of MaxEnt, and amenable to implementation as an external memory algorithm. If proved to be effective, such an approach could provide a practical way of constructing MaxEnt models when there is a need to select a small set of predictors in a pool comprising a potentially very large number of predictors.
This could lead to greater use of climate model outputs by the ecological research community and aid the search for viable predictors when variable selection through ecological reasoning is not apparent.

Materials and Methods
Cassin's Sparrow (Peucaea cassinii Woodhouse, 1852) is an elusive resident of arid shrub grasslands of Middle America and the Southwestern United States [18]. Desert-adapted birds, such as Cassin's Sparrow, appear to be especially vulnerable to climate change [19,20]. We chose Cassin's Sparrow as a target for our study as an example of a species whose study could benefit from the technical advances described here. Occurrence data was obtained from the Global Biodiversity Information Facility (GBIF) for the year 2016 [21].
After removing replicates, a total of 1865 records were acquired. To limit spatial extent and avoid pseudo-replication, we thinned the points to a radius of 16 km, which resulted in a total of 609 observations. For predictors, we used Worldclim's standard 19 Bioclimatic (bioclim) environmental variables at a resolution of 2.5 arc-minutes throughout (Table 1) [22]. We did not attempt to minimize collinearity by removing variables, because the current study focuses on an assessment of stochastic down-selection from a full variable set, and because MaxEnt has a demonstrated ability to account well for redundant variables [23].
We used MaxEnt Version 3.4.1 [24], R Version 4.0. 1 [25], the ENMEval Version 0.3.0 R package [26], RStudio Version 1.2.5033 [27], and ENMTools Version 1.4.4 [28] running on a 2.8 GHz Intel Core i7 MacBook Pro with 16 GB of memory in the study. First, we developed a baseline model using the stand-alone MaxEnt program operated through its graphical user interface (GUI). MaxEnt users can apply various combinations of five mathematical transformations ('feature classes' or FCs) to predictor variables to enable more complex fits to the observational data. The available feature types for continuous variables are linear (L), quadratic (Q), hinge (H), product (P), and threshold (T) [1]. Users can also adjust a regularization multiplier (RM) to maximize predictive accuracy and offset the overfitting that FC adjustments can introduce. We applied MaxEnt's default FC and RM settings (i.e. the "Auto features" setting) with 10 replicate cross-validation and jackknife evaluation of variable importance. By default, MaxEnt uses all feature classes and a regularization multiplier of 1.0 when there are more than 80 training samples, which was the case here [24]. Ten thousand background points were selected from across the study area following the recommendations of Phillips et al. [29] and Fourcade et al. [30]. We determined the average permutation importance for each variable in three replicated runs.
The top six predictors in the three-run ensemble were used to develop the final MaxEnt baseline model.
We then developed an alternative method to select the top six variables that is based on random sampling. We implemented our Monte Carlo approach as an R script that invokes MaxEnt through ENMEval, which provides convenient control over model settings, built-in evaluation metrics, and improved performance [15,26]. To reduce variability and isolate outcomes as much as possible to the effects of the sampling process, we adopted a feature To process a sprint, we initialized each of its ten model runs with a random subset of environmental variables read from the filesystem. Random integers drawn from a uniform distribution ranging 1-19 corresponding to the 19 bioclim predictors were used to make the selection. At the conclusion of each model run, the tally table was updated appropriately. At the conclusion of each sprint, we computed a MaxEnt model using the six predictors in the original starting set having the highest average permutation importance values at that point.
This process was repeated 100 times to produce a complete ensemble. We assessed the algorithm's performance in two ensembles. In the first, we chose two random variables for each sprint run; in the second, six random variables were used for each run. This resulted in an overall total of 2000 model runs.
The predictive distribution maps produced by the models were judged for reasonableness based on first-hand knowledge of the species, its habitat preferences, and known range [31]. We further compared model predictions to observational records from Cornell Lab's eBird citizen-scientist database [32]. We used the area under the operating curve (AUC) [33] as an indication of a model's classification accuracy (higher values indicating greater accuracy) and the Akaike information criterion corrected for small sample size (AICc) [34] as a measure of relative explanatory power (lower values indicating less information loss). Model similarity was compared with Warren's I-statistic [35] and Schoener's D statistic [36] (higher values in both indicating greater similarity) using ENMTools. Single-processor run times were recorded to aid our understanding of algorithm performance and help identify opportunities for multiprocessor parallelization.
Of those, bio02, bio05, and bio14 appeared in only one run each at 10th place. Bio18 showed strong dominance throughout. When performance was averaged across all three runs, the top six contributory variables in the ensemble collectively accounted for 65% of overall permutation importance (ensemble average). In descending order of importance, the top six predictors included bio18, bio03, bio10, bio15, bio11, and bio06. When these six topcontributing variables were used in a final MaxEnt run, the model's four most contributory variables (bio18, bio03, bio10, and bio15) accounted for approximately 86% of overall permutation importance, and its predicted habitat suitability distribution corresponded well with what is known about the natural history of the species and observational records for Cassin's Sparrow for the year 2016 (Fig 1)   Greater variability in the composition of the top six subset was seen in Ensemble #1 where two random variables at a time were selected for each sprint run ( Table 2, Fig 2). In Ensemble #2, where six random variables at a time were selected for the MaxEnt runs, the top six variables were identified by the 25th sprint and had settled into their final rank order by sprint 54 (Fig 3). Ensemble #2 appeared to produce the best overall results and shared four variables in common with the top six selected by the MaxEnt baseline (bio03, bio06, bio11, and bio18) ( Table 2). Ensemble #2's final model had the lowest overall AICc, and its four most contributory variables accounted for approximately 93% of overall permutation importance, the highest attained overall.
Ensemble #1 had only one variable in common with the top six selected by both the baseline run and Ensemble #2. What accounts for this difference is not immediately apparent; however, we speculate that the random pair-wise comparisons occurring in Ensemble #1 may alter the relative global influence of the collinearities known to exist in the bioclim variables [37][38][39]. The average number of times a variable was sampled appeared to have a marginal, positive influence on resulting model quality once an adequate minimum was attained.
Ensemble #2 results suggest that at least 80 uniformly distributed samples per starting-set variable are needed to identify a reasonable top six set of variables; the best overall model resulted from over 300 samples per variable ( Table 2).

Discussion
The most striking outcome of the study is the similarity in results. Final models in the MaxEnt baseline and the Monte Carlo ensembles produced predicted habitat suitability distributions that are nearly indistinguishable from one another (Fig 1). Internal metrics likewise reveal little difference in outcomes, with AUC values ranging only from 0.801 to 0.818 and AICc ranging from 12,152 to 12,222, suggesting that the traditional MaxEnt runs and the Monte Carlo approach both produced reasonable models ( Table 2). The two approaches also each identified four variables that collectively contributed more than 80% to the formulation of their respective models. Across the board, models showed a high degree of similarity in Schoener's D and the I-statistic (Table 3).  independence is sometimes referred to as an "embarrassingly parallel" workload, which makes it relatively straightforward to implement in a cluster computing environment. If 1000 processors were recruited into service -which is becoming increasingly convenient with the proliferation of multiprocessor, high performance cloud computing -a 1000-run ensemble could conceivably take about as long as a single MaxEnt run.
The potential significance of this advantage becomes apparent when one considers the method's use with large collections of environmental data. The Monte Carlo approach particularly if implemented as a high-performance cloud service.
The use of IPCC-class climate model outputs in efforts to assess the impacts of climate change on biodiversity and other ecosystem processes is growing. Exploring the potential of these massive data sets, expanded use of ensemble modeling, and the actual work of fitting models for the thousands of species scientists wish to study will require hundreds to thousands of projections [4,5]. An improved capacity to use large environmental data sets in MaxEnt modeling would benefit this work. We are encouraged to think that innovative use of Monte Carlo techniques might provide a helpful means of meeting this challenge.

Conclusions
This small-scale, proof-of-concept study leaves many practical and theoretical questions unanswered. Preliminary results, however, suggest that a Monte Carlo method could offer a viable means of selecting environmental predictors for MaxEnt models that is amenable to parallelization and scalable to large data sets, including externally-stored collections. This points to the possibility of near-real-time multiprocessor implementations that could enable broader and more exploratory use of global climate model outputs in environmental niche modeling and aid in the discovery of viable predictors. Next steps will focus on implementing this capability in NASA's Advanced Data Analytics Platform (ADAPT) science cloud, evaluating the method's behavior using products generated by the Goddard Earth Observing System, Version 5 (GEOS-5) modeling system, extending stochasticity to feature class and regularization multiplier selection, developing automatic stopping rules, and evaluating the method's effectiveness in addressing research questions relating to climate change influences on Cassin's Sparrow abundance and distribution.