Unsupervised Machine Learning for Species Delimitation, Integrative Taxonomy, and Biodiversity Conservation

Integrative taxonomy combining data from multiple axes of biologically relevant variation is a major recent goal of systematics. Ideally, such taxonomies would be backed by similarly integrative species-delimitation analyses. Yet, most current methods rely solely or primarily on molecular data, with other layers often incorporated only in a post hoc qualitative or comparative manner. A major limitation is the difficulty of deriving and implementing quantitative parametric models linking different datasets in a unified ecological and evolutionary framework. Machine Learning methods offer flexibility in this arena by learning high-dimensional associations between observations (e.g., individual specimens) across a wide array of input features (e.g., genetics, geography, environment, and phenotype) to delineate statistical clusters. Here, I implement an unsupervised method using Self-Organizing (or “Kohonen”) Maps (SOMs). Recent extensions called SuperSOMs can integrate an arbitrary number of layers, each of which exerts independent influence on the two-dimensional output clustering via empirically estimated weights. These output clusters can then be delimited into K significant units that are interpreted as species or other entities. I show an empirical example in Desmognathus salamanders with layers representing alleles, space, climate, and traits. Simulations reveal that the SOM/SuperSOM approach can detect K=1, does not over-split, reflects contributions from all layers with signal, and does not allow layer size (e.g., large genetic matrices) to overwhelm other datasets, desirable properties addressing major concerns from previous methods. Finally, I suggest that these and similar methods could integrate conservation-relevant layers such as population trends and human encroachment to delimit management units from an explicitly quantitative framework grounded in the ecology and evolution of species limits and boundaries.


Introduction
Machine Learning (ML) has had a decades-long impact on ecology and evolution ( parametric model-based inference versus algorithmic estimates which treat the data-generating 51 mechanism as unknown (Breiman, 2001). Classical statistics in biology has historically relied on 52 data models, which are typically parameterized based on conceptual understanding of the 53 relevant evolutionary mechanisms and genetic processes. In contrast, the rise of ML methods 54 corresponds with the ascendance of "big" datasets and the recognition that empirical datasets 55 often represent complex, non-linear interactions between variables. The power and flexibility of 56 various ML approaches can easily characterize higher-order patterns in data without constraint 57 from an a priori data model that may not include relevant parameters because they are unknown 58 to investigators or difficult to visualize and implement. While this does not obviate the need for 59 pertinent biological mechanisms to draw robust scientific conclusions, it offers the potential for 60 rapid assessment of significant patterns. Such results can then be used for more precise 61 parametric hypothesis-testing and may themselves also reveal complex empirical patterns and 62 potential mechanisms that are nearly invisible to simpler classical approaches. 63 the matrix of input features is divided up across multiple distinct layers, each of which 162 contributes to the singular output grid (Wehrens and Buydens, 2007). A distance weight is 163 calculated for each layer separately, rescaling contributions from the layers such that differing 164 scales between layers do not overwhelm contributions to the outputs from any one layer, and the 165 relative importance of each layer is estimated directly. I employ this strategy to implement UML 166 species delimitation incorporating data from SNPs, spatial autocorrelation, climate variables, and 167 potentially other features such as traits or phenotypes. contributions, I expand on this concern here. Previous studies have found that Gaussian 173 neighborhoods (rather than bubble or heuristic) and linear learning-rates typically yield optimal 174 results under a wide range of conditions (Natita et al., 2016; Stefanovič and Kurasova, 2011). 175 Consequently, I employ these as the default conditions. The layer distance weights (which I term 176 'w') are calculated automatically to normalize contributions to the final output grid. In practice 177 they vary by orders of magnitude, with important layer like alleles near zero, and large values for 178 unimportant layers. To scale them for comparison, I therefore used sqrt(1/w). 179 The learning rate as implemented in 'kohonen' takes an initial and final value for a that 180 declines linearly over 'rlen' updates, the run length in steps. Learning strategies other than linear 181 are possible, but generally have little effect (Natita et al., 2016; Stefanovič and Kurasova, 182 2011)and are not implemented in 'kohonen.' Given that the change from a0 to a1 is relative to 183 the number of steps, the run length typically does not have much of an influence provided that it exceeds whatever initial period is needed to learn the global structure of the data. Therefore, 185 longer runs typically do not yield increased learning accuracy; there is little additional benefit 186 from a longer "burn-in" as in other traditional types of analyses such as MCMC sampling. 187 Nevertheless, I varied rlen and a by orders of magnitude to determine any possible effects, by 188 comparing their output in terms of mean distance to nearest unit across runs for different 189 combinations of each variable. Similar to previous authors (Natita et al., 2016; Stefanovič and 190 Kurasova, 2011), for rlen=100 I tested a0.1,0.1, a0.5,0.1, a0.9,0.1, a0.9,0.01, a0.5,0.01, and a0.5,0.5. Finally, I calculated species assignment probabilities from the classification frequency of 195 each individual into the range of sampled K, again employing the logic of DAPC (Jombart et al.,196 2010; see Pyron et al., 2023) to model uncertainty in assigning observed input features to 197 hypothetical delimitation models. In a DNA-only SOM, these are analogous to individual 198 genomic ancestry coefficients (e.g., Beugin et al., 2018;Frichot et al., 2014;Pritchard et al., 199 2000). In a SuperSOM that reflects the influence of multiple data layers, these are not solely 200 indicative of genetic ancestry and might be better thought of as "species coefficients" which may 201 be useful for a variety of further ecological and evolutionary hypothesis testing. New data can be 202 mapped to an existing SOM, but retraining is fast and will likely represent the simplest solution 203 for updated datasets in most instances rather than attempting to map across replicates. 204 In a small change from the original implementation (Pyron et al., 2023), I now calculate 205 BIC for the values of 'K' in each learning replicate, rather than relying on raw WSS. When K=1 206 (lowest BIC), this value was chosen. Otherwise, the optimal 'K' for each run was estimated using 207 the "elbow" method under the 'diffNgroup' criterion employed by DAPC (Jombart et al. 2010) 208 from the BIC estimate for grid-cell neighbor distances as described above. I also include DBIC 209 using an approximation of the second-order rate of change proposed by Evanno et al. (2005) for 210 use as heuristic to determine when larger values of K result in non-zero net improvement, but this 211 is only for comparison and is not used to calculate optimal K across runs. Consequently, K can 212 therefore also vary across runs, and one can represent this uncertainty in the summarized output. 213 Consequently, each K and the estimated ancestries thereof are sampled in rough proportion to 214 their occurrence in the "posterior" distribution of potential learning outcomes for a two-215 dimensional map. This strategy allows one to incorporate variation in both K across runs and the 216 uncertainty in assigning individuals to each of the K clusters within runs. 217  For the problem of label matching across runs, I performed an initial cluster assignment for a 228 range of K using DAPC to obtain a fixed reference point, though this was not used in the SOM 229 modeling. After obtaining the SOM replicates, I synchronized labels using an approximate 230 implementation of a non-shortcut CLUMPP-like algorithm (Jakobsson and Rosenberg, 2007) 231 which minimized the distance between the DAPC labels and each SOM replicate. contributions from space, climate, and traits ( Fig. 1, 2). Of the latter three, phenotype has a larger 284 impact than climate, and both are substantially more important than space alone. Converting learning rate and run length revealed relatively little variation outcomes (Fig. 3). Decreased 290 performance (increased relative distances to closest unit) is observed for very low and very high 291 values (a0.5,0.01 and a0.5,0.5) of final learning rate, but increasing run length has little effect from 292 100 to 1,000. Finally, comparison of individual ancestry coefficients from sNMF and the DNA-293 only SOM reveals a tight linear relationship for hybrids between ~30-70% ancestry (Fig. 4a). 294 Outside of this range, SOM coefficients are essentially bimodal in predicting ~100% identity. In 295 contrast, trait based SuperSOM "species coefficients" are essentially binary, predicting dominant 296 group membership at ~100% identity (Fig. 4b) based on integrated datasets.

Simulated performance 327
The simulations reveal several desirable properties for a species-delimitation method in 328 comparison with existing approaches. First, conversion to optimal cluster selection using BIC 329 yields strong support for K=1 when no genetic structure is present, a major advantage over many for higher values did not differ from 0 ( Fig. 5a-c). Second, this advantage is maintained even 333 when paired with significant geographic, environmental, and phenotypic structure that would not 334 otherwise likely be indicative of speciation (Fig. 5d-f). Third, the size of the alleles layer alone 335 does not necessarily overwhelm the contribution of other layers when relatively little genetic 336 divergence is present. The 4-layer trait based SuperSOM including the null alleles layer 337 estimated K=2-5 with K2=52% of samples and the highest contributions again coming from 338 traits, climate, and space, with alleles having the smallest weight (Fig. 5g-l). 339  a44  a42  a35  a30  a29  a49  a34  a5  a65  a64  a32  a68  a51  a6  a60  a40  a27  a23  a21  a19  a16  a14  a12  a9  a1  a31  a59  a25  a17  a70  a71  a57  a55  a54   l) previous researchers addressing the problem of species delimitation, particularly for algorithmic 353 methods relying solely or primarily on genetic data (Luo et al., 2018). The method proposed here 354 has several major advantages over existing approaches. As with most ML algorithms, it is 355 extremely flexible and can be implemented over a wide variety of data types and formats without to exhibit a more sigmoidal relationship with the sNMF estimates; outside of ~30-70% mixed 366 ancestry, SOMs predict ~100% parental identity given that there is very little uncertainty in 367 cluster assignment in the tails. SuperSOMs heighten this effect, producing nearly binary species 368 coefficients for most populations. This is unsurprising given that even admixed populations near 369 the hybrid zone tend to be either montane or Piedmont and have the strongly diagnostic character 370 of 4-5 versus 6-7 larval spots in D. monticola compared to D. cheaha (Pyron et al., 2023). I do 371 note that the sample size here is relatively small, as individual ancestry coefficients were already 372 sharply bimodal with relatively few hybrid or admixed individuals. 373 The approach described here is also robust to a wide range of hyperparameter values, 374 suggesting that learning performance will be trustworthy across a range of empirical conditions. . Second is the concern that 394 common evolutionary models will over-split species based on population-level divergence 395 (Sukumaran and Knowles, 2017). Yet, while datasets will almost always predominantly reflect 396 signal from large multi-locus or genome-scale markers (Fujita et al., 2012), there is no inherent limit on the amount of informative environmental and phenotypic data (or behavior, physiology, 398 ecological interaction, etc.) that can be gathered for most taxa. 399 Third is the more durable issue of disagreement between various species-delimitation 400 methods which often induce hesitancy or indecision for researchers (Carstens et al., 2013). These 401 maybe minor (e.g., support for 6 versus 7 species), or represent major conflict or contradictory 402 outcomes. In contrast to concerns of "inflation" or "over-splitting" which are often easily 403 addressed by integrative analyses in numerous taxonomic groups, the problem of congruence has 404 not been widely solved. Yet, I predict it will abate in the future for three major reasons. One is 405 that datasets continue to grow; historically, conflicts have often arisen when sampling only a few 406 loci and traits. Two, a great deal of conflict has arisen from methods employing different strict 407 criteria, thresholds, or models for delimiting species based on barcode gaps, branch lengths, or 408 Consequently, I propose the "taxonomic convergence hypothesis." For most groups that 415 have a well-characterized natural history, integrative taxonomic approaches that combine 416 multiple data types across a broad range of biologically relevant axes of variation will eventually 417 tend to yield congruent delimitation models as datasets grow in size. The use of ML approaches 418 such as the SuperSOM algorithm implemented here will likely help to facilitate and accelerate 419 that convergence. Situations where the TCH may not hold include poorly known taxa and 420 regions (e.g., under-explored tropical biomes, the deep sea) for which such data are unavailable 421 (Cordier et al., 2022;Goodwin et al., 2015), "dark" taxa and other groups (e.g., microbes) that 422 are difficult to characterize by ecology and phenotype (Page, 2016;Sanford et al., 2021), and 423 complex hybrid interactions that present misleading genotypic evidence (Chan et al., 2021) or 424 strain the concept of "species" itself (Bogart, 2019). 425 426

Directions for future research 427
The third factor listed above (complex genomic ancestry) is perhaps the most 428 immediately relevant for many groups and of concern to a wide variety of researchers. What

Conclusions 471
The SuperSOM method implemented here has several desirable properties, including the ability 472 to detect K=1 when population structure is absent, a tendency not to over-split even when larger 473 values of K have partial support, reflection of contributions from all layers with significant 474 signal, and control for layer size (e.g., large-scale genetic matrices) not to overwhelm other 475 datasets. This represents a substantial advance in the potential for integrative species delimitation 476 leading to integrative taxonomy by merging multiple biologically relevant axes of ecological and 477 evolutionary variation into a single unified analysis. While challenges remain for taxa and areas 478 that lack such data, groups that can't be easily quantified beyond genetics, and complex patterns 479 of genomic ancestry including hybridization, future research can extend these methods and 480 address such concerns. One possibility is merging supervised and unsupervised approaches, such 481 as pre-training classifiers on "good" species and hybrid genotypes to recognize such instances. 482 Furthermore, I suggest that this and similar approaches can easily be extended into the realm of 483 biodiversity conservation by including relevant data layers such as population trends and human 484 encroachment to delimit significant management units. While many systematists (myself 485 included) would not treat these as "species," the conceptual merger of biodiversity conservation 486 and species delimitation will allow for quantitative analysis of conservation units in a taxonomic 487 framework grounded in empirical estimation of species limits and extinction risk. 488

Acknowledgments 490
This research was supported in part by GW UFF awards and US NSF grants DBI-0905765, 491 DEB-1441719, and DEB-1655737. This work was completed in part with resources provided by 492 the High Performance Computing Cluster at The George Washington University, Information 493 Technology, Research Technology Services. Thanks to G. Bradburd (UMich) and R. Wehrens 494 (WUR) for assistance with theory and algorithms. 495