ABSTRACT
Model-based approaches to species delimitation are constrained by computational capacities as well as frequently violated algorithmic assumptions applied to biologically complex systems. An alternate approach employs machine learning to derive species limits without explicitly defining an underlying species model. Herein, we demonstrate the capacity of these approaches to identify phylogenomically relevant groups in North American box turtles (Terrapene spp.). We invoked several machine learning-based species delimitation algorithms and a multispecies coalescent approach to parse a large ddRAD sequencing SNP dataset. We highlight two major findings: 1) Machine learning delimitations were variable among replicates, but heterogeneity only occurred within major species tree clades; 2) in this sense unsupported splits echoed patterns of phylogenetic discordance among several species-tree methods. Discordance, as corroborated by previously observed patterns of differential introgression, may reflect biogeographic history, gene flow, incomplete lineage sorting, or their combinations. Our study underscores machine learning as a species delimitation method, and provides insight into how commonly observed patterns of phylogenetic discordance may similarly affect machine learning classification.
1. INTRODUCTION
Delineating species is undeniably crucial for systematics, ecology, and the evolutionary process. Species are the currency of biodiversity, as are inconsistencies in the application of what constitutes a species (‘multiplicity’ of species definitions; Zachos 2018). This creates downstream issues for conservation (Mace 2004), where spurious ‘splitting’ or ‘lumping’ of taxa are impediments to equitable allocation of limited resources. For example, over-splitting may redundantly allow threatened/endangered taxa to proliferate (Zachos et al. 2013; Sullivan et al. 2014), or conflate recovery goals more appropriately managed at separate scales along the species-population continuum (Coates et al. 2018).
On the other hand, inappropriate lumping can mask potential extinctions and the recognition of adaptive differentiation (Stanton et al. 2019). This can bias ‘true’ diversity, as reflected by regional or clade-specific differences in taxonomic ‘culture’ (e.g. biases in trait-delimitation or species-concepts), or ‘inertia’ (i.e. persistent knowledge gaps; Gippoliti et al. 2018). Both disproportionately promote ‘species at peril’ and subsequently drive inefficient resource allocation (Morrison et al. 2009; Garnett & Christidis 2017), viewed divisively as ‘taxonomic inflation’ (Agapow et al. 2004; Isaac et al. 2004). Nevertheless, species definitions/delineations are a critical dimension in conservation’s ‘agony of choice’ regarding resource allocation (Vane-Wright et al. 1991; Stanton et al. 2019). Delimiting species impacts not only finite resource allocation across programs but also efforts to recover and protect biodiversity.
Earlier work on species delimitation relied on few genes (or markers), resulting in limited scope. Although genomic approaches have shown promise (Allendorf et al. 2010), conflicting genome-wide signals from incomplete lineage sorting (ILS) and gene flow (Funk & Omland 2003) are still apparent. Contemporary species delimitation relies upon a probabilistic approach to model gene tree conflicts (i.e. multispecies coalescent; MSC) (Yang & Rannala 2010). However, some models assume all such conflicts stem from ILS, and thus ignore other sources such as introgressive hybridization.
Two popular packages, BPP and BFD*/SNAPP (Yang & Rannala 2010; Leaché et al. 2014a), are not only intractable with large datasets, but also seemingly over-split in the presence of high population structure (Sukumaran & Knowles 2017) or when broad, continuous geographic distributions are involved (Chambers & Hillis 2019). Therein lies the difficulty when species delimitation explicitly assumes an underlying process of speciation (i.e. not effectively modeled as an aspect of high-dimensionality data; Chafin et al. 2019). Here, we advocate recently developed machine learning algorithms as an alternative that does not rely upon a priori assumptions regarding the speciation process, but instead evaluates the process in a relatively unrestricted manner.
Machine learning is broadly divided into two components: supervised (SML) and unsupervised (UML). The former requires a classification model be ‘trained’ with a priori designations, from which a classification model is derived and optimized for assignment of ‘unknown’ data. A popular SML approach invokes support vector machines (SVM) that partitions groups using linear or non-linear vectors in multi-dimensional space. However, the requirement of an a priori classification scheme from which to train the model limits its applicability, particularly when the purpose is to define groups, as in species delimitation. Additionally, SVM is often computationally demanding, and hence slow with respect to alternatives (Suryachandra & Reddy 2016). UML, on the other hand, requires no a priori classification, and relies instead upon inherent patterns in the data.
Several popular UML classifiers lend themselves to the species delimitation problem, including: Random Forest (RF; Breiman 2001), t-distributed stochastic neighbor embedding (t-SNE; Maaten & Hinton 2008), and variational autoencoders (VAE; Derkarabetian et al. 2019), each with inherent strengths and weaknesses. For example, RF uses randomly replicated data subsets (in the form of pairwise distances) as a mechanism to develop binary ‘decision trees’ for a classification model. All randomly seeded decision trees are aggregated (=‘forest’), with classification decisions parsed as a majority vote amongst all trees. The random sub-setting approach is relatively robust to correlations among features (=summary statistics or principle components used for prediction) and model overfitting (=over-training the model where it does not generalize well with new data). One stipulation is that features must be of low occupancy and without undue noise (Rodriguez-Galiano et al. 2012). By contrast, the goal of t-SNE is to create diagnosable clusters in reduced-dimension space, typically a 2D plane extracted from a distillation of multi-dimensional data. Thus, it conceptually resembles methods such as principle components analysis [(PCA) (Maaten & Hinton 2008)].
Alternatively, VAE uses neural networks in an attempt to ‘learn’ or reconstruct multidimensional data patterns from a compressed, low-dimensionality (=‘encoded’) representation. Again, the approach conceptually resembles the dimensionality-reduction employed by various ordination techniques, but without linear and orthogonal constraint being imposed upon the informative components. This approach may also be more statistically interpretable (Derkarabetian et al. 2019), and thus more appropriate for the capture of variability within highly complex data. Yet, careful consideration must be paid to the derivation of parameters (e.g. neural network ‘depth’) that controls the encoding process (Livingstone et al. 1997).
UML methods do not require a priori designations from which to train a classification model yet may still be sensitive to priors and parameter settings. Thus, guidelines for appropriate application must be clearly defined, particularly regarding complex, empirical datasets. Two metrics that can influence the support of a given species delimitation hypothesis is concordance among algorithms (Carstens et al. 2013), and the susceptibility of the underlying algorithms to common sources of phylogenetic discordance. Some machine learning algorithms are robust to processes such as gene flow (Derkarabetian et al. 2019; Newton et al. 2020; Smith & Carstens 2020), but more empirical tests in complex systems are warranted. For example, performance can vary among datasets, with potential influences including data quality (e.g. missing data proportions) and size (Newton et al. 2020), historical demography, evolutionary history, and coalescent processes such as incomplete lineage sorting (Austerlitz et al. 2009). Thus, we empirically apply some recently developed software packages (CLADES: Pei et al. 2018; RF, t-SNE, VAE: Derkarabetian et al. 2019) and discuss their capacity for evaluating a group of species historically recalcitrant to taxonomic resolution.
1.1. The convoluted evolutionary history of Terrapene
North American box turtles (Emydidae: Terrapene) are primarily terrestrial, with a common name based on an anterior ventral hinge that allows the plastron (bottom part of shell) to dorsally close against the carapace (Dodd 2001). There are five currently recognized species (Minx 1996; Iverson et al. 2017): Eastern (Terrapene carolina), Ornate (T. ornata), Florida (T. bauri), Coahuilan (T. coahuila), and Spotted (T. nelsoni), with a sixth (T. mexicana) proposed (Martin et al. 2013, 2014). Terrapene carolina includes two subspecies (Woodland: T. c. carolina; Gulf Coast: T. c. major) that inhabit the eastern U.S. from the Mississippi River to the Atlantic Ocean, and south through the Gulf Coastal Plain (Fig. 1). The putative T. mexicana contains three subspecies (Three-toed: T. m. triunguis; Mexican: T. m. mexicana; Yucatan: T. m. yucatana) ranging across the southeastern and midwestern United States, the Mexican state of Tamaulipas, and the Yucatan Peninsula. The Ornate (T. ornata ornata) and Desert (T. o. luteola) box turtles inhabit the Midwest and Southwest U.S. plus the Northwest corner of México, while the Southern and Northern Spotted box turtles (T. nelsoni nelsoni and T. n. klauberi) occupy the Sonoran Desert in western México. Terrapene coahuila is semi-aquatic and restricted to Cuatro Ciénegas (Coahuila, México), and the Florida box turtles occur in Peninsular Florida.
Morphological analyses delineate T. carolina/mexicana as a single species, sister to T. coahuila (Minx 1992, 1996), with anecdotal support from a subset of genetic studies (Feldman & Parham 2002; Stephens & Wiens 2003). Alternatively, Martin et al. (2013) proposed the elevation of T. mexicana as a separate species, with T. coahuila as a subgroup within T. carolina. In this latter study, T. c. carolina was sister to T. c. major/T. coahuila, although potential gene flow was suspected between T. c. carolina and T. c. major due to mito-nuclear discordance. Accordingly, T. c. major was recently demoted to an intergrade population and its subspecific status removed (Butler et al. 2011; Iverson et al. 2017), but Martin et al. (2013) disagreed and a more recent study identified two potentially pure T. c. major populations in the Florida and Mississippi panhandles (Martin et al. 2020). Likewise, T. bauri (formerly T. carolina bauri) was recently elevated to a distinct species (Butler et al. 2011; Iverson et al. 2017). a possibility that Martin et al. (2013) acknowledged, albeit cautiously as weak statistical support and inconsistent phylogenetic placement were evident. For the sake of clarity, we herein follow the recommendations of Martin et al. (2013, 2014), considering T. c. major a distinct entity and bauri as a subspecies within T. carolina. The monophyly of T. o. ornata/luteola has also been questioned; Herrmann and Rosen (2009) suggested distinct lineages using microsatellite analyses, whereas Martin et al. (2013) suggested polyphyly and a lack of phylogenetic structure using mitochondrial (mt)DNA and nuclear (n)DNA sequences.
One likely reason for the historically enigmatic classification of T. carolina and T. mexicana includes contemporary hybridization and introgression occurring within a hybrid zone in the southeastern U.S., with four taxa potentially involved (Auffenberg 1958, 1959; Milstead & Tinkle 1967; Milstead 1969). Some researchers (Fritz & Havaš 2013, 2014) interpret reproductive semi-permeability as evidence for lumping the southeastern taxa as a single species. However, divergent selection reinforcing species boundaries in some southeastern Terrapene has been suggested as a reason for re-examining their classificatory status, despite ongoing gene flow (Martin et al. 2014, 2020). Alternatively, the close phylogenetic relationship between T. c. major and T. coahuila is less well understood. This may result from ‘ghost’ admixture of T. coahuila and/or T. c. major with the extinct T. c. putnami (Martin et al. 2013).
Herein, we evaluate the classification of Terrapene within the context of both UML and coalescent model-based species delimitation approaches. In doing so, we empirically validate the use of machine learning approaches with complex genetic datasets that, upon analysis, support a well-characterized phylogenetic hypothesis. Of note, observed species delimitation classifications are consistent with patterns of phylogenetic discordance, demonstrating an empirical application where the sources for such discordance may similarly affect machine learning.
2. MATERIALS AND METHODS
2.1. Sample collection, storage, and DNA extraction
Tissue samples were obtained from various museums, organizations, agencies, and volunteers (Table S1), then stored in 70%-95% ethanol or DMSO (di-methyl sulfoxide) buffer. Non-invasive samples were also acquired from live specimens, with those more invasive (e.g. toes, muscle) taken from road-kills. Upon receipt, samples were stored at −20°C. Genomic DNA was extracted via the following spin-column kits: DNeasy Blood and Tissue Kits (QIAGEN), QIAamp Fast DNA Tissue Kit (QIAGEN), and E.Z.N.A. Tissue DNA Kits (Omega Bio-tek). Extracted DNA was quantified using Qubit (Thermo Fisher Scientific) broad-range dsDNA fluorometry and tested for high-molecular weight DNA using gel electrophoresis.
2.2. DNA library preparation
We first estimated the expected number of loci recovered via ddRAD sequencing (ddRADseq) through in silico digestion (Chafin et al. 2018) of the painted turtle (Chrysemys picta) genome (Shaffer et al. 2013). This was done to optimize choice of base-cutters, size-selection bounds, and multiplex-size, thus maximizing loci coverage while promoting high sequencing depth. We also used the in silico digest to identify a candidate size-selection that avoids restriction sites lying within repetitive genomic elements (Chafin et al. 2018). The expected number of ddRADseq loci and depth of coverage were empirically verified by performing a restriction enzyme digest on 1,000ng of DNA for a representative panel of 24 samples, followed by fragment analysis (Agilent 4200 TapeStation).
Samples with sufficient DNA quantity (≥50 ng/uL) were processed via ddRADseq protocol (Peterson et al. 2012). Between 500-1,000ng of genomic DNA per sample was digested using two restriction enzymes, PstI (5’-CTGCA|G-3’) and MspI (5’-C|CGG-3’). Following a digestion at 37°C for 24 hours, 5ul of each sample was visualized on a 2% agarose gel via electrophoresis to verify DNA fragmentation. Samples were purified using an AMPure XP (Beckman Coulter) solution at a concentration of 1.5X (relative to DNA volume), then standardized at 100ng of DNA per sample. Unique barcoded adapters were ligated to each individual before pooling 48 samples into a library. Taxa were spread across multiple libraries to mitigate potential batch effects, and libraries were size-selected on a Pippin Prep (Sage Science) using the in silico optimized range [378-433 base pairs (bp), excluding adapters]. Lastly, a twelve-cycle polymerase chain reaction (PCR) was run with Phusion DNA Polymerase (New England BioLabs), followed by 1×100 single-end sequencing on the Illumina Hi-Seq 4000, pooling two indexed libraries (=96 individuals) per lane. Sequencing and additional quality control (fragment visualization and qPCR) were performed at the Genomics and Cell Characterization Core Facility, University of Oregon/Eugene.
2.3. Sequence quality control and assembly
FASTQC v. 0.11.5 was used to assess sequence quality (Andrews 2010), with IPYRAD v0.7.28 employed to demultiplex the raw sequences and align reads (Eaton & Overcast 2020). Demultiplexed reads were allowed a strict maximum of one barcode mismatch, given that barcodes were designed with a minimum two-base distance. Reads with low PHRED quality scores (<33) were excluded, with additional filtering to remove adapter sequences. We then performed reference-guided assembly using the Terrapene m. mexicana reference genome (GenBank Accession #: GCA_002925995.2) with a minimum identity threshold of 0.85. Unmapped reads were removed, and retained loci exhibited ≥20X coverage depth to reduce sequencing error bias (Nielsen et al. 2011) and maximize phylogenetically informative sites in the alignment (Eaton et al. 2017). Loci were further excluded if they displayed <50% individual occupancy, excessive heterozygosity (≥75% of individual SNPs), or more than two alleles per sample (the latter two instances indicating over-merged paralogs).
2.4. Phylogenomic inference
To assess differences in phylogenetic inference, we generated species trees using three contemporary algorithms. Admixture across Terrapene hybrid zones has been well-characterized (Butler et al. 2011; Martin et al. 2013, 2020). Thus, to mitigate the impact of contemporary gene flow on phylogenetic inference, we only utilized individuals confirmed to be parental types (characterized in Martin et al. 2019), as modelled using NewHybrids (Anderson & Thompson 2002). In so doing, we partitioned T. c. major into two subsets comprising two putative parental populations.
Maximum likelihood phylogenies have been commonly produced for decades, yet the increased use of large-scale SNP datasets often inflates bootstrap support for concatenated phylogenomic datasets (Salichos & Rokas 2013; Simmons & Goloboff 2014). Coalescent-based approaches that account for independent gene tree histories are more applicable for SNP analysis, and thus we employed SVDquartets [(Chifman & Kubatko 2014), implemented in PAUP* v4.0a164 (Swofford 2003)] to produce a species tree with individuals grouped into populations. Unrooted four-taxon gene trees were generated to assess legitimate splits, then assembled to form the full species tree. SVDquartets performs better for concatenated SNP datasets than do species tree methods utilizing summary statistics (Chou et al. 2015), and importantly works well with the large amount of missing data typically produced by ddRADseq (Leaché et al. 2015).
To reduce linkage bias and because independent gene tree histories are assumed for each site, only one SNP from each ddRADseq locus was included in the SVDquartets alignment. To assess sampling variance, we ran 100 bootstrap replicates and considered nodes resampled at >70% as strongly supported. Taxon partitions were grouped at the lowest level of field identification (i.e. subspecific designations, when available), and by U.S. and Mexican state locality. Blanding’s (Emydoidea blandingii) and spotted (Clemmys guttata) turtles were included as outgroups. An exhaustive search of all possible quartets was performed, with the consensus tree visualized in FigTree v1.4.2 (Rambaut 2014).
We also employed a polymorphism aware model (POMO: Schrempf et al. 2016), as implemented in IQ-TREE v1.6.9 (Nguyen et al. 2015), to generate a second species tree. We did so because PoMo allows within-population polymorphism to account for ILS. The full Ipyrad alignment, including invariant sites, was input into PoMo and executed with 1,000 ultrafast bootstrap (UFB) replicates (Hoang et al. 2017) and a maximum virtual population size of 19. The discrete gamma rate model was applied (N=4), and clades with bootstrap support ≥95% were considered strongly supported.
Finally, we generated a lineage-tree phylogeny (IQ-TREE v1.7.12; Nguyen et al. 2015) to contrast with our species-trees. An edge-linked partition model with 1,000 UFB replicates was run using Modelfinder (Kalyaanamoorthy et al. 2017) to determine the optimal substitution model for each separate ddRADseq locus. Given computational constraints, model selection was restricted only the general time reversible (GTR) model. Following tree reconstruction, IQ-TREE was used to calculate site-wise concordance factors (sCF; Minh et al. 2018) for each branch because they are less susceptible than traditional bootstrapping to over-inflation (Philippe et al. 2011). The sCF were calculated from 100 quartets randomly sampled from internal branches of the tree, as recommended by IQ-TREE for stable sCF values. UFB≥95% and sCF≥50% were considered as strong support (per IQ-TREE documentation).
For statistical topology tests, we generated lineage trees with IQ-TREE under the topological constraints supported by four species-tree hypotheses derived from: (a) SVDquartets and (b) PoMo topologies, as generated herein; (c) Sanger sequencing with mtDNA and nuclear introns (Martin et al. 2013); and (d) Morphological data (Minx 1996). Modelfinder was again employed to optimize substitution models for each locus, as partitioned in a concatenated supermatrix, using a hierarchical clustering algorithm to minimize computational burden in IQ-TREE (-rcluster). We also toggled the -bnni and -opt-gamma-inv options to reduce the impact of severe model violation and more thoroughly explore gamma and invariant site parameters. Nodal confidence of individual trees was assessed using 1,000 UFB. We then compared support for the concatenated supermatrix among constraint trees using seven topological tests and 10,000 re-samplings: (a) Raw log-likelihoods; (b) bootstrap proportion test using the RELL approximation (Kishino et al. 1990); (c) Kishino-Hasegawa test (Kishino & Hasegawa 1989); (d) Shimodaira-Hasegawa test (SH; Shimodaira & Hasegawa 1999); (e) Approximately Unbiased test (Shimodaira 2002); and (f) Expected Likelihood Weights (Strimmer & Rambaut 2002). To visualize support for each topology across the genome, site-likelihood probabilities and pairwise site-likelihood score differences (ΔSLS) were calculated between the best-supported versus remaining trees.
2.5. Species delimitation
We employed the multispecies coalescent Bayes Factor Delimitation approach [BFD*; (Leaché et al. 2014a)] as a baseline to compare the machine learning-based methods. Because BFD* is computationally intensive, taxa were subset to a maximum of five individuals that contained the least amount of missing data (N=37, plus outgroups), with sampling locations varied (excepting T. c. bauri and the extremely rare T. m. mexicana and T. coahuila, which occur exclusively in Peninsular Florida and the Mexican states of Tamaulipas and Cuatro Ciénegas). For consistency, the same subset of individuals was used across all approaches. Details for BFD* prior selection and additional data filtering steps can be found in Supplemental Appendix 1.
For each BFD* model, SNAPP employed 48 path-sampling steps, 200,000 burn-in, plus 400,000 MCMC iterations, with sampling every 1,000 generations. The path-sampling steps were conducted with 200,000 burn-in, 300,000 MCMC generations, α=0.3, 10 cross-validation replicates, and 100 repeats. Trace plots were visualized (Tracer v1.7.1) to confirm parameter convergence and compute effective sample sizes (ESS; Rambaut et al. 2018). Bayes factors (BF) were calculated as [2 X (MLE1 – MLE2)] from the normalized marginal likelihood estimates (MLE). We considered the following scheme for BF model support: 0<BF<2=no model differentiation; 2<BF<6=positive; 6<BF<10=strong; and BF>10=decisive support (Kass & Raftery 1995).
The RF and t-SNE algorithms (Breiman 2001; Maaten & Hinton 2008) were run and visualized using an R script developed by Derkarabetian et al. (2019). The data were represented as scaled principle components (N=37 axes) generated in Adegenet v2.1.1 (Jombart & Ahmed 2011) in R v3.5.1 (R Development Core Team 2018). We averaged 100,000 majority-vote decision trees over 10,000 bootstrap replicates to generate RF predictions. Clustered RF output was visualized using both classic and isotonic multidimensional scaling procedures (cMDS and isoMDS; Shepard et al. 1972; Kruskal & Wish 1978). We ran t-SNE for 10,000 iterations within which equilibria of the clusters was visually confirmed. Perplexity, which limits the effective number of t-SNE neighbors, was tested at values of five and ten.
2.6. Determining optimal K for random forests and t-SNE
Two common clustering algorithms, as implemented in the aforementioned R scripts (Derkarabetian et al. 2019), were used to derive optimal K for both the RF and t-SNE analyses. The first [Partitioning Around Medoids (PAM); Kaufman and Rousseeuw 1987] attempts to minimize the distance between the center point versus all other points of K clusters. The program requires K to be defined a priori, and thus K=1-10 were tested, with the gap statistic and highest mean silhouette widths [(MSW) (Rousseeuw 1987; Tibshirani et al. 2001)] determining optimal K. The second [Hierarchical Agglomerative Clustering (HAC); Fraley and Raftery 1998] merges points with minimal dissimilarity metrics (based on pairwise distances) until all are clustered.
2.7. Variational autoencoders
The VAE UML approach (Derkarabetian et al. 2019) employs neural networks and deep learning to infer the marginal likelihood distribution of sample means (μ) and standard deviations [(σ) (i.e. ‘latent variables’)]. Clusters with non-overlapping σ are interpreted as distinct clusters, or ‘species.’ Data were input as 80% training/20% validation, with model loss (∼error) visualized to determine the optimal number of ‘epochs’ (=cycles through the training dataset). VAE should ideally be terminated when model loss converges on a minimum value between training and validation datasets [(i.e. the ‘Goldilocks zone’; Fig. S1) (Al’Aref et al. 2019)]. An escalating model loss in the validation dataset indicates overfitting, whereas a failure to acquire a minimum value points to underfitting (i.e. inability to generalize across both training and unseen data).
2.8. Support vector machines
The CLADES software (Pei et al. 2018) derives six summary statistics for SVM: 1) Proportion of private alleles; 2) a folded site-frequency spectrum (SFS); 3) pairwise FST values within populations; 4) pairwise FST values among populations; 5) the pairwise difference ratio (dbetween/dwithin); and 6) the longest shared tract (longest string shared by two sequences). More extensive methodological descriptions of the UML and SML components of machine learning are found in Supplemental Appendix 1.
3. RESULTS
3.1. Sampling and data processing
We sequenced 214 geographically widespread Terrapene (Fig. 1; Table S1) including all recognized species and subspecies, save the exceptionally rare T. nelsoni klauberi. When possible, we included a minimum of 10 individuals per taxon, though fewer were used per rare clade (T. m. yucatana, T. m. mexicana, T. coahuila, T. n. nelsoni, T. o. luteola, and T. c. bauri). The IPYRAD pipeline recovered 134,607 variable sites across 13,353 loci that mapped to the T. m. mexicana genome, with 90,777 being parsimoniously informative. The mean per-individual coverage depth was 56.3X (Fig. S2).
3.2. Species tree inferences
The sCF tree contained N=214 tips (Fig. 2), whereas SVDquartets and PoMo (Fig. 3) grouped individuals into N=26 populations, again based on locality and subspecies (when provided). The SVDquartets alignment contained 10,299 unlinked SNPs, with 87,395,061 quartets employed to assemble the species tree (Fig. 3a). Concatenated ddRADseq loci were included in the PoMo tree (Fig. 3b), to include both invariable and variable sites (Nsites=1,163,463). All trees clearly delineated eastern versus western clades, with T. mexicana, T. carolina, and T. coahuila composing the eastern clade and the west represented by the monophyletic T. ornata and T. nelsoni. However, some differences among methodologies were apparent within these clades.
All phylogenies clearly delineated the western T. ornata and T. nelsoni. However, SVDquartets paraphyletically nested T. o. luteola within T. o. ornata, whereas IQ-TREE and PoMo represented them as distinct monophyletic clades. In the eastern clade, SVDquartets displayed two subdivisions: Terrapene mexicana (all subspecies) and T. carolina (all subspecies) + T. coahuila. PoMo did likewise, but also placed T. m. triunguis as paraphyletic in T. mexicana. Furthermore, SVDquartets, PoMo, and IQ-TREE each differed regarding the placement of T. c. bauri, T. coahuila, and two previously recognized clades within T. c. major (Martin et al. 2013, 2020). Specifically, SVDquartets depicted T. c. bauri as ancestral in the bauri/major/coahuila/carolina clade, whereas POMO placed T. c. major from MS/coahuila as ancestor to T. c. major (FL)/bauri/carolina. However, IQ-TREE placed 1) T. c. bauri sister to all of T. carolina/T. mexicana, and 2) T. coahuila/T. c. major (MS) sister to T. c. carolina/T. c. major (FL). IQ-TREE also placed one T. c. major individual within the T. m. triunguis clade, and one T. c. carolina as ancestral to the Floridian T. c. major and remaining T. c. carolina.
3.3. Species tree reconciliation
Trees representing Sanger data and SVDquartets were in agreement when we contrasted our topology tests, whereas morphology-based and PoMo trees were both significantly rejected (Table 1). Although the SVDquartets tree was ranked the highest, site-likelihood scores indicated that each topology was determined by a small number of loci (Fig. S3), whereas the remaining majority was relatively uninformative.
3.4. Species delimitation methods compared
BFD* supported two top models (Table 2): All taxa delimited (K=9), and all as distinct save T. o. ornata/T. o. luteola (K=8; Fig. 4). BF did not distinguish between the top models (<2), although both were decisively better than all others (BF>10). Convergence was confirmed for the likelihood traces, and the mean per-model ESS were >300 (Table S2).
The majority of the RF and t-SNE runs (Fig. 4) also grouped T. o. ornata and T. o. luteola. However, the remaining clusters were split conservatively relative to BFD*. All runs clearly delineated T. ornata, T. carolina and T. mexicana ssp., with some also delimiting as distinct entities T. c. carolina, two T. c. bauri clusters, and T. m. mexicana. Of note, the runs and clustering algorithms exhibited high within-but not among-clade variability for T. carolina and T. mexicana, excepting MSW using isoMDS.
Each clustering algorithm and ordination technique displayed its own inherent characteristics. Essentially, cMDS and the gap statistic were inclined to split subclades of T. carolina and T. mexicana, isoMDS and MSW were the most conservative, and t-SNE and HAC were intermediate, though HAC oscillated in agreement with MSW and the gap statistic (Fig. 4). RF, but not t-SNE, varied among the 100 replicates, which was most pronounced for cMDS. Heightened cMDS run variation highlights its inherent sensitivity to low among-group variability (Olteanu et al. 2013). Finally, t-SNE optimal K increased with perplexity.
VAE initially agreed with BFD* in recognizing K=8, clumping T. o. ornata/T. o. luteola and splitting all other taxa (Fig. 5a). However, assessments of model-loss indicated overfitting in the sense that given enough epochs, the predictive model can perfectly ‘learn’ the training dataset, with predictive capacity rapidly decreasing for unseen test data. To mitigate, we identified in the model loss plot the transition point, or ‘elbow’ (Fig. 5b), where predictive accuracy falls off for the test data, such that test versus training sets diverge in accuracy. This occurred at a much lower number of sampled epochs (N=2,000) and was subsequently re-initiated at a new termination point. Once overfitting was eliminated, an optimal K=3 was derived (Figs. 4, 5c, 5d), in agreement with other UML methods. The model was also tested with N=1,000 epochs (not shown), for which K=3 clusters again persisted.
3.5. Supervised machine learning
CLADES yielded optimal K=2 (P=1.44e−4; Fig. 4; Table S3), but with highly discordant clusters compared with prior results and phylogenomic findings: Terrapene c. carolina/T. c. bauri emerged as one species, and the remaining seven taxa (T. ornata, T. mexicana, and the remaining T. carolina) as a consistently paraphyletic second species (Figs. 2-3). The possibility of outliers misleading the delimitations was also explored by removing two T. c. bauri and North Carolina T. c. carolina that, in a subset of UML runs either formed a potential second cluster or clustered instead with T. c. bauri. However, CLADES provided similar output without phylogenetic cohesiveness (K=2; P=6.88e−6) with T. c. bauri/T. c. major (MS population) as one species, and the remainder forming the second. In both cases, the estimated probability for optimal K was quite low.
3.6. Relative performance among approaches
All UML species delimitation methods converged on K=3 if considering RF and t-SNE classifications that did not inter-mix. Three Terrapene species (plus T. nelsoni) were corroborated (Martin et al. 2013, 2020), whereas the clumping of T. mexicana and T. carolina (Minx 1996) was rejected. Machine learning approaches were also markedly faster than BFD*. For example, RF, t-SNE, and VAE required ∼10-30 min run time on a Desktop PC utilizing one Intel i5-3570 CPU core and 16 GB RAM. Comparatively, the twenty BFD* runs required ∼4,000 total wall-time hours (∼200 hours/model), parallelized across 24-48 threads and utilizing 200 GB RAM/model.
4. DISCUSSION
We observed substantial heterogeneity among machine learning species delimitation approaches in resolving the southeastern Terrapene taxa, echoing previous morphological and single-gene results (Milstead 1967, 1969; Milstead & Tinkle 1967; Butler et al. 2011; Martin et al. 2013). However, groups exhibiting such heterogeneity may indicate the involved taxa are one species, whereas deficit groups may support distinctiveness. Additionally—as argued below— these were interpreted as a more appropriate reflection of taxon-specific biological patterns. Our results represent an empirical test for the de novo application of these software packages to other taxonomically-complex systems.
4.1. Species Delimitation Approaches Reconciled in Terrapene
Species trees provide a necessary phylogenetic context for species delimitation by outlining hypothetical species compositions and identities. In our case, they underscored classic discordance (Figs. 2-3), previously hypothesized via single-gene sequencing (Martin et al. 2013). Differences were apparent in the ancestral progression of taxa, and in transitions between monophyly versus paraphyly. Persistent uncertainties include: 1) Placement of T. c. bauri; 2) monophyly of T. mexicana and 3) T. o. luteola subspecies status. Additionally, two individuals were placed in unexpected clades, which was a far smaller proportion than previously seen in single-gene datasets. We suggest the latter are examples of admixture, as both were collected near a southeastern US hybrid zone (Martin et al. 2020), and suspect the other idiosyncrasies represent either violations of the model or methodological artifacts.
Impacts of interspecific gene flow on species tree inference are well-characterized, yet surprisingly, seldom modeled explicitly (Leaché et al. 2014b; Leaché & Oaks 2017). POMO, for example, constrains all nodes to the same Ne, a potentially poor assumption given contemporary and possibly historical admixture (Martin et al. 2013, 2020). An examination of the species trees alone reiterates previous single-gene taxonomic assessments. However, powerful species delimitation assessments were utilized that provide a far more robust phylogenetic classification.
UML species delimitation inferences were consistent with the most recent phylogenetic hypotheses (Martin et al. 2013; this study). Terrapene o. ornata/luteola, T. c. carolina/bauri/major/coahuila, and T. m. mexicana/triunguis represent what we would consider as species-level variants, within which each encompasses group assignment heterogeneity. Terrapene m. yucatana falls within T. mexicana, and T. nelsoni as sister to T. ornata, although their extreme rarity and concomitantly limited sampling (N=1) precluded them from species delimitation analysis. Importantly, the variability in RF and t-SNE results primarily echoed uncertainty found in the species tree analyses, including the distinctiveness of T. o. luteola subspecific relationships within T. carolina. This variation also corresponded with the proclivities of each algorithm. RF, for example, invokes a randomized process, with stochasticity perhaps exacerbated by the phylogenetic discordance within T. carolina. t-SNE was influenced by its perplexity parameter, with a second T. c. major group from Mississippi being a minor addition for perplexity=5. This could underscore population structure among Mississippi and Florida T. c. major. Finally, delimitations for both were strongly impacted by clustering algorithm, which closely paralleled their own algorithmic tendencies. For example, the gap statistic often over-estimates K (Dudoit & Fridlyand 2002; Yan & Ye 2007), MSW under-splits (ŞenbabaoLlu et al. 2014), and outliers and noise particularly impact HAC (Kim et al. 2009; ŞenbabaoLlu et al. 2014). This, in turn, may explain the varying extent of agreement between HAC and either MSW or the gap statistic.
We suggest the variability observed among RF and t-SNE runs was due to a lack of divergence within the more variable groups. Mixed classification was not observed among the T. mexicana, T. ornata, and T. carolina groups, excepting RF ISOMDS based on MSW that only differentiated T. ornata versus T. carolina (Fig. 4). The more conservative nature of ISOMDS was reflected in several original empirical tests, which suggested a restriction to two clustering dimensions may be more sensitive to higher genetic divergences (Derkarabetian et al. 2019), as seems to be the case here. Otherwise, variability among taxa was constrained within respective subspecific units.
VAE initially recovered results identical to BFD* (K=8), delimiting all taxa except T. o. ornata/T. o. luteola. To ensure model training occurred appropriately, we more closely inspected model loss and observed overfitting (Fig. 5b). The VAE script includes dropout regularization methods, which randomly thin neural network nodes during model training to reduce overfitting (Srivastava et al. 2014). However, regularization parameters can be sensitive to dataset properties (e.g. large versus small/noisy versus tidy), and may not perform well for every dataset (Gal & Ghahramani 2016; Derkarabetian et al. 2019). In model loss exploration, overfitting was mitigated by early termination of model training when loss was at its minimum, though this could also be accomplished by tuning dropout parameters. After correcting for overfitting, VAE also delimited K=3 (i.e. Terrapene mexicana, T. carolina, and T. ornata ssp.), much like RF and t-SNE if considering classification heterogeneity to indicate intra-specific relationships.
4.2. Phylogenetic and biological support of species delimitations
We suggest that identifying machine learning groups that consistently lack classification overlap is one criterion to delimit species. In our case, these were corroborated as major species tree clades, highlighting their complementary nature. In contrast, inconsistent species delimitation assignments reflect many of the phylogenetic discordances observed in this and previous studies (Butler et al. 2011; Martin et al. 2013). Potential underlying biological processes include incomplete lineage sorting, ongoing primary divergence, hybridization, and/or complex phylogeographic history [(e.g. isolation followed by secondary contact) (Mayr 1963; Barton & Hewitt 1985; Rieseberg et al. 1999, 2007; Coyne & Orr 2004; Sousa & Hey 2013)]. Divergent selection can counteract such processes and reinforce species boundaries (Feder et al. 2013). Our species delimitation results are consistent with previously observed divergent selection at candidate loci across T. carolina and T. mexicana, whereas it was absent for T. c. carolina and T. c. major (Martin et al. 2020). Thus, T. carolina and T. mexicana may exhibit signatures of secondary contact, whereas T. c. major and T. c. carolina may be earlier in the divergence process. Alternatively, T. c. major could be an intergrade population between T. c. carolina and T. m. triunguis (Butler et al. 2011), though the species trees disagree and two putative parental populations persist (Martin et al. 2020). In this sense, T. c. major displays fairly disparate habitat preferences, favoring salt marshes on the Gulf Coastal Plain, whereas T. c. carolina and T. m. triunguis occupy mesic woodlands. The low differentiation between T. c. major and T. c. carolina may result from T. c. major being restricted to the southeast. Here, T. c. carolina possibly blocked northward expansion of T. c. major, with gene flow persisting across much of T. c. major’s smaller range. Alternatively, it may have diverged more recently and now reflects ongoing primary divergence.
4.3. Comparisons to other empirical studies
The capability of machine learning species delimitation algorithms to discount population structure while isolating higher-level differentiation is corroborated by other recent studies (Derkarabetian et al. 2019; Hedin et al. 2020). However, and Derkarabetian et al. (2019) and Newton et al. (2020) emphasized the importance of integrative approaches, as they were able to identify cryptic species by considering both VAE species delimitation and ecological niche modeling. Given the increasing availability of geological resources, such integrative taxonomic considerations may prove to be invaluable.
Excepting CLADES, the machine learning software used herein also seem robust to hierarchical levels of genetic variation, having differentiated T. carolina versus T. ornata and the less divergent T. m. triunguis. However, this hierarchical robustness may have limits, as one recent geometric morphometric image-based deep learning study favored inter-generic over inter-specific delimitations (Boer & Vos 2018). On the contrary, another recent study was more accurate in recovering species-level delimitations rather than across genera, which they suggested stemmed from less informed model training in low-diversity families with many unique species. Recent and future work may also illuminate the impact of gene flow and population demography on observed delimitations, processes that MSC approaches do not consider. For example, Delimitr incorporates models of secondary contact and divergence with gene flow into RF classifiers for species delimitation (Smith & Carstens 2020). However, empirical tests of Delimitr tended to agree with the species delimited by BFD* and BP&P, whereas for Terrapene RF, t-SNE, and VAE were more conservative than the MSC approaches. It may be that Terrapene exhibits stronger population structure than the species included in the DELIMITR applications, which can influence BFD* (Sukumaran & Knowles 2017). Finally, machine learning frameworks may illuminate other potential sources of species tree discordance, with recent applications predicting discordant species trees (Roettger et al. 2009), assessing historical introgression despite ongoing gene flow (Burbrink & Gehara 2018), and identifying ILS (Burbrink et al. 2020). Nevertheless, RF, t-SNE, and VAE are reported to at least be robust to gene flow, with recent applications showing that they place admixed individuals between parental clusters (Derkarabetian et al. 2019; Hedin et al. 2020; Newton et al. 2020). However, in our case we avoided hybrids (as characterized by NewHybrids; Martin et al. 2020) due to frequent introgression in the southeastern Terrapene hybrid zone.
4.4. Conclusions
UML approaches attempt to identify groups based on inherent structure in the data, and accordingly are a natural extension to the species delimitation problem. In our case, a consensus among UML approaches corroborated other axes of differentiation, whereas MSC-based delimitations over-partitioned the data. Specifically, groups that were not supported by RF, t-SNE, and VAE echoed classic patterns of phylogenetic uncertainty seen among our species trees, which may be affected by previously observed genome-wide differential introgression. Furthermore, in our case it seems likely that the phylogenetic signals affecting discordance are similarly affecting the machine learning algorithms, which may include a combination of historical biogeographic processes, gene flow, and incomplete lineage sorting. What is clear is that delimiting almost every Terrapene taxon, as supported by BFD*, is probably not biologically appropriate. Though MSC methods are undoubtedly still extremely useful, machine learning provides a promising alternative for resolving long-standing biological problems. This may particularly be the case for species that violate MSC model assumptions, as demonstrated by our study system.
DATA ACCESSIBILITY
The raw ddRADseq data is available on the GenBank Nucleotide Database at https://www.ncbi.nlm.nih.gov/bioproject/563121(BioProject ID: 563121) [to be made public upon publication]. Additional sequence alignments, Supplemental Appendix 1, and supplementary materials will be available from a Dryad Digital Repository.
AUTHOR CONTRIBUTIONS
BTM and TKC designed the research, laboratory protocols, and scripts. BTM conducted the lab work and bioinformatic analyses, analyzed the data, and wrote the manuscript. MRD and MED were the study supervisors, guided the study design, and provided funding. JSP facilitated the collection of thousands of Terrapene tissues and provided methodological expertise. RDB collected hundreds of Terrapene tissues from southeastern North America and facilitated the collection of many additional individuals. CAP provided many of the T. ornata tissues and provided sampling expertise. All authors edited and revised the manuscript
ACKNOWLEDGEMENTS
We would like to thank the numerous volunteers, organizations, and agencies that contributed tissue samples (Table S1). We also thank both current students and alumni of the Douglas Lab, as well as University of Arkansas faculty for support, advice, and guidance, to include: A. Alverson, W. Anthonysamy, M. Bangs, J. Beaulieu, J. Koukl, S. Mussmann, J. Pummill, and Z. Zbinden. Sample collections were approved under three Animal Care and Use Committee (IACUC) protocols: #113 (University of Texas at Tyler), and #16160 and #18000 (University of Illinois/Champaign-Urbana). Funding sources included the Lucille F. Stickle Fund of the North American Box Turtle Committee, the American Turtle Observatory (ATO), and two University of Arkansas endowments [the Bruker Professorship in Life Sciences (MRD), and the 21st Century Chair in Global Change Biology (MED)]. The Arkansas High Performance Computing Cluster (AHPCC) and the Jetstream Cloud Service (NSF-XSEDE Research Allocation TG-BIO160065) graciously supplied analytical resources.
Footnotes
Disclosure statement: Authors have nothing to disclose
Major revisions to Discussion.