Abstract
Bayesian methods can be used to accurately estimate species tree topologies, times and other parameters, but only when the models of evolution which are available and utilized sufficiently account for the underlying evolutionary processes. Multispecies coalescent (MSC) models have been shown to accurately account for the evolution of genes within species in the absence of strong gene flow between lineages, and fossilized birth-death (FBD) models have been shown to estimate divergence times from fossil data in good agreement with expert opinion. Until now dating analyses using the MSC have been based on a fixed clock or informally derived node priors instead of the FBD. On the other hand, dating analyses using an FBD process have concatenated all gene sequences and ignored coalescence processes. To address these mirror-image deficiencies in evolutionary models, we have developed an integrative model of evolution which combines both the FBD and MSC models. By applying concatenation and the MSC (without employing the FBD process) to an exemplar data set consisting of molecular sequence data and morphological characters from the dog and fox subfamily Caninae, we show that concatenation causes predictable biases in estimated branch lengths. We then applied concatenation using the FBD process and the combined FBD-MSC model to show that the same biases are still observed when the FBD process is employed. These biases can be avoided by using the FBD-MSC model, which coherently models fossilization and gene evolution, and does not require an a priori substitution rate estimate to calibrate the molecular clock. We have implemented the FBD-MSC in a new version of StarBEAST2, a package developed for the BEAST2 phylogenetic software.
Introduction
We have vastly more data on biological organisms than at any point in the past; whole genome sequences, ancient DNA, morphological characters and fossil occurrences all contain a fingerprint of past evolutionary processes. With this wealth of data, we should expect coherent estimates of the pattern and timing of evolutionary events. Yet the story told by genomes and molecular clocks is often difficult to reconcile with morphological data and the fossil record (Meyer et al. 2012; O’Leary et al. 2013; Jarvis et al. 2014; dos Reis et al. 2014; Mitchell et al. 2015). These debates are often described as “rocks versus clocks” (Donoghue and Benton 2007) with famous examples including the timing of the origin of placental mammals (O’Leary et al. 2013; dos Reis et al. 2014), birds (Jarvis et al. 2014; Mitchell et al. 2015), flowering plants (Beaulieu et al. 2015), and the Cambrian Explosion (Lee et al. 2013). Most disturbingly, these debates persist even for evolutionarily recent and intensively studied questions like the timing of the human-chimp split, where fossils (Brunet et al. 2002; White et al. 2009; Wood and Harrison 2011; White et al. 2015) give different results than genomic data (Patterson et al. 2006; Langergraber et al. 2012; Meyer et al. 2012; Scally et al. 2012; Scally and Durbin 2012; Callaway 2015; Lipson et al. 2015).
Bayesian inference, the gold-standard in estimating evolutionary history (Huelsenbeck et al. 2001; Ronquist and Huelsenbeck 2003; Nylander et al. 2004; Drummond et al. 2012; Bouckaert et al. 2014; Höhna et al. 2016), provides a theoretical framework that supports the integration of multiple data sources. So called “total-evidence” analyses integrate molecular sequence and morphological character data. Where a fossil record is available, total-evidence data sets can be used with “tip-dating” methods to estimate time-calibrated species trees (Ronquist et al. 2012; Gavryushkina et al. 2014; Zhang et al. 2016; Gavryushkina et al. 2017).
Tip-dating makes an advance over previous methods such as node-dating or a fixed clock by treating fossils as data. Node-dating, where researchers propose parametric prior distributions for the dates of particular nodes based on expert opinion and intuition, can result in misleading node ages (Gavryushkina et al. 2017). An alternative to tip- or node-dating is a fixed molecular clock. Fixing the molecular clock at 1 means that only relative divergence times can be estimated, while using a value from a previous study assumes that the a priori rate is accurate for the species and loci in the new study.
Previous implementations of tip-dating have so far made the assumption of a single phylogeny encompassing all molecular loci and morphological characters. This assumption is known as “concatenation” because it is equivalent to concatenating several multiple sequence alignments into a single alignment, and it has been demonstrated to cause biases and overestimated precision when inferring species trees from molecular data (Liu et al. 2015; Ogilvie et al. 2016; Ogilvie et al. 2017). To enable the combined use of molecular, morphological and fossil data with the advantage of tip-dating and without the known problems of concatenation, we propose combining models of genealogical evolution, morphological evolution, and of speciation, extinction and fossilization.
The Fossilized Birth-Death Process
Explicitly including fossils in stochastic models of phylogenies became possible with the birth-death-serial-sampling model (Stadler 2010). This model has three macroevolutionary parameters; the fossil sampling rate ψ, the speciation rate λ and the extinction rate μ. A version of this model, named the fossilized birth-death (FBD) process, allows for sampled ancestors (Gavryushkina et al. 2014; Heath et al. 2014; Zhang et al. 2016); each fossil may be either a direct ancestor of other samples, or a tip branch if no descendants have been sampled. The “skyline” extension to the FBD (Stadler et al. 2013) allows the macroevolutionary parameters to vary through time in an arbitrary and independent fashion.
The Multispecies Coalescent
Modern phylogenetic inference distinguishes between high level phylogenetic relationships across species described by a species tree and relationships between individual alleles described by gene trees. It is now well understood that failure to take this into account can significantly bias results due to the effects of incomplete lineage sorting (ILS) and other processes (Liu et al. 2015; Linkem et al. 2016; Mendes and Hahn 2016; Mendes and Hahn 2018).
*BEAST (Heled and Drummond 2010), StarBEAST2 (Ogilvie et al. 2017), BEST (Liu 2008) and BPP (Yang 2015; Rannala and Yang 2017) are all examples of Bayesian software that explicitly sample the joint posterior distribution over both species and gene trees under the multispecies coalescent (MSC) model, as described by Maddison (1997) and Degnan and Rosenberg (2009). These methods account for the hierarchical nature of the evolutionary process and explicitly model ILS. However none of these implementations allow fossils or other ancestral samples to be placed directly on the species tree, meaning that tip-dating approaches are not possible.
Integrative Models of Species Evolution
Integrative models are desirable because they can integrate over uncertainty rather than assuming fixed parameters, and they can also directly utilize more sources of data than simpler models. In this paper we describe an integrative Bayesian phylogenetic model for estimating species trees and divergence times, capable of analyzing multilocus genetic data, fossil occurrence data and morphological data in a coherent probabilistic inference framework. The model reconciles molecular and fossil evidence by explicitly distinguishing two evolutionary processes, with the FBD process describing the distribution over species trees and the MSC model describing the probability distribution of molecular genealogies conditional on the species tree.
The FBD branching model of macroevolution accounts for speciation, extinction and fossilization. The species tree is modeled using the FBD process, with the morphology of all species arising from a stochastic process of evolution that proceeds down the branches of this species tree.
The MSC has become the standard model for describing the relationship between molecular genealogies and species trees. The molecular sequence data (sampled from extant individuals or as ancient DNA) are modeled by multiple independent gene trees, which may differ from each other due to processes such as ILS, but must be consistent with the shared species tree that they have all evolved within (Fig. 1).
The BEAST2 phylogenetic software features “StarBEAST2” — a recent implementation of the MSC — and an implementation of the FBD prior (Gavryushkina et al. 2014). We have updated StarBEAST2 to combine the MSC model with the FBD process, henceforth “FBD-MSC”. To demonstrate the utility of the FBD-MSC model, we applied the latest version of StarBEAST2 to an exemplar data set of the dog and fox subfamily Caninae.
Estimates made under the FBD-MSC model are compared with estimates made using FBD with concatenation (henceforth “FBD-concatenation”), the MSC with a fixed molecular clock instead of an FBD prior, and concatenation with a fixed clock. FBD-MSC results were generally in agreement with fixed clock MSC estimates. Concatenation overestimated tip branch lengths, species divergence times, and the timing of diversification leading to extant Caninae, even when fossil data was incorporated using the FBD model.
Methods
Integrative Model Probability
The integrative model combining the MSC, the FBD process, and morphological evolution can be expressed by combining the component likelihoods. The likelihood of a gene tree is the phylogenetic likelihood (Felsenstein 1981) Pr(Di|Gi) where Di is the multiple sequence alignment (MSA) for the ith gene tree Gi. The MSC probability for that gene tree is P(Gi|S) where S is the species tree. The likelihood contribution to the species tree of a morphological character is the phylogenetic likelihood Pr(Cj|S) where Cj is the vector of states for the jth character. The prior probability of the species tree under the FBD process is P(S|θ), where θ is a vector of FBD parameters as described by Gavryushkina et al. (2014). Combining the likelihoods for the integrative model we get the probability of the species tree given the molecular, morphological and fossil data: where Z = 1/Pr(D, C) is the marginal likelihood, an unknown normalizing constant that does not need to be computed when using Markov chain Monte Carlo (MCMC) to sample from the posterior distribution.
Under this model, the MSAs inform the species tree through the gene trees, whereas the morphological characters inform the species tree directly. Ultimately both the MSAs and morphological characters inform the FBD parameters through the species tree (Fig. 2).
Sampling and Simulating Trees from the Prior
We tested our implementation of the FBD-MSC model by using it to jointly sample species trees with a single embedded gene tree from the prior, and comparing those distributions with FBD trees and with gene trees produced by direct simulation. Three and four-taxon FBD trees were sampled from the prior using the “SA” package in BEAST2 (Gavryushkina et al. 2014). Following the parameterization in Gavryushkina et al. (2014), these trees were conditioned on an origin time tor of 3, a birth rate λ of 1, a death rate μ of 0.5, a sampling rate ψ of 0.1, a removal probability r of 0 and a present-day sampling probability ρ of 0.1.
The sampled taxa for three-taxon FBD trees were labelled A, B and C, and had fixed ages of 0, 1, and 1.5 respectively. The sampled taxa for the four-taxon FBD trees were labelled A, B, C and D, and had fixed ages of 0, 0, 0.5 and 2 respectively.
MCMC chains to sample FBD trees were run for 100 million steps, and 100,000 trees sampled at a rate of 1 per 1,000 steps. One gene tree was simulated for each sample using custom Java code available as part of the StarBEAST2 package, assuming effective population sizes fixed at 1 for each branch.
When jointly sampling FBD-MSC species and gene trees from the prior using StarBEAST2, identical parameters were used but MCMC chains were run for 500 million steps. Species and gene trees were sampled at a rate of 1 per 5,000 steps for 100,000 species trees and the same number of gene trees.
Compiling Caninae Data
Unphased molecular sequences were retrieved from NCBI GenBank. Sequences from Bardeleben et al. (2005a) had accession numbers AY609082–AY609158. Sequences from Bardeleben et al. (2005b) had accession numbers AY885308–AY885426. Sequences from Lindblad-Toh et al. (2005) had accession numbers DQ239439–DQ239486 and DQ240289–DQ240817. Outgroup (non-Caninae) and domestic dog sequences were discarded. Canis aureus was renamed Canis anthus following Koepfli et al. (2015). For each locus, we aligned those sequences to produce an MSA using PRANK (Löytynoja and Goldman 2005). Phased MSAs were generated by duplicating each aligned sequence and randomly phasing heterozygous sites.
Coded morphological data, character names, character state names and tip dates from Slater (2015) were retrieved from Dryad (https://doi.org/10.5061/dryad.9qd51). This data set built on previous monographs (Wang 1994; Wang et al. 1999; Tedford et al. 2009).
Outgroup characters and characters invariable within Caninae were discarded. Canis aureus was again renamed Canis anthus, and Cuon javanicus was renamed Cuon alpinus, a synonym used in the molecular sequence data. For species with molecular sequences but no morphological data, all characters were treated as missing data. An extant-only data set was produced by discarding fossil taxon characters, and characters invariable within extant Caninae. BEAST2-compatible NEXUS files were generated containing the coded data and names.
MSC and Concatenation Analyses
The MSC (in practice, StarBEAST2) was configured to estimate a constant population size separately for each branch, with a maximum effective population size of 2, and a 1/X prior on the mean population size. Phased sequences were used with StarBEAST2, and unphased sequences with concatenation. For both StarBEAST2 and concatenation, we set the uniform priors U(0, 2) and U(0,1) on the diversification rate λ − μ and on the turnover parameter μ ÷ λ respectively.
The mean substitution rate was either fixed at 8 × 10−4 substitutions per million years, or estimated with a lognormal prior which had a mean of 7.5 × 10−4 and a standard deviation of the log rate of 0.6. Substitution rates among loci were allowed to vary with a flat Dirichlet prior. The HKY substitution model was used for molecular data (Hasegawa et al. 1985), and transition/transversion ratios estimated separately for each locus. The Mkv model (Lewis 2001) was used to model the evolution of morphological characters, assuming character state frequencies and transition rates are all equal. A morphological clock was estimated with a 1/X prior and an upper bound of 1.
FBD analyses were conditioned on tor which was estimated with a uniform prior U(0, 1000). The sampling proportion ψ ÷ (ψ + μ) was also estimated with a uniform prior U(0, 1). The other FBD parameters r and ρ were fixed at 0 and 1 respectively.
For each fixed clock analysis, we ran 20 independent MCMC chains of 400 million states each, sampling once every 200,000 states, and discarded the first 10% of samples as burnin. For each fossilized birth-death analysis, we ran 20 independent MCMC chains of 15 billion states each, sampling once every 2 million states, and discarded the first 4% of samples from each chain as burnin. For each type of analysis, the independent chains were concatenated and subsampled for a combined sample of 2,000 states.
Posterior Predictive Simulations
For half (1,000) of the fixed clock StarBEAST2 posterior samples, we resimulated molecular and morphological data. For each locus a gene tree was simulated according to the MSC using DendroPy (Sukumaran and Holder 2010), embedded within the species tree (topology, times and per-branch population sizes) for that sample, with two alleles per extant species. An MSA was simulated for each gene tree using Seq-Gen (Rambaut and Grassly 1997), based on the HKY model with the estimated κ ratio and substitution rate of the locus from the posterior sample, and of the same length as the original locus. Unphased per-species sequences were generated using ambiguity codes for heterozygous sites.
Morphological data was resimulated by simulating a 1,000 character MSA along the posterior sample’s species tree with 20 states per character, again using Seq-Gen. Base frequencies and transition rates were all equal, and the substitution rate set to 0.03. Then for each morphological character in the original data set, we sampled without replacement one of the simulated characters with a matching number of observed states.
Each simulation was reanalyzed using concatenation with the same model and priors as for the original data set. However only one chain of 200 million states was run for each simulation, sampling once every 80,000 states, and 20% of samples were discarded as burnin.
Calculating Summary Statistics
Summary statistics were calculated for each estimated distribution of trees using DendroPy. These included the maximum clade credibility (MCC) tree, branch lengths, node heights, branch support and node support. For the purpose of calculating support values and internal node heights, a node is defined as the root of a subtree containing all of, and only, a given set of extant taxa. A branch is defined as the direct connection between parent and child nodes as defined above. Lineages-through-time (LTT) curves for FBD analyses were calculated using a custom script. Summary statistics and LTT plots were visualized using ggplot2 (Wickham 2016) and ggtree (Yu et al. 2017).
Results
FBD-MSC Implementation Correctness
To test the correctness of our FBD-MSC implementation, we first compared distributions of three and four-taxon FBD trees drawn from the prior using BEAST2 without the MSC, to distributions drawn from the prior using the FBD-MSC model in StarBEAST2. The marginal divergence time (Supplementary Fig. S1,S2) and topology (Supplementary Fig. S3,S4) distributions thus generated were found to be identical between implementations. As the BEAST2 implementation of the FBD model has been previously verified (Gavryushkina et al. 2014), this is strong evidence that the new implementation is also correct.
Gene trees were also sampled from the prior under the FBD-MSC model in StarBEAST2, and were compared to a distribution of gene trees simulated evolving within the FBD trees that were drawn from the prior absent StarBEAST2. The distributions of gene tree coalescent times (Supplementary Fig. S5,S6) and topologies (Supplementary Fig. S7,S8) were identical for either method, further supporting the correctness of our implementation.
Compiling an Exemplar Dataset
To demonstrate the effects of estimating species divergence times without accounting for coalescent processes, as when using concatenation, we compiled a data set by combining 19 previously published Caninae nuclear locus sequences from extant Caninae taxa (Table 1) with morphological characters and times from extant and fossil Caninae (Slater 2015). The combined data set included 21 extant taxa with molecular data only, 9 extant taxa with molecular and morphological data, and 31 fossil taxa with tip dates and morphological data. After removing characters with no variation within Caninae, there were 72 morphological characters remaining for FBD analyses. After further removing characters with no variation among the 9 taxa with both molecular and morphological data, there were 55 remaining for fixed clock analyses.
Calibrating Species Trees Using a Fixed Clock
In the absence of a fossil record for a clade of interest, divergence times can be estimated using a fixed molecular clock. This scales the tree by an a priori chosen substitution rate, or a set of substitution rates for a set of genes. Substitution rates have been previously estimated for the nuclear RAG1 gene across multiple tetrapod clades, and for mammals the rate is approximately 1 × 10−3 substitutions per site per million years (Hugall et al. 2007). Exploratory analyses suggested that RAG1 evolves around 25% more quickly than the mean rate for all genes in our study, so we used a substitution rate fixed at 8 × 10−4 for analyses calibrated with a fixed clock.
We compared the posterior distribution of species trees inferred under the MSC and concatenation without any fossil data, including nuclear loci and morphological characters only from extant taxa, and using a birth-death prior for the species tree. The estimated lengths of all tip branches and some internal branches were longer when using concatenation (Fig. 3). A few internal branches were shorter, most of all the 1−2, 5−A and E–J branches.
To understand whether failing to account for neutral coalescent processes could cause the observed branch length differences, we used posterior predictive simulations to model the expected differences. For 1,000 species tree samples in the fixed clock MSC posterior distribution, we resimulated gene trees according to the MSC. For each simulated gene tree, we simulated an MSA based on that sample’s substitution rates and transition/transversion ratios. A set of morphological characters were also simulated along the species tree for each sample. Posterior distributions of species trees using concatenation were then inferred from the simulated data.
For a given branch, we calculated the distribution of differences in branch length between the true length l of a branch b, and the concatenation estimate . This calculation was based on the replicates where the species tree used for simulation contained b. is the expectation marginalized over all samples containing b. In the case of phylogenetic cherries, only one branch was included, because their lengths are always equal in an ultrametric tree.
All observed differences in branch lengths fell within expectations (Fig. 4). This suggests that the failure to account for neutral coalescent processes, as modeled by the MSC, is responsible for the observed differences.
Calibrating Species Trees Using Fossil Data
Using a fixed molecular clock conditions the estimated divergence times on the accuracy of the a priori chosen substitution rate. The rate of molecular evolution is inversely associated with body size in mammals (Bromham 2011) so the substitution rate used for, say, baleen whales would likely be too slow when applied to Muridae. Instead the molecular substitution rate can be inferred jointly with the species tree topology and times by including fossil data and applying an FBD prior to the species tree.
We reran our concatenation and MSC analyses of Caninae after including morphological data with tip dates (fossils), and applied FBD priors to the species trees. The placement of fossil taxa was very uncertain, so to make the FBD results interpretable we pruned the posterior distributions of species trees to include only extant taxa. This also enables direct comparisons of the FBD and fixed clock results. The MCC tree topology inferred by FBD-MSC was identical to fixed clock MSC (Figs. 3, 5).
The differences in branch lengths observed for FBD-concatenation compared to FBD-MSC were very similar to those seen in the fixed clock scenario (Fig. 6). All branches with longer estimated lengths using concatenation and a fixed clock compared to MSC and a fixed clock also had longer estimated lengths using FBD-concatenation compared to the corresponding FBD-MSC estimates. The same applied to branches with shorter estimated lengths using concatenation compared to the MSC (Figs. 3, 5).
Similar estimates were made of macroevolutionary parameters using the MSC and concatenation models, as long as the same species tree prior was used (Table 2). When using the FBD prior, the molecular clock rate highest posterior densities (HPDs) included the a priori rate of 8 × 10−4 with either the MSC or concatenation. The only non-overlapping HPDs were for the morphological clock rate, which was inferred to be slower when using the FBD compared to a fixed clock. The lower bound for turnover (extinction relative to speciation) was approximately zero when fossil data was not included, but was higher when fossils were explicitly included for FBD analyses.
Clade Ages and Uncertainty
For all clades in the FBD-MSC MCC tree with at least 0.5% support, the divergence time for the root node of that clade according to the FBD-MSC was younger than when estimated using FBD-concatenation (Fig. 7). While the HPD intervals of the two estimates often overlapped substantially, those for the A node (the MRCA of extant sampled Canis, Cuon and Lycaon) and the D node (nested within the A node and excluding Canis mesomelas and C. adustus) did not, and the FBD-concatenation estimates of those species divergence times were about 2Myr older than FBD-MSC.
The Tempo of Caninae Evolution
If species divergence times are always overestimated using concatenation, even when using fossil data and an FBD prior to calibrate the species trees, this is likely to affect macroevolutionary analyses. As an example, we present LTT curves of Caninae diversification estimated using FBD-MSC and FBD-concatenation (Fig. 8).
For both methods the LTT curves are convex, as expected for a birth-death model of evolution with good taxon sampling (Stadler 2008). However the diversification leading to extant Caninae occurs earlier for the FBD-concatenation LTT curve compared to the FBD-MSC curve. The FBD-concatenation estimate also suggests a diversification slowdown during the last ≈ 2 million years, which is not suggested by the FBD-MSC curve. Diversification slowdown is a predicted spurious effect of concatenation (Ogilvie et al. 2017).
Support for Specific Clades
We considered clade support contradictory between analyses if that clade was highly supported (> 95%) in any analysis and unsupported (< 5%) in any other analysis. Only the R clade met this criterion, which is the clade that unites Cuon and Lycaon (Table 3).
The R clade was highly supported by the MSC regardless of whether a fixed clock or FBD was used to calibrate the species tree. To understand whether this support was driven by coalescent processes alone or by interactions with morphological data, we reran our fixed clock analyses with only molecular data. Without the morphological data there was no support for this clade even when using the MSC, suggesting that unmodeled processes such as selection for convergent morphological evolution might be increasing support for this clade.
Discussion
Concatenated Likelihood Methods Are Inaccurate
Several recent studies have demonstrated that methods which use phylogenetic likelihood to estimate species trees from concatenated loci – “concatenated likelihood” for short – are inaccurate under realistic conditions. These studies have been based on simulation and analytical results, and have covered both maximum likelihood (ML) and Bayesian concatenation.
Mendes and Hahn (2016) showed that ML concatenation is systematically biased when estimating the lengths of particular branches on an asymmetric species tree. This is due to substitutions produced by ILS (SPILS), which are artificial substitutions on discordant species tree branches. Mendes and Hahn (2018) went on to show that SPILS is also responsible for the statistical inconsistency of ML concatenation when estimating species tree topologies, even outside of the so-called “anomaly zone” of short branch lengths where the most probable gene tree topology is discordant with the species tree.
Other studies have shown that Bayesian concatenation can be grossly inaccurate when estimating species trees. Bayesian concatenation can overestimate the lengths of tip branches by as much as 350%, and is less accurate than Bayesian MSC using the same number of loci (Ogilvie et al. 2016). Bayesian concatenation is also less accurate at estimating the lengths of internal branches, and reports overly precise credible intervals and support values which can exclude the true values and topologies a majority of the time (Ogilvie et al. 2017).
We have built on previous results by studying the effect of concatenation on an empirical data set of Caninae. Using posterior predictive simulations, we have shown that the observed differences in species tree branch lengths between the MSC and concatenation are expected and caused by a failure to account for coalescent processes. Consistent with previous studies, tip branch lengths were always overestimated, and internal branch lengths were sometimes inaccurate in either direction (Figs. 3, 4).
FBD-MSC Results Are More Plausible
Researchers may wonder if the known problems of concatenation are relevant to dated trees inferred using an FBD process. Our study showed that for Caninae, dated species trees inferred using a fixed clock are very similar to dated species trees inferred using an FBD process. We further demonstrated that the differences between MSC and concatenation estimates made under a birth-death process without fossil data are very similar to those made under a FBD process with fossil data (Fig. 6).
Considering coalescent theory and the totality of our results, the FBD-MSC results are more plausible than the FBD-concatenation results. The posterior predictive simulations show that the observed differences in branch lengths between the MSC and concatenation are expected due to a failure to account for coalescent processes.
This has important implications for downstream analyses, as seen in the LTT plots (Fig. 8) where the FBD-concatenation LTT curve suggests a slowdown in Caninae diversification during the past ≈ 2 million years. In contrast, the FBD-MSC LTT curve shows a burst of diversification in the same time frame.
In this study the estimated clock rate of Caninae using the FBD was consistent with the rate inferred by Hugall et al. (2007). Despite this consistency, FBD models are still necessary to account for the correct amount of uncertainty in clock rates, and because the a priori clock rate will not always be accurate. If we had studied a different mammalian clade, it would not necessarily have a mean substitution rate consistent with Hugall et al. (2007).
Some unexplored possibilities are that FBD-concatenation would approach FBD-MSC given a morphological matrix covering more taxa and/or when using a relaxed clock. These are hypothetically interesting questions but in practice morphological data sets are usually quite limited in the number of taxa and characters. Concatenation with a relaxed clock is much slower than StarBEAST2 with a strict clock, without any evidence of improved error rates (Ogilvie et al. 2017).
Morphological and Molecular Discordance
We observed that the inclusion or omission of morphological data completely changes the support of the Lycaon+Cuon clade from 100% to 0% respectively when using MSC models (Table 3). Support for this clade is ubiquitous in morphological phylogenetic studies of Caninae (Tedford et al. 2009; Prevosti 2010) and probably is due to their specialized dentitions. A previous study of Caninae which combined morphological characters and mitochondrial sequence alignments found that support for this clade came only from the morphological data, and proposed that the responsible characters are likely convergent due to the hypercarnivory of these two species (Zrzavý and Řičánková 2004).
Molecular phylogeneticists should be aware of the potential for morphological model violations when conducting total-evidence studies, and be appropriately cautious when interpreting results. A potential avenue for future research is the development of improved models of morphological evolution, which allow for convergence across many characters at once due to selection. New models could either rule in or out support for Lycaon+Cuon by ascribing their similar morphology to convergent evolution. Alternatively, support for this putative clade could be further scrutinized through expanded sampling of fossil representatives of these lineages.
The molecular signal could also be potentially misleading due to unmodeled processes, for example introgression. This could be addressed by integrating the FBD with the multispecies network coalescent, which unlike the MSC does allow for introgression and hybridization (Wen and Nakhleh 2017; Zhang et al. 2017).
Integrative Models Are the Future
The development and implementation of the integrative FBD-MSC model demonstrates how integrative models are made possible within a Bayesian framework. Unlike previous Bayesian implementations of the MSC which are ultrametric and hence limited to contemporary sources of data, using the FBD-MSC we can incorporate morphological and timing information from excavated fossils. The FBD-MSC is a first step, and the future will see further development of integrative models in theory, and the development and use of new implementations in practice.
Funding
This research was funded a Royal Society of New Zealand Marsden award granted to AJD, DW, NJM, TGV and TS (16-UOA-277). HAO was supported by an Australian Laureate Fellowship awarded to Craig Moritz by the Australian Research Council (FL110100104). TS was supported in part by the European Research Council under the Seventh Framework Programme of the European Commission (PhyPD: grant agreement number 335529).
Acknowledgments
This research was undertaken with the assistance of resources from the National Computational Infrastructure (NCI), which is supported by the Australian Government. We thank Craig Moritz for his advice on preparing the manuscript, and the late Colin Groves for his insight into the Caninae fossil record.
Footnotes
↵* huw.ogilvie{at}anu.edu.au