Abstract
Dedication To the memory of Rossiter H. Crozier (1943-2009), an evolutionary biologist, who with his generosity and inquisitiveness inspired many students and scientists, in Australia and abroad.
Molecular phylogenetics plays a key role in comparative genomics and has an increasingly-significant impact on science, industry, government, public health, and society. We report that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence the phylogenetic estimates. Based on the potential offered by well-established but under-used procedures (i.e., assessment of phylogenetic assumptions and test of goodness-of-fit), we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates.
Molecular phylogenetics plays a pivotal role in the analysis of genomic data and has already had a significant, wide-reaching impact in science, industry, government, public health, and society (Table 1). Although the science and methodology behind applied phylogenetics is increasingly well understood within parts of the scientific community42, there is still a worryingly large body of research where the phylogenetic component was done with little attention to the consequences of a statistical misfit between the phylogenetic data and the assumptions embedded in the phylogenetic methods.
One reason for this is that molecular phylogenetics relies extensively on mathematics, statistics, and computer science, and many users of phylogenetic methods find the relevant subsections of these disciplines challenging to comprehend. A second reason is that methods and software often are chosen because they are easy to use and comprehend, or simply already popular, rather than because they are the most appropriate for the scientific questions and phylogenetic data at hand. A third reason is that much of the phylogenetic research done so far has relied on phylogenetic protocols43-48, which have evolved to become a standard to which it seems sensible to adhere. Although these protocols vary, they have, at their core, a common set of sensible features that are linked in a seemingly logical manner (see below).
Here we report that, although the current phylogenetic protocol has many useful features, it is missing two crucial components whereby the quality of fit between the data and models applied is assessed. This means that using the phylogenetic protocol in its current form may lead to biased conclusions. We suggest a modification to the protocol that will make it more robust and reliable.
The current phylogenetic protocol
Phylogenetic analysis of alignments of nucleotides or amino acids usually follows a protocol like that in Figure 1. Initially, the phylogenetic data are chosen on the basis of the assumption that the data will allow the researchers to solve their particular scientific questions. This choice of sequence data is often based on prior knowledge, developed locally or gleaned from the literature, and recommendations. Then, a multiple sequence alignment (MSA) method is chosen, often on the basis of prior experience with a specific method. The sequences are then aligned—the aim is to obtain an MSA, wherein homologous characters (i.e., nucleotides or amino acids) are identified and aligned. In practice, it is often necessary to insert gaps between the characters in some of the sequences to obtain an optimal MSA, and, in some cases, there may be sections of the MSA that cannot be aligned reliably.
Then follows the task of selecting the sites that will be used in the phylogenetic analysis. The rationale behind doing so is to maximize the signal-to-noise ratio in the MSA. By deleting poorly-aligned and highly-variable sections of the MSA, which are thought to create noise due to the difficulty of establishing homology, it is hoped that the resulting sub-MSA retains a strong historical signal49 that will allow users to obtain a well-supported and well-resolved phylogeny. The choice of sites to retain is made by visual inspection of the MSA or by using purpose-built software50-54. The automated ways of filtering MSAs have been questioned55.
Having obtained a sub-MSA, the next step in the protocol is to select a phylogenetic method for analysis of the data. Importantly, this means that it is assumed that the sequences have diverged along the edges of a single bifurcating tree (the tree-likeness assumption) and that the evolutionary processes operating at the variable sites are independent and identically-distributed processes (the IID assumption). If model-based molecular phylogenetic methods are chosen, it is also assumed that the evolutionary processes operating at the variable sites can be modelled accurately by time-reversible Markov models, and that the processes were stationary, reversible and homogeneous56-58 over time (the assumption of SRH conditions). In practice, the choice is one between methods assuming that the underlying evolutionary process can be modelled using a Markov model of nucleotide or amino-acid substitutions (i.e., distance methods59-64, likelihood methods59, 61, 63-69, Bayesian methods70-75), or using non-parametric phylogenetic methods (i.e., parsimony methods59, 61, 63, 64, 76-78). In reality, most researchers analyze their data by using different model-based phylogenetic methods, and reports that only use parsimony methods are increasingly rare. Depending on the chosen phylogenetic method, researchers may have to select a suitable model of sequence evolution (i.e., a substitution model and a rate-across-sites model) to apply to the sub-MSA. This choice is often made by using model-selection methods6, 79-90.
Having chosen the phylogenetic method and, in relevant cases, a suitable model of sequence evolution, the next step involves obtaining accurate estimates of the tree and evolutionary processes that led to the data. There is a plethora of programs that implement phylogenetic methods59-78. Depending on the methods chosen, users often also obtain the nonparametric bootstrap probability91 or clade credibility92 to measure support for individual divergence events in the phylogeny. These estimates are often thought of as measures of the accuracy of the phylogenetic estimate or the confidence we might have in the inferred divergence events. Doing so might be unwise because they are only measures of consistency93 (e.g., a phylogenetic estimate may consistently point to an incorrect tree).
Having inferred the phylogeny, with or without bootstrap probability or clade credibility for all of the internal branches (edges), the final step in the protocol is to interpret the result. Under some conditions—most commonly the inclusion of out-group sequences—the tree can be drawn and interpreted as a rooted phylogeny, in which case the order of divergence events and the lengths of the individual edges may be used to infer, for example, the tempo and mode of evolution of the data. The inferred phylogeny often confirms an earlier-reported or assumed evolutionary relationship. Often, too, there are surprises that are difficult to understand and explain. If the phylogenetic estimate is convincing and newsworthy, the discoveries may be reported, for example, through papers in peer-reviewed journals.
On the other hand, if the surprises are too numerous or do not appear credible, the researchers will begin the task of finding out what may have ‘gone wrong’ during the phylogenetic analysis. This process is illustrated as dashed feedback loops in Figure 1. The researchers may examine the data using alternative methods: use other Markov models, employ different phylogenetic methods, use a different sub-MSA, align the sequences differently, use a different alignment method, or use another data set. Given enough patience, the researches may reach a conclusion about the data, and they may decide to publish their results.
Problems with the current phylogenetic protocol
Although the current phylogenetic protocol has led to many fine scientific discoveries, it also has left many scientists with strong doubts about or, alternatively, unduly strong confidence in the estimates. The literature is rife with examples where analyses of the same data have led to disagreements among experts about what is the ‘right’ phylogeny (cf. e.g., 94-96). Such disagreements can be confusing, especially for non-experts and the public. To avoid this, it is necessary to understand the challenges that applied phylogenetic research still faces.
While it is clear that the right data are needed to answer the scientific question at hand, making that choice is not always as trivial as it might seem. In some cases, the sequences may have evolved too slowly and/or be too short, in which case there may not be enough information in the data, or they have evolved so fast that the historical signal has largely been lost97. In rarely-reported cases, the data are not what they purport to be98.
Next, there is no consensus on what constitutes an optimal MSA. Clearly, what is required is an accurate MSA where every site is a correct homology statement. Currently, however, there is no automatic procedure for homology assessment, so to infer an accurate MSA is still more an art than science99, putting the whole phylogenetic protocol into jeopardy. One way to mitigate this problem is to rely on simulation-based comparisons and reviews of MSA methods99-107, but they appear to have had less impact than deserved.
The choice to remove poorly-aligned or highly-variable sites is confounded by the fact that the different MSA methods frequently return different MSAs, implying different homology statements—they cannot all be right. However, the choice of what sites to retain depends not only on the MSA method used but also on how difficult it is to identify the sites (e.g., it is impractical to visually inspect and edit MSAs with over ∼50 sequences and ∼400 sites). In the past, expert knowledge about the data was often applied (e.g., structural information about the gene or gene product), but automated methods50-54 are now typically used. However, these methods often produce different sub-MSAs from the same MSA, leaving confusion and doubt.
The choice of what phylogenetic method to use for the data is rated by many as the most challenging one to make (e.g., because the assumptions of each method are often poorly understood), and it is often solved by using several phylogenetic methods. If these methods return the same phylogenetic tree, many authors feel confident that they have inferred the ‘true’ phylogeny and would go on to report their discoveries. However, while this approach may have led to correct trees, it is perhaps more due to luck than to scientific rigor that the right tree was identified. This is because every phylogenetic method is based on assumptions (see above), and if these assumptions are not violated too strongly by the data, the true tree has a high probability of being found. On the other hand, if these violations are strong, there is currently no way of knowing whether the true tree was found. Indeed, strong violation of phylogenetic assumptions could lead to similar but nevertheless wrong trees being inferred using different phylogenetic methods49, 108.
Over the last two decades, the choice of a suitable model of sequence evolution has often been made by using purpose-built model-selection methods6, 79-90. Assuming a tree, these methods step through an array of predefined models, evaluating each of them, one by one, until the list of models is exhausted. This is sensible if the true or most appropriate model is included in the set of predefined models. On the other hand, if this model is not included in the set of predefined models, the popular model-selection methods may never be able to return an accurate estimate. They will return an optimal estimate, but it will be conditional on the models considered. Importantly, most popular model-selection methods only consider time-reversible Markov models. If the data have evolved on a single tree but under more complex conditions, then there is no way that a simple, time-reversible Markov model is sufficient to approximate the evolutionary processes across all edges of the tree110. Hence, it is worrying that researchers still ignore or dismiss the implication of compositional heterogeneity across sequences108: it implies that the evolutionary process for a set of sites has changed over time (e.g., the third position of codons has evolved under different evolutionary processes across time, requiring multiple models of sequence evolution for these data). This implication must be taken seriously when data are analyzed phylogenetically; typically, it is not.
The choice of phylogenetic program is often driven more by prior experiences and transaction costs (i.e., the time it takes to become a confident and competent user of the software) rather than by a profound understanding of the strengths, limitations, and weaknesses of the available software. This may not substantially impact the accuracy of the phylogenetic estimate, as long as the data are consistent with the phylogenetic assumptions of the methods and the methods thoroughly search tree space and model space.
Finally, once a well-supported phylogenetic estimate has been obtained, a researcher’s prior expectations are likely to influence whether the results are considered both reliable and newsworthy. In some cases, where information on the phylogeny is known (e.g., serially-sampled viral genomes), not meeting the prior expectations may signal an issue with the phylogenetic analysis. However, if a researcher’s expectations are confirmed by the phylogenetic estimates, it is more likely that a report will be written without a thorough assessment of what might have gone wrong during the analysis. This tendency to let prior expectations influence the interpretation of phylogenetic estimates is called confirmation bias. Confirmation bias is not discussed in phylogenetics, even though it is recognized as a critical factor in other disciplines (e.g., psychology and social science111), so it is timely that the phylogenetic community takes onboard the serious implications of this.
The new phylogenetic protocol
Although the current phylogenetic protocol has many shortcomings, it has many desirable attributes, including that it is easy to follow and implement as a pipeline. However, to mitigate its limitations, it is necessary to redesign the protocol to accommodate well-established but largely-ignored procedures and new feedback loops.
Figure 2 shows our proposal for new phylogenetic protocol. It shares many features found in the current protocol (e.g., the first four steps), but the fifth step (assess phylogenetic assumptions) is novel. Because all phylogenetic methods are based on assumptions, it is sensible to validate these assumptions at this point in the protocol. Since many phylogenetic methods assume that the data (e.g., different genes) have evolved over the same tree, and that the chosen data partitions have evolved independently under the same time-reversible Markovian conditions, it is wise to survey the sub-MSA for evidence that the sequences actually have evolved under these conditions. If the data violate the phylogenetic assumptions of some methods, then it would be wise to avoid these phylogenetic methods and to employ other such methods. Alternatively, it may be worth following the relevant feedback loops in Figure 2—perhaps something led to a biased sub-MSA? The relevance and benefits of this step are illustrated using a case study (Box 1), which focuses on determining whether a data set is consistent with the phylogenetic assumption of evolution under time-reversible conditions. Assessments of other phylogenetic assumptions require other types of tests and surveys (it is beyond the scope of this review to discuss these here).
Next follows the choice of phylogenetic method, but this choice is now made on the basis of the previous step, rather than cultural or computational reasons. If the sequences have evolved on a single tree under time-reversible Markovian conditions, there is a large set of phylogenetic methods to choose from59-78. On the other hand, if these data have evolved under more complex Markovian conditions, the number of suitable phylogenetic methods is frustratingly limited5, 57, 112-136, and most of these methods are aimed at finding the optimal model of sequence evolution for a given tree rather than finding the optimal tree. Users of phylogenetic methods therefore are sometimes confronted by a dilemma: Do they abandon their data set because it has evolved under non-time-reversible condition and because there are no phylogenetic methods for such data, or do they take the risk and use robust time-reversible phylogenetic methods? Fortunately, there is a way around this dilemma (see below).
Having inferred the phylogeny using model-based phylogenetic methods, it is possible to test the fit between tree, model and data (step 10 of the new protocol). A suitable test of goodness-of fit was proposed in 1993137 (Fig. 4). In brief, using the inferred optimal tree (with edge lengths included), it is possible to simulate data sets under the null model (i.e., the inferred optimal model of sequence evolution with its parameter values included). This is called a parametric bootstrap. Given this tree and this model of sequence evolution, several sequence-generating programs5, 126, 138, 139 facilitate procurement of pseudo-data. Having generated, say, n = 1,000 pseudo-data, the next step involves finding the difference (δ) between the unconstrained (i.e., without assuming a tree and a model) and constrained (i.e., assuming a tree and a model) log-likelihoods (i.e., δ = lnL(D) - lnL(D|T, M), where D is the data, T is the tree, and M is the model of sequence evolution). If the estimate of δ is greater for the real data than for the pseudo-data, then that result reveals a poor fit between tree, model, and data109. The approach described here works well for likelihood-based phylogenetic analysis and a similar approach is available for Bayesian-based phylogenetic analysis140. Parametric bootstrapping is computationally expensive and time-consuming, so it should only be done if the data appears to meet the assumptions of phylogenetic method. The advantages of using such a goodness-of-fit test is that it allows users to determine if the lack of fit is large enough to not be due to chance. It does not say anything about whether or not the lack of fit matters. If the fit is poor, then the relevant feedback loops should be followed (Fig. 2)—perhaps a biasing factor was missed? If the phylogenetic tree and model of sequence evolution are found to fit the data, then that implies that these estimates represent a plausible explanation of the data. It is these estimates that should be reported, but only as one plausible explanation, not as the only possible explanation. This is because there may be other plausible explanations of the data that never were considered during the analysis. [739 words]
The future: Areas in most need of methodological research
Adherence to the new phylogenetic protocol would undoubtedly lead to improved accuracy of phylogenetic estimates and a reduction of confirmation bias. The advantage of the fifth step in the new phylogenetic protocol (i.e., assess phylogenetic assumptions) is that users are able to decide how to do the most computationally-intensive parts of the phylogenetic study without wasting valuable time on, for example, a high-performance computer centre. Model selection, phylogenetic analysis, and parametric bootstrapping are computationally-intensive and time-consuming, and there is a need for new, computationally efficient strategies that can be used to analyse sequences that have evolved under complex phylogenetic conditions.
The advantage of the tenth step in the new phylogenetic protocol (i.e., test goodness-of-fit) is its ability to answer whether an inferred phylogeny explains the data well, or not. In so doing, this step tackles the issue of confirmation bias front on. Clearly, without information gleaned from the fifth step, the parametric bootstrap might return an unwanted answer (i.e., the inferred tree and model of sequence evolution does not fit the data well), so to avoid such disappointments it is better to embrace the new phylogenetic protocol in full.
Results emerging from studies that rely on the new phylogenetic protocol might well call into question published phylogenetic research, but there is also a chance that research might gain stronger support. This is good for everyone concerned, especially since it will become easier to defend the notion that the research was done without prejudice or preference for a particular result. Objectivity should be restored in phylogenetics—it is no longer reasonable to defend phylogenetic results on the basis that they were obtained using the best available tools; if these tools do not model the evolutionary processes accurately, then that should be reported rather than be hidden away. This is critical as it increases transparency and aids other researchers to understand the nature of the challenges encountered.
Notwithstanding the likely benefits offered by the new phylogenetic protocol and the methods supporting it, it would be unwise to assume that further development of phylogenetic methods will no longer be needed. On the contrary, there is a lot of evidence that method development will be needed in different areas:
MSA Methods — There is a dire need for MSA methods that are accurate in the sense of homology statements. Likewise, there is a great need for methods that allow users to determine how accurate different MSA methods are and (ii) to select MSA methods that are most suitable for the data at hand.
Methods for Masking MSAs — Assuming an MSA has been inferred, there is a need for a set of strategies that can be used to identify and distinguish between poorly-aligned and highly-variable regions of MSA. Well aligned but highly-variable regions of MSAs may be more informative than poorly-aligned regions of such MSAs, so to delete them may be unwise.
Model-selection Methods — Model selection is important when parametric phylogenetic methods are used. However, the model-selection methods currently employed may not be accurate, especially for sequences that have evolved under complex conditions (e.g., heterotachous, covarion, or non-time-reversible conditions). Critically, the evolutionary process may have be considered an evolving entity in its own right.
Phylogenetic Methods — While there is a plethora of accurate phylogenetic methods for analysis of data that have evolved under time-reversible Markovian conditions, there is a dearth of accurate phylogenetic methods suitable for analysis of data that have evolved under complex conditions. Added to this challenge are methods that accurately consider incomplete lineage sorting of genetic markers and the special conditions associated with the analysis of SNP data.
Goodness-of-fit Tests — Although suitable goodness-of-fit tests are available, there is not only a need for a wider understanding of the merits of these tests, but also of how they can be tailored to suit different requirements. In particular, there is a need for programs that can generate pseudo data under extremely complex evolutionary conditions. Some programs are available5, 126, but they only cater for a limited set of conditions.
Analysis of Residuals — Although goodness-of-fit tests can tell you whether or not the lack of fit observed is potentially due to chance, they do not answer the more useful question of whether or not that lack of fit matters or how the lack of fit arises141, 142. For this reason, residual diagnostic tools that can inform the user about the way in which their model fails to fit the data would be very useful.
In summary, while calls for better phylogenetic methods and more careful considerations of the data have occurred110, we believe there is a need for a comprehensive overhaul of the current phylogenetic protocol. The proposed new phylogenetic protocol is unlikely to be the final product; rather, it is probably a first, but important step towards a more scientifically sound phylogenetic protocol, which not only will lead to more accurate phylogenetic estimates and but also to a reduction in the likelihood of confirmation bias.
Conclusions
The Holy Grail in molecular phylogenetics is clearly being able to obtain accurate, reproducible, transparent, and trustworthy phylogenetic estimates from the phylogenetic data. We are not there yet, but encouraging progress is being made in not only in the design of the phylogenetic protocol but also in phylogenetic methodology based on the likelihood and Bayesian optimality criteria.
Notwithstanding this progress, a quantum shift in attitudes and habits will be needed within the phylogenetic community—it is no longer sufficient to obtain an optimal phylogenetic estimate. The fit between trees, models, and data must be evaluated before the phylogenetic estimates can be considered newsworthy. We owe it to the community and wider public to be as rigorous as we can—the attitude “She’ll be alright, mate” is no longer appropriate in this discipline.
AUTHOR CONTRIBUTIONS
L.S.J. conceived the new phylogenetic protocol. R.A.C., B.R.H., and L.S.J. wrote the paper.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
CASE STUDY
To illustrate the relevance and benefits of the fifth step in the new phylogenetic protocol, we examined the phylogenetic data used to infer the evolution of insects3. The tetrahedral plots in Figure 3a-3c reveal that the nucleotide composition at the three codon positions is heterogeneous, implying that the evolutionary processes that operated at these positions are unlikely to have been time-reversible. However, the plots are deceptive because the presence of constant sites (i.e., sites with the same nucleotide or amino acid) in the data can mask how compositionally dissimilar the sequences actually are. To learn how to resolve this issue, it is necessary to focus on the evolution of two sequences on a tree (Fig. 3d) and the corresponding divergence matrix at time 0 (Fig. 3e) and at time t (Fig. 3f). At time 0, the two sequences are beginning to diverge from one another, so the off-diagonal elements of the divergence matrix are all zero. Later, the divergence matrix may look like that in Figure 3f. All the off-diagonal elements are now greater than zero, and the so-called matching off-diagonal elements of the divergence matrix might differ (i.e., xij ≠ xji). The degree of divergence between the two sequences can be inferred by comparing the off-diagonal elements to the diagonal elements, while the degree of difference between the two evolutionary processes can be inferred by comparing the above-diagonal elements to the below-diagonal elements. If the two evolutionary processes were the same, the matching off-diagonal elements in Figure 3f would be similar. A lack of symmetry (i.e., xij ≠ xji) implies that the evolutionary processes along the two descendant lineages may be different. A matched-pairs test of symmetry143 can be used to determine whether this observed deviation from symmetry is statistically significant. Figures 3g-3i show the distributions of the observed and expected p values from these tests for the data assessed in Figures 3a-3c. Because the dots in these plots do not fall along the diagonal line in the plots (showing that a lack of symmetry is not statistically significant), there is an overwhelming evidence that the evolutionary processes at these positions cannot have been time-reversible. The same is the case for the corresponding amino acid alignment (not shown). Consequently, it would be unwise to assume that the data evolved under time-reversible conditions. A far more complex evolutionary process is likely to explain the data, so the time-reversible phylogenetic methods used by Misof et al.3 were clearly not suitable for analysis of these data. However, such methods were not available at the time, and that is still the case!
ACKNOWLEDGEMENTS
L.S.J. thanks the University College Dublin for its generous hospitality. We thank D. Higgins, A. Locatelli, and K. H. Wolfe for their constructive feedback.
References
- 1.
- 2.
- 3.↵
- 4.
- 5.↵
- 6.↵
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.↵
- 43.↵
- 44.
- 45.
- 46.
- 47.
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.
- 53.
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.
- 63.↵
- 64.↵
- 65.
- 66.
- 67.
- 68.
- 69.↵
- 70.↵
- 71.
- 72.
- 73.
- 74.
- 75.↵
- 76.↵
- 77.
- 78.↵
- 79.↵
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.
- 101.
- 102.
- 103.
- 104.
- 105.
- 106.
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.
- 114.
- 115.
- 116.
- 117.
- 118.
- 119.
- 120.
- 121.
- 122.
- 123.
- 124.
- 125.
- 126.↵
- 127.
- 128.
- 129.
- 130.
- 131.
- 132.
- 133.
- 134.
- 135.
- 136.↵
- 137.↵
- 138.↵
- 139.↵
- 140.↵
- 141.↵
- 142.↵
- 143.↵
- 144.↵