Abstract
Summarizing individual gene trees into species phylogenies using coalescent-based methods has become a standard approach in phylogenomics. However, gene tree estimation error (GTEE) arising from a combination of reasons (ranging from analytical factors to more biological causes, as in short gene sequences) can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of correcting the quartet distribution induced by a set of estimated gene trees, which involves updating the weights of the quartets to better reflect their relative importance within the gene tree distribution. We present QT-WEAVER, the first method of its kind, which learns the conflicts within the quartet distribution induced by a given set of gene trees and generates an updated quartet distribution by adjusting the weights accordingly. QT-WEAVER is a general- purpose technique needing no explicit modeling of the subject system or reasons for GTEE or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical data sets suggest that QT-WEAVER can effectively account for GTEE, which results in a substantial improvement in the species tree accuracy. Additionally, the concept of quartet conflicts and related algorithmic and combinatorial innovations introduced in this study will benefit various quartet-based computations. Therefore, QT-WEAVER advances the state-of-the-art in species tree estimation from gene trees in the face of GTEE. QT-WEAVER is freely available in open-source form at https://github.com/navidh86/QT-WEAVER.
1 Introduction
Species tree estimation from genes sampled across the whole genome has become routine with the advent of high-throughput sequencing technologies, generating genome-wide datasets that include hundreds or even thousands of loci. Species tree estimation from multiple genes is commonly done through concatenation (also known as “combined analysis”), which combines sequence alignments from different loci into a single supermatrix and then computes a tree on the supermatrix. While concatenation can produce accurate species trees when gene trees are concordant, it can be statistically inconsistent [1, 2], and produce incorrect trees with high support [3] when gene trees differ from the species tree due to various biological processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), horizontal gene transfer (HGT), etc. As a result, summary methods [4–14] that combine estimated gene trees while explicitly accounting for gene tree discordance have drawn substantial interest among systematists. However, summary methods are susceptible to gene tree estimation errors (GTEE), which can arise from various reasons including short gene sequences, inaccurate alignments, and the limitations of the models and algorithms used for inferring gene trees from sequence data.
Due to the growing awareness that GTEE is a major contributor to inaccuracies of summary methods, there has been great interest in developing tools [15–24] to account for GTEE for improved species tree estimations. Most of these methods are species tree aware as they use a reference species tree in addition to the input gene trees. These species tree aware methods essentially reconcile/modify gene trees to make them closer in distance to the species tree by minimizing a species tree aware cost function. However, gene trees could be discordant and are not always expected to closely match the species tree Additionally, “integrative” methods such as ProfileNJ and TreeFix use available sequence data as well as the estimated gene trees and reference species trees. Obtaining a reasonably accurate reference species tree despite substantial amounts of GTEE is difficult and the inaccuracy in the reference tree may have cascading effects on gene tree corrections. More importantly, species tree aware methods, specifically TreeFix and TRACTION, have been criticized for their potential to increase GTEE where gene trees are discordant with species trees [25]. Bayesian techniques for co-estimating both gene trees and species trees such as BEST [26], *BEAST [27], and PHYLDOG [28] can produce substantially more accurate gene and species trees than other methods, but are not scalable to genome-level analyses [29–33]. Therefore, despite significant attempts to account for GTEE, substantial challenges remain.
In this study, we address the problem of GTEE in the context of species tree estimation, by formulating the Quartet Distribution Correction (QDC) problem for the first time, where we seek to “correct/adjust” the distribution of quartets induced by a given set of estimated gene trees. QDC attempts to account for GTEE without resorting to any reference species tree. Quartet-based summary methods have drawn considerable attention because quartets (unrooted gene trees with four taxa) can avoid the “anomaly zone” [34–36], where the most likely gene tree topology may differ from the true species tree topology and can produce highly accurate species trees. ASTRAL, the most widely used summary method, infers a species tree by maximizing the number of quartets in the gene trees that are consistent with the species tree. Another class of methods (known as quartet amalgamation techniques [12,37–40]), such as wQFM [12] and wQMC [40], involves inferring weighted quartets for every group of four taxa (with weights correspond to the relative importance of the quartets) and then combining them into a single species tree. In this study, we present QT-WEAVER (Quartet Weight Adjustment by Verifying, Estimating, and Rectifying quartet weights), a novel method that learns the weighted quartet distribution induced by a set of gene trees to identify and verify certain patterns of quartet conflicts and subsequently update the weights accordingly. We introduced the novel concept of “quartet conflicts” and proved key analytical results that underpin QT-WEAVER’s ability to learn and adjust quartet distributions without relying on any reference species tree or sequence data, offering a robust approach to improving the accuracy of species tree inference.
The quartet distributions corrected by QT-WEAVER often align more closely with the true gene tree quartets than with those from the estimated gene trees, thereby effectively accounting for GTEE. Our experimental results, based on a diverse set of simulated and real biological datasets covering a wide range of challenging model conditions, demonstrate that amalgamating QT-WEAVER’s corrected quartets significantly improves species tree inference accuracy. Moreover, QT-WEAVER is a general- purpose approach that does not require explicit modeling of the reasons for gene tree heterogeneity or GTEE, making it more resilient to model misspecification. QT-WEAVER is the first of its kind which advances the state-of-the-art in species tree estimation in the presence of GTEE.
2 Quartet Distribution Correction Problem
2.1 Problem Definition
Let 𝒢 = {g1, g2, …, gk} be a set of k gene trees, where each gi is a tree on taxon set Si ⊆ S (i.e., any gene tree gi can be on the full set S of n taxa or can be on a subset Si of taxa, making the gene tree incomplete). For a set of four taxa a, b, c, d ∈ S, the quartet tree ab cd denotes the unrooted quartet tree in which the pair a, b is separated from the pair c, d by an edge. Note that there are three alternative quartet topologies (ab|cd, ac|bd, ad|bc) for four taxa. Let 𝒬i be the set of quartets in gi. Therefore, 𝒬 = 𝒬1 ∪ 𝒬2 ∪ … ∪ 𝒬k is the multi-set of quartets present in 𝒢. Note that there are quartets in 𝒬i, and quartets in are 𝒬 unique, as there are three alternative quartet topologies for a set of four taxa. The gene tree frequency (GTF) based weighted quartet distribution 𝒬𝒟 of 𝒬 contains all possible quartets on n taxa along with their frequencies in the gene trees. That means, this is defined as the set of tuples.
Let 𝒢𝒯 and 𝒢ε be the set of true and estimated gene trees, respectively. Consequently, the definitions of 𝒬𝒯, 𝒬ε, 𝒬𝒟𝒯, 𝒬𝒟ε extend in an obvious way. We now define the quartet distribution correction (QDC) problem as follows.
Problem Quartet Distribution Correction (QDC)
Input A weighted quartet distribution induced by a set 𝒢ε of estimated gene trees.
Output A weighted quartet distribution of updated weights for qis so that the divergence/difference between and the true distribution 𝒬𝒟𝒯 is minimized.
While we have defined the GTF-based weighted quartet distribution here, this concept extends to other weighting schemes as well.
2.2 Proposed Methodology: Identifying Quartet Conflicts
Our proposed approach is based on detecting patterns of “conflicts” among quartets, where a set of quartets is considered conflicting if they cannot be simultaneously satisfied by a single tree. Therefore, we try to learn the inherent patterns of conflicts within the quartets induced by the estimated gene trees and update the weights accordingly. We call the three alternative topologies (ab |cd, ac |bd, and ad | bc) for a set {a, b, c, d} of four taxa the topological variants of each other, and clearly, they are conflicting. We now formalize the concept of conflicts between quartets in Theorem 1, which generalizes how the number of common taxa between two quartets affects the number of trees that can simultaneously satisfy both quartets.
The leaves in a tree T are denoted by L(T). Every edge e in an unrooted leaf-labeled tree T defines a bipartition (or split) π(e) on the leaves L(T) (induced by the deletion of e).
Let q1 and q2 be two quartets drawn from a set of taxa. The number of binary trees on L(q1) ∪ L(q2) that simultaneously satisfy both q1 and q2 is determined by the number of taxa shared between them. Specifically, let ct denote the number of taxa common to both quartets. Then the possible values of ct are 0, 1, 2, 3, or 4, and the number of binary trees that satisfy both q1 and q2 is as follows:
Case 1: ct = 0, 1, 2 When the number of common taxa between q1 and q2 is less than 3, i.e., ct = 0, 1, 2, there will always be multiple binary trees on L(q1) ∪ L(q2) that satisfy both quartets.
Case 2: ct = 3 When q1 and q2 share exactly three common taxa (ct = 3), meaning each quartet has one unique taxon relative to the other, two sub-cases arise:
If the unique taxon in q1 replaces the unique taxon in q2 (meaning that when the unique taxon in quartet q1 is replaced by the unique taxon in the other quartet q2, the resulting topology will match that of q2), there are exactly three distinct binary trees on L(q1) ∪ L(q2) that satisfy both quartets q1 and q2.
In all other configurations where the quartets share 3 common taxa, there is exactly one binary tree on L(q1) ∪ L(q2) that satisfies both quartets.
Case 3: ct = 4 When q1 and q2 share all 4 taxa (ct = 4), no binary tree can satisfy both quartets. Proof. We examine each case based on the number of common taxa between q1 and q2. Due to space constraints, we present the complete proof of this theorem in Appendix A. However, we present the proof for Case 2 (ct = 3) here as our proposed algorithm for QT-WEAVER is based on this particular case.
Case 2: ct = 3 (three common taxa) When q1 and q2 share exactly three taxa, the number of trees that can satisfy both quartets depends on the relationship between the unique taxon in each quartet. Two distinct sub-cases arise:
– Sub-case 1 (replacement of unique taxa): If the unique taxon in q1 simply replaces the unique taxon in q2 (preserving the relationship between the shared taxa) and vice versa, then two of the common taxa will be sister species in both q1 and q2. In this case, exactly three distinct binary trees on the set L(q1) ∪ L(q2) of five leaves satisfy both quartets. These three trees correspond to the possible ways to arrange the unique taxa from q1 and q2 with respect to the shared taxa. For example, q1 = ab|cd and q2 = ab|ce satisfy this condition where d and e are the unique taxa in q1 and q2 respectively and replacing d with e in q1 (or e with d in q2) makes q1 and q2 identical. Moreover, a and b are closer to each other than they are to other species in both q1 and q2. In this case, we can place e on three branches in q1 (as shown in Figure 1a), resulting in three different trees on five taxa {a, b, c, d, e} that satisfy both q1 and q2. Similarly, we can place d on three branches in q2 (as shown in Figure 1b), producing three different trees that satisfy both q1 and q2. Note that the set of three trees on five taxa resulting from different placedments of e in Figure 1a are identical to the set of of three trees resulting from different placements of d in Figure 1b. We now consider two other branches (incidents on a and b) in ab|cd. First, if taxon e is placed as the sister to a, one of the resulting induced quartets is ae|bc, which does not satisfy ab|ce as e is now the sister to a. Similarly, if e is placed as the sister to b, the quartet ac|be is formed, which also is not consistent with ab|ce. Thus, there are exactly three trees on {a, b, c, d, e} that satisfy both q1 and q2.
– Sub-case 2 (other configurations): In all other scenarios where the unique taxon in q1 does not simply replace the unique taxon in q2 (and vice versa), there is exactly one binary tree on the set L(q1) ∪ L(q2) of five taxa that satisfies both quartets. This is because the three common taxa, along with their relationships to the unique taxa, fully constrain the tree structure. Here, unlike sub-case 1, no pair of taxa among the three common taxa are sisters in both quartets. Consider the two quartets, q1 = ab|cd and q2 = ae|bd that satisfy this condition. In order to find a tree that satisfies both q1 and q2, we can insert e to q1 by making it a sister of a, as shown in Figure 2a. Similarly, we can insert c into q2 by making it the sister of d as shown in Figure 2b. Note that these two trees are identical (shown in shown in Figure 3a). Furthermore, the placements of e on four other edges in q1 (the edges incident on b, c, d, and the internal edge) result in trees that do not support q2 (ae|bd) as b becomes closer to a than e is to a. Similarly, the placements of c on four other edges in q2 do not produce any trees that satisfy q1 (ab|cd) as b becomes closer than c to d. Thus, there is exactly one tree (as shown in Figure 3a) that satisfies both q1 and q2.
We define canonical quartet pair as a pair of quartets (q1, q2) that can be satisfied by exactly one tree T, where L(T) = L(q1) ∪ L(q2), as in Sub-case 2 of Case 2 in Theorem 1. Thus, a canonical quartet pair (q1, q2) uniquely represent a tree on L(q1) ∪ L(q2) that satisfy both q1 and q2. For example, q1 = ab|cd and q2 = ae|bd is one such canonical quartet pair, which is satisfied by exactly one tree T on {a, b, c, d, e} as shown in Figure 3a. There are quartets (ae|bd, ae|bc, ae|cd, ab|cd, and be|cd) in T including the canonical pair corresponding to this tree (ab |cd and ae |bd). The canonical pair of quartets will clearly be in conflict with the topological variants of the three other quartets, ae |bc, ae |cd, and be | cd. Note that each quartet has two other topologically distinct variants. Thus, we have the following Corollary 1.1.
Every canonical pair of quartets (q1, q2) is in conflict with six quartets on L(q1) ∪ L(q2). We now have the Corollaries 1.2-1.5. The proofs and related discussions are presented in Appendix A.
Every binary unrooted tree T with five taxa contains a total of four canonical pairs of quartets (q1, q2), (q2, q3), (q3, q4) and (q4, q1) with each of the four quartets q1, q2, q3, q4 being present in two canonical pairs.
Every quartet is a part of eight canonical pairs of quartets, with the other members of these pairs having a particular unique taxon with respect to q1.
Every quartet q is part of 28 unique conflicting sets as a member of canonical pairs for each unique taxon x ∉ L(q).
For a set S of n taxa, every quartet q, where L(q) ∈ S, is in 28 × (n − 4) unique conflicting sets as a member of canonical pairs.
Let qi, qj, qk be a conflicting set, where qi and qj form the canonical pair in conflict with qk. Thus, while qi, qj, and qk cannot all coexist in the same tree, any two of them can. Consequently, the presence of qi in a species tree becomes less likely if the weights of both qj and qk increase (since higher weights for qj and qk indicate the possibility that both of them are present in the species tree). We now define the conflict score of a quartet as the sum of the products of the weights of pairs of quartets that are in conflict with it. Formally, for a given quartet qi and a set Zi containing the pairs of quartets that are in conflict with qi, the conflict score C(qi) is calculated as follows:
There are 28·(n−4) unique conflicting sets in each of which qi is present (Corollary 1.5). By removing qi from all these sets, we obtain a set of pairs Zi, where |Zi| = 28 · (n − 4).
Note that in addition to the product of weights (PoW) based definition of the conflict score (Equation 1), we can define the conflict score using a minimum of weights (MoW) based approach as in Equation 2.
The rationale behind this MoW-based definition is that even if one of wj or wk is very large, it does not necessarily challenge the presence of qi as long as the other is quite low. Therefore, considering the minimum of the two values (wj and wk) as a contributor to the conflict score of qi is a natural choice.
2.3 Algorithm: QT-WEAVER
Let be the set of weighted quartets of a set 𝒢ε of estimated gene trees, S be the set of all taxa present in 𝒢ε and be the set of all quartets in 𝒬𝒟ε. The algorithm takes the set 𝒬𝒟ε as input and for each quartet qi ∈ 𝒬𝒟ε, the algorithm adjusts the weight wi to using the concept of quartet conflict we introduced in Section 2.2. We maintain a set A ⊆ 𝒬, which represents the set of quartets whose weights have already been adjusted. Initially, A =∅, indicating that no quartet weights have been adjusted at the start of the algorithm.
We iterate over the set of quartets, q ∈ 𝒬. If a quartet q is already adjusted, i.e., q ∈ 𝒜, we skip the quartet. Otherwise, let 𝒯 = {q1, q2, q3} be the set of three possible quartets on L(q). We then adjust the weights of the quartets in 𝒯 and insert them to 𝒜, i.e., 𝒜 ← 𝒜 ∪ 𝒯. To adjust the weight for each quartet qi in 𝒯 (i ∈ {1, 2, 3}), we construct Zi, a set of all quartet pairs conflicting with qi.
We do this by iterating over the set of taxa not present in qi, i.e. S \ L(qi). For each t ∈ S \ L(qi), we obtain a set of 28 quartet pairs in conflict with the quartet qi (Corollary 1.4). We add all such pairs to Zi. Thus, Zi contains 28 ×|S \L(qi) | = 28 × (n − 4) pairs of quartets. Next, we compute the conflict score for each of the quartets in 𝒯 using the PoW or MoW-based formula in Equation (1) or Equation (2) (this is a user defined configuration in our algorithm).
Note that the conflict score of a quartet qi indicates a level of inconsistency of qi with respect to other quartets in the distribution. Therefore, a quartet with a higher conflict score should have a lower readjusted weight (i.e., we want to decrease its relative importance as this quartet has a higher degree of conflict with others). Therefore, we obtain by dividing wi by its conflict score C(qi).
This operation on three quartets in 𝒯 denormalizes the weights of the quartets, which must be restored to their original sum s = w1 + w2 + w3. To normalize these weights, we multiply each by so that their sum equals their original sum s. The final normalized weight is given by:
The pseudo-code for the QT-WEAVER algorithm is presented in Appendix B
3 Experimental Study
3.1 Datasets
We evaluated the performance of QT-WEAVER using a collection of previously studied simulated and biological datasets. We studied two simulated datasets: a 37-taxon mammalian dataset based on biological data [41] and a 15-taxon dataset, both generated in prior studies [42, 43]. These datasets were created through a multi-stage simulation process, starting with a species tree, followed by the simulation of gene trees under the multi-species coalescent model (which can result in gene trees topologically distinct from the species tree), and finally, the simulation of gene sequence alignments down the gene trees. The datasets exhibit varying degrees of incomplete lineage sorting (ILS), ranging from moderate to high levels, and differ in the number of genes and the extent of gene tree estimation errors (controlled by sequence lengths). Thus, these simulated datasets provide a wide array of conditions under which we assessed the performance of QT-WEAVER. Table A3 in Appendix D presents a summary of these datasets. We also evaluated QT-WEAVER on a challenging biological avian dataset from Jarvis et al. [44] comprising 14,446 genes sampled from 48 birds.
3.2 Species tree estimation methods
We used wQFM [12], the best existing weighted quartet amalgamation technique, to estimate species trees from weighted quartets. We also used wQMC, which is another well-known weighted quartet amalgamation technique. We ran GTF based wQFM and wQMC on uncorrected embedded quartets in the input gene trees with weights reflecting the frequencies of the quartets. wQFM (and wQMC) was also run on the adjusted/corrected weighted quartets, generated by QT-WEAVER, to demonstrate the impact of quartet weight correction on species tree estimation. We refer to this variant as wQFM- corrected or wQFM+QT-WEAVER interchangeably. We compared wQFM with ASTRAL-III [5, 45] (v. 5.7.8), which is the leading quartet based species tree method. Note that ASTRAL cannot take a set of weighted quartets as input, and as such, we cannot evaluate ASTRAL on corrected quartet distributions.
3.3 QT-WEAVER configurations
As discussed in Sections 2.2, 2.3, we can define the conflict score of a quartet q using PoW or MoW- based approaches (Equations 1 and 2). Additionally, beyond using all 28 conflicting sets that a quartet q belongs to for each taxon e ∉ L(q), we explore subsets of these conflicting sets, analyzing subsets of sizes four, six, and eight. Thus, we have explored eight configurations (two weighting schemes combined with four conflicting set subsets). Our empirical exploration using the 15-taxon datasets (results presented in Appendix C) suggests that the MoW-based approach is preferable over the PoW-based scheme. Moreover, utilizing six conflicting sets is both computationally faster and yields better performance across all conditions. Therefore, for the remaining experiments, we run QT-WEAVER with the configuration that uses 6 (out of 28) conflicting sets and the MoW-based approach to update quartet weights.
3.4 Measurements
We evaluated the accuracy of the estimated trees on simulated datasets by comparing them to the model species tree using the normalized Robinson-Foulds (RF) distance [46].Additionally, we assessed quartet scores, reflecting the number of quartets from the gene trees that are consistent with the estimated species tree. For the biological dataset, we compared the inferred species trees with those reported in the scientific literature. Multiple replicates under different model conditions were analyzed, and statistical significance between methods was determined using a two-sided Wilcoxon signed-rank test (with α = 0.05). To assess the accuracy of quartet distributions, we compared both corrected and uncorrected distributions to the true quartet distributions derived from the true gene trees using Jansen-Shannon divergence [47].
4 Results and Discussion
4.1 Results on 37-taxon dataset
The average RF rates of wQFM using both uncorrected and corrected (by QT-WEAVER) quartet distributions and ASTRAL on various model conditions in the 37-taxon dataset are shown in Figure 4(a). We vary the gene tree estimation error (by varying the sequence length from 50 to 1000 bp), the amount of ILS (by multiplying or dividing all internal branch lengths in the model species tree by two – producing three model conditions that are referred to as 1X (moderate ILS), 0.5X (high ILS) and 2X (low ILS)), and the number of genes (from 100 to 500). In general, wQFM+QT-WEAVER is more accurate than wQFM across all model conditions – clearly demonstrating the benefit and positive impact of correcting quartet distributions for GTEE by QT-WEAVER. wQFM is, in general, more accurate than ASTRAL (which was also reported by prior studies [12, 48]). However, the improvements of wQFM over ASTRAL are often not statistically significant. Remarkably, when wQFM is run on weighted quartet distributions corrected by QT-WEAVER, wQFM+QT-WEAVER becomes notably better than ASTRAL and in most cases (six out of nine), the improvements are statistically significant (p ≪ 0.05). Moreover, the accuracy achieved by ASTRAL and wQFM on true gene trees (i.e., without any estimation error) was matched by wQFM+QT-WEAVER even when using estimated gene trees derived from sequences as short as 500 bp. In fact, wQFM+QT-WEAVER with 1000 bp sequences is slightly more accurate than ASTRAL on true gene trees. These clearly show the power and efficacy of QT-WEAVER in accounting for gene tree estimation error. Even though there is no GTEE in true gene trees, the limited number of genes can result in a weighted quartet distribution that may fail to represent the true species trees. Interestingly, indeed, wQFM+QT-WEAVER outperformed both ASTRAL and wQFM on true gene trees, indicating that adjusting the weighted quartet distribution of true gene trees can still lead to more accurate species trees. We also evaluated the performance of wQMC, another well-known weighted quartet amalgamation technique, on quartets corrected by QT-WEAVER. Although wQMC is generally less accurate than wQFM [12, 48], we assessed wQMC+QT-WEAVER to demonstrate the usability of the corrected quartets generated by QT-WEAVER across different amalgamation methods. As shown in Figure A5 in Appendix E.1, wQFM is better than wQMC (both for corrected and uncorrected data) and wQMC+QT-WEAVER consistently outperforms wQMC – further demonstrating the effectiveness of QT-WEAVER in adjusting quartet distributions and the superiority of wQFM over wQMC.
ASTRAL has been evolved and improved over successive versions for both accuracy and scalability, offering the theoretical guarantee of identifying the species tree that maximizes quartet support within the search space defined by the bipartitions of the input gene trees and thereby leaving limited (or no) scope for further enhancements in quartet score optimization. Thus, addressing GTEE becomes a critical avenue for pushing the accuracy frontier of quartet-based methods. In this context, the substantial improvements achieved by wQFM+QT-WEAVER over ASTRAL is particularly noteworthy.
Next, we performed a series of experiments to further investigate the impact and quality of the quartet distributions produced by QT-WEAVER. First, to assess the similarity between the estimated quartet distributions (before and after correction) and the true quartet distributions (inferred from the true gene trees), we compare the Jensen-Shannon divergence between estimated and true quartet distributions in Figure 4(d)-(f) and Table A4 in Appendix E.1. The corrected quartet distributions almost always have lower divergence than the uncorrected distributions, except for model conditions with very low divergence to begin with (less than 4 percent; 0.5x-200gt-500bp and 1X-200gt-1000bp conditions). The model conditions with relatively long sequence length (1000bp) have low amounts of GTEE and have less divergence to begin with (less than 4 percent), and correcting it does not seem to further reduce the divergence (rather making the distribution slightly more diverged). However, the corrected distributions overall result in better species trees as reflected by the RF rates (Figure 4(c)). Overall, these results suggest that QT-WEAVER effectively bridges the gap with the overall nature of the true distribution.
Next, we computed the quartet scores of different estimated species trees and the true species tree with respect to both estimated and true gene trees (see Tables A5 and A6). When the quartet scores are computed based on original (uncorrected) estimated quartet distribution (Table A5), even though ASTRAL-estimated trees are less accurate than wQFM and wQFM+QT-WEAVER, ASTRAL achieves higher quartet scores than wQFM since ASTRAL is guaranteed to maximize the quartet score within a constrained search space. However, it “overshoots” the quartet score as it returns trees with higher quartet scores than the quartet score of the true tree. This is mostly because the statistical consistency of quartet score maximization criterion may not hold in the presence of gene tree estimation errors and limited number of genes. The true tree having the lower quartet scores across all model conditions supports this claim (see also [49–51] for more related discussions). The quartet scores of wQFM, especially wQFM+QT-WEAVER, with respect to the uncorrected estimated gene trees, are closer to the true quartet score than the scores of ASTRAL-estimated trees are to the true score. These further explain the superior performance of wQFM+QT-WEAVER over wQFM and ASTRAL. We also report the quartet score of wQFM+QT-WEAVER with respect to the quartet distribution corrected by QT-WEAVER. Notably, wQFM+QT-WEAVER has the highest quartet score when the score is computed based on the corrected quartet distribution. We note that these scores are not directly comparable to other methods’ quartet scores as other reported scores are computed with respect to a different quartet distribution (i.e., original uncorrected distribution). Therefore, we cannot consider these scores as “over-estimation” compared to true quartet scores. Rather, it suggests that the weight adjustment/correction by QT- WEAVER results in quartet distributions that may lead to species trees with higher quartet consistency scores (w.r.t corrected distributions), which were not attainable with the uncorrected distributions.
Finally, we examined the quartet scores with respect to true gene trees with no estimation errors (Table A6). The superiority of wQFM as a quartet-based tree estimation method and the efficacy of QT-WEAVER as a quartet correction method are even more evident from these quartet scores. In most cases, wQFM and wQFM+QT-WEAVER achieve higher and closer (to true quartet score) quartet scores than ASTRAL. This implies that the corrected distributions result in species trees that better correspond to the true gene tree distribution – supporting the trends observed in RF rates (Figure 4(a)).
4.2 Results on 15-taxon dataset
In the simulated 15-taxon datasets, we evaluated the performance on varying gene tree estimation error using 100bp and 1000bp sequence lengths and on varying numbers of gene trees (100 and 1000). Similar to the 37-taxon dataset, wQFM+QT-WEAVER consistently outperforms ASTRAL and wQFM across all four model conditions (Figure 5). In particular, wQFM+QT-WEAVER is significantly better (p ≪ 0.05) than ASTRAL on challenging model conditions with short sequences (100bp), i.e., high GTEE.
Similar to 37-taxon dataset, we investigated the divergence of uncorrected and corrected quartet distributions from true quartet distributions (Figure 5(c)-(d) and Table A7) and the quartet scores of different estimated species trees with respect to both estimated and true gene trees (see Appendix E.2). The Jensen-Shannon divergence clearly improves after correction when the gene tree estimation error is high (100bp). Thus, the quartet distributions inferred from gene trees with high estimation error realign with the true distribution after correction using QT-WEAVER. On the other hand, when the distribution has low amounts of GTEE with relatively long (1000bp) sequences and thus very low divergence to begin with (≤5%), correcting it does not seem to reduce the difference. Interestingly, even though the divergence increases slightly for the more accurate distributions (1000bp), the corrected distribution as a whole seems to be more representative of the true distribution, as supported by the better RF rates.
We again assess the quartet scores like we did for the 37-taxon dataset in Tables A8 and A9, and observed similar trends. wQFM+QT-WEAVER achieves higher and closer (to true quartet score) quartet scores than ASTRAL and wQFM across all four model conditions when the quartet scores are computed based on true gene trees – further demonstrating the efficacy of QT-WEAVER in accounting for GTEE.
5 Iterative corrections using QT-WEAVER
The demonstrated effectiveness of using QT-WEAVER in adjusting quartet distribution begs the question: what happens if we iteratively correct the adjusted distribution, using the output of QT-WEAVER as input to QT-WEAVER for the next iteration? To investigate this, we performed 50 iterations and analyzed how different evaluation metrics evolve in Figure 6. Notably, in model conditions with high gene tree estimation errors (100bp), RF rates steadily and dramatically improve (Figure 6a). For the 100bp-100gene condition, the RF rate drops from 25% to 11% after 15 iterations of weight corrections, and for the 100bp-1000gene condition, the RF rate decreases from 16% to as low as 2% after 10 iterations. Note that the RF rates of ASTRAL on these two model conditions are 31% and 23%, respectively, as indicated by the horizontal lines in Figure 6a. However, in model conditions with well-estimated gene trees (1000bp sequences) where the initial RF rates are already low (∼ 5%), repeated iterations may initially improve accuracy (e.g., a drop from 7% to 1% in the 1000bp-100gene condition) but eventually lead to performance degradation. This occurs because continuously adjusting well-estimated and corrected quartet distributions (as in cases with long sequences and large numbers of genes) can distort the weights to an extent that can mislead the tree search algorithm toward less accurate trees.
In Figure 6(b), we show the changes in quartet scores (w.r.t the true gene tree distributions) for wQFM+QT-WEAVER over multiple iterations. As we can see, the increase and decrease in the quartet scores over multiple iterations are perfectly aligned with the decrease and increase of the RF rates, respectively. When the GTEE is high, the scores maintain a rising trend. On the other hand, on the low GTEE model conditions, the quartet scores start to degrade after a few iterations in a pattern identical to the RF rate degradation. Additionally, we investigated the Jensen-Shanon divergence and quartet scores (relative to corrected distributions) across iterations (Figure A6 and related discussion in Appendix E.3).
Thus, performing multiple iterations of weight adjustment shows potential, though improvements may not be consistent across all datasets. For instance, in the 37-taxon dataset, no further improvement was seen beyond the first iteration, as in most cases, the RF rates dropped below 5% after one iteration, and successive iterations distorted the distributions to an extent that resulted in worse trees. Nonetheless, QT-WEAVER’s iterative approach shows promise and warrants further investigation to develop a robust stopping criterion and scoring scheme for selecting the best tree from successive iterations.
6 Results on biological avian dataset
We have reanalyzed the avian biological dataset from [44], which has 14,446 genes (including exons, introns, and UCEs) across 48 taxa. This dataset contains high levels of gene tree discordance likely driven by rapid radiation events in the evolutionary history of these species. Mahbub et al. [12] compared
ASTRAL, wQFM, and wQMC with the binned MP-EST tree which was presented in Jarvis et al. [44]. We include wQFM+QT-WEAVER in this comparison (Figure A7 in Appendix F).
All three estimated trees are highly congruent with the reference binned tree, with wQFM+QT- WEAVER being the most congruent and ASTRAL the least. The wQFM trees (on uncorrected and corrected distributions) are almost identical, with mainly one noticeable difference: wQFM+QT-WEAVER was able to correctly resolve the Cursores clade (crane and killdeer), which both wQFM and ASTRAL failed to recover. ASTRAL failed to recover Otidimorphae (bustard, turaco, and cuckoo), whereas both wQFM and wQFM+QT-WEAVER reconstructed this clade.
All three methods successfully reconstructed the well-established Australaves clade (passeriformes, parrots, falcons, and seriemas). They also recovered the Afroaves, Core Waterbird, and Caprimulgimorphae clades successfully. ASTRAL failed to recover Otidimorphae, unlike both the wQFM methods. All three failed to recover Columbea but were able to recover the constituent clades Columbimorphae (mesite, sandgrouse, and pigeon) and Phoenicopterimorphae (flamingo and grebe).
7 Running Time
The running time of QT-WEAVER solely depends on the number of taxa n and not on the number of genes, which determines the number of unique quartets in the weighted quartet distribution. All analyses were run on the same machine with AMD Ryzen 7 5800H CPU (8 cores), 16GB RAM, and NVIDIA GeForce RTX 3060 GPU (6GB memory). The simulated 15-taxon and 37-taxon datasets took 1.65 and 242 seconds on average per replicate. The biological avian dataset with 48 taxa took 983 seconds to run.
8 Conclusions
This study, for the first time, introduces the quartet distribution correction problem and shows the impact and clear benefit of using quartet distributions corrected by QT-WEAVER for improved species tree estimations. QT-WEAVER learns the overall quartet distribution based on the pattern of quartet conflicts (a concept that we have introduced in this study), and seeks to update the weights to better reflect their relative importance. The concept of quartet conflict and related theoretical results have broad applicability and will be valuable for a range of quartet-based computational methods. Our experimental study shows that QT-WEAVER may result in substantial improvements over the leading method ASTRAL. Therefore, the idea of estimating species trees by correcting quartet distributions has merit and should be pursued and used in future phylogenomic studies. As a future study, we plan to evaluate QT-WEAVER on a diverse set of real biological datasets as the pattern of quartet conflicts are sufficiently complex and heterogeneous across various datasets. Another important research direction is the automatic selection of QT-WEAVER configurations. Additionally, considering the dramatic improvements achieved from multiple iterations of QT-WEAVER on certain datasets and model conditions, future studies need to investigate how to automatically identify an appropriate QT-WEAVER configuration for a given input, based on the input gene tree topologies, to ensure iterative enhancement.
Appendix A Identifying Quartet Conflicts
Let q1 and q2 be two quartets drawn from a set of taxa. The number of binary trees on L(q1) ∪ L(q2) that simultaneously satisfy both q1 and q2 is determined by the number of taxa shared between them. Specifically, let ct denote the number of taxa common to both quartets. Then the possible values of ct are 0, 1, 2, 3, or 4, and the number of binary trees that satisfy both q1 and q2 is as follows:
Case 1: ct = 0, 1, 2 When the number of common taxa between q1 and q2 is less than 3, i.e., ct = 0, 1, 2, there will always be multiple binary trees on L(q1) ∪ L(q2) that satisfy both quartets.
Case 2: ct = 3 When q1 and q2 share exactly three common taxa (ct = 3), meaning each quartet has one unique taxon relative to the other, two sub-cases arise:
If the unique taxon in q1 replaces the unique taxon in q2 (meaning that when the unique taxon in quartet q1 is replaced by the unique taxon in the other quartet q2, the resulting topology will match that of q2), there are exactly three distinct binary trees on L(q1) ∪ L(q2) that satisfy both quartets q1 and q2.
In all other configurations where the quartets share 3 common taxa, there is exactly one binary tree on L(q1) ∪ L(q2) that satisfies both quartets.
Case 3: ct = 4 When q1 and q2 share all 4 taxa (ct = 4), no binary tree can satisfy both quartets.
Proof. We examine each case based on the number of common taxa between q1 and q2.
Case 1: ct = 0, 1, 2 When q1 and q2 share fewer than 3 taxa, there are always multiple binary trees that satisfy both quartets.
Their sets of taxa are mostly disjoint. In this situation, there are multiple binary trees that can satisfy both quartets because the taxa in q1 and q2 do not impose sufficient constraints on the tree structure to uniquely define a single tree. For example, ct = 0 means q1 and q2 do not have any common taxa and thus they can be placed independently within a tree, leading to multiple distinct trees that satisfy both quartets.
- Sub-case ct = 0 Since there are no shared taxa, the quartets are completely independent, and the quartets can be placed independently within a tree on L(q1) ∩ (q2), leading to multiple distinct trees that satisfy both quartets.
- Sub-case ct = 1 Suppose q1 = ab|cd and q2 = ax|yz be two quartets with a as the common taxon. Here, {a,b} and {a,x} are separated from {c,d} and {y,z} in q1 and q2, respectively. Then it is easy to see that any binary tree on L(ab|cd)∪L(ax|yz) = {a, b, c, d, x, y, z} containing an edge e that induces a bipartition π(e) = abx|cdyz separating {a, b} ∪{a, x} = {a, b, x} from {c, d} ∪ {y, z} = {c, d, y, z} is consistent with both ab|cd and ax|yz. Since there are 3 possible ways to arrange {a, b, x} at one partition of e and 15 possible ways to arrange {c, d, y, z} in the other partition, there are at least 3 × 15 = 45 trees that satisfy both quartets.
- Sub-case ct = 2 let the two common taxa in q1 and q2 be a and b, and the other four taxa (two from each quartet) be c, d, x, and y, respectively. Now, depending on the relative placements of a and b in q1 and q2, three cases may arise.
∗ Sub-sub-case 1 (a and b are sister taxa in both quartets): Let q1 and q2 be ab|cd and ab|xy respectively. It is easy to see that any binary tree drawn on L(q1) ∪ L(q2) containing the bipartition ab|cdxy satisfies both quartets.
∗ Sub-sub-case 2 (a and b are sisters on exactly one quartet): Without the loss of generality, let q1 and q2 be ab|cd and ax|by, a and b being sisters in q1. Then, any binary tree that contains both the bipartitions ax|bcdy and abx|cdy satisfies ab|cd and ax|by. Similarly, any binary tree that contains both the bipartitions by|acdx and aby|cdx satisfies ab|cd and ax|by.
∗ Sub-sub-case 3 (a and b are not sisters on any quartet): Let q1 and q2 be ac|bd and ax|by respectively. Any binary tree on L(q1) ∪ L(q2) containing an edge e that induces a bipartition π(e) = acx|bdy separating {a, c} ∪ {a, x} = {a, c, x} from {b, d} ∪ {b, y} = {b, d, y} is consistent with both ac|bd and ax|by. Note that there are 9 such trees.
Case 2: ct = 3 (three common taxa) When q1 and q2 share exactly three taxa, the number of trees that can satisfy both quartets depends on the relationship between the unique taxon in each quartet. Two distinct sub-cases arise:
- Sub-case 1 (replacement of unique taxa): If the unique taxon in q1 simply replaces the unique taxon in q2 (preserving the relationship between the shared taxa) and vice versa, then two of the common taxa will be sister species in both q1 and q2. In this case, exactly three distinct binary trees on the set L(q1) ∪ L(q2) of five leaves satisfy both quartets. These three trees correspond to the possible ways to arrange the unique taxa from q1 and q2 with respect to the shared taxa. For example, q1 : ab|cd and q2 : ab|ce satisfy this condition where d and e are the unique taxa in q1 and q2 respectively and replacing d with e in q1 (or e with d in q2) makes q1 and q2 identical. Moreover, a and b are closer to each other than they are to other species in both q1 and q2. In this case, we can place e on three branches in q1 (as shown in Figure A1a), resulting in three different trees on five taxa {a, b, c, d, e} that satisfy both of them. Similarly, we can place d on three branches in q2 (as shown in Figure A1b), producing three different trees that satisfy both q1 and q2. Note that the three trees in Figure A1a are identical to the trees depicted in Figure A1b. For instance, placing taxon e on the middle branch in Figure A1a yields the same tree as placing taxon d as the sister to taxon c in Figure A1b.
Let us now consider the two additional cases where taxon e is placed on the other two branches of the quartet ab|cd. First, if taxon e is placed as the sister taxon to a, one of the resulting induced quartets is ae|bc, which does not satisfy ab|ce as e is now the sister to a. Similarly, if e is placed as the sister to b, the quartet ac|be is formed, which also is not consistent with ab|ce. Thus, there are exactly three trees on {a, b, c, d, e} that satisfy both q1 and q2.
- Sub-case 2 (other configurations): In all other scenarios where the unique taxon in q1 does not simply replace the unique taxon in q2 (and vice versa), there is exactly one binary tree on the set L(q1) ∪ L(q2) of five taxa that satisfies both quartets. This is because the three common taxa, along with their relationships to the unique taxa, fully constrain the tree structure. Here, unlike sub-case 1, no pair of taxa among the three common taxa are sisters in both quartets. Consider the two quartets, q1 = ab|cd and q2 = ae|bd that satisfy this condition. In this context, unique taxon c in q1 and e in q2 do not replace each other. In order to find a tree that satisfies both q1 and q2, we can insert e in q1 by making it a sister of a, as shown in Figure A2a. Similarly, we can insert c into q2 by making it the sister of d as shown in Figure A2b. Note that these two trees are identical. Furthermore, the placements of e on four other edges in q1 (the edges incident on b, c, d, and the internal edge) result in trees that do not support q2 (ae|bd) as b becomes closer to a than e is to a. Similarly, the placements of c on four other edges in q2 do not produce any trees that satisfy q1 (ab|cd) as b becomes closer than c to d. Thus, there is exactly one tree (as shown in Figure A3a) that satisfies both q1 and q2.
Case 3: ct = 4 (all taxa are shared) When q1 and q2 share all four taxa, the two quartets represent topologically distinct arrangements of the same taxa – making them topologically distinct variants of each other. Since a single binary tree cannot accommodate two conflicting topologies for the same set of taxa, no tree can simultaneously satisfy both quartets.
We define canonical quartet pair as a pair of quartets (q1, q2) that can be satisfied by exactly one tree T, where L(T) = L(q1) ∪ L(q2), as in Sub-case 2 of Case 2 in Theorem 1. Thus, a canonical quartet pair (q1, q2) uniquely represent a tree on L(q1) ∪ L(q2) that satisfy both q1 and q2. For example, q1 = ab|cd and q2 = ae|bd is one such canonical quartet pair, which is satisfied by exactly one tree T on {a, b, c, d, e} as shown in Figure A3a. There are quartets (ae|bd, ae|bc, ae|cd, ab|cd, and be|cd) in T including the canonical pair corresponding to this tree (ab cd and ae bd). The canonical pair of quartets will clearly be in conflict with the topological variants of the three other quartets (ae |bc, ae |cd, and be |cd). Note that each quartet has two topologically distinct variants. Thus, we have the following Corollary 1.1.
Every canonical pair of quartets (q1, q2) is in conflict with six quartets on L(q1) ∪ L(q2). Note that the tree in Figure A3a is constructed using q1 = ab|cd as the backbone, with the unique taxon e in q2 = ae|bd placed as a sister to a. Consequently, both q2 = ae|bd and ae|bc are induced by T. Therefore, if we instead consider q2 = ae |bc (rather than ae |bd), q1 = ab |cd and q2 = ae |bc forms another canonical pair, which is also satisfied by the same tree in Figure A3a. Thus, two canonical pairs, (ab|cd, ae|bd) and (ab|cd, ae|bc), that contain ab|cd as one member and the other member having e as the unique taxon with respect to ab|cd correspond to the tree shown in Figure A3a. Additionally, two other canonical pairs, (ae|bc, be|cd) and (ae|bd, be|cd), both sharing be|cd as a common member and excluding ab|cd, also represent the same tree in Figure A3a. Consequently, the tree shown in Figure A3a contains a total of four canonical pairs of quartets (q1, q2), (q2, q3), (q3, q4) and (q4, q1), where q1 = ab |cd, q2 = ae |bd, q3 = be |cd, q4 = ae |bc. Notably, the quartet ae |cd, induced by the tree in Figure A3a, does not appear in any canonical pair, as there is no other quartet that can pair with ae |cd to place taxon b on the internal branch of ae |cd. This is because, when two quartets form a canonical pair, the unique taxon from one quartet is positioned as a sibling of one of the taxa in the other quartet. All other quartets in the tree, except ae|cd, are part of two canonical pairs.
Additionally, using q1 = ab|cd as the backbone and placing e as the sister to b, we obtain a tree T (shown in Figure A3b) that satisfies two other canonical quartet pairs, (ab|cd, ac|be) and (ab|cd, ad|be), that contain ab|cd as one member. Similarly, e can be placed as a sister to the other taxa in q1 (i.e, c and d). Thus, for a given quartet q1 = ab|cd, we can form eight canonical pairs of quartets with q1 = ab|cd being one member and the other member q2 having e as the unique taxon to q1. These lead to the following Corollaries 1.2 and 1.3 (these are Corollaries 1.2 and 1.3 in the main text respecetively).
Every binary unrooted tree T with five taxa contains a total of four canonical pairs of quartets (q1, q2), (q2, q3), (q3, q4) and (q4, q1) with each of the four quartets q1, q2, q3, q4 being present in two canonical pairs.
Every quartet is a part of eight canonical pairs of quartets with the other members of these pairs having a particular unique taxon with respect to q1.
As stated in Corollary 1.1, a quartet q1 = ab|cd is in conflict with six other quartets having a unique taxon with respect to q1. Additionally, according to Corollary 1.3, q1 is a member of eight pairs of canonical quartets. Table A1 lists the six conflicting quartets for each of these eight canonical pairs of quartets, where q1 = ab |cd is one member and the other member contains a unique taxon e. Thus, every canonical pair and its six conflicting quartets form a set of three quartets that cannot coexist in a tree, resulting in 48 such sets. We call these conflicting trios of quartets as a conflicting set. Ignoring the duplicates, as marked in Table A1, 28 unique conflicting sets remain.
Every quartet q is part of 28 unique conflicting sets as a member of canonical pairs for each unique taxon x ∉ L(q).
For a given set S of n taxa, and a quartet q (L(q) ∈ − S), there are n 4 unique taxa in S with respect to q. Thus, the following corollary follows naturally.
For a set S of n taxa, every quartet q, where L(q) ∈ S, is in 28 × (n − 4) unique conflicting sets as a member of canonical pairs.
Appendix B QT-WEAVER: pseudo-code
Appendix C QT-WEAVER configurations
The comparison of the eight configurations of QT-WEAVER (the two weighting schemes combined with four different conflicting subsets) along with wQFM on the original distribution has been presented in Figure A4. The original distribution is derived by taking the frequency of each quartet in all gene trees (GTF: Gene Tree Frequency). We vary both the number of genes (100 and 1000) and the amount of gene tree estimation errors (by varying the sequence lengths from 100bp to 1000bp). As expected, the performance of the methods improves when the number of genes increases or when the gene tree estimation errors decrease. The MoW-based weighting scheme outperforms the PoW variant. Furthermore, subsets of four or six conflicting sets tend to perform better than all conflicting sets.
These results raise the question – why does considering a small subset of all conflicting sets tend to yield better results? We hypothesize that using all 28 conflicting sets may impose excessive topological constraints on a quartet, making it challenging for QT-WEAVER to effectively distinguish among different topological variants.
In contrast, the six selected conflicting sets (as listed in Table A2) appear suitable for assessing conflict levels across quartet topologies, as indicated by our experimental results. While this specific choice of six sets lacks theoretical backing, we anticipate that the optimal subset–both in size and composition–may vary by dataset. Thus, an important research avenue involves automating the selection of these subsets, tailored to the topological features of the input gene trees, to optimize performance.
Appendix D Dataset
Appendix E Additional Results
E.1 Additional results on 37-taxon dataset
E.2 Additional results on 15-taxon dataset
E.3 Additional results on iterative corrections using QT-WEAVER
We investigated how the Jensen-Shannon divergence evolves across iterations (see Figure A6(a) in Appendix E.3). Initially, the divergence decreases during the first one or two iterations but then begins to increase. This occurs because, for the three alternative quartet topologies (ab |cd, ac |bd, ad |bc) on four taxa, QT-WEAVER tends to prioritize the quartet topology it identifies as “correct” –the one with the least conflict score–by increasing its weight while reducing the weights of the alternative topologies. Thus, with successive iterations, the weight of the “correct” quartet topology continues to rise (toward 100%) while the weights of the other two alternatives keep decreasing (toward 0%). Thus, this overestimation of the weight of the “correct” quartets and the underestimation of the weights of the “incorrect” ones lead the adjusted distribution to diverge from the true weighted distribution. Despite this divergence, the iterative adjustments may still guide the tree search algorithm toward more accurate trees (up to a certain point) by emphasizing and amalgamating the quartets with higher weights, as evidenced by the gradual decrease in the RF rates up to 10-15 iterations. For the same reason, when examining the quartet scores of wQFM+QT-WEAVER with respect to the corresponding corrected quartet distribution (instead of the original quartet distribution), the scores continue to increase, approaching nearly 100% (see Figure A6(b) in Appendix E.3). This occurs because, in the corrected distribution, the weights of the quartets present in the estimated species trees steadily rise, resulting in higher quartet scores for the estimated trees.