Polynomial Phylogenetic Analysis of Tree Shapes

Phylogenetic trees are a central tool in evolutionary biology. They demonstrate evolutionary patterns among species, genes, and with modern sequencing technologies, patterns of ancestry among sets of individuals. Phylogenetic trees usually consist of tree shapes, branch lengths and partial labels. Comparing tree shapes is a challenging aspect of comparing phylogenetic trees as there are few tools to describe tree shapes in a quantitative, accurate, comprehensive and easy-to-interpret way. Current methods to compare tree shapes are often based on scalar indices reflecting tree imbalance, and on frequencies of small subtrees. In this paper, we present tree comparisons and applications based on a polynomial that fully characterizes trees. Polynomials are important tools to describe discrete structures and have been used to study various objects including graphs and knots. There are also polynomials that describe rooted trees. We use tree-defining polynomials to compare tree shapes randomly generated by simulations and tree shapes reconstructed from data. Moreover, we show that the comparisons can be used to estimate parameters and to select the best-fit model that generates specific tree shapes.

T is P (T, x, y) = x(x 2 + y) + y = x 3 + xy + y, as T has two subtrees adjacent to the root, a 95 trivial tree T 1 with P (T 1 , x, y) = x and a cherry T 2 with P (T 2 , x, y) = x 2 + y. It is proved 96 in (Liu, 2021) that the polynomial distinguishes unlabeled rooted trees and can be 97 generalized to distinguish unlabeled unrooted trees. A rooted tree can be reconstructed 98 from its polynomial by computing its Newick code, which can be obtained by recursively 99 subtracting y and factoring the rest of the polynomial. Methods to factor large multivariate 100 polynomials can be found in (Monagan, 2018). The coefficients of a tree polynomial can be 101 written as a matrix. Let T be a rooted tree with n tips. Its coefficient matrix C(T ) or 102 (c (a,b) ) is displayed as follows, where c (a,b) is the coefficient in the term c (a,b) x a y b . x c (1,0) c (1,1) c (1,2) . . . c (1,n) x 2 c (2,0) c (2,1) c (2,2) . . . c (2,n) . . . . . . . . . . . . . . . . . .
x n c (n,0) c (n,1) c (n,2) . . . c (n,n) Let T be a rooted tree with n tips. The coefficient c (a,b) in the term c (a,b) x a y b of C(T ) can 105 be interpreted as the number of ways in T to choose b clades (with more than one tip) 106 such that these clades include n − a tips of T in total. The clade size distribution of a tree 107 T is the vector whose i-th element is the number of clades in T containing i tips. The 108 second column in the matrix C(T ) is the clade size distribution of the tree T , where Tree metrics 114 In this paper, we use three tree metrics or distances. The first is a tree metric based 115 on the Laplacian spectrum. The metric is the Jensen-Shannon distance over the spectrum 116 densities introduced in (Lewitus, 2016). We call it Lewitus-Morlon metric. The second 117 metric is based on the subtree size distribution. The subtree size distribution of a tree is 118 defined as a vector whose n-th entry is the number of n-tip subtrees in the tree. The 119 metric is defined using the Manhattan distance over the subtree size distribution vectors. 120 We name it the "subtree-Manhatttan metric". The third metric is based on the 121 polynomial. Let T 1 , T 2 be two trees and C(T 1 ) = (c This metric is not only defined for trees of the same size, but also for trees of 127 different sizes where it's natural to assign a coefficient of 0 to each term that is absent in a 128 polynomial.
129 Parameter estimation and model selection 130 To estimate parameters for trees, we use the polynomial metric or the for the set of observed trees T .

144
We use naive Bayes classifiers (Rish, 2001)  Beta splitting trees The beta splitting random trees used in this paper are 150 generated by the beta-splitting model introduced in (Aldous, 1996). At each branching 151 event, the probability of one child clade containing i tips and the other child clade 152 containing n − i tips is given by the following formula.
153 p(i|n) = 1 a n (β) The Γ(z) in the formula is the Gamma function and a n (β) is a normalizing constant.

155
Our sets of n-tip modeled beta splitting trees consist of trees generated with β = 0, 156 β = −1, and β = −1.5, and there are 100 trees for each parameter. These choices of β Explosive radiation trees The explosive radiation trees were simulated with a 162 modification of the birth-death model proposed by Steel (2001). Steel's model builds on 163 the traditional constant birth-death model by setting lineage-specific speciation rates.

164
More precisely, the rate of speciation events on a given lineage is a function of t, the time 165 to the last speciation event on that lineage. This time t is reset to 0 at every speciation, 166 and the birth (λ i ) and death (µ i ) rates of a given lineage i are then defined as follows: where λ A , λ B , µ and τ are parameters of the model.

176
Trait evolution trees This data set was simulated following the birth-death model 177 proposed by Heard (1996). In this model, each lineage has an associated trait value (x) 178 which is "inherited" at speciation events with some stochastic change. The model for trait  The birth (λ i ) and death (µ i ) rates are defined as λ i = x and µ i = µ, respectively.

184
Similar to the explosive radiation model, the death rate µ is constant in time and across 185 lineages. Notice that there are numerous ways to produce trees with a given number of species from an evolutionary model (Hartmann, 2010). For all evolutionary models used in 187 our analysis, trees are simulated forward in time until n tips are first reached. Our data 188 sets of n-tip trait evolution trees contain 1,000 random trees generated with the initial 189 birth rate fixed at 1.0 (per arbitrary time unit), and the birth rate variation at a speciation 190 event and the death rate uniformly randomly chosen from the interval [0, 1].

191
We do not down-sample the simulated trees despite the fact that the data we use  substitution. We use a GTRCAT model for rate variation among sites. Each tree was 216 based on a random sample of 100 sequences. We use a subtype D sequence as an outgroup 217 to root HIV-1 subtype B phylogenies.

218
Our influenza virus trees were previously described in  We developed an R package named treenomial, which is available at CRAN. We  high-performance likelihood-free inference methods could utilize. 315 We also generate a set of 750-tip explosive radiation trees and use the set of random 316 trees as observed data in the parameter estimation method to estimate the birth rate λ B 317 and death rate µ for 100 explosive radiation trees with 750 tips. another tree generator based on the birth-death process. Figure 2 shows that the 330 polynomial has the potential to distinguish different tree generating models. In this 331 section, we use the polynomial together with naive Bayes classifiers to estimate which 332 model is used to generate a tree. 333 We generate a set of 500-tip beta splitting trees, a set of 500-tip explosive radiation 334 trees, and a set of 500-tip trait evolution trees. We use these sets of random trees as 335 observed data together with the naive Bayes classifiers to classify random trees generated 336 by these three models. generator that fits a given tree, as not only are trees from different random processes 349 distinguished well, but the two different birth-death processes are also well distinguished. 350 We also use the subtree size distribution and the standard naive Bayes classifiers to    Figure   413 5A shows a visualization of the polynomial distances among these trees. Figure 5B shows  We also introduced some basic methods for tree analysis using the polynomial. The  Figure 6).

468
Our polynomial is not the only one that uniquely represents rooted binary trees.

469
Other polynomials, such as the ones introduced in (Andrén, 2009), (Chaudhary, 1991), 470 (Negami, 1996) and (Botti, 1993), (Matsen, 2012)  permutations of a given size, which can be computationally heavy, while the polynomial 475 used in this paper has a recursive formula which makes the computation more efficient. 476 To compare trees with different sizes is another challenge in tree comparison. In this with different sizes, but for trees whose sizes are drastically different, the sizes naturally 482 remain a dominating factor in the resulting tree comparisons.

483
Because polynomial coefficients can be treated as vectors, and vectors give rise to 484 metrics, there are alternative metrics that can be defined using tree polynomials (both 485 those used here and others (Andrén, 2009;Chaudhary, 1991;Negami, 1996) The polynomial metric We prove that the polynomial metric is a genuine metric.

508
It is easy to check that d(T 1 , T 2 ) = 0 if and only if T 1 T 2 , and d(T 1 , T 2 ) = d(T 2 , T 1 ). We

509
show that the triangular inequality is true for the metric, that is, We only need to prove the following inequality holds for 511 a, b, c 0.
Note that if a c b or c a b, we have This is equivalent to b 2 + ac ab + bc, which is true because ac − bc ab − b 2 . Similarly, 518 the equality also holds when c b a.

519
If b a c, we have 520 a − c a + c This is equivalent to ab(b − a) + 3c(b 2 − a 2 ) + c 2 (b − a) 0, which is true as b a.

522
Similarly, the equality also holds when b c a. Therefore the polynomial metric is a 523 genuine metric.    Figure 2. A-B: the comparisons between the real parameters and the estimated parameters of the explosive radiation random trees with 250 tips using polynomials. C-D: the comparisons between the real parameters and the estimated parameters of the explosive radiation random trees with 500 tips using polynomials.E: the comparisons between the real parameter and the estimated parameter of the beta splitting random trees with 750 using polynomials.
Polynomial binary differences Binary differences, based on presence and absence 546 of components, though in general not metrics, are one of the commonly used indices in, for 547 example, taxonomic, ecologic, biogeographic comparison and classification (Choi, 2010).

548
They provide effective insights about clusters though they are not metrics in general. We 549 define the polynomial binary differences used in this paper by the number of terms that 550 are present in the polynomial of one tree but are absent in the polynomial of the other.

551
More precisely, the binary difference of two trees T 1 and T 2 are calculated by counting the number of terms that are present in P (T 1 , x, y) but are absent in P (T 2 , x, y), or the 553 number of terms that are present in P (T 2 , x, y) but are absent in P (T 1 , x, y). This provides 554 another way to compare polynomials (trees). Supplementary Figure 6 shows the results of 555 k-medoids clustering on the binary differences of the influenza trees and the HIV trees, 556 which are better than the polynomial metric in this task.

557
WHO influenza clades For clade 3c3.B, the 95% confidence interval of the birth 558 rate λ B is (0.56, 0.60) and the 95% confidence interval of the death rate µ is (0.58, 0.62).