STAG: Species Tree Inference from All Genes

Species tree inference is fundamental to our understanding of the evolution of life on earth. However, species tree inference from molecular sequence data is complicated by gene duplication events that limit the availably of suitable data for phylogenetic reconstruction. Here we propose a novel method for species tree inference called STAG that is specifically designed to leverage data from multi-copy gene families. By application to 12 real species datasets sampled from across the eukaryotic domain we demonstrate that species trees inferred from multi-copy gene families are comparable in accuracy to species trees inferred from single-copy orthologues. We further show that the ability to utilise data from multi-copy gene families increases the amount of data available for species tree inference by an average of 8 fold. We reveal that on real species datasets STAG has higher accuracy than other leading methods for species tree inference; including concatenated alignments of protein sequences, ASTRAL & NJst. Finally we show that STAG is fast, memory efficient and scalable and thus suitable for analysis of large multispecies datasets.


Introduction
The correct species tree is fundamental to understanding the diversity and history of life on earth. 27 To infer species trees, researchers typically combine sequence data from sets of orthologous 28 sequences (Jarvis, et al. 2014 ;Mao, et al. 2015;Ruhfel, et al. 2014). Often this is sets of protein 29 coding genes, but can also include conserved non-transcribed elements as well as RNA genes and 30 other nucleotide sequences. A common approach to integrating the information contained within 31 multiple different genes is to join them together to form a concatenated multiple sequence alignment 32 (CMSA). This is often preceded by checks to search for conflicts between the individual trees 33 inconsistency under the multi-species coalescent model of incomplete lineage sorting (Roch and 39 Steel 2015). Methods such as NJst (Liu and Yu 2011) and ASTRAL (Mirarab, et al. 2014) have been 40 developed that bypass the concatenation problem and infer a species tree from a set of gene trees. 41 However, both concatenation methods such as NJst and ASTRAL require data from groups of single-42 copy orthologues genes, and the number of such genes in a group of species can be limited due to 43 differing patterns of gene duplication and loss in the species sets being considered. 44 Species tree inference methods such as PHYLDOG (Boussau, et al. 2013) and Guenomu (Martins, 45 et al. 2016) are not restricted to one-to-one orthologues and can therefore use more of the available 46 whole-genome data. For example, PHYLDOG (Boussau, et al. 2013) jointly infers the species tree 47 and the gene trees under a maximum likelihood model that combines a model for sequence evolution 48 with a model for gene duplication and loss. However, the method is not suited to large datasets: to 49 analyse 36 species and at most 100 genes per gene family, the method used 3000 processors of 50 one of the top 500 supercomputers at the time (Boussau, et al. 2013). Thus, species tree inference 51 methods that use single-copy genes are restricted in the amount of data they can access, and 52 methods that can use multi-copy genes require computational resources that are beyond the reach 53 of most research groups. 54 Here we present STAG (Species Tree inference from All Genes), a novel algorithm for inferring a 55 species tree from sets of multi-copy gene trees. Through application to 12 real world datasets 56 sampled throughout the eukaryotic domain we show that STAG is the most accurate method for 57 species tree inference. Moreover, we demonstrate that consensus species trees generated using 58 STAG have properties such as realistic support values and branch lengths that are suitable for 59 downstream comparative analysis. 60 62 Species tree inference methods are commonly tested on simulated sequence data (Liu and Yu 2011;63 Martins, et al. 2016;Mirarab, et al. 2014). However, the statistical similarity of these datasets to real 64 biological datasets is not examined and the performance of the tested methods is not compared to 65 expert curation of known species trees. To rectify this and provide comparative evaluation of species 66 tree inference methods on real biological data, a collection of 12 diverse clades of species were 67 sampled from throughout the eukaryotic domain (Table 1)  was identified that provides a reference topology for the species tree derived from expert curation of 72 molecular data (Table 2, Supplemental File S1). In all cases this reference topology is taken as true 73 and is used as a benchmark for species tree inference method evaluation. 74

Problem definition and benchmark datasets
Sets of one-to-one orthologues are rare, and decrease as a function of cumulative 75 evolutionary distance between sampled species 76 To provide a common input dataset on which to test multiple species tree inference methods 77 orthogroups were inferred for each of these 12 datasets using OrthoFinder (Emms and Kelly 2015). 78 Several of the methods under consideration here require datasets of one-to-one orthologues that 79 are present in all (or most) of the species being investigated. Thus the number of genes per species 80 in each orthogroup was analysed. For each species set, there are large numbers of orthogroups in 81 2015). In this way each multi-copy gene tree is able to provide an estimate of the underlying species 111 tree. 112 To test the accuracy of these species trees inferred from multi-copy genes, the complete set of 113 species trees inferred from multi-copy gene trees from the 12 species sets were subject to 114 topological analysis. Here, each multi-copy gene orthogroup was subject to multiple sequence 115 alignment using MAFFT L-INS-i (Katoh and Standley 2013) and phylogenetic inference using IQ-116 TREE (Nguyen, et al. 2015). Species trees were inferred from each multi-copy orthogroup gene tree 117 using STAG, and the Robinson-Foulds distance (Robinson and Foulds 1981) between the STAG 118 tree and the published species tree was evaluated. To place these results in context, the complete 119 set of gene trees inferred by IQ-TREE from single copy orthogroups for the same species dataset 120 was also subject to the same Robinson-Foulds distance analysis. Single copy gene trees produced 121 in this manner are conventionally used for species tree inference as they each contain an estimate 122 of the underlying species tree. Comparison of the two approaches revealed that species trees 123 inferred from multi-copy gene orthogroups using STAG were of comparable accuracy to species 124 trees inferred from single copy gene orthogroups using conventional methods (Figure 3, 125 Supplemental File S1 Table S1). 126 Across the 12 biological datasets, inclusion of STAG trees extracted from multi copy gene 127 orthogroups increased the mean RF distance of the set of trees available for species tree inference 128 by ~3% (Table 2). However, the average amount of data available for species tree inference 129 increased on average by 800% (Table 2). Thus, for a small increase in the amount of error per tree, 130 there is a substantial increase in the data available for tree inference. 131 Consensus species trees inferred using STAG are more accurate than other methods 132 on real biological datasets 133 Given STAG provides a substantial increase in data availability with only a small penalty to mean 134 dataset accuracy it was determined how this would affect the overall accuracy of species tree 135 inference when these trees are integrated. To investigate this, a consensus species tree for each of 136 the 12 species datasets was inferred by taking the greedy consensus (Felsenstein 2005; Swofford 137 and Sullivan 2009) of each estimate of the STAG trees inferred from the sets of orthogroups with all 6 quantify the proportion of input trees in which that bipartition occurs. The number of input trees used 140 for species inference for each group is given in Table 1; the number of orthogroups with all species  141 present. 142 To place these results in context, species trees for the same species datasets were also inferred 143 with a number of leading methods for species tree inference (Supplemental File S1 Table S2). This 144 set comprised ASTRAL (Mirarab and Warnow 2015), Concatenated MSA (CMSA), NJst (Liu and Yu 145 2011) and Guenomu (Martins, et al. 2016). Each method was run using best practice approaches 146 for these methods (see Methods). The species tree produced by STAG agreed with the published 147 species tree more frequently than any of the other methods ( Figure 4, Table 4). Moreover, the 148 median and mean Robinson-Foulds distance between the STAG consensus species tree and the 149 published species tree was lower than for any other method ( Figure 4, Table 4). An example tree 150 where all of the tested methods disagree is provided in Figure 5. Here, partitions that are incorrect 151 in the STAG consensus species tree receive low support values in contrast to the other tested 152 methods ( Figure 5). 153

Branch lengths obtained from STAG consensus trees are comparable to those
154 obtained concatenated multiple sequence alignments 155 Given that STAG performed well on the tree topological tests described above it was investigated 156 whether the branch lengths of the STAG consensus trees accurately represented molecular 157 phylogenetic distances between species. To provide an unbiased comparison, only the species trees 158 where all methods were correct and thus had 100% agreement on topology were analysed ( Figure  159 6). The branch lengths obtained from STAG consensus trees correlate well with those produced by 160 other methods (mean r 2 = 0.72), and are essentially identical to those produced using concatenated 161 multiple sequence alignments (r 2 = 0.99, p = 10 -77 , Figure 6 & Supplemental File S1  165 To demonstrate the performance characteristics of STAG the time and RAM usage across the 12 166 species datasets was analysed. The maximum time and RAM usage for STAG to infer a consensus 7 respectively (Table 5). Across the 12 datasets the run time was linearly dependent on the number of 169 species and number of trees being analysed (Supplemental File S1 Figure S1). Similar the RAM 170 requirements were linearly dependent on the number of species (Supplemental File S1 Figure S1). 171

172
With fully sequenced genomes available for many species, there is abundant sequence data from 173 which to infer species trees. However, the majority of species tree inference methods are restricted 174 to use one-to-one orthologous sequences that are present in all species in the analysis. Such groups 175 of sequences are available only if gene duplication or loss has not occurred during the divergence 176 of that gene family. In this work it is shown that such one-to-one orthogroups are comparatively rare 177 in real biological datasets, and become rarer as species tree length increases (a product of increased 178 divergence time and increased species sampling). We presented a novel method called STAG 179 (Species Tree inference from All Genes). The method constructs a consensus species tree from 180 trees from all orthogrous in which all species are present, irrespective of the gene copy number per-181 species in the orthogroup. On real species datasets STAG out-performed species trees inferred by 182 comparable methods such as ASTRAL, NJst, and Guenomu as well as maximum likelihood trees 183 inferred from concatenated multiple sequence alignments (CMSA). 184 The testing was performed using 12 real biological datasets sampled from throughout the eukaryotic 185 domain. At the time of writing, this is the largest collection of biological datasets for testing species 186 tree inference that has been assembled. The use of real biological datasets in species tree inference 187 evaluation eliminates modelling assumptions that are required in order to generate simulated test 188 datasets. Moreover, widely sampled real biological datasets reflect the true disparity in real species 189 datasets and thus accurately represent the kinds of datasets to which the method will be applied. 190 The disadvantage of biological data is that the ground truth is not known. Thus, for each of these 12 191 clades of species, a published study that inferred a species tree from expert curation was accepted 192 as true in order to facilitate comparison and benchmarking of the methods tested here. The tests 193 should not bias the results in favour of any of the tested species tree inference methods as they were 194 not used to generate the references trees. Moreover, these tests should not bias towards STAG as 8 here represent a significant increase over previous analyses that only used 1 (Liu and Yu 2011;197 Martins, et al. 2016)  give the final STAG species tree. The gene trees that were used were all automatically generated 208 and involved no expert curation. 209 Although STAG is fully automated and intended for use with large datasets, it is equally well-suited 210 to the inference of the species tree from a carefully curated set of gene trees, as is common for 211 studies that aim to resolve challenging clades of the tree of life. Thus careful filtering and processing 212 assemblages of multi-gene family alignments and trees prior to running STAG will likely aid in 213 increasing the accuracy of species tree inference. In this way, STAG will aid expert curation and 214 analysis of phylogenetic datasets enabling substantial increases in data availability. 215 The units for the branch lengths in the STAG species tree are the same as the units in the input gene 216 trees. In most cases, these will be the number of substitutions per site. The tree branch lengths in 217 the STAG consensus tree are the average branch lengths across all the individual trees inferred 218 from each gene family. Thus, the branch lengths represent the average number of substitutions per 219 sites across a large range of gene families. This is an important feature of STAG trees, as these 220 branch lengths can be used directly in downstream analyses such as ancestral state reconstruction 221 and time calibration. Equivalent utility is not present in species trees generated using ASTRAL, NJst, 222 or Guenomu. Furthermore, the support values for each bipartition in a consensus STAG tree are the are therefore lower, in general, than those given by the concatenation method, which can quickly 225 reach 100% support even for bipartitions believed to be incorrect (Salichos and Rokas 2013) see 226 also Figure 5. 227

228
Algorithm overview 229 STAG obtains an estimate for divergence between each species pair from orthologous gene pairs 230 in a given gene tree. Within a gene tree these estimates do not need to come from the same ortholog, 231 but rather the closest estimate for each species pair is taken to mitigate against problems such as 232 hidden paralogy. The input to STAG is a set of unrooted gene trees (Figure 7). For each gene tree 233 containing all species, STAG identifies the closest pair of genes from those species as those that 234 are separated by the shortest branch length. These shortest distances are used to construct an inter-235 species distance matrix, and a tree is inferred from this distance matrix using FastME (Lefort,  in some individual estimates of the species tree, the gene pairs used to estimate inter-species 243 distance are paralogues descended from a gene duplication event followed by the differential loss 244 of the orthologue for each duplicate. This is known at hidden paralogy (Martin and Burg 2002) and 245 is a problem that affects all of the other methods tested here and is thus not specific to STAG. The 246 assumption that one-to-one genes in a tree are orthologues is common to all methods that infer trees 247 from presumed orthologues. 248 249 We used the 12 biological datasets and literature-derived species tree topologies from a previous 250 study (Emms and Kelly 2017). This consisted of a diverse set of species sampled from throughout 251 the eukaryotic domain. This included every named group of eukaryotes on Ensembl Genomes 252 containing >4 genera as well as sets of 47 Birds, 42 Green Plants and 16 Kinetoplastids. For each 253 dataset, the species tree topology was taken the best available from a published study (Table 2,  254 Supplemental File S1). 255

Datasets for evaluation and benchmarking
Orthogroups were inferred for each species set using OrthoFinder (Emms and Kelly 2015). Multiple 256 sequence alignments (MSA) were inferred for each orthogroup using MAFFT L-INS-i (Katoh and 257 Standley 2013) and gene trees were inferred from these MSAs using IQTree (Nguyen, et al. 2015). 258 Appropriate subsets of this data were used to evaluate all methods presented in this study according 259 to best practices described the following section. 260 261 There is no consensus best practice for construction of concatenated multiple sequence alignments 262 for species tree inference (Supplemental File S1). To infer the CMSA species tree, each alignment 263 was trimmed to include only those columns present in 50% or more of the species. The species tree 264 was inferred from the concatenated alignment using IQ-TREE, as was done for the individual gene 265 trees. We used ASTRAL version 5.5.9 with default parameters. We used the implementation of NJst 266 available at https://github.com/adamallo/NJstM (last updated Dec 2016) using the original method. We ran these datasets on the University of Oxford HPC ARCUS-B in parallel with 16 cores but they 275 timed out at 120 hours without completing the initial 5000 burn-in generations (for comparison the 276 longest runtime for STAG was 95.1s on a single core on a desktop machine). Thus the bird and plant 277 datasets were recorded as 'Timeout'. 278

Implementation of comparative methods
For species were gene duplication and loss were common, few single-copy orthogroups were 279 identified with all species present (Figure 1 & Table 1). In particular, no single-copy orthogroups with 280 all species present were identified in the Plants. This makes it impossible to infer a species tree 281 using only single-copy orthologues. To overcome this problem, and allow us to infer a species tree 282 using ASTRAL, NJst and concatenated multiple sequence alignments (CMSA), we relaxed the data 283 selection criteria allowing selecting orthogroups that were single-copy for a proportion of species. 284 This method allowed a smaller proportion of species to be present and single-copy in an orthogroup 285 if it resulted in a proportionally greater increase in the number of orthogroups that could be used 286 (Supplemental File S1 Figure S2, Supplemental File S1 Table S4). In all cases, only the single-copy 287 orthologues in these orthogroups were used for species tree inference and multiple copy genes were 288 removed from the alignment. This made it possible to infer a species tree with CMSA, ASTRAL and 289 NJst for the plants dataset. 290 This relaxed data selection criterion improved the accuracy of CMSA and ASTRAL on average 291 across the 12 datasets used in this study and thus these trees were used for CMSA and ASTRAL 292 when comparing against STAG. STAG and Guenomu have no requirement for single-copy 293 orthogroups and so this orthogroup selection method was not required for these methods. 294 Software availability 295 STAG is written in python. The source code and precompiled binary are available in the 296 Supplemental material (Supplemental File S2) and at https://github.com/davidemms/STAG. The 297 software is operated via command line interface and can be used on Linux operating systems. 298 Disclosure declaration 299 The authors declare no competing interests.     Correlation between the branch lengths returned by STAG, CMSA, ASTRAL & NJst for the datasets 441 for which all four methods returned the same (correct) species tree topology. Line of best fit is for 442 the linear least-squares regression, with the r 2 value given above each plot. 443 'FastmeTree' calls the FastME program to calculate a tree from a distance matrix; 'ConsensusTree' 447 calculates the greed consensus tree from a set of trees. 448