Phylogenetic context using phylogenetic outlines

Microbial studies typically involve the sequencing and assembly of draft genomes for individual microbes or whole microbiomes. Given a draft genome, one first task is to determine its phylogenetic context, that is, to place it relative to the set of related reference genomes. We provide a new interactive graphical tool that addresses this task using Mash sketches to compare against all bacterial and archaeal representative genomes in the GTDB taxonomy, all within the framework of SplitsTree5. The phylogenetic context of the query sequences is then displayed as a phylogenetic outline, a new type of phylogenetic network that is more general that a phylogenetic tree, but significantly less complex than other types of phylogenetic networks. We propose to use such networks, rather than trees, to represent phylogenetic context, because they can express uncertainty in the placement of taxa, whereas a tree must always commit to a specific branching pattern. We illustrate the new method using a number of draft genomes of different assembly quality.


Introduction
indication of uncertainty or alternative groupings. 94 We provide a fast and interactive implementation for exploring phylogenetic context 95 of a set of microbial sequences of interest. The user loads one or more files of query DNA 96 sequences and then requests that all similar reference genomes are determined. Then a 97 threshold is set for the maximum distance of reference genomes, or number of reference 98 genomes, to be considered. These are downloaded and a Mash comparison of the query 99 sequences and all similar reference genomes is performed, the neighbor-net method is run 100 and the result is presented as a phylogenetic outline. (The user can also choose to use a 101 tree-building method such as neighbor-joining [Saitou and Nei, 1987]).

102
To illustrate our method, we applied it to a number of draft genomes reported in 103 [Arumugam et al., 2019]. These draft genomes, or metagenomic assembly bins, contain 104 assembled contigs of long-read microbiome sequences obtained from a bio-reactor enriched 105 for polyphosphate accumulation. The paper reports a taxonomic assignment for each bin 106 that is based on an analysis of the contained protein-coding genes and confirmed using 16S 107 rRNA sequences, when present. For each of the 14 reported bins, we calculated a 108 phylogenetic outline to display the phylogenetic context of the closest reference genomes 109 below a certain distance. Three are displayed in Figure 1 (one "high-quality", one 110 "medium-quality" and one "low-quality" draft genome, respectively, as defined in [Bowers 111 et al., 2017]) and the other 11 are show in the Supplement. closest, Candidatus Accumulibacter phosphatis UBA5574, Mash distance 0.07, is not 121 represented by protein sequences in NCBI and was thus was not part of the database used 122 by Arumugam et al. [2019]. 123 Unexpectedly, the species Xanthomonadales bacterium UBA2790, which comes from 124 a different taxonomic class, also appears in the phylogenetic context of B2, with a Mash 125 distance of 0.2. Note that this metagenome assembled genome (MAG) comes from a 126 sample of granular sludge that also gave rise to two Candidatus Accumulibacter reference 127 genomes [Parks et al., 2017] and we suspect that it might be contaminated with 128 Candidatus Accumulibacter contigs or sequence. 129 In the case of draft genome B8, all references genomes displayed in the phylogenetic 130 context are Thauera species, except one unclassified Betaproteobacteria bacterium, which 131 is, however, a member of the genus Thauera, and this supports the assignment to the 132 genus Thauera. The closest reference genome Thauera aminoaromatica S2 has a Mash 133 distance of 0.2.

134
Finally, in [Arumugam et al., 2019], the draft genome B12 was classified as a 135 member of the Betaproteobacteria class, suggesting that there did not exist a closely 136 related reference at the time of the publication of the dataset. In the phylogenetic context 137 computed by SplitsTree, B12 is placed closest to Candidatus Accumulibacter sp. 66-26 with 138 a Mash distance of 0.14. In addition, some species of the genera Azonexus and 139 Dechloromonas can be found that are similar to B12, with Mash distances below 0.18.

140
These two genera belong to the family of Azonexaceae in the NCBI taxonomy, while 141 Candidatus Accumulibacter does not have a defined family or order. All three genera 142 belong to the family Rhodocyclaceae in the GTDB taxonomy. Although B12 does not have 143 many closely related reference genomes, the phylogenetic outline produced by SplitsTree5 144 suggests that B12 belongs to the Rhodocyclaceae family, which is more specific that the 145 assignment suggested by Arumugam et al. [2019].

146
In each of the three examples, it took between one and three minutes to determine all reference genomes whose sketches have a distance of at most 0.3 to the sketch of the 148 draft genome, and then to compute and display the phylogenetic outlines for the 10 most 149 similar references. Computations were carried out on a laptop with 8 cores (at 2.4 GHz) 150 and 32GB of memory. Reference genomes are downloaded (and cached) on demand, which 151 takes additional time. The distance thresholds used to select the closest reference genomes 152 for each bin were chosen interactively and are reported in the Supplementary file.

153
To illustrate the improved practical performance of the outline algorithm on a larger 154 dataset, we computed the phylogenetic context for bin B12 using the 1,000 closest reference 155 genomes. Running the neighbor-net algorithm on this data takes 90 seconds and results in 156 4,516 splits. The equal-angle algorithm [Dress and Huson, 2004] produces a splits network 157 with 108,640 nodes and 212,762 edges, and requires about seven minutes to compute and 158 show the network. In contrast, our new outline algorithm produces a splits network with 159 8,028 nodes and 8,028 edges and requires only two seconds for this (not shown here).

160
For the purpose of comparison, we applied GTDB-Tk [Chaumeil et al., 2019] in  In the case of draft genome B8, the phylogenetic context included all members of 171 the genus Thauera from GTDB-Tk, and placed the query bin next to Thauera 172 aminoaromatica S2. The tree produced by GTDB-Tk contains the same references and has           we process all events in sorted order, constructing either 0 or 1 new nodes and/or edges, 249 per event.

250
For each split S we define two events, an outbound event S + , crossing over to the 251 other side of S from the side that contains x 1 , and an inbound event S − , returning back to 252 the side of S that contains x 1 . We will sort these events and then use them to construct 253 the phylogenetic outline.

254
We define a total ordering on all events as follows (see Figure   We now describe how to create the nodes and edges of the outline (see Figure 5d).

267
We will use p to denote the current location, initially set to (0, 0). Place taxon x 1 on 268 a new node v 1 at location π(v 1 ) = p. We will use Σ(v) to denote the set of splits that 269 separates a node v from the node v 1 .
x 1 x 2 x 3 x 4 In the example in Figure 5b, this is perpendicular to the chord associated with S.

272
We process all events as described in the following two paragraphs. Let v be the 273 current node, initially set to v 1 .

274
To process an outbound event for a split S, move the current location p in the The number m of circular splits on n taxa is bounded by O(n 2 ). As discussed 284 above, there will be at most 2m nodes and 2m edges in the network, and therefore the size worst-case requirements of the equal angle algorithm [Dress and Huson, 2004], which is 290 currently used to visualize the output of the neighbor-net algorithm in SplitsTree4 [Huson 291 and Bryant, 2006].

292
We now discuss how to compute a rooted phylogenetic outline. For midpoint 293 rooting, we proceed as follows. We first determine two taxa, a and b, that maximize the . For rooting by out-group, the root is placed in the middle of a split that separates 300 the out-group from the rest of the taxa and is minimal with respect to that property.

301
Graphical user interface 302 We have implemented the approach described here in our program SplitsTree5. To 303 compute a phylogenetic outline displaying the phylogenetic context for one or more 304 prokaryotic sequences, select the File → Analyze Genomes... menu item. This will open 305 a dialog with three tabs. The first tab is used to select the input file(s) and output file, and 306 to determine whether all sequences in a given file are to be concatenated, or are to be 307 treated separately (see Figure 6a). In addition, one can set a minimum sequence length 308 (here set to 100,000 bp). The second tab is used to edit the names for the sequences (see 309 Figure 6b). The third tab is used to perform a Mash-based search in the GTDB database 310 and to select which reference genomes should be included in the phylogenetic outline,

311
based on their distances to the input sequences (see Figure 6c).

312
The example presented is a medium quality draft genome consisting of 25 contigs assembled from long read sequences, designated bin B8 in [Arumugam et al., 2019] with 314 taxonomic assignment to the genus Thauera. In Figure 6d we use a phylogenetic outline to 315 show the phylogenetic context of the draft genome, involving the 30 closest reference 316 genomes.

317
In Figure 6e we show the phylogenetic context for the 15 (of 25) input contigs 318 whose length achieves the set threshold of 100,000 bp.  filters for fast sequence comparison; and the neighbor-net method and our new concept of 339 phylogenetic outlines for visualization. We thus provide a fast heuristic for establishing the 340 phylogenetic context for one or more prokaryotic genomes or DNA sequences. We 341 demonstrated that our approach can be applied to usefully determine and visualize the 342 phylogenetic context of bacterial draft genomes at different levels of assembly quality. 343 We believe that the use of a phylogenetic outline, rather than a phylogenetic tree, 344 to represent phylogenetic context is more suitable because outlines can express vagueness 345 in the placement of taxa with respect to each, whereas trees suggest a specific branching 346 pattern. For example, in Figure 4a we show the unrooted, resolved phylogenetic tree computed using the neighbor-joining algorithm [Saitou and Nei, 1987]. Both the splits 348 network ( Figure 4b) and the phylogenetic outline ( Figure 4c)  representation.

352
The mash-based calculation of phylogenetic outlines presented here is, on the one 353 hand, a form of ab initio phylogenetic analysis, in which we infer evolutionary relationships 354 from data. On the other hand, the aim is to visualize the phylogenetic context of 355 sequences, which is similar to the goal of phylogenetic placement. We provide a graphical 356 user interface that allows the user to interactively compute and explore the context of 357 sequences. While GTDB-tk provides methods for both phylogenetic placement and ab 358 initio phylogenetic analysis, these calculations are performed using a script and the output 359 is presented as text files. While the resulting trees can viewed using third party tools, the 360 leaves of the tree are labeled by GTDB accessions, whereas our approach provides the 361 option of labeling leaves by the associated NCBI names.

362
Using a phylogenetic outline to represent phylogenetic context does not replace 363 careful alignment and sophisticated phylogenetic analysis when the goal is to understand 364 the evolutionary history of a set of taxa in detail. Nevertheless, we believe that our 365 approach will prove to be a useful addition to the biologists' computational toolbox.