Abstract
Phylogenetic placement, the problem of placing sequences into phylogenetic trees, has been limited either by the number of sequences placed in a single run or by the size of the placement tree. The most accurate scalable phylogenetic placement method with respect to the number of query sequences placed, EPA-ng, has a runtime that scales sub-linearly to the number of query sequences. However, larger phylogenetic trees cause an increase in EPA-ng’s memory usage, limiting the method to placement trees of up to 10,000 sequences. Our recently designed SCAMPP framework has been shown to scale EPA-ng to larger placement trees of up to 200,000 sequences by building a subtree for the placement of each query sequence. The approach of SCAMPP does not take advantage of EPA-ng’s parallel efficiency since it only places a single query for each run of EPA-ng. Here we present BATCH-SCAMPP, a new technique that overcomes this barrier and enables EPA-ng and other phylogenetic placement methods to scale to ultra-large backbone trees and many query sequences. BATCH-SCAMPP is freely available at https://github.com/ewedell/BSCAMPP_code.
1 Introduction
Phylogenetic placement is the problem of placing one or more query sequences into a phylogenetic “backbone” tree, which may be a maximum likelihood tree on a multiple sequence alignment for a single gene, a taxonomy with leaves labeled by sequences for a single gene [29], or a species tree [14]. When the backbone tree is a tree estimated on a single gene, the most accurate techniques for phylogenetic placement are likelihood-based, and can be computationally intensive when the backbone trees are large [10]. Phylogenetic placement into gene trees occurs when updating existing gene trees with newly observed sequences, but can also be applied in the “bulk” context, where many sequences are placed at the same time into the backbone tree. For example, it can be applied when seeking to taxonomically characterize shotgun sequencing reads generated for an environmental sample in metagenomic analysis [25, 29, 4].
Thus, phylogenetic placement is a general tool with applications in both incremental tree building and taxon identification and abundance profiling in microbiome analysis [11, 21, 25, 29, 9, 6]. In these analyses, however, the number of sequences placed into the backbone tree can be very large, for example, when attempting to place all the shotgun sequencing reads in the sample into a taxonomic tree. Even approaches that only place a subset of the reads that map to marker genes (i.e., genes that are considered single-copy and universal) can require placement of thousands to tens of thousands of reads into the taxonomic tree for that marker gene [25]. Moreover, such approaches can have additional computational challenges as the taxonomic tree they are placing into may be very large, with perhaps many thousands or hundreds of thousands of leaves. Moreover, taxon identification and abundance profiling using marker genes can result in thousands to tens of thousands of reads being placed in a single run into large taxonomic trees.
Thus, there are two aspects of size that impact runtime (and potentially accuracy) of phylogenetic placement methods: the size of the backbone tree and the number of sequences being placed into the tree. Since many studies have shown the improvement in accuracy for abundance profiling, phylogenetic tree estimation, etc., when placing into large backbone trees (e.g., [29]), the importance of using phylogenetic placement methods that can run on large datasets is well established.
However, not all phylogenetic placement methods can scale to large backbone trees (and many methods are computationally intensive on large trees) [35, 34, 3], which has led to the development of methods, such as APPLES-2 [1], which use distances to place into large trees. There are also methods for phylogenetic placement that are alignment-free, such as RAPPAS [18] and App-SpaM [7]. These methods are potentially faster and more scalable than likelihood-based analyses, such as pplacer [20] and EPA-ng [4]. In general, both the number of query sequences (i.e., sequences that need to be placed into trees) and the size of the backbone tree represent challenges for phylogenetic placement methods, especially when seeking to use likelihood-based placement methods.
Previously we introduced the SCAMPP framework [35] to enable both pplacer and EPA-ng to perform phylogenetic placement into ultra-large backbone trees, and we demonstrated its utility for placing into backbone trees with up to 200,000 sequences. By using maximum likelihood methods pplacer or EPA-ng within the SCAMPP framework, the resulting placements are more accurate than with APPLES-2, with the most notable accuracy improvement for fragmentary sequences, and are computationally similar for single query sequence placement [35]. However, SCAMPP was designed to incrementally update a large tree, one query sequence at a time, and was not optimized for the other uses of phylogenetic placement, where batch placement of many sequencing reads is required.
EPA-ng is optimized for scaling the number of query sequences and is capable of placing millions of sequences into phylogenetic trees of up to a few thousand sequences [4]. EPA-ng performs this batch placement very efficiently, achieving sublinear runtime in the number of query sequences (see Figure 2 from [1]). However, other studies have shown that EPA-ng has a large memory requirement that depends on the size of the backbone tree; hence, the backbone trees used with EPA-ng are typically limited to a few thousand sequences.
Here we introduce BATCH-SCAMPP, a technique that improves scalability in both dimensions: the number of query sequences being placed into the backbone tree and the size of the backbone tree. Furthermore, BATCH-SCAMPP is specifically designed to improve EPA-ng’s scalability to large backbone trees.
BATCH-SCAMPP is based on SCAMPP, our prior method for scaling phylogenetic placement methods to large backbone trees. However, by its design, SCAMPP is not suitable for batch placement. Specifically, SCAMPP operates one query sequence at a time. Given the query sequence, SCAMPP finds and extracts a single (small) placement subtree out of the large backbone tree and then uses the selected phylogenetic placement method to place the query sequence into that placement subtree. Then, by superimposing the placement tree into the backbone tree and considering branch lengths, SCAMPP can identify the correct location within the original backbone tree to add the query sequence. While this approach enables phylogenetic placement methods to scale to ultra-large backbone trees, the fact that a different subtree is selected for each query sequence makes this approach ineffective for batch placement. Hence, BATCH-SCAMPP uses a substantially modified design in order to be able to take advantage of EPA-ng’s ability to place many query sequences efficiently.
The BATCH-SCAMPP method operates by allowing the input set of query sequences to suggest and then vote on placement subtrees, thus enabling many query sequences to select the same placement subtree. We pair BATCH-SCAMPP with EPA-ng to explore the capability of this approach for scaling to many query sequences. We show that this combination of techniques (which we call BSCAMPP+EPA-ng, or BSCAMPP(e)) not only provides high accuracy and scalability to large backbone trees, matching that of SCAMPP used with EPA-ng (i.e., SCAMPP(e)), but also achieves the goal of scaling sublinearly in the number of query sequences. Moreover, it is much more scalable than EPA-ng and faster than SCAMPP+EPA-ng: when placing 10,000 sequences into a backbone tree of 50,000 leaves, EPA-ng is unable to run due to memory issues, SCAMPP+EPA-ng requires 1421 minutes, and BSCAMPP(e) places all sequences in 7 minutes (all given the same computational resources, see Table 1).
The rest of the paper is organized as follows. We begin in Section 2 with preliminary experiments evaluating prior phylogenetic placement methods, motivating the need for a novel divide-and-conquer strategy to improve the scalability of EPA-ng. We then present our design of BSCAMPP(e) in Section 3. The experimental study where we design and evaluate BSCAMPP is described in Section 4, and the results are presented in Section 5. We provide a discussion of results in Section 6, and we finish with conclusions in Section 7. Due to space limitations, some of the results are provided in the Supplementary Materials of the full paper, available at [36].
2 Preliminary Experiments
We performed two sub-experiments: Experiment 0a examined App-SpaM (v1.03) and RAPPAS (v1.21), two methods that do not require that the query sequences be aligned to the backbone sequences. Both methods performed well in a prior study [7] in comparison to APPLES-2, but that study was limited to relatively small backbone trees (i.e., all but one of the backbone trees had at most 797 leaves, and the largest had 3748 leaves). Furthermore, App-SpaM and RAPPAS have not been compared to the SCAMPP-enabled likelihood-based methods. Therefore, we used Experiment 0a to determine if RAPPAS or App-SpaM were sufficiently scalable to include them in the comparison to BSCAMPP in our subsequent experiments. Experiment 0b studied EPA-ng, both in terms of its accuracy and its scalability (with the number of query sequences or the size of the backbone tree). The results of this experiment motivate the goals in developing a technique to enable EPA-ng to scale to large backbone trees, and also provide some insight into how to design such a strategy. For these experiments, we used two simulated datasets from prior studies evaluating phylogenetic placement methods: two subsets of RNASim [23] with up to 180,000 sequences in the backbone tree (placing at most 20,000 query sequences) and nt78 [28] with 68,132 sequences in the backbone tree (placing 10,000 query sequences). We explore phylogenetic placement error using the “delta error” (see Section 4.7) and runtime.
2.1 Experiment 0a: Evaluating App-SpaM and RAPPAS
Our own prior studies had already established that SCAMPP(e), SCAMPP(p), and APPLES-2 could scale to backbone trees with at least 200K sequences, and with limitations to 64GB of memory [10, 35]. Therefore, in Experiment 0a, we selected App-SpaM and RAPPAS, the two methods studied in [7] that had the best accuracy, to evaluate their suitability for placing into large trees, and compared them to APPLES-2, SCAMPP(p), and SCAMPP(e). This experiment was performed on three of our testing datasets: nt78, RNASim 50K, and RNASim 200K. All query sequences are fragmentary with length 10% of the full length. As with all our experiments, the methods were given 64GB of memory and were limited to 4 hrs of runtime.
The results from Experiment 0a when query sequences are aligned using UPP [26] are shown in Figure 1 (see Figures S1 and S2 in the Supplementary Materials in [36] for results given true alignments of query sequences). These results showed that RAPPAS failed to complete on any of these datasets due to high memory requirements. App-SpaM was able to complete on the nt78 and RNASim 50K datasets, but was less accurate than APPLES-2 on both datasets (in turn, APPLES-2 was much less accurate than the SCAMPP methods). App-SpaM was the fastest method, due to the other methods requiring alignments of the query sequences, and the vast majority of the runtime for APPLES-2 is the calculation of the alignment of the query sequences to the backbone alignment (see Tables S1 and S2 in the Supplementary Materials of the full paper [36]). However, App-SpaM was unable to complete on the RNASim 200K dataset, and had higher memory requirements than the other tested methods on these datasets. Given these results, we did not include RAPPAS or App-SpaM in the subsequent experiments.
2.2 Experiment 0b: Understanding EPA-ng
Here we focus on EPA-ng for placing short sequences (the metagenomics context, see Supplementary Materials in [36] for results when placing also full-length sequences). We extracted a subset of the RNASim training dataset (see Section 4.4) for this experiment. The subset is a clade with 10,200 leaves of the training RNASim dataset, from which we randomly selected 200 leaves as queries. For the remaining 10,000 leaves, we randomly selected from 1000 to 10,000 leaves to form the backbone tree. For the queries, we made them into fragments ranging from 10%-length and 50%-length. All queries were placed into the backbone trees using EPA-ng in one of two modes: “batch”, where we place all 200 queries all at once (a feature allowed by EPA-ng), and “one-by-one”, where we place each query independently by running EPA-ng on each query. We show placement error (using delta error, see Section 4.7 for definition) on various backbone tree sizes and placement strategies (batch vs. one-by-one). Results for 10%- and 25%-length are shown in Figure 2 (see Figure S3 in the Supplementary Materials in [36] for the full set of results).
For both query sequence lengths, accuracy improves for one-by-one placement as the backbone size increases, indicating the beneficial impact of taxon sampling. Interestingly, this is not true for batch placement. For batch placement, phylogenetic placement accuracy is very much impacted by query length, but very fragmentary sequences are only placed well on small backbone trees (up to 2,000 leaves).
Since our interest is in placing short sequences (corresponding to read placement), we focus now on the results shown for queries of 10%-length. For these very short query sequences, we see a jump in delta error for “batch” from backbone tree size of 2000 to 5000 and up. Thus, high accuracy is only obtained when placing into smaller trees. Thus, EPA-ng performs very differently between the batch mode (the case where it is able to obtain a runtime advantage, according to prior studies) and the one-by-one mode. When the backbone tree is small enough then batch mode can be very accurate, but on larger backbone trees the one-by-one mode is more accurate for placing fragmentary sequences. The key lesson is that if we wish to obtain good accuracy in placing fragmentary sequences, we need to either use EPA-ng in one-by-one mode or keep the backbone tree to at most 2000 sequences.
We next performed an experiment to understand the impact of varying the number of query sequences placed with EPA-ng and pplacer. Query sequences are all made fragmentary to 10% of the full sequence length. We show results on a 1000-leaf backbone tree for various numbers of query sequences (Figure 3). Note that EPA-ng takes less time to place 100 and 1000 sequences than it does to place 10 sequences, and more generally that its runtime grows sublinearly in the number of query sequences. Furthermore, neither its peak memory usage nor the delta error notably increase when placing more queries. In contrast, pplacer’s runtime grows much more quickly than EPA-ng’s.
These trends show that EPA-ng is fast when run in batch mode (instead of one-by-one), but only when placing into relatively small trees. Moreover, when placing into relatively small trees (i.e., less than 5000 leaves) using batch mode, its runtime is sublinear in the number of query sequences. These two observations motivate the development of BATCH-SCAMPP, the subject of the next section.
3 BATCH-SCAMPP
BATCH-SCAMPP (which we will call BSCAMPP for short) was designed with the goal of developing a divide-and-conquer strategy that allowed EPA-ng to run on datasets with large backbone trees, however, this approach could be used with any phylogenetic placement method. We assume that we have an input backbone tree T and its associated multiple sequence alignment of the sequences at the leaves of the tree, and a set Q of query sequences that we will place into the tree T.
At a high level, our approach has four stages: Stage 1: the sequences in Q vote for their top v placement subtrees, each of bounded size B (both parameters specified by the user); Stage 2: We construct a set 𝒯 of placement subtrees using the votes and we assign each query in Q to one of the placement subtrees; Stage 3: We allow reassignments of query sequences to subtrees; and Stage 4: we run EPA-ng on the placement subtrees to add the assigned query sequences into the subtrees, and then use branch lengths to find appropriate positions in the backbone tree T.
This four-stage approach is an elaboration on the SCAMPP technique, except that in SCAMPP, each query sequence picks a single placement subtree, and it is completely feasible that there will be as many placement trees as there are query sequences.
In a collection of initial experiments (all on the same training data), we explored four different variants of this four-stage approach (described in the Supplementary Section S3.1 in the Supplementary Materials of the full paper [36]). These variants differ mainly in Stage 2, which uses the votes to determine the subtree to assign each query sequence. In comparing the different strategies, we optimized both accuracy and computational effort, and selected the variant that had the best combined accuracy and speed. We refer to the selected variant as BATCH-SCAMPP or BSCAMPP, noting that it has two algorithmic parameters (subtree size B and the number v of votes).
The input to BSCAMPP is a backbone tree T, with edge lengths, and a set Q of query sequences, as well as a phylogenetic placement method. When used with EPA-ng (the default context), we refer to the combination as BSCAMPP(e). BSCAMPP has two parameters B (the size of the subtrees it creates) and v (the number of votes per query sequence). BSCAMPP(e) has the following four stages:
Stage 1 (Initialization):
− We compute the Hamming distance between every query sequence q in Q and every leaf in the tree T. Every query sequence q votes for the v leaves (ordered by Hamming distance to the query sequence), thus identifying their closest leaf, closest(q).
− We initialize 𝒯 to the empty set (∅).
Stage 2 (Constructing 𝒯, the set of subtrees, and initial assignment of query sequences):
− While there are some query sequences not yet assigned to any subtree, DO:
* Choose the most voted leaf x in the tree to use as a seed sequence, build a subtree using a breadth-first search based on branch length, until B leaves are in the subtree; call this tree tx. We refer to x as the seed for tx. Add subtree tx to the collection 𝒯.
* For every query sequence q, if the leaf closest(q) is in tx, then assign q to tx and remove the votes from query sequence q
Stage 3 (Allow reassignments of query sequences):
− For every query sequence q and for every tree tx ∈ 𝒯,
* compute the path distance (taking branch lengths into account) from closest(q) to x. Let yq be the seed sequence that has the minimum distance to q. Reassign q to ty if y ≠ closest(q).
Stage 4 (Perform Placement):
− For every subtree tx ∈ 𝒯,
* Run EPA-ng on tx and its assigned query sequences, to produce t′x
* Use technique from SCAMPP to add the query sequences in t′x to T (the backbone tree). This technique uses branch lengths in the placement and full backbone tree.
− Return a file containing all placements, with their requisite confidence score, distal length, placement edge number, etc.
Implementation details
On top of EPA-ng’s parallel implementation, BSCAMPP(e) is written in python with certain parts written using openMP in C++. Since the Hamming distance computation is a computationally intensive portion of the BSCAMPP(e) framework (requiring 𝒪(rql) for q queries of length l compared to r reference leaves), a parallel implementation using openMP allows for easier batch processing of queries.
4 Performance Study Design
4.1 Overview
We evaluate placement methods for use with short sequences, the application similar to what would be encountered in placing short reads into phylogenetic trees or taxonomies. Since many reads are characterized in each run, scalability to large numbers of reads is a relevant question. We use both simulated and biological datasets in this study, dividing the datasets into training data and testing data.
Our training data are simulated datasets with known true trees, and the testing data are both biological datasets (with estimated maximum likelihood trees) as well as simulated datasets. We report placement error using “delta error”, a standard metric used in prior studies [24, 3] (see Section 4.7).
In addition to the preliminary experiments (i.e., Experiment 0, reported in Section 2), we ran the following experiments for this study:
Experiment 1 is the design of BSCAMPP(e) (i.e., BSCAMPP used with EPA-ng) on training data.
Experiment 2 compares BSCAMPP(e) to other phylogenetic placement methods on testing data.
Experiment 3 evaluates BSCAMPP(e) with a large backbone tree (180,000 leaves) from the test data and varying the number of query sequences.
See Appendix B for additional details about datasets and commands used in our experimental study.
4.2 Methods
In Experiments 1–3, we compare EPA-ng (v0.3.8), pplacer, APPLES-2 (v2.0.11), SCAMPP (v2.1.1) used with pplacer or EPA-ng (i.e., SCAMPP(p) and SCAMPP(e), and BSCAMPP used with EPA-ng (i.e., BSCAMPP(e)). Within SCAMPP(p) we use pplacer (v1.1.alpha19) with taxtastic [12] for placement, since this improves accuracy and scalability according to [10, 1].
Some phylogenetic placement methods require numeric parameters to be estimated for the backbone trees. All backbone tree numeric parameters (branch lengths, 4 × 4 substitution rate matrix, etc.) are re-estimated according to the specifications of the phylogenetic placement method: RAxML-ng (v1.0.3) [15] for EPA-ng, SCAMPP(e), and BSCAMPP(e), and FastTree-2 (v2.1.11) [28] under Minimum Evolution for APPLES-2.
4.3 Computational resources
All methods are given four hours to run with 64 GB of memory and 16 cores. These analyses were run on the UIUC Campus Cluster, which is heterogeneous (i.e., some machines are older and hence slower than others). While all methods are given 16 cores and 64 GB, any given analysis may have access to very different computational resources. For SCAMPP(e) and SCAMPP(p), when placement time was over four hours (which occurred in all experiments with 10,000 query sequences), the query sequences were split into 40 subsets of 250 sequences each. SCAMPP(e) and SCAMPP(p) were then run 40 times for each subset containing 250 query sequences.
4.4 Datasets
All datasets are from prior studies and are freely available in public repositories.
4.4.1 RNASim
We use samples from the RNASim dataset [23], which is a simulated dataset of 1,000,000 sequences that evolve down a model tree under a biophysical model to preserve RNA secondary structure. Subsets of the million-sequence simulated dataset were used in prior studies evaluating phylogenetic placement methods [3, 1, 35], and provide a substantial challenge due to the dataset size. For this study, we split this dataset into two subsets by taking the model tree and splitting it into two clades, with one having approximately 600,000 sequences and the other having approximately 400,000 sequences. This defines two sets of sequences, with the smaller one used for training (Experiment 1) and the larger one for testing (Experiments 2 and 3). We place into a maximum likelihood tree on the true alignment (estimated using FastTree2), using the true alignment.
4.4.2 nt78
We also use the nt78 dataset, which were simulated for FastTree-2 [28]; these contain 10 replicates, simulated with Rose [32], each with 78,132 sequences in a multiple sequence alignment and the simulated backbone tree. We picked one replicate randomly, using 68,132 sequences for the backbone and 10,000 sequences for the query sequences. We placed the query sequences into a maximum likelihood tree, estimated using FastTree2 [28] on the true alignment. This dataset is used in the training phase (Experiment 1).
4.4.3 16S.B.ALL
For biological dataset analysis, we use 16S.B.ALL, a large biological dataset with 27,643 sequences and a reference alignment based on structure from The Comparative Ribosomal Website (CRW) [8]. 16S.B.ALL contains some duplicate sequences; these were removed before analysis, producing a dataset with 25,093 sequences. Of these, 5,093 sequences were randomly selected as query sequences and the remaining were made backbone sequences. A maximum likelihood tree for this dataset was computed for the SATé-II [19] study on this reference alignment using RAxML [31] and serves as the backbone tree into which we place the query sequences. When computing delta error, we used the 75% bootstrap tree (i.e., the result of collapsing all edges with support below 75%) as the reference topology. The maximum likelihood tree and the 75% bootstrap tree are available at [22].
4.5 Fragmentary sequence generation
We generated fragments from the full-length sequences starting at a randomly selected location. The fragmentary sequences are a mean of 10% of the original ungapped sequence length with a standard deviation of 10 nucleotides.
4.6 Additional details about experiments
In Experiment 1, we use ∼ 400, 000 sequences from the RNASim dataset, from which we randomly select 50,000 sequences to define the backbone sequences, 10,000 sequences for query sequences, and the remaining 340,000 sequences are not used. Sites of the true alignment on the 60,000 sequences containing more than 95% gaps are removed. To define the backbone tree on the set of sequences, we run FastTree-2 under the GTRGAMMA model on the true alignment.
For the nt78 datasets, we use the third replicate for the experiment. As the true alignment is not very gappy, we do not perform any masking. We pick 10,000 sequences at random for the query sequences and use the remaining 68,132 sequences as backbone sequences. We also define the backbone tree by running FastTree-2 on the true alignment of the backbone sequences.
4.7 Evaluation Criteria
We report placement error using average delta error [24, 35, 1], where the delta error for a single query sequence is the increase in the number of missing branches (FN) produced by adding the query sequence into the backbone tree, and hence is always non-negative; this is the same as the node distance when the backbone tree is the true tree. This requires the definition of the “true tree”, which is the model tree for the simulated data and the published reference tree for the biological data. The methods are also evaluated with respect to runtime and peak memory usage.
5 Results
5.1 Experiment 1: BSCAMPP(e) design
In designing BSCAMPP, we considered four different strategies, described in the Supplementary Materials Section S3.1 of the full paper at [36]. We used the training data for this exploration. In comparing the variants for speed and accuracy, we found that variant 4 (see Section 3) provided accuracy that was comparable with the next most accurate method, but had better computational performance (Figures S4 and S5 in the Supplementary Materials of the full paper at [36]). Based on this, we selected variant 4.
Having selected variant 4, we then performed additional experiments on the two training datasets to set the values for two parameters: the size of the subtrees and the number of votes per query sequence. We varied the subtree parameter setting from 1000, 2000, 3000, 5000 and 10,000 leaves. For each subtree size, we ran BSCAMPP(e) with 5 and 25 votes per query sequence. Results for this experiment are shown in Figure 4 (see also the Supplementary Materials Tables S3 and S4 in the full paper [36]).
Note that when the subtree size increases over 2,000 leaves, BSCAMPP(e) has over twice the error than it does for 1,000- and 2,000-leaf subtrees. The number of votes does not produce a large change in delta error for either dataset, but increasing the number of votes can increase the runtime noticeably in some cases. The runtime trends are dataset-dependent: on the RNASim dataset, the placements with all of the subtree sizes and votes were able to complete in under 15 minutes, whereas the nt78 dataset required over 50 minutes for some settings. On the nt78 dataset, where the runtime was more affected by parameter settings, BSCAMPP(e) is the fastest at 14.1 minutes for 5 votes with a subtree size of 1000. For both subtree sizes where the delta error is low, using 5 votes for parameter v results in faster or similar runtimes.
Based on these trends a subtree size of 2,000 and 5 votes are chosen as defaults for parameters B and v, respectively. Using the chosen parameters, we show BSCAMPP(e) compared to other methods on the training datasets in Figure 5. On both training datasets, BSCAMPP(e) is nearly as accurate as SCAMPP(e) and SCAMPP(p) while taking a fraction of the runtime. SCAMPP(p) is the most accurate method, but takes 27 and 23 hours to complete on RNASim and nt78, respectively. APPLES-2 is the fastest method across both datasets, but shows the highest error. EPA-ng is unable to run, with the failure due to out-of-memory errors. However, BSCAMPP(e) is able to place all 10,000 sequences accurately and quickly, while still using only 3 GB of memory.
5.2 Experiment 2: Comparing BSCAMPP(e) to other methods
The purpose of Experiment 2 is to evaluate BSCAMPP(e) in comparison to other placement methods on the test datasets, which contain large backbone trees (i.e., trees with at least 20,000 leaves and up to 180,000 leaves) and large numbers of fragmentary query sequences (i.e., 10% the length of the full sequence from which the query is obtained). We compare BSCAMPP(e) to SCAMPP(e), SCAMPP(p), APPLES-2, and EPA-ng when placing fragmentary sequences into a backbone tree.
We show results for RNASim with 50,000 sequences in the backbone tree and 16S.B.ALL with 20,000 sequences in the backbone tree in Figure 6. EPA-ng failed to run on the RNASim dataset given 64 GB of memory, reporting an out-of-memory issue; however, EPA-ng was able to place all sequences for the 16S.B.ALL dataset, within the memory limitations. An examination of the memory usage shows that EPA-ng had a much higher memory requirement (37 GB) on the 16S.B.ALL datasets, compared to the other methods (none of which exceeded 3 GB). Thus, EPA-ng demonstrates challenges on these two datasets due to memory requirements. All other methods succeeded in completing on these datasets within the available time.
The methods differed substantially in terms of runtime to place all the query sequences (10K for RNASim and 5093 for 16S.B.ALL). APPLES-2 never used more than 9 minutes on either dataset, BSCAMPP(e) never used more than 8 minutes on either dataset, EPA-ng used 17 minutes on 16S.B.ALL. In contrast, SCAMPP(e) used 214 and 466 minutes on 16S.B.ALL and RNASim, respectively, while SCAMPP(p) used 952 and 1421 minutes on these two datasets. Thus, APPLES-2, BSCAMPP(e), and EPA-ng are very fast, completing in just a few minutes on these datasets, and SCAMPP(e) and SCAMPP(p) are much slower.
The methods also differed in terms of placement accuracy. Specifically, APPLES-2 had higher error than the likelihood-based methods for both datasets (delta error above 1.0 in both cases), and EPA-ng had higher delta error than the other likelihood-based methods on the one dataset on which it ran (16S.B.ALL). The comparison between BSCAMPP(e), SCAMPP(e), and SCAMPP(p) shows very close accuracy on both datasets (all close to 0.5), with perhaps a very slight edge to SCAMPP(p) on the RNASim dataset.
s.
5.3 Experiment 3: Method comparison on ultra-large trees
Our final experiment is designed to verify that BSCAMPP(e) scales to both large query sets and ultra-large backbone tree sizes. Additionally, we check that BSCAMPP(e) is able to achieve sub-linear runtime scaling with respect to the number of query sequences. We use RNASim 200K for this study, with a backbone tree having 180,000 leaves, and placing from 20 to 20,000 query sequences into the tree. We limited this study to BSCAMPP(e) and APPLES-2, as these were the only methods that are able to place into the 50,000-leaf tree within 1 hour.
Results in Figure 7 show that BSCAMPP(e) is able to place 20,000 sequences in just 25 minutes, whereas 20 sequences requires 1 minute. The runtime for BSCAMPP(e) increases by a factor of 6.7, 2.5, and 1.5 each time the number of queries increases tenfold. Similarly, APPLES-2 shows a sub-linear curve; the runtime increases by a factor of 3.6, 3.8, and then 6.9 each time the number of queries increases tenfold. APPLES-2 is faster for 20, 200, and 2,000 query sequences, but BSCAMPP(e) is faster at 20,000 query sequences. Moreover, both methods are able to complete in under 1 hour, even when placing 20,000 query sequences into this large tree with 180,000 sequences. Finally, both methods use under 3 GB, well under the computational limit of 64 GB.
Results shown in Figure 7 also allow us to compare methods for accuracy, and see how changing the number of query sequences affects this. We see that BSCAMPP(e) delta error is not impacted by the number of query sequences, and when at least 200 sequences are considered the delta error for APPLES-2 is also not impacted; this makes sense. In addition, BSCAMPP(e) shows much lower delta error than APPLES-2 (consistent with trends in previous experiments).
5.4 Computational performance
When restricting our attention to APPLES-2 and BSCAMPP(e), we find that both were very fast, with no clear dominance of one over the other (Table 1). When considering only the time to do phylogenetic placement (and so assuming the query sequences are given in an alignment to the backbone sequences), APPLES-2 never used more than 47.1 minutes and BSCAMPP(e) never used more than 25.5 minutes. We also see that although SCAMPP(e) and SCAMPP(p) were able to run on all the datasets, they were by far the most computationally intensive of the tested methods. Furthermore, SCAMPP(e) used from 30 to 70 times as much time as BSCAMPP(e).
We also note that BSCAMPP(e) is able to run on datasets where EPA-ng fails due to out-of-memory issues. Indeed, the only dataset in our study where EPA-ng was able to complete was the 16S.B.ALL dataset, and on this dataset, EPA-ng has higher error than BSCAMPP(e). The memory usage was never that high for any method that succeeded in running (the most was EPA-ng, at 37 GB), but as no method had access to more than 64 GB, it is possible that EPA-ng might have succeeded in running if it had access to a larger amount of memory.
6 Discussion
Our study established that RAPPAS, App-SpaM, and EPA-ng had high memory requirements, making all of these methods unable to scale to the largest dataset we examined (RNASim 200K, with 180K sequences in the backbone tree and 20,000 query sequences). Indeed, EPA-ng and RAPPAS were unable to run on the RNASim 50K dataset, due to their memory requirements. Thus, these three methods had higher memory requirements than the remaining methods, which included APPLES-2 and the SCAMPP- or BSCAMPP-enabled methods. This finding is perhaps as expected, all the SCAMPP-, BSCAMPP-, and APPLES2 methods use divide-and-conquer, limiting the phylogenetic placement effort to small datasets (at most 2000 leaves in the relevant placement subtree).
APPLES-2, BSCAMPP(e), and App-SpaM were the fastest methods, with a definite advantage to App-SpaM when the time to compute the alignment of the query sequences to the backbone alignment is consid-ered. However, as noted above, App-SpaM did not complete on the largest dataset. EPA-ng was the slowest of the methods we tested, and it too did not complete on large datasets (due to memory requirements).
Our experiments consistently showed that all likelihood-based placement methods had the lowest delta error of the tested methods, with relatively minor differences between them (although EPA-ng was less accurate than the other likelihood-based methods in some cases), making computational performance the main distinguishing feature. Hence, the result that BSCAMPP(e) is much faster than the other likelihood-based methods on large datasets is noteworthy. We also observed that BSCAMPP(e) improves on the accuracy of APPLES-2, in many cases dramatically, and even improves on speed when the number of query sequences is large enough (at smaller numbers of query sequences, APPLES-2 is faster).
To understand why we see these trends, we consider what we learned about EPA-ng scalability on large backbone trees, both in terms of computational performance but also accuracy. Prior studies have suggested limits for EPA-ng to relatively small backbone trees due to computational reasons [35, 1, 3], but our Experiment 1 showed that EPA-ng had a jump in placement delta error as we increased the subtree size for the nt78 and RNASim datasets. Thus, our study suggests potentially that EPA-ng may have some numeric issues when placing into very large trees that result in increased placement error, a trend that has been previously observed for pplacer [35]. Thus, we would hypothesize that when placing into very large trees, even if EPA-ng manages to finish it may have reduced accuracy due to numeric issues, and that BSCAMPP(e) can have higher accuracy than EPA-ng since it restricts the placement subtrees for EPA-ng to be relatively small (at most 2000 sequences).
From a computational viewpoint, the main issue for EPA-ng is its large memory usage, which can explain why EPA-ng is limited to backbone trees of at most moderate size. Thus, by limiting the placement tree size to at most 2000 sequences, not only does BSCAMPP(e) succeed in maintaining good accuracy, but it also ensures good scalability to very large backbone trees.
Our study may suggest that BSCAMPP(e) could be able to scale to much larger numbers of query sequences while maintaining the sublinear runtime. The basis for this optimism is that EPA-ng has been shown to place millions of queries [4] into smaller trees of a few hundred sequences, yet we have limited this study to larger backbone trees with at most 20,000 query sequences. The other point is that Experiment 3 suggests that the full efficiency of BSCAMPP(e) has not yet been reached at 20,000 query sequences, and that it may continue to be able to place even larger numbers of query sequences with runtimes that grow sublinearly with that number even for backbone trees of this size.
7 Conclusions
Phylogenetic placement into large backbone trees is fundamental to several bioinformatics problems, including microbiome analysis (e.g., taxonomic characterization of shotgun sequencing reads) and updating large phylogenetic trees. Yet, only a very small number of methods provide acceptable accuracy and scalability. SCAMPP [35] was designed to improve scalability of likelihood-based phylogenetic placement methods to large backbone trees, and enabled good accuracy in this context. However, by design SCAMPP did not ensure that the subproblems resulted in many query sequences being placed into the same subtree, which meant that it could not take advantage of EPA-ng’s ability to place many query sequences with a runtime that is sublinear in the number of query sequences. Our redesign, which we call BATCH-SCAMPP (or BSCAMPP for short) matches the accuracy of the SCAMPP and scalability to large backbone trees, and when used with EPA-ng (i.e., BSCAMPP(e)) achieves sublinear runtime in the number of query sequences. Moreover, BSCAMPP is extremely fast, nearly as fast as APPLES-2 for placing a moderate number of query sequences into a large backbone alignment, and slightly faster when placing large numbers of query sequences into very large backbone trees. Thus, BSCAMPP is the first likelihood-based phylogenetic placement method that is both highly accurate (due to its use of likelihood) and scalable in both dimensions – the number of query sequences and the size of the backbone tree (due to its use of divide-and-conquer and the reliance on EPA-ng for placement of many query sequences into small subtrees).
This study leaves several directions for future research. A more extensive study involving additional methods (e.g., UShER [33]) is clearly called for. Such a study should also explore phylogenetic placement of full-length sequences, and possibly also consider the problem of placing genome-length sequences. This particular direction raises issues of heterogeneity across the genome, a problem that is addressed by the DEPP [14] method for phylogenetic placement. In addition, while BSCAMPP is fast, a possible improvement for the runtime can be explored by implementing parallel processing of subtrees (i.e., running instances of EPA-ng in parallel for different query-subtree sets). This might be particularly helpful in cases with few queries per subtree. Future work should include an analysis of the runtime and memory usage impacts of running EPA-ng concurrently on multiple compute nodes, in addition to running multiple instances of EPA-ng using fewer threads on a single compute node.
There are applications of phylogenetic placement methods that could be improved through the use of larger backbone tree sizes and many query sequences, now enabled with better accuracy through BSCAMPP. One such application is that of de novo phylogenetic tree building. Studies have shown that scalable phylogeny estimation methods have suffered in the presence of sequence length heterogeneity, and phylogenetic placement may provide a more accurate alternative [13]. For example, the divide-and-conquer tree estimation pipeline GTM (Guide Tree Merger) [30] could benefit by using BSCAMPP. BSCAMPP can facilitate the initial tree decomposition of GTM for better placement of shorter, fragmentary sequences into an initial tree containing the longer full-length sequences, potentially leading to better final tree estimation.
Another application with very large backbone trees with many query sequences is taxon identification and abundance profiling. For example, TIPP [25] and its improved version TIPP2 [29] use phylogenetic placement (based on pplacer) to place query sequences into gene-specific taxonomies. As shown in [29], the accuracy improvement of TIPP2 over TIPP is mainly due to the increased size of the backbone trees in TIPP2 over TIPP, suggesting that the use of BSCAMPP(e) on large trees might lead to increased accuracy and reduced runtime compared to the current use of pplacer and smaller backbone trees.
8 Data availability
All datasets used in this study are from prior publications.
A Delta Error Definition
The delta error computed for this study is always measured for a single query, and the mean for all queries in a query set is reported. We begin with some necessary notation.
A.1 Notation
T : the backbone tree used as input for the phylogenetic placement method with leafset S′
q: the query sequence to be added to T
S: the set S′ ∪ {q}
P : the tree on the leaf set S that results from adding q to T using a phylogenetic placement method.
T *: the true tree on S
T *|S′ : the true tree restricted to the leafset S′, i.e., the result of removing q from T *
Note that P and T * have the same set S of leaves, and that T *|s′ S′ and T have the same set S′ of leaves. Thus we can define the missing branches of P with respect to T * and the missing branches of T *| S′ ′with respect to T, as we now define.
A.2 The missing branches and bipartitions
Each edge in a tree defines a bipartition on the leaf-set, produced by deleting the edge but not its endpoints. We denote the bipartitions of a tree t are denoted by B(t). The set of missing branches in an estimated tree with respect to a true tree t on the same leaf set is therefore .
A.3 Formula for delta error
The delta error is defined as follows: In other words, the delta error is the increase in the number of missing branches produced by adding the new taxon q. This value is always non-negative.
A.4 Comments
By restricting the edges that count toward delta error to those that are caused only by the phylogenetic placement and not by those that were estimated incorrectly in the backbone tree T, performance of the phylogenetic placement can be measured directly without the bias of the estimated tree error.
The delta error is an extension of node distance, the number of nodes separating each query sequence’s true and target placement, originally used in [20, 5, 17] in that it removes bias from the estimated tree error, when the true tree topology is know. For all biological datasets where the true tree is unknown, by using the estimated tree as the true tree and placing into the estimated tree, the delta error is the same as the simple node distance [35].
B Datasets and Commands
All code is available on GitHub at https://github.com/ewedell/BSCAMPP_code.
B.1 Dataset Preparation
The datasets were built using the following commands, with scripts included on the GitHub site. For the RNASim dataset, all steps are followed. For the nt78 dataset 10,000 queries were randomly selected and bullets 1 and 2 below were skipped. In addition, the 16S.B.ALL dataset contained duplicate sequences. We removed these from the reference and “true topological” tree and alignments before running any method.
To generate replicates, choose query sequences, and remove sites with over 95% gaps: python3 ./build_dataset.py -m ./true.fasta -t ./true.tt -o ./ -x a
To convert from RNA to DNA: sed ‘/^[^>]/ y/uU/tT/’ aln.fa > aln_dna.fa
To separate query and referenc e alignments: python3 ./build_dataset.py -m aln_dna.fa -o ./ -x b -q query.txt
To estimate the backbone tree before queries are pruned using FastTree v2.1.11 single precision [27]: FastTreeMP -nosupport -gtr -gamma -nt -log ./true.fasttree.log < ./aln_dna.fa > ./true.fasttree
For 16S.B.ALL where the published reference topology was used, we re-estimated the branch lengths for pplacer with FastTree v2.1.11 using the following command: FastTreeMP -nosupport -gtr -gamma -nt -mllen -nome -intree reference.tree -log true.fasttree.log < aln_dna.fa > true.fasttree
To re-estimate branch lengths for use with EPA-ng using RAxML-ng v1.0.3 [16]: /raxml-ng/bin/raxml-ng --evaluate --msa ref.fa --model GTR+G --tree tree --brlen scaled
To build fragmentary sequences: python3 get_frag_aln.py -i query.fa -p 1 -a .10 -s 10 -o a_higher_frag.fa -of ua_higher_frag.fa
To re-estimate branch lengths under Minimum Evolution for use with APPLES-2 using FastTree v2.1.11 single precision [27]: FastTreeMP -nosupport -nt -nome -noml -log fasttree.log -intree reference.nwk < alignment.fasta > fasttree.nwk
To build a reference package with taxtastic taxit [12] for use with SCAMPP(p) (backbone pplacer.tree is the tree with query sequences pruned preserving branch lengths from the fasttree-2 estimated tree and true.fasttree.log is the log file created from running bullet 4 or 5 above): taxtastic-env/bin/taxit create -P my.refpkg -l 16S --aln-fasta ref.fa --tree-file backbone_pplacer.tree --tree-stats true.fasttree.log
B.2 Phylogenetic Placement Commands
The phylogenetic placement methods in this study are called using the following commands:
BSCAMPP(e) (v1.0.0) python3 EPA-ng-BSCAMPP.py -t backbone_epa.tree -a ref.fa -q a_higher_frag.fa -b {subtreesize} -i aln_dna.fa.raxml.bestModel -d {dir} -V {votes} -f True
SCAMPP(e) (v2.1.1) python3 EPA-ng-SCAMPP.py -t backbone_epa.tree -a ref.fa -q a_higher_frag.fa -b {subtreesize} -i aln_dna.fa.raxml.bestModel -d {dir} -f True
SCAMPP(p) (v2.1.1) python3 pplacer-SCAMPP-tax.py -t backbone_pplacer.tree -a ref.fa -q a_higher_frag.fa -b {subtreesize} -i true.fasttree.log -d {dir} -f True
EPA-ng (v0.3.8) epa-ng --ref-msa ref.fa --tree backbone_epa.tree --query a_higher_frag.fa --outdir {dir} --model aln_dna.fa.raxml.bestModel --redo -T 16
APPLES (v2.0.11) python3 run_apples.py -t backbone_app.tree -s ref.fa -q a_higher_frag.fa -T 16 -o APPLES-2-f.jplace -X -D
RAPPAS (v1.21) java -Xmx64G -jar RAPPAS.jar -p b -s nucl -b ./raxml-ng/bin/raxml-ng -w ./ -r ref.fa -t backbone_epa.tree --use_unrooted
App-SpaM (v1.03) appspam -t backbone_epa.tree -s ref.fa -q ua_higher_frag.fa -o appspam-f.jplace
Scripts with commands used to run each method are also available on GitHub.
B.3 Query sequence alignment
Alignments of the query sequences to the backbone were estimated using UPP (v4.5.1), treating the back-bone alignment as a constraint and aligning the query sequences (i.e., short sequences) to the backbone alignment. We use the UPP output alignment that masked columns with lowercase letters (i.e., collapsed insertion columns). This was performed using the following command:
run_upp.py -x 16 -s [unaligned query sequences] -a ref.fa -t backbone_epa.tree --molecule dna -A 100 -p {dir} -o est.aln
B.4 Delta Error calculation
The Delta Error was computed using measure.sh, adapted from the script for computing Delta Error from [2], and available on GitHub at https://github.com/ewedell/BSCAMPP_code.
9 Acknowledgments
The authors acknowledge the financial support of the Department of Computer Science. EW was supported by a Siebel Scholars scholarship, a SURGE Fellowship, and a Wing Kai Cheng Fellowship. CS was supported by NSF grant 2006069 (to TW).
Footnotes
We have added experiments comparing two alignment-free methods for phylogenetic placement. We have also improved the writing for the sake of clarity.