An improved mode of running PASTA

PASTA is a method for estimating alignments and trees that has been able to provide excellent accuracy on large sequence datasets. By design, PASTA operates using iteration, in which the tree from the previous iteration is used to inform a divide-and-conquer strategy during which a new alignment is computed on the sequence dataset, and then a new maximum likelihood tree is estimated on the new alignment. In its default setting, PASTA runs for three iterations and returns that alignment/tree pair from the last iteration. Here we use both biological and simulated nucleotide datasets to show that returning the alignment/tree pair that has the best maximum likelihood score improves on the default usage.


Introduction
Multiple sequence alignment (MSA) is a fundamental step in many biological studies, and large-scale MSA is particularly difficult. Several methods have been designed for large-scale MSA, including Clustal-Omega (Sievers et al., 2011), KAlign3 (Lassmann, 2019), PASTA , and UPP . Furthermore, sequence length heterogeneity also presents challenges for alignment (Smirnov and Warnow, 2020), a problem that is addressed by UPP. Specifically, when given a dataset with sequence length heterogeneity, UPP extracts a set of sequences that are "full length" and uses PASTA to align them, thus forming the "backbone alignment". The remaining sequences are then added into the backbone alignment using an ensemble of Hidden Markov Models (Durbin et al., 1998). Thus, UPP is an extension of PASTA to enable the accurate estimation of alignments given sequence length heterogeneity. Hence, here we focus on PASTA, due to its high accuracy on large datasets with high rates of evolution and applicability to both nucleotide alignment and protein alignment.
PASTA uses a combination of divide-and-conquer and iteration, so that each iteration operates by taking the alignment and tree from the previous iteration, dividing the sequence input into subsets using the tree, aligning the subsets using the selected base method, and then merging the alignments together. A maximum likelihood tree is then estimated on the alignment using FastTree2 (Price et al., 2010). This process repeats for several iterations (default 3), and then the last alignment/tree pair is returned. While this default usage has performed well in many studies, here we explore the potential for additional improvement in accuracy by returning the alignment/tree pair that had the best maximum likelihood score, as computed using FastTree2.

Methods
We compare two variants of PASTA, the default version and a variant where we return the alignment/tree pair with the best maximum likelihood score, on a collection of biological and simulated datasets with up to 1000 sequences, each run for four iterations, and we report the impact on alignment error. All datasets are available in public repositories from prior studies.
PASTA-BestML differs from the default settings for PASTA in two ways: it does not mask sites that are highly gappy (as is performed in PASTA-Default, in order to speed up the analysis on large datasets) and it returns the alignment/tree pair that has the best maximum likelihood score across all the reported iterations (instead of the last alignment/tree pair). Since PASTA has some randomness in its execution, we ran several independent experiments on each dataset. Therefore, we ran 100 independent experiments for each biological dataset and 50 independent experiments for each replicate of each simulated dataset.

Datasets
We include both biological nucleotide datasets and simulated nucleotide datasets from prior studies. We use a standard protocol to preprocess these datasets to reduce to a set of sequences that are full-length: we discard all sequences whose length is different from the median sequence length by over 20%. We use two sources for simulated datasets: the 1000-and 500-sequence datasets used in Liu et al. (2009Liu et al. ( , 2011 to evaluate SATé and SATé-II in comparison to other methods, which were simulated using ROSE (Stoye et al., 1998), and some subsets of the RNASim dataset (studied in Mirarab et al. (2015)) with 1000 sequences; the ROSE datasets are available at Mirarab (2020) and the RNASim datasets are available at Smirnov (2020).
All the simulated datasets evolve with substitutions and indels, but differ from each other in various ways. The ROSE datasets from Mirarab (2020) evolve under a modification of the GTRGAMMA model to allow for indels, and the RNASim datasets evolve with substitutions and indels under a biophysical model that reflects the selective pressure of RNA structure conservation (see Mirarab et al. (2015)). The ROSE datasets include a wider range of sequence evolution and thus include some conditions that are challenging to align, while the RNASim datasets are also challenging due to highly variable rates of evolution across the length of the sequences. Each of the simulated datasets has 20 replicates. We include six biological nucleotide datasets from the Comparative Ribosomal Website (CRW) (Cannone et al., 2002), available from Mirarab (2020). They are 16S.M, 16S.M.aa_ag, 23S.M, 23S.M.aa_ag, 23S.E and 23S.E.aa_ag; these range in size from 96 to 740 sequences, and from 930 nucleotides to 3599 nucleotides in average sequence length. See Tables 1 and 2 for empirical statistics about these datasets.

Criteria
We report alignment error using SPFP and SPFN, calculated by FASTSP (Mirarab and Warnow, 2011), where SPFN is the percentage of the pairwise homologies in the reference alignment that do not appear in the estimated alignment and SPFP represents the percentage of the homologies in the estimated alignment that are missing from the reference alignment.    As seen in Table 3 and Table 4, PASTA-BestML has lower alignment error than PASTA-Default for all biological datasets and nearly all simulated datasets. Thus, running PASTA with BestML typically improves alignment accuracy compared to default PASTA. The difference is generally small, but because it is nearly universal, it suggests that PASTA-BestML may be a more reliable technique for using PASTA than its default mode. This improvement also suggests that the maximum likelihood score may be a valuable criterion to use when selecting an alignment from a set of candidate alignments.

Conclusions
Multiple sequence alignment is a basic step in many bioinformatics pipelines, with phylogeny estimation one of the applications of interest. Here we have shown that a small change to the PASTA pipeline yields a consistent improvement in alignment accuracy on a collection of biological and simulated datasets. Although the improvement was small, it is suggestive of a trend that could lead to bigger improvements on larger datasets. More generally, it suggests also the possibility of using criteria, such as maximum likelihood scores, to select between competing multiple sequence alignments computed for the same dataset. In general, future work is needed in order to better understand how to estimate multiple sequence alignments.