Abstract
Multi-gene phylogenies constructed from multiplexed and Sanger sequencing data are regularly used in mycology and other disciplines as a cost-effective way of species identification and as a first means to investigate genetic diversity samples.
Today, a number of tools exist for each of the steps in this analysis, including quality control and trimming, the generation of a multiple sequence alignment (MSA), extraction of informative sites, and the construction of the final phylogenetic tree. A BLAST search in a reference database is often performed to identify sequences of type specimens to compare the samples with in the phylogeny. Made over the past decades, these tools are all independent from and often not perfectly adapted to one another.
We present AB12PHYLO, an integrated pipeline that can perform all necessary steps from reading in raw Sanger sequencing data through visualizing and editing phylogenies. In addition, AB12PHYLO can calculate basic summary statistics for each gene in the phylogeny.
AB12PHYLO is designed as a wrapper of several open access and commonly used tools for each of the intermediate stages, and intended to simplify the phylogenetic pipeline while still allowing a high degree of access. It comes as a command-line version for the highest reproducibility and an intuitive graphical user interface (GUI) for easy adoption by IT-agnostic end-users. The use of AB12PHYLO significantly reduces the hands-on working time for these analyses.
Main text
Multi-gene phylogenies obtained through Sanger sequencing are a cost-effective and fast way method to aid species identification of fungal samples collected in the field, get the first insight into their genetic diversity, and as a means to select subsets of samples for more expensive and labor-intensive whole genome sequencing approaches. Such analyses often use barcode sequences, specific genic or intergenic fragments of well-defined genes, that have been widely used over the past decades. Examples are regions of the Internally Transcribed Spacer (ITS) (White et al. 1990), Elongation Factor 1 alpha (EF1) (Carbone and Kohn 1999) or RNA polymerase II subunit (RBP2) (Liu et al. 1999). These genes have been sequenced for a large number of type specimens, and sequence comparison of the samples in question with stored type specimens either through direct local alignments or database searches such as NCBI-BLAST (Johnson et al. 2008), can help confirm species identity. Often, sequence data of a single barcode gene is not sufficient to specifically determine fungal identity on species level, whereas a combination of three or more barcodes can reliably determine which species the sample belongs to (see e.g. Woudenberg et al. 2015). In another example, construction of multi-gene phylogenies formed the basis for phylogenetic reclassification: the fungal plant pathogen genus Ulocladium appears morphologically different from the genus Alternaria, but multi-gene phylogenies did not result in monophyletic clades, suggesting to rename the Ulocladium spp, which now fall under the broader Alternaria genus (Woudenberg et al. 2013). The method is also in use to get better insights into pathogens in the field: Two recent studies used multi-gene phylogenetic analyses to confirm the nature and relationship of the pathogens A. alternata and A. solani in potato fields in Wisconsin or similarly in Brazil (Adhikari et al. 2020; Ding et al. 2020). Other recent studies used the method to identify and compare Colletotrichum spp. on tea (Orrock et al. 2019), strawberry (Chen et al. 2019), and a variety of hosts (He et al. 2019), to re-assess the taxonomic classification of Mycosphaerella spp on persimmon.(Hassan and Chang 2018) or to get first insights in the diversity of Phytophthora spp. in the amazon forest (Legeay et al. 2020).
The analyses presented above often involve manual inspection of the sequence quality, followed by manual data trimming. Some tools exist that automate sequence file inspection to a certain extent (see e.g. Singh and Bhatia 2016; Rausch et al. 2020), yet these tools do not help the user with the subsequent steps, such as alignment with reference sequences or phylogenetic reconstruction, Whereas such steps often require additional hands-on work as well, if only to prepare the output of one tool as input for the next. Manual editing of input and output files would also be the case when using popular web-based phylogeny tools like NGPhylogeny.fr (Lemoine et al. 2019). All manual processing slows down analysis and hampers reproducibility in general, as many parameters or small conversion steps are often not properly recorded. In order to speed up data analyses of this kind and increase their reproducibility, we constructed a fully customizable pipeline which we call AB12PHYLO.
AB12PHYLO is developed as a Python 3 package around widely-used open source tools. It takes raw ab1 (ABI) files. Additionally, the user can provide a template specifying the corresponding sample names, which can be formatted in a 96-well plate format, to represent the way the samples are often loaded for sequencing. When no sample template is specified, AB12PHYLO uses regular expressions that the user can modify to search for and extract the file and gene names.
Its command-line version is assembled from the following three parts (with eight main steps: Part A - Sequence assessment: i) File input: After the command line is supplemented with default configurations, the tables mapping plate coordinates to sample IDs are read to memory. Ab1 trace files are read using Biopython Bio.SeqIO (Cock et al. 2009), matched to their original sample ID and gene, and passed to quality control. Reference sequences are saved to the respective per-gene dataset. ii) Quality control was modeled after SeqTrace (Stucky 2012): Read ends are trimmed until a user-defined proportion of characters in the chromatogram have a phred quality score at or above another user-defined threshold, with 8 / 10 and 30 the pre-set default values. End trimming can discard reads. Consecutive stretches of characters with a score below the phred threshold will be replaced by an equal-length stretch of unknown N characters if they are longer than the last user-definable limit in trace processing; pre-set at 5. Reverse reads are replaced by their reverse complement. Part B – Sequence alignment: iii) Edited sequences are passed to a multiple sequence alignment tool in per-gene datasets. AB12PHYLO is able to interface with local installations of MAFFT (Katoh et al. 2002), Clustal Omega (Sievers et al. 2011) or MUSCLE (Edgar 2004); or an EMBL-EBI online service for any of them (Notredame et al. 2000) at https://www.ebi.ac.uk/Tools/msa.iv) The alignments are trimmed with Gblocks (Castresana 2000). Requirements for a conserved site can be set at four different levels, from 90% identiy to the most relaxed permissible parameters, and a fifth option skipping trimming entirely. The per-gene MSAs are concatenated into a supermatrix alignment. v) A BLAST similarity search of data from the first gene in the analysis is carried out to identify source species. If this search is to be run locally, AB12PHYLO employs BLAST+ (Camacho et al. 2009), which will download, update or check a user-defined database before searching it. Per default, AB12PHYLO will query the NCBI nucleotide database for sequences not found in the local database with Biopython Bio.Blast (Cock et al. 2009), and BLAST can also be run entirely via the public NCBI BLAST API, but this approach is not suitable for large datasets. Two more directly related options are available: Skip BLAST altogether, or parse one or several XML files from a previous analysis or a web BLAST. Part C – Phylogenies: vi) By default, a maximum likelihood (ML) tree is inferred from the concatenated alignment with RAxML-NG (Kozlov et al. 2019). While the evolutionary model is pre-set to GTR+ Γ and the numbers of ML tree searches to 10 with random or parsimony starting trees each, these parameters can be user-defined. Also, the number of parallel threads can be limited. Alternatively, trees can be inferred using IQ-tree (Minh et al. 2020), which allows automated model selection. Moreover, IQ-tree can also be executed in the windows version or AB12PHYLO. vii) With RaxML-NG or IQ-tree, bootstrap replicates are generated from the best ML phylogeny found in the previous step. FBP and TBE support values for the best ML tree are computed from the bootstrap trees constructed in parallel threads. viii) Output: The generated phylogeny is plotted with Toytree (Eaton 2020) and shown alongside other results in an HTML results page. A CGI script allows interactive searching of taxa and selecting populations while computing diversity statistics and the Tajima’s D neutrality test.An overview of the main features of AB12PHYLO is shown in Figure 1. A more detailed model of the command-line AB12PHYLO program flow is shown in Figure S1.
As a proof of concept, we obtained the data from two of the the above-mentioned studies: Ding et al. (2020) and Legeay et al. (2020) to reconstruct their phylogenies. To repeat the study by Ding et al (2020), we ran AB12PHYLO with default settings, providing both the raw ab1 files and the sequence data of the type specimens as used by Ding et al (2020). Two samples did not pass the default quality controls. With the remaining 74 samples, we resolved a phylogenetic tree similar to the one in the original work, in which the same genotype groups can be annotated (Figure S2). The MSA used for the phylogeny was 1822 bp long and included 74 samples. Our analysis was run on 12 threads on a system with 64 GB RAM. The total run time for this analysis, including parallelized bootstrapping and BLAST was less than 10 minutes. To repeat the analyses by Legeay et al (2020) we again ran AB12PHYLO with default settings. Again, several samples did not pass the default quality controls. With the remaining samples, we resolved a phylogenetic tree similar to the published one, with a few important differences (Figure S3A), therefore we also remade the phylogeny with the sequence data as deposited on NCBI (Figure S3B) and constructed a phylogeny with all data combined (Figure S3C). These two analyses revealed…
The MSA used for the phylogenies for A, B, C contained 14, 16 and 30 samples respectively, with 2038, 2319 and 2144 sites used for tree inference. On our system, the first two take less than a minute, C around 5 minutes.
Thus, we conclude that AB12PHYLO can produce high-quality multi-gene phylogenies rapidly. The use of AB12PHYLO significantly reduces hands-on working time for these analyses, and overall runtime by parallelization of computation-heavy maximum likelihood tree inference. Moreover, the fact that we observed minor differences between published phylogenies and our re-analyses highlights the importance of reproducible analyses.
Data Availability
AB12PHYLO is published under the GPLv3 license. It runs on standard desktop computers either under Linux, MacOS or Windows operating systems and can be installed via the pip or conda package-managment systems, the latter also allowing easy installation of an environment with all external tools. Installation instructions and source code are available at https://github.com/lkndl/ab12phylo
Supplementary Materials for
Acknowledgments
We thank Shunping Ding and Marc Buée and colleagues for providing the raw ab1 sequence data from their studies and Tamara Schmey for testing AB12PHYLO.
The project was funded by the German Science Foundation (DFG). Some analyses were performed on the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B).
Footnotes
Minor corrections in the main text to - better place the work in the context of other tools - reflect the additional functionality (by implementing IW-tree - describe a test run on an additional validation data set, highlighting the value of AB12PHYLO