sangerFlow, a Sanger sequencing-based bioinformatics pipeline for pests and pathogens identification

Sequencing of a Polymerase Chain Reaction product (amplicon) is called amplicon sequencing. Amplicon sequencing allows for reliable identification of an organism by amplifying, sequencing, and analysing a single conserved marker gene or DNA barcode. As this approach generally involves a single gene, it is a light-weight protocol compared to multi-locus or whole genome sequencing for diagnostic purposes; yet considerably reliable. Therefore, Sanger-based high-quality amplicon sequencing is widely deployed for species identification and high-throughput biosecurity surveillance. However, keeping up with the data analysis in a large-scale surveillance or diagnostic settings could be a limiting factor because it involves manual quality control of the raw sequencing data, alignment of the forward and reverse reads, and finally web-based Blastn search of all the amplicons. Here, we present a bioinformatics pipeline that automates the entire analysis. As a result, the pipeline is scalable with high-volume of samples and reproducible. Furthermore, the pipeline leverages the modern open-source Nextflow and Singularity concept, thus it does not require software installation except Nextflow and Singularity, software subscription, or programming expertise from the end users making it widely adaptable. Availability and implementation sangerFlow source code and documentation are freely available for download at GitHub, implemented in Nextflow and Singularity.


Introduction
Sanger sequencing 1 of single-target PCR product (amplicon) is widely used as a low-cost molecular diagnostic tool 2 .For diagnostic purposes, it is primarily used to identify a species or microorganism by performing a sequence similarity analysis using National Center for Biotechnology Information (NCBI)'s Basic Local Alignment Search Tool for nucleotide (Blastn) 3 .For example, mitochondrial CO1 for insect species identification 4 , 16S rRNA for pathogenic bacterial identification or 28S Internal Transcribed ribosomal gene Spacer (28S-ITS) for fungal identification 5 .The Sanger sequencing data of the forward and reverse reads can be analysed using Graphical User Interface (GUI)-based commercial software such as Geneious 6 and CLCbio 7 or non-commercial software like SnackVar 8 , GLASS 9 , and SeqTrace 10 .While these GUI-based tools are generally user-friendly, they are designed for manual processing of the data.In a large-scale biosecurity surveillance setting, the GUI-based data processing could be a time-and resource-limiting factor.However, some command line based open-source tools such SangeR 11 , ASAP 12 have been developed to analyse sanger sequencing data.However, SangeR 11 has been developed to perform mutation analysis while ASAP 12 can perform protein alignment.To the best of our knowledge, command line based open-source tools that automatically processes Sanger sequencing raw data and performs a Blastn search to identify the subject organism are rare.Although such a tool is in high demand for a biosecurity incident where wide-ranging surveillance is underway so that i) large volumes of samples can be processed rapidly, and ii) species identification can be streamlined.Here, we present a bioinformatics pipeline based on Nextflow and Singularity called sangerFlow.sangerFlow takes Sanger amplicon sequencing raw data and returns the targeted results i.e., the Blastn hits for each sample.Furthermore, the Blastn hits are sorted based on the highest bitscore and lowest e-value and produced in a concise Tab-Separated Values (TSV) format per sample for the convenience of quick species identification.In addition, the pipeline leverages Nextflow and Singularity containers.Therefore, the benefits of sangerFlow also include no-installation of software except for Nextflow and Singularity.Furthermore, the pipeline scripts are hosted in GitHub (https://github.com/asadprodhan/sangerFlow)keeping them away from the general users.This makes sangerFlow less dependent on users' programming expertise so that it can be implemented easily with minimal personnel training.All these features make the pipeline ideal for large-scale Sanger sequencing data analysis and user-friendly.

Workflow description
Figure 1 shows the sangerFlow workflow.The pipeline takes both .seqand .fastafile from Sanger sequencing as input data.The forward and reverse reads are fed into the workflow as two individual data channels.In the forward-read channel, the reads are re-named as per their sample names using awk commands followed by trimming of the ambiguous nucleotides using sed commands.In the reverse-read channel, the reverse reads are re-named as per their samples, converted to their reversecomplement strand using bioawk 13 , and trimmed for ambiguous nucleotides.Then, the trimmed reads are concatenated before subjecting them to the pairwise alignment using Clustal Omega 14 .The most representative sequence is extracted as .fastafile from the alignment file using The European Molecular Biology Open Software Suite (EMBOSS) Cons 15 .The fasta-headers are re-named with sample names before passing them as query sequence to the Blastn search 16 .The pipeline produces Blastn hits as xml, html and tsv file formats for the convenience of visualising the Blastn results.xml files are compatible to visualise in MEGAN 17 while the html files can be opened in any internet browser.The tsv files present the Blastn results for individual samples in the order of highest to lowest similarities.In addition, the pipeline compiles a master Blastn result sheet comprising the user-defined topmost Blastn hits of all the samples that allows for quick checking of the results.Users can also set the evalue and cpu numbers for the Blastn analysis as described in the following section as well as in https://github.com/asadprodhan/sangerFlow.

Software implementation and availability
sangerFlow is written using the Nextflow workflow manager coupled with Singularity containers.
Therefore, sangerFlow can be run with only two softwares-Nextflow and Singularity-installed on the working computer.The pipeline requires only one script called main.nf and one resources configuration file called nextflow.config.The main.nf script contains all the codes to carry out the individual analyses in a step-wise fashion.The nextflow.configcontains Singularity container links for each computational analysis.Furthermore, to simplify the implementation of the workflow, main.nf and nextflow.configare hosted in GitHub (https://github.com/asadprodhan/sangerFlow).
Consequently, the sangerFlow can be implemented with only the following coding snippet: sangerFlow nextflow run asadprodhan/sangerFlow -r VERSION-NUMBER --evalue=User_defined --topHits= User_defined --cpus= User_defined --db="/path/to/your/blastn_database" The sangerFlow workflow is accompanied with an intuitive user guide hosted in sangerFlow GitHub page (https://github.com/asadprodhan/sangerFlow).The results directory contains the final results while the temp directory includes all the intermediate files generated by sangerFlow in case any sample requires manual inspection for clarity.

Use case
We have successfully applied sangerFlow to process hundreds of surveillance samples in-house.To demonstrate the utility here, we ran sangerFlow with two publicly available datasets.The test datasets can be found in Table 1 and 2. To run sangerFlow, the users just supply the data and run sangerFlow (nextflow run asadprodhan/sangerFlow -r 18e5299 --evalue=0.05--topHits=1 --cpus=16 --db="/media/Disk/blastn_db" ) as depicted in Fig. 2 (Step 1 & 2, respectively).sangerFlow will be automatically launched from its GitHub repository and process all the data supplied without any manual intervention.In the end, there will be three new directories called 'results', 'temp, and 'work' (Fig. 2, Step 3).All the Blastn hits will be in the 'results' directory.Temp directory will contain the intermediate files like contigs in case some samples require revisiting the intermediate files.The 'work' directory is auto-generated by Nextflow that can be discarded.For details, see the user protocol at https://github.com/asadprodhan/sangerFlow

Results
SangerFlow is a command line-based bioinformatics tool that is designed to automatically analyse the Sanger sequencing data of PCR amplicons.Our state-wide biosecurity surveillance for exotic pests, often uses PCR assays resulting in a large influx of Sanger PCR amplicon sequencing data per day.This warrants a high-throughput analysis and led to developing this pipeline, sangerFlow.sangerFlow automatically carried out data quality control, analysis, and Blastn search.Consequently, our species identification was streamlined.We validated the accuracy of the pipeline by comparing the results from the pipeline and those from manual analysis.
To demonstrate the sangerFlow performance here, we used two test datasets that were comprised of PCR Sanger sequencing forward and reverse reads.First, we manually removed the ambiguous nucleotides from the forward and reverse sequences using Geneious 6 (Table 5), aligned them, extracted the consensus sequence, and finally searched them in the NCBI database using the web Blastn 16 via Geneious 6 .Then, we ran sangerFlow pipeline on the fasta files of the same datasets that automatically provided with Blastn outputs for each sample (Table 6).However, as the input and output files in sangerFlow were fasta format, there were no visualisation of the trimmed sequences.Finally, we compared the sangerFlow-derived Blastn outputs with those derived from manual processing (Table 7).The comparison of the results demonstrated consistency between manual analysis and sangerFlow (Table 7).

Table 4 . Sequencing quality stats of the reference dataset 2 (chromatograms)
%MQ , and %LQ are the percentages of untrimmed bases in a sequence that are considered high, medium, and low quality, respectively.Q20, Q30, and Q40 (%) are the percentage of nucleotides with quality scores of at least 20, 30, and 40, respectively.