Abstract
Viruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approach has surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. Our results from >15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This “variant interference” (VI) is highly consistent and reproducible by ten most used de novo assemblers, and occurs independent of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the “rescue” of full viral genomes from fragmented contigs. These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.
Introduction
For many years, Sanger sequencing has been used to complement classical epidemiological and laboratory methods for investigating viral infections.1 As technologies have evolved, the emergence of next-generation sequencing (NGS), which drastically reduced the cost per base to generate sequence data for complete viral genomes, has allowed scientists to apply viral sequencing on a grander scale.2 Genomic sequencing is ideal for elucidating viral transmission pathways, characterizing emerging viruses, and locating genomic regions which are functionally important for evading the host immune system or antivirals.3
Genomic surveillance of viruses is particularly important in light of their rapid rate of evolution. Viruses have higher mutation rates than cellular-based taxa, with RNA viruses having mutation rates as high as 1.5 × 10−3 mutations per nucleotide, per genomic replication cycle.4 Due to this high mutation rate, it is well established that most RNA viruses exist as a swarm of quasispecies,5 with each quasispecies containing unique single nucleotide polymorphisms (SNPs). The presence of these variants plays a key role in viral adaptation.
Due to viruses’ rapid evolution, a single clinical sample often contains a mixture of many closely related viruses. Viral quasispecies are mainly derived from intra-host evolution, with RNA viruses such as poliovirus, human immunodeficiency virus (HIV), hepatitis C (HCV), influenza, dengue, and West Nile viruses maintaining diverse quasispecies populations within a host.6, 7, 8, 9, 10, 11, 12, 13 Conversely, the term “viral strains” often refers to different lineages of viruses found in separate hosts, or a co-infection of viruses in the same host due to multiple infection events. As a result, sequence divergence is usually higher when comparing viral strains compared to quasispecies. In this study, we use the term “variant” to encompass both quasispecies and strains regardless of how the variants originated in the biological samples.
Since many sequencing technologies produce reads that are significantly shorter than the target genome size, a process to construct contigs, scaffolds, and full-length genomes is needed. Reference-mapping and de novo assembly are the two primary bioinformatic strategies for genome assembly. Reference-mapping requires a closely-related genome as input to align reads, while de novo assembly generates contigs without the use of a reference genome, and therefore is the most suitable strategy for analyzing underexplored taxa14 or for viruses with high mutation and/or recombination rates.
In this study, we first examined how often NGS and de novo assembly were applied in viral sequencing in GenBank Nucleotide entries (www.ncbi.nlm.nih.gov/nucleotide/). Then we investigated how the presence of variants affected assembly results - simulated and clinical NGS datasets were analyzed using multiple assembly programs to explore the effects of genome variant relatedness, read length, and genome length on the resulting contig distribution.
Results
The rise of NGS and de novo assembler use in GenBank viral sequences
GenBank viral entries from 1982-2017 were collected and analyzed, with extensive analyses performed to evaluate technologies and bioinformatics programs cited in records deposited between 2011 and 2017. Through 2017, there were over 2.3 million viral entries in GenBank; however, over 70% (1.7 million) do not specify a sequencing technology [Supplement Table S1] due to the looser data requirement in earlier years. When looking at recently deposited records (2014-2017), the Illumina sequencing platform was the most common NGS platform used for viral sequencing, with about a 2-fold increase over the next most popular NGS platform [Figure 1d & e]. When long sequences (≥2,000 nt) are considered, NGS technologies surpassed Sanger in 2017 as the dominant strategy for sequencing, comprising 53.8% (14,653/27,217) of entries compared to 46.2% of entries (12,564/27,217) for Sanger [Figure 1f and Supplement Table S2].
Hybrid sequencing approaches, where researchers use more than one sequencing technology to generate complete viral sequences, have also become more common over the past several years. The most common combination observed was 454 and Sanger (18,002 entries), likely due to the early emergence of the 454 technology compared to other NGS platforms [Figure 1c and Supplement Table S3]. However, combining Illumina with various other sequencing platforms is quite commonplace (>10,000 entries).
De novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, Trinity, and Velvet) have increased from less than 1% of viral sequence entries in 2012, to 20% of all viral sequence entries in 2017 [Figure 1h & i]. A similar increase was observed for reference-mapping programs (i.e., Bowtie and Bowtie2), from 0.03% in 2012 to 6.5% in 2017. Multifunctional programs (Suppl. Information) that offer both assembly options were the most common programs cited for the years 2013-2017, but since the exact sequence assembly strategy used for these records is unknown, the contributions of de novo assembly are likely underestimated. An expanded summary of the sequencing technologies and assembly approaches used for viral GenBank records is available in Supplement Tables S1–S6.
Effect of variant assembly using popular de novo assemblers
After establishing the growing use of NGS technologies for viral sequencing, we next focused on understanding how the presence of viral variants may influence de novo assembly output. We generated 247 simulated viral NGS datasets representing a continuum of pairwise identity (PID) between two viral variants, from 75% PID (one nucleotide difference every 4 nucleotides), to 99.6% PID (one nucleotide difference every 250 nucleotides) [Figure 2]. For Experiment 1, these datasets were assembled using 10 of the most used de novo assembly programs [Figure 2 and Supplement Figure S1a] to evaluate their ability to assemble the two variants into their own respective contigs as the PID between the variants increases.
One key observation is that the assembly result can change from two (correct) contigs to many (unresolvable) contigs simply by having variant reads; the presence of viral variants affected the contig assembly output of all 10 assemblers tested. The output of the SPAdes, MetaSPAdes, ABySS, Cap3, and IDBA assemblers shared a few commonalities, demonstrated by a conceptual model in Figure 3A. First, below a certain PID, when viral variants have enough distinct nucleotides to resolve the two variant contigs, the de novo assemblers produced two contigs correctly [Figure 3]. We refer to this as “variant distinction” (VD), with the highest pairwise identity where this occurs as the VD threshold. Above this threshold, the assemblers produced tens to thousands of contigs [Figure 3], a phenomenon we define as “variant interference” (VI). As PID between the variants continue to increase, the de novo assemblers can no longer distinguish between the variants and assembled all the reads into a single contig, a phenomenon we define as “variant singularity” (VS). [Figure 3]. The lowest pairwise identity where a single contig is assembled is the VS threshold.
Slight differences in the variant interference patterns (relative to the canonical variant interference model) were observed for the 10 assemblers investigated. VD was observed for SPAdes, MetaSPAdes, and ABySS assemblers. While it was not observed with Cap3 and IDBA with the current simulated data parameters, we speculate that VD may occur at a lower PID level for these assemblers than tested in this study. The PID range where VI was observed was distinct for each de novo assembler [Figure 3]. During VI, SPAdes produced as many as 134 contigs and ABySS produced 3,076 contigs, while MetaSPAdes, Cap3, and IDBA produced up to 10.
A different pattern was observed for Mira, Trinity, and SOAPdenovo2 assemblers. The average number of contigs generated by Mira, Trinity, and SOAPdenovo2 was 5, 36, and 283, respectively across all variant PIDs from 75%–99.96%. Specifically, Mira and Trinity generated fewer contigs at low PID, but produced many contigs when the two variants reach 97.1% PID and 96.0% PID, respectively. For SOAPdenovo2, a larger number of contigs were produced regardless of the PID. This indicates that these assemblers generally have major challenges producing a single genome; this has been observed in previous studies comparing assembly performance.15
Finally, Geneious and CLC were the least affected by VI in the simulated datasets tested, returning only 1–5 contigs for all pairwise identities. CLC’s assembly algorithm primarily returned a single contig over the range of PIDs tested (218/247 simulations; 88.3%), thus favoring VS. In comparison, Geneious predominantly distinguished the two variants (234/247 simulations; 94.7%), favoring VD.
Effect of GC content and genome length on variant assembly
For Experiment 2, we focused our study on evaluating whether VI observed in SPAdes de novo assembly is influenced by the GC content or genome length of the pathogen. Two datasets were used for the evaluation: reads generated from four artificial genomes ranging in length from 2 Kb to 1 Mb, as well as from genome sequences of poliovirus (NC_002058; 7,440 nt in length) and coronavirus (NC_002645; 27,317 nt in length). No discernable correlation was observed between the GC content of variant genomes and the degree of VI for any of the simulated datasets [Supplemental Dataset S2, p < 0.0001]. Therefore, for subsequent analyses examining the effects of genome length on VI, the number of contigs at each PID level was obtained by averaging the 13 GC simulations.
Notably, no matter the genome length, SPAdes produced vastly more contigs (i.e., VI) in a constant, narrow range of PID [99%–99.21% ; Figure 4a & b]. The effect of variants on assembly was characterized by the three distinct intervals described previously: VD at lower PIDs, VI [Figure 4b], and VS at higher PIDs for all genome lengths. For example, during VS, a single contig was generated when the two variants shared ≥99.22% PID, but tens to thousands of contigs were generated at a slightly lower PID of 99.21%. This PID threshold, 99.21%, marked the drastic transition from VS to VI, whereas the transition from VI to VD (i.e., the VD threshold) occurred at 98.99% PID [Figure 4b]. A correlation was observed between genome length and the number of contigs produced during VI, where longer genomes returned proportionally more contigs as expected as total VI occurrence should increase with length [r2 = 0.967; p <0.0001 Figure 4b and 4c].
Effect of read length on variant assembly
The read length of a given NGS dataset will vary depending on the sequencing platform and kits utilized to generate the data. Since read length is an important factor for de novo assembly success,16 we hypothesized that it may also influence the ability to distinguish viral variants. For Experiment 3, using SPAdes we investigated assemblies with four typical read lengths: 50, 100, 150, and 250 nt. At longer read lengths, the VD threshold occurred at higher PIDs [Figure 4d & e]. Also, with increasing read length, the width of the PID window where VI occurs gradually decreased from a 1.52% spread to a 0.21% spread [Figure 4e]. This indicates that longer reads are better for distinguishing viral variants with high PIDs.
In silico experiments examining variant assembly with NGS data derived from clinical samples
For clinical samples, assembly of viral genomes is affected by multiple factors other than the presence of variants, including sequencing error rate, host background reads, depth of genome coverage, and the distribution (i.e., pattern) of genome coverage. We next utilized viral NGS data generated from four picornavirus-positive clinical samples (one coxsackievirus B5, one enterovirus A71, and two parechovirus A3) to explore VI in datasets representative of data that may be encountered during routine NGS. The NGS data for each sample was partitioned into four bins of read data: (1) total reads after quality control (T); (2) major variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background non-viral reads only (MB) [Figure 5]. These binned datasets were then assembled separately using three assembly programs: SPAdes, Cap3, and Geneious. By comparing these manipulations, we aimed to test the hypothesis that minor variants directly affect the performance of assembly through VI in real clinical NGS data.
Even with an adequate depth of coverage for genome reconstruction, assembly of total reads (T) in 11/12 experiments resulted in unresolved genome construction – resulting in numerous fragmented viral contigs [Figure 6]. The only exception was one experiment where one single PeV-A3 (S1) genome was assembled using Cap3. When only reads from the major variant were assembled (M), full genomes were obtained for all datasets using SPAdes and Cap3, and for the CV-B5 sample using Geneious. Conversely, assembly of the read bins containing major and minor variants (Mm) resulted in an increased number of contigs for 9 of the 12 sample and assembly software combinations tested [Figure 6], indicating that VI due to the addition of the minor variant reads likely adversely affected the assembly. The presence of background reads with major variant reads (MB) did not appear to affect viral genome assembly, as the UG50% value, a performance metric which only considers unique, non-overlapping contigs for target viruses17, was similar between M and MB datasets.
Discussion
Our analysis of the GenBank quantified the decade-long expansion of NGS technologies and de novo assembly for viral sequencing [Figure 1]. As the number of viral sequences in public databases continues to grow, an important question that naturally arises is how well current de novo assembly programs perform for datasets with viral variants. Viral variants are expected in biological samples, with the number of variants and the extent of the sequence divergence between variants related to the mutation rate of the virus and the types of specimens that are being investigated. For example, samples containing rapidly evolving RNA viruses, such as poliovirus, HIV, and HCV7, 9, 18, environmental samples,19 and clinical samples from immunosuppressed individuals20, 21 usually harbor many variants. The ability to accurately distinguish variants is imperative to inform treatments (in the case of HIV and HCV), or determine whether a subpopulation of a more virulent variant is present.
Several experiments using simulated and clinical sample NGS data were performed to evaluate the ability of genome assembly programs to distinguish genome variants. All assemblers investigated generated fragmented assemblies when the data contained reads from two closely related variants due to “variant interference” (VI). Changes in pairwise identity (PID) as small as 0.01% between the two variants triggered an assembler to change from producing one or two contigs to producing hundreds of contigs. A quintessential example of this phenomenon was the SPAdes assembly of EV-A71 sequences during the in silico experiments with clinical NGS data. Assembly of major variant reads resulted in one full length contig [Figure 5], whereas assembly of datasets containing the major and minor variant reads (Mm and T) were characterized by a number of contigs, resulting in “cobwebs” of contig fragments when visualized using Bandage [Supplement Figure S2].22 Even though the de novo assembly graph linked the different contig fragments, the assembly could not differentiate the multiple routes of possible contig construction. We speculate this is the main reason why VI occurs in the context of de Bruijn graph assemblers.
The simulated experiments suggested that genome length and read length influence VI; A longer genome length will produce proportionally more contigs during VI, whereas a longer read length decreases the PID range where VI occurs [Figure 4]. While longer read length improves assembly, unfortunately, platforms that produce long reads such as Oxford Nanopore and PacBio have higher error rates.23 Until long reads can be produced at high fidelity, researchers must continue to rely on combining long- and short-read NGS datasets, and genome polishing techniques.23
The large number of contigs generated due to VI may be overwhelming for most researchers, and for viral ecology studies, could lead to over-estimation of species richness for methods that use contig spectra to infer richness, such as PHACCS or CatchAll.24, 25, 26 This phenomenon may also impact studies differently depending on the overall goal for generating viral sequence data. For example, some researchers may only be concerned with generating a single major consensus genome, even when variants are detected in the data. This is common during outbreak responses for pathogens such as Ebola virus or Middle East respiratory syndrome coronavirus, where detection of SNPs (indicative of minor variants) is not immediately important. On the other hand, some investigations could favor distinguishing variants, such as for investigating the presence of vaccine-derived poliovirus, where a small number of SNPs may distinguish a vaccine-derived strain from a normal vaccine strain genome.21
The effects of VI could potentially be mitigated by running multiple assembly programs. A previous study testing bioinformatics strategies for assembling viral NGS data found that employing sequential use of de Bruijn graph and overlap-layout-consensus assemblers produced better assemblies.15 We speculate that this “ensemble strategy”15 may perform better because the multiple assemblers complement one another by having different VI PID thresholds. Future assembly approaches could also consider resolving the VI problem by possibly discriminating the major and minor variant reads first (perhaps by coverage or SNP analysis), and then assembling major and minor variant reads separately.
Since we observed VI occurring in simulated data from 2 Kb to 1 Mb genome lengths, we speculate that it may not only affect viral data but also larger draft contigs of bacteria and other microorganisms. Even though bacterial mutation rates are much lower than those of most viruses, bacterial variants are common. For environmental studies, bacterial metagenomes are known to contain many related taxa and variants 27,28, 29, 30, and in clinical investigations, minor bacterial variants can harbor SNPs that provide resistance against antimicrobials. This warrants future investigation into how the presence of variants may impact the assembly of other microbial datasets.
This study aimed to understand how variants affect assembly. As an initial investigation, many confounding factors were simplified for experimentation. Simulated variants studied here only depicted periodic mutations, set at regular intervals. However, in real viral data, SNPs are never evenly distributed across the genome, with zones of divergence and similarity.31, 32 Other important factors which influence genome assembly include sequencing error rates, presence of repetitive regions, and coverage depth. We limited our experiments to keep these factors constant in order to investigate the sole effect of VI. Through this exploration, we demonstrated that reads from related genome variants adversely affect de novo assembly. As NGS and de novo assembly have become essential for generating full-length viral genomes, future studies should investigate the combined effects of the number and relative proportion of minor variants, as well as additional assembly factors (e.g., error rates) to supplement this work.
Methods
Analyzing NGS and assembler usage in the virus nucleotide collection in GenBank
Viral sequence entries from the GenBank non-redundant nucleotide collection were obtained by downloading all sequences under the virus taxonomy through the end of 2017. A total of 2,338,775 GenBank entries were investigated.
The total number of viral sequences submitted annually in GenBank through December 2017 was calculated by filtering GenBank submissions by “virus,” followed by application of the following additional filtering steps: “genomic DNA/RNA” was selected and a “release date: Jan 1 through Dec 31” was applied to find the total number of viruses for a given year. A custom script was used to filter and count all documented sequencing technologies and assembly methods used for each GenBank entry.
Creation of simulated variant genomes and reads
Simulated genomes were generated using custom scripts that randomly assign each nucleotide over a designated genome length with a weighted distribution dependent on the GC content [Supplement Figure S1]. The random genomes were then screened using NCBI BLAST to insure no similarity/identity existed to any classified organism (i.e., no BLAST hits). These simulated genomes served as the initial variant genome (variant 1). To generate the mutated variant genomes (variant 2), a custom script was used to systematically introduce evenly distributed random mutations at rates from 1 mutation in every 4 nucleotides (75% PID) to 1 mutation in every 250 nucleotides (99.6% PID), incrementing by 1 nucleotide.
Following the generation of initial and mutated variant genomes, high-quality fastq reads were generated using ART,33 simulating Illumina MiSeq paired-end runs at 50× coverage with 250 nt reads, DNA/RNA mean fragments size of 500, and quality score of 93. Fastq reads were combined in equal numbers for the initial and mutated variants, and used as input for subsequent de novo assembly experiments [Supplement Figure S1]. The same process was utilized to generate the artificial genomes, initial and mutated variant genomes, and reads for each of the experiments.
Experiment 1: Analyzing simulated reads from variants using different de novo assembly programs
The simulated datasets containing reads from two variant genomes with nucleotide pairwise identity ranging from 75%–99.6% were analyzed using 10 different genome assembly programs. The de novo assembly algorithms used were either overlap-layout-consensus (OLC) [Cap34 and Mira35, 36], de Bruijn graph (DBG) [ABySS37, IDBA38, MetaSPAdes39, SOAPdenovo240, SPAdes41, and Trinity42], or commercial software packages [CLC (https://www.qiagenbioinformatics.com/) and Geneious43] whose assembly algorithms are proprietary [Supplement Table S6]. The simulation settings for the reads were single-end reads, 250 nt read length, and 50× coverage. A total of 2,470 assemblies (247 datasets per genome × 10 assemblers) were analyzed [Supplement Figure S1a].
Experiment 2: Simulated data by varying genome length and GC content
Artificial genomes were constructed for four genome lengths: 2 Kb, 10 Kb, 100 Kb, and 1 Mb, with varying GC content from 20%–80%, in 5% increments [Supplement Figure S1b]. Datasets derived using one poliovirus genome (NC_002058) and one coronavirus genome (NC_002645) were also included in this analysis, representing the lower and upper genome length range typical of RNA viruses. The original GC content was kept constant for the poliovirus and coronavirus genomes. For all of these genomes, simulated reads for initial and mutated variants were generated as above.
A total of 13,338 SPAdes assemblies were generated, which included 12,844 assemblies for the four artificial genomes (247 datasets per genome × 4 artificial genome lengths × 13 GC content proportions × 1 assembler) and 494 assemblies for the poliovirus and coronavirus datasets (247 datasets per genome × 2 genomes × 1 assembler) [Supplement Figure S1b]. JMP v13.0.0 (www.sas.com) was used to calculate Pearson’s correlation and Spearman’s ρ values to compare the association between percent GC levels and the number of contigs produced at each PID level. Since there was little statistical difference when comparing the contig numbers generated at varying percent GC for each of the four genome length datasets (Spearman’s ρ = 0.8299 to 0.9801, p<0.001) [Supplement Excel file], the final contig number was averaged across the 13 GC percentages at a given PID. The average contig number was used for plotting the contig assembly results vs percent PID for each simulated genome length [Figures 4a-b].
Experiment 3: Simulated data by varying read length
Genome variants were generated as described above (“Creation of simulated variant genomes and reads”) for a genome of size 100 Kb with 50% GC; this was the starting initial variant genome. In this simulation, initial and mutation variant reads at four sequencing read lengths (50, 100, 150, and 250 nt) were created using ART. A total of 538 SPAdes assemblies were generated (47, 97, 147, and 247 datasets for the 50, 100, 150 and 250 nt read lengths, respectively) [Supplement Figure S1c].
Evaluation of NGS datasets from clinical samples
Four datasets derived from clinical samples containing picornaviruses (one enterovirus A71 [EV-A71], one coxsackievirus B5 [CV-B5] and two parechovirus A3 [PeV-A3]) were analyzed for this experiment, as previous sequencing analysis using Geneious indicated the presence of genome variants. The datasets were analyzed using an in-house pipeline (VPipe),18 which performs various quality control (QC) steps and de novo assembly using SPAdes. The post-QC reads were considered total reads (T)and mapped to their respective reference genome in order to determine the major and minor variants present in each sample. Total reads which mapped with high similarity (≥99%) to the major variant were categorized as reads representing the major variant (M). Unbinned reads from the major variant reference recruitment were used to construct the minor variant consensus using a second round of reference recruitment, and these reads were categorized as the minor variant (m). Remaining reads from the previous two steps were considered background (B) reads.
De novo assembly for each of the four clinical samples was performed for the following binned NGS datasets: (1) total reads only (T); (2) major variants only (M); (3) major and minor variants only (Mm); and (4) major variants and background reads only (MB). This was repeated with three assembly programs: SPAdes, Cap3, and Geneious. The length of the longest contig produced from each assembly and the performance metric UG50%.17 were calculated to compare the results for these 48 assemblies (4 experiments × 4 viruses × 3 assemblers).
Data Availability
Sequencing reads for the experiments conducted using clinical specimens are available through the NCBI Sequence Read Archive (SRA) accession PRJNA577924. Reads from simulated datasets (Experiments 1-3) are available upon request.
Funding Information
This work was supported in part by Federal appropriations to the Centers for Disease Control and Prevention, through the Advanced Molecular Detection Initiative line item.
Author contributions
All authors contributed to the conceptualization, data analysis, preparation, and review of this manuscript. C.J.C, R.L.M., and T.F.F.N. wrote this manuscript.
Competing interests
The authors declare no competing interests.
Supplemental Information
Analysis of viral GenBank records
The advent of NGS fuels viral sequencing
As of December 2017, GenBank’s non-redundant nucleotide database had grown to more than 2.3 million virus sequences, with the annual number of new sequences deposited increasing by 270% between 2007 and 2017 [Figure 1a and Supplement Table S1]. GenBank entries started incorporating information on the sequencing technology platform used in 2011. Through 2018, 144,712 viral entries (22%) had documented utilization of NGS sequencing technology, compared to 500,027 entries (77%) utilizing Sanger methods [Figure 1b and Supplement Table S1]. Illumina was the most common NGS platform used for viral sequencing, with >2-fold the number of entries compared to the next most popular NGS platform (31,000 viral entries in 2017 [Figure 1d & e]). Although NGS usage has risen tremendously, Sanger sequencing still contributed the majority of all viral sequences. This is likely because Sanger is still attractive for generating short viral sequences over genotyping windows or other informative regions. If only long sequences (≥2000 nt) are considered, NGS technologies surpassed Sanger as the dominant strategy for sequencing in 2017 [Figure 1f and Supplement Table S2].
A total of 27,217 counts of sequencing technologies were listed for the 25,344 long viral GenBank entries in 2017, as some sequences were generated using two or more sequencing technologies. NGS technologies were listed in 53.8% (14,653/27,217) of entries, versus 46.2% of entries (12,564/27,217) for Sanger. Illumina was identified as the most dominant NGS technology, accounting for 12,615/14,653 entries (86.1%) [Figure 1g and Supplement Table S2].
Multiple sequencing technologies may be used to generate viral sequence for one entry. The most common combination observed was 454 and Sanger (18,002 entries), likely due to the early emergence of the 454 technology compared to other NGS platforms [Figure 1c and Supplement Table S3]. This is followed by Illumina and Sanger (3,475), Illumina and 454 (3,299), Illumina and Ion Torrent (2,600), and Illumina and PacBio (997). Interestingly, more recently released longer-read platforms like PacBio and Oxford Nanopore tended to be paired with Illumina more frequently compared to traditional Sanger sequencing. A small number of studies even combined three or four different sequencing technologies (530 and 6 entries, respectively) [Supplement Table S4]. Some users employed a combined approach to circumvent the inherent flaws of one sequencing platform, particularly for genome finishing.44 For example, after NGS has been used to generate the majority of a RNA virus genome, RACE (Rapid amplification of cDNA ends) is typically performed with Sanger to obtain the 5’ or 3’ termini.45, 46
De novo assembly plays a major role in analyzing long viral sequences
We analyzed the assembly methods used for GenBank entries of long sequences (≥2000 nt) from 2012 to 2017 when NGS usage become relevant [Figure 1h & i and Supplement Table S5]. The number of programs used to assemble viral sequences has steadily increased over time (a >2-fold increase from 2012-2017). With new sequencing technologies emerging and computational power continually improving, the development of new and better assembly programs always follows suite. The use of specifically-designed de novo assembly programs (ABySS, BWA, Canu, Cap3, IDBA, MIRA, Newbler, SOAPdenovo, SPAdes, Trinity, and Velvet) has increased from less than 1% of viral sequence entries in 2012, to 20% of all viral sequence entries in 2017. A similar increase was observed for reference-mapping software (i.e., Bowtie and Bowtie2), from 0.03% in 2012 to 6.5% in 2017. Multifunctional programs that offer both assembly options, including CLC Genomics Workbench (CLC), DNA Baser, DNASTAR, Geneious, and Sequencher, were by far the most popular option for the years 2013-2017. However, since these commercial software packages can perform both de novo and reference-mapping assembly, the exact sequence assembly strategy used for these records is unknown, and thus the contributions of both de novo assembly and reference recruitment are likely underestimated.
Acknowledgements
We thank Dr. Steve Oberste for thoughtful suggestions on this work.