Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A Comparison of Performance for Different SARS-Cov-2 Sequencing Protocols

View ORCID ProfileJuanjo Bermúdez
doi: https://doi.org/10.1101/2021.03.01.433428
Juanjo Bermúdez
1Contignant Technologies SL c/Emigrant 30, Barcelona, Spain (08906)
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Juanjo Bermúdez
  • For correspondence: juanjo75es@gmail.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

SARS-Cov-2 genome sequencing has been identified as a fundamental tool for fighting the COVID-19 pandemic. It is used, for example, for identifying new variants of the virus and for elaborating phylogenetic trees that help to trace the spread of the virus. In the present study we provide a comprehensive comparison between the quality of the assemblies obtained from different sequencing protocols. We demonstrate how some protocols actively promoted by different high-level administrations are inefficient and how less-used alternative protocols show a significant increased performance. This increase of performance could lead to cheaper sequencing protocols and therefore to a more convenient escalation of the sequencing efforts around the world.

Introduction

There are two basic strategies to recreate a genome departing from the data obtained by the actually available sequencing machines:

  1. Recreate the genome with no prior knowledge using de novo sequence assembly

  2. Recreate the genome using prior knowledge with reference based alignment/mapping

It is generally accepted that each strategy has its own advantages and drawbacks. The quality of reference-based assembly is heavily dependent upon the choice of a closeenough reference: identification of some variantss can be missed if the sample is not close enough to the reference. In the other hand, de novo genome assembly is more computationally exigent and not always possible from the available data.

“Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations” (1)

“We document that 18.6% of SNP genotype calls in HLA genes are incorrect and that allele frequencies are estimated with an error greater than ±0.1 at approximately 25% of the SNPs in HLA genes. We found a bias toward overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset.” (2)

“Detecting indels is challenging for several reasons: (1) reads overlapping the indel sequence are more difficult to map and may be aligned with multiple mismatches rather than with a gap; (2) irregularity in capture efficiency and nonuniform read distribution increase the number of false positives; (3) increased error rates makes their detection very difficult within microsatellites; and (4) localization, near identical repetitive sequences can create high rates of false positives” (3)

In an ideal scenario, researchers should have both options available: reference-mapping and de-novo assembly. If one of these is missed, the results do not count with the maximal possible reliability. And if there is the possibility to have both at the same cost, there is absolutely no reason for not having both.

For that reason it is important that the libraries for sequencing SARS-Cov-2 are designed with de novo genome assembly in mind.

Some studies have already been developed to assess the performance of the most commonly used protocols (4), but these are exclusively focused on the obtained coverage of the reads and not in the quality of the de novo assemblies. This study will establish a comparison of protocols based on the quality of the de novo assembly, which is a more exigent metric to asses the performance of the protocols. The performance of mapping to a reference genome will not be analyzed as this has already been analyzed in previous studies and a superior performance in de novo assembly is already strongly correlated to a superior performance in reference-mapping.

Method

I used different search patterns at the NCBI SRA (5) website to find SARS-Cov-2 sequencing data obtained using different protocols. Despite this is not a totally reliable method (some search terms are ambiguous) I think it can help to understand the proportions.

Table 2 shows the number of matches found for every sequencing hardware technology. Despite some protocols were developed for some specific hardware, we can see how these are being used for other hardware too. for example, there are many more ARTIC (6) results for Illumina than for Nanopore despite the protocol was initially designed for Nanopore.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Queries at the NCBI portal

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2.

Sequencing runs found for every protocol and hardware

See how results corresponding to the ARTIC protocol roughly correspond to 41% of all available SARS-Cov-2 runs in the SRA archive.

From the results for these queries I randomly selected some runs and downloaded the data sets. Then I assembled the data sets using the best performing genome assembly software from SPAdes (7), rnaSPAdes (8) and metaSPAdes (I will note as xSPAdes the best result obtained from these). In case the runs contained long reads Flye and Canu (9) was also applied. I finally assembled some of the short-read runs with Contignant s-aligner (10).

SPAdes, rnaSPAdes and metaSPAdes have been demonstrated to be the best-performing open-source software for viral genome de-novo assembly in different previous studies. Flye and canu are considered the best-performing assembly utilities for long-read data. Meanwhile, s-aligner is a new de novo genome assembler that has recently demonstrated superior performance for viral-genome assembly over the previous short-read assemblers.

Results

Table 3 show the results obtained.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3.

Sequencing results for randomly selected data-sets

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 4.

Sequencing results for runs obtained from the ARTIC protocol

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 5.

Sequencing results for runs obtained with random primers amplification

From these results, some observations can be extracted.

A. Short-read data-sets outperform long-read ones

I still have not found a long-read data-set that completes a perfect assembly. Doesn’t matter the library design or the technology employed (Nanopore or PacBio). The mean NG50 for long-read data-sets is 7.622 while any protocol using short-reads at least doubles that.

In addition, the obtained sequences have a higher misassembly rate, which makes that data less feasible for variant detection.

B. The ARTIC protocol is far from delivering optimal results

Despite being widely used (41% of runs in the SRA archive) its performance is low and far from the bestperforming protocols. If we only consider results for shortread data the mean NG50 is 16.712, which is a quite bad result.

C. The ARTIC protocol doesn’t outperform other protocols

When making use exclusively of open-source assembly software, the ARTIC protocol doesn’t even significantly outperform results from other protocols. Its NG50 mean is similar to the NG50 overall mean of all protocols using opensource software: 16.712 with ARTIC vs 15.865 overall, and slightly lower than protocols using random primers (17.220).

D. Library designs with random primers largely outperform designs with fixed primers when using s-aligner

When making use of all available software options, not only open-source, designs with random primer selection largely outperform designs with fixed primer selection, like ARTIC. If we compare the NG50 mean from results for shortread data employing ARTIC and SPAdes (16.712), it is a 71% lower than the NG50 mean obtained from random-primer data and s-aligner assembler (28.654). Indeed, the combination of s-aligner plus random-primer data guarantees in most cases an almost perfect assembly of the virus genome. Thirteen out of fifteen cases got as result an almost-perfect assembly.

This observation is corroborated by the frequent presence of gaps in the reference-mapping of runs obtained from fixed-primers designs. This is, indeed, something that could be expected from designs based on fixed primers. That limitation is already recognized by the WHO (11).

E. The ARTIC protocol under-performs even when using s-aligner as software for genome assembly

S-aligner is, in general, a better tool for viral genome assembly. But even when using it, the ARTIC protocol underperforms compared to other protocols. The average NG50 using s-aligner for ARTIC data-sets is 16.757, which is similar to the average NG50 with open-source software (16.712), but far from the average NG50 obtained with s-aligner for random-primer protocols (28.654).

F. No paired-read performance benefit over single-read

When using s-aligner as assembly software with randomprimer library designs, there is no significant difference between using paired-end data or single-read data: 28.394 (single) vs 28.654 (overall).

Conclusions

There are significant differences of performance between different protocols for sequencing the SARS-Cov-2 (figure 1). The difference of performance between using the ARTIC protocol with short-read technologies and using a randomprimer design with s-aligner is statistically significant, with p-vakue <0.00001. The difference of performance in the NG50 metric is on average 71,5%. In addition, when evaluating the perfect-assembly ratio, we find that ARTIC has a 33,3% success rate, while the s-aligner-based protocol has a 86,7% success rate. With long-read data-sets, the success rate of ARTIC is 0% and NG50 can’t even be calculated because of lack of data.

Fig. 1.
  • Download figure
  • Open in new tab
Fig. 1.

NG50 for different clusters of runs.

These results suggest that the hundreds of thousands of genome sequencing’s being done in the world to trace the spread of the virus and detect new variants are not making use of the most reliable and efficient methods. The low NG50 and perfect-assembly ratio suggest that these methods are even far from being reliable if de novo genome assembly is considered a need, as it is suggested by previous studies on the efficacy of only-mapping assembly. Mapping the data to a reference genome is usually considered a necessary but insufficient step, and it is always preferable to have a de novo assembly, being the only reason for not preferring that the unavailability of that possibility. We demonstrate in this study that there are protocols that reliably permit us to obtain de novo genome sequencing’s of SARS-Cov-2: a tool that would improve the quality of the actual efforts to trace the virus worldwide.

Discussion

Another factor for considering which protocols to use for sequencing SARS-Cov-2 is the cost. ARTIC was specifically designed to be low-cost for that reason.

When evaluating the costs of different sequencing protocols three aspects should be considered.

  1. The cost of the sequencing hardware

  2. The cost of the products per sample

  3. The overall time expended per sample

Unfortunately, I don’t have the necessary experience nor access to materials to evaluate these costs. For that reason I contacted several public-health organizations, warning them of the significant lack of performance of some protocols and offering them cooperation to find better ones. You can see on Annex I a list of entities that were contacted. None of them have acceded to cooperate at the moment of writing this manuscript. One can guess what their motivations are, but some motivations can be firmly discarded: they are not rejecting that because they are already developing equivalent studies nor because they already have the answers that such study would bring.

Even though I lack the experience to make a full analysis of the cost-effectiveness of different protocols for sequencing SARS-Cov-2, some clues can be extracted from the data in this study. We see how we can obtain reliable, almost-complete, de novo genome assemblies from data-sets under 10MB (therefore largely multiplexable), obtained with less-expensive hardware like Ion Torrent or BGI. Also with Illumina, we can establish cost-effective protocols making use of less data and single-read technology. That suggests that costeffective protocols are possible that are also reliable under a-de-novo assembly perspective and not only under a referencemapping one. The increase of performance also suggests that a higher percentage of sequencing efforts will end-up in conclusive results, therefore eliminating the cost of most inconclusive results. All that information suggest that overall more cost-effective protocols than ARTIC are possible and desirable.

Data availability statement

The data underlying this article are available as DOI: 10.5281/zenodo.4558343.

The s-aligner software is available for free at https://contignant.com for first-time users. It’s free to use for 15 days after installation. No personal identification is required but a contact email must actually be provided for downloading it.

Competing interests

I am the developer and the owner of all the rights of the s-aligner software.

Supplementary Note A: Institutions invited to cooperate

See in Table 6 the list of public institutions that were contacted whether to warn them of a possible inefficiency in the applied protocols for sequencing SARS-Cov-2 (including an offer to cooperate) or to warn them of the existence of a new tool that could have an impact on the protocols for sequencing SARS-Cov-2 (offering them also cooperation).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 6.

Public institutions contacted to warn them of a possible improvement in public protocols for the management of the COVID-19 crisis.

Footnotes

  • https://contignant.com

  • https://doi.org/10.5281/zenodo.4558343

REFERENCES

  1. 1.↵
    Shulan Tian, Huihuang Yan, Eric W Klee, Michael Kalmbach, and Susan L Slager. Comparative analysis of de novo assemblers for variation discovery in personal genomes. Briefings in Bioinformatics, 19(5):893–904, April 2017. doi: 10.1093/bib/bbx037.
    OpenUrlCrossRef
  2. 2.↵
    Débora Y. C. Brandt, Vitor R. C. Aguiar, Bárbara D. Bitarello, Kelly Nunes, Jérôme Goudet, and Diogo Meyer. Mapping bias overestimates reference allele frequencies at the HLA Genes in the 1000 genomes project phase i data. G3&amp#58 Genes\ Genomes\ Genetics, 5(5): 931–941, March 2015. doi: 10.1534/g3.114.015784.
    OpenUrlAbstract/FREE Full Text
  3. 3.↵
    Giuseppe Narzisi, Jason A O’Rawe, Ivan Iossifov, Han Fang, Yoon ha Lee, Zihua Wang, Yiyang Wu, Gholson J Lyon, Michael Wigler, and Michael C Schatz. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nature Methods, 11(10):1033–1036, August 2014. doi: 10.1038/nmeth.3069.
    OpenUrlCrossRefPubMedWeb of Science
  4. 4.↵
    John R Tyson, Phillip James, David Stoddart, Natalie Sparks, Arthur Wickenhagen, Grant Hall, Ji Hyun Choi, Hope Lapointe, Kimia Kamelian, Andrew D Smith, Natalie Prystajecky, Ian Goodfellow, Sam J Wilson, Richard Harrigan, Terrance P Snutch, Nicholas J Loman, and Joshua Quick. Improvements to the ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore. September 2020. doi: 10.1101/2020.09.04.283077.
    OpenUrlAbstract/FREE Full Text
  5. 5.↵
    R. Leinonen, H. Sugawara, and M. Shumway and. The sequence read archive. Nucleic Acids Research, 39(Database):D19–D21, November 2010. doi: 10.1093/nar/gkq1019.
    OpenUrlCrossRefPubMedWeb of Science
  6. 6.↵
    Artic protocol. https://artic.network/ncov-2019.
  7. 7.↵
    Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455–477, May 2012. doi: 10.1089/cmb.2012.0021.
    OpenUrlCrossRefPubMed
  8. 8.↵
    Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A. Pevzner. metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5):824–834, March 2017. doi: 10.1101/gr.213959.116.
    OpenUrlAbstract/FREE Full Text
  9. 9.↵
    Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. Canu: scalable and accurate long-read assembly via adaptivek-mer weighting and repeat separation. Genome Research, 27(5):722–736, March 2017. doi: 10.1101/gr.215087.116.
    OpenUrlAbstract/FREE Full Text
  10. 10.↵
    Juanjo Bermúdez. s-aligner: a greedy algorithm for non-greedy de novo genome assembly. February 2021. doi: 10.1101/2021.02.02.429443.
    OpenUrlAbstract/FREE Full Text
  11. 11.↵
    World Health Organization. Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health, 8 January 2021. World Health Organization, 2021.
  12. 12.
    Contignant s-aligner. https://contignant.com/.
Back to top
PreviousNext
Posted March 01, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A Comparison of Performance for Different SARS-Cov-2 Sequencing Protocols
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A Comparison of Performance for Different SARS-Cov-2 Sequencing Protocols
Juanjo Bermúdez
bioRxiv 2021.03.01.433428; doi: https://doi.org/10.1101/2021.03.01.433428
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
A Comparison of Performance for Different SARS-Cov-2 Sequencing Protocols
Juanjo Bermúdez
bioRxiv 2021.03.01.433428; doi: https://doi.org/10.1101/2021.03.01.433428

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4859)
  • Biochemistry (10803)
  • Bioengineering (8046)
  • Bioinformatics (27325)
  • Biophysics (13987)
  • Cancer Biology (11130)
  • Cell Biology (16074)
  • Clinical Trials (138)
  • Developmental Biology (8792)
  • Ecology (13300)
  • Epidemiology (2067)
  • Evolutionary Biology (17371)
  • Genetics (11690)
  • Genomics (15932)
  • Immunology (11037)
  • Microbiology (26115)
  • Molecular Biology (10657)
  • Neuroscience (56614)
  • Paleontology (420)
  • Pathology (1735)
  • Pharmacology and Toxicology (3005)
  • Physiology (4552)
  • Plant Biology (9646)
  • Scientific Communication and Education (1615)
  • Synthetic Biology (2691)
  • Systems Biology (6979)
  • Zoology (1511)