Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Open-source mapping and variant calling for large-scale NGS data from original base-quality scores

Olga Krasheninina, Yih-Chii Hwang, Xiaodong Bai, Aleksandra Zalcman, Evan Maxwell, Jeffrey G. Reid, View ORCID ProfileWilliam J. Salerno Jr.
doi: https://doi.org/10.1101/2020.12.15.356360
Olga Krasheninina
1Regeneron Genetics Center, Tarrytown, NY 10591, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yih-Chii Hwang
2DNAnexus, Mountain View, CA 94040, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xiaodong Bai
1Regeneron Genetics Center, Tarrytown, NY 10591, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Aleksandra Zalcman
2DNAnexus, Mountain View, CA 94040, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Evan Maxwell
1Regeneron Genetics Center, Tarrytown, NY 10591, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jeffrey G. Reid
1Regeneron Genetics Center, Tarrytown, NY 10591, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
William J. Salerno Jr.
1Regeneron Genetics Center, Tarrytown, NY 10591, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for William J. Salerno Jr.
  • For correspondence: william.salerno@regeneron.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Standardized genome informatics protocols minimize reprocessing costs and facilitate harmonization across studies if implemented in a transparent, accessible and reproducible manner. Here we define the OQFE protocol, a lossless read-mapping protocol that retains key features of existing NGS standard methods. We demonstrate that variants can be called directly from NovaSeq OQFE data without the need for base quality score recalibration and describe a large-scale variant calling protocol for OQFE data. The OQFE protocol is open-source and a containerized implementation is provided.

Introduction

Public genomic initiatives such as the UK Biobank1, 1000 Genomes Project2, and the Human Diversity Project3 have established study populations and data resources that support a wide range of research, from human health to technology development, both as standalone data and as large-scale assets to complementary research programs. Given the size of these datasets, increasingly measured in hundreds of thousands of samples and petabytes of data, routine or custom reprocessing is infeasible for all but the most well-resourced users. However, such reprocessing is inevitable in the long term as methods improve and the data sets themselves grow.

Standardized genomic data protocols can obviate reprocessing by allowing users to harmonize their own data with large resources, ensuring the interoperability of datasets. In 2017, researchers defined functionally equivalent (FE) pipelines for sequence read mapping that were implemented across multiple large-scale sequencing projects, harmonizing more than 400,000 whole-genome samples worth of data with a three-fold reduction in size, achieved largely through lossless reference-based compression (CRAM) and a lossy quality-score binning from the native HiSeq X 8-value scheme to a recalibrated 4-value scheme4.

The original quality functionally equivalent (OQFE) protocol presented here adapts the FE protocol so that the original raw read data (i.e. FASTQ files) can be recovered from the resulting CRAM files. Applied to NovaSeq data, which have natively 4-valued quality scores, OQFE CRAMs are comparably sized. Minor updates of constituent programs are made to resolve known issues. Variants can be directly called from these CRAM files, as demonstrated with the DeepVariant5 and GLnexus6,7 protocol described below.

Methods

OQFE Protocol

The OQFE protocol maps raw reads (FASTQ) with BWA-MEM to the GRCh38 reference in a deterministic manner, retaining all supplementary alignments. Mate tags are added with samblaster as specified in the FE protocol. OQFE CRAMs contain all reads from the input FASTQs and meet all FE tag specifications. Duplicate reads are then marked with Picard 2.21.2, which resolves a known issue with the FE version of Picard (2.4.1), in which the representative read in a set of duplicate reads can depend on the sequence input order, potentially resulting in an order-dependent set of supplementary alignment duplicates. The final OQFE CRAM is compressed with samtools, without any base quality score recalibration or binning. OQFE CRAMs are thus forward compatible with the FE quality score recalibration and binning steps. Table 1 details the software versions, references and commands for each step and notes differences from the FE protocol.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1

OQFE DeepVariant Protocol

Variants were called on each CRAM with DeepVariant5 0.10.0 using a deep learning model retrained on exome data sequenced with the same protocol as was used to sequence the UK Biobank samples8. Variant calls were restricted to the exome capture region and the 100 base-pairs flanking each capture target, resulting in a gVCF (genomic VCF) for each sample containing all variant genotypes and compressed representations of reference regions without called variant genotypes.

The OQFE protocol was applied to the 200,000 UK Biobank (UKB 200K) exome samples9 with the containerized OQFE pipeline (https://hub.docker.com/r/dnanexus/oqfe). Per-sample gVCFs were generated via the DeepVariant protocol described above and merged with GLnexus 1.2.6 using the default ‘DeepVariantWES’ parameters6,7. Table 2 provides exact commands and access to all required resource files.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2

HG002 benchmark data

Two sets of NovaSeq exome sequence data were generated from the HG002 control sample10 via the exome sequencing protocol applied to UK Biobank samples8 and then mapped via the OQFE protocol. Two additional CRAMs were created from each HG002 OQFE CRAM by recalibrating the base qualities (+BQSR CRAM) and then applying the FE binning strategy (+BQSR+FEbin CRAM) as described in the FE protocol4. Original quality scores are not retained in either type of derived CRAM. All HG002 CRAMs were called with the OQFE DeepVariant protocol within the exome capture regions and evaluated with hap.py 0.3.811 against the Genome In A Bottle HG002 high-confidence variants (v3.3.2) within the corresponding HG002 high-confidence regions12,13.

Results

NovaSeq OQFE CRAMs retain original quality scores with only a modest increase in size (10-12%) compared to FE CRAMs and are approximately one-third the size of CRAMs with recalibrated quality scores (Table S2). The UKB 200K exome CRAMs (n=200,643) average 838 MB per sample, totaling approximately 174 TB. Compared to native NovaSeq data (i.e. read-name-sorted and compressed FASTQs), OQFE CRAMs maintain the three-fold reduction in size offered by FE CRAMs (Table S2).

To demonstrate that variants can be called directly from NovaSeq exome OQFE CRAMs without a loss of quality, we compared HG002 variant performance at two coverages (45x and 70x) and three quality score binning strategies: OQFE (native 4-valued), +BQSR (40-valued), +BQSR+FEbin (non-linear 4-valued). On average 21.6k SNVs and 880 indels per CRAM were compared to 22,587 high-confidence HG002 variants (21,675 SNVs and 912 indels), providing precision, recall and F1 scores for each of the six experiments (Table S1). As shown in Figure 1, variant performance varies more with coverage and with variant type than it does with quality-score binning. Summing across variant types and coverages (Table S1), we observe that OQFE has slightly fewer false negatives (FN=384) and false positives (FP=127) than each +BQSR (FN=390, FP=130) and +BQSR+FEbin (FN=394, FN=163).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Comparison of quality score binning strategies by variant type. F1 scores for SNVs differ by less than 0.0008 between all binning strategies at each coverage, and OQFE indel F1 scores are within 0.0012 of +BQSR values.

Discussion

Public genomic data represent significant investments in money, human effort and subject participation, all of which demand the data be both equitably actionable in the short term and durable in the long term. The size and complexity of public genomic datasets present a barrier to many users, even if the data are freely accessible. At the same time, the data must be amenable to current and future research that requires reprocessing (e.g. a new reference genome). While OQFE CRAMs are lossless relative to FASTQs and thus a durable long-term resource, they are reference-coordinate sorted and compressed. If FASTQs are required, we recommend OQFE CRAMs be name-sorted prior to conversion to avoid reference-specific correlation in the FASTQ read order.

We also note that the OQFE protocol avoids potential overbinning of NovaSeq quality scores. The FE protocol assigns all recalibrated quality scores greater than 23 (PHRED scale) a value of 30. When applied to the native NovaSeq quality scores of 2, 11, 25 and 37, the FE binning would both fail to distinguish between the two highest quality scores and deflate the highest value. While the OQFE+DV results described here are largely similar across quality-score binning strategies, we recommend that users with NovaSeq data evaluate any quality-score processing with respect to their variant calling protocol prior to analysis.

Lastly, we recognize that the cost to egress, store and reprocess data is compounded by the expertise required to maintain, optimize and execute genomic software at scale. To this end, all methods described here rely only on open-source software, and we provide a single containerized OQFE pipeline with all required source and validation files that can be executed on any local or cloud infrastructure that supports Docker containers. This ‘open-source-first’ policy combined with standardized descriptions ensures that users can execute these exact methods autonomously on standard hardware while also enabling commercial providers to facilitate accelerated and at-scale processing with specialized technology.

Footnotes

  • https://hub.docker.com/r/dnanexus/oqfe

References

  1. 1.↵
    Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779. doi:10.1371/journal.pmed.1001779
    OpenUrlCrossRefPubMed
  2. 2.↵
    Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi:10.1038/nature15393
    OpenUrlCrossRefPubMed
  3. 3.↵
    Cavalli-Sforza LL. The Human Genome Diversity Project: past, present and future. Nat Rev Genet. 2005;6(4):333–340. doi:10.1038/nrg1596
    OpenUrlCrossRefPubMedWeb of Science
  4. 4.↵
    Regier AA, Farjoun Y, Larson DE, et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat Commun. 2018;9(1):4038. doi:10.1038/s41467-018-06159-4
    OpenUrlCrossRefPubMed
  5. 5.↵
    Poplin R, Chang P-C, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983–987. doi:10.1038/nbt.4235
    OpenUrlCrossRefPubMed
  6. 6.↵
    Yun T, Li H, Chang P-C, Lin MF, Carroll A, McLean CY. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. bioRxiv. Published online May 2, 2020:2020.02.10.942086. doi:10.1101/2020.02.10.942086
    OpenUrlAbstract/FREE Full Text
  7. 7.↵
    Lin MF, Rodeh O, Penn J, et al. GLnexus: joint variant calling for large cohort sequencing. bioRxiv. Published online June 11, 2018:343970. doi:10.1101/343970
    OpenUrlAbstract/FREE Full Text
  8. 8.↵
    Van Hout CV, Tachmazidou I, Backman JD, et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 2020;586(7831):749–756. doi:10.1038/s41586-020-2853-0
    OpenUrlCrossRef
  9. 9.↵
    Szustakowski JD, Balasubramanian S, Sasson A, et al. Advancing Human Genetics Research and Drug Discovery through Exome Sequencing of the UK Biobank. medRxiv. Published online November 4, 2020:2020.11.02.20222232. doi:10.1101/2020.11.02.20222232
    OpenUrlAbstract/FREE Full Text
  10. 10.↵
    Zook JM, Catoe D, McDaniel J, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3(1):160025. doi:10.1038/sdata.2016.25
    OpenUrlCrossRefPubMed
  11. 11.↵
    Krusche P. Illumina/Hap.Py. Illumina; 2020. Accessed December 6, 2020. https://github.com/Illumina/hap.py
  12. 12.↵
    Zook JM, McDaniel J, Olson ND, et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019;37(5):561–566. doi:10.1038/s41587-019-0074-6
    OpenUrlCrossRefPubMed
  13. 13.↵
    Krusche P, Trigg L, Boutros PC, et al. Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes. Nat Biotechnol. 2019;37(5):555–560. doi:10.1038/s41587-019-0054-x
    OpenUrlCrossRef
  14. 14.
    Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinforma Oxf Engl. 2009;25(14):1754–1760. doi:10.1093/bioinformatics/btp324
    OpenUrlCrossRefPubMedWeb of Science
  15. 15.
    Faust GG, Hall IM. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinforma Oxf Engl. 2014;30(17):2503–2505. doi:10.1093/bioinformatics/btu314
    OpenUrlCrossRefPubMedWeb of Science
  16. 16.
    Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS alignment formats. Bioinforma Oxf Engl. 2015;31(12):2032–2034. doi:10.1093/bioinformatics/btv098
    OpenUrlCrossRefPubMed
  17. 17.
    Picard Toolkit. Broad Institute; 2019. http://broadinstitute.github.io/picard/
  18. 18.
    https://github.com/broadinstitute/picard/pull/1236
  19. 19.
    Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 2011;21(5):734–740. doi:10.1101/gr.114819.110
    OpenUrlAbstract/FREE Full Text
Back to top
PreviousNext
Posted December 16, 2020.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Open-source mapping and variant calling for large-scale NGS data from original base-quality scores
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Open-source mapping and variant calling for large-scale NGS data from original base-quality scores
Olga Krasheninina, Yih-Chii Hwang, Xiaodong Bai, Aleksandra Zalcman, Evan Maxwell, Jeffrey G. Reid, William J. Salerno Jr.
bioRxiv 2020.12.15.356360; doi: https://doi.org/10.1101/2020.12.15.356360
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Open-source mapping and variant calling for large-scale NGS data from original base-quality scores
Olga Krasheninina, Yih-Chii Hwang, Xiaodong Bai, Aleksandra Zalcman, Evan Maxwell, Jeffrey G. Reid, William J. Salerno Jr.
bioRxiv 2020.12.15.356360; doi: https://doi.org/10.1101/2020.12.15.356360

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4229)
  • Biochemistry (9109)
  • Bioengineering (6753)
  • Bioinformatics (23944)
  • Biophysics (12103)
  • Cancer Biology (9498)
  • Cell Biology (13745)
  • Clinical Trials (138)
  • Developmental Biology (7617)
  • Ecology (11664)
  • Epidemiology (2066)
  • Evolutionary Biology (15479)
  • Genetics (10620)
  • Genomics (14297)
  • Immunology (9467)
  • Microbiology (22800)
  • Molecular Biology (9078)
  • Neuroscience (48894)
  • Paleontology (355)
  • Pathology (1479)
  • Pharmacology and Toxicology (2566)
  • Physiology (3824)
  • Plant Biology (8309)
  • Scientific Communication and Education (1467)
  • Synthetic Biology (2291)
  • Systems Biology (6172)
  • Zoology (1297)