Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

de novo variant calling identifies cancer mutation profiles in the 1000 Genomes Project

Jeffrey K. Ng, Pankaj Vats, Elyn Fritz-Waters, Stephanie Sarkar, Eleanor I. Sams, Evin M. Padhi, Zachary L. Payne, Shawn Leonard, Marc A. West, Chandler Prince, Lee Trani, Marshall Jansen, George Vacek, Mehrzad Samadi, Timothy T. Harkins, Craig Pohl, View ORCID ProfileTychele N. Turner
doi: https://doi.org/10.1101/2021.05.27.445979
Jeffrey K. Ng
1Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Pankaj Vats
2NVIDIA Corporation, 2788 San Tomas Expressway Corporation, Santa Clara, CA 95051
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Elyn Fritz-Waters
3Research Infrastructure Services, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stephanie Sarkar
1Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Eleanor I. Sams
1Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Evin M. Padhi
1Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zachary L. Payne
1Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shawn Leonard
3Research Infrastructure Services, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marc A. West
2NVIDIA Corporation, 2788 San Tomas Expressway Corporation, Santa Clara, CA 95051
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chandler Prince
3Research Infrastructure Services, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lee Trani
4McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marshall Jansen
3Research Infrastructure Services, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
George Vacek
2NVIDIA Corporation, 2788 San Tomas Expressway Corporation, Santa Clara, CA 95051
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mehrzad Samadi
2NVIDIA Corporation, 2788 San Tomas Expressway Corporation, Santa Clara, CA 95051
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Timothy T. Harkins
2NVIDIA Corporation, 2788 San Tomas Expressway Corporation, Santa Clara, CA 95051
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Craig Pohl
3Research Infrastructure Services, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tychele N. Turner
1Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tychele N. Turner
  • For correspondence: tychele@wustl.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Detection of de novo variants (DNVs) is critical for studies of disease-related variation and mutation rates. We developed a GPU-based workflow to rapidly call DNVs (HAT) and demonstrated its effectiveness by applying it to 4,216 Simons Simplex Collection (SSC) whole-genome sequenced parent-child trios from DNA derived from blood. In our SSC DNV data, we identified 78 ± 15 DNVs per individual, 18% ± 5% at CpG sites, 75% ± 9% phased to the paternal chromosome of origin, and an average allele balance of 0.49. These calculations are all in line with DNV expectations. We sought to build a control DNV dataset by running HAT on 602 whole-genome sequenced parent-child trios from DNA derived from lymphoblastoid cell lines (LCLs) from the publicly available 1000 Genomes Project (1000G). In our 1000G DNV data, we identified 740 ± 967 DNVs per individual, 14% ± 4% at CpG sites, 61% ± 11% phased to the paternal chromosome of origin, and an average allele balance of 0.41. Of the 602 trios, 80% had > 100 DNVs and we hypothesized the excess DNVs were cell line artifacts. Several lines of evidence in our data suggest that this is true and that 1000G does not appear to be a static reference. By mutation profile analysis, we tested whether these cell line artifacts were random and found that 40% of individuals in 1000G did not have random DNV profiles; rather they had DNV profiles matching B-cell lymphoma. Furthermore, we saw significant excess of protein-coding DNVs in 1000G in the gene IGLL5 that has already been implicated in this cancer. As a result of cell line artifacts, 1000G has variants present in DNA repair genes and at Clinvar pathogenic or likely-pathogenic sites. Our study elucidates important implications of the use of sequencing data from LCLs for both reference building projects as well as disease-related projects whereby these data are used in variant filtering steps.

Competing Interest Statement

Competing interests: P.V., M.W., G.V. And T.T.H are full time employees of NVIDIA

Footnotes

  • These are the updates to the paper: Performed analysis of whole-genome sequencing data from an additional 4,216 trios (n = 8,922 individuals) where the sequencing was done on DNA derived from blood. This included calling of DNVs with our new workflow as well as phasing of the data. This increased the sample size of the paper from 1,806 individuals to 10,728 individuals (~6x more genomes). It also provides us with the unique opportunity to showcase the excellence of our de novo variant caller on DNA derived from blood. With this specific information, we have added a new component to the study that compares the blood-derived DNA results versus the cell-line-derived DNA results. We made a CPU-based version of the code to be released fully free and publicly available. We performed Sanger sequencing, at the wet bench, on 25 random DNVs from the NA12878 trio and 92% were confirmed as real DNVs. This adds to the manual DNV scoring we did by visualization of read data at 3,980 DNVs.

  • https://app.globus.org/file-manager?origin_id=3eff453a-88f4-11eb-954f-752ba7b88ebe&origin_path=%2F

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted October 27, 2021.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
de novo variant calling identifies cancer mutation profiles in the 1000 Genomes Project
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
de novo variant calling identifies cancer mutation profiles in the 1000 Genomes Project
Jeffrey K. Ng, Pankaj Vats, Elyn Fritz-Waters, Stephanie Sarkar, Eleanor I. Sams, Evin M. Padhi, Zachary L. Payne, Shawn Leonard, Marc A. West, Chandler Prince, Lee Trani, Marshall Jansen, George Vacek, Mehrzad Samadi, Timothy T. Harkins, Craig Pohl, Tychele N. Turner
bioRxiv 2021.05.27.445979; doi: https://doi.org/10.1101/2021.05.27.445979
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
de novo variant calling identifies cancer mutation profiles in the 1000 Genomes Project
Jeffrey K. Ng, Pankaj Vats, Elyn Fritz-Waters, Stephanie Sarkar, Eleanor I. Sams, Evin M. Padhi, Zachary L. Payne, Shawn Leonard, Marc A. West, Chandler Prince, Lee Trani, Marshall Jansen, George Vacek, Mehrzad Samadi, Timothy T. Harkins, Craig Pohl, Tychele N. Turner
bioRxiv 2021.05.27.445979; doi: https://doi.org/10.1101/2021.05.27.445979

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4684)
  • Biochemistry (10361)
  • Bioengineering (7675)
  • Bioinformatics (26337)
  • Biophysics (13529)
  • Cancer Biology (10686)
  • Cell Biology (15440)
  • Clinical Trials (138)
  • Developmental Biology (8497)
  • Ecology (12821)
  • Epidemiology (2067)
  • Evolutionary Biology (16862)
  • Genetics (11399)
  • Genomics (15478)
  • Immunology (10617)
  • Microbiology (25219)
  • Molecular Biology (10223)
  • Neuroscience (54473)
  • Paleontology (401)
  • Pathology (1668)
  • Pharmacology and Toxicology (2897)
  • Physiology (4342)
  • Plant Biology (9247)
  • Scientific Communication and Education (1586)
  • Synthetic Biology (2558)
  • Systems Biology (6781)
  • Zoology (1466)