Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Efficient compression and analysis of large genetic variation datasets

Ryan M. Layer, Neil Kindlon, Konrad J. Karczewski, Exome Aggregation Consortium, Aaron R. Quinlan
doi: https://doi.org/10.1101/018259
Ryan M. Layer
1Departments of Human Genetics and Biomedical Informatics, University of Utah, Salt Lake City, UT
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Neil Kindlon
2Department of Public Health Sciences, University of Virginia, Charlottesville, VA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Konrad J. Karczewski
3Analytical and Translational Genetics Unit, Harvard Medical School, Boston, MA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Aaron R. Quinlan
1Departments of Human Genetics and Biomedical Informatics, University of Utah, Salt Lake City, UT
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

ABSTRACT

The economy of human genome sequencing has catalyzed ambitious efforts to interrogate the genomes of large cohorts in search of deeper insight into the genetic basis of disease. This manuscript introduces Genotype Query Tools (GQT) as a new indexing strategy and powerful toolset that enables interactive analyses based on genotypes, phenotypes and sample relationships. Speed improvements are achieved by operating directly on a compressed index without decompression. GQT’s data compression ratios increase favorably with cohort size and therefore, by avoiding data inflation, relative analysis performance improves in kind. We demonstrate substantial query performance improvements over state-of-the-art tools using datasets from the 1000 Genomes Project (46 fold), the Exome Aggregation Consortium (443 fold), and simulated datasets of up to 100,000 genomes (218 fold). Moreover, our genotype indexing strategy complements existing formats and toolsets to provide a powerful framework for current and future analyses of massive genome datasets.

URLS All source code for the GQT toolkit is available at https://github.com/ryanlayer/gqt. Furthermore, all commands used for the experiments conducted in this study are available at https://github.com/ryanlayer/gqt_paper.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted April 20, 2015.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Efficient compression and analysis of large genetic variation datasets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Efficient compression and analysis of large genetic variation datasets
Ryan M. Layer, Neil Kindlon, Konrad J. Karczewski, Exome Aggregation Consortium, Aaron R. Quinlan
bioRxiv 018259; doi: https://doi.org/10.1101/018259
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Efficient compression and analysis of large genetic variation datasets
Ryan M. Layer, Neil Kindlon, Konrad J. Karczewski, Exome Aggregation Consortium, Aaron R. Quinlan
bioRxiv 018259; doi: https://doi.org/10.1101/018259

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2517)
  • Biochemistry (4964)
  • Bioengineering (3469)
  • Bioinformatics (15181)
  • Biophysics (6885)
  • Cancer Biology (5380)
  • Cell Biology (7711)
  • Clinical Trials (138)
  • Developmental Biology (4518)
  • Ecology (7135)
  • Epidemiology (2059)
  • Evolutionary Biology (10210)
  • Genetics (7497)
  • Genomics (9767)
  • Immunology (4822)
  • Microbiology (13179)
  • Molecular Biology (5129)
  • Neuroscience (29367)
  • Paleontology (203)
  • Pathology (835)
  • Pharmacology and Toxicology (1460)
  • Physiology (2129)
  • Plant Biology (4734)
  • Scientific Communication and Education (1008)
  • Synthetic Biology (1337)
  • Systems Biology (4002)
  • Zoology (768)