Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Integrating long-range connectivity information into de Bruijn graphs

View ORCID ProfileIsaac Turner, View ORCID ProfileKiran V Garimella, View ORCID ProfileZamin Iqbal, View ORCID ProfileGil McVean
doi: https://doi.org/10.1101/147777
Isaac Turner
Wellcome Trust Centre for Human Genetics, Oxford, OX3 7BN, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Isaac Turner
Kiran V Garimella
Wellcome Trust Centre for Human Genetics, Oxford, OX3 7BN, UKBig Data Institute, Li Ka Shing Centre for Health Information and Discovery, Oxford, OX3 7LF, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kiran V Garimella
Zamin Iqbal
Wellcome Trust Centre for Human Genetics, Oxford, OX3 7BN, UKEuropean Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB101SD, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zamin Iqbal
Gil McVean
Wellcome Trust Centre for Human Genetics, Oxford, OX3 7BN, UKBig Data Institute, Li Ka Shing Centre for Health Information and Discovery, Oxford, OX3 7LF, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gil McVean
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input.

Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both the de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterise the genomic context of drug-resistance genes.

Availability Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, available under the MIT license at https://github.com/mcvean/mccortex.

Contact turner.isaac{at}gmail.com.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted June 08, 2017.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Integrating long-range connectivity information into de Bruijn graphs
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Integrating long-range connectivity information into de Bruijn graphs
Isaac Turner, Kiran V Garimella, Zamin Iqbal, Gil McVean
bioRxiv 147777; doi: https://doi.org/10.1101/147777
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Integrating long-range connectivity information into de Bruijn graphs
Isaac Turner, Kiran V Garimella, Zamin Iqbal, Gil McVean
bioRxiv 147777; doi: https://doi.org/10.1101/147777

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (1641)
  • Biochemistry (2722)
  • Bioengineering (1902)
  • Bioinformatics (10203)
  • Biophysics (4174)
  • Cancer Biology (3202)
  • Cell Biology (4522)
  • Clinical Trials (135)
  • Developmental Biology (2831)
  • Ecology (4447)
  • Epidemiology (2041)
  • Evolutionary Biology (7213)
  • Genetics (5464)
  • Genomics (6795)
  • Immunology (2380)
  • Microbiology (7462)
  • Molecular Biology (2978)
  • Neuroscience (18529)
  • Paleontology (135)
  • Pathology (472)
  • Pharmacology and Toxicology (776)
  • Physiology (1147)
  • Plant Biology (2692)
  • Scientific Communication and Education (679)
  • Synthetic Biology (885)
  • Systems Biology (2840)
  • Zoology (465)