Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs

View ORCID ProfileHarun Mustafa, View ORCID ProfileMikhail Karasikov, View ORCID ProfileGunnar Rätsch, View ORCID ProfileAndré Kahles
doi: https://doi.org/10.1101/2022.11.04.514718
Harun Mustafa
1ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
2University Hospital Zurich, Rämistrasse 100, 8091 Zurich, Switzerland
3Swiss Institute of Bioinformatics, Zurich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Harun Mustafa
  • For correspondence: harun.mustafa@inf.ethz.ch
Mikhail Karasikov
1ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
2University Hospital Zurich, Rämistrasse 100, 8091 Zurich, Switzerland
3Swiss Institute of Bioinformatics, Zurich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mikhail Karasikov
Gunnar Rätsch
1ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
2University Hospital Zurich, Rämistrasse 100, 8091 Zurich, Switzerland
3Swiss Institute of Bioinformatics, Zurich, Switzerland
4ETH AI Center, Binzmühlestrasse 13, 8092 Zurich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gunnar Rätsch
André Kahles
1ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
2University Hospital Zurich, Rämistrasse 100, 8091 Zurich, Switzerland
3Swiss Institute of Bioinformatics, Zurich, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for André Kahles
  • For correspondence: harun.mustafa@inf.ethz.ch
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies’ processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node’s source samples (called labels), genomic coordinates, expression levels, etc.

An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment.

To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k – 1 acts as a proxy for inserting nodes into the graph.

MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads’ ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • harun.mustafa{at}inf.ethz.ch, mikhaika{at}inf.ethz.ch, gunnar.raetsch{at}inf.ethz.ch, andre.kahles{at}inf.ethz.ch

  • https://github.com/ratschlab/mla

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted November 05, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs
Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, André Kahles
bioRxiv 2022.11.04.514718; doi: https://doi.org/10.1101/2022.11.04.514718
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs
Harun Mustafa, Mikhail Karasikov, Gunnar Rätsch, André Kahles
bioRxiv 2022.11.04.514718; doi: https://doi.org/10.1101/2022.11.04.514718

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4685)
  • Biochemistry (10362)
  • Bioengineering (7682)
  • Bioinformatics (26342)
  • Biophysics (13534)
  • Cancer Biology (10693)
  • Cell Biology (15446)
  • Clinical Trials (138)
  • Developmental Biology (8501)
  • Ecology (12824)
  • Epidemiology (2067)
  • Evolutionary Biology (16867)
  • Genetics (11401)
  • Genomics (15484)
  • Immunology (10619)
  • Microbiology (25225)
  • Molecular Biology (10225)
  • Neuroscience (54481)
  • Paleontology (402)
  • Pathology (1669)
  • Pharmacology and Toxicology (2897)
  • Physiology (4345)
  • Plant Biology (9252)
  • Scientific Communication and Education (1587)
  • Synthetic Biology (2558)
  • Systems Biology (6781)
  • Zoology (1466)