Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”

Paolo Ferragina, Bud Mishra
doi: https://doi.org/10.1101/001669
Paolo Ferragina
Dipartimento di Informatica, University of Pisa, Italy. Email: ferragina@di.unipi.it
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: ferragina@di.unipi.it
Bud Mishra
Courant Institute, New York University, USA. Email mishra@nyu.edu. The research described here was supported by an NSF Expedition in Computing Grant.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: mishra@nyu.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

This paper reports an initial design of new data-structures that generalizes the idea of pattern-matching in stringology, from its traditional usage in an (unstructured) set of strings to the arena of a well-structured family of strings. In particular, the object of interest is a family of strings composed of blocks/classes of highly similar “stringlets,” and thus mimic a population of genomes made by concatenating haplotype-blocks, further constrained by haplotype-phasing. Such a family of strings, which we dub “stringomes,” is formalized in terms of a multi-partite directed acyclic graph with a source and a sink. The most interesting property of stringomes is probably the fact that they can be represented efficiently with compression up to their k-th order empirical entropy, while ensuring that the compression does not hinder the pattern-matching counting and reporting queries – either internal to a block or spanning two (or a few constant) adjacent blocks. The solutions proposed here have immediate applications to next-generation sequencing technologies, base-calling, expression profiling, variant-calling, population studies, onco-genomics, cyber security trace analysis and text retrieval.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY Unported 3.0 license.
Back to top
PreviousNext
Posted January 02, 2014.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”
Paolo Ferragina, Bud Mishra
bioRxiv 001669; doi: https://doi.org/10.1101/001669
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Algorithms in Stringomics (I): Pattern-Matching against “Stringomes”
Paolo Ferragina, Bud Mishra
bioRxiv 001669; doi: https://doi.org/10.1101/001669

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (1544)
  • Biochemistry (2500)
  • Bioengineering (1757)
  • Bioinformatics (9727)
  • Biophysics (3928)
  • Cancer Biology (2990)
  • Cell Biology (4235)
  • Clinical Trials (135)
  • Developmental Biology (2653)
  • Ecology (4129)
  • Epidemiology (2033)
  • Evolutionary Biology (6931)
  • Genetics (5243)
  • Genomics (6531)
  • Immunology (2206)
  • Microbiology (7012)
  • Molecular Biology (2781)
  • Neuroscience (17410)
  • Paleontology (127)
  • Pathology (432)
  • Pharmacology and Toxicology (712)
  • Physiology (1068)
  • Plant Biology (2515)
  • Scientific Communication and Education (647)
  • Synthetic Biology (835)
  • Systems Biology (2698)
  • Zoology (439)