Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

Brad Solomon, View ORCID ProfileCarleton Kingsford
doi: https://doi.org/10.1101/017087
Brad Solomon
Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Carleton Kingsford
Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Carleton Kingsford
  • For correspondence: carlk@cs.cmu.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases.

We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts.

The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/∼ckingsf/software/bloomtree.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted March 26, 2015.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees
Brad Solomon, Carleton Kingsford
bioRxiv 017087; doi: https://doi.org/10.1101/017087
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees
Brad Solomon, Carleton Kingsford
bioRxiv 017087; doi: https://doi.org/10.1101/017087

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2518)
  • Biochemistry (4968)
  • Bioengineering (3473)
  • Bioinformatics (15185)
  • Biophysics (6886)
  • Cancer Biology (5380)
  • Cell Biology (7718)
  • Clinical Trials (138)
  • Developmental Biology (4521)
  • Ecology (7135)
  • Epidemiology (2059)
  • Evolutionary Biology (10211)
  • Genetics (7504)
  • Genomics (9774)
  • Immunology (4826)
  • Microbiology (13186)
  • Molecular Biology (5130)
  • Neuroscience (29368)
  • Paleontology (203)
  • Pathology (836)
  • Pharmacology and Toxicology (1461)
  • Physiology (2131)
  • Plant Biology (4738)
  • Scientific Communication and Education (1008)
  • Synthetic Biology (1337)
  • Systems Biology (4003)
  • Zoology (768)