Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

MetaSRA: normalized sample-specific metadata for the Sequence Read Archive

Matthew N. Bernstein, AnHai Doan, Colin N. Dewey
doi: https://doi.org/10.1101/090506
Matthew N. Bernstein
1Department of Computer Sciences
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
AnHai Doan
1Department of Computer Sciences
3Center for Predictive Computational Phenotyping, University of Wisconsin-Madison
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Colin N. Dewey
1Department of Computer Sciences
2Department of Biostatistics and Medical Informatics
3Center for Predictive Computational Phenotyping, University of Wisconsin-Madison
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Motivation The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants, and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues, and cell types present in the SRA.

Results We present MetaSRA, a database of normalized SRA sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline.

Availability The MetaSRA database is available at http://deweylab.biostat.wisc.edu/metasra. Software implementing our computational pipeline is available at https://github.com/deweylab/metasra-pipeline.

Contact cdewey{at}biostat.wisc.edu

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted December 19, 2016.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
MetaSRA: normalized sample-specific metadata for the Sequence Read Archive
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
MetaSRA: normalized sample-specific metadata for the Sequence Read Archive
Matthew N. Bernstein, AnHai Doan, Colin N. Dewey
bioRxiv 090506; doi: https://doi.org/10.1101/090506
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
MetaSRA: normalized sample-specific metadata for the Sequence Read Archive
Matthew N. Bernstein, AnHai Doan, Colin N. Dewey
bioRxiv 090506; doi: https://doi.org/10.1101/090506

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4237)
  • Biochemistry (9147)
  • Bioengineering (6786)
  • Bioinformatics (24020)
  • Biophysics (12137)
  • Cancer Biology (9544)
  • Cell Biology (13795)
  • Clinical Trials (138)
  • Developmental Biology (7642)
  • Ecology (11715)
  • Epidemiology (2066)
  • Evolutionary Biology (15517)
  • Genetics (10650)
  • Genomics (14332)
  • Immunology (9492)
  • Microbiology (22856)
  • Molecular Biology (9103)
  • Neuroscience (49028)
  • Paleontology (355)
  • Pathology (1484)
  • Pharmacology and Toxicology (2572)
  • Physiology (3848)
  • Plant Biology (8337)
  • Scientific Communication and Education (1472)
  • Synthetic Biology (2296)
  • Systems Biology (6196)
  • Zoology (1302)