Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Curation of over 10,000 transcriptomic studies to enable data reuse

View ORCID ProfileNathaniel Lim, View ORCID ProfileStepan Tesar, View ORCID ProfileManuel Belmadani, View ORCID ProfileGuillaume Poirier-Morency, View ORCID ProfileBurak Ogan Mancarci, View ORCID ProfileJordan Sicherman, Matthew Jacobson, Justin Leong, Patrick Tan, View ORCID ProfilePaul Pavlidis
doi: https://doi.org/10.1101/2020.07.13.201442
Nathaniel Lim
1Genome Science and Technology Graduate Program, University of British Columbia, Vancouver, BC V6T1Z4, Canada
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nathaniel Lim
Stepan Tesar
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stepan Tesar
Manuel Belmadani
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Manuel Belmadani
Guillaume Poirier-Morency
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guillaume Poirier-Morency
Burak Ogan Mancarci
2Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T1Z4, Canada
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Burak Ogan Mancarci
Jordan Sicherman
2Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T1Z4, Canada
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jordan Sicherman
Matthew Jacobson
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Justin Leong
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Patrick Tan
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Pavlidis
3Department of Psychiatry, University of British Columbia, Vancouver, BC V6T1Z4, Canada
4Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T1Z4, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Paul Pavlidis
  • For correspondence: paul@msl.ubc.ca
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe-gene mappings across microarray technologies. Thus, extensive curation and data reprocessing is necessary prior to any reuse. The Gemma bioinformatics system was created to help address these issues. Gemma consists of a database of curated transcriptomic datasets, analytical software, a web interface, and web services. Here we present an update on Gemma’s holdings, data processing and analysis pipelines, our curation guidelines, and software features. As of June 2020, Gemma contains 10,811 manually curated datasets (primarily human, mouse, and rat), over 395,000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA-sequencing). Dataset topics were represented with 10,215 distinct terms from 12 ontologies, for a total of 54,316 topic annotations (mean topics/dataset = 5.2). While Gemma has broad coverage of conditions and tissues, it captures a large majority of available brain-related datasets, accounting for 34% of its holdings. Users can access the curated data and differential expression analyses through the Gemma website, RESTful service, and an R package.

Database URL: https://gemma.msl.ubc.ca/home.html

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • 1. Minor editorial fixes (typos, dead links) 2. Update co-authors' ORCIDs

  • https://gemma.msl.ubc.ca/home.html

  • https://github.com/PavlidisLab/Gemma/

  • https://github.com/PavlidisLab/GemmaPaper-2020

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted July 24, 2020.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Curation of over 10,000 transcriptomic studies to enable data reuse
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Curation of over 10,000 transcriptomic studies to enable data reuse
Nathaniel Lim, Stepan Tesar, Manuel Belmadani, Guillaume Poirier-Morency, Burak Ogan Mancarci, Jordan Sicherman, Matthew Jacobson, Justin Leong, Patrick Tan, Paul Pavlidis
bioRxiv 2020.07.13.201442; doi: https://doi.org/10.1101/2020.07.13.201442
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Curation of over 10,000 transcriptomic studies to enable data reuse
Nathaniel Lim, Stepan Tesar, Manuel Belmadani, Guillaume Poirier-Morency, Burak Ogan Mancarci, Jordan Sicherman, Matthew Jacobson, Justin Leong, Patrick Tan, Paul Pavlidis
bioRxiv 2020.07.13.201442; doi: https://doi.org/10.1101/2020.07.13.201442

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4241)
  • Biochemistry (9173)
  • Bioengineering (6806)
  • Bioinformatics (24064)
  • Biophysics (12155)
  • Cancer Biology (9565)
  • Cell Biology (13825)
  • Clinical Trials (138)
  • Developmental Biology (7658)
  • Ecology (11737)
  • Epidemiology (2066)
  • Evolutionary Biology (15543)
  • Genetics (10672)
  • Genomics (14360)
  • Immunology (9513)
  • Microbiology (22904)
  • Molecular Biology (9129)
  • Neuroscience (49115)
  • Paleontology (358)
  • Pathology (1487)
  • Pharmacology and Toxicology (2583)
  • Physiology (3851)
  • Plant Biology (8351)
  • Scientific Communication and Education (1473)
  • Synthetic Biology (2301)
  • Systems Biology (6205)
  • Zoology (1302)