Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

View ORCID ProfileWarren A. McGee, Harold Pimentel, Lior Pachter, View ORCID ProfileJane Y. Wu
doi: https://doi.org/10.1101/564955
Warren A. McGee
1Northwestern University, Department of Neurology (Chicago, IL, United States)
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Warren A. McGee
Harold Pimentel
2Stanford University, Department of Genetics (Stanford, CA, United States)
3Howard Hughes Medical Institute (Stanford, CA, United States)
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lior Pachter
4California Institute of Technology, Division of Biology and Biological Engineering (Pasadena, CA, United States)
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jane Y. Wu
1Northwestern University, Department of Neurology (Chicago, IL, United States)
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jane Y. Wu
  • For correspondence: jane-wu@northwestern.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

*Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the sample’s total RNA content. Thus, datasets carry only relative information, even though absolute RNA copy numbers are of interest. Current normalization methods assume most features do not change, which can lead to misleading conclusions when there are many changes. Furthermore, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when many changes occur.

We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We compared absSimSeq with several existing tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). We compared the standard normalization of these tools to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features.

Our analysis shows that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used with a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into consideration the compositional nature of the data. We demonstrate that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more frequently used in a compositional manner to minimize misleading conclusions in differential analyses.

Author Summary A critical question in biomedical research is “Is there any change in the RNA transcript abundance when cellular conditions change?” RNA Sequencing (RNA-Seq) is a powerful tool that can help answer this question, but two critical parts of obtaining accurate measurements are (A) understanding the kind of data that RNA-Seq produces, and (B) “normalizing” the data between samples to allow for a fair comparison. Most tools assume that RNA-Seq data is count data, but in reality it is “compositional” data, meaning only percentages/proportions are available, which cannot directly answer the critical question. This leads to distorted results when attempting to simulate or analyze data that has a large global change.

To address this problem, we designed a new simulation protocol called absSimSeq that can more accurately represent RNA-Seq data when there are large changes. We also proposed a “compositional normalization” method that can utilize “negative control” features that are known to not change between conditions to anchor the data. When there are many features changing, this approach improves performance over commonly used normalization methods across multiple tools. This work highlights the importance of having negative controls features available and of treating RNA-Seq data as compositional.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted March 02, 2019.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data
Warren A. McGee, Harold Pimentel, Lior Pachter, Jane Y. Wu
bioRxiv 564955; doi: https://doi.org/10.1101/564955
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data
Warren A. McGee, Harold Pimentel, Lior Pachter, Jane Y. Wu
bioRxiv 564955; doi: https://doi.org/10.1101/564955

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4863)
  • Biochemistry (10815)
  • Bioengineering (8061)
  • Bioinformatics (27365)
  • Biophysics (14011)
  • Cancer Biology (11153)
  • Cell Biology (16091)
  • Clinical Trials (138)
  • Developmental Biology (8806)
  • Ecology (13317)
  • Epidemiology (2067)
  • Evolutionary Biology (17387)
  • Genetics (11701)
  • Genomics (15951)
  • Immunology (11050)
  • Microbiology (26137)
  • Molecular Biology (10672)
  • Neuroscience (56685)
  • Paleontology (421)
  • Pathology (1737)
  • Pharmacology and Toxicology (3012)
  • Physiology (4561)
  • Plant Biology (9658)
  • Scientific Communication and Education (1617)
  • Synthetic Biology (2696)
  • Systems Biology (6989)
  • Zoology (1511)