Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data

Alemu Takele Assefa, Katrijn De Paepe, Celine Everaert, Pieter Mestdagh, Olivier Thas, Jo Vandesompele
doi: https://doi.org/10.1101/220129
Alemu Takele Assefa
Ghent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: alemutakele.assefa@ugent.be
Katrijn De Paepe
Global Advanced Analytics Group, Bain & Company Belgium, Inc.
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Celine Everaert
Ghent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Pieter Mestdagh
Ghent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Olivier Thas
Ghent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jo Vandesompele
Ghent University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Info/History
  • Metrics
  • Data Supplements
  • Preview PDF
Loading

Abstract

Background: Protein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs. Results: Thirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others. Conclusion: Overall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50\%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/ .

Copyright 
The copyright holder for this preprint is the author/funder. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
  • Posted November 16, 2017.

Download PDF

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data
Alemu Takele Assefa, Katrijn De Paepe, Celine Everaert, Pieter Mestdagh, Olivier Thas, Jo Vandesompele
bioRxiv 220129; doi: https://doi.org/10.1101/220129
del.icio.us logo Digg logo Reddit logo Technorati logo Twitter logo CiteULike logo Connotea logo Facebook logo Google logo Mendeley logo
Citation Tools
Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data
Alemu Takele Assefa, Katrijn De Paepe, Celine Everaert, Pieter Mestdagh, Olivier Thas, Jo Vandesompele
bioRxiv 220129; doi: https://doi.org/10.1101/220129

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (620)
  • Biochemistry (860)
  • Bioengineering (516)
  • Bioinformatics (4762)
  • Biophysics (1503)
  • Cancer Biology (1030)
  • Cell Biology (1448)
  • Clinical Trials (52)
  • Developmental Biology (974)
  • Ecology (1633)
  • Epidemiology (808)
  • Evolutionary Biology (3691)
  • Genetics (2514)
  • Genomics (3266)
  • Immunology (602)
  • Microbiology (2416)
  • Molecular Biology (895)
  • Neuroscience (6488)
  • Paleontology (42)
  • Pathology (124)
  • Pharmacology and Toxicology (220)
  • Physiology (287)
  • Plant Biology (893)
  • Scientific Communication and Education (247)
  • Synthetic Biology (386)
  • Systems Biology (1323)
  • Zoology (162)