Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Jun Li; Daniela M Witten; Iain M Johnstone; Robert Tibshirani

doi:10.1093/biostatistics/kxr031

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Biostatistics. 2012 Jul;13(3):523-38. doi: 10.1093/biostatistics/kxr031. Epub 2011 Oct 14.

Authors

Jun Li¹, Daniela M Witten, Iain M Johnstone, Robert Tibshirani

Affiliation

¹ Department of Statistics, Stanford University, Stanford, CA 94305, USA. junli07@stanford.edu

Abstract

We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Data Interpretation, Statistical*
Humans
Models, Statistical*
RNA, Messenger / chemistry
RNA, Messenger / genetics*
Reverse Transcriptase Polymerase Chain Reaction
Sequence Analysis, DNA / methods*

Substances

RNA, Messenger

Abstract

Publication types

MeSH terms

Substances

Grants and funding