Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

Marek Gierliński; Christian Cole; Pietà Schofield; Nicholas J Schurch; Alexander Sherstnev; Vijender Singh; Nicola Wrobel; Karim Gharbi; Gordon Simpson; Tom Owen-Hughes; Mark Blaxter; Geoffrey J Barton

doi:10.1093/bioinformatics/btv425

Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

Bioinformatics. 2015 Nov 15;31(22):3625-30. doi: 10.1093/bioinformatics/btv425. Epub 2015 Jul 23.

Affiliations

¹ Division of Computational Biology and Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dow Street Dundee, DD1 5EH, UK.
² Division of Computational Biology and.
³ Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dow Street Dundee, DD1 5EH, UK.
⁴ Edinburgh Genomics and.
⁵ Edinburgh Genomics and Institute of Evolutionary Biology, Ashworth Laboratories, University of Edinburgh, Edinburgh, UK.
⁶ Division of Plant Sciences and.
⁷ Division of Computational Biology and Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dow Street Dundee, DD1 5EH, UK, Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dow Street Dundee, DD1 5EH, UK.

Abstract

Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations.

Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of 'bad' replicates, which can drastically affect the gene read-count distribution.

Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.

Contact: g.j.barton@dundee.ac.uk.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Base Sequence
Binomial Distribution
Gene Expression Profiling
Models, Statistical*
Reproducibility of Results
Saccharomyces cerevisiae / genetics
Sequence Analysis, RNA / methods*

Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment

Authors

Affiliations

Abstract

Publication types

MeSH terms

Grants and funding