Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Michael I Love; John B Hogenesch; Rafael A Irizarry

doi:10.1038/nbt.3682

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Nat Biotechnol. 2016 Dec;34(12):1287-1291. doi: 10.1038/nbt.3682. Epub 2016 Sep 26.

Authors

Michael I Love^{1

2}, John B Hogenesch³, Rafael A Irizarry^{1

2}

Affiliations

¹ Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA.
² Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA.
³ Department of Pharmacology, Institute for Translational Medicine and Therapeutics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA.

Abstract

We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.

MeSH terms

Algorithms
Artifacts*
Base Composition / genetics*
Computer Simulation
Models, Genetic*
Models, Statistical*
RNA / genetics*
Reproducibility of Results
Sensitivity and Specificity
Sequence Analysis, RNA
Software
Transcription Factors / genetics*

Substances

Transcription Factors
RNA

Abstract

MeSH terms

Substances

Grants and funding