A field-wide assessment of differential high throughput sequencing reveals widespread bias

Taavi Päll; Hannes Luidalepp; Tanel Tenson; Ülo Maiväli

doi:10.1101/2021.01.04.424681

Abstract

Here we assess reproducibility and inferential quality in the field of differential HT-seq, based on analysis of datasets submitted 2008-2019 to the NCBI GEO data repository. Analysis of GEO submission file structures places an overall 56% upper limit to reproducibility without querying other sources. We further show that only 23% of experiments resulted in theoretically expected p value histogram shapes, although both reproducibility and p value distributions show marked improvement over time. Uniform p value histogram shapes, indicative of <100 true effects, were extremely few. Our calculations of π₀, the fraction of true nulls, showed that 36% of experiments have π₀ <0.5, meaning that in over a third of experiments most RNA-s were estimated to change their expression level upon experimental treatment. Both the fraction of different p value histogram types and π₀ values are strongly associated with the software used for calculating these p values by the original authors, indicating widespread bias.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

- We implemented 3 major changes, resulting in removal of a panel from Fig 4 and adding two new figures to the main text (Figs 5 and 6).
http://doi.org/10.5281/zenodo.4469911
https://doi.org/10.5281/zenodo.4046422
http://doi.org/10.5281/zenodo.4463804
https://github.com/rstats-tartu/geo-htseq-paper
https://github.com/rstats-tartu/simulate-rnaseq
https://github.com/rstats-tartu/geo-htseq
https://gin.g-node.org/tpall/geo-htseq-paper

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.