RT Journal Article SR Electronic T1 Modelling dropouts for feature selection in scRNASeq experiments JF bioRxiv FD Cold Spring Harbor Laboratory SP 065094 DO 10.1101/065094 A1 Tallulah S. Andrews A1 Martin Hemberg YR 2017 UL http://biorxiv.org/content/early/2017/05/25/065094.abstract AB A key challenge of single-cell RNASeq (scRNASeq) is the many genes with zero reads in some cells, but high expression in others. In full-transcript datasets modelling zeros using the Michaelis-Menten equation provides an equal or superior fit to existing scRNASeq datasets compared to other approaches and enables fast and accurate identification of features corresponding to differentially expressed genes without prior identification of cell subpopulations. For datasets tagged with unique molecular identifiers we introduce a depth adjusted negative binomial (DANB) to perform dropout-rate based feature selection. Applying our method to mouse preimplantation embryos revealed clusters corresponding to the inner cell mass and trophectoderm of the blastocyst. Our feature selection method overcomes batch effects to cluster cells from five different datasets by developmental stage rather than experimental origin.Author Summary Feature selection is a powerful approach for improving the signal to noise ratio in high dimensional datasets. We present two unsupervised feature selection methods for single-cell RNASeq data which unlike all previous methods are based on dropout rate rather than variance: M3Drop, tailored to full-transcript scRNASeq protocols, and DANB, tailored to data tagged with unique molecular identifiers (UMI). Using differentially expressed genes defined from bulk RNASeq, we perform the first comparison of feature selection quality for both full-transcript and UMI-tagged scRNASeq data. We show that dropout based methods outperform their variance-based counterparts on both real and simulated data due to lower sampling errors. Finally we demonstrate the ability to merge mouse embryo datasets produced using different protocols by different research groups using only the combination of feature selection and library size normalization.