Abstract
Clustering is routinely applied to microarray, RNA-seq, and other genomic data, to help ascertain biological processes, disease subtypes, and cell identities. Recently, single cell RNA-seq (scRNA-seq) is used to automatically generate a large amount of gene expression profiles of unlabeled single cells. With no prior knowledge, the clustering algorithms are used to classify unlabeled single cells, that ultimately determine cell identities. However, how can we evaluate if the cluster memberships – the cell identities – are correctly assigned? To this end, we introduce the jackstraw methods for unsupervised classifications that rigorously test the assignments of genomic features into their clusters. By learning uncertainty in clustering the noisy data, the proposed jackstraw methods can identify statistically significant genomic features that truly make up the corresponding clusters. We investigated the proposed methods on scRNA-seq data from a mixture of Jurkat and 293T cell lines, where individual cell identities are unknown. The jackstraw methods evaluate cluster membership assignments of 3381 unlabeled single cells such that the majority of multiplets are identified in an unsupervised manner. We propose posterior inclusion probabilities (PIPs) for cluster membership to help select and visualize the reliable features in reduced dimensions. Additionally, we consider clustering 5981 yeast genes under cell cycle. When clustering is used in high-dimensional genomic data analysis, the proposed jackstraw tests enable rigorous evaluation of membership assignments that readily improve feature selection and visualization.
Software: jackstraw package in R available at https://github.com/ncchung/jackstraw