PT - JOURNAL ARTICLE AU - Florian Schmidt AU - Bobby Ranjan AU - Quy Xiao Xuan Lin AU - Vaidehi Krishnan AU - Ignasius Joanito AU - Mohammad Amin Honardoost AU - Zahid Nawaz AU - Prasanna Nori Venkatesh AU - Joanna Tan AU - Nirmala Arul Rayan AU - S.Tiong Ong AU - Shyam Prabhakar TI - Robust clustering and interpretation of scRNA-seq data using reference component analysis AID - 10.1101/2021.02.16.431527 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.02.16.431527 4099 - http://biorxiv.org/content/early/2021/02/17/2021.02.16.431527.short 4100 - http://biorxiv.org/content/early/2021/02/17/2021.02.16.431527.full AB - Motivation The transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.Results Compared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.Availability RCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2Competing Interest StatementThe authors have declared no competing interest.