Abstract
Single cell RNA-seq (scRNA-seq) has remarkably advanced our understanding of cellular heterogeneity and dynamics in tissue development, diseases, and cancers. Integrated data analysis can often uncover molecular and cellular links among individual datasets and thus provide new biological insights, such as developmental relationship. Due to differences in experimental platforms and biological sample batches, the integration of multiple scRNA-seq datasets is challenging. To address this, we developed a novel computational method for robust integration of scRNA-seq (RISC) datasets using principal component regression (PCR). Because of the natural compatibility of eigenvectors between PCR model and dimension reduction, RISC can accurately integrate scRNA-seq datasets and avoid over-integration. Compared to existing software, RISC shows particular improvement in integrating datasets that contain cells of the same types (more accurately clusters) but at distinct functional states. To demonstrate the value of RISC in finding small groups of cells common between otherwise heterogenous datasets, we applied it to scRNA-seq datasets of normal and malignant cells and successfully identified small clusters of cells in healthy kidney tissues that may be related to the origin of renal tumors.