Abstract
An ever-increasing deluge of single-cell RNA-sequencing (scRNA-seq) data has been generated, often involving different time points, laboratories or sequencing protocols. Batch effect correction has been recognized to be indispensable when integrating scRNA-seq data from multiple batches. A recent study proposed an effective correction method based on mutual nearest neighbors (MNN) across batches. However, the proposed MNN method is unsupervised in that it ignores cluster label information of single cells. Such cluster or cell type label information can further improve effectiveness of batch effect correction, particularly under realistic scenarios where true biological differences are not orthogonal to batch effect. Under this motivation, we propose SMNN which performs supervised mutual nearest neighbor detection for batch effect correction of scRNA-seq data. Our SMNN either takes cluster/cell-type label information as input, or, in the absence of such information, infers cell types by performing clustering of scRNA-seq data. It then detects mutual nearest neighbors within matched cell types and corrects batch effect accordingly. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over MNN. Furthermore, SMNN retains more cell type-specific features after correction. Differentially expressed genes (DEGs) identified between cell types after SMNN correction are biologically more relevant, and the DEG true positive rates improve by up to 841%. SMNN is implemented in R, and freely available at https://yunliweb.its.unc.edu/SMNN/ and https://github.com/yycunc/SMNNcorrect.
Author summary The presence of batch effects poses grand challenges to integrative analysis of scRNA-seq data from multiple resources. One powerful tool MNN corrects batch effect of scRNA-seq data based on mutual nearest neighbors across batches. However, this method makes a critical assumption that batch effect is orthogonal to true biological differences. This assumption in practice can easily be violated. When that happens, MNN suffers from biases introduced by wrongly matched pairs of cells. To overcome this shortcoming, here we present a new method, SMNN, which performs supervised mutual nearest neighbor detection for batch effect correction. We benchmark the performance of SMNN using both simulations and real data, and demonstrate that, compared to MNN, our SMNN can better mix cells of the same type/state across batches. More importantly, SMNN can more effectively retain biologically relevant features, and thereof provide improved cell type clustering and enhanced power for detecting differentially expressed genes (DEGs) between different cell types.