RT Journal Article SR Electronic T1 Integrating scRNA-seq data of multiple donors increases cell-type identification accuracy JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.09.17.301911 DO 10.1101/2020.09.17.301911 A1 Hanbin Lee A1 Chanwoo Kim A1 Juhee Jeong A1 Keehoon Jung A1 Buhm Han YR 2020 UL http://biorxiv.org/content/early/2020/09/19/2020.09.17.301911.abstract AB Integrating scRNA-seq data of multiple donors is challenging. Multiple samples may exhibit strong heterogeneity and batch effects, which need to be properly corrected. Many previous methods focused on integrating multi-sample data in the cluster level, but it was challenging to quantitatively measure the benefit of integration. We present scIntegral, a scalable method to integrate hundreds of donors scRNA data. Our method aims to identify cell-types of the cells in a semi-supervised fashion using marker list information as prior. scIntegral is extremely efficient and takes only an hour to integrate ten thousand donor data, while fully accounting for heterogeneity with covariates. We quantify the benefit of multi-sample integration in terms of accuracy with respect to the gold standard cell labels, and prove that integrating multiple donors can significantly reduce the error rate in cell-type identification. scIntegral is more accurate than existing methods and can precisely identify very rare (<0.5%) cell populations, suggesting utilities for in-silico cell extraction. scIntegral is freely available at https://github.com/hanbin973/scIntegral.Competing Interest StatementBuhm Han is the CTO of Genealogy Inc.