PT - JOURNAL ARTICLE AU - Genevieve L. Stein-O’Brien AU - Brian S. Clark AU - Thomas Sherman AU - Cristina Zibetti AU - Qiwen Hu AU - Rachel Sealfon AU - Sheng Liu AU - Jiang Qian AU - Carlo Colantuoni AU - Seth Blackshaw AU - Loyal A. Goff AU - Elana J. Fertig TI - Decomposing cell identity for transfer learning across cellular measurements, platforms, tissues, and species AID - 10.1101/395004 DP - 2018 Jan 01 TA - bioRxiv PG - 395004 4099 - http://biorxiv.org/content/early/2018/08/20/395004.1.short 4100 - http://biorxiv.org/content/early/2018/08/20/395004.1.full AB - New approaches are urgently needed to glean biological insights from the vast amounts of single cell RNA sequencing (scRNA-Seq) data now being generated. To this end, we propose that cell identity should map to a reduced set of factors which will describe both exclusive and shared biology of individual cells, and that the dimensions which contain these factors reflect biologically meaningful relationships across different platforms, tissues and species. To find a robust set of dependent factors in large-scale scRNA- Seq data, we developed a Bayesian non-negative matrix factorization (NMF) algorithm, scCoGAPS. Application of scCoGAPS to scRNA-Seq data obtained over the course of mouse retinal development identified gene expression signatures for factors associated with specific cell types and continuous biological processes. To test whether these signatures are shared across diverse cellular contexts, we developed projectR to map biologically disparate datasets into the factors learned by scCoGAPS. Because projecting these dimensions preserve relative distances between samples, biologically meaningful relationships/factors will stratify new data consistent with their underlying processes, allowing labels or information from one dataset to be used for annotation of the other—a machine learning concept called transfer learning. Using projectR, data from multiple datasets was used to annotate latent spaces and reveal novel parallels between developmental programs in other tissues, species and cellular assays. Using this approach we are able to transfer cell type and state designations across datasets to rapidly annotate cellular features in a new dataset without a priori knowledge of their type, identify a species-specific signature of microglial cells, and identify a previously undescribed subpopulation of neurosecretory cells within the lung. Together, these algorithms define biologically meaningful dimensions of cellular identity, state, and trajectories that persist across technologies, molecular features, and species.