Abstract
The field of single-cell genomics is now observing a marked increase in the prevalence of cohort-level studies that include hundreds of samples and feature complex designs. These data have tremendous potential for discovering how sample or tissue-level phenotypes relate to cellular and molecular composition. However, current analyses are based on simplified representations of these data by averaging information across cells. We present MrVI, a deep generative model designed to realize the potential of cohort studies at the single-cell level. MrVI tackles two fundamental and intertwined problems: stratifying samples into groups and evaluating the cellular and molecular differences between groups, both without requiring a priori grouping of cells into types or states. Due to its single-cell perspective, MrVI is able to detect clinically relevant stratifications of patients in COVID-19 and inflammatory bowel disease (IBD) cohorts that are only manifested in certain cellular subsets, thus enabling new discoveries that would otherwise be overlooked. Similarly, we demonstrate that MrVI can de-novo identify groups of small molecules with similar biochemical properties and evaluate their effects on cellular composition and gene expression in large-scale perturbation studies. MrVI is available as open source at scvi-tools.org.
Competing Interest Statement
This work was supported by a Chan-Zuckerberg Initiative Seed Networks for the Human Cell Atlas grant (CZF2019-002452) and NIAID Grant R01 AI169075 to N.Y. J.H. was supported by grant number 2022-253560 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. A.G. is currently an employee of Google DeepMind. Google DeepMind has not directed any aspect of this study nor exerts any commercial rights over the results.
Footnotes
Revisited the methods as well as the experiments of this work. In particular, MrVI now relies on a different generative model, relying on cross-attention to model sample effects as well as a mixture-of-Gaussians prior. The performance of the algorithm has been validated on a much more extensive suite of semi-synthetic and real experiments.