Abstract
A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene/protein expression. Until recently, single-cell profiling experiments could measure only a single modality, leading to analysis focused on integrating information across separate experiments. However, researchers can now measure multiple modalities simultaneously in a single experiment, providing a new data paradigm that enables biological discovery but also requires new conceptual and analytic models. We therefore present Schema, an algorithm that leverages a principled metric learning strategy to synthesize multimodal information from the same experiment. To demonstrate the flexibility and power of our approach, we use Schema to infer cell types by integrating gene expression and chromatin accessibility data, perform differential gene expression analysis while accounting for batch effects and developmental age, estimate evolutionary pressure on peptide sequences, and synthesize spliced and unspliced mRNA data to infer cell differentiation. Schema can synthesize arbitrarily many modalities and capture sophisticated relationships between them, is computationally efficient, and provides a valuable conceptual model for exploring and understanding complex biology.
Competing Interest Statement
The authors have declared no competing interest.