Abstract
The advent of single-cell multi-omics sequencing technology makes it possible for re-searchers to leverage multiple modalities for individual cells and explore cell heterogeneity. However, the high dimensional, discrete, and sparse nature of the data make the downstream analysis particularly challenging. Most of the existing computational methods for single-cell data analysis are either limited to single modality or lack flexibility and interpretability. In this study, we propose an interpretable deep learning method called multi-omic embedded topic model (moETM) to effectively perform integrative analysis of high-dimensional single-cell multimodal data. moETM integrates multiple omics data via a product-of-experts in the encoder for efficient variational inference and then employs multiple linear decoders to learn the multi-omic signatures of the gene regulatory programs. Through comprehensive experiments on public single-cell transcriptome and chromatin accessibility data (i.e., scRNA+scATAC), as well as scRNA and proteomic data (i.e., CITE-seq), moETM demonstrates superior performance compared with six state-of-the-art single-cell data analysis methods on seven publicly available datasets. By applying moETM to the scRNA+scATAC data in human bone marrow mononuclear cells (BMMCs), we identified sequence motifs corresponding to the transcription factors that regulate immune gene signatures. Applying moETM analysis to CITE-seq data from the COVID-19 patients revealed not only known immune cell-type-specific signatures but also composite multi-omic biomarkers of critical conditions due to COVID-19, thus providing insights from both biological and clinical perspectives.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
We computed and observed significantly high correlations between transcripts and proteins for the same gene and correlations between transcripts and in-cis chromatin accessibility regions in terms of topic scores and reconstructed values across cells inferred from the CITE-seq and Multiome datasets, respectively (Supplementary Figure S8). We identified additional regulatory networks based on the inferred topics from the BMMC Multiome data (Supplementary Figure S7). These analyses reveal meaningful regulatory relationships across genes, motifs, and pathways. We further qualitatively and quantitatively compared cell topic mixtures inferred from individual modalities with the cell topic mixture integrated via the Product of Gaussians approach and observed improved cell clustering from the joint space (Supplementary Figure S6). We added more comprehensive pathway enrichment and motif enrichment analyses over the topics inferred from some of the datasets to demonstrate their biologically meaningfulness (Supplementary Figure S9).