Abstract
Single-cell transcriptomics has revolutionized our understanding of cellular heterogeneity, yet modeling ultra-long transcriptome sequences (i.e. number of genes) remains a significant computational challenge. In this study, we introduce SC-MAMBA2, based on the most recent MAMBA2 architecture, as the first application of this architecture integrated with state-space models (SSMs) for single-cell transcriptome modeling. Unlike traditional Transformer-based language models, SC-MAMBA2 leverages the efficiency and scalability of SSMs, enabling to handle longer transcriptome sequences with reduced computational overhead. We introduce unique design adaptations specifically tailored to transcriptome sequences and implement a bidirectional modeling approach under the SSM framework, facilitating comprehensive analysis of whole genome transcriptome sequence. SC-MAMBA2 stands as the largest model in the single-cell transcriptomics domain, with over 150 million parameters, capable of processing transcriptome sequences covering more than 60,000 genes. The model was trained on a dataset of 57 million cells, making it the most comprehensive solution for handling ultra-long sequences to date. Through extensive benchmarking across various downstream tasks, SC-MAMBA2 consistently outperforms state-of-the-art models, demonstrating superior accuracy and computational efficiency. Our results underscore the effectiveness and advanced capabilities of SC-MAMBA2, positioning it as a pivotal tool for future single-cell transcriptome studies.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* This work was completed during an internship at XtalPi.
yalong.zhao{at}xtalpi.com, bowen.zhao{at}mail.mcgill.ca, fan.zhang{at}xtalpi.com, chenfeng.he{at}xtalpi.com, wuwendao{at}stu.pku.edu.cn