Abstract
Large-scale pretrained models have become foundation models, leading to breakthroughs in natural language processing and related fields. Developing foundation models in life science, aimed at deciphering the “languages” of cells and facilitating biomedical research, is challenging yet promising. We developed a large-scale pretrained model, scFoundation, for this purpose. scFoundation was trained on over 50 million human single-cell transcriptomics data, which contain high-throughput observations on the complex molecular features in all known types of cells. scFoundation is currently the largest model in terms of the size of trainable parameters, dimensionality of genes and the number of cells used in the pre-training. Experiments showed that scFoundation can serve as a foundation model for single-cell transcriptomics and can achieve state-of-the-art performances in a diverse array of downstream tasks, such as gene expression enhancement, tissue drug response prediction, single-cell drug response classification, and single-cell perturbation prediction.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵+ Work was done while interning at BioMap.