Abstract
Artificial intelligence (AI) plays a crucial role in functional genomic analysis, offering great potential for comprehending biological phenomena such as heredity, development, diseases, and evolution. However, the development of AI models needs substantial labeled data, and these models are typically task-specific with limited generalizability to various applications. Here, we develop Genomics-FM, a genomic vocabulary driven foundation model that enables versatile and label-efficient functional genomic analysis. Specifically, Genomics-FM is first pretrained with ensemble genomic vocabulary on vast unlabelled data to learn comprehensive and generalizable representations and then finetuned with specific genomic vocabulary on limited labeled data to selectively activate and adapt the pretraining knowledge for specific tasks. We show that Genomics-FM significantly reduces the dependence on labeled data, and demonstrates the capability to outperform existing models across a comprehensive suite of tasks including genome annotation, epigenomic and expression profile prediction, and variant effect assessment. Remarkably, Genomics-FM even shows impressive zero-shot predictive capabilities across diverse species and tissues and exhibits noticeable adaptability to RNA-related tasks. With feasibility in data scarcity and even cross-domain biological scenarios, Genomics-FM will promote the broad application of AI and empower researchers to tackle previously insurmountable challenges, paving the way for groundbreaking research and discoveries.
Competing Interest Statement
The authors have declared no competing interest.