Abstract
In recent years, significant advancements have been observed in the domain of Natural Language Processing(NLP) with the introduction of pre-trained foundational models, paving the way for utilizing similar AI technologies to interpret the language of biology. In this research, we introduce “LucaOne”, a novel pre-trained foundational model designed to integratively learn from the genetic and proteomic languages, encapsulating data from 169,861 species en-compassing DNA, RNA, and proteins. This work illuminates the potential for creating a biological language model aimed at universal bioinformatics appli-cation. Remarkably, through few-shot learning, this model efficiently learns the central dogma of molecular biology and demonstrably outperforms com-peting models. Furthermore, in tasks requiring inputs of DNA, RNA, proteins, or a combination thereof, LucaOne exceeds the state-of-the-art performance using a streamlined downstream architecture, thereby providing empirical ev-idence and innovative perspectives on the potential of foundational models to comprehend complex biological systems.
Competing Interest Statement
Yong He, Zhaorong Li, Pan Fang, and Jieping Ye have filed an application for a patent covering the work presented. The other authors declare no competing interests.