Abstract
The central dogma serves as a fundamental framework for understanding the flow and expression of genetic information within living organisms, facilitating the connection of diverse biological sequences across molecule types. In this study, we present CD-GPT (Central Dogma Generative Pretrained Transformer), a generative biological foundation model with 1 billion parameters, aiming to capture the sequence relationships between DNA, RNA, and proteins. We model sequences in a unified representational space and employ a shared, multi-molecule vocabulary to narrow their distances in the embedding space effectively. Through extensive pretraining on nucleotide and amino acid sequence data, CD-GPT exhibits exceptional performance in a wide range of predictive and generative downstream tasks, including mono-molecular and multi-molecular analyses. Notably, CD-GPT excels in tasks such as genomic element detection, protein property prediction, RNA-protein interaction identification and also generative tasks like protein generation and reverse translation. The versatility of CD-GPT opens up promising avenues for advanced multi-omics analysis.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Figure 1 & 2 revised Author affiliations updated