PT - JOURNAL ARTICLE AU - Chen, Bo AU - Cheng, Xingyi AU - Geng, Yangli-ao AU - Li, Shen AU - Zeng, Xin AU - Wang, Boyan AU - Gong, Jing AU - Liu, Chiming AU - Zeng, Aohan AU - Dong, Yuxiao AU - Tang, Jie AU - Song, Le TI - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein AID - 10.1101/2023.07.05.547496 DP - 2023 Jan 01 TA - bioRxiv PG - 2023.07.05.547496 4099 - http://biorxiv.org/content/early/2023/07/06/2023.07.05.547496.short 4100 - http://biorxiv.org/content/early/2023/07/06/2023.07.05.547496.full AB - Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. This paper proposes a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that xTrimoPGLM significantly outperforms other advanced baselines in diverse protein understanding tasks (13 out of 15 tasks across four categories) and generates novel protein sequences which are structurally similar to natural ones. Furthermore, using the same xTrimoPGLM framework, we train an antibody-specific model (xTrimoPGLM-Ab) using 1 billion parameters. This model set a new record in predicting antibody naturalness and structures, both essential to the field of antibody-based drug design, and demonstrated a significantly faster inference speed than AlphaFold2. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences.Competing Interest StatementThe authors have declared no competing interest.