PT - JOURNAL ARTICLE AU - Shao, Bin TI - A long-context language model for deciphering and generating bacteriophage genomes AID - 10.1101/2023.12.18.572218 DP - 2024 Jan 01 TA - bioRxiv PG - 2023.12.18.572218 4099 - http://biorxiv.org/content/early/2024/02/07/2023.12.18.572218.short 4100 - http://biorxiv.org/content/early/2024/02/07/2023.12.18.572218.full AB - Inspired by the success of large language models, we develop a long-context generative model for genomes. Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96K base pairs, which contain functional regulatory elements and novel proteins with phage-related functions.Competing Interest StatementThe authors have declared no competing interest.