RT Journal Article SR Electronic T1 Language models of protein sequences at the scale of evolution enable accurate structure prediction JF bioRxiv FD Cold Spring Harbor Laboratory SP 2022.07.20.500902 DO 10.1101/2022.07.20.500902 A1 Zeming Lin A1 Halil Akin A1 Roshan Rao A1 Brian Hie A1 Zhongkai Zhu A1 Wenting Lu A1 Allan dos Santos Costa A1 Maryam Fazel-Zarandi A1 Tom Sercu A1 Sal Candido A1 Alexander Rives YR 2022 UL http://biorxiv.org/content/early/2022/07/21/2022.07.20.500902.abstract AB Large language models have recently been shown to develop emergent capabilities with scale, going beyond simple pattern matching to perform higher level reasoning and generate lifelike images and text. While language models trained on protein sequences have been studied at a smaller scale, little is known about what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for sequences with low perplexity that are well understood by the language model. ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.Competing Interest StatementThe authors have declared no competing interest.