RT Journal Article SR Electronic T1 ProGen: Language Modeling for Protein Generation JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.03.07.982272 DO 10.1101/2020.03.07.982272 A1 Madani, Ali A1 McCann, Bryan A1 Naik, Nikhil A1 Keskar, Nitish Shirish A1 Anand, Namrata A1 Eguchi, Raphael R. A1 Huang, Po-Ssu A1 Socher, Richard YR 2020 UL http://biorxiv.org/content/early/2020/03/08/2020.03.07.982272.abstract AB Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.