RT Journal Article SR Electronic T1 GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics JF bioRxiv FD Cold Spring Harbor Laboratory SP 2022.10.10.511571 DO 10.1101/2022.10.10.511571 A1 Maxim Zvyagin A1 Alexander Brace A1 Kyle Hippe A1 Yuntian Deng A1 Bin Zhang A1 Cindy Orozco Bohorquez A1 Austin Clyde A1 Bharat Kale A1 Danilo Perez-Rivera A1 Heng Ma A1 Carla M. Mann A1 Michael Irvin A1 J. Gregory Pauloski A1 Logan Ward A1 Valerie Hayot-Sasson A1 Murali Emani A1 Sam Foreman A1 Zhen Xie A1 Diangen Lin A1 Maulik Shukla A1 Weili Nie A1 Josh Romero A1 Christian Dallago A1 Arash Vahdat A1 Chaowei Xiao A1 Thomas Gibbs A1 Ian Foster A1 James J. Davis A1 Michael E. Papka A1 Thomas Brettin A1 Rick Stevens A1 Anima Anandkumar A1 Venkatram Vishwanath A1 Arvind Ramanathan YR 2022 UL http://biorxiv.org/content/early/2022/11/23/2022.10.10.511571.abstract AB We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.Competing Interest StatementThe authors have declared no competing interest.