RT Journal Article
SR Electronic
T1 GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 2022.10.10.511571
DO 10.1101/2022.10.10.511571
A1 Maxim Zvyagin
A1 Alexander Brace
A1 Kyle Hippe
A1 Yuntian Deng
A1 Bin Zhang
A1 Cindy Orozco Bohorquez
A1 Austin Clyde
A1 Bharat Kale
A1 Danilo Perez-Rivera
A1 Heng Ma
A1 Carla M. Mann
A1 Michael Irvin
A1 J. Gregory Pauloski
A1 Logan Ward
A1 Valerie Hayot-Sasson
A1 Murali Emani
A1 Sam Foreman
A1 Zhen Xie
A1 Diangen Lin
A1 Maulik Shukla
A1 Weili Nie
A1 Josh Romero
A1 Christian Dallago
A1 Arash Vahdat
A1 Chaowei Xiao
A1 Thomas Gibbs
A1 Ian Foster
A1 James J. Davis
A1 Michael E. Papka
A1 Thomas Brettin
A1 Rick Stevens
A1 Anima Anandkumar
A1 Venkatram Vishwanath
A1 Arvind Ramanathan
YR 2022
UL http://biorxiv.org/content/early/2022/11/23/2022.10.10.511571.abstract
AB We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.Competing Interest StatementThe authors have declared no competing interest.