PT - JOURNAL ARTICLE AU - Sargun Nagpal AU - Ridam Pal AU - Ashima AU - Ananya Tyagi AU - Sadhana Tripathi AU - Aditya Nagori AU - Saad Ahmad AU - Hara Prasad Mishra AU - Rintu Kutum AU - Tavpritesh Sethi TI - Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning AID - 10.1101/2021.05.25.445601 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.05.25.445601 4099 - http://biorxiv.org/content/early/2021/06/07/2021.05.25.445601.short 4100 - http://biorxiv.org/content/early/2021/06/07/2021.05.25.445601.full AB - The global efforts to control COVID-19 are threatened by the rapid emergence of novel variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we derive Dimensions of Concern (DoCs) in the latent space of SARS-CoV-2 mutations and demonstrate their potential to provide a lead time for predicting the increase of new cases. We modeled viral DNA sequences as documents with codons treated as words to learn unsupervised word embeddings. We discovered that “blips’’ in latent dimensions of the learned embeddings were associated with mutations. Latent dimensions which harbored blips that consistently preceded and were predictive of new caseloads were analyzed further as Dimensions of Concern, DoCs. The DOCs captured CGG, CTG, AGG, AGT, GAC and, CAC codons associated with major global VoCs L452R, R190S, and D1118H, thus validating our approach biologically. Tracking these DOCs can provide a practical approach to predict country-specific emergence and spread of viral strains for genomic surveillance and is extensible to related challenges such as immune escape, pathogenicity modeling, and antimicrobial resistance.Competing Interest StatementThe authors have declared no competing interest.