Abstract
The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we developed Strainflow, to learn the latent dimensions of 0.9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SARS-CoV-2. In our Strainflow model, SARS-CoV-2 genome sequences were treated as documents, and codons as words to learn unsupervised codon embeddings (latent dimensions). We discovered that codon-level changes lead to a change in the entropy of the latent dimensions. We used a machine learning algorithm to find the most relevant latent dimensions called Dimensions of Concern (DoCs) of SARS-CoV-2 spike genes, and demonstrate their potential to provide a lead time for predicting new caseloads in several countries. The DoCs capture codons associated with global Variants of Concern (VOCs) and Variants of Interest (VOIs), and may be surveilled to predict country-specific emergence and spread of SARS-CoV-2 variants.
Highlights
We developed a genomic surveillance model for SARS-CoV-2 genome sequences, Strainflow, where sequences were treated as documents with words (codons) to learn the codon context of 0.9 million spike genes using the skip-gram algorithm.
Time series analysis of the information content (Entropy) of the latent dimensions learned by Strainflow shows a leading relationship with the monthly COVID-19 cases for seven countries (e.g., USA, Japan, India, and others).
Machine Learning modeling of the entropy of the latent dimensions helped us to develop an epidemiological early warning system for the COVID-19 caseloads.
The top codons associated with the most relevant latent dimensions (DoCs) were linked to SARS-CoV-2 variants, and these DoCs may be used as a surrogate to track the country-specific spread of the variants.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
In this revised version, we have implemented fast tSNE for qualitative inspection of the latent dimensions (LD) of 0.9 million SARS-CoV-2 spike genes. Quantitative analyses of the LDs were performed using the fast sample entropy method. Also, we have used fast sample entropy instead of 'blips' to model COVID-19 caseloads.