Abstract
We describe a new deep learning approach for the imputation of SARS-CoV-2 variants. Our model, ImputeCoVNet, consists of a 2D ResNet Autoencoder that aims at imputing missing genetic variants in SARS-CoV-2 sequences in an efficient manner. We show that ImputeCoVNet leads to accurate results at minor allele frequencies as low as 0.0001. When compared with an approach based on Hamming distance, ImputeCoVNet achieved comparable results with significantly less computation time. We also present the provision of geographical metadata (e.g., exposed country) to decoder increases the imputation accuracy. Additionally, by visualizing the embedding results of SARS-CoV-2 variants, we show that the trained encoder of ImputeCoVNet, or the embedded results from it, recapitulates viral clade’s information, which means it could be used for predictive tasks using virus sequence analysis.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
julie.hussin{at}umontreal.ca
* Alternative Affiliation: Mila, 6666 St Urbain St, Montreal, Quebec H2S 3H1 (pesarana{at}mila.quebec)
15th Conference in Machine Learning in Computational Biology (MLCB 2020), Vancouver, Canada.