Abstract
As of February 8, 2020, the 2019 Novel Coronavirus (2019-nCoV) spread to 29 countries with 725 deaths and more than 34000 confirmed cases. 2019-nCoV is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though 2019-nCoV has a death rate of 2% as of 8 February, the 34963 confirmed cases in a few weeks (December 8, 2019 to February 8, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper proposes the use of a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole 2019-nCoV genomes. We namely classify the 2019-nCoV using MLDSP and MLDSP-GUI, alignment-free methods that use Machine Learning (ML) and Digital Signal Processing (DSP) for genome analyses. These tools are used to analyze a large dataset of unique viral genomic sequences, totalling 61.8 million bp, with a “decision tree” approach for successive refinements of taxonomic classification. Our results support the hypothesis of a bat origin and classify 2019-nCoV as Sarbecovirus, within Betacoronavirus. We use Spearman’s rank correlation analysis to confirm the relatedness of the 2019-nCoV sequences to the known genera of the family Coronaviridae and the known sub-genera of the genus Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within seconds, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
Author summary Analyzing over 5000 diverse viral complete genomes, we obtained a 100% accuracy score for classification of 2019-nCoV as Coronaviridae, Betacoronavirus, and finally as belonging to the sub-genus Sarbecovirus, using an alignment-free, supervised machine learning approach. Genomes identified as closely related to 2019-nCoV are bat betacoronaviruses within the same genus, and this supports the hypothesis of a bat origin for this novel coronavirus. This alignment-free analysis of genomic signatures using machine learning requires no prior knowledge of genic or regulatory content, and accurately classifies genomes of unknown taxonomy to potentially genus level resolution within minutes. This suggests that, for novel viral and pathogen genome sequences, such alignment-free machine-learning analyses can provide a reliable real-time option for taxonomic classification.