Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

View ORCID ProfileGurjit S. Randhawa, View ORCID ProfileMaximillian P.M. Soltysiak, Hadi El Roz, Camila P.E. de Souza, Kathleen A. Hill, Lila Kari
doi: https://doi.org/10.1101/2020.02.03.932350
Gurjit S. Randhawa
1Department of Computer Science, The University of Western Ontario, London, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gurjit S. Randhawa
  • For correspondence: grandha8@uwo.ca
Maximillian P.M. Soltysiak
2Department of Biology, The University of Western Ontario, London, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Maximillian P.M. Soltysiak
Hadi El Roz
2Department of Biology, The University of Western Ontario, London, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Camila P.E. de Souza
3Department of Statistical and Actuarial Sciences, The University of Western Ontario, London, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kathleen A. Hill
2Department of Biology, The University of Western Ontario, London, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lila Kari
4School of Computer Science, University of Waterloo, Waterloo, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

As of February 20, 2020, the 2019 novel coronavirus (renamed to COVID-19) spread to 30 countries with 2130 deaths and more than 75500 confirmed cases. COVID-19 is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though COVID-19 has a death rate of 2.8% as of 20 February, the 75752 confirmed cases in a few weeks (December 8, 2019 to February 20, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 genomes. The proposed method combines supervised machine learning with digital signal processing for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp. Our results support a hypothesis of a bat origin and classify COVID-19 as Sarbecovirus, within Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Footnotes

  • nCoV-2019 renamed to COVID-19; updated confirmed cases and deaths reported as of Feb 20; reformatted text for journal submission

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted February 20, 2020.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
Gurjit S. Randhawa, Maximillian P.M. Soltysiak, Hadi El Roz, Camila P.E. de Souza, Kathleen A. Hill, Lila Kari
bioRxiv 2020.02.03.932350; doi: https://doi.org/10.1101/2020.02.03.932350
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
Gurjit S. Randhawa, Maximillian P.M. Soltysiak, Hadi El Roz, Camila P.E. de Souza, Kathleen A. Hill, Lila Kari
bioRxiv 2020.02.03.932350; doi: https://doi.org/10.1101/2020.02.03.932350

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4850)
  • Biochemistry (10791)
  • Bioengineering (8040)
  • Bioinformatics (27281)
  • Biophysics (13971)
  • Cancer Biology (11119)
  • Cell Biology (16048)
  • Clinical Trials (138)
  • Developmental Biology (8777)
  • Ecology (13276)
  • Epidemiology (2067)
  • Evolutionary Biology (17353)
  • Genetics (11687)
  • Genomics (15915)
  • Immunology (11027)
  • Microbiology (26069)
  • Molecular Biology (10637)
  • Neuroscience (56525)
  • Paleontology (417)
  • Pathology (1732)
  • Pharmacology and Toxicology (3003)
  • Physiology (4543)
  • Plant Biology (9627)
  • Scientific Communication and Education (1614)
  • Synthetic Biology (2685)
  • Systems Biology (6975)
  • Zoology (1508)