Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Data-driven approaches for genetic characterization of SARS-CoV-2 lineages

Fatima Mostefai, Isabel Gamache, Jessie Huang, Arnaud N’Guessan, Justin Pelletier, Ahmad Pesaranghader, David Hamelin, Carmen Lia Murall, Raphaël Poujol, Jean-Christophe Grenier, Martin Smith, Etienne Caron, Morgan Craig, Jesse Shapiro, Guy Wolf, Smita Krishnaswamy, View ORCID ProfileJulie G. Hussin
doi: https://doi.org/10.1101/2021.09.28.462270
Fatima Mostefai
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
2Département de Biochimie et Médecine Moléculaire, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Isabel Gamache
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
2Département de Biochimie et Médecine Moléculaire, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jessie Huang
3Department of Computer Science, Yale University, New Haven, CT, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arnaud N’Guessan
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
4Department of Microbiology and Immunology, McGill University, 740 avenue Dr. Penfield, H3A 0G1, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Justin Pelletier
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
2Département de Biochimie et Médecine Moléculaire, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ahmad Pesaranghader
5Mila - Quebec Artificial Intelligence Institute, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David Hamelin
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
2Département de Biochimie et Médecine Moléculaire, Université de Montréal, Quebec, Canada
6Research Centre, CHU Ste-Justine, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Carmen Lia Murall
4Department of Microbiology and Immunology, McGill University, 740 avenue Dr. Penfield, H3A 0G1, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Raphaël Poujol
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jean-Christophe Grenier
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Martin Smith
2Département de Biochimie et Médecine Moléculaire, Université de Montréal, Quebec, Canada
6Research Centre, CHU Ste-Justine, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Etienne Caron
6Research Centre, CHU Ste-Justine, Quebec, Canada
7Department of Pathology and Cellular Biology, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Morgan Craig
6Research Centre, CHU Ste-Justine, Quebec, Canada
8Département de Mathématiques et Statistique, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jesse Shapiro
4Department of Microbiology and Immunology, McGill University, 740 avenue Dr. Penfield, H3A 0G1, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Guy Wolf
5Mila - Quebec Artificial Intelligence Institute, Quebec, Canada
7Department of Pathology and Cellular Biology, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Smita Krishnaswamy
3Department of Computer Science, Yale University, New Haven, CT, USA
9Department of Genetics, Yale University, New Haven, CT, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Julie G. Hussin
1Research Centre, Montreal Heart Institute, 5000 Belanger, H1T 1C8, Quebec, Canada
10Département de Médecine, Université de Montréal, Quebec, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Julie G. Hussin
  • For correspondence: julie.hussin@umontreal.ca
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

The genome of the Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2), the pathogen that causes coronavirus disease 2019 (COVID-19), has been sequenced at an unprecedented scale, leading to a tremendous amount of viral genome sequencing data. To understand the evolution of this virus in humans, and to assist in tracing infection pathways and designing preventive strategies, we present a set of computational tools that span phylogenomics, population genetics and machine learning approaches. To illustrate the utility of this toolbox, we detail an in depth analysis of the genetic diversity of SARS-CoV-2 in first year of the COVID-19 pandemic, using 329,854 high-quality consensus sequences published in the GISAID database during the pre-vaccination phase. We demonstrate that, compared to standard phylogenetic approaches, haplotype networks can be computed efficiently on much larger datasets, enabling real-time analyses. Furthermore, time series change of Tajima’s D provides a powerful metric of population expansion. Unsupervised learning techniques further highlight key steps in variant detection and facilitate the study of the role of this genomic variation in the context of SARS-CoV-2 infection, with Multiscale PHATE methodology identifying fine-scale structure in the SARS-CoV-2 genetic data that underlies the emergence of key lineages. The computational framework presented here is useful for real-time genomic surveillance of SARS-CoV-2 and could be applied to any pathogen that threatens the health of worldwide populations of humans and other organisms.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • ↵† Jointly supervised work.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted September 29, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Data-driven approaches for genetic characterization of SARS-CoV-2 lineages
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Data-driven approaches for genetic characterization of SARS-CoV-2 lineages
Fatima Mostefai, Isabel Gamache, Jessie Huang, Arnaud N’Guessan, Justin Pelletier, Ahmad Pesaranghader, David Hamelin, Carmen Lia Murall, Raphaël Poujol, Jean-Christophe Grenier, Martin Smith, Etienne Caron, Morgan Craig, Jesse Shapiro, Guy Wolf, Smita Krishnaswamy, Julie G. Hussin
bioRxiv 2021.09.28.462270; doi: https://doi.org/10.1101/2021.09.28.462270
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Data-driven approaches for genetic characterization of SARS-CoV-2 lineages
Fatima Mostefai, Isabel Gamache, Jessie Huang, Arnaud N’Guessan, Justin Pelletier, Ahmad Pesaranghader, David Hamelin, Carmen Lia Murall, Raphaël Poujol, Jean-Christophe Grenier, Martin Smith, Etienne Caron, Morgan Craig, Jesse Shapiro, Guy Wolf, Smita Krishnaswamy, Julie G. Hussin
bioRxiv 2021.09.28.462270; doi: https://doi.org/10.1101/2021.09.28.462270

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4672)
  • Biochemistry (10334)
  • Bioengineering (7655)
  • Bioinformatics (26281)
  • Biophysics (13497)
  • Cancer Biology (10663)
  • Cell Biology (15392)
  • Clinical Trials (138)
  • Developmental Biology (8485)
  • Ecology (12802)
  • Epidemiology (2067)
  • Evolutionary Biology (16818)
  • Genetics (11380)
  • Genomics (15454)
  • Immunology (10592)
  • Microbiology (25159)
  • Molecular Biology (10196)
  • Neuroscience (54373)
  • Paleontology (399)
  • Pathology (1663)
  • Pharmacology and Toxicology (2889)
  • Physiology (4332)
  • Plant Biology (9223)
  • Scientific Communication and Education (1585)
  • Synthetic Biology (2553)
  • Systems Biology (6769)
  • Zoology (1459)