Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Machine learning to predict the source of campylobacteriosis using whole genome data

View ORCID ProfileNicolas Arning, View ORCID ProfileSamuel K. Sheppard, David A. Clifton, Daniel J. Wilson
doi: https://doi.org/10.1101/2021.02.23.432443
Nicolas Arning
1Big Data institute, Nuffield Department of Population Health, University of Oxford, Li Ka Shing Centre for Health Information and Discovery, Old Road Campus, Oxford, OX3 7LF, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nicolas Arning
  • For correspondence: nicolas.arning@bdi.ox.ac.uk
Samuel K. Sheppard
2The Milner Centre of Evolution, Department of Biology & Biochemistry, University of Bath, Claverton Down, Bath, BA2 7AZ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Samuel K. Sheppard
David A. Clifton
3Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, OX3 7DQ, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel J. Wilson
1Big Data institute, Nuffield Department of Population Health, University of Oxford, Li Ka Shing Centre for Health Information and Discovery, Old Road Campus, Oxford, OX3 7LF, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Campylobacteriosis is among the world’s most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using machine learning. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.

Author summary C. jejuni are the most common cause of food-borne bacterial gastroenteritis but the relative contribution of different sources are incompletely understood. We traced the origin of human C. jejuni infections using machine learning algorithms that compare the DNA sequences of bacteria sampled from infected people, contaminated chickens, cattle, sheep, wild birds and the environment. This approach achieved improvement in accuracy of source attribution by 33% over existing methods that use only a subset of genes within the genome and provided evidence for the relative contribution of different infection sources. Sometimes even very similar bacteria showed differences, demonstrating the value of basing analyses on the entire genome when developing this algorithm that can be used for understanding the global epidemiology and other important bacterial infections.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted February 23, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Machine learning to predict the source of campylobacteriosis using whole genome data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Machine learning to predict the source of campylobacteriosis using whole genome data
Nicolas Arning, Samuel K. Sheppard, David A. Clifton, Daniel J. Wilson
bioRxiv 2021.02.23.432443; doi: https://doi.org/10.1101/2021.02.23.432443
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Machine learning to predict the source of campylobacteriosis using whole genome data
Nicolas Arning, Samuel K. Sheppard, David A. Clifton, Daniel J. Wilson
bioRxiv 2021.02.23.432443; doi: https://doi.org/10.1101/2021.02.23.432443

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4665)
  • Biochemistry (10324)
  • Bioengineering (7649)
  • Bioinformatics (26268)
  • Biophysics (13487)
  • Cancer Biology (10656)
  • Cell Biology (15380)
  • Clinical Trials (138)
  • Developmental Biology (8474)
  • Ecology (12789)
  • Epidemiology (2067)
  • Evolutionary Biology (16810)
  • Genetics (11375)
  • Genomics (15441)
  • Immunology (10589)
  • Microbiology (25110)
  • Molecular Biology (10182)
  • Neuroscience (54283)
  • Paleontology (399)
  • Pathology (1663)
  • Pharmacology and Toxicology (2885)
  • Physiology (4329)
  • Plant Biology (9218)
  • Scientific Communication and Education (1584)
  • Synthetic Biology (2548)
  • Systems Biology (6765)
  • Zoology (1459)