Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Genomic diversity affects the accuracy of bacterial SNP calling pipelines

View ORCID ProfileStephen J. Bush, Dona Foster, View ORCID ProfileDavid W. Eyre, View ORCID ProfileEmily L. Clark, Nicola De Maio, View ORCID ProfileLiam P. Shaw, View ORCID ProfileNicole Stoesser, Tim E. A. Peto, Derrick W. Crook, A. Sarah Walker
doi: https://doi.org/10.1101/653774
Stephen J. Bush
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
2National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stephen J. Bush
  • For correspondence: stephen.bush@roslin.ed.ac.uk
Dona Foster
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
3National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David W. Eyre
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David W. Eyre
Emily L. Clark
4The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emily L. Clark
Nicola De Maio
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Liam P. Shaw
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Liam P. Shaw
Nicole Stoesser
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nicole Stoesser
Tim E. A. Peto
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
2National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
3National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Derrick W. Crook
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
2National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
3National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
A. Sarah Walker
1Nuffield Department of Medicine, University of Oxford, Oxford, UK
2National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
3National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Background Accurately identifying SNPs from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained.

This study evaluates the performance of 41 SNP calling pipelines using simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally-sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia and Klebsiella.

Results We evaluated the performance of 41 SNP calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic, bacteria such as Escherichia coli, but less dominant for clonal species such as Mycobacterium tuberculosis.

Conclusions The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest performing pipelines was Novoalign/GATK. However, across the full range of (divergent) genomes, among the consistently highest-performing pipelines was Snippy.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted May 31, 2019.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Genomic diversity affects the accuracy of bacterial SNP calling pipelines
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Genomic diversity affects the accuracy of bacterial SNP calling pipelines
Stephen J. Bush, Dona Foster, David W. Eyre, Emily L. Clark, Nicola De Maio, Liam P. Shaw, Nicole Stoesser, Tim E. A. Peto, Derrick W. Crook, A. Sarah Walker
bioRxiv 653774; doi: https://doi.org/10.1101/653774
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Genomic diversity affects the accuracy of bacterial SNP calling pipelines
Stephen J. Bush, Dona Foster, David W. Eyre, Emily L. Clark, Nicola De Maio, Liam P. Shaw, Nicole Stoesser, Tim E. A. Peto, Derrick W. Crook, A. Sarah Walker
bioRxiv 653774; doi: https://doi.org/10.1101/653774

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3482)
  • Biochemistry (7329)
  • Bioengineering (5301)
  • Bioinformatics (20212)
  • Biophysics (9985)
  • Cancer Biology (7706)
  • Cell Biology (11273)
  • Clinical Trials (138)
  • Developmental Biology (6425)
  • Ecology (9923)
  • Epidemiology (2065)
  • Evolutionary Biology (13292)
  • Genetics (9353)
  • Genomics (12559)
  • Immunology (7681)
  • Microbiology (18964)
  • Molecular Biology (7421)
  • Neuroscience (40915)
  • Paleontology (298)
  • Pathology (1226)
  • Pharmacology and Toxicology (2130)
  • Physiology (3145)
  • Plant Biology (6842)
  • Scientific Communication and Education (1271)
  • Synthetic Biology (1893)
  • Systems Biology (5299)
  • Zoology (1086)