Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data

Sergio Arredondo-Alonso, View ORCID ProfileWillem van Schaik, Rob J. Willems, View ORCID ProfileAnita C. Schürch
doi: https://doi.org/10.1101/086744
Sergio Arredondo-Alonso
1University Medical Center Utrecht - Department of Medical Microbiology, Utrecht, The Netherlands
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Willem van Schaik
1University Medical Center Utrecht - Department of Medical Microbiology, Utrecht, The Netherlands
2University of Birmingham - Institute of Microbiology and Infection, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Willem van Schaik
Rob J. Willems
1University Medical Center Utrecht - Department of Medical Microbiology, Utrecht, The Netherlands
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Anita C. Schürch
1University Medical Center Utrecht - Department of Medical Microbiology, Utrecht, The Netherlands
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Anita C. Schürch
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Plasmids are autonomous extra-chromosomal elements in bacterial cells that can carry genes that are important for bacterial survival. To benchmark algorithms for automated plasmid sequence reconstruction from short read sequencing data, we selected 42 publicly available complete bacterial genome sequences which were assembled by a combination of long- and short-read data. The selected bacterial genome sequence projects span 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four different programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences.

PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall = 0.82) but approximately a quarter of the predicted plasmid contigs were false positives (precision = 0.76). PlasmidSPAdes merged 83 % of the predictions from genomes with multiple plasmids in a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids but failed with long plasmids (recall = 0.12, precision = 0.30). cBar, which applies pentamer frequency composition analysis to detect plasmid-derived contigs, showed an overall recall and precision of 0.78 and 0.64. However, cBar only categorizes contigs as plasmid-derived and does not bin the different plasmids correctly within a bacterial isolate. PlasmidFinder, which searches for matches in a replicon database, had the highest precision (1.0) but was restricted by the contents of its database and the contig length obtained from de novo assembly (recall = 0.36).

Surprisingly, PlasmidSPAdes and Recycler detected single isolated components corresponding to putative novel small plasmids (<10 kbp) which were also predicted as plasmids by cBar.

This study shows that it is possible to automatically predict plasmid sequences, but only for small plasmids. The reconstruction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of WGS data.

Author Summary Short read sequencing of the DNA of bacteria is often used to understand characteristics such as antibiotic resistance. However the assembly of short read sequencing data with the goal of reconstructing a complete genome is often fragmented and leaves gaps. Therefore independently replicating DNA fragments called plasmids cannot easily be identified from an assembly. Lately a number of programs have been developed to enable the automated prediction of the sequences of plasmids. Here we tested these programs by comparing their outcomes with complete genome sequences. None of the tested programs were able to fully and unambiguously predict distinct plasmid sequences. All programs performed best with the prediction of plasmids smaller than 50 kbp. Larger plasmids were only correctly predicted if they were present as a single contig in the assembly. While predictions by PlasmidSPAdes and cBar contained most of the plasmids, they were merged with or indistinguishable from other plasmids and sometimes chromosome sequences. PlasmidFinder missed most plasmids but all its predictions were correct. Without manual steps or long-read sequencing information, plasmid reconstruction from short read sequencing data remains challenging.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted March 28, 2017.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data
Sergio Arredondo-Alonso, Willem van Schaik, Rob J. Willems, Anita C. Schürch
bioRxiv 086744; doi: https://doi.org/10.1101/086744
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data
Sergio Arredondo-Alonso, Willem van Schaik, Rob J. Willems, Anita C. Schürch
bioRxiv 086744; doi: https://doi.org/10.1101/086744

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Microbiology
Subject Areas
All Articles
  • Animal Behavior and Cognition (4382)
  • Biochemistry (9591)
  • Bioengineering (7090)
  • Bioinformatics (24858)
  • Biophysics (12600)
  • Cancer Biology (9956)
  • Cell Biology (14349)
  • Clinical Trials (138)
  • Developmental Biology (7948)
  • Ecology (12105)
  • Epidemiology (2067)
  • Evolutionary Biology (15988)
  • Genetics (10925)
  • Genomics (14738)
  • Immunology (9869)
  • Microbiology (23660)
  • Molecular Biology (9484)
  • Neuroscience (50860)
  • Paleontology (369)
  • Pathology (1539)
  • Pharmacology and Toxicology (2682)
  • Physiology (4013)
  • Plant Biology (8657)
  • Scientific Communication and Education (1508)
  • Synthetic Biology (2394)
  • Systems Biology (6433)
  • Zoology (1346)