RT Journal Article SR Electronic T1 On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data JF bioRxiv FD Cold Spring Harbor Laboratory SP 086744 DO 10.1101/086744 A1 Arredondo-Alonso, Sergio A1 van Schaik, Willem A1 Willems, Rob J. A1 Schürch, Anita C. YR 2017 UL http://biorxiv.org/content/early/2017/03/28/086744.abstract AB Plasmids are autonomous extra-chromosomal elements in bacterial cells that can carry genes that are important for bacterial survival. To benchmark algorithms for automated plasmid sequence reconstruction from short read sequencing data, we selected 42 publicly available complete bacterial genome sequences which were assembled by a combination of long- and short-read data. The selected bacterial genome sequence projects span 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four different programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences.PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall = 0.82) but approximately a quarter of the predicted plasmid contigs were false positives (precision = 0.76). PlasmidSPAdes merged 83 % of the predictions from genomes with multiple plasmids in a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids but failed with long plasmids (recall = 0.12, precision = 0.30). cBar, which applies pentamer frequency composition analysis to detect plasmid-derived contigs, showed an overall recall and precision of 0.78 and 0.64. However, cBar only categorizes contigs as plasmid-derived and does not bin the different plasmids correctly within a bacterial isolate. PlasmidFinder, which searches for matches in a replicon database, had the highest precision (1.0) but was restricted by the contents of its database and the contig length obtained from de novo assembly (recall = 0.36).Surprisingly, PlasmidSPAdes and Recycler detected single isolated components corresponding to putative novel small plasmids (<10 kbp) which were also predicted as plasmids by cBar.This study shows that it is possible to automatically predict plasmid sequences, but only for small plasmids. The reconstruction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of WGS data.Author Summary Short read sequencing of the DNA of bacteria is often used to understand characteristics such as antibiotic resistance. However the assembly of short read sequencing data with the goal of reconstructing a complete genome is often fragmented and leaves gaps. Therefore independently replicating DNA fragments called plasmids cannot easily be identified from an assembly. Lately a number of programs have been developed to enable the automated prediction of the sequences of plasmids. Here we tested these programs by comparing their outcomes with complete genome sequences. None of the tested programs were able to fully and unambiguously predict distinct plasmid sequences. All programs performed best with the prediction of plasmids smaller than 50 kbp. Larger plasmids were only correctly predicted if they were present as a single contig in the assembly. While predictions by PlasmidSPAdes and cBar contained most of the plasmids, they were merged with or indistinguishable from other plasmids and sometimes chromosome sequences. PlasmidFinder missed most plasmids but all its predictions were correct. Without manual steps or long-read sequencing information, plasmid reconstruction from short read sequencing data remains challenging.