ABSTRACT
Summary Plasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data is often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and clustering based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short read sequence data.
Availability and implementation Gplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/gplas.git
Contact a.c.schurch{at}umcutrecht.nl
1 INTRODUCTION
A single bacterial cell can harbor several distinct plasmids, however, current plasmid prediction tools from short read WGS often have a binary outcome (plasmid or chromosome). To bin predicted plasmids into discrete entities, we built a new method based on the following concepts: i) contigs of the same plasmid have a uniform sequence coverage 1,10, ii) plasmid paths in the assembly graph can be searched for using a greedy approach8 and iii) removal of repeat units from the plasmid graphs disconnects the graph into independent components12.
Here, we refined these ideas and introduce the concept of unitigs co-occurrence to create a pruned plasmidome network. Using an unsupervised approach, the network is queried to find highly connected nodes corresponding to sequences belonging to the same discrete plasmid unit, representing a single plasmid. We show that our approach outper-forms other de-novo and reference-based tools and fully automates the reconstruction of plasmids.
2 MATERIALS AND METHODS
2.1 Gplas algorithm
Given a short-read assembly graph (gfa format), segments (nodes) and edges (links) are extracted from the graph. Gplas uses mlplasmids (version 1.0.0, prediction threshold = 0.5) or plasflow (version 1.1, prediction threshold = 0.7) to classify segments as plasmid- or chromosome-derived and selects segments with an in- and out-degree of 1 (unitigs) 2, 7. The k-mer coverage standard deviation (k-mer sd) of the chromosome-derived unitigs is computed to quantify the fluctuation in the coverage of segments belonging to the same replicon unit. Plasmid-derived unitigs are considered to search for plasmid walks with a similar coverage and composition using a greedy approach (Supplementary Methods). Gplas creates a plasmidome network (undirected graph) in which nodes correspond to plasmid unitigs and edges are drawn based on the co-existence of the nodes in the solution space using R packages igraph and4 ggraph(https://github.com/thomasp85/ggraph.git). Markov clustering algorithm is used to query the plasmidome network and retrieve clusters corresponding to discrete plasmid units in an unsupervised fashion11. The output consists of plasmid contigs binned into distinct components, representing the different plasmids present in the bacterial isolate. Complete description of the algorithm is available in Supplementary Methods.
2.2 Benchmarking dataset
Gplas was benchmarked against current existing tools to bin plasmid contigs from short-read WGS: i) plasmidSPAdes (de-novo based approach, version 3.12)1, ii) mob-recon (reference-based approach, version 1.4.9.1)9 and iii) hyasp (hy-brid approach, version 1.0.0)8. To evaluate the binning tools, we selected a set of 28 genomes with short- and long-read WGS available including 106 plasmids from 10 different bacterial species which were not present in the databases or training sets of the tools (Supplementary Methods, Supplementary Table S1)3, 5, 6, 13.
For each component reported by gplas, we can consider n as the total number of nodes present in the component. Then, we can define C as:
C corresponds to the total number of pair-pair connections between nodes of a particular component. We consider as true positive connections (TPC), pair-pair connections linking to nodes belonging to the same replicon sequence in contrast to false positive connections (FPC) in which connections link to nodes from different replicon sequences. Let npc be the total number of nodes from the most predominant replicon sequence present in the component and nrep the total number of nodes forming that replicon sequence. We then define two metrics commonly used in metagenomics for binning evaluation: i) precision and completeness (Supplementary Methods).


3 RESULTS
Gplas in combination with mlplasmids obtained the highest average precision (0.85) indicating that the predicted components were mostly formed by nodes belonging to the same discrete plasmid unit (Table 1 and Figure S1). The reported average completeness value (0.73) showed that most of the nodes from a single plasmid were recovered as a discrete plasmid component by gplas (Table 1 and Figure S2). We observed a decline in the performance of gplas in combination with mlplasmids (precision = 0.71, completeness = 0.68) when considering uniquely complex components (> 1 connection) which indicated merging problems of large plasmids with a similar k-mer coverage (Figure S3, Supplementary Results). However, in all cases the performance of gplas in combination with mlplasmids performed better than other de-novo and reference-based tools tested here (Table 1). To show the potential of gplas in combination with mlplasmids, we showcase the performance of our approach in two distinct bacterial isolates (Supplementary Results).
Gplas benchmarking. *Components > 1 connection
Mlplasmids only contains a limited range of species models (Supplementary Methods). For other bacterial species, we observed that plasflow probabilities in combination with gplas performed better than the other de-novo approaches but also introduced bias when wrongly predicting chromosome contigs as plasmid nodes (Table 1 and Figure S1), thereby creating chromosome and plasmid chimeras (precision = 0.63).
4 DISCUSSION
We present a new tool called gplas which enables the binning and a detailed analysis workflow of binary classified plasmid contigs into discrete plasmid units by relying on the structure of the assembly graph, k-mer information and clustering of a pruned plasmidome network. A limitation of the presented approach is the generation of chimaeras resulting from plasmids with similar k-mer profiles and sequence coverage and sharing a repeat unit, such as a transposase or an IS element. These cases cannot be unambiguously solved. Here, we integrated and extended upon features to predict plasmid sequences and exploit the information present in short-read graphs to automate the reconstruction of plasmids.
FUNDING
SA, RJLW were supported by the Joint Programming Initiative in Antimicrobial Resistance (JPIAMR Third call, STARCS, JPIAMR2016-AC16/00039). JC was funded by the European Research Council (grant no. 742158).
ACKNOWLEDGEMENTS
We would like to thank Dr. Bryan Wee for testing and contributing to the development of gplas.