Abstract
The spread of antimicrobial resistance among bacteria by horizontal plasmid transmissions poses a major challenge for clinical microbiology. Here, we evaluate a new real-time plasmid transmission detection pipeline implemented in the SeqSphere+ (Ridom GmbH, Münster, Germany) software.
Within the pipeline, a local Mash plasmid database is created and Mash searches with a distance threshold of 0.001 are used to trigger plasmid transmission early warning alerts (EWA). Clonal transmissions are detected using cgMLST allelic differences. The integrated tools MOB-suite, NCBI AMRFinderPlus, CGE MobileElementFinder, pyGenomeViz, and MUMmer are used to characterize plasmids and for visual pairwise plasmid comparisons, respectively. We evaluated the pipeline using published hybrid assemblies (Oxford Nanopore Technology/Illumina) of a surveillance and outbreak dataset with plasmid transmissions. To emulate prospective usage, samples were imported in chronological order of sampling date. Different combinations of the user-adjustable parameters sketch size (1,000 vs 10,000) and plasmid size correction were tested and discrepancies between resulting clusters were analyzed with Quast.
When using a sketch size of 1,000 with size correction turned on, the SeqSphere+ pipeline agreed with the published data and produced the same clonal and carbapenemase-carrying plasmid clusters. EWAs were in the correct chronological order.
In summary, the developed pipeline presented here is suitable for integration into clinical microbiology settings with limited bioinformatics knowledge due to its automated analyses and alert system, which are combined with the GUI-based SeqSphere+ platform. Thus, with its integrated sample database, (near) real-time plasmid transmission detection is within reach in bacterial routine-diagnostic settings when long-read sequencing is employed.
Importance Plasmid-mediated spread of antimicrobial resistance (AMR) is a major challenge for clinical microbiology and monitoring of potential plasmid transmissions is essential to combat further dissemination. Whole-genome sequencing (WGS) is often used to surveil nosocomial transmissions but usually limited to the detection of clonal transmissions (based on chromosomal markers). Recent advances in long-read sequencing technologies enable full reconstruction of plasmids and the detection of very similar plasmids but so far easy-to-use bioinformatic tools for this purpose were missing. Here we present an evaluation of an innovative real-time plasmid transmission detection pipeline. It is integrated into the GUI-based SeqSphere+ software, which already offers cgMLST based pathogen outbreak detection. It requires very limited bioinformatics knowledge, and its database, automated analyses, and alert system make it well suited for prospective clinical application.
Introduction
Antimicrobial resistance (AMR) is a major challenge in clinical microbiology as it is limiting the therapeutic options to manage infectious diseases (1). This is largely mediated by the rapid spread of resistance genes among bacterial populations via horizontal gene transfer (HGT) facilitated by mobile genetic elements (MGE) such as plasmids (2). Plasmids are extrachromosomal DNA elements that can serve as a vehicle to exchange genes between different bacterial strains and even species (3).
Recently, multi-species hospital outbreaks of carbapenem-resistant bacteria have been reported (4, 5). Whole-genome sequencing (WGS) based surveillance has become an important tool to monitor nosocomial transmissions in many hospitals (6), but these activities are usually limited to the detection of clonal transmissions, i.e., based on chromosomal markers, of single species. Thus, tools are needed that enable the detection of similar plasmids from WGS data. A variety of software tools has been developed for plasmid reconstruction and characterization (reviewed in (7)). Among the most popular are currently CGE PlasmidFinder (8), COPLA (9), PLACNET (10), and MOB-suite (11), which relies on Mash (12) for comparing k-mers of plasmids. All these tools require some form of bioinformatic expertise, which is often lacking in routine microbiology laboratories. Another major limitation is that plasmids are generally difficult to fully reconstruct from short-reads due to their high number of repetitive regions (7). However, with the recent advances in long-read sequencing, e.g., Oxford Nanopore Technology (ONT), accurate long-read data will become more available to clinical microbiology laboratories (13).
The GUI-based Ridom SeqSphere+ software (14) is widely used in clinical microbiology laboratories. One of its main purposes is the detection of clonal transmissions using core-genome multi-locus sequence typing (cgMLST) based similarity of bacterial isolates. Here, we evaluated the newly implemented real-time plasmid detection pipeline to enable users to detect potential plasmid transmissions within their own WGS datasets complementary to clonal transmissions. We used two already published datasets of hybrid assemblies of carbapenem-resistant bacteria from a United Kingdom (UK) (15) and a United States (US) hospital (16), respectively.
Materials and Methods
SeqSphere+ real-time plasmid transmission detection pipeline
We developed the pipeline as an extra module for the SeqSphere+ (Ridom GmbH, Münster, Germany) software. The first steps involve the reconstruction and characterization of plasmids from assembled data. These steps are based on the MOB-suite (v3.18) software package (11) and work for both short- and long-read data. Short-read and non-circular long read data go into the MOB-recon module first to identify and reconstruct plasmids. The reconstructed plasmids are then forwarded to MOB-typer, which assigns a primary and secondary cluster ID (Mash distance cluster threshold = 0.006 and 0.025, respectively), rep and relaxase type(s), and mobility prediction to each plasmid. For long-read data, the FASTA headers are checked for circularity terms ([topology = circular]) and only non-circular contigs a forwarded to MOB-recon. Circular contigs that are above 500kb in size are considered to be chromosomes, smaller contigs are considered to be plasmids and directly go into MOB-typer. Optionally, the MOB-recon module can also be skipped using a [non-recon] term in the FASTA header. AMR targets are determined using the NCBI AMRFinderPlus (v3.11.26) (17). Integrated mobile genetic elements (iMGEs) are detected with CGE MobileElementFinder (v1.1.2) (18). The typing results from MOB-typer, AMRFinderPlus, and MobileElementFinder are then summarized in a tabular “Chromosome & Plasmid” overview. From here, plasmid contigs can be exported as FASTA files for downstream analyses.
AMRFinderPlus results are grouped into two columns, one of them only showing priority AMR genes, which are defined as targets that might confer resistance to carbapenems, colistin, vancomycin, or methicillin, and genes that contain ESBL or AmpC in their name. The detected iMGE(s) are also shown in a separate column if they enclose a priority AMR gene.
Long-read data plasmid transmission analysis module
With the recently introduced version 10 of the SeqSphere+ software with the long-read data plasmid transmission analysis module, potential plasmid transmissions can be detected using Mash (v2.1) comparisons (12). The identified plasmids can be used to build a local Mash plasmid database from the users’ own data. Pairwise Mash distances can then be utilized to detect very similar plasmids. This information can be used in several ways. First, if new plasmids are imported via the pipeline mode, they are compared with all existing plasmids in the database and if there are matches below a certain threshold (default = 0.001), an early warning alert (EWA) is triggered and shown on the SeqSphere+ start screen. No EWA is generated if the respective samples contain a cgMLST scheme and have a clonal (i.e., cgMLST allelic distance below the species-specific threshold) relationship. Second, the complete plasmid database can be used to perform an all-against-all comparison to create an exportable distance matrix and/or a single-linkage clustering. Third, the database can be searched for similar plasmids from a single plasmid of interest. This plasmid search can be triggered from both the “Chromosome & Plasmid” overview and EWA reports. Within the resulting table, sample metadata and plasmid typing results for each plasmid are included. Finally, partially annotated plasmids (origin(s) of replication, priority AMR gene(s), other AMR gene(s), and iMGE(s)) can also be pairwise aligned and visualized using MUMmer v3.23 (19) and pyGenomeViz (v0.4.4; https://moshi4.github.io/pyGenomeViz/). To create a better visualization, there is an optional fixstart function to change the start and orientation of contigs. It uses dnaA for chromosomes and for plasmids the first occurrence of a rep gene as the start of the reoriented contig. Figure 1 shows an overview of the complete pipeline.
SeqSphere+ allows for configuration of multiple parameters including sketch size, sequence length compensation, and Mash distance threshold for matches (default:0.001). For the sketch size the user has the choice between 1,000 and 10,000. The k-mer size of 21 is fixed. Due to the large variability in plasmid length and as Mash distances favor contigs of similar size we have implemented a scaling factor for comparing plasmids with difference in unique content where the match threshold is lowered for each 1% of size difference. The default is a lowering of 0.0003 per 1% size difference up to a limit of 40% size difference but both the lowering value and the upper limit can be adjusted by the user.
Evaluation of the pipeline using two published datasets
To test the pipeline functionality and evaluate parameter settings, the pipeline was applied to two different published datasets. The first was a set of 85 isolates of phenotypically carbapenem-resistant Gram-negative bacteria from a single hospital in the UK (Addenbrooke’s Hospital, Cambridge) that were collected over a period of six years (Addenbrookes dataset) (15). We excluded four samples without a detected carbapenemase gene leading to a total number of 81 samples in our test dataset. These isolates were originally sequenced using a combination of Illumina HiSeq and Oxford Nanopore Technologies (ONT) MinION R9.4.1 technologies. Original hybrid assemblies were created using Flye (v2.8) (20), Unicycler (v0.4.8) (21) and Canu (v2.1.1) (22). These hybrid assemblies were downloaded in FASTA format from the European Nucleotide Archive (ENA), where they had been previously deposited under the study accession PRJEB30134. A Mash search against the NCBI RefSeq genomes was performed to confirm species assignment and rule out potential contamination. From the information in the original publication, we did assume that each contig, except the chromosome, was a single plasmid. However, since we did not have circularity information, we used a [non-recon] term in the FASTA headers to skip the MOB-recon part as additional plasmid reconstruction was not done in the original publication. The study was done retrospectively but to test the EWAs, we emulated a prospective analysis by importing the files in the chronological order of sampling date. Sampling date and patient information were taken from the supplemental material provided with the original publication.
Plasmid clusters were defined as plasmids with a Mash distance ≤ 0.001. In addition to clusters, the number of potential transmissions was counted. If a sample carried multiple plasmids that belonged to different plasmid clusters, it was assessed whether all of these plasmids were transferred to the same or different receiving samples. If all plasmids were shared with the same sample, this was counted as one co-transmission event. If plasmids of one sample were connected to different samples, each connection was counted as a single plasmid transmission event. If two samples had a clonal relation, the respective plasmids were not counted as a plasmid but as a clonal transmission event.
To assess clonal relations, we used cgMLST allelic distances. For species with public cgMLST schemes hosted on the Ridom nomenclature server (www.cgMLST.org), we used the species-specific default threshold. For species without a public scheme, we created ad hoc schemes, using the NCBI Genomes reference sequence of each species as seed genome and a threshold of 15 alleles to determine clonal clusters. To check for additional transmissions of only iMGEs, a user-defined task template was created from an allele database that contained all iMGE with a carbapenemase gene that were found using the CGE MobileElementFinder with thresholds of ≥90% identity and ≥95% alignment length.
For parameter evaluation, we tested four combinations: a sketch size of 1,000 and 10,000, each with and without size compensation applied to the Mash distance. A cluster threshold of ≤ 0.001 was used for all four approaches. Plasmid clusters with discrepancies between these different runs were analyzed with Quast (v5.2) (23). Here, the longest plasmid of each cluster was set as reference. Mismatches, InDels, and larger gaps were analyzed and counted. . Finally, the pipeline was applied to the non carbapenemase-carrying plasmids of this dataset as well.
To prove that the pipeline with the determined Mash parameters also works in another setting with different strains, we evaluated a second dataset. This one contained 19 isolates of a multispecies (n = 7) outbreak of blaNDM-5-producing Enterobacterales in a US hospital (UPMC Presbyterian dataset) (16). The isolates were collected from February 2021 to February 2023 from 15 patients. All isolates were sequenced on an Illumina NextSeq 550 and ONT MinION with R9.4.1 flow cells. Hybrid assemblies were created using Unicycer v0.5.0 and deposited by the original authors for download at NCBI under BioProject PRJNA981541. In the original paper, outbreak plasmids were defined as IncX3 plasmids harboring blaNDM-5, “that had sequence coverage values (i.e., the proportion of each plasmid’s sequence that was found in the Illumina genome) of at least 95%, and a sequence identity of <15 single nucleotide polymorphisms (SNPs) per 100 kb of plasmid-sequence compared with the fully resolved, circular plasmid sequence” (16). For our evaluation, the same analysis as described above was applied to the carbapenemase-carrying plasmids of this dataset.
Results
Analysis of clonal clusters in the Addenbrookes dataset
Clonal clusters were determined based on cgMLST allelic distances with species-specific thresholds (15 alleles difference for species without public cgMLST schemes). In total, seven clonal clusters were detected, three for Klebsiella pneumoniae, two for Escherichia coli, one for Pseudomonas putida, and one for Enterobacter hormaechei (Table 1). Except one K. pneumoniae cluster comprising eleven isolates (kleb_2), all clusters contained two isolates. The allelic distances were between 0 and 11. In addition, we found two E. roggenkampii isolates with a distance of one allele but since these isolates were from the same patient, they were denoted as intra-host variability.
Analysis of carbapenemase-carrying plasmids in the Addenbrookes dataset
Initially, we only analyzed the plasmids carrying a carbapenemase gene. Out of the 81 isolates, eleven isolates contained chromosomally encoded carbapenemase genes. The other 70 isolates had carbapenemase-carrying plasmids. One isolate comprised two carbapenemase-carrying plasmids leading to a total number of 71 analyzed plasmids.
By re-sequencing strains from documented and published plasmid transmission events with ONT and Pacific Biosciences sequencing methodology (4, 24), we determined a Mash distance of 0.001 as suitable for recovering nearly identical plasmids (data not shown). By varying the unique content of those plasmids, we came up with a scaling factor of 0.0003 per 1% plasmid size difference. Next, using a sketch size of 10,000, a Mash distance threshold of 0.001, and a 0.0003 size compensation, we found nine plasmid clusters (Table 1). One additional plasmid cluster consisted of only clonal isolates (kleb_2). Three plasmid clusters (Clust_148, Clust_9, Clust_72) partly contained clonal isolates. The clusters comprised between two and eight isolates from one to three different species. The relatedness of the 71 carbapenemase-carrying plasmids is visualized with a population snapshot using the exported clustering distance matrix (Figure 2).
To evaluate the Mash parameters, we repeated this analysis with a lower sketch size of 1,000 and tested both sketch sizes with and without size compensation and compared the resulting clusters. Out of the nine plasmid clusters, four showed discrepancies between the different methods (Table 2). One plasmid pair (Clust_54) was missed with a sketch size of 1,000 but only when size compensation was turned off. In Clust_9, one additional sample was included in the cluster for both sketch sizes when size compensation was applied. Clust_148 also contained an additional plasmid with size compensation, but only with a sketch size of 10,000. Finally, Clust_98/Clust_217 was lumped together and contained nine samples with size compensation but without size compensation it was divided in two clusters with six and three samples, respectively. We counted the number of differences between the disputable plasmid and the reference plasmid of each cluster.
For Clust_54 there were three, for Clust_148 five, and for Clust_9 13 differences, respectively. For Clust_98/Clust_217 we compared all samples of Clust_217 with one reference of Clust_98, which resulted in two differences.
In one cluster (Clust_72), we detected a case, where one plasmid (size = 12,282 bp) contained a duplicated sequence, which was otherwise completely identical with another plasmid (size = 6,141 bp) that only contained the single sequence, i.e. their uncorrected Mash distance was 0. We checked for additional transmissions of iMGEs by retrieving the sequences of all iMGEs containing a carbapenemase gene and searching for these sequences in all isolates utilizing a user-defined task template. No additional transmissions of iMGEs were found using this approach.
Analysis of all plasmids in the Addenbrookes dataset
After analysis of the carbapenemase-carrying plasmids, in contrast to the original publication, we also had a look at the overall plasmid landscape of these samples. In total, 324 plasmids were found in 78 of the 81 samples (three samples did not contain plasmids). Each sample carried 1-10 plasmids with an average of four plasmids per sample. Of these, 175 plasmids from 58 samples belonged to a cluster (Mash distance 0.001). On average, three plasmids per isolate belonged to a cluster with more than one plasmid. A total of 51 clusters (including clonal and carbapenemase-carrying plasmid clusters) was detected with 2-11 isolates per cluster (median = 2, mean = 3.4).
In addition to clonal and plasmid clusters, we counted the number of potential transmission events. In total, 44 single transmission events were counted. Of these, twelve were clonal transmissions, 18 were single plasmid transmissions, and 14 were co-transmissions of two or more plasmids to the same sample. Three of the single and five of the co-plasmid-transmissions were intra-host transmissions. Fifteen isolates transferred more than one plasmid; for nine of them, all plasmids were transferred to the same and for six of them to different receiving samples. Furthermore, we detected two cases where an isolate received plasmids from a clonal and an additional plasmid transmission. One of these involved two different species, K. pneumoniae and E. coli (Figure 3). Here, the two K. pneumoniae samples had identical MLST sequence types (STs) and cgMLST allelic profiles. However, the sample that was isolated later carried additional plasmids. Two of these plasmids (size = 1.8 kb & 5.2 kb) clustered with plasmids of an E. coli isolate that also shared a 145 kb blaNDM1-carrying plasmid with both K. pneumoniae samples.
Analysis of carbapenemase-carrying plasmids in the UPMC Presbyterian dataset
As a further evaluation, we analyzed the carbapenemase-carrying plasmids of a second dataset, comprising 19 isolates of a multispecies blaNDM-5 outbreak (16). According to the original publication, all isolates belonged to a single outbreak, where blaNDM-5 was encoded on a 46.2 kb IncX3 family plasmid. The plasmids differed by only 0-2 SNPs with the exception of one shorter (45.2 kb) and one larger (55.8 kb) plasmid that lost or acquired a transposase gene, respectively.
Using our pipeline with a sketch size of 1,000 and size correction turned on, all 19 plasmids were put into a single plasmid cluster. However, when size correction was turned off, the larger (17% size difference) plasmid did not cluster with the remaining plasmids (uncorrected Mash distance = 0.003). Both, blaNDM-5 and IncX3 were detected on all plasmids by AMRFinderPlus and MOB-suite, respectively. In addition, among other information, relaxase type MOBP was assigned by MOB-typer and the plasmids were predicted to be conjugative.
Discussion
In this study, we used two published datasets of plasmids from carbapenem-resistant bacteria to evaluate a new plasmid detection pipeline implemented in the SeqSphere+ software. We determined a Mash distance of 0.001 and a scaling factor of 0.0003 per 1% plasmid size difference as suitable for recovering nearly identical plasmids. Thresholds like these are always somewhat arbitrary and need to be a compromise between sensitivity and specificity. However, the developers of the MOB-suite came up with a distance threshold of 0.025 for short-read sequencing data (25). With long-read data this value can be lower. Independent from our own considerations, Roberts et al. came up with the same threshold of 0.001 (15). For both datasets, our results are in complete agreement with the findings from the original studies, in particular, the same clonal and plasmid clusters were found. Moreover, the same resistance genes and plasmid rep-types were detected, despite using different tools and databases. In the Roberts et al. study, Abricate (https://github.com/tseemann/abricate) was used with the CARD (26) and PlasmidFinder (8) databases, while our pipeline is based on NCBI AMRFinderPlus (17) and MOB-suite (11). In both the Roberts et al. and our study, Mash with a sketch size of 10,000 and a distance threshold of 0.001 was used to detect plasmid results, leading not surprisingly to the same cluster results. However, in contrast to a file-based Mash approach, our automated, database-based pipeline enables a prospective real-time surveillance of plasmid transmissions.
Raabe et al. defined outbreak plasmids based on Illumina sequence coverage values of at least 95%, and a sequence identity of <15 SNPs per 100 kb of plasmid sequence when compared against the complete reference plasmid sequence. They also used the CGE PlasmidFinder for rep-typing. Different approaches were also used for clonal cluster detection. While Roberts et al. used a combination of SNPs and split k-mer analysis (SKA), the SeqSphere+ pipeline is based on cgMLST allelic distances.
We compared the clustering results of Mash databases constructed with a sketch size of 1,000 and 10,000. Both are values used for plasmids in the literature (15, 25). While the higher sketch size increases the sensitivity, it also requires more computational memory and power and enlarges the database size. We found that a sketch size of 1,000 was generally enough to detect the same clusters at a threshold of 0.001, with one exception. The sensitivity can be increased without a significant increase in computational power requirements by applying a size compensation. The size compensation allows to detect potential transmissions of plasmids that share the same backbone but may have lost or gained larger unique segments. This was shown with the UPMC Presbyterian dataset, where one insertion of a transposase gene increased the size of the otherwise identical plasmid by 17 %. The Mash distance between this larger plasmid and the other outbreak plasmids was above the clustering threshold without size correction. However, size compensation must have an upper limit to prevent too many false positives just resulting from shared promiscuous iMGEs. An interesting finding was a plasmid pair, where one plasmid consisted of two exact copies of another plasmid. This is likely a result of multimer formation. To account for such phenomena, we decided to limit the size compensation to a maximum of 40% plasmid size differences but keep the uncorrected Mash distance value for plasmids with a larger size difference instead of ignoring plasmid pairs above 40% size difference. Thus, plasmids with large size differences but very similar content, as for example multimers, can still be detected.
When analyzing the whole plasmid landscape of the Addenbrookes dataset, we found that the reconstruction of potential transmission scenarios can be complicated due to complex co-transmission patterns. Based on genetic clustering, the number of plasmid transmission events within the analyzed dataset is estimated to be twice as common as clonal transmission of pathogens but detailed epidemiological information is needed to confirm these results. In particular, it is still largely unknown how often very similar plasmids occur without a transmission, i.e. as “core plasmids” of certain lineages or species. As other plasmid comparison tools, the pipeline only works with well reconstructed, ideally circular plasmids and thus requires long-read sequences. Moreover, Mash as a k-mer-based approach does not take synteny into account. Although same content is usually also same synteny in a local context, it is possible to check for rearrangements between plasmid sequences by utilizing pyGenomeViz, which is implemented in SeqSphere+, as a visual approach. For more diverse data, alternatively, a new tool such as Pling (27), which is alignment-based and thereby more compute-intensive, could be used. Currently the literature is still lacking analyses of plasmid transmissions based on long-read data. However, with long-read sequencing becoming more and more achievable for clinical laboratories, the need for suitable analysis tools is growing. We presented here a retrospective re-analysis of published datasets. However, in a prospective surveillance scenario, every single transmission would be captured by the EWA system as soon as the isolates are imported, allowing for a near real-time plasmid transmission detection.
In conclusion, the new SeqSphere+ plasmid transmission detection pipeline allows the detection of plasmid clusters in both retrospective (single linkage clustering of plasmid databases) and prospective (Early Warning Alerts) studies. The automated, database-based approach enables near real-time surveillance and allows for rapid informed actions.
Disclaimers
NS, JR, and TW are (part-time) employees of Ridom GmbH. JR, TW, and DH are shareholders of Ridom GmbH. AM declares no conflict of interest.
Acknowledgments
We thank Leah Roberts (CIIC, Queensland University of Technology) for providing help with sequence and epidemiological data. Further, we thank James Robertson and Kyrylo Bessonov (both National Microbiology Laboratory, Public Health Agency of Canada) for critical reading and improvement of the manuscript.
Footnotes
A second dataset (UPMC presbytarian dataset) was included to evaluate the pipeline. Moreover, further details on the Mash parameters were added. Several changes of the wording were made to enhance clarity.