Abstract
Antimicrobial resistance poses a great danger to humanity, in part due to the widespread horizontal transfer of plasmids via conjugation. Modeling of plasmid transfer is essential to uncovering the fundamentals of resistance transfer and for development of predictive measures to limit the spread of resistance. However, a major limitation in the current understanding of plasmids is the inadequate characterization of the DNA transfer mechanisms, which conceals the actual potential for plasmid transfer in nature. Here, we consider that the plasmid-borne origin-of-transfer substrates encode specific DNA structural properties that can facilitate finding these regions in large datasets, and develop a DNA structure-based alignment procedure for typing the transfer substrates that outperforms mere sequence-based approaches. We identify thousands of yet undiscovered DNA transfer substrates, showing that actual plasmid mobility can in fact be 2-fold higher and span almost 2-fold more host species than is currently known. Over half of all mobile plasmids contain the means to transfer between different mobility groups, which links previously confined host ranges across ecological habitats into a robust plasmid transfer network. We show that this network in fact serves to transfer antimicrobial resistance from the environmental genetic reservoirs to human pathogens, which might be an important driver of the observed rapid resistance development in humans and thus an important point of focus for future prevention measures.
1. Introduction
Horizontal gene transfer of antimicrobial resistance (AMR) genes occurs via the processes of transformation and conjugation. The former mediates especially narrow range, intra-genus transfers 1,2, whereas the latter is implicated in a wider range of transfer hosts 3,4 and potentially enables AMR to overcome the toughest phylogenetic and ecological transmission barriers 5–10. Consequently, the interaction between conjugative relaxase enzymes and their DNA origin-of-transfer (oriT) substrates facilitates the majority of all AMR transfers in nature 11,12 and is especially important for ones related to human infection complications 13. However, the current knowledge on conjugative transfer mechanisms and systems 4,14–16 is unable to describe the unprecedented amount of observed horizontal transfer 7,8,12,17 that seems to transcend all transfer barriers between resistance reservoirs and human hosts 5–7,9,10,12,13,18.
The standard approach for characterization of plasmid mobility involves classification of conjugation and mobilization genes 14, especially typing of relaxase enzymes into the respective mobility (Mob) groups 15,19. However, besides the possibility of yet unidentified enzymes and mobility groups 15,20–24, multiple new processes have recently been uncovered that might confer additional mobility to plasmids and involve the origin-of-transfer (oriT) DNA substrate. These include (i) broadened relaxase binding specificities to multiple different oriT sequence variants 25–29, which, according to the evolutionary theory of such DNA regions 4,30,31, indicates the possibility of plasmids carrying multiple functional secondary oriTs, and (ii) trans-mobilization of plasmids carrying oriTs triggered by relaxases from co-resident plasmids acting in trans on the non-cognate oriTs 32–35. The latter mechanism demonstrates that oriT regions are the only elements of the conjugation machinery required in cis 21 and suggests that many plasmids classified as non-mobile due to the absence of putative relaxases may in fact be mobilizable 33. However, although typing the oriT enzymatic substrates instead of the genetic scaffolds might present improvements to the current understanding of plasmid mobility, no systematic studies of oriTs across sequenced plasmids have yet been performed, likely due to the lack of available data and tools that would enable such oriT typing.
A major problem with uncovering oriT regions is that, apart from being experimentally laborious, it is computationally challenging due to multiple molecular mechanisms and a variety of DNA sequence elements present and coevolving in the DNA substrate 4, even among plasmids belonging to a single species such as Staphylococcus aureus 32. OriT typing requires algorithms beyond simple sequence-based alignment 36,37, which can recognise and process more complex molecular motifs such as inverted repeats and hairpins 38–40 as well as underlying DNA physicochemical and conformational features. These underpin key protein-DNA readout and activity mechanisms 4,41–43 as well as define conserved niches of structural variants that enable good resolution between Mob groups and subgroups 4. The use of DNA structural representations has indeed led to improvements in algorithms for identification of other regulatory regions, such as promoters and replication origins 44–48. Despite this, instead of using tools that probe the actual relaxase-oriT interaction potential by identifying molecular properties that are the basis of such interactions, conventional approaches for oriT analysis still rely on mere nucleotide sequence-based methods 32,36,37.
Hence, here we prototype a DNA structure-based alignment algorithm for finding oriT variants, enabling us to locate and Mob-type oriT regions across thousands of sequenced plasmids. Based on the newly uncovered oriT variants, since they can facilitate both in cis and in trans plasmid transfer, we re-analyse the amount of potential mobile plasmids and mobile plasmid-carrying host species. We then evaluate if and how the uncovered fraction of oriTs helps to overcome the known barriers to horizontal gene transfer, by reconstructing and analysing the network of potential AMR transfers between different species and habitats, especially those from the environmental reservoir to the human flora.
2. Results
Structural alignment algorithm improves oriT-typing performance
Since a DNA alignment algorithm performs multiple comparisons between a query and a target sequence by evaluating a distance function, we first developed a structural distance function, termed s-distance (Fig S1-1, Methods M2). This enabled comparison between structurally encoded oriTs and non-encoded ones, where the ungapped p-distance was used. For the comparison we used a balanced dataset of 64 oriT regions from 4 Mob groups 4 (Methods M1), where region sizes were varied stepwise to cover a single relaxase enzymatic site of 40 bp up to the whole oriT region of 220 bp containing multiple binding sites (Fig 1A). Furthermore, the comparison included discrimination of both (i) positive oriTs (aligned to nic site) from negative non-oriT sequences and (ii) Mob groups. With structurally encoded DNA, significant (Anova p-value < 1e-4) discrimination of both positive/negative and Mob groups was achieved in the whole oriT size range (Fig 1B,C), compared to the bare nucleotide sequences, where results were significant (Anova p-value < 0.05) only with oriT regions equal to or shorter than 120 bp. This was corroborated with the Jaccard distance, which significantly (Ranksum p-value < 1e-16) decreased by 40% with increasing oriT region size when using structurally encoded k-mers (Methods M2, Fig S1-2), whereas it increased with nucleotide k-mers (Ranksum p-value < 1e-9). The results suggested that our structural encoding approach leveraged the chemical information in longer query regions and could thus improve multiple sequence comparisons with alignments by increasing the statistical depth (Fig S1-3).
We next prototyped an alignment framework (Fig 1D) that employed the s-distance measure to find target hits to query oriTs, where p-values were obtained via permutation tests (Fig S1-1, Methods M2). A 4 Mob group dataset of 106 query oriT regions and 170 testing plasmids 15,36,49 was compiled (Fig S1-4, Table S1-1, Methods M1) to test the algorithm’s performance by assessing the oriT/nic location and Mob type of the highest scoring alignment hits. By using full-length 220 bp query regions, on average, 25% more significant (Permutation test p-value < 1e-14) oriT hits were recovered and Mob group predictions increased 28% compared to using 40 bp query size (Fig 1E, Fig S1-5). This corroborated that the use of longer queries indeed led to improved algorithm performance (Fig 1B,C). Furthermore, compared to Blast 37 and to non-encoded DNA using the p-distance (Methods M2), our approach uncovered, on average, 27% more significant (Permutation test p-value < 1e-10) oriT hits and correctly predicted 22% more Mob groups (Fig 1E, Fig S1-5). By analysing machine learning metrics to better understand the algorithm’s performance (Methods M3), we observed that a marked ∼40% increase was achieved with Recall (Fig 1F) at a relatively constant Precision and Specificity, which corresponded to recovering a larger amount of the correct oriTs (Fig S1-6). Our method thus correctly located and Mob typed, on average, 55% of oriTs in the cross-validation and testing datasets (Fig S1-7, S1-8). We further validated the capability of the algorithm to identify specifically nic sites using a plasmid dataset with experimentally determined oriT regions that was not Mob typed 36 (Methods M1). Out of 13 such plasmids with 14 oriT sites, it correctly identified (Permutational test p-value < 1e-12) 6 oriT regions in 5 plasmids with 100% sequence identity and aligned to within ±1 bp of the nic sites 49 (Fig S1-7, Table S1-2). This indicated that the lack of diversity in the query dataset altogether missed certain oriTs in the testing datasets, which was also confirmed by using smaller query datasets that lowered the algorithm’s performance especially for locating oriT regions (Fig S1-9). Nevertheless, despite the limited oriT data availability, the results experimentally verified the algorithm’s capacity for oriT typing.
OriT typing reveals a two-fold increase in the number of known mobile plasmids
We used the structural alignment algorithm to explore the diversity of oriT regions in natural plasmids. To cover all available oriT regions, we expanded the query dataset to 112 unique oriTs from 6 Mob groups that included, besides oriTs from the major Mob groups F, P, Q and V, also 3.6% and 0.9% of elements from groups C and T, respectively (Fig 2A, Methods M1). The query dataset covered 59 unique host species with the majority (88.4%) from the phyla Proteobacteria and Firmicutes (Fig S2-1). The target dataset comprised 4602 natural plasmids with Mob groups determined by relaxase amino acid homology analysis 50 (Fig 2B). Here, 28.7% of plasmids were mobile (1307 plasmids that contained 1377 distinct relaxases) with the highest represented Mob groups F, P, Q and V (Fig 2A, Fig S2-2). The target dataset contained 893 distinct host species from 22 distinct phyla, with the mobile plasmids harboured by 40% of the distinct species with 88.1% from the phyla Proteobacteria and Firmicutes (Fig S2-1).
Contrary to the expectations based on the 1377 distinct relaxases found in the target dataset (Fig 2B) that indicated at least a similar amount of oriT regions should be recoverable, with our alignment method we identified 11,497 significant (q-value < 1e-8, E-value < 0.01) oriT hits (Methods M4). The oriT hits were uncovered with 91% (102/112) of the query regions and covered all Mob groups and subgroups (Fig 2C). They contained sequence features, including sequence homology and inverted repeats (IR) that facilitate relaxase recognition and binding 4, similar to those of the query regions and in accordance with published findings showing that relaxases can function with relaxed specificity on non-cognate oriTs with ∼60% sequence homology 25,26,32. Indeed, sequence homologies between an oriT hit and its closest-associated query oriT region were above 60% within all the 40 bp relaxase-binding sites and with the majority (> 99.6%) of the larger sized subregions (Fig 2D). Median sequence homologies were well above the ones of the query dataset, where all pairwise seq. homologies within each Mob group were measured (Fig 2D, Methods M4). Furthermore, all the different sizes of oriT subregions were strongly correlated (Pearson’s r > 0.730, p-value < 1e-16, Table S2-1) between each other as well as to the oriT structural similarity, s-distance (Pearson’s r > 0.817, p-value < 1e-16, Fig S2-3). In accordance with the analysis of IRs in the query oriTs, where in the 60 bp upstream of nic an IR of at least 6 bp could be identified (Fig S2-4A: avg. size was 10 bp, Methods M4), in the oriT hits we found that in fact all but 6 oriTs (0.05%) carried IRs with similar properties (Fig S2-4B: at least 6 bp with an avg. size of 10 bp). Finally, we explored if the uncovered oriTs were located in any specific coding or non-coding regions in the plasmids by obtaining and analysing the CDS records of each plasmid (Methods M3). Indeed, we observed a significant (Fisher’s exact test p-value < 1e-16) 4.6-fold increase of oriT presence in non-coding areas and 2-fold decrease in coding ones as well as significant (Fisher’s exact test p-value < 1e-16) enrichment in genes related to horizontal mobility, namely conjugation, transposition and integration (Table S2-2).
Out of the 1309 plasmids that carried a relaxase (Fig 2B: 28.4% of target dataset), we identified oriTs in 907 plasmids (69.2% of relaxase-typed plasmids) as well as in an additional 1177 plasmids that were previously not specified as mobile (25.6% of target dataset). Therefore, the uncovered oriT regions combined with the previously typed relaxases resulted in a total of 2486 mobile plasmids (Fig 2B: 54% of target dataset), which represented a 1.9-fold increase in the number of mobile plasmids in the target dataset compared to the initial relaxase typing. Similarly, a 1.7-fold increase in mobile plasmid-carrying host species was identified, when comparing species from the whole set of mobile plasmids (Fig 2B: 607 out of 893 species, 58.2%) with previous relaxase typed ones (356 species, 39.9%). This also corresponded to a 1.4-fold increase in Phyla, with mobile plasmids representing 19 out of the 23 Phyla compared to merely 14 with relaxase typing (Fig S2-1). Furthermore, out of the 907 plasmids where both oriTs and relaxases were identified, the same Mob group, indicating that the oriT was cognate to the relaxase, was identified in 75% of cases (Fig 2E). In the remaining 25% of these plasmids, the uncovered orits could have been secondary oriTs 30,31 or corresponded to either unknown or in trans acting 32 relaxases. The distribution of the oriT-identified Mob groups was found to be comparable to the one expected according to relaxase typing (Fig 2E), whereas a larger diversity of oriTs may exist within certain Mob subgroups than others (Table S2-3). Multiple oriT regions were identified in over 63% of both the relaxase-typed as well as untyped mobile plasmids, where on average, 2 orits from 2 different Mob groups were identified per plasmid (Fig 2F). This supported the notion that, besides secondary and in trans oriTs, the untyped plasmids likely carried un-identified relaxases 15,20–24.
Since the amount of query oriTs was the limiting factor in our analysis of plasmid mobility, we explored what effect a larger query dataset could have on the findings. Briefly, we simulated the results with a larger query dataset by performing curve fitting on results obtained with 10 repetitions of random 10-fold dilutions of the present dataset (Methods M5). An approximate linear rule was observed between the size of the query dataset and the amount of uncovered oriT-hits, as each order of magnitude increase in oriT hits required likewise an order of magnitude larger query dataset (Fig S2-5A: e.g. 1e5 hits are obtained with a query set of ∼975 oriTs). Consequently, with each order of magnitude increase of the size of the query dataset, approximately 1500 more mobile plasmids (Fig S2-5B: starting from an initial value of 500 with 10 oriT queries) and 250 more mobile plasmid-carrying species were uncovered (Fig S2-5C). Additionally, we observed that in order to achieve a full overlap with the relaxase-typed plasmids, a considerably larger query dataset than is currently available is required, comprising 415 oriTs (95% lower and upper bounds were 328 and 532, respectively, Fig S2-5D). The demonstrated limitations of the query data suggest that the present published results 50,51 and our findings might still be an underestimation of the true plasmid mobility present in nature.
Over one half of mobile plasmids can transfer between different Mob groups
A large part of the newly uncovered oriTs were additional regions to the primary ones that corresponded to the plasmid cognate relaxases (Fig 2F, Fig 3A inset). This resulted in 1331 multi-oriT plasmids (64% of all oriT-typed mobile plasmids, of which 43% were relaxase-typed) that carried on average 5 oriTs (Fig S3-1). First, we analysed the connectivity (co-occurrence network) between the different oriT regions (nodes), when they were shared across the multi-oriT plasmids (edges), which represented an undirected multi-edged graph (Methods M6). Since each oriT hit was characterised only by its closest-associated query oriT, the actual oriT node diversity was limited to the 102 query oriTs that returned hits (Fig 2E). The graph contained a total of 79,004 connections between 552 interacting oriT node pairs (Fig 3B), with an average of 6 connections and up to 3528 connections per oriT node pair (ie. oriTs were shared across this many plasmids, Fig 3A). Among the multiple small connected components in this graph, one was larger than 2 nodes and comprised 76 oriT nodes (75% of all oriTs, Fig 3B) that proportionally represented all 6 Mob groups except Mob V, of which only 38% was in the fully connected subgraph (Fig 2E). The observations of a power law of the degree distribution (Fig S3-2: degree exponent was 0.97) indicated that this oriT network obeyed the laws of natural biological scale-free networks, while the decrease of the clustering coefficient with increasing degree hinted at the possibility of a hierarchical topology (Fig S3-2: global clustering coefficient was 0.47) 52. In fact, we observed that specific oriT regions acted as hubs and connected multiple other regions across the Mob groups (Fig 3B). The most highly connected hubs were for instance pNL1-, BNC1 Plasmid 1- and pBBR1-like oriTs from Mob F, Q and V, respectively, that connected over 50 unique oriTs from all 6 Mob groups via ∼1500 connections each (see details in Table S3-1).
We next focused on the connectivity of Mob groups, since over 90% of the multi-oriT plasmids contained on average 2 unique Mob groups and 3 unique Mob subgroups (Fig S3-1). Indeed, the number of different identified Mob groups per plasmid increased proportionally to the amount of oriT regions (Pearson’s r = 0.799, p-value < 1e-16). Compared to relaxase-typing, we observed on average a 7-fold increase of identified oriT elements per Mob group, which led to a 75-fold increase in the amount of connections, with the highest increases in Mob groups F, Q, and C (above 75-fold increase) and lowest in Mob groups P and V (11 and 13-fold increases, respectively, Table S3-2). Therefore, by switching the nodes of the above oriT graph with the corresponding Mob groups, we were able to analyse the high connectivity between Mob groups within the plasmids (Fig 3C, Table S3-3, Methods M6). In correspondence with the most frequently co-occurring Mob groups F, Q and V (Fig S3-2), the highest flux was also observed between these groups. The most strongly connected and highly promiscuous was Mob Q, as 35,062 connections were measured within Mob Q, 15,081 connections between Q and F, and 12,637 connections between Q and V (Table S3-3). Due to the large amount of intra-Mob fluxes, we also analysed the connectivity of Mob subgroups. This showed that the largest amounts of connections frequently involved the unknown subgroups, specifically Mob Qu with Q2, Fu, V2 and P7 (Table S3-4), pointing to possible insufficient typing and underestimation of the true mobility within the Mob groups. Overall, the results demonstrate how each multi-oriT plasmid contains the initial means of transfer between multiple Mob subgroups and groups, and can thereby in principle connect multiple different plasmid host species either transiently, such as via transfer hosts 53,54, or via replicative hosts 3,4.
High connectivity across habitats facilitates transfer of resistance genes to humans
We explored how the increased potential for plasmid transfer between different host species (Fig 3) contributes to the connectivity across different environmental habitats and whether it facilitates the spread of AMR genes to humans. For this task, we expanded the plasmid and oriT data by mapping the host species across 9 habitat supertypes (Table S4-1) according to published data 55–60 (Methods M6). This retained 43% (227 of 532) of the unique species carrying plasmids, where multiple oriTs and habitat sizes reflected those of the full habitat dataset (according to the amount of unique species) but were on average 8-fold smaller (Fig S4-1). Next, we constructed a directed graph representation of habitat nodes connected by potential plasmid transfers as edges, where habitats of donor hosts carrying the mobile plasmids (outbound connections) connected to habitats of potential acceptor hosts deduced from the query oriTs (inbound connections, Fig 4A, Fig S4-2, Methods M6). The network comprised 141,395 connected habitat node pairs, comprising 1,600,978 plasmid connections between the habitats (Fig 4A). On average, each plasmid was present in 6 habitats and had access to 5 unique host species (Fig 4B) as well as 6 habitats (Fig S4-3). The number of mobile plasmids and oriTs across the habitats were both strongly correlated to the number of unique species (Pearson’s r was 0.961 and 0.720, p-value < 0.03, respectively), with an average of 3 oriTs per plasmid (Fig S4-4 and S4-5). Interestingly, among the highest amount of oriTs per plasmid was found in industrial (includes food production and water treatment facilities) and animal habitats, known to harbor resistance 61 (Fig S4-4).
The number of outbound and inbound connections differed across the habitat types (Fig S4-6), as they were significantly correlated with the number of unique species (Pearson’s r was 0.895 and 0.916, p-value < 2e-3, respectively, Fig S4-5). However, despite the strong correlation (Pearson’s r = 0.894, p-value < 2e-3) between outbound and inbound connections across habitats, we observed a marked increase in the ratio of inbound vs. outbound connections in the Human and Animal habitats compared to the others (Fig 4C: 61% and 29% increase, respectively). Conversely, the Thermal habitat displayed 43% less inbound than outbound connections, which might also have been due to the lack of representation of plasmids from that habitat 55 or the lowest amount of species diversity (Fig S4-1). In order to analyse the transfers to and within the Human habitat more comprehensively, we expanded the habitat taxonomy to include commensal and pathogen types as well as tissue subtypes (Table S4-1, Methods M6). Strikingly, a 2-fold higher amount of inbound vs. outbound connections was observed with commensals (Fig 4C), whereas pathogens displayed a mere 11% increase. The highest amounts of connections to humans were from the Freshwater and Soil habitats (Fig S4-7A), especially via the gut and oral microbiomes (Fig S4-8A). Notably, all tissues except respiratory ones displayed an increase of inbound connections compared to outbound ones, with unassigned bacteria harboring some of the most highly receptive species (Fig S4-8B). Further analysis of the transfer within the human system itself showed an approximately equal rate of outbound and inbound connections (Fig S4-7B), suggesting that commensal bacteria act as the main interface for horizontal uptake of genes from the environment 62, which they then further disseminate to the pathogens within the human body (Fig 4C inset).
By predicting the AMR genes in the plasmids (Methods M6), we found a moderate positive correlation (Pearson’s r = 0.462, p-value < 1e-16), between the number of oriTs and the number of resistance genes in a plasmid (Fig 4D). Thus, 33% of the multi-oriT plasmids carried on average 4 genes from 3 different AMR classes (Fig 4D). When viewed from the perspective of the above cross-habitat transfer network (Fig 4A), the amount of inbound connections of resistance genes from the environment to the human flora indicated that, as expected, the most abundant flow of genes corresponded to the oldest and most widely used classes of antibiotics, for which also resistance is most developed and widespread 63 (Fig 4A, Fig S4-8). Moreover, in this case, the amount of inbound connections to pathogens surpassed that of commensals by almost 20% (Fig 4E), supporting that AMR transfer routes generally serve a different portion of microbes compared to plasmid transfer in general (Fig 4C).
3. Discussion
Here, we explore the potential for horizontal transfer of natural plasmids by attempting to identify all conjugative DNA origin-of-transfer substrates coded within them. By prototyping a structural alignment approach to find and characterize oriT regions across plasmids (Fig 1D), we discover an almost 8-fold larger amount of oriTs than expected according to relaxase typing (Fig 2B). Analysis of these regions suggests that the actual number of transferable plasmids can be as much as 2-fold higher and span almost 2-fold more host species than is currently known (Fig 2B), with the amount of plasmid-borne AMR genes in fact proportional to this assessment of plasmid mobility (Fig 4D). We show that over half of all mobile plasmids contain the means to transfer between different mobility groups (Fig 3C), potentially linking multiple host ranges that were previously thought to be confined 3,4. By analysing this network of potential plasmid transfers across ecological habitats (Fig 4A), we find that human and animal microbiomes display a large amount of plasmid connections with the environment. The considerably larger influx of plasmids to humans and animals compared to other environmental habitats (Fig 4C) might be a consequence of the increased amount of AMR transfers to these organisms 63. In accordance with previous findings 62, the network shows that human commensals might act as the main interface for horizontal uptake of genes from the environment in general (Fig 4C), whereas, unsurprisingly, the transfer of the specific widespread AMR genes is generally more highly targeted at pathogens (Fig 4E). Nevertheless, the potential increase in plasmid mobility uncovered here might be an important driver of the observed rapid resistance development in humans 63,64 and thus an important point of focus for future prevention measures.
Our oriT-typing procedure is a result of rationally expanding DNA alignment algorithms to incorporate enzymatically-relevant properties of the oriT substrates (Fig 1A), where the conservation of structural properties is detected across the whole 220 bp region compared to mere ∼40 bp of nucleotide sequence in the core relaxase-binding site 4 (Fig 1C,D). By allowing the use of at least 2-fold longer query sequences, structural alignment achieves a much larger statistical depth than sequence alignment (Fig S1-3), which means that oriTs can be efficiently sought across whole plasmids instead of just the vicinity of relaxases, as is the constraint with current methods 36 (Fig 1E). Since, due to the nature of the conserved structural properties, each enzymatic substrate corresponds to multiple possible sequence variants, the benefit of the DNA structural encoding is that it exposes these sequence variants by accessing the search space of the enzymatic co-evolutionary constraints (i.e. DNA structural background) 4 (Fig 1A and 2D). The identified candidate oriT regions serve as starting points that can be further verified by typing other known molecular features 32, such as inverted repeats 65,66 (Fig S2-4) and nucleotide sequence properties of the core enzymatic binding 38,67 and nicking sites 49,68 (Fig 2D).
Besides finding the majority of expected oriT regions of known cognate relaxases (Figs 2E), almost ⅔ of the oriT-bearing plasmids in fact carry multiple oriTs (Fig 2F). The newly discovered oriTs are frequently located where they are expected, in non-coding regions and within genes related to horizontal mobility (conjugation, transposition and integration) 69. However, the amount of Mob groups and depth of enzymatic substrate diversity that could be analysed within each group was constrained by the size of the query set of available oriT regions and nic sites 19 (Fig S1-9 and S2-5). By simulating the availability of a larger set of query sequences, a linear relationship between the amount of uncovered and query oriTs indicates that our current sampling is likely still an underestimation of the actual plasmid transfer potential that could possibly span all plasmids 51,70 (Fig 2I). Our results of a 2-fold higher plasmid mobility compared to relaxase typing, with an almost similar increase in the amount of mobile plasmid-bearing host species (Fig 2B), point to multiple possibilities that further undermine the paradigm of a one relaxase-one oriT conjugative plasmid system spanning less than ⅓ of plasmids 50: (i) a massive underidentification of relaxase enzymes 51,71, (ii) relaxase promiscuity 21,31 and oriT evolutionary mechanisms 30 leading to many functional secondary oriTs, and (iii) a system-wide adoption of relaxase in trans mechanisms 21,72.
Since plasmids are vehicles for transfer and long-term storage of ‘common goods’ that include, besides AMR, also virulence, heavy metal resistance and other genes 73, one can expect the tendency of the global plasmid transfer network to possess: (i) the capacity of at least some elements to bypass key transfer barriers, including phylogenetic 2,74, host range (defined by Mob and Inc/Rep groups, respectively) 3,4,50 as well as ecological habitat constraints 2,75, and (ii) a robust fault tolerant system that increases the guarantee for transfer as well as information storage 70,76. Firstly, we found that the plasmids bearing multiple oriTs connect different Mob groups and subgroups (Fig 3C), thus transcending phylogenetic and host range barriers as well as potentially creating a global cross-habitat network that is reminiscent of a dense information network 70,77 (Fig 4A). This network interconnects environmental genetic reservoirs and carriers as well as commensal vectors 62, which likely act as long-term and shorter-term storage units 75, respectively, with known human pathogens 64 (Fig 4C). Secondly, with a glimpse of the oriT network topology via the closest-associated query oriTs reminiscent of scale free and even hierarchical networks (Fig 3B and S3-2), the transfer networks display robust fault tolerant properties. As sparsely connected nodes without many direct neighbors are linked to highly connected hubs, even in case of absence of a large number of nodes, the remaining ones are likely still well connected 52,78. The finding that specific types of oriTs serve as hubs linking to most of the other oriT types (Fig 3B,C) suggests that certain types of conjugative transfer mechanisms and their corresponding hosts act as transfer hubs that can bypass horizontal gene transfer barriers (Fig 4A) to ensure the flow of genetic information among the different global microbiomes 64,79.
4. Methods
M1. Datasets used for alignments
The full query dataset comprised 112 distinct oriTs from 118 plasmids (merged below sequence similarity of 15%) and 6 Mob groups {F,P,Q,V,C,T} (26 Mob subgroups) with known nic sites (Figure S1-1A, Dataset S1). The dataset included (i) 48 experimentally verified oriT regions, of which 34 contained experimentally verified nicking sites and 14 contained putative nicking sites, and (ii) 59 oriT regions with computationally predicted nicking sites. For the initial development and testing of the structural alignment algorithm, the following query and testing datasets were used (Table S1-1). Due to the lack of a sufficient number of elements from Mob groups C, H and T for correct testing (Figure S1-4: below 10 elements per group), a 4 Mob group {F,P,Q,V} version of the query dataset with 106 elements was used. The balanced dataset from 4 Mob groups {F,P,Q,V} used for s-distance testing was a subset of the query data containing approx. 16 elements from each Mob group 4. A set of negative examples was obtained for each element by extracting sequences from the neighboring vicinity of oriTs. Specifically, the negative examples were selected randomly from a region 200 to 800 bp upstream and downstream from experimental nic sites (defined by size of smallest plasmid), thus containing different non-orit coding and non-coding regions with low sequence similarity (p-distance > 0.6). The testing dataset of 170 plasmids 15,36,49 was compiled from 3 separate datasets that included: (i) 106 plasmid sequences of the 4 Mob query dataset, (ii) 51 plasmids with known oriT locations and Mob sites but unknown nic sites and (iii) 13 plasmids with 14 experimentally determined nic/oriT sites but unknown Mob groups, obtained from the OriTFinder database 36.
M2. Development and testing of alignment algorithms
We developed and tested a DNA structure based alignment algorithm, termed Strast, based on (i) encoding DNA into a structural representation 4 and (ii) a procedure that finds the most similar segments of target sequences to query sequences using a structural distance measure. In the first stage of Strast, for each DNA sequence, its set of kmers is transformed into a precalculated structural encoding, termed s-mers 4 (Fig S1-1B). The s-mers are based on clustered principal components of values obtained from 64 models of physicochemical and conformational DNA properties (Table S1-3). Here, the s-mer size was set to 7 bp, the number of principal components was 18 (out of 64) to capture over 0.99 of data variance, and the k-means clustering algorithm with k = 128 clusters was used. The s-distance between two DNA sequences is defined as the sum of Euclidean distances between each consecutive pair of s-mers in the DNA sequence encodings (Fig S1-1B). The Jaccard distance between structurally encoded DNA sequences is obtained by using s-mers instead of k-mers. The second stage of Strast queries a target sequence for regions with the most similar and statistically significant s-distance (Fig S1-1C: see algorithm pseudocode, Methods M3).
For comparison of Strast to traditional non-encoded sequence based alignments two algorithms were used: (i) Blast v2.24 37 (www.ncbi.com) and (ii) the Strast procedure using the p-distance measure instead of s-distance. To assess the performance of the alignment algorithms for oriT-typing in target sequences, we evaluated the correctness of both (i) oriT and nic location finding to within ± 1 bp 49 and (ii) typing of Mob groups and subgroups. Additionally, the effect of the length of query regions in both types of tests was assessed (Fig 1B).
M3. Statistical analysis and machine learning metrics
The F-test was performed using Permanova 80 with sequence bootstraps. The statistical significance of s-distance scores was evaluated using permutational tests, where bootstrap resampling (n_bootstraps = 1e6 per sequence) of randomly selected query oriT sequences (n_seq = 10) was used to estimate the s-distance scores at different p-value cutoffs (from 1e-6 to 1e-1). Next, to obtain a mapping function of s-distance to permutational p-values in the whole range of 1e-132 to 1e-1 (Fig S1-1D), least squares curve fitting to a second order polynomial function was performed, where the theoretical limit of ∼1e-132 was set to correspond to an s-distance of 0. For additional statistical hypothesis testing, the Python package Scipy v1.1.0 was used with default settings.
The following machine learning performance metrics were used to assess alignment algorithm performance: Precision, Recall/Sensitivity, Specificity, Accuracy, F1-score and Matthews correlation coefficient (Table S1-4). To calculate these metrics, true and false positive and negative counts were obtained from the alignment tests (Methods M2) by considering only the most significant hit per alignment. A true or false positive value was assigned if the result was above a specified significance cutoff and corresponded or did not correspond, respectively, to the known value (nic location, Mob group or subgroup), and alternatively, a false or true negative value was assigned to results below the significance cutoff that corresponded or did not correspond, respectively, to the known value.
M4. Analysis of alignment hits
The newly uncovered regions were analysed by comparing the features of the oriT alignment hits with those of the query dataset, which included sequence properties and inverted repeats. Sequence homology analysis involved (i) calculation of the sequence homologies of the oriT query dataset within each Mob group (ii) calculation of sequence homologies between each oriT hit and its closest-associated query oriT region, (iii) comparison of the sequence homologies of the oriT query and alignment hit datasets, across the different sized oriT subregions. OriT hits with sequences of their relaxase-binding 40 bp subregions that deviated below 60% seq. homology from their query counterparts were removed. Sequence homology was calculated with the ratio function (python-Levenshtein package v0.12), where it equalled the Levenshtein (edit) distance divided by the length of the sequence. Similarly, analysis of the inverted repeats (IRs) involved computation of imperfect IRs in both the oriT query and alignment hit datasets, and oriT hits lacking IRs similar to those in the query set were removed. The Matlab package detectIR v2016-01-19 81 was used with IR size limits of (6, 15) bp and containing at most 2 mismatches. From the initially identified 20,255 oriT hits, 11,497 (57%) were retained (Dataset S2).
M5. Simulations of plasmid mobility
To estimate the results that would be obtained with a larger oriT query dataset, the following procedure was applied. The oriT alignment results with the 4602 target plasmid dataset were diluted according to 10-fold dilutions of the 102 query regions (Fig 2C) used to identify the hits (10 repetitions were used). Least-squares curve fitting was performed (Python package Scipy v1.1.0) using a linear function and the dataset dilutions - specifically between the size of the query oriT dataset and the variables corresponding to the numbers of oriT hits, mobile plasmids, mobile plasmid carrying species and overlap with relaxase-typed plasmids.
M6. Analysis of transfer between environmental microbiomes
Host species of the oriT alignment results with the subset of multi-oriT plasmids were mapped across 9 habitat supertypes (see habitat groupings in Tables S4-1, S4-2) according to published data on environmental 55 and human microbiomes 57–60. The habitat taxonomy was further expanded to include commensal and pathogen types 57 as well as human tissue subtypes 55. For typing antimicrobial resistance genes in the plasmids, the webserver version of ResFinder v3.2 82 was used with default settings.
To study the connectivity between the different oriT regions or Mob groups as nodes, shared across the multi-oriT plasmids as edges, an undirected multi-edged graph was constructed. To study the connectivity and AMR transfer potential between different habitats, a directed graph was constructed, where habitat nodes were connected by multi-oriT plasmids as edges. Habitats of donor hosts carrying the mobile plasmids (outbound connections) were connected to habitats of potential acceptor hosts, deduced from the query oriTs (inbound connections). The Python package NetworkX v2.2 was used.
M7. Software
Matlab v2018 (www.mathworks.com) and Python v3.6 (www.python.org) were used.
Acknowledgements
This work was supported by the Slovenian Research Agency under grant agreement n° [Z2-7257] and was in part carried out at the Faculty of Health Sciences, University of Primorska, Izola, Slovenia. I kindly thank Maria Pilar Garcillán-Barcia, Fernando de la Cruz (UNICAN, Spain) and Joshua Ramsey (Curtin Univ., Australia) for sharing and discussing data, Filip Buric (Chalmers Univ. of Tech., Sweden) for commenting on the manuscript, as well as Tomaz Pisanski (UP-FAMNIT, Slovenia) and Ales Lapanje (IJS, Slovenia) for helpful discussions in their respective fields of research.
Footnotes
Formatting has been updated in all sections. New results have been added.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.
- 40.↵
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.
- 46.
- 47.
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.
- 57.↵
- 58.
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵