Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

The genetic and ecological landscape of plasmids in the human gut

View ORCID ProfileMichael K. Yu, View ORCID ProfileEmily C. Fogarty, View ORCID ProfileA. Murat Eren
doi: https://doi.org/10.1101/2020.11.01.361691
Michael K. Yu
1Toyota Technological Institute at Chicago; Chicago, IL 60637, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael K. Yu
  • For correspondence: mikeyu@ttic.edu meren@uchicago.edu
Emily C. Fogarty
2Department of Medicine, University of Chicago; Chicago, IL 60637, USA
3Graduate Program in the Biological Sciences, University of Chicago; Chicago, IL 60637, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emily C. Fogarty
A. Murat Eren
2Department of Medicine, University of Chicago; Chicago, IL 60637, USA
3Graduate Program in the Biological Sciences, University of Chicago; Chicago, IL 60637, USA
4Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory; Woods Hole, MA 02543, USA
5Helmholtz Institute for Functional Marine Biodiversity, Ammerländer Heerstraße 231, 26129 Oldenburg, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for A. Murat Eren
  • For correspondence: mikeyu@ttic.edu meren@uchicago.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Plasmids are mobile genetic elements found across all domains of life. As plasmids often encode determinants of fitness, their evolution is intertwined with their hosts. However, naturally occurring plasmids remain far less understood than their hosts due to the lack of frameworks to recognize plasmids and to classify them into evolutionary groups. Here we trained a machine learning model that recognizes plasmids based on genetic architecture with state-of-the-art accuracy. We applied this model to a global collection of human gut metagenomes to identify 68,350 unique plasmids, 13,280 of which had a very high model confidence and represent more than an order of magnitude increase over the number of known plasmids that we detected in this environment. To understand the evolution of these plasmids, we developed a generalizable approach that enabled us to define 1,169 ‘plasmid systems’. Each system consists of plasmids that share a backbone sequence containing core plasmid functions, such as replication and conjugation, but vary in cargo genes that are often critical to the host, such as antibiotic resistance, amino acid biosynthesis, and tRNA modification. Members of the same system are often found in geographically distinct human populations, revealing cargo genes that likely respond to environmental selection. The ecological patterns of plasmids we observed could not be explained by microbial taxonomy. This work uncovers the tremendous diversity of plasmids and demonstrates the need to characterize them as a separate component of microbiomes distinct from their hosts.

Main

Plasmids are a type of mobile genetic element1 that occur in all domains of life2, 3. They typically exist as extrachromosomal and circular DNA, replicate semi-independently of their hosts, and often transfer between cells as a mechanism of horizontal gene transfer4–8. A hallmark of plasmids is their remarkably diverse capacity to impact their microbial hosts, by carrying fitness-determining functions4–6 such as antibiotic resistance genes9, 10 and virulence factors11, 12. Plasmids also exhibit many interesting genetic properties, such as frequent recombination, which can result in recurrent “backbone” sequences that are shared by multiple plasmids13–15. These backbone sequences often encode for core replication and transfer machinery13, 14, 16–18 that can determine their host range18, 19 as well as copy number in a specific host20. Experiments in model systems and cultured organisms have revealed the critical impact of plasmids in microbial phenotypes and survival especially for pathogens with medical significance. Yet, our understanding of the diversity, ecology, and genetic architecture of naturally occurring plasmids are far from complete.

Recent advances in metagenomics offer unprecedented access to the entire DNA content of an environment without the need for cultivation21. In particular, metagenomic assembly and binning strategies have enabled the reconstruction and characterization of microbial genomes de novo22, including those in the human gut23 where microbes have been associated with health and disease states24, 25. Metagenomic approaches have also been applied to study plasmid content26, but this application has been limited to enriching for plasmid DNA through library preparation techniques or to surveying only a small handful of metagenomes at a time27–32. Over the past decade, the number of publicly available metagenomes has rapidly increased, numbering in the tens of thousands, creating an opportunity to study plasmids at an unprecedented scale in complex ecosystems.

Comprehensive insights into plasmid ecology and evolution require effective computational strategies for de novo identification of plasmids, which remains a challenge33. Several computational strategies have been developed to identify plasmids in sequence collections17, 34, 35. Many of these approaches rely on k-mer patterns learned from reference plasmid sequences30, 31, 36, exploit known functions such as replication or conjugation genes37–39, or use a combination of these features35. While these features can help identify plasmids similar to those in public databases, they are of limited utility to recognize novel plasmids. Other approaches focus on circularity of sequences during (meta)genomic assembly32, 40, 41; however, this strategy overlooks plasmids that are linear, integrated, or found as assembly fragments, and may confuse other types of circular mobile elements for plasmids.

Here, we present (1) a machine learning approach to identify plasmids in complex microbial ecosystems, (2) and a novel algorithm to gain insights into plasmid evolution at scale. Specifically, we identify a collection of 68,350 non-redundant plasmids in the human gut microbiome that were more genetically diverse than reference plasmids and substantially more prevalent across global human populations. Using a novel network partitioning algorithm, we organize this large-scale sequence collection into ‘plasmid systems’ based on shared backbone sequences, and demonstrate that plasmid systems provide a framework for studying the selection of plasmids by environmental pressures.

Results

A plasmid classification system based on de novo gene families

To enable a systematic study of plasmid sequences for machine learning, we compiled a reference set of 16,827 plasmids and 14,367 chromosomal sequences from public databases (Figure 1A, Table S1). In these sequences, we identified 51.2 million open reading frames and annotated them with functions defined in the Cluster of Orthologous Genes (COG)42 and Pfam43 databases. We also used MMseqs244 to organize genes de novo into 2,322,750 gene families and removed those that contained only one gene. The remaining 1,090,132 de novo families enabled a more comprehensive analysis by accounting for 95% of all plasmid genes (Figures 1B and S1A). Using de novo gene families also substantially increased our ability to identify genes that were enriched in plasmids, independently of available gene function databases (Figures S1B and S1C).

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. A machine learning model for classifying plasmids.

(A) Our pangenomics workflow to characterize gene functions in a reference set of plasmids and chromosomes. (B) The fraction of all plasmids or all chromosomal genes that are annotated by using known families (blue), de novo families (orange), or a combination of both (green). (C) Training of PlasX. Reference sequences are sliced into 10kb windows and then prediction scores are made by a logistic regression that sums the contributions of gene families within a sequence. (D) Precision-recall curves comparing PlasX, Platon, PlasClass, and PPR-Meta. Except for PPR-Meta, every method was trained and evaluated using 4-fold cross-validation and an informed split. AUCPR was calculated using sequence weights for normalization. The arrows indicate the performance of PlasX using a score threshold of either >0.5 or >0.9. (E) The 200 gene families with the highest PlasX coefficients and thus most important for identifying plasmids. Gene families are ranked by their coefficient. (F) Maximum-likelihood phylogenetic tree of genes that are in PF10609 and also in either the plasmid-specific de novo subfamily mmseqs_5_1535552 (red) or chromosome-specific de novo subfamily mmseqs_70_40217271 (blue). (G) Sequence alignment of 10 representative genes from each subfamily (arrows in F).

We used this reference database to train a machine learning model, PlasX, that distinguishes between plasmids and chromosomes based on genetic architecture (Figure 1C). PlasX is a logistic regression, which assigns a positive or negative coefficient to gene families that are likely to originate from sequences that are of plasmid or non-plasmid origin. The coefficients of gene families within a sequence are summed to calculate a prediction score, ranging from 0 to 1, where a score of >0.5 designates that a sequence is more likely to be a plasmid than not. To improve performance, PlasX uses a technique called elastic net regularization, which identifies gene families with redundant or noisy signals and then minimizes the usage of these families by setting their coefficients equal or close to zero. Consequently, only a non-redundant and informative set of gene families can impact predictions by having coefficients far from zero (Figure S1D). For training and evaluating PlasX, we used 10kb slices of the reference sequences, to normalize for the fact that chromosomes are generally much longer than plasmids and to improve downstream application of PlasX on sequence collections that may contain a large number of fragmented sequences, such as assembled metagenomes.

Benchmarking the efficacy of a plasmid prediction algorithm is a non-trivial task. Evaluating an algorithm’s performance on sequences that are similar to those used during training, or comparing it to older approaches that were trained on a much smaller number of sequences are common pitfalls that inflate accuracy estimates. Here we implemented a more realistic evaluation framework to compare PlasX to three state-of-the-art algorithms, PlasClass31, PPR-Meta45, and Platon46. We first evaluated performance in 4-fold cross-validation, using a ‘naive’ randomized splitting of sequences into training and test data (Figure S1E). PlasX achieved nearly perfect accuracy, with the highest area under the precision-recall curve (AUCPR=0.99) compared to all other methods (Figure S1F). While naive splitting is a common evaluation technique, it is not a fair strategy as it can separate very similar sequences into training and test data, especially given the redundancy of sequences in public databases, and thus inflate the accuracy of classification. As a more accurate benchmark, we (1) designed an ‘informed’ split by first clustering plasmid and chromosomal sequences into subtypes and then keeping all sequences in the same subtype together in either the training or test data to better evaluate the ability of recognizing novel sequences and (2) assigned normalized weights to sequences to prevent well-studied plasmids from influencing the prediction ability disproportionately (see Methods). This advanced benchmark revealed a greater performance divide between PlasX (weighted AUCPR=0.70) and all other methods, with the next best method performing substantially worse (Platon, weighted AUCPR=0.23) (Figure 1D).

Plasmids can be difficult to distinguish from other mobile or integrated genetic elements because they share common features, including being extrachromosomal32, 47, facilitating horizontal gene transfer48, 49, or encoding traditional core functions like replication and mobilization38, 50. To determine PlasX’s ability to distinguish plasmids from other mobile genetic elements, we ran PlasX on all ICE sequences from the ICEberg database51 (n=552) and all prophage sequences from the NCBI viral database (n=445). PlasX correctly classified 92.2% of ICEs as not plasmids, and 93.2% of NCBI viral database as not plasmids (Table S9 and S10). Platon could also distinguish prophages from plasmids (99.6% accuracy), but its classification accuracy was much lower compared than PlasX’s for ICEs, as Platon classified 37.1% of ICEs as plasmids. Next, we ran PlasX on 21,012 plasmids that were added to PLSDB after we had already trained the model. PlasX performed very well, identifying 81.5% (17,128) of these sequences as plasmids (Table S11). Finally, we evaluated the performance of PlasX and other methods on a novel and recently characterized plasmid of Wolbachia, pWCP52. Since pWCP was not present in the training data for any of the plasmid prediction tools, it provided a unique opportunity to investigate whether this plasmid, which remained elusive until recently, could have been discovered through a de novo plasmid survey. PlasX was able to predict pWCP as a plasmid (score = 0.73), while all other methods, PlasClass31, PPR-Meta45, Platon46 and Deeplasmid35, were unable to classify it as a plasmid, either labeling it incorrectly as a chromosome or reporting high uncertainty in their prediction (Table S8). Overall, these results suggest that PlasX, with its reliance on gene families rather than strictly defined sequence features, is unique in its ability to predict novel plasmids that are not present in existing databases with high accuracy.

PlasX’s accuracy suggests it has learned insights into defining a “plasmid”. To broaden our understanding of a plasmid, we used PlasX’s coefficients to rank gene families by their importance in de novo identification of plasmids (Table S7). Among the 200 most important gene families, 19 were COGs and Pfams whose functional descriptions can be immediately recognized as being plasmid-associated because they contain keywords such as “plasmid”, “replication”, or “conjugation” (Figure 1E). However, another 9 COGs and Pfams did not have such a recognizable description. For example, a family of lipoproteins (PF05714) has been studied for conferring virulence in a few plasmids53, 54, but it is not generally thought of as a common plasmid function. Nonetheless, this family had the 17th highest coefficient of 1.678, consistent with its enrichment in 168 plasmids (36 plasmid subtypes) but only 2 chromosomes. While these results show that coefficients provide an approximate guide to understanding PlasX’s logic, we caution that interpreting each coefficient by itself can be complicated as PlasX often sums up the coefficients of several families in a sequence to make a prediction. Further curation is necessary to understand which high-coefficient families are truly characteristic of plasmids.

De novo families also provided two types of novel insights about plasmids. One insight is that 56.1% of de novo families can be thought as ‘subfamilies’ that group together a subset of genes within a COG or Pfam. As many COGs and Pfams contain both plasmid and chromosomal genes, these subfamilies can provide a deeper resolution of the bacterial gene pool by delineating plasmid- or chromosome-evolved lineages. For example, the Pfam PF10609 is a broad family of genes related to parA, a gene that drives the partitioning of chromosomes55 and plasmids56 during cell division. As this family is found on 35% of plasmids and 95% of chromosomes, it alone is not informative for identifying plasmids and thus has a coefficient close to zero (-0.023). However, PF10609 can be further dissected into plasmid-specific subfamilies, such as mmseqs_5_1535552 (coefficient +0.455), and chromosome-specific subfamilies, such as mmseqs_70_40217271 (coefficient -0.198), which become informative for PlasX to distinguish plasmids from chromosomes. Indeed, the maximum likelihood phylogenetic tree that relates the genes in these two subfamilies show a divergence of plasmids and chromosomes into monophyletic groups (Figure 1F), which is also reflected in their sequence alignment (Figure 1G). The second insight is that 35.5% (398,174) of de novo families have no overlap with any COG or Pfam. Many of these families have highly positive coefficients (e.g. 12,076 families have coefficients >0.1) that make a sequence appear substantially more like a plasmid and thus could represent fundamental but unexplored plasmid functions.

PlasX unveils a large database of new plasmids from the human gut microbiome

Having verified PlasX’s ability to identify plasmid sequences, we applied it to survey naturally occurring plasmids in the human gut microbiome, an environment which harbors a diverse range of microbes and mobile genetic elements57. We assembled 36 million contigs from 1,782 human gut metagenomes, spanning culturally and geographically distinct human populations (Table S2). Running PlasX on these data resulted in a total of 226,194 predicted plasmids with a model score above 0.5 (Figures 2A and S2A, Table S3). Our predictions spanned a wide range of lengths, including 135 sequences that were longer than 100 kbp, but they were generally shorter than reference plasmids with a median length of 2.6 kbp versus 53.3 kbp, respectively (Figure S2B). This discrepancy can be partly explained by fragmentation during metagenomic assembly, as the median length of the entire set of contigs was 2.1 kbp, and only 50,310 (0.14%) contigs were longer than 100 kbp. To minimize this issue, we removed predictions that were likely assembly fragments because they were subsequences of other predictions in our collection and also did not appear to be circular elements themselves. This filter retained 100,719 predictions for downstream analyses (see Methods, Figure S13). While PlasX identifies contigs that are likely plasmids or plasmid fragments, throughout this manuscript we refer to these predicted sequences shortly as ’plasmids’ for practical reasons.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2. Plasmid prediction from metagenomes.

(A) Number of plasmids predicted from each country. (B) Diagram of paired-end reads mapping to a linear versus a circular contig. Linear contigs have forward-reverse reads only, while circular contigs also have reverse-forward reads concentrated on the ends due to an artifact in contig assembly. (C) Orthogonal support for and novelty of the 100,719 non-fragment predictions.

To determine the circularity of predicted plasmids, we analyzed the orientation of paired-end reads recruited from metagenomes. This is a powerful strategy because if a contig occurred as a circular element in the environment, then matching reads from the same pair would be recruited to opposite ends of the contig in a ‘reverse-forward’ orientation, instead of the typical ‘forward-reverse’ orientation (Figure 2B). With this approach we found that 19,652 plasmid sequences were circular, and we designated them as high-confidence plasmids for downstream analyses. These circular plasmids spanned a range of sizes, with a median length of 4.4 kbp and 854/378/47 plasmids longer than 25/50/100 kbp. An additional 14,151 sequences were not circular themselves but were highly similar to a circular sequence. Together, these two types of sequences defined a set of ‘circular-associated’ sequences representing 33.6% (33,803/100,719) of predictions. Multiple factors can explain the lack of signal for circularity for the remaining plasmids, including insufficient sequencing depth to observe reverse-forward pairs, fragmented contigs, or the non-circular nature of some plasmids that occur linearly58 or are integrated in a chromosome3. There were 154,680 contigs that were not predicted to be plasmids but still appeared circular; however, these contigs tended to have a smaller number of supporting reverse-forward reads relative to their coverage (Figure S2E), which may indicate that they are other types of mobile elements such as viruses or ICEs that temporarily circularize.

Beyond circularity, confirming in silico whether a novel sequence represents a plasmid is a significant challenge. In the absence of single-copy core genes that have been vital to assess the completeness of non-plasmid and non-viral genomes assembled from metagenomes22, our understanding of the canonical features of plasmids is limited to a relatively small set of well-studied genes that are primarily derived from plasmids of model organisms in culture37, 38. For instance, MOB-suite38 identified canonical features for plasmid replication and conjugation in only 16.3% of the 16,827 PLSDB reference plasmid sequences used to train PlasX. This relatively small percentage reveals the limits of conventional approaches to identify plasmid features and thus foreshadows their limited utility to survey novel plasmids. Indeed, MOB-suite identified canonical features in only 10.1% of our predictions. Given this narrow sensitivity, we developed orthogonal data-driven strategies to increase confidence in our predictions.

For the remaining 89.8% (90,446/100,719) of predicted plasmid sequences in which MOB-suite did not find any canonical plasmid features, we performed several types of analyses to assess how many are true plasmids or novel sequences (Figure 2C). We found that 24.5% (24,689) of predictions were circular-associated sequences. 26.7% (26,921) were ‘keyword-recognizable’, as they contained a COG or Pfam function with the words ‘plasmid’ or ‘conjugation’. And finally, 4.0% (3,996) were highly similar to a known plasmid sequence in NCBI, while 64.7% (65,117) were novel sequences with no hits to any sequence in NCBI. As these different subsets of plasmids are partially overlapping, we took their union to find that 49.4% (49,739) of predictions had some orthogonal support for being a plasmid, by MOB-suite or any of the first three types of analyses, and 28.5% (28,658) had such support and were novel (Table S3). Overall, these findings suggest that our collection of predicted plasmids include not only sequences that match known plasmids, facilitating the study of their diversity and gene pool in natural habitats, but also novel sequences that can further advance plasmid biology.

We further investigated the subset of predictions that were highly similar to a sequence in NCBI and categorized matches as either known plasmids (26.9%), chromosomes (21.3%), viruses (0.6%), or an unclear type of sequence (51.2%). A total of 189 predictions matched a known virus. Of these, 110 were recognized as plasmids by MOB-suite or keywords but also contained virus-related COG or Pfam functions, as indicated by the keywords ’virus’, ’viral’, and ’phage’. These predictions carry both plasmid and viral features, a phenomenon that has previously been reported59–62. Surprisingly, 808 predictions that matched a known chromosome were also circular-associated and recognized by MOB-suite or plasmid keywords. One explanation of these data is that these plasmids can switch between an extrachromosomal or a chromosome-integrated state.

While we have identified plasmids based on a score of >0.5, a stricter threshold could be used to filter for more confident predictions. For example, we identified a subset of 24,614 predictions with a score of >0.9. These high-scoring predictions were more likely to match known plasmids and less likely to match known chromosomes in NCBI, compared with predictions with a lower score between 0.5 and 0.9 (Figure S2C). High-scoring predictions also tended to be longer and enriched for circular sequences (Figure S14), suggesting that they are less likely to be assembly fragments. Nonetheless, a stricter threshold comes with an inevitable cost of not only removing noise but also bona fide plasmids. This tradeoff is most visible in cross-validation, where a threshold of >0.5 lies at an inflection point in the precision-recall curve of Figure 1D (with a precision of 0.850 and recall of 0.500). While applying a stricter threshold of >0.9 would provide a modest increase of 13% in precision (to 0.920), it would substantially decrease recall by 44% (to 0.280). As our understanding of plasmid diversity in metagenomes is greatly underdeveloped, we decided that a threshold of >0.5 provides a reasonable balance between precision and recall, such that the resulting predictions still contain potentially many novel plasmids to advance the field. A good example for this is the long-missed Wolbachia plasmid52, which has a score of 0.73. Furthermore, we found that 31.6% (31,847/100,719) of plasmids with lower prediction scores (between 0.5 and 0.9) had orthogonal support for being plasmids (Table S3).

Plasmids predicted from metagenomes are found in isolate genomes and can transfer between microbial populations

To experimentally validate our metagenome-derived predictions as true plasmids of the human gut, we developed a pipeline for identifying predictions that are (1) present in human gut microbial isolates, (2) are circular in those isolates, and (3) can be naturally transferred to other microbes. First, we detected 127 of our predicted plasmids in 14 Bacteroides isolate genomes that we sequenced in a previous study63 (Figure 3A). Short-read sequencing of two of these isolates suggested that the predicted plasmids pFIJ0137_1 and pENG0187_1 were circular based on paired-end orientation (Figure 2C). We further confirmed their circularity using long-read sequencing. Following a previously described approach52, we identified and manually confirmed 500 long reads that align completely to a plasmid but not to the host chromosome (Figures 3A and S3). Some of these long reads align across the artificial contig breakpoint, indicating these plasmids are extrachromosomal and circular (see Methods).

Figure 3.
  • Download figure
  • Open in new tab
Figure 3. Experimental validation of plasmid predictions.

(A) We recruited reads from the sequenced genomes of 14 Bacteroides isolates to determine which isolates contain our predicted plasmids. We further confirmed the presence and circularity of a predicted plasmid (pFIJ1037_1) in the isolate B. fragilis 214 by long read sequencing. Grey circles represent 7 (of 500) long reads that align to pFIJ0137_1. Red triangles designate the beginning of a long read. (B) Transfer of pFIJ1037_1 from B. fragilis 214 to B. fragilis 638R via conjugation and selection on erythromycin- and rifampicin-containing media. (C) Coverage plots showing read recruitment of B. fragilis whole-genome sequencing reads to the pFIJ1037_1 reference sequence, confirming transfer of pFIJ1037_1. Grey are forward-reverse reads, while blue are reverse-forward reads that indicate the circularity of pFIJ1037_1.

Finally, we tested the ability of pFIJ0137_1 to transfer between its host, B. fragilis 214 (one of 14 isolates from 63) to a well-known laboratory strain, B. fragilis 638R. We designed an experimental setup that takes advantage of the naturally encoded erythromycin resistance (ermR) on pFIJ0137_1 and the rifampicin resistance (rifR) of B. fragilis 638R. Specifically, we first mated isolates in the absence of antibiotics, and then selected for transconjugants on media containing both antibiotics (Figure 3B). While this plasmid lacks conjugation machinery, it contains two relaxases (blue genes in Figure 3A) and thus could be mobilized by different conjugative apparatus in the host cell. Through short-read sequencing of the donor, recipient, and resulting transconjugants, and by employing a read recruitment analysis, we confirmed that pFIJ0137_1 transferred from B. fragilis 214 to B. fragilis 638R (Figure 3C). This analysis also confirmed the circularity of pFIJ0137_1 in B. fragilis 214 and both B. fragilis 638R transconjugants. Besides a 68bp deletion, pFIJ0137_1 in B. fragilis 214 (isolated in Chicago, USA) was identical to the pFIJ0137_1 version assembled from a Fijian metagenome, suggesting a relatively recent transfer of this plasmid between unrelated human populations. These experimental results show that while PlasX identifies plasmids solely based on genetic architecture, it is capable of predicting plasmids that have canonical features of being extrachromosomal, circular, or transmissible between cells.

Novel plasmids are highly prevalent, reflect human biogeography, and unexplained by microbial taxonomy

Next, we sought to characterize the ecology of plasmids across human populations through metagenomic read recruitment. For this task, we first dereplicated the entire collection of reference and predicted plasmid sequences, where we assumed that any pair of plasmid sequences was redundant if at least 90% of either sequence aligned to the other with over 90% sequence identity. This analysis found 68,350 and 11,121 non-redundant sequences in the set of 226,194 predicted and 16,827 reference plasmids, respectively. Then, we used the non-redundant sets of plasmids to recruit reads from the 1,782 globally distributed human gut metagenomes. We labeled a plasmid as present in a metagenome if its ‘detection’ was ≥0.95, where detection is the fraction of the sequence covered by at least one read (see Methods).

Our read recruitment analysis revealed that predicted plasmids were much more prevalent across human populations than reference plasmids. For instance, only 1.9% (211) of reference plasmids were present in at least two individuals in our dataset, suggesting the limited ecological relevance of reference plasmids to naturally occurring gut microbial communities. Indeed, many reference plasmids were isolated from a relatively small number of pathogens, such as Escherichia coli, Salmonella enterica, Pseudomonas aeruginosa, Klebsiella pneumoniae, and Vibrio cholerae, which are unlikely to be abundant in healthy humans. In contrast, 63.1% (43,114) of the predicted plasmids were present in at least two individuals (Figure S2D). Moreover, of the most highly prevalent plasmids found in ≥100 individuals, 99.7% (5,400/5,414) were predicted plasmids while only 0.3% (14/5,414) were reference plasmids.

The prevalence of predicted plasmids suggests that they capture the biogeography and lifestyles of human populations more effectively than reference plasmids. To confirm this, we performed agglomerative clustering to construct a dendrogram that organizes metagenomes based on their plasmid content. Using reference plasmids for this clustering, we found that only 50.2% of metagenomes were arranged next to another metagenome from the same country. In contrast, using predicted plasmids (Figure 4B) resulted in 74.0% of metagenomes arranged that way. We also organized metagenomes using a dimensionality reduction of predicted plasmids. This analysis shows that industrialized versus non-industrialized metagenomes can be distinguished solely by their plasmid content (Figure 4C). Dimensionality reduction also showed country-specific clustering (Figure S4).

Figure 4.
  • Download figure
  • Open in new tab
Figure 4. Global plasmid ecology.

(A) Read recruitment of human gut metagenomes to 11,121 non-redundant reference plasmids. The heatmap shows the 338 plasmids that are present in at least one metagenome (≥0.95 detection). (B) Read recruitment to 68,350 non-redundant predicted plasmids. The heatmap shows the 1,000 most prevalent plasmids that are present in at least one metagenome and have PlasX score ≥0.75. In A and B, column colors indicate country of origin and lifestyle (industrialized or non-industrialized). (C) Clustering of metagenomes based on the predicted plasmids that are present, using the UMAP dimensionality reduction method73. Metagenomes from industrialized or non-industrialized populations are colored red or blue, respectively.

These results have parallels with previous studies that found associations between gut microbiota, as characterized by microbial taxonomy, and the geography and lifestyles of human populations64–67. As geography is correlated to both plasmids and taxonomy, we wondered how many of our 68,350 plasmids are ecologically associated with and therefore explained by taxonomy. On one hand, such associations may be strong because plasmids are symbionts that rely on host machinery for replication and can have a narrow host range. On the other hand, such associations may be weak or nonexistent for two reasons. Some plasmids are known to have a range of multiple hosts68–70, which might not be neatly defined by a single species or even higher taxonomic category such as genus or phylum. Additionally, plasmids are often nonessential elements that can be gained or lost, such that nearly identical microbes can differ by the presence or absence of a plasmid or in the number of plasmid copies. Here, we systematically examined the ecological associations between plasmids and taxonomy to determine if plasmids comprise an independent component of microbial systems.

For every plasmid, we inferred its most likely host as the taxonomic group that had the most similar ecological distribution (see Methods). We surveyed taxonomic groups across all levels, from subspecies and species to class and phyla. We used two different formulas to calculate the ecological similarity between a plasmid and potential host: (1) the correlation in the abundance levels of the plasmid and host across metagenomes, and (2) how often the plasmid and host are found together in the same metagenome. Although some predicted plasmids had a high ecological similarity with their best matching taxonomic group, the vast majority of predicted plasmids had low similarity scores (median correlation = 0.04, median Jaccard = 0.21) (Figures S5A and S5B). We also observed low similarity scores even for reference plasmids that are isolated from a defined microbial host (Figures S5C and S5D). For example, the plasmid pDOJH10S and its cognate host, Bifidobacterium longum, were present together in 10 metagenomes; however, 27 and 69 metagenomes contained only the plasmid or only the host, respectively (Figure S5E).

Overall, our findings suggest that plasmids are a highly complex and prevalent feature of microbiomes (Figures 4A, 4B, and S2D), forming an ecological dimension that can stratify human populations (Figures 4C and S4). With current methods of analysis, this stratification cannot be explained by microbial taxonomy alone (Figure S5). While high-throughput analyses of human gut microbiomes often focus on taxonomic features, it has been challenging to find significant or reproducible taxonomic associations that distinguish health and disease states71, 72. As plasmids often carry key determinants for survival in an environment, we propose that systematic analysis of plasmid ecology is necessary to develop a complete understanding of the human microbiome.

Plasmid systems organize evolutionarily related plasmids by distinguishing backbone versus cargo content

Our large collection of predicted plasmids provides an unprecedented opportunity to study evolutionary patterns in plasmids and the extent to which they occur ecologically. Due to frequent genetic rearrangements, a hallmark of plasmid evolution is the reuse of a backbone and emergence of varying cargo/accessory genes13, 15, 17, 18. The backbone typically encodes machinery necessary for plasmid maintenance, while the cargo represents additional genetic content, such as antibiotic resistance or other fitness-determining functions. While backbones can be examined experimentally, most studies have identified them computationally. Nonetheless, there are four major challenges to this computational task. First, there is a lack of consensus across studies on how to identify a plasmid backbone, with varying definitions based on nucleotide identity13, 14, 75, gene similarity18, or gene annotations76–78. Second, these methods do not verify that an identified backbone encodes a sufficient set of functions for plasmid replication. Third, these methods are typically designed to analyze a small set of plasmids in a single study or dataset. Finally, scaling methods to identify backbones in metagenomic data introduces extra complications related to plasmid redundancy and assembly fragments that could inflate the number of predicted backbones.

We designed a scalable algorithm called MobMess (Mobile Element Systems) to study backbone structure in our collection of plasmids. Compared to previous methods, MobMess has the advantage of being able to simultaneously compare sequences without relying on gene annotations and to handle metagenomic issues of redundancy and fragmentation (see Methods). First, MobMess calculates pairwise alignments across all plasmids to build a sequence similarity network, in which a directed edge represents the containment of one plasmid within another (defined by ≥90% sequence identity and ≥90% coverage of the smaller plasmid) (Figures 5A and S8). Next, MobMess recognizes and collapses redundancy between plasmids. Finally, MobMess analyzes patterns of connectivity in the network to define and identify ‘backbone plasmids’ that satisfy two criteria. First, the backbone plasmid must be a circular element, inferred here by paired end orientation (Figure 2B), to ensure that it is not an assembly fragment and, importantly, that the genes present are sufficient for plasmid replication. Second, a backbone plasmid must be found as a subsequence within one or more ‘compound plasmids’. These compound plasmids are composed of the backbone and additional cargo, indicating the ability to acquire or lose genes.

Figure 5.
  • Download figure
  • Open in new tab
Figure 5. Identification of plasmid systems.

(A) Network diagram of a plasmid system. (B) Distribution of model coefficients for backbone vs. cargo genes in the non-redundant set of 68,350 predicted plasmids. We excluded genes that lacked gene family annotations and thus have a coefficient of zero by default. We also excluded genes that were labeled as backbone with respect to some systems but cargo in others. (C) Network of all plasmid systems that contain ≥3 non-redundant and high-confidence plasmids. Only these types of plasmids are shown. (D) Genetic architecture of plasmids in PS486, encased by a red box in C. Two plasmids in C are excluded. The system’s backbone (assembled from metagenome MON0062) encodes 5 backbone genes (colored gray). Rib.syn.=riboflavin biosynthesis, CT=conjugative transfer, mob=mobilization, T=toxin, AT=anti-toxin, tet=tetracycline resistance, erm=erythromycin resistance, transp.=transposon, hist. kin.=histidine kinase.

Together, we define a backbone and its compound plasmids as an evolutionary unit called a ‘plasmid system’ (Figure 5A). This definition of plasmid systems enables a formal categorization of plasmids into evolutionarily cohesive groups and facilitates analyses of backbone versus cargo content and their ecology, much in the same way that pangenomes enable studies of core versus accessory gene content in microbial genomes. However, plasmid systems are a specific case of pangenomics, as it is unlikely to find a naturally occurring microbial genome composed only of core genes. In contrast, backbone plasmids represent a minimal entity that can propagate using only backbone genes. MobMess provides an automated framework and standardized vocabulary to study this concept across different studies and datasets.

To define containment of plasmids within each other, we found that ≥90% alignment identity and coverage was a natural threshold for two reasons. First, we examined the histogram of similarities between all pairs of predicted plasmids, revealing an average nucleotide identity (ANI) “valley” at around 85-90% identity (Figure S12A), although, similar to viruses79, this drop was not as emphasized as those observed in the ANI between distinct bacterial taxa80. Second, we re-ran MobMess using varying thresholds. As the threshold is made stricter, plasmids gradually separated into distinct clusters, and consequently the number of non-redundant plasmids increased (Figure S12B-C). This growth in non-redundant plasmids occurred at a mostly constant rate from a threshold of 10% to 90%, but it suddenly accelerated from 90% to 100%. These results suggest that a threshold stricter than ≥90% (e.g. ≥95% or ≥99%) would split highly similar plasmids into separate clusters.

Other methods have recently been developed to cluster thousands of plasmids81, 82, but unlike MobMess, they are not designed to identify plasmid systems or analyze metagenomic data. To compare methods, we ran MobMess on the same set of 9,894 reference plasmids analyzed by Redondo-Salvo et al81 (Figure S6). In their study, Redondo-Salvo et al. constructed a plasmid similarity network with 79,727 edges. However, these edges span a wide range of similarity levels, where 66.5% of edges represent an alignment that covers <90% of either sequence (≥10% is not aligned) and 19.0% of edges have <70% alignment coverage (≥30% is not aligned). In contrast, MobMess applies a stricter threshold of ≥90% coverage to construct a smaller but more refined set of 39,680 edges (connecting 25,270 unique pairs of plasmids). Moreover, Redondo-Salvo et al.’s edges are undirected, while MobMess’s edges are directed to track smaller versus larger sequences. Retaining this extra information allowed MobMess to distinguish between the 10,860 pairs (43.0%) with unidirectional connections, representing a backbone contained in a compound plasmid, versus the 14,410 pairs (57.0%) with bidirectional connections, representing nearly identical plasmids.

Besides network construction, these methods also diverge in how they conceptually organize plasmids. MobMess dereplicates the 9,894 plasmids into 7,132 non-redundant sequences and then organizes them into 1,044 plasmid systems. In contrast, Redondo-Salvo et al. identified 641 clusters, or ‘PTUs’81. We found that 135 PTUs did correspond one-to-one to a plasmid system in MobMess, but the other PTUs spanned a wide range of evolutionary relations. At one extreme, 251 PTUs were simple sets of nearly identical plasmids, representing recent and strong relations. At the other extreme, 45 PTUs were complex mixtures of distinct plasmid systems, representing distant and weak relations. For example, the largest PTU contained 2,460 plasmids, which MobMess further dissected into 1,481 non-redundant plasmids and 461 plasmid systems. Figure S7 demonstrates one such plasmid system, where MobMess precisely connects the system’s backbone to its compound plasmids in a “star”-like topology, while the approach by Redondo-Salvo et al. connects almost every pair of these plasmids to each other, which obfuscates the internal organization of the plasmid system. Perhaps this is in part because the method by Redondo-Salvo et al. and another related method by Acman et al.82 have only been tested on reference plasmids that have been completely assembled, while MobMess is designed to handle metagenomic data by distinguishing between fragmented versus complete (circular) plasmids.

MobMess identifies 1,169 plasmid systems with conserved backbones and a wide repertoire of cargo functions

We ran MobMess on our predicted plasmids and identified a total of 1,169 plasmid systems, naming them PS1 (plasmid system #1) to PS1169. While plasmid systems captured a small fraction of the genetic diversity among non-redundant plasmids (6.5%, or 4,424/68,350), they captured a large fraction of all circular plasmid contigs (72.7%, or 14,285/19,652) (see Methods, Table S4). Plasmids that were part of a system tended to be longer and were more likely to be circular than plasmids that were not part of any system (Table S12). The requirement to be included in a plasmid system is that the sequence must not only be predicted as a plasmid (with score >0.5), but that there must also be at least one other predicted plasmid that shares the same backbone. Thus, while we previously applied a loose score threshold of >0.5 instead of >0.9 to identify plasmids, MobMess provides an independent de novo filter for plasmids with higher confidence. Indeed, we found that 16,663 plasmids with scores between 0.5 and 0.9 are part of a system.

Plasmid systems were highly heterogeneous in their genetic complexity. 37 plasmid systems contained sequences that could be classified among 7 different plasmid incompatibility types (Inc11, Inc18, IncFIB, IncFIC, IncI-gamma/K1, IncK2/Z, IncW) (Table S4). 602 plasmid systems contained at least 2 non-redundant compound plasmids, with the largest system containing 168 non-redundant compound plasmids (Figure 5C). For example, pFIJ1037_1, the plasmid we isolated and transferred between B. fragilis organisms, was part of PS486, a system containing 24 non-redundant plasmids and found across a total of 127 metagenomes. PS486’s backbone consists of a replication protein and a toxin-antitoxin system, and the cargo genes include beta-lactamases, erythromycin resistance, tetracycline resistance and riboflavin biosynthesis (Figure 5D, Table S5).

To understand how much genetic content is typically conserved or variable in a plasmid system, we calculated the percentage of genes on compound plasmids that were backbone genes versus cargo genes (see Methods). Plasmid systems spanned a wide range of cargo gene percentages between 0% and 100%, with a median value of 40% (Figure S15). Conversely, the median backbone percentage was 60%. PlasX often assigned higher model coefficients to backbone genes in the non-redundant set of predicted plasmids, suggesting these genes define the ‘essence’ of a plasmid by encoding essential functions that promote the ability of a plasmid to exist as a distinct element from the chromosome, such as the genes for plasmid replication, repA (PF01051), and mobilization, mobA (PF03432) (Figure 5B). In contrast, PlasX assigned lower coefficients to cargo genes, suggesting they encode functions that are not universally essential but important for specific niches, such as nitrogen reductase, nifH (PF00142), and membrane transport, ompA (PF00691). Indeed, 24.1% (2,169/8,995) of backbone genes versus 13.4% (3,229/24,168) of cargo genes encoded COG and Pfam functions with descriptions related to plasmid replication, transfer, and maintenance (see Methods).

The most frequent type of function encoded on cargo genes was antibiotic resistance, including efflux pumps, which can provide general resistance to multiple antibiotics, and genes targeting specific classes of antibiotics, such as glycopeptides and beta-lactams (Figure 6A). This large-scale observation is consistent with numerous examples of known plasmids encoding resistance and further illustrates how the widespread presence of these plasmids pose a public health threat85–88.

Figure 6.
  • Download figure
  • Open in new tab
Figure 6. Functional and ecological variation of plasmid systems.

(A-B) The number of compound plasmids that encode antibiotic resistance (A) and COG pathways (B) in cargo genes. Also shown are the numbers of metagenomes and countries that contain those plasmids. In B, we show the 20 COG pathways that have the highest number of compound plasmids. To avoid redundancy with A, we exclude COG pathways that occur in cargo genes encoding antibiotic resistance. (C) Prevalence of plasmid systems versus the individual plasmids in those systems. (D) Distribution of plasmid systems based on the number of industrialized and non-industrialized countries they are found in. (E) Recoloring of the network of plasmid systems shown in Figure 5C. Colors indicate whether a plasmid occurred in only industrialized, only non-industrialized, or both types of countries. A green ring indicates a plasmid encoding antibiotic resistance. (F) Compound plasmids from PS974 that encode for resistance to chloramphenicol (chlor), tetracycline (tet), or erythromycin (erm). 6/9 plasmids are circular. Dark gray genes are the backbone; light gray are cargo not related to antibiotic resistance. AGAT=aminoglycoside adenylyltransferase. OD=Oxaloacetate decarboxylase. PS974 is found in 22 Fijian and 1,408 non-Fijian metagenomes. The pictogram on the right-hand side represents these metagenomes using two shapes: circles (Fijian) and triangles (non-Fijian). For each plasmid, circles are colored blue to represent the proportion of the 22 Fijian-metagenomes that contain the plasmid. Similarly, triangles are colored red to represent the proportion of the 1,408 metagenomes that contain the plasmid.

Other highly prevalent cargo functions included a wide diversity of cellular and metabolic pathways defined in the COG (Figure 6B) and KEGG databases (Figure S9). The most enriched among these was tRNA modification, encoded in 35 compound plasmids within different systems. For example, the globally prevalent system PS1110 (present in 739 metagenomes) contained 291 compound plasmids (27 non-redundant), three of which encoded an enzyme that performs tRNA Gm18 2’-O-methylation (COG0566) and were collectively present in 498 metagenomes (Figure S10, Table S5). This enzyme is thought to reduce the immuno-stimulatory nature of bacterial tRNA, which is detected by Toll-like receptors (TLR7) of the mammalian innate immune system89, 90. While plasmids in some pathogens are known to facilitate bacterial evasion of mammalian immune system by regulating surface proteins91, the overwhelming prevalence of tRNA modification enzymes in our data suggests the likely presence of a previously unappreciated role for plasmids to increase the fitness of their bacterial hosts against the surveillance of the human immune system. Distinguishing between the fundamental structure of a plasmid (backbone genes) versus the genetic currency that is exchanged (cargo genes) allowed us to organize the extensive plasmid diversity within plasmid systems and to recognize the recurrent evolution of the same cargo functions across different systems.

Plasmid systems adapt their cargo genes to specific environments

Thus far, we have observed that our collection of plasmids is highly heterogeneous in their ecological distributions (Figure 4B), yet they can also be organized by evolutionary relations into plasmid systems. To understand how ecology and evolution are intertwined phenomena, we asked whether plasmid systems span a single ecological niche or multiple niches. We assumed that every country represents a different niche, as countries are known to differ in microbial composition67, 92–97 and we observed that countries also differ in plasmid composition (Figure S4). We found that while individual plasmids are often present in a single country, a plasmid system frequently spans multiple countries (Figure 6C, Table S4). Indeed, 2,005 individual plasmids within a system were unique to a single country, yet 1,794 (89.5%) were part of more geographically diverse systems that were present in at least two countries. In fact, 84.0% (982/1,169) of systems were present in at least two countries, and 9 systems were even in 15 of the 16 countries represented in our data. We also found that plasmid systems are typically mutually exclusive, i.e. plasmids from the same system generally do not occur in the same individual human hosts. Consequently, metagenomes often have at most one plasmid from every system (see Methods, Figure S11A).

One explanation of this mutual exclusivity pattern is that a person may have never been exposed to multiple plasmids from the same system. This scenario is plausible for rare systems (e.g. present in <10 metagenomes). Indeed, we observed a trend where more prevalent systems exhibit less mutual exclusivity (R2=0.39) (Figure S11B). However, we identified many prevalent systems (e.g. present in >100 metagenomes) that are more mutually exclusive than expected. For instance, PS961 is almost perfectly mutually exclusive across 131 metagenomes (Figure S11C). For such prevalent systems, a more likely explanation lies in either the backbone or cargo content of a system. As 24.1% of backbone genes encoded Pfams and COGs related to replication, transfer, or maintenance of plasmids, the competition for these resources within a cell can lead to incompatibility between plasmids of the same system98. With a large number of plasmid sequences at hand, our dataset is a new resource that can support experimental investigations of the impact of incompatibility on plasmid replication99, 100, especially its contribution relative to external selective pressure on cargo genes. However, here we focused our attention on investigating ecological associations between environmental selection and cargo gene content.

We propose that our algorithmic definition of plasmid systems can be used to study how ecological pressures on plasmids drive the evolution of cargo genes. Plasmid systems are akin to the concept of a genetic ‘delivery van’ (backbone) with the flexibility to disseminate variable ‘packages’ (cargo genes) to microbes. The dynamic pool of cargo genes could serve as a means for a plasmid, or its microbial host, to increase fitness in different environmental conditions. To investigate this hypothesis, we compared the evolution of plasmids in industrialized versus non-industrialized countries, an environmental difference that was reflected in plasmid content (Figure 4C). While many plasmids systems were exclusive to one or the other type of country, 396 of them were present in both types (Figure 6D). These global systems provide a unique in silico framework to explain environmental differences by variation in plasmid cargo genes.

To demonstrate this framework, we examined antibiotic usage, an extreme environmental difference that is well known to exert selective pressures on microbes, often causing them to maintain plasmids with antibiotic resistance101–108. The well-studied nature of antibiotic resistance also provides a testbed to demonstrate that plasmid systems can identify cargo genes under selection. In our data, the evolution of antibiotic resistance in a plasmid system coincided with the ecological variation of compound plasmids in the system. Specifically, we identified 24 high-confidence, compound plasmids that encoded antibiotic resistance in cargo genes and were exclusively present in either non-industrialized or industrialized countries (Figure 6E). Among non-industrialized metagenomes, one of the most common types of antibiotic resistance is chloramphenicol (Figure 6A). For instance, PS974 is highly diverse with 97 non-redundant plasmids; however, this system possesses chloramphenicol resistance (conferred via an acetyltransferase) only in plasmids assembled from Fiji (Figure 6F, Table S5). When we searched for these resistance plasmids across the global set of 1,430 metagenomes that contain PS974, we found them in 19/22 Fijian metagenomes but only in 1/1,408 non-Fijian metagenomes (p=1.1 x 10-13, Fisher’s exact test) (Figure 6F, pictogram). Chloramphenicol is routinely prescribed in Fiji to treat eye infections, central nervous system infections, periodontitis, shigellosis, typhoid and paratyphoid fevers, and diabetic foot infections, but it is rarely used in North America and Europe109–112. Thus, chloramphenicol resistance in this system likely reflects the increased exposure of Fijians to this antibiotic. While we observed that chloramphenicol-resistant plasmids appeared specific to Fiji, more extensive sampling may reveal the presence of these plasmids in other non-industrialized countries that also have high usage of this antibiotic.

Besides chloramphenicol resistance, PS974 also contained non-Fijian compound plasmids that carry tetracycline resistance (171/1,408 metagenomes) or erythromycin resistance (429/1,408 metagenomes) (Figure 6F). In an attempt to find an alternative explanation for the distribution of resistance plasmids in this system, we revisited our earlier question “Can plasmid ecology be simply explained by taxonomy?”. By searching for these plasmids among known sequences in NCBI, we determined that possible microbial hosts include Firmicutes, such as Blautia hydrogenotrophica. However, none of these hosts nor any other microbial taxon had a similar ecological distribution as any of the resistance plasmids (highest Jaccard index=0.37 across all plasmid-taxon comparisons). These results suggest that compound plasmids in systems have acquired antibiotic resistance to respond to lifestyle-specific usage of antibiotics.

While the connection between antibiotic usage and resistance is expected given previous studies101–108, plasmid systems in general can be used to determine if cargo genes are under selection, even when the functions of those genes or the environmental pressures driving the selection are unknown. For example, the cargo gene encoding tRNA modification in system PS1110 may provide some immunoevasive function (Figure S10), but there was not a clear geographic or lifestyle-specific association with this function. This is an example where we know the cargo function but not the environmental pressure and motivates collecting additional data about the environment. Overall, our work provides a computational roadmap for generating new hypotheses about plasmid evolution on an omics scale.

Discussion

Our work greatly expands the number of known plasmids by mining a global collection of metagenomes using machine learning. This expansion provides the community with a new resource to study fundamental concepts in plasmid biology. While there are many applications of our resource, here we focused on organizing plasmids into cohesive units known as plasmid systems to gain deeper insights into plasmid ecology and evolution. For instance, the diversity captured by our large collection of 1,169 plasmid systems reveals the great extent to which plasmids in complex ecosystems like the human gut are not static entities but actively evolving in response to the environment. By revealing likely determinants of fitness, such as the acquisition of specific antibiotic resistance genes in Fiji as a response to a commonly used antibiotic, plasmid systems serve as a hypothesis generation and testing tool to study forces that drive plasmid evolution and influence the ecology of hosts that carry them. Our study has focused on geographical or lifestyle-based environmental differences, but more generally, our analysis of plasmid systems can be applied to other contexts such as discerning cargo genes that distinguish healthy vs. disease states of the gut microbiome.

During the past few decades, our ability to bypass the limitations of cultivation and study microbial genomes derived from metagenomes has led to key biotechnological insights113–116. The malleability of plasmids is a desirable property in bioengineering and often has motivated the repurposing of naturally occurring plasmids into major tools for genetically modifying organisms. In this vein, we propose computational prediction and analysis of plasmids as an attractive approach to expand the toolkit of available plasmids for genetic engineering, particularly if they can be found in isolates that currently lack tools to make them genetically tractable.

PlasX and other plasmid recognition systems30, 31, 33, 36, 37, 46, 117–119, along with MobMess to characterize plasmid systems, present a roadmap for a detailed characterization of naturally occurring plasmids. Historically, plasmids and other genetic elements have been characterized on the basis of qualitative properties and descriptions. In contrast, PlasX and machine learning approaches provide an “operational definition” of a plasmid that can be universally and objectively applied. As some of our predicted plasmids contain virus-like or ICE-like signatures, our work can be used to study the spectrum of mobile elements that blur traditional labels and complements recent efforts to characterize viruses120–123 and horizontally transferred elements124–126.

To expand the scope of our work, we intentionally designed PlasX using a broad collection of reference sequences, so that it can be applied to study any environment and can include additional training sequences to improve accuracy. These methods provide a complementary approach to frequently used state-of-the-art workflows to study the taxonomic composition or functional potential of environmental or host-associated microbiomes through amplicon sequences or metagenomes. Overall, our findings suggest that high-throughput recognition and characterization of plasmids in microbiome studies are necessary for more complete insights into the ecology of naturally occurring microbial systems.

Methods

Compiling and annotating a reference set of plasmids and chromosomes

We obtained a list of 16,168 plasmids from the March 5, 2019 version of PLSDB127. We also downloaded the entire collection of 13,471 complete bacterial genome assemblies from NCBI RefSeq on October 26, 2019, using instructions at https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete. The RefSeq assemblies contained 26,376 contigs, of which we discarded 11,350 that are also in PLSDB. The reference set of 16,827 plasmids consisted of 16,168 PLSDB contigs, as well as 659 contigs from the Refseq assemblies that were labeled as ’Plasmid’ in the ’Assigned-Molecule-Location/Type’ field of the NCBI assembly report. The reference set of chromosomes was the remaining 14,367 Refeq contigs.

To identify and annotate genes in these sequences, we used the program ‘anvi-run-workflow’ with ‘--workflow contigs’ implemented128 in anvi’o129 v7.1, which uses Snakemake130 to execute previously defined steps (https://merenlab.org/anvio-workflows/) to generate anvi’o contigs-db files (https://anvio.org/m/contigs-db). These steps include first running Prodigal131 to call genes and then running DIAMOND v2.0132 and HMMER v3.3133 on amino acid sequences to determine gene functions against the Cluster of Orthologous Groups of proteins (COGs)42 and Protein Family Database models (Pfams) v32.043, respectively. To minimize noise, we used an e-value cutoff of 10-10 for COGs and the default model scores for Pfams.

Modeling de novo gene families

We inferred de novo gene families by running MMseqs2134 v10.6d92c on all amino acid sequences in our reference plasmids and chromosomes. First, we ran ‘mmseqs clusthash’ to collapse identical sequences into a non-redundant set for faster execution of the next step; the collapsing was inverted at the end to annotate all genes. Next, we ran ‘mmseqs cluster’ to calculate pairwise alignments and then cluster genes that are aligned above a minimum sequence identity threshold (parameter ‘--min-seq-id’). We ran this program multiple times with different thresholds (0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.25, 0.2, 0.15, 0.1, 0.05) to infer a wide range of possible families. Families from different thresholds can be redundant, so we merged nested families, i.e. if family X contains all genes in family Y, then we keep X and discard Y. We also discarded any family that contains only one gene. In theory, families inferred from a higher threshold (e.g. 0.9) should always nest within a family inferred from a lower threshold (e.g. 0.05), such that we would discard all families from higher thresholds. But in practice, families don’t always nest within each other but only overlap partially. After merging, our final model used the following number of families from each threshold.

View this table:
  • View inline
  • View popup

In total, our model used 1,090,132 gene families, which annotated 162,783,114 genes. Note that because these gene families can still overlap with each other, a gene may have multiple annotations. This analysis took advantage of MMseqs2’s parallelism, taking ∼6 hours using 256 CPU cores.

We refer to a de novo family as a ‘subfamily’ if 90% or more of its amino acid sequences are also annotated to a specific COG or Pfam. Note that this definition provides a small tolerance such that a subfamily does not need to be a perfect subset of a COG or Pfam. For the example about Pfam PF10609 (Figures 1F and 1G), we gathered the 253 amino acid sequences annotated to PF10609 and the subfamily mmseqs_5_1535552. We also gathered the 1,391 sequences annotated to PF10609 and the subfamily mmseqs_70_40217271. We collapsed 100% identical sequences to yield a total collection of 142 and 310 sequences from mmseqs_5_1535552 and mmseqs_70_40217271, respectively. We aligned all of these sequences together using muscle v3.8.1551 (default parameters)135 and then constructed a maximum likelihood phylogenetic tree using IQ-TREEv2.1.2 (parameters -m TEST -bb 1000 -alrt 1000 -T AUTO)136. We then rooted the tree using the midpoint method.

Subtypes and slicing of reference sequences

To group reference sequences into subtypes, we used mash v2.2.2137 (command ‘mash dist’, sketch size 100000, kmer size 21) to calculate a distance score of 0 to 1 between every pair of sequences. Next, we created an undirected graph, where sequences are nodes and sequences are connected if their distance is ≤0.1. We defined a ’subtype’ as one of the 7,326 connected components in the graph. 3,935 subtypes contained only plasmids; 3,355 subtypes contained only chromosomes; and 36 subtypes contained both plasmids and chromosomes.

We sliced reference sequences into 10kb slices by sliding a 10kb window at 5kb increments. The first window starts at the beginning of the sequence, and the final window stops at the end of the sequence. For instance, a 23kb sequence would be sliced at 0-10kb, 5-15kb, 10-20kb, and 13-23kb. A slice was annotated with any gene that was entirely or partly inside the slice. In total, we generated 10,453,279 slices from the reference chromosomes and 343,246 slices from the reference plasmids.

Assessing model performance in cross-validation

To perform cross-validation, we randomly divided the 10kb slices into four groups. For each fold of cross-validation, three groups formed the training data, and the fourth group formed the testing data. In a naive split, we keep all slices from the same reference sequence together in either training or testing data. In an informed split, we keep all slices from the same subtype together.

We assigned weights to the 10kb slices when calculating precision and recall performance (Figure S1F and 1D). Consider the following notation to represent sequences: Embedded Image And consider the following notation to represent weights: Embedded Image We defined two different scenarios for assigning weights. Scenario A satisfies the following conditions:

  1. All slices from the same sequence have the same weight Embedded Image

  2. The weight of every sequence is equal to 1 Embedded Image

Scenario B satisfies the following conditions:

  1. All slices from the same sequence have the same weight Embedded Image

    All plasmid (or chromosome) sequences in the same subtype have equal weight Embedded Image

  2. All subtypes have equal weight Embedded Image

  3. The sum of weights across all slices equals the total number of slices Embedded Image

Each scenario implies a unique assignment of weight values. Scenario A requires that every sequence has the same weight. Importantly, this ensures that long sequences, which have disproportionately more slices, have equal weight as shorter sequences. Scenario B further requires that every subtype has the same weight. Importantly, this ensures that subtypes that contain a disproportionately large number of sequences (e.g. subtypes that represent commonly studied bacteria, such as Escherichia, Salmonella, and Klebsiella) have equal weight as subtypes with fewer sequences.

We evaluated performance under two different cross-validation and weighting scenarios. Figure S1F shows the result of training models using a ‘naive’ cross-validation split and calculating precision/recall using weights from Scenario A. Figure 1D shows the results of training models using an ‘informed’ cross-validation split and calculating precision/recall using weights from Scenario B. We calculated precision/recall using the function sklearn.metrics.precision_recall_curve from the scikit-learn Python package138, with the parameter sample_weight set to the weights of the slices. We calculated AUCPR with the function sklearn.metrics.average_precision_score.

PlasX implementation

We implemented PlasX as a logistic regression using the SGDClassifier class from scikit-learn138. Regardless of how we evaluated PlasX, we always trained it with weights defined by Scenario B and based on only slices in the training data. To implement elastic net regularization, we performed a grid search of hyperparameters, with the regularization parameter alpha ranging from 108 to 10-3 in multiplicative increments of √10 and the parameter l1_ratio being 0.0, 0.25, 0.5, 0.75, or 1.0. For each evaluation scenario, we selected the hyperparameters that produced the best performance. We used the best hyperparameters from the ‘informed’ cross-validation and the weights defined by Scenario B (alpha=3.16x10-6, l1_ratio=0.0) to retrain PlasX on all 10kb slices and create the final model that we used to predict plasmids from metagenomes.

Execution of other plasmid prediction tools

We downloaded PlasClass31 from https://github.com/Shamir-Lab/PlasClass (v0.1.0-2-gb80a4f4). We downloaded PPR-Meta45 from https://github.com/zhenchengfang/PPR-Meta (v1.0-14-gab99c91). We downloaded Platon46 from https://github.com/oschwengers/platon, and then modified the code to more efficiently parallelize across many CPUs (modifications at https://github.com/michaelkyu/platon). We used Platon’s RDS score as its final prediction score, ignoring whether it found other features like conjugation and replication genes.

To ensure a fair comparison of models in cross-validation (Figure S1F and 1D), we retrained PlasClass and Platon using the same training sequences as we used for PlasX in each cross-validation fold. We trained PlasClass on 10kb slices, and we trained Platon on the entire non-sliced sequences. We did not train PlasClass and Platon with sequence weights because they don’t take in weights as input, but we did calculate precision and recall with weights. PPR-Meta45 and Deeplasmid35 do not provide software interfaces for retraining new models, so we ran the pretrained versions of these models that were published in their original studies (and thus were trained on different sequence datasets).

We downloaded the four sequence versions of the Wolbachia plasmid pWCP from https://doi.org/10.6084/m9.figshare.6380015 (Table S8). We made predictions of pWCP using the original pretrained and published versions of PlasClass, Platon, PPR-Meta, and Deeplasmid, and we ran the final PlasX model that was trained on all 10kb slices.

We downloaded the collection of all ICE sequences (n=552) from ICEberg51 2.0 at https://db-mml.sjtu.edu.cn/ICEberg/ on September 30, 2022. We also downloaded 455 prophage sequences from the NCBI Virus data portal (https://www.ncbi.nlm.nih.gov/labs/virus) on September 30, 2022. To download them, we selected the “Bacteriophages” subset from the “>Find Data” menu bar, and then we applied filters of “Only” for the “Provirus” option and “complete” for the “Nucleotide Completeness” option. We made predictions of these ICE’s and virus sequences using the original pretrained and published version of Platon, using Platon’s default ‘accuracy’ mode (Tables S9 and S10). We also ran the final PlasX model that was trained on all 10kb slices.

We ran Deeplasmid using the Docker image of the CPU implementation, following instructions at https://github.com/wandreopoulos/deeplasmid (version sha256:10809927e2c8a14cf86231801b804b0bd4bddf600821d17fd8b7e41a15c562c0). While we were able to run Deeplasmid on the Wolbachia plasmid pWCP, it was prohibitively slow to run on the entire set of 10kb slices used for cross-validation evaluation. In particular, we found that Deeplasmid running on a MacOS laptop takes ∼3 hours for 1,000 slices, so we estimated it would take ∼3.7 years to run on all slices. While the GPU implementation of Deeplasmid might be able to run faster, we were unable to execute its prebuilt Docker image (version sha256:f3a22993fb765a7f9678b174245b64976e7e52a4dce85570060900b794af5e43). We suspect that this image is incompatible with modern machine setups, like ours, because Deeplasmid depends on software that is several years old. For example, it requires the CNTK library, for which development was abandoned over 3 years ago (https://docs.microsoft.com/en-us/cognitive-toolkit/releasenotes/cntk_2_7_release_notes). We were also unable to build a new Docker image to run the GPU implementation, despite attempts to modify the Docker build file (see the issue we raised at https://github.com/wandreopoulos/deeplasmid/issues/3).

Predicting plasmids from metagenomic assemblies

We downloaded fastq files for 1,782 short-read and paired-end metagenomes from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) using the program ‘fastq-dump’. The countries represented are Austria139, Australia (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB6092), Bangladesh140, Canada141, China142, 143, Denmark144, England145, Ethiopia66, Fiji126, Israel146, Italy147, Madagascar66, Mongolia148, Spain95, Tanzania147 and USA93, 149. Some samples were sequenced multiple times (i.e. multiple records in SRA), in which case we concatenated the fastq files together. We have separated these multiple accessions using the delimiter ‘|’. We labeled Tanzania, Ethiopia, Bangladesh, Madagascar, and Fiji as non-industrialized and the other countries as industrialized.

All steps of quality filtering, metagenomic assembly, read recruitment and profiling were automated using snakemake130 workflows in anvi’o150. The ‘illumina-utils’151 commands ‘iu-gen-configs’ and ‘iu-filter-quality-minochè with the flag ‘--ignore-deflines’ were used to quality filter the raw paired-end reads. Each metagenome was assembled individually using IDBA_UD152 with default settings, except the flag ‘--min_contig 1000’.

We annotated COGs and Pfams in all assembled contigs using the same procedure as the reference plasmids and chromosomes. To annotate de novo families, we first used ‘mmseqs result2profilè (default parameters) to represent the sequence conservation in each de novo family as a profile. We then used ‘mmseqs search’ (default parameters) to search for profiles across all genes. We kept hits where the alignment coverage was ≥80% of both the gene and the profile and where the alignment identity was at least ≥X-0.05 where X is the minimum identity threshold used to originally construct the family (parameter --min-seq-id). For example, if a family was constructed using an identity threshold of 0.8, then we kept hits with an identity ≥0.75. Using these gene annotations, we ran PlasX to assign a score to every contig. We kept contigs intact, rather than slicing them into 10kb windows. Contigs with score >0.5 were classified as plasmids.

Detection and circularity of plasmids across metagenomes

We recruited short reads from our collection of metagenomes using Bowtie2 v.2.0.5153. We used the snakemake workflows in anvi’o to automate execution of bowtie and post-processing to calculate ‘detection’, i.e. the proportion of a sequence that is covered by at least one read. We ran bowtie2 using the three following combinations of parameters and input files.

First, to identify circular contigs, we recruited each metagenome’s reads to a fasta file that contained only the contigs assembled from that metagenome. For computational efficiency, we ran bowtie2 with its default behavior to align every read at most once. We then analyzed the orientation of paired-end reads (Figure 2B). During assembly, circular sequences are broken by an artificial breakpoint to represent them as linear contigs. Consequently, DNA sequencing that occurred across this breakpoint will produce paired-end reads that align in a reverse-forward orientation to the ends of the contig. In contrast, if a sequence is not circular, then all paired-end reads are expected to align in a forward-reverse orientation. To illustrate this intuition, suppose the upstream read of a paired-end maps to positions 200-300 of a contig and the downstream read maps to 500-600. If the upstream read maps with a reverse complement strandedness (i.e. ‘reverse’) and the downstream read maps with the same strandedness as the way the contig is written (i.e. ‘forward’), then the paired-end is in a reverse-forward orientation. In other words, if the contig is written 5’-to-3’, then the upstream read maps 3’-to-5’ and the downstream read maps 5’-to-3’. Inversely, the paired-end is in a forward-reverse orientation if the upstream read maps 5’-to-3’ and the downstream read maps 3’-to-5’. Next, we defined the gap (or insert) size of a paired-end to be the distance between the closest (or farthest) aligned positions between its two reads. In our example, the gap size is 600-200=400 and the insert size is 500-300=200. Let D be the contig’s length minus three times the median insert size of all forward-reverse paired-ends that are aligned to the contig. Finally, we label a contig as circular if (1) its detection was ≥0.95 and (2) it had at least one reverse-forward paired-end with a gap greater than or equal to D. This approach of examining reverse-forward paired-ends was inspired by154.

Second, to study the ecological distribution of all plasmids and plasmid systems at the same time, we recruited each metagenome’s reads to a fasta file that contained either the non-redundant set of 68,350 predicted plasmids or the non-redundant set of 11,121 reference plasmids. For computational efficiency, every read was aligned at most once (i.e. the default behavior of bowtie). We designated a plasmid as being present in a metagenome if its detection was ≥0.95. To compare metagenomes based on their plasmid content in Figure 4 and Figure S4, we ran UMAP v0.5.173 with parameters ‘n_neighbors=30, n_components=2, min_dist=0.15, metric=’jaccard’, random_state=1’. The heatmaps in Figure 4 were generated using the ‘heatmap.2’ package in R, with agglomerative clustering using median linkage on Euclidean distances.

Third, to study the specific plasmids from PS974 and PS1110 that were shown in Figure 6F and Figure S10 (see Table S5 for contig names), we ran bowtie2 on each sequence separately. This setup allowed every read to align to potentially multiple sequences, resulting in a more complete estimation of which metagenomes contained a plasmid. For the backbone sequences of these systems, we designated them as present in a metagenome if their detection was ≥0.95. For compound plasmids in PS1110 that encoded a chloramphenicol resistance gene, we designated them as present in a metagenome if they satisfied an additional criterion that ≥0.95 of the resistance gene was covered by at least one read.

Estimation of microbial taxonomy in metagenomes

We estimated taxonomic abundances in every metagenome by running kraken2155 v2.1.2 with its standard database (https://github.com/DerrickWood/kraken2) and then refined the abundances using bracken156 v2.5 (https://github.com/jenniferlu717/Bracken) with database parameters ‘-k 35 -l 96’. We ran bracken (parameter ‘-r 96’) a separate time for every taxonomic rank: S1 (subspecies/strain), S (species), G (genus), F (family), O (order), C (class), P (phylum), D (domain). The output of this analysis is a count of how many reads originated from each taxon.

To compare the metagenomic presence/absence of plasmids versus taxa, we calculated MP, the set of metagenomes where a plasmid P is detected at ≥95%, and MTr, the set of metagenomes in which at least r reads originated from taxon T. For each plasmid, we attempted to find the best explanation of its ecological distribution by comparing the plasmid to every taxon using the Jaccard index, and by scanning many possible read thresholds. More exactly, we used the following formula to represent the best possible explanation of a plasmid’s ecological distribution: Embedded Image where Embedded Image

We evaluated 29 values for the threshold r, ranging from 1 read to 10 million reads in multiplicative increments of 101/4. We ignored plasmids that were present in less than 5 metagenomes, i.e. |MP|<5, because it was likely that these plasmids would have a high Jaccard similarity to some taxon by chance. For instance, we observed that many pairs of plasmids and taxa occur in exactly one and the same metagenome, and thus they have a Jaccard index of 1.

To compare continuous-valued abundances, we defined a plasmid’s abundance in a metagenome as the sum of coverage values across all sequence positions divided by sequence length, and we defined a taxon’s abundance as the number of reads originating from the taxon. If a plasmid had detection of <95%, then we set its abundance to 0. If a taxon had less than 1000 reads, then we set its abundance to 0. We ignored plasmids and taxa that had non-zero abundances in less than 5 metagenomes. For every pair of plasmid and taxon, we estimated the Pearson correlation between their abundance levels across metagenomes using FastSpar157 v1.0.0 (https://github.com/scwatts/fastspar), which is an improved implementation of the SparCC74. This method accounts for the compositional nature of the data—in which abundances reflect relative instead of absolute quantities—by assuming that the amount of correlations in a data is sparse. We ran FastSpar on the non-redundant set of predicted plasmids, and ran it separately on the non-redundant set of reference plasmids.

Additional validation and annotations of plasmids

To determine if a predicted plasmid has canonical plasmid features, we ran MOB-suite38. This tool searches a sequence for known examples of four types of features: plasmid replicon (e.g. replication genes), relaxase, mating pair formation, and origin of transfer. We installed MOB-suite v3.0.1 using pip, in an Anaconda Python environment that has mash v2.2. We ran the MOB_typer subroutine (command ‘mob_typer’) using default parameters and followed the execution instructions at https://github.com/phac-nml/mob-suite.

To determine if a predictive plasmid is a novel sequence, plasmids were blasted against NCBI using the blast package (v2.9.0, installed from bioconda Anaconda repository). On October 13, 2021, we downloaded version 5 of the NCBI databases non-redundant nucleotide (nt), ref_prok_rep_genomes, ref_viroids_rep_genomes, and ref_viruses_rep_genomes, and then integrated them into a single database using the ‘blastdb_aliastool’ command. We then searched every predicted plasmids against this combined database, using the ‘blastn’ tool with the ‘-task megablast’ parameter for efficient searching. For each plasmid, we examined all matching NCBI sequences (called ’subjects’) and chose the one with the highest ‘qcovs’ (query coverage per subject), which represents the fraction of the plasmid sequence that is covered by all high-scoring segment pairs (HSP). Tiebreaking was done by sorting subjects by the maximum bitscore of the HSPs. If the qcovs of the best matching sequence was ≥90%, then we considered the predicted plasmid as found in NCBI and further categorized the matching sequence by searching for the keywords ’plasmid’, ’virus’, ’chromosome’ (in that order, disregarding capitalization) in its NCBI description. For example, if the description of the matching sequence contained the word ’plasmid’, then we said the predicted plasmid matched a known plasmid on NCBI. Similarly, if the description contained ’chromosome’ but not ’plasmid’ nor ’virus’, then we said that the predicted plasmid matched a known chromosome on NCBI. If the qcovs of the best sequence was <90%, then we labeled the predicted plasmid as not found in NCBI.

To visualize pFIJ1037_1 (Figure 3A) and pENG0187_1 (Figure S3A), we manually imported COG functions into the plasmid maps produced by snapgene (Insightful Science; snapgene.com). We manually curated functions in genes without COGs using NCBI BLASTx.

Keyword analysis of COGs and Pfams for plasmid functions

We labeled COGs and Pfams as being a plasmid-associated function (Figure 1E) if its database description contains any of following keywords as a substring: ‘plasmid’, ‘toxin’, ’replicat’, ’integrase’, ’transpos’, ’recombinase’, ’resolvase’, ’relaxase’, ’recombination’, ’partitioning’, ’mobilis’, ’mobiliz’, ’type iv’, ’conjugal’, ’conjugat’, ’segregat’, ’MobA’, ’ParA’, ’ParB’, ’BcsQ’. We labeled backbone and cargo genes as being related to plasmid replication, transfer, or maintenance if they were annotated to any plasmid-associated COG or Pfam (see section “Classification of cargo and backbone genes”).

To determine if a predicted plasmid is ‘keyword-recognizable’ (Figure 2C), we searched the plasmids for COGs and Pfams using a more restricted set of keywords (just “plasmid” and “conjugation”) instead of the keywords above.

MobMess algorithm to dereplicate plasmids, remove assembly fragments, and discover plasmid systems (see Figure S8)

The MobMess algorithm performs three tasks. It de-replicates plasmids that are nearly redundant to each other; it removes plasmids that appear to be assembly fragments; and finally it organizes plasmids together into evolutionary groups called plasmid systems. MobMess consists of several steps described below.

MobMess first performs an all-vs-all pairwise alignment of sequences using the MUMmer alignment package (v4.0.0rc1)80. All sequences are placed into a single fasta file and then aligned with ‘nucmer’ (parameters ‘--maxmatch --minmatch=16’) to calculate local alignment blocks.

Alignments are specified asymmetrically such that one sequence is designated as the query q and the other is the reference r. For every q and r, the alignment blocks calculated by ‘nucmer’ are written to a separate file, and then a subset of blocks is identified using ‘delta-filter’ (parameters ‘-q -r’) to create a one-to-one alignment.

Next, MobMess constructs a directed graph G where vertices are sequences and edges represent the containment of one sequence within another (Figure S8A). Formally, consider a query q and reference r. Let |q| be the length of q. For the ith alignment block between q and r, let Si, ei, and δi be the start position in q, end position in q, and number of alignment mismatches and indels, respectively. The following values summarize the information across all alignment blocks between q and r.

Embedded Image

MobMess creates a directed edge (q,r) in G if Ilocal and C are above user-specified thresholds. In this study, we applied thresholds of Ilocal≥0.9 and C≥0.9. In Figure S12, we re-ran MobMess using various thresholds on Ilocal and C (the same threshold was applied to Ilocal and C at the same time).

MobMess clusters sequences according to strongly connected components in G, calculated with igraph v0.8.2158 in Python. That is, two sequences x and y are placed in the same cluster if there exists a directed path from x to y and another from y to x in G. Intuitively, a cluster represents a set of sequences that are nearly identical to each other across nearly their entire lengths. MobMess then reduces G to another graph H, called the condensation graph, by contracting every cluster of sequences into a single vertex. A directed edge (u,v) exists in H if and only if there are sequences x ∈ u and y ∈ v where edge (x,y) exists in G. Note that H does not have any cycles. As proof by contradiction, if there were a cycle of clusters, then those clusters would have been in the same strongly connected component in G and hence would have been merged into a single, larger cluster.

MobMess labels every cluster in H as one of the three following types: (1) a ‘backbone cluster’ if it has an outgoing edge and at least one of its member sequences is circular, (2) a ‘fragment cluster’ if it has an outgoing edge but none of its member sequences are circular, or (3) a ‘maximal cluster’ if it does not have any outgoing edges. Intuitively, a maximal cluster represents the longest version of a plasmid observed in the data. In contrast, a backbone or fragment cluster represents a set of plasmids that are subsequences of other plasmids in a maximal cluster. The only difference between backbone and fragment clusters is that backbone clusters contain at least one circular plasmid (implying complete assembly), while fragment clusters do not contain any circular plasmids (suggesting they are assembly fragments of the maximal cluster).

To dereplicate sequences, MobMess discards all fragment clusters and then chooses a representative sequence from every maximal and backbone cluster. A cluster’s representative is the sequence with the highest global sequence identity (Iglobal), averaged across the set of alignments where that sequence is the reference and other sequences in the same cluster are the queries.

MobMess defines a plasmid system as a specific backbone cluster together with its ‘compound’ clusters, which are the set of non-fragment clusters connected to the backbone in H. Thus, there is a one-to-one correspondence between backbone clusters and plasmid systems. Note that systems can be nested within each other, because backbone clusters can be connected to each other in H. Thus, a backbone cluster can be the backbone that forms a given plasmid system, and at the same time, it can also be a compound cluster with respect to an even smaller backbone that forms a different system. As another note, a maximal cluster can be a ‘compound’ cluster of a system, but it is also possible that some maximal clusters are not found in any system because they are not connected to any backbone clusters in H.

We ran MobMess to analyze the 226,194 predicted plasmid contigs. MobMess grouped the contigs into a total of 132,616 clusters. 64,266 clusters were ‘fragment clusters’ that contained 125,475 contigs, which we interpreted as assembly fragments of other predicted plasmids. We discarded these fragments from further analysis. The other 68,350 clusters were non-fragment clusters (i.e. 1,169 backbone and 67,181 maximal clusters) and contained 100,719 contigs, which we further analyzed for the existence of orthogonal support for being plasmids (Figure 2C). Finally, MobMess identified 1,169 plasmid systems, which together represent 1,169 backbone and 63,926 maximal clusters (3,255 maximal clusters were excluded). See Figure S13 for a diagram of these numbers.

We ran MobMess separately on the 16,827 reference plasmid sequences, yielding 11,121 clusters. We assumed that all reference plasmids were circular, and thus there were no fragment clusters. We visualized networks with Cytoscape83 v3.8 and laid nodes out using the prefuse directed force layout84. While we have focused on plasmids, MobMess could be applied to dereplicate and organize other mobile genetic elements into systems.

Classification of cargo and backbone genes

We classified all genes on the backbone plasmids of a plasmid system as backbone genes. For genes on compound plasmids, we tested whether the genes shared any de novo family annotations with the genes on the backbone plasmids. If so, we classified those genes as backbone genes, otherwise as cargo genes. For this analysis, we used the 1,090,132 de novo families that we constructed from reference plasmids and chromosomes in order to train PlasX, and we also used an additional set of 439,584 de novo families that we constructed by running the command MMseqs2 (--min-seq-id 0.05) on the genes from all plasmid sequences in this study (16,827 reference and 226,194 predicted plasmids). These additional families allowed us to capture gene families that might be absent in reference sequences but are conserved in predicted plasmids. Note that the classification of genes as backbone or cargo depends on which plasmid system is being considered. It is possible for a gene to be classified as a backbone gene with respect to one plasmid system and, at the same time, as a cargo gene with respect to another system. This is because a plasmid can be a backbone plasmid of a system and also be a compound plasmid of a different system (see Methods subsection on MobMess algorithm).

For every non-redundant compound plasmid in the system, we calculated the fraction of genes in the plasmid that were cargo genes. We then averaged this fraction across all non-redundant compound plasmids in the system to define the “cargo gene percentage” of the system (Figure S15). Because every gene is either backbone or cargo, the percentage of backbone genes is 100% minus the cargo gene percentage.

For Figure 5B and to analyze the content of backbone/cargo genes, we used a non-redundant and unambiguous set of 8,995 backbone and 24,168 cargo genes. To derive these sets of genes, we first considered the 47,172 genes encoded on the 4,424 non-redundant plasmids that were part of at least one plasmid system. Of these 47,172 genes, we used the 8,995 genes that were classified as backbone genes because they were encoded on a backbone plasmid and that were never classified as cargo genes in any plasmid system. 24.1% (2,169/8,995) of these genes had a plasmid-associated keyword in their COG/Pfam annotations (see Methods section “Keyword analysis of COGs and Pfams for plasmid functions”). We also used the 24,168 genes that were always classified as cargo genes and never backbone genes in any plasmid system. 13.4% (3,229/24,168) of these genes had a plasmid-associated keyword. We excluded from analysis the 1,917 genes that were sometimes classified as backbone genes and other times cargo genes, depending on the system. We also excluded 12,092 genes that were on compound plasmids but were classified as backbone genes, as these genes are redundant with the backbone genes that were encoded on the backbone plasmid.

Mutual exclusivity of plasmid systems

To measure the extent of mutual exclusivity in a plasmid system S, we defined two sets of metagenomes. Msany is the set of metagenomes that contains S, i.e. where one or more plasmids in S has detection at ≥0.95. Mssolo is the set of metagenomes where one and only one plasmid in S has detection at ≥0.95. Then, we calculated the mutual exclusivity score Es as the ratio |Mssolo| / |Msany|. This score is similar to a test statistic used by methods, such as CoMEt159, for studying mutual exclusivity of gene alterations in cancer.

We observed that if a compound plasmid was present in a metagenome (detection ≥0.95), then its backbone plasmid was often also present. This could be due to (1) both plasmids being present as separate entities in the same microbiome, or (2) a read recruitment artifact arising from the sequence similarity between the plasmids. To minimize artifacts, we assumed the second scenario is always happening, and we corrected for it by assuming that a backbone plasmid is absent from a metagenome (regardless of its detection) whenever any compound plasmids are present in the metagenome. To formalize this correction procedure, let Pm be the set of plasmids in system S with detection ≥0.95 in metagenome m. We created the induced subgraph Hm that is formed by subsetting the vertices of Pm in the plasmid similarity graph H (see MobMess section of the Methods). Then, we defined Pm′ ⊆ Pm as the subset of plasmids that are maximal with respect to Hm (i.e. they don’t have outgoing edges in Hm). We used Pm′, rather than Pm, in order to calculate Msany, Mssolo, and Es.

Identification of antibiotic resistance genes

We annotated antibiotic resistance genes using two databases. First, we searched against a database of resistance protein family HMMs from Resfams160 (v1.2, dated 2015-01-27, ‘Core’ database at http://www.dantaslab.org/resfams). We used ‘anvi-run-hmms’ from anvi’o129 to automate running ‘hmmsearch’ from HMMER133 3.3.2 and apply an e-value cutoff of 10-10. Second, we ran rgi (v5.2.0, https://github.com/arpcard/rgi) to search for similarity in the CARD database of resistance genes161. We removed CARD hits that were labeled as ’Loose’ and kept those labeled as ’Perfect’ or ’Strict’. We removed any Resfams or CARD hits that contained the keywords ’transcription’, ’regulat’, ’modulat’ in their database description, to avoid cases (e.g. TetR protein) where the hit is a gene that regulates the expression of another resistance gene but doesn’t itself perform the molecular process that confers resistance. We categorized hits into major antibiotic resistance classes by searching for the following keywords in their functional descriptions: ‘lincosamide’, ‘macrolide’, ‘erythromycin’, ‘chloramphenicol’, ‘aminoglycoside’, ‘streptothricin’, ‘glycopeptide’, ‘efflux pump’, ‘beta-lactamase’, ‘nitroimidazole’, ‘tetraycyline’, ‘quinolone’, ‘sulfonamide’. Additionally, we searched the extra keywords ‘Van’ and ‘VanZ’ to identify glycopeptide resistance; ‘efflux’, ‘permease’, and ‘pump’ to identify efflux pumps; and ‘TetX’ to identify tetracycline resistance.

High molecular weight (HMW) DNA extraction, long-read sequencing, and determination of circularity through long-reads

We employed a long-read sequencing strategy on two Bacteroides fragilis cultivars from two patients (p-214 and n-216 previously described in Vineis et al63). We extracted total genomic HMW DNA by one of two methods. For B. fragilis p-214, we used the Qiagen Genomic Tip 20/G procedure (also known as Method #4/GT) as previously described162 on a 10 mL overnight BHIS broth culture. For B. fragilis n-216, we used a Phenol Chloroform protocol on 25 mL overnight BHIS broth cultures. Libraries were prepared with the Rapid Barcoding Kit (SQK-RBK004) and the standard protocols from Oxford Nanopore Technologies (ONT) with a few modifications. For B. fragilis p-214, DNA fragmentation was performed on 6 ug DNA using 5 passes through a 22G needle in a 30 µL volume. The gDNA input was 1.5 µg (Table S6), based on sample availability in a 7.5 µL volume, with 2.5 µL Fragmentation mix added. We sequenced for 72 hours using a single R9.4/FLO-MIN106 flow cell (ONT). For B. fragilis n-216, DNA fragmentation was performed on 10 ug DNA using 10 passes through a 22G needle in a 250 µL volume. The gDNA input was 0.32 to 0.44 µg, based on sample availability in an 8.5 µL volume, with 1.5 µL Fragmentation mix was added per sample. We sequenced for 72 hours using a single R9.4/FLO-MIN106 flow cell. We used Guppy (v4.0.15) for all post-run base calling, sample de-multiplexing and the conversion of raw FAST5 to FASTQ files.

To determine circularity, we used BLAST to align the long reads with a minimum quality score of 7 to our predicted plasmid sequences. During assembly, all DNA short reads are assembled as linear sequences even if they are circular elements. Circular elements have an artificial breakpoint to represent them as linear sequences, and this breakpoint can happen anywhere on the sequence depending on the assembly method. We tested for the presence of an artificially introduced breakpoint by aligning 500 long reads and then visualizing these alignments on the sequence as if it were assumed to be a circular element (Figure S3). If indeed the sequence is circular, the long reads would overlap each other and “wrap around” the entire circumference of the sequence. In other words, all nucleotide positions of the sequence would be covered by at least one read and there would also exist a read that spans the breakpoint by aligning to both sides of the breakpoint. This property ensures the breakpoint is artificial, and hence the sequence is a circular element. Inversely, this property does not hold when the breakpoint is not artificial (i.e. the sequence is actually an assembly fragment or linear element).

Transfer of predicted plasmid between microbial populations

In duplicate, we streaked B. fragilis 214 (donor, erythromycin resistant due to pFIJ0137_1) and B. fragilis 638R (recipient, rifampicin resistant) onto plates with brain-heart infusion agar supplemented with hemin and vitamin K (BHIS) and incubated them in 5 mL BHIS media anaerobically at 37 ℃ for 20 hours. To mate the donor to the recipient, 250 μL of donor cells were pelleted in a centrifuge at 5,000x gravity. We discarded the supernatant and resuspended the donor in 1 mL of the recipient culture. Again, cells were pelleted at 5,000x gravity, then resuspended in 25 μL of BHIS media. Cells were spotted onto BHIS agar plates and incubated anaerobically for 24 hours. After incubation, cells were resuspended in 1 mL BHIS. 250 μL of this suspension was plated onto BHIS plates containing 8 μg/mL rifampicin and 25 μg/mL erythromycin to select for B. fragilis 638R recipients of pFIJ0137_1. Duplicate plates each had approximately 300 colonies. Plating the donor or recipient alone resulted in zero colonies, confirming the transformants were not spontaneous mutants to either antibiotic. Two transformant colonies were restreaked onto fresh BHIS plates containing 8 ug/mL rifampicin and 25 μg/mL erythromycin.

Short-read sequencing of isolate genomes and confirmation of plasmid transfer

We grew 20-hour cultures of B. fragilis 214 donor, naive B. fragilis 638R, and B. fragilis 638R transconjugants containing pFIJ0137_1. Using the QIAseq FX DNA library kit (Qiagen), libraries of these strains were prepared with 100 ng of genomic DNA. DNA was fragmented enzymatically into smaller fragments and desired insert size was achieved by adjusting fragmentation conditions. Fragmented DNA was end repaired and ‘A’s were added to the 3’ ends to stage inserts for ligation. During the ligation step, Illumina compatible Unique Dual Index (UDI) adapters were added to the inserts and the prepared library was PCR amplified. Amplified libraries were cleaned up, and QC was performed using a tapestation. Libraries were sequenced on Illumina MiSeq platform using v2 cassette to generate 2x250bp reads. To confirm the transfer of pFIJ0137_1, we individually recruited reads from the B. fragilis 214 donor, naive B. fragilis 638R, and B. fragilis 638R transconjugants to the pFIJ0137_1 reference sequence. We used anvi’o to create contigs and profile databases (as described above) and visualized these results with the command ‘anvi-interactivè. We independently confirmed the presence of pFIJ0137_1 by assembling genomes using SPAdes163 with default parameters.

Data availability

Reproducible Analyses of reference plasmids and chromosomes are available at doi:10.5281/zenodo.5732024. The PlasX model as well as our analyses of known and predicted plasmids are available at doi:10.5281/zenodo.5843600. For all metagenomes, we have compiled the contigs, taxonomic abundances, and PlasX scores at doi:10.5281/zenodo.5730607, gene calls at doi:10.5281/zenodo.5730987, and gene annotations at doi:10.5281/zenodo.5731658. We have deposited long and short sequencing reads from B. fragilis isolates into the NCBI Sequence Read Archive (PRJNA782184).

Code availability

We have released two open-source packages, PlasX (https://github.com/michaelkyu/plasx) and MobMess (https://github.com/michaelkyu/mobmess), along with detailed installation and usage instructions.

Funding

Center for Data and Computing, at the University of Chicago (MKY, AME) National Institutes of Health NIDDK grant RC2 DK122394 (AME) Simons Foundation grant #687269 (AME) Sloan Foundation (AME)

Author information

Contributions Conceptualization: MKY, AME

Methodology: MKY, ECF, AME

Investigation: MKY, ECF, AME

Visualization: MKY, ECF

AME Funding acquisition: MKY, AME

Project administration: MKY, AME

Supervision: MKY, AME

Writing – original draft: MKY, ECF

Writing – review & editing: MKY, ECF, AME

Data curation: MKY, ECF

Formal Analysis: MKY

ECF Resources: ECF

MKY Software: MKY

Validation: ECF

Ethics declarations

Competing interests

The authors declare no competing interests.

Supplemental Figures

Figure S1.
  • Download figure
  • Open in new tab
Figure S1. Additional analysis of PlasX.

(A) Histograms of reference sequences, based on the fraction of genes that have known or de novo family annotations. (B-C) Two-dimensional histograms of known (B) and de novo (C) gene families, based on the number of plasmid and chromosomal subtypes that each family is found in. The number of gene families is log-scaled. Only the gene families that are enriched in plasmid subtypes (i.e. bottom-right triangle) are shown. (D) Histograms of the coefficients learned by PlasX, showing that the vast majority of coefficients are close to zero. (E) Diagrams of different training-test split configurations for cross-validation. A random ’naive’ split of plasmids and chromosomal sequences results in training and test sets that have similar sequences, due to the existence of plasmid and chromosomal subtypes that contain highly similar sequences. An ’informed’ split assigns all sequences of the same subtype to either training or test, creating a more representative evaluation of a model’s ability to generalize to unseen sequences. Colors and edges represent sequences that are in the same subtype. (F) Precision-recall curves using 4-fold cross-validation and a naive split.

Figure S2.
  • Download figure
  • Open in new tab
Figure S2. Additional analysis of predicted plasmids.

(A) Model scores of all contigs assembled from all 1,782 metagenomes. 226,194 plasmids were predicted by applying a score threshold of >0.5. Of these, 50,163 plasmids were high-scoring (≥0.9 score). (B) The sequence length of known and predicted plasmids. (C) Model scores of predicted plasmids that matched a sequence in NCBI (≥90% alignment identity and ≥90% coverage of the predicted plasmid). Predictions are labeled as a known ’plasmid’, ’virus’, or ’chromosome’ based on the presence of these words in the description of the matching NCBI sequence. We searched NCBI for only the filtered set of 100,719 non-fragment predictions. (D) The prevalence of reference and predicted plasmids across all metagenomes. (E) We calculated a “circularity coverage ratio” as the number of supporting reverse-forward reads divided by the average coverage of a contig. All circular contigs are shown, and they are colored if they were predicted by PlasX as plasmids (orange) or not plasmids (blue).

Figure S3.
  • Download figure
  • Open in new tab
Figure S3. Long read circularity.

(A) The process to identify circular plasmids using long read sequences. Contigs are always assembled as linear sequences even when originally circular in the environment. We can determine their original configuration by aligning long reads around the entire sequence. (B) B. fragilis 216 long reads aligned to pENG0187_1, demonstrating circularity. 4 of 500 reads are shown for simplicity. Red triangles designate the beginning of a long read.

Figure S4.
  • Download figure
  • Open in new tab
Figure S4. UMAP plot as in Figure 4C.

Metagenomes have been partitioned to show clustering within each country.

Figure S5.
  • Download figure
  • Open in new tab
Figure S5. Comparison of the ecological distributions of plasmids and microbial taxonomy.

We measured the association between every plasmid and taxon by calculating the correlation between their abundance levels across metagenomes, using the SparCC technique76. As another association measure, we applied thresholds to the abundance levels and then calculated the Jaccard similarity between the metagenomes containing the plasmid versus those containing the taxon. We restricted analyses to plasmids that were present in at least 5 metagenomes. (A-B) For every predicted plasmid, we identified the taxon with the highest correlation (A) or Jaccard similarity (B). (C-D) We did the same to identify the best matching taxa of reference plasmids. Blue lines indicate the median of each distribution. (E) Venn diagram showing the discordance between the metagenomes containing a plasmid pDOJH10S and those containing its cognate host, a B. longum strain.

Figure S6.
  • Download figure
  • Open in new tab
Figure S6. Conceptual differences in constructing plasmid similarity networks.

We ran MobMess on the set of 9,894 reference plasmids analyzed by Redondo-Salvo et al.81. MobMess constructs a network with directed edges, by aligning plasmids and determining if one plasmid is found as a subsequence within another. Redondo-Salvo et al. constructs a network with undirected edges, by determining whether two plasmids contain partial homology. (A-B) Visualization of the similarity networks. We used Cytoscape83 and the Prefuse directed layout algorithm84 to lay out the nodes in the MobMess network (A), and then we applied the same layout to the Redondo-Salvo et al. network (B). The red boxes represent the example shown in Figure S7.

Figure S7.
  • Download figure
  • Open in new tab
Figure S7. Comparison of MobMess versus Redondo-Salvo et al. for studying a plasmid system.

(A-B) An example from the similarity networks in Figure S6, showing the connections between 17 plasmids from the same plasmid system. MobMess further collapses its network to dereplicate plasmids and reveal the plasmid systems’s “star”-like topology, where a backbone connects to its compound plasmids. Redondo-Salvo et al. did recognize that these plasmids are related (represented by a cluster called “G3”), but they connected almost every pair of these plasmids in a “hairball” topology, obfuscating the system’s internal organization. (C) Alignments of plasmids in the MobMess system. Subregions in every sequence are colored gray or green to represent backbone or cargo content, respectively. Ribbons between sequences represent the alignment of subregions. The barcharts show the total breakdown of each plasmid into backbone versus cargo, as well as the fraction of the backbone sequence (‘NC_008385.1’) that is found within the plasmid.

Figure S8.
  • Download figure
  • Open in new tab
Figure S8. The MobMess algorithm and application to predicted plasmids.

(A) Diagram of the MobMess algorithm for dereplicating plasmids and discovering plasmid systems. All-vs-all sequence alignments and circularity information are used to construct a similarity network of plasmid contigs. Similar contigs are clustered, and every cluster is labeled as either a backbone, fragment, compound, or non-compound maximal. A plasmid system consists of a backbone cluster and the compound clusters connected to the backbone. This example shows two systems: one system has G as the backbone (H is the compound plasmid), and another system has D, E, and F as the backbone (B, C, H, and K are the compound plasmids). To dereplicate, fragment clusters are discarded and a representative sequence is chosen for every non-fragment cluster. (B) Network of clusters of predicted plasmids. All clusters are shown except those that are not connected to any other cluster.

Figure S9.
  • Download figure
  • Open in new tab
Figure S9. Functional annotation of cargo genes to KEGG modules, similar to Figures 6A and 6B.

This plot excludes KEGG modules that occur in only one plasmid system. To avoid redundancy with Figure 6A, this plot also excludes modules that occur in cargo genes annotated to antibiotic resistance.

Figure S10.
  • Download figure
  • Open in new tab
Figure S10. Plasmid system PS1110.

Compound plasmids in this system contain a gene that encodes two enzymes, a tRNA Gm18 2’-O-methylase (yellow, ’tRNA mod.’) and a Ribosomal protein S18 acetylase (red). Backbone plasmids contain a similar gene that encodes the S18 acetylase but lacks the tRNA methylase. Backbone genes have a thick, black outline.

Figure S11.
  • Download figure
  • Open in new tab
Figure S11. Mutual exclusivity of plasmids in the same system.

For every plasmid system with two or more compound plasmids, we defined a ‘mutual exclusivity score’ to quantify how often its plasmids segregated to different metagenomes. We definethis score as the number of metagenomes that have exactly one of the system’s plasmids divided by the number of metagenomes that have any of the system’s plasmids. (A) Histogram of mutual exclusivity. (B) The inverse relation between the mutual exclusivity and prevalence of a system. Red triangles represent examples of plasmid systems that are highly prevalent (present in >100 metagenomes) but are more mutually exclusive than expected by a linear regression (blue line). For easier visualization, the x- and y-coordinates of systems with >90% mutual exclusivity were randomly jittered within ±5% of the axes lengths. (C) Visualization of mutual exclusivity in PS961, which is one of the examples in B.

Figure S12.
  • Download figure
  • Open in new tab
Figure S12. Choosing a similarity threshold for MobMess.

(A) Histogram of similarities between every pair of the 226,194 predicted plasmid contigs. (B) We ran MobMess using different thresholds on the similarity, and then we calculated the number of non-redundant plasmids generated. (C) The derivative of the curve in B. The blue dashed lines represent our current ≥90% similarity threshold.

Figure S13.
  • Download figure
  • Open in new tab
Figure S13. Workflow of predicting plasmids with PlasX and organizing them with MobMess.
Figure S14.
  • Download figure
  • Open in new tab
Figure S14. Relation between PlasX score and the length and circularity of predicted plasmids.

(A-B) The distribution of plasmid lengths, for plasmids with a score of ≤0.9 (A) or >0.9 (B). (C-D) The distribution of PlasX scores for circular (C) or non-circular plasmids (D).

Figure S15.
  • Download figure
  • Open in new tab
Figure S15. Backbone and cargo composition of plasmid systems.

(A) For every plasmid system and compound plasmid in the system, we calculated the percentage of genes on the compound plasmid that were classified as cargo versus backbone genes (see Methods). We then averaged the cargo gene percentages across all compound plasmids in the system (x-axis). The vertical blue line shows the median at 40%. (B) Scatterplot of the cargo gene percentage versus the size of a plasmid system, showing a lack of correlation (R2 = 0.03). We defined the size as the number of non-redundant compound plasmids.

Supplemental Tables

Table S1. Summary of reference plasmids and chromosomes

Table S2. Names, accession numbers, and metadata of metagenomes

Table S3. Summary of predicted plasmids (model scores, orthogonal support, circularity, and NCBI blast results)

Table S4. Summary of plasmid systems

Table S5. Gene sequences for plasmid systems shown in Figure 5D, Figure S10 and Figure 6F

Table S6. DNA extraction and sequencing parameters for long read sequencing of isolate genomes

Table S7. COGs and Pfams ranked by their PlasX coefficients

Table S8. Prediction of a Wolbachia plasmid

Table S9. Prediction of ICEs as plasmids by PlasX and Platon

Table S10. Prediction of prophages as plasmids by PlasX and Platon

Table S11. Prediction of plasmids in the latest version of PLSDB (2020_06_23_v2) by PlasX

Table S12. The length and percent circularity of plasmids that are part of a system versus plasmids that are not part of any system.

Acknowledgments

We thank Karen Lolans (ORCiD:0000-0003-1903-756X) for performing the long-read sequencing and for providing feedback on the manuscript. We thank Samuel Miller (0000-0002-2836-1401) and Marcus Foo (0000-0003-3436-1632) for their insights into tRNA modification genes. We also thank other members of the Meren Lab at the University of Chicago for their feedback. MKY acknowledges support from Toyota Technological Institute at Chicago.

Footnotes

  • Updated several sections for clarity and to add new results; Updated Figure 1; Updated Supplementary Figure S2; Added Supplementary Figures S12-S15; Updated Supplementary Tables S3 and S4; Added Supplementary Tables S9-S12

  • https://github.com/michaelkyu/plasx

  • https://github.com/michaelkyu/mobmess

References

  1. 1.↵
    Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3, 722–732 (2005).
    OpenUrlCrossRefPubMedWeb of Science
  2. 2.↵
    Kumar, D., Prajapati, H. K., Mahilkar, A., Ma, C.-H., Mittal, P., Jayaram, M. & Ghosh, S. K. The selfish yeast plasmid utilizes the condensin complex and condensed chromatin for faithful partitioning. PLoS Genet. 17, e1009660 (2021).
  3. 3.↵
    Kazlauskas, D., Varsani, A., Koonin, E. V. & Krupovic, M. Multiple origins of prokaryotic and eukaryotic single-stranded DNA viruses from bacterial and archaeal plasmids. Nat. Commun. 10, 1–12 (2019).
    OpenUrlCrossRefPubMed
  4. 4.↵
    del Solar, G., Giraldo, R., Ruiz-Echevarría, M. J., Espinosa, M. & Díaz-Orejas, R. Replication and Control of Circular Bacterial Plasmids. Microbiol. Mol. Biol. Rev. 62, 434 (1998).
  5. 5.
    Khan, S. A. Rolling-circle replication of bacterial plasmids. Microbiology and molecular biology reviews : MMBR 61, 442–455 Preprint at https://doi.org/10.1128/.61.4.442-455.1997 (1997)
    OpenUrl
  6. 6.↵
    Lilly, J. & Camps, M. Mechanisms of Theta Plasmid Replication. Microbiol Spectr 3, PLAS–0029–2014 (2015).
    OpenUrl
  7. 7.
    Thomas, C. M. Horizontal Gene Pool: Bacterial Plasmids and Gene Spread. (CRC Press, 2003).
  8. 8.↵
    Summers, D. The Biology of Plasmids. (1993).
  9. 9.↵
    Jacob, A. E. & Hobbs, S. J. Conjugal transfer of plasmid-borne multiple antibiotic resistance in Streptococcus faecalis var. zymogenes. J. Bacteriol. 117, 360–372 (1974).
    OpenUrlAbstract/FREE Full Text
  10. 10.↵
    Poyart-Salmeron, C., Carlier, C., Trieu-Cuot, P., Courtieu, A. L. & Courvalin, P. Transferable plasmid-mediated antibiotic resistance in Listeria monocytogenes. Lancet 335, 1422–1426 (1990).
    OpenUrlCrossRefPubMedWeb of Science
  11. 11.↵
    Lan, R., Stevenson, G. & Reeves, P. R. Comparison of Two Major Forms of the Shigella Virulence Plasmid pINV: Positive Selection Is a Major Force Driving the Divergence. Infect. Immun. 71, 6298 (2003).
    OpenUrlAbstract/FREE Full Text
  12. 12.↵
    Meletzus, D., Bermphol, A., Dreier, J. & Eichenlaub, R. Evidence for plasmid-encoded virulence factors in the phytopathogenic bacterium Clavibacter michiganensis subsp. michiganensis NCPPB382. J. Bacteriol. 175, 2131–2136 (1993).
    OpenUrlAbstract/FREE Full Text
  13. 13.↵
    Sen, D., Van der Auwera, G. A., Rogers, L. M., Thomas, C. M., Brown, C. J. & Top, E. M. Broad-host-range plasmids from agricultural soils have IncP-1 backbones with diverse accessory genes. Appl. Environ. Microbiol. 77, 7975–7983 (2011).
    OpenUrlAbstract/FREE Full Text
  14. 14.↵
    Holt, K. E., Thomson, N. R., Wain, J., Phan, M. D., Nair, S., Hasan, R., Bhutta, Z. A., Quail, M. A., Norbertczak, H., Walker, D., Dougan, G. & Parkhill, J. Multidrug-resistant Salmonella enterica serovar paratyphi A harbors IncHI1 plasmids similar to those found in serovar typhi. J. Bacteriol. 189, 4257–4264 (2007).
    OpenUrlAbstract/FREE Full Text
  15. 15.↵
    Fernandez-Lopez, R., Redondo, S., Garcillan-Barcia, M. P. & de la Cruz, F. Towards a taxonomy of conjugative plasmids. Curr. Opin. Microbiol. 38, 106–113 (2017).
    OpenUrlCrossRef
  16. 16.
    Oliva, M., Calia, C., Ferrara, M., D’Addabbo, P., Scrascia, M., Mulè, G., Monno, R. & Pazzani, C. Antimicrobial resistance gene shuffling and a three-element mobilisation system in the monophasic Salmonella typhimurium strain ST1030. Plasmid 111, 102532 Preprint at https://doi.org/10.1016/j.plasmid.2020.102532 (2020)
  17. 17.↵
    Orlek, A., Stoesser, N., Anjum, M. F., Doumith, M., Ellington, M. J., Peto, T., Crook, D., Woodford, N., Walker, A. S., Phan, H. & Sheppard, A. E. Plasmid Classification in an Era of Whole-Genome Sequencing: Application in Studies of Antibiotic Resistance Epidemiology. Front. Microbiol. 0, (2017).
  18. 18.↵
    Norberg, P., Bergström, M., Jethava, V., Dubhashi, D. & Hermansson, M. The IncP-1 plasmid backbone adapts to different host bacterial species and evolves through homologous recombination. Nat. Commun. 2, 268 (2011).
  19. 19.↵
    Heuer, H. & Smalla, K. Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS Microbiol. Rev. 36, 1083–1104 (2012).
    OpenUrlCrossRefPubMedWeb of Science
  20. 20.↵
    Sota, M., Yano, H., Hughes, J. M., Daughdrill, G. W., Abdo, Z., Forney, L. J. & Top, E. M. Shifts in the host range of a promiscuous plasmid through parallel evolution of its replication initiation protein. ISME J. 4, 1568–1580 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  21. 21.↵
    Handelsman, J. Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68, 669–685 (2004).
    OpenUrlAbstract/FREE Full Text
  22. 22.↵
    Chen, L.-X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020).
    OpenUrlAbstract/FREE Full Text
  23. 23.↵
    The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
    OpenUrlCrossRefPubMedWeb of Science
  24. 24.↵
    Manor, O., Dai, C. L., Kornilov, S. A., Smith, B., Price, N. D., Lovejoy, J. C., Gibbons, S. M. & Magis, A. T. Health and disease markers correlate with gut microbiome composition across thousands of people. Nat. Commun. 11, 1–12 (2020).
    OpenUrlCrossRefPubMed
  25. 25.↵
    de Vos, W. M., Tilg, H., Van Hul, M. & Cani, P. D. Gut microbiome and health: mechanistic insights. Gut 71, 1020–1032 (2022).
    OpenUrlAbstract/FREE Full Text
  26. 26.↵
    Smalla, K., Jechalke, S. & Top, E. M. Plasmid detection, characterization and ecology. Microbiology spectrum 3, (2015).
  27. 27.↵
    Jones, B. V. & Marchesi, J. R. Transposon-aided capture (TRACA) of plasmids resident in the human gut mobile metagenome. Nat. Methods 4, 55–61 (2007).
    OpenUrlCrossRefPubMedWeb of Science
  28. 28.
    Delaney, S., Murphy, R. & Walsh, F. A Comparison of Methods for the Extraction of Plasmids Capable of Conferring Antibiotic Resistance in a Human Pathogen From Complex Broiler Cecal Samples. Front. Microbiol. 9, 1731 (2018).
    OpenUrl
  29. 29.
    Brown Kav, A., Sasson, G., Jami, E., Doron-Faigenboim, A., Benhar, I. & Mizrahi, I. Insights into the bovine rumen plasmidome. Proc. Natl. Acad. Sci. U. S. A. 109, 5452–5457 (2012).
    OpenUrlAbstract/FREE Full Text
  30. 30.↵
    Krawczyk, P. S., Lipinski, L. & Dziembowski, A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res. 46, e35 (2018).
  31. 31.↵
    Pellow, D., Mizrahi, I. & Shamir, R. PlasClass improves plasmid sequence classification. PLoS Comput. Biol. 16, e1007781 (2020).
  32. 32.↵
    Antipov, D., Raiko, M., Lapidus, A. & Pevzner, P. A. Plasmid detection and assembly in genomic and metagenomic data sets. Genome Res. 29, 961–968 (2019).
    OpenUrlAbstract/FREE Full Text
  33. 33.↵
    Hou, S., Cheng, S., Chen, T., Fuhrman, J. A. & Sun, F. DeepMicrobeFinder Sorts Metagenomes into Prokaryotes, Eukaryotes and Viruses, with Marine Applications. Research Square (2021). doi:10.21203/rs.3.rs-1016976/v1
    OpenUrlCrossRef
  34. 34.↵
    Arredondo-Alonso, S., Willems, R. J., van Schaik, W. & Schürch, A. C. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data. Microb Genom 3, e000128 (2017).
  35. 35.↵
    Andreopoulos, W. B., Geller, A. M., Lucke, M., Balewski, J., Clum, A., Ivanova, N. N. & Levy, A. Deeplasmid: deep learning accurately separates plasmids from bacterial chromosomes. Nucleic Acids Res. 50, e17 (2022).
  36. 36.↵
    Zhou, F. & Xu, Y. cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data. Bioinformatics 26, 2051–2052 (2010).
    OpenUrlCrossRefPubMed
  37. 37.↵
    Carattoli, A., Zankari, E., García-Fernández, A., Larsen, M. V., Lund, O., Villa, L., Aarestrup, F. M. & Hasman, H. In SilicoDetection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing. Antimicrobial Agents and Chemotherapy 58, 3895–3903 Preprint at https://doi.org/10.1128/aac.02412-14 (2014)
    OpenUrlAbstract/FREE Full Text
  38. 38.↵
    Robertson, J. & Nash, J. H. E. MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom 4, (2018).
  39. 39.↵
    Garcillán-Barcia, M. P., Francia, M. V. & de la Cruz, F. The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiol. Rev. 33, 657–687 (2009).
    OpenUrlCrossRefPubMedWeb of Science
  40. 40.↵
    Rozov, R., Brown Kav, A., Bogumil, D., Shterzer, N., Halperin, E., Mizrahi, I. & Shamir, R. Recycler: an algorithm for detecting plasmids from de novo assembly graphs. Bioinformatics 33, 475–482 (2017).
    OpenUrlCrossRef
  41. 41.↵
    Pellow, D., Zorea, A., Probst, M., Furman, O., Segal, A., Mizrahi, I. & Shamir, R. SCAPP: an algorithm for improved plasmid assembly in metagenomes. Microbiome 9, 144 (2021).
  42. 42.↵
    Galperin, M. Y., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 43, D261–9 (2015).
    OpenUrlCrossRefPubMed
  43. 43.↵
    El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., Qureshi, M., Richardson, L. J., Salazar, G. A., Smart, A., Sonnhammer, E. L. L., Hirsh, L., Paladin, L., Piovesan, D., Tosatto, S. C. E. & Finn, R. D. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
    OpenUrlCrossRefPubMed
  44. 44.↵
    Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026–1028 Preprint at https://doi.org/10.1038/nbt.3988 (2017)
    OpenUrlCrossRefPubMed
  45. 45.↵
    Fang, Z., Tan, J., Wu, S., Li, M., Xu, C., Xie, Z. & Zhu, H. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience 8, (2019).
  46. 46.↵
    Schwengers, O., Barth, P., Falgenhauer, L., Hain, T., Chakraborty, T. & Goesmann, A. Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microb Genom 6, (2020).
  47. 47.↵
    Wozniak, R. A. F. & Waldor, M. K. Integrative and conjugative elements: mosaic mobile genetic elements enabling dynamic lateral gene flow. Nat. Rev. Microbiol. 8, 552–563 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  48. 48.↵
    Wang, G. H., Sun, B. F., Xiong, T. L., Wang, Y. K., Murfin, K. E., Xiao, J. H. & Huang, D. W. Bacteriophage WO Can Mediate Horizontal Gene Transfer in Endosymbiotic Wolbachia Genomes. Front. Microbiol. 0, (2016).
  49. 49.↵
    Cuecas, A., Kanoksilapatham, W. & Gonzalez, J. M. Evidence of horizontal gene transfer by transposase gene analyses in Fervidobacterium species. PLoS One 12, e0173961 (2017).
  50. 50.↵
    Garcillán-Barcia, M. P., Alvarado, A. & de la Cruz, F. Identification of bacterial plasmids based on mobility and plasmid population biology. FEMS Microbiol. Rev. 35, 936–956 (2011).
    OpenUrlCrossRefPubMed
  51. 51.↵
    Liu, M., Li, X., Xie, Y., Bi, D., Sun, J., Li, J., Tai, C., Deng, Z. & Ou, H.-Y. ICEberg 2.0: an updated database of bacterial integrative and conjugative elements. Nucleic Acids Res. 47, D660–D665 (2019).
    OpenUrlCrossRefPubMed
  52. 52.↵
    Reveillaud, J., Bordenstein, S. R., Cruaud, C., Shaiber, A., Esen, Ö. C., Weill, M., Makoundou, P., Lolans, K., Watson, A. R., Rakotoarivony, I., Bordenstein, S. R. & Eren, A. M. The Wolbachia mobilome in Culex pipiens includes a putative plasmid. Nat. Commun. 10, 1–11 (2019).
    OpenUrlCrossRefPubMed
  53. 53.↵
    Sukupolvi, S. & O’Connor, C. D. TraT lipoprotein, a plasmid-specified mediator of interactions between gram-negative bacteria and their environment. Microbiol. Rev. 54, 331–341 (1990).
    OpenUrlAbstract/FREE Full Text
  54. 54.↵
    Norris, S. J., Carter, C. J., Howell, J. K. & Barbour, A. G. Low-passage-associated proteins of Borrelia burgdorferi B31: characterization and molecular cloning of OspD, a surface-exposed, plasmid-encoded lipoprotein. Infect. Immun. 60, 4662–4672 (1992).
    OpenUrlAbstract/FREE Full Text
  55. 55.↵
    Jalal, A. S. B. & Le, T. B. K. Bacterial chromosome segregation by the ParABS system. Open Biol. 10, 200097 (2020).
  56. 56.↵
    Bouet, J.-Y. & Funnell, B. E. Plasmid Localization and Partition in Enterobacteriaceae. EcoSal Plus 8, (2019).
  57. 57.↵
    Carr, V. R., Shkoporov, A., Hill, C., Mullany, P. & Moyes, D. L. Probing the Mobilome: Discoveries in the Dynamic Microbiome. Trends Microbiol. 29, 158–170 (2021).
    OpenUrlCrossRef
  58. 58.↵
    Meinhardt, F., Schaffrath, R. & Larsen, M. Microbial linear plasmids. Appl. Microbiol. Biotechnol. 47, 329–336 (1997).
    OpenUrlCrossRefPubMedWeb of Science
  59. 59.↵
    Chen, Z., Zhong, L., Shen, M., Fang, P. & Qin, Z. Characterization of Streptomyces plasmid-phage pFP4 and its evolutionary implications. Plasmid 68, 170–178 (2012).
    OpenUrl
  60. 60.
    Oliva, M. A., Martin-Galiano, A. J., Sakaguchi, Y. & Andreu, J. M. Tubulin homolog TubZ in a phage-encoded partition system. Proc. Natl. Acad. Sci. U. S. A. 109, (2012).
  61. 61.
    Dokland, T. Molecular Piracy: Redirection of Bacteriophage Capsid Assembly by Mobile Genetic Elements. Viruses 11, (2019).
  62. 62.↵
    Pfeifer, E., Moura de Sousa, J. A., Touchon, M. & Rocha, E. P. C. Bacteria have numerous distinctive groups of phage–plasmids with conserved phage and variable plasmid gene repertoires. Nucleic Acids Res. 49, 2655–2673 (2021).
    OpenUrl
  63. 63.↵
    Vineis, J. H., Ringus, D. L., Morrison, H. G., Delmont, T. O., Dalal, S., Raffals, L. H., Antonopoulos, D. A., Rubin, D. T., Eren, A. M., Chang, E. B. & Sogin, M. L. Patient-Specific Bacteroides Genome Variants in Pouchitis. MBio 7, (2016).
  64. 64.↵
    Monaghan, T. M., Sloan, T. J., Stockdale, S. R., Blanchard, A. M., Emes, R. D., Wilcox, M., Biswas, R., Nashine, R., Manke, S., Gandhi, J., Jain, P., Bhotmange, S., Ambalkar, S., Satav, A., Draper, L. A., Hill, C. & Kashyap, R. S. Metagenomics reveals impact of geography and acute diarrheal disease on the Central Indian human gut microbiome. Gut Microbes 12, 1752605 (2020).
  65. 65.
    Mancabelli, L., Milani, C., Lugli, G. A., Turroni, F., Ferrario, C., van Sinderen, D. & Ventura, M. Meta-analysis of the human gut microbiome from urbanized and pre-agricultural populations. Environ. Microbiol. 19, 1379–1390 (2017).
    OpenUrlCrossRef
  66. 66.↵
    Pasolli, E., Asnicar, F., Manara, S., Zolfo, M., Karcher, N., Armanini, F., Beghini, F., Manghi, P., Tett, A., Ghensi, P., Collado, M. C., Rice, B. L., DuLong, C., Morgan, X. C., Golden, C. D., Quince, C., Huttenhower, C. & Segata, N. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle. Cell 176, 649–662.e20 (2019).
    OpenUrlPubMed
  67. 67.↵
    Yatsunenko, T., Rey, F. E., Manary, M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R. N., Anokhin, A. P., Heath, A. C., Warner, B., Reeder, J., Kuczynski, J., Caporaso, J. G., Lozupone, C. A., Lauber, C., Clemente, J. C., Knights, D., Knight, R. & Gordon, J. I. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).
    OpenUrlCrossRefPubMedWeb of Science
  68. 68.↵
    Klümper, U., Riber, L., Dechesne, A., Sannazzarro, A., Hansen, L. H., Sørensen, S. J. & Smets, B. F. Broad host range plasmids can invade an unexpectedly diverse fraction of a soil bacterial community. ISME J. 9, (2015).
  69. 69.
    Kohler, V., Vaishampayan, A. & Grohmann, E. Broad-host-range Inc18 plasmids: Occurrence, spread and transfer mechanisms. Plasmid 99, 11–21 (2018).
    OpenUrlCrossRefPubMed
  70. 70.↵
    Bishé, B., Taton, A. & Golden, J. W. Modification of RSF1010-Based Broad-Host-Range Plasmids for Improved Conjugation and Cyanobacterial Bioprospecting. iScience 20, 216–228 (2019).
    OpenUrl
  71. 71.↵
    Lloyd-Price, J., Arze, C., Ananthakrishnan, A. N., Schirmer, M., Avila-Pacheco, J., Poon, T. W., Andrews, E., Ajami, N. J., Bonham, K. S., Brislawn, C. J., Casero, D., Courtney, H., Gonzalez, A., Graeber, T. G., Hall, A. B., Lake, K., Landers, C. J., Mallick, H., Plichta, D. R., Prasad, M., Rahnavard, G., Sauk, J., Shungin, D., Vázquez-Baeza, Y., White, R. A., 3rd., IBDMDB Investigators, Braun, J., Denson, L. A., Jansson, J. K., Knight, R., Kugathasan, S., McGovern, D. P. B., Petrosino, J. F., Stappenbeck, T. S., Winter, H. S., Clish, C. B., Franzosa, E. A., Vlamakis, H., Xavier, R. J. & Huttenhower, C. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).
    OpenUrlCrossRefPubMed
  72. 72.↵
    Schloss, P. D. Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research. MBio 9, (2018).
  73. 73.↵
    McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML] (2018). at <http://arxiv.org/abs/1802.03426>
  74. 74.↵
    Friedman, J. & Alm, E. J. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 8, e1002687 (2012).
  75. 75.↵
    Fernández-López, R., Garcillán-Barcia, M. P., Revilla, C., Lázaro, M., Vielva, L. & de la Cruz, F. Dynamics of the IncW genetic backbone imply general trends in conjugative plasmid evolution. FEMS Microbiol. Rev. 30, 942–966 (2006).
    OpenUrlCrossRefPubMedWeb of Science
  76. 76.↵
    Garcillán-Barcia, M. P., Ruiz del Castillo, B., Alvarado, A., de la Cruz, F. & Martínez-Martínez, L. Degenerate primer MOB typing of multiresistant clinical isolates of E. coli uncovers new plasmid backbones. Plasmid 77, 17–27 (2015).
    OpenUrlCrossRefPubMed
  77. 77.
    Carattoli, A., Bertini, A., Villa, L., Falbo, V., Hopkins, K. L. & John Threlfall, E. Identification of plasmids by PCR-based replicon typing. Journal of Microbiological Methods 63, 219–228 Preprint at https://doi.org/10.1016/j.mimet.2005.03.018 (2005)
    OpenUrlCrossRefPubMedWeb of Science
  78. 78.↵
    Carattoli, A., Zankari, E., García-Fernández, A., Voldby Larsen, M., Lund, O., Villa, L., Møller Aarestrup, F. & Hasman, H. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob. Agents Chemother. 58, 3895–3903 (2014).
    OpenUrlAbstract/FREE Full Text
  79. 79.↵
    Bobay, L.-M. & Ochman, H. Biological species in the viral world. Proc. Natl. Acad. Sci. U. S. A. 115, 6040–6045 (2018).
    OpenUrlAbstract/FREE Full Text
  80. 80.↵
    Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
    OpenUrlCrossRefPubMed
  81. 81.↵
    Redondo-Salvo, S., Fernández-López, R., Ruiz, R., Vielva, L., de Toro, M., Rocha, E. P. C., Garcillán-Barcia, M. P. & de la Cruz, F. Pathways for horizontal gene transfer in bacteria revealed by a global map of their plasmids. Nat. Commun. 11, 3602 (2020).
    OpenUrl
  82. 82.↵
    Acman, M., van Dorp, L., Santini, J. M. & Balloux, F. Large-scale network analysis captures biological features of bacterial plasmids. Nat. Commun. 11, 1–11 (2020).
    OpenUrlCrossRefPubMed
  83. 83.↵
    Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., Amin, N., Schwikowski, B. & Ideker, T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
    OpenUrlAbstract/FREE Full Text
  84. 84.↵
    Heer, J., Card, S. K. & Landay, J. A. prefuse: a toolkit for interactive information visualization. in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 421–430 (Association for Computing Machinery, 2005).
  85. 85.↵
    Vrancianu, C. O., Popa, L. I., Bleotu, C. & Chifiriuc, M. C. Targeting Plasmids to Limit Acquisition and Transmission of Antimicrobial Resistance. Front. Microbiol. 11, 761 (2020).
  86. 86.
    MacLean, R. C. & San Millan, A. The evolution of antibiotic resistance. Science 365, 1082–1083 (2019).
    OpenUrlAbstract/FREE Full Text
  87. 87.
    World Health Organization. WHO report on surveillance of antibiotic consumption: 2016-2018 early implementation. (World Health Organization, 2018). at <https://apps.who.int/iris/bitstream/handle/10665/277359/9789241514880-eng.pdf>
  88. 88.↵
    Centers for Disease Control and Prevention. Antibiotic Use in the United States, 2021 Update: Progress and Opportunities. (2021). at <https://www.cdc.gov/antibiotic-use/stewardship-report/current.html>
  89. 89.↵
    Gehrig, S., Eberle, M.-E., Botschen, F., Rimbach, K., Eberle, F., Eigenbrod, T., Kaiser, S., Holmes, W. M., Erdmann, V. A., Sprinzl, M., Bec, G., Keith, G., Dalpke, A. H. & Helm, M. Identification of modifications in microbial, native tRNA that suppress immunostimulatory activity. J. Exp. Med. 209, 225–233 (2012).
    OpenUrlAbstract/FREE Full Text
  90. 90.↵
    Galvanin, A., Vogt, L.-M., Grober, A., Freund, I., Ayadi, L., Bourguignon-Igel, V., Bessler, L., Jacob, D., Eigenbrod, T., Marchand, V., Dalpke, A., Helm, M. & Motorin, Y. Bacterial tRNA 2’-O-methylation is dynamically regulated under stress conditions and modulates innate immune response. Nucleic Acids Res. 48, 12833–12844 (2020).
    OpenUrlCrossRef
  91. 91.↵
    Embers, M. E., Alvarez, X., Ooms, T. & Philipp, M. T. The failure of immune response evasion by linear plasmid 28-1-deficient Borrelia burgdorferi is attributable to persistent expression of an outer surface protein. Infect. Immun. 76, 3984–3991 (2008).
    OpenUrlAbstract/FREE Full Text
  92. 92.↵
    Gupta, V. K., Paul, S. & Dutta, C. Geography, Ethnicity or Subsistence-Specific Variations in Human Microbiome Composition and Diversity. Frontiers in Microbiology 8, Preprint at https://doi.org/10.3389/fmicb.2017.01162 (2017)
  93. 93.↵
    Obregon-Tito, A. J., Tito, R. Y., Metcalf, J., Sankaranarayanan, K., Clemente, J. C., Ursell, L. K., Xu, Z. Z., Van Treuren, W., Knight, R., Gaffney, P. M., Spicer, P., Lawson, P., Marin-Reyes, L., Trujillo-Villarroel, O., Foster, M., Guija-Poma, E., Troncoso-Corzo, L., Warinner, C., Ozga, A. T. & Lewis, C. M. Subsistence strategies in traditional societies distinguish gut microbiomes. Nat. Commun. 6, 1–9 (2015).
    OpenUrlCrossRefPubMed
  94. 94.
    Gomez, A., Petrzelkova, K. J., Burns, M. B., Yeoman, C. J., Amato, K. R., Vlckova, K., Modry, D., Todd, A., Jost Robinson, C. A., Remis, M. J., Torralba, M. G., Morton, E., Umaña, J. D., Carbonero, F., Gaskins, H. R., Nelson, K. E., Wilson, B. A., Stumpf, R. M., White, B. A., Leigh, S. R. & Blekhman, R. Gut Microbiome of Coexisting BaAka Pygmies and Bantu Reflects Gradients of Traditional Subsistence Patterns. Cell Rep. 14, 2142–2153 (2016).
    OpenUrlCrossRef
  95. 95.↵
    Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R., Prifti, E., Nielsen, T., Juncker, A. S., Manichanh, C., Chen, B., Zhang, W., Levenez, F., Wang, J., Xu, X., Xiao, L., Liang, S., Zhang, D., Zhang, Z., Chen, W., Zhao, H., Al-Aama, J. Y., Edris, S., Yang, H., Wang, J., Hansen, T., Nielsen, H. B., Brunak, S., Kristiansen, K., Guarner, F., Pedersen, O., Doré, J., Ehrlich, S. D., Bork, P. & Wang, J. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
    OpenUrlCrossRefPubMed
  96. 96.
    Xia, Y., Zhu, Y., Li, Q. & Lu, J. Human gut resistome can be country-specific. PeerJ 7, e6389 (2019).
  97. 97.↵
    Sonnenburg, E. D. & Sonnenburg, J. L. The ancestral and industrialized gut microbiota and implications for human health. Nat. Rev. Microbiol. 17, 383–390 (2019).
    OpenUrlCrossRefPubMed
  98. 98.↵
    Thomas, C. M. Plasmid Incompatibility. Molecular Life Sciences 1–3 Preprint at https://doi.org/10.1007/978-1-4614-6436-5_565-2 (2014)
  99. 99.↵
    Novick, R. P. Plasmid incompatibility. Microbiological Reviews 51, 381–395 Preprint at https://doi.org/10.1128/mmbr.51.4.381-395.1987 (1987)
    OpenUrlFREE Full Text
  100. 100.↵
    Velappan, N., Sblattero, D., Chasteen, L., Pavlik, P. & Bradbury, A. R. M. Plasmid incompatibility: more compatible than previously thought? Protein Eng. Des. Sel. 20, 309–313 (2007).
    OpenUrlCrossRefPubMed
  101. 101.↵
    Svara, F. & Rankin, D. J. The evolution of plasmid-carried antibiotic resistance. BMC Evol. Biol. 11, 130 (2011).
    OpenUrlCrossRefPubMed
  102. 102.
    Sykes, R. The 2009 Garrod lecture: the evolution of antimicrobial resistance: a Darwinian perspective. J. Antimicrob. Chemother. 65, 1842–1852 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  103. 103.
    Cantón, R. & Morosini, M.-I. Emergence and spread of antibiotic resistance following exposure to antibiotics. FEMS Microbiol. Rev. 35, 977–991 (2011).
    OpenUrlCrossRefPubMed
  104. 104.
    Baquero, F. Low-level antibacterial resistance: a gateway to clinical resistance. Drug Resist. Updat. 4, 93–105 (2001).
    OpenUrlCrossRefPubMedWeb of Science
  105. 105.
    San Millan, A., Escudero, J. A., Gifford, D. R., Mazel, D. & MacLean, R. C. Multicopy plasmids potentiate the evolution of antibiotic resistance in bacteria. Nat Ecol Evol 1, 10 (2016).
  106. 106.
    Alonso, A., Sanchez, P. & Martinez, J. L. Environmental selection of antibiotic resistance genes. Minireview. Environmental Microbiology 3, 1–9 Preprint at https://doi.org/10.1046/j.1462-2920.2001.00161.x (2001)
    OpenUrlCrossRefPubMedWeb of Science
  107. 107.
    Xiong, W., Sun, Y., Ding, X., Wang, M. & Zeng, Z. Selective pressure of antibiotics on ARGs and bacterial communities in manure-polluted freshwater-sediment microcosms. Front. Microbiol. 0, (2015).
  108. 108.↵
    Ma, H. & Bryers, J. D. Non-invasive determination of conjugative transfer of plasmids bearing antibiotic-resistance genes in biofilm-bound bacteria: effects of substrate loading and antibiotic selection. Appl. Microbiol. Biotechnol. 97, 317–328 (2012).
    OpenUrl
  109. 109.↵
    Berendsen, B., Stolker, L., de Jong, J., Nielen, M., Tserendorj, E., Sodnomdarjaa, R., Cannavan, A. & Elliott, C. Evidence of natural occurrence of the banned antibiotic chloramphenicol in herbs and grass. Anal. Bioanal. Chem. 397, 1955 (2010).
    OpenUrlCrossRefPubMed
  110. 110.
    Both, L., Botgros, R. & Cavaleri, M. Analysis of licensed over-the-counter (OTC) antibiotics in the European Union and Norway, 2012. Euro Surveill. 20, 30002 (2015).
    OpenUrl
  111. 111.
    Balbi, H. J. Chloramphenicol: A Review. Pediatrics in Review 25, 284–288 Preprint at https://doi.org/10.1542/pir.25-8-284 (2004)
    OpenUrl
  112. 112.↵
    Ministry of Health and Medical Services, Government of Fiji. Fiji Antibiotic Guidelines. (Government of Fiji, 2019).
  113. 113.↵
    Delmont, T. O., Quince, C., Shaiber, A., Esen, Ö. C., Lee, S. T., Rappé, M. S., McLellan, S. L., Lücker, S. & Eren, A. M. Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes. Nat Microbiol 3, 804–813 (2018).
    OpenUrl
  114. 114.
    van Kessel, M. A. H. J., Speth, D. R., Albertsen, M., Nielsen, P. H., Op den Camp, H. J. M., Kartal, B., Jetten, M. S. M. & Lücker, S. Complete nitrification by a single microorganism. Nature 528, 555–559 (2015).
    OpenUrlCrossRefPubMed
  115. 115.
    Edwards, R. A., Vega, A. A., Norman, H. M., Ohaeri, M., Levi, K., Dinsdale, E. A., Cinek, O., Aziz, R. K., McNair, K., Barr, J. J., Bibby, K., Brouns, S. J. J., Cazares, A., de Jonge, P. A., Desnues, C., Díaz Muñoz, S. L., Fineran, P. C., Kurilshikov, A., Lavigne, R., Mazankova, K., McCarthy, D. T., Nobrega, F. L., Reyes Muñoz, A., Tapia, G., Trefault, N., Tyakht, A. V., Vinuesa, P., Wagemans, J., Zhernakova, A., Aarestrup, F. M., Ahmadov, G., Alassaf, A., Anton, J., Asangba, A., Billings, E. K., Cantu, V. A., Carlton, J. M., Cazares, D., Cho, G.-S., Condeff, T., Cortés, P., Cranfield, M., Cuevas, D. A., De la Iglesia, R., Decewicz, P., Doane, M. P., Dominy, N. J., Dziewit, L., Elwasila, B. M., Eren, A. M., Franz, C., Fu, J., Garcia-Aljaro, C., Ghedin, E., Gulino, K. M., Haggerty, J. M., Head, S. R., Hendriksen, R. S., Hill, C., Hyöty, H., Ilina, E. N., Irwin, M. T., Jeffries, T. C., Jofre, J., Junge, R. E., Kelley, S. T., Khan Mirzaei, M., Kowalewski, M., Kumaresan, D., Leigh, S. R., Lipson, D., Lisitsyna, E. S., Llagostera, M., Maritz, J. M., Marr, L. C., McCann, A., Molshanski-Mor, S., Monteiro, S., Moreira-Grez, B., Morris, M., Mugisha, L., Muniesa, M., Neve, H., Nguyen, N.-P., Nigro, O. D., Nilsson, A. S., O’Connell, T., Odeh, R., Oliver, A., Piuri, M., Prussin, A. J., Ii, Qimron, U., Quan, Z.-X., Rainetova, P., Ramírez-Rojas, A., Raya, R., Reasor, K., Rice, G. A. O., Rossi, A., Santos, R., Shimashita, J., Stachler, E. N., Stene, L. C., Strain, R., Stumpf, R., Torres, P. J., Twaddle, A., Ugochi Ibekwe, M., Villagra, N., Wandro, S., White, B., Whiteley, A., Whiteson, K. L., Wijmenga, C., Zambrano, M. M., Zschach, H. & Dutilh, B. E. Global phylogeography and ancient evolution of the widespread human gut virus crAssphage. Nat Microbiol 4, 1727–1736 (2019).
    OpenUrl
  116. 116.↵
    Hug, L. A., Baker, B. J., Anantharaman, K., Brown, C. T., Probst, A. J., Castelle, C. J., Butterfield, C. N., Hernsdorf, A. W., Amano, Y., Ise, K., Suzuki, Y., Dudek, N., Relman, D. A., Finstad, K. M., Amundson, R., Thomas, B. C. & Banfield, J. F. A new view of the tree of life. Nat Microbiol 1, 16048 (2016).
    OpenUrl
  117. 117.
    Royer, G., Decousser, J. W., Branger, C., Dubois, M., Médigue, C., Denamur, E. & Vallenet, D. PlaScope: a targeted approach to assess the plasmidome from genome assemblies at the species level. Microb Genom 4, (2018).
  118. 118.
    Arredondo-Alonso, S., Rogers, M. R. C., Braat, J. C., Verschuuren, T. D., Top, J., Corander, J., Willems, R. J. L. & Schürch, A. C. Mlplasmids: A user-friendly tool to predict plasmid- and chromosome-derived sequences for single species. Microb. Genom. 4, (2018).
  119. 119.↵
    Gomi, R., Wyres, K. L. & Holt, K. E. Detection of plasmid contigs in draft genome assemblies using customized Kraken databases. Microb Genom 7, (2021).
  120. 120.↵
    Guo, J., Bolduc, B., Zayed, A. A., Varsani, A., Dominguez-Huerta, G., Delmont, T. O., Pratama, A. A., Gazitúa, M. C., Vik, D., Sullivan, M. B. & Roux, S. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
  121. 121.
    Al-Shayeb, B., Sachdeva, R., Chen, L.-X., Ward, F., Munk, P., Devoto, A., Castelle, C. J., Olm, M. R., Bouma-Gregson, K., Amano, Y., He, C., Méheust, R., Brooks, B., Thomas, A., Lavy, A., Matheus-Carnevali, P., Sun, C., Goltsman, D. S. A., Borton, M. A., Sharrar, A., Jaffe, A. L., Nelson, T. C., Kantor, R., Keren, R., Lane, K. R., Farag, I. F., Lei, S., Finstad, K., Amundson, R., Anantharaman, K., Zhou, J., Probst, A. J., Power, M. E., Tringe, S. G., Li, W.-J., Wrighton, K., Harrison, S., Morowitz, M., Relman, D. A., Doudna, J. A., Lehours, A.-C., Warren, L., Cate, J. H. D., Santini, J. M. & Banfield, J. F. Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020).
    OpenUrl
  122. 122.
    Shkoporov, A. N. & Hill, C. Bacteriophages of the Human Gut: The ‘Known Unknown’ of the Microbiome. Cell Host Microbe 25, 195–209 (2019).
    OpenUrlCrossRefPubMed
  123. 123.↵
    Antipov, D., Raiko, M., Lapidus, A. & Pevzner, P. A. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics 36, 4126–4129 (2020).
    OpenUrlCrossRef
  124. 124.↵
    Smillie, C. S., Smith, M. B., Friedman, J., Cordero, O. X., David, L. A. & Alm, E. J. Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241–244 (2011).
    OpenUrlCrossRefPubMedWeb of Science
  125. 125.
    Groussin, M., Poyet, M., Sistiaga, A., Kearney, S. M., Moniz, K., Noel, M., Hooker, J., Gibbons, S. M., Segurel, L., Froment, A., Mohamed, R. S., Fezeu, A., Juimo, V. A., Lafosse, S., Tabe, F. E., Girard, C., Iqaluk, D., Nguyen, L. T. T., Shapiro, B. J., Lehtimäki, J., Ruokolainen, L., Kettunen, P. P., Vatanen, T., Sigwazi, S., Mabulla, A., Domínguez-Rodrigo, M., Nartey, Y. A., Agyei-Nkansah, A., Duah, A., Awuku, Y. A., Valles, K. A., Asibey, S. O., Afihene, M. Y., Roberts, L. R., Plymoth, A., Onyekwere, C. A., Summons, R. E., Xavier, R. J. & Alm, E. J. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053–2067.e18 (2021).
    OpenUrl
  126. 126.↵
    Brito, I. L., Yilmaz, S., Huang, K., Xu, L., Jupiter, S. D., Jenkins, A. P., Naisilisili, W., Tamminen, M., Smillie, C. S., Wortman, J. R., Birren, B. W., Xavier, R. J., Blainey, P. C., Singh, A. K., Gevers, D. & Alm, E. J. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435 (2016).
    OpenUrlCrossRefPubMed
  127. 127.↵
    Galata, V., Fehlmann, T., Backes, C. & Keller, A. PLSDB: a resource of complete bacterial plasmids. Nucleic Acids Res. 47, D195–D202 (2019).
    OpenUrlCrossRefPubMed
  128. 128.↵
    Shaiber, A., Willis, A. D., Delmont, T. O., Roux, S., Chen, L.-X., Schmid, A. C., Yousef, M., Watson, A. R., Lolans, K., Esen, Ö. C., Lee, S. T. M., Downey, N., Morrison, H. G., Dewhirst, F. E., Mark Welch, J. L. & Eren, A. M. Functional and genetic markers of niche partitioning among enigmatic members of the human oral microbiome. Genome Biol. 21, 292 (2020).
    OpenUrlCrossRef
  129. 129.↵
    Eren, A. M., Kiefl, E., Shaiber, A., Veseli, I., Miller, S. E., Schechter, M. S., Fink, I., Pan, J. N., Yousef, M., Fogarty, E. C., Trigodet, F., Watson, A. R., Esen, Ö. C., Moore, R. M., Clayssen, Q., Lee, M. D., Kivenson, V., Graham, E. D., Merrill, B. D., Karkman, A., Blankenberg, D., Eppley, J. M., Sjödin, A., Scott, J. J., Vázquez-Campos, X., McKay, L. J., McDaniel, E. A., Stevens, S. L. R., Anderson, R. E., Fuessel, J., Fernandez-Guerra, A., Maignien, L., Delmont, T. O. & Willis, A. D. Community-led, integrated, reproducible multi-omics with anvi’o. Nat Microbiol 6, 3–6 (2021).
    OpenUrl
  130. 130.↵
    Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
    OpenUrlCrossRefPubMedWeb of Science
  131. 131.↵
    Hyatt, D., Chen, G.-L., Locascio, P. F., Land, M. L., Larimer, F. W. & Hauser, L. J. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
    OpenUrlCrossRefPubMed
  132. 132.↵
    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
    OpenUrlCrossRefPubMed
  133. 133.↵
    Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).
    OpenUrlCrossRefPubMed
  134. 134.↵
    Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 1–8 (2018).
    OpenUrlCrossRefPubMed
  135. 135.↵
    Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
    OpenUrlCrossRefPubMedWeb of Science
  136. 136.↵
    Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A. & Lanfear, R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol. 37, 1530–1534 (2020).
    OpenUrlCrossRefPubMed
  137. 137.↵
    Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S. & Phillippy, A. M. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
    OpenUrlCrossRefPubMed
  138. 138.↵
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. & Others. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12, 2825–2830 (2011).
    OpenUrl
  139. 139.↵
    Feng, Q., Liang, S., Jia, H., Stadlmayr, A., Tang, L., Lan, Z., Zhang, D., Xia, H., Xu, X., Jie, Z., Su, L., Li, X., Li, X., Li, J., Xiao, L., Huber-Schönauer, U., Niederseer, D., Xu, X., Al-Aama, J. Y., Yang, H., Wang, J., Kristiansen, K., Arumugam, M., Tilg, H., Datz, C. & Wang, J. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 1–13 (2015).
    OpenUrlCrossRefPubMed
  140. 140.↵
    David, L. A., Weil, A., Ryan, E. T., Calderwood, S. B., Harris, J. B., Chowdhury, F., Begum, Y., Qadri, F., LaRocque, R. C. & Turnbaugh, P. J. Gut Microbial Succession Follows Acute Secretory Diarrhea in Humans. MBio 6, (2015).
  141. 141.↵
    Raymond, F., Ouameur, A. A., Déraspe, M., Iqbal, N., Gingras, H., Dridi, B., Leprohon, P., Plante, P.-L., Giroux, R., Bérubé, È., Frenette, J., Boudreau, D. K., Simard, J.-L., Chabot, I., Domingo, M.-C., Trottier, S., Boissinot, M., Huletsky, A., Roy, P. H., Ouellette, M., Bergeron, M. G. & Corbeil, J. The initial state of the human gut microbiome determines its reshaping by antibiotics. ISME J. 10, 707 (2016).
    OpenUrlCrossRefPubMed
  142. 142.↵
    Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D. R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.-M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H. B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S. D. & Wang, J. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
    OpenUrlCrossRefPubMedWeb of Science
  143. 143.↵
    Wen, C., Zheng, Z., Shao, T., Liu, L., Xie, Z., Le Chatelier, E., He, Z., Zhong, W., Fan, Y., Zhang, L., Li, H., Wu, C., Hu, C., Xu, Q., Zhou, J., Cai, S., Wang, D., Huang, Y., Breban, M., Qin, N. & Ehrlich, S. D. Quantitative metagenomics reveals unique gut microbiome biomarkers in ankylosing spondylitis. Genome Biol. 18, (2017).
  144. 144.↵
    Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M., Arumugam, M., Batto, J. M., Kennedy, S., Leonard, P., Li, J., Burgdorf, K., Grarup, N., Jørgensen, T., Brandslund, I., Nielsen, H. B., Juncker, A. S., Bertalan, M., Levenez, F., Pons, N., Rasmussen, S., Sunagawa, S., Tap, J., Tims, S., Zoetendal, E. G., Brunak, S., Clément, K., Doré, J., Kleerebezem, M., Kristiansen, K., Renault, P., Sicheritz-Ponten, T., de Vos, W. M., Zucker, J. D., Raes, J., Hansen, T., Bork, P., Wang, J., Ehrlich, S. D. & Pedersen, O. Richness of human gut microbiome correlates with metabolic markers. Nature 500, (2013).
  145. 145.↵
    Xie, H., Guo, R., Zhong, H., Feng, Q., Lan, Z., Qin, B., Ward, K. J., Jackson, M. A., Xia, Y., Chen, X., Chen, B., Xia, H., Xu, C., Li, F., Xu, X., Al-Aama, J. Y., Yang, H., Wang, J., Kristiansen, K., Wang, J., Steves, C. J., Bell, J. T., Li, J., Spector, T. D. & Jia, H. Shotgun Metagenomics of 250 Adult Twins Reveals Genetic and Environmental Impacts on the Gut Microbiome. Cell systems 3, 572 (2016).
  146. 146.↵
    Zeevi, D., Korem, T., Zmora, N., Israeli, D., Rothschild, D., Weinberger, A., Ben-Yacov, O., Lador, D., Avnit-Sagi, T., Lotan-Pompan, M., Suez, J., Mahdi, J. A., Matot, E., Malka, G., Kosower, N., Rein, M., Zilberman-Schapira, G., Dohnalová, L., Pevsner-Fischer, M., Bikovsky, R., Halpern, Z., Elinav, E. & Segal, E. Personalized Nutrition by Prediction of Glycemic Responses. Cell 163, 1079–1094 (2015).
    OpenUrlCrossRefPubMed
  147. 147.↵
    Rampelli, S., Schnorr, S. L., Consolandi, C., Turroni, S., Severgnini, M., Peano, C., Brigidi, P., Crittenden, A. N., Henry, A. G. & Candela, M. Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 25, 1682–1693 (2015).
    OpenUrlCrossRefPubMed
  148. 148.↵
    Liu, W., Zhang, J., Wu, C., Cai, S., Huang, W., Chen, J., Xiaoxia, X. I., Liang, Z., Hou, Q., Zhou, B., Qin, N. & Zhang, H. Unique Features of Ethnic Mongolian Gut Microbiome revealed by metagenomic analysis. Sci. Rep. 6, (2016).
  149. 149.↵
    Turnbaugh, P. J., Ley, R. E., Hamady, M., Fraser-Liggett, C. M., Knight, R. & Gordon, J. I. The Human Microbiome Project. Nature 449, 804–810 (2007).
    OpenUrlCrossRefPubMedWeb of Science
  150. 150.↵
    Shaiber, A. & Murat Eren, A. Anvi’o snakemake workflows. (2018). at <http://merenlab.org/2018/07/09/anvio-snakemake-workflows/>
  151. 151.↵
    Murat Eren, A., Vineis, J. H., Morrison, H. G. & Sogin, M. L. A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology. PLoS One 8, e66643 (2013).
    OpenUrlCrossRefPubMed
  152. 152.↵
    Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
    OpenUrlCrossRefPubMedWeb of Science
  153. 153.↵
    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).
    OpenUrlCrossRefPubMedWeb of Science
  154. 154.↵
    Jørgensen, T. S., Xu, Z., Hansen, M. A., Sørensen, S. J. & Hansen, L. H. Hundreds of circular novel plasmids and DNA elements identified in a rat cecum metamobilome. PLoS One 9, e87924 (2014).
    OpenUrlCrossRefPubMed
  155. 155.↵
    Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).
    OpenUrlCrossRef
  156. 156.↵
    Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
  157. 157.↵
    Watts, S. C., Ritchie, S. C., Inouye, M. & Holt, K. E. FastSpar: rapid and scalable correlation estimation for compositional data. Bioinformatics 35, 1064–1066 (2018).
    OpenUrl
  158. 158.↵
    Csardi, G., Nepusz, T. & Others. The igraph software package for complex network research. InterJournal, complex systems 1695, 1–9 (2006).
    OpenUrlCrossRef
  159. 159.↵
    Leiserson, M. D. M., Wu, H.-T., Vandin, F. & Raphael, B. J. CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genome Biol. 16, 160 (2015).
  160. 160.↵
    Gibson, M. K., Forsberg, K. J. & Dantas, G. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology. ISME J. 9, 207–216 (2015).
    OpenUrlCrossRefPubMed
  161. 161.↵
    Alcock, B. P., Raphenya, A. R., Lau, T. T. Y., Tsang, K. K., Bouchard, M., Edalatmand, A., Huynh, W., Nguyen, A.-L. V., Cheng, A. A., Liu, S., Min, S. Y., Miroshnichenko, A., Tran, H.-K., Werfalli, R. E., Nasir, J. A., Oloni, M., Speicher, D. J., Florescu, A., Singh, B., Faltyn, M., Hernandez-Koutoucheva, A., Sharma, A. N., Bordeleau, E., Pawlowski, A. C., Zubyk, H. L., Dooley, D., Griffiths, E., Maguire, F., Winsor, G. L., Beiko, R. G., Brinkman, F. S. L., Hsiao, W. W. L., Domselaar, G. V. & McArthur, A. G. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 48, D517–D525 (2020).
    OpenUrlCrossRefPubMed
  162. 162.↵
    Trigodet, F., Lolans, K. & Fogarty, E. High molecular weight DNA extraction strategies for long-read sequencing of complex metagenomes. Mol. Ecol. (2022). at <https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13588>
  163. 163.↵
    Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A. & Pevzner, P. A. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 19, 455 (2012).
Back to top
PreviousNext
Posted December 18, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
The genetic and ecological landscape of plasmids in the human gut
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
The genetic and ecological landscape of plasmids in the human gut
Michael K. Yu, Emily C. Fogarty, A. Murat Eren
bioRxiv 2020.11.01.361691; doi: https://doi.org/10.1101/2020.11.01.361691
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
The genetic and ecological landscape of plasmids in the human gut
Michael K. Yu, Emily C. Fogarty, A. Murat Eren
bioRxiv 2020.11.01.361691; doi: https://doi.org/10.1101/2020.11.01.361691

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4232)
  • Biochemistry (9128)
  • Bioengineering (6774)
  • Bioinformatics (23989)
  • Biophysics (12117)
  • Cancer Biology (9523)
  • Cell Biology (13772)
  • Clinical Trials (138)
  • Developmental Biology (7627)
  • Ecology (11686)
  • Epidemiology (2066)
  • Evolutionary Biology (15504)
  • Genetics (10638)
  • Genomics (14322)
  • Immunology (9477)
  • Microbiology (22831)
  • Molecular Biology (9089)
  • Neuroscience (48960)
  • Paleontology (355)
  • Pathology (1480)
  • Pharmacology and Toxicology (2568)
  • Physiology (3844)
  • Plant Biology (8327)
  • Scientific Communication and Education (1471)
  • Synthetic Biology (2296)
  • Systems Biology (6186)
  • Zoology (1300)