Interrogation of the integrated mobile genetic elements in gut-associated Bacteroidaceae with a consensus prediction approach

Exploration of mobile genetic element (MGE) diversity and relatedness is vital to understanding microbial communities, especially the gut microbiome, where the mobilization of antibiotic resistance and pathogenicity genes has important clinical consequences. Current MGE prediction tools are biased toward elements similar to previously-identified MGEs, especially tailed phages of proteobacterial hosts. Further, there is a need for methods to examine relatedness and gene sharing among MGEs. We present VICSIN, a consensus approach for MGE prediction and clustering of predictions to provide classification. Testing of VICSIN on datasets of Pseudomonas aeruginosa and Bacteroides fragilis genomes suggests VICSIN is the optimal approach to predict integrated MGEs from poorly-explored host taxa, because of its increased sensitivity and accuracy. We applied VICSIN to a dataset of gut-associated Bacteroidaceae genomes, identifying 816 integrated MGEs falling into 95 clusters, most of which are novel. VICSIN’s fast and simple network-building scheme revealed a high degree of gene sharing within and between related MGE clusters. Shared gene functions across MGEs include core mobilization functions and accessory gene content, such as type VI secretion systems and antibiotic resistance genes. The MGEs identified here encode a large portion of unknown gene content, emphasizing the fact that the full diversity of MGEs and the factors they encode remain very poorly understood. Together, this work motivates more exploration of the gut mobilome, which is likely one of the most potent drivers of microbial evolution in the human microbiome. IMPORTANCE Mobile genetic elements (MGEs), including phages and integrative and conjugative elements (ICEs), drive the diversity and function of microbial communities through horizontal gene transfer. Current tools to predict MGEs in genomic sequence data are highly focused on phages, and are biased against the discovery of novel MGEs. We present VICSIN, a consensus approach to MGE prediction that is able to find a diversity of MGEs, particularly in poorly-understood bacterial taxa. By applying VICSIN to a large database of diverse Bacteroidaceae genomes, we have been able to get a distinct view of the gut mobilome, extending beyond the phageome. These novel MGEs belong to related groups, sharing a significant amount of functional gene content within and between groups, supporting a mosaic model of evolution for ICEs. Understanding how phages evolve in Bacteroidaceae hosts, however, remains elusive and highlights the need for more experimental research.

We present VICSIN, a consensus approach for MGE prediction and clustering of predictions to provide classification. Testing of VICSIN on datasets of Pseudomonas aeruginosa and Bacteroides fragilis genomes suggests VICSIN is the optimal approach to predict integrated MGEs from poorly-explored host taxa, because of its increased sensitivity and accuracy. We applied VICSIN to a dataset of gut-associated Together, this work motivates more exploration of the gut mobilome, which is likely one of the most potent drivers of microbial evolution in the human microbiome.
MGEs encode their own mechanisms that allow for their transfer between and their maintenance within cellular hosts, thereby circumventing two of the main barriers to HGT: cell entry and persistence (4). HGT occurs at especially high rates in the human gut microbiome (8,9), which has been attributed to both the high density of microbes (10,11) and the activity of some MGEs to host-associated conditions, such as inflammation (12,13). MGE activity has important consequences for all microbial hosts and communities, including in the human gut. Within the gut microbiome, MGEs disseminate antibiotic resistance genes, virulence factors, and other advantageous traits, both in symbiotic and pathogenic contexts (3,(14)(15)(16)(17). Characterizing MGE diversity is crucial for understanding how genetic diversity, genes of interest, and relevant traits are maintained within and disseminated between microbes.
The process of characterizing MGE diversity presents two major challenges: (i) accurate and sensitive detection of MGEs, and (ii) MGE classification. An increasingly large number of tools have been developed to search genomic sequence data for putative MGEs using distinct computational approaches (18). Each of these individual approaches to MGE prediction has advantages and limitations. Similarity-based search tools are biased towards the identification of MGEs similar to previously characterized MGEs, and therefore perform poorly at identifying novel MGEs (18,19). De novo prediction tools use generic models of MGE sequence characteristics but exhibit variable accuracy and sensitivity depending on the microbial host (18,19). Genome alignment tools identify all non-core genomic elements, which include MGEs but can also include other accessory genomic features, such as recombination hotspots (20), cell surface modification loci (20,21), and CRISPR-Cas systems (22). Machine learning and deep learning-based tools may require large dataset inputs or customized training data, which may not be available for most microbial systems (18,23,24). Furthermore, MGE identification is most thoroughly tested on clinically-relevant and well-studied host microbes, especially those in the phylum Proteobacteria (19,(23)(24)(25)(26)(27), and is largely tailored to identifying integrated viruses (i.e., prophages), neglecting the numerous other types of MGEs (18,28). Although testing these tools on lesser-understood microbial hosts and MGEs is intrinsically difficult, the current lack of representation across host and MGE diversity has resulted in biases in MGE prediction (27).
The second major challenge to understanding MGE diversity is understanding how they are related, and their further classification into related groups. Although many types of MGEs have been identified, we will focus on viruses and integrative and conjugative elements (ICEs), which appear to be the predominant means of HGT in the gut microbiome (29). Traditionally, viral classification has been accomplished by the International Committee for the Taxonomy of Viruses (ICTV), which relies on a combination of viral characteristics to classify new viruses, including viral particle morphology, genome composition, and homology (30). There is no universally accepted method for the classification of most other MGEs. One method for ICE classification is based on integrase gene homology and synteny of ICE backbone features (31), however the mosaic nature of ICEs suggests classification based heavily on a single gene may not be accurate (32). The mosaicism within ICEs and across all MGE types is a result of recombination events between otherwise unrelated MGEs coexisting within a single host cell (32,33). While the continuous nature of MGE diversity is perhaps the biggest challenge to their classification, it also calls for a generalizable classification scheme that does not currently exist, in order to better understand their diversity and relatedness.
The genus Bacteroides represents one of the most common and abundant microbes associated with the human gut (34), having coevolved with hominid hosts for millions of years (35). Bacteroides genomes encode diverse suites of accessory genomic elements including polysaccharide utilization loci (36,37), capsular polysaccharide synthesis loci (38), CRISPR-Cas systems (39,40), and a diversity of integrated MGEs (41,42). Although Bacteroides species engage in beneficial interactions with hosts (e.g., production of short chain fatty acids, immune system development), they can also be opportunistic pathogens outside and inside the gut lumen. Bacteroides species commonly carry ICEs encoding antibiotic resistance genes (ARGs) that can be transferred among symbiotic and pathogenic microbes in the gut, thus acting as important antibiotic resistance reservoirs (15,43,44  MGEs from the phylum Proteobacteria, and Bacteroides fragilis, a lesser-studied organism from the divergent phylum Bacteroidetes. Benchmark analyses suggest that although VICSIN performs comparably to its constituent tools for P. aeruginosa genomes, VICSIN is more accurate and sensitive for the B. fragilis benchmark dataset.
Thus, we suggest VICSIN is a superior tool for MGE discovery in poorly characterized hosts. We applied VICSIN to a large dataset of Bacteroides genomes, revealing a highly diverse set of related MGEs that has gone unexplored.

VICSIN predicts and networks integrated MGEs
We developed VICSIN as a tool for (i) comprehensive prediction of MGEs integrated in microbial genomes and (ii) clustering of MGEs as a method of classification ( Fig. 1). VICSIN uses a consensus approach to MGE prediction, utilizing the pre-existing MGE prediction tools VirSorter (19), PhiSpy (25), and AGEnt (26). VICSIN is designed to take groups of ≥5 related bacterial genomes as input. To test VICSIN, two benchmark datasets were developed for the divergent bacterial species P. aeruginosa and B. fragilis (Fig. 2). Each benchmark dataset is comprised of 6 closely related genomes (Fig. 2). In order to prevent drops in prediction precision, VICSIN requires input genomes be grouped below the species level, using a cutoff of ≥70% PLA by Spine, approximately corresponding to >98% average nucleotide identity (ANI) in both benchmark datasets (Fig. S1). For both datasets, "known MGEs" integrated in each genome were identified based on literature reports or manual annotation (Table S1), identifying 40 known MGEs in the P. aeruginosa genomes, and 38 known MGEs in the B. fragilis genomes. Though VICSIN can use known MGEs for prediction by BLAST, this option was not used for benchmarking purposes.
Using the known MGEs in these two benchmark datasets, we compared the accuracy and sensitivity of VICSIN to its individual constituent tools (Fig. 3). We used four main metrics to evaluate the effectiveness of the tools: precision (i.e., the number of true positive predictions, divided by the total number of predictions), recall (i.e., the number of true positive predictions, divided by the total number of known MGEs), coverage (i.e., the percent length of a known MGE covered by a true positive prediction), overprediction (i.e., the excess length of a true positive prediction that does not cover a known MGE), and total false positive length (i.e., the sum of the lengths of all false positive predictions in a genome). True positive predictions were defined to be any prediction overlapping a known MGE in a benchmark genome.
Analysis of the B. fragilis dataset found that VICSIN is more accurate and sensitive than its constituent tools VirSorter, PhiSpy, and AGEnt for all six genomes Together, this analysis suggests VICSIN is at least as accurate and sensitive as its constituent tools for P. aeruginosa host genomes, and exceeds the accuracy and precision of those constituent tools for B. fragilis. This may be a result of P. aeruginosa genomes being key elements in the testing and training of VirSorter, PhiSpy, and AGEnt [19,25,26]. We suggest VICSIN is the optimal approach to MGE prediction in poorlycharacterized host genomes, especially those outside the phylum Proteobacteria.
Classification of predicted MGEs is a critical step for subsequent analyses of evolutionary history, mode of transfer, and host range. The existing tools ClustAGE Importantly, VICSIN offers several practical advantages while achieving similar results to vConTACT. Prior to running vConTACT the user must also predict protein encoded on their MGEs, and then generate protein clusters using programs in the vConTACT toolbox. This process, however, is not designed to handle multiple MGEs and requires additional work to reformat and compile protein data inputs. VICSIN eliminates these additional steps by operating directly on the nucleotide sequences. VICSIN are both built on BLAST and MCL, making them equally scalable. Because of its practical advantages and its reduced susceptibility to error and compute time during protein calling and data format conversion, we conclude that VICSIN is a superior tool for MGE clustering.

Application of VICSIN to characterize the Bacteroidaceae integrated mobilome
A subset of genomes was selected from a database of 341 Bacteroides genomes accessible on NCBI or sequenced in-house (Table S2). As was performed for the benchmark datasets, genomes were compared by pairwise Spine-PLA (Fig. S3). From the similarity matrix, two groups of interest were chosen for analysis with VICSIN: (1) 61 B. fragilis genomes, (2) and 27 P. vulgatus or P. dorei genomes. B. fragilis is commonly associated with extraintestinal infections and perhaps the best-studied Bacteroides species (53). As a result, many known MGEs have been characterized for it, including ICEs (54-60) and phages (61)(62)(63)(64)(65). P. vulgatus and P. dorei constitute a clade divergent from that of B. fragilis, and together have fewer known MGEs (66,67). This scheme was chosen in order to characterize MGEs both within host groups and also to sample across the diverse Bacteroidaceae family. The B. fragilis and P. vulgatus groups were arbitrarily split into subgroups of 7-10 genomes, in order to keep AGEnt sensitivity high and compute times low. Because VICSIN prediction does not scale, each subgroup was run independently for prediction, and those predictions were later aggregated for VICSIN MGE cluster analysis.
VICSIN identified a total of 816 MGEs across the 88 host genomes (Table S3). This constitutes 42.3 Mb of predicted mobile sequence out of 468 Mb of total genome length (9.0%). This is greater than the mobile content of the initial P. aeruginosa (4.3%) predictions, reflecting other accessory genomic elements. Therefore, we recommend manual examination of predictions, as for all MGE prediction tools, especially from large datasets (≥50 genomes). However, it should be noted that 9% of a genome consisting of MGEs is well within the range what has been observed for other organisms (68,69).
All 816 MGE predictions were pooled with a set of known Bacteroidaceae MGEs including phages and ICEs (Table S4) (Table S3, S5).

Genetic recombination in VICSIN-predicted phage clusters
It was observed that three clusters with known phages are isolated in the network having low/no PLA above 20% with other MGEs (Fig. 4, S4). This includes the crAssphages (Cluster 21), Hankyphage and its putative relatives (Cluster 29), and two siphophages (Cluster 77), but not Salyersviridae temperate phages (Cluster 23). Given the low connectivity of some phage clusters in the network analysis, it suggests that the rate of recombination between MGEs may vary by MGE type or cluster. Notably, this pattern is not determined by phage lifestyle (i.e. obligately lytic versus temperate).
Hankyphage is a Mu-like phage capable of being stably integrated, but we did not While further sampling of MGEs may find more evidence of gene sharing between phages and other MGEs in Bacteroidaceae hosts, these results suggest it is less common, and certainly much rarer than the rate of gene sharing among ICE-like MGEs (Fig. 4).

Encoded protein functions of VICSIN-predicted MGEs
We hypothesized that the gene content uniting an MGE cluster would be comprised of core genes encoding the essential functions for MGE transmission and persistence, and the gene content shared between MGE clusters would be comprised of accessory genes that benefit the Bacteroidaceae host. To examine gene sharing at a functional level, predicted MGE genes were annotated using Pfams (Fig. 5).
This analysis supports our hypothesis that sharing within MGE clusters would be defined by core transmission genes. Many of the most frequently shared genes between ICEs within a cluster (ICE-ICE within) are those that encode the Type IV secretion system (T4SS). Likewise, some the most frequently shared between phages within a cluster (phage-phage within) encode structural components of the phage capsid, or the machinery for packaging the phage genome. Gene sharing between ICEs and phages within a cluster (ICE-phage within) is largely driven by functions not readily assignable to an MGE type, or to accessory genes like those associated with capsular polysaccharide synthesis (CPS) loci or polysaccharide utilization loci (PUL). This is The Pfam analysis, however, only partially supports our hypothesis that gene sharing between clusters would be driven by accessory genes. Functional gene sharing between MGE clusters is dominated by the same core genes which drive gene sharing within those clusters. This sharing of core genes between MGE clusters does support the hypothesis that recombination is a powerful contributor to MGE evolution, leading to their modular and mosaic nature. While some accessory genes, such as Type VI secretion system (T6SS) genes, are shared between clusters, their frequencies are dwarfed by that of core genes. This is likely due to several factors, including difficulty in annotating potential accessory genes using Pfams and the potential for a high diversity of accessory genes compared to core functional genes.

Extensive synteny within VICSIN MGE clusters
The sharing of genetic content was also examined through representative clusters. Clusters of interest were chosen on the basis of (i) having a known MGE member (Fig. S5), and (ii) having members from more than one species (Fig. S6). We postulated that clusters with MGEs from more than one host species are least likely to contain false positive predictions, by virtue of being recently mobile. Two clusters each representing ICEs and phages were selected; no attempt was made to examine clusters composed of unknown MGE types, which are difficult to annotate. Cluster 1 (CTn86-like elements) and Cluster 3 (CTnDOT-like elements) were chosen as representative of  (Fig. 6A, 6B), and both show extensive sharing of core functional genes. Shared genes within these clusters include the tra operon, which encodes the T4SS, and can include the mob operon, encoding the relaxosome. Further, accessory genes in ICEs (i.e. bile salt hydrolase, xylanase, antibiotic resistances) are variable within their respective clusters. Representative phage clusters were Cluster 23 (Salyersviridae-like elements) and Cluster 29 (Hankyphage-like elements) (Fig. 6C, 6D). Both phage clusters exhibit a greater degree of similarity within the cluster than do the representative ICE-like clusters, covering the majority of these phage genomes. This is likely because more of a phage genome can be considered core; phages encode a large number of genes involved in capsid structure and assembly, genome packaging, host lysis, DNA replication, and regulation.

Detection of clinically-relevant traits in VICSIN-predicted MGEs
Finally, functional gene sharing was further interrogated by searching for specific clinically relevant functions among the VICSIN predictions. First, we examined antibiotic resistance genes (ARGs) (Fig. 7A). Twenty of the 816 VICSIN predicted MGEs were determined to encode a total 24 resistance genes. All were characterized resistance genes common among the Bacteroidetes: cfxA3, ermF, ermG, tetX, tetQ and linA, each conferring cefoxitin, clindamycin, erythromycin, tetracycline, tetracycline, and lincomycin resistance, respectively. We hypothesized that antibiotic resistance genes would be associated with particular MGE clusters due to strong selection for MGEs to maintain their ARGs over and the 20% PLA cutoff for clustering, we did not expect the sharing of ARGs to drive MGE clustering. However, our initial hypothesis was true only for linA, which was detected exclusively in two MGEs in Cluster 2 ( Table 2). Further sampling may reveal that linA is not restricted to a single MGE cluster. We determined that all of the other ARGs detected were distributed in multiple MGE clusters. The tetracycline resistance gene tetQ is common and highly conserved in Cluster 3, and although it is also present in two other clusters, is found in 8 out of 39 predictions (21%) within the group. This suggests that although tetQ is maintained over evolutionary time by some members of a cluster, stochastic loss events may cause its absence in other MGEs of the cluster. Two found to encode predicted T6SS. T6SSs primarily occurred in three clusters: Cluster 0, Cluster 5 (CTnBST-like ICEs), and Cluster 6 (P. vulgatus-associated ICEs). Evidence of sharing of these T6SS genes is apparent between clusters (i.e. between Clusters 0 and 9, and between Clusters 6 and 52), suggesting they may also move between MGEs by recombination.

DISCUSSION
Here we describe VICSIN, a tool for the prediction and networking of integrated MGEs. VICSIN uses a consensus approach to MGE prediction that results in greater sensitivity than its constituent tools. It is especially powerful for poorly-studied host genomes such as those of the family Bacteroidaceae. Our data suggests that this may be a result of these tools being developed and tested using extant model organisms like those of P. aeruginosa and E. coli (18) 18 (66,(71)(72)(73). Although the mechanisms causing the entirety of these events are largely unknown, one of the principle drivers of HGT is likely uncharacterized integrated MGEs.

Presently, between the NCBI and ICEberg databases only have 27 annotated
Bacteroidaceae-specific phage or ICE elements. Our results found that 60% (57/95) of VICSIN-predicted MGE clusters lack a known MGE. Further, we show evidence of recombination within and between clusters of predicted MGEs, and hypothesize at least some of these MGEs are capable of transfer across a broad host range. It has been hypothesized that a combination of these factors is an especially potent driver of HGT across the Bacteroidaceae family (71,73).
We found very limited gene sharing between known phages and integrated MGE predictions in the current Bacteroidaceae dataset. This suggests some barrier to recombination exists between some MGEs. We do not believe the barrier is caused by short infection times for these phages; Hankyphages are Mu-like viruses capable of stable integration into host chromosomes (74), and cells infected with CrAssphages have been observed to enter a "carrier state" where lysis and progeny release can be delayed for long periods of time (75). Part of this could be caused by structural constraints of phages versus ICEs. Whereas phages have limited space within the capsid to store their genome, ICEs which transfer directly from cell to cell via the T4SS have fewer constraints on genome size. Known ICEs can be hundreds of kilobases in length (e.g. PAIst in Streptomyces turgidiscabies >674 Kb) (76). Further, the current data reinforces the view that ICEs are likely the primary drivers of HGT in these genomes, though this may be due to a lack of knowledge about lysogenic phages infecting Bacteroidaceae hosts. Phages are known to drive HGT in other bacterial Bacteroidaceae experimental dataset. From the NCBI web interface, 300 Bacteroidaceae genomes were downloaded (Table S2). An additional 18 genomes were added to the dataset that were sequenced as part of this work (Table S7). Known MGEs integrated in the networking steps were downloaded from NCBI or ICEberg databases (31).  (Table S5). Phage-associated Pfams were defined as any with the words "terminase," "portal," "capsid," "tail," "base plate," "spike," "neck," "head," or "phage protein" in their description. Reads were trimmed and assembled using the A5ud pipeline (83). Sequencing methods and assembly data are summarized in Table S7.

ACKNOWLEDGEMENTS
We thank Ted Pacyga for his invaluable help with the genome_grouper.py program, Nadja Shoemaker and Abigail Salyers for access to an impressive collection of Bacteroidaceae isolates, and Alvaro Hernandez and Chris Wright for DNA sequencing.