Density-based binning of gene clusters to infer function or evolutionary history using GeneGrouper

Motivation Identifying gene clusters of interest in phylogenetically proximate and distant taxa can help to infer phenotypes of interest. Conserved gene clusters may differ by only a few genes, which can be biologically meaningful, such as the formation of pseudogenes or insertions interrupting regulation. These qualities may allow for unsupervised clustering of similar gene clusters into bins that provide a population-level understanding of the genetic variation in similar gene clusters. Results We developed GeneGrouper, a command-line tool that uses a density-based clustering method to group gene clusters into bins. GeneGrouper demonstrated high recall and precision in benchmarks for the detection of the 23-gene Salmonella enterica LT2 Pdu gene cluster and four-gene Pseudomonas aeruginosa PAO1 Mex gene cluster in 435 genomes containing mixed taxa. In a subsequent application investigating the diversity and impact of gene complete and incomplete LT2 Pdu gene clusters in 1130 S. enterica genomes, GeneGrouper identified a novel, frequently occurring pduN pseudogene. When replicated in vivo, disruption of pduN with a frameshift mutation negatively impacted microcompartment formation. We next demonstrated the versatility of GeneGrouper by clustering both distant homologous gene clusters and variable gene clusters found in integrative and conjugative elements. Availability GeneGrouper software and code are publicly available at https://github.com/agmcfarland/GeneGrouper.


Background
Physically proximate groups of genes, called gene clusters, are present in many microbial taxa (1). Gene clusters can include genes that form biosynthetic pathways or efflux, secretion, and signaling systems (1)(2)(3)(4)(5). Some gene clusters are arranged into one or multiple operons (6). Microbial genomes are under constant gene flux, driven by gene gain, loss, and rearrangements (5,7,8). The identification of intact, conserved gene clusters across different genomes can allow for inferences to be made as to the gene cluster's functionality, stability, and taxonomic distribution (6,9).
There are different, overlapping approaches used for the identification and classification of gene clusters. Approaches that incorporate a reference database of curated gene clusters include DOOR2, T346Hunter, TADB2.0, and MetaCRAST (10)(11)(12)(13). Synteny-based approaches are generally split into those that identify all gene clusters at the genome level compared to within a defined genomic window. Examples of genome-level synteny tools are CSBFinder, GECKO3, and Mauve and genomic window tools include SynFind and SimpleSynteny (14)(15)(16)(17)(18).
Gene cluster homology search approaches, like MultiGeneBlast or SLING, identify gene clusters that contain all or some of the gene cluster query genes (19,20). After identifying a set of gene clusters, some tools will further aggregate gene clusters into bins using sequence similarity networks or clustering (14,20).
A challenge in analyzing large numbers of gene clusters is that many conserved gene clusters will display little variation in gene content, but that variation may nevertheless be biologically significant, for example an insertion disrupting key genes in a biosynthetic operon (6,21). A population-level understanding of gene cluster content can help to identify which genes are typically located in a gene cluster, and which are variable. Most tools require either custom analysis of tabular outputs or manual inspection of gene cluster synteny plots to identify variations in gene cluster content and their distribution within the analyzed genomes.
We developed GeneGrouper to identify, quantify, contextualize, and visualize the degree of similarity for gene clusters that contain a queried gene of interest in a population of usersupplied genomes. It is designed to work on thousands of genomes and is suitable for use on a personal computer. We demonstrate the utility of GeneGrouper by comparing its unsupervised clustering accuracy with existing tools in the identification of two distinct gene clusters, the 23gene catabolic microcompartment Pdu gene cluster found in Salmonella enterica LT2 and the four-gene MexR/MexAB-OprM Resistance-Nodulation-Division (RND)-type efflux pump gene cluster from Pseudomonas aeruginosa PAO1 in 435 taxonomically diverse genomes (22,23).
GeneGrouper was next used to examine the diversity and distribution of gene complete and incomplete LT2 Pdu gene clusters in 1130 S. enterica genomes. Using GeneGrouper's visual and tabular outputs, we identify a novel pseudogene present in a subset of otherwise genecomplete LT2 Pdu gene clusters. We replicate the pseudogene in vivo and find that it negatively impacts microcompartment formation.

Input and pre-processing
GeneGrouper requires two inputs: genome files and a translated seed gene sequence.
Genome files must be in GenBank file format like those from the NCBI Refseq database (31).
All genome files have coding sequence features extracted and stored in an SQLite database. A BLAST database is constructed from all extracted amino acid sequences.

Seed homology searching
A BLASTp search for the translated seed gene is performed using user-specified identity and coverage thresholds (Fig. 1A i). Upstream and downstream genomic regions of lengths corresponding to user-specified base-pair distances are extracted. In instances where extracted regions overlap, the region with the highest E-value is chosen. The genomic position of hits and the amino acid sequences within the defined genomic region are written to a seed gene-specific SQLite database. All sequences within the defined genomic region are extracted and stored as a FASTA file.

Orthology inference and assignment
GeneGrouper assigns orthology to all sequences extracted from the defined genomic region using an internal pipeline. The internal orthology identification scheme takes as input a FASTA file generated during the pre-processing phase containing all detected amino acid sequences. Sequences are clustered using mmseqs2 linclust to generate a set of proximate orthology relationships, producing a set of representative amino acid sequences in FASTA format. An all-vs-all BLAST search is performed, with the resulting hits table filtered for identity, coverage, E-value, and desired number of matches . The E-values from the filtered hits table is used as an input for Markov Graph Clustering with MCL. MCL is run over multiple inflation values, with the lowest inflation value containing the highest count of unique orthologs selected by default. The MCL and mmseqs2 linclust ortholog group assignments are transferred to every sequence and stored (Fig. 1A ii).

Genomic region clustering
Pairwise Jaccard distances are calculated for all genomic regions (Fig. 1A iii) (25). The DBSCAN algorithm is then run on the resulting dissimilarity matrix using a fixed minimum cluster size value over increasing epsilon values (32). For each epsilon value, the number of clusters, noise, silhouette score, and Calinksi-Harabasz score are calculated (33). The epsilon value demonstrating the best separation of clusters (defaulting to the highest Calinksi-Harabasz score) is selected. The previously constructed Jaccard distance matrix is subsetted for regions within each DBSCAN cluster label. For each cluster label, the mean dissimilarity for each region is calculated. The region with the lowest mean dissimilarity is selected as the representative for that cluster label.

Outputs
Tabular outputs containing the cluster label, region identifier, mean cluster label dissimilarity, and relative dissimilarity to the cluster representative is generated. Each unique cluster is assigned a numeric label. All gene regions that could not be assigned to any cluster are grouped into cluster label 'c-1'. Three main visualizations are produced: The representative gene regions for each cluster with gene annotations (Fig. 1B), the dissimilarity contained within each gene region, and the unique variants found in each gene region along with their count.
Users can query specific clusters and generate a fourth visualization type showing the count, dissimilarity, and structure of each unique gene cluster within that queried cluster. Mex gene cluster was selected to test whether GeneGrouper could specifically detect a short gene cluster with multiple homologs within a species, and across all five Gram-negative taxa in our collection of genomes.

Identification of full or partial Pdu and MexAB-OprM gene clusters using different tools
We compared the capacities of GeneGrouper, MultiGeneBlast, and CSBFinder to detect full or partial LT2 Pdu and PAO1 Mex gene clusters in all 435 genomes. Each tool was limited to only detecting gene clusters, and no internal scoring or clustering algorithm was used. Each individual gene cluster was then scored in a standard manner as 'full' (100% of expected gene clusters present), 'partial' (<100-70% present), or 'other' (<70% present). In this manner, the capacity for different tools to detect gene clusters was standardized to allow for direct comparison of the detection of specific gene clusters and limit idiosyncrasies in clustering/scoring approaches.
For the detection of the LT2 Pdu gene cluster, GeneGrouper was run using the translated S. enterica LT2 PduA sequence as a seed, and a genomic search space of 2,000 bp downstream and 18,000 bp upstream to capture the two genes downstream and 22 genes upstream of pduA. PduA was selected as the seed because it is a member of the pfam00936 protein family, which is the hallmark indication of BMC loci (9). For the detection of the PAO1 Mex gene cluster, the P. aeruginosa PAO1 MexB sequence was used as a seed with a uniform search space of 10,000 bp upstream and downstream. For both searches a ≥ 60% identity and ≥ 80% coverage threshold was used. The orthology assignments for each gene in the Pdu or PAO1 Mex gene clusters were then used to score gene clusters.
MultiGeneBlast was run in search mode with default settings on each individual genome using an input FASTA file that contained all the translated gene sequences belonging to either LT2 Pdu or PAO1 Mex gene clusters. BLAST results for each identified gene cluster were filtered such that each individual query gene was matched to its single best hit, with a coverage cutoff of ≥ 80%, and no identity cutoff to allow for phylogenetically distant hits to be preserved.
Each gene cluster was then scored.
CSBFinder inputs were pre-processed prior to gene cluster searching. The proteomes for all six genera were generated by clustering with mmseqs2 linclust (34). Afterwards, orthology identification was performed using OrthoFinder with default settings (35). Genomes with orthology assignments were then converted into the CSBFinder format. The orthology assignments for each gene present in either the LT2 Pdu or PAO1 Mex gene clusters were converted to a 'patterns' file and used to search all genomes for the respective gene clusters and then scored. and Enterobacter.
Using our standardized methods for the detection of either the LT2 Pdu or PAO1 Mex gene cluster, all tools compared similarly in the capacity to detect genomes carrying a full gene cluster. CSBFinder reported lower numbers of full or partial gene clusters in phylogenetically distant genomes. This is likely due to the orthology pre-processing step using OrthoFinder.
MultiGeneBlast had higher numbers of partial gene clusters detected, especially for PAO1 Mex gene clusters. The large number of partial gene clusters was likely due to the BLAST-based scoring system that did not use an identity cutoff, compared to the reference orthology group approach used by GeneGrouper and CSBFinder. Importantly, these results demonstrate that GeneGrouper detects similar numbers of full or partial gene clusters compared to existing methods using a standardized scoring method.

Accuracy of GeneGrouper automated gene cluster binning
GeneGrouper uses an unsupervised learning approach to aggregate each individual gene cluster into a cluster label. Each cluster label should contain gene clusters that have similar, but not always identical, gene content, over a defined distance. Therefore, a cluster label will likely contain both full and partial gene clusters but should not contain unrelated gene clusters. We tested whether this was the case by comparing the results of our standardized gene cluster identification with the clustering results produced by GeneGrouper. Prior to testing, the ground truth of each genome for the presence of a full, partial, or absent LT2 Pdu/ PAO1 Mex gene cluster was determined.
The LT2 Pdu gene cluster was searched for in all genomes with GeneGrouper using the same parameters as before. GeneGrouper assigned 654 different gene clusters to four different cluster labels and had 12 gene clusters with no clustering solution that were placed in cluster label 'c-1' ( Table 2, Fig. S1, Fig. S2A). Cluster label 'c0' contained the Pdu gene cluster from S. enterica LT2 and was designated as being the cluster label that contained all expected instances of full or partial LT2 Pdu gene clusters (from here on referred to as GG-cluster).
Overall, the precision and recall scores of GG-cluster compared favorably with the scores from the standardized approaches (Fig. 3A). GG-cluster had a lower precision when used to predict the presence of only full LT2 Pdu gene clusters. However, the precision increased to 1 when predicting full or partial Pdu gene clusters. Comparatively, the recall remained almost unchanged, with a score of 1 when identifying full Pdu gene clusters and 0.99 when identifying full or partial Pdu gene clusters. GG-cluster missed one instance of a partial Pdu gene cluster that was assigned to cluster label 'c-1', which contains gene clusters for which no clustering solution was found. This Pdu cluster was split in two in the referenced assembly, being present at the start and end of different contigs. These results demonstrate that GG-cluster accurately identified almost all LT2 Pdu gene clusters of either full or partial status and did not incorrectly identify non-Pdu gene clusters as LT2 Pdu gene clusters.
PAO1 Mex gene clusters were searched for using GeneGrouper as previously described, identifying 2213 gene clusters contained within 40 cluster labels ( Table 2 Table 2). The search returned four distinct cluster labels with distinct gene clusters and two total unclustered gene clusters, which were visualized using GeneGrouper's visualization command (Fig. 1B). GeneGrouper reports the Jaccard dissimilarities of each region within a cluster relative to the region representative so that differences in gene content can be efficiently quantified and assessed. Cluster label 'c0' contained the S. enterica LT2 strain LT2 Pdu gene cluster, which had zero dissimilarity with the representative region of cluster label 'c0'. In total, cluster label 'c0' contained 1120 regions with a 0 and 0.076 dissimilarity at the 50th and 95th percentiles, respectively. These low dissimilarities indicated that cluster label 'c0' had very little variation in gene content relative to its representative region.
To examine the variability in gene content within cluster label 'c0', GeneGrouper's cluster inspection command was run to visualize the count of identical occurrences of each gene cluster (Fig. S4). 48 separate identical gene clusters were present, the majority of which had all 23 LT2 Pdu genes. The tabular output was queried to reveal that of the gene clusters identified, 920 (81.4%) carried all 23 LT2 Pdu genes, 10 (0.88%) did not have a LT2 Pdu gene cluster identified, and the remaining 200 (17.6%) had predicted LT2 Pdu gene clusters with between one and five pseudogenes. Interestingly, gene clusters carrying a pduN pseudogene but otherwise complete were the most common non-complete gene cluster observed (Fig. 4A).
A whole-genome phylogeny of all 1130 genomes was created using Phylophlan 3.0 with the provided 400 marker sequence database and visualized with ggtree along with a presenceabsence matrix of each Pdu component extracted from GeneGrouper's tabular output (36,37).
We found that genomes with pduN pseudogenes were present almost entirely in the same section of the phylogenetic tree (Fig. 4B). This is a surprising finding, as PduN is a necessary contained a nucleotide deletion at position 68 that resulted in a frame-shift mutation (Fig. S5).
The effects of this deletion on microcompartment formation were tested in S. enterica genes pduA and pduJ (ΔAΔJ) (Fig. 6) (40). Microcompartment formation was tested using a GFP encapsulation assay, in which GFP is targeted to microcompartments using an N-terminal signal sequence sufficient for microcompartment targeting (41,42). We found that strains

Additional gene cluster searches using GeneGrouper
To demonstrate the applicability of GeneGrouper to other gene cluster types and use cases, we searched for an additional two different seed genes ( Table 1) genomes. Identity and coverage parameters were lowered to 15% and 70%, respectively, and only the best BLAST hit from each genome was kept ( Table 2). A total of 393 gene clusters were binned into six cluster labels (Fig. S6A,B). Interestingly, S. enterica, K. pneumoniae, and Enterobacter had similar gene cluster arrangements, even between clusters, with the main difference being the genes upstream of the Pst gene cluster. Interestingly, the Pst gene cluster has been described in Clostridium and verified to be a homolog of the E. coli Pst gene cluster (43). In this search, only one Clostridium genome has a Pst gene cluster identified and placed alone in cluster label 'c-1', suggesting that the Pst gene cluster may not be carried by all Clostridium. Another unexpected finding was that only 62.8% of P. aeruginosa genomes carried a homolog of pstS in a conserved gene cluster assigned to cluster label 'c0'. However, gene clusters in this cluster label lacked other members of Pst and instead were associated with type II secretion genes. This context suggests that the pstS gene in cluster label 'c0' may serve a different functional role compared pstS found in the Pst gene cluster in E. coli (45,46).
In another example use case, traC, a type IV secretion system (T4SS) gene found in integrative and conjugative elements (ICEs) was searched for in all 435 genomes ( Table 2) (47). ICEs have highly variable gene content across both its cargo genes and the components necessary for integration and conjugation (48). To search for traC, identity and coverage parameters were maintained as above, with unlimited numbers of hits per genome. A 20000 upstream/downstream genomic range was used, which is on the lower end of ICE sizes (shown to range in size from 37-143 kb) (49). Clustering returned 74 separate gene clusters binned in four different cluster labels, and 10 gene clusters assigned to cluster label 'c-1' (Fig. S7A,B).
Expectedly, there was high dissimilarity within cluster labels. However, one cluster label, 'c3', exhibited low mean dissimilarity and was present in both S. enterica and K. pneumoniae genomes, suggesting these genomes carry the same ICE. Cluster label 'c0' was found in 46% of all K. pneumoniae genomes, raising the possibility of a particular mobile or ancient ICE acquisition.

Conclusions
We demonstrate that GeneGrouper is a simple and accurate tool for identifying gene clusters of interest in a large number of genomes using a single seed gene and a specified genomic window. The use of a gene cluster representative for each cluster label allows for a more intuitive understanding of the diversity of gene clusters that exist in a population, especially when coupled with additional visual metadata, and allow for easy identification and comparison of biologically relevant features. The provided tabular outputs allows for researchers to further probe identified gene clusters for their own specific questions. There exist some limitations in our approach, namely the absence of gene clusters that do not have the seed gene and the presence of incomplete gene clusters due to low-quality assembly genomes.
Despite these limitations, comparisons with existing gene cluster detection tools demonstrates that GeneGrouper's automated clustering and overall approach provides similarly accurate predictions.
In an example application, GeneGrouper was used to determine whether the LT2 Pdu               T  o  t  a  l  g  e  n  e  c  l  u  s  t  e  r  s  f  o  u  n  d   T  o  t  a  l  c  l  u  s  t  e  r  l  a  b  e  l  s   T  o  t  a  l  u  n  c  l  u  s  t  e  r  e  d  g  e  n  e  c  l  u  s  t  e  r  s   G  e  n  o  m  e  s  w  i  t  h  h  i  t  D  a  t  a  s  e  t  R  u  n  t  i  m  e  (  h  :  m