Abstract
Depending on the scale of observation, bacterial genomes are both organized and fluid. While individual bacterial genomes show signatures of organization (e.g. operons), pan-genomes reveal genome fluidity, both in terms of gene content and order (synteny). Here we ask how mutational forces (including recombination and horizontal gene transfer) combine with selection and gene interactions to shape genome organization and variation both within and across strains. We first build an evolutionary simulation model to assess the impact of gene interactions on pan-genome structure. A neutral evolutionary model can produce transient co-segregation of initially linked genes but is vulnerable on longer time-scales to perturbing mutational events. In contrast, incorporation of modular gene fitness interactions can produce sustainable clusters of linked and co-segregating genes, with the network of co-segregation recapitulating the defined simulation ‘ground-truth’ network of gene interactions. To test our model predictions, we exploit the increasing number of closed genomes in model species to investigate gene interactions in the pan-genomes of Escherichia coli and Pseudomonas aeruginosa. Using these highly curated pan-genomes, we show that the co-segregation networks for P. aeruginosa and E. coli are modular, associate with physical linkage, and most closely map known metabolic networks (for P. aeruginosa) and regulatory networks (for E. coli). The results imply that co-segregation networks can contribute to accessory genome annotation, and more generally that gene interactions are the primary force shaping genome structure and operon evolution.
Significance Statement Bacterial pan-genomes represent extraordinary genomic diversity yet remain relatively unexplored due to a research focus on lab reference strains. We exploit the growing availability of closed genomes to build pan-genomes where we can track the physical linkage of all genes. Through a combination of evolutionary simulations and data-analysis, we ask how mutation, selection and gene interactions combine to shape genome structural organization (linkage) and variation (co-segregation) across strains. We show that co-segregation networks are modular, associate with physical linkage, and map to metabolic (for P. aeruginosa) and regulatory networks (for E. coli). The results imply that modular gene interactions are sufficient to guide the evolution of persistent gene clusters and are the primary force shaping genome structural evolution.
Introduction
Individual bacterial genomes show hallmarks of organization (1). For example, genes involved in related biochemical processes are often clustered together on the chromosome (2). The organization of clustered genes can be refined to the point where they are transcribed together as operons, with a shared promoter and operator sequences. Regulatory genes are also often found close to the genes they regulate. A classic example is the lacI repressor gene, which resides near the lacZYA operon in Escherichia coli (3).
While organized, genome structure is under constant assault from an array of mutational processes that scramble gene content and order. Genes can be gained or lost. Whole sequences of genes can be flipped (inversions), moved along a chromosome (translocation, recombination) or shuttled to a new individual (horizontal gene transfer). As a result of these perturbing forces, genomic organization is variable across bacterial isolates; genes that are clustered in one species are often reordered in another (2, 4). For example, multiple alignment of different strains of E. coli and Yersinia pestis reveal large changes in gene ordering (synteny) across the chromosome of each species (5, 6), where fragments of different length undergo rearrangement due to inversion and translocation events. In addition to variation in ordering, gene content is also variable across strains of the same species. As little as 11% of all Escherichia coli genes are present in all strains (7, 8) and an even smaller 1% in Pseudomonas aeruginosa (9). These ever-present genes are described as the core genome, and the larger set of variably present/absent genes as the accessory genome. Together, the entire set of genes across a species is termed the pan-genome.
The forces that contribute to bacterial genome organization in the face of perturbing mutational pressures have been the focus of extensive debate, leading to a menu of alternate models. The influential selfish operon model (10) views gene clustering as a selfish property of an operon, as clustering of physiologically related units enhances the probability that an entire operon will be horizontally transferred together as a functional unit, and therefore overcome selection for the loss of multi-gene functions that are under weak or fluctuating selection. A central prediction of the selfish operon model is that operons are enriched for peripheral (versus core) metabolic functions subject to fluctuating selection, which is supported by anecdotal reports of accessory functions carried on plasmids (11). Systematic whole-genome analyses, however, have not generally supported this prediction; highlighting a greater frequency of clustering for essential genes in E. coli (12, 13).
Other models of operon evolution highlight the efficiency gains of sharing a single promotor. For example, by reducing the target size for loss-of-function mutations (14), or by offering more complex temporal co-ordination of co-regulated processes (15, 16). Moreover, gene clustering, even without a shared promotor, can offer efficiency gains by spatially co-localizing within the cell gene products that engage in related biochemical tasks (17–19).
In this study we propose that the models above represent special cases of a broader gene-interaction model (e.g. metabolic or regulatory connections between genes). The potential importance of gene interactions has been proposed before in the context of operon evolution (20, 21), following Fisher’s (1930) foundational work on epistasis, recombination, and the evolution of linkage. Fisher reasoned that physical linkage (clustering) between genes can occur, at higher rates than expected by chance, due to positive epistatic interactions. The relevance of this logic to bacteria has subsequently been questioned (10) due to the generally low rates of homologous recombination in clonally reproducing microbes (2, 10, 23). Here we return to Fisher’s model and apply it considering our broader understanding of the peculiarities of bacterial genomic organization and processes of change beyond recombination.
Our hypothesis is that gene interactions introduce selective effects on genome-modifying processes which lead to the formation of co-segregating clusters of interacting genes. We test this hypothesis via an iterative combination of evolutionary simulation models and pan-genome data resulting in clear predictions regarding the effect of gene-interaction on genomic architecture. Using closed genome data for 100s of E. coli and P. aeruginosa genomes, we identify clusters of physically linked and co-segregating genes and show that the resulting co-segregation networks map onto underlying gene-regulatory and metabolic gene interaction networks.
Results
Simulation models (1): Neutral model of genome evolution
Our overarching hypothesis is that gene interactions ultimately drive the organization of genomes into modular clusters of physically linked genes. In the following sections, we develop a model of pan-genome evolution, based on interactions that impact bacterial fitness. First, we establish a baseline neutral model case where there are no interactions (or where gene interactions have no impact on fitness). In this simplified world, genes still experience chromosomal structure, with genes arranged on a circular chromosome. In this neutral world, it is still intuitively reasonable to expect genome structure to persist as a result of initial conditions: genes that are initially close together will tend to stay close together, as they experience a reduced likelihood of gene separation via recombination or other genome modifying functions.
To test this neutral (no selection) hypothesis of genome inertia, we conduct in silico experimental evolution on 100 interacting lineages of bacteria (see methods for details). In brief, each lineage has an average genome size of 2000 genes, and is subject to genomic perturbations of gene gain, loss, and gene rearrangement (via inversion/translocation). Gene gain for a focal lineage is sourced from the pan-genome (the genomes of all lineages), therefore linking lineages together via horizontal gene transfer. Our simulation tracks time in fixation events (e.g., fixation within 1 lineage of a gene-loss event, see methods for simulation details).
In our first simulation we track a single gene pair (x,y) that are initially adjacent in 10% of all lineages (absent in other lineages) and follow their resulting organization (linkage within genomes and co-segration across genomes). In figure 1A we track linkage via the average chromosomal distance dxy (the number of intervening genes on the shortest chromosomal path between x and y, or d generally for any arbitrary pair of genes), averaged across all genomes where both genes are found. Initially, dxy is zero, reflecting our initial conditions focused on two adjacent genes. Figure 1A illustrates that while linked genes can persist in a state of close linkage for 100s of fixation events, given sufficient time (note log scale on x axis), genetic perturbations will eventually separate two focal genes. In Figure 1B we see that by 10,000 fixation events, the distribution of dxy approaches a uniform distribution (Figure 1B), consistent with a simple diffusion model of genetic drift.
The distance metric dxy is only applicable for lineages that contain both genes x and y. Initially, x and y were always found together (in 10% of lineages; both absent elsewhere) yet processes of gene gain and loss could potentially alter this tight pattern of co-segregation. To assess the extent to which a pair of genes co-segregate, we next estimated the mutual information (Ixy, or I for any arbitrary pair of genes) between genes x, y across the pan-genome (mutual information is analogous to correlation without the need to use real-valued variables (24, 25)). In Figure 1C we see that the initial simulation conditions produce an initial mutual information co-segregation value I = 0.127. Our results show that over short timeframes (less than 100 fixation events), initially linked genes maintain high mutual information, however as linkage breaks down due to inevitable neutral mutational events, so does co-segregation (Figure 1C; note I takes on discrete values due to the discrete nature of presence /absence values used in the pangenome characterization). At equilibrium we observe a skewed distribution of I values, with a peak at zero (Figure 1D). Figures 1A-D indicate that under a neutral model, gene pair distance d tends to drift and co-segregation peaks at zero.
Focal gene interaction model
The absence of persistent linkage in our neutral simulation model (Figure 1) is not consistent with examples of highly conserved linkage across bacterial isolates, indicating a role for selection. For example the peg (polyethylene glycol-degradative) operon is conserved across multiple lineages of Sphingomonas (26), likewise several E. coli operons are conserved in different lineages (27). Therefore, to assess the role of selection in our simulation model, we develop a model to include gene interactions that impact fitness. In this model, events of gene gain, gene loss, and gene rearrangement are now potentially selectively non-neutral if they perturb a gene interaction with defined importance for fitness. As in our previous simulation we track a single pair of genes, but now assume this pair has a positive interaction effect on fitness when found together in a genome (and set all other gene interaction effects to zero). In contrast to Figure 1, we reverse our initial conditions, so that our 2 focal genes are initially not closely linked (Figure 2A) and not co-segregating (Figure 2B).
Figure 2A shows that distance decreases with simulation time, and although stochastic effects can cause instances where distance is high, these are transient events, and the distribution of distances after 200,000 fixation events is stably low, with a mode of zero (Fig 2B). The shortest time to reach a distance of zero (i.e., the focal pair becoming adjacent genes) is ~40,000 fixation events (Fig 2A) which translates to around 400 fixation events per genome in the collection. Figure 2C shows that during the simulation values of Ixy steadily increases, because Ixy is a global measurement of the linkage of a pair of genes xy in the genome collection it is less susceptible to stochastic effects. After ~15,000 fixation events (150 per genome events) the value of I stabilizes around 0.27 and remains the same for the course of the simulation.
Genome network interaction model
Figure 2 illustrates that given selection on one gene pair in isolation, close linkage (low dxy) and high co-segregation Ixy between these the genes x,y can emerge, despite ongoing perturbing forces of gene loss, gain, translocation, etc. Yet in real-world genomes, movement of two positively interacting genes together might in turn pull apart other positively interacting genes. To assess how evolution proceeds given multiple gene interactions, we again extend our simulation model to incorporate networks of gene interactions, spanning both random network models (28) and modular interaction network models (small world (29) and preferential attachment (30) models). Specifically, we hypothesize that modular gene interaction networks will drive repeated clustering of interacting genes at a pangenome level.
On a pairwise scale, repeated clustering of interacting genes implies that clustered (low d) gene pairs will tend to co-segregate more often (higher I), leading to a predicted negative correlation between d and I. Figure 3A-D plots d against I under the various interaction models, including the ‘no interaction’ neutral case (Figure 3A). In both the neutral and random interaction case we can reject the hypothesis of a negative association between d and I (neutral model: r = −2.3X10−3, p = 0.75; random interaction model, r =- 1.4X10−3; p = 0.84). In contrast, when we introduce modular networks, we find support for a negative association (small world model: r = −0.05; p = 6.2X10−3; preferential attachment model: r = −0.12; p = 0.01).
While the modular simulations (Figures 3C, D) reveal significant statistical support for a negative association, the effect size is relatively weak and the distribution of simulation results appears to be sparse for higher values of I, indicating a potential problem of under-sampling. We note that panels 3C, D summarize the behavior of approximately 10 million gene pairs, so we are not under-sampling. We conjecture that the sparse boundary reflects that the simulated data is a composite of a large ‘background’ of non-interacting pairs (see neutral simulation case) overlaid with a smaller set of interacting gene pairs that are biased towards high I and low d values (Figure 2). Consistent with this conjecture, the gene interaction networks were all sparse (around 0.1% of the possible interactions are included in the network). To further test this conjecture, we identify values of I that are outside of the central distribution of I values (‘outlier I values’) and repeat our statistical tests of association between d and outlier values of I. We then go on to assess whether outlier values of I can be used to recover the underlying architecture of gene interactions in our defined simulation system.
To isolate ‘outlier’ values of I from a background of neutral interactions we use mean absolute deviation (MAD) to establish a threshold above which values of I are to be considered outliers (red lines, Figure 3). We find that for our neutral model (Figure 3A) (r = −3.4X10-4; p = 0.97) and our random network model (Figure 3B) (r = −4.1X10−3; p = 0.67) outliers have no correlation with distance. Whereas in all models of modular gene interaction we find a significant and more substantial negative correlation between outlier values of I and d. (Small world network, Figure 3C: r = −0.33; p = 2.6X10−12; Preferential attachment network, Figure 3D: r = −0.62; p = 5.2X10−16).
Gene co-segregation network can recover the underlying defined ‘ground truth’ gene interaction network in our simulation model
Our pan-genome evolutionary simulations with modular gene interactions (Figures 3C,D) indicates that evolution will produce networks of co-segregating (high I) and physically clustered (low d) genes. Following our arguments above, we hypothesize that the networks of co-segregating genes will recapitulate the underlying modular gene interaction networks, which in this simulation world are defined and accessible as a ‘ground truth’ case.
To test this hypothesis, we first recovered the underlying gene interaction network over which the fitness effects of our simulation are based on (our ground truth). Next, we built a network based on the outlier I values of co-segregation (the co-segregation network, see methods). In both networks nodes represent genes, and edges represent interaction effects. To measure similarity between the two networks we used Hamming distance (i.e., number of changes a network must undergo to be identical to the other; vertical blue lines in Figure 4). To place these similarity metrics in context, we asked how often we would see the same degree of network similarity (Hamming distance), given a randomly re-sampled gene interaction network. By re-sampling randomly 1000 times from networks with the same underlying algorithm and parameters as the specific ‘ground truth’ case (see methods) we are able to assess the extent to which the observed network similarity diverges from chance expectations. Figure 4 shows that modular gene interaction networks such as small world or preferential attachment produce co-segregation networks that recover with high fidelity the ground truth network (particularly for preferential attachment, where the observed Hamming distance is in the 1st percentile of the reference distribution). In contrast, a random gene-interaction network does not produce a discernable signal in the resulting co-segregation network. The results in Figure 4 indicate that outlier co-segregation values can be a reliable predictor of underlying gene interactions, given a modular organization of gene interactions. We further assess the extent to which repeated simulations identify the same outliers. By fixing the ground truth network and repeating the evolutionary simulation 100 times we find that the repeatability of ‘ground truth’ discovery is maximal, given modular fitness networks (Suppl Table S1).
Gene clusters
Our simulation results to date examine pairwise measures of genome structure (pairwise distance d and pairwise co-segregation I), which does not directly capture the extent of larger-scale gene clustering. In a final simulation analysis, we define a gene cluster as a set of genes that are repeatedly adjacent to each other, across a set of genomes. In Figure 5 we plot distributions of cluster sizes, as a function of different threshold values for ‘repeatability’ (defined by prevalence – i.e., the fraction of the genomes in which a cluster is observed). In the absence of gene interactions, we find no clusters larger than gene pairs, even with a permissive prevalence cuttoff of 40% (Figure 5A). In contrast, all our simulations with gene interaction effects show cluster sizes following a geometric distribution, with abundant small clusters and rare large clusters (Figure 5B-D). Unsurprisingly, when gene interactions follow a random network model, clusters are small (Figure 5B). In contrast, small world and preferential attachment networks can lead to bigger cluster sizes (Figure 5C, D).
Model assessment using pan-genomes of closed P. aeruginosa and E coli genomes
Our simulation models produce several predictions that we now test using genome data on the model species Escherichia coli and Pseudomonas aeruginosa. We examine genetic linkage, cluster formation and their relationship to genetic networks to evaluate our model predictions
To capture information on chromosomal linkage, our analyses require the use of completed genomes only. Our results are based on 329 E. coli and 179 P. aeruginosa closed genomes from the NCBI database that met our criteria for inclusion (see methods). For each individual genome we predicted the coding sequences (CDS), finding on an average 5654 CDS for E. coli and 6159 for P. aeruginosa. We then compared the predicted CDS using BLAST reciprocal best match to determine orthologous groups (OGs, i.e., genes with a common evolutionary origin within a species) with Markov clustering. Given our focus on completed genomes only, the estimated pan-genomes (38731 OGs across all genomes of E. coli and 16795 for P. aeruginosa) are smaller than estimates using all available sequence data (9, 31). Because most of the genes are found in the accessory genome, a matrix of presence/absence of genes in each genome is highly sparse and bears the characteristic histogram U shape (Figure S1, Baumdicker, Hess, & Pfaffelhuber, 2012; Collins & Higgs, 2012; Lobkovsky, Wolf, & Koonin, 2013). A key aspect of our analysis is to examine the extent of co-segregation among gene pairs, which we can only calculate for variably present accessory genes. We therefore focus in this study on the accessory genome of E coli (5035 genes) and P. aeruginosa (4973 genes).
Testing cluster formation predictions
Our previous simulation results (Figure 5A) predict that under the neutral model clusters of size 2 are rare and larger clusters are effectively absent. To quantify the proportional abundance of clustered genes in the neutral simulation model, we find that 0.02% of the pan-genome are found in persistent gene clusters of size 2 or more (measured with prevalence cutoff of 0.4, purple line in Figure 5A). In contrast, the selection models allow for larger and geometrically distributed clusters, particularly under modular interactions. Specifically, we found the proportion of clustered genes to be 0.9%, 6% and 8% under the random, small-world and preferential attachment models, respectively (Figure 5B-D).
To assess these predictions against data, we examine the cluster distribution in our pangenome data collections, using the same prevalence thresholds and defining adjacency as a set of genes that are within 10KB of each other. Like our simulated results, cluster size approximates a geometric distribution (Figure 6A). Indicating that large gene clusters although present are rare in nature. Combining large and small clusters together we find that the total fraction of persistently clustered genes (7% for E. coli and 12% for P. aeruginosa) is in agreement with our modular network simulation models.
Next, we used experimentally validated operons as a reference for the distribution of cluster size in a single genome. We consulted the RegulonDB for E. coli (35, 36) which includes experimentally validated operons for the strain K12. The study on the transcriptome of P. aeruginosa by Wurtzel et al, (37) was used to estimate operons of genes in the strain PA14. In the latter an operon is defined as a set of genes transcribed together. In both cases the distribution of sizes approximates a geometric distribution (Figure 6B) in agreement with a modular network model (Figure 5C,D).
Co-segregation versus linkage
Figures 3 A-D predicts a negative relationship between d and outlier values of I, given modular gene interactions (Figures 3C, D). Turning to our genomic data (Figure 7), we see a predominance of low I values, regardless of chromosomal distance d. A simple correlation of I and d values reveals a significant yet small negative association ((Figure 7A E. coli: Pearson correlation r =-0.02; p = 0.003. (Figure 7B P. aeruginosa: r = −0.012; p = 0.001). To analyze the associations with d for outlier values of I, we proceed with the same data analyses as in our simulation models. After identifying outliers via the median absolute deviation (38) we find a more substantial negative association between outlier I and d values (Figure S2; E. coli: r = −0.34; p = 2.2X10−6. P. aeruginosa: r = −0.23; p = 1.5X10−6), consistent with our modular network simulation results. Figure 7 shows a pattern of sparse outlier values of I that resembles our modular interaction models (Figure 3C, D), demonstrating a bias towards the highest measures of co-segregation all occurring among genes in close association (see Figure S2 for a higher resolution plot of this region).
Defining the underlying gene interaction mechanism in E. coli and P. aeruginosa
In light of the simulation benchmarking results in Figure 4, we next turn to our measured outlier I values for E. coli and P. aeruginosa and ask: what is the network structure of observed I values, and does this structure recapitulate candidate gene-interaction networks, namely published metabolic networks and regulatory networks for these two model organisms?
In our simulation models we had defined ‘ground truth’ gene interaction networks. In our empirical analysis, we turn to candidate interaction networks, beginning with established gene regulatory networks for E. coli and P. aeruginosa (39–42), involving 1258 accessory genes from E. coli and 987 for P. aeruginosa. We next assess the similarity of this network with our co-segregation network via Hamming distances, and again contrast this observed network distances with randomized gene interaction network expectations (Figure 8A, B). The Hamming distance to the regulatory network is in the 1st percentile for E. coli and 25th percentile for P. aeruginosa (Fig 8 A, B).
In addition to evaluating regulatory networks, we performed the same analyses for established metabolic networks for both species (43, 44), which contain 1538 accessory genes of E. coli and 1278 genes of P. aerugiunosa. In this case we consider an edge between two genes if they participate in the same metabolic reaction or an adjacent reaction (i.e., the products of one reaction are reactants in the adjacent). We observe the Hamming distance for E. coli to be around the 10th percentile and less than the 1st percentile for P. aeruginosa (Fig 8 C, D).
To further test these conclusions, we assessed the extent to which the entire co-segregation network (not just outlier values of I) can serve as a predictor gene interaction network structure, for both metabolic and regulatory gene interaction networks (AUROC analyses, Supplementary Figure S3 A,B). Consistent with Figure 8, we find that the E. coli co-segregation network is most predictive of the E. coli regulatory gene network (AUROC = 0.86) and the P. aeruginosa co-segregation network is most predictive of the P. aeruginosa metabolic network (AUROC = 0.96). Altogether, the results in Figures 8 and S3 show that co-segregating genes in the accessory genome are more closely associated with direct regulatory links in E. coli (Figure 8A, S3A) and direct metabolic links in P. aeruginosa (Figure 8D, S3D).
Discussion
Our pan-genomic results reveal that co-segregating genes (high I) tend to be located closer on the genome (low d), for two model bacterial systems (Figure 7). While physical linkage provides an intuitive explanation for this result, our neutral evolutionary simulations show that initial linkage provides only a transient support for this association (Figure 1). Introducing selection on gene interactions into our simulation models can produce persistent linkage and co-segregation for a focal gene pair (Figure 2). When gene interactions are distributed modularly across the genome, we see the emergence of persistent clusters of chromosomally linked and co-segregating genes (Figure 3, 5), producing patterns of co-segregation that can reveal the underlying defined gene interaction network (Figure 4). Consistent with our simulation results, we find that E. coli and P. aeruginosa pan-genomes contain clusters of linked and co-segregating genes (Figures 6,7). Assessing our gene co-segregation networks for these two species against candidate gene interaction models we find support for pan-genome evolution primarily driven by metabolic (P. aeruginosa) and regulatory (E. coli) gene interactions (Figures 8, S3).
The identification of gene interactions is a major topic in biology and has resulted in numerous algorithms to define interactions from comparative genome data (45–50). A key differentiating factor in our work is the use of bacterial closed genomes, which allows us to analyze the dependency of genome structure on gene interactions. Our results have broad implications in both applied and basic biology contexts. In the context of basic biology, our results indicate how pan-genome structure is causally connected to gene interaction networks. On an applied scale, this causal connection can be used to aid gene function annotation. High co-segregation values can be taken as evidence for positive gene interactions, providing a tool to infer functional links from genes of unknown function to genes that are already annotated (e.g., established regulatory or metabolic functions). This approach can be viewed as an extension to the common ‘guilt by association’ framework (51), now leveraging repeated associations on the pan-genome scale. Current functional network models, including our baseline regulatory and metabolic networks examined above, are heavily biased towards the gene content of model strains(36, 42–44, 52). Our approach offers a path to infer and extend network structures beyond the well-studied domain of reference genomes.
We note that our results are potentially dependent on several limiting assumptions, for example we currently only examine the impact of positive versus neutral fitness interactions. Although this assumption is sufficient to generate gene clusters in our simulation, avenues for future work could include examination of the effects of negative and positive interactions on cluster formation, and/or higher-order fitness interactions (i.e., pairwise effects dependent on genetic background)(47, 53, 54). By focusing on orthologous groups of genes, gene-dosing effects of gene duplication are also omitted from our analysis (55, 56). Our focus on accessory genomes captures the large majority of the pan-genome of E. coli and P. aeruginosa, but omits the core genome (1.2 % and 4.5 % of E coli and P.aeruginosa pangenomes, respectively). It is reasonable to expect that core genes have interactions with the variable genome which we are not considering.
By modeling time via fixation events, we do not require parameterization of absolute rates of fixation. We do need to make assumptions however on the relative rates of distinct fixation events in our simulations. Beginning with an assumption of genome size stability (i.e. no directional change in lineage genome size through time, (54, 57)), we set rates of lineage gene gain and gene loss to be equal. Following previous work ((58–60)), we assume that genome rearrangement fixation events are substantially (three orders of magnitude) lower than lineage rates of gene gain and gene loss. While consistent with prior data and simulations, the fixation of genome rearrangement events is reported to vary widely across species (Ballouz et al. 2010). To examine the impact of varying the relative rates of fixation events, we find that our results are robust to order-of-magnitude variation in the relative rate of genome rearrangement fixation events. Specifically, relative rates of between 10−3 and 10−5 rearrangement fixations per gain (or loss) fixation result in stable cluster formation (Figure S4).
We have shown that gene interactions can introduce systematic biases to pan-genome evolution. In particular, modular gene interactions are sufficient to guide the evolution of persistent gene clusters, and we argue are the first step in the evolutionary process of operon formation. Although several other models have offered working alternatives to the clustering/operon problem, prior work lacks explicit treatment of the pan-genome. By integrating simulation and data analysis at the pan-genome scale, our work indicates that genome interactions are the central organizing principle to understand the origins and maintenance of pan-genome structure and open a new path for gene annotation across the major pool of bacterial genomic diversity – the accessory genome.
Materials and Methods
Supplementary Methods
Model of gene clustering
A collection G of 100 genomes is simulated through evolutionary time (each genome through time forms an evolutionary lineage, coupled to other lineages by processes of gene transfer). For simulation purposes a genome is an ordered set of genes, where each element represents a gene. The collection represents a polymorphic sampling of strains from the species, and a genome is a unique arrangement of this genetic pool. Each genome has an average size (S) of 1000 genes. Each gene is allocated to a set of genomes by sampling a beta distribution, such that the number of times each gene appears in the collection (Ng) is given by:
This distribution is like the one found in E. coli and P. aeruginosa pangenomes and is common in bacterial pangenomes (1). G has an associated matrix W, which describes the interactions between genes. The W(i,j) element represents the associated cost or benefit to a genome of carrying both gene i and gene j. The simulation considers events of gene gain, gene loss, and genome rearrangement (i.e., genomic inversions and translocations) as independent events the genome can suffer, these events can be influenced by a selection criterion or can be free of selection. The model however does not consider the population dynamics that lead to fixation of genomic changes. It is assumed that (given enough time) a genome lineage will suffer one of the following three categories possible fixation events: gene gain, gene loss or gene inversion / translocation. Each iteration step of the model represents a fixation event for one of these processes in one of the 100 genome lineages. The simulation algorithm can be summarized as follows:
Initialize the set (G, W) as described above.
Chose an individual genome at random.
A fixation event occurs:
Gene gain.
Gene loss.
Gene inversion/translocation.
Distance and mutual information are calculated for the genes in the set.
Repeat steps 2-4 until the end of the simulation.
Record the values and exit
Gene interaction and selection effects
Selection is only considered in step 3 of the above-described algorithm. Assumptions on the impact of selection are described for each event as follows:
For a gene gain event, the available pangenome pool is sampled, the insert size (i.e., number of genes to be inserted) is calculated according to a Poisson distribution. W is used to calculate the insertion site in the receiver genome. For each joint in the genome (i.e., edge between two genes) a weight is calculated such that
Where Aij is the weight for the joint between genes i and j, Wij is the interaction coefficient between the two genes, and S is the genome size. An insertion site is selected by sampling from the distribution of values A. We recover the neutral model (Figure 2) given Wij = 1.
For a gene loss event, fragment size is selected the same way as with gene gain. A fragment to be removed is selected by calculating its weight, by considering the sum of the weights for the genes in the fragment.
The fragment to be removed is sampled form the set of all possible weights R.
Finally, for a genome rearrangement event. A fragment to be moved is selected same as for gene loss, and the fragment is inserted into a site selected according to gene gain.
Simulation parameterization
For all simulations we used a genome collection of 100 separate genome lineages. 1000 genes (assumed to be orthologous groups) are allocated to the collection such that the probability of each gene (P(og)) follows a beta distribution with positive equal shape parameters (α,β), such that (P(og) Beta(α = 0.5, β = 0.5). Genome size is initially 2000 genes for all genes. Fixation events rates are defined as: gene gain (Pg), gene loss (Pl), and gene rearrangement (Pr).
With Pg = Pl, although individual genome changes during the simulation, average genome size is constant.
For our model of interactions, a network is defined for all genes (G). We use three network generated models:
-A random network according to the Erdős–Rényi random graph model (2). With edge sampling parameter p =0.0125.
-A small-world network according to the Watts–Strogatz model (3), with an expectation of 1.2X106 edges and an average degree (K) of 3.
-A preferential attachment network according Barabási–Albert model (4), with a degree distribution parameter (γ) equal to 3 and a growth rate parameter of 1.2.
This parameter selection ensured we have networks with similar number of nodes and edges such that the simulations are comparable. Because both the Watts–Strogatz and the Barabási–Albert generate sparse networks we use a small value of p in our random network. Specifically small world networks require that the number of edges I is smaller than the number of nodes (n) such that, n ≫ e ≫ ln (e) (3). While preferential attachment networks have a distribution of degree following a power law, they generate sparse network with, n ≫ e (5).
Co-segregation and chromosomal distance
A presence/absence matrix was built for the selected genes (OGs). In this numeric matrix (1 representing presence and 0 representing absence), each gene is treated as a random variable and its probability distribution is given by its prevalence in the collection. For each gene pair mutual information is calculated according to:
Where X and Y are two separate genes, p(x), p(y) are the marginal probabilities for gene X and Y respectively (e.g., p(x)probability of finding gene X in a single genome), and p(x,y) is the joint probability distribution of gene X and Y respectively.
Chromosomal distance (d) is calculated between all pairs of selected genes. Such that:
Where Dg(x,y) is the set of 2 distances in the circular chromosome between genes x and y for the gth genome and G is the total number of genomes in the collection.
Outlier detection and network inference
We used median absolute deviation (MAD) to determine outlier measurements of co-segregation according to (6):
Where I is the set of measurements of co-segregation, and C is a consistency constant. A given value is considered an outlier if it has an absolute deviation from the maximum deviation larger than 2.
A network is inferred using outlying values of co-segregation. Where for a gene pair x and y, an edge is drawn between them if their co-segregation (Ixy) is larger than the MAD cutoff. Comparison between networks is performed by calculating the Hamming distance that is; the number of transformations (i.e., edge additions and subtractions necessary for the two networks to be identical, this is expressed as:
Where E(Gi) is the number of edges in network i, E(Gj) is the number of edges in network j.
E. coli and P. aeruginosa genome processing
To avoid sampling issues in the gene content analysis only completed genomes were included, this includes sequences labeled as either “Complete genome” (which includes the chromosome and plasmid information) and Chromosome in the database. Additionally, we removed any genome that had gaps in sequences or unspecified nucleotides (N). We found 179 Pseudomonas aeruginosa and 329 Escherichia coli genomes from the NCBI database that met our criteria for inclusion.
Coding sequences for each genome were predicted using GeneMarkS v.1.11 (7). For each pair of genomes reciprocal protein BLAST v.2.10.1 (8) was performed. Reciprocal best matches are used to determine orthologous groups (OGs) with Markov clustering, using and inflation value of 1.5 (9). We exclude core genes (i.e., OGs present in all the genomes in the collection), and OGs with a single instance in the collection to produce a dataset of accessory genes (the variable genome).
Cluster size calculation
To calculate cluster size, we selected 4 different prevalence thresholds (0.4, 0.5, 0.6, 0.7) and selected all genes with prevalence equal or higher in our genome collection. For all gene pairs we consider a cluster if the distance between them is below our selection criteria (10Kbp for bacterial genomes, 0 gene for simulated genomes) so that adjacent and overlapping genes small intergenic. For each pair, above our prevalence threshold, their distance is calculated to all other genes a cluster is determined according to the same criteria. This process is repeated iteratively until a cluster size of 10.
Supplementary Information
Acknowledgments
We thank Peng Qiu, Marvin Whiteley, Gabriel Perron, Tim Read, Rohan Mehta, the Brown lab and members of the Center for Microbial Dynamics and Infection for valuable feedback on the manuscript. We thank the NIH (5R21AI156817-02) and the Cystic Fibrosis Foundation (BROWN21P0) for funding this project.
Footnotes
Competing Interest Statement: The authors declare no competing interest.