## Abstract

While bacteria divide clonally, occasional homologous recombination is known to be an important contributor to their evolution. However, the details of how the competition between clonal inheritance and recombination shapes genome diversification, population structure, and species stability remains poorly understood. Using a computational model, we propose two evolutionary regimes and identify two composite parameters that dictate the fate of bacterial species. In the *divergent* regime, characterized by either a low recombination frequency or strict barriers to recombination, cohesion due to recombination is not sufficient to overcome the mutational drift. As a consequence, the divergence between any pair of genomes in the population steadily increases in the course of evolution. The species as a whole lacks coherence at the population level with sub-populations continuously formed and dissolved. In contrast, in the *metastable* regime, characterized by a high recombination frequency combined with low barriers to recombination, genomes continuously recombine with the rest of the population. The population remains genetically cohesive and stable over time. We demonstrate that the transition between these two regimes can be affected by relatively small changes in evolutionary parameters. Using the data from Multi Locus Sequence Typing (MLST) analysis we classify a number of well-studied bacterial species to be either the divergent or the metastable type. Mechanisms that allow bacterial species to transition from one regime to another are discussed. Generalizations of the framework to understand adaptive populations, horizontal gene transfer of non-homologous regions, and spatial correlations in diversity along the chromosome are also discussed.

## Introduction

Bacterial genomes are extremely variable, comprising both a consensus ‘core’ genome which is present in the majority of strains in a population, and an ‘auxiliary’ genome, comprising genes that are shared by some but not all strains (1–7).

Multiple factors shape the diversification of the core bacterial genome. Bacteria divide clonally thereby inheriting the entirety of their mother’s genome. The balance between this vertical inheritance and random fixation of single nucleotide polymorphisms (SNPs), generated at a rate *μ* per base pair per generation, limits the typical population diversity to *θ* = 2*μN _{e}* where

*N*is the

_{e}*effective*population size of the species (8). During the last two decades, genetic exchange between closely related organisms through homologous recombination, attempted at a rate p per base pair per generation, has also been recognized as a significant factor in evolution (5, 6, 9–14). Notably recombination between genetically distant bacteria is suppressed, the probability

*p*

_{success}~

*e*

^{−δ/δTE}of successful recombination of foreign DNA into a recipient genome decays exponentially with

*δ*, the

*local*divergence between the donor DNA and the recipient (12, 15–17). The effective barrier

*δ*

_{TE}to successful recombination, referred here as the

*transfer efficiency*, is shaped at least in part by the biophysical mechanisms of homologous recombination (15, 16).

While clonal inheritance with mutations imposes a clonal structure on the population, recombination acts as an homogenizing force, keeping populations homogeneous and potentially destroying the genetic signatures of clonality (6, 17, 18). There are two principal components to the interplay between mutations and recombinations. First is the competition between the diversity within the population *θ* and the maximal diversity within one subpopulation *δ*_{TE} uniformly capable of successful recombination. If *δ*_{TE} < *θ*, one expects spontaneous fragmentation of the entire population into several transient sexually isolated sub-populations that rarely exchange genetic material between each other. In contrast, if *δ*_{TE} > *θ*, unhindered exchange of genetic fragments may result in a single cohesive population. Second is the competition between the recombination transfer rate *ρ* and the mutation rate *μ*. The typical time between consecutive recombination events in any local genomic region is 1/2*ρ* × *l*_{tr} where *l*_{tr} is the typical length of transferred regions. In the same time, the total divergence accumulated in this region is *δ*_{mut} ~ 2*μ*/2*ρ* × *l*_{tr}). If *δ*_{mut} ≫ *δ*_{TE}, the pair of genomes may become sexually isolated from each other between successive recombination events. In contrast, if *δ*_{mut} < *δ*_{TE}, frequent recombination events may disallow sexual isolation resulting in a homogeneous population.

What qualitative dynamical regimes in bacterial evolution emerge from the competition and balance between these two factors and which evolutionary parameters dictate the evolutionary fate of bacterial genome diversification remains poorly understood. Importantly, even the question of whether bacteria can retain their clonal inheritance in the presence of recombination and whether signatures of clonal structure and recombination can be inferred from population genetic data is still heavily debated (17, 19–21).

Nonetheless, some aspects of this interplay have been explored before. In their pioneering study Vetsigian and Goldenfeld (22) investigated the effects of nonrecombining segments (for example, a large scale inversion or insertion) on recombination events in their chromosomal neighborhoods’ vicinity and how it may result in divergence spreading along the chromosome. Falush et al. (23) suggested that a low *transfer efficiency δ*_{TE} leads to sexual isolation in *Salmonella enterica*. Fraser et al. (18), working with a *θ* = 0.4% (lower than those observed in typical bacterial species) and a *transfer efficiency δ*_{TE} ≈ 2.4% concluded that realistic values of sexual isolation in bacterial species is insufficient to cause speciation with realistic recombination frequencies. Doroghazi and Buckley (24), working with a fixed *transfer efficiency* but a very small population size (limit of *θ* → 0), studied how the competition between mutations and recombination affects the cohesion of two isolated subpopulations.

In this work, using a computational model and mathematical calculations, we show that the two composite parameters identified above, *θ*/δ_{TE} and *δ*_{mut}/*δ*_{TE}, determine qualitative evolutionary dynamics of a given bacterial species. Furthermore, we identify two principal regimes of this dynamics. In the *divergent regime*, characterized by a high *δ*_{mut}/*δ*_{TE}, local genomic regions acquire multiple mutations between successive recombination events and rapidly isolate themselves from the rest of the population. The population remains mostly clonal where transient sexually isolated sub-populations are continuously formed and dissolved. In contrast, in the *metastable regime*, characterized by a low *δ*_{mut}/*δ*_{TE} and a low *θ*/*δ*_{TE}), local genomic regions recombine repeatedly before ultimately escaping the pull of recombination (hence the name “metastable”). At the population level, in this regime all genomes can exchange genes with each other resulting in a sexually cohesive and temporally stable population. Notably, our analysis suggests that only a small change in evolutionary parameters can have a substantial effect on evolutionary fate of bacterial genomes and populations.

We also show how to classify bacterial species using the conventional measure of the relative strength of recombination over mutations, *r*/*m* (defined as the ratio of the number of single nucleotide polymorphisms (SNPs) brought by recombinations and those brought by point mutations in a pair of closely related strains), and our second composite parameter *θ*/*δ*_{TE}. Based on our analysis of the existing MLST data, we find that different bacterial species bacteria belong to either divergent and metastable regimes. We discuss possible molecular mechanisms and evolutionary forces that decide the role of recombination in a species’s evolutionary fate. We also discuss possible extensions of our analysis to include adaptive evolution and genome modifications such as insertions, deletions, and inversions.

## The computational model

We consider a population of *N _{e}* strains. The population evolves with non-overlapping generations and in each generation the strains choose their parents randomly (8). The genome of each strain has

*G*indivisible and non-overlapping transferable units. For simplicity, in what follows we refer to these units as

*genes*but note that while protein coding genes in a typical bacteria are ~ 1000 base pair long, we use

*l*

_{tr}= 5000 base pairs mimicking genetic transfers longer than individual protein coding genes (6, 10). These genes acquire point mutations at a rate

*μ*per base pair per generation and recombinations are attempted on a recipient genome from a randomly selected donor in the population at a rate p per base pair per generation. The mutations and recombination events are assumed to have no fitness effects. Finally, the probability of a successful integration of a donor gene decays exponentially,

*p*

_{success}~

*e*

^{−δ/δTE}, with the

*local*divergence

*δ*between the donor and the recipient.

In order to avoid simulating extremely large bacterial population sizes, we focus on the evolution of divergence between two randomly chosen genomes labeled *X* and *Y* in a co-evolving population. *X* and *Y* start diverging from each other as identical twins at time *t* = 0 (when their mother divides). We denote by *δ _{i}*(

*t*), the divergence in the

*i*

^{th}gene between

*X*and

*Y*at time

*t*and by the average genome-wide divergence. Based on population genetic and biophysical considerations, we derive the probability

*E*(

*δ*|

_{a}*δ*) = 2

_{b}*μM*(

*δ*|

_{a}*δ*) + 2

_{b}*ρl*

_{tr}

*R*(

*δ*|

_{a}*δ*) (

_{b}*a*for after and

*b*for before) that the divergence in any gene changes from

*δ*to

_{b}*δ*in one generation (see supplementary materials for details) (6). Briefly, there are two components to the probability,

_{a}*M*and

*R*. Point mutations, represented by

*M*(

*δ*|

_{a}*δ*), occur at a rate 2 ×

_{b}*μ*per base pair per generation and increase the divergence in a gene by 1/

*l*

_{tr}. Unlike point mutations, after a recombination event (represented by

*R*(

*δ*|

_{a}*δ*)), the divergence can change suddenly, taking values either larger or smaller than the current divergence (6). Note that recombinations from highly diverged members in the population are suppressed exponentially and consequently not all recombination attempts are successful. Intuitively, the time evolution of

_{b}*p*(

*δ*|

*t*) of the probability of observing a divergence

*δ*in a gene at time t can be written at

## Evolution of *local* divergence has large fluctuations

In Fig. 1 we show a color-coded typical trajectory of evolution of the *local* divergence *δ*(*t*) of a single gene in a pair of genomes. We have used *θ* = 1.5% and *δ*_{TE} = 1%, values typically observed in bacterial species (6, 18). To keep the simulation times manageable, the mutation and the recombination rates used here are 4–5 orders of magnitude higher compared to those observed in real bacteria (*μ* = 10^{-5} per base pair per generation and *ρ* = 5 × 10^{-6} per base pair per generation, *δ*_{mut}/*δ*_{TE} = 0.04) (25, 26) while keeping the ratio of the rates realistic (5, 6, 13, 27). Alternatively, the unit of time in these simulations is considerably longer than one single generation.

The time evolution of *δ*(*t*) is noisy; mutational drift events that gradually increase the divergence linearly with time (red) are frequently interspersed with homologous recombination events (green if they increase *δ*(*t*) and blue if they decrease it) that suddenly change the divergence to typical values seen in the population (see Eq. A1 in the appendix). Eventually, either through the gradual mutational drift or a sudden recombination event, *δ*(*t*) increases beyond the integration barrier set by the transfer efficiency, *δ*(*t*) ≫ *δ*_{TE}. Beyond this point, the two strains belong to two different sexually isolated sub-clades. Any further recombination events on any one of the strains are limited to their own sub-clades and do not change the divergence between the two strains. Consequently, the mutational drift keeps on driving them further apart indefinitely.

## Genome-wide divergence

Because genomic regions in our model evolve independently of each other, the genome-wide average divergence Δ(*t*) can be calculated as the mean of *G* independent realizations *local* divergences *δ*(*t*). Consequently, since the number *G* of genes in the genome is large, the law of large numbers implies that the fluctuations in the dynamics of Δ(*t*) are substantially suppressed compared to more noisy δ(*t*) seen in Fig. 1.

In Fig. 2, we plot the time evolution of Δ(*t*) between a pair of strains (as % difference). We have used *θ* = 0.25%, *δ*_{TE} = 1%, and *δ*_{mut}/*δ*_{TE} = 2, 0.5,0.04, and 2 × 10^{-3} respectively. When *δ*_{mut}/*δ*_{TE} is large (either a due to low p or a low *δ*_{TE}), in any local genomic region, multiple mutations are acquired between two successive recombination events. Consequently, individual genes escape the pull of recombination rapidly and *Δ*(*t*) increases roughly linearly with time at a rate 2*μ*. For smaller values of *δ*_{mut}/*δ*_{TE}, the rate of change of Δ(*t*) in the long term decreases as many of the individual genes repeatedly recombine with the population. However, even then the fraction of genes that have escaped the integration barrier slowly increases over time, eventually leading to a linear increase in Δ(*t*) with time albeit with a slope different than 2*μ*. The repeated resetting of δ(*t*) after homologous recombination (see Fig. 1) generally results in a Δ(*t*) that increases extremely slowly with time.

At the shorter time scale, the trends in genome divergence are opposite of those at the longer time scale. At a fixed *θ*, a low value of *δ*_{mut}/*δ*_{TE} implies faster divergence and vice versa. When recombination rate is high, genomes of strains quickly ‘equilibrate’ with the population and the genome-wide average divergence between a pair of strains reaches the population average diversity ~ *θ* (see the red trajectory in Fig. 2). From here, any new mutations that increase the divergence are constantly wiped out through repeated recombination events with the population. Over time, a rare event, wherein mutations accumulate and no recombinations take place for a sufficiently long time, allows individual ‘genes’ to escape the pull of recombination. Subsequently, genes on any pair of genomes escape the pull of recombination one after the other and the genomes can diverge indefinitely.

Computational algorithms that build phylogenetic trees from multiple sequence alignments often rely on the assumption that the sequence divergence, for example between a pair of strains (at the level of individual genes or at the level of genomes), faithfully represents the time that has elapsed since their most recent common ancestor (MRCA). However, Fig. 1 and Fig. 2 serve as a cautionary tale. Notably, after just a single recombination event the *local* divergence at the level of individual genes does not at all reflect time elapsed since divergence but rather depends on statistics of divergence within a recombining population (see (6) for more details). At the level of genomes, when *δ*_{mut}/*δ*_{TE} is large (e.g. the blue trajectory in Fig. 2), the time since MRCA of two strains is directly correlated with the number of mutations that separate their genomes. In contrast, when *δ*_{mut}/*δ*_{TE} is small (see pink and red trajectories in Fig. 2), frequent recombination events repeatedly erase the memory of the clonal ancestry. Nonetheless, individual genomic regions slowly escape the pull of recombination at a fixed rate. Thus, the time since MRCA is reflected not in the total divergence between the two genomes but in the fraction of the length of the total genomes that has escaped the pull of recombination. As described above one uses a very different rate of accumulation of divergence to estimate evolutionary time from genome-wide average divergence.

## Quantifying metastability

How does one quantify the *metastable* behavior described above? At the level of individual genes it is manifested through constant resetting of *δ*(*t*) to typical population values and at the level of entire genomes through a very slow increase in Δ(*t*) when *δ*_{mut}/*δ*_{TE} is small. Fig. 2 suggests that high rates of recombination prevent pairwise divergence from increasing beyond the typical population divergence ~ *θ* at the whole-genome level. Thus, for any set of evolutionary parameters, *μ, ρ, θ*, and *δ*_{TE}, the time it takes for a pair of genomes to diverge far beyond the typical population diversity *θ* can serve as a quantifier for metastability.

In Fig. 3, we plot the time *t*_{div} required for the genome-wide average divergence Δ(*t*) between a pair of genomes to exceed twice the typical population diversity 2 × *θ* as a function of *θ*/*δ*_{TE} and *δ*_{mut}/*δ*_{TE}. Note that in the absence of recombination, it takes *t*_{div} = 2*N _{e}* generations before Δ(

*t*) exceeds 2

*θ*. While we work with an arbitrary threshold of twice the average diversity, we do not expect that our results will change for any other non-typical divergences >

*θ*.

We observe two distinct regimes in the behavior of *t*_{div} as a function of *θ*/*δ*_{TE} and *δ*_{mut}/*δ*_{TE}. In the *divergent* regime, after a few recombination events, the divergence *δ*(*t*) at the level of individual genes quickly escapes the integration barrier and increases indefinitely. Consequently, Δ(*t*) increases linearly with time (see *δ*_{mut}/*δ*_{TE} = 2 in Fig. 2) and reaches Δ(*t*) = 2*θ* within ~ 2*N _{e}* generations. In contrast, it takes extremely long for Δ(

*t*) to reach 2

*θ*as

*δ*

_{mut}/

*δ*

_{TE}decreases in the

*metastable*regime. Here, genes get repeatedly exchanged with the population and Δ(

*t*) between a pair of strains appears to remain constant over large periods of time (see

*δ*

_{mut}/

*δ*

_{TE}= 2 × 10

^{-3}in Fig. 2). Notably, a small perturbation in evolutionary parameters can change evolutionary dynamics from divergent to metastable and vice versa near the boundary region between the two regimes.

## Population structure

Can we understand the phylogenetic structure of the population by studying the dynamics of evolution of divergence between a pair of strains in the population? Every pair of genomes diverges indefinitely when sufficient amount of time has elapsed since their MRCA. But, in a finite population of size *N _{e}*, the

*average*probability of observing a pair of strains whose MRCA existed

*t*generations ago is exponentially distributed, (the line indicates averaging over multiple realizations of the coalescent process) (28–30). Thus, while it may be possible a pair of genomes considered above to diverge indefinitely from each other (see Fig. 2), it becomes more and more unlikely to find such a a pair in a finite sized population.

Let us define *π*(Δ) as the probability that the genomes of two randomly picked strains in a population have diverged from each other by Δ. Let be the time average of *π*(Δ). We have

In Eq. 2, *p _{c}*(

*t*) is the probability of that a pair of strains share their MRCA

*t*generations ago in any time snapshot of the population and

*p*(Δ|

*t*) is the probability that a pair of strains have diverged by Δ at time

*t*. Given that Δ(

*t*) is the average of

*G*≫ 1 independent realizations of

*δ*(

*t*), we can approximate it as a Gaussian distribution with average and variance . The angular brackets and the subscript

*G*indicate an average over the entire genome.

Unlike the time averaged distribution , the instantaneous distribution *π*(Δ) is accessible from genome sequences. However, we expect substantial differences between the two distributions even in the large population limit given that *p _{c}*(

*t*) is extremely noisy and does not resemble its long-time average (29, 30). In panels a) to d) of Fig. 4, we show

*π*(Δ) for the four cases shown in Fig. 2. We fixed the population size to

*N*= 500. We changed

_{e}*δ*

_{mut}/

*δ*

_{TE}by changing the recombination rate

*ρ*. The solid lines represent a time snapshot obtained by numerically sampling

*p*(

_{c}*t*) in a Fisher-Wright population of size

*N*= 500. The dashed black line represents the time average .

_{e}In the divergent regime of Fig. 3, at high *δ*_{mut}/*δ*_{TE} values, the *instantaneous* snapshot distribution *π*(Δ) has multiple peaks indicating the spontaneous formation of clonal sub-populations *even in a homogeneous population* that exchange genetic material within the clade but not outside of it, either because of a low recombination frequency or because of a low transfer efficiency. In this case, the time averaged distribution has a long exponential tail and, as expected, does not agree with the instantaneous distribution *π*(Δ).

Notably, in the metastable regime, at lower values of *δ*_{mut}/*δ*_{TE}, the exponential tail shrinks into a Gaussian-like peak. The width of this peak relates to fluctuations in Δ(*t*) around its mean value which in turn are related to the total number of genes *G*. Moreover, the difference between the instantaneous and the time averaged distributions decreases as well. In this limit, all strains in the population exchange genetic material with each other. Thus, in the metastable regime, frequent recombination events can successfully eiliminate the clonal structure of the population leading to a sexually cohesive and temporally stable population.

## Analysis of bacterial species

How are bacterial species placed on the divergent-metastable diagram? Instead of *δ*_{mut}/*δ*_{TE} as defined here, population genetic studies of bacteria quantify the relative ‘strength’ of recombination over mutations as *r*/*m*. In our framework, *r*/*m* can be estimated as *r*/*m* = *ρ*_{succ}/*μ* × *l*_{tr} × *δ*_{tr} where *ρ*_{succ} < *ρ* is the rate of successful recombination events and *δ*_{tr} is the average divergence in transferred regions. Both *ρ*_{succ} and *δ*_{tr} depend on the evolutionary parameters (see appendix for a detailed description of the calculations).

In Fig. 5, we re-plot the ‘phase diagram’ in Fig. 3 in terms of *θ*/*δ*_{TE} and *r*/*m* and place multiple bacterial species on it. We estimated *θ* from MLST data (31) and used *r*/*m* values that were determined previously by Vos and Didelot (13). We assumed that the transfer efficiency *δ*_{TE} was approximately equal to *δ*_{TE} ~ 2.26% (18). There are three striking features. First, our analysis identifies both *r*/*m* and *θ*/*δ*_{TE} as important evolutionary parameters and suggests that individually *r*/*m* or *θ*/*δ*_{TE} alone cannot determine population structure. Second, the sharp transition between the divergent and the metastable phase is also observed in the modified phase diagram as well, implying that a small change in either *r*/*m* or *θ*/*δ*_{TE} can change the evolutionary fate of a bacterial species. And finally, we observe that real bacterial species can indeed be both divergent as well as metastable.

Can bacteria change their evolutionary fates? There are multiple biophysical and ecological processes by which bacterial species may move from the metastable to the divergent regime and vice versa in Fig. 3. For example, if population size remains constant, a change in mutation rate changes *δ*_{mut}/*δ*_{TE} as well as *θ*. A change in the expression of the mismatch repair (MMR) system or the type of the MMR system can change *δ*_{TE} (15). Loss of co-infecting phages or the SOS system and the number of restriction-modification (RM) systems (32) can change the rate of recombination.

Adaptive and ecological events are regularly inferred from population genomics data only after rejecting models of *neutral* evolution. However, the range of qualitative behaviors explained by the neutral models of recombination-driven evolution of bacterial species was not entirely quantified leading to potentially unwarranted conclusions, as illustrated in (33).

Consider *E. coli* as an example. Known strains of *E. coli* are usually grouped into 5-6 different sub-clades. It is thought that inter-clade sexual exchange is lower compared to intra-clade exchange (6, 27). Ecological niche separation and/or selective advantage are usually implicated as initiators of such putative speciation events (17). In our previous analysis of 32 fully sequenced *E. coli* strains, we estimated *θ*/*δ*_{TE} > 3 and *r*/*m* ~ 8 – 10 (6) implying that *E. coli* resides in the divergent regime in Fig. 5. Thus, the analysis presented here indicates that *E. coli* strains should spontaneously form sexually isolated sub-populations even in the absence of selective pressures or ecological niche separation. Consequently, careful analysis is needed to reject neutral models of evolution in the study of population genetics of bacteria.

## Conclusions

While recombination is now recognized as the chief contributor to the observed genome diversity in many bacterial species(5, 6, 9–13), its effect on population structure and species stability is still heavily debated (17–21).

In this work, we have shown that recombination-driven bacterial genome evolution can be understood as a balance between two important competitions. We identified the two dimensionless parameters *θ*/*δ*_{TE} and *δ*_{mut}/*δ*_{TE} that dictate this balance and result in two qualitatively different regimes in bacterial evolution, separated by a sharp transition.

As seen in Fig. 3 and Fig. 5, in the divergent regime, the pull of recombination is insufficient to homogenize individual genes and entire genomes leading to a temporally unstable and sexually fragmented population. Notably, understanding divergence between a pair of genomes allows us to study the structure of the population as well. As shown in Fig. 4, genomes of members of divergent population form a clonal population structure. On the other hand, in the metastable regime, individual genomes repeatedly recombine genetic fragments with each other leading to a sexually cohesive and temporally stable population. As seen in Fig. 5, real bacterial species appear to belong to both of these regimes as well as in the cross-over region between the regimes.

## Extending the framework

Recombination in bacterial species is thought to be essential in order to minimize the fitness loss due to Muller’s ratchet (34) and to minimize clonal interference (35). Thus, it is likely that both recombination frequency and transfer efficiency are under selection (34). Could one include fitness effects in our theoretical framework? Here, we considered the dynamics of *neutrally* evolving bacterial population of effective population size *N _{e}*. The effective population size is incorporated in our framework via the coalescent time distribution of a neutral population, exp(–

*T*/

*N*) (see supplementary materials). Neher and Hallatschek (36) recently showed that while pairwise coalescent times in adaptive populations are

_{e}*not*exponentially distributed, this distribution has a pronounced exponential tail with an

*effective*population size

*N*weakly related to the actual population size and largely determined by the variance of mutational fitness effects (36). Our results could be generalized for adaptive populations by incorporating the effects of such non-exponential coalescent time distribution.

_{e}In this work, we assumed that in a recipient genome, recombinations bring *non-overlapping* segments of length ltr that recombine in their entirety. In real bacteria, different recombined segments have variable lengths and partially overlap with each other thereby creating a *mosaic* of clonal and transferred regions along a chromosome (6, 10, 11). Overlapping transfers can affect evolutionary dynamics. In particular, when a local region within a genome has diverged above the threshold imposed by transfer efficiency *δ*_{TE} it reduces the likelihood of successful homologous recombination near both of its boundaries leading to a gradual expansion of the highly diverged region along the chromosome (22). Vetsigian and Goldenfeld proposed (22) that non-core genome segments e.g. horizontally acquired pathogenicity genomic islands could nucleate such propagating fronts of diversity and ultimately give rise to a new species. Genome rearrangements such as large-scale inversions are also expected to reduce the local rate of recombination in their vicinity. One expects that a long stretch of a genome that diverged above transfer efficiency threshold can also result in two propagating wave fronts of divergence along the chromosome. In our future studies we plan to explore these and other extensions on top of the basic mathematically tractable model described here.

## A1. APPENDIX

### A. The computational model

We consider a model population of *N _{e}* bacteria. The population evolves with non-overlapping generations. In each generations, offsprings choose their parent randomly. Genome of every bacteria consists of

*G*genes. The genes can mutate and can be transferred

*one at a time*in their entirety. The genes in this model are in fact indivisible units of homologous recombination. We denote by

*l*

_{tr}the length of each gene. We use

*l*

_{tr}= 5000 base pairs reflecting transfers larger than individual genes (6, 10). The mutation rate is

*μ*per base pair per generation and the rate at which recombinations are attempted is

*ρ*per base pair per generation. We assume that recombinations always start at the first base pair of each gene.

In this co-evolving population, we focus on the divergence between a pair of strains *X* and *Y* that at time *t* = 0 start as identical twins. The divergence *δ*(*t*) on any one of the genes between these two pairs evolves stochastically as a function of time. With probability 2*μ*, the divergence increases by 1/*l*_{tr}. Recombinations are attempted from the population into one of the genomes (say X) at a rate 2 × *ρ*. The divergence after a recombination *δ _{a}* (

*a*for after) event can either remain the same, decrease, or increase compared to the divergence before recombination

*δ*(

_{b}*b*for before). The three probabilities are given by (see Fig. A1 for an illustration). Here,

*Di*(

*x*) is the Dirac Delta function and

*H*(

*x*) is the Heaviside theta function. The full evolutionary kernel

*E*(

*δ*|

_{a}*δ*) is the combination of mutational events and recombination events.

_{b}### B. Estimating *r*/*m*

As mentioned in the main text, *r*/*m* is defined in a pair of strains as the ratio of SNPs brought in by recombination events and the SNPs brought in by point mutations. Clearly, *r*/*m* will depend on a strain-to-strain comparison however, usually it is reported as an average over all pairs of strains. How do we compute *r*/*m* in our framework? We have

Thus, in order to compute *r*/*m*, we need two quantities. First, we need to compute the rate of successful recombinations *ρ*_{succ} < *ρ*. We can calculate *ρ*_{succ} as
where *p*_{succ} is the success probability that a gene that has diverged by *δ* will have a successful recombination event. The integration over exponentially distributed pairwise coalescent times averages over the population. *p*_{succ} can be computed from Eq. A1 by integrating over all possible scenarios of successful recombinations. We have
where *δ*^{*} = *δ*/*δ*_{TE} and *θ*^{*} = *θ*/*δ*_{TE} are normalized divergences and *p*(*δ*|*t*) is the distribution of *local* divergences at time *t*. In practice, *r*/*m* can only estimated by analyzing statistics of distribution of SNPs on the genomes of closely related strain pairs where both clonally inherited and recombined parts of the genome can be identified (6, 27). Here, we limit the time-integration in Eq. A3 to times *t* < min(*N _{e}* =

*θ*/2

*μ*,

*δ*

_{TE}/2

*μ*).

Second, we need to compute the average divergence in transferred segments, *δ*_{tr}. We have
where *δ _{t}* (

*δ*) is the average divergence

*after*a recombination event if the divergence before transfer was

*δ*.

### C. Computing *θ* from MLST data

Except for *E. coli* where we used our previous analysis (6), we downloaded MLST sequences of multiple organisms from the MLST database (31). For each of the 7 genes present in the MLST database, we performed a pairwise alignment between strains. *θ* for each gene was calculated as the average of pairwise SNPs. The *θ* for the species was estimated as average of the *θ*s calculated for each of the 7 genes.

## Footnotes

↵* Email: maslov{at}illinois.edu