## Abstract

Although homologous recombination is accepted to be common in bacteria, so far it has been challenging to accurately quantify its impact on genome evolution within bacterial species. We here introduce methods that use the statistics of single-nucleotide polymorphism (SNP) splits in the core genome alignment of a set of strains to show that, for many bacterial species, recombination dominates genome evolution. Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment.

We also show how SNP splits can be used to quantify the relative rates with which different subsets of strains have recombined in the past. We find that virtually every strain has a unique pattern of recombination frequencies with other strains and that the relative rates with which different subsets of strains share SNPs follow long-tailed distributions. Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect these long-tailed distributions of recombination rates.

## Introduction

The only illustration that appears in Darwin’s Origin of species [1] is of a phylogenetic tree. Indeed, the tree has become the archetypical concept representing biological evolution. Since every biological cell that has ever lived was the result of a cell division, all cells are connected through cell divisions in a giant tree that stretches all the way back to the earliest cells that existed on earth. Thus, the study of biological evolution in some sense corresponds to the study of the structure of this giant cell-division tree. Indeed, virtually all models of evolutionary dynamics formulate the dynamics as occurring along the branches of a tree, and many mathematical and computational methods have been developed for inference and modeling of evolutionary dynamics along the branches of a tree, e.g. [2,3].

It is thus natural that the first step in the analysis of a set of related biological sequences is to reconstruct the phylogenetic tree that reflects the cell division history of the sequences, i.e their ‘ancestral phylogeny’. Once the ancestral relationships between the sequences are known, the evolution of the sequences can then be modeled along the branches of this tree. This strategy has been employed from the earliest days of sequence analysis [4] and is almost invariable applied in the analysis of microbial genome sequences, which is the main topic of this work.

A second key concept in models of evolutionary dynamics is the idea of a ‘population’ of organisms that are mutually competing for resources and that, for purposes of mathematical modeling, can be considered exchangeable in the sense that they are subjected to the same environment. Indeed, populations of exchangeable individuals form the basis of almost all mathematical population genetics models (see e.g. [5]), including coalescent models for phylogenies [6]. Although it is of course well recognized that, in the real world, populations are structured into sub-populations with varying degrees of interaction between them, population genetics models almost by definition assume that at some level there are sub-populations of exchangeable individuals sharing a common environment.

In this paper we present evidence that we believe challenges the usefulness of applying these two concepts for describing genome evolution in prokaryotes. First, we find that for most bacterial species recombination is so frequent that, within an alignment of strains, each genomic locus has been overwritten by recombination many times and the phylogeny typically changes tens of thousands of times along the genome. Moreover, for most pairs of strains, none of the loci in their pairwise alignment derives from their ancestor in the ancestral phylogeny, and the vast majority of genomic differences result from recombination events, even for very close pairs. Consequently, the ancestral phylogeny cannot be reconstructed from the genome sequences using currently available methods and, more generally, the strategy of modeling microbial genome evolution as occurring along the branches of an ancestral phylogeny breaks down.

Second, we show that the structure represented in whole genome phylogenies of microbial strains does not reflect ancestry, but instead the relative rates with which different lineages have recombined in the past. Whereas almost every short genomic segment follows a different phylogeny, these phylogenies are not uniformly randomly sampled from all possible phylogenies, but sampled from highly biased distributions. In particular, the relative frequencies with which particular sub-clades of strains occur in the phylogenies at different loci follow roughly power-law distributions and each strain has a distinct distribution of co-occurrence frequency with the other stains. Since each strain has a unique ‘finger print’ of recombination rates with the lineages of other strains, the assumption that at some level strains can be considered as exchangeable members of a population, also fundamentally breaks down.

The structure of the paper is as follows. To present our analyses, we will focus on a collection of 91 wild *E. coli* strains that were isolated over a short period from a common habitat [7]. After introducing these strains, we introduce the main puzzle of bacterial whole genome phylogeny: although the phylogenies of individual genomic loci are all distinct, the phylogeny inferred from any large collection of genomic loci converges to a common structure, e.g. [8–10]. We first study recombination by studying pairs of strains, extending a recent approach by Dixit et al. [11] to model each pairwise alignment as a mixture of ancestrally inherited and recombined regions. We show that, as the distance to the pair’s common ancestor increases, the fraction of the genome covered by recombined segments increases, and at some pairwise distance all clonally inherited DNA disappears. Importantly, this distance is far below the typical divergence of pairs of strains such that for the vast majority of pairs, none of the DNA in their genome alignment stems from their common ancestor.

Much of the new analysis methodology that we introduce is based on bi-allelic SNPs (which constitute almost all SNPs in the core alignment). Although bi-allelic SNPs have been studied to estimate the number of recombinations along alignments of sexually reproducing species [12], as far as we are aware they have received very little attention in the study of prokaryotic genomes. We show that virtually all bi-allelic SNPs correspond to single mutational events in the history of its genomic locus, so that each SNP provides a bi-partition that occurs at the phylogeny at that locus. We show various ways in which these bi-allelic SNPs can be used to investigate which SNPs are consistent with given phylogenies, and each other, and use them to quantify the amount of phylogenetic variation along the alignment. We use these SNPs to show there is no consensus phylogeny, that the phylogeny changes every few SNPs along the core phylogeny, and to estimate a lower bound on the ratio of recombination to mutation events in a genome alignment.

We then show how these SNPs can also be used to quantify the relative rates with which different lineages share mutations, and show that these rates follow approximately scale free distributions, indicating that there is ‘population structure’ on every scale. Finally, we define entropy profiles of phylogenetic variability of each strain and show that these entropy profiles provide a unique phylogenetic fingerprint of each strain.

We close by showing how all the statistics that we developed for *E. coli* apply to a set of other bacterial species including: *B. subtilis, H. pylori, M. tuberculosis, S. enterica*, and *S. aureus*. We show that, with the exception of *M. tuberculosis* where all strains are very closely related and no pair has yet been fully recombined, all other species follow the same general behavior as *E. coli*. Thus, for almost all bacterial species that we studied, there is no common or consensus phylogeny, but many thousands of different phylogenies along the core genome. These phylogenies are drawn from a distribution with scale-free properties, and each strain has a unique fingerprint of recombination with the others. We feel that these observations necessitate a new way of thinking about how to model genome evolution in prokaryotes.

## Results

To illustrate our methods we focus on the SC1 collection of wild *E. coli* isolates that were collected in 2003 – 2004 near the shore of St. Louis river in Duluth, Minnesota [7]. We sequenced 91 strains from this collection together with the K12 MG1655 lab strain as a reference. In a companion paper [13] we discuss this collection in more detail and extensively analyze the evolution of gene content and phenotypes of this collection. Here we focus on sequence evolution in the core genome of these strains. Although the SC1 strains were collected from a common habitat over a short period of time, they show a remarkable diversity, with no two identical strains, all known major groups of *E. coli* represented, as well as an ‘out group’ of 9 strains that are more than 8% diverged at the nucleotide level from other *E. coli* strains (see Suppl. Fig. S1 for a phylogenetic tree constructed using maximum likelihood on the joint core genome of the SC1 strains and 189 reference strains, [13]).

### Phylogenies of individual loci disagree with the phylogeny of the core genome

To construct a core genome alignment of the SC1 strains and K12 MG1655 we used the REALPHY software [14] (see Methods), resulting in a multiple alignment across all 92 strains of 2′756′541 base pairs long. REALPHY used PhyML [15] to reconstruct a phylogeny from the core genome alignment and will refer to this tree as the *core tree* from here on (Fig. 1).

We first checked to what extent the alignments of individual genomic loci are statistically consistent with the core tree. For each 3 kilobase block of the core alignment we used PhyML to reconstruct a phylogeny and then compared its log-likelihood with the log-likelihood that can be obtained when the phylogeny is constrained to have the topology of the core tree. We find that essentially all 3 Kb alignment blocks reject the core tree (Suppl. Fig. S2, left panel). Moreover, each alignment block rejects the topologies that were constructed from all other alignment blocks (Suppl. Fig. S2, right panel).

Although it thus appears that the phylogeny at each genomic locus is statistically significantly distinct, it is still possible that all these phylogenies are highly similar. In order to quantify the differences between the core tree and the phylogenies of 3Kb blocks we calculated, for each split in the core tree, the fraction of 3Kb blocks for which the same split occurred in the phylogeny reconstructed from that alignment block. As shown in the top row of Fig. 1, the phylogenies of individual blocks differ substantially from the core tree: roughly two-thirds of the splits in the core tree occur in less than half of 3Kb block phylogenies and half of the core tree splits occur in less than a quarter of all 3Kb block phylogenies. Especially the splits higher up in the core tree do not occur in the large majority of block phylogenies.

These observations are not particularly novel. There is by now a vast and sometimes contentious literature on the role of recombination in prokaryotic genome evolution which is beyond the scope of this article to review. We thus focus on a few key points that are central to the questions and methods we study here. First, systematic studies of complete microbial genomes have shown that horizontal gene transfer is relatively common and can significantly affect phylogenies of individual loci, e.g. [16, 17]. Such observations caused some researchers to question whether trees can be meaningfully used to describe genome evolution [18]. However, many in the researchers field feel that, given careful study, a major phylogenetic backbone can be extracted from genomic data. For example, it has been observed that, whenever a phylogeny is reconstructed from the alignments of a large number of genomic loci, one obtains the same or highly similar phylogenies, e.g. [8–10]. We also observe this behavior for our strains. Phylogenies reconstructed from a random sample of 50% of all 3Kb blocks look highly similar to the core tree, i.e. with two thirds of the core tree’s splits occurring in *all* phylogenies (Fig. 1, bottom row).

How should we interpret this convergence of phylogenies to the core phylogeny as increasing numbers of genomic loci are included? One interpretation, proposed by some researchers, is that once a large number of genomic segments is considered, effects of horizontal transfer are effectively averaged out, and the phylogeny that emerges corresponds to the clonal ancestry of the strains, e.g. [8,19]. Based on this idea, several tools have been developed that detect recombination events by comparing local phylogenies with the overall reference phylogeny constructed from the entire genome [20,21]. In contrast, some recent studies have argued that recombination is so common in some bacterial species that it is impossible to meaningfully reconstruct the clonal ancestry from the genome sequences, and that these species should be considered freely recombining, e.g. [22]. However, if members of the species are freely recombining, one would expect the core tree to take on a star-like structure as opposed to the clear and consistent phylogenetic structure that phylogenies converge to as more genomic regions are included in the analysis. Addressing this puzzle is one of the topics of this work.

### Quantifying recombination through analysis of pairs of strains

As a first analysis of the impact of recombination, we follow an approach recently proposed by Dixit et al. based on the pairwise comparison of strains [11]. The simplest measure of the distance between a pair of strains is their nucleotide divergence, i.e. the fraction of mismatching nucleotides in the core genome alignment of the two strains. For pairs of strains with very low divergence, e.g. D6 and F2 at 4 × 10^{−4} divergence (Fig. 2A), the effects of recombination are almost directly visible in the pattern of SNP density along the genome. While the SNP density is very low along most of the genome, i.e. 0 – 2 SNPs per kilobase, there are a few segments, typically tens of kilobases long, where the SNP density is much higher and similar to the typical SNP density between random pairs of *E. coli* strains, i.e. 10 – 30 SNPs per kilobase. These high SNP density regions almost certainly result from horizontal transfer events in which a segment of DNA from another *E. coli* strain, for example carried by a phage, made it into one of the ancestor cells of this pair, and was incorporated into the genome through homologous recombination. For pairs of increasing divergence, e.g. the pair C10-D7 with divergence 0.002 in Fig. 2B, the frequency of these recombined regions increases, until eventually the majority of the genome is covered by such regions (pair D6-H10 in Fig. 2C).

For close pairs, the histograms of SNP densities also clearly separate into two components: a majority of clonally inherited regions with up to at most 3 SNPs per kilobase, and a long tail of recombined regions with up to 50 or 60 SNPs per kilobase (Fig. 2D-E). As explained in the methods, we can accurately model the distributions of SNP densities as a mixture of a Poisson distribution for the clonally inherited regions plus a negative binomial for the recombined regions (solid line fits in Fig. 2D-F). In this way we can estimate, for each pair of strains, the fraction of the genome that is clonally inherited, and the number of SNPs that fall in clonally inherited versus recombined regions. Using a Hidden Markov model on close pairs, we also estimated the distribution of lengths of recombined regions (see Methods), finding that recombined blocks are typically in the range of 10 – 70 kilobases long (Fig. 2J).

From this analysis we see that, whenever the pairwise divergence is less than 0.001, the large majority of blocks is clonally inherited, which is indicated as the light-green segment in Fig. 2G. However, over a narrow range of divergence between 0.001 and 0.01 the fraction of clonally inherited DNA drops dramatically (yellow segment in Fig. 2G) and at a divergence of about 0.014 essentially the entire alignment has been overwritten by recombination and all clonally inherited DNA is lost (blue segment in Fig. 2G). Notably, 80% of all pairs of strains lie in this fully recombined regime (Fig. 2I). Thus, for the large majority of pairs of strains, none of the DNA in their alignment derives from their clonal ancestor, making it impossible to estimate the distance to their clonal ancestor from comparing their DNA. Moreover, as shown in Fig. 2H, even for pairs that are so close that most of their genomes are clonally inherited, the large majority of the SNPs derives from the recombined regions.

For later comparison with the data on other species, we summarize our observations from the pairwise analysis by a few key statistics. First, half of the genome is recombined at a *critical divergence* of 0.0032. Second, at this critical divergence, the fraction of all SNPs that is in recombined regions is 0.95. Third, the fraction of mostly clonal pairs is 0.077, and finally, the fraction of fully recombined pairs is 0.78 (see Methods). All these statistics suggest that pairwise divergences between strains are almost entirely driven by recombination and do not reflect distances to their clonal ancestors. To understand how a consistent phylogenetic structure can still emerge when the full core genomes of all strains are compared, we need to go beyond studying pairs.

### SNPs in the core genome alignment correspond to splits in the local phylogeny

Whereas there may not be a single phylogeny that captures the evolution of our genomes, we will assume that each *single position* in the core genome alignment, i.e. each alignment column, has evolved according to some phylogenetic tree. A key insight is that our set of strains is sufficiently closely related that, for almost all of these alignment columns, the number of substitutions that have occurred in their evolutionary history is either zero or one. In particular, of the 2′457′464 columns in the core genome alignment, only 10.85% are polymorphic. Moreover, almost all of these SNP columns are bi-allelic, i.e. for 93.6% of the SNPs only 2 nucleotides appear, 6.3% have 3 nucleotides, and in 0.2% all 4 nucleotides occur, suggesting that most positions have not undergone any substitutions, and that columns with multiple substitutions are rare. Notably, these statistics are still inflated due to the occurence of an outgroup of 9 strains that is far removed from the other strains (the clade from B5 to B3 visible on the right in Fig. 1). We observe that almost 36% of all SNPs correspond to SNPs in which all 9 strains of this outgroup have one nucleotide, and all other 83 strains have another nucleotide. If we remove the outgroup from our alignment, the fraction of SNP positions in the alignment drops from 10.85% to 6. 7%, and the fraction of SNPs that are bi-allelic increases to 95.5%.

We analyzed the frequencies of columns with 1, 2, 3 and 4 different nucleotides that are expected under a simple substitution model, separately analyzing positions that are under least selection (third positions of 4-fold degenerate codons) and positions under most selection (second positions in codons), and either including or excluding the outgroup (see Methods). These analyses indicate that around 98% of all bi-allelic SNP columns correspond to columns in which only a single substitution took place.

Since almost all bi-allelic SNPs correspond to a single substitution, each such SNP provides an important piece of information about the phylogeny at that position in the alignment: whatever this phylogeny is, it must contain a split, i.e. a branch bipartitioning the set of strains, such that all strains with one letter occur on one side of the split, and all strains with the other letter on the other side (Fig. 3).

As illustrated in Fig. 3, pairs of SNPs can either be consistent with a common phylogeny, i.e. columns *X* and *Y* or columns *Y* and *Z*, or they can be inconsistent with a common phylogeny, i.e. columns *X* and *Z*. The pairwise comparison of SNP columns for consistency with a common phylogeny is known as the four-gamete test in the literature on sexual species [12] but has so far rarely been used for quantifying recombination in bacteria (see [23] for the only exception we are aware of). In the rest of this paper we show how analysis of bi-allelic SNPs (which from now on we will just call SNPs) can be systematically used to quantify recombination in bacterial species.

### SNP statistics are inconsistent with a single consensus phylogeny

As a first test, we investigated to what extent the SNPs support the branches in the core tree. Since each branch in the core tree corresponds to a split, we calculated what fraction of SNPs correspond to a branch in the core tree, and what fraction are inconsistent with the core tree. Overall, 58% of the SNPs that are shared by at least 2 strains correspond to a branch of the core tree, whereas 42% clash with it (SNPs that occur in only a single strain are consistent with any phylogeny). However, this relatively high fraction results almost entirely from SNPs on the single branch connecting the outgroup to the other strains, which is responsible for almost 36% of all SNPs. When the outgroup is removed, only 27.4% of all SNPs are consistent with the core tree. Since the core tree was constructed using a maximum likelihood approach that assumes the entire alignment follows one common tree, we investigated to what extent the number of tree supporting SNPs can be improved by specifically constructing a tree to maximize the number of supporting SNPs (see Methods). However, such trees only marginally improve the number of supporting SNPs by 0.1%.

To assess the extent to which SNPs are consistent with individual branches of the core tree we counted, for each branch, the number of supporting SNPs *S* that match the split, and the number of clashing SNPs *C* that are inconsistent with the split, to calculate the fraction *f* = *S*/(*S* + *C*) of SNPs supporting the branch. Figure 4 shows that, for two-thirds of the branches, there are more clashing than supporting SNPs. Moreover, for as many as half of the branches in the core tree, the fraction of supporting SNPs is less than 5%, i.e. there are 20-fold more clashing than supporting SNP columns.

Besides the brach to the outgroup, the only branches for which supporting SNPs outnumber clashing SNPs are branches toward groups of highly similar strains near the bottom of the tree. We thus wondered if it would be possible to construct well supported subtrees for clades of closely-related strains near the bottom of the tree. We devised a method that builds subtrees bottom-up by iteratively fusing clades so as to minimize the number of clashing SNPs at each step (see Methods and Suppl. Fig S3, left panel). As shown in Suppl. Fig. S3, while the fraction of clashing SNPs is initially low, it rises quickly as soon as the average divergence within the reconstructed subtrees exceeds 10^{−4}, which is more than 100-fold below the typical pairwise distance between *E. coli* strains. Thus, while some groups of very closely-related strains can be unambiguously identified, only a minute fraction of sequence divergence falls within these groups, and the bulk of the sequence variation between the strains is not consistent with a single phylogeny.

It is also conceivable that there is a single dominant phylogeny for most strains, but that this is concealed from view when analyzing the full alignment because of a subset of strains with aberrant behavior. To investigate this, we focused on the smallest subsets of strains that have meaningfully different phylogenetic tree topologies. For a quartet of strains (*I*, *J*, *M*, *N*), there are 3 possibly binary trees, i.e with (*I*, *J*) and (*M*, *N*) nearest neighbors, with (*I*, *M*) and (*J*, *N*), or with (*I*, *N*) and (*J*, *M*) (See Suppl. Fig. S4). We selected quartets of roughly equidistant strains and checked, for each quartet, whether the SNPs clearly supported one of the tree possible topologies. However, we find that alternative topologies are always supported by a substantial fraction of the SNPs, and that for most quartets the most supported topology is supported by less than half of the SNPs (Suppl. Fig S4).

Thus, consistent with our analysis of pairs of strains, all these results show that the core tree does not capture the sequence relationships between the strains. In fact, rather than a single phylogeny representing the evolutionary relationships between the strains, the SNP data suggest a large number of different phylogenies across the core genome alignment. It may thus seem all the more puzzling that, when trees are constructed from sufficiently many genomic loci, the core tree reliably emerges (Fig. 1, bottom). To underscore this puzzle, we observed that if we remove all SNP columns from the core genome alignment that correspond to branches of the core tree, and then reconstruct a phylogeny from this edited alignment, the resulting tree is still highly similar to the core tree (Suppl. Fig. S5). However, almost *all* SNPs of this edited alignment clash with the tree that is reconstructed from it. Thus, the core tree reconstructed from an alignment does not need to match the phylogeny of any of the genomic loci. Rather, the core tree represents some sort of *average* of the distribution of phylogenies across the genome. Note that, whenever a quantity *x* has a multi-modal distribution, it can easily occcur that there is almost no probability for any sample of *x* to occur near its average 〈*x*〉. Similarly, the actual phylogenies occurring across the core genome alignment may all be very different from the global ‘average’ phylogeny that the core tree represents.

### Phylogeny changes every few dozens of base pairs along the core alignment

So far we have analyzed SNP consistency without regard to their relative positions. We now analyze to what extent mutually consistent SNPs are clustered along the alignment. In particular, we calculate the lengths of segments along the alignment that are consistent with a single phylogeny.

We first asssessed the length-scale over which phylogenies are correlated by calculating a standard linkage measure as a function of distance along the alignment (Fig. 5A and Methods). Linkage drops quickly over the first 100 base pairs and becomes approximately constant at distances beyond 200 – 300 base pairs, indicating that segments of correlated phylogenies are much shorter than the typical length of a gene. Very short linkage profiles were recently also observed in thermophilic cyanobacteria isolated in Yellowstone National Park [22].

We next determined the lengths of segments consistent with a common phylogeny. Starting from each SNP *s*, we determined the number of consecutive SNPs *n* that are all mutually consistent with a common phylogeny. As shown in Fig. 5B, the distribution of tree-compatible stretches has a mode at *n* = 4, and stretches are very rarely longer than 20 consecutive SNPs. In terms of number of base pairs along the genome, tree-compatible segments are typically just a few tens of base pairs long, and very rarely more than 300 base pairs (Fig. 5C). Thus, stretches of tree-compatible segments are very short. For comparison, we also calculated the distribution of tree-compatible segment lengths in an alignment where the positions of all columns have been completely randomized and observe that these are still a bit shorter (blue distributions in Fig. 5). Thus, while there is some evidence that neighboring SNPs are more likely to be compatible than random pairs of SNPs, this compatibility is lost very quickly, typically within a handful of SNPs.

### A lower bound on the ratio of recombination to substitution events

Every time inconsistent SNP columns are encountered as one moves along the core genome alignment, the local phylogeny must change. For example, somewhere between columns *X* and *Z* in Fig. 3 the phylogeny must change. This in turn implies that at least one recombination event must occur between columns *X* and *Z*. By going along the core genome, and determining the minimum number of times the phylogeny must change, one can thus derive a lower bound on the total number of recombination events [12] (see Methods). Using this we find that the phylogeny must change at least *C* = 43′575 times along the core phylogeny, i.e. there are at least *C* recombination events. If we denote by *R* the true total number of recombination events, then we can write *C* = *Rf*, where *f* can be thought of as the fraction of recombination events that are detected by SNP inconsistencies in the alignment. As we argued previously, almost all SNP columns correspond to a single substitution event, such that the total number of SNP columns *M* is a good estimate of the total number of substitutions in the alignment. Consequently, the ratio of phylogeny changes *C* to SNPs *M* provides a lower bound on the ratio of recombinations to mutations in the alignment, i.e.

Figure 6 shows the ratio *C/M* for random subsets of our 92 strains as a function of the number of strains in the subset.

We see that, for small subsets of strains, the recombination to mutation ratio *C/M* shows substantial fluctuations. For example, for subsets of *n* = 10 strains, the recombination to mutation ratio *C/M* ranges from 0.036 to 0.167, with a median of 0.1. However, as the number of strains in the subset increases, the recombination to mutation ratio converges to a value of *C/M* ≈ 0.155. In particular, whenever there is a substantial fraction of the strains, i.e. *n* ≥ 50, the ratio *C/M* is highly consistent across the subsets. Thus, the ratio *C/M* gives a highly informative summary statistic of the relative rate of recombination to mutation events along the alignment.

These results confirm that, also on the level of the entire alignment, the strains are in a regime where each position has been affected by recombination. For example, given the ratio *C/M* = 0.155, and the overall SNP rate of 0.1085, the average length of aligment segments between changes in phylogeny is 1/(0.1550.1085) ≈ 59.4 base pairs. From the analysis of close pairs we saw that the typical length of a recombined segment is about 20’000 base pairs (Fig. 2J). Thus, as an order of magnitude estimate, a given position in the genome has been overwritten roughly 20′;000/59.4 ≈ 337 times by recombination events. Moreover, since we only detect a fraction of the phylogeny changes across the alignment, the true number of times each locus has been overwritten by recombination is likely considerably higher.

Although, it is tempting to interpret the ratio *C/M* as an estimate of the relative rates of recombination and mutation in the evolution of the strains, this would require defining a specific evolutionary model, and even for simple models, e.g. a Kingman coalescent with a fixed rate of recombination [24], the relationship between the ratio of recombination and mutation rates, and the observed ratio *C/M* would be nontrivial. Moreover, we will see below that the data in fact suggests that recombination rates vary over a wide range across lineages.

### Recombination rates across lineages follow scale-free distributions

The analyses above have shown that the core alignment consists of tens of thousands of short segments with different phylogenies. Thus, one approximate way of thinking about the core genome alignment is that the phylogeny at every genomic locus is drawn from some distribution of possible phylogenies. In a freely recombining population, every strain would be equally likely to recombine with any other, leading to a uniform distribution over all possible phylogenies. However, under such a model each pair of strains would become approximately equidistant and phylogenies build from large numbers of genomic loci would take a star-shape, which is clearly at odds with our observations. This suggests that the phylogenies along the genome are drawn from a highly non-uniform distribution in which some lineages are more likely to have recombined recently than others.

The distribution of observed SNP types in fact contains extensive information about the relative frequencies with which different lineages have recombined at different times in the past. For example, imagine a SNP where two strains share a nucleotide which differs from the nucleotide that all other strains possess. We will denote such SNPs as 2-SNPs or pair-SNPs. If, at some genomic locus *g*, we find a 2-SNP shared by strains *s*_{1} and *s*_{2}, then it follows that, whatever the phylogeny is at locus *g*, the strains *s*_{1} and *s*_{2} must be nearest neighbors in the tree, and the SNP corresponds to a mutation that occurred on the branch connecting the ancestor of *s*_{1} and *s*_{2} to all other strains.

Thus, to quantify to which extent the lineage of a strain *s* has recently recombined with the lineages of the other strains, we can extract all 2-SNPs in which *s* shares a letter with one other strain *s*′ and compare their frequencies. For example, Fig. 7A graphically shows the frequencies of all pair-SNPs (*A*1, *s*) in which A1 shares a SNP with one other strain *s*. Note that, if there was a dominant clonal phylogeny, then A1 should essentially only have 2-SNPs with its nearest neighbor in this dominant phylogeny. However, we see that A1 shares 2-SNPs with 17 of the 92 strains in our collection. If, one the other hand, A1 were freely recombining with all other strains, then we would expect roughly equal frequencies of all possible 2-SNPs (*A*1, *s*). However, we see that A1 shares 2-SNPs with some strains much more often than with others. For example, whereas 2-SNPs with strains A2, A11, and D8 are the most frequent and occur almost 200 times each, for 11 of the 17 strains the number of occurrences is 10 or less, and for 4 strains a 2-SNP with A1 is observed only once.

Figure 7B shows a graph representation of all observed pair-SNPs, with the thickness of the edges proportional to the logarithm of the frequency of occurrence of the 2-SNP type. We see that each strain is connected through 2-SNPs to a substantial number of other strains, indicating a high diversity of recent recombination events across the strains. At the same time, the large variability in the thickness of the edges indicates that some pairs occur much more frequently than others. Figure 7C shows the reverse cumulative distribution of the frequencies of all observed 2-SNPs, i.e. the distribution of the thickness of the edges in Fig. 7B (blue dots). Note that, if the strains were to recombine freely, each 2-SNP would be equally likely to occur, and the distribution of 2-SNPs would be peaked around a typical number of occurrences per type. Instead, we see that 2-SNP frequencies *f* vary over more than 3 orders of magnitude, i.e. from an occurrence of just *f* = 1 for many 2-SNPs to *f* = 2965 occurrences for the most common 2-SNP. Moreover, the reverse cumulative distribution of 2-SNP frequencies follows an approximate straight-line in a log-log plot. In other words, the distribution of frequencies *P*(*f*) is approximately power-law, i.e *P*(*f*) ∝ *f*^{−α}. Fitting the 2-SNP data to a power-law (see Methods) we find that the exponent equals approximately *α* ≈ 1.41 (blue line in Fig. 7C). Importantly, this means that there is no clear most common 2-SNP partner for each strain, and that one cannot naturally divide 2-SNPs into common and rare types. Instead, the distribution of 2-SNP frequencies is approximately scale-free.

Beyond SNPs shared by pairs of strains, we can of course also look at SNPs shared by triplets, quartets, and so on. Besides the distribution of 2-SNP frequencies, Fig. 7C also shows the reverse cumulative distributions of 3-SNPs (orange dots), 4-SNPs (green dots), and 12-SNPs (red dots). We see that all these distributions follow approximately straight lines in a log-log plot and can be fitted with power-law distributions (solid lines). The *n*-SNP distributions drop more steeply as *n* increases. Figure 7D shows the exponent α of the *n*-SNP distribution as a function of *n*, showing that the exponents range from *α* ≈ 1.25 for singlets, i.e. *n* = 1, to *α* ≈ 2.8 for *n* ≥ 20.

We find that essentially all *n*-SNP distributions are approximately scale-free, i.e. can be fitted with power-laws. Thus, while some subgroups of *n* strains share a common ancestor much more often than others subgroups of *n* strains, their frequencies fall along a scale-free continuum, so that there is no natural way of dividing the strains into groups of ‘highly recombining’ clades. Note also that each *n*-SNP corresponds to a mutation that occurred in the branch leading to the ancestor of a group of *n* strains. Therefore, *n*-SNPs for larger *n* typically correspond to mutational events that occurred further back in time. The fact that *n*-SNP distributions become more steep as *n* increases means that the average number of occurrences per *n*-SNP decreases as *n* increases. Thus, the diversity of *n*-SNPs tends to be larger further back in time (see Suppl Fig. S6).

### Phylogenetic entropy profiles of individual strains

Another way to think about the structure evident in the *n*-SNP distributions is to quantify, for each strain *s*, how diverse the phylogenies are that *s* occurs in at different *n*. In particular, for a given *n*, all *n*-SNP types in which *s* is one of the strains sharing the minority nucleotide, are all mutually inconsistent with a common tree. For example, if the strain *s* occurs in 10 different quartets of strains, i.e. in 10 different 4-SNP types, then each of these 10 quartets must correspond to different phylogenies and the diversity of quartets among which *s* occurs can be quantified by the entropy of the frequency distribution of 4-SNP types in which *s* occurs. That is, for each strain *s*, and each *n*, we can extract all *n*-SNPs in which *s* occurs and calculate their relative frequencies, and then summarize the diversity of phylogenies by the *entropy* of this distribution across *n*-SNPs. In this way, for each strain *s* we can calculate an entropy profile *H _{s}*(

*n*) that contains the entropies of the

*n*-SNP distributions in which strain

*s*occurs, as a function of

*n*(see Methods). Supplementary Fig. S7 shows the entropy profiles for 5 example strains, as well as the distribution of entropy profiles across all strains. We see that the entropy generally increases as

*n*increases, again indicating that the diversity of phylogenies increases as one goes further back in time. The entropy profiles are highly diverse, e.g. for strains like A10 and H6 the entropy increases quickly to 5 – 6 bits, while for the strain G8, which belongs to a cluster of 20 strains that are extremely closely related, the entropy only increases for

*n*> 20. Most significantly, each strain

*s*has an essentially unique entropy profile

*H*(

_{s}*n*), showing that each strain has its own ‘fingerprint’ of the frequencies with which its lineage shares recent ancestors with the other strains. Finally, the entropy profiles become more similar as

*n*increases, and for large

*n*the entropy converges to roughly 7.5 bits, which corresponds to effectively 180 different possible ancestries per strain.

### Other species of bacteria exhibit qualitatively similar statistics

To investigate to what extent the observations we made for *E. coli* generalize to other species of bacteria, we selected 5 additional species from different bacterial groups for which sufficiently many complete genome sequences of strains were available, and used REALPHY to obtain a core genome alignment of the strains for each species (see Table S1 for a list of the species, the number of strains, and other core genome statistics for each species). We then performed most of the analyses that we presented above for *E. coli* on each of these core alignments. Figure 8 presents a summary of the results that we observe across the species.

Figure 8A shows the cumulative distributions of pairwise divergences between strains for all species. We see that, while among our *E. coli* strains that were sampled from a common habitat there is a small percenage of very close pairs with divergence around 10^{−6}, for the strains of the other species the closest pairs are at divergence 10^{−5}. With the exception of *M. tuberculosis*, where the median pair divergence is around 10^{−4}, the median pairwise divergence in all other species is around 10^{−2} or larger. The vertical lines in Fig. 8A indicate the critical divergences, for each species, where half of the alignment is recombined. With the exception of *M. tuberculosis*, where all pairs are mostly clonal, the critical divergences lie in a fairly narrow range of 0.003 – 0.01. Figure 8B shows the reverse cumulative distributions, across pairs of strains, of the fraction of the alignment that is clonally inherited, i.e. as for Fig. 2I for *E. coli*. Note that, for all species except *M. tuberculosis*, the large majority of the pairs is fully recombined. For *H. pylori* the fraction of pairs that still contain clonally inherited DNA is almost zero, whereas for *S. aureus* the fraction of pairs with a substantial fraction of clonally inherited DNA is largest. Thus, we see that for almost all species the situation is similar to what we observed in *E. coli*: for most pairs the distance to their common ancestor cannot be estimated from their alignment, because the entire alignment has been overwritten by recombination events. Note also that, for all species, there is only a relatively small fraction of pairs that lie in the partially recombined regime (yellow segment in Fig. 8B).

Figure 8C shows, for each species, the fraction of all SNPs that derive from recombination, for pairs of strains that are at the critical divergence where half of the alignment is recombined. Even though this critical divergence occurs for pairs that are relatively close compared to the typical distance between pairs, for all species more than 90% of the SNPs derive from recombination. That is, we also see that for all 5 species the divergence between close strains is dominated by SNPs that are introduced through recombination.

Figure 8D summarizes the distributions of support of the branches of the core tree as violin plots, i.e. as shown for *E. coli* as a cumulative distribution in Fig. 4. In *E. coli* most branches have many more SNPs that reject the split than support it, and even stronger rejection of the branches of the core tree are observed for *B. subtilis* and *H. pylori*. For the other three species, including *M. tuberculosis*, an almost uniform distribution of branch support is shown, i.e. for these species there are roughly as many branches that are strongly supported by the SNPs, strongly rejected by the SNPs, or supported and rejected by roughly equally many SNPs.

Figure 8E summarizes, for each species, the distribution of distances between SNPs along the core alignment as box-whisker plots (green) as well as the distribution of distances between phylogeny breakpoints (blue), i.e. as shown in Fig. 5C for *E. coli*. The figure shows that, with the exception of *M. tuberculosis*, the inter-SNP distances range from a few to a few dozen basepairs, with a median inter-SNP distance of 4 (*H. pylori*) to 15 (*S. aureus*) base pairs. For these 5 species, the median distances between phylogeny breakpoints range from around 10 (*H. pylori*) to about 100 base pairs for *S. aureus*. Note that, for all species, the tail of the distributions stretches to very short distances between breakpoints, whereas distances between breakpoints of more than 200 bps are very rare for all these 5 species. Thus, for these species the segments that are consistent with a single phylogeny are always much shorter than the typical length of a gene. In contrast, for *M. tuberculosis* both the distances between SNPs and the distances between breakpoints are almost two orders of magnitude larger.

Finally, Fig. 8F shows box-whisker plots for the distribution of the number of consecutive SNPs between breakpoints, as was shown for *E. coli* in Fig. 5B. We see that for all species, including *M. tuberculosis*, there are typically less than a handful of SNPs in a row before a phylogeny breakpoint occurs, and very rarely more than a dozen SNPs. The smallest number of SNPs per breakpoint is observed for *H. pylori*, i.e. typically less than 2 SNPs per breakpoint, but the range of SNPs per breakpoint is very similar across all species.

We next investigated whether the *n*-SNPs of the other species also exhibit approximately power-law distributions, as observed in *E. coli*. Supplementary figure S8 shows the reverse cumulative distributions of 2-SNPs, 3-SNPs, 4-SNPs, and 12-SNPs across all 6 species together with power-law fits. Although the curves often deviate substantially from simple straight lines, they all exhibit long tails and range over several orders of magnitudes, i.e. up to 5 orders of magnitude for 2-SNPs in *S. enterica*. Note that, since in *M. tuberculosis* the total number of different *n*-SNP types is small, only the 2-SNP and 3-SNP distributions can be reasonably defined. Figure 9 (left panel) shows the fitted exponents of the power-law distributions of *n*-SNPs as a function of *n* for all species. With the exception of *M. tuberculosis*, for which the exponents are small for all *n*, we see that the exponents generally increase with *n* indicating that the phylogenetic diversity generally increases as one moves further back in time, i.e. to larger *n*. Consistent with other observations, *H. pylori* shows the highest exponents, i.e. the highest diversity, and *M. tuberculosis* the lowest. While the exponents become roughly constant for *n* > 20 for *E. coli, H. pylori* and *S. aureus, B. subtilis* and *S. enterica*, exhibit more complex patterns with sudden drops in exponent at particular values of *n*, suggesting more complex population structures for these species.

As an aside, we decided to investigate what the distribution of *n*-SNP frequencies looks like for a sexually reproducing organism with complex population structure such as human. We extracted SNP data for chromosome 21 for 2504 humans from the 1000 Genome project [25] and calculated the frequency distributions of *n*-SNP types. Supplementary Fig. S9 shows examples of the *n*-SNP distributions for human together with the fitted exponents for *n* ranging from 1 to 30. Interestingly, the *n*-SNP distributions in human are all well fit by power-law distributions but instead of exponents that systematically increase with *n*, as we observed for the bacteria, for human the exponent is slightly larger than 3 and independent of *n*.

Returning to the bacterial species, supplementary Figure S10 shows the entropy profiles *H _{n}*(

*s*) for all strains

*s*in each of the species. As we observed in

*E. coli*, essentially every strain

*s*exhibits a unique entropy profile

*H*(

_{n}*s*), showing that also in these other species each strain has a unique ‘fingerprint’ of frequencies with which its lineage shares ancestors with those of other lineages. Although the entropy rises quickly to values in the range 4 – 8 for most strains, we also see strains for which the entropy only rises after

*n*exceeds some fairly large value of

*n*, e.g. at

*n*= 10 for some strains in

*H. pylori*, and at

*n*= 24 and

*n*= 62 for some

*S. enterica*strains, suggesting that these strains are part of groups of closely-related strains. Note also that these events appear to correspond to the sudden drops in the exponents of the

*n*-SNP distributions of those strains (Fig. 9, left panel), reiterating that these

*n*-SNP statistics encode extensive information about the population structure of each species. To summarize the entropy profiles of each species, the right panel of Fig. 9 shows the mean and standard-error of the entropy profiles, averaged over all strains, as a function of

*n*. As for most other statistics,

*M. tuberculosis*is an outlier whose strains generally only show low phylogenetic entropy. For all other species, the average entropy clearly increases as

*n*increase, indicating again that the phylogenetic diversity increases further back in the past. For 4 of the 6 species, the mean entropy at large

*n*falls in a narrow range between 5 and 6, suggesting that the effective number of ancestries far back in the past is relatively similar for these species.

## Discussion

In this work we have introduced new methods to analyze prokaryotic genome evolution from multiple alignments of the core genomes of strains from a species. In particular, showing that almost all bi-allelic SNPs in the core genome alignment correspond to single mutations in the history of that position in the alignment, we showed several new ways in which these SNPs can be used to quantify phylogenetic structures and the role of recombination in genome evolution within prokaryotic species.

Our analysis shows that, for the species studied here, evolution of the core genome is almost entirely driven by recombination. That is, even for very closely related pairs of strains, the large majority of mutations that separate them derive from recombination events. Moreover, for the large majority of pairs of strains, none of the DNA in their pairwise alignment derives from their common ancestor, and each position in the core alignment has been overwritten many times by recombination. Given this, it seems highly unlikely that the ancestral phylogeny of the strains can be reconstructed from the core genome alignment.

Although we cannot completely exclude that sufficient information about the ancestral phylogeny is still encoded in some way into the core alignment, it is clear that currently no method exists that is capable of extracting this information, and we suspect that it is in fact impossible, i.e. that recombination has destroyed the necessary information. However, even if it were possible to reconstruct the ancestral phylogeny, it is not clear how useful this clonal phylogeny would be for understanding core genome evolution. Our analysis of SNP compatibility along the core alignment shows that the phylogeny changes every few dozen basepairs (and every handful of SNPs), so that the core alignment fragments into many thousands of short segments with different phylogenies. Thus, modeling sequence evolution in the core genome as occurring along the branches of a fixed phylogenetic tree is clearly inappropriate.

One might infer from these statistics that bacterial species are quasi-sexual and recombining freely, but this is inconsistent with the observation that strains do not appear roughly equidistant and that phylogenies build from large numbers of genomic loci clearly converge to a well-defined average phylogeny. To understand how this phylogenetic structure emerges in the face of rampant recombination we developed several methods for using bi-allelic SNPs for quantifying population structure from the core genome alignment. In particular, although recombination is evident across the ancestral lineages of almost all strains, we find that some lineages recombine much more frequently than others, and that the relative rates with which different groups of strains share a recent common ancestor vary over 3 – 5 orders of magnitude and follows roughly power-law distributions. Thus, the phylogeny build from the core genome alignment does not reflect the clonal history of the strains, but rather reflects the rates with which different lineages have recombined in the past. Notably, since the *n*-SNP distributions follow smooth long-tailed distributions that do not appear to have a characteristic scale, it is not possible to naturally subdivide a species into subspecies of freely recombining groups of strains. Rather, there is a large continuum of relative rates. As an aside, given that recombination rates vary over orders of magnitude across different lineages, the idea of an effective recombination rate for a species seems inherently misleading, and models that fit the data to a model that assumes a constant rate of recombination within a species, e.g. [26], seem inappropriate.

Essentially all population genetics and coalescent models start from assuming one or more populations of individuals that, for the purpose of the model, are exchangeable. However, using the entropy profiles of the *n*-SNP distributions of each strain, we observed that every strain has unique relative rates with which its lineage shares common ancestors with the lineages of other strains. That is, each strain has unique recombination statistics. These observations thus suggest that models that assume individuals are exchangeable are inappropriate by definition.

Given that models that assume either a single consensus tree, a fixed rate of recombination across strains, or even just exchangeable individuals, are all clearly at odds with the data on prokaryotic genome evolution, this raises the question of what would be an appropriate mathematical ‘null model’ that can capture the statistics that we observed here. In such a model, each lineage must have different rates of recombination with all other lineages, these rates must vary over multiple orders of magnitude, and the model should reproduce the roughly power-law distributions of *n*-SNP frequencies, ideally with exponents that can be tuned by parameters in the model. It is currently unclear how to construct such a model.

Given that recombination rates between different lineages appear to vary over several orders of magnitude, it also raises the question as to what sets these relative recombination rates. For example, it is not even clear whether these rates are shaped by natural selection, e.g. that due to epistatic interactions only recombinant segments from other strains with similar ‘ecotypes’ are not removed by purifying selection, or that recombination rates may rather be set by parameters such as the frequency with which lineages co-occur at the same geographical location. It is also conceivable that phages are a major source of transfer of DNA between strains, so that recombination rates may reflect the rates at which different lineages are infected by the same types of phages. It is also noteworthy that homologous recombination requires sufficient homology between the endpoints of the DNA fragment and the homologous segment in the host genome. Thus, recombination rates will intrinsically decrease with the nucleotide divergence between strains and previous studies have estimated that the rate of successful recombination decreases exponentially with nucleotide divergence [27,28]. In this regard it is interesting that the critical divergence at which half of the genome is recombined varies over a relatively small range, i.e. from 0.003 – 0.01 (Fig. 8A). It is thus conceivable that a species is essentially defined by the collection of strains that are sufficiently close to allow efficient recombination [29]. However, the statistics reported here seem to suggest a much larger range of recombination rates than such a simple DNA-homology based model would predict.

While we here studied the frequency distribution of *n*-SNP types as well as the entropies *H _{n}*(

*s*) of the

*n*-SNP distributions for each strain, it appears to us that this is just the tip of the iceberg of possible ways in which

*n*-SNPs can be used to study the evolution of a set of strains from their core genome alignment. Our analyses indicate that prokaryotic genome evolution is driven by recombination that occurs at a very wide distribution of different rates between different lineages, and there is now a strong need for the development of new mathematical tools and models that can accurately describe this kind of genome evolution.

## Methods

### Data

The *E. coli* sequences analysed here can be accessed on NCBI Bioproject via the accession number PRJNA432505 [7,13]. In Table S2 strains names and details for the reference strains used for Figure S1 can be found.

Genome sequences for all other species were downloaded from ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/. All strain names and download dates are listed in Table S3.

### Core genome alignment and core tree

To build a core genome alignment for the SC1 strains we used the Realphy tool [14] with default parameters and Bowtie 2 [30] for the alignments. Realphy used PhyML [15] with parameters -m GTR -b 0 to infer trees from the whole and parts of the core alignment. The tree visualizations were made using the Figtree software [31].

### Analysis of core alignment blocks

For each 3 Kb block of the core alignment we used PhyML using the option -c 1 to infer a phylogeny while restricting the number of relative substitution rate categories to one. Furthermore, to calculate the log-likelihood of a given 3Kb block under the tree topologies of other blocks, we reran PhyML using the -o ‘lr’ option, which only optimizes the branch lengths as well as the substitution rate parameters but doesn’t alter the topology of the phylogeny.

### Pairwise analysis and mixture modeling

For each pair of strains we slide a 1Kb window over the core genome alignment of the pair, shifting by 100 bp at a time, and build a histogram of the number of SNPs per kilobase by counting the number of SNPs in each window. That is, we obtain the distribution *P _{n}* of the fraction of 1 Kb windows that have

*n*SNPs. We then assumed that the one kilobase blocks can be separated into a fraction

*f*of ‘ancestral blocks’, i.e. regions that were inherited from the clonal ancestor of the pair, and a fraction (1 –

_{a}*f*) that have been recombined since the pair diverged from a common ancestor. Although in previous work a simple

_{a}*ad hoc*scheme was used in which it was assumed that blocks with less than a particular number of SNPs are ancestral and blocks with more SNPs are recombined [11], we found that this approach is not satisfactory and results significantly depend on the cut-off chosen.

We thus decided to employ a more principled mixture model approach. For the ancestral regions, the number of SNPs per kilobase should follow a simple Poisson distribution *P _{n}* =

*μ*

^{n}

*e*

^{−μ}/

*n*!, with

*μ*the expected number of mutations per block. For the recombined regions, we note that these regions themselves will consist of mosaics of subregions that have been recombined. Consequently, the recombined regions will consist of a mixture of Poisson distributions with different rates. It is well-known that mixtures of Poisson distributions with rates that are (close to) Gamma-distributed follow a negative binomial distribution and we found empirically that negative binomial distributions give excellent fits to the observed SNP distributions in our data. For the recombined regions we thus assume a negative binomial distribution of the form where 0 ≤ λ ≤ 1 and

*α*≥ 1 are parameters of the distribution. We thus fit the observed distribution of SNPs per block

*P*using the following mixture: where

_{n}*f*is the fraction of the genome that is ancestral. Fits were obtained using maximum likelihood. While expectation maximization was used to fit the parameters

_{a}*f*,

_{a}*μ*, and

*λ*, a grid search was employed to find the optimal dispersion parameter

*α*.

Note that, in terms of the fitted parameters, the total number of mutations in ancestral blocks is *μf _{a}*, and the number of mutations in recombined blocks is (1 –

*f*)

_{a}*α*λ/(1 – λ).

To estimate the lengths of recombination events, we first extracted pairs that are sufficiently close (divergence less than 0.002) such that multiple overlapping recombination events are unlikely. We then used a two-state HMM with the same two components, i.e. a Poisson and a negative binomial component corresponding to ancenstral and recombined segments, and having fixed rates of transitioning from ancestral to recombined and vice versa, to parse the pairwise alignment into ancestral and recombined segments. We took the distribution of recombined segments in these alignments as the distribution of recombination events.

We define mostly clonal pairs as pairs with more than 90% of the alignment classified as ancestral, fully recombined pairs as pairs with less than 10% of the alignment classified as ancestral, and all other pairs as transition pairs. In order to estimate the critical divergence at which half of the genome is recombined we fit a linear model to the observed relationship between divergence and clonal fraction in all transition pairs, and define the critical divergence as the divergence at which the linear fit has a clonal fraction of 50%. To calculate the fraction of mutations that derive from recombined segments at the critical divergence we compute the fraction of mutations in recombined segments for all transition pairs (using the results from the mixture model) and fit a linear model to the observed dependence between the ancestral fraction an the fraction of mutations in recombined segments. We then define the fraction of mutations in recombined regions at the critical divergence as the fraction of mutations in the linear fit when the ancestral fraction is 50%.

### Estimating the fraction of SNPs that correspond to single mutational events

The relatively low frequency of SNPs and the fact that almost all SNPs are bi-allelic strongly suggests that almost all bi-allelic SNPs correspond to single mutational events. Here we use a simple model to estimate the fraction of bi-allelic SNPs that correspond to single mutational events. To do this we will analyze the observed frequencies of columns with 1, 2, 3, and 4 different nucleotides under a simple model. To assess the effects of selection, we will consider these frequencies both for the subset of positions that should be under relatively little selection, i.e. third positions in fourfold degenerate codons, and positions that should be under relatively strong selection, i.e. second positions in codons. We will also do this separately for all strains, and all strains minus the 9 strains of the outgroup.

For a given position in the alignment, let *μ* denote the product of the mutation rate times the total length of the branches in the phylogeny at that position. The variable μ thus corresponds to the expected number of mutations at this position. The probability that *n* mutations took place at this position is given by a Poisson distribution:

We will assume that, every time a mutation occurs, each of the 3 possible target nucleotides is equally likely. Let d denote the number of different nucleotides in the column and let *T*(*d′*|*d*) be the matrix of probabilities, that under a single mutation, the number of different nucleotides transitions from *d* to *d′*. We have *T*(2|1) = 1, *T*(2|2) = 1/3, *T*(3|2) = 2/3, *T*(3|3) = 2/3, *T*(4|3) = 1/3, *T*(4|4) = 1, and all other transition probabilities are zero. Starting from a single nucleotide in the column, the probability *P*(*d*|*n*) to end up with *d* different nucleotides after *n* mutations is given by the *n*-th power of the transition matrix *T*, i.e. *P*(*d*|*n*) = *T ^{n}*(

*d*|1). From this we can work out the probability

*P*(

*d*|

*μ*) to end up with d different nucleotides as a function of the expected number of mutations

*μ*as

The infinite sums can all be evaluated analytically and we find and

Assume we observe *c _{d}* columns with d different nucleotides, with

*d*running from 1 to 4. The log-likelihood of this count data given

*μ*is

Maximizing the log-likelihood with respect to *μ* we find that the optimal value of *μ* given these counts as

Finally, given *μ*_{*}, the fraction *f _{sm}* of bi-allelic SNPs that correspond to single mutations is given by

Table 1 shows the estimated expected number of mutations per column *μ*_{*} and the estimated fraction of bi-allelic SNPs that correspond to single mutations *f _{sm}* for the 5 different subsets of columns. We see that, for all 5 subsets, the fraction

*f*is over 95% and close to 100% for the second positions in codons.

_{sm}In addition, Supplementary Fig. S11 shows a comparison of the observed and predicted frequencies of columns with 1, 2, 3, and 4 letters. Since effects of selection are likely least for the synonymous positions, we expect the simple model to fit the data best and we indeed observe that, for the synonymous positions, the simple model can reasonably accurately fit the observed frequencies, and even for the set of all alignment columns the fits are quite accurate (Suppl. Fig. S11). In contrast, for the second positions in codons, we can see the effects of selection in that, from the larger fractions of columns without SNPs, the model infers a lower *μ*_{*}, and this leads to an underestimation of columns with 4 nucleotides. Thus, the true fraction *f _{sm}* is more likely close to the values inferred from the synonymous positions. Note that

*f*= 0.953 when including the outgroup and

_{sm}*f*= 0.975 when the outgroup is excluded. The difference between these two estimates derives from the very high fraction of SNP columns in which the 9 strains of the outgroup have another nucleotide than all other strains. For this subset of SNPs the fraction of columns that have more than one mutation is much higher than for any other SNP column. Thus, for all other SNP columns, the estimate that 97.5% correspond to single mutations is likely the most accurate.

_{sm}### Constructing a tree that maximizes the number of compatible SNPs

We classify all SNPs in the core genome alignment into *SNP types* as follows. For each bi-allelic SNP, we map all letters with the majority nucleotide to a 0 and the minority nucleotide to a 1 and sort the bits according to the alphabetic order of the strain names. In this way, each SNP is mapped to a binary sequence of length 92. This binary sequence defines the SNP type. Note that a SNP type corresponds to a particular bi-partitition of the strains.

We next counted the number of occurrences *n _{t}* of each SNP type

*t*and sorted the SNP types from most to least common. We then used the following greedy algorithm to a collect a subset

*T*of mutually compatible SNP types that accounts for as many SNPs as possible. We seed

*T*with the most common SNP type, i.e. the SNP type occurring at the top of the list. We then go down the list of SNP types, iteratively adding SNP types

*t*to the set

*T*that are compatible with all previous types in the set

*T*.

### Bottom up tree building

In this procedure we build phylogenies of subclades in a bottom-up manner, starting from the full set of 92 strains and iteratively fusing pairs, minimizing the number of incompatible SNPs at each step.

For any subset of strains *S*, we define the number of supporting SNPs *n _{S}* as the number of SNPs that fall on the branch between the subset

*S*and the other strains, i.e. the number of SNPs in which all strains in

*S*have one letter, and all other strains another letter. Similarly, we define the number of clashing SNPs

*c*as the number of SNPs that are incompatible with the strains in

_{S}*S*forming a subclade in the tree.

The iterative merging procedure is initiated with each of the 92 strains forming a subclade *S*. At each step of the iteration we calculate, for each pair of existing subclades *S*_{1}, *S*_{2}, the number of clashing SNPs *c _{S}* and supporting SNPs

*n*for the set of strains

_{S}*S*=

*S*

_{1}∪

*S*

_{2}consisting of the union of the strains in

*S*

_{1}and

*S*

_{2}. We then merge the pair (

*S*

_{1},

*S*

_{2}) that minimizes the clashes c

_{S}and, when their are ties, maximizes the number of supporting SNPs

*n*. At each step of the calculation we keep track of the total number of SNPs on the branches of the subtrees build so far, as well as the total number of SNPs that are inconsistent with the subtrees build so far. In addition, we calculate the average pairwise divergences of the strains within the subclades. Supplementary Fig. S3 shows the ratio of clashing to supporting SNPs as a function of the divergence within the subclades.

_{S}### Quartet analysis

Quartets were assembled in the following way. We construct a grid of target distances *d* starting at 0.00001 and having 50 points with 0.0005 sized distance. For every target distance *d* we scan the alignment for four strains which have all pairwise distances within 1.25 fold of distance *d*. Every target distance d for which no quartet can be found fulfilling these criteria is ignored.

For each quartet we extract all SNP columns where two strains have a specific nucleotide and the other two strains have another nucleotide. Every such SNP column unambiguously supports one out of three possible tree topologies for this quartet. For each quartet we determine which topology has the largest number of supporting SNPs, and what the fraction of SNPs is that support this topology.

### Linkage Disequilibrium measure

A standard measure of linkage disequilibrium of SNPs at a given distance is given by the average squared-correlation of the genotypes at these positions [32]. For a pair of loci with bi-allelic SNPs there are 4 possible genotypes which we indicate as binary patterns 00, 01, 10, and 11. If the frequencies of these genotypes are *f*_{00}, *f*_{01}, *f*_{10}, and *f*_{11}, then the squared correlation is calculated as
where the variables with dots correspond to marginal probabilities, e.g. *f*_{1.} = *f*_{10} + *f*_{11}, *f*_{.1} = *f*_{01} + *f*_{11}, and so on.

### Minimum number of phylogeny switches

We iterate over all SNP columns in order of the core genome and add the current SNP to a list if it is pairwise compatible with all SNPs currently in the list. If it is incompatible with at least one SNP in this list we empty the list, re-initialize the list with the current SNP, and increase the phylogeny counter by one.

### Power-law fits of *n*-SNP distributions

We extract each *n*-SNP from the core genome alignment and count the frequency, i.e. the number of occurrences, *f _{t}* of each

*n*-SNP type

*t*as well as the total number

*T*of

*n*-SNP types that occur at least once. We assume the

*n*-SNP type occurrences are drawn from a power-law of the form where

*ζ*(

*α*) is the Riemann zeta function defined by

The log-likelihood of the frequencies *f _{t}* as a function of

*α*is given by where 〈log[

*f*]〉 is the average of the logarithm of the SNP-type frequencies. Using a uniform prior on

*α*, the posterior distribution of

*α*is simply proportional to the likelihood function. The optimal exponent

*α*

_{*}is the solution of

To calculate error-bars on the fitted exponentials we approximate the posterior by a Gaussian by expanding the log-likelihood to second order around the optimal exponent *α*_{*}. We then find for the standard-devation of the posterior distribution:

### Entropy profiles of *n*-SNP distributions

For a given strain *X* we first extract all SNP types *t* for which *X* is one of the strains that shares the minority nucleotide. We then further stratify these SNP types by the number of SNPs sharing the minority nucleotide. For each *n* we thus obtain a set *S*(*X*, *n*) of *n*-SNPs in which strain *X* is one of the strains sharing the SNP. We denote the number of occurrences of a SNP of type *t* by *f _{t}* and the total number of

*n*-SNPs within set

*S*(

*X*,

*n*) as

*F*(

*X*,

*n*), i.e.

The entropy *H*(*X*, *n*) of the *n*-SNP distribution of strain *X* is then defined as

## Acknowledgments

The authors thank Olin Silander and Frederic Bertels for useful discussions during the development of this work. This work was supported by the Swiss National Science Foundation grant No. 31003A_135397. In addition, this work was done in part while the authors were visiting the Simons Institute for the Theory of Computing and was thus supported in part by NSF Grant No. PHY17-48958, NIH Grant No. R25GM067110, and the Gordon and Betty Moore Foundation Grant No. 2919.01. Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at University of Basel.