Abstract
Bacteria can exchange and acquire new genetic material from other organisms directly and via the environment. This process, known as bacterial recombination, has a strong impact on the evolution of bacteria, for example leading to the spread of antibiotic resistance across clades and species, and to the avoidance of clonal interference. Recombination hinders phylogenetic and transmission inference because it creates patterns of substitutions that are not consistent with the hypothesis of a single evolutionary tree (homoplasies). Bacterial recombination is typically modelled as statistically akin to the gene conversion process of eukaryotes, i.e., using the coalescent with gene conversion (CGC). However, this model can be very computationally demanding as it requires to account for the correlations of evolutionary histories of even distant loci. So, with the increasing popularity of whole genome sequencing, the need has emerged for a new and faster approach to model and simulate bacterial evolution at genomic scales. We present a new model that approximates the coalescent with gene conversion: the bacterial sequential Markov coalescent (BSMC). Our approach is based on a similar idea to the the sequential Markov coalescent (SMC), an approximation of the coalescent with recombination. However, bacterial recombination poses hurdles to a sequential Markov approximation, as it leads to strong correlations and linkage disequilibrium across very distant sites in the genome. Our BSMC overcomes these difficulties and shows both a considerable reduction in computational demand compared the exact CGC, and very similar patterns in the simulated data. We use the BSMC within an Approximate Bayesian Computation (ABC) inference scheme and show that we can correctly recover parameters simulated under the exact CGC, which further showcases the accuracy of our approximation. We also use this ABC approach to infer recombination rate, mutation rate, and recombination tract length from a whole genome alignment of Bacillus cereus. Lastly, we implemented our BSMC model within a new simulation software FastSimBac. In addition to the decreased computational demand compared to previous bacterial genome evolution simulators, FastSimBac also provides a much more general set of options for evolutionary scenarios, allowing population structure with migration, speciations, population size changes, and recombination hotspots. FastSimBac is available from https://bitbucket.org/nicofmay/fastsimbac and is distributed as open source under the terms of the GNU General Public Licence.
Introduction
Whole-genome bacterial sequencing has rapidly replaced multilocus sequence typing as for population analyses of bacterial pathogens thanks to its fast and cost-effective provision of higher resolution genetic information (Didelot et al., 2012; Wilson, 2012). Methods using genomic data to infer epidemiological, phylogeographic, phylodynamic and evolutionary patterns are often hampered by recombination (e.g. Schierup and Hein, 2000; Posada and Crandall, 2002), and bacterial recombination is no exception (Hedge and Wilson, 2014). Recombination, in fact, causes different sites of the genome to have different inheritance histories. For these reasons, in recent years many methods have been proposed to measure, identify and account for bacterial recombination (e.g. Didelot and Falush, 2007; Marttinen et al., 2008; Tang et al., 2009; Didelot et al., 2010; Marttinen et al., 2012; Croucher et al., 2014; Didelot and Wilson, 2015). Among these, simulators of bacterial evolution (e.g. Didelot et al., 2009b; Mostowy et al., 2014; Brown et al., 2015) are used for parameter inference and hypothesis testing (e.g. Fearnhead et al., 2005; Fraser et al., 2005; Wilson et al., 2009; Ansari and Didelot, 2014) and for benchmarking (e.g. Falush et al., 2006; Didelot and Falush, 2007; Turner et al., 2007; Buckee et al., 2008; Marttinen et al., 2012; Hedge and Wilson, 2014).
Simulating bacterial evolution poses specific difficulties as the process of bacterial recombination is very different from that of other organisms. Eukaryotic recombination is predominantly modelled as a cross-over process, with recombination events breaking a chromosome into two parts with different ancestries (Figure 1). While it is possible to simulate eukaryotic evolution with recombination forward in time (Peng and Kimmel, 2005; Carvajal-Rodr Iguez, 2008; Hernandez, 2008; Arenas, 2013), coalescent-based (Kingman, 1982) backward in time models (Hudson, 1983; Griffiths and Marjoram, 1997; Wiuf and Hein, 1999) are usually more computationally efficient (e.g. Hudson, 2002; Arenas and Posada, 2007, 2010; Ewing and Hermisson, 2010; Excoffier and Foll, 2011). Yet, the coalescent with recombination itself may not be sufficiently fast when large genomic segments are considered (McVean and Cardin, 2005). One of the reasons is that the structure describing the evolutionary history of all positions (ancestral recombination graph, or ARG) seems to grow subexponentially with genome size and recombination rate (Wiuf and Hein, 1999). For this reason, a faster approximation to the coalescent with recombination, the sequential Markov coalescent (SMC, see McVean and Cardin, 2005; Marjoram and Wall, 2006) has been proposed. Similar to the sequential coalescent with recombination (Wiuf and Hein, 1999), the SMC starts by considering one evolutionary tree on the left end of the genome, and generates new trees affected by recombination as it moves towards the right end. However the SMC does not generate an ARG, but rather a sequence of local trees under the simplifying assumption that if the local tree for the current position is known, all previous local trees can be ignored in the next steps. This model has been further extended to include population history (Chen et al., 2009), increased accuracy (Wang et al., 2014), and increased computational efficiency (Staab et al., 2015).
Bacterial recombination is different from eukaryotic recombination (Smith et al., 1993, 2000), and is generally modelled as a gene conversion process, such that in a bacterial recombination event only a relatively small fragment of DNA is imported from a donor, whereas most of the genome is inherited clonally (Figure 1). This results in sites very distantly located in the genome to be very tightly linked genetically. In fact, a single genealogy, the clonal frame (Milkman and Bridges, 1990), represents the evolutionary history of all non-recombining sites, no matter how far they are from each other on the genome. For this reason, most methods used to describe and simulate eukaryotic recombination cannot be applied in bacteria. While bacterial evolution can also be simulated forward in time, backward in time coalescent methods are usually more efficient, and are generally based on the coalescent with gene conversion (CGC, see Wiuf and Hein, 2000, and Figure 2A). Recently, efficient methods implementing the CGC have been developed for simulating bacterial evolution (Didelot et al., 2009b; Brown et al., 2015). However, these approaches still struggle in simulating whole-genomes at high recombination rates (e.g. requiring up to hours to simulate a single bacterial genome-wide alignment with ρ >0.01, see Brown et al., 2015, and Results) because, similar to the coalescent with recombination, the CGC also generates large ARGs.
Here, we present a new model of recombination (Figure 2B) that, inspired by the SMC, efficiently and accurately approximates bacterial recombination. We explicitly model the clonal frame, and simulate the coalescent and recombination processes along the genome conditional on the clonal frame, but “forgetting” recombination events that occur at distant positions. This approach differs by other approximations to the CGC (Didelot et al., 2010; Ansari and Didelot, 2014) as we can simulate entire genomes while allowing recombining lineages to coalesce with one another, and recombination events to split the ancestral material of recombinant lineages. Ignoring these complexities leads to biases when considering elevated recombination rates (Didelot et al., 2010), and by accounting for them we aim at specifying a model more adherent to the CGC. We call this model the bacterial sequential Markov coalescent (BSMC), which we implement within a new simulation software called FastSimBac. FastSimBac is faster than previous methods (between about one and two orders of magnitude for typical bacterial genome size and recombination rates). Also, by building on top of popular simulators ms (Hudson, 2002) and MaCS (Chen et al., 2009), our software can simulate more general evolutionary scenarios, allowing migration, speciation, demographic changes, recombination hotspots, and between-species recombination. We show that the BSMC can accurately approximate the exact CGC by inferring recombination parameters simulated under the CGC using Approximate Bayesian Computation (ABC) implementing BSMC simulations with FastSimBac. We also showcase its applicability by using it to infer recombination and mutation parameters via ABC from a whole genome alignment of Bacillus cereus.
Materials and Methods
BSMC algorithm
We assume that a given set of parameters is priorly specified: λ is the mean length of a recombining segment, G is the total genome length, and ρ is the recombination rate. λ and G are considered in terms of base pairs, while ρ= 2NeGr is the per-individual, per-generation, per-base pair gene conversion initiation rate r scaled by the effective population size Ne and genome length G. Our BSMC algorithm, while inspired by the SMC (McVean and Cardin, 2005; Marjoram and Wall, 2006) in that it crosses the genome from left to right and discards previous local trees, also keeps tracks of and conditions on the clonal frame, and so has several important differences from the SMC. All lineages with ancestral material exclusively on the left of the currently considered genomic position xcur are forgotten (removed from the current local ARG A(xcur)), while all lineages with ancestral material on the right of xcur are stored in memory (included in A(xcur)). All lineages in A(xcur) are possible targets of new recombination events and coalescent events. Recombination events and coalescent events are instead not allowed on lineages that have been forgotten (not in A(xcur)). In order to decide which lineage is in A(xcur) and which needs to be removed, we record and update for each lineage l its ancestral material on the right of xcur: al(xcur). Updating the ancestral material of each lineage in A(xcur) after a new recombination has been added to A(xcur) is one of the most complex routines in our algorithm. One aim of the algorithm is to generate the sequence of local trees along the genome. For a given position xcur, the local tree (or marginal tree) T(xcur) is the coalescent tree describing the inheritance history of site xcur. T(xcur) can be obtained from A(xcur) by removing all branches that are not ancestral at xcur. While a simple graphical example of the algorithm is given in Figure S2, the list of BSMC algorithmic steps is the following:
Initialisation: xcur= 0 (current position, maximum is 1), and Tcf(the clonal frame) is simulated under the coalescent without recombination. The initial local ARG A(xcur), and local tree T(xcur), are set to T(0) = A(0) = Tcf. The ancestral material of every lineage l in A(0) is set to al(0) = [xcur,1] = [0,1], the whole genome. The list of recombination end points E(the right ends of recombination segments) is initialised as empty: E= ().
Position of new event: The distance until the next recombination initiation xnew is drawn according to an exponential distribution , where is the sum of all branch lengths in A(xcur), expressed in units of 2Ne generations. If xnew > E0, where E0 is the first (and smallest) element of the list E of recombination end points (if E is empty then E0 = ∞), then xnew= E0, E0 is removed from E, and the next event is a recombination termination, so go to step 4. If xnew ≥1, and E is empty, terminate the algorithm. Otherwise the next event is a new recombination, so go to step 3.
New recombination event: sample a lineage l randomly from A(xcur) proportionally to branch length. Then sample a time t uniformly along the time spanned by l. The new recombination happens at time t on branch l, and a new lineage l’ is created, with its more recent end joining l at time t. A new coalescent time and coalescing lineage is sampled for l’ conditional on A(xcur) (under the algorithm of Wiuf and Hein, 1999). The right end of the recombining interval xend is sampled from the distribution (xend − xnew) ∼ Geom(λ)/G, where Geom(λ) is the geometric distribution with mean λ. If xend <1, it is added to E in such a way to keep E sorted in increasing order. The new local ARG is defined A(xnew) = A(xcur) ∪ l’ and ancestral material of all lineages in A(xnew) is updated (ancestral material on the left of xnew is deleted). All lineages with no ancestral material on the right side of xnew are removed from A(xnew). The new local tree T(xnew) is defined from A(xnew) and is printed to file. The current position is updated: xcur= xnew. Go back to step 2.
Terminate a recombination event: the ancestral material of all lineages in A(xnew) is updated (ancestral material on the left of xnew is deleted). All lineages with no ancestral material on the right side of xnew are removed from A(xnew). The new local tree T(xnew) is defined from A(xnew) and is printed to file. The current position is updated: xcur= xnew. Go back to step 2.
A large part of the complexity of the algorithm goes into the process of updating the ancestral material of lineages after a new recombination event is added to the local ARG. This step is described more in detail in the Supplement. Our algorithm and model differs from the approximation of the CGC used by (Didelot et al., 2010; Ansari and Didelot, 2014) in that, differently from them, we allow recombinant lineages to be affected by recombination, and to coalesce with each other if having overlapping ancestral material. To increase the realism of the model, we use the first positions simulated by the algorithm (generally 10λ bases) as burn-in, that is, they are simulated but not written to output or considered part of the genome length. While we simulate a linear genome, bacterial genomes are typically circular, so we assume that a genome start position has been arbitrarily chosen. The version of the algorithm above conveys the basics of the model of within-population recombination, and does not describe many additional events that we have included in our simulation software FastSimBac and that are described in the Supplement: mutations, migration, speciations, demographic changes, recombination hotspots and between-species recombination.
Performance Testing
We simulated bacterial genome evolution under the coalescent with gene conversion using SimBac (Brown et al., 2015). We always simulated 50 contemporaneous samples. We performed simulations under four different recombination intensities:
ρ= 2Ner= 0.001,0.002,0.005,0.01, with ρ the population-scaled per generation per site recombination initiation rate. We also used four different genome sizes: G=1Mbp, 2Mbp, 5Mbp, 10Mbp. The mean recombination tract length was fixed to λ= 500. These values encompass a range of typical biologically relevant scenarios for bacteria (Vos and Didelot, 2009; Didelot and Maiden, 2010). We simulated 10 replicates for each combination of parameters, and for each replicate the collection of local trees, and the clonal frame, were stored. Sequence data was generated using the local trees and SeqGen (Rambaut and Grassly, 1997) under an HKY85 model (Hasegawa et al., 1985) with transition/transversion rate ratio κ= 3. Some of the parameter combinations were too computationally demanding for SimBac: (ρ= 0.005, G=10Mbp), (ρ= 0.01, G=5Mbp), (ρ= 0.01, G=10Mbp). For all the replicates for which we could run SimBac, we used the clonal frame simulated by SimBac as an input for our new software FastSimBac. In fact, the clonal frame is a major source of variation in sequence patterns between simulations (Ansari and Didelot, 2014). By using the same clonal frames in the two methods we expect less variance in the difference of summary statistics between the two methods; in particular, we eliminate the variance associated with the clonal frame, and this allows us to perform fewer simulations to compare the methods. For all scenarios in which we could not run SimBac, the clonal frame was generated randomly within FastSimBac. Again, we generated local trees in FastSimBac and used these to generate alignments in SeqGen as before.
Approximate Bayesian Computation Inference
We performed Approximate Bayesian Computation (ABC) inference with the local-linear regression approach (Beaumont et al., 2002) as implemented in the R package abc (Csilléry et al., 2012). To test the performance of an ABC scheme based on our BSMC model, we used it with FastSimBac simulations to infer parameters from datasets simulated under the CGC using SimBac. We used a uniform prior distribution over [0,0.005] for the recombination rate ρ, and over [10,1000] for the mean length λ of recombining intervals. The same priors were used for simulating datasets and for performing inference. The aim of the ABC analyses was to infer these ρ and λ. For simplicity, the clonal frame simulated in SimBac was assumed to be known (see also Ansari and Didelot, 2014; Hedge and Wilson, 2014), as was the mutation rate θ= 0.005. The genome size was fixed to 1Mbp, and the number of samples to 20. For each true data set simulated with SimBac, we simulated 10,000 approximate datasets under the BSMC in FastSimBac. Only 1% of the simulations in FastSimBac was retained for parameter inference (the 1% with closest summary statistics to the true dataset, see Beaumont et al., 2002). We used two summary statistics: the proportion of incompatible sites (G4) between neighbouring SNPs, and the G4 between SNPs at least 20kbp away. More precisely, we considered the simulated alignment starting from the left end of the genome, and, for the first summary statistic, for each SNP we selected the first SNP occurring on its right; for the second summary statistic for each SNP we selected the first SNP on its right at least 20kbp away. The idea is that G4 (and linkage disequilibrium) at very short distances (≪ λ) will mostly depend only on the recombination rate ρ, while G4 on long distances (≪ λ) will mostly depend on the product ρλ, so that these two summary statistics together will give sufficient information to estimate ρ and λ.
We also performed Approximate Bayesian Computation (ABC) inference on a real Bacillus cereus genome alignment (Didelot et al., 2010; Ansari and Didelot, 2014) with the ABC-MCMC scheme (Marjoram et al., 2003). We used uniform prior distributions on [0.0,0.25] for ρ, on [1,10000] for λ, and on [0.01,0.2] for θ(the per-bp per-individual per-generation mutation rate scaled by 2Ne). These are the 3 parameters that we attempt to infer. We simulated entire genome alignments of 13 samples and 5240935 bp, as for the real dataset. We use 7 summary statistics: number of polymorphic sites (real value 629942); G4 (proportion of SNP pairs that are not consistent, breaking the 4-gamete rule) for consecutive SNPs (real value 0.167) and for SNPs at least 2kbp away (real value 0.297); mean linkage disequilibrium (LD, measured as where pA is the frequency of allele A in the first SNP, pB the frequency of B in the second SNP, and pAB the frequency of the AB haplotype) for consecutive SNPs (real value 0.396) and for SNPs at least 2kbp away (real value 0.274); and mean number of haplotypes (considering a certain number of SNPs at the time) for pairs of consecutive SNPs (real value 3.003) and for groups of 4 SNPs made of 2 pairs of consecutive SNPs, the two pairs being at a distance of at least 2kbp. While the number of SNPs is informative of the mutation rate, the three summary statistics at short range are informative of the recombination initiation rate, while the three summary statistics at long range are informative of the product ρλ. Number of SNPs, G4 and r2 were also used as summary statistics by (Ansari and Didelot, 2014). The fact that we are able to generate entire genomes (instead of SNP pairs as Ansari and Didelot, 2014) allows us also to include summary statistics on groups of SNPs, such as numbers of haplotypes. For simplicity we fixed the clonal frame to the one estimated and used by (Didelot et al., 2010; Ansari and Didelot, 2014). However, we also correct for the branch lengths estimation error caused by recombination. In fact, with increasing recombination, all genetic distances between samples converge to a unique value. We discuss this bias and our corresponding correction in the Supplement. Lastly, in an attempt to further increase the realism of our model, we account for invariable sites. In fact, a large proportion of the sites is polymorphic (about 1 bp every 6 after removing sites with limited coverage) and a large proportion of the genome is expected to be coding; so, in principle, one would expect many homoplasies (sites patterns not consistent with the clonal frame and the infinite sites assumption) to occur just due to multiple mutations at one site, and not necessarily involve recombinations. Using back of the envelop calculations (see Supplement) we estimated about half of the genome to be invariant (48.44% of sites) and a transition-transversion ratio of about 5.21. We used these estimates as fixed values within an HKY (Hasegawa et al., 1985) substitution model with invariant sites, instead of the basic JC model (Jukes and Cantor, 1969) implemented in our basic inference and in (Ansari and Didelot, 2014). This model, together with the local trees simulated by FastSimBac, was used in SeqGen to simulate the alignment from which summary statistics were extracted at each step of the ABC-MCMC. Each run consisted of 10000 ABC-MCMC steps, of which 1000 were used as burn-in.
Results and Discussion
Computational efficiency of BSMC
Thanks to our BSMC approximation that simplifies the coalescent with gene conversion (CGC) by considering many small local ARGs, instead of a unique, large, global ARG, FastSimBac shows great computational improvement in simulating typical bacterial genome evolution. Compared to the currently most efficient software to the best of our knowledge, SimBac, FastSimBac speed improvements range from about one order of magnitude for low recombination rate (ρ= 0.001) and genome size(106bp), to two orders of magnitude for more elevated recombination rate (ρ= 0.01) and genome size (107bp), as shown in Figure 3. Also, FastSimBac allows simulation of scenarios with both high recombination rate and genome size which are currently out of reach of other methods, due to excessive requirements in time and RAM. In fact, we see that the performance of FastSimBac relative to the exact coalescent with gene conversion improves as we increase either genome size or recombination rate (Figure 3). As expected, the running time of FastSimBac appears linear with genome size, while this is not true for SimBac. Another benefit of FastSimBac is that, by avoiding the generation of a global ARG, it has small RAM usage, which allows it to efficiently run in parallel on multiple cores.
Accuracy of BSMC
Next, we compare the simulated patterns of genetic variation and local tree features between the exact CGC simulated under SimBac, and the BSMC simulated with FastSimBac. Looking at linkage disequilibrium (LD, measured as r2) and site pair incompatibility (or four-gamete test, G4), we notice that, as expected, LD decreases and G4 increases considerably with increasing recombination rate (Figure 4). There is a lot of variation across different replicates in mean LD, but this is also expected as each replicate has a distinct clonal frame, and the clonal frame influences site patterns of the whole genome. LD and G4 at 1kbp are already very close to that of longer distances, suggesting that a distance of 2λ is sufficient to reach nearly as much LD as any arbitrary distance. Most importantly, we notice that values simulated under the BSMC mimic very closely those simulated under the exact CGC, suggesting that indeed, even at high recombination rates and short distances, the BSMC is a very accurate approximation (Figure 4). Similar results are also observed at different genome sizes (Figure S3).
Additionally, looking at the number of haplotypes present in non-overlapping windows of 10 SNPs, we observe an expected increases with recombination rate (Figure 5A). More importantly, the BSMC again very closely mimics the exact CGC. The genomic variation in number of haplotypes (Figure S4A) is very slightly underestimated, probably because long-range correlations in local trees (after conditioning on the clonal frame) are ignored in the BSMC, while present in the CGC. The mean pairwise genetic distances between samples appears not affected by recombination and by the model used for simulations (Figure 5B), but recombination does affect the variance of genetic distances over sample pairs (Figure S4B) because it tends to break down the relatedness of samples. Again, both patterns in the CGC are very closely approximated by the BSMC. Looking at mean local tree height (Figure 5C) and mean local tree total branch length (Figure 5D) we see that these are highly variable dependent on the simulated clonal frame, but are not considerably affected by the simulation parameters. Again, BSMC and CGC values are very close.
BSMC-based ABC inference
We investigated the accuracy and applicability of the BSMC approximation by performing ABC inference of parameters. First, we reconstructed parameters simulated under the exact CGC. We use summary statistics based incompatibilities indicative of recombination between pairs of sites (G4). Despite the fact that the exact CGC was used to create the original datasets, while our BSMC was used for the ABC, inference was accurate. 95% posterior confidence intervals for ρ and λ(respectively the population-scaled recombination rate and the mean length of recombining intervals) contain the simulated values in both our replicates (Figure 6 and S5). This supports the idea that sequential Markov approximations of the CGC can be used for accurately inferring bacterial evolutionary parameters.
As an additional example of the applicability of our model and software, we used an ABC-MCMC approach (Marjoram et al., 2003) to infer ρ, λ, and the scaled mutation rate θ for the Bacillus cereus bacterial group. Bacteria of the B. cereus group mostly live in the soil, feeding on dead organic matter, but they can occasionally infect humans and cause a range of diseases, from food poisoning up to deadly anthrax (Arnesen et al., 2008). Disagreement has been found between B. cereus species designation and MLST clade structure and population history, probably due to the contribution of plasmids and genetic recombination to the bacterial phenotype (Priest et al., 2004; Sorokin et al., 2006; Didelot et al., 2009a; Zwick et al., 2012). Furthermore, analyses of MLST data showed discordant results regarding the prevalence of recombination in B. cereus relative to mutation, with estimates ranging from ρ/θ ≈0.05 (Hanage et al., 2006), to ρ/θ ≈0.2 (Didelot et al., 2009a), to ρ/θ ≈0.3 (Didelot and Falush, 2007), up to ρ/θ ≈2 (Pérez-Losada et al., 2006), leading to present uncertainty regarding the contribution of recombination to the B. cereus evolution. Improving our understanding of recombination in B. cereus would help us recognise the effect of homologous recombination on epidemiological inference and species delimitation (Didelot and Maiden, 2010), and predict the acquisition and spread of infectivity and resistance factors (Perron et al., 2011). With this respect, genome-wide data from multiple strains provide a greater opportunity to study recombination in detail, and here we consider the genome alignment described in Didelot et al. (2010) and Ansari and Didelot (2014), and comprising 13 genomes from the B. cereus group. Didelot et al. (2010) performed MCMC inference on this dataset using an approximate coalescent model with bacterial recombination (the ClonalOrigin model) that did not allow recombinant lineages to be affected by further recombination, or recombinant lineages to coalesce with each other. They inferred a mean recombination tract length of λ= 171bp with interquartile range [168,175], and ρ/θ= 0.21 with interquartile range [0.20,0.23]. Ansari and Didelot (2014) used again a model similar to the ClonalOrigin one within an ABC-MCMC approach, and accounted for the propensity of lineages to recombine with more closely related lineages than with distantly related ones. They inferred ρ= 0.077 with confidence interval CIρ= [0.036,0.127], λ= 152bp with CIλ= [74,279], and θ= 0.0528 with CIθ= [0.0437,0.0640]. The ClonalOrigin model used by these methods approximates the coalescent with gene conversion, but in a less adherent way than the BSMC; in fact, their model leads to overestimation of the recombination rate ρ at elevated recombination and mutation rates which are relevant in this scenario (Didelot et al., 2010). Our BSMC-based ABC-MCMC approach instead allows recombining lineages to coalesce with one another, and recombination events to split the ancestral material of recombinant lineages. Furthermore, differently from these previous methods we account for differences in transition and transversion rates, for invariant sites, and for biases in tree branch length estimation (see Materials and Methods and Supplement)
With our BSMC-based approach, we inferred higher mean recombination tract length λ(median 592bp and interquartile range [336,885]) than previous estimates (52bp and 171bp from Didelot et al. (2010) and Ansari and Didelot (2014) respectively); This estimate is closer to values inferred from genome-wide likelihood-based analyses in Clostridium difficile(Didelot and Wilson, 2015). We also inferred a considerably lower contribution of recombination relative to mutation (ρ/θ, median 0.0065 and interquartile range [0.004,0.011]) than previous genome-wide studies (0.21 and ≈1.46 from Didelot et al. (2010) and Ansari and Didelot (2014) respectively); this means that recombination has a much lower contribution to evolution in B. cereus than previously thought, and that in fact these bacteria are considerably clonal, although, due to variation in recombination rates between B. cereus clades, our results do of course not apply to all species within the B. cereus group (Sorokin et al., 2006). These results were confirmed by an additional independent estimation run (Figure S7), and can be explained by the fact that we account for invariant sites and for different transition and transversion rates. In fact, invariant sites and high transition/transversion rate ratio cause more homoplasies than expected under an homogeneous substitution rate; these homoplasies, if unaccounted for, can be interpreted as short recombinant fragments, biasing downward estimates of λ, and upward estimates of ρ/θ. Supporting our interpretation, when we ran our method without accounting for invariant sites we estimated lower λ and higher ρ/θ(Figure S8). Another factor that can explain our larger λ estimate is that in our BSMC model we allow recombination events to interfere with each other, breaking recombinant segments into smaller pieces as expected in the CGC, and this process, if unaccounted for, could lead to a downward bias in the estimation of λ.
We also found correlation between ρ and λ, suggesting that while the total impact of recombination ρ * λ is easier to estimate, identifying the two individual parameters is harder. We found no correlation instead between other pairs of parameters (Figure S6 A-C, see also Ansari and Didelot, 2014). While our ABC-MCMC seems to capture well the complexity of real data for 5 out of 7 summary statistics, for two of them (G4 at large distances and r2 at short distances) there seem to be discrepancies (Figure S6 D-J and Figure S7 E-K). This might suggest the existence of some complexities that we did not account for in our model, for example the larger rate of recombination between closely related lineages (see Ansari and Didelot, 2014), variable recombination rate between B. cereus clades (Sorokin et al., 2006), prevalent non-homologous recombination (Didelot and Maiden, 2010), population structure (such as due to niche adaptation Sorokin et al., 2006), recombination with other bacterial groups, variable selective pressure and mutation rate, and alignment errors.
In conclusion, the BSMC offers not only a very computationally convenient approximation to the CGC, but also an accurate one. Our implementation of the BSMC model in the simulation software FastSimBac allows faster simulations, and therefore parameter inference, of bacterial genome evolution, and under a broader range of parameter values. FastSimBac also allows specification of the clonal frame upon which to condition simulations, which can grant simulations a closer fit to particular phylogenies reconstructed from real datasets. But more importantly, by virtue of building on top of the popular simulators ms (Hudson, 2002) and MaCS (Chen et al., 2009), our software includes many evolutionary scenario options that have been included in previous eukaryotic coalescent simulators (Hudson, 2002; Chen et al., 2009) but have remained precluded from bacterial coalescent simulations, such as population structure and migration, speciation histories, changes in population sizes, and recombination hotspots. FastSimBac is available as open source from https://bitbucket.org/nicofmay/fastsimbac. Applications of our model and software are not necessarily restricted to simulations, but, as we have shown, also include inference of recombination rate and other parameters of bacterial evolution. Our simulations suggest that our approach gives results very close to those obtained with the exact CGC, but at considerably reduced computational cost. Our analysis of recombination in the B. cereus group showcases the applicability of our method for inference from genome-wide alignments. Possible further applications of our model include the study and inference of recombination patterns and events, or the estimation of the clonal frame while accounting for recombination. We thus believe that the BSMC and FastSimBac will provide very useful for both benchmarking and for statistical inference on bacterial whole genome sequence data.
Acknowledgements
We are grateful to the creators of MaCS, on which the FastSimBac code is partly based. We also thank Xavier Didelot for sharing the B. cereus dataset. NDM was supported by a James Martin Research Fellowship of the Oxford Martin School. D.J.W. is a Sir Henry Dale Fellow, jointly funded by the Wellcome Trust and the Royal Society (grant 101237/Z/13/Z).