A scalable analytical approach from bacterial genomes to epidemiology

Recent years have seen a remarkable increase in the practicality of sequencing whole genomes from large numbers of bacterial isolates. The availability of this data has huge potential to deliver new insights into the evolution and epidemiology of bacterial pathogens, but the scalability of the analytical methodology has been lagging behind that of the sequencing technology. Here we present a step-by-step approach for such large-scale genomic epidemiology analyses, from bacterial genomes to epidemiological interpretations. A central component of this approach is the dated phylogeny, which is a phylogenetic tree with branch lengths measured in units of time. The construction of dated phylogenies from bacterial genomic data needs to account for the disruptive effect of recombination on phylogenetic relationships, and we describe how this can be achieved. Dated phylogenies can then be used to perform fine-scale or large-scale epidemiological analyses, depending on the proportion of cases for which genomes are available. A key feature of this approach is computational scalability and in particular the ability to process hundreds or thousands of genomes within a matter of hours. This is a clear advantage of the step-by-step approach described here. We discuss other advantages and disadvantages of the approach, as well as potential improvements and avenues for future research. This article is part of a discussion meeting issue ‘Genomic population structures of microbial pathogens’.


Introduction
Over the past decade, the cost and time required to sequence whole bacterial genomes have reduced dramatically [1]. Sequencing is frequently applied to many or all isolates in local outbreaks, or to a high proportion of cases in more endemic situations, as well as large retrospective and longitudinal collections. This genomic data has huge potential to deliver new insights into the evolution and epidemiology of bacterial pathogens, which can lead to better control measures. However, the lack of scalable methodology for analysis of this genomic data represents an important bottleneck for the realization of their full potential.
A gold standard for the analysis of pathogen genomic data has been set by the integrated phylogenetic frameworks implemented for example in BEAST [2] and BEAST2 [3]. These phylodynamic tools were originally conceived for viral genetics and are still mostly used for that purpose, but have also been increasingly applied to bacterial genomic data [4]. One of the strengths of these tools is that they can infer a dated phylogeny by combining the genomic data with the dates of isolation, resulting in estimates for the dates of the common ancestors in the phylogeny. Such dated phylogenies are extremely useful to draw epidemiological interpretations from the genomic data, as we will see. Another advantage of the integrated phylogenetic frameworks is that they include a number of powerful extensions, for example, to use relaxed clock models [5], to estimate past population dynamics [6], geographical spread [7][8][9] or transmission between hosts [10,11]. This integrated approach has many natural advantages but also limitations especially in terms of scalability to analyse larger datasets.
These limitations of the integrated approach are especially important in bacterial genomics, where the genomes are orders of magnitude longer than in viral genetics and often subject to recombination. The ClonalOrigin model [12] of bacterial evolution has been integrated into BEAST2 [13], but the resulting algorithm is too computationally intense to be applied to whole genome datasets. Here we present an alternative step-by-step approach.
The step-by-step approach is illustrated in figure 1. In the first step, a phylogeny is constructed from a genomic alignment in a way that accounts for recombination events. In the second step, this phylogeny is dated. In the third step, the dated phylogeny is interpreted in terms of a number of epidemiological properties. Many software packages are available to perform each of these steps, including but not limited to the ones named in figure 1, although it is worth noting that many of these tools have emerged only in the past few years, and so are still work in progress and expected to improve in the near future. In this article we review each of the steps of this approach in turn. We also pay special attention to the 'cracks' between the steps, since these are often ignored in articles that focus on each of the steps rather than the whole step-by-step approach. Finally, we demonstrate the usability of this approach by applying it to a complete collection of Staphylococcus aureus ST239 genomes.

Recombination-aware phylogenetic analysis
Even a relatively low amount of recombination can invalidate the results of phylogenetic tools if not accounted for [14,15]. It is therefore essential to detect recombination events to correctly reconstruct the clonal genealogy, that is the phylogenetic relationship between genomes when the ancestral lines of recipient cells rather than donor cells are followed for each ancestral recombination event. Special phylogenetic methods have been developed for this purpose, including Gubbins [16] and ClonalFrameML [17] which is based on the ClonalFrame model [18]. However, these tools are often underexploited, typically to build a recombination-corrected tree without paying attention to the recombination events and regions that have been detected.
A lot can be learnt from studying the inferred recombination events themselves. Recombination is useful to help us understand how species are being formed [19] and the population structure within species, especially when the origin of recombination events is being investigated [20]. These recombination patterns often reflect important driving evolutionary forces such as ecology [21], adaptation [22] or selective pressures [23]. For example in Streptococcus pneumoniae, recombination events have been shown to be driven by antibiotic usage in a localized dataset [24] and by immune pressure in a global collection of the PMEN1 lineage [25]. The latter study also represents a good example of how the temporal signal can become much clearer once recombination is correctly accounted for [25,26]. Recombination is also useful for the analysis of genome-wide associations between genotypes and phenotypes, since it separates new genetic variants from their original genomic background [27].
Accounting for recombination when reconstructing phylogenies is an important starting point for many epidemiological studies. A method often used is to extract from the genomic alignment the sites that have not been affected by recombination and to build a phylogeny using these sites only. Both Gubbins and ClonalFrameML are often used in this way, to create a recombination-free alignment which is then passed on to BEAST. However, this method works only if relatively few recombination events happened throughout the tree. For example, consider the simulated dataset shown in figure 2. The true clonal genealogy is shown in figure 2a and the true recombination events that happened on each of the branches are shown in figure 2b. These data were simulated using a standard coalescent model for the phylogeny [28], a strict clock model of mutation with rate θ/2 = 0.005 per site, a model of recombination coming from external sources [18] with initiation rate ρ/2 = 0.001 per site, average length of recombination δ = 1500 bp and distance of the source v = 0.05. For clarity we used a relatively small dataset of 20 sequences of 100 000 bp each. In this simulated dataset, there was not a single site that was not affected by recombination on at least one of the branches. On the other hand, every branch had some sites unaffected by recombination (figure 2b).
We applied ClonalFrameML [17] to this dataset using a PhyML tree [29] as starting point. The reconstructed clonal genealogy is shown in figure 2c and the inferred recombination events are shown in figure 2d, and they are in very good agreement with the true simulated tree and events shown in figure 2a,b. ClonalFrameML correctly inferred that there was not a single site unaffected by recombination on at least one of the branches. Therefore an alignment containing only the non-recombinant sites would contain no sites, and could not be used as a starting point for further analysis. On the other hand, the inferred clonal genealogy shown in figure 2c can be used in our proposed step-by-step approach. It has the same topology as the true clonal genealogy (figure 2a) and very similar branch lengths, with a weighted Robinson-Foulds distance [30] of 0.005 between the true and ClonalFrameML trees. Gubbins [16] was also applied to this dataset using RAxML [31] as a tree builder. The correct topology was inferred, with a weighted Robinson-Foulds distance of 0.03 between the true and Gubbins trees.

Dating the ancestors in a phylogeny
Once a recombination-corrected tree has been reconstructed, it is possible to study the temporal signal in this tree and to date the common ancestors in the tree. Multiple software tools have recently been developed to perform dating on a phylogeny, including BactDating [26] which is specifically aimed at bacterial genomes, but also LSD [32], treedater [33] and TreeTime [34]. BactDating uses Bayesian statistics, whereas treedater and TreeTime are based on maximum likelihood, which is equivalent to Bayesian maximum a-posteriori (MAP) inference under a uniform prior on dates [35] so that the posterior distribution is proportional to the likelihood. It is often important to use a relaxed clock model in this step that allows the evolutionary rate to vary between lineages [5]. An additive relaxed clock model has recently been developed which is more biologically realistic and leads to better dating of pathogen phylogenies than the previous relaxed clock model [36].
In our proposed step-by-step approach, the reconstruction of a dated phylogeny and its epidemiological interpretation are separated. One disadvantage of this is that the prior (or lack of ) on dates used to reconstruct the dated phylogeny is not the same as the one that would be implied by the epidemiological models used in subsequent analyses. This statistical issue could be resolved by considering the difference between the probability of a tree in the two models used for the dating and the epidemiology. For example, an importance sampler could be applied to postprocess the results, and produce corrected results that account for this difference between the two tree distributions [37]. Alternatively, the role of the prior can be assessed by comparing inference with and without the genomic data, or by comparing inference under different prior models to make sure that the posterior distributions remain consistent [38]. However, this issue is often small enough in practice to make little difference to the results [39], especially if the method used to build the dated phylogeny was based on the likelihood only, or if a mild prior was used such as the coalescent with constant population size [28].
To illustrate this, we simulated five years of an outbreak model [40] with within-host diversity N e g = 0.25 year, basic reproduction number R 0 = 2, generation time distribution Exponential(1) in years and sampling proportion π = 0.1. A total of 59 cases were sampled in this outbreak, with the samples being related to each other as shown in the 'true' dated phylogeny in figure 3a. We applied a strict clock model to this true dated phylogeny in order to produce an undated 'observed' phylogeny. We used a rate μ = 5 substitutions per year which is of the same order of magnitude as many bacterial pathogens [41]. The undated 'observed' phylogeny was then used, along with the known dates of sampling, to infer an 'estimated' dated phylogeny using Bact-Dating [26] with prior set to the coalescent with constant population size [28]. This prior is completely different from the outbreak model that was used to generate the phylogeny [40], which is not coalescent due to the host structure and where the population size is clearly growing since the reproduction number was greater than one. Figure 3b shows the 'estimated' dated phylogeny, which is in good agreement with the 'true' dated phylogeny in figure 3a despite the complete difference between the epidemic model used for simulation and the coalescent model used for inference.
At the same time as dating is performed, the substitution rate is typically estimated which provides a useful value to compare with previous estimates [41] in order to make sure that the dating is working as expected. Statistical methods can also be used to ensure that the temporal signal is significant, in particular by using a date-randomization test [42,43]. This test involves making sure that the inferred substitution rate is larger when using the correct dates for the genomes than when the dates are permutated in at least 20 randomized datasets [42]. This method requires several runs of the dating method to be performed, and it is, therefore, useful for this to be as fast as possible, which is achieved in our step-by-step method by separating the phylogenetic inference from the dating.
Furthermore, the root of the phylogeny is typically estimated during the dating step, since the trees generated by standard phylogenetic tools are not rooted whereas dated trees are always rooted by definition, with the date of the root being the date of the last common ancestor of the whole sample. If the root has already been determined robustly, for example using one or ideally several closely related outgroups [44], then this information can be preserved during the dating. This can be achieved by using as input into the dating software a phylogeny that does not include the outgroups but that is rooted as was informed by the outgroups, and turning off the estimation of root position [26,33]. If on the other hand the root is undetermined, or arbitrarily selected for example using the midpoint method [45], then the fact that dating the phylogeny simultaneously performs rooting provides an additional reason for dating the tree, which becomes much more informative in terms of epidemiology once it is dated and rooted.

From dated phylogeny to epidemiology
A dated phylogeny is very useful to learn about the epidemiology of the bacteria under study, and sometimes the dating directly provides answers to questions of interest beyond the age of pathogens [46]. For example, several antibiotic-resistant lineages have been dated to have emerged around the time when the corresponding antibiotics were starting to be used, highlighting the link between consumption and resistance [47,48]. As another example, the dating of the common ancestors between pairs of Clostridium difficile patients in a hospital allowed transmission for many pairs to be ruled out, concluding that nosocomial transmission was less frequent than previously thought [49].
It can often be useful to identify clusters of significantly similar genomes in a dataset. The most commonly used approach is to use a separate dedicated algorithm that uses the genomic data for this purpose, such as HierBAPS [50], fastbaps [51] or PopPunk [52]. These methods do not make use of a phylogeny, but it is often useful to show their results overlaid on a dated phylogeny using colours for example. Another approach is to use additional non-genomic data to do the clustering given the phylogeny, as performed for example by AdaptML [53], treebreaker [54] and treeSeg [55]. Finally a third option is to try and identify directly on the dated phylogeny the lineages that seem to be ruled by different dynamics, for example using treestructure which does not rely on an explicit phylodynamic model [56] or CaveDive [57] which is focused on the detection of clonal expansions.
The dated phylogeny can also be used as a starting point for further analysis. In particular, past variations in the bacterial population size have a direct effect on the shape of the dated phylogeny, so that the population size through time can be estimated and presented as a skyline plot [58]. The methodology for performing such an analysis was originally developed within BEAST, which simultaneously estimates the dated phylogeny and the demographic function [6]. However, the step-by-step approach requires estimation of the demographic function from a dated phylogeny (figure 1), and several software tools have recently been released for this purpose including phylodyn [59,60], skygrowth [61] and mlesky [62]. Beyond a simple model of varying population size, it is also possible to fit an epidemiological compartmental model such as the susceptibleinfected-recovered model [63], and therefore to estimate the parameters of this model such as the transmission rate or removal rate. Such an inference can be achieved by formulating a structured coalescent model that corresponds to the compartmental model [64,65]. Existing software for fitting such a model to a given dated phylogeny include rcolgem [66] and phydyn [67]. The same methods based on the structured coalescent can also be applied to a dated phylogeny in order to reconstruct past geographical migrations [9], although such phylogeographic inference is much more often based on discrete trait analysis, for example using the ace command from the R package ape [68] or in the Next-Strain platform [69]. The worldwide spread of the current pandemic of Vibrio cholerae has been described using such techniques [70,71]. BEAST can also be useful in this step from dated phylogeny to epidemiology, by inputting the dated tree estimated in the previous step as the initial tree to be used in the Markov chain Monte Carlo (MCMC) and deactivating the updates on the tree. This results in a run in which the dated phylogeny is fixed, so that the run time is typically much faster. This strategy can be used for example to reconstruct a demographic function or to perform a trait analysis on large datasets [72,73].
When the genomes are densely sampled within an epidemic, it can be useful to try and reconstruct the transmission tree of who infected whom [74]. Within-host diversity and evolution is significant for many bacterial pathogens which blurs the relationships between transmission tree and phylogeny [75]. However, TransPhylo can infer the transmission tree from a dated phylogeny in a way that accounts for within-host evolution [40,76,77]. Significant uncertainty typically remains in the inferred transmission tree, which is captured by the use of Bayesian statistics within TransPhylo. More precise inference can sometimes be obtained by combining the genomic inference with epidemiological data [78].
A drawback of separating the dating step from the interpretation step is that the uncertainty in dating is typically not passed on to the epidemiological analysis. This can be achieved by running on multiple samples from the posterior of dated phylogeny and averaging the results [79], or reweighting according to the posterior probability in the epidemiological analysis [37], but in practice the phylogenetic uncertainty is usually not accounted for. However, this is not often a significant issue in practice. To illustrate this, we simulated a dataset for a small outbreak with just 10 cases, using an epidemic model [40] with basic reproduction number R 0 = 1, within-host diversity N e g = 0.25 year, mean generation time of 1 year, sampling proportion of π = 0.5 and a strict clock model with rate μ = 5 substitutions per year. The dated phylogeny was inferred using BactDating [26] and we extracted the first (after burnin) and the last trees sampled by the MCMC, as shown in figure 4a,c. We then reconstructed the transmission events using TransPhylo [40] separately for each of these two dated trees, as shown in figure 4b,d. These analyses used default parameters except that the parameters for the generation time distribution, offspring distribution, within-host diversity and sampling proportion were assumed known. In spite of small differences in the two dated phylogenies, the inferred results in terms of transmission chains were very similar.

Example of application
To illustrate the use of the step-by-step approach from bacterial genomes to epidemiology, we apply it to a state-of-the-art dataset, using only a standard laptop computer and paying particular attention to the time taken by each step. We collected all available genomes of Staphylococcus aureus ST239 (electronic supplementary material, table S1). This collection is made of 521 assembled genomes, only small subsets of which had been comparatively analysed in previous studies [80][81][82][83]. The genomes were collected between 1982 and 2010 from all parts of the world (451 from Asia, 46 from Europe, 18 from Americas, 2 from Africa, 2 from Oceania and 2 unknown). All genomes were aligned using MuMMER v. 3.1 [84] against the reference genome TW20 which is a member of ST239 [85] and therefore included in the collection. This resulted in a reference-anchored alignment that took only a few minutes to generate, since each pairwise alignment against the reference genome can be performed in parallel. Alternatively, assembly pipelines are often based on reference-based mapping of the sequencing reads, for example using BWA [86] and SamTools [87]. This can also be performed in parallel and results in a similar referenceanchored alignment. Note that the whole genome alignment was used as input for phylogenetic reconstruction, rather than an alignment of variant sites only, as would be produced for example using the software SNP-sites [88]. Alignments of variant sites can be used for standard phylogenetic inference if some correction is applied [89], but they cannot be used for recombination-aware phylogenetics since the genomic distance between variant sites becomes an important factor [18].
A first phylogeny was built using PhyML v. 3.3 [29] which took approximately 3 h. This was used as the starting point to build a recombination-corrected phylogeny using ClonalFra-meML v. 1.12 [17], which took approximately two days to run. The same analysis using Gubbins v. 2.4.1 [16] gave very similar results and took approximately one day to run. The PhyML, ClonalFrameML and Gubbins analyses used default parameters. This step currently represents a clear bottleneck in the application of the step-by-step approach, which should be addressed in the near future through the development of new parallelised algorithms. Significant recombination was found, with a total of 198 recombination events detected throughout the phylogeny. The relative rate of recombination versus mutation was estimated to be R/θ = 0.144, meaning that on average mutation events were about f7 times more frequent than recombination events. The mean length of royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20210246 recombination events was estimated to be δ = 619 bp which is in good agreement with previous estimates for S. aureus [17,90,91]. The mean distance between donor and recipient was estimated to be ν = 0.31%, which corresponds approximately to the distance between ST239 and some of its closest relatives such as CC8 [92]. The relative effect of recombination versus mutation was therefore estimated to be r/m = R/θ × ν × δ = 0.28, so that 3 to 4 times more substitutions are caused by mutation than by recombination. These results confirm that recombination plays a role in S. aureus evolution, although not as dramatic as in some other bacterial pathogens [14,93,94].
We detected a strong temporal signal in the recombination-corrected phylogeny on the basis of a regression analysis of root-to-tip distances against isolation dates (R 2 = 0.57) and using the date-randomization test [42] (with 100 replicates and CR2 criterion). We, therefore, computed a dated phylogeny using BactDating v. 1.1 [26] with default parameters including use of the additive relaxed clock model [36]. This step took approximately 3 h to run for 10 6 MCMC iterations, and the inferred dated phylogeny is shown in figure 5a. The isolation dates were unknown for 36 of the 521 genomes (electronic supplementary material, table S1), but BactDating can accommodate this by treating the missing dates as additional parameters that are inferred simultaneously as the dates of the common ancestors [26]. The evolutionary rate was estimated to be 7.05 substitutions per year throughout the genome, with credible interval between 6.43 and 7.67. This estimate is in good agreement with several previous estimates in ST239 [80,82] and other lineages of S. aureus [48]. The root of the ST239 phylogeny was estimated to have existed in 1958, with credible interval ranging between 1951 and 1965. This is again in good agreement with previous estimates and coincides with penicillins being increasingly used to treat bacterial infections [80,82,95].
We used the dated phylogeny as input into treestructure v. 0.1.2 [56] with default parameters to determine whether there were significant differences in the phylodynamic properties of sublineages within the tree. This analysis took less  royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20210246 than a minute to perform and found no significant differences, which means that the tree can be treated as a whole in phylodynamic reconstructions [56]. We therefore applied skygrowth v. 0.3.1 [61] to the whole dated tree using the maximum a-posteriori method. This analysis took less than a minute and the estimated demographic function is shown in figure 5b, with an approximately exponential rise of the effective population size between 1960 and 1995, and a plateau between 1995 and 2010. This is in good agreement with previous skyline analyses of ST239 [82,95]. We do not seek to say more about the epidemiological dynamics of ST239 since our aim with this application was to test the applicability of the step-by-step method to a relatively large dataset, rather than study it in detail.

Discussion
The step-by-step approach described above (figure 1) has several drawbacks compared to an integrated approach. A practical disadvantage is that multiple tools need to be applied one after the other, with the need to make sure that the output of one tool is a suitable input for the next tool. The software tools have been developed separately, and format conversion is sometimes required when combining them, which introduces a risk of error being made. Method developers should make every effort to minimize this risk, for example by providing practical examples of source code combining new tools with pre-existing ones, and including verifications in each tool that the input is formatted as expected. Furthermore, any results from a step-by-step approach are only as good as the software components involved, many of which are imperfect and under development, which highlights the importance of thorough testing using both simulated and real datasets.
Another concern with the step-by-step approach relates to statistical soundness. In an integrated approach, a complex model is formed by combining multiple simpler models into a consistent whole, for example a model describing how the pathogen population size varied over time, another  Figure 5. Example of application of the approach to a collection of Staphylococcus aureus ST239 genomes. The dated phylogeny was inferred using BactDating (a) and the past population size dynamics was inferred using skygrowth (b).
royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 377: 20210246 model describing how these fluctuations affect the genealogy and yet another model describing how mutation and recombination events affect the genomes given the genealogy. Inference is then performed on the combined model, with all uncertainties being accounted for simultaneously and in all directions: for example the uncertainty on a mutation event will feed into the uncertainty on the past population size, and vice versa. By contrast, in the step-by-step approach, each of the tools makes separate modelling assumptions, which may not always be consistent with each other. An example of this was discussed in §3, where the prior used for the reconstruction of a dated phylogeny was not the correct one, but figure 3 showed that the result can still be correct. Furthermore, the uncertainty can only be passed from one tool to the next in the order that they are being applied, and even in this direction it is frequent to use the best estimate from one method as the starting point of the next, without passing on any uncertainty. Again this is not necessarily a problem in practice, as illustrated in figure 4 where the uncertainty on the phylogeny had little effect on the uncertainty of the transmission tree. From a statistical point of view, the integrated approach, therefore, represents a gold standard, although statisticians have recently noted that joint inference under a combined model carries the risk that misspecification in any of the model parts can affect estimates from the others in unpredictable ways [96]. Further research is needed on this in the context of genomic epidemiology, as well as research on how to avoid the statistical issues described above with the step-by-step approach.
A key advantage to the step-by-step approach we described is that by breaking down the problem into simple steps, it becomes easier to solve, a strategy often called 'divide and conquer' in the computer science literature. The running time is greatly improved compared to an integrated approach, which quickly becomes intractable as more model components are combined into a large model. An example of this concerns the difficulty to integrate recombination into a phylodynamic framework [13]. A similar situation occurs when aligning sequences and building a phylogeny: in principle alignment and phylogeny would benefit from being performed simultaneously [97,98] but in practice this is too computationally challenging. The lower running time of the step-by-step approach also means that it is more scalable to the large numbers of bacterial genomes currently available, and this scalability is probably the main reason for a recent increase in popularity [39,69]. The step-by-step approach can deal with hundreds or even thousands of genomes, although some of the steps can become slow. This is particularly true for the time taken to reconstruct a recombination-corrected phylogeny, which depends not just on the number of genomes but also on their lengths, diversity, and the recombination rate. For very large datasets, it can be useful to divide them into lineages which can be analysed separately and in parallel [49,99]. New tools have recently emerged to deal with the very large numbers of sequences of SARS-CoV-2 genomes [100,101] which are likely to provide inspiration for how to improve the future scalability of bacterial genome analysis.
Perhaps even more importantly, a counterintuitive advantage of the step-by-step approach is that it is less automatic than the integrated approach. Although this may seem like a disadvantage, the fact that several software tools have to be applied one after the other brings great benefits. It allows the user to check after each step that the result makes sense before carrying out the next step. For example, if a phylogeny is clearly wrong due to contamination during sequencing, there is no point trying to apply dating of the nodes or interpreting the phylogeny in terms of epidemiology. Since each tool is focused on a simpler task, it is easier for the user to check the validity of the assumptions made, and if needed to compare models or the results of several software tools, or apply more complex models, since each step is relatively quick. These checks and refinements provide the user with a better understanding of their data and the analysis process, rather than relying on 'black-box' or 'turnkey' analysis. This is one of the most important advantages of the step-by-step approach, since it creates good conditions for a balanced interpretation of the data and results.
Data accessibility. Our paper uses data available from the NCBI website.
All accession numbers are provided in electronic supplementary material, table S1 [102].