phastSim: efficient simulation of sequence evolution for pandemic-scale datasets

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100,000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available from https://github.com/NicolaDM/phastSim and allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutatability models that we developed to more realistically represent SARS-CoV-2 genome evolution.

Sequence evolution simulation is important for many aspects of bioinformatics [1]. Its 2 most ubiquitous applications are for testing and comparing the performance of various 3 essential tools (such as alignment, phylogenetic, and molecular evolution inference 4 software, see e.g. [2][3][4]). However, simulating sequence evolution is also often used for 5 testing hypotheses (e.g. [5]) and for inference, either for example through Approximate 6 Bayesian Computation [6,7], see e.g. [8,9], or, more recently, using deep learning, see 7 e.g. [10][11][12]. 8 Many simulators address the task of simulating gene trees, or ancestral 9 recombination graphs, as well as simulating evolution along these trees (e.g. [13][14][15][16]). 10 Instead, here we focus on the problem of generating sequences given an input tree, as 11 done by "phylogenetic" simulators (e.g. [17][18][19]). Realistic simulation of sequence 12 evolution along a phylogenetic tree is essential, for example, for assessing and improving 13 our methods for inference of SARS-CoV-2 phylogenies, which is a largely still open 14 problem [20]. One important factor is the large numbers of available genome sequences 15 for SARS-CoV-2 (> 3, 000, 000 in the GISAID database [21] as of September 2021). 16 Despite this, there are currently no available simulation frameworks capable of 17 simulating the scale and complex evolutionary features of SARS-CoV-2 and similar 18 genome datasets. For this reason, we focus on the issue of simulating realistic 19 substitution patterns for large datasets of closely related samples, as broadly observed 20 in genomic epidemiology sequence data, and for arbitrarily complex substitution and 21 indel models. 22 Here we show that sequence simulation for such large numbers of genomes is 23 exceedingly computationally demanding for existing software. Complex evolutionary 24 models, for example codon substitution models and rate variation, can cause significant 25 further slow-downs. Furthermore, many existing methods do not allow the simulation of 26 mutational patterns realistic for SARS-CoV-2, such as non-stationary and highly 27 variable mutational processes [22][23][24], or don't allow the simulation of indels. We 28 propose a new approach to efficiently simulate the evolution of many closely related 29 genomes along a phylogenetic tree and under general sequence evolution models. Our 30 approach simulates one mutation (substitution or indel) at a time using the Gillespie 31 method [25], and is further tailored to reduce time and memory demand by efficiently 32 representing and storing information regarding non-mutated positions of the genome. 33 Furthermore, we use a multi-layered search tree structure to efficiently sample mutation 34 events along the genome even when each position has its own mutation rate, and to 35 efficiently traverse the phylogenetic tree and avoid redundant operations. Our approach 36 empowers extremely flexible and fine-grained evolutionary models. For example, Materials and methods 42 We consider the problem of simulating evolution of a DNA (or RNA) sequence along a 43 specified input phylogenetic tree, and under a given evolutionary model. Our simulation 44 approach is based on the Gillespie method [25], as is typically used in molecular 45 evolution simulators [18,19]. We assume that each position of the genome (either 46 nucleotide or codon) evolves independently of the others, and under a time-homogeneous 47 substitution process; that is, the rates of evolution at each position are initially specified 48 by the user or are sampled randomly by the simulator. We focus on the efficient 49 simulation of sequence evolution for large phylogenetic trees with short branches: we 50 assume that only a few mutations happen on each branch across the genome, which is 51 typical for genomic epidemiology, and in particular for SARS-CoV-2 [26]. 52 "Vanilla" approach 53 If we assume that evolutionary rates are homogeneous across the genome, it is simple to 54 use the Gillespie approach efficiently in this scenario by adopting an efficient 55 representation of ancestral genomes in terms of differences with respect to a root 56 genome [27]. As a very simplified example, let's consider the case in which there is no 57 selective force at play, mutation rates are constant across the genome, there are no 58 indels, and all bases mutate into all other bases at the same rate (JC69 model [28] with 59 equal nucleotide frequencies). Throughout the manuscript, we will not assume 60 equilibrium or stationarity in sequence evolution, but instead assume that we are given 61 a genome at the root of the phylogeny, which we then evolve down the tree according to 62 given rates. 63 In this simplified "vanilla" scenario, the total mutation rate across the genome is 64 equal to the mutation rate for one base, 3r, times the genome length (which we assume 65 constant), L. Starting from the root and its genome, we visit each branch of the tree 66 one at the time in preorder traversal. For each branch of the tree, we consider its length 67 t b , and we recursively sample a time for the next mutation from an exponential 68 distribution with parameter 1 3rL . If the sampled time t is over t b , we move to the next 69 branch. Otherwise, we decrease t b by t and we sample a mutation event. In the 70 considered scenario, this simply means sampling one position of the genome at random 71 (a random integer number 1 ≤ i ≤ L), and then a random allele b, different from the 72 current allele at position i, to mutate into. Additional steps are also required to keep 73 track of mutations which have already occurred and allow them to further mutate, for 74 example, possibly reversing a mutated allele back to the reference allele. We track each 75 sampled mutation by adding it to a list of mutations for the current branch. It is worth 76 noting however that there are more efficient ways to keep track of mutations that have 77 already occurred, which we discuss in subsequent sections. A pseudocode description of 78 the algorithm is given in Algorithm 1. So overall, the total cost of this "vanilla" previous mutation events at each new mutation, and this can be significantly reduced as 86 explained in the next section. There is a caveat however. The default output of our 87 software phastSim is a text file where each sample name is followed by a list of 88 differences of the simulated sample genome with respect to the reference. We also allow 89 users to print a Newick format tree, annotated with the simulated mutation events. If, 90 however, we want to produce a file containing the full alignment, the memory and time 91 cost of the algorithm will become O(N L), since this is the size of the alignment. For 92 this reason, we provide the option for the user to generate a FASTA or PHYLIP 93 alignment output, but by default we only generate the more concise version consisting 94 of a list of differences, which usually leads to a very considerable reduction in time and 95 memory demand.

96
Algorithm 1 Vanilla algorithm for one phylogenetic branch. Here evolution on one branch is considered. t b is initialized as the length of the considered branch. r is the mutation rate from one nucleotide to any other nucleotide. L is genome size. ref [i] is the reference allele at position i. "Node" is the child node of the currently considered branch.
Sample t (the time to next mutation event) from an exponential distribution with Sample a random integer 0 < i ≤ L if i is not a position previously mutated in an ancestor of Node then a =ref[i]. else a is the current allele for Node end if Sample a random new allele b = a. Add mutation (i, a, b) to the list of mutations of Node. Update current allele for Node at position i as b.
Sample t from an exponential distribution with parameter 1 3rL . end while In classical implementations of sequence evolution simulators [17], for each node of 97 the tree we need to update each base of the genome one at the time, therefore incurring 98 in cost O(N L). Therefore, when the number of expected mutations is M N L we 99 expect an advantage in using this approach.

100
A considerable limitation of the above "vanilla" approach is that we assume that 101 rates are the same across the genome, and this is hardly realistic [29,30]. We 102 implemented a simple extension of this algorithm above which accounts for both an 103 arbitrary nucleotide substitution model (UNREST [31]), and for rate variation across 104 the genome in terms of a finite number of rate categories. To achieve this, we extended 105 the algorithm above to keep track of which positions of the genome have which rates.

106
This allows us to efficiently calculate total mutation rates for each class of sites, and to 107 efficiently sample sites within a class. 108 We also implemented a new model of rate variation in order to better fit the  [22,23]. 116 However, as the number of site classes increases, and as the number of alleles 117 increase (for example when considering codon models), the efficiency of the extension of 118 the vanilla approach described above deteriorates, especially when each site of the genome is given different evolutionary rates. For this reason, we developed a more 120 complex algorithm that remains efficient in light of rate variation, with only a small 121 efficiency sacrifice relative to the vanilla method in the scenario of no rate variation. We 122 allow phastSim users to choose between the vanilla approach or the more complex one, 123 that we call "hierarchical" and describe below. Advanced features, for example 124 simulation of indels, are only implemented with the hierarchical algorithm.

125
"Hierarchical" approach 126 Binary search "genome" tree 127 We first describe the structure and algorithm that allow us to efficiently sample a 128 mutation event along the genome when each position might have a distinct mutation 129 rate. This structure needs to be efficiently updatable following a mutation event; in fact, 130 a mutation event changes the allele at a position of the genome, and therefore also its 131 mutation rate. This is very similar to the problem of sampling from a categorical 132 distribution with many elements, where the probabilities can be slightly modified at 133 each sample [32]. A Huffman tree [33] would be close to optimal for this task, however, 134 here we implement a binary search tree, which has a slightly higher expected cost [32] 135 but allows us to more efficiently model blocks of contiguous nucleotides, and therefore 136 to efficiently simulate indels.

137
In our "genome" search tree (which is distinct from, and should not be confused 138 with, the phylogenetic tree), each node corresponds to a contiguous block of nucleotides 139 along the genome. The root node represents the whole genome, and contains a rate 140 value corresponding to the global mutation rate of the whole genome. The two children 141 of the root correspond to the first and the second half of the genome, respectively.

142
There is no overlap between the regions considered by each child node, and their union 143 gives the region considered by the parent node. Consequently, the sum of the rates of 144 the children of a node is equal to the rate of the node. Given this structure, we also 145 refer to this binary search tree as the "genome" tree. A terminal node of the genome 146 tree corresponds to one unit of the genome, either a base or a codon, depending on the 147 model we choose for simulations. A terminal node contains not only information about 148 the position of the unit along the genome, but also the reference allele at this position 149 and the mutation rates associated with it, to allow sampling of a specific mutation event 150 at the given position/node. A graphical representation of an example genome tree is   Example genome tree and genome tree search. An example genome search tree for ancestral genome ACGGT. Blue nodes are terminal and red nodes are internal. Inside each node we represent on top the genome positions represented by the node; at the center inside terminal nodes we show the allele of the node; at the bottom of nodes is their total rate. Under each terminal node we show the example relevant mutation rates. The black arrows show an example sampling of one mutation event. A parameter R is assigned an initial random number sampled uniformly between 0 and the total rate 8.1, in this case it is R = 4.7. As we move downward, the value of R can decrease, as described in Algorithm 2, determining which site will mutate and how. Here, an initial R = 4.7 results in the sampling of a G→T mutation at genome position 4.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 23, 2021. ; Algorithm 2 Sampling of a substitution event along a genome tree, and updating the genome tree. We assume we are given node "Root", the root of a genome tree structure. For any given genome node, "Node", Node.rate represents the corresponding total subtree rate. Node.children represents the list of its child nodes (2 children for internal nodes, 0 for terminal nodes). Node.parent is the parent node of "Node". Node.allele is its current allele. Node.rates (which is only allocated for terminal nodes) represents the 2-dimensional matrix of mutation rates (from any allele to any other allele) at the given position, and for simplicity here we assume that a rate of an allele into itself (a matrix diagonal entry) is 0; for codon models, for efficiency the rows of the matrix are only allocated and filled when they are needed for the first time. Node.position (only defined for terminal nodes) refers to the genome position represented by the node. The list "mutations" is used to record the mutation events simulated on the considered phylogenetic branch. The function "sample(rates)" samples a rate from a list proportional to its value. We are also given a random number 0 ≤ R <Root.rate for which we want to sample the corresponding mutation event.
Node=Root As mentioned before, once a mutation event is sampled, we need to modify the 170 sampling process so that the change in allele at the mutated position is taken into 171 account, since this change usually affects local and global mutation rates (a rare 172 exception is for example when substitution rates are all equal). Modifying our genome 173 tree following a substitution event is both simple and efficient: we simply need to 174 modify the rates and allele at the mutated terminal node, and then update the rate of 175 all ancestors of this terminal node accordingly. Algorithm 2 for example describes how 176 to sample a substitution event from a genome tree as well as how to update the genome 177 tree accordingly. Again, this can be done in log(L) time for each new substitution event 178 sampled. However, while this is efficient for simulating evolution along a temporal line, 179 that is, along a single branch of the phylogeny descendant from the root, it becomes 180 inefficient for simulating evolution along a phylogenetic tree. This is because, if we 181 modify the tree, then we cannot use it as it is for the sibling nodes. In other words, 182 September 23, 2021 7/26 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 23, 2021. ; https://doi.org/10.1101/2021.03.15.435416 doi: bioRxiv preprint when we reach a split in the phylogenetic tree, and we have two children of the same 183 phylogenetic node, we need to pass the same genome tree of the phylogenetic parent 184 node to both phylogenetic children. However, we can't only pass a pointer to the same 185 tree to both children, because evolving along one branch leading to one sibling would 186 modify the genome tree also for the other sibling. If we take the approach of duplicating 187 the genome tree at each phylogenetic split, we end up with a cost O(N L), which we are 188 trying to avoid. For this reason, we devise an alternative, hierarchical, multi-layered 189 approach to evolving a genome tree, described below. Later on in the text we also 190 describe the extension of our approach and of the genome tree structure to simulate 191 indels.

192
Hierarchical, multi-layer evolving genome tree 193 In order to use our genome tree structure to sample mutations along a phylogenetic tree, 194 we add a further "vertical" dimension to it. At each branch of the phylogenetic tree, 195 instead of modifying a genome tree, we take the approach of building on it, without 196 modifying the starting genome tree nodes, so that the original genome tree is not lost 197 but instead is preserved at "layer 0" of our multi-layer structure. When we sample a 198 mutation, we create a few new genome tree nodes in the corresponding layer of the 199 structure, instead of not an entire new genome tree. By doing this, we can effectively 200 adapt a (multi-layer) genome tree as new mutations are sampled without losing the 201 original genome tree. This means that when we can pass the same genome tree to two 202 children of a phylogenetic node without needing to duplicate the genome tree structure. 203 Instead, we simply remove (de-allocate, or ignore) the genome tree nodes that have been 204 added to other layers by the descendants of the first child node, and pass the same 205 genome tree structure to the two considered child nodes. A graphical representation of 206 an example multi-layer genome tree and its evolution is given in Fig 2.

207
We start with a genome tree at the phylogenetic root node; additional nodes are 208 then added at further layers. A genome tree layer n represents the genome nodes 209 specific to a particular depth of the phylogenetic tree; phylogenetic nodes closer to the 210 root (in terms of number of branches that need to be traversed from the root) will be 211 associated with a lower n, and those more distant from the root with higher n. All the 212 initial nodes of the original genome tree belong to layer 0, the layer corresponding to 213 the phylogenetic root. Then, as we move from the phylogenetic root to its first child, we 214 add nodes to the tree in layer 1, representing the consequences of mutation events 215 happening along the branch between the phylogenetic root and the first child. Nodes in 216 layer 0 only point to nodes in layer 0, and never to nodes in other layers. More 217 generally, nodes in layer n only point to nodes in layers m ≤ n. Every time the 218 multi-layer genome tree is passed from phylogenetic parent (layer n) to child (layer 219 n + 1), new nodes are added to the corresponding layer (n + 1) if mutation events occur 220 on the corresponding phylogenetic branch. 221 We traverse the phylogenetic tree in preorder traversal, so, starting from the root, we 222 move to the first child, to which we pass the initial genome tree, add new layers, then 223 do the same for this child's children. For each new mutation occurring on this 224 phylogenetic branch connecting the root and its first child, we traverse the genome tree, 225 and every time we would modify the genome tree (to update the mutation rates 226 following a change of allele at a position) we instead create new genome tree nodes in 227 the child layer. Once we have traversed the whole phylogenetic subtree of the first child 228 of the root, we have to move to second child of the root. This operation does not incur 229 the cost of duplicating any part of the genome tree, as we only need to pass to the 230 second child the pointer to the root of layer 0 of the hierarchical genome tree. Similarly, 231 at any internal phylogenetic node at layer n, to both children we pass the pointer to the 232 root of layer n of the genome tree. The only additional step which might be required is 233 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  Example of multi-layer genome tree and its evolution. We track the evolution of the multi-layer genome tree starting from the genome tree of Fig 1. Colors for the genome tree are the same as in Fig 1. In green, on the left side of each panel, we show an extract of the phylogenetic tree containing three nodes ("P" for parent, which in this example is the root of the phylogeny, and "L" and "R" for left and right node). "L" has further descendants, but we don't show them here and only focus on this triplet of nodes as an example. The orange arrow along the phylogenetic tree shows the current step of the preorder traversal being considered by the given panel. Black arrows show past steps. Vertical dashed black lines in the multi-layer genome tree connect nodes that represent the same portions of the genome but that are in different layers. "L0" stands for "Layer 0" and "L1" for "Layer 1". A At the phylogenetic root "P" we initialize the genome tree for layer 0. B As we move to child "L", a new substitution is sampled (as in Fig 1) and 3 corresponding genome nodes are created in layer 1. These nodes correspond to the nodes in the original genome tree whose rate is affected by the new mutation. C As we traverse the subtree of the descendants of L, new nodes and mutations might be added in the layers below. D We are finished traversing the subtree of the descendants of L, and we return to L, at which point all nodes in layer below 1 have either been removed or have become irrelevant. E We return to P, at which point the genome tree nodes previously added layer 1 are also ignored or deleted. F We move from P to R, and in doing so new mutation events might be sampled and the corresponding genome nodes might be added to layer 1 (new genome tree nodes corresponding to 1 new substitution are shown in the new layer 1).
the de-allocation of nodes in layer n + 1 as we move from one node to its sibling (thanks 234 to our preorder traversal, the nodes currently in and below this layer will not be used 235 again), but this step at most only slows simulations by a small constant factor.

236
At the start of the simulations for each branch, moving from layer n to n + 1, we 237 first create a new genome root node for layer n + 1. This root initially points to the 238 same children as the genome root at layer n, and it also has the same total rate. After 239 creating a new layer root, we sample mutation events for the current phylogenetic 240 branch. To sample mutations, we follow the binary search tree determined by the root 241 of layer n + 1. As a new mutation event is picked, we either create new layer n + 1 242 nodes, or modify existing layer n + 1 nodes. When sampling a new mutation, every time 243 we reach a node in the genome tree, we either modify the rate of the node, if it's in layer 244 n + 1, or we create a new layer n + 1 node, if the original node was in a different layer. 245 The new node is given at first the same children as the original node. When a terminal 246 node is reached, we calculate its new rates (unless they have already been created before 247 for some other node in the phylogenetic tree, in which case we just retrieve them from a 248 dictionary) and total rate, and we pass the new total rate to its parent node, which uses 249 it to update its own total rate, and so on. In total, the cost of sampling a new mutation 250 event and updating the multi-layered structure is O(log(L)). A sketch of the mutation 251 sampling process and multi-layer genome tree update is given in (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Algorithm 3 SampleMutation(Node,Layer,R): Sampling of a mutation event along a multi-layer genome tree. This function is initially run on the root node "Root" of a genome tree for layer "Layer". Parameters are as in Algorithm 2; in addition, Node.layer represents the layer of the considered node. While below we simplify a few details, in reality we don't recalculate rates at every mutation, but we only calculate them the first time they are needed, and then store them in dictionaries.
if Algorithm 4 SimulatePhyloNode(Node,GenomeNode,Layer): Hierarchical algorithm for simulating sequence evolution along a branch of the phylogenetic tree. Here evolution on a branch of the phylogenetic tree is considered. The branch is passed through Node, which represents the child node of the branch. The branch length is Node.length. Simulation of the whole phylogeny is performed by calling SimulatePhyloNode(Root,GenomeRoot,0), where Root is the root of the phylogenetic tree (we assume Root.length=0) and GenomeRoot is the root of the initial genome tree for layer 0. This layer 0 genome tree is created by considering the genome of the phylogenetic root, which is typically either sampled at random or read from a reference genome.
Sample t (the time to next mutation event) from an exponential distribution with parameter 1/GenomeNode.rate CurrentTime= t while CurrentTime<Node.length do Sample a random uniform vaule 0 ≤ R <GenomeNode.rate GenomeNode=SampleMutation(GenomeNode,Layer,R) {note that this is the function defined by Algorithm 3} Sample t (the time to the next mutation event) from an exponential distribution with parameter 1/GenomeNode.rate CurrentTime=CurrentTime+t end while for Child in Node.children do run SimulatePhyloNode(Node,GenomeNode,Layer+1) {note that this function is the one defined in the current Algorithm, which is therefore recursive in nature} end for if needed, de-allocate all nodes of layer Layer from GenomeNode down to its descendants. {Because genome nodes with layer Layer are descendant only from nodes with layer Layer, we do not need to traverse the whole multi-layer genome tree, but only its layer Layer.}

259
We further extended the multi-layer genome tree approach to efficiently simulate 260 insertions and deletions. Each leaf on the genome tree is assigned a deletion rate, 261 insertion rate, and substitution rate, denoted R d , R i , and R s respectively, and the total 262 mutation rate for the leaf will be R d + R i + R s . The substitution rate R s itself is the 263 sum of all substitution rates from the current allele of the leaf. Insertions are modeled 264 as occurring on the right (3 end) of the sampled position; to model insertion at the 5 265 end of the genome, a dummy terminal genome tree node is employed representing the 266 leftmost end of the genome, and is initialized with R s = R d = 0 but with non-zero R i . 267 Just as with the substitution rates, which can be site specific, R d and R i can be drawn 268 from a gamma distribution, or can be constant across the genome. When a mutation 269 event is sampled at a node, it will be sampled as a deletion, an insertion, or a 270 substitution proportionally to R d , R i and R s .

271
Our software allows for indels with lengths drawn from a number of parametric 272 distributions following the options allowed with INDELible, see Table 1 for an overview 273 of the various distributions that have been implemented. Sampled indels have always 274 length ≥ 1.

275
Below we explain in more detail the algorithm used to efficiently simulate insertions 276 and deletions using multi-layered genome trees. In short, insertion events are simulated 277 by adding new small subtrees to the genome tree in the current layer. Deletion events 278 are instead simulated by setting substitution and indel rates to 0 in deleted nodes.

280
The algorithms for simulating insertions and deletions mostly proceed as the one 281 simulating substitutions (Algorithm 3) in that we traverse the genome tree to find the 282 terminal node "Leaf" affected by the next sampled mutation. We then sample the type 283 of the next mutation event (insertion, deletion, or substitution) proportional to the 284 corresponding mutation rates R i , R d and R s of "Leaf". The process for simulating a 285 substitution remains the same as before. If instead a new insertion event is simulated, 286 we sample a length l for the inserted material from the corresponding prior distribution, 287 and then add a new subtree to the genome tree as detailed in Algorithms 5 and 6.

288
Algorithm 5 insertNode(Node, l): this function inserts a new genome subtree at the given terminal genome tree node "Node" at which the insertion is sampled, given the insertion length l. We assume that Node is part of the current layer, and that new nodes are created at the current layer. insertionRootNode = populateGenomeTree(l) {This calls algorithm 6 to generate an insertion subtree of size l.} Create a new genome tree internal node newInternalNode newInternalNode.parent = Node.parent Replace Node with newInternalNode as child of Node.parent . Node.parent = newInternalNode insertionRootNode.parent = newInternalNode newInternalNode.children = [Node, insertionRootNode] newInternalNode.rate = Node.rate + insertionRootNode.rate Note that the addition of new subtrees to the genome tree will typically make it less 289 balanced, and a potentially less efficient search tree. In typical scenarios considered 290 here, that is, when divergence is low and all genomes are closely related to the root, the 291 effect of this imbalance on the overall search-efficiency of the genome tree will be 292 extremely minor. If the next mutation event at Leaf will instead be a deletion, again, we first sample a 295 deletion length l, and then we proceed to set to 0 the total mutation rate for node Leaf 296 and its following l − 1 positions of the genome in the current layer, ignoring positions 297 September 23, 2021 13/26 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 23, 2021. ; https://doi.org/10.1101/2021.03.15.435416 doi: bioRxiv preprint Algorithm 7 deleteNodes(Rand,GenomeNode,Layer,RemainingDeletions): recursive algorithm for deleting nodes of the genome tree following a deletion event. It returns the number of positions deleted. "Rand" is the random number that has been used to sample the deletion event -here it's used to direct the search to the first deleted positions and the following ones. Thanks to our algorithm, we can allow any substitution model without incurring a 306 dramatic increase in computational demand, and without risking numerical instability 307 (which can sometimes be a problem with classical matrix exponentiation approaches).

308
Users can easily specify different nucleotide substitution matrices (e.g. JC [28], 309 HKY [34], or GTR [6]). By default, we adopt the most general nucleotide substitution 310 model, UNREST [31], using as default rates those we estimated from SARS-CoV-2 [22]. 311 We also implemented codon models, which, with our hierarchical approach, come at 312 only a small additional computational demand compared to nucleotide models. To and separately model the nucleotide mutation process and the amino acid selection one. 315 Unlike GY94 (which assumes an HKY nucleotide mutation process), we allow any 316 general nucleotide mutation process as defined by an UNREST matrix. Then, 317 nonsynonymous mutations rates are modified by a single factor ω (see next section for 318 variation of ω across codons). Under this model, a substitution from codon c 1 to codon 319 c 2 therefore has rate: if c 1 and c 2 are synonymous and differ only by nucleotides n 1 and n 2 at a position , ωm n1→n2 , if c 1 and c 2 are non-synonymous and differ only by nucleotides n 1 and n 2 at a position , 0 , if c 1 and c 2 differ by more than one nucleotide , where m n1→n2 is the mutation rate from nucleotide n 1 to nucleotide n 2 .

321
We don't allow, at this stage, instantaneous multi-nucleotide mutation events, or 322 amino acid substitution models, but we plan to address them in future extensions. A 323 description of currently implemented models and a comparison with those in other 324 similar simulation software is given in Table 2. We consider four types of variation in rates across the genome. These types can be used 327 in combination, or separately, as required.

328
The first type of variation is changes in the position-specific mutation rate across the 329 genome. Every nucleotide position i in the genome (even when using a codon model) is 330 assigned its own mutation rate scaling factor γ i . This means that, at position i, the 331 mutation rate from any nucleotide n 1 to any other nucleotide n 2 becomes γ i m n1→n2 . 332 We allow two ways to sample values of γ i for each i. One way is to sample them from a 333 continuous Gamma distribution with parameters Γ(α, α), with α specified by the user; 334 this results in each genome position having a distinct γ i . Alternatively, we allow the 335 definition of discrete categories, with a finite number of categories, each with its own 336 proportion of sites and γ rate.

337
The second type of variation we model is variation in ω, with each codon position i 338 across the genome being given its own ω i . As with γ i , values of ω i can either be 339 sampled from a continuous Gamma or a finite categorical distribution. 340 Lastly, to accommodate the strong variation in mutation rates observed in 341 SARS-CoV-2 [22,23] attributable to APOBEC, ADAR, or ROS activity, we introduce a 342 new model of rate variation. This model allows, for a certain position, to have one 343 specific mutation rate (from one specific nucleotide to another specific nucleotide) 344 enhanced by a certain amount µ. In this case we only allow a categorical distribution, 345 with the first category having no enhancement (µ = 1) and the other categories having 346 µ > 1. For any nucleotide position i that is assigned a hypermutable category and 347 therefore has µ i > 1, we then sample uniformly a start nucleotide n s and a destination 348 nucleotide n d . The mutation rates m i n1→n2 for position i then become: if n 1 = n s or n 2 = n d , γ i µ i m n1→n2 , otherwise . (2) Rate normalization 350 We assume that, for the given input phylogenetic tree, branch lengths represent 351 expected numbers of substitutions per nucleotide -no matter if a nucleotide or a codon 352 model is used. However, as mutations accumulate across the phylogeny, the total 353 mutation rate of the genome might slightly change. This is because we allow 354 substitution models that are not reversible and not at equilibrium. Therefore, we always 355 consider the mutation rates at the root genome to normalize the mutation rates. This 356 means that while branch lengths near the root accurately represent the expected 357 numbers of substitutions per nucleotide, lower down the tree this might slightly change. 358

359
As default, our software creates an output file where it stores information about which 360 genome position evolved under which rate. It also creates a file where each tip name is 361 listed together with the mutations it contains that distinguish its genome from the 362 reference genome. In scenarios similar to SARS-CoV-2 datasets (where each genome is 363 very similar to the reference), this format requires much less space and time to generate 364 than FASTA or PHYLIP formats (see the "vanilla" approach subsection).

365
An optional output format that our software can create is a tree in Newick format, 366 where each branch of the input phylogeny is annotated with a list of mutation events 367 that occurred on that branch. This format is richer than the others, as it provides 368 information regarding each mutation event, even those that might be over-written by 369 other mutations at the same position; it is also more efficient than multiple sequence 370 alignment formats in the scenario of short branch lengths considered here. We also 371 allow a binary analogue of this annotated Newick tree, called a MAT (mutation 372 annotated tree) [37], which is compatible with the phylogenetic software UShER [27]. 373 Finally, we also allow the creation of unaligned FASTA output. However, note that 374 the creation of a FASTA file costs O(N L) in time and space. In the case simulations are 375 performed without indels, we also allow the generation of a PHYLIP format alignment 376 output.  To assess the performance of our approach compared to existing evolutionary simulators, 385 we consider different scenarios typical for genomic epidemiology. First, we consider the 386 simulation of a scenario similar to SARS-CoV-2 evolution. We simulate trees with a 387 custom script and under a Yule process with birth rate equal to genome length (29,903), 388 so to have in the order of one mutation per branch. Evolution is simulated under an 389 UNREST model [31] with rates inferred from SARS-CoV-2 data [22] where possible (for 390 phastSim, pyvolve [36] and INDELible [18]) and a GTR model [6] otherwise (for 391 Seq-Gen [17]). We run INDELible both using method 1 ("INDELible-m1", which uses 392 matrix exponentiation to model substitutions) and method 2 ("INDELible-m2", which 393 instead uses the Gillespie approach for the same task). For now we ignore sequence rate 394 variation. While phastSim and pyvolve are both Python implementations, therefore 395 sharing similar benefits (high compatibility with other packages and ease of extensions) 396 and draw-backs (reduced efficiency compared to some other languages), we see that the 397 two approaches have dramatically different computational demands (Fig 3): simulating 398 50 sequences under pyvolve requires on average more time than simulating 500,000 in 399 phastSim. We can also see that INDELible-m2 is marginally more efficient than simulating 500,000 sequences in phastSim (Fig 3), despite the fact that INDELible is 404 coded in C++. Seq-Gen appears to be very efficient, but it's still more than one order 405 of magnitude slower than phastSim on large phylogenetic trees in this scenario. Also 406 note that, for large trees considered here, we can reduce computational demand in 407 phastSim by more than 5-fold by not producing a FASTA output alignment; this way 408 we can also save very significant amounts of memory demand. Regarding small trees 409 (< 10 4 tips) most of the demand in phastSim is associated with initializing the 410 simulations (loading packages and initializing the genome tree structure); these 411 initialization costs do not depend on tree size, and instead depend on genome size, and 412 they are why phastSim is relatively less efficient on small trees. If simulation on small 413 trees are indeed of interest, these initialization costs could be reduced by re-using the 414 same genome tree structure over multiple replicates, or, in the case of simple 415 evolutionary models, by using our "vanilla" simulation approach.

416
Bacterial scenario 417 To demonstrate a scenario in which we are interested in simulating bacterial genome 418 evolution within one outbreak, we use the E. Coli reference genome 419 (https://www.ncbi.nlm.nih.gov/nuccore/U00096.3 [40], 4,641,652 nucleotides) as 420 our root genome sequence. For now we still focus on the simple scenario of a nucleotide 421 model without rate variation. We again assume a scenario typical for genomic 422 epidemiology, that the birth rate of the simulated tree is equal to the genome length.

423
The number of mutations simulated is therefore comparable to the number of branches 424 in the tree.

425
As genome size increases, time and memory demand of traditional simulators is 426 expected to grow linearly. Indeed, we now see that Seq-Gen takes considerably more 427 time to simulate the same number of genomes than in the SARS-CoV-2 scenario (Fig 4). 428 phastSim also has an increased computational demand, but only in terms of the initial 429 step of generating an initial genome tree. This initial cost is linear with respect to 430 genome length, but does not increase with the number of samples or with the number of 431 mutations simulated. In total, in this scenario phastSim can simulate sequence evolution 432 along trees with more than 1000 times more samples than Seq-Gen. A further reduction 433 in computational demand, in particular in terms of the initial cost of generating a 434 genome tree, can be obtained by using the "vanilla" algorithm (Fig 4), which however 435 comes at the cost of narrowing the choice of evolutionary models to less complex ones. 436 September 23, 2021 18/26 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each boxplot represents ten replicates. We do not run the most demanding simulators when each replicate would take substantially more than 1 minute to run. In blue is the computational demand for generating the random trees with a customised version of NGESH [39] distributed within the phastSim package; sequence simulation is performed conditional on these simulated trees. In red is the time to run phastSim with a concise output, and in orange is the time for phastSim with additionally generating a FASTA format output. In green is the demand of pyvolve, and in purple of Seq-Gen. In yellow and brown are respectively the time for running INDELible with method 1 (matrix exponentiation) and method 2 (Gillespie approach). On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each boxplot represents ten replicates. We do not run Seq-Gen for more than 1000 tips due to high computational demand. In red is the time to run phastSim, and in orange is the time for phastSim with the "vanilla" approach. In purple is the time demand of Seq-Gen.

Evolutionary and Indel models 437
One of the advantages of the approach we present here is that simulating evolution 438 under increasingly complex models comes at almost no additional computational cost 439 (Fig 5). It can be seen, for example, that INDELible-m1 and Seq-Gen incur a 440 significantly higher cost when using a continuous variation in mutation rate, and that

441
INDELible-m2 is more demanding when simulating discrete rate categories.
442 Surprisingly, running INDELible with a codon model appears to come with no 443 additional computational demand, similarly to phastSim (Fig 5). For these comparisons 444 we have considered the SARS-CoV-2 simulation scenario.

445
Our algorithm also allows efficient simulation of insertion and deletion events 446 (indels). Among the other simulators considered here, only INDELible can simulate 447 indels. In the SARS-CoV-2 scenario, phastSim can simulate substitutions and indels for 448 considerably larger phylogenies (about 10 times larger) than INDELible for the same 449 computational run time (Fig 6).

450
The impact of branch lengths

451
Probably the main limiting factor in the applicability of the approach presented here 452 are tree branch lengths. Since the demand of our approach is affected linearly by the 453 number of mutation events, and as we scale up the length of the tree we need to 454 simulate more mutation events, then the length of the phylogenetic branches will 455 significantly affect the performance of our approach. We can see that, in the 456 SARS-CoV-2 scenario, the impact is not strictly linear (Fig 7). This is because there On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the model used for simulations: "nucleotide" is a nucleotide substitution model without variation; "nuc+10cat" is a nucleotide model with 10 rate categories; "nuc+alpha" is a nucleotide model with continuous variation in rate (each site has a distinct rate sampled from a Gamma distribution); "codon" represents a codon substitution model; "codon+10cat" represents a codon substitution model with 10 categories for ω; "codon+alpha" is a codon model with continuous rate variation in mutation rate and in ω (only allowed in phastSim). Each boxplot represents ten replicates. Seq-Gen does not allow codon models. Colors are as in Fig 3.   We have introduced a new approach to simulating sequence evolution that is 468 particularly efficient when used on phylogenies with many tips and with short branches. 469 Our software phastSim implements this new algorithm and is implemented in Python, 470 allowing it to be easily extended and combined with other Python packages. phastSim 471 relies on the ETE 3 tree phylogenetic structure, and in particular it uses ETE 3 to read 472 input phylogenetic trees. This allows flexibility in the phylogenetic tree input format. 473 Furthermore, thanks to the fact that the efficiency of the algorithm is not affected by 474 the complexity of the substitution model used, we allow a broad choice of evolutionary 475 models, such as codon models with position-specific mutation rates and selective 476 pressures. We also implement a new model of hypermutability to more realistically 477 describe the mutational process in SARS-CoV-2. Also, we can efficiently simulate indel 478 events, which are rarely modeled by other simulation packages. 479 We show that, compared with other simulators, phastSim is more efficient in the 480 scenarios common to genomic epidemiology, that is, when simulating many closely 481 related bacterial or viral genomes. Its particular efficiency with bacterial genomes 482 means that it ideally matches the needs of software that simulate bacterial ancestral 483 recombination graphs (e.g. [9,41]). phastSim can also be easily run using the output of 484 September 23, 2021 22/26 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 23, 2021. On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the rescaling factor we use to make the phylogenetic tree branch lengths longer or shorter. Colors are as in Fig 3. To aid the visual comparison, we use trees of different sizes for different simulators: 1000 tips for INDELible; 5,000 tips for Seq-Gen; 100,000 tips for phastSim.
phylogenetic simulator, most relevantly VGsim [42] which allows fast simulations of very 485 large and short phylogenies typical of SARS-CoV-2 and other genomic epidemiological 486 scenarios, and which also allows the simulation of the effects of selection on the 487 phylogenetic tree shape. phastSim is implemented as a Python package, which allows 488 for easy integration into other Python pipelines.

489
In the future, it would be possible, and of interest, to expand the features of 490 phastSim, in particular allowing a broader spectrum of models, for example allowing 491 column-specific amino acid fitness profiles; also, it could be possible to implement the 492 described algorithm in more efficient programming languages.

493
In conclusion, we have presented a novel algorithm, and corresponding software 494 implementation phastSim, to efficiently simulate sequence evolution along large trees of 495 closely related sequences. This new approach considerably outperforms other methods 496 in the scenarios of genomic epidemiology, for example when simulating SARS-CoV-2 497 genome sequence datasets. This approach also allows for more realistic models of 498 sequence evolution, allowing more efficient and accurate sequence data simulation and 499 inference.

501
We are very thankful to Vladimir Shchur for the valuable suggestions on our work.