SiPhyNetwork: An R package for Simulating Phylogenetic Networks

Gene flow is increasingly recognized as an important macroevolutionary process. The many mechanisms that contribute to gene flow (e.g., introgression, hybridization, lateral gene transfer) uniquely affect the diversification of dynamics of species, making it important to be able to account for these idiosyncrasies when constructing phylogenetic models. Existing phylogenetic-network simulators for macroevolution are limited in the ways they model gene flow. We present SiPhyNetwork, an R package for simulating phylogenetic networks under a birth-death-hybridization process. Our package unifies the existing birth-death-hybridization models while also extending the toolkit for modeling gene flow. This tool can create patterns of reticulation such as hybridization, lateral gene transfer, and introgression. Specifically, we model different reticulate events by allowing events to either add, remove, or keep constant the number of lineages. Additionally, we allow reticulation events to be trait-dependent, creating the ability to model the expanse of isolating mechanisms that prevent gene flow. This tool makes it possible for researchers to model many of the complex biological factors associated with gene flow in a phylogenetic context.

Network simulation using SiPhyNetwork relies on three core functions: sim.bdh.age(), sim.bdh.taxa.ssa(), and sim.bdh.taxa.gsa(), collectively referred to as the sim.bdh() functions. All three functions use the same 126 simulation algorithm but have different stopping conditions (discussed below). With the exception of arguments 127 setting the stopping condition, each simulation function takes the same set of arguments to specify the model (Table   128 2). These functions generate evonet objects from the ape software package (Paradis and Schliep, 2019), which are 129 themselves extensions of the phylo objects used for phylogenetic trees. These objects can then be stored in the ex-  We model the branching process of the phylogeny with speciation, extinction, and hybridization events. The lineage 138 diversification process has exponentially distributed waiting times for the events, with constant-rate parameters λ 139 for speciation, µ for extinction, and ν for hybridization. Since hybridization requires two lineages, we consider the 140 rate of hybridization (ν) on each species pair, whereas speciation (λ) and extinction (µ) are rates on each lineage. 141 For a phylogeny with N taxa, a rate on each species pair means that hybridization events occur at an effective rate 142 of N 2 ν. The overall waiting time until the next event of the birth-death-hybridization process is exponentially 143 distributed with rate where the probability for each event is weighted by its effective rate for N taxa (i.e., N λ for speciation, N µ for 145 extinction, and N 2 ν for hybridization). In SiPhyNetwork users specify a value for each rate (lambda, mu, and nu) 146 by providing values in the sim.bdh() functions.

147
For each hybridization event, we denote the genetic contributions of the two parental lineages-also called 148 inheritance probabilities-as γ and 1 − γ. Inheritance probabilities are drawn from a user-defined distribution 149 that draws values from 0 to 1, allowing for asymmetric inheritance, where one parent contributes more genetic  SiPhyNetwork has a versatile system for simulating many modes of gene flow by allowing for reticulation events 178 with different macroevolutionary patterns (see Figure 1). We denote each type of event by the net change in the

195
Existing phylogenetic network simulators allow different subsets of these reticulation patterns (Table 1).  of the genetic distance d ij between taxa i and j at a given time: where the relationship between hybridization and genetic distance is user-specified. However, in practice we use the where P i denotes the set of paths from the taxon i to the root, γ e is the inheritance probability of the associated 236 edge e, and e is the edge length of e. This formulation is identical to the covariance computation of Bastide et al.

237
(2018) with the exception that we take the sum of edge lengths that are not shared across paths, instead of taking 238 the edges that are shared.

239
In SiPhyNetwork genetic-distance dependence is modeled by providing a genetic-distance function for the 240 hyb.rate.fxn argument. Users have the flexibility to define any arbitrary function that relates hybridization 241 success to genetic distance. This function takes the genetic distance as an argument and should return a number 242 that represents the probability of hybridization success, as shown in the example below.

243
## Hybridization fails if the distance is greater than 2.5 Additionally, we have implemented the same decreasing functions as Woodhams et al. (2016), i.e., linear decay, exponential decay, snowballing decay, and polynomial decay: Here, s and t are values set by the user that are used to affect the shape and rate of decay for each function. We    (Figure 4). If frac is less than one, then that proportion of extant taxa will be sampled.

347
Further, if the argument stochsampling=TRUE, then each extant taxon will be sampled with probability frac.  parameterizations (e.g., µ > λ). Similarly, for the birth-death-hybridization process, the specific combinations of 393 λ, µ, ν values will affect the probability that a simulation reaches its stopping condition.  Table 1: Simulation model features present in existing tools employing the birth-death-hybridization process. Reticulation type is denoted as generative (denoted as m-type in Janssen and Liu, 2021), neutral (n-type), and degenerative (y-type), referring to the net gain, stasis, or reduction in the number of lineages from an event, respectively (see Figure 1). A distribution on the inheritance probability γ enables users to specify an arbitrary distribution to determine the inheritance proportions of the parental lineages at hybridization. Hybridization dependence allows the success of hybridization to rely on either genetic distance or on a trait that evolves on the phylogenetic network.  A logical that if TRUE, starts the process with two lineages that share a common ancestor, else starts the process with one lineage complete A logical to return the complete or reconstructed network Figure 1: Macroevolutionary patterns of gene flow. Orange circles denote the parental nodes that lead to the reticulate node, while dashed orange arrows indicate the two parental lineages that contribute to gene flow. Lineage generative hybridization (m-type) occurs when a reticulation event results in a gain of one lineage. Lineage neutral hybridization (n-type) results in the a net-zero change in the number of lineages. Lineage degenerative hybridization (y-type) reduces the number of lineages by one. Reticulation events on phylogenetic networks are typically drawn using one of two conventions: A) having all parental nodes lateral to the hybrid node, or B) fusing any in-degree 1 out-degree 1 nodes, potentially placing parent nodes at different times than the hybrid node. Both depictions, however, have the same topological relationships.