Genealogical distances under low levels of selection

For a panmictic population of constant size evolving under neutrality, Kingman’s coalescent describes the genealogy of a population sample in equilibrium. However, for genealogical trees under selection, not even expectations for most basic quantities like height and length of the resulting random tree are known. Here, we give an analytic expression for the distribution of the total tree length of a sample of size n under low levels of selection in a two-alleles model. We can prove that trees are shorter than under neutrality under genic selection and if the beneficial mutant has dominance h < 1/2, but longer for h > 1/2. The difference to neutrality is for genic selection with selection intensity α and for other modes of dominance.


Introduction
Understanding population genetic models, e.g. the Wright-Fisher or the Moran model, can be achieved in various ways. Classically, allelic frequencies are described by diffusions in the large population limit, and for simple models such as two-alleles models, the theory of onedimensional diffusions leads to prediction for virtually all quantities of interest (Ewens, 2004). Moreover, starting with Kingman (1982) and Hudson (1983), genealogical trees started to play a big role in the understanding of the models as well as of DNA data from a population sample. Most importantly, all variation seen in data can be mapped onto a genealogical tree. Under neutral evolution, the mutational process is independent of the genealogical tree. As a consequence, the length of the genealogical tree is proportional to the total amount of polymorphic sites in the sample.
Genealogies under selection have long been an interesting object to study (see e.g. Wakeley, 2010 for a review). Starting with Krone and Neuhauser (1997) and Neuhauser and Krone (1997), genealogical trees under selection could be described using the Ancestral Selection Graph (ASG). In addition to coalescence events, the fitness differences make it necessary to study the history of all possible ancestors, leading to splitting events in the ASG. The disadvantage of these splitting events is that they make this genealogical structure far more complicated to study than the coalescent for neutral evolution.
In recent years, much progress has been made in the simulation of genealogical trees under selection. Mostly, these simulation algorithms use the approach of the structured coalescent, which is based on Kaplan et al. (1988). Here, the allelic frequency path is generated first, and conditional on this path, coalescence events are carried out. (See also Barton et al., 2004 for a formal derivation of this approach.) Simulation approaches based on this idea include the inference method by Coop and Griffiths (2004), msms by Ewing and Hermisson (2010), and discoal Kern and Schrider (2016). However, the structured coalescent approach can hardly be used to obtain analytical insights (with some notable exceptions, see e.g. Taylor, 2007).
Recently, genealogies under selection have been studied by Depperschmidt et al. (2012) using Markov processes taking values in the space of trees, i.e. the genealogical tree is modelled as a stochastic process which is changing as the population evolves. As for many Markov processes, the equilibrium of this process gives the equilibrium tree and can be studied using stationary solutions of differential equations. In our manuscript, we will make use of this theory in order to compute an approximation for the total tree length under a general bi-allelic selection scheme, which is assumed to be weak; see Section 4. Our results are extensions of Theorem 5 of Depperschmidt et al. (2012), where an approximation of the Laplace-transform of the genealogical distance of a pair of individuals under bi-allelic mutation and low levels of selection was computed.
The paper is structured as follows: In Section 2, we introduce the model we are going to study, i.e. genealogies in the large population limit for a Moran model under genic or incomplete dominance, over-or under-dominant selection. The last three cases we collect under the term other modes of dominance. We give recursions for the Laplace transform (and the expectation) of the tree length of a sample in Theorem 1 and Corollary 4 for genic selection, and in Theorem 2 and Corollary 7 for other modes of dominance. In Section 3, we discuss our findings and also provide some plots on the change of tree lengths under selection. Section 4 gives some preliminaries for the proofs. In particular, we give a brief review of the construction of evolving genealogies from Depperschmidt et al. (2012). Finally, Section 5 contains all proofs.

Model and main results
We will obtain approximations for the tree length under selection. While Theorem 1 and its corollaries describe the case of genic selection, Theorem 2 and its corollaries deals with other modes of dominance. All proofs are found in Section 5.
2. Every line is hit by a mutation event from • to • at rate θ • > 0, and by a mutation event from • to • at rate θ • > 0.
3. Every line of type • places an offspring to a randomly chosen line at rate α.
Mutation leads to an expected change dX of θ Recall that the frequency X of • in the population follows -in the limit N → ∞ -the SDE for some Brownian motion W; see e.g. (5.6) in Ewens (2004).  Figure 1: In a population of size N following Moran dynamics with resampling, mutation, and genic selection in equilibrium, it is possible to study genealogical trees as given on the right. Here, the full tree of all individuals in the model is drawn, but it is as well possible to study the tree of a population sample.
In the sequel, we will rely here on the possibility to pick a sample from the Moran population in the large population limit at equilibrium and describe its genealogical tree, which is given by (i) genealogical distances between any pair of individuals, resulting in an ultra-metric tree and (ii) marks on the tree which describe mutation events from • to • or from • to •; see also Figure 1. This possibility is implicitly made by the ancestral selection graph from Neuhauser and Krone (1997), and formally justified by some results obtained in Depperschmidt et al. (2012); precisely, their Theorem 4 states that the genealogical tree under selection has a unique equilibrium.
We will write P α [.] for the distribution of genealogical trees under the selection coefficient α and E α [.] for the corresponding expectation. In particular, P 0 [.] and E 0 [.] are reserved for neutral evolution, α = 0. Within the genealogical tree, we pick a sample of size n and let L n be the (random) length of its genealogy. We note that in the absence of selection, L n does not depend on the mutational mechanism and L n d = n k=2 kT k , where T k ∼ exp k 2 , k = 2, ..., n are the coalescence times in the tree; see e.g. (3.25) in Wakeley (2008). In particular, for λ ≥ 0, (2) with f 1 = 1 since the empty product is defined to be 1. We are now ready to state our first main result, which gives a recursion for an approximation of the Laplace transform of the tree length under selection for small α.
Remark 2 (Comparing neutral and selective genealogies). 1. Note that for α = 0, (3) gives precisely (2). Moreover, there is no linear term in α in the recursion (3). This finding is reminiscent of Theorem 4.26 in Krone and Neuhauser (1997) and Theorem 5 in Depperschmidt et al. (2012), but we note that for other models of dominance, a linear term arises; see Theorem 2.
2. Let us compare tree lengths under neutrality and under selection qualitatively. Crucially, the quantity d n as given in (10) is positive. As consequences, by the recursions, e n from (7) is positive, c n from (6) is positive, b n from (5) is positive, and a n from (4) is positive. The effect is that x n for small α is larger than under neutrality, i.e. E α [e −λL n ] > E 0 [e −λL n ] for small α, which implies that genealogical trees are generally shorter (in the so-called Laplace-transform-order) under selection. In particular, we have shown the intuitive result that expected tree lengths are shorter under selection.
3. While x n , a n are quantities within the selected genealogies, all other quantities can be computed under neutrality, α = 0. However, if one would like to obtain finer results, i.e. specify the O(α 3 )-term in (3), more quantities within selected genealogies would have to be computed. In principle, this is straight-forward using our approach of the proof of Theorem 1.
Since we can directly obtain expected tree lengths from the Laplace transforms in Theorem 1, we obtain also a recursion for expected tree lengths.
Corollary 5 (Genealogical distance of two individuals under genic selection).

Other modes of dominance
In a diploid population, (1) only models the frequency of • correctly if selection is genic, i.e. if the selective advantage of an individual which is homozygous for • is twice the advantage of a heterozygote. For other modes of dominance, we have to introduce a dominance coefficient h ∈ (−∞, ∞) and replace the selective events in the Moran model by the following transitions: 3'. Let X be the frequency of • in the population. Every line of type • places an offspring to a randomly chosen line at rate α(X + h(1 − X)). Every line of type • places an offspring to a randomly chosen line at rate αhX.
Note that 3'. is best understood by assuming that every line picks a random partner and if the pair is a heterozygote, it has fitness advantage αh, and if it is homozygous for •, it has fitness advantage α. (Here, we have assumed that h ≥ 0, but some modifications of 3'. also allow for We will write P α,h [.] for the distribution of genealogical trees and allele frequencies under this scenario, and E α,h [.] for the corresponding expectation. With this notation, we have P α [.] = P 2α,1/2 [.]. We note that h = 0 means a positively selected recessive allele, while h = 1 refers to a dominant selectively favoured allele. Again, we obtain an approximation of the Laplacetransform of the tree length of a sample of size n. Theorem 2 (Genealogical distances under any form of dominance). Let h ∈ (−∞, ∞) and Then, y 1 , y 2 , ... satisfy the recursion y 1 = 0 and where h 1 , h 2 , ... satisfy the recursion h 1 = 0 and and e n was given in Theorem 1.
Remark 6 (Comparing genealogies). 1. Most interestingly, neutral trees differ from trees under genic selection only in order α 2 , whereas the difference is in order α for other forms of dominance. While this may be counter-intuitive at first sight, it can be easily explained. Note that the model actually does not change if we replace α by −α and h by 1 − h at the same time. By doing so, we just interchange the roles of allele • and •. For h = 1/2, this means that our results have to be identical for α and −α, leading to a vanishing linear term in (3). For h 1/2, this symmetry does not have to hold, leading to a linear term in α.
2. Similar to our reasoning in Remark 2.2, the sign of h n in the recursion for y n determines if tree lengths are shorter or longer under selection. We see that the behaviour changes at h = 1/2. By construction, h n is positive, so if h < 1/2, y n is positive as well and we see that trees are shorter under selection (in the Laplace-transform order). If h > 1/2, the reverse is true and trees are longer under selection. This result is not surprising for over-dominant selection, h > 1, since the advantage of the heterozygote leads to maintenance of heterozygosity or balancing selection, which in turn is known to produce longer genealogical trees.
Corollary 8 (Genealogical distance of two individuals under any form of dominance).

Discussion
A fundamental question in population genetics is: How does selection affect genealogies of a sample of individuals? We have added to this question an analysis of tree lengths under low levels of selection, both for genic selection and for other modes of dominance. While our results are only given through recursions, these give valuable insights. Collecting previously stated interpretations of our results: • The selection coefficient enters the change in tree length linearly for h 1/2, but only quadratically for genic selection (h = 1/2); see Remark 6.1.
• The tree length are shorter under genic selection; see Remark 2.2. For other modes of dominance, tree lengths are shorter only for h < 1/2, but longer for h > 1/2; see Remark 6.2.
In addition, from Theorems 1 and 2, it is possible to study the effect in large samples. Recall that under neutrality, where Z is Gumbel distributed (see p. 255 of Wiuf and Hein, 1999). In particular, for large n, where γ e ≈ 0.57 is the Euler-Mascheroni constant. We show in the appendix that the change of L n under low levels of selection relative to neutral evolution is O(1), precisely This then shows that the order of magnitude in the change due to selection is much smaller than the length of the full tree for large samples. Note that this finding is in line with Przeworski et al. (1999), where simulations of the ancestral selection graph are used to find that the overall tree shape under selection is not too different from neutrality for low levels of selection.
In order to get more quantitative insights, we have numerically solved the recursions and plotted the effect on the tree length for various scenarios. Figure 2(A) analyses the effect of genic selection in large samples (i.e. n = 50). Interestingly, there is some mutation rateθ ≈ 0.5, which gives the largest effect. This is clear since very little mutation implies almost no change in the genealogy relative to neutrality since almost always only the beneficial type is present in the population, and very high mutation rate implies that selection is virtually inefficient, leading to nearly neutral trees. Moreover, we can see here that Θ(1 − Θ) enters the recursion for the change in tree length only linearly. In Figure 2(B), we display the change in tree length for h = 0. Since 1 − 2h enters the recursion only linearly, the graph looks qualitatively the same for other dominance coefficients.
In principle, the approach we use here is comparable to the ancestral selection graph in the sense that all events happening within the ASG are also implemented in our construction. However, within the ASG, when a splitting event occurs, it is not clear which of the two lines is the true ancestor, so bot lines are followed. As a consequence, in any case the ASG looks longer than a neutral coalescent due to the splitting events, and only when the ASG is pruned to become the right genealogical tree, the tree length can be computed. In our approach, the information, which we need within a splitting event is different, since only the type of the additional line, and the type of an individual within the sample is needed.
The approach we use here to study genealogies under selection can be used for other statistics than the total tree length. In principle, every quantity of a sample tree can be described. The reason why we chose the tree length is its simple structure due to coalescence events: If two lines in a sample of size n coalesce, all that remains is a tree of n − 1 individuals, already implying the recursive structure for the tree lengths which is apparent in Theorems 1 and 2. In addition, in order to describe the effect of selection, we rely on a description of the sample which was already used in the mathematics literature for the so-called Fleming-Viot process (which generalizes the Wright-Fisher diffusion); see e.g. Etheridge (2001) and Depperschmidt et al. (2012). In principle, our approach can be extended e.g. to include more than two alleles, population structure, recombination etc. However, probably one would still have to find a recursive structure, which will often be feasible only in the case of weak selection.

Preliminaries for the proofs
We here present the construction of a tree-valued process in a nutshell, leaving out various technical details. All details of the construction are given in Depperschmidt et al. (2012). Any genealogical tree is uniquely given by all genealogical distances of pairs on (haploid) individuals. So, in order to describe the evolution of genealogical trees, it suffices to describe the evolution of all pairwise distances. We also note that the tree length which we consider in all our results, is a function of pairwise distances. (See e.g. Section 8 of Depperschmidt et al. (2012).) Consider a sample of size n taken from the Moran model at time t as described in Section 2, and let R(t) := (R i j (t)) i< j be the pairwise genealogical distances (note that R i j = R ji ), and U(t) := (U 1 (t), ..., U n (t)) the allelic types, either • or •. We will consider some smooth, bounded function Φ : R ( n 2 ) + × {•, •} n → R and are going to describe the change in E α [Φ(R(t), U(t))] due to the evolution of the Moran model. We have to take into account several mechanisms: 0. Growth of the tree: During times when no event happens, all genealogical distances grow deterministically and linearly (with speed 2). In time dt, the change is 1. Resampling: If a pair of the N individuals within the Moran model resamples, there is either none, one, or two of them within the sample tree of size n. If none, the sample tree is not affected. If one, and this one reproduces, the sample tree is not affected as well. If one, and this one is replaced by the individual outside of the sample, the effect is the same as if we would have picked the other individual to begin with. Since in the E α [...], we average over all possibilities which samples of size n we take, there is also no resulting effect. If two, i reproduces and j dies, say, the effect is that distances to individual j are replaced by distances to individual i, and the new type of individual j is the type of i. Since all pairs within the sample resample at rate 1, the change in time dt is 2. Mutation: We note that the mutational mechanism can also be described by saying that every line in the Moran model mutates at rateθ, with outcome • with probability Θ and outcome • with probability 1−Θ. Since mutation only affects the alleles of the individuals in the sample, we have in time dt and analogously for U i,• .
3. Selection: By the dynamics of the Moran model, we have to accept that selective events depend on the type of individuals. We say that the kth individual has fitness αχ k := αχ(U k ), and χ is the fitness function. For genic selection and other modes of dominance, the fitness function is given by respectively. Here, for other modes of dominance, m is some randomly picked (haploid) individual. As for resampling, selective events occur in a pair of one individual i giving birth and the other, individual j, dying, as given through the function θ i j from above. Consequently, we find that in time dt, since n N, and all individuals outside the sample bring the same effect, for genic selection In the ≈ (which is exact in the limit N → ∞), we have used that the effect of selective events within the sample can be neglected. For the last equality, we have used that we can permute sampling orders in E α [.] since all genealogies are sampled with the same probability and by changing individuals j and n + 1 in the sample in the first term.
Lemma 9. For n ≥ 2, and with the convention that Φ n −1, j = Φ n i,−1 = 0, for genic selection, and the last two terms change to for other modes of dominance.
Remark 10. Simple algebra shows that the α-terms in (22) and (23) agree for h = 1/2. For future reference, note that for i = j = 0, the α-term in (23) gives . Proof of Lemma 9. The effect of tree growth on E α [Φ n i j ] is that the tree growth by ndt in time dt, i.e. in time dt −nλ · E α [Φ n i j (t)]. Let I = {1, ..., i}, H = {i + 1, ..., n} and J = {n + 1, ..., n + j}. For resampling, we distinguish between events among I, events among I ∪ H, with at most one partner within I, events with one partner within I and the second among J, events with one partner in H and the second among J, and events with two partners in J. Only if two among I ∪ H coalesce, n decreases. This gives For mutation, we note that for h ∈ H, Φ n i j (R, U h,• ) = Φ n i j (R, U h,• ) = Φ n i j , therefore the effect of mutation is Last, for selection, we have to distinguish the cases of genic selection and other modes of dominance. For genic selection, we have that χ k = 1(U k = •) and we note that for i ∈ I and j ∈ J, Φ n i j χ i = Φ n i j χ j = Φ n i j , therefore the effect is ] For other modes of dominance, Therefore, the effect of selection is here The next result is collected from Theorem 4 and Lemma 8.1 in Depperschmidt et al. (2012).
This shows (5). For c n , we use the same coalescent, and by distinguishing the six cases, we write which shows (6). For d n , let B ∈ {2, ..., n} be the number of lines in a coalescent starting with n lines, just before lines 1 and 2 coalesce. Then, Then, and we see that d n = f n−1 − f n − g n−1 + g n .
Finally, for e n , we again use a recursion. Consider a coalescent with n + 1 lines and make a first-step-analysis. In this first step, we distinguish four cases: 1. Coalescence of lines 1 or 2 with one of 3, ..., n; rate n 2 − 1 2. Coalescence of lines 1 and 2; rate 1 3. Coalescence of lines n + 1 and 1; rate 1 4. Coalescence of lines n + 1 and one of 2, ..., n; rate n − 1 This shows (7).
Proof of Corollary 4. We have to compute x n := ∂ ∂λ x n | λ=0 . A close inspection of the recursions for x n reveals that (i) x n is a sum of products, and in each summand, some factor d k enters and (ii) d k = O(λ) for small λ for all k, which is best seen from (10). As a consequence, we can compute the derivative with respect to λ at λ = 0 in each summand which enters x n by taking the derivative of d k with respect to to λ and set λ = 0 in all other factors. Summing up, we have the same recursions as in Theorem 1 with (i) λ = 0 in all terms except d n , and (ii) replace d n with the derivative according to λ at λ = 0. This gives the recursions as given in the corollary and d n = ∂ ∂λ f n − f n−1 − g n−1 + g n λ=0 = 2 n − 1 + g n−1 − g n with g n = − ∂ ∂λ Proof of Corollary 5. Applying Theorem 1, we get 2λ(1 + 2λ + 4θ + 4λθ + 4θ 2 − 1 + 2λ) (1 + 2θ)(1 + 2λ)(1 + 2λ + 2θ) = 8λθ(1 + λ +θ) (1 + 2θ)(1 + 2λ)(1 + 2λ + 2θ) , This gives the first assertion. The second follows since a 2 is O(λ) and the derivative with respect to λ at λ = 0 is easily computed.
Proof of Corollary 7. The proof is basically the same as for Corollary 4: The recursions for y n is a sum of products, where each factor comes with a factor d n , and d n = O(λ). Therefore, the derivative according to λ at λ = 0 is performed by taking derivatives only of d n and setting λ = 0 in all other instances.