Moment-based approximations for the Wright-Fisher model of population dynamics under natural selection at two linked loci

Properly modelling genetic recombination and local linkage has been shown to bring significant improvement to the inference of natural selection from time series data of allele frequencies under a Wright-Fisher model. Existing approaches that can account for genetic recombination and local linkage are built on either the diffusion approximation or a moment-based approximation of the Wright-Fisher model. However, methods based on the diffusion approximation are likely to require much higher computational cost, whereas moment-based approximations may suffer from the distribution support issue: for example, the normal approximation can seriously affect computational accuracy. In the present work, we introduce two novel moment-based approximations of the Wright-Fisher model on a pair of linked loci, both subject to natural selection. Our key innovation is to extend existing methods to account for both the mean and (co)variance of the two-locus Wright-Fisher model with selection. We devise two approximation schemes, using a logistic normal distribution and a hierarchical beta distribution, respectively, by matching the first two moments of the Wright-Fisher model and the approximating model. As compared with the diffusion approximation, our approximations enable the approximate computation of the transition probabilities of the Wright-Fisher model at a far smaller computational cost. We can also avoid the distribution support issue found in the normal approximation.


Introduction
s A ∈ [0, 1] is the selection coefficient and h A ∈ [0, 1] is the dominance parameter. For example, 90 the viability of an individual carrying the A 1 B 2 /A 2 B 1 genotype is w 23 = (1 − h A s A )(1 − h B s B ). 91 Let X i (k) denote the frequency of gamete i in a population of N individuals in generation 92 k ∈ N for i = 1, 2, 3, 4, and X(k) be the vector of the frequencies of the four possible gametes. 93 In the Wright-Fisher model, the population size N is assumed to be fixed over time, and at each 94 generation gametes are randomly sampled from an effectively infinite gamete pool reflecting the 95 parental gamete frequencies. Under multinomial sampling, we have where p is the vector of the parental gamete frequencies of an effectively infinite population in 97 generation k − 1 and can be written down as a function of the gamete frequencies x, 98 p 1 = (1 − r)q 1 + r(q 1 + q 2 )(q 1 + q 3 ) p 2 = (1 − r)q 2 + r(q 1 + q 2 )(q 2 + q 4 ) p 3 = (1 − r)q 3 + r(q 3 + q 4 )(q 1 + q 3 ) p 4 = (1 − r)q 4 + r(q 3 + q 4 )(q 2 + q 4 ) (2) for i = 1, 2, 3, 4. In Eq. (3), the numeratorw i is the marginal fitness of haplotype i and the 100 denominatorw is the mean fitness.
approximations to the transition probabilities in the Wright-Fisher model based on parametric 110 families have been developed, which could be traced back to Cavalli-Sforza & Edwards (1967). 111 However, existing moment-based approximations either suffer from the distribution support is- these issues, we introduce two moment-based approximations for the two-locus Wright-Fisher 115 model with selection. We first derive approximations for the mean and (co)variance, and then, 116 to approximate the transition probabilities, we describe a logistic normal approximation and a 117 hierarchical beta approximation.  = p (4) respectively, where diag(p) is the operator that constructs a diagonal matrix and places the elements of the sampling probability vector p on the main diagonal.
Let µ k and Σ k denote the mean and (co)variance of the gamete frequencies X(k) given the 135 gamete frequencies X(0) for k ∈ N + , respectively. Using the law of total mean and (co)variance, 136 we have Following Lacerda & Seoighe (2014), we apply the delta method with the first-order Taylor 138 expansion of the term µ(X(k − 1)) about the mean µ k−1 . The mean µ k and the (co)variance 139 Σ k can then be approximated as 140 µ k = µ(µ k−1 ) (8) where J µ (x) is the Jacobian matrix defined over the function µ(x) in Eq. (4). By substituting 141 Eqs.
(2) and (3) into Eq. (4) and taking the derivative with respect to the gamete frequencies 142 x, we have With the first-order Taylor expansion of the term µ(X(k − 1)) about the mean µ k−1 , we can 149 approximate the mean µ k and the (co)variance Σ k as With the second-order Taylor expansion of the term µ(X(k − 1)) about the mean µ k−1 , we can 151 approximate the mean µ k and the (co)variance Σ k as Using the iteration formula in Eqs. (8) and (9) or in Eqs. (10) and (11) where µ k and Σ k are the mean and (co)variance of the two-locus Wright-Fisher model with 167 selection, respectively, which can be obtained by running the iteration formula in Eqs. (8) and

168
(9) or in Eqs. (10) and (11) or in Eqs. (12) and (13) and introduce a stochastic process, denoted by Y = {Y (k), k ∈ N}, to approximate the trans- where m k and V k are the mean and (co)variance of the normal distribution. The unique inverse mapping given by yields our moment-based approximation of the Wright-Fisher model X, where the inverse map-190 ping φ −1 is commonly known as the additive logistic transformation. Given the initial gamete 191 frequencies X(0), our moment-based approximation φ −1 (Y (k)) suggests that the gamete fre-192 quencies X(k) approximately follows a logistic normal distribution which we refer to as the logistic normal approximation of the two-locus Wright-Fisher model 194 with selection.

195
Our task now is to find proper values for the mean and (co)variance of the normal distri- The transition probability density function of our logistic normal approximation can be explicitly  Figure 1 for an illustration). In Section 4, we 214 will explore how such a discrepancy affects the performance of our logistic normal approximation 215 through extensive simulations. with selection into two conditionally independent beta distributed levels. More specifically, we 234 consider any three linearly independent gamete frequencies out of the four, which is illustrated 235 below for X 1 (k), X 2 (k) and X 3 (k), and construct a transformation, denoted by ψ, to map the 236 gamete frequencies X(k) to the state space [0, 1] 3 : We introduce a stochastic process, denoted by Z = {Z(k), k ∈ N}, to approximate the trans-

238
formed Wright-Fisher model ψ(X). More concretely, let where α i,k and β i,k are the shape parameters of each beta distribution in Eqs. (20)-(22). Notice the haplotypes with the A 1 allele, and Z 3 (k) is the frequency of the A 2 B 1 haplotype in the 243 haplotypes with the A 2 allele. The unique inverse mapping given by yields our moment-based approximation of the Wright-Fisher model X. We call the procedure 245 we just outlined the hierarchical beta approximation of the two-locus Wright-Fisher model with 246 selection.

247
Our task now is to find proper values for the shape parameters α i,k and β i,k of each beta Similarly, the mean of the other three gamete frequencies can be written down as Solving the system of coupled equations given by Eqs.

262
With the law of total variance, the variance of the frequency of the A 1 B 1 gamete under the 263 Wright-Fisher model X can be represented in terms of the mean m i,k and the variance V i,k of 264 each beta distribution as Similarly, the variance of the other three gamete frequencies can be expressed as Solving the system of coupled equations given by Eqs. (32)-(34) yields where m i,k is the mean of each beta distribution given by Eqs.
where the shape parameters α i,k and β i,k of each beta distribution can be obtained by substi- Using the law of total covariance, we can write down the covariance between four gamete 283 frequencies under our hierarchical beta approximation in terms of the mean m i,k and the variance 284 V i,k of each beta distribution. For example, the covariance between the frequencies of the A 1 B 1 285 gamete and the A 1 B 2 gamete under our hierarchical beta approximation can be represented as Similarly, we can write down the other covariance between four gamete frequencies under our 287 hierarchical beta approximation as We will discuss how the discrepancy of the (co)variance structure affects the performance of our 289 hierarchical beta approximation through extensive simulations in Section 4.   tribution on the state space Ω X , and the empirical CDF is evaluated at these sampled gamete 374 frequencies at every 10 generations from the 1,000,000 simulated gamete frequency trajectories. 375 We measure the difference between the Wright-Fisher model and its approximation using the 376 root mean squared difference (RMSD) between the two empirical CDFs. tion rates but shows a sharp increase for large recombination rates. We also see from Figure 8 417 that the failure proportion slightly increases with the decrease in the population size, which 418 becomes significant for large recombination rates. Such a computational instability issue could 419 be caused by that our moment approximation gets less accurate with time and our hierarchical 420 beta approximation fails to model fixation or loss of alleles.

421
To resolve the computational instability issue of our hierarchical beta approximation, if the 422 three selected gamete frequencies, e.g., X 1 (k), X 2 (k) and X 3 (k), fail to produce positive shape In summary, the moment-based approximations enable us to compute the approximate tran- under natural selection at a single locus. A major advantage of our logistic normal and hier-457 archical beta approximations, compared to e.g., the Gaussian approximation of Terhorst et al.

458
(2015), is that we avoid the distribution support issue.