Numerical simulation of the two-locus Wright-Fisher stochastic differential equation with application to approximating transition probability densities

Over the past decade there has been an increasing focus on the application of the Wright-Fisher diffusion to the inference of natural selection from genetic time series. A key ingredient for modelling the trajectory of gene frequencies through the Wright-Fisher diffusion is its transition probability density function. Recent advances in DNA sequencing techniques have made it possible to monitor genomes in great detail over time, which presents opportunities for investigating natural selection while accounting for genetic recombination and local linkage. However, most existing methods for computing the transition probability density function of the Wright-Fisher diffusion are only applicable to one-locus problems. To address two-locus problems, in this work we propose a novel numerical scheme for the Wright-Fisher stochastic differential equation of population dynamics under natural selection at two linked loci. Our key innovation is that we reformulate the stochastic differential equation in a closed form that is amenable to simulation, which enables us to avoid boundary issues and reduce computational costs. We also propose an adaptive importance sampling approach based on the proposal introduced by Fearnhead (2008) for computing the transition probability density of the Wright-Fisher diffusion between any two observed states. We show through extensive simulation studies that our approach can achieve comparable performance to the method of Fearnhead (2008) but can avoid manually tuning the parameter ρ to deliver superior performance for different observed states.

(3), the numeratorw i is the marginal fitness of gamete i and the 120 denominatorw is the mean fitness. 121 We define the two-locus Wright-Fisher model with selection to be the stochastic process dX(t) = µ(X(t))dt + ν(X(t))dW (t), where the drift term µ(x) is a four-dimensional vector being the diffusion term ν(x) is a four by three matrix satisfying 143 ν(x)ν (x) = Σ(x), for i, j = 1, 2, 3, 4, and W (t) is a three-dimensional standard Brownian motion.

145
The proof of Theorem 1 follows in the similar manner to that employed for neutral popu-  The term x 1 x 4 − x 2 x 3 in Eq.
(3) is a measure of the linkage disequilibrium between loci A and 150 B, which quantifies the non-random association of the alleles at the two loci. We refer to the 151 diffusion process X satisfying the SDE in Eq. (4) as the two-locus Wright-Fisher diffusion with selection, which evolves in the state space 153 Ω X = x ∈ [0, 1] 4 : Moreover, for any x in the interior of the state space Ω X , the infinitesimal mean vector µ(x) 154 and the infinitesimal covariance matrix Σ(x) are twice continuously differentiable with respect 155 to x with bounded derivatives, and the infinitesimal covariance matrix Σ(x) is also positive 156 definite. Therefore, for any x and x in the interior of the state space Ω X , the transition prob-157 ability distribution function P (t, x, x ) of the Wright-Fisher diffusion X is strongly continuous 158 with respect to Lebesgue's measure (Stroock & Varadhan, 1979), and the transition probability for fixed x and the Kolmogorov forward equation (KFE) for fixed x, where the infinitesimal mean vector µ(x) and the infinitesimal covariance matrix 162 Σ(x) are given by Eqs. (5) and (7) which will be numerically unstable when x 1 is close to 1. More precisely, even though in Eq. (11)

190
(1 − x 1 − x 2 ) 1/2 will be tiny, which cancels out the magnitude of (1 − x 1 ) −1/2 , any round-off 191 error will be magnified when computing ( where the diffusion term σ(x) can be explicitly written down as for all x, x ∈ Ω X , where | · | denotes the Euclidean vector norm or the Frobenius matrix norm 209 as appropriate and for some C > 0, K > 0 and β > 1. Here we can take C = 1, K = 1 and β = 2, respectively. Ac-

218
The proof of Theorem 2 can be found in Appendix A. 3.2. Euler-Maruyama scheme 220 We now turn to the development of a numerical simulation scheme for the TLWFDS-SDE 221 with the decomposition of the infinitesimal covariance matrix Σ(x) we proposed in Section 3.1.

222
Assuming that the closed-form expression of the diffusion coefficient matrix σ(x) is available, we first represent the Wright-Fisher SDE in the integral component form as for i = 1, 2, 3, 4. Using Itô's formula, we then have where L 0 and L k are differential operators defined as for i = 1, 2, 3, 4, where We can get a truncated stochastic Itô-Taylor expansion by keeping the first three constant and normally distributed with mean 0 and variance ∆τ u−1 for j = 1, 2, . . . , 6.

246
The stochastic Taylor scheme we adopt here is rather simple, commonly known as the Euler-

247
Maruyama scheme, which is one of the most popular stochastic Taylor schemes in practice due 248 to its high efficiency and low complexity. As demonstrated in Kloeden & Platen (1992), the 249 Euler-Maruyama scheme is numerically stable and strongly consistent, and the convergence   (1) where we defineX(τ u ) = ξ u for u = 0, 1, . . . , U , and use the convention ξ 0 = x 0 and ξ U = x T 297 to conserve notation.

298
From Theorem 2 in Pedersen (1995b), we have However, it is infeasible to evaluate the integrals in Eq. (22) analytically, and unfortunately,  Maruyama approximationp where {ξ is the sample of the realisations generated by setting ξ 0 = x 0 and ξ U = x T , 308 and successively drawing for u = 1, 2, . . . , U − 1, and V is the total number of the realisations in the sample. In Figure 3,  (1) (2012), which can force the simulated sample paths generated from the importance sampler to 330 terminate at state x T almost surely through the so-called guiding drift term defined by Note that we use G to refer to a generic guided process in the following.

332
The first G-based importance sampler we consider was developed by Delyon & Hu (2006), 333 which suggests generating the simulated sample paths from the guided process, denoted by G 1 , 334 satisfying the following proposal SDE of the form where the drift term

341
More specifically, using the Euler-Maruyama scheme, we set ξ 0 = x 0 and ξ U = x T , and 342 successively draw for u = 1, 2, . . . , U − 1. We show a set of simulated sample paths generated from the G 1 -based importance sampler with two observed states x 0 and x T as an example in Figure 5. The second G-based importance sampler we consider was proposed by Fearnhead (2008), 346 which suggests generating the simulated sample paths from the guided process, denoted by G 2 , 347 satisfying the following proposal SDE of the form where the drift term

354
More specifically, using the Euler-Maruyama scheme, we set ξ 0 = x 0 and ξ U = x T , and 355 successively draw for u = 1, 2, . . . , U − 1. In particular, when we set ρ = 0 in Eq. (27), the G 2 -based importance  We propose an adaptive G 2 -based importance sampler, which suggests generating the sim-369 ulated sample paths from the guided process, denoted by G 3 , satisfying the following proposal 370 SDE of the form 371 dX(t) = µ G 3 (t, X(t))dt + σ(X(t))dW (t), where the drift term for a predetermined non-negative function ρ(x T ), the diffusion term is given by Eq. (13), and 373 W (t) is a six-dimensional standard Brownian motion. Compared to the drift term µ G 2 (t, x) of 374 the G 2 , the ρ-term is a function of the observed state x T in the drift term µ G 3 (t, x) of the G 3 , 375 which is defined as wherex T is the value at time T of the solution to the following ordinary differential equation More specifically, using the Euler-Maruyama scheme, we set ξ 0 = x 0 and ξ U = x T , and 387 successively draw for u = 1, 2, . . . , U − 1. Figure 8 illustrates a set of simulated sample paths generated from the 389 G 3 -based importance sampler with two observed states x 0 and x T as an example.

390
Furthermore, the modification made in the G 2 -based importance sampler can be naturally 391 applied to the G 3 -based importance sampler, which suggests setting ξ 0 = x 0 and ξ U = x T , and 392 successively drawing for u = 1, 2, . . . , U − 1. In Figure 9, we illustrate a set of simulated sample paths generated from 394 the modified G 3 -based importance sampler with two observed states x 0 and x T as an example. (1) We can thereby approximate the Euler-Maruyama approximationp is the sample of the realisations generated from the G-based 432 importance sampler, and V is the total number of the realisations in the sample.   To further improve the importance sampling approximation, we can adopt the asymptoti-    respectively. Figure 18 gives an example when the positively selected allele is initially in low 536 frequency. Figure 19 shows an example of non-additive gene action. Figure  to the fixation of a certain gamete type A i B j , but not necessarily the A 1 B 1 gamete. In fact, 544 which gamete type becomes fixed is a random event with a corresponding fixation probability 545 determined by the population size, the strength of natural selection and genetic recombination, 546 gene actions and the initial state of the population (see Figures 16-19). By contrast, when at 547 least one of the two loci is neutral, the fixation of a certain gamete type A i B j may not occur 548 (see Figure 20).

549
The scope of our results in the present work can be extended in various directions. A direct 550 extension is to certain multi-locus problems. As we have stated above, in theory, our method 551 can be naturally extended to high-dimensional problems, but may suffer from the potential 552 issue of computational efficiency in practice. We have two potential directions to improve the 553 performance of our method for multi-locus problems. One direction is to consider how to exactly Af Let A * be the generator associated with the SDE in Eq. (12), then where 717 a ij (x) = 6 k=1 σ ik (x)σ kj (x).
From (13), we have which is the same as Σ ij (x) = x i (δ ij − x j ) given in (7). This shows the two SDE's in Eqs. where the drift term µ(z) is a three-dimensional vector being ∂ 2 (Σ ij (z )p(t, z, z )) ∂z i ∂z j − 3 i=1 ∂(µ i (z )p(t, z, z )) ∂z i (B.6) for fixed z, where the infinitesimal mean vector µ(z) and the infinitesimal covariance matrix