## Abstract

Forward Wright-Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the CPU, thus limiting their usefulness. The single-locus Wright-Fisher forward algorithm is, however, exceedingly parallelizable, with many steps which are so-called *embarrassingly parallel*, consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright-Fisher simulation, or *GO Fish* for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of well over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data - all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, *GO Fish* serves as an exciting template for future research into accelerating computation in evolution. *GO Fish* is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/

## Introduction

The Graphics Processing Unit (GPU) is commonplace in today’s consumer and workstation computers and provides the main computational throughput of the modern supercomputer. A GPU differs from a computer’s Central Processor Unit (CPU) in a number of key respects, but the most important differentiating factor is the number and type of computational units. While a CPU for a typical consumer laptop or desktop will contain anywhere from 2-4 very fast, complex cores, GPU cores are in contrast relatively slow and simple. However, there are typically hundreds to thousands of these slow and simple cores in a single GPU. Thus CPUs are low latency processors that excel at the serial execution of complex, branching algorithms. Conversely, the GPU architecture is designed to provide high computational bandwidth, capable of executing many arithmetic operations in parallel.

The historical driver for the development of GPUs was increasingly realistic computer graphics for computer games. However, researchers quickly latched on to their usefulness as tools for scientific computation – particularly for problems that were simply too time consuming on the CPU due to sheer number of operations that had to be computed, but where many of those operations could in principle be computed simultaneously. Eventually programming languages were developed to exploit GPUs as massive parallel processors and, overtime, the GPU hardware has likewise evolved to be more capable for both graphics and computational applications.

Population genetics analysis of single nucleotide polymorphisms (SNPs) is exceptionally amenable to acceleration on the GPU. Beyond the study of evolution itself, such analysis is a critical component of research in medical and conservation genetics, providing insight into the selective and mutational forces shaping the genome as well as the demographic history of a population. One of the most common analysis methods is the site frequency spectrum (SFS), a histogram where each bin is a count of how many mutations are at a given frequency in the population.

SFS analysis is based on the precepts of the Wright-Fisher process [1, 2], which describes the probabilistic trajectory of a mutation’s frequency in a population under a chosen evolutionary scenario. The defining characteristic of the Wright-Fisher process is forward time, non-overlapping, discrete generations with random genetic drift modeled as a binomial distribution dependent on the population size and the frequency of a mutation [1, 2]. On top of this foundation, can be added models for selection, migration between populations, mate choice & inbreeding, linkage between different loci, etc. For simple scenarios, an approximate analytical expression for the expected proportion of mutations at a given frequency in the population, the expected SFS, can be derived [1-5]. This expectation can then be compared to the observed SFS of real data, allowing for parameter fitting and model testing [5-7]. However, more complex scenarios do not have tractable analytical solutions, approximate or otherwise. One approach is to simulate the Wright-Fisher process forwards in time to build the expected frequency distribution or other population genetic summary statistics [8-11]. Because of the flexibility inherent in its construction, the Wright-Fisher forward simulation can be used to model any arbitrarily complex demographic and selection scenario [8-13]. Unfortunately, because of the computational cost, the use of such simulations to analyze polymorphism data is often prohibitively expensive in practice [12, 13]. The coalescent looking backwards in time and approximations to the forward single-locus Wright-Fisher algorithm using diffusion equations provide alternative, computationally efficient methods of modeling polymorphism data [14, 15]. However, these effectively limit the selection and demographic models that can be ascertained and approximate the Wright-Fisher forward process [12, 13, 15, 16]. Thus by speeding up forward simulations, we can use more complex and realistic demographic and selection models to analyze within-species polymorphism data.

Single-locus Wright-Fisher simulations based on the Poisson Random Field model [4] ignore linkage between sites and simulate large numbers of individual mutation frequency trajectories forwards in time to construct the expected SFS. Exploiting the naturally parallelizable nature of the single-locus Wright-Fisher algorithm, these forward simulations can be greatly accelerated on the GPU. Written in the programming language CUDA v6.5 [17], a C/C++ derivative for NVIDIA GPUs, the GPU Optimized Wright-Fisher simulation, *GO Fish*, allows for accurate, flexible simulations of SFS at speeds orders of magnitude faster than comparative serial programs on the CPU. *GO Fish* can be both run as a standalone executable and integrated into other programs as a library to accelerate single-locus Wright-Fisher simulations used by those tools.

## Algorithm

In a single-locus Wright-Fisher simulation, a population of individuals can be represented by the set of mutations segregating in that population – specifically by the frequencies of the mutant, derived alleles in the population. Under the Poisson Random Field model, these mutations are completely independent of each other and new mutational events only occur at non-segregating sites in the genome (i.e. no multiple hits) [4].

Figure 1 sketches the algorithm for a typical, serial Wright-Fisher simulation, starting with the initialization of an array of mutation frequencies. From one discrete generation time step to the next, the change in any given mutation’s frequency is dependent on the strength of selection on that mutation, migration from other populations, the percent of inbreeding, and genetic drift. Unlike the others listed, inbreeding is not directly a force for allele frequency change, but rather it modifies the effectiveness of selection and drift. Frequencies of 0 (lost) and 1 (fixed) are absorbing boundaries such that if a mutation becomes fixed or lost across all extant populations, it is removed from the next generation’s mutation array. New mutations arising stochastically throughout the genome are then added to the mutation array of the offspring generation, replacing those mutations lost and fixed by selection and drift. As the offspring become the parents of the next generation, the cycle repeats until the final generation of the simulation.

While the details of how a GPU organizes computational work are quite intricate [17], the vastly oversimplified version is that a serial set of operations is called a thread and the GPU can execute many such threads in parallel. With completely unlinked sites, every simulated mutation frequency trajectory is independent of every other mutation frequency trajectory in the simulation. Therefore, the single-locus Wright-Fisher algorithm is trivially parallelized by simply assigning a thread to each mutation in the mutation array: when simulating each discrete generation, both calculating the new frequency of alleles in the next generation and adding new mutations to next generation are *embarrassingly parallel* operations (Figure 2A). This is the parallel ideal because no communication across threads is required to make these calculations. A serial algorithm has to calculate the new frequency of each mutation one by one – and the problem is multiplied where there are multiple populations, as these new frequencies have to be calculated for each population. For example, in a simulation with 100,000 mutations in a given generation and 3 populations, 300,000 sequential passes through the functions governing migration, selection, and drift are required. However, in the parallel version, this huge number of iterations can theoretically be compressed to a single step in which all the new frequencies for all mutations are computed simultaneously. Similarly, if there are 5,000 new mutations in a generation, a serial algorithm has to add each of those 5,000 new mutations one at a time to the simulation. The parallel algorithm can, in theory, add them all at once. Of course, a GPU only has a finite number of computational resources to apply to a problem and thus this ideal of executing all processes in a single time step is never truly realizable for a problem of any substantial size. Even so, parallelizing migration, selection, drift, and mutation on the GPU results in dramatic speedups relative to performing those same operations serially on the CPU. This is the main source of *GO Fish*’s improvement over serial, CPU-based Wright-Fisher simulations.

One challenge to the parallelization of the Wright-Fisher algorithm is the treatment of mutations that become fixed or lost. When a mutation reaches a frequency of 0 (in all populations, if multiple) or 1 (in all populations, if multiple), that mutation is forever lost or fixed. Such mutations are no longer of interest to maintain in memory or process from one generation to the next. Without removing lost and fixed mutations from the simulation, the number of mutations being stored and processed would simply continue to grow as new mutations are added each generation. When processing mutations one at a time in the serial algorithm, removing mutations that become lost or fixed is as trivial as simply not adding them to the next generation and shortening the mutation array in the next generation by 1 each time. This becomes more difficult when processing mutations in parallel. As stated before: the different threads for different mutations do not communicate with each other when calculating the new mutation frequencies simultaneously. Therefore any given mutation/thread has no knowledge of how many other mutations have become lost or fixed that generation. This in turn means that when attempting to remove lost and fixed mutations while processing mutations in parallel, there is no way to determine the size of the next generation’s mutation array or where in the offspring array each mutation should be placed.

One solution to the above problems is the algorithm *compact* [18], which can filter out lost and fixed mutations while still taking advantage of the parallel nature of GPUs (Figure 2C). However, compaction is not *embarrassingly parallel*, as communication between the different threads for different mutations is required, and it involves a lot of moving elements around in GPU memory rather than intensive computation. Thus, it is a less efficient use of the GPU as compared to calculating allele frequencies. As such, a nuance in optimizing *GO Fish* is how frequently to remove lost and fixed mutations from the active simulation. Despite the fact that computation on such mutations is wasted, calculating new allele frequencies is so fast that not filtering out lost and fixed mutations every generation and temporarily leaving them in the simulation actually results in faster runtimes. Eventually of course, the sheer number of lost and fixed mutations overwhelms even the GPU’s computational bandwidth and they must be removed. How often to compact for optimal simulation speed can be ascertained heuristically and is dependent on the number of mutations each generation in the simulation and the attributes of the GPU the simulation is running on. Figure 3 illustrates the algorithm for *GO Fish*, which combines parallel implementations of migration, selection, drift, and mutation with a compacting step run every X generations and again before the end of the simulation.

### The Population Genetics Model of GO Fish

A more detailed description of the implementation of the Wright-Fisher algorithm underlying *GO Fish*, with derivations of the equations below, can be found in the Appendix. Table 1 provides a glossary of the variables used in the simulation.

The simulation can start with an empty initial mutation array, with the output of a previous simulation run, or with the frequencies of the initial mutation array in mutation-selection equilibrium. Starting a simulation as a blank canvas provides the most flexibility in the starting evolutionary scenario. However, to reach an equilibrium start point requires a “burn-in”, which may be quite a large number of generations [11]. To save time, if a starting scenario is shared across multiple simulations, then the post-burn-in mutation array can be simulated beforehand, stored, and input as the initial mutation array for the next set of simulations. Alternatively, the simulation can be initialized in a calculable, approximate mutation-selection equilibrium state, allowing the simulation of the evolutionary scenario of interest to begin essentially immediately. λ_{μ}(*x*) is the expected (mean) number of mutations at a given frequency, *x*, in the population at mutation-selection equilibrium and can be calculate via the following equation:

The derivation for eq. 1 can be found in the Appendix (eq. 1-6 in the Appendix). The numerical integration required to calculate λ_{μ}(*x*) has been parallelized and accelerated on the GPU. To start the simulation, the actual number of mutations at each frequency is determined by draws from the Inverse Poisson distribution with mean and variance λ_{μ}(*x*). This numerical initialization routine can handle most of the equilibrium evolutionary scenarios the main simulation is capable of itself – a major exception being those cases with migration between multiple populations. Given the number of cases covered by the above integration technique, this is likely to be the primary method to start a *GO Fish* simulation in a state of mutationselection equilibrium.

After initialization begins the cycle of adding new mutations to the population and calculating new frequencies for currently segregating mutations. The number of new mutations introduced in each population *j*, for each generation *t* is Poisson distributed with mean *N _{e}μL* in accordance with the assumptions of the Poisson Random Field Model. These new mutations start at frequency 1/

*N*in the simulation. Meanwhile, the SNP frequencies of the extant mutations in the current generation

_{e}*t*, and population

*j*are modified by the forces of migration (I.), selection (II.), and drift (III.) to produce the new frequencies of those mutations in generation

*t+1*.

I. *GO Fish* uses a conservative model of migration [19] where the new allele frequency, *xmig*, in population *j* is the average of the allele frequency in all the populations weighted by the migration rate from each population, to population *j*. II. Selection further modifies the expected frequency of the mutations in population *j* according to eq. 2 below:

The derivation for eq. 2 can be found in the Appendix (eq. 8-13 in the Appendix). The variable *X _{mig,sel}* represents the expected frequency of an allele in generation

*t+1*. III. Drift, which is modeled as a binomial random deviation with mean

*N*and variance

_{e}x_{mig,sel}*N*(1-

_{e}x_{mig,sel}*X*), then acts on top of the deterministic forces of migration and selection to produce the ultimate frequency of the allele in the next generation,

_{mig,sel}*t+1*, in population

*j, X*. Then the cycle repeats.

_{t+1,j}## Results and Discussion

To test the speed improvements from parallelizing the Wright-Fisher algorithm, *GO Fish* was compared to a serial Wright-Fisher simulation written in C++. Each program was run on two computers: an iMac and a self-built Linux-box with equivalent Intel Haswell CPUs, but very different NVIDIA GPUs. Constrained by the thermal and space requirements of laptops and all-in-one machines, the iMac’s NVIDIA 780M GPU (1536 cores@823 MHz) is slower and older than the NVIDIA 980 (2048 cores@1380MHz) in the Linux-box. For a given number of simulated populations and number of generations, a key driver of execution time is the number of mutations in the simulation. Thus many different evolutionary scenarios will have similar runtimes if they result in similar numbers of mutations being simulated each generation. As such, to benchmark the acceleration provided by parallelization and GPUs, the programs were run using a basic evolutionary scenario while varying the number of expected mutations in the simulation. The utilized scenario is a simple, neutral simulation, starting in mutation-selection equilibrium, of a single, haploid population with a constant population size of 200,000 individuals over 1,000 generations and a mutation rate of 1x10-9 mutations per generation per individual per site. With these other parameters held constant, varying the number of sites in the simulation adjusts the number of expected mutations for each of the benchmark simulations.

As shown in Figure 4: accelerating the Wright-Fisher simulation on a GPU results in massive performance gains on both an older, mobile GPU like the NVIDIA 780M and a newer, desktop-class NVIDIA 980 GPU. For example, when simulating the frequency trajectories of ∽500,000 mutations over 1,000 generations, *GO Fish* takes ∽0.2s to run on a 780M as compared to ∽18s for its serial counterpart running on the Intel i5/i7 CPU (@3.9 Ghz), a speedup of 88-fold. On a full, modern desktop GPU like the 980, *GO Fish* runs this scenario ∽176x faster than the strictly serial simulation and only takes about 0.1s to run. As the number of mutations in the simulation grows, more work is tasked to the GPU and the relative speedup of GPU to CPU increases logarithmically. Eventually though, the sheer number of simulated SNPs saturates even the computational throughput of the GPUs, producing linear increases in runtime for increasing SNP counts, like for serial code. Thus, eventually, there is a flattening of the fold performance gains. This plateau occurs earlier for 780M than for the more powerful 980 with its more and faster cores. Executed serially on the CPU, a huge simulation of ∽4x10^{7} SNPs takes roughly 24min to run versus only ∽13s/5.7s for *GO Fish* on the 780M/980, an acceleration of more than 109/250-fold. While not benchmarked here, the parallel Wright-Fisher algorithm is also trivial to partition over multi-GPU setups in order to further accelerate simulations.

Tools employing the single-locus Wright-Fisher framework are widely used in population genetics analyses to estimate selection coefficients and infer demography (see [11, 20-24] for examples). Often these tools employ either a numerically solved diffusion approximation, or even the simple analytical function, to generate the expected SFS of a given evolutionary scenario, which can then be used to calculate the likelihood producing an observed SFS (ref). The model parameters of the evolutionary scenario are then fit to the data by maximizing the composite likelihood (ref). With *GO Fish*, forward simulation can generate the expected spectra. To validate these expected spectra, the results of *GO Fish* simulations were compared against *δaδi* [15] for a complex evolutionary scenario involving a single population splitting into two, exponential growth, selection, and migration. (Figure 5) The spectra generated by each program are identical. Interestingly, the two programs also had essentially identical runtimes for this scenario and hardware. (Figure 5) In general, the relative compute time will vary depending on the simulation size for *GO Fish*, the grid size & time-step for *δaδi* [15], and the simulation scenario & hardware for both.

For maximum-likelihood and Bayesian statistics as for parametric bootstraps and confidence intervals, hundreds, thousands, even tens of thousands of distinct parameter values may need to be simulated to yield the needed statistics for a given model. Multiplying this by the need to often consider multiple evolutionary models as well as nonparametric bootstrapping of the data, a single serial simulation run on a CPU taking only 18s, as in the simple simulation of ∽500,000 SNPs presented in Figure 4, can add up to hours, even days of compute time. Moreover, and in contrast to the approximating analytical or numerical solutions typically employed, simulating the expected SFS introduces random noise around the “true” SFS of the scenario being modeled. Figure S1 demonstrates how increasing the number of simulated SNPs increases the precision of the simulation – and therefore of the ensuing likelihood calculations. Simulating tens of millions of SNPs, wherein a single run on the CPU can take nearly half-an-hour, can be imperative to obtain a high-precision SFS needed for certain situations. Thus, the speed boost from parallelization on the GPU in calculating the underlying, expected SFS greatly enhances the practical utility of simulation for many current data analysis approaches. The speed and validation results demonstrate that, now with *GO Fish*, one can not only track allele trajectories in record time, but also generate SFS by using forward simulations in roughly the same time-frame as by solving diffusion equations. Just as importantly, *GO Fish* achieves the increase in performance without sacrificing flexibility in the evolutionary scenarios it is capable of simulating.

*GO Fish* can simulate mutations across multiple populations for comparative population genomics, with no limits to the number of populations allowed. Population size, migration rates, inbreeding, dominance, and mutation rate are all user-specifiable functions capable of varying over time and between different populations. Selection is likewise a userspecifiable function parameterized not only by generation and population, but also by allele frequency, allowing for the modeling of frequency-dependent selection as well as time-dependent and population-specific selection. By tuning the inbreeding and dominance parameters, *GO Fish* can simulate the full range of single-locus dynamics for both haploids and diploids with everything from outbred to inbred populations and overdominant to underdominant alleles. GPU-accelerated Wright-Fisher simulations thus provide extensive flexibility to model unique and complex demographic and selection scenarios beyond what many current site frequency spectrum analysis methods can employ.

Paired with a coalescent simulator, *GO Fish* can also accelerate the forward simulation component in forwards-backwards approaches (see [16, 25]). In addition, *GO Fish* is able to track the age of mutations in the simulation providing an estimate of the distribution of the allele ages, or even the age by frequency distribution, for mutations in an observed SFS. Further, the age of mutations is one element of a unique identifier for each mutation in the simulation, which allows the frequency trajectory of individual mutations to be tracked through time. This ability to sample ancestral states and then track the mutations throughout the simulation can be used to contrast the population frequencies of polymorphisms from ancient DNA with those present in modern populations for powerful population genetics analyses [26]. By accelerating the single-locus forward simulation on the GPU, *GO Fish* broadens the capabilities of SFS-analysis approaches in population genetic studies.

Across the field of population genetics and evolution, there exist a wide range of computationally intensive problems that could benefit from parallelization. The algorithms presented and discussed in Figure 2 represent a subset of the essential parallel algorithms, which more complex algorithms modify or build upon. Application of these parallel algorithms are already wide-ranging in bioinformatics: motif finding [27], global and local DNA and protein alignment [28-31], short read alignment and SNP calling [32, 33], haplotyping and the imputation of genotypes [34], analysis for genome-wide association studies [35, 36], and mapping phenotype to genotype and epistastic interactions across the genome [37, 38]. In molecular evolution, the basic algorithms underlying the building of phylogenetic trees and analyzing sequence divergence between species have likewise been GPU-accelerated [39, 40]. Further, there are parallel methods for general statistical and computational methods, like Markov Chain Monte Carlo and Bayesian analysis, useful in computational evolution and population genetics [41, 42].

Future work on the single-locus Wright-Fisher algorithm will include extending the parallel structure of *GO Fish* to allow for multiple alleles as well as multiple mutational events at a site, relaxing one of the key assumptions of the Poisson Random Field [4]. At present, neither running simulations with long divergence times between populations nor any scenario where the number of extant mutations in the simulation rises to too high a proportion of the total number of sites is theoretically consistent with the Poisson Random Field model underpinning the current version of *GO Fish*. Beyond *GO Fish*, solving Wright-Fisher diffusion equations in programs like *δaδi* [15] can likewise be sped up through parallelization on the GPU [43-46].

Unfortunately, while the effects of linkage and linked selection across the genome can be mitigated in analyses using a single-locus framework [15, 24, 47], these effects cannot be examined and measured whilst assuming independence amongst sites. Expanding from the study of independent loci to modeling the evolution of haplotypes and chromosomes, simulations with the coalescent framework or forward Wright-Fisher algorithm with linkage can also be accelerated on GPUs. The coalescent approach has already been shown to benefit from parallelization over multiple CPU cores (see [48]). While Montemuiño et al. achieved their speed boost by running multiple independent simulations concurrently, they noted that parallelizing the coalescent algorithm itself may also accelerate individual simulations over GPUs [48]. Likewise, multiple independent runs of the full forward simulation with linkage can be run concurrently over multiple cores and the individual runs might themselves be accelerated by parallelization of the forward algorithm. The forward simulation with linkage has many *embarrassingly parallel* steps, as well as those that can be refactored into one of the core parallel algorithms. The closely related *genetic algorithm*, used to solve difficult optimization problems, has already been parallelized and, under many conditions, greatly accelerated on GPUs [49-51]. However, not all algorithms will benefit from parallelization and execution on GPUs – the real world performance of any parallelized algorithm will depend on the details of the implementation [50, 51]. While the extent of the performance increase will vary from application to application, each of these represent key algorithms whose potential acceleration could provide huge benefits for the field [12, 13].

These potential benefits extend to lowering the cost barrier for students and researchers to run intensive computational analyses in population genetics. The *GO Fish* results demonstrate how powerful even an older, mobile GPU can be at executing parallel workloads, which means that *GO Fish* can be run on everything from GPUs in high-end compute clusters to a GPU in a personal laptop and still achieve a great speedup over traditional serial programs. A batch of single-locus Wright-Fisher simulations that might have taken a hundred CPU-hours or more to complete on a cluster can be done, with *GO Fish*, in an hour on a laptop. Moreover, graphics cards and massively parallel processors in general are evolving quickly. While this paper has focused on NVIDIA GPUs and CUDA, the capability to take advantage of the massive parallelization inherent in the Wright-Fisher algorithm is the key to accelerating the simulation and in the High Performance Computing market there are several avenues to achieve the performance gains presented here. For instance, OpenCL is another popular lowlevel language for parallel programming and can be used to program NVIDIA, AMD, Altera, Xilinx, and Intel solutions for massively parallel computation, which include GPUs, CPUs, and even Field Programmable Gate Arrays (FPGAs) [52-54]. The parallel algorithm of *GO Fish* can be applied to all of these tools. Whichever platform(s) or language(s) researchers choose to utilize, the future of computation in population genetics is massively parallel and exceedingly fast.

## Acknowledgements

The author would like to thank Nandita Garud, Heather Machado, Philipp Messer, Kathleen Nguyen, Sergey Nuzhdin, Peter Ralph, Kevin Thornton, and two anonymous reviewers for providing feedback and helpful suggestions to improve this paper.

## Appendix – Parallel Wright-Fisher Simulation Details

### Simulation Initialization

Simulations can be initialized in one of three ways: 1) a blank canvas, 2) from the results of a previous simulation, and 3) mutation-selection equilibrium. Starting a simulation as a blank canvas provides the most flexibility in what evolutionary state the simulation begins and thus any evolutionary scenario can be simulated from the beginning. However, as the simulation starts with no mutations present, a “burn-in” time is necessary to reach the point where the simulation of the scenario of interest can begin. The number of “burn-in” generations may be quite long, particularly to reach any kind of equilibrium state where selection, mutation, migration, and drift are all in balance and the number of mutations being fixed and lost is equal to the number of new mutations in the population(s). To save time, if a starting scenario is shared across multiple simulations, then the post-burn-in mutation array can be simulated beforehand, stored, and input as the initial mutation array for the next set of simulations.

Another way to jump start the simulation is by assuming all extant populations are in mutationselection balance at the beginning of the simulation. Under general mutation-selection equilibrium (MSE), the proportion of mutations at every frequency in the population can be calculated via numerical integration over a continuous frequency diffusion approximation (see [3]). While this constrains the starting evolutionary state to mutation-selection equilibrium, this allows one to then start simulating the selection and demographic scenario of interest immediately. Due to current limitations of the MSE model in *GO Fish*, the mutation-selection equilibrium scenario does not, as of yet, include migration from other populations or random fluctuations in selection intensity – nor can the code calculate the number of generations ago a mutation at frequency *x* is expected to have arisen at. Instead all mutations in the initial mutation array said to have arisen at time t = 0. The model is detailed below:

Using the glossary from Table 1, for any given population j at time t = 0:

*μ* = *μ*(*j*,0), s(*x*) = s(*j*,0,*x*), etc…

From Kimura p. 220-222 [3]:
where *N _{e}* = 2

*N*(1 +

*F*)

λ_{μ}(*x*) is the expected (mean) number of mutations at a given frequency, *x*, in the population at mutation-selection equilibrium. V(*x*) and M(*x*) are the contribution of drift of selection respectively to the rate of change of a mutation’s frequency at frequency *y* in the population. Since this is an allele-based simulation, I use the equilibrium value of the effective number of chromosomes, *N _{e}*, to account for inbreeding amongst

*N*individuals.

The total rate of frequency change is the average of the rate of change of the effective haploid proportion of the population and the effective diploid proportion of the population weighted by *F*.

Substituting eq. 3 and 5 into eq. 1 yields:

More familiar versions of eq. 6 can be derived by assuming neutrality or by assuming no frequency-dependent selection and either codominance or haploid/completely inbred individuals.

if s(

*x*) = 0 ∀*x*∈ (0,1) (neutral) →*λ*(_{μ}*x*) = 2*μL x*if s(

*x*) =*s*∀*x*∈ (0,1) and (*h*= 0.5 or*F*= 1) →where if

*h*= 0.5 (codominant) →*N*= 2_{e}*N*(1+*F*)where if

*F*= 1 (haploid) →*N*=_{e}*N*

I approximate the integrals in eq. 6 using trapezoidal numerical integration and use the *scan* parallel algorithm implemented in CUB 1.6.4 [57] to accelerate the integration on the GPU*. λ_{μ}(*x*) is the expected (mean) number of mutations. To determine the actual number of mutations at a given frequency, *x*, I generate random numbers from the Inverse Poisson distribution with mean λ_{μ}(*x*) using the following procedure:

Random number generator Philox [58] generates a uniform random number between 0 and 1.

If λ

_{μ}(*x*) ≤ 6, then that uniform variable is fed into the exact Inverse Poisson CDF.If λ

_{μ}(*x*) > 6, then a Normal approximation to the Poisson is used.

Adding all the new mutations at every frequency to the starting mutation array is an embarrassingly parallel problem. Thus, combined with the parallel numerical integration for the definite integral components of eq. 6, initializing the simulation at mutation-selection equilibrium is overall greatly accelerated on the GPU relative to serial algorithms on the CPU.

***An Aside About Numerical Precision, GPUs, and Numerical Integration:** For a bit of background, CPUs employ a Floating-point Processor Unit with 80-bits of precision for serial floating-point computation, which then quickly translates the result into double-precision (64-bit) for the CPU registers. Thus, CPU programs, including the serial Wright-Fisher simulation, are often written with double-precision performance in mind. In contrast, most consumer GPU applications are geared towards single-precision (32-bit) computation (e.g. graphics) and many consumer GPUs have relatively poor double-precision performance. More expensive, professional-grade workstation GPUs often have far better double-precision performance than their consumer counterparts. As the Wright-Fisher simulation does not actually require 64-bits of precision for its calculations, *GO Fish* has been written with 32-bits of precision computation in mind. This is even true of the MSE Integration step where the naturally pair-wise summation of parallel *scanning* mitigates the round-off error when performing large numbers of consecutive sums in 32-bit [59]. That said, the mutation frequencies stored in the simulation have only single-precision floating-point accuracy. Experiments using CPU serial Wright-Fisher simulations showed consistent results between storing mutation frequencies with 32-bits vs. 64-bits of precision.

#### Steps to *Calculate New Allele Frequencies*

Migration, selection, and drift determine the frequency of an allele in the next generation, *xt+1*, based on its current frequency, *xt*. Migration and selection are deterministic forces whereas drift introduces binomial random chance. While these three steps can, in principle, be done in any order, their order in the simulation is as follows:

Migration

Selection (with Inbreeding)

Drift (with Inbreeding)

##### I. Migration

Using the glossary from Table 1, in population *j* at time *t*:

m(

*k*) = m(*k, j, t*),*x*≡ freq. of allele in pop._{t,k}*k*at time*t*,*x*=_{mig}*x*≡ freq. of allele in pop._{mig, j}*j*after migration,

where

*GO Fish* uses a conservative model of migration [19]. The new allele frequency in population *j* is the average of the allele frequency in all the populations weighted by the migration rate from each population, to population *j*. And the migration rate is specified by the proportion of chromosomes from population *k* in population *j*.

##### II. Selection (with Inbreeding)

In population *j* at time *t*:

*x*=_{mig}*x*≡ freq. of allele after migration,_{mig, j}*y*= 1 −_{mig}*x*,_{mig}*x*=_{mig, sel}*x*≡ freq. of allele after migration and selection,_{mig, sel}, j*P*,_{AA}*P*,_{Aa}*P*≡ frequency of genotype_{aa}*AA, Aa*, and*aa*,s(

*x*) = s(*j, t, x*), h = h(*j*,*t*),*w*=*w*≡ average pop._{j}*j*fitness,*n*=*n*≡ average pop._{j}*j*fitness of allele A

Like with M(*x*) in eq. 4, *w* and *n* are a weighted average of the effective haploid (inbred) and diploid (outbred) portions of the chromosome population. Diploid genotype frequencies assume random mating and Hardy-Weinberg equilibrium [60, 61].

Following the same logic as above:

Substituting eq. 11*d* and 12*b* into eq. 10 yields:

Again, like for eq. 6, more familiar forms of eq. 13 may be derived under certain assumptions such as neutrality, haploid/inbred individuals, and completely outbred diploids.

if s(

*x*) = 0 ∀_{mig}*x*∈ (0,1) (neutral) →*x*=_{mig, sel}*x*_{mig}if

*F*= 1 (haploid) →if

*F*= 0 (diploid) →

##### III. Drift (with Inbreeding)

For population *j* in generation *t*:

The variable *x _{mig,sel}* represents the expected frequency of the allele in generation

*t+1*. Drift is the random deviation of the actual frequency of the allele from this expectation. To determine the actual frequency of the allele in the next generation,

*x*, I generate random numbers from the Inverse Binomial distribution with mean

_{t+1,j}*N*and variance

_{e}X_{mig,sel}*N*(1-

_{e}X_{mig,sel}*X*) using the following procedure:

_{mig,sel}Random number generator Philox [58] generates a uniform random number between 0 and 1.

If

*N*≤ 6, then that uniform variable is fed into the exact Inverse Poisson CDF as an approximation to the Binomial._{e}x_{mig,sel}If

*N*> 6, then a Normal approximation to the Binomial is used._{e}x_{mig,sel}

As *N _{e}* = 2

*N*/(1+

*F*), inbreeding affects drift as well as selection.

#### Adding New Mutations

Using the glossary from Table 1, for population *j* in generation *t*:

*μ*=*μ*(*j*,*t*),*N*= 2N(_{e}*j*,*t*) (1 +*F*)14)

*λ*=_{μ}*N*_{e}μLstarting frequency,

*x*= 1/*N*_{e}

The Poisson Random Field shares an important assumption with Watterson’s infinite sites model in that regardless of how many sites are currently polymorphic, mutations will never strike a currently polymorphic site and the number of monomorphic sites that a mutation can occur at is always the total number of sites, *L* [4, 62]. Eq. 14 defines the expected number of mutations in population *j* for generation *t+1*. The actual number of new mutations is drawn from the Inverse Poisson distribution using the same procedure detailed in *Simulation Initialization*. New mutations can be added to generation *t+1* in parallel and simultaneously with the new frequency calculations. Each new mutation is given a 4-part unique ID consisting of the thread and compute device that birthed it (if more than one graphics card is used) as well as the generation and population in which it first arose.