## Abstract

Extracting insight from population genetic data often demands computationally intensive modeling. dadi is a popular program for fitting models of demographic history and natural selection to such data. Here, I show that running dadi on a Graphics Processing Unit (GPU) can speed computation by orders of magnitude compared to the CPU implementation, with minimal user burden. This speed increase enables the analysis of more complex models, which motivated the extension of dadi to four- and five-population models. Remarkably, dadi performs almost as well on inexpensive consumer-grade GPUs as on expensive server-grade GPUs. GPU computing thus offers large and accessible benefits to the community of dadi users. This functionality is available in dadi version 2.1.0, https://bitbucket.org/gutenkunstlab/dadi/.

Population genetic data contain much information about the history of the sampled populations, but computationally intensive modeling is often necessary to extract that information. dadi is widely used for inferring models of demographic history (Gutenkunst et al., 2009) and natural selection (Kim et al., 2017) from population genetic data summarized in the form of an allele frequency spectrum. In a typical dadi analysis, the user specifies a model with parameters that represent population sizes, migration rates, divergence times, and/or selection coefficients. For a given set of a parameters, dadi computes the expected allele frequency spectrum, from which the composite likelihood of the sample data can be calculated. Nonlinear optimization is then used to optimize parameters to maximize that likelihood, and the maximumlikelihood values are then interpreted to gain insight into past population genetic processes. During optimization, the model will be evaluated hundreds or thousands of times, leading to substantial computational expense. Here, I show that computing on Graphics Processing Units (GPUs) can massively speed dadi model computation and thus inference.

Modern graphics processing units (GPUs) provide enormous computing power for data parallel tasks, in which the same operations are applied to many entries in memory (Owens et al., 2008). But exploiting this power often demands new algorithms. For example, in computational biology there has been extensive research into GPU algorithms for sequence alignment (Manavski & Valle, 2008) and search (Vouzis & Sahinidis, 2011). In genomics, GPU algorithms have been developed for variant calling (Luo et al., 2013) and secondary analysis (Luo et al., 2014) of short-read sequencing data. But GPU computing has rarely been applied to population genetic simulation or inference. Recently, Lawrie (2017) developed a GPU implementation of the single-locus Wright-Fisher model, finding speedups of over 250 times compared to a CPU implementation. Previously, Zhou et al. (2015) implemented a subset of the IM program for inferring isolation-with-migration demographic models (Hey & Nielsen, 2004) on a GPU, demonstrating speedups of around 50 times. Here, I show that GPU computing can dramatically speedup dadi analyses.

For dadi, the limiting computation is the numerical solution of a partial differential equation (PDE) to model the dynamics of the population distribution of allele frequencies *ϕ* (Kimura, 1964). In dadi, this PDE is solved using an alternating direction implicit scheme based on the Crank-Nicolson method (Fig. 1A; Press et al. (2007)). Evolving *ϕ* forward in time then reduces to solving a large number of tridiagonal linear systems. Model parameters, including population sizes, migration rates, and selection coefficients, affect the *a, b*, and *c* diagonal vectors of these systems (Fig 1A). Unlike general matrix equations, the tridiagonal structure enables these systems to be solved in linear time, using serial Thomas algorithm (Press et al., 2007).

Because diffusion PDEs similar to those in dadi are encountered in many fields, substantial research has been done on GPU algorithms for tridiagonal systems. Early work by Pixar employed the parallel cyclic-reduction algorithm (Hockney, 1965) to solve tridiagonal systems that arise in computer graphics on the GPU (Kass et al., 2006). Later work demonstrated that optimizing such algorithms is complex and that some parallel algorithms are numerically unstable (Zhang et al., 2010). Recently, Valero-Lara et al. (2018) showed that when the number of linear systems to be solved is large, the serial Thomas algorithm can be more efficient on the GPU than these parallel algorithms. This is because the Thomas algorithm enables more efficient memory access, if the diagonal elements *a, b*, and *c* are stored in dense matrices (Fig. 1A).

Programming GPUs demands special-purpose frameworks and libraries. Two frameworks dominate the field. The Open Computing Language (OpenCL) is a standard for writing parallel code that is portable across CPUs, GPUs, field-programmable gate arrays, and other specialized hardware (Stone et al., 2010). The Compute Unified Device Architecture (CUDA) is developed by the Nvidia Corporation specifically for use on its GPUs (Nickolls et al., 2008). In general, these frameworks offer similar performance and programming convenience (Holm et al., 2020). OpenCL is available on more platforms, particularly because CUDA is longer supported on OS X, but more libraries are available for CUDA than OpenCL. In particular, the Valero-Lara et al. (2018) algorithm for efficiently solving batches of tridiagonal systems was recently integrated into the CUDA standard library. Thus, the GPU implementation of dadi requires a CUDA-compatible GPU from Nvidia. dadi is primarily written in Python; to interface with CUDA I used the PyCUDA (Klöckner et al., 2012) and scikitcuda (Givon et al., 2019) libraries, which can be easily installed from the Python Package Index using pip.

For the end user, dadi GPU usage is transparent, requirement only a single call to `dadi.cuda_enabled(True)`.

To evaluate performance of the dadi GPU imple-mentation, I compared times to compute the population distribution of allele frequencies. The key parameter governing computation time is the number of grid points, *pts*, used to approximate the numerical solution of the PDE. In practice, the number of grid points must be larger than the largest data sample size, and many more points are needed when simulating strong selection (Kim et al., 2017). The *ϕ* matrix has dimensionality *P*, the number of populations modeled. So the number of elements in *ϕ* scales as *pts ^{P}* and so does the expected computation time. I tested models with differing numbers of populations, drawing on the stdpopsim resource where possible (Adrion et al., 2020) on multiple CPUs and GPUs. The benchmarking code is available in the dadi source repository: https://bitbucket.org/gutenkunstlab/dadi/src/master/examples/CUDA.

dadi is most often used with two- or three-population models, and in both cases the GPU implementation can be substantially faster than the CPU implementation (Fig. 1B). My two-population test model was the three-epoch model estimated for African and European Drosophila melanogaster by Li & Stephan (2006). My three-population test model was the Out-of-Africa model for humans estimated by Gutenkunst et al. (2009). The server-grade Tesla P100 GPU is over 400 times faster than the corresponding system CPU for two populations and many grid points. Even the mid-range consumer-grade GeForce GTX 1650 Super GPU is over 10 times faster than the system CPU for large 2D and 3D systems. These speed differences are illustrated more directly in Fig. S1, where the ratios of the CPU and GPU times on the same systems are shown. For all systems tested, the GPU began to outperform the CPU at around 150 grid points for two populations and at around 50 grid points for three populations, which are values regularly used with typical data analyses.

Given the dramatic speed up provided by GPU computing, I extended dadi to four- and five-population models. Tests with the four-population New World model from Gutenkunst et al. (2009) and the five-population archaic admixture model from Ragsdale & Gravel (2019) again showed that GPUs can substantially outperform CPUs.

Keys to efficient GPU computing include minimizing data transfer and managing memory usage. For dadi, the *ϕ* matrix is copied to the GPU at the outset of each integration function. All computations then remain on the GPU through potentially many time steps, until the *ϕ* matrix is copied back when the integration function exits. One subtlety is that during each time step the *ϕ* matrix must be transposed and reshaped between integration directions, to maintain the expected alignment of the *a*, *b*, *c*, and *ϕ* matrices for the batch tridiagonal solver algorithm. For example, integrating a three-population scenario with 100 grid points involves simultaneously solving 10,000 tridiagonal systems of size 100, so the naturally 100×100×100 *ϕ* array must be reshaped to 100 × 10, 000, then reshaped for the next direction of integration, and so forth. Memory is typically more limited on GPUs than on the host systems. For dadi, the full dense *a, b, c*, and *ϕ* matrices are stored as double-precision. For *P* populations and *pts* grid points, memory usage is thus roughly 4 × 8 × *pts ^{P}*/1024

^{3}gigabytes (GB). For 3 populations, a modern mid-range GPU with 4 GB of RAM can thus scale to pts = 400, while a high-end GPU with 24 GB of RAM can scale to pts = 900. A future implementation of dadi on the GPU could potentially only compute portions of the a, b, and c matrices at a time, reducing memory usage but substantially increasing code complexity.

GPU programming can involve unusual performance decisions. For example, if parameter values are constant during an integration, then the *a, b*, and *c* matrices do not change between time steps and could be cached. Remarkably, for systems large enough to favor the GPU over the CPU, it is faster to recalculate those matrices for each time step, rather than caching them (Fig. S2). In CUDA, computational threads are organized into blocks that can share local memory, and optimizing block size can be important for maximal performance (Ryoo et al., 2008). For dadi, no communication between threads is needed, so the optimal block size is large (Fig. S3).

For many users, the ultimate benefit of GPU computing is high performance at low cost. In this respect, the performance of the mid-range consumergrade GeForce GTX 1650 Super GPU is remarkable. As of writing, the GeForce costs roughly $200, compared to roughly $2,500 for the high-end consumer-grade Titan RTX and roughly $6,000 for the server-grade Tesla P100. Yet the performance of the GeForce is within a factor of three of the high-end GPUs (Fig. S4), even though the P100 theoretically has a 30-fold advantage in double-precision operations per second. This suggests that the dadi workload is bound by memory bandwidth rather than arithmetic. It also shows that the massive benefits of GPU computing for dadi are easily accessible to end users.

GPU computing offers substantial performance benefits for dadi, with minimal user burden. These performance improvements increase dadi’s competitiveness with alternative methods for calculating the allele frequency spectrum, such as moments (Jouganous et al., 2017). They also make analysis of more complex models, including those with four and five populations, computationally feasible. Lastly, the large benefits of even consumer-grade GPUs for dadi suggest that GPU computing may also be worth considering for other software in computational population genetics.

## Acknowledgments

This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM127348 to R.N.G.). This material is based upon High Performance Computing (HPC) resources supported by the University of Arizona TRIF, UITS, and Research, Innovation, and Impact (RII) and maintained by the UArizona Research Technologies department. I thank Xin Huang for benchmarking assistance and Andreas Klöckner for guidance to the Python CUDA ecosystem.