## Abstract

Inference for population genetics models is hindered by computationally intractable likelihoods. While this issue is tackled by likelihood-free methods, these approaches typically rely on handcrafted summary statistics of the data. In complex settings, designing and selecting suitable summary statistics is problematic and results are very sensitive to such choices. In this paper, we learn the first exchangeable feature representation for population genetic data to work directly with genotype data. This is achieved by means of a novel Bayesian likelihood-free inference framework, where a permutation-invariant convolutional neural network learns the inverse functional relationship from the data to the posterior. We leverage access to scientific simulators to learn such likelihood-free function mappings, and establish a general framework for inference in a variety of simulation-based tasks. We demonstrate the power of our method on the recombination hotspot testing problem, outperforming the state-of-the-art.

## 1. Introduction

Statistical inference in complex population genetics models is challenging, as the likelihood is often both analytically and computationally intractable. These models are usually based on the coalescent (Kingman, 1982), a stochastic process describing the distribution over genealogies of a random sample of chromosomes from a large population. Unfortunately, standard coalescent-based likelihoods require integrating over a large set of correlated high-dimensional combinatorial objects, rendering classical inferential techniques inapplicable. This limitation can be overcome by likelihood-free methods such as Approximate Bayesian Computation (ABC) (Beaumont et al., 2002) and deep learning (Sheehan & Song, 2016). These approaches leverage scientific simulators to draw samples from the generative model, and reduce population genetic data to a suite of summary statistics prior to performing inference. However, hand-engineered feature representations typically are not statistically sufficient for the parameter of interest, leading to loss in accuracy. In addition, these statistics are often based on the intuition of the user, need to be modified for each new task, and, in the case of ABC, are not amenable to hyperparameter optimization strategies since the quality of the approximation is unknown.

Deep learning offers the possibility to avoid the need for hand-designed summary statistics in population genetic inference and work directly with genotype data. The goal of this work is to develop a scalable general-purpose inference framework for raw genetic data, without the need for summary statistics. We achieve this by designing a neural network which exploits the exchangeability in the underlying data to learn feature representations that can approximate the posterior accurately.

As a concrete example, we focus on the problem of recombination hotspot testing. Recombination is a biological process of fundamental importance, in which the reciprocal exchange of DNA during cell division creates new combinations of genetic variants. Experiments have shown that some species exhibit *recombination hotspots*, that is, short segments of the genome with high intensity recombination rates (Petes, 2001). The task of recombination hotspot testing is to predict the location of recombination hotspots given genetic polymorphism data. Accurately localizing recombination hotspots would illuminate the biological mechanism that underlies recombination, and could help geneticists map the alleles causing genetic diseases (Hey, 2004). We demonstrate in experiments that we achieve state-of-the-art performance on the hotspot detection problem.

Our contributions focus on addressing major inferential challenges of complex population genetic inference. In Section 2 we review relevant lines of work in both the fields of machine learning and population genetics. In Section 3 we propose a scalable Bayesian likelihood-free inference framework for exchangeable data, which may be broadly applicable to many population genetic problems as well as more general simulator-based machine learning tasks. The application to population genetics is detailed in Section 4. In particular, we show how this allows for direct inference on the raw population genetic data, bypassing the need for *ad hoc* summary statistics. In Section 5 we run experiments to validate our method and demonstrate state-of-the-art performance in the hotspot detection problem.

## 2. Related Work

Likelihood-free methods like ABC have been widely employed in population genetics (Beaumont et al., 2002; Boitard et al., 2016; Wegmann et al., 2009; Sousa et al., 2009). In ABC the parameter of interest is simulated from its prior distribution, and data are subsequently simulated from the generative model and reduced to a pre-chosen set of summary statistics. These statistics are compared to the summary statistics of the real data, and the simulated parameter is weighted according to the similarity of the statistics to derive an empirical estimate of the posterior distribution. However, choosing summary statistics for ABC is challenging because there is a trade-off between loss of sufficiency and computational tractability. In addition, there is no direct way to evaluate the accuracy of the approximation.

Other likelihood-free approaches have emerged from the machine learning community and have been applied to population genetics, such as support vector machines (SVMs) (Schrider & Kern, 2015; Pavlidis et al., 2010), single-layer neural networks (Blum & François, 2010), and deep learning (Sheehan & Song, 2016). The connection between likelihood-free Bayesian inference and neural networks has also been studied previously by Jiang et al. (2015) and Papamakarios & Murray (2016). An attractive property of these methods is that, unlike ABC, they can be applied to multiple datasets without repeating the training process, which is commonly referred to as amortized inference. However, current practice in population genetics collapses the data to a set of summary statistics before passing it through the machine learning models. Therefore, the performance still rests on the ability to laboriously hand-engineer informative statistics, and must be repeated from scratch for each new problem setting.

The inferential accuracy and scalability of these methods can be improved by exploiting symmetries in the input data. Permutation-invariant models have been previously studied in machine learning for SVMs (Shivaswamy & Jebara, 2006) and, recently, gained a surge of interest in the deep learning literature. Recent work on designing architectures for exchangeable data include Ravanbakhsh et al. (2016), Guttenberg et al. (2016), and Zaheer et al. (2017), which exploit parameter sharing to encode invariances. To our knowledge, no prior work has been done on learning feature representations for exchangeable population genetic data.

We demonstrate these ideas on the problem of recombination hotspot testing. To this end, several methods have been developed (Fearnhead, 2006; Li et al., 2006; Wang & Rannala, 2009). However, none of these are scalable to the whole genome, with the exception of `LDhot` (Auton et al., 2014; Wall & Stevison, 2016), so we limit our comparison to this latter method. `LDhot` relies on a composite likelihood, which can be seen as an approximate likelihood for summaries of the data. It can be computed only for a restricted set of models (i.e., an unstructured population with piecewise constant population size), is unable to capture dependencies beyond those summaries, and scales at least cubically with the number of DNA sequences. The method we propose in this paper scales linearly in the number of sequences while using raw genetic data directly.

## 3. Methodology

In this section we propose a flexible framework to address the shortcomings of current likelihood-free methods. Although motivated by population genetics, we first lay out the ideas that generalize beyond this application. We describe the exchangeable representation in Section 3.1 and the training algorithm in Section 3.2, which are combined into a general likelihood-free inference framework in Section 3.3. The statistical properties of the method are studied in Section 3.4.

### 3.1. Feature Representation for Exchangeable Data

Population genetic datapoints x^{(i)} typically take the form of a binary matrix, where rows correspond to individuals and columns indicate the presence of a Single Nucleotide Polymorphism (SNP), namely a nucleotide variation at a given location of the DNA. For unstructured populations the order of individuals carries no information, hence the rows are exchangeable. More generally, given data **X** = (x^{(1)},…,x^{(N)}) where x^{(i)} ∈ ℝ^{n×d} and , we call **X** *exchangeably-structured* if, for every *i*,
,
for all permutations *σ* of the indices {1,…, *n*}.

To obtain an exchangeable feature representation of genotype data, we proceed as follows. Let be a feature mapping. We apply a symmetric function to the feature mapped datapoint to obtain , a feature representation of the exchangeably-structured data. This representation is very general and can be adapted to various machine learning settings. For example, Φ could be some *a priori* fixed feature mapping (e.g. a kernel or summary statistics) in which case *g* should be chosen such that the resulting feature representation remains informative. More commonly, the mapping Φ needs to be learned (such as in kernel logistic regression or a deep neural network), hence we choose some fixed *g* such that subgradients can be backpropagated through *g* to Φ. Some examples of such a function *g* include the element-wise sum, element-wise max, lexicographical sort, or higher-order moments. Throughout the paper, we choose to parameterize Φ with a neural network and choose *g* to be the element-wise max function, such that . A variant of this representation is proposed by Ravanbakhsh et al. (2016) and Zaheer et al. (2017).

This embedding of exchangeably-structured data into a vector space is suitable for many tasks such as regression or clustering. We focus on inference in which the objective is to learn the function , where Θ is the space of all parameters *θ* and is the space of all probability distributions on Θ. Endowed with our exchangeable feature representation, a function can be composed with our symmetric mapping to get . For simplicity, throughout the rest of the paper we focus on binary classification where *θ* ∈ {0,1}, so that can be parameterized by ℙ(θ = 1 | x^{(i)}, *ϕ*), where *ϕ* are nuisance parameters and *h* is parameterized as a neural network such that both *h* and Φ can be learned via backpropagation with a cross entropy loss. Specifically, we will apply this construction to infer the presence of recombination hotspots, indicated by the parameter *θ*. The posterior ℙ(*θ* =1 | x^{(i)}, *ϕ*) is estimated by a soft max application so that the output is defined on [0,1]. This exchangeable representation has many advantages. While it could be argued that flexible machine learning models could learn the structured exchangeability of the data, encoding exchangeability explicitly allows for faster per-iteration computation and improved learning efficiency, since data augmentation for exchangeability scales as *O*(*n*!). Enforcing exchangeability implicitly reduces the size of the input space from ℝ^{n×d} to the quotient space ℝ^{n×d}/*S _{n}*, where

*S*is the symmetric group on

_{n}*n*elements. A factorial reduction in input size leads to much more tractable inference for large

*n*. In addition, choices of

*g*where

*d*

_{2}is independent of

*n*(e.g., element-wise operations with output dimension independent of

*n*) allows for a representation which is robust to differing number of exchangeable variables between train and test time. This property is particularly desirable to construct feature representations of fixed dimension even with missing data.

### 3.2. Simulation-on-the-fly

In statistical decision theory, the Bayes risk for prior *π*(*θ*) is defined as , with *l* being the loss function and *T* an estimator. The excess risk over the Bayes risk resulting from an algorithm *A* with model class can be decomposed as
where and are the function obtained via algorithm A and the empirical risk minimizer, respectively. The terms on the right hand side are referred to as the optimization, estimation, and approximation errors, respectively. Often the goal of statistical decision theory is to minimize the excess risk, motivating algorithmic choices to control the three sources of error. For example, with supervised learning, overfitting is a result of large estimation error. Typically, for a sufficiently expressive neural network optimized via stochastic optimization techniques, the excess risk is dominated by optimization and estimation errors.

When we have access to scientific simulators, the amount of training data available is limited only by the amount of computational time available for simulation, so we propose simulating each training datapoint afresh such that there is exactly one epoch over the training data. We refer to this as *simulation-on-the-fly*. A similar setting is commonly used in the reinforcement learning literature, a key to recent success of deep reinforcement learning in applications such as games (Silver et al., 2016; 2017), though it is rarely utilized in supervised learning since access to simulators is usually unavailable. In this setting, the algorithm is run for many iterations until the estimation error is sufficiently small, eliminating the pitfalls of overfitting. With fixed training data, additional iterations after the first epoch are not guaranteed to further minimize the estimation error. Furthermore, simulation-on-the-fly guarantees , and given that by the Universal Approximation Theorem (Cybenko, 1989), we can conclude that . The risk surface is much smoother than that for fixed training sets (shown empirically in Section 5). This reduces the number of poor local minima and, consequently, the optimization error.

An alternative viewpoint of the simulation-on-the-fly paradigm from the lens of stochastic optimization is to compare the gradients of the two training procedures when is restricted to first-order stochastic approximation algorithms. In order to make explicit the optimization algorithm at hand, we parameterize the model class by *w* ∈ such that if and only if where is the space of all possible neural network weights for a fixed architecture. Denote the empirical risk with respect to the prior *π*(*θ*) as . In the simulation-on-the-fly regime, the *t*-th iteration approximates the gradient of the population risk at *w _{t}* by the unbiased random vector
where

*ξ*is a random vector such that 𝔼(

_{t}*ξ*) = 0 and 𝔼(

_{t}*ξ*|

_{t}*g*

_{sim}(

*w*

_{1}),…,

*g*

_{sim}(

*w*

_{t−1})) = 0. On the other hand, for the fixed training set regime, the

*t*-th iteration of the algorithm approximates the empirical risk gradient at

*w*by the unbiased random vector where once again 𝔼(ξ

_{t}_{t}) = 0 and 𝔼(ξ

_{t}|

*g*

_{fixed}(

*w*

_{1}),…,

*g*

_{fixed}(

*w*

_{t−1})) = 0. A key point is that while

*g*

_{fixed}(

*w*) is unbiased with respect to , it is biased with respect to . Using the formulation in (2), the fixed training data setting performs stochastic optimization on the empirical risk and converges to the empirical Bayes risk for decaying learning rate and suitably expressive . On the other hand, simulation-on-the-fly in (1) performs stochastic optimization directly on the population Bayes risk

_{t}*R*, circumventing the bias incurred from using the empirical Bayes risk as a proxy for the population Bayes risk.

_{π}### 3.3. Likelihood-Free Inference Framework

With an exchangeable feature representation and an optimization procedure in hand, we can now combine these ingredients into an inference scheme. Let x, *θ*, *ϕ*, and *γ* be the observed data, the latent parameter of interest, the nuisance parameters, and the prior hyperparameters, respectively. The latent parameter *θ* can be inferred by drawing samples from the prior distribution *θ*^{(i)}, *ϕ*^{(i)} ~ *π*(*θ*, *ϕ* | *γ*) and from the density x^{(i)} ~ *P*(x | *θ*^{(i)}, *γ*, *ϕ*^{(i)}), while stochastic optimization under the simulation-on-the-fly paradigm fits to *θ*^{(i)} in an online manner.

This Bayesian inference framework marginalizes over the uncertainty of the nuisance parameters. As neural networks have been empirically shown to interpolate well between examples, we recommend choosing a diffuse prior, which makes our trained model robust to model misspecification.

Another question about utilizing machine learning models for Bayesian inference is the calibration of the posteriors, since neural networks have been empirically shown to be overconfident in their predictions. Guo et al. (2017) showed that common deep learning practices cause neural networks to poorly represent aleatoric uncertainty, namely the uncertainty due to the noise inherent in the observations. These calibration issues are a byproduct of the fixed training set regime but do not apply to simulation-on-the-fly. The soft-max probabilities are calibrated for a correctly specified model under simulation-on-the-fly, since for a sufficiently expressive neural network the minimizer approximates the true posterior. However, under large model misspecification, softmax probabilities should not directly be used as posteriors since they do not properly quantify epistemic uncertainty (uncertainty in the model) as they may overconfidently classify outliers dissimilar to the training set. For recombination hotspot testing, we found that the summary statistics from the 1000 Genomes dataset (1000 Genomes Project Consortium, 2015) were similar to the summary statistics of the simulated data, so for simplicity we use the softmax probabilities as the posterior.

### 3.4. Statistical Properties

Our deep learning method exhibits similar asymptotic properties to those of ABC, with additional guarantees for nonobserved values of x.

In the simulation-on-the-fly setting, convergence to a global minimum implies that a sufficiently large neural network architecture represents the true posterior within *ϵ*-error in the following sense: for any fixed error *ϵ*, there exist *H*_{0} and *N*_{0} such that the trained neural network produces a posterior which satisfies
for all *H* > *H*_{0} and *N* > *N*_{0}, where *H* is the minimum number of hidden units across all neural network layers, w the weights parameterizing the network, and *KL* the Kullback-Leibler divergence between the population risk and the neural network. Then the following proposition holds.

*For all x, H > H _{0}, and N > N_{0} and for any fixed error δ >* 0,

*with probability at least*,

*where w* is the minimizer of*(3).

We can get stronger guarantees in the discrete setting common to population genetic data.

*Under the same conditions, if x is discrete and* ℙ(x) > 0 *for all x, the KL divergence appearing in* (4) *converges to 0 as H,N → ∞ uniformly for all* x.

The proofs are given in the Appendix. Note that it is computationally infeasible to train a neural network such that *H* → ∞. Instead, we restrict the number of units to some fixed constant *H* inducing a model class over learnable functions . Our training procedure in the asymptotic regime for fixed *H* minimizes the objective function in (3) as *N* → ∞. Similarly, under the finite sample regime, the training procedure directly minimizes the projected population risk for the restricted model class. An important property of neural networks with a finite number of hidden units is that this restricted model class is quite large and has been empirically shown to approximate many functions well, so finite *H* only introduces minimal error. Furthermore, deep learning has been empirically shown to converge in only a few thousand iterations for many real-world highdimensional datasets (Zhang et al., 2016), hence *N* need not approach infinity to obtain a good approximation of the posterior. We later confirm this finding experimentally. Hyperparameter optimization in neural networks can be performed by comparing the relative loss of the neural network under a variety of optimization and architecture settings. On the other hand, ABC has no such theoretical or empirical results in the finite sample regime outside of toy examples in which the likelihood can be approximated, and it has been shown empirically that the iteration complexity scales exponentially in the dimension of the summary statistics due to the curse of dimensionality.

We now show that our neural network learns statistics which are asymptotically sufficient. While several variants of sufficiency in the Bayesian context have been defined in the literature (Kolmogoroff, 1942; Furmańczyki & Niemiro, 1998) we focus on the following.

A statistic *T*(*x*) is called *prior-dependent Bayes sufficient* if for a parameter *θ* and fixed prior *π*(*θ*), the posterior satisfies, for all *θ* and *x*,
.

*Each layer of the neural network trained via the likelihood-free framework is prior-dependent Bayes sufficient with respect to π(θ) as H → ∞ and assuming the optimization for each H converges to the global minimum.*

This is proved in the Appendix. The sufficiency of the exchangeable feature representation ensures that no inferential accuracy has been sacrificed while reducing the data to an exchangeable feature representation. Each layer of the neural network is sufficient, allowing this representation to be used for other tasks. While this notion of sufficiency does not cover a finite architecture, it allows us to compare against the asymptotic results of ABC. More details on the properties of ABC are given in the Appendix.

Another desirable property is having unbiased uncertainty estimates, namely posterior calibration. Fearnhead & Prangle (2012) note that ABC is asymptotically calibrated as its kernel bandwidth goes to 0, but not calibrated in general. Similarly, our deep learning procedure is calibrated as the number of hidden units *H* → ∞. While neural networks are difficult to analyze in fixed architecture settings with nonconvex loss surfaces, we empirically find that our neural network is calibrated in Section 5.

## 4. Population Genetics Application

The framework we established overcomes many challenges posed by population genetic inference. In this setting, each observation x is encoded as follows. Let x_{S} be the binary *n* × *d* allele matrix with 0 and 1 as the major and minor alleles respectively, where *n* is the number of individuals and *d* is the number of SNPs. Let x_{D} be the *n* × *d* matrix storing the distances between neighboring SNPs, so each row of x_{D} is identical and the rightmost distance is set to 0. Define x as the *n* × *d* × *2* tensor obtained by stacking x_{S} and x_{D}. To improve the conditioning of the optimization problem, the distances are normalized such that they are on the order of [0,1]. As mentioned in Section 3.1, this is an instance of exchangeably-structure data.

The standard generative model for such data is the coales-cent, a stochastic process describing the distribution over genealogies relating samples from a population of individuals. The coalescent with recombination (Griffiths, 1981; Hudson, 1983) extends this model to describe the joint distribution of genealogies along the chromosome. The recombination rate between two DNA locations tunes the correlation between their corresponding genealogies. Population genetic data derived from the coalescent obeys translation invariance along a sequence conditioned on local recombination and mutation rates also obeying translation invariance. In order to take full advantage of parameter sharing, our chosen architecture is given by a convolutional neural network with tied weights for each row preceding the exchangeable layer, which is in turn followed by a fully connected neural network. We choose *g* as the element-wise max, and the architecture is depicted in Figure 1.

### 4.1. Recombination Hotspot Testing

Recombination hotspots are short regions of the genome (≈ 2 kb in humans) with high recombination rate relative to the background recombination rate. To apply our framework to the hotspot detection problem, we define the overall graphical model in Figure 2. Denote *w* as a small window (typically < 25 kb) of the genome such that *X _{w}* is the population genetic data in that window, and

*X*

_{−w}is the rest. Similarly, let

*ρ*and

_{w}*ρ*

_{−w}be the recombination map in the window and outside of the window, respectively. Let

*q*be the the relative proportion of the sample possessing each mutation,

*η*the population size function 0 the mutation rate, and h the indicator function for whether the window defines a hotspot. While

*ρ*and

_{w}*ρ*

_{−w}have a weak dependence (dashed line) on

*X*and

_{−w}*X*respectively, this dependence decreases rapidly and is ignored for simplicity. Similarly, conditioned on

_{w}*q*,

*η*is only weakly dependent on

*X*. The shaded nodes represent the observed variables.

_{w}We define our prior as follows. We sample the hotspot indicator variable *h* ~ Bernoulli(0.5) and the local recombination maps from the released fine-scale recombination maps of HapMap (Gibbs et al., 2003). In addition, the demography is inferred via `SMC`++ (Terhorst et al., 2017) and fixed in an empirical Bayes style throughout training for simplicity. The human mutation rate is fixed to that experimentally found in Kong et al. (2012). Since `SMC`++ is robust to changes in any small fixed window, inferring from *X* has minimal dependence on *ρ*_{w}.

To test for recombination hotspots, first simulate a batch of *h* and *ρ*_{w}. from the prior, and *X*_{w} from `msprime` (Kelleher et al., 2016). Then, feed a batch of training examples into the network. Repeat until convergence or for a fixed number of iterations. At test time, slide along the genome to infer posteriors over *h*.

### 5. Experiments

In this section, we study the accuracy of our framework to test for recombination hotspots. As very few hotspots have been experimentally validated, we primarily evaluate our method on simulated data, with parameters set to match a human-like setting. The presence of ground truth allows us to benchmark our method and compare against `LDhot`. Unless otherwise specified, for all experiments we use the mutation rate, *μ* = 1.1 × 10^{−8} per generation per nucleotide, convolution patch length of 5 SNPs, 32 and 64 convolution filters for the first two convolution layers, 128 hidden units for both fully connected layers, and 20-SNP length windows. The experiments comparing against `LDhot` used sample size *n* = 64 to construct lookup tables for `LDhot` quickly. All other experiments use *n* = 198, matching the size of the CEU population (i.e., Utah Residents with Northern and Western European ancestry) in the 1000 Genomes dataset. All simulations were performed using `msprime` (Kelleher et al., 2016). Gradient updates were performed using Adam (Kingma & Ba, 2014) with learning rate 1 × 10^{−3} × 0.9^{b/10000}, *b* being the batch count.

#### 5.1. Evaluation of Exchangeable Representation

We compare the behavior of an explicitly exchangeable architecture to a nonexchangeable architecture that takes 2D convolutions with varying patch heights. The accuracy under human-like population genetic parameters with varying 2D patch heights is shown in Figure 3. Since each training point is simulated on-the-fly, data augmentation is performed implicitly in the nonexchangeable version without having to explicitly permute the rows of each training point. As expected, directly encoding the permutation invariance leads to more efficient training and higher accuracy while also benefiting from a faster per-batch computation time. Furthermore, the slight accuracy decrease when increasing the patch height confirms the difficulty of learning permutation invariance as *n* grows. Another advantage of exchangeable architectures is the robustness to the number of individuals at test time. As shown in Figure 4, the accuracy remains robust during test time for sample sizes roughly 0.5–4× the train sample size.

#### 5.2. Evaluation of Simulation-on-the-fly

Next, we analyze the effect of simulation-on-the-fly in comparison to the standard fixed training set. A fixed training set size of 10000 was used and run for 20000 training batches and a test set of size 5000. For a network using simulation-on-the-fly, 20000 training batches were run and evaluated on the same test set. The weights were initialized with a fixed random seed in both settings with 20 replicates. Figure 5 shows that the fixed training set setting has both a higher bias and higher variance than simulation-on-the-fly. The bias can be attributed to the estimation error of a fixed training set in which the empirical risk surface is not a good approximation of the population risk surface. The variance can be attributed to an increase in the number of poor quality local optima in the fixed training set case.

We next investigated posterior calibration. This gives us a measure for whether there is any bias in the uncertainty estimates output by the neural network. We evaluated the calibration of simulation-on-the-fly against using a fixed training set of 10000 datapoints. The calibration curves were generated by evaluating 25000 datapoints at test time and binning their posteriors, computing the fraction of true labels for each bin. A perfectly calibrated curve is the dashed black line shown in Figure 6. In accordance with the theory in Section 3.2, the simulation-on-the-fly is much better calibrated with an increasing number of training examples leading to a more well calibrated function. On the other hand, the fixed training procedure is poorly calibrated.

#### 5.3. Comparison to `LDhot`

We compared our method against `LDhot` in two settings: (i) sampling empirical recombination rates from the HapMap recombination map for CEU and YRI (i.e., Yoruba in Ibadan, Nigera) (Gibbs et al., 2003) to set the background recombination rate, and then using this background to simulate a flat recombination map with 10 – 100× relative hotspot intensity, and (ii) sampling segments of the HapMap recombination map for CEU and YRI and classifying them as hotspot according to our definition, then simulating from the drawn variable map.

The ROC curves for both settings are shown in Figure 7. Under the bivariate empirical background prior regime where there is a flat background rate and flat hotspot, both methods performed quite well as shown on the top panel of Figure 7. We note that the slight performance decrease for YRI when using `LDhot` is likely due to hyperparameters that require tuning for each demography. This bivariate setting is the precise likelihood ratio test for which `LDhot` tests. However, as flat background rates and hotspots are not realistic, we sample windows from the HapMap recombination map and label them according to a more suitable hotspot definition that ensures locality and rules out neglectable recombination spikes (the details are given in the Appendix). The bottom panel of Figure 7 uses the same hotspot definition in the training and test regimes, and is strongly favorable towards the deep learning method. Under a sensible definition of recombination hotspots and realistic recombination maps, our method still performs well while `LDhot` performs almost randomly. We believe that the true performance of `LDhot` is somewhere between the first and second settings, with performance dominated by the deep learning method. Importantly, this improvement is achieved without access to any problem-specific summary statistics.

Our approach reached 90% accuracy in fewer than 2000 iterations, taking approximately 0.5 hours on a 64 core machine with the computational bottleneck due to the `msprime` simulation (Kelleher et al., 2016). For `LDhot`, the two-locus lookup table for variable demography using the `LDpop` fast approximation (Kamm et al., 2016) took 9.5 hours on a 64 core machine (downsampling *n* = 198 from *N* = 256). The lookup table has a computational complexity of *O*(*N*^{3}) while per-iteration training of the neural network scales as *O*(*n*), allowing for much larger sample sizes.

### 6. Discussion

We developed the first likelihood-free inference method for population genetics that does not rely on handcrafted summary statistics. To achieve this, we designed a family of neural networks that learn an exchangeable representation of genotype data, which is in turn mapped to the posterior distribution over the parameter of interest. State-of-the-art accuracy was demonstrated on the challenging problem of recombination hotspot testing. Furthermore, we analyzed and developed general-purpose machine learning methods that can leverage scientific simulators to improve over preexisting likelihood-free inference schemes.

The theoretical and empirical results of simulation-on-the-fly illustrate the attractiveness of fields with model simulators as a testbed for new neural network methods. For instance, this approach allows the researcher to diagnose if regularization or convergence to poor local minima is affecting performance. We believe the simulator paradigm has a lot to offer to further understanding theoretical aspects of neural networks.

Quantifying uncertainty over a continuous parameter could be of interest in many other population genetic tasks, in which case softmax probabilities are inapplicable. Future work could adapt our method with ideas from the Bayesian neural networks literature to obtain posterior distributions over continuous parameters (Hernández-Lobato & Adams, 2015; Blundell et al., 2015; Gal & Ghahramani, 2016; Kingma et al., 2015).

## ACKNOWLEDGEMENTS

We thank Ben Graham for helpful discussions. This research is supported in part by an NSF Graduate Research Fellowship (JC); EPSRC grants EP/L016710/1 (VP) and EP/L018497/1 (PJ); an NIH grant R01-GM094402 (JC, JPS, SM, and YSS); and a Packard Fellowship for Science and Engineering (YSS). YSS is a Chan Zuckerberg Biohub investigator.

## Appendix

## A. Statistical Properties of ABC

Understanding the statistical properties of ABC enables us to highlight the theoretical benefits of our approach. Variants of ABC are among the most widely-used likelihood-free inference techniques in the scientific literature. ABC simulates N draws of the parameter *θ*^{(i)} ~ *π*(*θ*) and data x^{(i)} ~ ℙ(x | *θ*^{(i)}) for *i* = 1,…, *N*, then approximates the posterior conditioned on the observed summary statistics *s*_{obs} = *S*(x_{obs}) by
where *S*: *𝒳* → *𝒱* is a summary statistic of the data and *K*: 𝒱 → ℝ is a density kernel that integrates to 1 with bandwidth *u* > 0. Denote a as the *N*-dimensional vector corresponding to the kernel weight *K*(·) for each simulated parameter *θ*^{(i)}. A common choice of *K* is the uniform kernel, though many variants exist. Intuitively, ABC can be interpreted as locally smoothing the empirical likelihood estimates of points near the observed data in the summary statistic space.

The ABC posterior asymptotically converges to the true posterior conditioned on the observed data x_{obs} for sample size *n* under suitable regularity conditions, so that
for any choice of sufficient statistic *S*. See Frazier et al.(2016) for a formal treatment. However, in the finite-sample regime for fixed *N*, the hyperparameters of the ABC algorithm should be chosen such that
is minimized. Based on this formulation, the computational and statistical tradeoffs based on the choice of *u* and *S* as a function of the computational budget *N* are made explicit. Unfortunately, hyperparameter optimization on *u* and *S* cannot be performed since the expected KL in (5) cannot be compared without access to the true posterior. Instead, practitioners often optimize a surrogate objective similar to
where τ is a sampling threshold and *m*(a) is a user-defined function of the kernel weights, such as number of accepted samples or effective sample size. *J*(*u*, *S*) is a user-defined positive function that is decreasing in *u*. *J*(*u*, *S*) also satisfies *J*(*u*, *S*_{1}) ≤ *J*(*u*, *S*_{2}) for all *S*_{1}, *S*_{2} where *𝒱*_{1} ⊆ *𝒱*_{2}. Note that ABC practitioners are typically not explicit about their surrogate objective function when tuning ABC; however, for clarity, we specify the general surrogate objective above, and remark that many modifications do not affect the underlying tradeoffs stated below.

Intuitively, the surrogate objective function encourages practitioners to choose values of *u* and *S* such that *u* is close to 0, and *S* is close to sufficient while generating enough large kernel weights within the computational budget to obtain an empirical posterior. This procedure ignores the posterior completely and could result in arbitrarily poor approximations of the posterior. Poor approximations of the posterior can result for many reasons, including lack of information in *S*, large *u*, insufficient number of samples generated, insufficient computational budget, or incorrect choice of kernel *K* or norm ‖ · ‖ for the geometry of the posterior. Furthermore, this procedure must be re-run for each new dataset x_{obs} allowing for a smaller computational budget *N* when dealing with multiple datasets. There are no guarantees that the previous values of *S*] and *u* remain good choices for a new dataset since the parameters depend on x_{obs}.

## B. Statistical Properties of Our Method: Proofs

By the Universal Approximation Theorem and the interpretation of simulation-on-the-fly as minimizing the expected KL divergence between the population risk and the neural network, the training procedure minimizes the objective function for every ϵ > 0, *H* > *H*_{0}, and *N* > *N*_{0},

Let w* be a minimizer of the above expectation. By Markov’s inequality, we get for every x and δ > 0 such that for all *H* > *H*_{0} and *N* > *N*_{0}
with probability at least .

As above, we have
for all ϵ > 0, *H* > *H*_{0}, and *N* > *N*_{0}. Furthermore, for all x, the KL is bounded at the minimizer since ℙ(x) > 0 for all x resulting in the following bound
independent of x. Thus, the training procedure results in a function mapping that uniformly converges to the posterior ℙ(*θ* | x).

For each *H* and *N*, the neural network is trained to find the w that minimizes

As *H* → ∞ and *N* → ∞, this quantity converges to a global minimum. By the Universal Approximation Theorem this is achieved when the function learned by the neural network is the posterior *P*(*θ* | x). Thus, each layer of the neural network can be viewed as a statistic *T*(x) of the input data x. In other words each layer of the trained neural network is prior-dependent Bayes sufficient, ℙ(*θ* | x) = ℙ(*θ* | T(x)) for our chosen prior π(θ).

## C. Recombination Hotspot Details

Recombination hotspots are short regions of the genome with high recombination rate relative to the background. In order to develop accurate methodology, a precise mathematical definition of a hotspot needs to be specified in accordance with the signatures of biological interest. We use the following.

(Recombination Hotspot). Let a window over the genome be subdivided into three subwindows *w* = (*w*_{l}, *w*_{h}, *w*_{r}) with physical distances *α*_{l}, *α*_{h}, and *α*_{r}, respectively, where *w*_{l}, *w*_{h}, *w*_{r} ∈ 𝒢 where 𝒢 is the space over all possible subwindows of the genome. Let a mean recombination map *R*: 𝒢 → ℝ_{+} be a function that maps from a subwindow of the genome to the mean recombination rate per base pair in the subwindow. A recombination hotspot for a given mean recombination map *R* is a window *w* which satisfies the following properties:

Elevated local recombination rate:

*R*(*w*_{h}) >*k*· max (*R*(*w*_{l}),*R*(*w*_{r}))Large absolute recombination rate:

where is the median (at a per base pair level) genome-wide recombination rate, and k is the relative hotspot intensity.

The first property is necessary to enforce the locality of hotspots and rule out large regions of high recombination rate, which are typically not considered hotspots by biologists. The second property rules out regions of minuscule background recombination rate in which sharp relative spikes in recombination still remain too small to be biologically interesting. The median is chosen here to be robust to the right skew of the distribution of recombination rates. Typically, for the human genome we use *α*_{l} = *α*_{r} = 13 kb, *α*_{h} = 2 kb, and *k* = 10 based on experimental findings.

The most widely-used technique for recombination hotspot testing is `LDhot` as described in (Auton et al., 2014). The method performs a generalized composite likelihood ratio test using the two-locus composite likelihood based on (Hudson, 2001) and (McVean et al., 2004). The composite two-locus likelihood approximates the joint likelihood of a window of SNPs *w* by a product of pairwise likelihoods
where *X*_{ij} denotes the data restricted only to SNPs *i* and *j*, and *ρ*_{ij} denotes the recombination rate between those sites. Only SNPs within some distance, say *z* = 50, are considered.

Two-locus likelihoods are computed via an importance sampling scheme under a constant demography (*η* = 1) as in (McVean et al., 2004). The likelihood ratio test uses a null model of a constant recombination rate and an alternative model of a differing recombination rate in the center of the window under consideration:

The two-locus likelihood can only be applied to single pan-mictic populations with constant demography, constant mutation rate, and without natural selection. Furthermore, the two-locus likelihood is an uncalibrated approximation of the true joint likelihood. In addition, the experiments in Wall & Stevison (2016) and Auton et al. (2014) do not demonstrate the efficacy of `LDhot` against a realistic variable background recombination rate as its null hypothesis leads to a comparison against a biologically unrealistic flat background rate. In order to fairly compare our likelihood-free approach against the composite likelihood-based method in realistic human settings, we extended the LDhot methodology to apply to a piecewise constant demography using two-locus likelihoods computed by the software `LDpop` (Kamm et al., 2016). Unlike the method described in Wall & Stevison (2016), our implementation of `LDhot` uses windows defined in terms of SNPs rather than physical distance in order to measure accuracy via ROC curves, since the likelihood ratio test is a function of number of SNPs. Note that computing the approximate two-locus likelihoods for a grid of recombination values is at least *O*(*n*^{3}), which could be prohibitive for large sample sizes.

## D. Additional Experiments

**Regularization** The simulation-on-the-fly paradigm obviates the need for modern regularization techniques such as dropout. This is due to the fact that there is no notion of overfitting since each training point is used only once and a large number of examples are drawn from the population distribution. As shown in Figure 8, dropout does not help improve the accuracy of our method and, in fact, leads to a minor decrease in performance. As expected, directly optimizing the population risk minimizer circumvents the problem of overfitting.

**Phasing** Deconvolving two haplotypes from genotype data is a challenging statistical problem, commonly referred to as phasing. Phasing without a high quality reference panel introduces significant bias into downstream inference. Our approach can flexibly perform inference directly on haplotype or genotype data, the latter being a challenge for model-based approaches. Inference directly on genotype data allows us to implicitly integrate over possible phasings, reducing the bias introduced by fixing the data to a single phasing. In the case of recombination hotspots, we have found only a minor decrease in accuracy for smallsample sizes corresponding to the reduction in statistical signal when inference is performed on genotype data. We quantified the effect of having accurately phased data in comparison to genotype data. Specifically, inference was run by simulating haplotype data and randomly pairing them to construct genotype data such that the height of the genotype image is half that of the haplotype image. We ran the experiment for *n* = 16, 32, 64 as shown in Figure 9 and found that the our method is robust, remaining highly accurate for unphased data.

**Missing Data** Biological data typically contain significant amounts of missing data. The missingness results from a number of factors such as repetitive regions of the chromosome which are difficult to map, or low read coverage. Fortunately, haplotype data in population genetics is mostly missing completely at random; that is, the locations of missingness are independent of the data values. However, there is a strong correlation structure between the missingness of spatially close SNPs. To improve the robustness of our methods to missing data, we sample the missingness patterns from empirical data during training time.