Deep learning for population size history inference: design, comparison and combination with approximate Bayesian computation

For the past decades, simulation-based likelihood-free inference methods have enabled researchers to address numerous population genetics problems. As the richness and amount of simulated and real genetic data keep increasing, the field has a strong opportunity to tackle tasks that current methods hardly solve. However, high data dimensionality forces most methods to summarize large genomic datasets into a relatively small number of handcrafted features (summary statistics). Here we propose an alternative to summary statistics, based on the automatic extraction of relevant information using deep learning techniques. Specifically, we design artificial neural networks (ANNs) that take as input single nucleotide polymorphic sites (SNPs) found in individuals sampled from a single population and infer the past effective population size history. First, we provide guidelines to construct artificial neural networks that comply with the intrinsic properties of SNP data such as invariance to permutation of haplotypes, long scale interactions between SNPs and variable genomic length. Thanks to a Bayesian hyperparameter optimization procedure, we evaluate the performance of multiple networks and compare them to well established methods like Approximate Bayesian Computation (ABC). Even without the expert knowledge of summary statistics, our approach compares fairly well to an ABC based on handcrafted features. Furthermore we show that combining deep learning and ABC can improve performance while taking advantage of both frameworks. Finally, we apply our approach to reconstruct the effective population size history of cattle breed populations.

1 Introduction spectrum and the linkage disequilibrium as a function of the distance between SNPs averaged over 19 distance bins for 190 a total of 68 summary statistics. Our python script is partly based on the scikit-allel python module (Miles et al., 2019). 191 These predefined summary statistics constitute the training, validation and test set for all methods based on summary 192 statistics or on their combination with SNP matrices. 193 2.2 Baselines 194 We compared our approach to five baselines: an ABC approach and a MLP both using linkage disequilibrium and site 195 frequency spectrum as summary statistics, and another MLP, a custom CNN and a CNN from (Flagel et al., 2018), all 196 using genomic data directly. We evaluated four ABC methods (rejection, local linear regression, local ridge regression 197 and a single-hidden-layer neural network). Their performance represents a baseline of commonly used methods to 198 be compared to ours. The second framework, a MLP processing the same summary statistics as ABC, is similar in 199 spirit to the one previously proposed for predicting three demographic parameters alongside selection (Sheehan and 200 Song, 2016). It is the baseline for methods using deep learning based on summary statistics. The third and fourth 201 baselines consist in a MLP and a CNN processing directly the first 400 SNPs of a 2Mb-long genomic region instead of 202 summary statistics. The MLP takes as input SNP data and locations flattened into a single vector. In this configuration, 203 the spatial information between SNPs, the link between the SNPs and their location, and the link between alleles of the 204 same individual are lost. The fourth baseline is a newly design CNN with rectangular shaped filters that cover more 205 than one haplotype (i.e. more than one row). It takes as input a matrix in which rows are the haploid individuals and 206 columns are the genetic markers. An additional row contains the SNP positions encoded as the distances between 207 SNPs. The input size is (number of haplotypes+1) × number of kept SNPs. This custom CNN is a first step towards 208 an architecture tailored to raw genomic data, because spatial information is preserved as for recent ANNs applied to 209 population genetics (Chan et al., 2018, Flagel et al., 2018, Torada et al., 2019, but also because asymmetrical filters 210 of various sizes account for the heterogeneous entities of axes (haplotype versus SNP, rather than pixel versus pixel).

211
Finally we adapted and re-trained four networks among the top-ranked CNNs proposed by Flagel et al. (2018) so that 212 they could reconstruct a 21-epoch model of instantaneous effective population size rather than the three-epoch model 213 initially investigated by the authors, and for practicability we called them Flagel CNNs. 214 Each method is evaluated using its prediction error given by the following mean squared error: where Θ i j andΘ i j are respectively the true and predicted standardized population size for the time window i and scenario otherwise. This MLP has a total of 2,986 trainable parameters. Our second MLP is based on 'raw' genomic data and 230 takes a matrix of 50 haplotypes (rows) for 400 SNPs (columns) and its associated vector of distances between SNPs as 231 input. Its hidden layers respectively have 20, 20, and 10 neurons, which gives it 408,981 trainable parameters.

232
Custom CNN Our convolutional neural network takes as input the same matrix of 400 SNPs and has 2-dimension 233 filters of various shapes. The first layer consists of 5 kernels with rectangular shape (2×2, 5×4, 3×8, 2×10, 20×1) 234 applied to the SNP matrix X. Each kernel creates 50 filters, which amounts to 250 feature maps after the first layer.

235
The SNP distance vector d is treated by the 5 associated kernel shapes (1×2, 1×4, 1×8, 1×10, 1×1) with 20 filters 236 each, making 100 filters in total. The results of the first convolutional layer are then concatenated so that the second 237 convolutional layer will couple information from X and d in a way that emphasizes the original location of the SNPs 238 along the genome. The outputs of this second layer are then combined and go through 5 convolutional layers and 2 239 fully connected layers. Adding convolutional layers one after the other allows our network to combine patterns and 240 reduce the size of the data without adding too many weights to our model. This network has a total of 131,731 trainable 241 parameters.

242
Flagel network We reused the code associated with the repository of the first paper using a CNN for demographic 243 inference (Flagel et al., 2018) and adapted it to our dataset and task. We trained the network with the exact same 244 architecture as the one published, except that we changed the last layer to allow the prediction of our 21 population 245 size parameters. We parametrized the network with the set of hyperparameters leading to the best performance in the 246 previous work for two different types of SNP encoding (0/255 or -1/1). It is noteworthy that the actual encoding in their 247 code is 0/-1 and not 0/255, thus we kept the same encoding to be able to compare the performance. The networks were 248 trained with the same procedure of 10 epochs with early stopping in case of no progression of the loss after 3 epochs.

249
The batch size is 200. The input data had 50 haplotypes and either 400 SNPs as processed by our custom CNN or we 250 downsampled the data to one every ten SNPs as done in the original work, leading to 1,784 wide input SNP matrices.

251
This size corresponds to the tenth of the biggest SNP matrix in our dataset. Smaller simulations are padded with zeros.

252
All parameters can be found in table S1. A major difficulty that arises with genomic data is that the number of SNP varies from one dataset to another, or 278 from one genomic region to another, due to the stochasticity of biological and demographic processes (and of their 279 corresponding genetic simulations). Therefore, we use convolution layers as they can handle data with variable size 280 while keeping the number of network weights constant. A filter can be repeated more or fewer times to cover the whole 281 input entering each layer, letting the network adapts itself to the data. Consequently, the output size of each convolution 282 layer will vary depending on the input size. This prevents the use of fully connected layers directly after a convolution 283 layer as it is often the case with CNNs. Instead, we use fully-connected layers only after operations independent of the 284 input size and with a fixed output size, namely mean functions over the column and row dimensions ( Figure 2).

292
For our permutation invariant architecture, we designed three variations. The first one uses batch normalization, after 293 each convolution layer, and therefore takes as input a fixed number of 400 SNPs, similarly to two of the baselines.

294
The second one is invariant to the number of SNPs and uses instance normalization, after each convolution layer, to 295 normalize layer inputs per-data instead of per-batch (for the batch normalization). The last variation is also invariant to 296 the number of SNPs, but uses two separate instance normalization steps, as well as the correlation control parameter α.

297
Except for these differences, these three variations have the same architecture, represented in Figure 2. At each step 298 i of the network, we consider that the data has four dimensions B

299
M the row dimension (also the haplotype/genotype dimension before the first layer), S the column dimension (also 300 the SNP dimension before the first layer) and F the feature dimension (only one feature before the first layer). A first ρ, which is the mean over the dimension M (B2). Thus ρ(φ(X i−1 )) has size B i−1 × (S i−1 − 2) × F i−1 /2, which is 308 repeated M times to maintain the same dimension as φ(X i−1 ). Then ρ(φ(X i−1 )) and φ(X i−1 ) are concatenated over  Network weight initialization is a difficult task that can lead to vanishing or exploding gradient when not carefully done 315 and associated with a poor learning rate (Bengio et al., 1994, Glorot andBengio, 2010). Most initialization schemes try 316 to force the outputs of each layer to follow some distribution assuming normalized input data. Batch normalization 317 solves this problem by normalizing layer outputs over the whole batch during training and computing a running mean 318 and variance for the evaluation steps. We used this type of normalization for our networks that take as input a fixed 319 number of SNPs. For the networks invariant to the number of SNPs, we could not concatenate all batch data into the 320 same tensor because of their varying sizes. Therefore, we use instance normalization, which computes both mean and 321 variance over the feature dimension. Bayesian optimization procedure with a log-uniform prior between 0.5 and 1. The search was performed for 3 budget 339 steps and replicated 5 times, leading to a total of 83 successfully trained networks.

340
For the ABC baselines, the tolerance rates ranged from 0.05 to 0.3 by step of size 0.05 and were optimized for 12 ABC 341 algorithms independently (4 correction methods × 3 types of inputs: predefined summary statistics, SPIDNA outputs or 342 both).

343
As the training time of the MLP using summary statistics was short, we optimized its hyperparameters with a random 344 search by drawing 27 configurations from uniform distributions and trained a network for each configuration during 6 345 epochs. The batch size was drawn between 10 and 100, learning rate between 5 · 10 −5 and 1 · 10 −2 and weight decay 346 between 5 · 10 −5 and 1 · 10 −2 .  SPIDNA and a combination of both with the best hyperparameter configurations on the modified simulated data and 357 performed the inference. The best version of SPIDNA without ABC is non-adaptive and therefore uses 400 SNPs from 358 each segment which represents 10% of the total number of SNPs in the cattle data and 67% for the training dataset.

359
All computational resources used for this study are described in the Supplementary Text. The configuration with the lowest loss generated by the hyperparameter optimization procedure used 400 SNPs with 363 SPIDNA, batch normalization, a weight decay of 2.069 · 10 −2 , a learning rate of 1.416 · 10 −2 and a batch size of 78 364 ( Figure S1). Configurations with large batch sizes tended to have lower losses ( Figure S1), which is expected as large 365 batches provide a better approximation of the full training set gradient. However, a batch size too close to the training 366 set size can lead to overfitting the training set. Here, we did not observe overfitting for any run when monitoring training 367 and validation losses. The best configurations also tended to have low learning rates and weight decays ( Figure S1).

368
These low values slow down the convergence, but usually decrease the final prediction error if the budget (i.e. number 369 of training epochs) is high enough for the network to reach convergence. For each architecture, we selected the best configuration obtained with the hyperparameter optimization procedure 372 and trained it for a greater budget (i.e. 10 epochs), allowing an in-depth comparison. We found no strong decrease of 373 prediction errors after this longer training compared to their counterparts with a 10 7 budget (10 7 training SNP matrix, 374 i.e. 5.57 epochs) (Figures 3 and S1). Predictions errors for the validation set (used in the hyperparameter optimization 375 procedure) and the test set are shown in the table S2. In the following paragraph, each method is designated along its 376 index in the table.

377
We first compared the optimized neural networks to optimized ABC approaches based on predefined summary statistics.

378
The prediction errors achieved by ABC using summary statistics ranged from 0.496 (index 0, ABC rejection, i.e.   Figure S3). On average a moderate selective pressure (100-400) did not decrease the performance ( Figure   424 S3 top row). ABC inference for declining population datasets was the only one negatively impacted (increased error 425 for s=200 and 400). In fact, in multiple cases increasing s decreased the prediction error mean. Very strong selection 426 (s = 800) on the other hand led to an increased prediction error mean in all cases except for the declining histories 427 inferred by SPIDNA. In addition, the 95% quantile and standard deviations of the prediction errors tend to increase with 428 s ( Figure S3) indicating that the prediction should be taken more cautiously in the presence of strong positive selection.

429
This variance is systematically smaller for SPIDNA than ABC. In particular, a handful of histories reconstructed with 430 ABC are far off while SPIDNA prediction errors remain comparatively low for all scenarios ( Figure S4). In this paper, we introduced a deep learning approach to infer the detailed size history of a single population directly 449 from genomic data given an unknown recombination rate. This consisted in inferring jointly 21 population size 450 14 parameters. We not only increased the complexity of the demographic model with respect to previous works such as Flagel et al. (2018), but also compared the performance of our architecture to other methods including ABC, and 452 applied our approach to real data sets. We found that our approach compared competitively with one of the best to date 453 approaches, with the added advantage of not relying on summary statistics. A robustness analysis based on a subset of 454 demographic scenarios also indicated that SPIDNA might be more robust than ABC to the presence of positive selection 455 in the data. Finally, we reconstructed the effective population size fluctuations of three cattle breeds and confirmed that 456 they all had similar sizes when they were part of the same ancestral species Bos taurus and underwent a decline likely 457 linked to their domestication, although the estimated strength of this decline depended on the inference method.

474
To interpret the results and compare them, let us first note that in Figure 3, a 0 error means perfect prediction, while an 475 error of 1 means that no information is extracted from the input. Indeed, a function outputting always the same value, 476 for all samples, can at best predict the average target value over the dataset, in which case the error is the standard 477 deviation over the dataset of the value to predict, which is normalized to 1 in our setup.

478
Processing the SNP and distance matrices with a MLP led to high prediction errors, especially for recent population 479 sizes. This is not surprising, since genomic information is encoded as a simple list of values, where the order has no 480 meaning from the MLP point of view, which then cannot exploit information given by the data structure. In summary, 481 an MLP configuration has several drawbacks: (i) the number of network parameters to estimate is high; (ii) the MLP 482 can only retrieve the geometry of the data through training, with no guarantee that it will learn the spatial structure of 483 the genome (i.e. the column order and distance between SNPs) or distinguish from which individual comes each SNP. 484 better).

486
On the contrary, CNN layers process input elements by groups, allowing close SNPs to be processed together.   order. Among our permutation-invariant architectures, the best one (SPIDNA using batch normalization) had a smaller 516 prediction error than our custom CNN. However, it is not clear whether this improvement is directly linked to its built-in 517 permutation-invariance property, or to other differences between the two networks. Controlling the speed to invariance 518 thanks to the parameter α improved the performance of the instance normalization SPIDNA, but not significantly the performance of the instance normalization adaptive SPIDNA (see table S2). 520

Robustness to the number of individuals 521
Importantly, SPIDNA adapts to the number of individuals, which is an advantageous property compared to many 522 methods relying on summary statistics. SPIDNA can be trained on data sets having similar or varying sample sizes, and, 523 once trained, it can be directly applied to a dataset of reasonably close sample size, but unobserved during training. We

529
This was expected because this specific network was not exposed to diverse sampling sizes during training. Given the

542
This small performance drop is likely due to differences in normalization rather than to the adaptive feature. Indeed, 543 the best non-adaptive SPIDNA uses batch normalization while the adaptive versions use instance normalization as 544 there is currently no implementation of batch normalization for batches with inputs of mixed sizes. We think that 545 adaptive architecture could greatly benefits from an optimised implementation of adaptive batch normalization or from 546 an implementation of batches with mixed data sizes. Nonetheless, SPIDNA networks with instance normalization had a 547 much better performance when using all SNPs rather than the 400 first SNPs only, which suggests that adaptability is a 548 useful feature (see table S2).

549
Our adaptive architecture provides an alternative to data compression based on computer vision algorithms: since 550 compression is not optimized for the task of interest, it could induce information loss by reducing data prematurely.

551
descriptors and processing pipelines (e.g., SIFT features to describe image keypoints (Lowe, 2004), and the "bag of 553 visual words" pipeline (Sivic and Zisserman, 2003) to build an exploitable representation of them through clustering and 554 histograms) by ones that can be optimized. It is also an alternative to padding, a technique that consists in completing 555 the SNP and distance matrices at the edges so that they all match the biggest simulated SNP matrix; it is left to the 556 neural network to guess where the real genetic data stops and where padding starts. As such it may make the task  new network will benefit from the embedding already learned for the previous task, improving error and learning time. 570 We also highlight that, as for most ABC methods, the parameters are inferred jointly, a major point as the common 571 population genetics model parameters almost never have an independent impact on shaping genetic diversity. We noted 572 that for highly fluctuating population sizes, SPIDNA estimated smooth histories. Smoothing can be seen as a good 573 byproduct and was for example achieved on purpose by SMC++ thanks to a spline regularization scheme (Terhorst 574 et al., 2017). A tentative explanation for SPIDNA's smoothing effect while no regularizer was used is that it is easy 575 for neural networks to express smoothing filters in their last layer. As, in our task, smoothing is correlated with lower 576 prediction variance, the training of SPIDNA naturally chooses to smooth out its predictions. This could be seen as a 577 tendency to favor low variance in the bias/variance trade-off. We found that adding an ABC step to the deep learning approach increased its performance. This ABC step takes as Finally, we showed that applying ABC to SPIDNA predictions combined with precomputed summary statistics led to an 607 error 4.7% smaller than the one of a regular ABC and 6.0% smaller than SPIDNA. This indicates that the information 608 retrieved by SPIDNA does not completely overlap the one encoded into the predefined summary statistics but is not  Applying a method trained on simulated data to a real dataset can be a difficult task. Here we show that the estimated 615 effective population sizes of the three cattle breeds were qualitatively similar across the different methods used. All of 616 them were able to recover the large ancestral population size shared by the three breeds, followed by its decline after 617 domestication. However, the methods produced size estimates that were quantitatively different, notably in the strength 618 of the decline and the recent population sizes. For quality reasons, inference was done using genotypes pruned of low frequency alleles rather than haplotypes. The architecture and hyperparameters were optimized based on simulated 620 haplotypes, and the network was trained on simulated genotypes. It is possible that an architecture designed with 621 a new hyperparameter optimization procedure calibrated for filtered genotypes would decrease SPIDNA error rate.

622
However, the discrepancy between ABC and SPIDNA reconstructions in the last 10,000 years might also be due to We addressed a challenging task in population genetics, that is, reconstructing effective population size through time.

641
We showed that this demographic inference could be done for unknown recombination rates. Our approach has only   Figure 2: Schematic of SPIDNA architecture. SPIDNA takes as input a SNP matrix associated with its vector of distances between SNPs (in blue). A convolution layer is applied to the SNPs (A1) and another convolution layer is applied to the distances (A2). Results of A2 are repeated to be concatenated with results from A1 (A3). The output is passed to a series of seven SPIDNA blocks (A4). Each SPIDNA block starts with a convolution layer (B1) followed by the mean over rows of the convolution layer result (B2) and the mean over columns of B2 result (B3). The concatenation of B1 and B2 results (B4) is processed by a max pooling layer (B5) and passed to the next SPIDNA block. In parallel, the output of B3 is processed by a fully-connected layer (B6). The prediction vector (in green) is updated at each SPIDNA block with a sum (C1) of its previous value and B6 results. It is finally output by the last block as the predicted demographic parameters (C2).  Figure 3: Prediction errors on the test set of the best run of each method after the hyperparameter optimization. The best configurations of each ANN (MLP, custom CNN and SPIDNA) have been retrained for 10 epochs. Traditional ABC methods are depicted in yellow, deep MLPs and CNNs in red and orange, SPIDNA ANNs in blue, combinations of ANNs and ABC in green. Methods are grouped into 4 families: "Summary statistics" (processed by ABC or ANN), "SNP matrices" (processed by ANN), "SPIDNA outputs" (processed by ABC, no summary statistic used), "Summary statistics and SPIDNA outputs" (processed by ABC). Vertical black lines on top of each bar represent the 95% confidence interval of prediction errors. Horizontal dashed line indicate the lowest error obtained (adaptive SPIDNA + ABC with local linear regression using summary statistics and SPIDNA outputs).     Figure S1: Population size prediction error for each run of the hyperparameter optimization procedure. X-axes indicate the hyperparameter (batch size, learning rate, weight decay and alpha) or budget values, and colors indicate the type of network used for the run (MLP, custom CNN and multiple SPIDNA architectures). For each network the best run is surrounded by a square.  Figure S2: Robustness to simulator tool. Distributions of SPIDNA predictive errors per replicate (i.e, per independent genomic region) for four demographic scenarios and three different genetic simulators (msprime, ms, msms). SPIDNA batch norm. network was trained on simulated datasets generated with msprime. The test datasets were generated by different simulators, based on the same demographic parameters and under neutrality. X-axes: simulator for the test set ; Y-axes: predictive error for each region/replicate (i.e. for each matrix of size 50 samples ×400 SNPs) averaged over the 21 time steps. Each violin describes 100 replicates.  Figure S3: Robustness to the presence of positive selection. ABC and SPIDNA (batch norm.) predictive errors computed from 100 2Mb-long regions for three demographic scenarios (Medium constant size, Decline and Expansion) under various selective pressures (with additive fitness effect). The reference/training set is the same as the one used throughout the paper (neutral simulations generated with msprime from a prior distribution on recombination rate and population sizes). The test datasets were simulated using msms with multiple values of selection strength, starting time and initial frequency of the beneficial allele. X-axes: Selection coefficient SAa. Y-axes: Mean (top), 95% quantile (middle row) and variance (bottom row) estimators of the predictive error (across 30 test sets for SAa=0 and 144 test sets for any other SAa value). Vertical bars correspond to 95% confidence intervals computed via bootstrap.  Figure S5: Robustness to sample size. Distributions of SPIDNA predictive errors per replicate (i.e, per independent genomic region) for four demographic scenarios and different sampling sizes. SPIDNA (batch norm.) network was trained on simulated datasets containing exactly 50 samples. The test datasets were simulated with msprime based on the same four demographic parameter sets but with different sample sizes (ranging from 10 to 150 haplotypes). X-axes: sample size M of the targeted region ; Y-axes: predictive error for each replicate (i.e. for each matrix of size M samples ×400 SNPs) averaged over the 21 time steps. Each violin describes 100 replicates. Effective population size (log scale) Domestication Holstein Figure S6: Effective population size of three cattle breeds inferred by the best SPIDNA architecture, SPIDNA batch normalization, and by ABC based on SPIDNA outputs. Boxplots show the dispersion of SPIDNA predictions (over replicates). For each history inferred by SPIDNA combined with ABC, we display the posterior median (plain line) and the 95% credible interval. Domestication is estimated to have occurred 10 000 years ago (vertical dotted line). 38