Revisiting the Out of Africa event with a novel Deep Learning approach

Anatomically modern humans evolved around 300 thousand years ago in Africa1. Modern humans started to appear in the fossil record outside of Africa about 100 thousand years ago though other hominins existed throughout Eurasia much earlier2–4. Recently, several researchers argued in favour of a single out of Africa event for modern humans based on whole-genome sequences analyses5–7. However, the single out of Africa model is in contrast with some of the findings from fossil records, which supports two out of Africa8,9, and uniparental data, which proposes back to Africa movement10,11. Here, we used a novel deep learning approach coupled with Approximate Bayesian Computation and Sequential Monte Carlo to revisit these hypotheses from the whole genome sequence perspective. Our results support the back to Africa model over other alternatives. We estimated that there are two successive splits between Africa and out of African populations happening around 60-80 thousand years ago and separated by 12-13 thousand years. One of the populations resulting from the more recent split has to a large extent replaced the older West African population while the other one has founded the out of Africa populations.


29
In the last few decades, the development of efficient and powerful computing infrastructure 30 allowed us to gain substantial progress in the machine learning field, especially for 31 computationally demanding algorithms such as Neural Network (NN) 12,13 and Bayesian 32 methods 14,15 . NN was demonstrated to be an useful tool for specific types of tasks, such as 33 classification or natural language processing 12-14,16 . However, NN requires a large amount of data 34 as a training set. In some cases, simulated datasets are one of the strategies to overcome this 35 limitation. The simulation of synthetic genetic data can be helpful to substantially mitigate this 36 problem 17,18 . NN is already adopted in population genomics studies to interpret the genomics data 37 in terms of underlying demography 19-21 and positive selection 22,23 . However, unlike classical 38 approaches, it is still challenging to measure the significance of a prediction performed by NN,39 given that it is a black-box approach. Approximate Bayesian Computation (ABC) can be used to 40 weigh the accuracy of a NN-based prediction from the data itself, without knowing the maximum 41 likelihood function 19,20,24 . 42 Recent fossil record analysis suggests that anatomically modern humans appeared around 300 43 thousand years ago (kya) in Africa 1 . This hypothesis is corroborated by genetic data 25 , which 44 projected the deepest splits between modern human populations at a similar time interval. 45 Although fossil records advocate that there might be multiple Out Of Africa (OOA) events for 46 modern humans 26 , recent genetic studies revealed that all modern non-African or OOA populations 47 fit a model characterised by a single OOA event 5-7 . This conclusion indicates that older OOA 48 migrations, documented by archaeological records, might have not left much contribution to 49 modern human populations, with the possible exception in Oceania (Papuan populations) 27 and 50 some archaic hominin 28 . 51 While the single OOA model finds support in both autosomal and uniparental data 11,29,30 , there is 52 some evidence for a more complicated scenario. Most of the uniparental haplogroups are closer to 53 each other in OOA populations than African haplogroups (thus having less time to the most recent 54 common ancestor [TMRCA]), corroborating a single clean OOA model, apart from the sister Y 55 haplogroups D and E. The haplogroup D can be found in isolated populations in Asia (i.e., 56 Andamanese, Tibetan, Japanese, etc.), while the haplogroup E is ubiquitous in sub-Saharan 57 African populations. They are slightly closer to each other than any other haplogroups found in 58 OOA populations from them 11,31 . This observation might be explained by a back to Africa 59 migration 11 or a more complicated scenario 32 . Some autosomal analyses also suggest that the 60 separation between Africa and OOA populations might not be a single split event [33][34][35][36] . 61 Testing these hypotheses (single out of Africa, back to Africa and two out of Africa) is challenging 62 due to the strong bottleneck of non-African populations 37

80
The general workflow for ABC-DLS (both for model selection and parameters estimation) 81 includes the following steps. First, we simulated 18 multiple sets of genetic data for each tested 82 model using demographic parameters sampled from a uniform distribution within prior ranges 83 (Table 1). Next, we converted this data into joint site frequency spectrum (SFS) (although 84 potentially any other summary statistics (SS) can be used) and split the data into a training and a 85 testing subset. We then trained the NN (implemented using TensorFlow 44 with Keras backended 45 ) 86 on the training dataset to either select between demographic models or to estimate the demographic 87 parameters. The resulting NN is applied to the testing dataset as well as to the observed SS data 88 (see below as well as Methods for more details). Next, we apply ABC to estimate support for the 89 NN prediction on the observed data comparing the NN prediction outcome between the observed 90 data and the testing dataset (see Methods, Supplementary Figure 2 and also our previous paper 19 ). 91 Finally, in cases when SMC is used, we essentially iterate the parameter estimation step by SMC. 92 When estimating the posterior range for the parameters using ABC, we kept the top five percent 93 (equal to the tolerance level) of simulations from the testing dataset that best matched with the 94 observed data. We then used the parameters of those simulations to update our prior range and sent 95 it for next iteration till convergence reached ( Supplementary Figures 2 and 3 one, the prediction certainty varied between methods (Table 3). While DLS returned 100% 116 probability for model B, DL and RF gave lower support. Also, when 10 independent runs were 117 tested, model B won 10 times out of 10 using DLS and 9 out of 10 using DL. Moreover, Bayes 118 factor value was predicted to be 6.69 between model B and model S by DL. These suggest that we 119 cannot reject model S completely with DL. This difference in prediction certainty was likely due 120 to the better power of DLS to differentiate between the three models compared to the other (Table  121 3). 122 The DLS results were reproduced under different data filtering strategies and different datasets 123 (Supplementary Our inference suggests that there was first a separation between the Ancient African population 145 (AA) and a population ancestral to both Back-to-Africa and the actual Out-of-Africa populations 146 (OOAʹ) around 72.2 (CI 70.6 -73.7) kya followed by a split between back to Africa (B2A) and 147 OOA 59 (CI 57.4 -60.5) kya and an admixture between AA and B2A 47.7 (CI 46.4 -49.1) kya. 148 The Neanderthal introgression to OOA happened much later, 39.9 (CI 38.9 -40.9) kya, suggesting 149 that this Back to Africa migration cannot explain the Neanderthal ancestry found in modern 150 African populations 55 . Our method predicted the admixture proportion from B2A to be as high as 151 92% (CI 89.66 -93.08) suggesting a massive replacement of the AA population. 152 Our results also comply with Y-chromosomal phylogeny and support back to Africa as proposed 153 before 11 . However, our estimation time of separation between populations is much younger than 154 what is reported in Y-chromosomes. One explanation might be that we used a slightly higher 155 mutation rate (1.45×10 -8 per bp per generation) 57 instead of a slightly slower alternative (1.25×10 -156 8 per bp per generation) 58,59 . When we used the slower mutation rate, our estimation for most of 157 the events time increased (Supplementary Table 11). Indeed, the separation time between B2A and 158 OOA populations corresponds to 67 (CI 66.3 -67.6) kya, which is close to the estimate of TMRCA 159 between Haplogroup D and E (72 kya 11 ). 160 To independently validate our results, we compared effective population size (Ne) trajectories and 161 cross-coalescent rates obtained by applying Relate 32 to real data as well as to data simulated under 162 each of the three models using the mean posterior parameters (Table 2 and Supplementary Table  163 2 and 3) predicted by DLS (following a flowchart represented in Supplementary Figure 3a) 34 . We 164 observe a close match between the estimates for the real data and our best predicted model ( Figure  165 2) which suggests our parameter estimation to be accurate. This similarity is particularly 166 interesting, given that we have not used any LD-based SS to optimize those parameters. On the 167 other hand, neither the Ne trajectory nor the cross-coalescent rate over time is informative to 168 differentiate between the three models (data not shown). Specifically, the gradual separation 169 between African and OOA populations, which was observed before with Relate and similar 170 methods 33,34 , cannot be directly explained by the back to Africa or two out of Africa migration as 171 this separation is also matched in our model S (Supplementary Figure 4). 172 Discussion 173 We presented here that the ABC analysis can be substantially improved by using NN coupled with 174 the SMC approach. Our methodology is robust to test any hypotheses which can be simulated, 175 which cannot be extensively tested by other methods (especially for scenarios of admixture from 176 ghost populations where the ancient genomes are unavailable) and can accommodate any kind of 177 SS. In this study, we used SFS as SS because it is effortless to calculate and have sufficient 178 information 37,60 . Our results might be further improved by using some LD-based SS 53,61 but we 179 opted out as they are computationally demanding to produce and the improvement in the result is 180 minimal (at least for the tested scenario). Although our approach (DLS) is fast enough, the main 181 bottleneck currently is the production of the simulated SS data. 182 In our models, we have not adopted any migration rates between populations, although our 183 approach can use it. This is because we found out that our approach (Parameter Estimation using 184 DLS) predicted non-zero migration rates when we used a mock observed SS data coming from a 185 pulse model with no migration (mean values from In msprime, Admixtures were represented as MassMigration (the fraction of a population replaced 266 by another population in a single generation). In contrast, migration rates under island models 267 (where appliable) were represented as Migrationrate (the rate of fraction per generation of a 268 population was replaced by another population for several generations). 269 The ABC-DLS analysis is efficient enough to be done on a single computer. The main bottleneck 270 of the whole approach is the production of the SFS data. Msprime is fast, but the total amount of 271 data, which needs to be simulated for the NN, is impossible to produce in a single computer. We 272 have used a snakemake pipeline to produce the SFS on the cluster 74 . 273 Demographic models

274
Simple out of Africa (model S) 275 In this simulation model, we have modeled a simple OOA event (Supplementary Figure 1)  for that simulation and the rest of the columns representing SFS elements. We ran Parameter 344 estimation on this CSV file to retrieve the parameters predicted on observed data for the given 345 model. We used the known parameters as labels for training the NN (y), and we used the SFS as 346 input (x). Thus, we can think of NN here as an inverse function of the simulation. We kept 10,000 347 random lines for the testing dataset and ABC analysis, and the rest were used for training the NN. 348 All the columns of SFS and parameters were normalized with MinMax scaler 75 , so the whole data 349 is within 0.0 and 1.0 per column. 350 We used four hidden layers of a dense NN (Supplementary Figure 5) with activation relu, and we 351 used linear for the output layer with the same number of units as the number of parameters. We 352 used a Masking layer at the beginning to remove cells with zero from the learning algorithm and 353 then Gaussian noise injection of .05 to introduce some noise (Supplementary Figure 5). We used 354 logcosh as a loss function and Nadam for the optimizer. The NN ran through the training dataset 355 several times (epochs) to increase accuracy. We used EarlyStopping on loss coming from 356 validation dataset with the patient of 100 and used the ModelCheckpoint of the lowest loss result 357 on validation data. We also used ReduceLROnPlateau of factor 0.2 to reduce the learning rate if 358 we reached minima for several epochs (10 by default). 359 After training is done, we used the testing dataset to predict the parameter from the SFS, which 360 was then used for cross-validation tests and parameter prediction using loclinear from ABC 76 with 361 the tolerance of 0.01. 362 This approach is similar to what we have published before in ABC-DL 19 , although we have used 363 the latest tools (TensorFlow, Keras, and hdf5). The whole approach is presented in a flowchart 364 (Supplementary Figure 2). 365

Model Selection with DL 366
Here we describe model selection using NN and ABC. After running simulations for the three 367 demographic models (sample numbers are same as above per model), we produced three 368 corresponding CSV files. These CSV files are used together as input for Model Selection. 369 We used SFS as input (x) in the NN, and the model names as the output (y) and removed the 370 parameters as they are not necessary for this step. We used MinMax scaler from sklearn 75 only on 371 the SFS data as above, and the names of the models are converted to One-Hot Encoding by using 372 pandas.Categorical and keras.utils.to_categorical. After concatenating files coming from all the 373 competitive models, we randomized rows by a custom code 77 . We left around 10,000 random lines 374 per each model to test the power of NN (as a testing dataset) and for ABC analysis and used the 375 rest to train the NN (as a training dataset). The rest of the approach is exactly similar as before. 376 We used two hidden layers of the Dense neural network with the activation relu (Supplementary 377 Figure 5). We used softmax for the output layer with the same number of units as the number of 378 trained models. We added a Masking layer and a noise injection layer as above. We used a 1% 379 dropout layer within every dense layer to make it more robust. We used categorical_crossentropy 380 for the loss function from Keras and adam for the optimizer. 381 After the training was done, we used the testing dataset to predict models from the simulated SFS, 382 which were then used to perform the cross-validation test using ABC with the tolerance of 0.0033 383 (which converts to 100 samples for three models) using mnlogistic. We calculated the model 384 selection (abc.postpr) by using real data. See a schematic representation in Supplementary Figure  385 2. This approach is also similar to our previous study 19 . 386

Parameter Estimation with DLS 387
This method uses the Classic parameter estimation strategy of DL (described above) together with 388 the SMC algorithm used for recursion. The approach here is close to the classic approach of SMC 47 389 but not exactly the same. 390 We used the rejection method in ABC for parameter estimation as it generates posterior within the 391 prior range with a tolerance of 0.05. We obtained the posterior range by taking the minimum and 392 the maximum values from the ABC output. This range was then used as a prior range for the next 393 iteration. This cycle repeated until shrinking for all parameters is more than 0.95, suggesting it has 394 reached convergence. 395 If the shrinking is more than 95% for a parameter, the new posterior estimation is rejected for that 397 parameter. Instead, the prior is kept for another cycle. This strategy was used so that the posterior 398 stops shrinking unless NN has found some accurate prediction for the parameter from SFS. We 399 kept simulations inside the new posterior range in every cycle to reuse simulations from a previous 400 cycle to a new cycle. A flow chart of this strategy can be found in Supplementary Figure 2c. 401 We used 10,000 simulations as a training dataset and 10,000 simulations for testing. The NN model 402 is exactly as before (as used in DL, Supplementary Figure 5). To make it more efficient, we started 403 with simulating a total length of 100 Mbp (each simulated region being 1 Mb long), and then we 404 increased it stepwise (i.e., 0.5, 1.5, and 3 gbp). The priors for 100 Mbp regions are the same as 405 presented in Table 1. The final posterior (after convergence reached) of 100 Mbp is used as a prior 406 for 0.5 gbp simulation and so on. We multiplied the observed SFS by frac accordingly to scale it 407 for to the simulated region length. 408 After the convergence was reached with 3 gbp in total, we finalized by running 50,000 training 409 and 10,000 testing simulations with the DL method using loclinear with the tolerance of 0.01. The 410 flowchart of the method is represented in Supplementary Figure 3. 411

Model Selection with DLS 412
Here we describe model selection using NN, ABC and SMC together. In principle, we can directly 413 use the final output of parameter estimation by DLS for every model and then use it for the ABC 414 classification approach. However, this approach would be inefficient, given that only one model 415 is likely for our real dataset, and thus spending considerable resources to optimize parameters for 416 unlikely scenarios does not make sense. Instead, we used the output of 100 Mbp parameter 417 optimizations from the DLS approach as a prior to every model, and then we used the Model 418 selection strategy of DL, as mentioned before. We found out that we already have enough power 419 to distinguish between models using 100 Mbp for optimization. 420

421
We tested the real SFS against the three simulated models using a similar ABC approach but using 422 Random Forest 78 as an inferential tool implemented in the abcrf R package 48,79 . First, we trained 423 our model using the bagging method applying the function abcrf, with no Linear Discriminant 424 analysis, and 2,000 decision trees using 150,000 simulations (50,000 for each tested model). We 425 then evaluated the performance of ABC-RF through a cross-validation dataset composed of 10,000 426 simulations for each tested model using the function predict.abcrf. The same function and settings 427 were used for inferring the best-supported model using the SFS obtained from real data described 428 above. We performed parameter estimation for the most supported scenario applying regression 429 as implemented in the regAbcrf model using 1000 decision trees.   CI is the confidence interval of 2.5%-97.5% of respective parameters. Ky means kilo years and 479 kya means kilo or thousand years ago from now. 480

491
Here we compare effective population size (a) and relative cross-coalescence rates (b) estimated 492 using Relate between the real and data simulated under the model B. a) x axis in kya (kilo years 493 ago) and y axis is the effective population size for corresponding populations presented in the inset. 494 Both axes are in log scale. b) x axis is in kya and y-axis shows the relative cross coalescent rate 495 for corresponding populations pairs presented in the insert. x axis is log scale.