## Abstract

Ancient genome sequencing technologies now provide the opportunity to study natural selection in unprecedented detail. Rather than making inferences from indirect footprints left by selection in present-day genomes, we can directly observe whether a given allele was present or absent in a particular region of the world at almost any period of human history within the last 10,000 years. Methods for studying selection using ancient genomes often rely on partitioning individuals into discrete time periods or regions of the world. However, a complete understanding of natural selection requires more nuanced statistical methods which can explicitly model allele frequency changes in a continuum across space and time. Here we introduce a method for inferring the spread of a beneficial allele across a landscape using two-dimensional partial differential equations. Unlike previous approaches, our framework can handle time-stamped ancient samples, as well as genotype likelihoods and pseudohaploid sequences from low-coverage genomes. We apply the method to a panel of published ancient West Eurasian genomes, to produce dynamic maps showcasing the inferred spread of candidate beneficial alleles over time and space. We also provide estimates for the strength of selection and diffusion rate for each of these alleles. Finally, we highlight possible avenues of improvement for accurately tracing the spread of beneficial alleles in more complex scenarios.

## Introduction

Understanding the dynamics of the spread of a beneficial allele through a population is one of the fundamental problems in population genetics (Ewens, 2012). We are often interested in knowing the location where an allele first arose and the way in which it spread through a population, but this is often unknown, particularly in natural, non-experimental settings where genetic sampling is scarce and uneven.

Patterns of genetic variation can be used to estimate how strongly natural selection has affected the trajectory of an allele and to fit the parameters of the selection process. The problem of estimating the age of a beneficial allele, for example, has yielded a rich methodological literature (Slatkin & Rannala, 2000), and recent methods have exploited fine-scale haplotype information to produce highly accurate age estimates (Mathieson & McVean, 2014; Platt *et al*., 2019; Albers & McVean, 2020). In contrast, efforts to infer the geographic origins of beneficial mutations are scarcer. These include Novembre *et al*. (2005), who developed a maximum likelihood method to model the origin and spread of a beneficial mutation and applied it to the *CCR5-*Δ32 allele, which was, at the time, considered to have been under positive selection (Stephens *et al*., 1998; Sabeti *et al*., 2005; Novembre & Han, 2012). Similarly, Itan *et al*. (2009) developed an approximate Bayesian computation (ABC) approach using demic simulations, in order to find the geographic and temporal origins of a beneficial allele, based on present-day allele frequency patterns.

As ancient genome sequences become more readily available, they are increasingly being used to understand the process of natural selection (see reviews in Malaspinas *et al*. (2012); Dehasque *et al*. (2020)). However, few studies have used ancient genomes to fit spatial dynamic models of the spread of an allele over a landscape. Most spatiotemporal analyses which included ancient genomes have used descriptive modelling in order to learn the spatiotemporal covariance structure of allele frequencies (Segurel *et al*., 2020) or hidden ancestry clusters (Racimo *et al*., 2020b), and then used that structure to hindcast these patterns onto a continuous temporally-evolving landscape. In contrast to descriptive approaches, dynamic models have the power to infer interpretable parameters from genomic data and perhaps reveal the ultimate causes for these patterns (Wikle *et al*., 2019).

Dynamic models can also contribute to ongoing debates about the past trajectories of phenotypically important loci. For example, the geographic origin of the rs4988235(T) allele—upstream of the *LCT* gene and associated with adult lactase persistence in most of Western Eurasia (Enattah *et al*., 2002)—remains elusive, as is the way in which it spread (an extensive review can be found in Śegurel & Bon, 2017). The allele has been found in different populations, with frequencies ranging from 5% up to almost 100%, and its selection coefficient has been estimated to be among the highest in human populations (Bersaglieri *et al*., 2004; Enattah *et al*., 2008; Tishkoff *et al*., 2007). However, the exact causes for its adaptive advantage are contested (Szpak *et al*., 2019), and it has been suggested that the selection pressures acting on the allele may have been different in different parts of the continent (Gerbault *et al*., 2009). Ancient DNA evidence shows that the allele was rare in Europe during the Neolithic (Burger *et al*., 2007; Gamba *et al*., 2014; Allentoft *et al*., 2015; Mathieson *et al*., 2015) and only became common in Northern Europe after the Iron Age, suggesting a rise in frequency during this period, perhaps mediated by gene flow from regions east of the Baltic where this allele was more common during the onset of the Bronze Age (Krüttli *et al*., 2014; Margaryan *et al*., 2020). Itan *et al*. (2009) deployed their ABC approach to model the spatial spread of the rs4988235(T) allele and estimated that it was first under selection among farmers around 7,500 years ago possibly between the central Balkans and central Europe. Others have postulated a steppe origin for the allele (Allentoft *et al*., 2015), given that the rise in frequency appears to have occurred during and after the Bronze Age migration of steppe peoples into Western Eurasia (Haak *et al*., 2015; Allentoft *et al*., 2015). However, the allele is at low frequency in genomes of Bronze Age individuals associated with Corded Ware and Bell Beaker assemblages in Central Europe who have high steppe ancestry (Mathieson *et al*., 2015; Margaryan *et al*., 2020), complicating the story further (Śegurel & Bon, 2017).

The origins and spread dynamics of large-effect pigmentation-associated SNPs in ancient Eurasians have also been intensely studied (Ju & Mathieson, 2020). Major loci of large effect on skin, eye and hair pigmentation have been documented as having been under recent positive selection in Western Eurasian history (Voight *et al*., 2006; Sabeti *et al*., 2007; Pickrell *et al*., 2009; Lao *et al*., 2007; Mathieson *et al*., 2015; Alonso *et al*., 2008; Hudjashov *et al*., 2013). These include genes *SLC45A2*, *OCA2*, *HERC2*, *SLC24A5* and *TYR*. While there is extensive evidence supporting the adaptive significance of these alleles, debates around their exact origins and spread are largely driven by comparisons of allele frequency estimates in population groups which are almost always discretized in time and/or space. Among these, selection at the *TYR* locus is thought to have occurred particularly recently, over the last 5,000 years (Stern *et al*., 2019), driven by a recent mutation (Albers & McVean, 2020) that may have spread rapidly in Western Eurasia.

Here, we develop a method to model the spread of a recently selected allele across both space and time, avoiding artificial discretization schemes to more rigorously assess the evidence for or against a particular dispersal process. We begin with the model proposed by Novembre *et al*. (2005), and adapt it in order to handle ancient low-coverage genomic data, and explore more complex models that allow for both diffusion and advection (i.e. directional transport) in the distribution of allele frequencies over space, as well as for a change in these parameters at different periods of time. We apply the method to alleles in two of the aforementioned loci in the human genome, which have been reported to have strong evidence for recent positive selection: *LCT/MCM6* and *TYR*. We focus on Western Eurasia during the Holocene, where ancient genomes are most densely sampled, and infer parameters relevant to the spread of these alleles, including selection, diffusion and advection coefficients.

## Results

### Summary of model

We based our statistical inference framework on a model proposed by Novembre *et al*. (2005) to fit allele frequencies in two dimensions to present-day genotype data spread over a densely sampled map. We extend this model in several ways:

We incorporate temporally sampled data (ancient genomes) to better resolve changes in frequency distributions over time

We make use of genotype likelihoods and pseudohaploid genotypes to incorporate low-coverage data into the inference framework

We permit more general dynamics by including advection parameters.

We allow the selection, advection and diffusion parameters to be different in different periods of time. Specifically, to reflect changes in population dynamics and mobility before and after the Bronze Age (Loog

*et al*., 2017; Racimo*et al*., 2020a), we partitioned the model fit into two time periods: before and after 5,000 years BP.

We explored the performance of two different spread models, which are extensions of the original model by Novembre *et al*. (2005), hereby called model A. This is a diffusion model containing a selection coefficient *s* (determining the rate of local allele frequency growth) and a single diffusion term (*σ*). A more general diffusion model - hereby model B - allows for two distinct diffusion parameters for latitudinal (*σ _{y}*) and longitudinal (

*σ*) spread. Finally, model C is even more general and includes two advection terms (

_{x}*v*and

_{x}*v*), allowing the center of mass of the allele’s frequency to diverge from its origin over time. The incorporation of advection is meant to account for the fact that population displacements and expansions could have led to allele frequency dynamics that are poorly explained by diffusion alone.

_{y}In order to establish a starting time point for our diffusion process, we used previously published allele age estimates obtained from a non-parametric approach leveraging the patterns of haplotype concordance and discordance around the mutation of interest (Albers & McVean, 2020). In the case of the allele in the *LCT/MCM6* region, we also used age estimates based on an approximate Bayesian computation approach (Itan *et al*., 2009).

### Performance on deterministic simulations

To characterize the accuracy of our inference method under different parameter choices we first generated deterministic simulations from several types of diffusion models. First, we produced an allele frequency surface map with a specified set of parameters from which we drew 1,040 samples matching the ages, locations and genotype calling format (diploid vs. pseudo-haploid) of the 1,040 genomes that we analyze below when studying the rs1042602(A) allele.

We generated six different simulations with different diffusion coefficients and afterwards ran our method assuming model B. The results (simulations B1-B6) are summarised in figures 1, S1, S2, S3, S4, S5 and table S1. Overall, the model is more accurate at correctly inferring the parameters for the time period before 5,000 years BP (figure 1b), with decreased performance when longitudinal diffusion is high (figure S5).

Next, we investigated the performance of model C, which includes advection coefficients. We generated four different simulations including advection (simulations C1-C4: Figure 2, supplementary figures S6, S7, S8 and table S2). We found that our method is generally able to estimate the selection coefficient accurately. However, in some of the simulations, we found discrepancies between the estimated and true diffusion and advection coefficients, often occurring because of a misestimated origin forcing the other parameters to adjust in order to better fit the allele frequency distribution in later stages of the allele’s spread (Figure 2). Despite the disparities between the true and inferred parameter values, the resulting surface plots become very similar as we approach the present, suggesting that different combinations of parameters can produce similar present-day allele frequency distributions.

### Spatially-explicit forward simulations

In addition to drawing simulated samples from a diffusion model, we performed spatially explicit individual-based forward-in-time simulations of selection acting on a beneficial allele using a new simulation framework implemented in the R package *slendr* (Petr (2021)). This package makes it possible to define spatiotemporal population models in R and then feeds them into the forward population genetic simulator SLiM (Haller & Messer (2019)) for generating genotype data.

We introduced a single beneficial additive mutation in a single individual and let it evolve across the European landscape. Before applying our method on the simulated data, we sampled 1,040 individuals whose ages were log-uniformly distributed, to ensure that there were more samples closer to the present, as in the real data. We transformed the diploid genotypes to pseudohaploid genotypes by assigning a heterozygous individual an equal probability of carrying the ancestral or the derived genotype. The parameter values estimated by our model to the simulations described in this section are summarised in table S3. We can see that the origin of the allele inferred by the model closely corresponds to the first observation of the derived allele in the simulation (figure 3). The inferred selection coefficient is only slightly higher than the true value from the simulation (0.0366 vs 0.030). In general, the model accurately captures the spread of the allele centered in central Europe, though we observe some discrepancies due to differences between the model assumed in the simulation (which, for example, accounts for local clustering of individuals, figure S9), and that assumed by our diffusion-based inference.

### Dynamics of the rs4988235(T) allele

Having tested the performance of our method on simulated data, we set out to infer the allele frequency dynamics of the rs4988235(T) allele (associated with adult lactase persistence) in ancient Western Eurasia. For our analysis, we used a genotype dataset compiled by Segurel *et al*. (2020), which amounts to 1,434 genotypes from ancient Eurasian genomes individuals, and a set of 36,659 genotypes from present-day Western and Central Eurasian genomes (Śegurel & Bon, 2017; Heyer *et al*., 2011; Marchi *et al*., 2018; Liebert *et al*., 2017; Gallego Romero *et al*., 2012; Itan *et al*., 2010; Charati *et al*., 2019). After filtering out individuals falling outside of the range of the geographic boundaries considered in this study, we retained 1,332 ancient individuals. The locations of ancient and present-day individuals used in the analysis to trace the spread of rs4988235(T) are shown in figure 4.

We used a two-period scheme by allowing the model to have two sets of estimates for the selection coefficient and the diffusion and advection coefficients in two different periods of time: before and after 5,000 years ago, reflecting the change in population dynamics and mobility before and after the Bronze Age transition (Loog *et al*., 2017; Racimo *et al*., 2020a). We used two allele age estimates as input: a relatively young one (7,441 years ago) obtained from Itan *et al*. (2009), and a relatively old one (20,106 years ago) obtained from Albers & McVean (2020). The results obtained for fitting the model on rs4988235(T) are summarised in tables S4 and S5, and in figures 5b (younger age) and S12 (older age).

Assuming the age estimate from (Itan *et al*., 2009), the origin of the allele is estimated to be north of the Caucasus, around what is now southwestern Russia and eastern Ukraine (Figure 5b). Given that this age is relatively young, our method fits a very strong selection coefficient ( 0.1) during the first period in order to accommodate the early presence of the allele in various points throughout Eastern Europe, and a weaker (but still strong) selection coefficient ( 0.03) in the second period. We also estimate stronger diffusion in the second period than in the first, to accommodate the rapid expansion of the allele throughout Western Europe, and a net westward advection parameter, indicating movement of the allele frequency’s center of mass to the west as we approach the present.

Assuming the older age estimate from Albers & McVean (2020), the origin of the allele is estimated to be in the Northeast of Europe (figure S12), which is at a much higher latitude than the first occurrence of the allele, in Ukraine. A comparison of the parameters related to the allele expansion inferred for the two time periods shows that the allele initially expands at a much higher rate in the latitudinal direction relative to the longitudinal direction (table S5). This difference greatly decreases in the second time period. The model appears to restrict the expansion of the allele in the region with a lower density of available aDNA data and thus avoids an overlap of the increasing allele frequencies with individuals who do not carry the derived rs4988235(T) allele (see figure 5a). The rapid expansion in the southern direction allows the model to eventually reach the sample carrying the derived variant in Ukraine. As the rs4988235(T) allele becomes more widely distributed after 5000 years BP, the longitudinal diffusion and advection parameters in the second period are higher than in the first.

### Dynamics of the rs1042602(A) allele

Next, we investigated the spatiotemporal dynamics of the spread of an allele at a pigmentation-associated SNP in the *TYR* locus (rs1042602(A)), which has been reported to be under recent selection in Western Eurasian history (Stern *et al*., 2019). For this purpose, we applied our method to the Allen Ancient DNA Resource data (AADR: https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data), which contains randomly sampled pseudohaploid genotypes from 1,513 published ancient Eurasian genomes (listed in Supplementary Text 1), from which we extracted those genomes that had genotype information at this locus in Western Eurasia. We merged this dataset with diploid genotype information from high-coverage present-day West Eurasian genomes from the Human Genome Diversity Panel (HGDP) (Bergström *et al*., 2020), which resulted in a total of 1,040 individuals with geno-type information at rs1042602, which were as input to our analysis. Geographic locations of individuals in the final dataset are shown in Figure 6.

Similarly to our analysis of the spread of the allele in rs4988235(T), we inferred the dynamics of the rs1042602(A) allele separately for the time periods before and after 5,000 years BP. The inferred parameters for both time periods are summarised in table S6 and the allele frequency surface maps generated using these parameters are shown in figure 7b. The origin of the rs1042602(A) corresponds closely to the region where the allele initially starts to segregate in the time period between 7,500 and 10,000 years BP as seen in figure 7a. Estimates of the selection coefficient for both time periods (0.0221 and 0.0102 for the period before and after 5000 years BP, respectively) suggest that selection acting on the allele has decreased after 5000 years BP.

### Robustness of parameters to the inferred geographic origin of allele

We carried out an analysis to characterize how sensitive the selection, diffusion and advection parameters are to changes in the assumed geographic origin of the allele. For the rs4988235(T) allele, we forced the origin of the allele to be 10 degrees away from our inferred origin in each cardinal direction, while assuming the allele age from Itan *et al*. (2009) (table S7). In figures S13, S14, S15 and S16, we can see the allele frequency dynamics of these four scenarios, respectively. We also forced the allele origin to be at the geographic origin estimated in Itan *et al*. (2009) (table S8, figure S17), which is westward of our estimate. In all five cases during the period prior 5,000 years BP, the allele is inferred to expand in the direction of the first sample that is observed to carry the rs4988235(T) allele and is located in Ukraine. During the time period after 5000 years BP, the patterns produced by the model are rather similar, although the parameters associated with diffusion and advection differ, in order to account for the different starting conditions.

We also investigated how the results are affected when the estimated geo-graphic origin of the rs1042602(A) allele is moved with respect to the initial estimate. We set the allele to be 10 degrees east, 10 degrees north and 10 degrees south of the original estimate as shown in figures S18, S19 and S20, respectively. We did not look at a scenario in which the origin of the allele is moved to the west, since it would either end up in the Black sea or more west-wards than 10 degrees. The selection coefficient remains similar to the original estimate throughout all three scenarios. The way the allele spreads across the landscape is also similar in all cases and, as in the case of rs4988235(T), the model accounts for the different origins of the allele by adjusting the diffusion and advection coefficients in the time period after 5000 years BP.

### Robustness of parameters to the assumed age of the allele

In order to investigate how sensitive our inferences are to the point estimates of allele ages we obtained from the literature (Albers & McVean, 2020; Itan *et al*., 2009), we also fitted our model using the upper and lower ends of the 95% confidence intervals or credible intervals for each age estimate (depending on whether the inference procedure in the literature was via a maximum likelihood or a Bayesian approach). For the rs4988235(T) allele, the reported credible intervals for the Itan *et al*. (2009) age are 8,683 and 6,256 years BP. For the rs1042602(A) allele, the reported confidence intervals for the age are 27,315 and 25,424 years BP (Albers & McVean, 2020).

When re-fitting the model for the rs4988235(T) allele, we found that the inferred selection coefficient is slightly lower when the allele age is assumed to be at the lower bound of the 95% credible interval and slightly higher when assumed to be at the upper bound (table S4 and figures S21 and S22). This occurs because the selection intensity must be higher or lower when there is more or less time, respectively, for the allele to reach the allele frequencies observed in the data. In the case of the rs1042602(A) allele, this only affects the earlier time period (table S6). The rs4988235(T) allele’s geographic distribution in the more recent time periods is also less extended geographically when the age is assumed to be young. The inferred geographic origin of both alleles slightly differs under different assumed ages (figures S23 and S24).

## Discussion

A spatially explicit framework for allele frequency diffusion can provide new insights into the dynamics of selected variants across a landscape. We have shown that under the conditions of strong, recent selection, our method can infer selection and dispersal parameters, using a combination of ancient and present-day human genomic data. However, when allowing for advection, the inferred location tends to become less accurate. This suggests that migration events early in the dispersal of the selected allele could create difficulties in finding the true allele origin if net directional movement (i.e. via major migratory processes) had a large effect in this dispersal. This issue could be alleviated with the inclusion of more ancient genomes around the time of the mutational origin, perhaps in combination with a more fine-scaled division into periods where advection may have occurred in different directions.

The inferred geographic origin of the rs4988235(T) allele reflects the best guess of our framework given the constraints provided by its input, namely the previously inferred age of the allele and the observed instances of this allele throughout Western Eurasia. We are also assuming that the allele must have arisen somewhere within the bounding box of our studied map. When assuming a relatively young allele age (7,441 years ago, Itan *et al*. (2009)), the origin of the allele is placed north of the Caucasus, perhaps among steppe populations that inhabited the area at this time (Haak *et al*., 2015; Allentoft *et al*., 2015). This origin is further east than the geographic origin estimate from Itan *et al*. (2009), likely reflecting additional ancient DNA information that is available to us, and indicates an early presence of the allele in eastern Europe. When assuming a relatively old allele age (20,106 years ago, Albers & McVean (2020)), the age is placed in northeast Europe, perhaps among Eastern hunter-gatherer groups that inhabited the region in the early Holocene. We note that the number of available genomes for eastern and northeastern Europe during the early Holocene is scarce, so our confidence on the exact location of this origin is necessarily low. Regardless of the assumed age, we estimate a net westward displacement of the allele frequency’s center of mass, and a rapid diffusion, particularly in the period after 5,000 years ago.

Various studies have estimated the selection coefficient for the rs4988235(T) allele, and these range from as low as 0.014 to as high as 0.19 (Enattah *et al*., 2008; Mathieson & Mathieson, 2018; Mathieson, 2020; Stern *et al*., 2019; Burger *et al*., 2020; Peter *et al*., 2012; Gerbault *et al*., 2009; Itan *et al*., 2009; Bersaglieri *et al*., 2004). Recent papers incorporating ancient DNA estimate the selection coefficient to be as low as 0 (in certain regions of Southern Europe) and as high as 0.06 (Mathieson & Mathieson, 2018; Mathieson, 2020; Burger *et al*., 2020). It is also likely that the selection coefficient was different for different regions of Europe, perhaps due to varying cultural practices (Mathieson, 2020). In our case, the estimated selection coefficient during the first period - before 5,000 years ago - depends strongly on the assumed allele age (s = 0.0993 vs. s = 0.0285). As in the case of the geographic origin, these estimates should be taken with caution as the number of available allele observations in the early Holocene is fairly low. The estimates for the second period - after 5,000 years ago - are more robust to the assumed age: s = 0328 (95% CI: 0.0327–0.0329) if we assume the younger allele age (7,441 years ago) and s = 0.0255 (95% CI: 0.0252–0.0258) if we assume the older allele age (20,106 years ago). These estimates are also within the range of previous estimates.

In the case of the rs1042602(A) allele, our estimated selection coefficients of 0.0221 (95% CI: 0.0216-0.0227) and 0.0102 (95% CI: 0.0083-0.0120) for the time periods before and after 5000 years BP, respectively, are generally in agreement with previous results. Wilde *et al*. (2014) used a forward simulation approach to infer a point estimate of 0.026. Another study using an approximate Bayesian computation framework (Nakagome *et al*., 2019) estimated the strength of selection acting on rs1042602 to be 0.013 (0.002–0.029). Although both studies utilized ancient DNA data, the estimates were obtained without explicitly modelling the spatial dimension of the selection process.

Our estimates of the longitudinal advection parameter are negative for both the SNPs in the *TYR* and *LCT* loci: the mutation origins are always to the east of the center of mass of the allele frequency distribution seen in present-day data. This perhaps reflects common migratory processes, like the large-scale Neolithic and Bronze Age population movements from east to west, affecting the allele frequencies at these loci across the Eurasian landscape (Allentoft *et al*., 2015; Haak *et al*., 2015). As a form of regularization, we kept the range of explored values for the advection parameters to be small (−2.5 to 2.5 km per generation), while allowing the diffusion parameters to be explored over a much wider range of values. In certain cases, like the second period of the rs4988235(T) spread when the allele age is assumed to be young (table S4), we find that the advection parameters are fitted at the boundary of the explored range, because the allele needs to spread very fast across the landscape to fit the data. A future improvement to our method could include other forms of regularization that better account for the joint behavior of the advection and diffusion processes, or the use of priors for these parameters under a Bayesian setting, which could be informed by realistic assumptions about the movement of individuals on a landscape.

When investigating the robustness of the geographic origin of both rs4988235(T) and rs1042602(A), we found that parameters related to the beneficial allele’s expansion change in response to different assumed origins of the allele. The resulting allele frequency surface plots, however, appear very similar throughout the later stages of the process, showing that the model tends to adjust the diffusion and advection coefficients in a way such that the allele will end up expanding into the same areas regardless of the origin.

As we apply these methods to longer time scales and broader geographic areas, the assumptions of spatiotemporal homogeneity of the parameters seem less plausible. There may be cases where the allele may have been distributed over a wide geographic area but remained at low frequencies for an extended period of time, complicating the attempts to pinpoint the allele’s origin. In our study, we estimated diffusion and selection coefficients separately for two time periods before and after 5000 years ago to account for changes in mobility during the Neolithic transition, but this approach may still be hindered by uneven sampling, especially when the allele in question exists at very low frequencies. Notably, our results for the spread of the rs4988235(T) allele during the older time period should be interpreted with caution, since they may be affected by sparse sampling in the early Holocene.

Potential future extensions of our method could incorporate geographic features and historical migration events that create spatially or temporally varying moderators of gene flow. An example of this type of processes is the retreat of glaciers after the last Glacial maximum, which allowed migration of humans into Scandinavia (Günther *et al*., 2018). These changing geographic features could lead to changes in the rate of advection or diffusion across time or space. They could also serve to put more environmentally-aware constraints on the geographic origin of the allele, given that it cannot have existed in regions uninhabitable by humans, and to extend our analyses beyond the narrow confines of the Western Eurasian map chosen for this study. One could also envision incorporating variation in population densities over time, or known migration processes in the time frames and regions of interest. These might have facilitated rapid, long-range dispersal of beneficial alleles (Bradburd *et al*., 2016; Hallatschek & Fisher, 2014) or caused allelic surfing on the wave of range expansions (Klopfstein *et al*., 2006). Additional information like this could come, for example, from previously inferred spatiotemporal demographic processes (e.g. Racimo *et al*. (2020b)).

As described above, our model only accounts for diffusion in two directions. Further extension of our model could therefore incorporate anisotropic diffusion (Othmer *et al*., 1988; Painter & Hillen, 2018). Another possibility could be the introduction of stochastic process components, in order to convert the partial differential equations into stochastic differential equations (Brown *et al*., 2000). Stochastic components could serve to induce spatial autocorrelation and capture local patterns of allele frequency covariance in space that might not be well modeled by the deterministic PDEs (Cressie & Wikle, 2015). They could also serve to induce stochasticity in allele frequency changes over time as a consequence of genetic drift (Crow *et al*., 1970), allowing one to model the dynamics of more weakly selected variants, where drift plays an important role. Eventually, one could perhaps combine information across loci to jointly model the spatiotemporal frequency surfaces at multiple loci associated with the same trait. This could help clarify the dynamics of polygenic adaptation and negative selection on complex traits (Irving-Pease *et al*., 2021), and perhaps hindcast the genetic value of traits across a landscape.

The availability of hundreds of ancient genomes (Marciniak & Perry, 2017) and the increasing interest in spatiotemporal method development (Bradburd & Ralph, 2019), such as the one described in this manuscript, will likely lead researchers to posit new questions and hypotheses about the behavior of natural selection. In the case of a beneficial allele spreading on a landscape, new ontologies and vocabulary for describing positive selection in time and space will be needed. Abundant terms exists to classify the initial conditions and dynamics of a selective sweep in a single population (hard sweep, multiple origin soft sweep, single origin soft sweep, partial sweep) (Hermisson & Pennings, 2005; Pritchard & Di Rienzo, 2010; Hermisson & Pennings, 2017). In contrast, there is a lack of vocabulary for distinguishing between a scenario of strong selection that is locally constrained in space from a scenario of widespread selection extended over a landscape, or a model of neutral diffusion in space followed by parallel non-neutral increases in frequency at multiple locations. For example, Ralph & Coop (2010) showed how multiple localized hard sweeps may be seen as a soft sweep at a larger population-wide scale. Existing vocabulary for spatiotemporal genetic processes is clearly not enough, limiting the types of questions or hypotheses we can pose about them.

Population genetic models that explicitly account for space and time are an important area of future methodological development (Bradburd & Ralph, 2019). We believe that methods such as the one described in this study show great promise at broadening the horizon of our understanding of natural selection across space and time in humans and other species. As in the case of demographic reconstruction (Ray & Excoffier, 2009), spatiotemporal information can greatly help improve our knowledge of how natural selection operated in the past.

## Methods

### The model

To describe the allele frequency dynamics in time and space, we first begin by using a deterministic model based on a two-dimensional partial differential equation (PDE) (Fisher, 1937; Kolmogorov *et al*., 1937; Novembre *et al*., 2005). This PDE represents the distribution *p*(*x, y, t*) of the allele frequency across a two dimensional (*x, y*) landscape at time *t*:
where

Here, *σ* is the diffusion coefficient, *s* is the selection coefficient, and *d* is the dominance coefficient (Novembre *et al*., 2005). We assumed an additive model and fixed *d* = 2*s* in all analyses below. We call this “model A”, but we also evaluated the fit of our data under more complex models which are more flexible, and are described below.

Model B is a more general diffusion-reaction model, which incorporates distinct diffusion terms in the x and y axes (*σ _{x}* and

*σ*, respectively):

_{y}Model C is a generalization of model B that incorporates advection terms in the x and y directions (see e.g. Cantrell & Cosner (2004) for a motivation of this type of model in the context of spatial ecology):

Here, *v _{x}* and

*v*represent the coefficients for advective velocity along the

_{y}*x*and

*y*axes respectively.

In the Appendix, we motivate the construction of these equations using model C as an example, and show that equation 4 can be obtained by taking an infinitesimal limit of a random walk on a two-dimensional lattice, after including a reaction term due to selection. Models A and B are then shown to be special cases of model C.

For evaluating the likelihood of the observed data, we use a binomial geno-type sampling model. Let *g _{i} ∈* 0, 1, 2 be the genotype of individual

*i*at the locus of interest, let

*a*be the number of reads carrying ancestral alleles, let

_{i}*d*be the number of reads carry derived reads. Let (

_{i}*x*) be the coordinates of the location from which individual

_{i}, y_{i}*i*was sampled, and

*t*its estimated age (e.g. from radiocarbon dating). Then, the likelihood for individual

_{i}*i*can be computed as follows:

Here, *p*(*x _{i}, y_{i}, t_{i}*) is the solution to one of the partial differential equations described above (equations (1), (2) or (4), depending on the process model chosen), evaluated at location (

*x*) and time

_{i}, y_{i}*t*. In turn,

_{i}*P*[

*d*=

_{i}, a_{i}|g_{i}*h*] is the likelihood for genotype

*i*. Furthermore,

*P*[

*g*=

_{i}*h p*(

*x*)] is a binomial distribution, where

_{i}, y_{i}, t_{i}*n*represents the ploidy level, which in this case is 2:

Then, the likelihood of the entire data can be computed as

where M is the total number of individuals for which we have data, **d** is the vector containing the derived read count for each individual and **a** is the vector containing the ancestral read count for each individual. We computed genotype likelihoods directly on the BAM file read data, using the SAMtools genotype model (Li, 2011) implemented in the software ANGSD (Korneliussen *et al*., 2014).

When only randomly sampled pseudohaploid allele counts are available, we used a Bernoulli sampling likelihood (conditional on the genotype *g _{i}*) on the left-hand side of equation 5 instead. Briefly, assuming that the probability of an individual having genotype

*g*at a particular locus given the underlying allele frequency

*p*follows a binomial distribution and that the probability of sampling a read given the genotype of an individual follows a Bernoulli distribution with probability of success

^{1}

*g*, then the probability of sampling a read given the genotype follows a Bernoulli distribution with probability of success

*p*.

### Map

We restricted the geographic area explored by our model fit to be between 30* ^{◦}*N to 75

*N, and between 10*

^{◦}*W and 80*

^{◦}*E. For numerical calculations, we used a grid constructed using a resolution of approximately 1 grid cell per latitude and longitude. We used Harvesine functions in order to transform the distance from degrees to kilometers between two geographic points. The diffusion of the allele frequency was disallowed in the map regions where the topology is negative (i.e. regions under water), based on ETOPO5 data (NOAA (1988)). For this reason we added land bridges between the European mainland and Sardinia, and between the mainland and Great Britain, in order to allow the allele to diffuse in these regions (see figure S10).*

^{◦}### Parameter search

Parameter optimization was done via maximum likelihood estimation with a two-layer optimization set-up. The first layer consists of a simulated annealing approach (Béelisle (1992)) starting from 50 random points in the parameter space. The initial 50 points are sampled using latin hypercube sampling to ensure an even spread across the parameter space. The output of this fit was then fed to the L-BFGS-B algorithm to refine the parameter estimates around the obtained maximum and obtain confidence intervals for the selection, diffusion and advection parameters (Byrd *et al*. (1995)).

The parameters optimised were:

the selection coefficient (

*s*), restricted to the range 0.001-0.1two dispersal parameters

*σ*and_{x}*σ*in the longitudinal and latitudinal directions respectively, restricted to the range of 1-100 square-kilometers per generation_{y}the longitudinal and latitudinal advection coefficients

*v*and_{x}*v*respectively. As a form of regularization, we set the range of explored values to be narrowly centered around zero: −2.5 to 2.5 kilometers per generation_{y}the geographic origin of the allele, which is randomly initialized to be any of the 28 spatial points shown in Figure S11 at the start of the optimization process

The latitude and longitude are discretized in our model in order to solve the differential equations numerically, thus the origin of a mutation is measured in terms of discrete units. For this reason, when using the L-BFGS-B algorithm, we fixed the previously estimated origin of the allele, and did not explore it during this second optimization layer. Time was measured in generations, assuming 29 years per generation. During the optimization we scaled the time and the parameters by a factor of 10, which allowed us to decrease the execution time of the model.

We initialized the grid by setting the initial allele frequency to be *p*_{0} in a grid cell where the allele originates and 0 elsewhere. *p*_{0} was calculated as 1*/*(2 *D A*), where *D* is the population density and is equal to 2.5 inhabitants per square-kilometer, which is the estimated population density in Europe in 1000 B.C. (Colin McEvedy, 1978; Novembre *et al*., 2005). In the equation, *D* is multiplied by 2 because we assume that the allele originated in a single chromosome in a diploid individual. *A* is the area in square-kilometers of the grid cell where the allele emerged.

Asymptotic 95% confidence intervals for a given parameter *θ _{j}* were calculated using equation
where

*F*(

**) is an estimate of the observed Fisher information matrix (Fisher, 1922; Efron & Hastie, 2016; Casella & Berger, 2021).**

*θ*### Implementation

The above described model was implemented in R version 3.6. To numerically solve the differential equations and obtain maximum likelihood estimates, we used the libraries *deSolve* (Soetaert *et al*., 2010), *ReacTran* (Soetaert & Meysman, 2012) and *bbmle* (Bolker & R Development Core Team, 2020). Scripts containing the code used in this paper are available on github: https://github.com/RasaMukti/stepadna

### Indvidual-based simulations

For the individual-based spatiotemporal forward simulations, we first defined a spatial boundary for a population spread across a broad geographic region of Europe. In order to ensure a reasonably uniform distribution of individuals across this spatial range throughout the course of the simulation, we set the maximum distance for spatial competition and mating choice between individuals to 250 km (translated, on a SLiM level, to the interaction parameter *maxDistance*), and the standard deviation of the normal distribution governing the spread of offspring from their parents at 25 km (leveraged in SLiM’s *modifyChild()* call-back function) (Haller & Messer, 2019). We note that we have chosen the values of these parameters merely to ensure a uniform spread of individuals across a simulated landscape. They are not intended to represent realistic estimates for these parameters at any time in human history.

After defining the spatial context of the simulations and ensuring the uniform spread of individuals across their population boundary, we introduced a single beneficial additive mutation in a single individual. In order to test how accurately our model can infer the parameters of interest, we simulated a scenario in which the allele appeared in Central Europe 15,000 years ago with the selection coefficient of the beneficial mutation set to 0.03. Over the course of the simulation, we tracked the position of each individual that ever lived together with its location on a two-dimensional map, as well as its genotype (i.e. zero, one, or two copies of the beneficial allele). We then used this complete information about the spatial distribution of the beneficial allele in each time point to study the accuracy of our model in inferring the parameters of interest.

## Appendix

Here, we motivate the construction of model C as a large scale limit of a random walk model on a lattice (Karlin & Taylor, 1975; Cantrell & Cosner, 2004). We think of the allele frequency as a variable *p* that can increase in magnitude due to its inherent advantage (selection), spread across a landscape (diffusion) or move directionally as a consequence of migration (advection). We imagine a lattice composed of small square cells of size Δ*x* x Δ*y*, where a certain amount of allele frequency *p* can occur at a given time point *t*. At each small time step (of duration Δ*t*), inflow and outflow of p can occur in the x-direction with probability h or in the y-direction with probability 1-h, and the magnitude of these flows depend on the amount of *p* present in neighboring cells. If flow of p is along the x-axis, it does so in the positive direction with probability *α* and in the negative direction with probability 1 *α*. If flow of p is along the y-axis, it does so in the positive direction with probability *β* and in the negative direction with probability 1 *β*. The allele frequency can also increase in magnitude locally, via a function *γ*() that depends on its dominance (d), selection coefficient (s) and current magnitude (*p*(*x, y, t*)). Then, we obtain:

We can also write this as:

If we divide both sides by Δ*t* and take the limit of infinitesimally small Δ*x*,

Δ*y* and Δ*t*, while assuming that, in this limit, and are finite (Okubo *et al*., 1980), we obtain:
where

If we let , then we obtain equation 4. Thus, we can see that the squared diffusion coefficient *σ*^{2}_{x} depends on the square of the length of the cells in the x-axis relative to the duration of a time step (*λ _{x}*), and on the probability that flows occurs in the x-axis at a given time step (

*h*). Similarly, the squared diffusion coefficient

*σ*

^{2}

_{y}depends on the square of the length of the cells in the y-axis relative to the duration of a time step (

*λ*), and on the probability that flows occurs in the

_{y}xis at a given time step (1 *− h*). The advection coefficient *v _{x}* depends on the advective velocity along the x-axis (

*u*) as well as on the probability of flow occurring along the x-axis (

_{x}*h*) and the directional bias 1 2

*α*, which depends on the probability that flow occurs in the positive x-direction (

*α*). Finally, the advection coefficient

*v*depends on the advective velocity along the y-axis (

_{y}*u*) as well as on the probability of flow occurring along the y-axis (1

_{y}*h*) and the directional bias 1 2

*β*, which depends on the probability that flow occurs in the positive y-direction (

*β*).

We can recover model B as a special case of model C if we fix *α* = *β* = ½, assuming isotropy in the two directions, so that Δ*x* = Δ*y*. We can also recover model A if we additionally fix *h* = ½.

## Supplementary tables

## Supplementary figures

## Acknowledgments

We thank Graham Gower, Evan Irving-Pease, Montgomery Slatkin and the members of the Racimo group for helpful comments and advice. FR and RM were funded by a Villum Fonden Young Investigator award to FR (project no. 00025300). FR was also supported by a Lundbeckfonden grant (R302-2018-2155) and a Novo Nordisk Fonden grant (NNF18SA0035006) to the GeoGenetics Centre. TSK was funded by a Carlsberg grant (CF19-0712). JN was funded by NIH grant R01 GM132383.

## References

- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵