## Abstract

Understanding population dynamics from the analysis of molecular and spatial data requires sound statistical modeling. Current approaches assume that populations are naturally partitioned into discrete demes, thereby failing to be relevant in cases where individuals are scattered on a spatial continuum. Other models predict the formation of increasingly tight clusters of individuals in space, which, again, conflicts with biological evidence. Building on recent theoretical work, we introduce a new genealogy-based inference framework that alleviates these issues. This approach effectively implements a stochastic model in which the distribution of individuals is homogeneous and stationary, thereby providing a relevant null model for the fluctuation of genetic diversity in time and space. Importantly, the spatial density of individuals in a population and their range of dispersal during the course of evolution are two parameters that can be inferred separately with this method. The validity of the new inference framework is confirmed with extensive simulations and the analysis of influenza sequences collected over five seasons in the USA.

## 1 Introduction

Kingman’s coalescent [24] is a cornerstone of population genetics. It provides a mathematical framework in which the effective size of a population can be estimated through the comparative analysis of genetic data from a sample of individuals. The simplicity and utility of the coalescent explains its popularity in biology (see [32] for a review). In its simplest form, the coalescent defines the probability density of a genealogy of individuals sampled from a constant size, panmictic population. However, the panmixia assumption becomes problematic when considering the spatial distribution of individuals as degree of kinship is generally correlated with geographic distance [37, 34, 25].

The coalescent was thus extended to incorporate spatial information. Under the so-called structured coalescent [21, 33], the population is partitioned into demes, each deme corresponding to a geographic entity. Sub-populations within each deme are panmictic and only individuals in the same deme can coalesce. Migrations between demes are governed by an homogeneous Markov process with the migration rate assumed to be small and estimated from the combination of spatial and genetic data. The structured coalescent has obvious connections with standard models in population genetics, namely the island model [52, 30] and the stepping stone models [29, 23] for which mathematical properties are well understood. Inference under the structured coalescent using maximum-likelihood [7, 8] and Bayesian techniques [15, 6, 46] has led to important advances in biology [e.g., 40], but is limited for computational reasons to a relatively small number of demes (typically less than ten) which are assumed known a priori [46].

But many natural populations are not subdivided into discrete demes. Instead, they display a gradient of kinship across a continuous landscape. In seminal works, Wright [53] and Malécot [28] proposed an extension of the Wright-Fisher model that incorporates continuous spatial information. Under the so-called “isolation-by-distance” (IBD) model, individuals are uniformly distributed on the landscape and the locations of offspring are random draws from a Normal distribution with mean given by the parental position. Mathematical expressions were derived for the probability that two alleles are identical by descent as a function of their spatial distance. However, Felsenstein [16] showed that some of the assumptions of the IBD model are inconsistent. A population evolving under this process displays ‘clumping’ of individuals, which contradicts the uniformity assumption. The IBD model may provide a suitable inference framework when considering short time scales over which clumping can safely be ignored. In the general case, however, it is preferable that the spatial distribution of the population is described by a stationary process.

Sawyer and Felsenstein [43] addressed this issue in a modified version of the IBD model where the spatial distribution of individuals is governed by a Poisson random field with density constant in time. However, their model relies on the assumption that each pair of parents produces exactly two offspring, which is constraining from a biological perspective. Moreover, their approach applies to the special case of a one-dimensional habitat and generalization to two dimensions leads to mathematical difficulties [31].

Wilkins and Wakeley [51] proposed a different approach in which a population is uniformly distributed along a one-dimensional finite habitat with the location of parents correlated to that of offspring. Importantly, each individual occupies an interval inversely proportional to the size of the population, thereby ensuring the population density is regulated at all points in space and time. This model assumes that the spatial position of each lineage is subject to a diffusion process backward in time, with the habitat having reflecting boundaries. The authors were able to derive analytic formula for the distribution of the time to coalescence of a pair of sampled lineages. Wilkins [50] later proposed a generalization of this model to two-dimensional landscapes. However, Barton, Etheridge and Véber (2010b) suggested that this approach is sampling inconsistent, i.e., the distribution of the time to coalescence of a pair of sampled lineages depends on the size of the sample under consideration. Estimates of parameters of this model may thus be difficult to interpret in practice.

More recently, Lemey et al. (2009) proposed a model whereby spatial location is considered as a discrete character evolving along lineages according to a continuous-time Markov chain. Unfortunately, this approach suffers serious limitations. First, estimates of rates of migration are influenced by spatial variations in sampling intensity. Moreover, non-uniformity of population density is ignored when calculating the density of the genealogy. Also, this model is a discrete approximation of the IBD model when the migration process is isotropic and thus suffers from the same shortcomings. Altogether, while this approach is efficient from a computational perspective, it provides biased estimates of demographic parameters in particular simulation settings [13] and should thus be used with great caution.

In a recent series of articles [14, 9, 3, 5, 47, 4], Barton, Etheridge and colleagues described a new process, called the spatial λ-Fleming-Viot process, for studying the evolution of populations on a continuous landscape. Malécot’s approach and related models consider that the time of death and reproduction of individuals are governed by a random process running along every lineage in the evolving population. The new model assumes instead that the time and position of these events are independent of the spatial location of lineages. The authors describe the forward-in-time dynamics of a population evolving on an unbounded spatial continuum, and the corresponding backward-in-time process that characterizes the genealogy of sampled individuals.

The mathematical properties of the spatial λ-Fleming-Viot model have been studied extensively [9, 3, 47, 4]. In particular, it has been shown that this model does not suffer from sampling inconsistency or clumping issues. Barton et al. [2] showed that the analysis of pairs of sub-populations provides information about neighborhood size, i.e., the product of the effective population density by the dispersal intensity. These last two quantities are relevant from a biological perspective and, ideally, one would like to estimate each of them separately instead of their product.

In this study, we perform Bayesian inference under the spatial Λ-Fleming-Viot model applied to multiple individuals taken jointly. Using extensive simulations, we demonstrate the accuracy and precision with which the parameters of this model can be estimated. We compare our estimates to that obtained using two popular inference techniques, i.e., the regression on fixation index (Fst) values [41] and the structured coalescent [21, 33]. Our results demonstrate the good performance of our approach in these conditions. They also indicate that the proposed framework permits the estimation of population density and dispersal intensity as two separate (i.e., identifiable) parameters, thereby going beyond pairwise analyses. We further illustrate the validity of this new technique through the analysis of H1N1 sequences collected over five flu seasons in the USA. We show that the 2009 flu pandemic had distinct population dynamics compared to more recent seasons with smaller than usual neighborhood size and larger than usual dispersal distance.

## 2 The model

The spatial Λ-Fleming-Viot model (noted as ΛV from here on for the sake of brevity) assumes that reproduction, dispersal and death of lineages result from ‘events’ that occur at locations independent from that of individuals forming the population under scrutiny. In the following, we refer to these events as reproduction/extinction or REX events. From a biological perspective, one REX corresponds to either (i) a single reproduction event accompanied by extinction of the parent with the offspring dispersing over long distances or (ii) a sum of multiple reproduction and extinction events, each reproduction accompanied by dispersal of the offspring over short distances. In any case, the average time between two successive REX in a given lineage is proportional to the generation time of the species under scrutiny.

### 2-a Forward-in-time dynamics of the population

We assume that a population inhabits a finite habitat represented by a rectangle *R*(*h, ω*), with height *h* and width *ω* known *a priori*. Migrations crossing the boundaries of the rectangle are forbidden. This differs slightly from Barton, Etheridge and colleagues who assume that the population is distributed on a two-dimensional torus or on ℝ^{2}.

Initial locations of individuals are determined by a homogeneous Poisson point process with intensity *ρ*, producing a uniform distribution over *R*(*h, ω*). REXs occur at exponentially distributed waiting times with rate parameter λ. The center of each REX is chosen uniformly at random across the habitat. Each lineage then dies with some probability which depends on its distance to the center. New lineages are born at an intensity that also depends on the distance to the center. The choice of kernel modeling the dependency of the rates of birth and death as a function of distance from the center is flexible [see 4]. We choose a Normal kernel.

Let *c*_{i} := (*x*_{i}, *y*_{i}) be the center of the REX occurring at time *t*_{i}. Suppose lineage *k* has location at time i.e., just before the event occurs (going forward in time). That lineage dies at *t*_{i} with probability
where 0 < *μ* ≤ 1. We will refer to *μ* and *θ* as the death and radius parameters from here on.

Individuals are born at time *t*_{i} and location *l* according to an inhomoge-neous Poisson point process with intensity *ρu*(*l,c*_{i})*dl*. Therefore, the expected size of the population does not change after each REX event and the spatial distribution of the population is still uniform, making this process stationary.

Finally, all individuals born at *t*_{i} share a common parent who is chosen from among the individuals existing immediately before the REX event. The probability that lineage *k* with location is selected as parent is then

### 2-b Backward-in-time dynamics of a sample

When considering a sample from the present population (corresponding to time *t*_{0} = 0), its ancestry is traced towards the past (corresponding to *t* < 0) as follows. REXs occur at the same rate λ as in the forward-in-time process and centers (i.e., values of *c*_{i}) are still uniform on *R*(*h, ω*). Suppose a REX event takes place at time *t*_{i} and that lineage *k* has position at time i.e., immediately before the event that took place (going backward in time). *k* changes location at the event, i.e., it is ‘hit’ by the REX event, with probability . When it is hit, *k* jumps to a new, ancestral, location with probability density proportional to . When multiple sampled lineages are hit by the event, all of them coalesce at time *t*_{i} and location .

### 2-c Interpretation of parameters

The ΛV model has three parameters: λ, *μ* and *θ*. λ has a straightforward interpretation: it is the rate of REX events taking place in the whole habitat. It is also common to refer to the rate per unit area λ′ := λ/s where *s* := *ωh* is the total area. Although the value of this parameter is linked to the expected time elapsed between birth and the age of first reproduction, the precise relationship between λ and generation time depends on the biology of the species under scrutiny.

The probability density of the location of a lineage immediately after being hit (going backward in time) given its location before the event is a bivariate normal density with covariance matrix 2*θ*^{2}**I**. 2*θ*^{2} is thus half the expected square Euclidean distance (in one dimension) between a parent and one of its immediate offspring.

Also, considering the backward-in-time process here again, the probability that a lineage is hit by a REX is approximated by 2*πθ*^{2}*μ/s*. This probability can be interpreted as the ratio between the rate of events that hit lineages and the rate for both types of events. Therefore, given a fixed radius, the higher *μ*, the higher this ratio.

Wright’s neighborhood size (*𝒩*) and dispersal intensity (*σ*^{2}) are two standard parameters in population genetics that can be expressed as functions of λ, *μ* and *θ*. Assuming small values of *θ*, the distribution of the location of a lineage at time *t* conditional on its location at time 0 is normal with variance *σ*^{2}*t*, where *σ*^{2} := 4*θ*^{4}λ′*πμ*. In the limit where λ → ∞ and *θ*→ 0, i.e., the jumps become increasingly frequent and small, the backward-in-time motion of a single lineage is a Brownian process with diffusion parameter *σ*^{2}.

Considering again small values of the radius, the probability that any two lineages coalesce in a REX event is simply the probability that both of them are hit, i.e., (2*πθ*;^{2}*μ/s*)^{2} =4*π*^{2}*θ*^{4}*μ*^{2}/*s*^{2}. The rate of these events is thus4*π*^{2}*θ*^{4}*μ*^{2}λ/*s*^{2}. We define this rate as 1/(2*N*_{e}), where *N*_{e} is the effective size of the (diploid) population. Following the definition of the neighborhood size used in the Wright-Malecot model, i.e.,*𝒩* := 4*πN*_{e}*σ*^{2}/*s*, and using the definitions of *σ ^{2}* and

*N*

_{e}given above, we obtain

*𝒩*= 2/

*μ*. The neighborhood size, which can be understood as the expected number of individuals participating in reproduction in a disk or radius 2

*σ*, is thus inversely proportional to

*μ*. More details on the derivations of the results in this section are given in SI 12 and [5].

## 3 Likelihood

We now derive the likelihood of the ΛV model for data consisting of the locations of a sample of present-day individuals and their ancestors. Let *g* be the timed genealogy describing the ancestral relationships of the sampled lineages. Suppose there are *m* REX events between t0 and the time of the most recent common ancestor (MRCA, the root of g) and that the i-th event occurs at time *t*_{i} with center *c*_{i}. Since we are considering the backward-in-time process, *t*_{i} > *t*_{i+1} for *i* = 0,…, *m* −1. Let and be the vector of locations of the ancestral lineages at time and respectively. Also, let be the known location data. Knowing *g* allows one to determine which locations in correspond to lineages that were born at *t*_{i} and which location in is the parental one.

The likelihood for the observed data l_{0} and imputed data , *t*_{i} (with *i* =1,… *m*) and *g* is then

Note that , i.e., the locations do not convey any information about the genealogy directly (only genetic sequences do). Let *B*_{i} be the set of indices of sampled lineages born at *t*_{i}. We have:
where *l*(*p*, *c*_{i}) is the location of the parent of all offspring lineages born at *t*_{i} and **I**(|*B*_{i}| > 1) is an indicator function taking value 0 if no offspring were born at REX *i* and 1 otherwise. The likelihood for the observed location data only can be obtained via marginalization which we perform via MCMC sampling.

## 4 Simulations

### 4-a Range of parameter values

The habitat is a 10 by 10 square in our simulations. Individuals belonging to the population of interest are never found outside this area. The effective size of the population, *N*_{e}, was sampled from a uniform distribution on [100, 5000]. Values of the neighborhood size were then obtained by sampling uniformly in [*N*_{e} × 10^{−3}, *N*_{e} × 10^{−2}]. Values of *θ* were sampled uniformly in [1.5,4]. When *θ* = 1.5, the probability that the offspring falls at a distance from its parent smaller or equal to 1.0 is approximately 0.5 (the distance is measured here along a single axis). When *θ*; = 4, this probability is approximately 0.25. We thus considered this range of values for *θ* to be broad enough to illustrate medium-and long-range dispersal patterns respectively. The values of *N*_{e}, *θ* and *𝒩* fully determine that of *μ*, *σ* and λ.

Nucleotide sequences of length 500 bp evolved along the genealogies under the Kimura-2-parameter model [22] with a transition/transversion ratio set to 4.0. The 5% and 95% quantiles of the nucleotide diversity estimated from the sequence alignments hence generated are 0.44% and 1.56% respectively, well in line with that observed in *Drosophila melanogaster* for instance [1].

### 4-b Data collection process

In real experiments, collection of data is rarely uniform over the habitat. Instead, samples are often obtained from disconnected and seemingly randomly scattered regions. In an attempt to mimic these patterns, the sampling scheme used in our simulations relies on throwing random triangles on the habitat. Lineages are then sampled uniformly at random from each of these triangles (see SI 10).

The evolution of a population counting 5,000 individuals was simulated under the forward-in-time process for each simulation. Each of these stopped after 100,000 REXs. These particular values for the population size and number of REXs were selected so that computation could be performed with a reasonable amount of memory. In fact, the size of the population used here is not relevant to the effective population size, which is a function of *λ*, *μ*, *θ* and *s* only.

Sampling sites were then randomly scattered on the landscape, as just explained. A sample of 50 individuals was obtained by selecting individuals uniformly at random within the available sites. In cases where less than 50 individuals happened to be within the sampling regions, new regions and individuals were drawn. This procedure was repeated until the obtention of a sample of size 50. In all simulations, the sampled individuals did coalesce in the time period considered (i.e., the time required to reach 10,000 REXs).

## 5 Bayesian parameter estimation

Samples of genetic sequences, s, along with locations of sampled lineages, l0, are used to infer the parameters of interest, namely λ, *μ* and *θ*. The model also involves multiple nuisance parameters which we impute; the genealogy of the genetic sequences under study and the spatial coordinates of ancestral lineages being the main ones. Random draws from the joint posterior distribution of all these parameters are obtained using MCMC techniques. The posterior density is as follows
where *η* is the substitution rate. In what follows, we assume that this rate does not vary across lineages nor sites of the alignment and its value is known *a priori*. The parameters (*t*_{0},… ,*t*_{m}, *η*) fully specify the edge lengths in g in terms of expected number of substitutions (between nucleotides, amino-acids or codons) per position along a sequence. The probability Pr(s|g,*t*_{0}, …, *t*_{m}, *η*) is calculated using Felsenstein’s (1981) pruning algorithm. The joint prior density *p*(*θ*, λ, *μ|h, ω*) is given by the product of the three densities *p*(*θ|h, ω*), *p*(λ|*h, ω*) and *p*(*μ*|*h, ω*). The prior distributions of *μ*, λ and *θ* are uniform on [0,1], [10^{−6},10^{+2}] and [0,5] respectively.

Random draws from the joint posterior distribution were obtained using the Metropolis-Hastings algorithm. A total of fifteen operators updating the values of every model parameter, including the tree topology, were implemented (see SI 11). We validated the implementation and verified the correctness of the data simulation algorithm by comparing the distribution of summary statistics in simulated data to that inferred using our sampling technique (see SI 13). For each (real and simulated) data set, the chain was sampled for a maximum of 100 hours on a computer server equipped with 2.7-2.8 GHz CPUs. The sampling halted when the effective sample sizes of λ, *μ* and *σ*^{2} all exceeded 100.

## 6 Results

### 6-a Population density

Neighborhood size (*𝒩*), a quantity proportional to the product of population density and dispersal intensity, can be inferred from pairs of sequences and their spatial coordinates. Indeed, estimates of *𝒩* are often derived from the slope of the regression of Fst values for pairs of sequences on the corresponding geographic distances (see [41] and SI 14). In our simulations, Pearson’s correlation between true neighborhood values and estimates obtained using this technique are equal to 0.074 and −0.006 for two and ten sampling regions respectively. For the ΛV model, estimates of neighborhood sizes are taken as the posterior medians. The correlation between true and estimated values is equal to 0.631 and 0.669 for two and ten sites respectively. A more detailed analysis of the posterior distributions estimated from the simulated data indicates that accurate and precise inference of this parameter is achievable using our technique, at least for values of *𝒩* smaller than ~20 (see SI 15).

The structured coalescent is often used to estimate the effective size of populations in a context similar to that of our simulations: each deme corresponds to a sampling region as opposed to a genuine element of a partitioned population. We used MultiTypeTree [46] from the BEAST2 package [10] to estimate the parameters of the structured coalescent model applied to our simulated data (see SI 17). Posterior medians and 95% credibility intervals for the population densities estimated with the structured coalescent and the ΛV models are presented in Figure 1. The structured coalescent generally overestimates population densities with the strongest biases observed with two demes and more accurate estimates obtained with ten. Estimates obtained under the ΛV model are better overall, both in terms of accuracy and, to a lesser extent, precision, with very little difference between two and ten demes.

### 6-b Dispersal intensity

Figure 2 displays the simulated values of dispersal intensity (*σ*^{2}) against the posterior medians inferred using the Bayesian approach. Parameter inference is fairly accurate overall but precision is limited, especially for large values of this parameter. Increasing the number of sampling regions improves the quality of inference however. Overall, these results suggest that it is possible to extract valuable information about dispersal patterns using our approach, although obtaining precise estimates requires sampling a large number of sites.

### 6-c Computation times

Each simulated data set was allocated a maximum of 100 hours of computation time. Effective sample sizes (ESS) for the parameters λ, *μ* and *σ*^{2} were monitored during this period. For two sampling sites, 60%, 62% and 95% of the data sets had ESS greater than 100 for λ, *μ* and *σ*^{2} respectively. For ten sites, the corresponding percentages are 56, 80 and 99. Usable estimates of the ΛV model parameters are thus generally obtained in a reasonable, yet substantial, computation time.

### 6-d Influenza seasons in the USA

Homologous nucleotide sequences from the NA segment of the Influenza A virus (H1N1 sub-type) were retrieved from the Influenza Research Database [45]. Five flu seasons (2009-2010 to 2013-2014) in the USA were considered in five separate analyses. Hawaii and Alaska were excluded from the analysis in order to approximate the shape of the habitat with a rectangle. Multiple sequences are generally available for each state and season. Two distinct sets of sequences (with a single sequence per state for each set) were analyzed for each season, thereby providing two independent biological replicates for each of the five flu seasons. Each sequence alignment was analyzed using the HKY [19] model of nucleotide substitution with the FreeRate model of rate variation across sites [44] and a covarion-like model of site-specific rate variation across lineages [18].

Figure 3 gives the posterior distributions of the neighborhood size and the radius parameter for the five seasons and two replicates. We focused on the radius *θ* rather than the dispersal intensity parameter *σ*^{2} as the latter requires knowledge about the rate of nucleotide substitution which we could not infer because of the lack of calibration information in the data considered.

The comparison of parameter estimates across seasons and replicates shows interesting features. First, the posterior distribution of parameters are similar in the two independent biological replicates. This observation suggests that variation of parameter estimates due to sampling is negligible. Second, the evolutionary dynamics observed for the 2009-2010 season (corresponding to the 2009 flu pandemic in the USA) are clearly distinct from that observed for other seasons. Comparatively smaller neighborhood sizes and larger radii are inferred for this season. These two observations are indicative of a virus with limited infection rate (low *𝒩*) but good ability to proliferate under various climatic conditions (large *θ*). The fact that the 2009-2010 season lasted for a substantially longer period of time compared to subsequent seasons and had a comparatively mild incidence rate (see SI 16) corroborate this conclusion.

## 7 Discussion

Understanding the forces shaping genetic diversity in space is a key objective in ecology and population genetics. Recent years have seen the rise of methods that essentially aim at visualizing the correlation between genetic and geographic distances [35, 34]. These exploratory methods can reveal interesting patterns in the data such as long-distance admixture [12], migration corridors or barriers [36]. The present study focuses instead on inferring the parameters of a stochastic model of population dynamics. This approach is well suited to testing biological hypotheses and therefore provides a relevant complement to more exploratory techniques.

Our results indicate that the spatial λ-Fleming-Viot (λV) model [14, 9, 3, 5, 47, 4] is amenable to parameter inference under biologically realistic conditions. Using a Bayesian inference technique that relies on augmenting the data with the ancestral locations of sampled lineages, we show that accurate information about neighborhood size and dispersal intensity can be recovered from geo-referenced genetic data. It is a significant step forward in the analysis of the spatial distribution of genetic diversity as partitioning populations into discrete demes (as in the structured coalescent) or assuming a non-homogeneous distribution of individuals in space and time (as in the Wright-Malécot model) is not required with this new technique.

Estimates of neighborhood sizes obtained with the traditional approach based on fixation indices (Fst) show virtually no correlation with the true values of this parameter in our simulations. This inference technique was originally designed for the analysis of diploid individuals and multiple unlinked loci with each locus evolving under an infinite-allele model. It is robust to misspecifica-tion of the mutation model [26] and is relevant in a broad range of experimental conditions [49]. In our simulations however, all loci evolved along the same genealogy while the number of alleles was limited to four nucleotides and only one sequence per individual was considered. In these circumstances, which correspond to standard experimental conditions, our Bayesian inference technique returns precise estimates of neighborhood sizes, thereby providing a relevant alternative to Fst-based methods. More importantly, while the traditional approach only estimates the product of population density and dispersal intensity (i.e., the two parameters are not identifiable in the standard inference framework, see e.g., [42]), both parameters can be estimated separately using the proposed technique.

The structured coalescent provides estimates of population density. Although this method assumes that each deme corresponds to a sub-population, it is commonplace to equate a sampling region with a deme. In these circumstances, our results suggest that the structured coalescent overestimates the population density when the generative model is λV. The bias decreases with the increase of the number of sampling regions however. Yet, our Bayesian estimation technique clearly outperforms the structured coalescent here. In cases where the population of interest is not strongly structured spatially but rather continuously distributed, we thus recommend that estimation is conducted under the ΛV model.

Bayesian inference methods are generally computationally intensive compared to other estimation techniques. Our approach is no exception, although stable estimates of the three main model parameters were obtained after about four days of computation in the majority of simulated data sets. Our implementation of the MCMC sampler fitting the ΛV model was extensively tested and optimized. Nonetheless, new operators complementing the fifteen considered here might improve the speed of convergence. Also, a substantial fraction of the computation is spent on evaluating the likelihood of the model for REX events that do not affect the location of any lineage. Integrating over the locations and times of these events could potentially be done analytically, thereby decreasing the computational burden.

The analysis of H1N1 sequences from five flu seasons in the USA provides insight into the dynamics of the infection that is coherent with the variation in the incidence of flu-related diseases. In particular, the patterns inferred for the 2009 pandemic, both in terms of neighborhood size and range of dispersal, are notably distinct from that observed in other ‘regular’ flu seasons. The larger than usual estimate of dispersal distance might be one of the factors explaining why this season lasted for a longer period of time compared to other years. Also, the incidence rate for the 2009 season was relatively mild which is consistent with the small neighborhood size estimated here.

Our implementation of the ΛV model relies on several assumptions that require careful consideration. First, the size of the population and its habitat are considered as fixed. Detecting expansion or contraction of population sizes during the course of evolution is at the core of important questions in ecology and population genetics (see e.g., [20]). Accommodating for deterministic changes of population size in the ΛV framework is therefore of utmost interest and will be the focus of future research. Second, our simulations assume a homogeneous landscape. This assumption is not realistic in instances where mountains, rivers, human activity, etc., impede migration of individuals. Relaxing the assumption of isotropic migrations in the ΛV model presents a technical challenge that needs to be addressed. Third, our simulations and inference method target multiple linked loci. Extending our approach to accommodate for recombination is a interesting research prospect. In fact, Barton et al. (2013) recently showed that recombination patterns convey information about the distribution of relative parent position and neighborhood size under the ΛV model. Fourth, the boundaries of the habitat are considered as known a priori in the present study. Treating the area of the habitat as a random variable would be relevant for the analysis of most real data sets. Further work on this question should also assess the impact of over-or under-estimating this area on the accuracy and precision of model parameter estimates.

Despite these limitations, the proposed approach is a relevant complement to the exploratory analyses mentioned above and the inferential techniques based on the structured coalescent or the fixation index. Fitting a ΛV model is particularly relevant in cases where the spatial distribution of individuals in the population of interest does not display well-defined demes. Despite the substantial computational burden involved with the Bayesian inference under this model, the opportunity to infer the density of a population and characterize dispersal distances from the combined analysis of genetic and spatial data should help improve our understanding of the mechanisms underlying spatial distribution of populations and species.

## 8 Software availability

The software phyrex implementing the MCMC algorithm for parameter inference under the ΛV model is available as part of the PhyML package from the following URL: https://github.com/stephaneguindon/phyml.

## 9 Acknowledgments

We thank Prof. Nick Barton and François Rousset for constructive feedback on early versions of the manuscript. We also wish to acknowledge the Centre for eResearch at the University of Auckland and NeSI high-performance computing facilities.

## Footnotes

↵

^{1}This notation slightly deviates from the one used on the main text where the time of the MRCA is noted as*t*_{m}.

## Bibliography

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵