Demographic inference in a spatially-explicit ecological model from genomic data: a proof of concept for the Mojave Desert Tortoise

In this paper, we study the general problem of extracting information from spatially explicit genomic data to inform inference of ecologically and geographically realistic population models. We describe methods and apply them to simulations motivated by the demography of the Mojave desert tortoise (Gopherus agassizii). The tortoise is an example of a long-lived, threatened species for which we have an excellent understanding of range, habitat preference, and certain aspects of demography, but inadequate information on other life history components that are important for conservation management. We use an individual-based model on a discretized geographic landscape with overlapping generations and age and sex-specific dispersal, fecundity, and mortality to develop and test a method that uses genomic data to infer demographic parameters. We do this by seeking parameters that best match a set of spatial statistics of genomes, which we introduce and discuss. We find that for inferring only overall population density and mean migration distance, a simple statistical learning method performs well using simulated training data, inferring parameters to within 10% accuracy. In the process, we introduce spatial analogues of common population genetics statistics, and discuss how and why they are expected to contain signal about the geography of population dynamics that are key for ecological modeling generally and conservation of endangered taxa.

other. Since we use these statistics in a nonstandard way, we now define them and motivate their use as 151 informative spatial statistics.

152
For a set of genomes, denoted A, we write the genetic diversity of A, i.e., the mean density of nucleotide 153 differences between a randomly chosen pair of genomes from A, as π(A) = E[|a 1 − a 2 |], where a 1 and a 2 are 154 alleles at a random site in the genome, coded as 0 or 1, from genomes randomly chosen without replacement 155 from A. We also denote the genetic divergence between two groups, A and B, as the mean density of 156 nucleotide differences between randomly chosen individuals from the two groups, which can be written as if we interpret the latter as sampling without replacement, as we did for π(A, B); for this reason, subsequently we write all F statistics using this format (and dropping the subscript '4'). 166 We also introduce analogous statistics that depend on choices of three genomes. We will write these in if they all come from a single randomly mating population), because in this case each topology is equally 193 frequent and has the same distribution of branch lengths, so the contributions of each topology cancel. Figure   194 2 also shows formulas for the statistics in terms of divergence. physically between the regions of groups B and C, then Y (A; B, C) will be negative due to a deficit of trees 204 with topology (A, (B, C)), which has overall positive weight, as shown in Figure 2B. Conversely, if B or C lie 205 between the other two, then Y (A; B, C) will be positive. Figure 3 shows several geographic configurations of i.e., there is no recent coalescence -all three rooted topologies in Figure 2B are roughly equally likely and 208 have equal branch lengths on average (Wilkins 2004), and will therefore cancel out.

209
Population size also affects the probability of first coalescence. For example, F 2 (1, 2) = F (1, 2; 1, 2), which implying that F 2 (1, 2) > 0. However, an increased population density in one region reduces the magnitude of 213 F 2 , since it makes it more likely that two lineages in that region trace back to ancestors outside the region 214 before they coalesce, decreasing this bias. Thus, increased population density is expected to decrease the 215 value of the statistic. Figure 4 shows how varying region size (i.e., group population size) affects F and Y .

216
How many statistics? If we have divided our samples into k groups we can compute a large number of 217 statistics. The statistics we consider here are averages over choices of two, three, or four genomes chosen 218 from varying numbers of groups. We call a statistic that depends on k groups a "k-point statistic" since we 219 imagine each group standing in for a spatial location. "Groups" may be arbitrary: single genomes, single 220 diploids, or larger collections. The total number of statistics of each type is:  data, the statistics must be computed using sequence differences, but these are directly comparable to the 248 branch length statistics after a rescaling. values. Second, we compute statistics on each output. After this procedure (shown in Figure 1D) we obtain 255 a table containing inputs (values of ρ and σ) and corresponding outputs, i.e., the statistics we calculate, 256 denoted generically here as (s 1 . . . s n ). We seek to infer the relationship between inputs and outputs and do 257 so using inverse interpolation.  (A, B) with the path between a sample from A and a sample from B yields the weights. For instance, in T Y 1 the weight of -1/2 on the branch above b (ii) is obtained by adding +1/2 (because it is on the path from a to b) to -1 (since it is on the path from b to c). Figure 3: Effects of differing spatial configurations of regions in isotropic space on the signs of Y and F statistics. In all cases the sign is derived by reasoning about the probability of the first coalescence involving two regions and then connecting this to the probability that topologies occur in Figure 2. All population sizes (i.e., areas) are equal. Larger distances between regions in isotropic space result in decreased probability that the first coalescence involves these regions. Equivalent distances between equal-sized regions results in equal probability that the first coalescence involves these regions. If the actual sign is in doubt, the comparator is marked by '?'. Recall that F 3 (1; 2, 3) = F (1, 2; 1, 3) and that F 2 (1, 2) = F (1, 2; 1, 2). See Supplement A for justification. Figure 4: Effects of differing sizes and configurations of regions in isotropic space on Y and F statistics. In all cases the sign is derived by reasoning about the probability of the first coalescence involving two regions and then connecting this to the probability that topologies occur in Figure 2. All distances between centroids of adjacent populations are assumed equal. Regions denoted by larger circles have twice the population size (e.g., twice the population density) of those denoted by smaller circles. The probability of the first coalescence involving two members from a larger population is less than that of the first coalescence involving two members of a smaller population. Recall that F 3 (1; 2, 3) = F (1, 2; 1, 3) and that F 2 (1, 2) = F (1, 2; 1, 2). See Supplement A for justification.

Inference via inverse interpolation
Consider using real genomic data to infer migration. In a classic Wright-Fisher model there is a clean 260 parametric dependence of a genetic statistic, F ST , on migration rate m and population size N . For our more 261 complex model there is an unknown analytic relationship between σ, ρ, and the statistics we introduce above.

262
More generally, simulations give us noisy observations of an unknown function f (θ) that maps parameters 263 (here, θ = (σ, ρ)) to the n statistics: for each simulation, run with parameters θ i , we can think of the resulting 264 set of statistics as where Σ is an unknown covariance matrix defining how the noise i is correlated across observations i. Given 266 a new set of statisticsS, we then seek to estimate the corresponding parameters,θ.

267
For our purposes, it suffices to take an average over the known parameter values θ i , each weighted according 268 to the proximity of its associated statistics S i to the observedS: We choose the bandwidth, ω, by k-fold crossvalidation. To do this, we randomly divide the data S and θ into then compute the mean relative error in the i th as: We choose the bandwidth ω that minimizes the median relative error across all k validation blocks. parameters for any spatial population model-dispersal-is also one of the most challenging to estimate 278 empirically. In G. agassizii adults are generally dependent on burrows and therefore relatively stationary.

279
But tortoises are also long-lived, so rare adult dispersal may provide significant, but rarely observed, genetic survival and fecundity by age class), density-dependence, movement in space, and maturity (which increases 293 survival). Our IBM simulates a closed population, which could be an entire species. 294 We implement the model by discretizing continuous space into a grid of discrete patches, where the patches 295 are contiguous and each represents a subpopulation ( Figure 1B). In the simulation below, each of these 296 includes approximately 50-500 individuals. Within each patch, we assume that individuals mate randomly. 297 We also implement population regulation within the patches.

298
Movement between patches is parameterized by the standard deviation of the dispersal kernel, σ. We compute 299 the migration matrix M whose (i, j)th entry is the probability that an individual in patch i moves to patch j 300 in a given year. This is computed as the probability that an individual uniformly located in patch i moves a 301 random, Gaussian-distributed distance with mean zero and standard deviation σ and ends up in patch j; if 302 the corresponding geographic regions are denoted A i and A j then this can be computed as This computation is done by numerical integration using the R package landsim (Ralph 2017).

304
Our simulation performs one step in the life cycle for each year, as follows:  the geographic and biological setting to give us more nearly independent information. We refer to these as i ) based on the location of I i (solid arrow), 3. for each of j simulations draw the n replicates (connected by gray lines). Samples I i could be empirical data or individuals in a reference simulation, in which case their location is simply their patch.

Model behavior 387
The IBM simulations produced a population with age-structure and dynamics similar to that seen in the     values. Thus, these methods provide a way to estimate parameters for which it is difficult to obtain data 415 within the context of relatively well-known parameters.

416
In addition, we have introduced spatial analogues of common population genetic statistics and showed how  A, B, C, D). In each case we 467 describe the labelling of the tips with a dictionary-like notation: {key : value}. We also denote the probability 468 that the first coalescence involves individuals from groups A and B, for example, asĀB.

469
For Y statistics our heuristic isBC − 1 2 (ĀC +ĀB). For F -statistics our heuristic isĀC +BD − (ĀD +BC). 470 We conjecture that for both of these heuristics, if the value is positive, so is the statistic. For example, if and so without other changes under the first labelling statistic must be less positive. However under the second labelling, coalescence within group 2 has no effect and so it should be equal to the statistic 501 on equal-sized groups.

502
• ii): For Y , coalescence within the group 1 has no effect and so it should be equal to the statistic on 503 equal-sized groups.

504
• iii): For Y , coalescence within the group 1 has no effect and so it should be equal to the statistic on 505 equal-sized groups.  (M,  nsamples=500, length=1e6, Ne=1e4, recombination_rate=1e-8, mutation_rate=1e-8, num_replicates=5)) tsl = tuple(run_sim(M, nsamples=500, length=1e6, Ne=1e4, recombination_rate=1e-8, mutation_rate=1e-10, num_replicates=5)) nstats = int(n * (n-1) / 2 / 2) C Biologically-motivated custom statistics 521 The first custom statistic aims to capture the overall timescale over which sampled alleles find a common 522 ancestor; for this we use divergence averaged across all comparisons. The next three custom statistics aim to 523 quantify the timescale over which individual alleles sampled from opposite sides of the population range find 524 a common ancestor; for this we apply the three-point statistics to groups spanning the range, with the focal 525 group in one corner and the two comparison groups on the opposite end of the landscape. For example, the Y 526 statistic with a focal group in the northwest corner and the comparison groups on the east side is Y 3 (A; C, G). 527 Specifically, we computed Y 3 with focal groups in the west, Y W 3 = 1 2 (Y 3 (A; C, G) + Y 3 (D; C, G)); and east (G; A, D)). We also computedF corners 3 , the average of the four F 3 statistics having 529 a corner population as focal and the two most distant corner populations as the other two arguments.

530
The final three custom statistics aim to quantify differences in timescales over which alleles find common 531 ancestors depending on whether they are sampled from i) within the same region, ii) neighboring regions, or 532 iii) non-neighboring regions. To do this, we categorized every divergence statistic π(A, B) according to these 533 three categories, i.e., whether i) A = B, ii) A and B were neighbors, or iii) otherwise, and averaged mean 534 divergences within each of these three categories, denoting these quantities π w , π n , and π nn , respectively.

535
Then, we took differences of these average divergences to create three statistics: between neighbors and 536 within-region, π n − π w ; between non-neighbors and within-region, π nn − π w ; and between neighbors and 537 non-neighbors, π n − π nn . The neighbors and non-neighbors for each group in Figure 1C (using a King's 538 neighborhood) are listed in  Figure B.1: Comparing F 4 and Y 3 calculated using sequence data (y-axes) and using branch lengths of marginal genealogies (x-axes) for high (circles and red lines) and low (gray dots and blue lines) mutations rates. The slopes of the lines in each plot are the mutation rates. thus Y 3 has a larger value. This is similar to Figure 4i where the three-point statistic decreases in magnitude 549 when the non-focal population is larger: Y (1; 2, 2) > Y (1; 2 , 2 ) because group 2 has a larger population size 550 than 2.

552
Simulations were run for 15000 years, starting from the age distribution shown in Figure D.3, which is 553 roughly at equilibrium. The mean realized lifetime fitness (number of offspring) versus age for both males and females is shown in