Effects of fine-scale population structure on the distribution of 1 heterozygosity in a long-term study of Antirrhinum majus 2

13 Many studies have quantified the distribution of heterozygosity and relatedness in natural 14 populations, but few have examined the demographic processes driving these patterns. In this 15 study, we take a novel approach by studying how population structure affects both pairwise 16 identity and the distribution of heterozygosity in a natural population of the self-incompatible 17 plant Antirrhinum majus . Excess variance in heterozygosity between individuals is due to 18 identity disequilibrium (ID), which reflects the variance in inbreeding between individuals; it 19 is measured by the statistic g 2 . We calculated g 2 together with F ST and pairwise relatedness 20 (F ij ) using 91 SNPs in 22,353 individuals collected over 11 years. We find that pairwise F ij 21 declines rapidly over short spatial scales, and the excess variance in heterozygosity between 22 individuals reflects significant variation in inbreeding. Additionally, we detect an excess of 23 individuals with around half the average heterozygosity, indicating either selfing or matings 24 between close relatives. We use two types of simulation to ask whether variation in 25 heterozygosity is consistent with fine-scale spatial population structure. First, by simulating 26 offspring using parents drawn from a range of spatial scales, we show that the known pollen 27 dispersal kernel explains g 2 . Second, we simulate a 1000-generation pedigree using the 28 known dispersal and spatial distribution and find that the resulting g 2 is consistent with that 29 observed from the field data. In contrast, a simulated population with uniform density 30 underestimates g 2 , indicating that heterogeneous density promotes identity 31 disequilibrium. Our study shows that heterogeneous density and leptokurtic dispersal can 32 together explain the distribution of heterozygosity. 33


38
For most organisms, gene dispersal and therefore relatedness are spatially structured, such 39 that individuals closer in space are more likely to mate, and be more closely related, than 40 individuals further apart [1], [2]. Such spatial population structure causes decreasing genetic 41 similarity with geographic distance (isolation-by-distance [3]); this reduces the mean 42 heterozygosity of the whole population relative to a well-mixed population. Despite the 43 ubiquity of these patterns in nature, the role of demography and gene dispersal in determining 44 the spatial pattern of genetic variation has not been thoroughly explored. Commonly used 45 spatial models typically assume discrete demes and/or a uniform population density. 46 However, natural populations are typically patchy, with heterogeneity in both the distribution 47 and density of individuals. Patchy and heterogeneous spatial distributions within natural 48 populations should result in spatial variation in inbreeding and, consequently, excess variance 49 in heterozygosity. Despite this prediction, the effect of spatial heterogeneity on 50 heterozygosity has rarely been examined in the population structure literature. Moreover, it is 51 the interplay of heterogeneous density and dispersal that likely shapes the spatial structuring 52 of genetic relatedness between individuals. This highlights the importance of understanding 53 the factors (e.g., life history, demography, population structure) that contribute to shaping the 54 full distribution of heterozygosity and relatedness in a spatially structured population. 55 56 Understanding the drivers of variation in inbreeding within populations is fundamental, given 57 its importance to genetic diversity and to fitness. Quantifying variation in inbreeding and 58 combining this with measures of fitness (or fitness proxies) makes it possible, in principle, to 59 estimate inbreeding depression either through pedigrees [4], [5] or heterozygosity-fitness 60 correlations (HFCs). For HFCs, inbreeding depression is estimated by comparing proxy 61 measures of fitness against heterozygosity, with the expectation that offspring from related 62 individuals will have lower heterozygosity. Variance in inbreeding is therefore essential for 63 HFCs to be detected [6]. In addition, variance in inbreeding is interesting per se because it 64 depends on both demographic history (e.g., [7]) and mating system (selfing, partial selfing or 65 outcrossing) [8]. Outcrossing species, with generally low levels of inbreeding, provide an 66 opportunity to examine factors other than mating system variation that may affect inbreeding 67 variation, and thus, variance in heterozygosity. 68 69 If there is variation in inbreeding between individuals, heterozygosity at different loci will be 70 correlated. The covariance between loci in heterozygous state is termed identity 71 disequilibrium (ID), by analogy with linkage disequilibrium, which is the covariance in 72 allelic state between loci. ID can be calculated across individuals and divided by the square of 73 the mean heterozygosity to calculate the population statistic g2, which is a measure of 74 variance in identity by descent [6] amongst individuals. For an outcrossing organism with 75 fine-scale population structure, spatial patterns of density and mating could have strong 76 effects on the degree of mating with related individuals, and thus affect identity 77 disequilibrium and g2. Furthermore, as sessile organisms, mating and offspring dispersal in 78 plants are mediated by external vectors (pollinators and seed dispersal mechanisms) [9]. 79 Consequently, the shape of the distribution of dispersal of both pollen and seed will also have 80 an impact on g2. Additionally, as partial selfing will produce identity disequilibria across loci 81 for selfed individuals, g2 can be used to estimate the selfing rate of a population, with this 82 estimator being robust to null alleles and biparental inbreeding [10] individual in ~5 km stretches of two parallel roads that cross the Vall de Ribès, dubbed the 132 "lower road" (GIV-4016; ~1150 m elevation) and "upper road" (N-260; ~1350 m) (Fig. 1). 133 We also sampled along small side roads, railroad embankments, rivers, and hiking trails. The 134 plants grow preferentially along exposed areas such as roads, therefore, density was very low 135 away from these disturbed areas between the main sampling sites of the lower and upper 136 roads. In some years, we were limited to genotyping only in the core area, ~1 km along each 137 road. We calculated multilocus heterozygosity for each individual pooling across all years, denoted 211 here by H, defined as the fraction of heterozygous loci in an individual. In this system 212 "generations" cannot be clearly defined because of seed dormancy and perenniality. 213 However, pooling data across years only reduced H by 0.08%. 214 215 We observed an excess of individuals with around half the mean heterozygosity (see Results). 216 To check whether the pattern was consistent with rare selfing, we compared the likelihood of 217 a single Gaussian to a mixture of two Gaussian distributions, one with the observed mean and 218 variance and the other with half its mean and variance. 219 220 The variance in individual heterozygosity consists of two components. The first is due to the 221 variance in whether an individual locus is heterozygous, and decreases in proportion to the 222 number of SNP, n: it equals (1 − ) 2 2 (1 − 2 ) ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ / . The second is due to covariance in 223 heterozygosity between loci, which is termed the identity disequilibrium (ID Additionally, g2 can be used to estimate selfing rate within a population [10]. Using the 251 software SPAGeDi [22], which implements the g2-based selfing rate calculation described in 252 [10], the selfing rate was estimated for the full population using the 91 SNP data. 253 254 Effects of pollen dispersal on heterozygosity 255 256 With isolation by distance, the distribution of heterozygosity is expected to depend on the 257 distance between parents: heterozygosity of offspring from nearby parents will have a lower 258 mean and higher variance compared to offspring from distant parents. To test this prediction, 259 we simulated offspring using all field individuals as mothers and choosing fathers from a 260 given distance away (detail in SM1.3). Then we compared the distribution of H between the 261 field data and offspring simulated from matings with three models of pollen dispersal: the 262 nearest neighbor to the mother, a Gaussian distribution (σ = 300 m), and a leptokurtic 263 dispersal kernel sampled from 1463 empirical measurements of pollen dispersal, estimated as 264 the distances between assigned parents (electronic supplementary material; D. Field, 265 unpublished data). A CDF of the latter distribution (SM1.3: Fig. S3 with the local population structure that determines the variation in inbreeding amongst 299 individuals within an area of a few km 2 , which will equilibrate rapidly [25]. The spatial 300 pedigree was generated by choosing parents for each individual according to a backwards 301 dispersal distribution measured empirically. The seed and pollen dispersal distances are 302 estimated respectively as the distance between offspring and nearest parent (assumed to be 303 the mother) and between parents (electronic supplementary material; D. Field, unpublished 304 data). For every offspring, the mother and father are chosen from randomly drawn distances 305 from the seed and pollen dispersal distributions. To choose a parent from a distance r, 6 306 points are assigned randomly on a circle of radius r centred at the focal individual and the 307 nearest individual to each of them are found. The closest individual to any of these points is 308 then chosen as the parent. The accuracy of our algorithm is verified by comparing the 309 specified and realised seed and pollen dispersal distributions for the simulated pedigrees 310 (SM1.4: Fig. S7, Table S4). The same procedure is repeated for the father, taking the mother 311 as the starting point. Since A. majus is self-incompatible, the mother and father are not 312 allowed to be the same individual. 313 314 Once the spatial pedigree is generated, 10 replicate sets of genotypes are assigned by 315 dropping genes down the pedigree, starting with equal expected frequencies of both alleles at 316 each of 91 loci. In fact, one could start with any initial frequencies, since FST-like measures 317 are independent of them. Population size was adjusted so that FST matched the empirical data 318 for the simulated sampling area. This was done by first simulating the population with an 319 initial population size (N) and then repeating the process with higher or lower N until the 320 desired FST is attained. 321 322 Next, we simulated a population with realistic heterogeneous spatial structure by using the 323 individual locations available for the years 2009 to 2019 in the A. majus focal population 324 (SM1.4: Fig. S5). There were fewer individuals from 2017-2018, so these were merged, 325 giving distribution data for 10 time points. We randomly sample from the ten consecutive 326 time points, and repeat for 100 cycles, thus iterating for 1000 generations. We sub-sample 327 from these locations to maintain a constant population size (N). If N is greater than the 328 number of plants available in a given time point, say k, all k plants are first included and the 329 remaining N-k locations were re-sampled from the same time point, displaced at a random 330 angle on a circle of radius 3m to avoid having plants in the same location. This naïve 331 approach allows us to simulate a spatial structure that is realistic over at least small scales. 332 We then generated a pedigree following the procedure used for the uniform population, again 333 adjusting population size to match the empirically observed FST. Ten replicate sets of 334 genotypes were run for each of five replicate pedigrees. 335 336 Patterns of isolation by distance, heterozygote deficit (FIS) and identity disequilibrium were 337 compared between the two simulation types and the field data (calculated from the simulated 338 sub-area of the field site). As the fitted population sizes were large (see Results), obtaining 339 direct estimates of identity by descent and thus FST from the pedigrees was not feasible. 340 Instead, FST was obtained for a pedigree as the average of replicate genotype sets generated 341 from that pedigree. FIS was calculated from the observed and expected heterozygosity. Values 342 of g2 were calculated for each replicate from each pedigree using InbreedR (in R version 343 3.6.1 [20]). 344 345

347
Isolation by distance 348 If we consider pairs of individuals within 20m of each other, the average FST over the eleven 349 years is 0.0244; however, this is an average over a quantity that depends strongly on distance. 350 The average pairwise Fij was calculated each year for individuals separated by different 351 distance classes and then averaged across years. Pairwise relatedness (pairwise Fij) between 352 individuals decreased rapidly with geographic distance, showing isolation by distance (Fig.  353  2A). The sharp decline in pairwise identity over short spatial scales corresponds precisely to a 354 rapid increase in H with distance between parents (SM1.3: Fig. S1), since heterozygosity is 355 determined by the probability of identity by descent between the genes from each parent. 356 Note that over large separations (>1Km), pairwise Fij values are necessarily negative, because 357 distant individuals are less closely related than the average for the whole population. 358

Variation in inbreeding 359
Excess variance in the distribution of individual heterozygosity (H) in the field data shows 360 that there is variance in inbreeding in the population (Fig. 2B). Furthermore, there is an 361 excess of individuals with around half the mean heterozygosity (i.e., with H~0.22, rather than 362 0.44; Fig. 2B, blue, lower left). These might be due to a low rate of selfing, and using the g2 363 estimator calculated with SPAGeDi, the selfing rate for the population is estimated to be 364 1.2%. Indeed, a mixture between two Gaussian with means ~ 0.22 and 0.44, and variances in 365 the same ratio, fits significantly better than a single Gaussian (Fig. 2B, compare red and black 366 to blue) with an increased likelihood of 11.3. However, we shall see in the next section that 367 this excess is also consistent with matings between close relatives, without the need to invoke 368 a breakdown in self-incompatibility. 369 370 To examine whether the observed distribution of heterozygosity is significantly different to a 371 distribution taken from a population with zero identity disequilibrium (ID), we compared the 372 field data with heterozygous values shuffled across individuals, which eliminates ID by 373 removing correlations between loci. We found greater variance in heterozygosity in the 374 observed compared to the randomly shuffled field data (Fig. 2B, gray). For both data sets, the 375 mean heterozygosity (0.44602) necessarily remains the same, but the observed variance in the 376 field data (var(H) = 0.00336) was significantly higher than the average variance in The overall ID, as measured by g2, is due to correlations in heterozygosity between all pairs 391 of loci, most of which are unlinked. We expect stronger correlations between linked loci, 392 because relatives will share blocks of genome. We found that the mean covariance in 393 heterozygosity between SNP on the same linkage group is substantially stronger than the 394 overall mean (0.00265 vs. 0.00056). If we restrict attention to those individuals with H<0.3, 395 we find that the covariance in heterozygosity between SNP on the same linkage group is still 396 higher (0.00649), as expected if close relatives share long blocks of genome IBD. This higher 397 covariance in heterozygosity translates to higher mean g2, which is seen within linkage 398 groups compared to the overall value (SM1.2: Table S1). 399 400 Effects of pollen dispersal on heterozygosity 401 The heterozygosity of simulated offspring depends on distance between their parents, with a 402 rapid increase in mean H with distance (SM1.3: Fig. S1). We compared the observed 403 distribution of heterozygosity with three alternative scenarios for pollen dispersal. There was 404 no significant difference between the mean and variance of heterozygosity between the field 405 data and offspring simulated from the observed leptokurtic dispersal . However, the mean and 406 variance of heterozygosity differed between the field data and simulated matings with either 407 nearest neighbors, or with Gaussian dispersal (Fig. 3A, SM1.3: Tables S2-S3). While all three 408 dispersal schemes differed in the distribution tail as assessed by Kolmogorov-Smirnov tests, 409 Gaussian and nearest neighbour matings are very different from the field data compared to 410 the leptokurtic distribution (SM1.3: Table S3). These comparisons were made for a single 411 replicate, but because each involves 22,353 individuals, there was little variation in the mean 412 and variance between replicates. 413 We next examined deviations in the left tail of the distribution, where an excess of low 414 heterozygosity individuals might arise from selfing or from matings between close relatives. 415 We focused on the leptokurtic dispersal curve, which was the distribution closest to the field 416 data. We estimated the increase in likelihood between fitting a single versus mixed Gaussian 417 distribution (see "Variation in inbreeding") for 100 replicate simulations. We found that the 418 mixed Gaussian was a better fit than a single Gaussian, with an increase in log likelihood 419 greater than 2 for 69 of 100 replicates. The estimated fraction of putatively "selfed" 420 individuals was 0.00043, averaged over replicates, which is about half the estimate from the 421 actual data, 0.00086. In comparison, only 4/100 replicates gave higher estimates than that 422 observed (SM1.3: Fig. S2). This suggests that the excess of individuals with low 423 heterozygosity can to a large extent be explained by matings between relatives under 424 leptokurtic pollen dispersal. Nevertheless, there is a marginally significant excess of such 425 individuals, with twice as many being seen as expected from our simulations. There is 426 considerable variation in fit between replicates, simply because deviations in the tail involve 427 few individuals. 428 The coefficient g2 reflects excess variation due to identity disequilibrium, and showed similar 429 patterns as the variance in H. Here, we found no significant difference between g2 from field 430 data and offspring from simulated matings with leptokurtic pollen dispersal. However, g2 431 from Gaussian and neighbor matings were 80% higher than g2 from field data and leptokurtic 432 matings. This nominally represents a significant difference given that the 95% confidence 433 intervals between these groups do not overlap (Fig. 3B). However, as we discuss below, these 434 confidence intervals only include sampling error, and not the additional variance due to 435 random evolutionary realizations. 436 437 438 Figure 3. A: Probit transform of the CDF of multilocus heterozygosity, H, for the field data 439 (blue) versus a single replicate of offspring simulated from Gaussian pollen dispersal 440 (orange), nearest neighbor matings (gray), and leptokurtic pollen dispersal (green). A normal 441 distribution (black) with the same mean and standard deviation as the field data is included 442 for comparison B: Identity disequilibrium (g2) for the same data as above indicating mean 443 and 95% CI. 444 445 Heterozygosity in a simulated spatial pedigree 446 447 In the previous section, we simulated offspring across one generation. To examine whether 448 the observed heterozygosity is consistent with a spatially structured model, we simulated 449 pedigrees over 1000 generations with uniform and heterogeneous density, conditioned on the 450 locations of individuals observed over ten years, repeated over 100 cycles for the latter case. 451 The realized seed and pollen dispersal matched the empirical seed and pollen dispersal 452 distribution for both density types (SM1.4: Fig. S7, Table S4). We required N = 15500 453 individuals for the heterogeneous density model and 40000 individuals for the uniform 454 density, in order to match the observed FST ~ 0.022 calculated over a 20m scale from the 455 simulated sub-area of the field site (SM1.4: Table S5). Up to distances of 1km, the decline in 456 pairwise identity with distance matched between the field data and the five replicate 457 pedigrees simulated with heterogeneous density (Fig. 4A, SM1.4: Fig. S8A). High variation 458 among replicates suggests that many more SNPs would be needed to match the pattern from 459 the pedigree (SM1.4: Fig. S8B); moreover, linkage would increase this variance to some 460 extent. We also compared the pattern of isolation by distance from the field data to that from 461 the pedigrees generated for both the heterogeneous and uniform density scenarios (Fig 4B,  462 SM1.4: Fig. S9); the heterogeneous density is a much better fit than the uniform density 463 (SM1.4: Table S5). 464 465 466 Figure 4. A:. Isolation by distance compared between the field data (blue) and five replicate 467 simulated pedigrees (gray) based on a heterogeneous population density. B: Isolation by 468 distance from the field data (blue) compared between the simulated pedigree with a 469 heterogeneous (gray) and uniform (orange) population density. 470 471 Identity disequilibrium (g2) estimates from the genotypes from pedigrees simulated with 472 heterogeneous density showed substantial variation between the five simulated pedigrees, and 473 between the ten draws of 91 SNPs from each pedigree (Fig. 5). The average g2 estimated 474 from the five pedigrees (each with 10 replicates) is 0.00264, which is consistent with the 475 observed mean annual g2 from the field of 0.00262. On the other hand, when assuming a 476 uniform density, the average g2 of 0.00171 is significantly lower than the field data. Note that 477 the confidence limits for the field data, generated by InbreedR, only include error due to 478 sampling a limited number of individuals. These errors do not account for sampling a limited 479 number of SNPs, or the random variation between evolutionary realizations (see Discussion). 480 481 482 Figure 5. Identity disequilibrium (g2) calculated from field data versus simulated pedigrees. 483 Five of ten replicates per pedigree are shown (gray: heterogeneous density, with five 484 simulated pedigrees; orange: uniform density, with one simulated pedigree). Mean from the 485 field (light blue) is across 2009-2019, while mean from the heterogeneous (black) and 486 uniform (yellow) simulations is across all replicates. The final year of field data (dark blue) is 487 comparable to g2 calculated from the final year of pedigree replicates (gray and orange). 488

491
An enduring problem in evolutionary biology is understanding how demographic processes, 492 such as heterogeneous density and dispersal, interact with spatial structure to determine the 493 distribution of heterozygosity within populations. In this study of a long-term dataset, 494 including more than 20,000 plants sampled over 11 years, we combine field data and 495 simulations to address questions central to understanding how demography can influence 496 patterns of heterozygosity. Namely, can we predict the distribution of heterozygosity for an 497 outcrossing species from key demographic parameters? To address this question, we first 498 confirmed that there was significant correlation in heterozygosity between markers (g2, a 499 measure of identity disequilibrium), which implies variation in inbreeding. By simulating 500 offspring from matings between geo-referenced, genotyped individuals, we show that the 501 mean heterozygosity increases, and the variance of heterozygosity decreases, with increasing 502 distance between parents; strikingly, these changes occur over very short scales (~10m, 503 SM1.3: Fig. S1). We found that the observed distribution of heterozygosity is consistent with 504 the known leptokurtic distribution of pollen dispersal. We also simulate the population over 505 1000 generations using the actual seed and pollen dispersal kernels, and the observed 506 heterogeneous density. We found that this model matches the observed identity 507 disequilibrium, whereas a model with uniform density substantially underestimates the 508 observed patterns. Thus, we explain the distribution of heterozygosity (mean, variance and 509 tails) using known features of the population. Moreover, our results also highlight the 510 limitations of making theoretical predictions from simulations that only assume simple 511 demographies. Taken together, our findings highlight the potential for using the observed 512 demography to explain the distribution of genetic diversity, and specifically the variance in 513 inbreeding in spatially continuous populations. 514 515 Variation in heterozygosity within populations provides the potential for selection to reduce 516 the frequency of less fit, inbred individuals. The association between inbreeding and fitness is 517 often tested through heterozygosity-fitness correlations (HFC), which quantify inbreeding 518 depression in natural populations by correlating measures of fitness with heterozygosity [6]. 519 Many studies that test for HFCs find that the excess variation in heterozygosity, g2, which 520 arises from identity disequilibrium, is low and rarely significant [26]. In our study, we 521 estimate a significant g2 of 0.0029 (95% CI: 0.0026-0.0033). Although low, this estimate is of 522 the same order as most of the g2 values found across 105 vertebrate populations in a meta-523 analysis of 50 HFC studies (average of 0.007) [26], and on the same order as ~60% of the 524 local populations surveyed in a long-lived tree [27]. Our estimate of significant variation in 525 heterozygosity provides the opportunity to examine potential drivers of this variance and 526 examine how density, spatial structure and dispersal contribute to a non-uniform distribution 527 of heterozygosity. 528 529 In our study, beyond simply estimating identity disequilibrium, we use two types of 530 simulation to explore how demography shapes variation in inbreeding. The first simulation 531 shows how the spatial pattern of pollen dispersal affects the distribution of heterozygosity. 532 Simulated matings with the empirically measured leptokurtic pollen dispersal curve were 533 consistent with the actual g2, compared to matings with nearest neighbors or a Gaussian 534 pollen dispersal. This result is somewhat surprising because we did not include the 535 complexities of the mating system of A. majus. Antirrhinum majus has a gametophytic self-536 incompatibility system (GSI [28]), whereby the pollen detoxifies secretions from the style 537 unless the pollen and style genotypes share alleles at the S-locus [29]. This system not only 538 prevents selfing, but also reduces mating among relatives (i.e., biparental inbreeding) because 539 related plants are more likely to share S-alleles [4], [30]. Thus, we might expect that our 540 simulated matings would have lower mean heterozygosity than the empirical measurement; 541 yet we found no evidence for such an effect. Indeed, we found that the excess of individuals 542 with low heterozygosity, around half the mean, can be explained largely by a small amount of 543 bi-parental inbreeding with leptokurtic pollen dispersal (Fig. 3, SM1.3: Fig. S2). However, 544 we have little statistical power to distinguish this from rare selfing, which can occur in self-545 incompatible species. In fact, using the g2 estimator of selfing rate from eq. 9 in [10], our 546 significant g2 value would imply a selfing rate of 1.2% for this population. However, as 547 shown by [11], this estimate could be within the bounds of the upward bias of the estimator if 548 strong biparental inbreeding is present, hence, this does not necessarily imply a breakdown of 549 self-incompatibility. We believe that our method, fitting a model of two Gaussians, is a more 550 robust way to estimate selfing than using g2, since it focuses on the low-H individuals rather 551 than the whole variance. However, it is still challenging to distinguish selfing from close 552 inbreeding. 553 554 Our second simulation approach asked whether heterogeneous density promotes variation in 555 inbreeding, given strong fine-scale population structure indicated by a rapid decay in pairwise 556 Fij (over a few metres, Fig. 2A). We only provide a proof-of-principle, by asking whether a 557 plausible model of spatial structure can explain the observed heterozygosity. We do not 558 include all features of the actual populationin particular, we extrapolate by repeatedly 559 sampling ten years of spatial distributions; we ignore linkage; we simplify the self-560 incompatibility system; and we assume an annual life cycle (no perenniality or seed bank). 561 Indeed, simulated pedigrees with uniformly distributed plants gave less identity 562 disequilibrium than we observed. In contrast, simulated pedigrees conditioned on the actual, 563 heterogeneous density of plants were consistent with identity disequilibrium measured in the 564 field. This indicates that patchiness combined with leptokurtic dispersal shapes the 565 distribution of heterozygosity. Simulations with heterogeneous density also better capture 566 empirical isolation-by-distance patterns than those with a uniform density (Fig 4B, SM1.4: 567 Fig. S9). However, the effective population size of 15,500 individuals in the heterogeneous-568 density simulations is an order of magnitude larger than the average number of plants 569 observed in a year (~2500). We believe that most plants are sampled each year, so that this 570 discrepancy is more likely to be due to a seed bank, which is expected to substantially 571 increase the effective population size [31]. Nevertheless, despite simplifications such as non-572 overlapping generations, no seed bank, and a simple SI system, the heterogeneous-density 573 simulation accurately captures patterns of identity disequilibrium and isolation-by-distance. 574 575 Our estimation of identity disequilibrium illustrates a general problem with statistical 576 comparisons in evolutionary biology. There are three sources of error in estimating g2: firstly, 577 error generated from sampling a limited number of individuals, secondly, from sampling a 578 limited number of SNPs, and thirdly from random variation between evolutionary realizations 579 or trajectories. In our study, the first source (a limited number of individuals) is shown by the 580 confidence intervals in Fig. 5, obtained by bootstrapping across individuals [21]. The second 581 source of error (a limited number of SNPs) is shown by the substantial variation in g2 of the 582 ten replicates of each of five pedigrees. Here, variation is generated by random meiosis 583 amongst unlinked markers on a fixed pedigree. This variation could be reduced by increasing 584 the number of SNPs, but the effective number of segregating sites that can be included in the 585 analysis is fundamentally limited by the length of the genetic map. Finally, there is additional 586 variation between pedigrees, due to the random assignment of parents in the simulations, 587 which generates a random pedigree. The wide variation in estimates of g2 due to random 588 meiosis, and to the random generation of the pedigree (Fig. 5) is an important reminder that 589 estimates of parameters are typically limited by the randomness of evolution. The 590 stochasticity of evolution can potentially generate error variance far higher than that due to 591 the limited number of individuals or SNPs sampled. 592 593 In addition to analyzing the effect of population structure on the distribution of 594 heterozygosity, our study highlights the potential of utilizing multiple statistics to estimate 595 population structure. We have shown that the variance of heterozygosity due to identity 596 disequilibrium can distinguish alternative dispersal and density distributions, which implies 597 that in combination with pairwise Fij as a function of distance, g2 can help estimate the 598 demography. Genetic data contain far more information than is described by FST and g2; for 599 example, the mean squared disequilibrium can be used to estimate effective population size 600 [32], [33], and this extends naturally to the covariance of pairwise linkage disequilibrium as a 601 function of distance. We could simply use a set of such statistics to inform demographic 602 inference via ABC [34]. However, our preference would be to first develop a theoretical 603 understanding of how realistic demographies influence statistical measures of spatial 604 covariance in allele frequency, identity disequilibria, and linkage disequilibria. 605 606 The distribution of heterozygosity has often been measured to estimate inbreeding depression 607 and examine correlation with fitness. Yet, this type of data has rarely been used to investigate 608 population structure per se and as a complement to the more widely used pairwise identity, 609 FST. By bringing together local inbreeding and isolation-by-distance, our study provides a 610 novel assessment of how dispersal and population density can explain both pairwise identity 611 and the distribution of heterozygosity in spatially continuous populations. However, we have 612 only begun to investigate how the distribution of heterozygosity can be shaped by population 613 structure and demographic parameters. Our future work will focus on understanding how 614 other features such as a seed bank influence genetic diversity, with the ultimate goal of 615 deriving information about demographic history from the distribution of heterozygosity in 616 populations that have fewer measured parameters. New models that include these 617 complexities, as well as ecological, mating system and life history factors are required to 618 extend our understanding of the drivers of population structure in natural populations. 619 620

622
All data and code used to generate simulated data and carry out analysis is available at: