Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Haplotype-based inference of recent effective population size in modern and ancient DNA samples

View ORCID ProfileRomain Fournier, View ORCID ProfileDavid Reich, View ORCID ProfilePier Francesco Palamara
doi: https://doi.org/10.1101/2022.08.03.501074
Romain Fournier
1Department of Statistics, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Romain Fournier
  • For correspondence: romain.fournier@stats.ox.ac.uk palamara@stats.ox.ac.uk
David Reich
2Department of Genetics, Harvard Medical School, Harvard, Boston, USA
3Broad Institute of Harvard and MIT, Cambridge, USA
4Department of Human Evolutionary Biology, Harvard University, Cambridge, USA
5Howard Hughes Medical Institute, Harvard Medical School, Boston, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David Reich
Pier Francesco Palamara
1Department of Statistics, University of Oxford, Oxford, UK
6Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pier Francesco Palamara
  • For correspondence: romain.fournier@stats.ox.ac.uk palamara@stats.ox.ac.uk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

1 Abstract

Individuals sharing recent ancestors are likely to co-inherit large identical-by-descent (IBD) genomic regions. The distribution of these IBD segments in a population may be used to reconstruct past demographic events such as effective population size variation, but accurate IBD detection is difficult in ancient DNA (aDNA) data and in underrepresented populations with limited reference data. In this work, we introduce an accurate method for inferring effective population size variation during the past ~2,000 years in both modern and aDNA data, called HapNe. HapNe infers recent population size fluctuations using either IBD sharing (HapNe-IBD) or linkage disequilibrium (HapNe-LD), which does not require phasing and can be computed in low coverage data, including data sets with heterogeneous sampling times. HapNe showed improved accuracy in a range of simulated demographic scenarios compared to currently available methods for IBD-based and LD-based inference of recent effective population size, while requiring fewer computational resources. We applied HapNe to several modern populations from the 1, 000 Genomes Project, the UK Biobank, the Allen Ancient DNA Resource, and recently published samples from Iron Age Britain, detecting multiple instances of recent effective population size variation across these groups.

2 Introduction

The increasing availability of high-quality genomic data for both modern and ancient samples is creating exciting new opportunities for data-driven investigation of key evolutionary parameters. Among these, the effective size of a population plays an essential role in population biology1. A population’s effective size is defined as the number of individuals in an idealized evolutionary model2,3, and the ability to infer it from genomic data has a wide range of applications, including the study of past demographic events4,5 and cultural practices6, the quantification of the effectiveness of natural selection1,7, and the prediction of viability in conservation biology8.

Several statistical tools have been developed to reconstruct the trajectory of effective population size from genomic data9, each leveraging different genomic features and enabling the analysis of different data types. Methods that rely on the site frequency spectrum (SFS) of a sample10–13 avoid modeling recombination and are thus scalable, but require high-quality sequencing data to estimate the SFS and have been observed to be statistically inefficient14. Methods that model both mutation and recombination processes15–19, on the other hand, tend to scale to smaller sample sizes and require high-quality genome sequencing data. Recent approaches enable simultaneous modeling of recombination and allele frequencies in unphased sequencing data18, or scaling to larger sample sizes for accurately phased sequencing data20. Finally, several methods that focus on capturing the signature of recombination through the sharing of identical-by-descent (IBD) haplotypes21–25 or linkage disequilibrium26–29(LD) have been developed.

Inference of recent population size fluctuations is particularly appealing because it provides unique insights into demographic and evolutionary processes that are specific to the analyzed population. IBD-based methods have been used to infer recent demographic history21–23, 25 in SNP array and sequencing data. A key limitation of these methods is that they rely on accurate detection of IBD regions30–33. The performance of these algorithms depends on accurate long-range computational phasing, which may be hard to obtain, particularly in low coverage ancient DNA data. While being a less direct measure of the signature of past recombination events, LD-based summary statistics can be computed in unphased samples, including SNP array and ancient DNA data. LD has been extensively modeled34–38 and applied to infer effective population size26–29, 38, 39. The most recent methods for IBD- and LD-based inference, IBDNe25 and GONE,29 enable inference of population size fluctuations in time, without assuming a strictly parametrized demographic model. This strategy, however, poses additional challenges, due to the need to adequately regularize the inferred models23,25 to avoid reporting spurious fluctuations, while preserving manageable computational costs.

Here, we present a new method, called HapNe, that enables flexible inference of recent effective population size fluctuations using IBD or LD summary statistics, and can be used to analyze both phased and unphased SNP array or sequencing data, including low coverage or ancient DNA data with heterogeneous sampling time. Using extensive coalescent simulations, we show that HapNe accurately and efficiently infers recent demographic history, while regularizing the model to control for spurious oscillations in recent generations. We applied HapNe to reconstruct recent demographic history in both modern and ancient data, including populations from the 1,000 Genomes Project and different postcodes from the U.K. Biobank data set, where we observed a bottleneck in the Late Middle Ages corresponding to the period of the Black Death. We also analyzed ancient individuals from the Caribbean, Scandinavian Vikings, and individuals who lived in England during the Iron Age, observing isolation and expansion events that are consistent with past historical events, such as the transition from the Archaic to the Ceramic culture in the Caribbean.

3 Results

3.1 Overview of the HapNe algorithm

The HapNe algorithm infers recent effective population size using either IBD or LD data (see Methods and Supplementary Note for a detailed description of the algorithm). We refer to these two approaches as HapNe-IBD and HapNe-LD, respectively. HapNe-IBD uses IBD sharing information to compute summary statistics related to the count of IBD segments of different lengths. However, accurate detection IBD segments typically relies on phasing information and modeling of haplotype sharing to differentiate between identical-by-state (IBS) and truly IBD regions. Accurate phasing and haplotype modeling may not be possible if the analyzed genomes are not of high quality or not well represented in reference panels. HapNe-LD, on the other hand, leverages summary statistics related to long-range LD (Pearson correlation between sites). These LD statistics are easy to compute and do not require genotypes to be either phased or of high quality, enabling the analysis of past demographic events in low coverage or aDNA data.

HapNe-IBD and HapNe-LD both optimize a composite likelihood. To ensure that the model is appropriately regularized, HapNe utilizes a prior on the effective population size Ne(t) that favors models with minimal population size fluctuations. When the analyzed IBD or LD data does not contain sufficient signal, this regularization mechanism prevents inferring spurious variation in Ne(t), which may be incorrectly interpreted as past demographic events. The resulting approximate posterior is optimized to compute a maximum-a-posteriori (MAP) estimator of Ne(t) and bootstrap resampling is used to provide estimates of uncertainty through approximate 95% confidence intervals. Both methods automatically exclude genomic regions harboring unusually large amounts of IBD or LD, which may be caused by natural selection or the presence of structural variation rather than past demographic events. In addition, HapNe-LD implements a test to detect the presence of possible biases due to the presence of strong LD caused by past admixture events (admixture LD) and can handle samples originating from different time points. The HapNe program is freely available as an open-source software package (see Code Availability).

3.2 Performance on simulated modern data

We used extensive coalescent simulations to benchmark HapNe-IBD and HapNe-LD against other recent methods for haplotype-based inference of recent effective population size. To this end, we considered several demographic scenarios (Figure 1a, dotted black lines), including: a constant population size of Ne(t) = 20,000; an exponentially expanding population with 200,000 haploid individuals at t = 0 and 20,000 at t = 50 generations; an exponentially collapsing population with 2,000 living individuals at t = 0 and 20,000 at t = 100; and a population undergoing a strong bottleneck, evolving from 200,000 haploid individuals at t = 0 to 2,000 at t = 25, and then growing back to 20,000 at t = 50. For each of these populations, we simulated 256 diploid individuals. We generated realistic SNP-array data and used the simulated ancestral recombination graph to extract ground truth IBD segments longer than 1cM (see Methods).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: Benchmarks in simulated modern populations.

(a) Comparison of HapNe-IBD, IBDNe, HapNe-LD, and GONE on simulated SNP-array data (256 individuals) for four different demographic scenarios. (b) Accuracy of the different methods on the ”Bottleneck” demographic model as a function of sample size. Error bars correspond to 1.96 × SE computed using 10 independent simulations. (c) Total running time for each method (including IBD segment detection and within-chromosome LD estimation, see Methods).

We initially considered the performance of HapNe-IBD and IBDNe31 in an idealized setting where ground truth IBD sharing information is available (see Supplementary Figure S1). In this scenario, HapNe-IBD generally produced lower error than IBDNe, measured using the root mean squared log-error (RMSLE) over the past 50 generations (see Methods). HapNe-IBD produced stable estimates of effective population size in the very recent past, whereas IBDNe tended to output spurious oscillations, a caveat that was highlighted by the authors31. We next inferred and analyzed LD summary statistics from the simulated array data using HapNe-LD. Because the LD signal reflects the presence of underlying IBD segments (see Supplementary Note), analysis of ground truth IBD data may be seen as an upper bound on the accuracy of HapNe-LD. We observed the RMSLE of HapNe-LD applied to SNP array data to be close to that of HapNe-IBD using ground truth IBD data, suggesting that HapNe-LD achieves close to optimal performance in these simulations, despite not utilizing phasing information (see Supplementary Figure S1b). We also tested the performance of GONE29, a recent LD-based method, and observed larger RMSLE in the past 50 generations (see Figure 1b). Due to its regularization procedure, HapNe-LD tended to infer smooth changes in population size, whereas GONE inferred more rapid fluctuations (see Figure 1a). GONE did not produce bootstrap confidence intervals in these simulations, due to an insufficient number of available SNPs (see Methods).

We next considered a more realistic scenario for the application of IBD-based methods (HapNe-IBD and IBDNe), where we inferred IBD sharing from simulated SNP array data (assuming perfect phasing, see Methods). We detected IBD sharing using the FastSMC program32; similar results for IBDNe were obtained by using the recommended HapIBD software33 (see Supplementary Figure S2). Figure 1a shows the output of all four methods on a data set of 256 diploid samples and results for other sample sizes are summarized in Figure 1b (also see supplementary figures S3 and S4). In most cases, the noise introduced by inferring IBD from the data resulted in biases in the inferred effective population sizes; IBDNe tended to underestimate recent effective population size, while HapNe-IBD tended to overestimate ancestral population size (Supplementary Figure S3). We observed the error in IBD detection to be dependent on several factors, including demographic history and the length of the inferred segments (see Supplementary Figure S5). We note that additional biases due to genotyping and phasing errors are likely to be present in real data, further affecting the quality of IBD-based analyses.

We finally benchmarked the computational speed of these methods and observed HapNe-IBD and HapNe-LD to be more computationally efficient than IBDNe and GONE (see Figure 1c). Computing LD scales only linearly with the number of analyzed samples, while detecting pairwise IBD sharing requires computation that is quadratic in the number of samples, making LD-based analyses more scalable. Unlike IBDNe, which requires more time to fit larger samples, HapNe-IBD only computes a fixed-size vector of the IBD segment lengths, significantly reducing computational costs for larger samples. The difference in computational time between HapNe-IBD and HapNe-LD is mainly driven by differences in the time required to compute IBD and LD summary statistics.

Overall, HapNe-IBD and HapNe-LD provided improved accuracy and substantially reduced computational times compared to existing methodologies. Although IBD-based inference of effective population sizes is potentially more accurate than LD-based analysis, the need to accurately detect IBD sharing is likely to introduce substantial biases in the inferred population sizes. HapNe-LD’s performance was observed to be close to that of IBD-based methods applied to ground truth IBD data and may be applied in the analysis of large sample sizes, providing several practical advantages over IBD-based methods in the analysis of real data sets.

3.3 Performance on simulated aDNA data

HapNe-LD does not require phased or high coverage data, making it especially suitable for the analysis of effective population sizes of ancient populations, where phase determination can be poor. However, LD-based analysis suffers from several limitations and potential confounders, some specific to aDNA data. First, analyses based on aDNA data sets tend to contain fewer samples sequenced at relatively low coverage compared with modern panels. Furthermore, different sequencing strategies balancing sample size and coverage might lead to different performances in effective population size inferences. Next, an important confounder is the potential presence of admixture in the analyzed samples, which is often encountered in real populations as a result of past demographic interactions and induces long-range correlations among genomic variants40. Finally, individuals sampled at a site are unlikely to have lived at the same time, with a few notable exceptions41,42. If not modeled, this source of time heterogeneity may lead to biased effective size estimates.

We set out to test HapNe-LD’s robustness to these sources of confounding. We first created synthetic aDNA samples by generating pseudo-diploid individuals with different levels of missingness m, mimicking the effects of reduced sequencing coverage C, with m ≈ e−C (see Methods). We tested the relative impact of the simulated sample size s and coverage on HapNe-LD’s inference accuracy (see Figure 2a and Supplementary Figure S6 for additional demographic scenarios). As expected, RMSLE decreases when more samples are available and when coverage increases (see Figure 2b and Supplementary Figure S7). We then tested whether HapNe-LD would perform better when analyzing a larger number of low-coverage samples rather than a smaller number of high-coverage samples. To this end, we performed simulations where the overall number of sequencing reads is kept approximately constant, while the number of analyzed samples and their coverage are varied (see Figure 2c and Supplementary Figure S7). We considered an analysis involving 256 individuals and observed that reducing coverage from 30x to 1.4x had no significant impact on the performance while requiring only about 5% of the reads. Using an equivalent number of reads to perform high coverage (30x) sequencing would only allow sequencing 16 individuals, resulting in significantly higher RMSLE. These results suggest that sequencing at a coverage higher than 1-2x does not lead to significant improvements in HapNe-LD’s performance, and that HapNe-LD is more accurate when a larger number of individuals is sequenced at lower coverage compared to settings in which a smaller number of high coverage samples is analyzed.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2: Results in simulated aDNA data.

(a) HapNe-LD inference results for simulated aDNA-like data under the ”Bottleneck” demographic scenario (dashed lines) where the number s of simulated samples and fraction m of missing SNPs, or equivalently the coverage C, are varied (see Methods). (b) RMSLE over the first 50 generations for different coverage levels. Error bars correspond to 1.96 × SE computed using 10 independent simulations. (c) Comparison of the accuracy of HapNe-LD based on two sequencing strategies. The red line reports RMSLE for high coverage data (m = 0, C = 30) with varying sample size s. The blue line reports RMSLE for fixed s = 256 and varying coverage. Error bars correspond to 1.96 × SE computed using 10 independent simulations. (d) HapNe-LD results under the IM and ICF models of recent admixture, depicted on the left. For both models, we set tm = 50 generations. For ICF simulations, we sampled all individuals from one population and selected a migration rate μ such that ancestors of a sampled individual are located in the second population with probability close to 1/3 (see Methods). (e) HapNe-LD and GONE inference results for a simulation where individuals from a population of constant size of Ne = 20,000 are uniformly sampled over an interval ΔT = 10 generations (red shaded area).

We next simulated a population affected by recent admixture (see Supplementary Note) by considering two demographic scenarios (similar to those used in 43). In these scenarios, two isolated populations first separate and then either merge again (IM model) or experience continuous gene flow (ICF model, see Figure 2c). All simulated models had a constant number of 20,000 haploid individuals within each population; the interaction time tm was set to 50 generations. Simulation results for other values or tm are shown in Supplementary Figure S8. For the ICF model, we sampled all individuals from one population and selected a migration rate μ such that at time tm the ancestral lineages of all individuals are located in the second population with a probability close to 1/3. Figure 2c shows that HapNe-LD results under these models do not strongly deviate from the true underlying effective population size (see Supplementary Note). Some ICF simulations resulted in an increase in the inferred recent population size (see Supplementary Figure S8), likely due to model regularization, indicating that larger sample sizes are needed to infer subtle population size variation at these time scales. Taken together, these results suggest that HapNe-LD is robust to reasonable levels of admixture LD. The HapNe-LD software implements a statistical test for admixture LD, warning the user if significant admixture LD is detected.

Lastly, we considered potential biases arising due to heterogeneous sampling times of the analyzed aDNA individuals. We used analytical modeling (see Methods and Supplementary Note) to confirm that, if not accounted for, heterogeneous sampling times lead to biased recent effective population size estimates. We performed simulations of aDNA samples originating from heterogeneous time locations under a constant demographic history, uniformly drawing the time offset of each sample between 0 and ΔT generations in the past (see Methods). In this setting, we observed that using GONE to infer effective population size leads to the spurious inference of a recent population expansion, consistent with analytical predictions under unmodeled time heterogeneity (see Figure 2d). The HapNe-LD algorithm allows utilizing prior knowledge of sampling times (e.g. from radiocarbon dating or archeological context) in the form of a user-provided time interval for each analyzed individual (see Methods). Using simulations, we verified that this approach effectively removes recent biases due to time heterogeneity.

3.4 Inference of recent effective population sizes in the UK Biobank and 1,000 Genomes Project data sets

We used HapNe-IBD and HapNe-LD to analyze recent effective population size variation within the UK Biobank data set. Accurate inference of recent demographic events requires a combination of large sample sizes and small effective population sizes, which make it possible to estimate recent coalescent rates. In this case, large recent effective population sizes generally present across the UK are balanced by the large sample sizes available in the UK Biobank data set. In order to mitigate the impact of admixture LD, we focused on the larger group of samples with self-reported white British ancestry, and only considered unrelated individuals to avoid biasing demographic inference in recent generations. We grouped individuals based on the postcode of their self-reported birthplace and report analyses for three of these postcodes (see Figure 3a, Methods). We also used FastSMC to detect IBD segments within each of these postcodes. Regions with unusually high LD or IBD sharing were excluded using HapNe’s filter (Supplementary Figure S9).

Figure 3:
  • Download figure
  • Open in new tab
Figure 3: HapNe-IBD and HapNe-LD estimates of recent effective population sizes in modern populations.

(a) Inference results for three postcodes: Glasgow (G), s = 14,724; Edinburgh (EH), s = 9,981; and Llandudno (LL), s = 2,089 from the UK Biobank data set. The vertical dashed line corresponds to the estimated date of the Black Death in the UK (1348,44). HapNe results are converted to years assuming 29 years per generation. The shaded grey area depicts how the placement of the Black Death would shift with respect to the inferred demographic models if values between 23 and 35 years per generations were assumed. (b) Inference results for three populations (Finnish, FIN, s = 99; Kinh in Ho Chi Ninh City, Vietnam, KHV, s = 99; Yoruba in Ibadan, Nigeria, YRI, s = 107) from the 1,000 Genomes Project.

Effective size trajectories inferred from these regions in the UK all exhibit a bottleneck event during the Late Middle Ages, which roughly corresponds to the period of the Black Death (Figure 3a, vertical dashed line). The inferred population size for individuals from the Llandudno postcode has a significantly smaller effective population size compared to the ones inferred for Glasgow and Edinburgh. Such a smaller effective size offers a stronger source of recent demographic signal, allowing to perform inference using a smaller sample size (s = 2,089 for Llandudno, s = 14,724 for Glasgow, and s = 9,981 for Edinburgh). In contrast, detecting the more subtle contraction to a larger minimum bottleneck size in Glasgow required a substantially larger sample size, as highlighted when we downsampled data from this postcode to 2,000 individuals (see Supplementary Figure S10). In this experiment, the bottleneck was only apparent in the output of HapNe-IBD, suggesting that LD-based analysis may lead to comparably lower statistical efficiency in cases where high-quality IBD signal is available. Demographic models inferred by HapNe-IBD and HapNe-LD are broadly consistent, although HapNe-IBD tends to report a larger effective population size, with a significative shift towards more remote times. These observations are compatible with the presence of underlying IBD segments that are undetected or broken into smaller segments, due to the presence of phasing or genotyping errors in the data.

We next applied HapNe-IBD and HapNe-LD to data from the 1,000 Genomes Project (1kGP,45). Unlike the UK Biobank, most 1kGP groups contain a small number of samples, which originate from large populations. Furthermore, several groups represented in the 1kGP data set are known to have undergone recent admixture, which complicates LD-based analyses45. We therefore expected analysis of recent effective population sizes to only be possible in a small subset of 1kGP populations. We used HapNe-LD to compute LD for each population and estimated recent IBD sharing using the FastSMC algorithm32 (see Methods). We used HapNe’s filters to exclude populations that were flagged as either not containing sufficient recent demographic signals or exhibiting strong admixture LD (19/26). We then inferred recent effective population sizes using the HapNe-LD and HapNe-IBD methods.

Figure 3b shows results for three populations that passed these filters. Results for all populations without significant admixture LD are shown in Supplementary Figure S11, which also reports results obtained by running the IBDNe algorithm. Supplementary Figure S12 shows two additional populations passing these filters for a less stringent significance cutoff and Supplementary Figure S13 displays the remaining 19 groups. Again, the demographic history inferred using IBD data consistently resulted in larger effective population sizes compared to LD-based results, particularly for recent generations, and were more strongly regularized due to reduced signal. These effects were more pronounced in these groups compared to the UK Biobank analysis, likely due to smaller sample sizes leading to lower phasing and IBD detection quality. HapNe-LD suggests a recent expansion for the individuals from the Kinh population in Ho Chi Minh City, Vietnam (KHV) and the Yoruba population in Ibadan, Nigeria (YRI) and infers a bottleneck at 1,000 CE for the FIN population, consistent with previous reports25,29,46. These demographic events are inferred to have an earlier onset using IBD data, likely also a result of noisy IBD detection. We also observed that IBD-based methods inferred strong bottlenecks in many African and South American populations around 1,000 CE, which is likely due to biases in the IBD-detection (see Supplementary Figure S13).

Overall, these results suggest that HapNe-LD and HapNe-IBD provide similar results when large samples and high-quality IBD data are available. HapNe-LD, however, provides more robust results than HapNe-IBD in data sets where phasing and IBD detection accuracy are reduced, at the cost of an only slightly reduced statistical efficiency. HapNe-LD may produce biased estimates for data sets including a history of strong recent admixture, as highlighted for some populations in Supplementary Figure S13. These biases usually result in an apparent population collapse in the recent past; in these analyses, however, HapNe-LD implements tests to flag populations where strong admixture is likely to result in such a spurious recent bottleneck.

3.5 Inference of recent demographic history in ancient populations

We applied the HapNe-LD method to aDNA sampled from four different sites for which large cohorts from similar time strata were available (see Methods and supplementary tables S1–S7).

We first analyzed a group of recently published individuals excavated in Pocklington, York-shire, UK47 (see Figure 4a). The archeological context suggests that this group belongs to the Arras culture, which is distinctive relative to other Iron Age cultures in the UK but shows similarities with contemporary cultures in the Paris Basin and Ardennes/Champagne regions of France. These individuals were found to be unusually highly drifted from nearby groups, although their F-statistics do not highlight significantly divergent admixture histories47. This suggests that these groups share common origins but may have been isolated for some time. To test this, we compared the effective population size for 24 individuals from the Arras culture to that of 49 other Iron Age individuals from Southern England (supplementary tables S2 and S3). For the Arras, we detected a significant recent population contraction, starting between 500 and 1,000 BCE, which was not observed in individuals from Southern England. This is consistent with isolation of the Arras group from other Iron Age individuals in the South of England, possibly also reflecting isolation by distance due to the stronger geographic localization for the Arras samples. Admixture LD for these groups was found to be negligible, suggesting that the observed demographic signature is not due to admixture (see Supplementary Table S1). The small population size of the Arras group might also explain why this population was found to be unusually highly drifted from nearby groups. The recent effective population size inferred for individuals in the South of England was compatible with population size estimates obtained for modern UK Biobank individuals, although confidence intervals were large over the first 1,000 years due to a reduced sample size.

Figure 4:
  • Download figure
  • Open in new tab
Figure 4:

(a) Analysis of 49 Middle to Late Iron Age individuals from South England, compared to 24 individuals related to the Arras culture near Yorkshire. (b) Inference based on 22 Viking samples found in modern Norway (blue) and 28 found in Gotland, a Swedish island (red). (c) Effective population size inference based on 71 unrelated individuals from the Caribbean Ceramic clade and 18 from the Dominican South-East coast subclade. The grey shaded area corresponds to the estimated date for the transition from Archaic to the Ceramic culture in the region.

We next analyzed 22 genetically similar individuals from the Viking Age buried in Norway, together with 28 individuals from the south-east Swedish island of Gotland41 (Figure 4b and supplementary tables S4 and S5). Norwegian and Swedish Vikings have been observed to have a slightly smaller proportion of ancestry from Neolithic farmers from Anatolia compared to Swedish Vikings. On the other hand, Vikings from Gotland have a relatively higher estimated fraction of ancestry shared with Bronze Age individuals from the Baltic region. Despite these differences, the demographic histories inferred by HapNe-LD for the recent past of these individuals substantially overlap, and both trajectories show a significant expansion during the iron age (−500 to 800 CE).

Finally, we focused on 71 unrelated individuals from the Caribbean, first analyzed in Fernandes et al.48 (n=62) and Nägele et al.49(n=9) spanning ~1,149 to ~1,440 CE (supplementary tables S6 and S7). For these samples, HapNe-LD infers a weak sign of a bottleneck occurring around 1 CE, followed by a significant expansion, as shown in Figure 4c (blue line). This pattern may reflect the transition from the Archaic to Ceramic context about 2,500-2,300 years ago (Figure 4a, grey area), which has been associated with migration events in the region48. We also extracted and separately analyzed a subgroup of individuals from South-East Dominican sites (Figure 4c, red). These individuals are part of a subclade previously identified in48. The population size inferred for this group matches that of the broader Caribbean group in the deep past, consistent with common origins, but shows a distinctive sign of contraction in the more recent past. Admixture LD is detectable in these individuals, which may partially explain the observed contraction, as observed in some 1kGP populations (see Supplementary Figure S13 and Supplementary Table S1). Nevertheless, the sizes inferred by HapNe-LD in the recent past roughly match those inferred using runs of homozygosity50, supporting the possibility of a population contraction starting after the transition from the Archaic to the Ceramic Culture48. As in the case of the Arras and Southern England individuals, these demographic patterns may also be due to isolation by distance, where samples originating from different islands result in a larger effective size when considered together.

4 Discussion

We developed an algorithm, called HapNe, that leverages the count of IBD segments of different lengths (HapNe-IBD) or long-range LD (HapNe-LD) to infer recent effective population size fluctuations in modern or ancient DNA data. HapNe-IBD and HapNe-LD implement a number of preprocessing steps, as well as tests to verify that sufficient recent demographic signal is present in the data and to detect the presence of admixture LD. Both methods minimize a power-likelihood based on an analytic link between observed summary statistics and the effective population size and use regularization to avoid producing spurious oscillations. We used extensive simulation to show that both HapNe methods were more accurate and computationally faster than available algorithms for IBD-based and LD-based inference of recent demographic history, producing lower error and fewer spurious oscillations. These simulations also showed that while HapNe-LD does not require high quality or phased data and scales better with sample size, its performance can be close to that of IBD-based methods applied to ground truth IBD information. Finally, we applied HapNe to several modern and aDNA data sets, detecting evidence for recent past demographic events across these populations. These include population size contractions corresponding to the period of the Black Death in different regions of the UK, as well as bottleneck and expansion events in 1,000 Genome Project populations. In aDNA data, these analyses provided evidence for divergence and isolation events, as well as shared demographic histories in subgroups from several ancient populations with diverse geographic and temporal origins.

Our analyses suggest that LD-based inference of recent demographic variation provides a route to circumenting biases that may arise in IBD-based demographic inference. Although the spectrum of shared IBD haplotypes is an effective source of information for analyses of past demographic events, accurately estimating IBD sharing is complicated in low coverage and aDNA data and may lead to biased results. This may also be the case in modern populations when limited data availability prevents accurate phase estimation. Although summary statistics of LD rely on less direct observation of historical recombination events, they may be effectively computed in unphased and low coverage data sets. This enables analyzing recent demographic events in samples from poorly represented populations and, coupled with modeling of heterogeneous sampling time, in aDNA data sets. Performing both IBD-based and LD-based analyses may offer validation for an inferred demographic model and allow testing for the presence of biases in either approach. An additional source of potential bias in methods for demographic inference is linked to the need to make assumptions about the type of demographic model being inferred. In this context, approaches that avoid relying on a predefined set of models provide more flexibility, but require further tuning strategies to balance the desired sensitivity to past demographic events with the need to prevent the inference of spurious fluctuations. Our work suggests that the use of self-tuning regularization mechanisms helps mitigate the risk of spurious inferred fluctuations. Finally, our analyses highlight the importance of accurately preprocessing both IBD and LD signals before performing demographic inference, as results may vary significantly if unfiltered data is utilized. Key preprocessing steps include testing for the presence of admixture LD and systematically filtering out regions of the genome that harbor unusually high IBD sharing or LD (see e.g. Supplementary Figure S9). These may be due to natural selection or the presence of structural variation and lead to biases in analyses of demographic history and selection if not accounted for.

We outline several limitations and directions of future development for this work. First, HapNe-LD assumes that the LD signal observed in the data is solely due to past population size fluctuations. In some instances, residual admixture LD can be present in the data after filtering, causing a spurious bottleneck in the recent past and creating the need to carefully interpret models that resemble this type of signature. Similarly, HapNe-IBD currently only relies on the observed spectrum of IBD sharing, which may be biased due to inaccurate IBD detection. Future work may allow explicit modeling of type-1 and type-2 errors in IBD detection, mitigating biases in the inferred demographic models. Second, while regularization helps prevent the inference of spurious demographic fluctuation, it leads to favoring constant and exponential demographic histories that lack fluctuations if these are not supported by the data. When interpreting demographic models inferred by HapNe, it is important to note that an inferred constant growth rate may reflect insufficient evidence for past demographic variation (see e.g. Figure S10). Finally, HapNe-LD makes several model simplifications, including the assumption that the analyzed samples come from a single population. HapNe may be extended to explicitly account for multiple populations, improving the analysis of more complex demographic models such as those involving isolation by distance, divergence, and admixture. Similarly, HapNe-LD is currently focused on the inference of recent demographic history, but may be extended to the analysis of deeper time scales by modeling variation in allele frequencies, which are currently assumed to be constant in time. Despite these limitations, we expect that the HapNe framework developed in this work will offer valuable insights into past demographic events in both modern and ancient DNA data.

5 Methods

5.1 Simulated genetic data

We used the ARGON simulator51 (version 0.1.160415) to generate synthetic genotypes and ground truth IBD data for modern and ancient populations. Simulations with time heterogeneity were performed using msprime52 (version 1.1.1). We simulated genomes of 36.23 Morgans, split into 39 independent regions corresponding to human chromosome arms. We used a mutation rate of μ = 1.65 × 10−8 and a recombination rate of ρ =1 × 10−8 per generation per base pair. To simulate SNP data we then downsampled sequencing data to match the genotype density and allele frequency spectrum observed using Chromosome 2 of the UK Biobank data set, using 50 evenly spaced MAF bins. We generated unphased diploid individuals by randomly pairing simulated haplotypes. Ancient data was generated using a similar procedure, with two additional steps to simulate low coverage data. We first transformed the data into pseudodiploid individuals by randomly sampling one haplotype at each site. We then set each site as missing with probability m, related to a simulated coverage parameter C through the relationship m ≈ e−C, further described below.

5.2 Simulation of missingness and coverage

We simulated low coverage data by discarding a proportion m of the SNPs of each individual, but often report results referring to corresponding sequencing coverage parameters. To this end, we assumed a simple model where a genome of length G is sequenced using N reads of length L. Using this notation, the probability that a randomly selected site along the genome is not spanned by a read is: Embedded Image where Embedded Image represents the coverage parameter. This relation can also be used to obtain a link between m and the number of reads: Embedded Image where Embedded Image and s is the number of sampled individuals with missingness m.

5.3 Computation of LD

We consider a panel of s individuals, M sites and genotypes Embedded Image for individual i at site x with minor allele frequency px. We first standardize the genotypes by computing Embedded Image, where Embedded Image is the estimated allele frequency. The LD between two sites x and y is computed as the R2 statistic: Embedded Image

The computation of this statistic scales linearly with the number of samples Embedded Image. Note that this estimator is biased due to the use of Embedded Image instead of the unknown allele frequency px during the normalization step. We describe a procedure used at runtime to debias these estimates in the Supplementary Note. The LD of pseudo-diploid individuals is computed using the same approach, with Embedded Image.

5.4 Detection of IBD segments

We ran FastSMC32 (version 1.2) using parameters min_m = 0.5 (minimum cM length) and t = 100 (IBD time threshold). Decoding quantities were generated based on 30 samples using a European demographic history. FastSMC was run using multiple jobs, so that each job considers at most 100 haploid samples. We also used IBD segments obtained by running the HapIBD software31 (version 1.0), using recommended parameters for SNP-array data analysis (default parameters).

5.5 HapNe-IBD and HapNe-LD algorithms

We developed two algorithms to infer recent effective population size fluctuations Ne(t) from a set of s samples, called HapNe-IBD and HapNe-LD. Both approaches take summary statistics {Yi,b} as input and maximize a pseudo-posterior function for Ne(t). The input data set {Yi,b} is split into 39 genomic regions corresponding to chromosome arms indexed by i, using 0.5cM long bins indexed by b.

HapNe-IBD takes as input a list of IBD segments of length Embedded Image. Input data {Yi,b} corresponds to the count of IBD segments in region i whose length lies in bin b. Bins start at 2cM and end at the largest detected IBD segment. We assume that each of these counts is the realization of a Poisson random variable, with demographic-dependent mean parameter μb(Ne(t))Li where Li is the length of the ith region (μb(Ne(t) is described in the Supplementary Note). To handle overdispersion, we used a quasi-likelihood approach to compute a weight parameter Embedded Image that multiplies the variance in each bin.

HapNe-LD uses average R2 statistics as input data {Yi,b}. This input is computed in Embedded Image, where m is the total number of loci. We assumed that these observations are realizations of a Normal random variable, with a distance-dependent mean parameter μb(Ne(t) (see Supplementary Note for a detailed description of μb(Ne(t))). The variance parameters Embedded Image were estimated using the usual variance estimator within each bin.

Give a set of IBD or LD observations {Yi,b} for the ith genomic region and bth bin, HapNe aims to maximize P(Ne(t)|{Yi,b}) under the following assumptions. First, Ne(t) is a piece-wise exponential function from t = 0 to t = tmax generations, and remains constant afterwards. In all our analyses, we used tmax = 125 generations. The lengths of the time intervals are iteratively tuned so that each time interval contains the same number of expected ancestors of IBD segments (see Supplementary Note). Second, we assume that there exists a prior on the effective population size pNe(θ), where θ represents the set of parameters defining Ne(t). A discussion about the choice of this prior can be found in the Supplementary Note. Third, we assume that the covariance across consecutive bins can be modeled using a power likelihood Embedded Image. In the Supplementary Note, we show that under these assumptions the MAP estimator of Ne(t) depends on a single hyperparameter cσ2, that we automatically tune using a heuristic model selection rule (see Supplementary Note).

Once the time intervals and the value of the regularisation parameter are fixed, HapNe assesses the uncertainty of the prediction by performing 100 bootstrap iterations. For each iteration, HapNe samples chromosome arms with replacement to create new input data, and estimates the effective population size. The 2.5th, 25th, 75th, and 97.5th percentiles are reported at each generation to obtain 50% and 95% confidence intervals.

5.6 Comparisons to other methods

To perform method comparisons, we simulated genotypes based on the demographic models shown in Figure 1 and used the methodology described above to compute summary statistics. We ran HapNe-IBD, HapNe-LD, IBDNe (version 23Apr20.ae9), and GONE (Jun 21, 2021 commit) using their default parameters. The simulated SNP array data did not contain enough sites to perform the SNP bootstrapping strategy used by GONE to produce confidence intervals in sequencing data. All computations were run on an Intel Skylake 2.6 GHz architecture on the Oxford Biomedical Research Computing cluster.

We reported the root mean squared log-error (RMSLE) over the first 50 generations as a measure of accuracy. If Ne(t) and Embedded Image denote the true and predicted demographic models, the accuracy is defined as: Embedded Image

We performed 10 independent sets of simulations and computed error bars reported in each plot as 1.96 × se.

5.7 Filtering of high IBD and LD regions

To mitigate the impact of natural selection and structural variation, HapNe applies a filtering algorithm to exclude chromosome arms with unusual amounts of IBD sharing or LD. For LD data, parameters of a normal distribution are computed for each bin using the median and quantiles of the observed data. We used this quantile-based approach instead of moment-based estimators so that the inference is robust in the presence of the outlier regions we aim to filter out. Then, each genomic region is discarded using the following two heuristic rules. First, the deviation between the observed LD in the region and the median must be within 6 standard deviations. Second, the observed values must cross the median at least once, i.e. a region cannot have all its observations above or below the median. The IBD data is filtered using a similar approach. For each region, the mean of the Poisson distribution and the dispersion factors are computed for each bin using all others regions. The region is discarded if the sum of its squared deviance residuals is in the upper or lower α-quantile of the underlying χ2 distribution, with α = 10−12. The procedure is performed a second time, without considering the discarded regions, to prevent outliers to impact the final result.

5.8 LD-based admixture test

Admixture creates long-range LD between unlinked pair of sites. HapNe allows testing for the presence of admixture LD by computing cross-chromosome LD (CCLD). In the absence of CCLD, we expect the correlations between two sites x and y located on different chromosomes to be only due to finite sample sizes (see Supplementary Note): Embedded Image where Nx and Ny are the number of observed haplotypes on sites x and y, respectively. Because the LD is only computed between pairs of sites containing at least 2 overlapping observations, Nx and Ny are not independent variables. HapNe-LD computes the empirical mean of Eq. 5 for each pair of chromosomes and then performs a t-test to check for deviation from the 0-mean hypothesis. If the hypothesis is rejected, the levels of admixture LD might cause a recent collapse in the effective population size, as shown in Supplementary Figure S13.

5.9 Time heterogeneity in the set of analyzed samples

Most aDNA data sets contain samples originating from different time points, with an estimated date range spanning many generations when the archeological context is used to date the samples. We thus extended HapNe-LD to account for time heterogeneity and uncertainty. The user can provide a date range for each sample. This information is used by HapNe to compute the density of the ages of a randomly selected pair of individuals. This density is then used to marginalize out the age of the oldest sample and the generation gap between the two individuals under the SMC approximation, resulting in an unbiased estimator of the effective population size (see Supplementary Note).

5.10 Inference of demographic history in the UK Biobank

We analyzed the subset of 305,784 unrelated samples with self-reported White British ancestry, corresponding to the individuals reported in Byrcroft et al.53 that did not withdraw from the study and whose birth location can be assigned to a postcode in the U.K. (13,995 were removed because of this last condition). The autosomal variants were phased using Beagle 5.154. We then grouped the individuals based on their self-reported birth location, labeling each of them with the first 1 or 2 letters of their corresponding postcode. We randomly picked postcodes with different sample sizes to infer population sizes. LD computations and IBD detection steps were performed using the procedure described above.

5.11 Inference of demographic history in the 1,000 Genomes Project

Starting with N = 2,504 samples from the 1,000 Genomes Project data set, we removed related individuals (up to 3rd degree) based on publicly available pedigree information. The remaining 2,460 were split according to population labels. Before running FastSMC, we downsampled the genotypes to UK BioBank as done for SNP array data, using the procedure described above. LD computations were run using all loci with MAF> 0.25.

5.12 Inference of demographic history in ancient data

We downloaded version 50.0 of the Allen Ancient DNA Resource (AADR) dataset55. For each analysis, we started by removing related individuals reported in the annotation files present in the dataset. For each family, the individual with the highest coverage was kept. Information about sample ages was also extracted from the annotation file and used as input for HapNe-LD. We then removed variants and individuals with low coverage (m > 0.8). Specific information about each population is present in the supplementary tables S1–S7.

6 Data availability

Genomic data sets and annotations analyzed in this study include: UK Biobank http://www.ukbiobank.ac.uk/, genetic maps ftp://1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20110106_recombination_hotspots/, 1000 Genomes Project phase three https://www.internationalgenome.org/data/ and the Allen Ancient DNA Resource https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data

7 Code availability

The HapNe software package is freely available at https://palamaralab.github.io/software/

Supplementary Information

1 Supplementary Note

1.1 Derivation of the IBD and LD models

This note describes the models used to infer effective population size from IBD and LD summary statistics. We first describe a link between the effective population size and the probability that two sites are spanned by an IBD segment under the SMC’ model1, as well as computationally tractable approximations used in several derivations. Related work on calculations presented in this section may be found in2–11. We then provide details on how these models are used to perform inference based on IBD and LD summary statistics. We conclude by describing further details of the LD model related to low coverage data, time-heterogeneity, and admixture LD.

1.1.1 Notation

We aim to infer the effective population size Ne(t) based on the genotype of s samples consisting of m markers. For simplicity, we will assume that t is a continuous variable, with t = 1 corresponding to 1 generation. Note that Ne(t) refers to haploid individuals in the population. Although Ne(t) is the quantity of interest, we will derive several expressions in terms of its inverse Embedded Image, the coalescent rate, as well as the cumulative coalescent rate Embedded Image.

1.1.2 Survival function for a change of ancestor

Using the above notation, the distribution of the age of the most recent common ancestor (TMRCA) under the coalescent12 may be expressed as: Embedded Image which for a constant coalescent rate takes the form of an exponential waiting time f(t) = γe−γt, leading to Embedded Image.

Given the MRCA at site x, with TMRCA= t, we are interested in the genetic distance U at which a change of ancestor is observed. This requires a recombination event, which occurs at rate 2t (see e.g.13). When a recombination event happens, a new lineage is created at a time V ~ Uniform(0, t). This new lineage will not lead to a change of ancestor if it coalesces back to the lineage from which it branched out between V and t. We refer to this kind of coalescent event as a “healing” event and denote its probability by ph(t). To derive an expression for ph(t), we note that the coalescent rate of the new lineage is given by f2(t) = 2γ(t)e2Γ(t), with a factor 2 appearing because the new lineage can coalesce with either of two original ones. Healing requires the new lineage to coalesce between v and t, which happens with probability Embedded Image. It also requires the new lineage to coalesce to the original lineage, which happens with probability Embedded Image. Together, these terms lead to the following expression, also derived in7: Embedded Image

For a constant demographic history with coalescent rate γ, this becomes: Embedded Image

Thus, the waiting distance for a change of ancestor is exponentially distributed with rate 2t(1 – ph(t)) and its survival function is given by: Embedded Image

We obtain S(u) by marginalizing the TMRCA, Embedded Image

For a constant population size, this expression becomes: Embedded Image where Γe denotes the (incomplete) Euler gamma function Embedded Image and Embedded Image. This survival function, also derived in14, assumes an underlying SMC’ model1, but does not lead to a closed-form solution when a piece-wise constant function γ(t) is utilized. To obtain a tractable expression, we introduce an approximation of the SMC’ model. Using a Taylor expansion, Eq. 4 may be written in the form: Embedded Image where Embedded Image is the probability density function of the sum of k exponential random variables with rate 2t. In the last sum, k can be interpreted as the number of healing events observed within a distance u. The SMC approximation, where each recombination event leads to a change of ancestor15, is recovered by only considering the first term and discarding the sum: Embedded Image

For a constant demographic history, the survival function becomes: Embedded Image

Note that this recovers the expression derived in 16 using a different approach. This approximation may become poor when working with small populations and short genetic distances. For example, considering u = 1cM and Embedded Image leads to a relative error Embedded Image. Taking into account a single recombination and healing event leads to increased accuracy (see e.g.3 for a related approach). Using the above formulation, this amounts to considering the first term of the sum. Under a constant demographic model, the survival function is now: Embedded Image which greatly reduces the relative error compared to the SMC approximation (e.g. ~ 10 × lower using the previous example). This approach thus provides a good balance between accuracy and computational cost, as it allows multiple expressions to be computed analytically if γ(t) is approximated by a piece-wise constant function.

1.1.3 IBD model

We aim to model the number of IBD segments of particular lengths shared between pairs of individuals from a population. We denote the probability density function of the length of an IBD segment by fseg(l|γ(t)), dropping the γ(t) term for clarity. We first consider the length of an IBD segment spanning a given site x along the genome. The probability density function for the length of such a segment, fsite(l), is related to fseg(l) through the following relation2: Embedded Image where Embedded Image represents the expected length of a randomly selected IBD segment. The TMRCA of the two haplotypes at site x is distributed according to f(t). Conditioned on a TMRCA t, the length of the IBD segments spanning x is the sum of the distances to the next change of ancestor on either side of the site. By allowing at most one healing event within the IBD segment as described above, the density takes the form: Embedded Image where the first term accounts for the case of no healing events and the second term allows for one recombination event. Marginalizing t, we obtain: Embedded Image

For a constant demographic history, this becomes: Embedded Image Neglecting the probability of healing leads to the SMC approximation for a constant demographic history: Embedded Image

Conditioned on the total number of IBD segments Ns shared in a region, the expected count of IBD segments within a length bin delimited by ui and ui+1 is Embedded Image. Furthermore, Embedded Image, with Lc denoting the genomic length of the current region. Thus, the expected value of the number of segments within the ith bin Yi is given by: Embedded Image

Note that we neglect issues due to finite size chromosomes, which we found to have a negligible effect. For a constant demographic history, this quantity becomes: Embedded Image

Eq. 16 provides the first moment of the distribution of Yi. Note that the approximation introduced in Eq. 10 allows to compute this expression analytically when the demographic model γ(t) is a piece-wise constant function. Previous expressions derived under the full SMC’, on the other hand, required the use of special functions or numerical integration7.

Poisson distributions provide a natural way of describing “count data” such as Yi. However, when using the Poisson model, we encountered bin-dependent overdispersion, particularly for smaller bins, where IBD segments originate from older coalescence events that likely involve multiple samples. We thus used a quasi-likelihood approach17, adding a dispersion parameter ϕi: Embedded Image where Embedded Image and the Poisson mass function is recovered for ϕi = 1. The dispersion parameters ϕi are set so that the variance of the deviance residuals is 1.

1.1.4 LD model

Rather than relying on the direct observation of IBD data, HapNe-LD leverages long-range correlations that are induced by shared segments, which may be detected using unphased data. To describe the LD model used by HapNe, we begin by noting that alleles found at high frequency in a sample are typically older than ancestors transmitting large IBD segments (also see Section 1.2.1 for calculations related to the age of IBD segments). This implies that high frequency mutations found on long IBD segments are also likely to be carried by the shared ancestor transmitting the segment. We restrict our analysis to sites with MAF > 0.25. Given one such high frequency site x, we assume that the haplotypes of two individuals i and j spanned by a large (> 0.5 cM) IBD segment satisfy Embedded Image and that the same haplotypes will be independent if not spanned by an IBD segment, i.e. Embedded Image

The presence of IBD segments therefore leads to correlation in the observed genotypes, which HapNe-LD aims to leverage for the inference of effective population size variation. The input for HapNe-LD is a set of unphased genotypes Embedded Image, where i ∈ {1, …, s} denote individuals in the panel, and x ∈ {1, …, M} denote sites. Embedded Image and Embedded Image represent the (hidden) haplotypes of sample i at site x, with Embedded Image where px is the population’s allele frequency at site x. For simplicity, we consider standardized input data: Embedded Image where Embedded Image is the estimator of the allele frequency at site x, which is assumed to remain constant in the recent past.

HapNe-LD starts by computing the LD for different bins b. Unless otherwise specified, these bins are 0.5cM long and range from 0.5 to 10cM. For every bin b, we compute Embedded Image as the average of all Embedded Image values estimated for pairs of sites (x, y) whose distance is within bin b: Embedded Image

Note that this requires Embedded Image computation.

We now aim to relate these correlation statistics to the effective population size. The first moment of Embedded Image is given by: Embedded Image

We can group the 16 terms of the sum into different categories, according to the number of distinct haplotypes involved in each of these terms. In particular, the 4 terms where α = γ and β = δ involve two distinct haplotypes, i.e. haplotype α for individual i and β for individual j. For these 4 terms, we can use equations 10, 19, and 20 to write: Embedded Image where u denotes the distance between the two sites x and y. Note that we neglect issues due to finite sample sizes and admixture LD, which are addressed later. With this assumption, we have Embedded Image and Embedded Image.

The 12 other terms of the sum of Eq. 22 involve either 3 or 4 haplotypes. For example, a term with α ≠ γ and β = δ involves both haplotypes for individual i and haplotype β for individual j. In these cases, correlations induced by IBD require at least two pairs of haplotypes to be shared IBD, leading to Embedded Image contributions, which we neglect.

Together, these expressions enable obtaining the first moment of Embedded Image. If bin b is delimited by ui and uj, we have: Embedded Image

To complete the model, we assume that Embedded Image and estimate Embedded Image using Embedded Image estimates obtained across chromosome arms.

1.1.5 Correcting for finite sample size

Working with finite sample sizes induces correlations in the data which, if not accounted for, lead to bias in the inferred effective population size. These correlations arise as a result of the use of an empirical allele frequency Embedded Image instead of the unknown px. As a first step to debias the estimator of R2, we consider the ratio of the expected values as an approximation to the expected value of the ratio, which has been shown to be a good approximation for common alleles18: Embedded Image

If sx haplotypes are observed at site x, the numerator becomes: Embedded Image

Similarly, the denominator is given by: Embedded Image

It follows that: Embedded Image

When working with low coverage data, sx becomes a random quantity, Sx. Because computing LD between x and y requires that at least two individuals are sequenced at both sites, Sx and Sy are not independent for the (x, y) pairs considered when computing LD. We therefore average realizations of Embedded Image over pairs of sites (x, y) to compute an estimate Embedded Image for the following quantity in Eq. 23: Embedded Image which is also relevant for the detection of admixture LD, as discussed later. We use the same pairs (x, y) to similarly obtain an estimate Embedded Image for the quantity Embedded Image and use these terms to obtain a corrected estimate for Embedded Image Embedded Image

Note that the factor 4 is due to the Embedded Image terms in Eq. 22 that also cause finite-sample size correlations.

1.1.6 Correcting for time heterogeneity

Ancient DNA samples in a data set often originate from different time points. Due to the uncertainty in obtaining precise time estimates, their origins are often reported as a time range. Time heterogeneity across the set of analyzed samples causes a reduction in LD, due to the effects of recombination on the underlying haplotypes. If not modeled, this leads to an upwards bias in the estimated effective population size. HapNe-LD implements a correction to prevent these biases using the reported sample ages, which are obtained via radio-carbon dating or using the archeological context.

Consider two individuals i and j sampled at times Ti and Tj. Assume, without loss of generality, that Ti > Tj and define ΔT ≡ Ti – Tj > 0. Following the lineage of individual j at a site x, we denote by k the ancestor living at generation Ti. The LD between individuals i and k, both of them living at generation Ti, can be computed using Eq. 7 by replacing γ(t) with γo(t) = γo(t + Ti). The LD between individuals i and j is obtained by multiplying the LD between individuals i and k by the probability that the haplotype is not broken by a recombination event when transmitted from k to j, which decays exponentially with rate ΔT. Under the SMC approximation, this probability is given by e−ΔTu. In practice, Ti and Tj are not known exactly but provided as a range. If the density functions of Ti and Tj are available, both times can be marginalized in the above calculations of LD. HapNe supports used-provided time intervals for each sample and assumes that the true time is uniformly distributed within these intervals.

1.1.7 Admixture LD

Admixture causes correlation due to differences in allele frequencies across diverged populations. This correlation, often referred to as admixture LD, may lead to biases in the inferred demographic models. We use Eq. 30 to detect the presence of admixture LD and partially correct for it. For each pair of distinct chromosomes i and j, we compute the average difference between both sides of Eq. 30 and use a two-sided t-test to verify that they do not significantly deviate from 0. To mitigate the effects of admixture LD, we estimate Embedded Image by averaging realizations of XiXjYiYj for loci located on different chromosomes, and used this value as an estimate of β in Eq. 32. Note that, because all pairs of chromosomes are used to compute the t-test, the samples are not strictly independent, making this approach slightly conservative. An alternative approach consists in only considering disjunct pairs of chromosomes, which however leads to higher variance in the estimates for β.

1.1.8 Effective population size in IM and ICF models

We used the backward-in-time Markov chain introduced in19 to convert coalescence rates for the IM and ICF multi-population models into effective sizes for an equivalent single-population model. In particular, given a demographic model involving multiple populations, we used a Markov chain to compute the probability that two lineages coalesce at generation t, conditioned on not having coalesced up to generation t – 1, and took the inverse of this probability to be the effective population size for an equivalent single-population model.

1.2 Additional details on the inference procedure

We provide additional details on the use of quantiles of the IBD segment age distribution to discretize the time intervals and on the regularized loss function minimized by HapNe to infer Ne(t).

1.2.1 Parameterization of Ne(t)

HapNe aims to infer the demographic model given by Ne(t). We parameterize this function by assuming it to be piece-wise exponential, with parameters described by a vector, θ. More in detail, we divide the time axis into M consecutive intervals and for each interval i assume that Ne(t) varies according to a constant exponential rate λi. We set λM = 0, implying that the population size remains constant from the last predicted time to infinity. Ne(t) is thus fully determined by a set of M values Embedded Image. Time intervals are automatically selected so that each of them contains the same expected number of IBD segments (as also done in e.g.20). Let fage(t|l > umin) denote the probability density function of the age of IBD segments whose length satisfies l > umin. We define time intervals so that they coincide with quantiles of this density, which we compute using Embedded Image where fseg(u) in defined in Eq. 13 and Embedded Image. To derive fage(t|l), we note that it represents the TMRCA of a randomly selected site spanned by an IBD segment of length l. Using Bayes’ rule and the SMC approximation, Embedded Image

For a constant coalescent rate γ, this becomes Embedded Image i.e. an Erlang-3 and Erlang-2 distribution, respectively (also see6,9). Because time intervals depend on Ne(t), HapNe iteratively tunes them at each iteration using the current population size estimates.

Note that a slightly more accurate closed-form solution under a constant population size can be obtained by allowing a single recombination event to heal, replacing fsite in Eq. 34 with the expression of Eq. 12, leading to: Embedded Image

1.2.2 Loss function

We aim to find the best set of parameters θ based on correlated observations Y = {yr,b}, where yr,b represents LD or IBD summary statistics computed for the bth bin of the rth independent genomic region. Due to the presence of correlations in the data, rather than using standard likelihood calculations we work with the approximated power likelihood Embedded Image where 0 ≤ c ≤ 1 is a hyperparameter and fb is the probability mass or density function derived in equations 18 and 25. Minimizing Eq. 37 for θ is an ill-defined problem, for which small changes in the input data might lead to significant changes in the inferred parameter Embedded Image (also see e.g.4). To improve convergence and restrict the parameter space we thus impose the following prior on the {λ} coefficients of the piece-wise exponential function Ne(t): Embedded Image where Δti denotes the length of the ith time interval and λi the growth rate in the same interval. Because the numerator corresponds to the length of log Ne(t) between t = 0 and the last predicted time, this choice of prior favors trajectories with reduced fluctuations.

Combining these expressions leads to the following posterior: Embedded Image where Z is a normalizing constant.

We aim to find the MAP of θ: Embedded Image

This requires tuning a single hyperparameter κ = cσ2, using the approach described in the next section.

1.2.3 Numerical optimization

We used SciPy’s implementation of the L-BFGS-B optimiser21 to minimize Eq. 40. Each minimization step is run 5 times using different starting points. The solution yielding the smallest loss is kept.

1.3 Model selection

HapNe performs a grid-search over different values of the hyperparameter κ, ranging from a strong regularization κ0 = 10−5 to an almost unregularized model with parameter κmax = 100. For each of these parameters, HapNe infers the MAP Embedded Image by optimizing Eq. 40, as well as the associated pseudo-likelihood Embedded Image. HapNe then computes the “pseudo-deviance” Embedded Image. The smallest value of κ satisfying D(κ) < τ is selected as the best hyperparameter. Since the parameter c handling correlations between bins is neglected when computing the “pseudo-deviance”, we cannot use asymptotic theories about the distribution of D to fix the value of τ in a principled way. Instead, we fixed the thresholds τ for both HapNe-LD and HapNe-IBD by training them using three sets of simulations that used different demographic models than the ones presented in this work.

1.4 Supplementary Figures

Figure S1:
  • Download figure
  • Open in new tab
Figure S1: Accuracy of HapNe-IBD and IBDNe using ground truth IBD sharing information, and HapNe-LD using inferred LD.

(a) Simulated demographic models (dotted black lines), predictions based on ground truth IBD sharing for both HapNe-IBD (red) and IBDNe (green), and HapNe-LD results based on simulated SNP-array data (blue). (b) Error as a function of sample size for corresponding demographic models in (a), measured as the RMSLE over the first 50 generations (see Methods). HapNe-IBD and IBDNe were run using ground truth IBD sharing information. Error bars correspond to 1.96 × SE computed using 10 independent simulations.

Figure S2:
  • Download figure
  • Open in new tab
Figure S2: Impact of IBD detection on the accuracy of IBDNe and HapNe-IBD.

(a) RMSLE as a function of sample size for IBDNe and (b) HapNe-IBD using different sources of IBD sharing. Ground Truth refers to the IBD segments obtained from the ARGON simulator, FastSMC and HapIBD were applied as described in the Methods section. Error bars correspond to 1.96 × SE computed using 10 independent simulations.

Figure S3:
  • Download figure
  • Open in new tab
Figure S3: Effect of sample size variation (rows) across several demographic models (columns).

HapNe-IBD was run using IBD segments detected by FastSMC and IBDNe using segments detected by HapIBD. LD methods were run using their standard pipeline. The y-axis is truncated for readability in simulations that resulted in very large vaues.

Figure S4:
  • Download figure
  • Open in new tab
Figure S4: Inference accuracy as a function of sample size.

Accuracy was measured using RMSLE over the first 50 generations for each simulated demographic history and sample size (see Methods). IBD segments for HapNe-IBD and IBDNe were computed using FastSMC and HapIBD, respectively. Error bars correspond to 1.96 × SE computed using 10 independent simulations.

Figure S5:
  • Download figure
  • Open in new tab
Figure S5: Relative error in IBD detection.

We computed the relative difference between the true and inferred number of IBD segments for different sample sizes (rows) and demographic models (columns) for FastSMC. Positive/negative values indicate a depletion/excess of detected segments.

Figure S6:
  • Download figure
  • Open in new tab
Figure S6: Effect of coverage and sample size.

(a) Output of HapNe-LD on simulated aDNA for 256 individuals, with m = 0 (C ≈ 30) and m = 0.25 (C ≈ 1.4). (b) Output of HapNe-LD on simulated aDNA for 16 individuals with m = 0 (C ≈ 30) and 256 individuals with m = 0.75 (C ≈ 0.3).

Figure S7:
  • Download figure
  • Open in new tab
Figure S7: Accuracy of HapNe-LD as a function of sample size and coverage.

(a) RMSLE for HapNe-LD as a function of sample size for three different levels of coverage (line color) and different demographic models (column). The different levels of coverage, 30×, 1.4× and 0.7×, approximately correspond to m = 0, m = 0.25 and m = 0.5, respectively (see Methods). (b) Comparison of the RMSLE while keeping the number of samples constant (s = 256) and decreasing coverage (blue line), compared to the RMSLE obtained while keeping the coverage constant at 30×, while decreasing the sample size.

Figure S8:
  • Download figure
  • Open in new tab
Figure S8: Inference based on demographic models involving multiple populations.

(a-c) Results for the IM and ICF models for different values of tm (see Methods).

Figure S9:
  • Download figure
  • Open in new tab
Figure S9: Filtering of high LD regions.

The LD at different distances u (in Morgans, M) was computed by randomly selecting individuals from the UK Biobank. Unusually elevated LD was observed in the HLA region on Chromosome 6 (blue line) and on Chromosome 8 (orange line), corresponding to a known large inversion polymorphism.

Figure S10:
  • Download figure
  • Open in new tab
Figure S10: Downsampling analysis for the Glasgow postcode in the UK Biobank.

Effective population size inferred using unrelated individuals with self-reported white British ancestry whose birth location is in the Glasgow (G) postcode area. The numbers above each plot correspond to the sample size used in each analysis.

Figure S11:
  • Download figure
  • Open in new tab
Figure S11: Inferred demographic models for 1,000 Genomes Project populations where no significant admixture LD was detected.

Results for populations for which the admixture LD test was not significant (p > 0.05). Numbers in parentheses correspond to −log10(p). IBD segments for IBDNe and HapNe-IBD were computed using FastSMC.

Figure S12:
  • Download figure
  • Open in new tab
Figure S12: Inferred demographic models for 1,000 Genomes Project populations where significant admixture LD was detected (0.05/26 < p < 0.05).

Results for populations for which the admixture LD test was significant at 0.05/26 < p < 0.05. Numbers in parentheses correspond to – log10(p). IBD segments for IBDNe and HapNe-IBD were computed using FastSMC.

Figure S13:
  • Download figure
  • Open in new tab
Figure S13: Inferred demographic models for 1,000 Genomes Project populations where significant admixture LD was detected (p < 0.05/26).

Results for populations for which the admixture LD test was significant at p < 0.05/26. Numbers in parentheses correspond to – log10(p). IBD segments for IBDNe and HapNe-IBD were computed using FastSMC.

1.5 Supplementary Tables

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S1: Further information on populations analyzed in Figure 4.

Sample size s, average coverage, estimated age of the most recent and distant samples (given in years before 1950), and approximate p-value for the CCLD test for each analyzed ancient population.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S2: Samples used in the Arras analysis

Genotypes were downloaded from published supplementary materials.

View this table:
  • View inline
  • View popup
Table S3: Samples used in the South England MIA-LIA analysis

Genotypes were downloaded from published supplementary materials.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S4: Samples used in the Norway Viking analysis.

Genotypes were downloaded from V50 of the Allen ancient data resource.24

View this table:
  • View inline
  • View popup
Table S5: Samples used in the Gotland Viking analysis.

Genotypes were downloaded from V50 of the Allen ancient data resource.24

View this table:
  • View inline
  • View popup
Table S6: Samples used in the Caribbean Ceramic analysis.

Genotypes were downloaded from V50 of the Allen ancient data resource.24

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S7: Samples used in the South East Coast Dominican Republic Ceramic analysis.

Genotypes were downloaded from V50 of the Allen ancient data resource.24

8 Acknowledgements

We thank Juba Nait Saada and Fergus Cooper for helpful discussions and suggestions; Arjun Biddanda and Shai Carmi for comments on an early version of the manuscript; Brian Zhang and Arjun Biddanda for sharing code used for various parts of the analysis. This work was supported by the Angus McLeod Scholarship (to R.F.); NIH grant R21-HG010748-01 (to P.F.P.); and ERC Starting Grant ARGPHENO 850869 (to P.F.P.). D.R. is an investigator of the Howard Hughes Medical Institute and this work was also supported by grants from the National Institutes of Health (GM100233 and HG012287), and the John Templeton Foundation (grant 61220). This work was conducted using the UK Biobank resource (Application #43206). We thank the participants of the UK Biobank project. Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. Financial support was provided by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Footnotes

  • ↵† Jointly supervised this work

References

  1. ↵
    Charlesworth, B. Effective population size and patterns of molecular evolution and variation. Nature Reviews Genetics 10 (2009).
  2. ↵
    Wright, S. Evolution in mendelian populations. Genetics 16 (1931).
  3. ↵
    Wright, S. Inbreeding and homozygosis. Proceedings of the National Academy of Sciences 19 (1933).
  4. ↵
    Pickrell, J. K. & Reich, D. Toward a new history and geography of human genes informed by ancient dna. Trends in Genetics 30 (2014).
  5. ↵
    Nielsen, R. et al. Tracing the peopling of the world through genomics. Nature 541 (2017).
  6. ↵
    Sikora, M. et al. Ancient genomes show social and reproductive behavior of early upper paleolithic foragers. Science 358 (2017).
  7. ↵
    Kondrashov, A. S. Contamination of the genome by very slightly deleterious mutations: why have we not died 100 times over? Journal of Theoretical Biology 175 (1995).
  8. ↵
    Franklin, I. R. & Frankham, R. How large must populations be to retain evolutionary potential? Animal Conservation 1 (1998).
  9. ↵
    Schraiber, J. G. & Akey, J. M. Methods and models for unravelling human evolutionary history. Nature Reviews Genetics 16 (2015).
  10. ↵
    Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genetics 5 (2009).
  11. Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V. C. & Foll, M. Robust demographic inference from genomic and snp data. PLoS Genetics 9 (2013).
  12. Bhaskar, A., Wang, Y. R. & Song, Y. S. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Research 25 (2015).
  13. ↵
    Kamm, J., Terhorst, J., Durbin, R. & Song, Y. S. Efficiently inferring the demographic history of many populations with allele count data. Journal of the American Statistical Association 115 (2020).
  14. ↵
    Terhorst, J. & Song, Y. S. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proceedings of the National Academy of Sciences 112 (2015).
  15. ↵
    Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).
    OpenUrlCrossRefPubMedWeb of Science
  16. Sheehan, S., Harris, K. & Song, Y. S. Estimating variable effective population sizes from multiple genomes: A sequentially markov conditional sampling distribution approach. Genetics 194 (2013).
  17. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics 46, 919–925 (2014).
    OpenUrlCrossRefPubMed
  18. ↵
    Terhorst, J., Kamm, J. A. & Song, Y. S. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics 49 (2017).
  19. ↵
    Steinrucken, M., Kamm, J., Spence, J. P. & Song, Y. S. Inference of complex population histories using whole-genome sequences from multiple populations. Proceedings of the National Academy of Sciences 116 (2019).
  20. ↵
    Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics 51 (2019).
  21. ↵
    Palamara, P. F., Lencz, T., Darvasi, A. & Pe’er, I. Length distributions of identity by descent reveal fine-scale demographic history. American Journal of Human Genetics 91, 809–822 (2012).
    OpenUrlCrossRefPubMed
  22. Palamara, P. F. & Pe’er, I. Inference of historical migration rates via haplotype sharing. Bioinformatics 29 (2013).
  23. ↵
    Ralph, P. & Coop, G. The geography of recent genetic ancestry across europe. PLoS Biology 11, 1001555 (2013).
    OpenUrl
  24. Harris, K. & Nielsen, R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genetics 9 (2013).
  25. ↵
    Browning, S. R. & Browning, B. L. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. American Journal of Human Genetics 97, 404–418 (2015).
    OpenUrlCrossRefPubMed
  26. ↵
    Sved, J. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theoretical Population Biology 2 (1971).
  27. Tenesa, A. et al. Recent human effective population size estimated from linkage disequilibrium. Genome Research 17, 520–526 (2007).
    OpenUrlAbstract/FREE Full Text
  28. McEvoy, B. P., Powell, J. E., Goddard, M. E. & Visscher, P. M. Human population dispersal ”out of africa” estimated from linkage disequilibrium and allele frequencies of snps. Genome Research 21, 821–829 (2011).
    OpenUrlAbstract/FREE Full Text
  29. ↵
    Santiago, E. et al. Recent demographic history inferred by high-resolution analysis of linkage disequilibrium. Molecular Biology and Evolution 37, 3642–3653 (2020).
    OpenUrlCrossRef
  30. ↵
    Gusev, A. et al. Whole population, genome-wide mapping of hidden relatedness. Genome Research 19 (2008).
  31. ↵
    Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194 (2013).
  32. ↵
    Saada, J. N. et al. Identity-by-descent detection across 487,409 british samples reveals fine scale population structure and ultra-rare variant associations. Nature Communications 11 (2020).
  33. ↵
    Zhou, Y., Browning, S. R. & Browning, B. L. A fast and simple method for detecting identity-by-descent segments in large-scale data. The American Journal of Human Genetics 106 (2020).
  34. ↵
    Hill, W. G. Estimation of linkage disequilibrium in randomly mating populations. Heredity 33 (1974).
  35. Weir, B. S. Inferences about linkage disequilibrium. Biometrics 35 (1979).
  36. L, E. & M, S. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology and Evolution (1995).
  37. Waples, R. S. A bias correction for estimates of effective population size based on linkage disequilibrium at unlinked gene loci*. Conservation Genetics 7 (2006).
  38. ↵
    Ragsdale, A. P. & Gravel, S. Models of archaic admixture and recent history from two-locus statistics. PLOS Genetics 15 (2019).
  39. ↵
    Mezzavilla, M. Neon: An r package to estimate human effective population size and divergence time from patterns of linkage disequilibrium between snps. Journal of Computer Science & Systems Biology 8 (2015).
  40. ↵
    Loh, P.-R. et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics 193, 1233–1254 (2013).
    OpenUrlAbstract/FREE Full Text
  41. ↵
    Margaryan, A. et al. Population genomics of the viking world. Nature 585, 390–396 (2020).
    OpenUrl
  42. ↵
    Novak, M. et al. Genome-wide analysis of nearly all the victims of a 6200 year old massacre. PLOS ONE 16, e0247332 (2021).
    OpenUrl
  43. ↵
    Pfaff, C. et al. Population structure in admixed populations: Effect of admixture dynamics on the pattern of linkage disequilibrium. The American Journal of Human Genetics 68, 198–207 (2001).
    OpenUrlCrossRefPubMedWeb of Science
  44. ↵
    Aberth, J. The black death 1348 - 1350: A brief history with documents. The Bedford Series in History and Culture (St Martin’s Press, New York, NY, 2005), 1 edn.
  45. ↵
    and Adam Auton et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
    OpenUrlCrossRefPubMed
  46. ↵
    Kere, J. Human population genetics: lessons from finland. Annual review of genomics and human genetics 2, 103–128 (2001).
    OpenUrlCrossRefPubMed
  47. ↵
    Patterson, N. et al. Large-scale migration into britain during the middle to late bronze age. Nature (2021).
  48. ↵
    Fernandes, D. M. et al. A genetic history of the pre-contact caribbean. Nature 590, 103–110 (2021).
    OpenUrl
  49. ↵
    Nägele, K. et al. Genomic insights into the early peopling of the caribbean. Science 369, 456–460 (2020).
    OpenUrlAbstract/FREE Full Text
  50. ↵
    Ringbauer, H., Novembre, J. & Steinrücken, M. Human parental relatedness through time - detecting runs of homozygosity in ancient DNA (2020).
  51. ↵
    Palamara, P. F. Argon: fast, whole-genome simulation of the discrete time wright-fisher process. Bioinformatics 32, 3032–3034 (2016).
    OpenUrlCrossRefPubMed
  52. ↵
    Kelleher, J., Etheridge, A. M. & McVean, G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology 12, e1004842 (2016).
    OpenUrl
  53. ↵
    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
    OpenUrlCrossRefPubMed
  54. ↵
    Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics 81, 1084–1097 (2007).
    OpenUrlCrossRefPubMedWeb of Science
  55. ↵
    Allen ancient dna resource (aadr): Downloadable genotypes of present-day and ancient dna data, version 50.0. URL https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data.

References

  1. ↵
    Marjoram, P. & Wall, J. D. Fast ”coalescent” simulation. BMC Genetics 7 (2006).
  2. ↵
    Palamara, P. F., Lencz, T., Darvasi, A. & Pe’er, I. Length distributions of identity by descent reveal fine-scale demographic history. American Journal of Human Genetics 91, 809–822 (2012).
    OpenUrlCrossRefPubMed
  3. ↵
    Harris, K. & Nielsen, R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genetics 9 (2013).
  4. ↵
    Ralph, P. & Coop, G. The geography of recent genetic ancestry across europe. PLoS Biology 11, 1001555 (2013).
    OpenUrl
  5. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics 46, 919–925 (2014).
    OpenUrlCrossRefPubMed
  6. ↵
    Palamara, P. F. Population genetics of identity by descent (Columbia University, 2014).
  7. ↵
    Carmi, S., Wilton, P. R., Wakeley, J. & Pe’er, I. A renewal theory approach to IBD sharing. Theoretical Population Biology 97, 35–48 (2014).
    OpenUrlCrossRefPubMed
  8. Browning, S. R. & Browning, B. L. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. American Journal of Human Genetics 97, 404–418 (2015).
    OpenUrlCrossRefPubMed
  9. ↵
    Palamara, P. F. et al. Leveraging distant relatedness to quantify human mutation and gene-conversion rates. The American Journal of Human Genetics 97, 775–789 (2015).
    OpenUrlCrossRefPubMed
  10. Wilton, P. R., Carmi, S. & Hobolth, A. The smc’ is a highly accurate approximation to the ancestral recombination graph. Genetics 200, 343–355 (2015).
    OpenUrlAbstract/FREE Full Text
  11. ↵
    Biddanda, A., Steinrücken, M. & Novembre, J. Properties of 2-locus genealogies and linkage disequilibrium in temporally structured samples. Genetics 221 (2022).
  12. ↵
    Kingman, J. The coalescent. Stochastic Processes and their Applications 13, 235–248 (1982).
    OpenUrlCrossRef
  13. ↵
    Wiuf, C. & Hein, J. Recombination as a point process along sequences. Theoretical population biology 55, 248–259 (1999).
    OpenUrlCrossRefPubMedWeb of Science
  14. ↵
    Eriksson, A., Mahjani, B. & Mehlig, B. Sequential markov coalescent algorithms for population models with demographic structure. Theoretical Population Biology 76, 84–91 (2009).
    OpenUrlCrossRefPubMedWeb of Science
  15. ↵
    McVean, G. A. & Cardin, N. J. Approximating the coalescent with recombination. Philosophical Transactions of the Royal Society B: Biological Sciences 360, 1387–1393 (2005).
    OpenUrlCrossRefPubMed
  16. ↵
    Sved, J. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theoretical Population Biology 2 (1971).
  17. ↵
    Davison, A. C. Statistical Models (Cambridge University Press, Cambridge, 2003).
  18. ↵
    Hudson, R. R. THE SAMPLING DISTRIBUTION OF LINKAGE DISEQUILIBRIUM UNDER AN INFINITE ALLELE MODEL WITHOUT SELECTION. Genetics 109, 611–631 (1985).
    OpenUrlAbstract/FREE Full Text
  19. ↵
    Wang, K., Mathieson, I., O’Connell, J. & Schiffels, S. Tracking human population structure through time from whole genome sequences. PLOS Genetics 16, e1008552 (2020). URL https://doi.org/10.1371/journal.pgen.1008552.
    OpenUrl
  20. ↵
    Terhorst, J., Kamm, J. A. & Song, Y. S. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics 49 (2017).
  21. ↵
    Virtanen, P. et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272 (2020).
    OpenUrl
  22. Patterson, N. et al. Large-scale migration into britain during the middle to late bronze age. Nature (2021).
  23. Margaryan, A. et al. Population genomics of the viking world. Nature 585, 390–396 (2020).
    OpenUrl
  24. ↵
    Allen ancient dna resource (aadr): Downloadable genotypes of present-day and ancient dna data, version 50.0. URL https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data.
  25. Fernandes, D. M. et al. A genetic history of the pre-contact caribbean. Nature 590, 103–110 (2021).
    OpenUrl
  26. Nägele, K. et al. Genomic insights into the early peopling of the caribbean. Science 369, 456–460 (2020).
    OpenUrlAbstract/FREE Full Text
Back to top
PreviousNext
Posted August 04, 2022.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Haplotype-based inference of recent effective population size in modern and ancient DNA samples
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Haplotype-based inference of recent effective population size in modern and ancient DNA samples
Romain Fournier, David Reich, Pier Francesco Palamara
bioRxiv 2022.08.03.501074; doi: https://doi.org/10.1101/2022.08.03.501074
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Haplotype-based inference of recent effective population size in modern and ancient DNA samples
Romain Fournier, David Reich, Pier Francesco Palamara
bioRxiv 2022.08.03.501074; doi: https://doi.org/10.1101/2022.08.03.501074

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4085)
  • Biochemistry (8755)
  • Bioengineering (6477)
  • Bioinformatics (23331)
  • Biophysics (11740)
  • Cancer Biology (9144)
  • Cell Biology (13237)
  • Clinical Trials (138)
  • Developmental Biology (7410)
  • Ecology (11364)
  • Epidemiology (2066)
  • Evolutionary Biology (15084)
  • Genetics (10397)
  • Genomics (14006)
  • Immunology (9115)
  • Microbiology (22036)
  • Molecular Biology (8777)
  • Neuroscience (47345)
  • Paleontology (350)
  • Pathology (1420)
  • Pharmacology and Toxicology (2480)
  • Physiology (3703)
  • Plant Biology (8045)
  • Scientific Communication and Education (1431)
  • Synthetic Biology (2207)
  • Systems Biology (6014)
  • Zoology (1249)