Abstract
Heterozygous sites are not uniformly distributed along a diploid genome. Rather, their density varies as a result of recombination events, and their local density reflects the time to the last common ancestor of the maternal and paternal copies of a genomic region. The distribution of the density of heterozygous sites therefore carries information about the history of the population size. Despite previous efforts, an exact derivation of the distribution of heterozygous sites is still lacking. As a consequence, the estimation of population size variation is difficult and requires several simplifying assumptions. Using a novel theoretical framework, we are able to derive an analytical formula for the distribution of distances between heterozygous sites. Our theory can account for arbitrary demographic histories, including bottlenecks. In the case of a constant population size the distribution follows a simple function and exhibits a power-law tail proportional to rα with α =−3, where r is the distance between heterozygous sites. This prediction is accurately validated when considering heterozygous sites in individuals of African descent. Other populations migrated out of Africa and underwent at least one bottleneck which left a distinct mark on their interval distribution between heterozygous sites, i.e., an overrepresentation of intervals between 10 and 100 kbp in length. Our analytical theory for non-constant population sizes reproduces this behavior and can be used to study historical changes in population size with high accuracy. The simplicity of our approach facilitates the analysis of demographic histories for diploid species, requiring only a single unphased genome.
1 Introduction
The evolution of genomes is driven by the interplay of several evolutionary forces. Mutagenesis initially introduces genomic variation into the genome of a single individual, which is then subject to natural selection and genetic drift. The relative strength of these two mechanisms depends on the selective advantage or disadvantage conferred by a mutation and the size of the population. In large populations, genetic drift is weak and natural selection is the dominant evolutionary force, whereas in small populations, genetic drift is strong and natural selection is weak. Therefore, demographic events such as population bottlenecks can have a strong effect on genome evolution and leave characteristic traces in genomes.
In diploid species that reproduce sexually, genetic recombination is another important force shaping genome evolution. In these species, chromosomes come in pairs, one inherited from the father and one from the mother. However, as a result of crossing over events during meiosis, chromosomes recombine, and a single gamete passed on to the next generation carries a combination of maternal and paternal genetic information (see Fig. 1(A)). Thus, over several generations, each of an individual’s two sister chromosomes is a mosaic of segments of genetic material from its ancestors.
In a finite population, however, the ancestral lineages of each such segment will eventually coalesce at the time of the most recent common ancestor (TMRCA), see Fig. 1(B). These segments are therefore said to be identical by descent (IBD). At the time point of coalescence, these segments were also identical by state (IBS), i.e. their genomic sequence was 100% identical. Subsequently, both copies accumulated mutations over evolutionary timescales. These mutations are detected as heterozygous sites in a diploid genome that break IBD segments into shorter IBS tracts that are themselves homozygous, see Fig. 1(B). Even if one assumes that these mutations occur uniformly along individual IBD segments, on longer length scales the density of heterozygous sites along chromosomes becomes non-uniform, due to the fact that different IBD segments have different TMRCAs. In fact, it has been observed that the density of single nucleotide polymorphisms (SNPs) varies along genomes by orders of magnitude, much more than expected for a random Poisson process. [1, 2, 3, 4].
In the past, the density of SNPs along chromosomes was already analyzed to infer properties of the coalescent process, in particular the effective population size. Several methods based on the Sequential Markovian Coalescent (SMC) have been developed to infer the demographic history of a genomic sample from SNP data. These methods have in common that they model the Ancestral Recombination Graph, i.e. the genealogies of the observed sample along the genome, assuming that the genealogies follow a Markov process [5]. The first such method, the pairwise sequentially Markovian coalescent (PSMC) [6], was used to infer demographic history from a single diploid genome. Later, this method was extended to analyze multiple pairs of haploid genomes from a population in a method known as multiple sequentially Markovian coalescent (MSMC) [7]. To enable the analysis of large population samples of hundreds of individuals, a novel regularization scheme that reduces estimation error was introduced with SMC++ [8]. Many other extensions [9, 10, 11, 12] have been proposed to address the limitations of SMC-based algorithms for inferring the demographic history of populations (see [13] for a review). In parallel, a competing approach to infer the demographic history was proposed by Harris and Nielsen [14]. They proposed to examine the length distribution of IBS tracts and derived an approximate closed-form formula for the expected length distribution of IBS tracts under different demographic scenarios. However, due to the mathematical complexity of their method, it is rarely used compared to PSMC or similar approaches.
Related to the density of heterozygous sites is the length of homozygous segments. Historically, and due to the fact that long Runs of Homozygosity (RoH) have been associated with inbreeding and genetic diseases, researchers first used restriction fragment length polymorphisms to identify long RoH and map disease loci [15]. Data from about 8, 000 short tandem repeat polymorphisms [16] and later high-density genome-wide scans using SNP microarrays interrogating 3 million SNPs [17] allowed RoH to be found at higher resolutions (up to the 1 kbp scale) and showed that RoH are common in human populations [18]. Finally, short-read whole genome sequencing now surpasses microarray scans in resolution and precision, and phased genotype data are available for more than 1000 individuals in several human populations [19].
The aim of this article is to develop a mathematical model for the length distribution of RoH or IBS tracts. We propose to revisit the approach of Harris and Nielsen [14] and show for the first time that the length distribution of IBS tracts follows a simple function and exhibits a power-law tail rα with exponent α = −3, where r is the length of an IBS tract. Here we provide a simple theoretical framework for computing this distribution. We test our theory using empirical data from the 1000 Genomes Project [19, 20] and find that it agrees very well with empirical data for humans of African ancestry. We also derive an extension of our model for individuals who have undergone a population bottleneck, and show that we can use this extended model to date and estimate the magnitude of population bottlenecks.
2 Results
The first type of question we want to answer concerns the number and distribution of heterozygous sites in a genome. The number of heterozygous sites of an individual provides valuable information about the genetic diversity in a population. A higher number of heterozygous sites indicates greater genetic diversity within a population. The total number of heterozygous sites in a diploid genome of length L is well approximated by Lθ with the scaled mutation rate θ = 4Neμ, the mutation rate per bp and generation, μ, and the effective size of a panmictic population in equilibrium without selection, Ne [21, 22, 23, 24].
For humans we have θ ≪ 1 and expect a heterozygous site every nucleotides along the genome on average. However, it is known that the local density of heterozygous sites along the chromosomes is not uniform [1], see Fig. 2(A), Very similar density distributions can also be observed when considering SNPs rather than heterozygous sites. Note, however, that we will not consider homozygous SNPs in an individual in the following as their presence depends on the underlying reference genome.
To get a better understanding of the density distribution of heterozygous sites we consider the same data in a rainfall plot [25], see Fig. 2(B). In this plot we clearly see that genomes are a mosaic of regions with various densities of heterozygous sites. Each “piece” of this mosaic is an IBD segment which dates back to a different most recent common ancestors and therefore had the potential to accumulate more or less mutations [1, 14].
Here we want to focus on the distribution of distances between heterozygous sites or the length distribution of IBS tracts. This distribution for an individual of African ancestry is shown in Fig. 2 (C). and is very different from the distribution one would obtain if heterozygous sites were uniformly distributed along the genome (see Fig. S1). Such a distribution of long IBS tracts was also observed previously [14]. Here we find that this distribution has a specific shape and exhibits a power-law tail M (r) ≃rα with α = −3. This intriguing observation can be understood by a mathematical model with only a few simplifying assumptions as deduced below.
Heterozygous sites of an individual genome represent sites where the maternal and paternal alleles differ. Each IBD segment has a different time to the most recent common ancestor (TMRCA). For an individual genome, the TMRCA distribution depends on the structure of the population. For a stationary population with N diploid individuals, we have that the distribution of TMRCAs, is given by the probability of coalescence: where τ is the TMRCA [24, 26]. For segment with TMRCA τ the length distribution of IBS tracts is given by a stick-breaking process [27, 28] and follows where r is the length of a IBS tract, μ the mutation rate, and K the length of the IBD segment, which is assumed to be larger than a typical IBS tract. Combining Eqs. (2) and (3), assuming that recombination is acting on length scales longer than a typical IBS tract it follows that the genome-wide length distribution of IBS tracts M is where L is the length of the considered genome. This distribution has a mean value 1/θ and shows the characteristic power-law tail ∝ rα with exponent α = −3 as observed in Fig. 2 (C). Beside the sequence length L, the model has only one more free parameter θ and the observed empirical distribution for individuals of African ancestry can be very well fitted by our model, see Fig. 2(C). Note that the power-law exponent α does not depend on any model parameter. Assuming that the mutation rate μ = 2.36·10−8 per generation and bp [12], the fitted value for θ = 0.00106066 corresponds to a effective population size of N0 = 11, 280 ± 80.
However, considering individuals with different ancestry, especially those which are assumed to have underwent a population bottleneck while moving out of Africa, we observe significant deviations from our predictions (Fig. 3(C-F) and Fig. 5). Notably, we observe an up to twofold excess of IBS tracts of lengths between 104 and 105 bp. Such deviations were previously shown to be the result of bottleneck events [14].
To calculate analytically how demographic events such as bottlenecks affect the IBS tract length distribution, we first compute statistical properties of coalescent times of IBD segments in population with varying population size.
Let us consider a population of diploid individuals evolving in time t with a varying population size N (t). Each individual diploid genome is partitioned into IBD segments, i.e. genomic regions for which all sites of the maternal and paternal haploid copy coalesce at the same most recent common ancestor, which we denote by τ . In a population of size N there are 2N haploid genomes, and (2N)2/2 possible pairs of haploid genomes. We define n(τ, t) as the number of IBD segments with TMRCA τ in all pairs of haploid genomes in the population normalized by the length of the genome L at time t. In other words the expected probability to coalesce at TMRCA τ for two homologous base pairs in the population is n(τ, t)/2N 2. Extending previous considerations [29, 30] we find that the time evolution of n(τ, t) is given by where the three terms on the r.h.s. describe contributions to the evolution of n(τ) in time due to (i) pre-exiting pairings with finite TMRCA τ, (ii) new pairings with τ = 0 appearing due to coalescent events, and (iii) the loss of pairings due to the death of individuals. Note that N (t) might vary in time and that the normalisation of n(τ, t) is such that the total number of pairings at each time equals which is the total number of possible pairs assuming that N (t) ≫ 1. Considering the Laplace transform of n(τ, t) in τ : the differential Eq. (5) takes the form and can be solved by integration. From a solution n(τ, t) one can compute the length distribution of IBS tracts in a genome of size L using the stick breaking process with stick length distribution (3): Due to the relationship of the integral in Eq. (9) to Laplace transformations, we can also compute the length distribution of IBS tracts directly as the second derivative of the Laplace transform of n: With these computational simplifications we are now in the position to compute the length distribution of IBS tracts in closed form.
As an example consider again a stationary population of constant size N. The number of pairs and its Laplace transform ñ-(τ, t) do not depend on time, i.e. ∂ñ(s, t)/∂t = 0, and we solve Eq. (8) by Using Eq.(10) the length distribution of IBS tracts is therefore as found above using simpler arguments leading to Eq. (4).
Let us now consider a scenario for which the population size is piece-wise constant for given time intervals, see Fig. 4. We start with a stationary population of size N0. At time t = 0 this population expands or shrinks to size N1 for a duration of time T1. Subsequently the population size changes to Ni for a time Ti for of total E epochs i = 1, 2,…, E. We define tk =Σi≤k Ti as the time points of population size changes. Finally, we observe and analyse the genome of an individual in a population of size NE at time tE.
The solution of the differential Eq. (8) can be computed iteratively by considering an initially stationary population at time t0 = 0 with At the time points ti for i = 0, 1,…,E − 1 the population of size Ni with pairings ñ(s, ti) is subjected to either a reduction or an expansion of the population size.
In case of (i) an instantaneous reduction of the population size, i.e. if Ni+1 ≤ Ni we just take a sample of size Ni+1 individuals. This changes the normalisation of the function n, i.e. at time point just after the population size change: where γi = Ni+1/Ni ≤ 1 and denotes a time just before ti.
In contrast, in case of (ii) an instantaneous expansion of the population size, i.e. Ni+1 > Ni, we model that each gamete will generate several offspring to make up the next generation. We assume that the number of offspring, p, for each gamete is distributed according to a truncated Poisson distribution: for p≥1 and vanishes for p = 0, i.e. we assume and that no gamete is lost. On average each gamete will have γi = Ni+1/Ni > 1 offspring and the resulting population at time will be of size Ni+1. The number of pairings in Laplace space is which is a simple affine transformation of the number of pairings before the population size increase. In sum, both a decrease and an increase of the population size amounts to an affine transformation of ñ(s, t) in the following denoted by 𝒜Ni,Ni+1. Note that if the population size does not change,Ni = Ni+1 the solution is consistent and we have 𝒜N,N = id.
Between consecutive changes of the population size, during the time interval Ti, we need to prop-agate the solution to with constant population size Ni. Using Eq. (8) we have We denote this transformation by .
In summary, by combining these three simple transformations of the initially stationary distribution at t0 = 0 we can iteratively compute the distribution ñ at time tE under the influence of the intermediate population size changes as In the end the length distribution of IBS tracts M is computed as the second derivative of the final ñ(s, tE) following Eq. (10).
Our result above allow us to efficiently compute the length distribution of IBS tracts M for relevant evolutionary scenarios. We find that the following model can accurately describe the observed length distribution of IBS tracts for populations that migrated out of Africa. We consider a model with E = 2 epochs: an initial population with N0 individuals goes through a bottleneck such that its population size shrinks to N1 for a time T1 and then grows again to a population of size N2 for a time T2. With this model we obtain very accurate fits of IBS tracts distributions for populations with different ancestries, see Fig. 3 where we combined IBS tracts for all individuals in an ancestry group (see Methods). Importantly, our model convincingly recovers the overrepresentation of long IBS tracts in populations that have migrated out of Africa, and can also be applied on single individuals, see Figs. 5(A) and 5(B) for exemplary individuals of African and European ancestries.
Comparing the two models (with and without a bottleneck) for an individual of African ancestry, the one with a bottleneck shows no significant improvement (see Fig. 5(A)), although the bottleneck model has 4 more parameters than the model with a stationary population. We further consider bootstrapped samples of the length distributions to compare the two models. For an individual of European ancestry we observe a significant increase (p < 10−20) of the likelihoods for the model with a bottleneck, Fig. 6. In contrast, the likelihood distributions for fits for an African individual are overlapping indicating that the bottleneck model for African populations is not justified.
Next we fitted the stationary population model and the E = 2 epochs bottleneck model to all individuals in the 1000 genomes dataset [19]. As observed in the example (see Fig. 5(B)), we see significant improvements of the fits using the bottleneck model for South American, European, and Asian populations relative to the stationary population model, Fig. S2. The fitted demographic parameters are shown in Fig. 7. The estimates for N0 are all very consistent within a 2% relative error margin and point to an ancestral effective population size of about 11, 280 ± 80 individuals, see also Tab. S1.
Where appropriate, i.e. for South American, European, and Asian populations, we also show estimates for the duration and population size during the bottleneck, T1 and N1, and the times and population sizes after the bottleneck, T2 and N2, respectively. We observe no significant correlation between our estimates for these quantities and the estimate for N0. The lack of correlation gives us confidence that our estimates of the population sizes at different times are independent of each other.
For European and Asian populations, all estimates are consistent and point to a bottleneck lasting roughly 1, 000 generations at a population size between 1, 000 and 2, 000 individuals, followed by population expansions to about 10, 000 individuals over the following 1, 000 generations. These estimates are close to what we know about the timing of the out-of-Africa bottleneck [31, 32]. Notably, the Peruvian ancestry group shows a significantly longer duration of the bottleneck period, offset by a shorter post-bottleneck period. This again is consistent with our knowledge of how humans migrated out of Africa and settled around the world. Some other individuals of South American ancestry show a similar pattern, although it is likely to be distorted by recent admixture events [7].
3 Discussion
In this article, we have studied the distribution of distances between heterozygous sites, or IBS tract lengths, in human genomes. We show that this distribution has a power-law tail with exponent −3. Using simple arguments and combining results for the distribution of times to the most recent common ancestor from coalescent theory and the stick-breaking process, we can analytically derive this powerlaw behavior.
Applied to a stationary population, we show that the length distribution of IBS tracts can be computed in closed form and depends on only two parameters: the size of the genome and θ = 4Nμ. This function fits very well to empirical data from genomes of African ancestry. Empirical data for populations that have undergone a bottleneck show significant deviations in the length distribution of IBS tracts. A generalization of our model to a demographic scenario with piecewise constant population sizes allows us to capture these deviations. We use these results to efficiently estimate population sizes before, during, and after the bottleneck, as well as the timing of the bottleneck for populations that migrated out of Africa. Interestingly, using our methodology, we do not see sufficient evidence to include a bottleneck in the model for African populations. This is in contrast to previous studies [6, 7], which inferred a mild bottleneck also for African individuals.
Some parts of the IBS tract length distributions are not well fitted by our model. Notably the upswing of the empirical distribution in the length range from 1 to 10 bp is not captured. This over-representation is due to multi-nucleotide substitutions, i.e. mutation events that change two nearby nucleotides at the same time. Such events are not accounted for in our model and generate many very short IBS tracts [33, 34]. Furthermore, it turns out that a large fraction of these events are also associated with small insertions and deletions in their neighborhood [34, 35], and we did not include such more complex changes of the DNA sequence in our modeling.
In addition, if we compare the empirical length distributions of IBS tracts for all individuals in a population (Figs. S3-S8), we observe that they are very consistent for lengths up to 100 kbp and become noisy beyond this length scale. This noise is due to the fact that only a few such tracts are observed in an individual’s genome, sometimes leading to over- or under-representation of very long IBS tracts. However, in certain populations, especially individuals of Gambian, Punjabi, Tamil, and Telugu ancestry, see Fig. S3(B) and S8(C-E), a large fraction of individuals show an upswing in the IBS tract length distribution for tracts longer than 100 kbp. This overrepresentation of very long IBS tracts might reflect inbreeding, which is more common in the above populations as previously reported [36]. The quantification of this effect could allow to define an inbreeding coefficient for single individuals without knowledge of a pedigree.
Here we only analyzed the distribution of heterozygous sites in humans. Of note, our method does not require a reference genome (except for the purpose of efficiently calling SNPs from short-read sequencing data) or the phase of SNPs along the maternal and paternal chromosomes. It is therefore straightforward to apply our methodology to other diploid species and haploid species if the genomes of two individuals are available. It can therefore be of great use in species where only little genomic information is available, for instance endangered or extinct species.
4 Materials and Methods
Files containing calls for phased variants against the hg38 reference genome of 2491 individuals have been downloaded from the website ftp.1000genomes.ebi.ac.uk of the 1000 genomes project [19, 20]. We excluded several individuals from the original 2504 individuals since they were reported to be children of other individuals in the same data set.
Insertion and deletions have been disregarded. The length of IBS tracts was computed as the difference ℓ = pi+1 − pi of the positions of two consecutive heterozygous sites at pi for on either of the parental chromosomes. We disregarded homozygous SNPs which appear on both chromosomes of an individual. Note that IBS tract of length ℓ has ℓ − 1 exactly matching base pairs in the two parental chromosomes. IBS tracts spanning over centromeres and at the very tips of chromosomes are disregarded as well. In our considerations the phase of a SNP is irrelevant.
The lengths distributions of all autosomes have been pooled after convincing ourselves that data from individual chromosomes do not show significant deviations from each other.
For graphical representation of count data in double-logarithmic length distributions, we have binned the data into bins of equal size along the logarithmic x-axis. We plot the density of counts on the y-axis, which is computed as the number of counts in an interval normalized by the length of the interval.
For graphs of the length distributions of IBS tracts for an ancestry group we aggregated the corresponding distributions for all individuals in that group and normalized the density by the number of individuals in that group.
All computations are performed using the Julia programming language [37]. The Laplace transform of the distribution of pairings ñ(s, t) was computed as described in the text, see Eq. (18). The second derivative, see Eq. (10), was computed using automatic differentiation [38].
To fit the model to empirical data we minimized the loss function where C(r) is the count of IBS tracts of lengths r and Pλ denotes the Poisson distribution and lengths r ≤ 1, 000, 000 are considered. The modelled IBS tract length distribution M (r) depends on the sequence length, L, the initial population size, N0, as well as the lengths and population sizes, Ti and Ni, of subsequent epochs, i = 1,…, E, see Eq. (18). The steady state model has E = 0.
We use Markov Chain Monte Carlo sampling to find the minimum of the loss function in the parameter space. Specifically, we use the No-U-Turn Sampler (NUTS) to adaptively set the path lengths in Hamiltonian Monte Carlo [39, 40]. The reported parameter values are mean values of at least 10 MCMC chains with 10, 000 samples after 10, 000 iterations to burn-in. We convinced ourselves that the MCMC chains converged. As an example we show the accumulative density distributions for all model parameters and the log likelihood for several chains in Fig. S9.
To access the stability of our parameter estimates and to compare models with different numbers of parameters (see Fig. 6) we considered 2, 000 bootstrap samples of IBS tract lengths and re-estimated all parameters.
A Supplementary Material
Normalisation of M
The length distribution of IBS tracts M is normalized, such that their total number is and their total length is Therefore, the mean length of IBS tracts is L/(Lθ) = 1/θ. However, due to the power-law nature of the length distribution with exponent −3, the standard deviation of tract lengths diverges.
The number of pairings for a population size increase
As described in the main text, for an instantaneous expansion of the population size at time ti, i.e. Ni+1 > Ni, we model that each present gamete will generate several offspring to make up the next generation. We assume that the number of offspring, p, for each gamete is distributed according to a truncated Poisson distribution: for p ≥ 1 and vanishes for p = 0, i.e. we assume and that no gamete is lost. On average each gamete will have γi = Ni+1/Ni > 1 offspring. A gamete has p offspring will generate p(p − 1)/2 pairs with τ = 0. For the whole population of 2Ni gametes the expected number of pairing with τ = 0 is The resulting population at time will be of size Ni+1 and the number of pairings is In Laplace space we recover Eq. (16).
A model for an instantaneous bottleneck
Although the functions M (r) can be computed analytically the actual expression get quite long due to repeated applications of the transformations , see Eq. (18), and especially due to the second derivative, see Eq. (10). However the function can easily be computed programmatically;especially for the second derivative automatic differentiation can be used, see Methods.
Here we only want to give the formula for M in a E = 2 epoch model with N0 > N1 < N2 assuming that the time spend in the bottleneck vanishes, i.e. T1 → 0. In this limit we have which can be computed to be The humped part of this distribution is dominated by the term and the hump itself is located at rhump = 3/(2μT2). Therefore older bottlenecks will be responsible for humps at smaller IBS tract lengths. Likewise, from the position of a hump one can roughly estimate the time of a bottleneck to be T2 = 3/2(μrhump). This general behavior will also hold for finite-time bottlenecks.