Abstract
Heterozygous sites are not uniformly distributed along a diploid genome. Rather, their density varies as a result of recombination events, and their local density reflects the time to the last common ancestor of the maternal and paternal copies of a genomic region. The distribution of the density of heterozygous sites therefore carries information about the history of the population size. Despite previous efforts, an exact derivation of the distribution of heterozygous sites is still lacking. As a consequence, the estimation of population size variation is difficult and requires several simplifying assumptions. Using a novel theoretical framework, we are able to derive an analytical formula for the distribution of distances between heterozygous sites. Our theory can account for arbitrary demographic histories, including bottlenecks. In the case of a constant population size the distribution follows a simple function and exhibits a power-law tail proportional to rα with α =−3, where r is the distance between heterozygous sites. This prediction is accurately validated when considering heterozygous sites in individuals of African descent. Other populations migrated out of Africa and underwent at least one bottleneck which left a distinct mark on their interval distribution between heterozygous sites, i.e., an overrepresentation of intervals between 10 and 100 kbp in length. Our analytical theory for non-constant population sizes reproduces this behavior and can be used to study historical changes in population size with high accuracy. The simplicity of our approach facilitates the analysis of demographic histories for diploid species, requiring only a single unphased genome.
Competing Interest Statement
The authors have declared no competing interest.