Abstract
The site frequency spectrum in human populations is not accurately modeled by an infinite sites model, which assumes that all mutations are unique. Despite the pervasiveness of recurrent mutations, we lack computational methods to identify these events at specific sites in population sequencing data. Rare alleles that are identical-by-descent (IBD) are expected to segregate on a long, shared haplotype background that descends from a common ancestor. However, alleles introduced by recurrent mutation or by non-crossover gene conversions are identical-by-state and will have a shorter expected shared haplotype background. We hypothesized that the expected difference in shared haplotype background length can distinguish IBD and non-IBD variants in population sequencing data without pedigree information. We implemented a Bayesian hierarchical model and used Gibbs sampling to estimate the posterior probability of IBD state for rare variants, using simulations to demonstrate that our approach accurately distinguishes rare IBD and non-IBD variants. Applying our method to whole genome sequencing data from 3,621 individuals in the UK10K consortium, we found that non-IBD variants correlated with higher local mutation rates and genomic features like replication timing. Using a heuristic to categorize non-IBD variants as gene conversions or recurrent mutations, we found that potential gene conversions had expected properties such as enriched local GC content. By identifying recurrent mutations, we can better understand the spectrum of recent mutations in human populations, a source of genetic variation driving evolution and a key factor in understanding recent demographic history.
Competing Interest Statement
The authors have declared no competing interest.