PT - JOURNAL ARTICLE AU - Noto, Keith AU - Ruiz, Luong TI - Accurate Genome-Wide Phasing from IBD Data AID - 10.1101/2022.04.11.487932 DP - 2022 Jan 01 TA - bioRxiv PG - 2022.04.11.487932 4099 - http://biorxiv.org/content/early/2022/04/12/2022.04.11.487932.short 4100 - http://biorxiv.org/content/early/2022/04/12/2022.04.11.487932.full AB - As genotype databases increase in size, so too do the number of detectable segments of identity by descent (IBD): segments of the genome where two individuals share an identical copy of one of their two parental haplotypes, due to shared ancestry. We show that given a large enough genotype database, these segments of IBD collectively overlap entire chromosomes and can be used to phase them accurately. Furthermore, primarily using instances of IBD that span multiple chromosomes, we can accurately phase an entire genome.We are able to separate the DNA inherited from each parent completely, across the entire genome, with 98% median accuracy in a test set of 30,000 individuals. We estimate the IBD data requirements for accurate phasing, and we propose a method for estimating confidence in the resulting phase. We show that our methods do not require the genotypes of close family, and that they are robust to genotype errors and missing data. In fact, our method can impute missing data accurately and correct genotype errors.Author summary We present a method for phasing, separating the DNA inherited from each parent, of an entire genome using short segments of DNA that are shared between the genome of the person we wish to phase and those of distant cousins. Essentially, if we have enough of these distant cousins’ genotypes available, we can piece together overlapping segments until we have phased the entire genome.We have developed a method that can phase large numbers of individuals, and that makes special considerations for potential close family and potential genotype errors in data.We report results on experiments phasing 30,000 individuals. We analyze how many such segments are required to phase accurately, and propose a model-based method to estimate accuracy. We also show that our method can accurately impute missing data and correct genotype errors.Competing Interest StatementThe authors declare competing financial interests: authors affiliated with AncestryDNA may have equity in Ancestry. The work described in this manuscript is covered by one or more patents including U.S. Patent Application Publication No. 2021/0034647, titled "Clustering of Matched Segments to Determine Linkage of Dataset in a Database."