An effective filter for IBD detection in large data sets

Lin Huang; Sivan Bercovici; Jesse M Rodriguez; Serafim Batzoglou

doi:10.1371/journal.pone.0092713

An effective filter for IBD detection in large data sets

PLoS One. 2014 Mar 25;9(3):e92713. doi: 10.1371/journal.pone.0092713. eCollection 2014.

Authors

Lin Huang¹, Sivan Bercovici¹, Jesse M Rodriguez¹, Serafim Batzoglou¹

Affiliation

¹ Department of Computer Science, Stanford University, Stanford, California, United States of America.

Abstract

Identity by descent (IBD) inference is the task of computationally detecting genomic segments that are shared between individuals by means of common familial descent. Accurate IBD detection plays an important role in various genomic studies, ranging from mapping disease genes to exploring ancient population histories. The majority of recent work in the field has focused on improving the accuracy of inference, targeting shorter genomic segments that originate from a more ancient common ancestor. The accuracy of these methods, however, is achieved at the expense of high computational cost, resulting in a prohibitively long running time when applied to large cohorts. To enable the study of large cohorts, we introduce SpeeDB, a method that facilitates fast IBD detection in large unphased genotype data sets. Given a target individual and a database of individuals that potentially share IBD segments with the target, SpeeDB applies an efficient opposite-homozygous filter, which excludes chromosomal segments from the database that are highly unlikely to be IBD with the corresponding segments from the target individual. The remaining segments can then be evaluated by any IBD detection method of choice. When examining simulated individuals sharing 4 cM IBD regions, SpeeDB filtered out 99.5% of genomic regions from consideration while retaining 99% of the true IBD segments. Applying the SpeeDB filter prior to detecting IBD in simulated fourth cousins resulted in an overall running time that was 10,000x faster than inferring IBD without the filter and retained 99% of the true IBD segments in the output.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Databases, Nucleic Acid*
Datasets as Topic
Sequence Analysis, DNA / methods*

Grants and funding

T15 LM007033/LM/NLM NIH HHS/United States