A fast and simple method for detecting identity by descent segments in large-scale data

Ying Zhou; Sharon R. Browning; Brian L. Browning

doi:10.1101/2019.12.12.874685

Abstract

Segments of identity by descent (IBD) are used in many genetic analyses. We present a method for detecting identical-by-descent haplotype segments that is optimized for large-scale genotype data. Our method, called hap-IBD, combines a compressed representation of genotype data, the positional Burrows-Wheeler transform, and multi-threaded execution to produce very fast analysis times. An attractive feature of hap-IBD is its simplicity: the input parameters clearly and precisely define the IBD segments that are reported, so that program correctness can be confirmed by users.

We evaluate hap-IBD and four state-of-the-art IBD segment detection methods (GERMLINE, iLASH, RaPID, and TRUFFLE) using UK Biobank chromosome 20 data and simulated sequence data. We show that hap-IBD detects IBD segments faster and more accurately than competing methods, and that hap-IBD is the only method that can rapidly and accurately detect short 2-4 cM IBD segments in the full UK Biobank data. Analysis of 485,346 UK Biobank samples using hap-IBD with 12 computational threads detects 231.5 billion autosomal IBD segments with length ≥2 cM in 24.4 hours.

Introduction

Segments of identity by descent (IBD) are genomic regions over which a pair of individuals share a haplotype due to inheritance from a recent common ancestor. IBD segments are useful in a wide variety of applications because they capture information about genetic relationships between individuals. Correlation between pairwise IBD and phenotypic similarity can be used to detect genomic regions harboring trait-affecting variants^1-6 and to estimate heritability.^7-10 IBD segments are also used to estimate kinship coefficients,¹¹ detect close relationships,^12-14 and identify fine-scale population structure.^15-20

Recent demographic history can be inferred from IBD segments.^{15; 16; 21-24} Populations with smaller effective population size have more IBD sharing because individuals are more closely related on average. Short segments have a larger time to the most recent common ancestor (TMRCA), and thus are informative about less recent effective size, while long segments have a smaller TMRCA and are informative about very recent effective size. Similarly, IBD segments shared between populations are informative about migration rates. Approximately the past 100 generations of demographic history can be inferred from IBD segments.²³

IBD segments are also useful for estimating population genetic parameters, including mutation rates^25-28 and recombination rates,²⁹ and for detecting regions undergoing recent selection.^{10; 20; 30-32} The mutation rate is estimated from the observed discordance rate in IBD haplotypes. Recombination rate maps can be estimated using the rate of IBD segment endpoints. Selection is detected by looking for genomic regions with higher rates of IBD sharing.

There are several classes of methods for detecting IBD segments. The first class of methods are probabilistic. These methods include PLINK,² Beagle IBD,³³ and others.^{4; 10; 34-39} For these methods, the unobserved IBD status for a pair (or set) of individuals at a locus takes two (IBD/non-IBD) or more possible states. Typically, a hidden Markov model is used to infer the IBD state at each marker. In the context of a pedigree, with shared haplotypes inherited only through pedigree founders, this IBD-state approach makes sense. However, in population samples, the concept of “IBD state” is ill-defined. Two haplotypes are identical by descent if they are descended from a common ancestor, which is trivially true for all pairs of haplotypes at each position in the genome.

The second class of methods, which includes all the methods presented in this paper, look for long segments of identical-by-state allele sharing either in phased or in unphased genotype data. These identity-by-state (IBS) methods include GERMLINE⁴⁰ and others.^41-44 In contrast to most of the probabilistic methods, these methods do not dichotomize pairwise haplotypic sharing into “IBD” and “non-IBD”, but instead dichotomize it into “long IBD” and “not long IBD”, which better fits the realities of population-based IBD sharing. Ideally, reported IBD segments should primarily represent IBD from a single common ancestor, rather than a conflation of segments from multiple ancestors, and this is achieved when the length threshold is relatively long.⁴⁵ A drawback of these methods is that the handling of allelic discordances within IBD segments tends to be ad hoc.

For IBS methods, the requirement that two individuals share a haplotype is more stringent than the requirement that the two individuals share at least one allele in their genotypes across a given region. Thus, haplotype-based methods can detect short IBD segments (e.g 2-10 cM in length) with much greater accuracy than genotype-based methods. However, haplotype-based methods can break up a long IBD segment into a sequence of shorter IBD segments if there are haplotype phasing errors in the long IBD segment. For some downstream applications, it is necessary to perform a merging step after detecting IBD segments in order to recover the original long IBD haplotype. On the other hand, genotype-based methods do not require accurately-phased genotype data, and they can detect long segments (>= 15 cM) with high accuracy, which is sufficient for highly-accurate detection of first and second degree relatives.¹³

A third class of methods are those that combine aspects of probabilistic modelling and length-based thresholding on IBS. Typically these methods detect candidate long shared segments, and then form a likelihood ratio for IBD versus non-IBD.^{11; 46-48} These methods tend to be more computationally efficient than the full probabilistic methods, but they cannot analyze large, biobank data sets, unlike some of the purely length-based IBS methods.

Although “identity by descent” implies allelic identity, in fact positions of discordance will be observed. Causes of this discordance include mutation or gene conversion since the common ancestor, and genotype error. Probabilistic methods allow for these discordances via an error term in the modelling, while length-based methods allow for short, infrequent gaps in allele sharing.

The genome-wide average mutation rate in humans (1.3 × 10⁻⁸ per basepair per meiosis²⁸) is similar to the genome-wide average crossover rate (1.2 × 10⁻⁸ per basepair per meiosis⁴⁹). Consequently, IBD segments from sequence data will each contain an average of approximately one discordance due to mutation.

Gene conversions occur at a rate of 6 × 10⁻⁶ per basepair per meiosis in humans,⁵⁰ while crossovers occur at a rate of 1.2 × 10⁻⁸ per basepair per meiosis on average. Thus, an average of around five hundred basepairs per IBD segment will have been subject to gene conversion since the common ancestor. Within a gene conversion tract in an IBD segment, allelic discordance occurs only at positions that were heterozygous in the individual who underwent gene conversion at the locus. Mean heterozygosity in human populations is generally less than or equal to 1 heterozygote per kb.⁵¹ Thus five hundred basepairs of gene-converted DNA will result in an average of less than 0.5 discordances per IBD segment (depending on the heterozygosity of the population).

Genotype error rates vary greatly across data sets. Data from two recent studies give genotype error rate estimates of 0.008 per Mb per individual in a large SNP array study⁵² and 25 per Mb per individual for single nucleotide variants in a large sequencing study⁵³ (error rates estimated as half the discordance between duplicate samples after quality control filtering, multiplied by the average number of called/assayed variants per Mb). Exclusion of rare variants can decrease the genotype error rate,⁵⁴ particularly for sequence data.⁵³

With increasingly large data sets, computational issues become significant. The detection of sets of shared haplotypes can be reduced to linear computational complexity, by means of hashing^{40; 44} or by use of the positional Burroughs-Wheeler transform (PBWT).⁴³ However, the generation of pairwise IBD segments from these sets scales quadratically with sample size, because the number of pairs of individuals grows quadratically with sample size. Consequently, detecting IBD segments in biobank-scale data is challenging. As well as computation time being an issue, some algorithms require unfeasibly large amounts of computer memory to analyze such data sets.

In this work, we present hap-IBD, which scales to biobank-sized data, provides greater accuracy than competing methods, and is notable for the simplicity of its algorithm and tuning parameters. Hap-IBD utilizes the PBWT⁵⁵ and parallel computation to reduce computing time, and it uses data compression to reduce memory requirements.⁵⁶ It addresses the issue of allele discordance in IBD segments by requiring that a reported segment have a central core (the “seed”) that is free of discordance, while allowing extension beyond the seed after a short gap containing discordance. The key parameters for hap-IBD are the minimum seed length, the minimum extension length, the maximum gap length, and the minimum length of reported IBD segments. These parameters directly control which IBD segments are detected and reported. The hap-IBD program is open-source and freely available for academic and commercial use.

Methods and Materials

The hap-IBD algorithm

The hap-IBD method employs a simple seed-and-extend algorithm. A seed is an IBS segment with genetic length greater than a specified minimum value. The hap-IBD algorithm finds all seed segments, and extends each seed if possible. A seed segment is extended if there is another long IBS segment for the same pair of haplotypes that is separated from the seed segment by a short non-IBS gap. The maximum number of base pairs between the first and last markers in the non-IBS gap and the minimum genetic length of the extension IBS segment are specified by the user. A segment may be extended multiple times. When it is no longer possible to extend the segment, the segment is written to the output file if its genetic length is greater than a specified minimum output length.

Allowing short non-IBS gaps provides robustness to three sources of discordant alleles in IBD segments: genotype error, gene conversion, and mutation since the most recent common ancestor. Genotype error and mutation will typically introduce a single discordant allele in an IBD segment. Gene conversion will generally produce a very short interval containing one or a few discordant alleles in an IBD segment. When the phasing of the surrounding alleles is correct, the mis-matching alleles on the pair of IBD haplotypes result in two IBS segments for the same pair of haplotypes, separated by a single marker, or at most a few markers in the case of gene conversion. Our method allows these breaks in IBS sharing to be detected and for the IBS segments on each side of the break to be included in the same reported IBD segment.

Two or more distinct IBS seed segments can result in the same IBD haplotype after each seed is extended. If an IBS segment that extends the seed segment to the left is itself a valid seed segment, we stop the extension process and discard the seed segment that is being extended because the same IBD haplotype will be generated by a seed segment that occurs earlier on the chromosome.

The hap-IBD algorithm also has an optional min-markers parameter that requires seed IBS segments to have a minimum number of markers. The min-markers parameter can be useful for ensuring a minimum level of evidence for IBD in genomic regions having low marker density. When a min-markers parameter is specified, IBS segments that extend a seed are also required to have a minimum number of markers. We set the minimum number of markers in an extension to be the product of the min-markers parameter and the ratio of the minimum extension length to the minimum seed length.

We first describe a single-threaded implementation of the preceding algorithm and then describe how the single-threaded implementation is modified to permit parallel computation.

Computationally efficient detection of seed segments

After the genotype data for a chromosome are read into memory we apply the positional Burrows-Wheeler transform (PBWT)⁵⁵. The PBWT sweeps through the markers in chromosome order, and at each marker sorts the reverse haplotype prefixes in lexicographic order (the reverse haplotype prefix at the m-th marker is the sequence of alleles at markers m − 1, m − 2, …). At marker m we generate a “divergence” array that stores the first marker of the IBS segment containing marker m − 1 for each pair of haplotypes that are adjacent after sorting.⁵⁵ The divergence array is used to efficiently identify all seed IBS segments that end at marker m (see Durbin’s Algorithm 3).⁵⁵ After a seed is identified, it is extended if possible by comparing the alleles on the two IBS haplotypes in the regions preceding and succeeding the seed segment as described above.

Memory-efficient computation

The hap-IBD program takes phased genotype data in VCF format as input.⁵⁷ As the genotype data are read into memory, the data are immediately converted to binary reference format (version 3).⁵⁶ Binary reference format compresses low frequency variants by storing only the indices of the haplotypes carrying non-major alleles. Higher frequency variants are compressed by storing unique allele sequences in a region, along with a vector that maps haplotypes indices to the allele sequence carried by the haplotype. We use binary reference format because it permits data for an entire chromosome to be stored compactly in memory, and it allows rapid queries of alleles carried by haplotypes at each marker.

The PBWT requires only two additional arrays of stored information, each with length equal to the number of haplotypes. Seed IBS segments are extended as soon as they are identified by the PBWT. After extension, segments that are longer than the minimum output segment length are immediately printed to an output buffer, which is flushed to disk when full. Consequently, only a limited number of IBD segments are stored in memory at any time.

Parallelization

The hap-IBD algorithm is parallelized by applying the PBWT concurrently in overlapping marker windows. If L is the genetic distance between the first and last markers on the chromosome, S is the minimum seed genetic length, and T is the number of computational threads, we sequentially define T overlapping marker windows W_1, W_2, … W_T that each have length approximately equal to ((L − S)/T + S) cM, and that have approximately S cM overlap between adjacent windows. The first window W₁ begins at the first marker on the chromosome and ends at the first marker after genetic position ((L − S)/T + S) whose index is greater than the minimum number of markers required for an IBS seed segment. The first marker in W_i+1 is the first marker in W_i that cannot be the start of a seed IBS segment contained within W_i because the number of markers or genetic distance separating the marker from the last marker in W_i is too small. The last marker in W_i+1 is the first marker that is ≥ (L − S)/T cM away from the last marker in window W_i. With these definitions every seed IBS segment will be detected in at least one of the overlapping windows.

We run the PBWT algorithm in each overlapping window in parallel. When a seed IBS segment is found, we ignore the window boundaries when we extend the segment, so that the extension process is the same as for the single-threaded case. If multiple seeds result in the same maximal IBD segment after extension, we keep the maximal IBD segment generated by the first seed, and discard the duplicate IBD segments generated by later windows seeds.

Input and output data

The input data is a VCF file⁵⁷ with phased, non-missing genotype data, and a PLINK-format genetic map.² Linear interpolation is used to estimate the genetic map positions for any marker whose position is not on the genetic map. Although the use of a genetic map is recommended, hap-IBD can be used with Mb units simply by supplying a genetic map with a recombination rate of 1 cM = 1 Mb.

Two output files are produced: one containing within-individual segments of homozygosity by descent (HBD) and one containing between-individual IBD segments. Each output line contains the pair of samples, the specific haplotypes, the starting base position, the ending base position, and the genetic length of the HBD or IBD segment.

Analysis parameters

The minimum seed length parameter has a large influence on computation time. Increasing the minimum seed length reduces computation time because fewer seed IBS segments will be considered. Decreasing the minimum seed length can increase power to detect short IBD segments that have discordant alleles on the pair of shared haplotypes.

The maximum gap length and minimum extension length allow reported IBD segments to contain discordant alleles due to genotype error, mutation, or gene conversion. The hap-IBD software also has an option for excluding input markers having low minor allele count.

The minimum markers parameter controls the minimum number of markers in IBS seed and extension segments. The number of reported IBD segments should be approximately constant throughout the genome; however, regions with low marker density can produce local spikes in the number of reported IBD segments (see Results). These spikes contain many IBS segments that satisfy the genetic length requirements, but which contain relatively few markers. The spikes can be reduced or eliminated by post-processing,^{23; 58} or by requiring seed and extension IBS segments to contain a minimum number of markers.

UK Biobank genotype data

We downloaded the UK Biobank genotype data from the European Genome-phenome Archive⁵⁹ (Dataset accession: EGAD00010001497). The UK Biobank data contain 488,377 individuals and 784,256 autosomal markers.⁵² We excluded markers with more than 5% missing genotypes (n = 70,246), markers that had only one individual carrying a minor allele (n = 5,123), and markers that failed one or more of the UK Biobank’s batch quality control tests (n = 1,527).⁵² After excluding 72,601 markers that failed one or more of these filters, there were 711,655 autosomal markers.

We then exclude 968 individuals that were identified by the UK Biobank as being outliers for their proportion of missing genotypes or proportion of heterozygous genotypes, and we excluded 9 individuals that were identified by the UK Biobank as showing third degree or closer relationships with more than 200 individuals (indicating sample contamination).⁵² After these exclusions there were 487,400 individuals.

We identified parent-offspring trios using the kinship coefficients and the proportion of markers that share no alleles (IBS0) that are reported by the UK Biobank.^{52; 60} First degree relatives were considered to be pairs of individuals with kinship coefficient between 2^−2.5 and 2^−1.5. Among first degree relatives, parent-offspring relationships were assumed to be the first-degree relative pairs with IBS0 < 0.0012. These are the same kinship coefficient and IBS0 thresholds used by the UK Biobank to identify parent-offspring relationships.⁵² We considered an individual to be the offspring in a parent-offspring trio if the individual had a parent-offspring relationship with exactly one male and one female individual, and if the male and female first-degree relatives are not in the set of related pairs of individuals reported by the UK Biobank, which is the set of pairs of individuals with estimated kinship coefficient greater than 2^−4.5. In this case, we considered the male and female first-degree relatives to be the offspring’s parents. Using this procedure, we identified 1064 parent-offspring trios.

The 1064 trio offspring have 2,054 distinct parents. We excluded these parents from the data before phasing and IBD segment detection so that phasing accuracy in the trio offspring would more closely match phasing accuracy in unrelated individuals. After excluding the trio parents, there were 485,346 remaining individuals. We listed the 1064 trio offspring followed by the remaining samples in random order. We created five telescoping genotype data sets that included 5000, 15,000, 50,000, 150,000, and all 485,346 individuals by taking the corresponding number of samples from the top of this list. We then phased each data set with Beagle 5.1.⁶¹

We used the parental genotype data to determine true phase in 850 trio offspring. We selected the 850 trio offspring by computing the number of autosomal sites with Mendelian inconsistent genotypes in each of the 1064 trios (range: 57-5102 inconsistencies) and taking the offspring from the trios having the smallest number of Mendelian inconsistent genotypes (range: 57-456 inconsistencies).⁶² We phased the 850 trio offspring at all heterozygous genotypes for which phase could be determined from parental genotypes and Mendelian inheritance constraints (82.4% of heterozygous genotypes), and we masked genotypes at Mendelian inconsistent sites in this phased data. We used these estimated haplotypes to evaluate false-positive and false-negative rates for IBD segment detection as described below.

After excluding trio parents, there were 43 remaining parent-offspring pairs who were not part of a trio in the 50,000 individual subset of the UK Biobank data. We use these 43 remaining pairs to compute the mean proportion of chromosome 20 covered by detected IBD segments in parent-offspring pairs.

Simulated Data

In order to test the performance of hap-IBD and other methods on sequence data, we generated 60 Mb of data for 50,000 individuals from a demographic model that simulates the present UK European population.⁴⁷ This model has a population size of 24,000 in the distant past, a reduction to 3,000 occurring 5,000 generations ago, growth at rate 1.4% per generation starting 300 generations ago, and growth at rate 25% beginning ten generations ago.

We used forward simulation with SLiM⁶³ to simulate the ancestral recombination graph for the most recent 5000 generations. Gene conversion tracts were initiated at a rate of 2 × 10⁻⁸ per bp per generation, and had geometrically distributed lengths with mean 300 bp, giving an overall gene conversion rate of 6 × 10⁻⁶. ^{27; 50} A constant recombination rate of 1 × 10⁻⁸ was used. We then used msprime’s coalescent simulation to add mutations (at rate 1.38 × 10⁻⁸) and simulate the more distant past.⁶⁴ This hybrid strategy of using SLiM and msprime enables utilization of msprime’s computational efficiency for large data sets, while incorporating biologically realistic settings such as gene conversion that are implemented in SLiM but not currently implemented in msprime.⁶⁵ Our simulation only includes gene conversion events in the most recent 5000 generations, but it is the more recent gene conversions that have the greatest potential impact on haplotype phase accuracy and that can create discordances between identical by descent haplotypes.

We determined the true IBD segments for 1000 simulated individuals from the simulated ancestral recombination graphs. IBD segments are required to have the same ancestral node along their length, except for short breaks due to gene conversion.

We added genotype error at a rate of 0.02%, which is the error rate that produces the observed 0.04% rate of discordance at SNVs passing quality control in the TOPMed Freeze 5 whole genome sequence data.⁵³ We then removed variants with frequency less than 0.10 and phased the remaining genotypes using Beagle v5.1.⁶¹ We also separately phased a subset of 5000 individuals with the same minor allele frequency threshold of 0.10. We found that phasing accuracy for common variants improves when rare variants are excluded, and that the improved phasing accuracy more than offsets the loss of information from excluding low frequency variants. Low frequency variants are not very informative for IBD because most individuals are homozygous for the major allele, and because allele discordance at low frequency variants in IBD segments could due to genotype error, recent mutation, or phasing error, rather than indicating non-IBD. Other methods for IBD detection in sequence data have used a minor allele frequency filter. The application of GERMLINE to the Genomes of the Netherlands whole genome sequence data used a minor allele frequency filter of 1%.⁵⁸ The TRUFFLE analysis of 1000 Genomes sequence data used minor allele frequency filters of 5% and 10%.⁴²

Parameter settings

Each method has an option for setting the minimum length of reported IBD segments. All methods, except TRUFFLE, measure genetic distance in cM units. For TRUFFLE, we substituted Mb units for cM units. All genetic distances are interpolated from the HapMap genetic map.⁶⁶

We required all UK Biobank chromosome 20 analyses to complete within two days of wall-clock time on the compute nodes used for these analyses. We set the minimum output segment length to 2.0 cM, unless a higher output segment length was required for analyses to finish within two days of wall-clock time. Parameter settings for analysis of UK Biobank and simulated sequence data are based on previously published analyses of SNP array^{5; 42-44} and sequence data.^{42; 43; 58} Parameter settings for each method are reported in Tables S1 and S2. We do not evaluate iLASH on the sequence data because the published description of this method does not include analyses of sequence data.

View this table:

Table S1: Parameters used for analysis of UK Biobank data with 2 cM minimum IBD segment length.

Parameters that control the output IBD segment length are in red.

View this table:

Table S2: Parameters used for analysis of simulated sequence data with minimum 2 cM IBD segment length.

Parameters that control the output IBD segment length are in red.

Comparison of methods

For the simulated data, coalescent trees for 1000 simulated samples were used to determine true IBD segments exceeding 1.5 cM in length for those samples. For the UK Biobank data, we considered true IBD segments to be IBS segments exceeding 1.5 cM in length among the 850 trio offspring which were phased using parental genotypes and Mendelian inheritance rules.

False-positive rate estimation

We divided detected IBD segments into bins according to the detected segment length (2-3, 3-4, 4-6, 6-10, 10-18, and >18 cM). For each detected IBD segment, we identified the cM length of the portion of the detected segment that is not covered by any true IBD segment with length >1.5 cM, and we calculated the sum of these false-positive segment lengths. The false-positive rate for a bin is the sum of the false-positive segment lengths divided by the sum of the detected segments lengths.

False-negative rate estimation

We divided true IBD segments into bins according to the true segment length. The length bins and number of true IBD segments in each length bin are: 2.5-3 cM (2492), 3-4 cM (1360), 4-6 cM (551), 6-10 cM (160), 10-18 cM (55), and >18 cM (64). For each true IBD segment, we identified the cM length of the portion of the true segment that is not covered by any detected IBD segment with length >2.0 cM, and we calculated the sum of these false-negative segment lengths. The false-negative rate for the bin is the sum of the false-negative segment lengths divided by the sum of the true segments lengths.

ROC analysis

In order to account for inter-method differences in determining IBD end-points, which affect the reported length of IBD segments, we calculated false positive and false negative rates for each method over a range of detected segment length thresholds (1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, and 2.4 cM). We calculated the false positive rate for each threshold as described above using all true segments having length ≥ 1.5 cM, and we calculated the false negative rate as described above using all true segments having length ≥ 2.5 cM. We then generated a receiver operating characteristic (ROC) curve for each method that shows the true-positive rate (which is one minus the false-negative rate) and false-positive rate for each detected segment length threshold. For length thresholds < 2.0 cM, the hap-IBD minimum seed length was set to 1.6.

We also generated ROC curves for each method for 5 cM segments. For this analysis, we used detected segment length thresholds of 4.6, 4.8, 5.0, 5.2, and 5.4. We calculated false positive rates using all true segments having length ≥ 4.5 cM, and we calculated false negative rates using all true segments having length ≥ 5.5 cM.

Computation time

All analyses were run on a 12-core 2.6 GHz computer with Intel Xeon ES-2630 processors and 128 GB of memory. Computation time was measured using the unix time command, which returns a “real”, a “system”, and a “user” time. The wall-clock time is the “real” time, which is the length of time the program was running. The CPU time is the sum of the “system” and “user” times. For multi-threaded compute jobs, the CPU time includes the sum of the CPU time for each computational thread, so that it represents the total CPU resources consumed by the program. A maximum of 2 days of wall-clock time was allowed for each analysis of UK Biobank data chromosome 20 data or of the 60 Mb of simulated sequence data, with no results reported if the analysis did not complete within this time frame.

Results

Computational feasibility

Figure 1 shows CPU times for subsets of the UK Biobank chromosome 20 data (5000 to 485,346 individuals) for 2 cM and 5 cM output length thresholds. For the full data UK Biobank data with 485,346 individuals, hap-IBD detected 3.43 billion IBD segments on chromosome 20 at the 2 cM threshold, and 106 million segments at the 5 cM threshold. GERMLINE could not analyze subsets of size 50,000 or more individuals because it required more than 128 Gb of memory. TRUFFLE could not analyze subsets of size 150,000 or more individuals on our compute nodes. iLASH could not analyze subsets of 150,000 or more individuals at the 2 cM output threshold because it needed more than 128 Gb of memory, but it could analyze the full data set at the 5 cM output threshold. RaPID could not analyze the full chromosome 20 data set at the 2 cM threshold within the permitted two days of wall-clock time, but it could analyze the full data set at the 5 cM output threshold.

Figure 1. CPU time.

CPU time for detecting IBD segments with length ≥ 2 cM (left panel) and ≥ 5 cM (right panel) on chromosome 20 in samples of 5000, 15,000, 50,000, 150,000, and 485,346 individuals from the UK Biobank. CPU time is the sum of the computation time for each CPU core. All programs used 12 computational threads, except RaPID and GERMLINE which are limited to one computational thread.

Three of the five methods have parallelization implemented, and running these methods on a 12-core computer leads to an approximate 10-fold reduction in computing time compared to single-threaded analysis (Figure S1). This degree of speedup is important for analysis of large data sets. For example, the single-threaded RaPID program required 223.6 minutes of wall-clock time to output IBD segments > 5 cM for all samples on chromosome 20, but hap-IBD required only 13.4 minutes when using 12 computational threads.

Figure S1. Wall-clock compute time.

Wall-clock time for multi-threaded programs when using 12 CPU cores for detecting IBD segments with length ≥ 2 cM (left panel) and ≥ 5 cM (right panel) on chromosome 20 in samples of 5000, 15,000, 50,000, 150,000, and 485,346 individuals from the UK Biobank. All programs used 12 computational threads.

Overall, we see that hap-IBD is the fastest program except when analyzing the smallest sample size (5000 individuals) using the largest output threshold (5 cM output threshold); for this combination iLASH is faster. In our experiments, hap-IBD was the only method that could analyze the full UK biobank chromosome 20 data on our compute servers in less than 2 days when using a 2 cM output length threshold.

We also performed a genome-wide analysis of the 22 autosomes for the UK Biobank data. Analysis of 485,346 UK Biobank samples using hap-IBD with 12 computational threads detected 231.5 billion autosomal IBD segments in 24.4 hours.

Accuracy

Several methods have a low false-positive rate (Figures 2 and S2) but a high false-negative rate (Figures 3 and S3), or vice versa. The methods apply different algorithms for determining the end-points of IBD segments, and this results in different methods reporting different lengths for a true IBD segment. Since false-positive rates and false-negative rates can be traded off by changing the output length threshold, we constructed ROC curves by varying the output IBD segment length threshold for each method in order to assess the true-positive vs false-positive trade-off. The true-positive rate is one minus the false-negative rate. An ideal method would have a true-positive rate of 1 and a false-positive rate of 0. For 2 cM IBD segments (Figure 4) and for 5 cM IBD segments (Figure S4) hap-IBD shows the best performance on these ROC curves. In particular, hap-IBD has much lower false-positive rates than RaPID and much higher true-positive rates than iLASH. The IBD segment detection method for unphased genotype data (TRUFFLE) has high error rates for these short IBD segments.

Figure S2. False-positive IBD segment detection in UK Biobank chromosome 20 data.

As for Figure 2, but zoomed out to show the full range of false-positive rates. False-positive rates for IBD segment detection for 5000, 50,000, and 485,346 UK Biobank samples. IBD segments with length ≥ 2 cM were detected with each method. Detected IBD segments were assigned into bins of 2-3, 3-4, 4-6, 6-10, 10-18, and >18 cM according to their segment length. The false-positive rate is the proportion of detected IBD segments in a bin that are not covered by any true IBD segment ≥ 1.5 cM in length. Hap-IBD is the only method shown for the full UK Biobank analysis (485,346 individuals) because other methods were unable to complete the analysis with a 2 cM output threshold within the memory and time constraints (see Computational Feasibility Results). The x-coordinate of each data point is the left bin end point (e.g. 2 cM for the 2-3 cM bin).

Figure S3. False-negative IBD segment detection in UK Biobank chromosome 20 data.

As for Figure 3, but zoomed out to show the full range of false-negative rates. False-negative rates for IBD segment detection for 5000, 50,000, and 485,346 UK Biobank samples. IBD segments with length ≥ 2 cM were detected with each method. The right column of plots shows false negative rates. True IBD segments with length ≥ 2.5 cM were assigned into bins of 2.5-3, 3-4, 4-6, 6-10, 10-18, and >18 cM according to their segment length. The false-negative rate is the proportion of true IBD segments in a bin that are not covered by any detected IBD segment ≥ 2 cM in length. Hap-IBD is the only method shown for the full UK Biobank analysis (485,346 individuals) because other methods were unable to complete the analysis with a 2 cM output threshold within the memory and time constraints (see Computational Feasibility Results). The x-coordinate of each data point is the left bin end point (e.g. 2.5 cM for the 2.5-3 cM bin).

Figure S4. ROC curves for 5 cM IBD segment detection in UK Biobank chromosome 20 data.

As for Figure 4, but for 5 cM segments. The right panel is a zoomed in version of the left panel. False-positive and false-negative rates for detection of IBD segments over a range of output length thresholds around 5 cM for 5000 UK Biobank samples. False positives are assessed using true segments of length ≥ 4.5 cM, and false negatives are assessed using true segments of length ≥ 5.5 cM in order to allow for some discrepancy between reported and true lengths. IBD segments were detected with each method using length thresholds of 5 cM (plotted symbol) and with other thresholds between 4.6 and 5.4 cM (lines; see Methods).

Figure 2. False-positive IBD segment detection in UK Biobank chromosome 20 data.

False-positive rates for IBD segment detection for 5000, 50,000, and 485,346 UK Biobank samples. IBD segments with length ≥ 2 cM were detected with each method. Detected IBD segments were assigned into bins of 2-3, 3-4, 4-6, 6-10, 10-18, and >18 cM according to their segment length. The false-positive rate is the proportion of detected IBD segments in a bin that are not covered by any true IBD segment ≥ 1.5 cM in length. Hap-IBD is the only method shown for the full UK Biobank analysis (485,346 individuals) because other methods were unable to complete the analysis with a 2 cM output threshold within the memory and time constraints (see Computational Feasibility Results). The x-coordinate of each data point is the left bin end point (e.g. 2 cM for the 2-3 cM bin). For the full range of y-coordinate values see Figure S2.

Figure 3. False-negative IBD segment detection in UK Biobank chromosome 20 data.

False-negative rates for IBD segment detection for 5000, 50,000, and 485,346 UK Biobank samples. IBD segments with length ≥ 2 cM were detected with each method. The right column of plots shows false negative rates. True IBD segments with length ≥ 2.5 cM were assigned into bins of 2.5-3, 3-4, 4-6, 6-10, 10-18, and >18 cM according to their segment length. The false-negative rate is the proportion of true IBD segments in a bin that are not covered by any detected IBD segment ≥ 2 cM in length. Hap-IBD is the only method shown for the full UK Biobank analysis (485,346 individuals) because other methods were unable to complete the analysis with a 2 cM output threshold within the memory and time constraints (see Computational Feasibility Results). The x-coordinate of each data point is the left bin end point (e.g. 2.5 cM for the 2.5-3 cM bin). For the full range of y-coordinate values see Figure S3.

Figure 4. ROC curves for 2 cM IBD segment detection in UK Biobank chromosome 20 data.

False-positive and false-negative rates for detection of IBD segments over a range of output length thresholds around 2 cM for 5000 UK Biobank samples. False positives are assessed using true segments of length ≥ 1.5 cM, and false negatives are assessed using true segments of length ≥ 2.5 cM in order to allow for some discrepancy between reported and true lengths. IBD segments were detected with each method using length thresholds of 2 cM (plotted symbol) and with other thresholds between 1.6 and 2.4 cM (lines; see Methods). Figure S4 shows a similar plot for 5 cM.

We also investigated the proportion of chromosome 20 in parent-offspring pairs that was covered by detected IBD segments with length ≥ 2 cM in 43 parent-offspring pairs in the set of 50,000 UK Biobank samples. The proportions were 0.978 for iLASH, 0.987 for hap-IBD, 0.994 for RaPID, and 1.0 for TRUFFLE. GERMLINE was not evaluated because it could not analyze 50,000 individuals on our compute server. All methods detected IBD across all or nearly all of the chromosome in the parent-offspring pairs. For haplotype-based methods, the methods with higher false-positive rates (Figure 2) detected slightly higher amounts of IBD in the parent-offspring pairs. Genotype-based methods are not affected by haplotype phase errors and the genotype-based method (TRUFFLE) had the highest detection rate for these chromosome-length shared haplotypes.

In genome-wide analysis of the UK Biobank data, we find regions in which IBD detection methods report inflated levels of IBD segments. These are generally regions with large gaps in marker coverage, or very low marker density, and often occur around centromeres. Figure 5 shows results for chromosomes 1 and 20 for the methods with the highest accuracy for short IBD segments (the four haplotype-based methods), for the 5000 individual UK Biobank data. Around the chromosome 1 centromere the methods are finding IBD segments at a rate 40 to 3000 times greater than the baseline level. The inflation is worse for RaPID and iLASH than for GERMLINE and hap-IBD. Figure S5 shows that the inflated detection can be reduced by increasing hap-IBD’s min-markers parameter. However, the use of overly high values of this parameter will reduce power to detect short IBD segments. Alternatively, regions with high rates of IBD segment discovery can be identified after IBD segment detection and excluded.⁶⁷

Figure S5: Effect of marker thresholds on IBD segment detection in UK Biobank.

The hap-IBD program was run on 5000 UK Biobank samples on chromosome 1 (left panel) and chromosome 20 (right panel) with the min-markers parameter set to 1, 100, 200, and 400, markers. The min-markers parameter controls the minimum number of markers that must be present in a reported seed IBD segment. All other hap-IBD parameters were set at their default values. Each chromosome is divided into non-overlapping 10 kb intervals. For each interval, the IBD segments intersecting the interval are weighted by the proportion of the 10 kb interval that is covered by the IBD segment, and the sum of weights is plotted as the IBD coverage.

Figure 5: Chromosome-wide IBD segment detection in UK Biobank.

The methods were run on 5000 UK Biobank samples on chromosome 1 and chromosome 20 to detect IBD segments of length ≥ 2 cM. Each chromosome is divided into non-overlapping 10 kb intervals. For each interval, the IBD segments intersecting the interval are weighted by the proportion of the 10 kb interval that is covered by the IBD segment, and the sum of weights is plotted as the IBD coverage.

We also assessed accuracy using simulated sequence data. There are several important differences between the UK Biobank analysis and the simulated sequence data analysis. First, the approach to assessing accuracy differs. In the UK Biobank, we determine true phase of trio offspring and use that to determine identity by state at the haplotype level, which we use as a proxy for true IBD. The genotype error rate is extremely low in these data (with a duplicate discordance rate of 6.7 × 10⁻⁵),⁵² but genotype errors can disrupt both the true IBD and the estimated IBD in the UK Biobank analysis. In contrast, in the simulated data the true IBD status is obtained directly from the simulation (defined as no change in common ancestor across a segment except in tracts of gene conversion), and mis-called alleles may disrupt the estimated IBD but do not affect the ascertainment of true IBD. Second, the marker density is much higher for the simulated sequence data. Although we remove markers with minor allele frequency < 0.1 (see Methods), the marker density is still five times greater than that of the UK Biobank (97,890 markers with minor allele frequency ≥ 0.1 in the simulated 60 Mb region, compared with 18,424 total UK Biobank markers on chromosome 20). Third, the genotype error rate in the simulated sequence data is much higher than for the UK Biobank data. With current technology, error rates tend to be higher for sequence data than for SNP array data, even with high sequence coverage and careful processing. We added genotype error to the simulated sequence data at a rate that generates the level of duplicate discordance observed in the TOPMed data, which is 4 × 10⁻⁴ for SNPs passing quality control.⁵³ This level of duplicate discordance is six times higher than for the UK Biobank SNP data. There are also important similarities between the two analyses, which include the length of the region (approximately 60 Mb for the simulated analysis and for the UK Biobank chromosome 20 analysis), large sample size (up to 50,000 for the simulated data and up to 485,346 for the UK Biobank data), and demographic history (UK-like simulation versus actual UK population).

In the simulated sequence data, we compared methods for which the authors have published analyses of sequence data (GERMLINE, RaPID, and TRUFFLE), and we replicated settings from those published analyses (see Methods for details). To compare accuracy, we produced ROC curves for detection of 2 cM segments. We considered sample sizes of 5000 (Figure 6) and 50,000 (Figure S6). We find that hap-IBD and GERMLINE have a very similar accuracy profile (for the 5000 individuals only, because GERMLINE could not analyze the 50,000 individuals with the available computer memory). TRUFFLE had very low power to detect 2 cM IBD segments, while RaPID had a high false positive rate. Overall these results are similar to those seen in the UK Biobank analysis, except that the relative accuracy of GERMLINE is improved in these simulated sequence data. The parameters that we using for GERMLINE in the simulated sequence analysis may be a better match for these data than were the parameters that we used for the UK Biobank data, although we used published parameter settings in both instances.

Figure S6. ROC curves for IBD segment detection in simulated sequence data.

False-positive and false-negative rates for detection of IBD segments over a range of output length thresholds around 2 cM for 50,000 simulated samples. False positives are assessed using true segments of length ≥ 1.5 cM, and false negatives are assessed using true segments of length ≥ 2.5 cM in order to allow for some discrepancy between reported and true lengths. IBD segments were detected with each method using length thresholds of 2 cM (plotted symbol) and with other thresholds between 1.6 and 2.4 cM (lines; see Methods).

Figure 6. ROC curves for IBD segment detection in simulated sequence data.

False-positive and false-negative rates for detection of IBD segments over a range of output length thresholds around 2 cM for 5000 simulated samples. False positives are assessed using true segments of length ≥ 1.5 cM, and false negatives are assessed using true segments of length ≥ 2.5 cM in order to allow for some discrepancy between reported and true lengths. IBD segments were detected with each method using length thresholds of 2 cM (plotted symbol) and with other thresholds between 1.6 and 2.4 cM (lines; see Methods). Figure S6 shows a similar plot for 50,000 samples.

Discussion

We have presented an IBD segment detection method for large-scale genotype data that is substantially faster and more accurate than four state-of-the-art competing methods (GERMLINE, iLASH, RaPID, and TRUFFLE). We applied hap-IBD to 485,346 samples from the UK Biobank⁵² and detected 231.5 billion autosomal IBD segments having length >2 cM in less than 24.4 hours of wall-clock time on compute server with 12 CPU cores.

An attractive feature of hap-IBD is its simplicity. All seed IBS segments that exceed a specified length are identified and then extended if possible. The extension process allows for sporadic non-IBS alleles due to mutation, genotype error, or gene conversion. The hap-IBD parameters define the minimum length of IBS seed and extension segments and the maximum length of non-IBS gaps. These parameters have a simple and direct relationship to the IBD segments that are reported, which enables the correctness of the results to be confirmed. In contrast, some methods utilize a large number of tuning parameters which have only an indirect relationship to output IBD segments, such as iLASH’s seven parameters for controlling locality-sensitive hashing: perm_count, shingle_size, shingle_overlap, bucket_count, match_threshold, interest_threshold, and minhash_threshold.⁴⁴

The hap-IBD method shares some similarities with the GERMLINE method: both methods search for long IBS segments via a seed and extend algorithm and both methods allow for the presence of some discordant alleles in a reported IBD segment.⁴⁰ However, hap-IBD achieves much greater computational efficiency and greater accuracy than GERMLINE by employing the positional Burrows-Wheeler transform instead of a hash table and by identifying seeds that exceed a specific genetic length rather than a specified number of markers.

In our tests, hap-IBD consistently required less CPU time than competing methods. The hap-IBD method includes internal parallelization that can yield wall-clock compute times that are a fraction of the total CPU time on multi-core processors.

The hap-IBD method requires phased genotype data. In practice, nearly all large genotype data sets are phased because phased data are required to obtain the highest accuracy for many downstream analyses including IBD segment detection, relationship inference,¹³ local ancestry inference,⁶⁸ population demography inference,^21-23 and detection of selection.¹⁰ Phased data are also required for computationally efficient and accurate genotype imputation.^{56; 69} With state-of-the-art methods, the effort and computational cost required to phase large data sets is modest when using a small compute cluster. We phased the UK Biobank genome-wide data with Beagle 5.1 in less than two days using 16 compute servers, each with 20 CPU cores.

Our results confirm that IBD segment detection methods for phased genetic data can detect much shorter IBD segments than methods for unphased genetic data. In our tests, the method for unphased data could not accurately detect segments with length < 10 cM, but most methods for phased data could accurately detect IBD segments with length > 2 cM (Figure 4). Furthermore, haplotype-based methods identify the shared allele sequence, whereas genotype methods cannot identify the shared allele when the individuals both carry an identical heterozygote genotype. However, genotype-based IBD detection methods have some advantages. Genotype-based methods can detect first and second degree relationships with high accuracy before haplotypes are estimated,^{13; 42} which can be useful during initial data quality control. In addition, genotype-based methods can be used when genotype data cannot be accurately phased due to non-uniform marker coverage or high rates of genotype error, such as can be the case for exome and low-coverage sequence data.

The hap-IBD method performs well across a range of haplotype switch error rates. In the UK Biobank data, the switch error rate for 5000 samples is more than an order of magnitude higher than the switch error rate for 485,346 samples.⁶² However, even for the 5000 sample subset of the UK Biobank, the IBD-detection accuracy is very high and is sufficient to identify close relatives in the data. Furthermore, one can increase the accuracy of phase estimates in small samples by phasing the samples together with a reference panel of sequenced individuals.⁷⁰

A general limitation of IBD detection segment methods that rely on IBS is that there is some degree of error in determination of segment end points. The IBS interval can extend beyond the end-points of a contained IBD segment. Consequently, IBD detection methods that report the full IBS interval will often over-extend the IBD segment ends. Such methods can also miss some regions at the end of IBD segments when genotype error, mutation, or gene conversion near the end of the IBD segment causes the IBS segment to end before the actual end of the IBD segment. If the genetic distance between the truncated end of the IBS segment and the true end of the IBD region is short, it is not possible determine with confidence whether or not the IBD segment extends past the end of the IBS segment. Development of IBD segment detection methods that are robust to genotype error, recent mutation, and gene conversion that occur near the ends of IBD segments is an area for future research.

Supplemental Data

Supplemental Data include 2 tables and 6 figures.

Web resources

hap-IBD:

https://github.com/browning-lab/hap-ibd

Beagle,

http://faculty.washington.edu/browning/beagle/beagle.html

The UK Biobank data,

https://www.ebi.ac.uk/ega/home

Acknowledgements

Research reported in this publication was supported by the National Human Genome Research Institute and National Institute of General Medical Sciences of the National Institutes of Health under award numbers HG008359, HG005701, and GM075091. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research has been conducted using the UK Biobank Resource under Application Number 19934.

Footnotes

http://github.com/browning-lab/hap-ibd

References

1.↵
Houwen, R.H., Baharloo, S., Blankenship, K., Raeymaekers, P., Juyn, J., Sandkuijl, L.A., and Freimer, N.B. (1994). Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nat Genet 8, 380.
OpenUrl CrossRef PubMed Web of Science
2.↵
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559–575.
OpenUrl CrossRef PubMed
3.
Kenny, E.E., Gusev, A., Riegel, K., Lütjohann, D., Lowe, J.K., Salit, J., Maller, J.B., Stoffel, M., Daly, M.J., and Altshuler, D.M. (2009). Systematic haplotype analysis resolves a complex plasma plant sterol locus on the Micronesian Island of Kosrae. Proceedings of the National Academy of Sciences 106, 13886–13891.
OpenUrl Abstract/FREE Full Text
4.↵
Moltke, I., Albrechtsen, A., Hansen, T.V., Nielsen, F.C., and Nielsen, R. (2011). A method for detecting IBD regions simultaneously in multiple individuals--with applications to disease genetics. Genome Res 21, 1168–1180.
OpenUrl Abstract/FREE Full Text
5.↵
Gusev, A., Kenny, E.E., Lowe, J.K., Salit, J., Saxena, R., Kathiresan, S., Altshuler, D.M., Friedman, J.M., Breslow, J.L., and Pe’er, I. (2011). DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation. Am J Hum Genet 88, 706–717.
OpenUrl CrossRef PubMed
6.↵
Browning, S.R., and Thompson, E.A. (2012). Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics 190, 1521–1531.
OpenUrl Abstract/FREE Full Text
7.↵
Price, A.L., Helgason, A., Thorleifsson, G., McCarroll, S.A., Kong, A., and Stefansson, K. (2011). Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet 7, e1001317.
OpenUrl CrossRef PubMed
8.
Zuk, O., Hechter, E., Sunyaev, S.R., and Lander, E.S. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences 109, 1193–1198.
OpenUrl Abstract/FREE Full Text
9.
Browning, S.R., and Browning, B.L. (2013). Identity-by-descent-based heritability analysis in the Northern Finland Birth Cohort. Hum Genet 132, 129–138.
OpenUrl CrossRef PubMed
10.↵
Palamara, P.F., Terhorst, J., Song, Y.S., and Price, A.L. (2018). High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat Genet 50, 1311–1317.
OpenUrl CrossRef
11.↵
Browning, B.L., and Browning, S.R. (2011). A fast, powerful method for detecting identity by descent. Am J Hum Genet 88, 173–182.
OpenUrl CrossRef PubMed
12.↵
Huff, C.D., Witherspoon, D.J., Simonson, T.S., Xing, J., Watkins, W.S., Zhang, Y., Tuohy, T.M., Neklason, D.W., Burt, R.W., and Guthery, S.L. (2011). Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res 21, 768–774.
OpenUrl Abstract/FREE Full Text
13.↵
Ramstetter, M.D., Dyer, T.D., Lehman, D.M., Curran, J.E., Duggirala, R., Blangero, J., Mezey, J.G., and Williams, A.L. (2017). Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics 207, 75–82.
OpenUrl Abstract/FREE Full Text
14.↵
Qiao, Y., Sannerud, J., Basu-Roy, S., Hayward, C., and Williams, A.L. (2019). Distinguishing pedigree relationships using multi-way identical by descent sharing and sex-specific genetic maps. BioRxiv, 753343.
15.↵
Gusev, A., Palamara, P.F., Aponte, G., Zhuang, Z., Darvasi, A., Gregersen, P., and Pe’er, I. (2012). The architecture of long-range haplotypes shared within and across populations. Mol Biol Evol 29, 473–486.
OpenUrl CrossRef PubMed Web of Science
16.↵
Ralph, P., and Coop, G. (2013). The geography of recent genetic ancestry across Europe. PLoS Biol 11, e1001555.
OpenUrl CrossRef PubMed
17.
Fu, W., Browning, S.R., Browning, B.L., and Akey, J.M. (2016). Robust inference of identity by descent from exome-sequencing data. The American Journal of Human Genetics 99, 1106–1116.
OpenUrl PubMed
18.
Han, E., Carbonetto, P., Curtis, R.E., Wang, Y., Granka, J.M., Byrnes, J., Noto, K., Kermany, A.R., Myres, N.M., and Barber, M.J. (2017). Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat Commun 8, 14238.
OpenUrl CrossRef PubMed
19.
Taylor, A.R., Schaffner, S.F., Cerqueira, G.C., Nkhoma, S.C., Anderson, T.J., Sriprawat, K., Phyo, A.P., Nosten, F., Neafsey, D.E., and Buckee, C.O. (2017). Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genet 13, e1007065.
OpenUrl CrossRef
20.↵
Henden, L., Lee, S., Mueller, I., Barry, A., and Bahlo, M. (2018). Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS Genet 14, e1007279.
OpenUrl CrossRef PubMed
21.↵
Palamara, P.F., Lencz, T., Darvasi, A., and Pe’er, I. (2012). Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet 91, 809–822.
OpenUrl CrossRef PubMed
22.
Palamara, P.F., and Pe’er, I. (2013). Inference of historical migration rates via haplotype sharing. Bioinformatics 29, i180–188.
OpenUrl CrossRef PubMed
23.↵
Browning, S.R., and Browning, B.L. (2015). Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet 97, 404–418.
OpenUrl CrossRef PubMed
24.↵
Browning, S.R., Browning, B.L., Daviglus, M.L., Durazo-Arvizu, R.A., Schneiderman, N., Kaplan, R.C., and Laurie, C.C. (2018). Ancestry-specific recent effective population size in the Americas. PLoS Genet 14, e1007385.
OpenUrl CrossRef
25.↵
Narasimhan, V.M., Rahbari, R., Scally, A., Wuster, A., Mason, D., Xue, Y., Wright, J., Trembath, R.C., Maher, E.R., and van Heel, D.A. (2017). Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat Commun 8, 303.
OpenUrl CrossRef
26.
Campbell, C.D., Chong, J.X., Malig, M., Ko, A., Dumont, B.L., Han, L., Vives, L., O’Roak, B.J., Sudmant, P.H., Shendure, J., et al. (2012). Estimating the human mutation rate using autozygosity in a founder population. Nat Genet 44, 1277–1281.
OpenUrl CrossRef PubMed
27.↵
Palamara, P.F., Francioli, L.C., Wilton, P.R., Genovese, G., Gusev, A., Finucane, H.K., Sankararaman, S., Sunyaev, S.R., de Bakker, P.I., and Wakeley, J. (2015). Leveraging distant relatedness to quantify human mutation and gene-conversion rates. The American Journal of Human Genetics 97, 775–789.
OpenUrl CrossRef PubMed
28.↵
Tian, X., Browning, B.L., and Browning, S.R. (2019). Estimating the Genome-wide Mutation Rate with Three-Way Identity by Descent. Am J Hum Genet 105, 883–893.
OpenUrl
29.↵
Zhou, Y., Browning, B.L., and Browning, S. (2019). Population-specific recombination maps from segments of identity by descent. bioRxiv, 868091.
30.↵
Albrechtsen, A., Moltke, I., and Nielsen, R. (2010). Natural selection and the distribution of identity-by-descent in the human genome. Genetics 186, 295–308.
OpenUrl Abstract/FREE Full Text
31.
Cai, Z., Camp, N.J., Cannon-Albright, L., and Thomas, A. (2011). Identification of regions of positive selection using Shared Genomic Segment analysis. Europ J Hum Genet 19, 667.
OpenUrl CrossRef PubMed
32.↵
Han, L., and Abney, M. (2013). Using identity by descent estimation with dense genotype data to detect positive selection. Europ J Hum Genet 21, 205.
OpenUrl CrossRef PubMed
33.↵
Browning, S.R., and Browning, B.L. (2010). High-resolution detection of identity by descent in unrelated individuals. Am J Hum Genet 86, 526–539.
OpenUrl CrossRef PubMed Web of Science
34.↵
Leutenegger, A.-L., Prum, B., Génin, E., Verny, C., Lemainque, A., Clerget-Darpoux, F., and Thompson, E.A. (2003). Estimation of the inbreeding coefficient through use of genomic data. The American Journal of Human Genetics 73, 516–523.
OpenUrl CrossRef PubMed Web of Science
35.
Browning, S.R. (2008). Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics 178, 2123–2132.
OpenUrl Abstract/FREE Full Text
36.
Albrechtsen, A., Sand Korneliussen, T., Moltke, I., van Overseem Hansen, T., Nielsen, F.C., and Nielsen, R. (2009). Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet Epidemiol 33, 266–274.
OpenUrl CrossRef PubMed Web of Science
37.
Han, L., and Abney, M. (2011). Identity by descent estimation with dense genome-wide genotype data. Genet Epidemiol 35, 557–567.
OpenUrl CrossRef PubMed
38.
Brown, M.D., Glazner, C.G., Zheng, C., and Thompson, E.A. (2012). Inferring coancestry in population samples in the presence of linkage disequilibrium. Genetics 190, 1447–1460.
OpenUrl Abstract/FREE Full Text
39.↵
Thompson, E. (2008). The IBD process along four chromosomes. Theor Popul Biol 73, 369–373.
OpenUrl CrossRef PubMed Web of Science
40.↵
Gusev, A., Lowe, J.K., Stoffel, M., Daly, M.J., Altshuler, D., Breslow, J.L., Friedman, J.M., and Pe’er, I. (2009). Whole population, genome-wide mapping of hidden relatedness. Genome Res 19, 318–326.
OpenUrl Abstract/FREE Full Text
41.↵
Kong, A., Masson, G., Frigge, M.L., Gylfason, A., Zusmanovich, P., Thorleifsson, G., Olason, P.I., Ingason, A., Steinberg, S., Rafnar, T., et al. (2008). Detection of sharing by descent, long-range phasing and haplotype imputation. Nat Genet.
42.↵
Dimitromanolakis, A., Paterson, A.D., and Sun, L. (2019). Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via TRUFFLE. Am J Hum Genet 105, 78–88.
OpenUrl
43.↵
Naseri, A., Liu, X., Tang, K., Zhang, S., and Zhi, D. (2019). RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. Genome biology 20, 143.
OpenUrl
44.↵
Shemirani, R., Belbin, G.M., Avery, C.L., Kenny, E.E., Gignoux, C.R., and Ambite, J.L. (2019). Rapid detection of identity-by-descent tracts for mega-scale datasets. bioRxiv, 749507.
45.↵
Chiang, C.W., Ralph, P., and Novembre, J. (2016). Conflation of short identity-by-descent segments bias their inferred length distribution. G3: Genes, Genomes, Genetics 6, 1287–1296.
OpenUrl
46.↵
Browning, B.L., and Browning, S.R. (2013). Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471.
OpenUrl Abstract/FREE Full Text
47.↵
Browning, B.L., and Browning, S.R. (2013). Detecting identity by descent and estimating genotype error rates in sequence data. Am J Hum Genet 93, 840–851.
OpenUrl CrossRef PubMed
48.↵
Rodriguez, J.M., Bercovici, S., Huang, L., Frostig, R., and Batzoglou, S. (2015). Parente2: a fast and accurate method for detecting identity by descent. Genome Res 25, 280–289.
OpenUrl Abstract/FREE Full Text
49.↵
Kong, A., Thorleifsson, G., Gudbjartsson, D.F., Masson, G., Sigurdsson, A., Jonasdottir, A., Walters, G.B., Jonasdottir, A., Gylfason, A., and Kristinsson, K.T. (2010). Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099.
OpenUrl CrossRef PubMed Web of Science
50.↵
Williams, A.L., Genovese, G., Dyer, T., Altemose, N., Truax, K., Jun, G., Patterson, N., Myers, S.R., Curran, J.E., and Duggirala, R. (2015). Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. Elife 4, e04637.
OpenUrl CrossRef PubMed
51.↵
The 1000 Genomes Project Consortium. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65.
OpenUrl CrossRef PubMed Web of Science
52.↵
Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L.T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., and O’Connell, J. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203.
OpenUrl CrossRef PubMed
53.↵
Taliun, D., Harris, D.N., Kessler, M.D., Carlson, J., Szpiech, Z.A., Torres, R., Taliun, S.A.G., Corvelo, A., Gogarten, S.M., and Kang, H.M. (2019). Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv, 563866.
54.↵
Weedon, M.N., Jackson, L., Harrison, J.W., Ruth, K.S., Tyrrell, J., Hattersley, A.T., and Wright, C.F. (2019). Very rare pathogenic genetic variants detected by SNP-chips are usually false positives: implications for direct-to-consumer genetic testing. bioRxiv, 696799.
55.↵
Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266–1272.
OpenUrl CrossRef PubMed Web of Science
56.↵
Browning, B.L., Zhou, Y., and Browning, S.R. (2018). A one-penny imputed genome from next-generation reference panels. The American Journal of Human Genetics 103, 338–348.
OpenUrl CrossRef PubMed
57.↵
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., Depristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156–2158.
OpenUrl CrossRef PubMed Web of Science
58.↵
The Genome of the Netherlands Consortium. (2014). Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 46, 818–825.
OpenUrl CrossRef PubMed
59.↵
Lappalainen, I., Almeida-King, J., Kumanduri, V., Senf, A., Spalding, J.D., Ur-Rehman, S., Saunders, G., Kandasamy, J., Caccamo, M., Leinonen, R., et al. (2015). The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 47, 692–695.
OpenUrl CrossRef PubMed
60.↵
Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., and Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873.
OpenUrl CrossRef PubMed Web of Science
61.↵
Browning, S.R., and Browning, B.L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81, 1084–1097.
OpenUrl CrossRef PubMed Web of Science
62.↵
Delaneau, O., Zagury, J.F., Robinson, M.R., Marchini, J.L., and Dermitzakis, E.T. (2019). Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436.
OpenUrl
63.↵
Haller, B.C., and Messer, P.W. (2016). SLiM 2: Flexible, interactive forward genetic simulations. Mol Biol Evol 34, 230–240.
OpenUrl
64.↵
Kelleher, J., Etheridge, A.M., and McVean, G. (2016). Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comp Biol 12, e1004842.
OpenUrl
65.↵
Haller, B.C., Galloway, J., Kelleher, J., Messer, P.W., and Ralph, P.L. (2019). Tree - sequence recording in SLiM opens new horizons for forward - time simulation of whole genomes. Molecular ecology resources 19, 552–566.
OpenUrl
66.↵
The International HapMap Consortium. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861.
OpenUrl CrossRef PubMed Web of Science
67.↵
Li, H., Glusman, G., Hu, H., Caballero, J., Hubley, R., Witherspoon, D., Guthery, S.L., Mauldin, D.E., Jorde, L.B., and Hood, L. (2014). Relationship estimation from whole-genome sequence data. PLoS Genet 10, e1004144.
OpenUrl CrossRef PubMed
68.↵
Maples, B.K., Gravel, S., Kenny, E.E., and Bustamante, C.D. (2013). RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 93, 278–288.
OpenUrl CrossRef PubMed
69.↵
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G.R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44, 955–959.
OpenUrl CrossRef PubMed
70.↵
Loh, P.-R., Danecek, P., Palamara, P.F., Fuchsberger, C., Reshef, Y.A., Finucane, H.K., Schoenherr, S., Forer, L., McCarthy, S., and Abecasis, G.R. (2016). Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet 48, 1443.
OpenUrl CrossRef PubMed

View the discussion thread.

Posted December 12, 2019.

Download PDF

Data/Code

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5193)
Biochemistry (11688)
Bioengineering (8711)
Bioinformatics (29099)
Biophysics (14916)
Cancer Biology (12045)
Cell Biology (17332)
Clinical Trials (138)
Developmental Biology (9400)
Ecology (14127)
Epidemiology (2067)
Evolutionary Biology (18250)
Genetics (12208)
Genomics (16745)
Immunology (11831)
Microbiology (27969)
Molecular Biology (11535)
Neuroscience (60731)
Paleontology (450)
Pathology (1863)
Pharmacology and Toxicology (3224)
Physiology (4932)
Plant Biology (10374)
Scientific Communication and Education (1679)
Synthetic Biology (2875)
Systems Biology (7326)
Zoology (1639)

[1] 1.↵
Houwen, R.H., Baharloo, S., Blankenship, K., Raeymaekers, P., Juyn, J., Sandkuijl, L.A., and Freimer, N.B. (1994). Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nat Genet 8, 380.
OpenUrl CrossRef PubMed Web of Science

[2] 2.↵
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559–575.
OpenUrl CrossRef PubMed

[3] 3.
Kenny, E.E., Gusev, A., Riegel, K., Lütjohann, D., Lowe, J.K., Salit, J., Maller, J.B., Stoffel, M., Daly, M.J., and Altshuler, D.M. (2009). Systematic haplotype analysis resolves a complex plasma plant sterol locus on the Micronesian Island of Kosrae. Proceedings of the National Academy of Sciences 106, 13886–13891.
OpenUrl Abstract/FREE Full Text

[4] 4.↵
Moltke, I., Albrechtsen, A., Hansen, T.V., Nielsen, F.C., and Nielsen, R. (2011). A method for detecting IBD regions simultaneously in multiple individuals--with applications to disease genetics. Genome Res 21, 1168–1180.
OpenUrl Abstract/FREE Full Text

[5] 5.↵
Gusev, A., Kenny, E.E., Lowe, J.K., Salit, J., Saxena, R., Kathiresan, S., Altshuler, D.M., Friedman, J.M., Breslow, J.L., and Pe’er, I. (2011). DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation. Am J Hum Genet 88, 706–717.
OpenUrl CrossRef PubMed

[6] 6.↵
Browning, S.R., and Thompson, E.A. (2012). Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics 190, 1521–1531.
OpenUrl Abstract/FREE Full Text

[7] 7.↵
Price, A.L., Helgason, A., Thorleifsson, G., McCarroll, S.A., Kong, A., and Stefansson, K. (2011). Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet 7, e1001317.
OpenUrl CrossRef PubMed

[8] 8.
Zuk, O., Hechter, E., Sunyaev, S.R., and Lander, E.S. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences 109, 1193–1198.
OpenUrl Abstract/FREE Full Text

[9] 9.
Browning, S.R., and Browning, B.L. (2013). Identity-by-descent-based heritability analysis in the Northern Finland Birth Cohort. Hum Genet 132, 129–138.
OpenUrl CrossRef PubMed

[10] 10.↵
Palamara, P.F., Terhorst, J., Song, Y.S., and Price, A.L. (2018). High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat Genet 50, 1311–1317.
OpenUrl CrossRef

[11] 11.↵
Browning, B.L., and Browning, S.R. (2011). A fast, powerful method for detecting identity by descent. Am J Hum Genet 88, 173–182.
OpenUrl CrossRef PubMed

[12] 12.↵
Huff, C.D., Witherspoon, D.J., Simonson, T.S., Xing, J., Watkins, W.S., Zhang, Y., Tuohy, T.M., Neklason, D.W., Burt, R.W., and Guthery, S.L. (2011). Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res 21, 768–774.
OpenUrl Abstract/FREE Full Text

[13] 13.↵
Ramstetter, M.D., Dyer, T.D., Lehman, D.M., Curran, J.E., Duggirala, R., Blangero, J., Mezey, J.G., and Williams, A.L. (2017). Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics 207, 75–82.
OpenUrl Abstract/FREE Full Text

[14] 14.↵
Qiao, Y., Sannerud, J., Basu-Roy, S., Hayward, C., and Williams, A.L. (2019). Distinguishing pedigree relationships using multi-way identical by descent sharing and sex-specific genetic maps. BioRxiv, 753343.

[15] 15.↵
Gusev, A., Palamara, P.F., Aponte, G., Zhuang, Z., Darvasi, A., Gregersen, P., and Pe’er, I. (2012). The architecture of long-range haplotypes shared within and across populations. Mol Biol Evol 29, 473–486.
OpenUrl CrossRef PubMed Web of Science

[16] 16.↵
Ralph, P., and Coop, G. (2013). The geography of recent genetic ancestry across Europe. PLoS Biol 11, e1001555.
OpenUrl CrossRef PubMed

[17] 17.
Fu, W., Browning, S.R., Browning, B.L., and Akey, J.M. (2016). Robust inference of identity by descent from exome-sequencing data. The American Journal of Human Genetics 99, 1106–1116.
OpenUrl PubMed

[18] 18.
Han, E., Carbonetto, P., Curtis, R.E., Wang, Y., Granka, J.M., Byrnes, J., Noto, K., Kermany, A.R., Myres, N.M., and Barber, M.J. (2017). Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat Commun 8, 14238.
OpenUrl CrossRef PubMed

[19] 19.
Taylor, A.R., Schaffner, S.F., Cerqueira, G.C., Nkhoma, S.C., Anderson, T.J., Sriprawat, K., Phyo, A.P., Nosten, F., Neafsey, D.E., and Buckee, C.O. (2017). Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genet 13, e1007065.
OpenUrl CrossRef

[20] 20.↵
Henden, L., Lee, S., Mueller, I., Barry, A., and Bahlo, M. (2018). Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS Genet 14, e1007279.
OpenUrl CrossRef PubMed

[21] 21.↵
Palamara, P.F., Lencz, T., Darvasi, A., and Pe’er, I. (2012). Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet 91, 809–822.
OpenUrl CrossRef PubMed

[22] 22.
Palamara, P.F., and Pe’er, I. (2013). Inference of historical migration rates via haplotype sharing. Bioinformatics 29, i180–188.
OpenUrl CrossRef PubMed

[23] 23.↵
Browning, S.R., and Browning, B.L. (2015). Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet 97, 404–418.
OpenUrl CrossRef PubMed

[24] 24.↵
Browning, S.R., Browning, B.L., Daviglus, M.L., Durazo-Arvizu, R.A., Schneiderman, N., Kaplan, R.C., and Laurie, C.C. (2018). Ancestry-specific recent effective population size in the Americas. PLoS Genet 14, e1007385.
OpenUrl CrossRef

[25] 25.↵
Narasimhan, V.M., Rahbari, R., Scally, A., Wuster, A., Mason, D., Xue, Y., Wright, J., Trembath, R.C., Maher, E.R., and van Heel, D.A. (2017). Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat Commun 8, 303.
OpenUrl CrossRef

[26] 26.
Campbell, C.D., Chong, J.X., Malig, M., Ko, A., Dumont, B.L., Han, L., Vives, L., O’Roak, B.J., Sudmant, P.H., Shendure, J., et al. (2012). Estimating the human mutation rate using autozygosity in a founder population. Nat Genet 44, 1277–1281.
OpenUrl CrossRef PubMed

[27] 27.↵
Palamara, P.F., Francioli, L.C., Wilton, P.R., Genovese, G., Gusev, A., Finucane, H.K., Sankararaman, S., Sunyaev, S.R., de Bakker, P.I., and Wakeley, J. (2015). Leveraging distant relatedness to quantify human mutation and gene-conversion rates. The American Journal of Human Genetics 97, 775–789.
OpenUrl CrossRef PubMed

[28] 28.↵
Tian, X., Browning, B.L., and Browning, S.R. (2019). Estimating the Genome-wide Mutation Rate with Three-Way Identity by Descent. Am J Hum Genet 105, 883–893.
OpenUrl

[29] 29.↵
Zhou, Y., Browning, B.L., and Browning, S. (2019). Population-specific recombination maps from segments of identity by descent. bioRxiv, 868091.

[30] 30.↵
Albrechtsen, A., Moltke, I., and Nielsen, R. (2010). Natural selection and the distribution of identity-by-descent in the human genome. Genetics 186, 295–308.
OpenUrl Abstract/FREE Full Text

[31] 31.
Cai, Z., Camp, N.J., Cannon-Albright, L., and Thomas, A. (2011). Identification of regions of positive selection using Shared Genomic Segment analysis. Europ J Hum Genet 19, 667.
OpenUrl CrossRef PubMed

[32] 32.↵
Han, L., and Abney, M. (2013). Using identity by descent estimation with dense genotype data to detect positive selection. Europ J Hum Genet 21, 205.
OpenUrl CrossRef PubMed

[33] 33.↵
Browning, S.R., and Browning, B.L. (2010). High-resolution detection of identity by descent in unrelated individuals. Am J Hum Genet 86, 526–539.
OpenUrl CrossRef PubMed Web of Science

[34] 34.↵
Leutenegger, A.-L., Prum, B., Génin, E., Verny, C., Lemainque, A., Clerget-Darpoux, F., and Thompson, E.A. (2003). Estimation of the inbreeding coefficient through use of genomic data. The American Journal of Human Genetics 73, 516–523.
OpenUrl CrossRef PubMed Web of Science

[35] 35.
Browning, S.R. (2008). Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics 178, 2123–2132.
OpenUrl Abstract/FREE Full Text

[36] 36.
Albrechtsen, A., Sand Korneliussen, T., Moltke, I., van Overseem Hansen, T., Nielsen, F.C., and Nielsen, R. (2009). Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet Epidemiol 33, 266–274.
OpenUrl CrossRef PubMed Web of Science

[37] 37.
Han, L., and Abney, M. (2011). Identity by descent estimation with dense genome-wide genotype data. Genet Epidemiol 35, 557–567.
OpenUrl CrossRef PubMed

[38] 38.
Brown, M.D., Glazner, C.G., Zheng, C., and Thompson, E.A. (2012). Inferring coancestry in population samples in the presence of linkage disequilibrium. Genetics 190, 1447–1460.
OpenUrl Abstract/FREE Full Text

[39] 39.↵
Thompson, E. (2008). The IBD process along four chromosomes. Theor Popul Biol 73, 369–373.
OpenUrl CrossRef PubMed Web of Science

[40] 40.↵
Gusev, A., Lowe, J.K., Stoffel, M., Daly, M.J., Altshuler, D., Breslow, J.L., Friedman, J.M., and Pe’er, I. (2009). Whole population, genome-wide mapping of hidden relatedness. Genome Res 19, 318–326.
OpenUrl Abstract/FREE Full Text

[41] 41.↵
Kong, A., Masson, G., Frigge, M.L., Gylfason, A., Zusmanovich, P., Thorleifsson, G., Olason, P.I., Ingason, A., Steinberg, S., Rafnar, T., et al. (2008). Detection of sharing by descent, long-range phasing and haplotype imputation. Nat Genet.

[42] 42.↵
Dimitromanolakis, A., Paterson, A.D., and Sun, L. (2019). Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via TRUFFLE. Am J Hum Genet 105, 78–88.
OpenUrl

[43] 43.↵
Naseri, A., Liu, X., Tang, K., Zhang, S., and Zhi, D. (2019). RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. Genome biology 20, 143.
OpenUrl

[44] 44.↵
Shemirani, R., Belbin, G.M., Avery, C.L., Kenny, E.E., Gignoux, C.R., and Ambite, J.L. (2019). Rapid detection of identity-by-descent tracts for mega-scale datasets. bioRxiv, 749507.

[45] 45.↵
Chiang, C.W., Ralph, P., and Novembre, J. (2016). Conflation of short identity-by-descent segments bias their inferred length distribution. G3: Genes, Genomes, Genetics 6, 1287–1296.
OpenUrl

[46] 46.↵
Browning, B.L., and Browning, S.R. (2013). Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194, 459–471.
OpenUrl Abstract/FREE Full Text

[47] 47.↵
Browning, B.L., and Browning, S.R. (2013). Detecting identity by descent and estimating genotype error rates in sequence data. Am J Hum Genet 93, 840–851.
OpenUrl CrossRef PubMed

[48] 48.↵
Rodriguez, J.M., Bercovici, S., Huang, L., Frostig, R., and Batzoglou, S. (2015). Parente2: a fast and accurate method for detecting identity by descent. Genome Res 25, 280–289.
OpenUrl Abstract/FREE Full Text

[49] 49.↵
Kong, A., Thorleifsson, G., Gudbjartsson, D.F., Masson, G., Sigurdsson, A., Jonasdottir, A., Walters, G.B., Jonasdottir, A., Gylfason, A., and Kristinsson, K.T. (2010). Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467, 1099.
OpenUrl CrossRef PubMed Web of Science

[50] 50.↵
Williams, A.L., Genovese, G., Dyer, T., Altemose, N., Truax, K., Jun, G., Patterson, N., Myers, S.R., Curran, J.E., and Duggirala, R. (2015). Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. Elife 4, e04637.
OpenUrl CrossRef PubMed

[51] 51.↵
The 1000 Genomes Project Consortium. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65.
OpenUrl CrossRef PubMed Web of Science

[52] 52.↵
Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L.T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., and O’Connell, J. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203.
OpenUrl CrossRef PubMed

[53] 53.↵
Taliun, D., Harris, D.N., Kessler, M.D., Carlson, J., Szpiech, Z.A., Torres, R., Taliun, S.A.G., Corvelo, A., Gogarten, S.M., and Kang, H.M. (2019). Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv, 563866.

[54] 54.↵
Weedon, M.N., Jackson, L., Harrison, J.W., Ruth, K.S., Tyrrell, J., Hattersley, A.T., and Wright, C.F. (2019). Very rare pathogenic genetic variants detected by SNP-chips are usually false positives: implications for direct-to-consumer genetic testing. bioRxiv, 696799.

[55] 55.↵
Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266–1272.
OpenUrl CrossRef PubMed Web of Science

[56] 56.↵
Browning, B.L., Zhou, Y., and Browning, S.R. (2018). A one-penny imputed genome from next-generation reference panels. The American Journal of Human Genetics 103, 338–348.
OpenUrl CrossRef PubMed

[57] 57.↵
Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., Depristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156–2158.
OpenUrl CrossRef PubMed Web of Science

[58] 58.↵
The Genome of the Netherlands Consortium. (2014). Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 46, 818–825.
OpenUrl CrossRef PubMed

[59] 59.↵
Lappalainen, I., Almeida-King, J., Kumanduri, V., Senf, A., Spalding, J.D., Ur-Rehman, S., Saunders, G., Kandasamy, J., Caccamo, M., Leinonen, R., et al. (2015). The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 47, 692–695.
OpenUrl CrossRef PubMed

[60] 60.↵
Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., and Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873.
OpenUrl CrossRef PubMed Web of Science

[61] 61.↵
Browning, S.R., and Browning, B.L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81, 1084–1097.
OpenUrl CrossRef PubMed Web of Science

[62] 62.↵
Delaneau, O., Zagury, J.F., Robinson, M.R., Marchini, J.L., and Dermitzakis, E.T. (2019). Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436.
OpenUrl

[63] 63.↵
Haller, B.C., and Messer, P.W. (2016). SLiM 2: Flexible, interactive forward genetic simulations. Mol Biol Evol 34, 230–240.
OpenUrl

[64] 64.↵
Kelleher, J., Etheridge, A.M., and McVean, G. (2016). Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comp Biol 12, e1004842.
OpenUrl

[65] 65.↵
Haller, B.C., Galloway, J., Kelleher, J., Messer, P.W., and Ralph, P.L. (2019). Tree - sequence recording in SLiM opens new horizons for forward - time simulation of whole genomes. Molecular ecology resources 19, 552–566.
OpenUrl

[66] 66.↵
The International HapMap Consortium. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861.
OpenUrl CrossRef PubMed Web of Science

[67] 67.↵
Li, H., Glusman, G., Hu, H., Caballero, J., Hubley, R., Witherspoon, D., Guthery, S.L., Mauldin, D.E., Jorde, L.B., and Hood, L. (2014). Relationship estimation from whole-genome sequence data. PLoS Genet 10, e1004144.
OpenUrl CrossRef PubMed

[68] 68.↵
Maples, B.K., Gravel, S., Kenny, E.E., and Bustamante, C.D. (2013). RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 93, 278–288.
OpenUrl CrossRef PubMed

[69] 69.↵
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G.R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44, 955–959.
OpenUrl CrossRef PubMed

[70] 70.↵
Loh, P.-R., Danecek, P., Palamara, P.F., Fuchsberger, C., Reshef, Y.A., Finucane, H.K., Schoenherr, S., Forer, L., McCarthy, S., and Abecasis, G.R. (2016). Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet 48, 1443.
OpenUrl CrossRef PubMed