A performance assessment of relatedness inference methods using genome-wide data from thousands of relatives

Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these methods in real data has been lacking. Here, we report an assessment of 11 state-of-the-art relatedness inference methods using a dataset with 2,485 individuals contained in several large pedigrees that span up to six generations. We nd that all methods have high accuracy (~93% – 99%) when reporting first and second degree relationships, but their accuracy dwindles to less than 60% for fifth degree relationships. However, the inferred relationships were correct to within one relatedness degree at a rate of 83% – 99% across all methods and considered relationship degrees. Furthermore, most methods infer unrelated individuals correctly at a rate of ~99%, suggesting a low rate of false positives. Overall, the most accurate methods were ERSA 2.0 and approaches that classify relationships using the IBD segments inferred by Refined IBD and IBDseq. Combining results from the most accurate methods provides little accuracy improvement, indicating that novel approaches for relatedness inference may be needed to achieve a sizeable jump in performance.

The recent explosive growth in sample sizes of genetic datasets has led to an increasing proportion of close relatives hidden within these large studies, necessitating relatedness detection. Inferring relatedness between samples 1-3 is an essential step in performing genetic association studies [4][5][6] and linkage analysis [7][8][9] , is a powerful tool for forensic genetics 1,10,11 , and is needed to account for or remove relatives in population genetic analyses [12][13][14] . Relatedness estimation has also drawn the interest of the general public via companies such as 23andMe and AncestryDNA which advertise their ability to find and report relatives, allowing individuals to explore their ancestry and genealogy. The broad utility of relatedness estimation has motivated the development of numerous methods for such inference. These methods work by estimating the proportion of the genome shared identical by descent (IBD) between individuals 1,3 or a closely-related quantity, where an allele in two or more individuals' genomes is said to be IBD if those individuals inherit it from a recent common ancestor 2 . As previously shown, the distributions of IBD proportions for different relatedness classes (such as first cousins and half-first cousins) are expected to overlap 2,15 , posing a challenge for these inference procedures.
Here, we present a rigorous evaluation of 11 state-of-the-art methods that can scale to large study sizes, including seven that directly infer genome-wide relatedness measures [16][17][18][19][20][21][22] and four IBD segment detection methods 23-26 that we utilized to infer these quantities. To assess each of these methods, we used SNP array genotypes from Mexican American individuals contained in large pedigrees from the San Antonio Mexican American Family Studies (SAMAFS) [27][28][29] . Our analysis sample included 2,485 individuals genotyped at 521,184 SNPs (Supplemental Note) within pedigrees that span up to six generations with genotype data from as many as five generations of individuals. Given this large sample, including 13 pedigrees with >50 individuals (Supplemental Figure 1), numerous close relatives exist, and we used these to evaluate each of the inference methods. In particular, there are >4,500 pairs of individuals within each of the first through fifth degree relatedness classes that we evaluated, and we further considered more than three million pairs of individuals that are in distinct pedigrees and hence assumed unrelated (Table 1). Prior analyses of relatedness inference methods considered either simulated data 17,18,20-22 -which may not fully capture the complexities of real data-or used small sample sizes 17,18,22,30 . Our analysis using real data for large numbers of up to fifth degree relatives provides a comprehensive evaluation of these relatedness inference methods.
Our analysis considered each method's ability to correctly infer the degree of relatedness between the pairs of samples based on their reported relationships. These reported relationships are extremely reliable and in most cases we can validate them via first degree connections among samples in the densely-genotyped SAMAFS pedigrees. Some methods directly infer the degree of relatedness 19 while others infer a kinship coefficient 17,18,20 , a coefficient of relatedness 16,22 (which is two times the kinship coefficient 31 ), or instead detect IBD segments [23][24][25][26] (Table 2). To infer the degree of relatedness from an estimated kinship coefficient for a pair of samples, we use the ranges of estimated kinship values from the KING method 17 (Table 3). These ranges use differences in powers of two for the relatedness degree intervals, which is generally consistent with simulations 3 . For IBD detection methods that report the number of IBD segments shared at a locus 23,26denoted IBD0, IBD1, and IBD2 for the corresponding number of copies that are IBD-it is straightforward to calculate a kinship coefficient 2 . This coefficient, φ ij , between a pair of samples i, j denotes the probability that a randomly selected allele in individual i is IBD with a randomly selected allele from the same genomic position in j. Let p (1) ij and p (2) ij denote the proportion of their genomes that individuals i, j share IBD1 and IBD2 respectively; then the kinship coefficient is ij and p (2) ij are simply the sum of the genetic lengths of the IBD1 and IBD2 segments, respectively, between samples i, j divided by the total genetic length of the genome analyzed. (Note if i = j, then φ ii = 1 2 (1 + f i ) where f i is the kinship coefficient between the parents of i which is equivalent to the inbreeding coefficient of individual i.) For the IBD detection methods that do not distinguish between regions that are IBD1 from IBD2 24,25 , the proportion of the genome that is inferred to be IBD0 provides an alternate means of estimating the degree of relatedness (Table 3), with the ranges of values here again from the KING paper 17 . We classified individuals with lower kinship coefficients or higher IBD0 rates than indicated for the fifth degree range as unrelated.
Using the SAMAFS sample, we assessed the performance of each program by using them to classify all pairs of individuals. Figure 1 shows the proportion of sample pairs inferred to be within each of the degree classes that we considered (first through fifth degree and unrelated), with results separated according to the reported and inferred relatedness degrees of the pairs. All methods perform well when inferring first and second degree relatives, with the accuracy ranging from 98.4% to 99.5% for first degree relatives, and from 93% to 98.6% for second degree relatives. For more distant relatedness, the IBD-based methods have  Table 2: Properties of the 11 relationship inference methods we analyzed. Type indicates the inference methodology the program uses. Runtime is wall clock time to run the program; we ran parallelized programs using the numbers of cores indicated in parentheses: total compute time for the parallelized programs is the runtime multiplied by the number of cores used. Input required from outside program indicates extraneous information needed to run the program. Programs that use either principal components or ancestral population proportions are indicated as accounting for population structure. "Y" indicates yes, "N" indicates no, and "NA" indicates not applicable. Runtimes are from a machine with four AMD Opteron 6176 2.30 GHz processors (64 cores total) and 256 GB memory. Expected Accepted range for:   higher accuracy than those that rely on allele frequencies of independent markers-for example, for fifth degree relatives, the top performing IBD-based method has 59.4% accuracy while the highest performing allele frequency-based method has only 53.8% accuracy. Overall, the most accurate programs are ERSA 2.0, Refined IBD, and IBDseq. The improved accuracy of IBD-based methods may be due to their focus on identifying long stretches of identical segments that more readily discriminate recent shared relatedness from chance sharing of alleles.
Noting that the SAMAFS consist of admixed Mexican American individuals, we examined the accuracy results among the allele frequency-based methods, of which several account for population structure. Of all these methods, PC-Relate has the highest accuracy across all levels of relatedness, and it does account for population structure using principal components. Overall, the results are mixed with regards to accounting for population structure and accuracy, with PC-Relate, REAP, RelateAdmix, and KING all incorporating population structure into their models, and PREST-plus and PLINK ignoring this structure. Because relatedness structure can confound methods that detect population structure, we employed a procedure designed to locate true ancestral population proportions for the input supplied to REAP and RelateAdmix (Supplemental Note). PC-Relate, by contrast, addresses these concerns by performing population structure analysis internally using a set of samples with low levels of relatedness. However, IBD detection methods do not directly account for population structure and generally have the best performance.
The inference accuracy of all methods decreases for higher relatedness degrees, likely due to the exponential drop in mean pairwise IBD shared and an increased coefficient of variation as relatedness decreases 15,32,33 . In particular, for fifth degree relatives, the accuracy rates for all methods are very low at less than 60%. However, in nearly all cases (≥ 83.8%), the programs correctly inferred the degree of relatedness to within one degree of that reported in the SAMAFS pedigrees. IBDseq has the highest within-one-degree accuracy for reported fourth degree pairs (the relationship class with the lowest accuracies for off-by-one inference) at 98.7%. At the same time, the methods classify an average of 97.9% of pairs of unrelated individuals correctly, averaged across all programs (99.7% when PLINK is excluded), with few instances of fifth or greater degree of relatedness inferred for these pairs. These results suggest that, when methods do detect relatedness-even as far distant as fifth degree-the individuals are likely to be truly related.
Because the SAMAFS data consist of many closely related individuals, the allele frequencies derived from it have the potential to be biased. Furthermore, haplotype phasing and therefore IBD inference accuracy might be greater than would be achieved in a more outbred sample. To ensure the performance results presented here also apply to analyses of non-pedigree datasets, we identified a set of unrelated individuals using FastIndep 34 and merged these samples with pairs of related individuals to form 1,000 datasets that include different pairs of relatives (Supplemental Note). Each reduced dataset contains at most one pair of samples from any distinct SAMAFS pedigree, limiting the potential for bias. When classifying the related individuals included in at least one of these reduced datasets, PLINK's inference accuracy differs by less than 3% compared to the full dataset (Supplemental Figure 2), suggesting that allele frequency biases are small and only minimally impact inference accuracy. In order to test the IBD detection methods, we further merged 580 HapMap samples 35 with each of the reduced datasets (Supplemental Note). Results from running IBD detection methods on these datasets show a reduction in accuracy that ranges between 0% − 8%, yet the results are still consistent with those of the larger analysis (Supplemental Figure 3). Specifically, the IBD segment-finding methods tend to have higher performance than allele frequency-based methods, supporting the conclusion that IBD segment-based methods provide the highest accuracy. This is true even in the reduced datasets that have no more than 1,204 samples and therefore are subject to a relatively high level of phasing errors.
We examined the pairs of samples that were inferred to be related but were reported as unrelated (in distinct pedigrees) in the SAMAFS dataset. ERSA 2.0, Refined IBD, and IBDseq all inferred a small number of first through third degree relationships that connect individuals from different pedigrees within SAMAFS ( Figure 2). Overall, we found 48 pairs of pedigrees with at least five pairs of relatives between them which all three methods unanimously infer to have the same degree of relatedness. Additionally, these three methods agreed on the inference of 374 and 1,632 pairs of fourth and fifth degree relatives between the pedigrees (not shown). These results highlight the importance of checking for relatedness among samples in all cohorts, and indicate that there can be sizable numbers of relatives across a range of degrees even in well-studied samples.
As current methods provide only moderate accuracy when classifying third through fifth degree relatives, we evaluated the potential for increasing performance by combining inference results from the top three programs. We used an approach that calls the degree of relatedness for a pair only when all three programs unanimously agree on the relatedness degree, providing no classification for other pairs. The resulting inference accuracy increased only negligibly (0.15%, 0.22%, 1.6%, 3.1%, 1.8%, and 0.01%, respectively for first through fifth degree and unrelated pairs) in comparison to the most accurate method's performance in each degree class. We also considered a majority vote between the three programs, discarding the cases in which all three programs inferred a different degree (only two cases were of this class). With this approach, there is a slight decrease in performance overall (-0.46%, -0.26%, -1.4%, -1.5%, +0.28%, +0.01%). These results suggest that while there is room for improvement in the specificity of relatedness inference methods, dramatic improvement is likely to be achieved only with novel approaches and not composites of current methods.
We have presented a detailed comparison of state-of-the-art relatedness inference methods using thousands of pairs of individuals that range from first to fifth degree relatives as well as numerous individuals that are reported to be unrelated. All the methods we assessed reliably identify first and second degree relatives as well as unrelated pairs (accuracy ∼93% − 99%), but their accuracy falls precipitously when classifying third to fifth degree relatives. This is unsurprising given the increased coefficient of variation as well as greater skewness in the proportion of genome shared as the meiotic distance between two relatives increases. Despite these challenges, the inferred relationship was within one degree of the reported relationship at a rate of 83% − 99% for all programs and relationship degrees (Figure 1). Misreported or unknown relationships in the SAMAFS dataset likely explain some of the inference errors, particularly since even some confidently inferred first degree relationships were likely misreported as a more distant relationship (Supplemental Table  4) or as unrelated ( Figure 2). We find that IBD-based methods outperform other approaches for more distantly-related pairs, though notably these packages require substantially more compute time to run which may limit their utility in some applications (Table 2). While the precise performance results presented here are specific to the SAMAFS sample, we find that reducing the sample size still produces similar results, with methods that leverage IBD segments having greater accuracy than other approaches. Therefore, the results presented here should be generalizable and indicate overall properties of relationship inference methodologies: approaches that use IBD segments outperform other methods for third degree and more distant relatives; and the specificity of relatedness inference, even in a dataset where phase accuracy may be relatively high, is inhibited for all but the closest relatives. Figure 1: Performance comparison of the evaluated methods using the SAMAFS dataset. Bar plots indicate the percentage of pairs of samples that are reported to have a given degree of relatedness and who are inferred to be in each degree class. The bar plots are separated on the horizontal axis by the reported relatedness degree and on the vertical axis by inferred relatedness degree. For clarity, the plots list above each bar the percentage number that the corresponding bar depicts. Program names listed in red are IBD-based methods while those in black utilize allele frequencies for inference. Figure 2: Relationships discovered between individuals from different SAMAFS pedigrees. Bands on the perimeter of the elliptical plot indicate distinct pedigrees within SAMAFS with band size proportional to the number of individuals in the pedigree. Curves between two bands correspond to discovered relative pairs with color indicating the degree of relatedness: red for first degree, green for second degree, and blue for third degree. Points where the curves end correspond to specific individuals, and a single point may have multiple curves running to it, indicating several relationships between that individual and others in the dataset.