Correcting statistical bias in correlation-based kinship estimators

Accurate estimate of relatedness is important for genetic data analyses, such as association mapping and heritability estimation based on data collected from genome-wide association studies. Inaccurate relatedness estimates may lead to spurious associations and biased heritability estimations. Individual-level genotype data are often used to estimate kinship coefficient between individuals. The commonly used sample correlation-based genomic relationship matrix (scGRM) method estimates kinship coefficient by calculating the average sample correlation coefficient among all single nucleotide polymorphisms (SNPs), where the observed allele frequencies are used to calculate both the expectations and variances of genotypes. Although this method is widely used, a substantial proportion of estimated kinship coefficients are negative, which are difficult to interpret. In this paper, through mathematical derivation, we show that there indeed exists bias in the estimated kinship coefficient using the scGRM method when the observed allele frequencies are regarded as true frequencies. This leads to negative bias for the average estimate of kinship among all individuals, which explains the estimated negative kinship coefficients. Based on this observation, we propose an unbiased estimation method, UKin, which can reduce the bias. We justify our improved method with rigorous mathematical proof. We have conducted simulations as well as two real data analyses to demonstrate that both bias and root mean square error in kinship coefficient estimation can be reduced by using UKin. Further simulations indicate that the power in association mapping can also be improved by using our unbiased kinship estimates to adjust for cryptic relatedness. Author summary Inference of relatedness plays an important role in genetic data analysis. Many methods have been proposed to estimate kinship coefficients, including the commonly used genomic relationship matrix method. However, a substantial proportion of the kinship coefficients estimated by this method are negative, which is difficult to interpret. In this paper, through mathematical derivation, we show that there indeed exists a negative bias in this approach. To correct for this bias, we propose a new kinship coefficient estimation method, UKin, which is unbiased without requiring extra genetic information nor added computational complexity. The better performance of UKin in reducing bias and root mean squared error is demonstrated through theory, simulations and analysis of data from the young-onset breast cancer and familial intracranial aneurysm studies.

Accurate estimation of relatedness among individuals is important in genetic data analysis. For 2 example, in both population-based and family-based genome-wide association studies (GWAS) 3 with uncertain relationships among study subjects, it is critical to appropriately account for 4 cryptic relatedness because incorrect estimates can decrease power and inflate false positive 5 rates of association tests [1][2][3]. Several methods have been proposed to adjust for relatedness in 6 GWAS, such as introducing a genomic relationship matrix (GRM) as an augment into 7 2/24 well-developed linear mixed model (LMM) [4][5][6][7]. It has been demonstrated that proper 8 consideration of genetic relatedness can also benefit heritability estimation based on GWAS data 9 in the presence of pedigree structures [8,9]. 10 In order to adjust for cryptic relatedness in genetic studies like association mapping and 11 heritability estimation, individual-level genotype data are often used to estimate pairwise kinship 12 coefficients. The sample correlation-based genomic relationship matrix (scGRM) method 13 estimates kinship coefficient by calculating the average sample correlation coefficient among all 14 genetic variants, in which the observed allele frequencies are used for the calculation of both 15 expectation and variance of genotypes [10][11][12]. We note that most association mapping and 16 heritability estimation packages use this method as their default setting for calculating GRM, 17 such as GCTA, GEMMA and FaSTLMM [6,8,13]. Although this method is widely used, 18 researchers have noted that a substantial proportion of the estimated kinship coefficients are 19 negative. As kinship coefficient is defined to be a positive number (see in Materials and 20 Methods), it is difficult to interpret these negative estimates [14][15][16]. 21 In this paper, through mathematical derivation, we first show that there indeed exists bias in 22 the estimated kinship coefficients using the scGRM method. The bias exists because the 23 observed allele frequencies are regarded as true frequencies. We also prove analytically that the 24 bias essentially results in a negative average for all estimates, which explains the large 25 proportion of negative values. Based on this observation, we propose an improved kinship 26 estimation method, UKin, which can remove bias. We provide a mathematical proof for the 27 unbiasedness of the UKin estimator. Simulations and real data analyses also demonstrate that 28 both bias and root mean square error (RMSE) can be reduced by replacing the scGRM method 29 with our UKin method. For real data analyses, we apply our method to two studies, young-onset 30 breast cancer (BC) and familial intracranial aneurysm (FIA), which have pedigree information to 31 evaluate our results. Finally, as an application of our method in association mapping, we conduct 32 a simulation study to show the power of detecting genetic associations can be improved by 33 correcting cryptic relatedness using our unbiased kinship estimates. 34 The paper is organized as follows. In the Results section, we evaluate the performance of 35 UKin through two simulations and two real data sets in BC and FIA to validate our theoretical 36 derivation and demonstrate the effectiveness of UKin estimator in reducing bias and RMSE. In 37 the Materials and Methods section, we present the theoretical details which show the scGRM 38 method is biased, propose our UKin estimation method and give the correctness proof, as well as 39 its connection with the scGRM estimator. Lastly, we conclude with a simulation study in the 40 Discussion section to demonstrate that UKin method can further improve the power of 41 association mapping. Technical details such as mathematical derivations are provided in S1 42 Appendix. An illustrative example 46 We start our discussion with a simple but extreme example. In this experiment, we assumed that there were 500 full siblings from the same family. Although unlikely to exist in reality, this example serves as a good illustration of our theoretical derivation. As every two individuals selected from the same family were full siblings, the true value of their kinship coefficient should be 0.25 (see in Table 5). However, following Property 3 in the Materials and Methods section, their average kinship coefficient estimated by scGRM, denoted byφ , should have the expectation: where n is the sample size andρ is the average of their true genetic correlation coefficients.

47
Property 1 together with Table 5 in the Materials and Methods section suggest thatρ = 0.5 for 48 full siblings.

49
This result shows the unexpected phenomenon that although all individuals in our simulated 50 samples are full siblings to each other, the average of the estimated kinship coefficients has a 51 negative value. To illustrate Property 3 in practice, we simulated 200 unrelated families each 52 consisting of 500 full siblings with the method provided by the package CorBin [17]. Each ρ andρ represent the average of correlation coefficients between full siblings from the same 67 family, estimated by the UKin method and the scGRM method respectively, i.e.
As there was a linear relationship between kinship coefficient and correlation coefficient (see 69 Property 1 in Materials and Methods), the distributions of the average kinship coefficients 70 estimated by the two methods should have the same shape.     with breast cancer and another unaffected daughter. For each family, only the diseased daughter 106 and her unaffected full sister were genotyped for analysis. As for data quality control, we 107 removed individuals with more than 10% missing genotypes as well as SNPs with a missing 108 genotype rate greater than 5% or a minor allele frequency less than 5%. After further removing 109 individuals with missing phenotypes, we got 1,983 subjects (1,458 cases and 525 controls) with 110 614,310 variants in total. The processed data included 511 pairs of full sisters, with one affected 111 by breast cancer. We assumed individuals from different families were unrelated, then the true 112 values of all estimated kinship coefficients should be either 0.25 (511 full sister pairs) or 0 (all 113 the other individual pairs). 114 We first applied the scGRM method to estimate the kinship coefficients, which had poor 115 performance. For the 511 within-family pairs (full sister pairs), only 473 pairs were estimated to 116 have kinship coefficients between 2 −5/2 and 2 −3/2 , which means 7.4% of full sisters were 117 incorrectly inferred to be other kinds of relative pairs. Estimation of 1, 964, 642 = 1,983 2 − 511 118 between-family pairs with the scGRM method also performed poorly, as 0.45% unrelated pairs 119 between families were misspecified as 1st-degree relative pairs (such as sibling pairs), 120 2nd-degree relative pairs (such as half-sibs, avuncular pairs and grandparent-grandchild pairs) or 121 3rd-degree relative pairs (such as first cousins). In contrast, our UKin method had more accurate 122 estimates with 501 out of 511 (98.0%) full sisters pairs correctly estimated and only 177 123 unrelated pairs (less than 0.01%) were misspecified as 3rd-degree relative pairs (   results were 0.248 and 6.72 × 10 −4 , respectively. To visualize the difference between UKin and 140 scGRM, we also draw the scatter plot of the estimated kinship coefficients for the 511 full sister 141 pairs between the two methods (Fig 4). The scatter plot demonstrates that while the distribution 142 of UKin estimates is more concentrated at its true value, scGRM tends to overestimate the 143 kinship coefficients for many full sister pairs. These results clearly show the better performance 144 of UKin than scGRM. To further investigate the effectiveness of the UKin method in kinship coefficient estimation, 147 we applied UKin to infer pedigree structure using genotype data from the FIA linkage study 148 (dbGaP Study Accession: phs000293.v1.p1). This study recruited 400 families with multiple 149 individuals who have an intracranial aneurysm (IA) through 23 (25) referral centers throughout 150 North America, Australia, and New Zealand that represent 35 (40) recruitment sites. After a 151 standardized procedure of quality control and discarding subjects with missing phenotype, we 152 obtained 990 individuals from 371 families and each of them was genotyped at 5,505 SNPs. In 153 this FIA dataset, the confirmed relationships include 137 1st-degree relative pairs (including 19 154 full siblings and 118 parent-child pairs). 155 We compared the performance of UKin and scGRM in identifying these 1st-degree relative 156 pairs and estimating their kinship coefficients. UKin was able to correctly recognize all the 137 157 1st-degree pairs (with estimated kinship coefficients between 2 −2.5 and 2 −1.5 ), while scGRM 158 misspecified one parent-child pair as monozygotic twins, with an estimated kinship coefficient 159 137 1st-degree pairs in the IA data set (Fig 6). We further calculated the bias from the true value 164 (0.25) and RMSE of the estimated coefficients for each estimator. As summarized in Table 4, the 165 estimation bias of UKin was 1/6 of the bias estimated by scGRM, while the RMSE of UKin was 166 half of scGRM. We also note that scGRM misspecified 15 parent-child pairs or unrelated pairs 167 as MZ twins, while UKin only made five such mistakes which were all included in the 168 misspecified pairs of scGRM. These results demonstrate that our UKin method achieves more  inference.

191
In our theoretical derivations and simulation studies, we made assumptions like linkage equilibrium (LE) and absence of inbreeding, that is, genotypes at different markers are independent. During our derivation, we used the same weights for all SNPs, and our simulated datasets were also generated under this assumption. Although there is linkage disequilibrium (LD) in reality, empirical results from the analyses of the BC and FIA family data show that the 10/24 bias and RMSEs can also be reduced greatly with the application of UKin to real data. To consider the problem of LD in practice, we can give different weights based on LD to these SNPs. Following the approach of Wang (2017) [15], these LD weights w = (w 1 , w 2 , ..., w m ) T can be calculated by solving the following minimization problem: lk ] is the matrix of squared LD correlations. Theoretically, this result can be 192 directly applied to UKin by assigning the correlation coefficient at each SNP marker its 193 corresponding weight, which might make our approach adapt to LD situation.

194
Another assumption throughout our study is a homogeneous population so that the allele 195 frequencies can be calculated once and applied to all subjects. Some methods have been 196 proposed to estimate kinship coefficients in admixed populations, where the assumption of 197 population homogeneity is untenable [11,18,19]. However, as most of these methods are based 198 on the scGRM method, they are also likely to be biased estimators, too. How to extend our UKin 199 method to deal with admixed populations is a topic for future studies.

200
More accurate kinship estimation will improve the performance of different genetic analyses 201 such as association mapping. In recent years, GWAS have seen great success in identifying 202 genetic loci contributing to complex human traits [20,21]. By studying a genome-wide data set 203 of genetic variants in different individuals, GWAS looks for SNPs correlated with traits in the 204 samples. Accurate specification of familial relationships is expected to bring more powerful 205

11/24
association results in GWAS with unknown (or unrecognized) family structure. We have 206 investigated whether association mapping can be improved by applying UKin to account for 207 cryptic relatedness. 208 We conducted a simulation study to compare the performance of UKin with scGRM in 209 GWAS. In our experiments, we simulated 4,000 samples including 2,000 cases and 2,000 210 controls. We included subjects with various pairwise kinship coefficients in both cases and 211 controls. More specifically, we simulated 250 1st-degree relative pairs, 250 2nd-degree relative 212 pairs, 250 3rd-degree relative pairs, and 500 unrelated subjects. The total number of SNPs  We applied GEMMA [13], which was developed to implement the genome-wide mixed 219 model association algorithm for a standard linear mixed model for association analysis. In our 220 simulations, we performed likelihood ratio tests in a univariate LMM for marker association 221 mappings with a single phenotype. PLINK binary file format was [22] adopted as input files 222 containing phenotypes and genetic information. A standardized relatedness matrix file estimated 223 by either scGRM or UKin was included to appropriately account for relatedness among subjects. 224 We applied GEMMA to analyze the simulated GWAS dataset and selected all SNPs with 225 P-value below the threshold 5 × 10 −4 . Statistical power and type I error rate were calculated to 226 evaluate the performance of marker association tests when the relatedness matrix used in LMMs 227 was estimated by scGRM and UKin, respectively. The results suggest that the type I error rate 228 was appropriately controlled at a low level (less that 5 × 10 −4 ) for both methods. We compared 229 the power of association mapping which suggests that for the two risk SNP proportions

238
Alleles are said to be identical by descent (IBD) if they are inherited from a common ancestor. 239 To describe the average amount of IBD sharing at the genome level, we often adopt the concept 240 of kinship coefficient [12]. For two individuals indexed by a and b, their kinship coefficient, φ ab , 241 is defined as the probability that two alleles sampled at random from two individuals at the same 242 autosomal locus are IBD. Let k 0ab , k 1ab , k 2ab denote the probability that individuals a and b 243 share zero, one and two alleles IBD, respectively. The definition of kinship coefficient indicates 244 that φ ab can be expressed as a function of those IBD-sharing probabilities, to be more explicit, 245 reference alleles (with label A) for individual i at SNP marker j. Thus X i j takes values 0, 1, or 2 251 according to whether individual i has, respectively, 0,1, or 2 copies of allele A at marker j.

252
To simplify the illustration, we denote µ j and σ 2 j as the expectation and variance of X i j , 253 respectively. In other words, E(X i j ) = µ j ,Var(X i j ) = σ 2 j . We assume the population variance 254 for each marker is already known throughout our derivation. In practice, we can use sample 255 variance, an unbiased estimator of population variance, as a substitute. Now we consider a pair 256 of individuals i and i . We use ρ ii , j to denote the correlation coefficient between X i j and X i j .

257
Besides, we letρ j be the average of ρ ii , j among all the individual pairs, i.e.
If we further assume all individuals are sampled from a homogeneous population, we can 259 derive the following relationship among those correlations: 260 Property 1. Assume all individuals are sampled from a homogeneous population, then for This property has also been mentioned in other articles, for example, see [11]. A proof of 261 this property is given in S1 Appendix. Now we summarize the conclusions of this property as 262 follows:

263
Result i. implies that the correlation between X i j and X i j is irrelevant to which SNP we  Estimating kinship coefficient by calculating the average sample pairwise correlation among 269 all genetic variants has been taken by many methods. Following this principle, a natural 270 estimator of ρ ii is (1) X i j is the average counts of reference alleles (with label A) at SNP j in the 272 whole population. We callφ ii = 1 2ρii the scGRM estimator.

273
However, as we are going to demonstrate,ρ ii is actually a biased estimator of ρ ii . To 274 illustrate this, we need the following property:

275
Property 2. For 1 ≤ i, i ≤ n, 1 ≤ j ≤ m, the estimated correlation coefficient between X i j 276 and X i j has a systematic bias from ρ ii . More specifically, we have The proof is given in S1 Appendix.

278
Equation (2) also reveals that the expected value of 1 is not related to which SNP we select. Now we consider the expectation of estimator (1), it comes to the conclusion that Ifρ ii is an unbiased estimator of ρ ii , then we should have Eρ ii = ρ ii . However, the result 279 we derive is obviously contradictory to it. The existence of bias means a systematic error when 280 we estimate kinship coefficient via the scGRM method mentioned above. To make this fact 281 clearer, we sum the expectation of 1 The proof is given in S1 Appendix. 284

14/24
Recall that Eρ ii = E 1 From Property 1 we knowρ is the theoretical mean value of correlations between pair-wise 285 individuals, therefore it must take the value between 0 and 1. This fact together with Property 3 286 reveals that the mean value of estimatorρ ii is negative on average, which explains the empirical 287 observation that a substantial proportion of estimated kinship coefficients are negative.

288
This bias problem makesφ ii less desirable as an estimator of kinship between individuals i 289 and i . We can design an improved kinship estimation method which can eliminate the bias for 290 each pair of individuals based on the scGRM estimatorφ ii . The improved estimation method, 291 UKin, which stands for the unbiased kinship estimator, solves the bias problem without adding 292 much computational complexity. To understand how this method guarantees the unbiasedness, 293 we need the following property: 294 Property 4. For every SNP marker j, 1 ≤ j ≤ m, and every pair of individuals i and i , 1 ≤ i, i ≤ n, we have The proof is given in S1 Appendix.

295
For ease of presentation, we set Using (3), we also conclude that the expectation of u j ii does not depend on which SNP we 296 select. Based on this fact, a reasonable estimator of ρ ii is As Property 4 shows Eu j ii = ρ ii holds for every 1 ≤ j ≤ m, the expectation ofρ ii is still ρ ii . 298 In other words,ρ ii is an unbiased estimator of ρ ii , thusφ ii = 1 2ρii is an unbiased kinship 299 estimator. Besides, as we can observe from the expression of Eu j ii ,ρ ii is the sum of a group of 300 scGRM estimatorsρ ii and a few correction terms, which means the UKin estimator relies on the 301 same information we need for calculating the scGRM estimatorφ ii . Thus the implementation of 302 the UKin method doesn't require extra data.

303
It is worth noting that there exists some relationship between the scGRM and UKin estimator. 304

15/24
Substituting the expression of u j ii into (4), we get Equation (5) indicates that the UKin estimatorφ ii is a linear combination of some scGRM 306 estimatorsφ ii and constants. Thusφ ii andφ ii are based on the same genetic information.

307
Besides, this conclusion also shows that the UKin method won't bring a significant increase in 308 computational complexity than the scGRM method.

309
Throughout our above analysis, we make assumptions of no inbreeding, LE and population 310 homogeneity. In the Discussion we have analyzed these assumptions in detail.

311
Supporting information 312 S1 Appendix. Mathematical derivations of the properties in the Materials and Methods 313 section.

S1 Appendix
Proof of Property 1.
For the j-th single nucleotide polymorphism (SNP) (1 ≤ j ≤ m), let f j be the frequency of the reference allele (with label A) at that SNP. Consider a pair of individuals i and i whose kinship coefficient is denoted by φ ii , we derive the covariance of X i j and X i j from two different aspects. Recall that we denote ρ ii , j to be the correlation between X i j and X i j , thus we have On the other hand, X i j can be treated as the sum of two independent Bernoulli random variables. That is, X i j = B i j(1) + B i j (2) . For k = 1, 2, With this expression of X i j , we have As we denote f j to be the probability that a random allele chosen from the j-th SNP is A. Notice that B i j(k) B i j(k ) = 1 only when the two alleles selected from i and i at this marker are both with label A, under this circumstance, these two reference alleles are either identical by descent (IBD) or not. For simplicity, let A i j (k) represent the k-th alleles from individual i at SNP j, if we assume IBD genes have the same allelic types and non-IBD genes have independent allelic types, we obtain Consider the definitions of f j and φ ii , we obtain P(A i j (k) and A i j (k ) are IBD).
Substituting them into (S.2), we get As X i j is the sum of two i.i.d. Bernoulli random variables whose probability of success is f j , we can derive that σ 2 j = 2 f j (1 − f j ). Together with (S.1) and (S.3), we have Equation (S.4) also reveals that the value of correlation ρ ii , j doesn't depend on which SNP is selected, thus we get Proof of Property 2.
To demonstrate this property, we need a few preparations: i. Consider the result Cov(X i j , X i j ) = ρ ii σ 2 j , we have Directly applying this result yields ii. Based on (1)-(3) stated above, we have With these preparations, we now work on the demonstration of Property 2.

20/24
Directly expand the expression on the left side of (S.2), we have Substituting (S.5) and (S.6) into this expansion, we have Thus we derive the conclusion in Property 2.

22/24
Besides, if we change the sequence of summation, we have 1 2n Substituting them into the expansion, we get Thus we finish the proof of Property 3.

23/24
Proof of Property 4. At the start, we focus on a part of the expression on the left side of (3): Substituting (S.7), together with (S.9), into the whole expansion, we get Here we have proved the conclusion in Property 4.