Is it reasonable to use of a kinship matrix for best linear unbiased prediction?

The linear mixed model (LMM) is characterized to account for the variance-covariance among entities in a population toward calculating the best linear unbiased prediction (BLUP). Animal and plant breeders widely use the LMM because it is perceived that the a BLUP estimate informs an estimated breeding value (EBV), so to speak a combining ability as a parent, obtained by relating each entity to his/her relatives using the variance-covariance. The LMM practice routinely substitutes an external kinship matrix for the variance-covariance. The challenge relevant to the LMM practice is the fact that it is unrealistic to validate the EBVs because the real breeding values are not measurable but conceptual. This unreality actually means that the EBVs are vague. Although some previous studies measured correlations between the EBVs and empirical combining abilities, they are not sufficient to remove the vagueness of EBVs because uncontrollable environmental factors might interfere with phenotypic observations for measuring the combining abilities. To overcome the challenge, this study scrutinized the soundness of the routine LMM practice from the mathematical perspective. As a result, it was demonstrated that the BLUP estimates resulting from the routine LMM practice mislead the breeding values. The genuine BLUP represents the arithmetic means of multiple phenotypic observations per each entity, given all phenotypic observations adjusted to the mean of zero.


Introduction
A breeding value means a combining ability of an entity as a parent. The linear mixed model (LMM) is a statistical method that is widely used for estimating breeding values (Piepho, 1994;Panter and Allen, 1995a;Panter and Allen, 1995b;Choi et al, 2017). The fixed-effect variable and random-effect variable are two parameters of LMM. Of these, the random-effect variable has its own variance-covariance, which characterizes the LMM differently from the linear model (linear regression) that is restricted to estimating the fixed-effect variable. The resultant randomeffect variable contains estimated breeding values (EBVs) that the variance-covariance implies.
It is generally perceived that the variance-covariance in the LMM corrects the breeding-value underestimations and overestimations by accounting for the relationship between entities.
Routine LMM practices substitute an external kinship matrix resized toward obtaining the least square for the variance-covariance (Robinson, 1991). A number of previous studies validated the effectiveness of BLUP taking advantage of the kinship matrix by empirically measuring the combining abilities through field tests and computer simulations (Belonsky and Kennedy, 1988;Piepho, 1994;Panter and Allen, 1995a;Panter and Allen, 1995b;Bauer et al, 2006;Nielsen et al, 2011;Choi et al, 2017;Manzanila-Pech et al, 2017). However, the empirical validations necessarily limit thorough evaluation of the quality of EBVs because it is impossible to obtain exact combining abilities to be compared with the EBVs. To overcome the limitation, this study investigated the integrity of BLUP from the mathematical perspective.

Naive-BLUP
The LMM is denoted as: where is the phenotypic variable; is the design matrix for fixed effect; = the design matrix for random effect; = the unknown fixed-effect variable; = the unknown random-effect variable; = the unknown error-term variable.
Equation 1 assumes ~ (0, ). The u contains EBVs and can be calculated as follows: The ( ) can be calculated as follows: The ( ) can calculated as follows: Let us substitute Equation 4 for Equation 3 as follows: demonstrates that the ( ′ ) in Equation 3 is negligible because it will be cancelled by ( ). For simplicity, therefore, Equation 3 can omit the ( ′ ) so that: Equation 6 can be written as follows: where is a vector containing the averages of .
Equation 7 is equivalent to the following equation: Equation 9 is derived from statistical basics. Hereafter, the resulting from Equation 9 is called Naive-BLUP. Every value of for Equation 9 represents the arithmetic mean of multiple phenotypic observations per each entity, given all phenotypic observations adjusted to the mean of zero. Once is known, for Equation 1 can be calculated using the following equation:

Average-vector
If the Naive-BLUP does not adjust all phenotypic observations to the mean of zero, the resultant vector will contain the arithmetic means of multiple phenotypic observations per each entity, which can be calculated as follows: Hereafter, the resulting from Equation 11 is called Average-vector.

K-BLUP
The equation for the routine LMM is as follows: Equation 12 was originally derived by Henderson (1975). The exact year of its first release is unclear. The mathematical derivation of Equation 12 from Equation 1 is available in a lecture note by Searle (1980).

In Equation 12
, the can be obtained using the row operation as follows (Kim et al. 2018): Therefore, Equation 13 can be transformed into the following form: Equation 14 comprises the K (kinship matrix). Hereafter, the resulting from Equation 14 is called K-BLUP. Equation 10 can be commonly used for calculating the . This study fitted the K-BLUP using the expectation-maximization (EM) algorithm using GWASpro (Kim et al, 2018).
More details about the EM algorithm can be found in the related paper.

Rice data set
To compare between two 's obtained by the Naive-BLUP and K-BLUP, an open-access rice data set comprising phenotypic and SNP (single nucleotide polymorphism) data was used, which was downloaded from http://ricediversity.org/data/index.cfm. The data set was originally prepared, analyzed and released by Spindel et al. (2015). Therefore, more details can be found in the related paper. The phenotypic trait used in this study was grain yield (kg/ha), which had the structure of three dimensions: four years (2009,2010,2011,2012) missing phenotypic observations or were absent in the SNP data. Accordingly, the SNP data set was aligned with the phenotypic data, which resulted in 107 accessions genotyped with 108,024 SNPs. The SNP data set was used to calculate a using Numericware i (Kim and Beavis, 2017).

Validating the integrity of K-BLUP
To examine the integrity of K-BLUP, the following proposition was established: The process from Equations 1 to 7 shows that the ( )in the LMM must hold the property of Equation 7 which is In order for the K-BLUP to be correct, therefore, it must be satisfied Meanwhile, the Naive-BLUP is based on Equation 7 so that the property of Equation 7 is naturally held. To compare the two ( )s calculated by and Equation 7, the Pearson correlation coefficient between their lower triangle matrices was calculated.

Repository of input data and computer code
All input data and computer code are freely available at https://github.com/bongsongkim/BLUP.

Comparison between Naive-BLUP, K-BLUP and Average-vector
The Naive-BLUP, K-BLUP and Average-vector were calculated using the given rice data set. Figure 1A shows that the correlation coefficient between the resultant K-BLUP and Naive-BLUP was 0.9526. Figure 1B shows that the correlation coefficient between the resultant Naive-BLUP and Average-vector was 1.0, illustrating the complete proportionality between them. Figure 1C shows that the correlation coefficient (0.9526) between the resultant K-BLUP and Averagevector was the same as the case of Figure 1A, which is due to the complete proportionality between the Naive-BLUP and Average-vector. Table 1 summarizes the resultant K-BLUP, Naive-BLUP and Average-vector and shows that the grand mean of Naive-BLUP is zero. Figure   1B and the mean of zero indicate that every value for Naive-BLUP represents the average of multiple phenotypic observations per each entity, given all phenotypic observations adjusted to the mean of zero.

( )s resulting from K-BLUP and Naive-BLUP
Two ( )s resulting from the Naive-BLUP and K-BLUP were computed using the same rice data set. Sub-matrices of the former and the latter are shown in Tables 2 and 3, Tables 2 and 3 showed the two ( )s are different, leading to the conclusion that Proposition 1 is not satisfied. Figure 2 shows the correlation plot between two lower triangle matrices of ( ) s calculated by and Equation 7. The resultant correlation coefficient was 0.099, indicating that the two ( )s were hardly correlated.

Discussion
The mathematical process from Equations 1 through 9 for the Naive-BLUP shows the whole property of BLUP. Of the process, the relationship between ( ) and ( ) in Equation 7 indicates that depends on . This means that the direction of is influenced by , in which it is important to note that the direction of implies the ranking of entities by breeding value. This is a crucial property of BLUP because the phenotypic variable ( ) must be a key input for the breeding-value variable ( ). Equation 9 shows that, given all phenotypic observations adjusted to the mean of zero, the Naive-BLUP results in a vector containing the arithmetic means of multiple phenotypic observations per each entity. Figure 1B suggests that the Naive-BLUP is proportional to the Average-vector, illustrating that the latter implies the former. Considering that the average of multiple phenotypic observations can represent how well an entity adapts to multiple environments, its association with the breeding value is reasonable (Piepho, 1994). This is because the meaning of breeding value must comprehend phenotypic variation for an entity. In other words, a single phenotypic observation cannot be eligible for a breeding value because the phenotypic variation is missing.
Meanwhile, the K-BLUP assumes that ( ) = other than Equation 7. Considering that the is a scholar, the direction of is solely determined by . This causes a critical problem in that cannot influence the direction of . This problem illustrates that the assumption of K-BLUP fundamentally has a flaw. In actuality, however, the dissatisfaction of Proposition 1 means that does not result in ( ), indicating that the assumption of K-BLUP cannot be satisfied. This doubles the flaw of K-BLUP with regard to the assumption that ( ) = .
The discrepancy between two ( )s resulting from the Naive-BLUP and K-BLUP is shown in the comparison between Tables 2 and 3 as well as Figure 2. Another flaw of K-BLUP is the dilemma that ( ) must be known to estimate the unknown , which is unrealistic. To overcome this unreality, the routine practice of K-BLUP substitutes for ( ) (Yu et al, 2006;Bradbury et al, 2007;Gilmour et al, 2009;Lipka et al, 2012;Kim et al, 2018). However, merely implies either the genomic similarity based on DNA-sequence homology or genealogical co-ancestry based on pedigrees, other than the variance-covariance among EBVs (Emik and Terrill, 1949;Henderson, 1976;VanRaden, 2008;Kim et al, 2016;Kim and Beavis, 2017). Therefore, the use of as the substitution for the unknown ( ) in the K-BLUP misleads the breeding values.

Conclusion
The ( ) in the Naive-BLUP soundly secures the integrity of BLUP as well as correct breeding values. The soundness indicates that the Naive-BLUP is the genuine BLUP.