The use of a genetic relationship matrix biases the best linear unbiased prediction

The best linear unbiased prediction (BLUP), derived from the linear mixed model (LMM), has been popularly used to estimate animal and plant breeding values (BVs) for a few decades. Conventional BLUP has a constraint that BVs are estimated from the assumed covariance among unknown BVs, namely conventional BLUP assumes that its covariance matrix is a λK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \lambda K $$\end{document}, in which λ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \lambda $$\end{document} is a coefficient that leads to the minimum mean square error of the LMM, and K\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ K $$\end{document} is a genetic relationship matrix. The uncertainty regarding the use of λK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \lambda K $$\end{document} in conventional BLUP was recognized by past studies, but it has not been sufficiently investigated. This study was motivated to answer the following question: is it indeed reasonable to use a λK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \lambda K $$\end{document} in conventional BLUP? The mathematical investigation concluded: (i) the use of a λK\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \lambda K $$\end{document} in conventional BLUP biases the estimated BVs, and (ii) the objective BLUP, mathematically derived from the LMM, has the same representation as the least squares.


Introduction
A breeding value (BV) refers to the combining ability of an entity as a parent. Estimating the BVs is purposeful in seed and livestock industries, aiming to save time and resources in increasing genetic gains by maximizing selection accuracies and efficiencies. The best linear unbiased prediction (BLUP), derived from the linear mixed model (LMM), has been popularly used to estimate BVs for a few decades (Piepho 1994;Panter and Allen 1995a, b;Meuwissen et al. 2001;Choi et al. 2017). Conventional BLUP has a constraint that the estimated breeding values (EBVs) are derived from the assumed covariance among unknown BVs. Conventional BLUP uses a matrix representing pairwise genetic variations for an assumed covariance matrix (Henderson 1975). Pedigrees or genomic fingerprints are typically required for calculating the assumed covariance matrix (Emik and Terrill 1949;VanRaden 2008;Kim et al. 2016;Kim and Beavis 2017).
Previous studies reported that EBVs obtained by conventional BLUP have a high correlation with empirically observed combining abilities, based on field tests and computer simulations (Belonsky and Kennedy 1988;Piepho 1994;Panter and Allen 1995a, b;Bauer et al. 2006;Nielsen et al. 2011;Choi et al. 2017;Manzanilla-Pech et al. 2017). However, no studies clarified two uncertainties regarding conventional BLUP. First, there are no clues that the assumed covariance matrix is objective. If the assumed and objective covariance matrices are not equal, conventional BLUP is biased. Second, BVs are not physically measurable but conceptual. In fact, the combining ability for an entity can vary according to its gametes, mating partners and environments.
In this study, I investigated the aforementioned uncertainties from a mathematical perspective. As a result, this study discovered the following facts: (i) the objective covariance matrix can be estimated; (ii) conventional BLUP is biased in that the assumed and objective covariance matrices are different; and (iii) the unbiased BLUP, mathematically derived from the LMM, is equal to the least squares.

Naïve BLUP
The LMM is denoted as where y is the phenotypic variable; X is the design matrix for fixed effect; Z is the design matrix for random effect; b is the unknown fixed-effect variable; u is the unknown randomeffect variable; and e is the error-term variable. Equation (1) assumes e $ N ð0; Ir 2 ). The u represents a variable for BVs and can be calculated as follows: The var u ð Þ can be calculated as follows: The var y ð Þ can be calculated as follows: Let us substitute equation (4) for equation (3): Equation (5) demonstrates that r 2 Z 0 Z ð Þ À1 in equation (3) is negligible because it will be cancelled by var y ð Þ. For the sake of simplicity, therefore, equation (3) can omit r 2 Z 0 Z ð Þ À1 so that: Because var y ð Þ ¼ 1 nÀ1 y À y ð Þ y À y ð Þ 0 , equation (6) can be rewritten as: where n is the number of values in u; and y is a vector containing the arithmetic averages of y.
Equation (7) is in the same structure as var u Hereafter, equation (8) is called naïve BLUP. This is the unbiased solution of the LMM and has the same representation as the least squares. Every value ofû in equation (8) represents the arithmetic average of multiple phenotypic observations for each entity. Given theû, theb for equation (1) can be calculated using the following equation: Conventional BLUP Henderson et al. (1959) derived the following formula from the LMM: In equation (10),û can be obtained using the row operation (Kim et al. 2019) as follows: Henderson (1975) assumed that var u ð Þ ¼ kK, where k is a coefficient that leads to the minimum mean square error (MMSE) of the LMM, and K is a genetic relationship matrix (Robinson 1991). Therefore, equation (11) can be rewritten as:û Equation (12) is a mathematical representation of conventional BLUP. Because k = r 2 u r 2 e , conventional BLUP can be fitted by estimating r 2 e and r 2 u , in which the former and the latter are the variance components for e and u, respectively. In this study, equation 12 was fitted using the expectationmaximization (EM) algorithm (Kim et al. 2019). Given thê u, theb can be calculated using equation 9.

Rice dataset
To compare between two sets of EBVs, obtained by naïve BLUP and conventional BLUP, an open-access rice data set (Spindel et al. 2015) comprising phenotypic data and singlenucleotide polymorphism (SNP) data was used, which was downloaded from http://ricediversity.org/data/index.cfm. The phenotypic trait used in this study was grain yield (kg/ ha), which was recorded over four years (2009,2010,2011,2012) Meanwhile, the SNP data consisted of 108,024 SNPs, which was used to calculate a genetic relationship matrix (K) using Numericware i (Kim and Beavis 2017). The common entities between the SNP data and phenotypic data were 107, which was subjected to statistical analyses.

Condition of conventional BLUP being unbiased
The following proposition was established to judge whether conventional BLUP is biased.
Proof. Conventional BLUP assumes that var u ð Þ ¼ kK. According to the process from equations (1) through (7), the objective var u ð Þ is defined as var u ð Þ ¼ 1

R code and data availability
All R code and data are available at https://github.com/ bongsongkim/BLUP.

Results
Comparison between naïve BLUP and conventional BLUP Figure 1 shows the correlation between two sets of EBVs, obtained by naïve BLUP and conventional BLUP. Its Pearson correlation coefficient (q) is 0.9526. Table 1 summarizes the two sets of EBVs. Figure 1 and table 1 show that the two sets of EBVs deviate from each other. Naïve BLUP is equal to the least squares, suggesting that its resulting EBVs are statistically unbiased. The deviation between the two sets of EBVs indicates that conventional BLUP is biased due to the assumption that var u ð Þ ¼ kK.

Comparison between the objective and assumed covariance matrices
The objective and assumed covariance matrices were computed using the same rice dataset. The submatrices of the former and the latter are shown in tables 2 and 3, respectively. The objective covariance matrix was calculated by equation (7). The assumed covariance matrix was calculated by kK. The k is defined as r 2 u r 2 e , in which r 2 e and r 2 u are the variance components for e and u, respectively. In this study, the resulting estimates for r 2 e and r 2 u were 80019.630 and 86076.655, respectively. It led to k ¼ 1:076. Therefore, table 3 represents a submatrix of 1.076*K. The comparison between tables 2 and 3 shows that the two covariance matrices are different. Figure 2 shows a low correlation (q = 0.099) between the two covariance matrices. The discrepancy between the two covariance matrices led to the conclusion that kK 6 ¼ 1 nÀ1 Z 0 Z ð Þ À1 Z 0 y À y ð Þ y À y ð Þ 0 Z Z 0 Z ð Þ À1 . According to Proposition 1, therefore, conventional BLUP is biased.

Discussion
Naïve BLUP is based on the objective covariance matrix defined as equation (7), which implies that y influences the direction of u. The fact that u depends on y is crucial because the phenotypic variable (y) must impact the BV variable (u). Naïve BLUP is equal to the least squares, i.e. each naïve BLUP estimate represents the arithmetic average of multiple phenotypic observations per entity. This means that naïve BLUP estimates are unbiased because the arithmetic average indicates the point where the variance among phenotypic  observations per entity is at minimum. Essentially, the naïve BLUP estimate for an entity represents a genetic performance that implies the adaptability to environments. Considering that a parent with high adaptability has high chances to transmit environmental stress-resistant genes to their offspring, naïve BLUP estimates can be EBVs. Conventional BLUP has two fatal flaws regarding its constraint, var u ð Þ ¼ kK. First, conventional BLUP ignores the fact that the direction of u must depend on y. According to the constraint, k just multiplies K. This means that the direction of u is solely determined by K. Second, the constraint per se is unrealistic. In order for the constraint to be true, it must be satisfied that kK ¼ 1 nÀ1 Z 0 Z ð Þ À1 Z 0 y À y ð Þ y À y ð Þ 0 Z Z 0 Z ð Þ À1 . However, figure 2 and the comparison between tables 2 and 3 demonstrate that kK 6 ¼ 1 . This means that conventional BLUP violates the constraint. According to Proposition 1, therefore, conventional BLUP is biased.

Conclusion
The uncertainty regarding the use of kK in conventional BLUP was recognized by past studies (Blasco 2001;Postma 2006;Hadfield et al. 2010). Blasco (2001) argued that conventional BLUP may not be entitled to being called best and unbiased. Nevertheless, the use of kK in conventional BLUP has been considered as the best practice to overcome the unknowable var u ð Þ. Notably, this study derived naïve BLUP from the linear mixed model, and two facts were found therefrom. First, the objective covariance matrix can be defined as var u ð Þ ¼ 1 nÀ1 Z 0 Z ð Þ À1 Z 0 y À y ð Þ y À y ð Þ 0 Z Z 0 Z ð Þ À1 : Second, naïve BLUP is equal to the least squares. Considering that each naïve BLUP estimate represents the arithmetic average of multiple phenotypic observations per entity, the application of naïve BLUP is suited for autogamous species. This is because multiple entities being genetically identical can be made in autogamous species. The naïve BLUP estimate for each entity indicates a genetic performance that implies the adaptability to environments. From this perspective, naïve BLUP estimates can be EBVs. However, the use of naïve BLUP estimates is at a breeder's discretion. This is because the combining ability for a parent must have a large variability due to unknown complex polygenetic effects such as hybrid vigour or inbreeding depression. Figure 2. Correlation between the assumed and objective covariance matrices.